Thinking Machines Lab Says It’s Solved a Key Problem in LLMs: Here’s What That Means
Thinking Machines Lab, the AI startup founded by former OpenAI CTO Mira Murati, says it has made a breakthrough in one of the hardest problems facing large language models (LLMs): nondeterminism during inference. In short, they claim to have found a way to make LLMs consistently produce the same output when given the same input—something many of us have struggled with when using AI tools.
While Thinking Machines Lab is addressing global challenges in AI reliability, India is also stepping up its presence in the large language model space. A prime example is Sarvam AI, which is building India-focused LLMs tailored for local languages and contexts. This effort shows how different regions are shaping AI to meet their own needs. You can read more in our article on Sarvam AI’s India LLM initiative.
They describe the problem in their recent research blog post titled “Defeating Nondeterminism in LLM Inference”. According to researcher Horace He and his team, randomness in AI responses often stems from how GPU kernels and batching are handled under the hood. By changing the way kernels respect batch size invariance, Thinking Machines Lab says it can eliminate much of that unwanted randomness.
Why does this matter
If the claims hold up, this could be a big deal for any application where you want predictability, reproducibility, or reliability from LLMs. A few practical benefits:
- In development or research, it means you can run the same prompt multiple times and get identical (or near-identical) results. That helps with debugging, evaluation, and comparison.
- For business use cases like legal, medical, or financial domains—where consistency and auditability matter—it gives more trust in using LLMs.
- It could reduce frustration among users who see different outputs just because of background GPU scheduling or batching quirks.
What exactly is nondeterminism here
“Non-determinism” means that even when you feed the same prompt into the same model, you might get different outputs at different times. This can be caused by several technical factors. The research from Thinking Machines Lab points out some of them:
- GPU kernel scheduling and concurrency: which part of the GPU is used when, and in what order.
- Batch size variation: when multiple queries are processed together (“batched”), slight differences in how batches are handled can lead to numerical differences. These small differences propagate and lead to different outputs.
- Floating point operations and how rounding or the order of operations sometimes vary under parallel computation.
The key insight from their work is that by using kernels that are invariant to batch size, that is, kernels whose internal computations behave consistently regardless of how many inputs are processed together in a batch, you reduce one major source of randomness.
How did they test this
From what’s reported:
- They ran many tests (thousands of prompts / inference runs) under different batching and GPU scheduling conditions.
- Without the new approach, results varied widely (many different outputs for the same prompt across runs). With their “batch-invariant kernel” modifications, outputs became far more consistent. In some setups, identical outputs for identical inputs.
- They tracked and isolated the effects of batch size variance, GPU kernel batching, scheduling, etc., to identify what aspects needed fixing.
Caveats & questions
While this sounds promising, there are important things to watch out for or verify:
- It’s not yet clear whether this approach works across all popular LLMs and hardware setups. Different models, GPU architectures, and deployments may differ.
- There might be trade-offs: things like throughput (how fast the model can respond), resource utilization, or even energy efficiency might suffer if kernels are constrained to be more “deterministic.”
- In real world systems, environmental factors (driver versions, hardware differences, GPU loads, memory, etc.) could still introduce some nondeterminism. The work may not eliminate all sources of unpredictability.
- How easy will it be for developers or companies to adopt these batch-invariant kernels? Will major LLM providers integrate this internally, or will users have to configure/modify kernels themselves?
What this could lead to
If adopted broadly, this innovation could enable:
- More reliable AI tools in regulated domains (law, medicine, compliance).
- Better comparison of LLM versions or fine-tuned models, since output variation would be less of a confounding factor.
- Improved benchmarking, since consistency is important when comparing models or prompt engineering techniques.
- A smoother user experience: less “surprise behavior” from LLMs that suddenly change their output for no obvious reason.
Broader context: Why this problem stuck around
Even though many people know that LLM output can be inconsistent, solving it cleanly has been hard for a few reasons:
- The complexity of GPU hardware, libraries, kernels, drivers, and how different scheduling or parallelism arrangements affect computation.
- The fact that many “random” sources of variation are low-level (floating point arithmetic, rounding, concurrency) and often invisible to high-level model architects.
- Pressure for speed and throughput tends to push toward more parallelism, batching, and hardware optimizations which, ironically, increase variability.
- Diverse deployment environments make it harder to guarantee deterministic behavior everywhere.
Thinking Machines Lab’s work matters because it attempts to pull apart some of these layers and propose solutions that operate at the kernel / batch interface level—closer to hardware/implementation than to just “prompt engineering” or model tuning.
What people are saying & next steps
The announcement has drawn attention in the AI research and developer community:
- Many are impressed: achieving reproducible outputs has long been considered an unsolved (or only partially solved) problem.
- Others are cautiously optimistic: wanting independent audits, tests across many models/hardware, and insight into performance trade-offs.
- Some in forums (e.g. on Reddit) report good results in simple tests, but also note that the improvement may drop off in more complex scenarios
What to expect next:
- Publication of more detailed benchmarks and technical documentation.
- Possibly, integrations by big LLM providers if the method proves robust.
- Open-source tools / kernel libraries that others can use to replicate the approach.
- Community feedback and edge-case testing (e.g. very large context, very large batch sizes, distributed inference).
Bottom line
Thinking Machines Lab’s claim is significant: they say they’ve solved, or at least greatly reduced, one of the long-standing issues in LLM behavior—non-determinism in inference. If the results hold up under broader testing, this could make large language models more trustworthy, predictable, and easier to use in demanding contexts.
That said, we should see this as an exciting step rather than a finished product. Practical adoption, performance trade-offs, and hardware diversity will test the robustness of the solution. For now, it’s a hopeful sign that AI is evolving in ways that don’t just push for larger models, but for smarter and more reliable ones.