AI Explained

Why Temperature 0 Isn't Deterministic

Arpit TripathiArpit TripathiLinkedIn·June 19, 2026·11 min read

Temperature 0 picks the top token every time, yet LLM outputs still vary. The real cause is GPU math and the serving stack, not the model.

Set temperature to 0, watch the model pick the same top token every single time, and you can still get a different answer. Temperature 0 makes token selection deterministic. It does not make the whole system deterministic. The same prompt drifts to a different answer because the numbers feeding that choice are computed differently from run to run. The variation you see is not the model. It lives in GPU floating-point math and in the inference server that batches and routes your request. Most explanations stop at GPU rounding. The part that actually bites you is that your answer can change because of who else hit the server at the same moment.

What temperature 0 actually does

Temperature 0 turns sampling into a greedy argmax: at every step the model selects the single token with the highest probability and ignores the rest. There is no sampling distribution to draw from anymore, so the usual source of randomness, picking a less-likely token by chance, is gone. People expect this to guarantee identical text. It removes one source of variation, not all of them. In short: temperature 0 is deterministic in the math it applies, but not in the numbers it receives.

Here is the subtle part. Argmax is only deterministic if the logits feeding it are bit-for-bit identical every time. The selection rule is fixed, but the inputs to that rule are computed by a massive pile of floating-point arithmetic on a GPU. If those numbers shift by a fraction of a percent, the ranking of the top two tokens can flip. One flipped token early in a sequence changes everything that follows it. So the question is never whether the rule is deterministic. It is whether the numbers handed to that rule are.

Insight

Temperature 0 fixes how the model chooses. It does not fix what it chooses from. The logits are still the output of nondeterministic hardware math.

The real cause: floating-point addition is not associative

The root mechanism is that floating-point addition is not associative. In exact math, (a + b) + c equals a + (b + c). In floating point, those two expressions can produce different results because every intermediate sum is rounded to a fixed number of bits. Change the order of additions and you change the rounding, and you get a slightly different total. This is not a bug. It is a fundamental property of how computers represent real numbers, and it holds on every GPU and CPU ever built.

A transformer forward pass is built almost entirely from reduction operations: matrix multiplications, attention, and normalization all sum many numbers together. GPUs run these sums in parallel across thousands of threads, then combine the partial results. The order in which those partial sums finish and get combined is not fixed. It depends on thread scheduling, which kernel the runtime picked, and how the work was tiled. Run the same reduction twice and the accumulation can land in a different order, which lands on a slightly different number. A single matrix multiply inside one layer can sum thousands of products, and a model has dozens of layers, so these micro-roundings happen millions of times per forward pass.

How a tiny error becomes a different answer

Most of the time these rounding differences are invisible. The model's top token sits far ahead of the runner-up, so a 0.0001 wobble in the logits changes nothing. The problem appears at near-ties. When the two most likely tokens are close, a tiny numerical difference can reorder them, and argmax now picks the other one. That single different token feeds back into the context for the next step, so the divergence compounds. This is why two runs often agree for several words and then split apart.

Think of it as a fork in the road that only matters when two paths sit at nearly equal height. A confident model rarely hits those forks, so its greedy output is stable even across noisy hardware. A model that is genuinely uncertain, or a prompt where several continuations are about equally good, sits on top of many near-ties at once. The same numerical noise that never mattered before now decides the branch. This is why low-confidence generations and open-ended prompts feel flakier than crisp factual lookups, even at the same temperature setting.

Batch composition: the part nobody expects

The biggest surprise is that your output can depend on other people's requests. To use a GPU efficiently, inference servers batch many incoming requests together and process them in one pass. Which requests get grouped with yours, and how many there are, changes the shapes of the matrices and the reduction schedule the kernel chooses. That changes the rounding, which can shift your logits, which can flip a tie even at temperature 0.

Your prompt is identical. The inputs batched alongside it are not. Server load changes second to second, so the batch your request joins is effectively random to you. Same prompt, same settings, 2pm and 2am return different text. Nothing changed except who else was online. Thinking Machines Lab argues this batch sensitivity, not raw GPU concurrency, is the primary reason production endpoints feel nondeterministic. In their framing, the load and therefore the batch size varies nondeterministically, and that variation is what moves the output.

This is the counterintuitive bit worth sitting with. You never see the other requests in your batch, you cannot control them, and they do not appear in any log you receive. Yet they are part of the computation that produced your tokens. It is the closest thing in everyday software to a result that depends on a stranger's input. Once you internalize that, a lot of mysterious flakiness in production LLM systems stops looking like a model defect and starts looking like a property of shared, batched serving.

Pro Tip

If you need reproducible results from a hosted API for evaluation, run each request in isolation where you can, and log the response metadata. Batch context you cannot see will still move outputs at the margins.

Hosted APIs add a second layer: the cluster

On a hosted endpoint, identical requests do not even hit identical machines. Providers run fleets of servers, and your call lands on whichever one is free. Those servers can differ in GPU generation, driver and kernel versions, numeric libraries, or even the exact model checkpoint during a rollout. Any of those shifts the arithmetic. The routing is invisible to you, so the variation looks like the model being moody.

OpenAI is explicit about this. The seed parameter makes a best effort to sample deterministically, but the docs state determinism is not guaranteed, and they expose a system_fingerprint field so you can detect when the backend configuration changed under you. Treat that fingerprint as a tripwire: when it changes, assume your outputs can change too, and do not be surprised by a different answer to the same seed and the same prompt.

The full causal chain in order

  • Temperature 0 sets a fixed rule: always pick the argmax token. This removes sampling randomness.
  • The logits feeding argmax come from floating-point reductions, and floating-point addition is not associative.
  • GPUs run those reductions in parallel and combine partial sums in a non-fixed order, producing tiny per-run differences.
  • Batch composition changes matrix shapes and reduction schedules, so other requests batched with yours shift your numbers.
  • Near-ties between the top two tokens get reordered by those tiny shifts, and one flipped token cascades through the rest of the sequence.
  • Hosted APIs route requests across clusters with differing hardware, drivers, and checkpoints, adding a further layer of variation.

When you CAN get true determinism

Open-weights models on hardware you control can be made fully reproducible. Fix the seed, set temperature 0, pin the model checkpoint, and run with a stable batch so the reduction order stops moving. One hurdle remains: the kernels. Standard GPU kernels pick different reduction schedules at different batch sizes, which quietly reintroduces variation. The fix is batch-invariant kernels that always reduce in the same order regardless of batch size. Each of these steps closes one of the gaps from the causal chain above, and you need all of them at once. Pin the checkpoint but leave the batch dynamic and the kernels will still drift; fix the kernels but route across mixed hardware and the arithmetic shifts again.

This is no longer theoretical. Thinking Machines Lab published a set of batch-invariant kernels and a deterministic mode on top of vLLM, then reported 1,000 runs producing bit-for-bit identical output even under dynamic batching. The guarantee is not free: in their published timings the deterministic mode ran the same workload in about 42 seconds against 26 seconds for default vLLM, so you pay a meaningful but acceptable throughput cost for the reproducibility. vLLM exposes batch-invariant inference as an opt-in feature, and SGLang has since integrated the same batch-invariant kernels behind a deterministic-inference flag. So determinism is achievable when you own the stack and accept that slowdown. It is not something a hosted endpoint promises you by default.

SettingHosted API (temp 0)Self-hosted open weights (tuned)
Sampling randomnessRemoved by temp 0Removed by temp 0
Float reduction orderVaries with kernel and loadPinned via batch-invariant kernels
Batch compositionMixed with other users, invisibleControlled by you
Hardware and checkpointRouted across a varied clusterSingle fixed machine and checkpoint
Bit-identical outputNot guaranteedAchievable, at a throughput cost

What temperature 0 means for reproducibility in practice

Stop treating temperature 0 as a reproducibility switch. It is a sampling setting, and it does its job perfectly: it removes the dice. The variation that remains is an engineering property of running giant matrix math on parallel hardware behind a shared API. Your tests will flake. That flake is not a model failure, it is the serving stack doing exactly what it was built to do.

Design your tests around it. Assert on meaning, structure, or a scoring rubric rather than exact-string equality. Validate that the JSON parses and the required fields are present rather than that the bytes match a golden file. Pin model versions and watch the system_fingerprint so a silent backend swap does not get blamed on your prompt. If you genuinely need bit-identical output, the only reliable path today is self-hosting with deterministic, batch-invariant kernels and a fixed batch, knowing you trade some speed for that guarantee. For most teams the better move is to stop demanding identical strings and start measuring whether the answer is correct.

Where reliable AI memory fits

Nondeterminism at the token level is one reason grounding a model in stable, retrievable context matters more than chasing identical wording. If the facts an assistant works from are fixed and yours, small phrasing wobble stops mattering, because the substance stays anchored. MemX is a consumer AI memory layer over your own documents, photos, and notes across Android, iOS, and WhatsApp, giving your assistant a consistent source to draw on instead of improvising. It is private by architecture, with per-user keys, encryption at rest, and an on-device first pass, so your memory stays yours while the model's surface wording varies as it always will.

Frequently Asked Questions
01Does temperature 0 make an LLM deterministic?

Only at the selection step. Temperature 0 fixes token choice to greedy argmax, but the logits argmax reads are computed by parallel floating-point math on a GPU, which is not associative and finishes in varying order. Tiny rounding differences can flip near-tied tokens, so outputs still vary.

02Does setting a seed make an LLM deterministic?

Not on hosted APIs. OpenAI states the seed is a best-effort attempt and determinism is not guaranteed. They expose a system_fingerprint so you can detect backend changes. A seed helps consistency but cannot override hardware, batch, and cluster variation you do not control.

03Why does the same prompt give different answers at different times?

Server load changes which other requests get batched with yours, and that changes the GPU reduction order and your logits. Hosted endpoints also route requests across clusters with different hardware and checkpoints. Both shift the arithmetic enough to flip near-tied tokens.

04Can open-weights models be made fully deterministic?

Yes, on hardware you control. Fix the seed and temperature 0, pin the checkpoint, run a stable batch, and use batch-invariant kernels so reduction order stays constant. Thinking Machines demonstrated 1,000 bit-identical runs this way, at a throughput cost over default serving.

05What are batch-invariant kernels?

They are GPU kernels that always reduce numbers in the same order regardless of batch size, removing the variation standard kernels introduce when batch composition changes. vLLM and SGLang now offer them as an opt-in deterministic mode for reproducible inference.

Read Next

Or try MemX to access 40+ AI models in one place — including Claude Sonnet 4.6 and GPT-5.4 — and get your questions answered today.

Was this article helpful?

Found this useful? Share it with someone who needs it.

Free · iOS, Android & WhatsApp

Stop losing what you save.
Let MemX remember it for you.

Every screenshot, photo, PDF and voice note — captured, encrypted, and instantly searchable. Ask in plain English, get the answer in seconds.

  • Reads text inside images and handwriting
  • Private and encrypted by default
  • Free to start, no credit card

Takes under a minute to set up. Your data stays yours.

Arpit Tripathi
Written by
Arpit TripathiLinkedIn

Founder of MemX. Ex-Google Staff Tech Lead Manager, ex-AWS Senior SDE (Elastic Block Store). Writes about practical AI on the MemX blog.

Keep reading

More guides for AI-powered students.