Temperature, Top-P, Top-K Explained

You bumped temperature to 1.1 for more creative chat, kept top-p low to play it safe, and the output barely changed. Here is the verdict with no hedging: the low top-p quietly threw away the variety you just dialed in. In temperature vs top_p LLM sampling, the split is simple. Temperature is one dial on the entire probability distribution. Top-p and top-k are filters that trim the long tail before a token gets picked. Get that division of labor right and most tuning problems solve themselves. Two facts trip up almost everyone: temperature 0 does not mean truly deterministic output, and stacking top-p on top of temperature can cancel the effect you were aiming for.

Here is the mechanism in one breath. Temperature scales the raw logits before the softmax turns them into probabilities. Top-p (nucleus sampling) keeps the smallest set of top tokens whose probabilities sum to a threshold, so the candidate pool grows and shrinks with the model's confidence. Top-k keeps a fixed number of top tokens no matter what. The rule that saves you the most time: pick one knob to control randomness and one filter to cap the tail, and tune them in that order.

Insight

One-line model: temperature changes how steep or flat the odds are. Top-p and top-k change how many tokens are eligible. They act on the same distribution but do different jobs, which is exactly why combining them carelessly backfires.

How sampling actually works: from logits to a chosen token

Every generation step produces one score per vocabulary token, called a logit. The model then turns those scores into a single next token. The pipeline runs in a fixed order. The forward pass emits raw logits, temperature scales them, top-k and top-p filter the candidates, softmax converts the survivors into probabilities, and a final draw picks the token. Change any stage and you change the output. Knowing that order is what turns the combination traps later in this post from mysterious into obvious.

There are two broad ways to choose: deterministic search and sampling. Greedy search always takes the single highest-scoring token, and it is the default decoding strategy in Hugging Face Transformers when do_sample is left at False. Sampling, switched on with do_sample=True, draws a token at random in proportion to its probability, so lower-ranked tokens still get a chance. Greedy is tight but repetitive on long outputs. Sampling is varied but can wander. Temperature, top-p, and top-k only matter once sampling is on.

Why the long tail exists at all

After softmax, a handful of tokens hold most of the probability mass and thousands of others split the rest. That sliver is the long tail. Most of those tokens are nonsense in context, but a few are genuinely plausible alternatives. Sampling without any filter occasionally reaches deep into the tail and pulls out a strange token, which derails the next step, because each token conditions everything after it. Top-p and top-k exist to cut the tail off before that happens.

Temperature: what high vs low actually changes

Temperature divides every logit by a constant T before softmax, which stretches or compresses the gaps between token scores. Low temperature (below 1) widens the lead of the top token, sharpening the distribution toward the most likely choices and producing steadier, more repeatable text. High temperature (above 1) flattens the distribution, so the gaps between scores shrink and unlikely tokens become competitive. The token rankings never change. Only how confidently the model favors the front-runners does.

Match the temperature to what the task rewards. Use low temperature for tasks with a right answer: extraction, classification, code, structured output, factual question answering. Use higher temperature for tasks where variety is the point: brainstorming, fiction, marketing copy, alternative phrasings. At low temperatures the model rarely strays from its top picks. As temperature climbs above 1 it starts to lose coherence, because genuinely improbable tokens slip through. OpenAI's chat completions API, for example, accepts a temperature from 0 to 2 with a default of 1, though useful settings cluster well below that ceiling.

Temperature 0 is not what it sounds like

Setting temperature to 0 collapses sampling into greedy decoding. The model always takes the top-scoring token, so the sampling step itself stops being random. That is the determinism people expect. The catch is everything upstream. The logits get computed with floating-point math on GPUs, and floating-point addition is not associative, so parallel reductions can finish in different orders and shift logits by tiny amounts between runs. When two top tokens sit close together, that jitter can flip which one wins, and the divergence compounds from there.

Here is the part that burns teams shipping to production: two identical temperature-0 calls can still return different text, and that is the API working as designed, not a bug in your code. Most providers do not promise bit-exact reproducibility, because the batch a request lands in and the hardware it runs on affect the order of floating-point operations. Research from Thinking Machines Lab on batch invariance points to varying server-side batch size as a major driver of this run-to-run drift. Expect outputs that match most of the time, not every time, and never build a system that breaks if two temperature-0 calls differ by a token.

Pro Tip

If you need stable output for tests or caching, set temperature to 0 and also pass a fixed seed when your provider supports one. Then compare with tolerance, allowing for the occasional one-token difference, rather than asserting exact string equality.

Also on MemX

AI Explained

AI Benchmarks Explained: MMLU, GPQA

11 min read→

AI Explained

Why Temperature 0 Isn't Deterministic

11 min read→

AI Explained

Prompt Caching Explained: 3 Providers

12 min read→

Top-P (nucleus) and Top-K: two ways to cut the long tail

Top-k keeps the k highest-scoring tokens and discards the rest before sampling. Set k to 40 and the model only ever chooses among its top 40 candidates at each step, regardless of how confident it is. Top-k is simple and predictable. The fixed cutoff is its weakness: when the model is very sure, 40 is far too many, and when it is genuinely torn among many options, 40 may be too few. The pool size never adapts to the situation.

Top-p, also called nucleus sampling, sizes the candidate pool by probability mass instead of by count. Starting from the most likely token, it adds tokens until their cumulative probability reaches the threshold p, then samples from just that set. When the model is confident, two or three tokens may already cover most of the mass, so the pool stays tiny. When the model is uncertain, the pool widens on its own. That adaptiveness is why top-p has grown popular, with common values landing around 0.9 to 0.95.

The newer option most explainers skip: min-p

Here is an angle the standard temperature-vs-top-p writeups leave out. A 2024 method called min-p sampling, presented at ICLR 2025 by Minh Nguyen and co-authors, sets its cutoff relative to the top token's probability instead of a fixed mass. The paper reports that min-p beats top-p for creative writing at high temperatures while holding coherence. The contrarian footnote, as of June 2026: a 2025 critical re-analysis (arXiv 2506.13681) argues that advantage shrinks once you optimize for quality and diversity together, and disputes some of the original evaluation. The takeaway for most teams is to treat top-p as the dependable default and reach for min-p only if high-temperature creative output is your specific problem.

Knob	What it does	Pool size	Best for
Temperature	Scales logits to sharpen or flatten odds	Unchanged; reweights all tokens	Overall randomness control
Top-K	Keeps the k highest-scoring tokens	Fixed at k every step	Hard cap on the candidate set
Top-P	Keeps tokens up to cumulative prob p	Adapts to model confidence	Most general tail trimming
Greedy (temp 0)	Always takes the single top token	One token	Most repeatable, structured output

A set-this-get-that table by task

Pick settings by what the task rewards. Code and extraction reward correctness and repeatability, so push temperature down and let a low top-p cap the tail. Chat and general assistance want a natural feel without going off the rails, so a moderate temperature with a high top-p works well. Creative work rewards surprise, so raise temperature and widen top-p. Treat the values below as starting points, not laws. Every model family responds a little differently, so tune from this baseline.

Code, extraction, classification: temperature 0 to 0.2, top-p around 0.1 to 0.5. You want the obvious answer, every time.
Factual question answering: temperature 0.1 to 0.3, top-p around 0.5 to 0.8. Stay grounded but allow natural phrasing.
General chat and assistants: temperature 0.5 to 0.7, top-p around 0.9 to 0.95. Balanced and readable.
Brainstorming and ideation: temperature 0.8 to 1.0, top-p around 0.95. More divergent options per prompt.
Creative writing and poetry: temperature 0.9 to 1.2, top-p around 0.95 to 1.0. Maximum variety, accept some misses.
Reproducible test fixtures: temperature 0 plus a fixed seed where supported. Compare with a small tolerance.

Insight

Notice that each row sets temperature OR leans on top-p as the primary control, never both turned aggressively. That is deliberate, and the next section explains why.

Common trap: combining temperature and top-p badly

The Prompt Engineering Guide states the rule plainly: alter temperature or top-p, but not both. They fight over the same distribution. Raise temperature to add randomness, then clamp top-p low to be safe, and the low top-p throws away most of the variety you just paid for. Push both high at once and you get compounding chaos, with the flattened distribution and a wide pool together letting in tokens that wreck coherence. The two knobs do not add up cleanly.

Return to the example from the top of this post, because the failure is so quiet. You set temperature to 1.1 and, out of caution, top-p to 0.3. The high temperature flattens the odds, but top-p 0.3 then keeps only the few tokens covering the first 30 percent of mass, which are still the safe, common ones. The output feels barely different from a low-temperature run, and you conclude temperature does nothing. It did. The filter undid it. Move one knob at a time so you can actually see its effect.

Pro Tip

Pick a primary control and pin the other to a neutral value. If you tune temperature, leave top-p at 1.0 (off) or near it. If you tune top-p, leave temperature at 1.0. Change one, observe, then change the other only if you still need to.

Defaults to start from and how to tune from there

Start from a sensible default, then adjust in one direction at a time. For most assistant and chat use, begin at temperature around 0.7 with top-p around 0.9, close to what many providers ship by default. For anything that must be correct and stable, start at temperature 0 and only raise it if the output feels unnaturally clipped. Generate several samples at each setting before judging, because one sample tells you almost nothing about a distribution.

Output too repetitive or robotic: raise temperature in steps of 0.1, or widen top-p toward 0.95.
Output too erratic or off-topic: lower temperature, or tighten top-p toward 0.7.
Occasional bizarre tokens in otherwise good text: keep temperature, tighten top-p to trim the tail.
Need strict structure (JSON, code): drop temperature to 0 to 0.2 and stop tuning top-p.
Results jump around between runs at temperature 0: that is run-to-run jitter from floating-point and batching, not your settings; add tolerance to comparisons.

Test against your real prompts, not toy examples, because the right setting depends on the task and the model. Recommendations transfer roughly across models but not exactly. The same temperature can feel tame on one model and wild on another. Log the parameters alongside outputs so you can reproduce a result you liked. Sampling settings are cheap to change and expensive to guess at, so measure instead of assuming.

Where memory fits in

Sampling settings shape how a model phrases an answer, but they cannot give it information it never had. If your app needs the model to remember a user's past choices, preferences, or facts across sessions, that is a memory problem, not a temperature problem. No amount of top-p tuning recovers context that was never in the prompt.

That is the gap MemX (memx.app) fills. MemX is an external, model-agnostic AI memory layer that stores and retrieves the right context so your prompts carry what matters, whichever model and sampling settings you use. It is private by architecture: per-user isolation, encryption at rest, and on-device options. It does not change your sampling behavior. It just makes sure the model has the context before temperature and top-p decide how to say it.

Frequently Asked Questions

01What is the difference between temperature and top_p in LLM sampling?

Temperature scales the whole probability distribution to make it sharper or flatter, controlling overall randomness. Top-p trims the candidate pool to the smallest set of tokens whose probabilities sum to p. Temperature reweights all tokens; top-p decides which ones are eligible to be picked.

02Should I use both temperature and top_p together?

Usually no. The Prompt Engineering Guide recommends altering temperature or top-p, not both, since they act on the same distribution and can cancel out. Pick one as your main control, leave the other near a neutral 1.0, and tune in one direction at a time.

03Does temperature 0 make an LLM fully deterministic?

Not fully. Temperature 0 forces greedy decoding, so the sampling step is deterministic, but GPU floating-point math is not associative and server-side batching varies, so logits can shift slightly between runs and flip close calls. Expect highly repeatable output, not bit-exact identical text every call.

04What is a good temperature for code generation?

Use a low temperature, around 0 to 0.2, for code, extraction, and structured output. You want the most likely, correct token rather than variety. For reproducible results, set temperature to 0, add a fixed seed if supported, and compare outputs with a small tolerance.

05What is the difference between top_k and top_p?

Top-k keeps a fixed number of the highest-scoring tokens, regardless of confidence. Top-p (nucleus sampling) keeps a variable number of tokens whose cumulative probability reaches p, so the pool grows when the model is uncertain and shrinks when it is confident. Top-p adapts; top-k does not.

Temperature, Top-P, Top-K Explained

How sampling actually works: from logits to a chosen token

Why the long tail exists at all

Temperature: what high vs low actually changes

Temperature 0 is not what it sounds like

Top-P (nucleus) and Top-K: two ways to cut the long tail

The newer option most explainers skip: min-p

A set-this-get-that table by task

Common trap: combining temperature and top-p badly

Defaults to start from and how to tune from there

Where memory fits in

Stop losing what you save.
Let MemX remember it for you.

Keep reading

How sampling actually works: from logits to a chosen token

Why the long tail exists at all

Temperature: what high vs low actually changes

Temperature 0 is not what it sounds like

Top-P (nucleus) and Top-K: two ways to cut the long tail

The newer option most explainers skip: min-p

A set-this-get-that table by task

Common trap: combining temperature and top-p badly

Defaults to start from and how to tune from there

Where memory fits in

Stop losing what you save.Let MemX remember it for you.

Keep reading

Stop losing what you save.
Let MemX remember it for you.