AI Foundations

Decoding Strategies (Greedy, Beam Search, Top-k, Top-p)

Decoding strategies are the algorithms that turn a language model's predicted probability distribution over the next token into actual generated text. The main families are deterministic search (greedy, beam search) and stochastic sampling (top-k, top-p/nucleus), which trade off accuracy, diversity, and the risk of repetitive or degenerate output.

What are Decoding Strategies?

Decoding strategies are the algorithms that convert a language model's next-token probability distribution into a concrete sequence of output tokens. At each step the model produces logits over its vocabulary, a softmax turns those logits into probabilities, and the decoding strategy decides which token to emit. The choice of strategy strongly affects whether output is accurate, diverse, repetitive, or incoherent, even though the underlying model is unchanged.

Decoding methods fall into two broad families. Deterministic search methods (greedy decoding and beam search) aim to find a high-probability sequence and produce the same output every time for a given prompt. Stochastic sampling methods (top-k sampling, top-p or nucleus sampling, and temperature scaling) introduce randomness to produce varied and often more natural text. The right choice depends on the task: factual or structured tasks favor search, while open-ended generation favors sampling.

Operate on the per-step probability distribution, not the model weights.
Deterministic family: greedy, beam search; reproducible output.
Stochastic family: top-k, top-p (nucleus), temperature; varied output.
Strongly influence fluency, diversity, and repetition without retraining.

Greedy decoding and beam search

Greedy decoding picks the single highest-probability token at every step. It is fast and deterministic but myopic: a locally optimal token can lead to a globally poor sequence, and greedy output often becomes repetitive on open-ended prompts.

Beam search keeps the k most probable partial sequences (beams) at each step instead of one, expanding each and retaining the top k by cumulative log-probability. It finds higher-probability sequences than greedy and is standard in machine translation and other tasks with a clear correct answer. Its weaknesses are cost (k times the compute) and a known tendency toward bland, repetitive text in open-ended generation, which is what the nucleus sampling paper was motivated to fix.

score(y₁..y_t) = Σ_{i=1}^{t} log P(y_i | y_{<i}, x)

Beam search ranks candidate sequences by summed log-probability of their tokens; greedy decoding is the special case where the beam width k equals 1.

Greedy: argmax at every step; fastest, deterministic, prone to loops.
Beam search: tracks k candidate sequences ranked by cumulative log-probability.
Beam search excels on translation and short, answer-bearing tasks.
Both can produce dull or repetitive text in open-ended writing.

Top-k and top-p (nucleus) sampling

Top-k sampling restricts the choice to the k most probable tokens, renormalizes their probabilities, and samples from that reduced set. It avoids picking very unlikely tokens but uses a fixed cutoff, which can be too narrow when the distribution is flat and too wide when it is peaked.

Top-p sampling, also called nucleus sampling, was introduced in The Curious Case of Neural Text Degeneration (Holtzman et al., 2019). Instead of a fixed count, it keeps the smallest set of tokens whose cumulative probability reaches a threshold p (for example 0.9), then samples from that nucleus. Because the nucleus grows and shrinks with the model's confidence, top-p adapts to context better than top-k and reduces both repetition and incoherent output. Temperature is usually applied first: it rescales logits before softmax, with values above 1 flattening the distribution for more randomness and values below 1 sharpening it.

python

from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
inputs = tok("The future of AI memory is", return_tensors="pt")

# Greedy (deterministic)
greedy = model.generate(**inputs, do_sample=False, max_new_tokens=40)

# Beam search with 5 beams
beam = model.generate(**inputs, num_beams=5, max_new_tokens=40)

# Nucleus (top-p) sampling with temperature
nucleus = model.generate(
    **inputs,
    do_sample=True,
    top_p=0.9,
    top_k=0,          # disable top-k so only top-p applies
    temperature=0.8,
    max_new_tokens=40,
)

# Decode and inspect the nucleus-sampled output
print(tok.decode(nucleus[0], skip_special_tokens=True))

Generating with different decoding strategies using Hugging Face Transformers.

Top-k: sample from the fixed k highest-probability tokens.
Top-p (nucleus): sample from the smallest set whose probabilities sum to p.
Top-p adapts the candidate pool size to the model's per-step confidence.
Temperature rescales logits before sampling: higher is more random, lower is more focused.

Choosing a strategy in practice

For factual question answering, code generation, or structured output where one correct answer exists, low-randomness decoding is preferred: greedy or beam search, or sampling with a low temperature. For creative writing, brainstorming, or dialogue where variety matters, nucleus sampling with a moderate temperature (commonly p around 0.9 and temperature near 0.7 to 1.0) is a strong default.

Most production APIs expose temperature, top_p, and top_k as parameters and combine them: temperature and top-p together are the most common setup. Setting temperature to 0 effectively makes generation greedy and reproducible, which is useful for testing and for tasks that must be deterministic.

Deterministic tasks (facts, code, JSON): greedy, beam, or low temperature.
Creative tasks: nucleus sampling, p around 0.9, temperature around 0.7 to 1.0.
Temperature 0 makes most APIs behave greedily and reproducibly.
Top-p and temperature are the most common combination in production.

Key takeaways

Decoding strategies turn an LLM's next-token probabilities into text and strongly shape output quality without changing the model.
Greedy and beam search are deterministic and best for tasks with one correct answer, but can be repetitive in open-ended writing.
Top-k samples from a fixed number of top tokens; top-p (nucleus) samples from a probability-mass threshold that adapts to confidence.
Nucleus sampling was introduced to fix the bland, repetitive output of likelihood-maximizing decoding.
Temperature rescales logits before sampling; setting it to 0 makes generation effectively greedy and reproducible.

Frequently asked questions

Greedy decoding picks the single most likely token at each step. Beam search tracks the k most probable partial sequences at once and chooses the one with the highest cumulative log-probability, finding better overall sequences at higher compute cost.

Top-k samples from a fixed number of the highest-probability tokens. Top-p (nucleus) samples from the smallest set of tokens whose probabilities sum to a threshold p, so the candidate pool grows or shrinks with the model's confidence.

Temperature rescales the logits before the softmax. Values above 1 flatten the distribution for more random, diverse output; values below 1 sharpen it toward the most likely tokens; temperature 0 makes generation effectively greedy and deterministic.

It depends on the task. Use greedy, beam search, or low temperature for factual or structured output, and nucleus sampling (top-p around 0.9) with moderate temperature for creative or conversational text where diversity is desired.

Greedy decoding always takes the locally most probable token, which can lock the model into high-likelihood loops. Likelihood-maximizing methods overweight common continuations, so open-ended generation degenerates into repetition, the problem nucleus sampling was designed to address.

Put the idea into practice

MemX is an AI memory agent built on these ideas: store anything, skip the folders, and find it again by asking in plain English.

Try MemX Free