How LLMs Pick Words: Greedy, Beam, Sampling

You changed one setting in an API call, temperature, and the same model went from sounding like a bored intern to sounding unhinged. You changed nothing about what it knew. Here is the verdict: the model never picks a word. It scores every token, hands those scores to a separate piece of code, and that code does the picking. Greedy, beam search, and sampling are three versions of that code. Temperature and top-p are dials on it. The model already spoke before any of them ran.

Short answer: the model scores every token, then a decoding strategy picks one

At every step a language model produces a score for every token in its vocabulary. Those scores become a probability distribution. Then a separate algorithm, the decoding strategy, chooses which token to emit. That choosing step is where greedy, beam search, and sampling split apart. Temperature and top-p are not the model thinking. They are dials on the gamble that happens after the model has spoken.

This is why two systems running the exact same weights can produce wildly different text. One reads flat and repetitive. The other reads fluent and varied. Not a single weight changed between them; only the decoding strategy bolted on top is different. Learning that one layer is the line between treating an LLM as a black box and knowing exactly why your output looks the way it does. The rest of this post walks the main LLM decoding strategies in the order they actually run.

Insight

The model outputs a probability over thousands of tokens. It never outputs a word. The word is a decision made downstream, by an algorithm you can swap out without retraining a thing.

Logits to softmax: turning raw scores into a probability distribution

The model's final layer emits one raw number per token in its vocabulary. These numbers are logits. They are unbounded real values, not probabilities. A logit can be 8.2 for one token and -3.5 for another, and nothing forces them to add up to anything. To turn this pile of scores into something you can sample from, the model runs the softmax function.

Softmax does two things. It exponentiates each logit, which makes every value positive and widens the gap between high and low scores. Then it divides each result by the sum of all of them, so the whole vector adds up to exactly 1. The formula is e raised to z_i divided by the sum of e raised to z_j across every token j. After softmax, every token has a probability between 0 and 1, and together they form a valid distribution over the vocabulary.

This is the hidden step every decoding strategy sits on. Greedy, beam, and sampling all operate on this post-softmax distribution. They never touch the logits directly except through one knob: temperature, which scales the logits before softmax runs. A higher temperature flattens the distribution toward uniform and raises the entropy. A lower temperature sharpens it so one token dominates. Temperature is a softmax setting, not a separate strategy, which is why it shows up under sampling later.

Why this layer is invisible in normal use

When you call an API, you see text come out. The logits, the softmax, and the per-token distribution all happen inside one forward pass, repeated once per generated token. Most users never see the full vocabulary-wide probability vector. It can run from tens of thousands to hundreds of thousands of entries depending on the model, and it exists for a fraction of a millisecond before the decoder collapses it into a single choice. The decoding strategy is the one part of that pipeline you actually control.

Also on MemX

AI Explained

Why Semantic Search Misses Exact Words

11 min read→

AI Explained

Cosine, Dot, or Euclidean: Pick a Metric

11 min read→

AI Explained

Speculative Decoding: 2x Faster LLMs

9 min read→

Greedy decoding: always take the top token, and why it gets repetitive

Greedy decoding is the simplest strategy. At each step, pick the single token with the highest probability. Formally it takes the argmax over the distribution. There is no randomness and no lookahead. Run the same prompt twice with greedy decoding and you get the identical output every time, because the highest-probability token is always the same.

The problem is that greedy decoding loops. The Hugging Face generation guide calls repetition a very common problem in language generation, and worse in greedy and beam search. Once the model lands in a phrasing it likes, the most probable next token keeps pointing back into the same phrase, and the text gets stuck circling. The guide's worked example shows a model emitting the clause 'I'm not sure if I'll ever be able to walk with my dog' twice in a row from pure greedy decoding.

There is a deeper flaw too. Taking the locally best token does not give you the globally best sequence. A token with a slightly lower probability now can open the door to a much higher-probability continuation a few steps later. Greedy decoding never sees that. It commits to the top choice at every step and cannot look back, so it misses high-probability words hidden behind a single low-probability word. That weakness is exactly what beam search was built to fix.

Pro Tip

Greedy is the right default when you want determinism and there is one correct answer: classification, extraction, short factual lookups. It is the wrong default for anything that should read like writing.

Beam search: tracking several candidate paths at once

Beam search fixes greedy's blind spot by keeping more than one candidate sequence alive. Instead of committing to a single top token, it tracks the num_beams most probable partial sequences at every step, expands each one, and keeps the best handful overall. Set num_beams to 5 and it carries five running hypotheses forward at once, pruning the rest.

Because it explores several paths, beam search always finds an output sequence with higher overall probability than greedy search. The catch, stated plainly in the Hugging Face guide, is that it is still not guaranteed to find the single most likely sequence. Finding the true global best would mean searching the entire tree of possibilities, which is computationally impossible for any real vocabulary. Beam search is a wide-but-shallow compromise.

Here is the contrarian part most tutorials skip. A higher-probability sequence is not a better one. The Hugging Face guide, citing Holtzman and colleagues (2019), states that high-quality human language does not follow a distribution of high-probability next words. We want generated text to surprise us, not to be predictable. So the very thing beam search optimizes for, maximum sequence probability, is the thing that makes open-ended writing read like a hostage statement. Beam search shines where the output length is roughly known and there is a clearly correct target, machine translation and summarization being the classic cases. It struggles on anything open-ended. The same repetition from greedy shows up again, which is why beam search usually pairs with an n-gram penalty such as no_repeat_ngram_size to stop any n-gram from appearing twice.

The cost of carrying beams

Tracking multiple hypotheses is not free. With num_beams set to 5, the model does five times the bookkeeping at each step and runs more forward computation than greedy decoding. For latency-sensitive chat, that overhead is often not worth it, which is one reason most chat-facing LLM APIs default to sampling rather than beam search.

Sampling: where temperature, top-p, and top-k actually plug in

Sampling abandons the search for the most probable sequence entirely. Instead of taking the argmax, it draws a token at random from the probability distribution, so a token with 12 percent probability gets picked roughly 12 percent of the time. That randomness is what makes output varied and non-deterministic, and it is the default mode for most creative and conversational use. It is also where the dials you have heard of finally enter the picture.

Pure sampling has a problem. Because every token in the distribution stays a possible candidate, the long tail of bad options can occasionally get drawn, and the text degrades into gibberish. Temperature, top-k, and top-p are three different ways to tame that tail. None of them changes the model. Each one reshapes or trims the distribution before the random draw.

Temperature: reshape the whole distribution

Temperature scales the logits before softmax, changing how sharp or flat the distribution is. Below 1, it sharpens the curve so the top tokens dominate and output gets more conservative and predictable. Above 1, it flattens the curve so unlikely tokens get a real shot, and output gets more random. Temperature removes no token; it only adjusts the odds. Push it to its limit and the Hugging Face guide notes that as temperature approaches 0, temperature-scaled sampling becomes equal to greedy decoding. That is the quiet through-line of this whole post: these strategies are not separate species. They are points on one spectrum that collapses to greedy at the bottom.

Top-k: keep a fixed shortlist

Top-k filters the distribution down to the k most likely tokens, redistributes the probability mass among only those, and samples from that shortlist. With top-k of 50, the model only ever considers its 50 best guesses and discards the entire tail. Set top-k to 1 and you are back to greedy decoding, because a shortlist of one leaves no choice to make; the Cohere docs state this equivalence directly. The weakness of top-k is that k is fixed. A shortlist of 50 can chop off good options when the model is genuinely uncertain, and waste slots on junk when the model is confident.

Top-p: keep a dynamic nucleus

Top-p, also called nucleus sampling, fixes the fixed-shortlist problem. Instead of a fixed count, it chooses from the smallest possible set of tokens whose cumulative probability exceeds the threshold p, then redistributes the mass across that set. The shortlist grows and shrinks on its own. When the model is uncertain and probability is spread thin, the nucleus widens to include more candidates. When the model is confident, it narrows to just a few. In Cohere's API, as one concrete reference point, top-p defaults to 0.75 and tops out at 0.99, specifically to cut off the long tail of low-probability tokens. Treat those as one vendor's numbers, not a universal law.

These dials stack. Top-k and top-p can run together, and when both are enabled the Cohere docs are explicit that p acts after k. Temperature reshapes the distribution first, then top-k or top-p trims it, then the random draw happens. Every one of these steps sits on top of the same softmax distribution from the very first section. Nothing here teaches the model anything. It only governs how the existing gamble gets placed.

Strategy	How it picks	Deterministic?	Best for
Greedy	Always the single highest-probability token	Yes	Classification, extraction, one-answer tasks
Beam search	Tracks several high-probability sequences, keeps the best	Yes	Translation, summarization, fixed-length output
Sampling (temperature)	Random draw after reshaping the whole distribution	No	Tuning overall creativity level
Sampling (top-k)	Random draw from a fixed-size shortlist of k tokens	No	Capping the candidate pool to a known size
Sampling (top-p)	Random draw from a dynamic nucleus exceeding probability p	No	Open-ended chat, fluent natural writing

Which decoding strategy to use for accuracy vs creativity

For accuracy, lean deterministic. When there is one correct answer, you want the model to take the high-probability path and stop gambling. Greedy decoding, or sampling with a very low temperature, gives you repeatable output for classification, data extraction, structured generation, and factual lookups. For tasks with a known target shape such as translation or summarization, beam search finds a higher-probability sequence than greedy and earns the extra compute.

For creativity, lean toward sampling. Open-ended writing, brainstorming, and conversational tone all gain from the variety a random draw produces, and top-p sampling tends to read as the most fluent option for open-ended text. The Hugging Face guide is blunt that there is no one-size-fits-all method, so the practical move is to start from your provider's defaults and adjust one dial at a time. The guide's own worked example uses sampling with a top-p of 0.92, a moderate-to-high nucleus that trims the worst of the tail while keeping room for variety. Lower temperature and tighter top-p for more focus; raise them for more range. This guidance holds as of June 2026.

Pro Tip

Change one knob at a time. Temperature, top-k, and top-p interact, so moving all three at once makes it impossible to tell which one caused a change. OpenAI's own API reference recommends altering temperature or top-p, but not both.

Where memory fits: decoding controls the words, not what the model knows

Decoding strategy decides how the model speaks. It does nothing about what the model knows going in. A perfectly tuned top-p setting still cannot recall a preference you stated last week or a fact from a conversation three sessions ago, because that information was never in the context window to begin with. Decoding shapes the gamble. It cannot add cards to the deck.

That gap is the problem MemX (memx.app) works on. MemX is an external, model-agnostic memory layer that holds durable context across sessions and across providers, so the relevant facts are present in the prompt before the model ever computes a single logit. It is private by architecture: per-user isolation, encryption at rest, and on-device options. MemX does not claim end-to-end encryption or zero-knowledge, and it does not pretend memory fixes decoding. It feeds the model better context; the decoding strategy still decides how that context gets phrased.

Frequently Asked Questions

01What are logits in an LLM?

Logits are the raw, unbounded scores the model assigns to every token in its vocabulary at each step. They are not probabilities yet. The softmax function exponentiates and normalizes them into a distribution that sums to 1, which the decoding strategy then samples from.

02What is the difference between greedy and beam search?

Greedy decoding picks the single highest-probability token at every step. Beam search keeps several candidate sequences alive at once and selects the best overall. Beam search always finds a higher-probability sequence than greedy, but neither is guaranteed to find the single most likely full sequence.

03Why does greedy decoding repeat itself?

Greedy always takes the most probable next token, so once the model favors a phrasing, the top choice keeps pointing back into the same phrase and the text loops. Repetition is a very common failure of greedy decoding, which is why sampling or n-gram penalties get added.

04Do temperature and top-p change the model itself?

No. Temperature, top-k, and top-p only reshape or trim the probability distribution before a token is drawn. The model weights stay identical. They control how the choice is made among the model's existing predictions, not what those predictions are.

05Which decoding strategy is best for accurate answers?

For tasks with one correct answer, use greedy decoding or sampling at a very low temperature, since both push toward the high-probability path and produce repeatable output. For translation or summarization with a known target, beam search finds a higher-probability sequence and is often worth the extra compute.

How LLMs Pick Words: Greedy, Beam, Sampling

Short answer: the model scores every token, then a decoding strategy picks one

Logits to softmax: turning raw scores into a probability distribution

Why this layer is invisible in normal use

Greedy decoding: always take the top token, and why it gets repetitive

Beam search: tracking several candidate paths at once

The cost of carrying beams

Sampling: where temperature, top-p, and top-k actually plug in

Temperature: reshape the whole distribution

Top-k: keep a fixed shortlist

Top-p: keep a dynamic nucleus

Which decoding strategy to use for accuracy vs creativity

Where memory fits: decoding controls the words, not what the model knows

Stop losing what you save.
Let MemX remember it for you.

Keep reading

Short answer: the model scores every token, then a decoding strategy picks one

Logits to softmax: turning raw scores into a probability distribution

Why this layer is invisible in normal use

Greedy decoding: always take the top token, and why it gets repetitive

Beam search: tracking several candidate paths at once

The cost of carrying beams

Sampling: where temperature, top-p, and top-k actually plug in

Temperature: reshape the whole distribution

Top-k: keep a fixed shortlist

Top-p: keep a dynamic nucleus

Which decoding strategy to use for accuracy vs creativity

Where memory fits: decoding controls the words, not what the model knows

Stop losing what you save.Let MemX remember it for you.

Keep reading

Stop losing what you save.
Let MemX remember it for you.