Perplexity is an intrinsic metric for language models that measures how well a model predicts a sample of text. It equals the exponential of the average per-token cross-entropy, and can be read as the effective number of equally likely choices the model considers at each step. Lower perplexity means better prediction.
What is Perplexity?
Perplexity is an intrinsic evaluation metric for probabilistic language models that quantifies how well a model predicts a held-out sample of text. Formally, it is the exponential of the average negative log-likelihood per token, equivalently the exponentiation of the cross-entropy. A lower perplexity indicates the model assigns higher probability to the actual observed text and is therefore a better predictor.
Intuitively, perplexity can be read as the effective branching factor: the average number of equally likely tokens the model is choosing among at each position. A model with a perplexity of 20 is, on average, as uncertain as if it were picking uniformly among 20 options at each step. Because it is computed directly from the model's probabilities, perplexity needs no human labels and is widely used to compare language models during pretraining and research.
- An intrinsic metric: computed from model probabilities, no human labels needed.
- Equals the exponential of the average per-token cross-entropy.
- Lower is better; it reflects higher probability assigned to real text.
- Interpretable as the effective number of equally likely next-token choices.
The perplexity formula
For a tokenized sequence, perplexity is the exponentiation of the average negative log-probability the model assigns to each token given the preceding context. Because perplexity is a monotonic transform of cross-entropy, minimizing cross-entropy loss during training is equivalent to minimizing perplexity.
The base of the exponent must match the base of the logarithm: natural log pairs with exp, and log base 2 pairs with 2 raised to the cross-entropy in bits. The result is the same metric expressed consistently. The Hugging Face documentation notes that perplexity is only comparable across models that share the same tokenization, since splitting text into more or fewer tokens changes the per-token average.
- Perplexity is a monotonic transform of cross-entropy loss.
- Minimizing training loss is equivalent to minimizing perplexity.
- The exponent base must match the logarithm base (exp with ln, 2 with log₂).
Computing perplexity in practice
For modern fixed-context transformers, perplexity over a long document is computed with a sliding window so each token is predicted with as much context as the window allows, as described in the Hugging Face perplexity guide. The cross-entropy loss returned by the model over the target tokens is averaged and then exponentiated.
A subtle but important detail is that tokens used only as context (not predicted) should be masked out of the loss, typically by setting their target label to -100 in PyTorch so they are ignored. Failing to mask them inflates or deflates the reported number.
- Use a sliding window so each prediction has maximum available context.
- Average the per-token cross-entropy, then exponentiate.
- Mask context-only tokens (label -100 in PyTorch) so they do not count.
- Only compare perplexity across models that use the same tokenizer.
Uses and limitations
Perplexity is most useful as a relative measure during model development: it tracks improvement across training checkpoints and ranks models trained and evaluated under identical tokenization and data conditions. It is cheap, automatic, and correlates with general language-modeling quality.
Its limitations are well documented. Perplexity is not comparable across different tokenizers or vocabularies, since per-token averaging depends on how text is split. It also measures probability assigned to reference text rather than usefulness, so a low-perplexity model is not guaranteed to be more truthful, helpful, or aligned. For instruction-following and chat models, task benchmarks, human preference judgments, and LLM-as-a-judge evaluations are used alongside or instead of perplexity. As The Gradient discusses, intrinsic metrics like perplexity should complement, not replace, extrinsic task evaluation.
- Best as a relative metric under matched tokenization and data.
- Not comparable across different tokenizers or vocabulary sizes.
- Low perplexity does not guarantee truthfulness, helpfulness, or alignment.
- Complement it with task benchmarks and human or LLM-judged evaluation.
Key takeaways
- Perplexity measures how well a language model predicts text and equals the exponential of the average per-token cross-entropy.
- It can be read as the effective number of equally likely next-token choices; lower perplexity is better.
- Minimizing cross-entropy during training is equivalent to minimizing perplexity.
- Perplexity is only comparable across models that share the same tokenization and evaluation data.
- It does not capture truthfulness, helpfulness, or alignment, so task benchmarks and human evaluation are needed alongside it.
Frequently asked questions
Related terms
Related reading
Sources
Put the idea into practice
MemX is an AI memory app built on these ideas: store anything, skip the folders, and find it again by asking in plain English.
Try MemX Free