Models & Evaluation

Perplexity

Perplexity is an intrinsic metric for language models that measures how well a model predicts a sample of text. It equals the exponential of the average per-token cross-entropy, and can be read as the effective number of equally likely choices the model considers at each step. Lower perplexity means better prediction.

What is Perplexity?

Perplexity is an intrinsic evaluation metric for probabilistic language models that quantifies how well a model predicts a held-out sample of text. Formally, it is the exponential of the average negative log-likelihood per token, equivalently the exponentiation of the cross-entropy. A lower perplexity indicates the model assigns higher probability to the actual observed text and is therefore a better predictor.

Intuitively, perplexity can be read as the effective branching factor: the average number of equally likely tokens the model is choosing among at each position. A model with a perplexity of 20 is, on average, as uncertain as if it were picking uniformly among 20 options at each step. Because it is computed directly from the model's probabilities, perplexity needs no human labels and is widely used to compare language models during pretraining and research.

An intrinsic metric: computed from model probabilities, no human labels needed.
Equals the exponential of the average per-token cross-entropy.
Lower is better; it reflects higher probability assigned to real text.
Interpretable as the effective number of equally likely next-token choices.

The perplexity formula

For a tokenized sequence, perplexity is the exponentiation of the average negative log-probability the model assigns to each token given the preceding context. Because perplexity is a monotonic transform of cross-entropy, minimizing cross-entropy loss during training is equivalent to minimizing perplexity.

The base of the exponent must match the base of the logarithm: natural log pairs with exp, and log base 2 pairs with 2 raised to the cross-entropy in bits. The result is the same metric expressed consistently. The Hugging Face documentation notes that perplexity is only comparable across models that share the same tokenization, since splitting text into more or fewer tokens changes the per-token average.

PPL(X) = exp( −(1/N) Σ_{i=1}^{N} log P(x_i | x_{<i}) )

Perplexity is the exponential of the average per-token negative log-likelihood over a sequence of N tokens. It equals exp(cross-entropy).

PPL = 2^{H},   H = −(1/N) Σ log₂ P(x_i | x_{<i})

The equivalent base-2 form: 2 raised to the cross-entropy H measured in bits per token.

Perplexity is a monotonic transform of cross-entropy loss.
Minimizing training loss is equivalent to minimizing perplexity.
The exponent base must match the logarithm base (exp with ln, 2 with log₂).

Computing perplexity in practice

For modern fixed-context transformers, perplexity over a long document is computed with a sliding window so each token is predicted with as much context as the window allows, as described in the Hugging Face perplexity guide. The cross-entropy loss returned by the model over the target tokens is averaged and then exponentiated.

A subtle but important detail is that tokens used only as context (not predicted) should be masked out of the loss, typically by setting their target label to -100 in PyTorch so they are ignored. Failing to mask them inflates or deflates the reported number.

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

text = "Perplexity measures how well a language model predicts text."
enc = tok(text, return_tensors="pt")

# Simple single-window case: the whole text fits in one context window.
# For documents longer than the model's context, slide a window and
# mask context-only tokens with label -100 instead of this one-shot call.
with torch.no_grad():
    # Passing labels makes the model return mean cross-entropy loss
    out = model(**enc, labels=enc["input_ids"])

ppl = torch.exp(out.loss)
print(float(ppl))  # lower is better

Single-window perplexity of a short text with a Hugging Face causal LM. Long documents require the strided sliding-window loop described above.

Use a sliding window so each prediction has maximum available context.
Average the per-token cross-entropy, then exponentiate.
Mask context-only tokens (label -100 in PyTorch) so they do not count.
Only compare perplexity across models that use the same tokenizer.

Uses and limitations

Perplexity is most useful as a relative measure during model development: it tracks improvement across training checkpoints and ranks models trained and evaluated under identical tokenization and data conditions. It is cheap, automatic, and correlates with general language-modeling quality.

Its limitations are well documented. Perplexity is not comparable across different tokenizers or vocabularies, since per-token averaging depends on how text is split. It also measures probability assigned to reference text rather than usefulness, so a low-perplexity model is not guaranteed to be more truthful, helpful, or aligned. For instruction-following and chat models, task benchmarks, human preference judgments, and LLM-as-a-judge evaluations are used alongside or instead of perplexity. As The Gradient discusses, intrinsic metrics like perplexity should complement, not replace, extrinsic task evaluation.

Best as a relative metric under matched tokenization and data.
Not comparable across different tokenizers or vocabulary sizes.
Low perplexity does not guarantee truthfulness, helpfulness, or alignment.
Complement it with task benchmarks and human or LLM-judged evaluation.

Key takeaways

Perplexity measures how well a language model predicts text and equals the exponential of the average per-token cross-entropy.
It can be read as the effective number of equally likely next-token choices; lower perplexity is better.
Minimizing cross-entropy during training is equivalent to minimizing perplexity.
Perplexity is only comparable across models that share the same tokenization and evaluation data.
It does not capture truthfulness, helpfulness, or alignment, so task benchmarks and human evaluation are needed alongside it.

Frequently asked questions

Perplexity is an intrinsic metric measuring how well a model predicts a text sample. It is the exponential of the average per-token cross-entropy and can be read as the effective number of equally likely tokens the model considers at each step. Lower is better.

Compute the model's average negative log-probability per token over a sequence, then exponentiate it. Equivalently, it is the exponential of the cross-entropy loss. Using log base 2 gives perplexity as 2 raised to the cross-entropy in bits per token.

Lower perplexity is better. It means the model assigns higher probability to the actual observed text, so it is less surprised and a stronger predictor. High perplexity indicates the model finds the text unexpected.

Only if they use the same tokenizer and evaluation data. Perplexity averages over tokens, so different tokenizations split text differently and produce non-comparable numbers. Comparing across vocabularies or datasets is not meaningful.

Perplexity measures probability assigned to reference text, not usefulness, so a low score does not guarantee a model is truthful, helpful, or aligned. It also cannot be compared across tokenizers, so task benchmarks and human evaluation are needed too.

Put the idea into practice

MemX is an AI memory agent built on these ideas: store anything, skip the folders, and find it again by asking in plain English.

Try MemX Free