The softmax function converts a vector of real-valued scores (logits) into a probability distribution, where each output is between 0 and 1 and all outputs sum to 1. It is used in transformer attention to weight tokens and at the output layer to turn logits into next-token probabilities.
What is the Softmax Function?
The softmax function takes a vector of real numbers and converts it into a probability distribution. Each output value lies between 0 and 1, and the outputs sum to exactly 1, so the result can be read as probabilities over a set of mutually exclusive options. It is the standard way to turn raw model scores, called logits, into normalized probabilities.
Softmax works by exponentiating each input and dividing by the sum of all exponentiated inputs. Exponentiation makes every value positive and amplifies differences between large and small scores, so the largest logit receives the most probability mass while smaller ones still receive a nonzero share. This smooth, differentiable behavior is why softmax is preferred over a hard argmax during training: gradients can flow through it.
- Maps any real-valued vector to a probability distribution summing to 1.
- Each output is strictly between 0 and 1.
- Differentiable, so it supports gradient-based training, unlike argmax.
- Larger logits get exponentially more probability mass than smaller ones.
The softmax formula
For an input vector z with components z_i, the softmax of the i-th component divides the exponential of z_i by the sum of the exponentials of all components. The denominator is a normalizing constant that guarantees the outputs sum to 1.
In practice, implementations subtract the maximum logit from every component before exponentiating. This shift does not change the result mathematically but prevents the exponential from overflowing for large inputs, a standard numerical-stability trick known as the log-sum-exp or max-subtraction technique.
- The denominator normalizes the outputs so they sum to exactly 1.
- Subtracting the maximum logit avoids floating-point overflow.
- The max-subtraction shift leaves the output mathematically unchanged.
Softmax in attention and output layers
Inside transformer attention, softmax turns raw similarity scores into attention weights. Scaled dot-product attention computes query-key dot products, divides by the square root of the key dimension, and applies softmax across the keys so that each query distributes a total weight of 1 over all positions. Those weights then form a weighted average of the value vectors. The square-root scaling keeps the dot products from growing so large that softmax saturates into near-one-hot outputs with vanishing gradients.
At the output of a classifier or language model, softmax is applied to the final logits to produce a probability distribution over classes or over the vocabulary. For language models, this distribution over the next token is what decoding strategies then sample from or search over. During training, softmax is paired with the cross-entropy loss, which together yield clean gradients.
- In attention: applied across keys so each query's weights sum to 1.
- Scaling by 1/sqrt(d_k) prevents softmax saturation and vanishing gradients.
- At the output layer: converts logits into class or next-token probabilities.
- Usually paired with cross-entropy loss during training.
Temperature and common variants
Softmax is often combined with a temperature parameter that divides the logits before exponentiation. A temperature above 1 flattens the distribution toward uniform, increasing randomness, while a temperature below 1 sharpens it toward the largest logit. This is exactly how temperature controls randomness in LLM text generation.
Softmax assumes mutually exclusive classes; when labels can co-occur, an element-wise sigmoid is used instead. The binary, two-class case of softmax reduces to the logistic (sigmoid) function. For very large vocabularies, approximations such as hierarchical softmax or sampled softmax were historically used to reduce the cost of the normalizing sum, though modern hardware often computes full softmax directly.
- Temperature divides logits before softmax to control randomness.
- Sigmoid replaces softmax for multi-label problems where classes co-occur.
- Two-class softmax is equivalent to the logistic function.
- Sampled or hierarchical softmax can reduce cost over huge vocabularies.
Key takeaways
- Softmax converts a vector of logits into a probability distribution whose values are between 0 and 1 and sum to 1.
- It exponentiates each input and normalizes by the sum, amplifying the largest scores while keeping the function smooth and differentiable.
- Numerically stable implementations subtract the maximum logit before exponentiating to avoid overflow.
- In transformer attention, softmax normalizes scaled query-key scores into weights that sum to 1; the 1/sqrt(d_k) scaling prevents saturation.
- Dividing logits by a temperature flattens or sharpens the distribution, which is how sampling randomness is controlled.
Frequently asked questions
Related terms
Sources
Put the idea into practice
MemX is an AI memory app built on these ideas: store anything, skip the folders, and find it again by asking in plain English.
Try MemX Free