AI Foundations

Softmax Function

The softmax function converts a vector of real-valued scores (logits) into a probability distribution, where each output is between 0 and 1 and all outputs sum to 1. It is used in transformer attention to weight tokens and at the output layer to turn logits into next-token probabilities.

What is the Softmax Function?

The softmax function takes a vector of real numbers and converts it into a probability distribution. Each output value lies between 0 and 1, and the outputs sum to exactly 1, so the result can be read as probabilities over a set of mutually exclusive options. It is the standard way to turn raw model scores, called logits, into normalized probabilities.

Softmax works by exponentiating each input and dividing by the sum of all exponentiated inputs. Exponentiation makes every value positive and amplifies differences between large and small scores, so the largest logit receives the most probability mass while smaller ones still receive a nonzero share. This smooth, differentiable behavior is why softmax is preferred over a hard argmax during training: gradients can flow through it.

Maps any real-valued vector to a probability distribution summing to 1.
Each output is strictly between 0 and 1.
Differentiable, so it supports gradient-based training, unlike argmax.
Larger logits get exponentially more probability mass than smaller ones.

The softmax formula

For an input vector z with components z_i, the softmax of the i-th component divides the exponential of z_i by the sum of the exponentials of all components. The denominator is a normalizing constant that guarantees the outputs sum to 1.

In practice, implementations subtract the maximum logit from every component before exponentiating. This shift does not change the result mathematically but prevents the exponential from overflowing for large inputs, a standard numerical-stability trick known as the log-sum-exp or max-subtraction technique.

softmax(z)_i = exp(z_i) / Σ_{j=1}^{K} exp(z_j)

Softmax over K classes: each logit is exponentiated and divided by the sum of all exponentiated logits, producing values that sum to 1.

softmax(z)_i = exp(z_i − max(z)) / Σ_j exp(z_j − max(z))

The numerically stable form: subtracting the maximum logit before exponentiating avoids overflow without changing the output.

The denominator normalizes the outputs so they sum to exactly 1.
Subtracting the maximum logit avoids floating-point overflow.
The max-subtraction shift leaves the output mathematically unchanged.

Softmax in attention and output layers

Inside transformer attention, softmax turns raw similarity scores into attention weights. Scaled dot-product attention computes query-key dot products, divides by the square root of the key dimension, and applies softmax across the keys so that each query distributes a total weight of 1 over all positions. Those weights then form a weighted average of the value vectors. The square-root scaling keeps the dot products from growing so large that softmax saturates into near-one-hot outputs with vanishing gradients.

At the output of a classifier or language model, softmax is applied to the final logits to produce a probability distribution over classes or over the vocabulary. For language models, this distribution over the next token is what decoding strategies then sample from or search over. During training, softmax is paired with the cross-entropy loss, which together yield clean gradients.

Attention(Q, K, V) = softmax(Q·Kᵀ / √d_k)·V

Scaled dot-product attention: softmax normalizes the scaled query-key scores into weights that average the value vectors.

python

import numpy as np

def softmax(z, axis=-1):
    # Subtract the max for numerical stability
    z = z - np.max(z, axis=axis, keepdims=True)
    exp_z = np.exp(z)
    return exp_z / np.sum(exp_z, axis=axis, keepdims=True)

logits = np.array([2.0, 1.0, 0.1])
probs = softmax(logits)
print(probs)          # [0.659..., 0.242..., 0.099...]
print(probs.sum())    # 1.0

A numerically stable softmax in NumPy.

In attention: applied across keys so each query's weights sum to 1.
Scaling by 1/sqrt(d_k) prevents softmax saturation and vanishing gradients.
At the output layer: converts logits into class or next-token probabilities.
Usually paired with cross-entropy loss during training.

Temperature and common variants

Softmax is often combined with a temperature parameter that divides the logits before exponentiation. A temperature above 1 flattens the distribution toward uniform, increasing randomness, while a temperature below 1 sharpens it toward the largest logit. This is exactly how temperature controls randomness in LLM text generation.

Softmax assumes mutually exclusive classes; when labels can co-occur, an element-wise sigmoid is used instead. The binary, two-class case of softmax reduces to the logistic (sigmoid) function. For very large vocabularies, approximations such as hierarchical softmax or sampled softmax were historically used to reduce the cost of the normalizing sum, though modern hardware often computes full softmax directly.

Temperature divides logits before softmax to control randomness.
Sigmoid replaces softmax for multi-label problems where classes co-occur.
Two-class softmax is equivalent to the logistic function.
Sampled or hierarchical softmax can reduce cost over huge vocabularies.

Key takeaways

Softmax converts a vector of logits into a probability distribution whose values are between 0 and 1 and sum to 1.
It exponentiates each input and normalizes by the sum, amplifying the largest scores while keeping the function smooth and differentiable.
Numerically stable implementations subtract the maximum logit before exponentiating to avoid overflow.
In transformer attention, softmax normalizes scaled query-key scores into weights that sum to 1; the 1/sqrt(d_k) scaling prevents saturation.
Dividing logits by a temperature flattens or sharpens the distribution, which is how sampling randomness is controlled.

Frequently asked questions

It converts a vector of real-valued scores into a probability distribution. Each output is between 0 and 1, and all outputs sum to 1, so the values can be interpreted as probabilities over mutually exclusive options such as classes or vocabulary tokens.

Attention needs weights over positions that sum to 1 so it can form a weighted average of value vectors. Softmax turns the scaled query-key similarity scores into exactly such a normalized, differentiable set of weights.

Softmax produces a distribution over mutually exclusive classes that sums to 1. Sigmoid squashes each value independently to between 0 and 1, suitable for multi-label problems. Two-class softmax reduces to the sigmoid (logistic) function.

Subtracting the maximum logit prevents the exponential from overflowing on large inputs. The shift cancels in the numerator and denominator, so the output is mathematically identical but computed safely. This is the standard numerical-stability technique.

Temperature divides the logits before exponentiation. A temperature above 1 flattens the distribution toward uniform for more randomness; below 1 sharpens it toward the top score; this is how temperature tunes diversity in LLM text generation.

Put the idea into practice

MemX is an AI memory agent built on these ideas: store anything, skip the folders, and find it again by asking in plain English.

Try MemX Free