RMSNorm (Root Mean Square Normalization) is a normalization layer that rescales a vector by its root mean square and applies a learned gain, without subtracting the mean. By dropping LayerNorm's re-centering step, it is simpler and faster while matching LayerNorm quality, which is why it is standard in modern LLMs.
What is RMSNorm?
RMSNorm (Root Mean Square Normalization) is a layer-normalization variant that scales each input vector by its root mean square and then multiplies by a learned gain vector. Unlike standard Layer Normalization, it does not subtract the mean of the vector and does not add a learned bias. It was introduced by Biao Zhang and Rico Sennrich in 2019 with the hypothesis that LayerNorm's success comes mostly from its re-scaling, not its re-centering, so the mean-subtraction step can be dropped.
Because RMSNorm skips the computation of the mean and the variance and instead computes a single statistic, the root mean square, it does less work per call. The authors reported runtime reductions of roughly 7 to 64 percent across different models while reaching quality comparable to LayerNorm. That combination of simplicity and speed is why RMSNorm has become the default normalization in many large language models.
- Normalizes by root mean square, not by mean and variance.
- Has a learned scale (gain) but no mean subtraction and typically no bias.
- Matches LayerNorm quality while reducing per-layer compute.
RMSNorm versus LayerNorm
LayerNorm first subtracts the mean of the vector, divides by the standard deviation, then applies a learned scale and bias. This makes it invariant to both shifts (re-centering) and scaling (re-scaling) of the input. RMSNorm keeps only the re-scaling invariance: it divides by the root mean square and applies a learned scale, with no mean subtraction.
The practical consequences are fewer operations and fewer parameters. RMSNorm avoids computing the mean, avoids the subtraction, and commonly omits the bias term. Empirically this does not hurt model quality for transformer language models, which is the evidence that re-centering was not the essential part of LayerNorm for these architectures.
- LayerNorm: subtract mean, divide by standard deviation, scale and shift.
- RMSNorm: divide by root mean square, scale only.
- RMSNorm drops the mean computation and usually the bias, saving compute and parameters.
Implementing RMSNorm
RMSNorm is a few lines of code. The implementation below mirrors the version used in LLaMA-style models: compute the mean of squares, take the reciprocal square root, scale, then apply the learned weight. Computing in float32 before casting back improves numerical stability in mixed-precision training.
Why modern LLMs use RMSNorm
RMSNorm is used in many widely deployed LLMs, including LLaMA, Mistral, Qwen, Gemma, and DeepSeek. It is typically applied in a pre-normalization configuration, normalizing the input to each attention and feed-forward sublayer before the residual connection, which stabilizes training of deep transformers.
The appeal is that RMSNorm gives the training-stability benefits of LayerNorm at lower cost. With normalization layers running on every sublayer of every transformer block, even a small per-call saving accumulates across a large model and long training run. Later analysis has examined the geometric differences between RMSNorm and LayerNorm, but for practical LLM training the consensus is that RMSNorm is a strong default.
- Used in LLaMA, Mistral, Qwen, Gemma, and DeepSeek.
- Commonly applied as pre-normalization before each sublayer for training stability.
- Per-call savings accumulate across many layers and long training runs.
Key takeaways
- RMSNorm normalizes a vector by its root mean square and applies a learned gain, without subtracting the mean.
- Dropping re-centering makes it simpler and faster than LayerNorm while matching quality on transformers.
- The original paper reported runtime reductions of roughly 7 to 64 percent across models.
- RMSNorm is the default normalization in LLaMA, Mistral, Qwen, Gemma, and DeepSeek, usually in a pre-norm setup.
Frequently asked questions
Related terms
Related reading
Sources
Put the idea into practice
MemX is an AI memory app built on these ideas: store anything, skip the folders, and find it again by asking in plain English.
Try MemX Free