AI Foundations

RMSNorm (Root Mean Square Normalization)

RMSNorm (Root Mean Square Normalization) is a normalization layer that rescales a vector by its root mean square and applies a learned gain, without subtracting the mean. By dropping LayerNorm's re-centering step, it is simpler and faster while matching LayerNorm quality, which is why it is standard in modern LLMs.

What is RMSNorm?

RMSNorm (Root Mean Square Normalization) is a layer-normalization variant that scales each input vector by its root mean square and then multiplies by a learned gain vector. Unlike standard Layer Normalization, it does not subtract the mean of the vector and does not add a learned bias. It was introduced by Biao Zhang and Rico Sennrich in 2019 with the hypothesis that LayerNorm's success comes mostly from its re-scaling, not its re-centering, so the mean-subtraction step can be dropped.

Because RMSNorm skips the computation of the mean and the variance and instead computes a single statistic, the root mean square, it does less work per call. The authors reported runtime reductions of roughly 7 to 64 percent across different models while reaching quality comparable to LayerNorm. That combination of simplicity and speed is why RMSNorm has become the default normalization in many large language models.

Normalizes by root mean square, not by mean and variance.
Has a learned scale (gain) but no mean subtraction and typically no bias.
Matches LayerNorm quality while reducing per-layer compute.

RMSNorm versus LayerNorm

LayerNorm first subtracts the mean of the vector, divides by the standard deviation, then applies a learned scale and bias. This makes it invariant to both shifts (re-centering) and scaling (re-scaling) of the input. RMSNorm keeps only the re-scaling invariance: it divides by the root mean square and applies a learned scale, with no mean subtraction.

The practical consequences are fewer operations and fewer parameters. RMSNorm avoids computing the mean, avoids the subtraction, and commonly omits the bias term. Empirically this does not hurt model quality for transformer language models, which is the evidence that re-centering was not the essential part of LayerNorm for these architectures.

RMS(x) = sqrt( (1/n) · sum_{i=1}^{n} x_i^2 + eps );   RMSNorm(x)_i = ( x_i / RMS(x) ) · g_i

Each element x_i is divided by the root mean square of the vector (with a small epsilon for stability) and multiplied by a learned per-dimension gain g_i. No mean is subtracted.

LayerNorm(x)_i = ( (x_i - mu) / sqrt(sigma^2 + eps) ) · g_i + b_i,   where mu = mean(x), sigma^2 = var(x)

For contrast, LayerNorm subtracts the mean mu and divides by the standard deviation, then applies scale and bias. RMSNorm removes the mu and b terms.

LayerNorm: subtract mean, divide by standard deviation, scale and shift.
RMSNorm: divide by root mean square, scale only.
RMSNorm drops the mean computation and usually the bias, saving compute and parameters.

Implementing RMSNorm

RMSNorm is a few lines of code. The implementation below mirrors the version used in LLaMA-style models: compute the mean of squares, take the reciprocal square root, scale, then apply the learned weight. Computing in float32 before casting back improves numerical stability in mixed-precision training.

python

import torch
import torch.nn as nn

class RMSNorm(nn.Module):
    def __init__(self, dim, eps=1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))  # learned gain g

    def _norm(self, x):
        # divide by root mean square over the last dimension
        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)

    def forward(self, x):
        # compute in float32 for stability, then cast back
        out = self._norm(x.float()).type_as(x)
        return out * self.weight

RMSNorm as used in LLaMA-style transformers (PyTorch).

Why modern LLMs use RMSNorm

RMSNorm is used in many widely deployed LLMs, including LLaMA, Mistral, Qwen, Gemma, and DeepSeek. It is typically applied in a pre-normalization configuration, normalizing the input to each attention and feed-forward sublayer before the residual connection, which stabilizes training of deep transformers.

The appeal is that RMSNorm gives the training-stability benefits of LayerNorm at lower cost. With normalization layers running on every sublayer of every transformer block, even a small per-call saving accumulates across a large model and long training run. Later analysis has examined the geometric differences between RMSNorm and LayerNorm, but for practical LLM training the consensus is that RMSNorm is a strong default.

Used in LLaMA, Mistral, Qwen, Gemma, and DeepSeek.
Commonly applied as pre-normalization before each sublayer for training stability.
Per-call savings accumulate across many layers and long training runs.

Key takeaways

RMSNorm normalizes a vector by its root mean square and applies a learned gain, without subtracting the mean.
Dropping re-centering makes it simpler and faster than LayerNorm while matching quality on transformers.
The original paper reported runtime reductions of roughly 7 to 64 percent across models.
RMSNorm is the default normalization in LLaMA, Mistral, Qwen, Gemma, and DeepSeek, usually in a pre-norm setup.

Frequently asked questions

RMSNorm is a normalization layer that divides each input vector by its root mean square and multiplies by a learned scale. It stabilizes training like LayerNorm but skips subtracting the mean, making it simpler and faster while keeping comparable model quality.

LayerNorm subtracts the mean and divides by the standard deviation, then applies a learned scale and bias. RMSNorm only divides by the root mean square and applies a scale, with no mean subtraction and usually no bias, so it does fewer operations.

RMSNorm matches LayerNorm's training stability and quality on transformers while costing less compute and fewer parameters per call. Across the many normalization layers in a large model, those savings add up, so models like LLaMA and Mistral adopted it as the default.

Yes. RMSNorm has a learned per-dimension gain vector that rescales the normalized output. It typically omits the learned bias that LayerNorm includes, since it does not perform mean re-centering.

It is usually applied as pre-normalization, normalizing the input to each attention and feed-forward sublayer before the residual connection is added. This pre-norm placement helps keep deep transformer training stable.

Put the idea into practice

MemX is an AI memory agent built on these ideas: store anything, skip the folders, and find it again by asking in plain English.

Try MemX Free