AI Foundations

Adam Optimizer

Adam (Adaptive Moment Estimation) is a gradient-based optimization algorithm that adapts the learning rate for each parameter using running estimates of the first and second moments of the gradients. Introduced by Kingma and Ba in 2014, it is one of the most widely used optimizers for training neural networks.

What is the Adam Optimizer?

The Adam optimizer is a first-order, gradient-based optimization algorithm for training neural networks that computes an individual adaptive learning rate for every parameter. The name Adam stands for Adaptive Moment Estimation, because the method maintains exponentially decaying running averages of two statistics of the gradients: the first moment (the mean) and the second moment (the uncentered variance). It was introduced by Diederik Kingma and Jimmy Ba in the 2014 paper 'Adam: A Method for Stochastic Optimization.'

Adam combines ideas from two earlier methods. From momentum it borrows the running average of past gradients, which smooths the update direction and helps the optimizer move through noisy or flat regions. From RMSProp and AdaGrad it borrows per-parameter scaling by a running average of squared gradients, which gives parameters with consistently large gradients smaller steps and parameters with small gradients larger steps. The result is an optimizer that is computationally efficient, has modest memory requirements, and works well across a wide range of problems with little hyperparameter tuning.

Adam is the default choice for many deep learning workloads, including transformers and large language models, because it converges quickly and is relatively insensitive to the scale of gradients. It is not always optimal for generalization compared with carefully tuned stochastic gradient descent, which is why a decoupled variant called AdamW is now standard for training large models.

Adam stands for Adaptive Moment Estimation.
It keeps per-parameter running averages of the gradient mean and squared gradient.
Introduced by Kingma and Ba (2014) in 'Adam: A Method for Stochastic Optimization.'
Combines momentum (first moment) with RMSProp-style scaling (second moment).
The default optimizer for many deep nets; AdamW is the standard large-model variant.

How does Adam update weights?

At each step t, Adam first computes the gradient g_t of the loss with respect to the parameters. It then updates two running averages: the first moment estimate m_t and the second moment estimate v_t. These are exponential moving averages controlled by decay rates beta1 and beta2. Because m and v are initialized at zero, they are biased toward zero during the early steps, so Adam applies a bias correction to produce m-hat and v-hat. Finally, each parameter is updated by stepping in the direction of m-hat scaled inversely by the square root of v-hat.

The square-root scaling is what makes the learning rate adaptive per parameter. A parameter whose gradients have been large and consistent accumulates a large v, which shrinks its effective step, while a parameter with small or sparse gradients takes relatively larger steps. The small constant epsilon in the denominator prevents division by zero and stabilizes updates when v-hat is tiny.

The bias correction matters most in the first few hundred iterations. Without it, the zero-initialized moment estimates would make early updates artificially small. Dividing m_t by (1 minus beta1 to the power t) and v_t by (1 minus beta2 to the power t) compensates for this startup bias, so the optimizer reaches its intended step sizes quickly.

Maintain first moment m_t and second moment v_t as exponential moving averages.
Bias-correct m_t and v_t to counter their zero initialization.
Update each weight using m-hat divided by sqrt(v-hat) plus epsilon.
Per-parameter scaling shrinks steps for large, consistent gradients.
Bias correction chiefly affects the early iterations of training.

Adam update equations

The following equations define one Adam update step for a parameter vector theta, given gradient g_t at step t, learning rate alpha, decay rates beta1 and beta2, and stability constant epsilon.

m_t = β₁·m_{t-1} + (1 − β₁)·g_t
v_t = β₂·v_{t-1} + (1 − β₂)·g_t²
m̂_t = m_t / (1 − β₁ᵗ)
v̂_t = v_t / (1 − β₂ᵗ)
θ_t = θ_{t-1} − α · m̂_t / (√v̂_t + ε)

m_t and v_t are the running first and second moment estimates; m̂_t and v̂_t are their bias-corrected versions; the final line is the per-parameter weight update. Paper defaults are β₁ = 0.9, β₂ = 0.999, ε = 1e-8.

Default hyperparameters and AdamW

The original paper recommends default values that work well in most settings without tuning: beta1 = 0.9, beta2 = 0.999, and epsilon = 1e-8. The learning rate alpha is the main knob practitioners adjust, with values around 1e-3 common for many tasks and smaller values such as 1e-4 or below typical when fine-tuning large pretrained models. These defaults are baked into the Adam implementations in PyTorch, TensorFlow, and JAX.

A common pitfall is how weight decay interacts with Adam. Naively adding an L2 penalty to the loss couples the regularization with the adaptive per-parameter scaling, so parameters with large second-moment estimates get less effective decay. Loshchilov and Hutter showed in 'Decoupled Weight Decay Regularization' (2017) that decoupling weight decay from the gradient-based update fixes this. Their variant, AdamW, applies weight decay directly to the parameters as a separate term rather than routing it through the moment estimates.

AdamW generally generalizes better than Adam with L2 regularization and has become the standard optimizer for training transformers and large language models. When training modern deep networks, the practical recommendation is to use AdamW with a tuned learning rate and weight decay rather than plain Adam with an L2 loss term.

Paper defaults: β₁ = 0.9, β₂ = 0.999, ε = 1e-8.
Learning rate is the primary hyperparameter to tune (often ~1e-3, smaller for fine-tuning).
L2 regularization couples badly with Adam's adaptive scaling.
AdamW (Loshchilov & Hutter, 2017) decouples weight decay from the update.
AdamW is the standard optimizer for transformers and large language models.

Adam in practice (PyTorch)

Most frameworks expose Adam and AdamW as a single class. The example below shows a minimal PyTorch training step using AdamW with the standard defaults.

python

import torch

model = torch.nn.Linear(128, 10)

# AdamW with the original Adam defaults plus decoupled weight decay
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-3,
    betas=(0.9, 0.999),
    eps=1e-8,
    weight_decay=0.01,
)

for inputs, targets in dataloader:
    optimizer.zero_grad()
    logits = model(inputs)
    loss = torch.nn.functional.cross_entropy(logits, targets)
    loss.backward()          # compute gradients
    optimizer.step()         # adaptive per-parameter update

A minimal PyTorch optimization loop using AdamW with the canonical Adam defaults.

Key takeaways

Adam (Adaptive Moment Estimation) gives each parameter its own adaptive learning rate using running averages of the gradient and squared gradient.
It combines momentum (first moment) with RMSProp-style per-parameter scaling (second moment), plus a bias-correction step for the early iterations.
The paper defaults beta1 = 0.9, beta2 = 0.999, and epsilon = 1e-8 work without tuning in most cases; the learning rate is the main knob.
Plain Adam couples L2 weight decay with its adaptive scaling, which hurts regularization.
AdamW decouples weight decay from the gradient update and is the standard optimizer for transformers and large language models.

Frequently asked questions

Adam stands for Adaptive Moment Estimation. The name refers to how the algorithm estimates the first moment (mean) and second moment (uncentered variance) of the gradients using exponentially decaying running averages, then uses those estimates to set a separate adaptive learning rate for each parameter.

The original Kingma and Ba paper recommends beta1 = 0.9, beta2 = 0.999, and epsilon = 1e-8. These defaults work well across most problems and are the built-in values in PyTorch, TensorFlow, and JAX. The learning rate is usually the main hyperparameter that practitioners tune.

Adam adds weight decay through an L2 penalty in the loss, which couples it with Adam's adaptive per-parameter scaling and weakens regularization. AdamW decouples weight decay, applying it directly to the parameters as a separate term. AdamW usually generalizes better and is the standard for training large models.

Adam initializes its first and second moment estimates at zero, which biases them toward zero during the early steps and would make initial updates too small. Bias correction divides each moment by one minus its decay rate raised to the step number, compensating for the startup bias so step sizes reach their intended scale quickly.

Adam converges faster and needs less tuning, which makes it the default for many tasks including transformers. However, well-tuned stochastic gradient descent with momentum can sometimes generalize better on certain vision tasks. For large language models, the decoupled variant AdamW is the common choice rather than plain SGD.

Sources

Put the idea into practice

MemX is an AI memory agent built on these ideas: store anything, skip the folders, and find it again by asking in plain English.

Try MemX Free