Adam (Adaptive Moment Estimation) is a gradient-based optimization algorithm that adapts the learning rate for each parameter using running estimates of the first and second moments of the gradients. Introduced by Kingma and Ba in 2014, it is one of the most widely used optimizers for training neural networks.
What is the Adam Optimizer?
The Adam optimizer is a first-order, gradient-based optimization algorithm for training neural networks that computes an individual adaptive learning rate for every parameter. The name Adam stands for Adaptive Moment Estimation, because the method maintains exponentially decaying running averages of two statistics of the gradients: the first moment (the mean) and the second moment (the uncentered variance). It was introduced by Diederik Kingma and Jimmy Ba in the 2014 paper 'Adam: A Method for Stochastic Optimization.'
Adam combines ideas from two earlier methods. From momentum it borrows the running average of past gradients, which smooths the update direction and helps the optimizer move through noisy or flat regions. From RMSProp and AdaGrad it borrows per-parameter scaling by a running average of squared gradients, which gives parameters with consistently large gradients smaller steps and parameters with small gradients larger steps. The result is an optimizer that is computationally efficient, has modest memory requirements, and works well across a wide range of problems with little hyperparameter tuning.
Adam is the default choice for many deep learning workloads, including transformers and large language models, because it converges quickly and is relatively insensitive to the scale of gradients. It is not always optimal for generalization compared with carefully tuned stochastic gradient descent, which is why a decoupled variant called AdamW is now standard for training large models.
- Adam stands for Adaptive Moment Estimation.
- It keeps per-parameter running averages of the gradient mean and squared gradient.
- Introduced by Kingma and Ba (2014) in 'Adam: A Method for Stochastic Optimization.'
- Combines momentum (first moment) with RMSProp-style scaling (second moment).
- The default optimizer for many deep nets; AdamW is the standard large-model variant.
How does Adam update weights?
At each step t, Adam first computes the gradient g_t of the loss with respect to the parameters. It then updates two running averages: the first moment estimate m_t and the second moment estimate v_t. These are exponential moving averages controlled by decay rates beta1 and beta2. Because m and v are initialized at zero, they are biased toward zero during the early steps, so Adam applies a bias correction to produce m-hat and v-hat. Finally, each parameter is updated by stepping in the direction of m-hat scaled inversely by the square root of v-hat.
The square-root scaling is what makes the learning rate adaptive per parameter. A parameter whose gradients have been large and consistent accumulates a large v, which shrinks its effective step, while a parameter with small or sparse gradients takes relatively larger steps. The small constant epsilon in the denominator prevents division by zero and stabilizes updates when v-hat is tiny.
The bias correction matters most in the first few hundred iterations. Without it, the zero-initialized moment estimates would make early updates artificially small. Dividing m_t by (1 minus beta1 to the power t) and v_t by (1 minus beta2 to the power t) compensates for this startup bias, so the optimizer reaches its intended step sizes quickly.
- Maintain first moment m_t and second moment v_t as exponential moving averages.
- Bias-correct m_t and v_t to counter their zero initialization.
- Update each weight using m-hat divided by sqrt(v-hat) plus epsilon.
- Per-parameter scaling shrinks steps for large, consistent gradients.
- Bias correction chiefly affects the early iterations of training.
Adam update equations
The following equations define one Adam update step for a parameter vector theta, given gradient g_t at step t, learning rate alpha, decay rates beta1 and beta2, and stability constant epsilon.
Default hyperparameters and AdamW
The original paper recommends default values that work well in most settings without tuning: beta1 = 0.9, beta2 = 0.999, and epsilon = 1e-8. The learning rate alpha is the main knob practitioners adjust, with values around 1e-3 common for many tasks and smaller values such as 1e-4 or below typical when fine-tuning large pretrained models. These defaults are baked into the Adam implementations in PyTorch, TensorFlow, and JAX.
A common pitfall is how weight decay interacts with Adam. Naively adding an L2 penalty to the loss couples the regularization with the adaptive per-parameter scaling, so parameters with large second-moment estimates get less effective decay. Loshchilov and Hutter showed in 'Decoupled Weight Decay Regularization' (2017) that decoupling weight decay from the gradient-based update fixes this. Their variant, AdamW, applies weight decay directly to the parameters as a separate term rather than routing it through the moment estimates.
AdamW generally generalizes better than Adam with L2 regularization and has become the standard optimizer for training transformers and large language models. When training modern deep networks, the practical recommendation is to use AdamW with a tuned learning rate and weight decay rather than plain Adam with an L2 loss term.
- Paper defaults: β₁ = 0.9, β₂ = 0.999, ε = 1e-8.
- Learning rate is the primary hyperparameter to tune (often ~1e-3, smaller for fine-tuning).
- L2 regularization couples badly with Adam's adaptive scaling.
- AdamW (Loshchilov & Hutter, 2017) decouples weight decay from the update.
- AdamW is the standard optimizer for transformers and large language models.
Adam in practice (PyTorch)
Most frameworks expose Adam and AdamW as a single class. The example below shows a minimal PyTorch training step using AdamW with the standard defaults.
Key takeaways
- Adam (Adaptive Moment Estimation) gives each parameter its own adaptive learning rate using running averages of the gradient and squared gradient.
- It combines momentum (first moment) with RMSProp-style per-parameter scaling (second moment), plus a bias-correction step for the early iterations.
- The paper defaults beta1 = 0.9, beta2 = 0.999, and epsilon = 1e-8 work without tuning in most cases; the learning rate is the main knob.
- Plain Adam couples L2 weight decay with its adaptive scaling, which hurts regularization.
- AdamW decouples weight decay from the gradient update and is the standard optimizer for transformers and large language models.
Frequently asked questions
Related terms
Sources
Put the idea into practice
MemX is an AI memory app built on these ideas: store anything, skip the folders, and find it again by asking in plain English.
Try MemX Free