AI Foundations

Vanishing and Exploding Gradients

Vanishing and exploding gradients are two failure modes in training deep neural networks where gradients shrink toward zero or grow uncontrollably as they propagate backward through layers. Both stall or destabilize learning, especially in very deep or recurrent networks.

What are Vanishing and Exploding Gradients?

Vanishing and exploding gradients are two related problems that arise when training deep neural networks with backpropagation. During backpropagation, the gradient of the loss is computed layer by layer using the chain rule, which multiplies many derivative terms together as the signal travels from the output back to the early layers. When those terms are consistently smaller than one, their product shrinks toward zero, producing vanishing gradients. When they are consistently larger than one, their product grows without bound, producing exploding gradients.

With vanishing gradients, the early layers of a deep network receive almost no learning signal, so their weights barely change and the network struggles to learn long-range dependencies. With exploding gradients, weight updates become enormous and erratic, often pushing the loss to NaN and making training diverge. Both problems are most severe in very deep feedforward networks and in recurrent neural networks unrolled over many time steps.

The issue was characterized in part by Glorot and Bengio in their 2010 paper 'Understanding the difficulty of training deep feedforward networks,' which showed that poor weight initialization and saturating activation functions make the problem worse. Their analysis motivated the initialization schemes and architectural changes that make deep networks trainable today.

Both arise from multiplying many chain-rule terms during backpropagation.
Vanishing: gradient products shrink toward zero, stalling early layers.
Exploding: gradient products grow without bound, destabilizing training.
Worst in very deep feedforward nets and unrolled recurrent networks.
Characterized in part by Glorot and Bengio (2010).

Why does it happen during backpropagation?

Backpropagation computes the gradient at an early layer as a product of Jacobian matrices, one per layer, plus the derivatives of the activation functions. The magnitude of the final gradient depends roughly on the product of the singular values of those matrices and the slopes of the activations. If those magnitudes are on average below one, the product decays exponentially with depth. If they are above one, it grows exponentially.

Saturating activation functions such as the sigmoid and tanh make vanishing gradients more likely because their derivatives are at most 0.25 and 1 respectively, and approach zero whenever the input is large in magnitude. Stacking many such layers multiplies many small derivatives together. Recurrent networks face the same dynamic across time steps, which is why a vanilla RNN cannot easily learn dependencies spanning many steps.

Exploding gradients are common when weights are initialized too large or when recurrent connections amplify the signal at each step. Because the growth is exponential in depth, even a modest per-layer amplification can produce gradients large enough to overflow within a few dozen layers.

Gradients at early layers are products of per-layer Jacobians and activation slopes.
Average term magnitude below one decays; above one grows; both exponential in depth.
Sigmoid and tanh saturate, driving their derivatives toward zero.
Vanilla RNNs vanish across time steps, limiting long-range memory.
Large initial weights or amplifying recurrence cause explosion.

The math behind it

Consider a simple deep network where the gradient at layer 1 is a product over all later layers of weight matrices and activation derivatives. Its magnitude scales as shown below, which makes the exponential dependence on depth explicit.

∂L/∂h₁ = ∂L/∂h_L · ∏_{k=2}^{L} ( Wₖᵀ · diag(σ'(zₖ)) )

‖∂L/∂h₁‖ ∼ γᴸ ,  where γ ≈ ‖Wₖ‖ · |σ'|

γ < 1 ⟹ vanishing,   γ > 1 ⟹ exploding

The gradient at an early layer is a chained product across all L layers. If the typical per-layer factor γ (weight scale times activation slope) is below one the product vanishes exponentially with depth; if above one it explodes.

How do you fix vanishing and exploding gradients?

The most effective fixes target the per-layer factor so that signal magnitude stays near one across depth. Careful weight initialization is the first line of defense: Xavier (Glorot) initialization scales initial weights by the layer's fan-in and fan-out for tanh-style activations, while He initialization scales by fan-in for ReLU networks. Both keep the variance of activations and gradients roughly constant from layer to layer at the start of training.

Non-saturating activation functions such as ReLU and its variants largely avoid the vanishing problem because their derivative is one for positive inputs rather than a fraction. Normalization layers, including batch normalization and layer normalization, keep the distribution of layer inputs stable, which also helps gradients flow. Residual (skip) connections, introduced with ResNets, give gradients a direct path that bypasses the multiplicative chain, which is a key reason networks with hundreds of layers became trainable.

For exploding gradients specifically, gradient clipping caps the gradient norm at a threshold before the optimizer step, preventing rare huge updates from destabilizing training. This is standard practice in recurrent networks and large language model training. Gated recurrent architectures such as LSTM and GRU were designed in part to preserve gradient flow across many time steps, mitigating the vanishing problem in sequence models.

Use Xavier/Glorot initialization for tanh and He initialization for ReLU.
Prefer non-saturating activations like ReLU over sigmoid and tanh.
Apply batch or layer normalization to stabilize layer input distributions.
Add residual (skip) connections so gradients bypass the multiplicative chain.
Clip gradient norms to control explosion; use LSTM/GRU gates in sequence models.

Key takeaways

Vanishing and exploding gradients come from multiplying many chain-rule terms during backpropagation, which decays or grows the signal exponentially with depth.
Vanishing gradients stall the early layers, while exploding gradients cause erratic updates and divergence.
Saturating activations like sigmoid and tanh, and very deep or recurrent architectures, make the problem worse.
Good weight initialization (Xavier, He), non-saturating activations, and normalization keep the per-layer factor near one.
Residual connections and gradient clipping are the practical fixes that made very deep networks trainable.

Frequently asked questions

Backpropagation multiplies many per-layer derivative terms together via the chain rule. When those terms are consistently below one, as happens with saturating activations like sigmoid and tanh or poorly scaled weights, their product shrinks exponentially with depth. The early layers then receive almost no gradient and learn very slowly.

Both come from the same chained multiplication in backpropagation. Vanishing gradients occur when the per-layer factors are below one, so the gradient shrinks toward zero and early layers stop learning. Exploding gradients occur when those factors exceed one, so the gradient grows uncontrollably and produces huge, destabilizing weight updates or NaN losses.

Residual or skip connections add a layer's input directly to its output, giving gradients a shortcut path that bypasses the multiplicative chain of weight matrices and activations. This keeps the gradient magnitude closer to one across many layers, which is a key reason networks with hundreds of layers, such as ResNets, became trainable.

ReLU greatly reduces vanishing gradients because its derivative is one for positive inputs, unlike sigmoid and tanh whose derivatives shrink toward zero. It does not eliminate all issues: dead ReLUs can output zero permanently, and very deep networks still benefit from good initialization, normalization, and residual connections.

Gradient clipping rescales the gradient whenever its norm exceeds a chosen threshold, so the optimizer never takes an excessively large step. This caps rare but catastrophic updates that would otherwise push the loss to NaN. It is standard in recurrent networks and large language model training, where exploding gradients are common.

Sources

Put the idea into practice

MemX is an AI memory agent built on these ideas: store anything, skip the folders, and find it again by asking in plain English.

Try MemX Free