The feed-forward network (FFN), also called the MLP block, is the position-wise sublayer inside every transformer layer that applies two linear projections with a nonlinearity in between. SwiGLU is a gated variant that replaces the simple nonlinearity with a Swish-gated linear unit and is now standard in models like LLaMA and PaLM.
What is a Feed-Forward Network / MLP Block?
A feed-forward network (FFN), also called the MLP block, is the position-wise sublayer that sits after the attention sublayer in each transformer layer. It is a small two-layer multilayer perceptron applied independently and identically to every token position. Where attention mixes information across positions, the FFN processes each position on its own, expanding the hidden representation to a larger inner dimension, applying a nonlinearity, and projecting back down.
In the original transformer, the FFN consists of two linear transformations with a ReLU activation between them. The first projection raises the model dimension d_model to a larger inner dimension d_ff (typically four times larger), and the second projection brings it back to d_model. Because the same weights are reused at every position, the FFN is often described as a position-wise feed-forward layer. The FFN holds the majority of a transformer's parameters, which makes its design central to both capacity and efficiency.
- Applied per token position, independent of other positions, unlike attention.
- Two linear layers with a nonlinearity between them, expanding then contracting the dimension.
- Typical inner dimension d_ff is about 4x the model dimension d_model in the original design.
- Contains most of a transformer's parameters and a large share of its compute.
The original FFN formulation
The feed-forward sublayer in Attention Is All You Need uses a ReLU between two affine maps. Each token vector x is projected up, passed through ReLU, then projected back down. This is the simplest and most widely cited form of the block.
In practice, modern transformers often drop the bias terms and pair the FFN with a residual connection and layer normalization, so the actual computation in a layer is x + FFN(Norm(x)) under the common pre-norm arrangement.
- ReLU zeroes out negative pre-activations, keeping positive ones unchanged.
- Bias terms are frequently omitted in large models without loss of quality.
- The block is wrapped in a residual connection plus normalization.
What is SwiGLU?
SwiGLU is a gated variant of the FFN introduced in Noam Shazeer's 2020 paper GLU Variants Improve Transformer. Instead of a single linear projection followed by an activation, a gated linear unit (GLU) uses two parallel projections of the input: one passes through an activation and acts as a gate, and the other is the value. Their element-wise product is then projected down by a third matrix. SwiGLU specifically uses the Swish (SiLU) function as the gating activation.
The paper compared several GLU variants, including ReGLU, GEGLU, and SwiGLU, and found that GEGLU and SwiGLU achieved the best perplexities on language modeling. Because SwiGLU adds a third weight matrix, implementations reduce the inner dimension (commonly to about two-thirds of the equivalent non-gated d_ff) so the total parameter count stays comparable. SwiGLU is now the default FFN in models such as PaLM and Meta's LLaMA family.
- GLU splits the input into a gate path and a value path, then multiplies them element-wise.
- SwiGLU uses Swish (SiLU) on the gate; GEGLU uses GELU; ReGLU uses ReLU.
- Adds a third weight matrix, so d_ff is usually scaled down to keep parameters constant.
- Adopted by PaLM, LLaMA, and many later open models as a quality improvement.
Why the FFN matters for LLM behavior
The FFN is where much of a transformer's stored knowledge and feature processing lives. Interpretability research often treats FFN layers as key-value memories: the up-projection detects patterns in the residual stream, and the down-projection writes associated information back. This is one reason the block carries so many parameters relative to attention.
Choosing a better FFN activation is one of the cheapest reliable quality gains available to model designers because it changes only the per-position computation, leaves attention untouched, and keeps parameter counts roughly fixed when the inner dimension is rescaled. That favorable trade-off explains why SwiGLU spread quickly across open and closed models after 2020.
- FFN layers behave like key-value memories that store learned associations.
- Activation choice changes per-token compute without touching attention.
- SwiGLU offers measurable perplexity gains at near-constant parameter cost.
Key takeaways
- The feed-forward network (MLP block) is the per-position sublayer that expands, nonlinearly transforms, and contracts each token vector inside every transformer layer.
- The original design uses a ReLU between two linear projections with inner dimension about 4x the model dimension.
- SwiGLU replaces the single activation with a Swish-gated linear unit using two parallel up-projections multiplied element-wise.
- GLU Variants Improve Transformer (Shazeer, 2020) showed GEGLU and SwiGLU give the best perplexities.
- SwiGLU adds a third matrix, so inner dimension is scaled down (often to about two-thirds) to keep parameters constant; it is now standard in LLaMA and PaLM.
Frequently asked questions
Related terms
Related reading
Sources
Put the idea into practice
MemX is an AI memory app built on these ideas: store anything, skip the folders, and find it again by asking in plain English.
Try MemX Free