AI Foundations

Feed-Forward Network / MLP Block (and SwiGLU)

The feed-forward network (FFN), also called the MLP block, is the position-wise sublayer inside every transformer layer that applies two linear projections with a nonlinearity in between. SwiGLU is a gated variant that replaces the simple nonlinearity with a Swish-gated linear unit and is now standard in models like LLaMA and PaLM.

What is a Feed-Forward Network / MLP Block?

A feed-forward network (FFN), also called the MLP block, is the position-wise sublayer that sits after the attention sublayer in each transformer layer. It is a small two-layer multilayer perceptron applied independently and identically to every token position. Where attention mixes information across positions, the FFN processes each position on its own, expanding the hidden representation to a larger inner dimension, applying a nonlinearity, and projecting back down.

In the original transformer, the FFN consists of two linear transformations with a ReLU activation between them. The first projection raises the model dimension d_model to a larger inner dimension d_ff (typically four times larger), and the second projection brings it back to d_model. Because the same weights are reused at every position, the FFN is often described as a position-wise feed-forward layer. The FFN holds the majority of a transformer's parameters, which makes its design central to both capacity and efficiency.

Applied per token position, independent of other positions, unlike attention.
Two linear layers with a nonlinearity between them, expanding then contracting the dimension.
Typical inner dimension d_ff is about 4x the model dimension d_model in the original design.
Contains most of a transformer's parameters and a large share of its compute.

The original FFN formulation

The feed-forward sublayer in Attention Is All You Need uses a ReLU between two affine maps. Each token vector x is projected up, passed through ReLU, then projected back down. This is the simplest and most widely cited form of the block.

In practice, modern transformers often drop the bias terms and pair the FFN with a residual connection and layer normalization, so the actual computation in a layer is x + FFN(Norm(x)) under the common pre-norm arrangement.

FFN(x) = max(0, x·W₁ + b₁)·W₂ + b₂

The original transformer FFN: a ReLU between two linear projections. W₁ maps d_model to d_ff, and W₂ maps d_ff back to d_model.

ReLU zeroes out negative pre-activations, keeping positive ones unchanged.
Bias terms are frequently omitted in large models without loss of quality.
The block is wrapped in a residual connection plus normalization.

What is SwiGLU?

SwiGLU is a gated variant of the FFN introduced in Noam Shazeer's 2020 paper GLU Variants Improve Transformer. Instead of a single linear projection followed by an activation, a gated linear unit (GLU) uses two parallel projections of the input: one passes through an activation and acts as a gate, and the other is the value. Their element-wise product is then projected down by a third matrix. SwiGLU specifically uses the Swish (SiLU) function as the gating activation.

The paper compared several GLU variants, including ReGLU, GEGLU, and SwiGLU, and found that GEGLU and SwiGLU achieved the best perplexities on language modeling. Because SwiGLU adds a third weight matrix, implementations reduce the inner dimension (commonly to about two-thirds of the equivalent non-gated d_ff) so the total parameter count stays comparable. SwiGLU is now the default FFN in models such as PaLM and Meta's LLaMA family.

SwiGLU(x) = (SiLU(x·W) ⊗ (x·V))·W₂,   Swish_β(z) = z·σ(βz)

The SwiGLU FFN: a Swish-gated path (W) multiplied element-wise (⊗) by a value path (V), then projected down by W₂. β is commonly fixed to 1, making Swish equal to SiLU.

python

import torch
import torch.nn as nn
import torch.nn.functional as F

class SwiGLU(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        # Two up-projections: gate (w) and value (v)
        self.w = nn.Linear(d_model, d_ff, bias=False)
        self.v = nn.Linear(d_model, d_ff, bias=False)
        self.w2 = nn.Linear(d_ff, d_model, bias=False)

    def forward(self, x):
        # F.silu is Swish with beta = 1
        return self.w2(F.silu(self.w(x)) * self.v(x))

# d_ff is often set to ~2/3 of 4*d_model to match a plain FFN's params
block = SwiGLU(d_model=512, d_ff=1365)
out = block(torch.randn(2, 16, 512))  # (batch, seq, d_model)
print(out.shape)  # torch.Size([2, 16, 512])

A minimal SwiGLU feed-forward block in PyTorch.

GLU splits the input into a gate path and a value path, then multiplies them element-wise.
SwiGLU uses Swish (SiLU) on the gate; GEGLU uses GELU; ReGLU uses ReLU.
Adds a third weight matrix, so d_ff is usually scaled down to keep parameters constant.
Adopted by PaLM, LLaMA, and many later open models as a quality improvement.

Why the FFN matters for LLM behavior

The FFN is where much of a transformer's stored knowledge and feature processing lives. Interpretability research often treats FFN layers as key-value memories: the up-projection detects patterns in the residual stream, and the down-projection writes associated information back. This is one reason the block carries so many parameters relative to attention.

Choosing a better FFN activation is one of the cheapest reliable quality gains available to model designers because it changes only the per-position computation, leaves attention untouched, and keeps parameter counts roughly fixed when the inner dimension is rescaled. That favorable trade-off explains why SwiGLU spread quickly across open and closed models after 2020.

FFN layers behave like key-value memories that store learned associations.
Activation choice changes per-token compute without touching attention.
SwiGLU offers measurable perplexity gains at near-constant parameter cost.

Key takeaways

The feed-forward network (MLP block) is the per-position sublayer that expands, nonlinearly transforms, and contracts each token vector inside every transformer layer.
The original design uses a ReLU between two linear projections with inner dimension about 4x the model dimension.
SwiGLU replaces the single activation with a Swish-gated linear unit using two parallel up-projections multiplied element-wise.
GLU Variants Improve Transformer (Shazeer, 2020) showed GEGLU and SwiGLU give the best perplexities.
SwiGLU adds a third matrix, so inner dimension is scaled down (often to about two-thirds) to keep parameters constant; it is now standard in LLaMA and PaLM.

Frequently asked questions

It processes each token position independently, projecting the vector up to a larger inner dimension, applying a nonlinearity, and projecting it back down. Unlike attention, it does not mix information across positions; it transforms each position on its own.

A normal FFN uses one up-projection plus an activation like ReLU. SwiGLU uses two parallel up-projections: one passed through Swish acts as a gate, the other is the value, and they are multiplied element-wise before the down-projection.

The GLU Variants paper found SwiGLU and GEGLU achieve lower perplexity than ReLU or GELU FFNs. The gain comes at near-constant parameter cost since the inner dimension is rescaled, so models like LLaMA and PaLM adopted it.

Swish, also called SiLU when beta equals 1, is defined as z times the sigmoid of beta times z. It is smooth and non-monotonic, allowing small negative values to pass rather than hard-zeroing them like ReLU.

The FFN typically holds the majority of a transformer layer's parameters because its inner dimension is several times the model dimension. This makes the FFN a primary target for efficiency techniques like mixture-of-experts and quantization.

Put the idea into practice

MemX is an AI memory agent built on these ideas: store anything, skip the folders, and find it again by asking in plain English.

Try MemX Free