AI Foundations

Multi-Head Latent Attention (MLA)

Multi-Head Latent Attention (MLA) is an attention mechanism introduced in DeepSeek-V2 that compresses keys and values into a single low-rank latent vector per token. Only that small latent and a decoupled rotary key are cached, which DeepSeek reports cuts the KV cache by 93.3 percent versus standard multi-head attention.

What is Multi-Head Latent Attention (MLA)?

Multi-Head Latent Attention (MLA) is an attention design introduced in the DeepSeek-V2 technical report that reduces inference memory by compressing the keys and values of each token into a single low-rank latent vector. Instead of caching full per-head keys and values, MLA caches only this compact latent, then reconstructs the per-head keys and values on the fly during attention. DeepSeek reports that this cuts the KV cache by 93.3 percent compared with the standard multi-head attention used in its earlier 67B model.

MLA targets the same bottleneck as Grouped-Query Attention, the size of the KV cache during autoregressive decoding, but takes a different route. Rather than reducing the number of key-value heads, it projects keys and values down to a shared latent space of much smaller dimension and stores that. This keeps a multi-head expressive form for attention while drastically shrinking what must be held in memory.

Caches a single low-rank latent per token instead of full keys and values.
Reconstructs per-head keys and values from the latent during attention.
DeepSeek reports a 93.3 percent KV-cache reduction versus standard MHA.

Low-rank key-value compression

MLA applies a down-projection to each token's hidden state to produce a compressed latent vector whose dimension is much smaller than the combined dimension of all key and value heads. At attention time, separate up-projection matrices map this latent back into the per-head keys and values. Because matrix multiplication is associative, the up-projection matrices can be absorbed into the surrounding query and output projections, so the latent is the only thing that needs to be cached per token.

A complication is rotary position embedding. RoPE applies a position-dependent rotation to keys, which does not commute with the latent up-projection, so it cannot simply be folded in. MLA solves this with a decoupled design: it keeps a small separate key component that carries the RoPE rotation, alongside the compressed content latent. The cached state per token is therefore the compressed latent plus this small rotary key.

c_t^{KV} = W_DKV · h_t   (compressed latent, dim d_c << n_h · d_h);   k_t^C = W_UK · c_t^{KV},   v_t^C = W_UV · c_t^{KV};   cached per token = [ c_t^{KV} ;  k_t^R ]  with k_t^R = RoPE(W_KR · h_t)

Each token's hidden state h_t is compressed to a small latent c_t^{KV}; per-head keys and values are reconstructed from it, while a separate small key k_t^R carries the rotary position signal. Only the latent and the rotary key are stored in the cache.

A down-projection compresses keys and values into a shared latent of low dimension.
Up-projections reconstruct per-head keys and values and can be absorbed into other weights.
A decoupled rotary key carries RoPE position information separately from the latent.

MLA versus MHA and GQA

Standard multi-head attention stores a full set of keys and values for every head, giving the largest KV cache. Grouped-query attention reduces the number of key-value heads, which lowers the cache at a small cost to quality. MLA instead compresses along the feature dimension, keeping the benefit of many heads while caching only a small latent.

DeepSeek reports that MLA not only shrinks the cache far below MHA but also matches or exceeds MHA quality in their experiments, whereas GQA and MQA typically trade some quality for their savings. The decoupled RoPE component is the main added complexity, and it is the reason MLA needs a careful implementation rather than a drop-in swap.

MHA: largest KV cache, full per-head keys and values.
GQA/MQA: fewer key-value heads, smaller cache, small quality trade-off.
MLA: low-rank latent plus decoupled rotary key, large cache reduction with strong quality.

Where MLA is used

MLA was introduced with DeepSeek-V2 and carried forward into DeepSeek-V3 and the DeepSeek-R1 reasoning model, where it is one of the architectural choices that make long-context, large-scale inference economical. By holding only a small latent per token, MLA lets these models serve long contexts with substantially less GPU memory than an equivalent MHA model would require.

MLA combines with the rest of the DeepSeek stack, including a Mixture-of-Experts feed-forward design and RoPE for position encoding. Its main appeal is for inference at scale, where KV-cache memory governs how long a context can be and how many requests can run concurrently.

Introduced in DeepSeek-V2 and used in DeepSeek-V3 and DeepSeek-R1.
Pairs with Mixture-of-Experts layers and RoPE in the DeepSeek architecture.
Most valuable for long-context, high-concurrency inference where KV memory is the limit.

Key takeaways

MLA compresses each token's keys and values into a single low-rank latent vector that is cached instead of full keys and values.
DeepSeek reports MLA reduces the KV cache by 93.3 percent compared with standard multi-head attention.
A decoupled rotary key carries RoPE position information separately because rotation cannot be folded into the latent.
MLA was introduced in DeepSeek-V2 and is used in DeepSeek-V3 and DeepSeek-R1 to make large-scale long-context inference economical.

Frequently asked questions

It is an attention mechanism from DeepSeek-V2 that compresses each token's keys and values into a small low-rank latent vector. Only that latent, plus a small rotary key, is cached, and the full per-head keys and values are reconstructed during attention.

Instead of storing full keys and values for every head, MLA stores a single compressed latent per token whose dimension is much smaller. DeepSeek reports this shrinks the KV cache by 93.3 percent versus standard multi-head attention, freeing memory for longer contexts and larger batches.

GQA reduces the number of key-value heads, compressing across heads. MLA compresses across the feature dimension into a low-rank latent while keeping many heads. DeepSeek reports MLA can match multi-head quality, whereas GQA usually trades a little quality for its savings.

Rotary position embedding rotates keys by an amount that depends on position, and that rotation does not commute with MLA's latent reconstruction. MLA therefore keeps a small separate key that carries the rotary signal, alongside the compressed content latent, so position is encoded correctly.

MLA was introduced in DeepSeek-V2 and is used in later DeepSeek models including DeepSeek-V3 and the DeepSeek-R1 reasoning model. It is a core part of how those models keep long-context inference memory low at large scale.

Put the idea into practice

MemX is an AI memory agent built on these ideas: store anything, skip the folders, and find it again by asking in plain English.

Try MemX Free