Multi-Head Latent Attention (MLA) is an attention mechanism introduced in DeepSeek-V2 that compresses keys and values into a single low-rank latent vector per token. Only that small latent and a decoupled rotary key are cached, which DeepSeek reports cuts the KV cache by 93.3 percent versus standard multi-head attention.
What is Multi-Head Latent Attention (MLA)?
Multi-Head Latent Attention (MLA) is an attention design introduced in the DeepSeek-V2 technical report that reduces inference memory by compressing the keys and values of each token into a single low-rank latent vector. Instead of caching full per-head keys and values, MLA caches only this compact latent, then reconstructs the per-head keys and values on the fly during attention. DeepSeek reports that this cuts the KV cache by 93.3 percent compared with the standard multi-head attention used in its earlier 67B model.
MLA targets the same bottleneck as Grouped-Query Attention, the size of the KV cache during autoregressive decoding, but takes a different route. Rather than reducing the number of key-value heads, it projects keys and values down to a shared latent space of much smaller dimension and stores that. This keeps a multi-head expressive form for attention while drastically shrinking what must be held in memory.
- Caches a single low-rank latent per token instead of full keys and values.
- Reconstructs per-head keys and values from the latent during attention.
- DeepSeek reports a 93.3 percent KV-cache reduction versus standard MHA.
Low-rank key-value compression
MLA applies a down-projection to each token's hidden state to produce a compressed latent vector whose dimension is much smaller than the combined dimension of all key and value heads. At attention time, separate up-projection matrices map this latent back into the per-head keys and values. Because matrix multiplication is associative, the up-projection matrices can be absorbed into the surrounding query and output projections, so the latent is the only thing that needs to be cached per token.
A complication is rotary position embedding. RoPE applies a position-dependent rotation to keys, which does not commute with the latent up-projection, so it cannot simply be folded in. MLA solves this with a decoupled design: it keeps a small separate key component that carries the RoPE rotation, alongside the compressed content latent. The cached state per token is therefore the compressed latent plus this small rotary key.
- A down-projection compresses keys and values into a shared latent of low dimension.
- Up-projections reconstruct per-head keys and values and can be absorbed into other weights.
- A decoupled rotary key carries RoPE position information separately from the latent.
MLA versus MHA and GQA
Standard multi-head attention stores a full set of keys and values for every head, giving the largest KV cache. Grouped-query attention reduces the number of key-value heads, which lowers the cache at a small cost to quality. MLA instead compresses along the feature dimension, keeping the benefit of many heads while caching only a small latent.
DeepSeek reports that MLA not only shrinks the cache far below MHA but also matches or exceeds MHA quality in their experiments, whereas GQA and MQA typically trade some quality for their savings. The decoupled RoPE component is the main added complexity, and it is the reason MLA needs a careful implementation rather than a drop-in swap.
- MHA: largest KV cache, full per-head keys and values.
- GQA/MQA: fewer key-value heads, smaller cache, small quality trade-off.
- MLA: low-rank latent plus decoupled rotary key, large cache reduction with strong quality.
Where MLA is used
MLA was introduced with DeepSeek-V2 and carried forward into DeepSeek-V3 and the DeepSeek-R1 reasoning model, where it is one of the architectural choices that make long-context, large-scale inference economical. By holding only a small latent per token, MLA lets these models serve long contexts with substantially less GPU memory than an equivalent MHA model would require.
MLA combines with the rest of the DeepSeek stack, including a Mixture-of-Experts feed-forward design and RoPE for position encoding. Its main appeal is for inference at scale, where KV-cache memory governs how long a context can be and how many requests can run concurrently.
- Introduced in DeepSeek-V2 and used in DeepSeek-V3 and DeepSeek-R1.
- Pairs with Mixture-of-Experts layers and RoPE in the DeepSeek architecture.
- Most valuable for long-context, high-concurrency inference where KV memory is the limit.
Key takeaways
- MLA compresses each token's keys and values into a single low-rank latent vector that is cached instead of full keys and values.
- DeepSeek reports MLA reduces the KV cache by 93.3 percent compared with standard multi-head attention.
- A decoupled rotary key carries RoPE position information separately because rotation cannot be folded into the latent.
- MLA was introduced in DeepSeek-V2 and is used in DeepSeek-V3 and DeepSeek-R1 to make large-scale long-context inference economical.
Frequently asked questions
Related terms
Related reading
Sources
Put the idea into practice
MemX is an AI memory app built on these ideas: store anything, skip the folders, and find it again by asking in plain English.
Try MemX Free