Grouped-Query Attention (GQA) and Multi-Query Attention (MQA)
By Arpit Tripathi, Founder
Multi-Query Attention (MQA) shares a single key-value head across all query heads, and Grouped-Query Attention (GQA) shares one key-value head across each small group of query heads. Both shrink the key-value cache that dominates LLM inference memory, with GQA preserving most of the quality of full multi-head attention.
What is Grouped-Query Attention (GQA)?
Grouped-Query Attention (GQA) is a variant of transformer attention in which several query heads share a single key and value head. It sits between standard Multi-Head Attention (MHA), where every query head has its own key and value heads, and Multi-Query Attention (MQA), where all query heads share one key-value head. By choosing the number of key-value groups, GQA trades a small amount of quality for a large reduction in the memory and bandwidth needed during inference.
GQA was introduced by Joshua Ainslie and colleagues in 2023. The motivation is that during autoregressive decoding, the model caches the keys and values of all previous tokens. The size of this key-value (KV) cache is proportional to the number of key-value heads, so reducing those heads directly reduces memory use and the memory bandwidth that bottlenecks generation.
- MHA: one key-value head per query head (maximum quality, largest KV cache).
- MQA: one key-value head shared by all query heads (smallest KV cache, some quality loss).
- GQA: one key-value head per group of query heads (a tunable middle ground).
Why the KV cache matters
During text generation, an LLM produces one token at a time and reuses the keys and values computed for all earlier tokens. Storing those keys and values is the KV cache. Its size grows with sequence length, batch size, number of layers, and the number of key-value heads. For long contexts and large batches, the KV cache can dominate GPU memory and limit how many requests a server can handle at once.
MQA and GQA attack the key-value head term directly. MQA collapses all key-value heads to one, cutting that part of the cache by the number of heads. GQA collapses them to a small number of groups, for example 8 key-value heads for 64 query heads, which keeps most of the savings while retaining more representational capacity than a single shared head.
- KV cache size scales with the number of key-value heads.
- Fewer key-value heads means less memory and less memory-bandwidth pressure during decoding.
- Decoding is typically memory-bandwidth bound, so smaller KV reads speed up generation.
Converting MHA to GQA: uptraining
A practical contribution of the GQA paper is that an existing multi-head checkpoint can be converted to GQA cheaply. The key and value heads within each target group are mean-pooled into a single head, and the model is then briefly fine-tuned, called uptraining, using only about 5 percent of the original pretraining compute. The result reaches quality close to the original MHA model while running close to MQA speed.
The snippet below shows the core idea of expanding a small set of key-value heads back to the number of query heads during the attention computation, which is how GQA is implemented at run time.
Where GQA and MQA are used
GQA is now common in large open-weight models because it offers most of the inference savings of MQA with little measurable quality loss. LLaMA 2's larger models, LLaMA 3, Mistral, and many other recent architectures use GQA. MQA is used where the smallest possible KV cache matters most and the quality trade-off is acceptable.
GQA also composes with other efficiency techniques. It pairs naturally with FlashAttention for faster kernels and with quantized KV caches for further memory savings. DeepSeek's Multi-Head Latent Attention can be seen as a different answer to the same KV-cache problem, compressing keys and values into a shared latent vector instead of reducing the head count.
- GQA: used in LLaMA 2 (larger sizes), LLaMA 3, Mistral, and many recent LLMs.
- MQA: used when minimizing KV cache outweighs a small quality cost.
- Both combine with FlashAttention and KV-cache quantization.
Key takeaways
- MQA shares one key-value head across all query heads; GQA shares one per group of query heads.
- Both reduce the KV cache, which is a major memory and bandwidth bottleneck during LLM decoding.
- GQA keeps quality close to full multi-head attention while approaching MQA inference speed.
- An MHA checkpoint can be converted to GQA by mean-pooling heads and uptraining on about 5 percent of pretraining compute.
Frequently asked questions
Related terms
Related reading
Sources
Put the idea into practice
MemX is an AI memory app built on these ideas: store anything, skip the folders, and find it again by asking in plain English.
Try MemX Free