AI Foundations

Grouped-Query Attention (GQA) and Multi-Query Attention (MQA)

Multi-Query Attention (MQA) shares a single key-value head across all query heads, and Grouped-Query Attention (GQA) shares one key-value head across each small group of query heads. Both shrink the key-value cache that dominates LLM inference memory, with GQA preserving most of the quality of full multi-head attention.

What is Grouped-Query Attention (GQA)?

Grouped-Query Attention (GQA) is a variant of transformer attention in which several query heads share a single key and value head. It sits between standard Multi-Head Attention (MHA), where every query head has its own key and value heads, and Multi-Query Attention (MQA), where all query heads share one key-value head. By choosing the number of key-value groups, GQA trades a small amount of quality for a large reduction in the memory and bandwidth needed during inference.

GQA was introduced by Joshua Ainslie and colleagues in 2023. The motivation is that during autoregressive decoding, the model caches the keys and values of all previous tokens. The size of this key-value (KV) cache is proportional to the number of key-value heads, so reducing those heads directly reduces memory use and the memory bandwidth that bottlenecks generation.

MHA: one key-value head per query head (maximum quality, largest KV cache).
MQA: one key-value head shared by all query heads (smallest KV cache, some quality loss).
GQA: one key-value head per group of query heads (a tunable middle ground).

Why the KV cache matters

During text generation, an LLM produces one token at a time and reuses the keys and values computed for all earlier tokens. Storing those keys and values is the KV cache. Its size grows with sequence length, batch size, number of layers, and the number of key-value heads. For long contexts and large batches, the KV cache can dominate GPU memory and limit how many requests a server can handle at once.

MQA and GQA attack the key-value head term directly. MQA collapses all key-value heads to one, cutting that part of the cache by the number of heads. GQA collapses them to a small number of groups, for example 8 key-value heads for 64 query heads, which keeps most of the savings while retaining more representational capacity than a single shared head.

KV cache size scales with the number of key-value heads.
Fewer key-value heads means less memory and less memory-bandwidth pressure during decoding.
Decoding is typically memory-bandwidth bound, so smaller KV reads speed up generation.

Converting MHA to GQA: uptraining

A practical contribution of the GQA paper is that an existing multi-head checkpoint can be converted to GQA cheaply. The key and value heads within each target group are mean-pooled into a single head, and the model is then briefly fine-tuned, called uptraining, using only about 5 percent of the original pretraining compute. The result reaches quality close to the original MHA model while running close to MQA speed.

The snippet below shows the core idea of expanding a small set of key-value heads back to the number of query heads during the attention computation, which is how GQA is implemented at run time.

python

import torch
import torch.nn.functional as F

def gqa_attention(q, k, v, n_kv_heads):
    # q: (batch, n_q_heads, seq, head_dim)
    # k, v: (batch, n_kv_heads, seq, head_dim)
    b, n_q_heads, seq, head_dim = q.shape
    group = n_q_heads // n_kv_heads   # query heads per KV head

    # expand each KV head to its group of query heads
    k = k.repeat_interleave(group, dim=1)  # -> (b, n_q_heads, seq, head_dim)
    v = v.repeat_interleave(group, dim=1)

    return F.scaled_dot_product_attention(q, k, v, is_causal=True)

# MQA is the special case n_kv_heads = 1
# MHA is the special case n_kv_heads = n_q_heads

GQA repeats each key-value head to cover its group of query heads before computing attention.

Where GQA and MQA are used

GQA is now common in large open-weight models because it offers most of the inference savings of MQA with little measurable quality loss. LLaMA 2's larger models, LLaMA 3, Mistral, and many other recent architectures use GQA. MQA is used where the smallest possible KV cache matters most and the quality trade-off is acceptable.

GQA also composes with other efficiency techniques. It pairs naturally with FlashAttention for faster kernels and with quantized KV caches for further memory savings. DeepSeek's Multi-Head Latent Attention can be seen as a different answer to the same KV-cache problem, compressing keys and values into a shared latent vector instead of reducing the head count.

GQA: used in LLaMA 2 (larger sizes), LLaMA 3, Mistral, and many recent LLMs.
MQA: used when minimizing KV cache outweighs a small quality cost.
Both combine with FlashAttention and KV-cache quantization.

Key takeaways

MQA shares one key-value head across all query heads; GQA shares one per group of query heads.
Both reduce the KV cache, which is a major memory and bandwidth bottleneck during LLM decoding.
GQA keeps quality close to full multi-head attention while approaching MQA inference speed.
An MHA checkpoint can be converted to GQA by mean-pooling heads and uptraining on about 5 percent of pretraining compute.

Frequently asked questions

Multi-head attention gives every query head its own key-value head. Multi-query attention shares a single key-value head across all query heads. Grouped-query attention is in between, with one key-value head per group of query heads, balancing quality and the size of the KV cache.

Generation is usually limited by memory bandwidth, and the KV cache scales with the number of key-value heads. By using fewer key-value heads, GQA shrinks the KV cache so the GPU reads less data per token, which speeds up decoding and frees memory for larger batches.

Only slightly. The GQA paper showed that grouped-query models reach quality close to full multi-head attention while running near multi-query speed. The small quality cost is usually outweighed by the large inference and memory savings.

Take a trained multi-head checkpoint, mean-pool the key and value heads within each target group into a single head, then fine-tune briefly. This uptraining uses roughly 5 percent of the original pretraining compute and recovers most of the original quality.

GQA is widely adopted in modern open-weight models, including the larger LLaMA 2 sizes, LLaMA 3, and Mistral. It has become a default choice because it cuts KV-cache memory with little measurable loss in output quality.

Put the idea into practice

MemX is an AI memory agent built on these ideas: store anything, skip the folders, and find it again by asking in plain English.

Try MemX Free