If the first token of a long prompt feels slow and everything after streams fast, the KV cache is the reason. It stores the Key and Value tensors a transformer already computed for every past token, so the model never recomputes attention over the whole sequence again, and that same cache is what makes long context expensive: it grows linearly with length until it becomes the dominant memory cost at scale.
What the KV cache actually is
A transformer writes one token at a time. Each step runs self-attention, where the current token compares itself against every earlier token. That comparison needs three projected tensors per token: a Query (Q), a Key (K), and a Value (V). The Query belongs to the token being produced right now. The Keys and Values come from everything before it.
Here is the catch that makes caching possible. A past token's Key and Value never change. Token 12's Key is identical whether the model is generating token 13 or token 1,300. Recompute them every step and you redo nearly the entire forward pass for the full sequence on each new token. That scales quadratically. It is brutally slow.
The KV cache kills that waste. It stores K and V the moment a token is processed, then reads them back. At each decode step the model computes K and V for one new token, appends them, and attends against the full stored set. Pay once. Read forever.
Why only K and V get cached, never Q
Queries are not stored because each step uses only the Query for the token it is currently producing. Once token 12 has attended to its predecessors and been emitted, its Query is done and never consulted again. Keys and Values are different: every future token reads them. That asymmetry is the whole design. Reused tensors get cached, single-use tensors do not.
Prefill and decode: where the cache is built and where it is spent
Inference runs in two phases, and the cache behaves oppositely in each. Prefill reads your entire prompt in one parallel pass and fills the cache with K and V for every prompt token at once. Decode then emits the answer token by token, appending one token's K and V per step.
Their performance profiles invert. Prefill is compute-bound: it crunches the whole prompt through every layer in parallel, which is heavy arithmetic. Decode is memory-bandwidth-bound: each new token does little math but must stream the entire growing KV cache plus the model weights out of memory. Past a certain length, memory traffic, not arithmetic, sets generation speed.
This is the answer to why the first token of a long prompt lags while the rest stream quickly. Prefill has to read and encode everything before a single output token can appear. After that the cache is warm, and each subsequent token is cheap. The fast streaming people read as the model 'remembering' what it said is just the cache being read back.
Why the KV cache becomes the dominant memory cost of long context
At long context, the cache can outweigh the model itself. NVIDIA's inference optimization guide gives the per-token size directly: roughly 2 * num_layers * (num_heads * head_dim) * precision_in_bytes. The leading 2 stores both K and V. The (num_heads * head_dim) term is usually the hidden size. Precision is 2 bytes for FP16 or BF16, 1 byte for FP8.
Multiply that per-token figure by sequence length and batch size for the full footprint: batch_size * sequence_length * 2 * num_layers * hidden_size * bytes_per_value. Every term is a multiplier, and none of them shrink as a conversation runs. Double the context, double the cache. Add concurrent users to a batch, multiply again.
The numbers get heavy fast. A 7B model in half precision runs roughly 0.5 MB of cache per token, so an 8,000-token session already burns about 4 GB on cache alone, and batching pushes it past the model's own FP16 weights. Long context is not expensive because the model is bigger. It is expensive because the cache feeding the model gets enormous.
What most explainers skip: GQA decides whether long context is affordable
The headline context number gets the attention. The architecture detail that actually controls cost gets ignored. Grouped Query Attention (GQA) lets many query heads share a smaller set of KV heads, and the cache size tracks the KV head count, not the query head count. Llama-3-70B uses 8 KV heads where a full multi-head design would use far more, which cuts its per-token cache by roughly 8x. Two models with the same advertised context window can have wildly different serving costs purely because of this. When a long-context model is cheap to run, GQA is usually why.
| Memory consumer | What it stores | Scales with |
|---|---|---|
| Model weights | Fixed parameters of the network | Model size only (constant per request) |
| KV cache, short context | K and V for a few hundred tokens | A small fraction of weight memory |
| KV cache, long context | K and V for tens of thousands of tokens | Length x batch x layers x KV heads (can exceed weights) |
How engineers fight the KV cache memory wall
Because the cache dominates, most serving optimizations aim straight at it. Three families do the heavy lifting: smarter memory layout, quantization, and offload. Each attacks a different term in the cost equation.
Paged attention and vLLM
Naive serving reserves one contiguous slab per request sized for the worst-case length. That wastes huge amounts to fragmentation and caps batch size; measured KV memory utilization on older systems sat around 20 to 40 percent. PagedAttention, the method behind vLLM, copies virtual memory and paging from operating systems: it splits the cache into fixed-size blocks placed anywhere in memory and mapped through a translation table. Utilization climbs to roughly 96 percent, requests can share cached blocks, and vLLM reaches 2x to 4x higher throughput on the same hardware.
KV cache quantization
Cache size scales directly with bytes per value, so storing K and V in lower precision shrinks it proportionally. Drop from FP16 (2 bytes) to FP8 (1 byte) and the cache roughly halves; lower formats cut further. The cost is accuracy. Push precision too low and output quality degrades, so quantization gets tuned to hold generations faithful while reclaiming memory.
Offload
When the cache will not fit in GPU memory, serving systems move blocks to CPU memory or other tiers and pull them back on demand. Offload buys capacity and pays in bandwidth, since hauling cache across the bus is slower than reading it from GPU memory. It is the release valve for context lengths that would otherwise not fit at all.
What this means for context windows and AI memory
The KV cache is the physical reason a context window has a ceiling. An advertised context limit is not an arbitrary software cap. It reflects how much KV cache the hardware can hold and stream fast enough. A 7B-class model at 128k context can demand on the order of 64 GB of cache, most of a single GPU's memory before weights or activations even enter the picture. Every token in the prompt is a token whose K and V live in that cache for the entire response.
That breaks a common instinct. Stuffing an entire knowledge base, every past chat, and a stack of reference docs into one giant prompt does not buy perfect recall for free. It inflates the cache linearly, raises both latency and cost, and still slams into the window ceiling. Each retained token is a recurring tax, not a one-time deposit.
The context window is not memory. It is a meter, and it runs on every token you leave in it. Retrieving only what is relevant beats carrying everything.
Where MemX fits
This is the practical case for memory that lives outside the model. MemX sits beyond the context window and holds your conversations, notes, and documents in its own retrieval system. Instead of pinning everything in the prompt and paying for it in KV cache on every decode step, MemX fetches only the passages relevant to the current question and injects those. The model gets the memory it needs without dragging the whole history through the cache.
The KV cache is bounded by hardware and grows with whatever sits in the window. An external layer is not. MemX is private by architecture, with per-user isolation and encryption at rest, and it works across ChatGPT, Claude, and Gemini. It does not make the context window bigger. It stops you fighting the cache by keeping less in context and pulling back what matters, when it matters.
01What is a KV cache in an LLM?
It is the block of memory where a transformer stores the Key and Value tensors for tokens it already processed. During generation the model reuses them instead of recomputing attention over the whole sequence each step, which is what keeps token generation fast.
02Why does the KV cache make LLMs faster?
Without it, the model recomputes Keys and Values for every past token on each new step, which scales quadratically. The cache stores those tensors once during prefill and appends just one token per decode step, turning repeated work into a cheap memory read.
03How do you calculate KV cache size?
Per token it is roughly 2 x num_layers x num_kv_heads x head_dim x bytes_per_value. Multiply by sequence length and batch size for the total. Because it grows linearly with context length, at long context it can rival or exceed the model's own weights.
04Why is long context so expensive for LLMs?
Long context inflates the KV cache linearly, and that cache becomes both the dominant memory cost and the main driver of decode-time memory bandwidth. More context means more cache to store and stream, raising GPU memory use and latency together.
05Does an external memory layer help with KV cache limits?
Yes, indirectly. A layer like MemX keeps history outside the context window and retrieves only relevant passages, so fewer tokens sit in the prompt. That shrinks how much KV cache you carry on every decode step instead of stuffing everything into context.
