Prefill vs Decode: Why LLMs Feel Slow

An LLM feels slow for one of two reasons, and they are not the same reason. A long prompt that takes seconds to start answering is stuck in prefill, the compute-bound phase that reads your whole prompt at once. A short prompt that crawls out a long answer is stuck in decode, the memory-bound phase that writes one token at a time.

Name those two phases and almost every latency mystery in a chatbot or agent dissolves. A giant prompt that takes three seconds to start, then streams fluently? Prefill. A two-line prompt that produces a long reasoning trace and limps the whole way? Decode. Here is the part most explainers skip: the fix for one phase does nothing for the other, so the usual advice to 'shorten your prompt' often targets the wrong bottleneck entirely.

The two phases of LLM inference

Every transformer response runs in two phases: prefill, then decode. Prefill happens once per request. Decode happens once per output token. Same model weights, completely different load on the GPU, which is why they have different bottlenecks and different fixes.

Prefill: read the whole prompt at once

Prefill processes your entire prompt in a single parallel pass and produces the first token. Because every prompt token is known up front, the GPU runs one big matrix-matrix multiply that pegs its compute units. NVIDIA describes prefill as a matrix-matrix operation that is highly parallelized and effectively saturates GPU utilization. That makes prefill compute-bound: the limit is floating-point operations per second, not how fast the GPU can move data.

Prefill is what you wait on during Time To First Token (TTFT), the gap between hitting send and seeing the first word. The longer the prompt, the more work prefill does in that one pass, so TTFT climbs with prompt length. A 200-token question starts answering almost instantly. Paste a 40,000-token document into the context window and the first token takes noticeably longer, because prefill has far more to chew before it can emit anything.

Decode: write one token at a time

Decode generates the answer token by token. Each step takes everything produced so far, computes exactly one new token, then feeds it back to compute the next. You cannot parallelize across future tokens, because the model cannot write token fifty until it has written token forty-nine. Decode is a long chain of tiny steps, not one big step.

The cost hides in what each step has to read. To produce a single token, the GPU streams the full set of model weights plus the KV cache out of memory. NVIDIA frames decode as a matrix-vector operation where the speed at which the data, meaning weights, keys, values, and activations, transfers to the GPU dominates the latency, not how fast the computation runs. So decode is memory-bandwidth-bound: the compute units sit half-idle waiting on memory. This also explains why a bigger model decodes slower on an identical prompt. More parameters means more weight bytes to move per token.

Insight

The mental model: prefill is one big parallel read of your prompt, bound by GPU math. Decode is many small sequential writes of the answer, bound by memory speed. Different bottleneck, different metric, different fix.

Why decode reads memory on every single step

Decode is memory-bound for one structural reason: there is nothing to amortize. In prefill, the GPU loads the weights once and reuses them across thousands of prompt tokens in parallel, so the read cost spreads thin. In decode, the GPU loads the same weights to produce one token, then loads them again for the next, and again, and again. The weights are a heavy book the GPU rereads cover to cover for every single word it writes.

On top of the weights, every decode step also reads the KV cache: the stored keys and values for all tokens so far. Databricks notes that decode speed depends on how quickly model parameters and cached state load from GPU memory, not on how fast the computation runs, which is a core reason decode is memory-bandwidth-bound. The KV cache grows as the answer grows, so later tokens in a long response have slightly more to read than earlier ones. For the deeper mechanics of why that cache exists and how it grows, see our companion post on the KV cache.

This is why batching helps decode

Batching is the main lever for decode throughput, and the logic is direct. When a server processes many users' requests together, it reads the model weights from memory once and computes one token for every request in the batch. NVIDIA describes batching as spreading out the memory cost of the weights because multiple requests use the same model, with larger batches transferred to the GPU at once. The expensive weight read gets shared, so each extra request in the batch is nearly free in bandwidth terms until the hardware saturates.

This is also why your per-token speed on a busy hosted API can differ from a quiet one. More concurrent users can mean better weight amortization and higher aggregate throughput, though it can also mean queueing. Prefill, already compute-bound, gains far less from batching, because it was never waiting on memory.

Also on MemX

AI Explained

How LLMs Pick Words: Greedy, Beam, Sampling

11 min read→

AI Explained

Speculative Decoding: 2x Faster LLMs

9 min read→

AI Explained

KV Cache: Why LLMs Remember Fast

8 min read→

Prefill vs decode at a glance

Property	Prefill	Decode
What it does	Reads the whole prompt in one parallel pass	Generates output one token at a time
How often it runs	Once per request	Once per output token
Bound by	GPU compute (FLOPS)	Memory bandwidth
GPU operation type	Matrix-matrix (saturates compute)	Matrix-vector (underutilizes compute)
Metric it sets	Time To First Token (TTFT)	Inter-token latency / tokens per second
Scales with	Prompt length	Output length and model size
How to speed it up	Shorter prompts, prompt caching, more compute	Batching, smaller or quantized models, faster memory

Why this matters for reasoning models and agents

Reasoning models and agents live in the decode phase, and decode is the phase that burns wall-clock time and money per token. A model that thinks step by step writes a long internal trace before the visible answer, and every thinking token is its own decode step. More output tokens means more decode steps means more memory reads means more seconds and more cost.

This flips a common assumption. People blame slowness on a big input, but for a reasoning model the input is often tiny while the output is enormous. The latency you feel is decode, multiplied across thousands of generated tokens. Prefill barely registers. Databricks states that output length dominates overall response latency, which is exactly what reasoning workloads run into.

Agents pay the prefill tax repeatedly

Agents break the problem in the other direction. An agent loop re-sends a growing context every turn: the system prompt, tool definitions, full history, and any documents it pulled in. Each turn pays prefill on that whole context again before it can act. As the conversation stretches, prefill cost per turn climbs, and TTFT per step climbs with it. So an agent can feel slow at the start of each step (prefill on a fat context) and slow during each step (decode on a long action). Both phases bite at once.

Long input, short output: latency is mostly prefill. Trim the prompt or cache it.
Short input, long output: latency is mostly decode. Cut output length or pick a faster-decoding model.
Reasoning models: decode dominates, because hidden thinking tokens are real decode steps you pay for.
Long agent loops: both phases grow as the carried-over context expands turn after turn.

Practical ways to cut each number

Different bottlenecks, different fixes, and using the wrong fix wastes effort. To cut TTFT, attack prefill: send shorter prompts, reuse prompt caching so a repeated prefix is not re-processed, or move to hardware with more compute. To cut streaming latency and cost, attack decode: ask for shorter outputs, pick a smaller or quantized model so fewer weight bytes stream per token, or run on a server that batches well.

The highest-leverage move for most application builders is sending less prompt. A smaller prompt shrinks prefill directly, lowers TTFT, trims the KV cache decode reads on every step, and cuts token cost on both ends. The hard part is sending less prompt without making the model dumber. That is exactly where an external memory layer earns its place.

Where MemX fits

MemX is an external memory layer for ChatGPT, Claude, Gemini, and your personal documents, and its job here is to keep prompts short without losing context. Instead of pasting a long history or a pile of documents into the context window on every call, which inflates prefill and bloats the KV cache for decode, MemX stores that context outside the model and supplies only the relevant slice when it is needed.

Smaller, targeted prompts mean less prefill work per request and a leaner KV cache for decode to stream through. MemX does not change the physics: prefill stays compute-bound, decode stays memory-bound. It changes how much you feed into each phase. MemX keeps your memory private by architecture, with per-user isolation and encryption at rest, so cutting prompt size does not mean handing your context to a shared pool.

Frequently Asked Questions

01What is the difference between prefill and decode in an LLM?

Prefill processes your entire prompt in one parallel pass to produce the first token, and is limited by GPU compute. Decode generates the answer one token at a time, and is limited by memory bandwidth. Prefill sets how long you wait to start; decode sets how fast text streams.

02Why does a long prompt make the first response slower?

A long prompt slows the first token because of prefill, the phase that reads the whole prompt before generating anything. Prefill work grows with prompt length, so Time To First Token rises. Once the first token appears, streaming speed is set by decode and stays roughly steady regardless of prompt size.

03Why is the decode phase memory-bound instead of compute-bound?

Decode produces one token per step, and each step streams the full model weights plus the KV cache out of memory. The GPU spends most of its time waiting on those transfers rather than computing, so memory bandwidth, not FLOPS, sets the speed. That is what memory-bound means.

04Does batching make a single LLM request faster?

Not really on its own. Batching helps decode throughput across many concurrent requests by reading the model weights once and reusing them for every request in the batch. It raises tokens per second in aggregate. Prefill, being compute-bound, benefits much less from batching.

05Why are reasoning models and AI agents so slow?

Reasoning models generate long hidden thinking traces, and every thinking token is a separate decode step that costs time and money. Agents also re-send a growing context each turn, paying prefill repeatedly. So latency comes from many decode steps, fat contexts, or both, not one slow operation.

Sources

Technical framing drawn from published inference-performance writeups by NVIDIA and Databricks.

Prefill vs Decode: Why LLMs Feel Slow