Speculative Decoding: 2x Faster LLMs

Speculative decoding makes large language models generate text roughly 2x to 3x faster, and the catch is smaller than it sounds: the output is mathematically identical to what the slow model would have produced on its own. A small draft model guesses several tokens ahead, the big target model checks all of those guesses in one parallel pass, and a strict accept-reject rule guarantees the final text matches the target model's exact distribution. Same answer, less waiting.

The one-sentence version

A fast little model writes a rough draft of the next few tokens, and the slow expensive model checks the whole draft at once instead of writing each token itself. When the draft is right, the big model confirms many tokens for the price of a single step. When the draft is wrong, the big model catches the first mistake and fixes it. A careful accept-and-reject rule guarantees the final text is sampled from exactly the same distribution the big model would have produced alone. This is an acceleration trick, not an approximation.

This is the third post in our LLM inference series. The first explained the KV cache, the memory that lets a model skip recomputing past tokens. The second covered the prefill and decode phases and why they behave so differently. Speculative decoding builds on both and targets the decode phase specifically: the slow, one-token-at-a-time part of generation.

Why normal decoding is slow in the first place

Decoding is slow because it is sequential and memory-bound, not because the math is hard. To produce one token, the model reads its entire set of weights out of GPU memory, does a relatively small amount of arithmetic, and emits one token. Then it repeats the whole thing for the next token. A 70-billion-parameter model streams tens of gigabytes of weights from memory for every single token it writes.

The bottleneck is memory bandwidth, not arithmetic

During decode, the GPU spends most of its time waiting on memory reads rather than doing math. Loading the weights is the expensive part. Once those weights sit on the chip, running them over one token or over ten tokens costs almost the same wall-clock time, because the arithmetic units were idle anyway. You are paying full price to read the model's weights from memory, then using almost none of the compute you paid for. That is what people mean when they call decode memory-bandwidth-bound.

That idle compute is the opening. Feed the big model several candidate tokens at once, have it check them all in a single weight read, and you extract more useful work from the same expensive memory transfer. Speculative decoding turns that idea into exact, lossless output.

Also on MemX

AI Explained

How LLMs Pick Words: Greedy, Beam, Sampling

11 min read→

AI Explained

Prefill vs Decode: Why LLMs Feel Slow

9 min read→

AI Explained

KV Cache: Why LLMs Remember Fast

8 min read→

How speculative decoding works, step by step

The algorithm runs a draft-then-verify loop. Each round confirms multiple tokens instead of one, as long as the draft model guesses well.

Draft: a small, fast model proposes the next K tokens (a typical K is 4 to 8), generating them quickly because it is cheap to run.
Verify: the large target model takes the original context plus all K proposed tokens and runs ONE parallel forward pass, computing what it would have predicted at each position.
Accept: starting from the first proposed token, the target keeps the longest prefix that matches what it would have sampled, using a probabilistic accept-and-reject rule.
Correct: at the first rejected position, the target samples the right token itself from an adjusted distribution, so no round is ever wasted.
Repeat: the loop restarts from the new, longer confirmed context.

Why the output stays exactly correct

The accept-and-reject rule is the clever core, and it is provably exact. For each proposed token, the algorithm compares the draft model's probability for that token against the target model's probability. If the target assigns equal or higher probability, the token is accepted outright. If the target's probability is lower, the token is accepted with a probability equal to the ratio of the two, and otherwise rejected. On a rejection, the next token is sampled from a corrected residual distribution. Leviathan et al. proved this procedure samples from precisely the target model's distribution, so the drafted text is identical in distribution to what plain decoding would have produced. Quality does not move.

Where the speedup actually comes from

Verifying K tokens reuses a single expensive weight read. The target model loads its weights from memory once, then evaluates all K candidate positions in parallel on compute that was otherwise idle. Propose 5 tokens, accept 4, and the sequence advances by 4 tokens at close to the cost of one normal decode step. The more the draft agrees with the target, the bigger the win.

What determines the actual speedup

The acceptance rate drives everything. How often the target accepts drafted tokens decides the gain, and that depends on how well the small draft model imitates the big one on your specific traffic. Predictable text, like boilerplate code or common phrasing, gets accepted at high rates and runs fast. Surprising text gets rejected more often and drifts back toward normal speed.

Draft quality: a draft model that tracks the target well produces longer accepted runs and bigger speedups.
Workload predictability: structured or repetitive output accepts more tokens than open-ended creative text.
Draft length K: longer drafts can win more per round but waste more compute when rejected, so there is a sweet spot.
Batch size: at very large batch sizes the target model becomes compute-bound rather than memory-bound, so the idle compute the trick relied on is gone and the benefit shrinks.

Here is what most explainers skip: the 2x-3x headline is not a property of the algorithm, it is a property of your workload, and it can quietly collapse. A draft model that perfectly matches the target on easy prose can still win almost nothing on hard, high-entropy generation, because acceptance, not accuracy, is the lever. The original paper reported 2x to 3x acceleration on T5-XXL with identical outputs, and production systems land in similar ranges, with some configurations pushing higher on favorable traffic. Treat anything above 3x as workload-dependent, not a guarantee.

Property	Normal decoding	Speculative decoding
How tokens are produced	Target model emits one token per forward pass, strictly sequential	Draft model proposes K tokens; target verifies all K in one parallel pass
Tokens advanced per target pass	Exactly 1	1 up to K, depending on acceptance rate
Typical speedup	Baseline (1x)	Around 2x to 3x, workload-dependent
Output quality	Reference output	Identical in distribution; provably lossless
Extra cost	None	Running a small draft model plus a slightly larger verification pass
Best case	Consistent but slow	Predictable text with high draft acceptance
Worst case	Same as always	Low acceptance falls back toward baseline, rarely slower in practice

The popular variants you will hear about

The real question every variant answers is the same: where does the draft come from? The classic recipe uses a separate small model from the same family. Newer methods drop the second model entirely.

Medusa

Medusa skips the separate draft model and bolts several lightweight prediction heads onto the target model itself. Each head guesses a token a few positions ahead, and the candidates get verified together. No second model to train or serve, and still multiple tokens per step.

EAGLE

EAGLE drafts by extrapolating the target model's own internal feature vectors, specifically the second-to-top-layer features, instead of its raw token outputs. Those feature sequences are more regular than token sequences, which raises the acceptance rate. EAGLE reports strong speedups while keeping outputs consistent with the target, and it runs on common serving stacks.

Self-speculative decoding

Self-speculative decoding pulls the draft from the target model directly, for example by skipping some of its own layers to produce fast guesses, then verifying with the full model. The draft and target share parameters, so there is no second model to maintain.

All of these ship in real inference engines. vLLM supports speculative decoding, including draft-model, Medusa, and EAGLE-style configurations, and Hugging Face Text Generation Inference supports it too. You turn it on through configuration instead of writing the algorithm yourself.

Where this fits in the bigger inference picture

Speculative decoding is the headline technique for cutting latency without touching quality, and it pairs with the other two ideas in this series. The KV cache stops the model from recomputing past attention. The prefill-decode split tells you which phase you are optimizing. Speculative decoding then attacks decode, the part that crawls one token at a time, by spending idle compute on parallel verification. Together they explain most of why a modern serving stack feels fast.

Pro Tip

If you are tuning settings, the lever is acceptance rate, not draft accuracy in the abstract. Match your draft model closely to your target and your real traffic, then measure accepted-tokens-per-step on that traffic instead of trusting a generic benchmark.

Where MemX fits

Speculative decoding speeds up how fast a model writes tokens. It does nothing about what the model knows about you, which is a separate problem entirely. MemX is an external memory layer that sits alongside ChatGPT, Claude, and Gemini and stores your context, preferences, and documents, so the model has the right facts to work from in the first place. Faster decoding and durable memory are complementary: one shortens the wait, the other makes the answer relevant. MemX keeps that memory private by architecture, with per-user isolation and encryption at rest, so your context stays yours.

Frequently Asked Questions

01Does speculative decoding change the model's output?

No. The accept-and-reject rule is mathematically proven to sample from the target model's exact distribution, so the output is identical in distribution to normal decoding. It is a speed technique, not an approximation. Quality does not drop.

02How much faster is speculative decoding?

Usually about 2x to 3x, depending on how often the target accepts the draft model's guesses and on the workload. Predictable text accepts more tokens and runs faster. Some setups beat 3x, but 2x to 3x is the safe, defensible range to quote.

03Why does verifying multiple tokens at once speed things up?

Decoding is memory-bandwidth-bound: most time goes to reading the model's weights from GPU memory while compute sits idle. Verifying several drafted tokens in one parallel pass reuses that single expensive weight read, doing more useful work per memory transfer.

04What are Medusa and EAGLE in speculative decoding?

They are variants that avoid running a separate draft model. Medusa adds lightweight prediction heads to the target model itself. EAGLE drafts by extrapolating the target's internal feature vectors for higher acceptance rates. Both keep outputs consistent with the target and run on engines like vLLM.

05Can I use speculative decoding today?

Yes. It ships in production inference engines including vLLM and Hugging Face Text Generation Inference, with support for draft-model and EAGLE-style configurations. You enable it through configuration rather than coding the algorithm, then tune it by matching the draft model to your workload.

Speculative Decoding: 2x Faster LLMs

The one-sentence version

Why normal decoding is slow in the first place

The bottleneck is memory bandwidth, not arithmetic

How speculative decoding works, step by step

Why the output stays exactly correct

Where the speedup actually comes from

What determines the actual speedup

The popular variants you will hear about

Medusa

EAGLE

Self-speculative decoding

Where this fits in the bigger inference picture

Where MemX fits

Stop losing what you save.
Let MemX remember it for you.

Keep reading

The one-sentence version

Why normal decoding is slow in the first place

The bottleneck is memory bandwidth, not arithmetic

How speculative decoding works, step by step

Why the output stays exactly correct

Where the speedup actually comes from

What determines the actual speedup

The popular variants you will hear about

Medusa

EAGLE

Self-speculative decoding

Where this fits in the bigger inference picture

Where MemX fits

Stop losing what you save.Let MemX remember it for you.

Keep reading

Stop losing what you save.
Let MemX remember it for you.