Test-time compute is the strategy of spending more computation during inference, by generating longer reasoning or sampling many candidate answers, to improve a model's accuracy without changing its weights.
What is Test-Time Compute (Inference-Time Scaling)?
Test-time compute refers to the amount of computation a language model uses while answering a query, as opposed to the computation used to train it. Inference-time scaling is the observation that allocating more of this compute at answer time, by letting the model produce longer chains of reasoning or evaluate multiple candidate solutions, can raise accuracy on hard problems without retraining or enlarging the model.
The idea reframes a long-standing assumption that performance comes mainly from training scale. Instead, a fixed model can spend a variable budget per question, thinking briefly on easy questions and at length on hard ones. Reasoning models such as the OpenAI o-series are built around this principle: they generate an internal reasoning process before the final answer, and accuracy tends to rise as that process is allowed to grow.
- Compute is spent at answer time, not training time.
- More reasoning or more samples can improve accuracy with the same weights.
- The compute budget can vary per question based on difficulty.
How models spend more compute at inference
There are two broad families of methods. The first modifies the proposal distribution: the model generates a longer reasoning trace, revising and extending its own thinking before committing to an answer. Chain-of-thought prompting and trained reasoning models both fall here. The second modifies the search over outputs: the model produces many candidate answers and a selection mechanism chooses among them.
Selection mechanisms include majority voting over sampled answers, known as self-consistency, and scoring candidates with a verifier or reward model. More elaborate search methods expand a tree of partial solutions and use a process reward model to score intermediate steps, keeping promising branches and pruning weak ones. Each method trades additional tokens or parallel samples for higher expected correctness.
- Sequential scaling: generate longer or self-revising reasoning traces.
- Parallel scaling: sample many answers and select the best.
- Selection can use majority vote, a verifier, or a process reward model over steps.
The compute trade-off
Inference-time scaling exposes a trade-off between training compute and test-time compute. Research on scaling test-time compute optimally shows that on problems where a smaller base model already has a non-trivial success rate, a compute-optimal test-time strategy can let that smaller model outperform a model 14 times larger run with a small inference budget, while on the hardest problems pretraining scale still dominates. The practical implication is that the optimal split between training and inference depends on the difficulty of the workload.
Allocating test-time compute well also matters. A compute-optimal strategy adapts the budget to the prompt, applying more search to questions where the base model is uncertain and less where it is already confident. In the same work this adaptive allocation improved the efficiency of test-time compute by more than four times over a best-of-N baseline, far more efficient than a fixed, large budget applied uniformly.
- For some tasks, a compute-optimal budget lets a model beat one 14 times larger.
- For the hardest tasks, pretraining scale remains decisive.
- Adapting the budget to question difficulty is more efficient than a uniform budget.
Costs and limits
The cost of test-time compute is direct: more generated tokens or more parallel samples mean higher latency and higher spend per query. Because reasoning models emit hidden reasoning tokens that are billed even when not shown, a single answer can consume many times the tokens of a non-reasoning response. Returns also diminish, since accuracy gains flatten and eventually plateau as the budget grows.
Inference-time scaling is not a substitute for capability the model lacks. If the base model cannot represent the knowledge or reasoning needed for a task, more samples and longer traces will not conjure it. The technique amplifies a model's existing competence rather than adding new abilities, which is why it pairs well with capable reasoning-trained models.
- More compute means higher latency and cost per query.
- Accuracy gains diminish and plateau beyond a point.
- It amplifies existing competence rather than adding new knowledge.
Key takeaways
- Test-time compute is computation spent during inference to improve answers without changing model weights.
- Methods split into sequential scaling (longer reasoning) and parallel scaling (sample many, then select).
- With a compute-optimal budget, a smaller model can outperform one 14 times larger on some tasks, but not the hardest ones.
- Compute-optimal strategies adapt the budget to question difficulty rather than spending uniformly.
- The gains diminish and carry real latency and cost, since reasoning tokens are billed.
Frequently asked questions
Related terms
Related reading
Sources
Put the idea into practice
MemX is an AI memory app built on these ideas: store anything, skip the folders, and find it again by asking in plain English.
Try MemX Free