Prompting & Reasoning

Test-Time Compute (Inference-Time Scaling)

By Aditya Kumar Jha, Engineer

Test-time compute is the strategy of spending more computation during inference, by generating longer reasoning or sampling many candidate answers, to improve a model's accuracy without changing its weights.

What is Test-Time Compute (Inference-Time Scaling)?

Test-time compute refers to the amount of computation a language model uses while answering a query, as opposed to the computation used to train it. Inference-time scaling is the observation that allocating more of this compute at answer time, by letting the model produce longer chains of reasoning or evaluate multiple candidate solutions, can raise accuracy on hard problems without retraining or enlarging the model.

The idea reframes a long-standing assumption that performance comes mainly from training scale. Instead, a fixed model can spend a variable budget per question, thinking briefly on easy questions and at length on hard ones. Reasoning models such as the OpenAI o-series are built around this principle: they generate an internal reasoning process before the final answer, and accuracy tends to rise as that process is allowed to grow.

  • Compute is spent at answer time, not training time.
  • More reasoning or more samples can improve accuracy with the same weights.
  • The compute budget can vary per question based on difficulty.

How models spend more compute at inference

There are two broad families of methods. The first modifies the proposal distribution: the model generates a longer reasoning trace, revising and extending its own thinking before committing to an answer. Chain-of-thought prompting and trained reasoning models both fall here. The second modifies the search over outputs: the model produces many candidate answers and a selection mechanism chooses among them.

Selection mechanisms include majority voting over sampled answers, known as self-consistency, and scoring candidates with a verifier or reward model. More elaborate search methods expand a tree of partial solutions and use a process reward model to score intermediate steps, keeping promising branches and pruning weak ones. Each method trades additional tokens or parallel samples for higher expected correctness.

  • Sequential scaling: generate longer or self-revising reasoning traces.
  • Parallel scaling: sample many answers and select the best.
  • Selection can use majority vote, a verifier, or a process reward model over steps.

The compute trade-off

Inference-time scaling exposes a trade-off between training compute and test-time compute. Research on scaling test-time compute optimally shows that on problems where a smaller base model already has a non-trivial success rate, a compute-optimal test-time strategy can let that smaller model outperform a model 14 times larger run with a small inference budget, while on the hardest problems pretraining scale still dominates. The practical implication is that the optimal split between training and inference depends on the difficulty of the workload.

Allocating test-time compute well also matters. A compute-optimal strategy adapts the budget to the prompt, applying more search to questions where the base model is uncertain and less where it is already confident. In the same work this adaptive allocation improved the efficiency of test-time compute by more than four times over a best-of-N baseline, far more efficient than a fixed, large budget applied uniformly.

  • For some tasks, a compute-optimal budget lets a model beat one 14 times larger.
  • For the hardest tasks, pretraining scale remains decisive.
  • Adapting the budget to question difficulty is more efficient than a uniform budget.

Costs and limits

The cost of test-time compute is direct: more generated tokens or more parallel samples mean higher latency and higher spend per query. Because reasoning models emit hidden reasoning tokens that are billed even when not shown, a single answer can consume many times the tokens of a non-reasoning response. Returns also diminish, since accuracy gains flatten and eventually plateau as the budget grows.

Inference-time scaling is not a substitute for capability the model lacks. If the base model cannot represent the knowledge or reasoning needed for a task, more samples and longer traces will not conjure it. The technique amplifies a model's existing competence rather than adding new abilities, which is why it pairs well with capable reasoning-trained models.

  • More compute means higher latency and cost per query.
  • Accuracy gains diminish and plateau beyond a point.
  • It amplifies existing competence rather than adding new knowledge.

Key takeaways

  • Test-time compute is computation spent during inference to improve answers without changing model weights.
  • Methods split into sequential scaling (longer reasoning) and parallel scaling (sample many, then select).
  • With a compute-optimal budget, a smaller model can outperform one 14 times larger on some tasks, but not the hardest ones.
  • Compute-optimal strategies adapt the budget to question difficulty rather than spending uniformly.
  • The gains diminish and carry real latency and cost, since reasoning tokens are billed.

Frequently asked questions

Test-time compute is the computation a model uses while answering, such as generating longer reasoning or multiple candidate answers. Spending more of it can raise accuracy on hard problems without retraining the model or increasing its parameter count.
Sequential scaling makes the model think longer through extended, self-revising reasoning traces. Parallel scaling samples many independent answers and selects the best using majority vote, a verifier, or a reward model. Many systems combine both.
Sometimes. Research shows that with a compute-optimal budget a smaller model can outperform a model 14 times larger on problems where it already has a non-trivial success rate. On the hardest problems, however, pretraining scale still wins, so the trade-off depends on workload difficulty.
Reasoning models generate hidden reasoning tokens before the final answer, and those tokens are billed and add latency. A single response can consume many times the tokens of a standard answer, which is the direct cost of spending more test-time compute.
No. Gains diminish and eventually plateau as the budget grows, and the technique cannot add knowledge the base model lacks. It amplifies existing competence, so beyond a point extra compute mainly raises cost without raising accuracy.