Self-consistency is a decoding strategy that samples multiple chain-of-thought reasoning paths for the same prompt, then selects the final answer by majority vote. It improves reasoning accuracy by replacing a single greedy chain with an aggregate over diverse paths.
What is Self-Consistency?
Self-consistency is a decoding strategy that improves chain-of-thought reasoning by sampling several independent reasoning paths for one prompt and choosing the most common final answer. It was introduced by Wang and colleagues at Google in 2022 as a drop-in replacement for the single greedy decode normally used with chain-of-thought prompting.
The intuition is that a hard problem usually has many valid ways to reach the correct answer but many different ways to reach a wrong one. If several sampled reasoning paths converge on the same answer, that answer is more likely correct. Self-consistency marginalizes over the reasoning paths and keeps only the answers, then takes a majority vote.
- Type: a decoding strategy layered on top of chain-of-thought prompting.
- Mechanism: sample multiple paths, then majority-vote on the final answer.
- Origin: Wang et al., Google, 2022.
How it works
Instead of decoding one chain greedily, self-consistency uses temperature sampling to generate a diverse set of reasoning chains for the same prompt. Each chain ends in a candidate answer. The reasoning text is then discarded and the candidate answers are aggregated, with the most frequent answer chosen as the final output.
Sampling temperature controls diversity. Too low and the chains collapse to near-identical reasoning, removing the benefit; too high and the chains become noisy. The number of sampled paths trades accuracy against cost: more samples generally raise accuracy but with diminishing returns, so practitioners pick a budget such as 5 to 40 samples depending on the task.
- Diversity comes from temperature sampling rather than greedy decoding.
- Only final answers are aggregated; intermediate reasoning is discarded.
- Accuracy rises with more samples but shows diminishing returns.
Why it improves reasoning
A single chain-of-thought decode can be derailed by one bad step, and greedy decoding commits to that path with no recovery. Self-consistency hedges across many paths, so an occasional faulty chain is outvoted by the correct majority. On arithmetic and commonsense reasoning benchmarks such as GSM8K, Wang et al. reported sizable accuracy gains over plain chain-of-thought prompting.
The method needs no extra training, no verifier, and no changes to the model. It only adds inference cost, since it generates and evaluates several completions per question. That cost makes self-consistency an early and widely cited example of test-time compute, where spending more computation at inference buys higher accuracy.
- Outvoting faulty chains makes reasoning more reliable than a single greedy decode.
- No training or verifier required, only added inference compute.
- An early example of trading test-time compute for accuracy.
Limitations
Self-consistency works only when answers can be compared and counted, such as a final number or a multiple-choice letter. For open-ended generation, where two correct answers may be worded differently, a plain majority vote does not apply cleanly.
It also multiplies inference cost by the number of samples, which can be significant for long reasoning chains. And because it aggregates only final answers, a problem where most sampled paths share the same systematic error will still vote for the wrong answer.
- Best for tasks with discrete, comparable answers.
- Inference cost scales with the number of sampled paths.
- Shared systematic errors can still win the vote.
Key takeaways
- Self-consistency samples multiple chain-of-thought paths and chooses the final answer by majority vote.
- Diversity comes from temperature sampling, and only final answers are aggregated.
- It improves accuracy on reasoning benchmarks without any model retraining.
- The cost is extra inference compute, scaling with the number of sampled paths.
- It applies best to tasks with discrete, comparable answers and can fail on shared systematic errors.
Frequently asked questions
Related terms
Related reading
Sources
Put the idea into practice
MemX is an AI memory app built on these ideas: store anything, skip the folders, and find it again by asking in plain English.
Try MemX Free