Prompting & Reasoning

Self-Consistency (Decoding)

By Arpit Tripathi, Founder

Self-consistency is a decoding strategy that samples multiple chain-of-thought reasoning paths for the same prompt, then selects the final answer by majority vote. It improves reasoning accuracy by replacing a single greedy chain with an aggregate over diverse paths.

What is Self-Consistency?

Self-consistency is a decoding strategy that improves chain-of-thought reasoning by sampling several independent reasoning paths for one prompt and choosing the most common final answer. It was introduced by Wang and colleagues at Google in 2022 as a drop-in replacement for the single greedy decode normally used with chain-of-thought prompting.

The intuition is that a hard problem usually has many valid ways to reach the correct answer but many different ways to reach a wrong one. If several sampled reasoning paths converge on the same answer, that answer is more likely correct. Self-consistency marginalizes over the reasoning paths and keeps only the answers, then takes a majority vote.

  • Type: a decoding strategy layered on top of chain-of-thought prompting.
  • Mechanism: sample multiple paths, then majority-vote on the final answer.
  • Origin: Wang et al., Google, 2022.

How it works

Instead of decoding one chain greedily, self-consistency uses temperature sampling to generate a diverse set of reasoning chains for the same prompt. Each chain ends in a candidate answer. The reasoning text is then discarded and the candidate answers are aggregated, with the most frequent answer chosen as the final output.

Sampling temperature controls diversity. Too low and the chains collapse to near-identical reasoning, removing the benefit; too high and the chains become noisy. The number of sampled paths trades accuracy against cost: more samples generally raise accuracy but with diminishing returns, so practitioners pick a budget such as 5 to 40 samples depending on the task.

answer* = argmax_a Σ_i 1[ a_i = a ]
The selected answer is the one that appears most often across the sampled reasoning paths. The indicator 1[a_i = a] counts how many paths produced answer a.
python
import re
from collections import Counter
from openai import OpenAI

client = OpenAI()
prompt = "Q: A store had 24 apples, sold 9, then got 15 more. How many now? Let's think step by step."

def extract_final_answer(text: str) -> int:
    # take the last integer mentioned in the reasoning as the final answer
    nums = re.findall(r"-?\d+", text)
    return int(nums[-1]) if nums else None

answers = []
for _ in range(10):  # sample 10 reasoning paths
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,  # diversity across chains
    )
    text = resp.choices[0].message.content
    answers.append(extract_final_answer(text))

final = Counter(answers).most_common(1)[0][0]  # majority vote
print(final)
Self-consistency samples several chain-of-thought completions at non-zero temperature, then returns the most common final answer.
  • Diversity comes from temperature sampling rather than greedy decoding.
  • Only final answers are aggregated; intermediate reasoning is discarded.
  • Accuracy rises with more samples but shows diminishing returns.

Why it improves reasoning

A single chain-of-thought decode can be derailed by one bad step, and greedy decoding commits to that path with no recovery. Self-consistency hedges across many paths, so an occasional faulty chain is outvoted by the correct majority. On arithmetic and commonsense reasoning benchmarks such as GSM8K, Wang et al. reported sizable accuracy gains over plain chain-of-thought prompting.

The method needs no extra training, no verifier, and no changes to the model. It only adds inference cost, since it generates and evaluates several completions per question. That cost makes self-consistency an early and widely cited example of test-time compute, where spending more computation at inference buys higher accuracy.

  • Outvoting faulty chains makes reasoning more reliable than a single greedy decode.
  • No training or verifier required, only added inference compute.
  • An early example of trading test-time compute for accuracy.

Limitations

Self-consistency works only when answers can be compared and counted, such as a final number or a multiple-choice letter. For open-ended generation, where two correct answers may be worded differently, a plain majority vote does not apply cleanly.

It also multiplies inference cost by the number of samples, which can be significant for long reasoning chains. And because it aggregates only final answers, a problem where most sampled paths share the same systematic error will still vote for the wrong answer.

  • Best for tasks with discrete, comparable answers.
  • Inference cost scales with the number of sampled paths.
  • Shared systematic errors can still win the vote.

Key takeaways

  • Self-consistency samples multiple chain-of-thought paths and chooses the final answer by majority vote.
  • Diversity comes from temperature sampling, and only final answers are aggregated.
  • It improves accuracy on reasoning benchmarks without any model retraining.
  • The cost is extra inference compute, scaling with the number of sampled paths.
  • It applies best to tasks with discrete, comparable answers and can fail on shared systematic errors.

Frequently asked questions

Self-consistency is a decoding strategy that samples several chain-of-thought reasoning paths for the same prompt and selects the final answer by majority vote. Aggregating diverse paths makes reasoning more reliable than a single greedy decode.
A single chain can be derailed by one bad step. Self-consistency generates many chains at non-zero temperature, discards the reasoning, and votes on the answers, so occasional faulty chains are outvoted by the correct majority.
No. It is a pure inference-time method that needs no training, fine-tuning, or verifier. It only changes how outputs are decoded and aggregated, at the cost of generating several completions per question.
It needs non-zero temperature, often around 0.5 to 0.7, so the sampled chains are diverse. Too low collapses the chains to near-identical reasoning, and too high makes them noisy and less accurate.
It works only when answers can be compared and counted, such as numbers or multiple-choice letters. It multiplies inference cost by the sample count, and if most sampled paths share the same systematic error the vote can still be wrong.