Authors: Xuezhi Wang, Jason Wei, Dale Schuurmans et al. Year: 2022 (published at ICLR 2023) Source: arXiv:2203.11171
Instead of taking a language model’s first answer to a reasoning question, you can get significantly better results by asking it to solve the problem multiple times and picking the answer that comes up most often.
Large language models (LLMs) – neural networks trained on massive text corpora to predict the next word – struggle with tasks that require multi-step reasoning, such as arithmetic word problems or commonsense inference. In 2022, Wei et al. introduced chain-of-thought prompting (CoT), a technique where you show the model a few worked examples that include step-by-step reasoning, and the model learns to produce similar reasoning steps before giving its final answer. This was a significant improvement over just asking for the answer directly.
However, CoT prompting relied on greedy decoding, a strategy where the model always picks the single most probable next word at each step. Think of it like writing an essay by always choosing the most obvious next word: you get a coherent result, but you are locked into one line of thinking. If the model makes a mistake early in its reasoning chain – say, misinterpreting part of a word problem – greedy decoding has no way to recover. The model commits to that one path and arrives at whatever (possibly wrong) answer it leads to.
The core issue is that greedy decoding produces exactly one reasoning path. If that path contains an error, the final answer is wrong, and there is no fallback. Prior approaches to this problem required training additional models (called “verifiers” or “re-rankers”) on labeled data to evaluate the quality of generated solutions, which is expensive and task-specific.
Imagine you are lost in an unfamiliar city and you ask five different people for directions to the train station. Three of them tell you to go north, one says east, and one says west. Even though you cannot verify any single person’s directions, you would reasonably follow the majority and head north. The more people who independently agree on the same answer, the more confident you should be.
Self-consistency applies exactly this intuition to language model reasoning. Instead of generating a single chain of thought (greedy decoding), the method samples multiple reasoning paths from the model’s decoder – each one potentially taking a different approach to the problem – and then takes a majority vote over the final answers. The key insight is that correct reasoning paths, even when they take different routes, tend to converge on the same answer, while incorrect reasoning paths are more likely to scatter across different wrong answers.
What makes this especially elegant is its simplicity. Self-consistency requires no additional training, no extra models, no human annotations, and no fine-tuning. It works with any pre-trained language model that supports chain-of-thought prompting. The only cost is computational: you run the model multiple times (typically 5-40 times) instead of once. The authors call this a “self-ensemble” because it gets the benefits of an ensemble (combining multiple predictions) while using only a single model.
Figure 1: The self-consistency method replaces greedy decoding with sample-and-marginalize. (1) Prompt the model with chain-of-thought exemplars. (2) Sample multiple diverse reasoning paths from the model’s decoder. (3) Marginalize over the reasoning paths by taking a majority vote on the final answers. Different reasoning paths may arrive at the same correct answer through different logic.
Self-consistency is a decoding strategy, not a new model architecture. It operates as a three-step pipeline on top of any existing language model:
Step 1: Prompt with chain-of-thought exemplars. You give the model a prompt containing a few worked examples that show step-by-step reasoning. For instance, for math word problems, you might show 8 examples where each question is followed by a detailed solution ending with “The answer is X.” This is identical to standard CoT prompting.
Step 2: Sample diverse reasoning paths. Instead of using greedy decoding (always taking the most probable token), you use stochastic sampling – the model randomly selects tokens according to their probability distribution, introducing controlled randomness. This means each time you run the model on the same question, it might produce a different reasoning chain. You do this \(m\) times (the paper uses \(m = 40\) in most experiments) to collect a set of reasoning paths, each ending with a final answer. The sampling uses standard techniques like temperature sampling (a parameter \(T\) that controls randomness – higher values produce more diverse outputs) and top-k sampling (only considering the \(k\) most probable tokens at each step).
For a concrete example, consider the question: “Janet’s ducks lay 16 eggs per day. She eats 3 for breakfast and bakes muffins with 4. She sells the rest for $2 each. How much does she make daily?” Three different sampled paths might reason:
Step 3: Aggregate by majority vote. Parse the final answer from each of the \(m\) generated outputs, then count how many times each distinct answer appears. The answer that appears most frequently – the plurality vote – is the final prediction. In the example above, “18” wins with 2 votes vs. 1 for “14”.
The method is compatible with various sampling strategies. The paper tests temperature sampling (with \(T \in \{0.3, 0.5, 0.7\}\)), top-k sampling (with \(k \in \{20, 40\}\)), and nucleus sampling (selecting from the smallest set of tokens whose cumulative probability exceeds a threshold \(p\)). Performance is robust across all these choices.
An important detail: the paper also explores weighted voting, where each sampled path is weighted by how likely the model thinks that path is. However, the experiments show that simple unweighted majority vote performs just as well, because the model assigns similar probabilities to all its generated paths (both correct and incorrect ones). This is actually a finding about model calibration – the model is not good at telling which of its own reasoning paths are correct based on probability alone.
The method relies on just a few key mathematical ideas. Let us walk through them.
Equation 1: Majority Vote (Self-Consistency Criterion)
The core operation of self-consistency is selecting the answer that receives the most votes across all sampled reasoning paths:
\[a^* = \arg\max_a \sum_{i=1}^{m} \mathbb{1}(a_i = a)\]
In plain language: for each possible answer, count how many of the \(m\) sampled paths produced that answer, then pick the answer with the highest count. For a concrete example, suppose we sample \(m = 5\) paths for a math problem, and the parsed answers are \([18, 14, 18, 26, 18]\). The counts are: 18 appears 3 times, 14 appears 1 time, 26 appears 1 time. So \(a^* = 18\). This matters because it is the mechanism that filters out incorrect reasoning: errors tend to produce scattered wrong answers, while correct reasoning converges.
Equation 2: Length-Normalized Conditional Probability
When exploring weighted voting as an alternative to simple majority vote, the paper defines the probability of a specific reasoning path and answer pair, normalized by sequence length:
\[P(r_i, a_i \mid \text{prompt}, \text{question}) = \exp\left(\frac{1}{K}\sum_{k=1}^{K} \log P(t_k \mid \text{prompt}, \text{question}, t_1, \ldots, t_{k-1})\right)\]
In plain language: this computes the geometric mean of the per-token probabilities across the entire output. Without the \(\frac{1}{K}\) normalization, longer reasoning paths would always have lower total probability (since you multiply many probabilities less than 1), unfairly penalizing detailed explanations. This formula is the length-normalized version used in the “weighted sum” aggregation.
To see this concretely, suppose a 3-token output has per-token log-probabilities \([-0.5, -0.3, -0.4]\). The normalized probability is \(\exp\left(\frac{1}{3}(-0.5 + -0.3 + -0.4)\right) = \exp(-0.4) \approx 0.67\). A 10-token output might have the same average log-probability per token but a much lower total product; normalization ensures both are compared fairly.
This matters because it shows why normalization is critical: the unnormalized version performs much worse (Table 1 in the paper), while the normalized version performs comparably to simple majority vote.
Equation 3: Weighted Aggregation
When using weighted voting instead of majority vote, each answer is scored by the sum of its associated path probabilities:
\[a^* = \arg\max_a \sum_{i=1}^{m} P(r_i, a_i \mid \text{prompt}, \text{question}) \cdot \mathbb{1}(a_i = a)\]
In plain language: rather than treating all votes equally, weight each vote by how confident the model was in that particular reasoning path. An answer supported by three high-confidence paths would beat an answer supported by four low-confidence paths.
In practice, however, the paper shows this weighted version performs almost identically to the unweighted majority vote. On PaLM-540B across six benchmarks, the “unweighted sum (majority vote)” and “weighted sum (normalized)” achieve nearly identical accuracy (e.g., 74.4% vs 74.1% on GSM8K, 99.3% vs 99.3% on MultiArith). The reason is that the model assigns similar probabilities to all generated paths, meaning the weights are approximately equal anyway. This is an important practical finding: you can skip probability computation entirely and just count votes.
Self-consistency delivers consistent improvements across every model and every benchmark tested. Here are the headline results on arithmetic reasoning with PaLM-540B (the largest model tested, with 540 billion parameters):
| Task | CoT Prompting | Self-Consistency | Gain |
|---|---|---|---|
| GSM8K | 56.5% | 74.4% | +17.9% |
| MultiArith | 94.7% | 99.3% | +4.6% |
| AQuA | 35.8% | 48.3% | +12.5% |
| SVAMP | 79.0% | 86.6% | +7.6% |
| AddSub | 91.9% | 93.7% | +1.8% |
| ASDiv | 74.0% | 81.9% | +7.9% |
The GSM8K result is particularly striking: a +17.9% absolute improvement on a challenging grade-school math benchmark, achieved purely through a decoding strategy change with no additional training. With GPT-3 (code-davinci-002), self-consistency achieves 78.0% on GSM8K, compared to 60.1% for CoT prompting – again a +17.9% gain.
On commonsense reasoning, the gains are also substantial:
| Task | CoT Prompting | Self-Consistency | Gain |
|---|---|---|---|
| StrategyQA | 75.3% | 81.6% | +6.3% |
| CommonsenseQA | 79.0% | 80.7% | +1.7% |
| ARC-challenge | 85.2% | 88.7% | +3.5% |
| ARC-easy | 95.3% | 96.4% | +1.1% |
A key finding is that gains increase with model scale. On the smaller UL2-20B model, improvements are typically 3-7% absolute. On LaMDA-137B, they jump to 5-24%. On PaLM-540B and GPT-3, improvements range from 2-18%. This suggests that larger models generate more diverse and higher-quality reasoning paths, giving majority vote more signal to work with.
Self-consistency outperforms every alternative decoding strategy tested:
The paper also shows that self-consistency helps even when CoT prompting hurts compared to standard prompting. On tasks like ANLI-R1, e-SNLI, and RTE, adding chain-of-thought rationales actually decreases performance with greedy decoding, but self-consistency recovers and surpasses standard prompting performance (e.g., ANLI-R1: standard prompting 69.1%, CoT 68.8%, self-consistency 78.5%).
Self-consistency became one of the most widely adopted inference-time techniques in the language model ecosystem. Its core insight – that you can trade compute at inference time for accuracy, without any training – influenced a broad family of methods often called “inference-time compute scaling” or “test-time compute.” The idea that spending more computation during inference (rather than during training) can substantially improve performance became a recurring theme in subsequent research.
The technique is now a standard component in LLM evaluation pipelines and reasoning benchmarks. When researchers report “pass@k” metrics or use majority voting in coding benchmarks (like HumanEval), they are applying the same principle. The method also influenced the development of more sophisticated multi-path reasoning techniques that extend the idea from linear chains to branching search trees over reasoning steps.
Self-consistency also highlighted an important property of language models: their outputs are stochastic processes that can be meaningfully aggregated. This shifted the field’s perspective on decoding from “find the single best output” to “sample a population of outputs and extract signal from the distribution.” The paper’s finding that consistency correlates with accuracy also opened the door to using self-consistency as an uncertainty estimator – if the model’s sampled answers are scattered, it is likely uncertain about the correct answer.
To fully understand this paper, the reader should be familiar with: