Self-Consistency Improves Chain of Thought Reasoning in Language Models

Authors: Xuezhi Wang, Jason Wei, Dale Schuurmans et al. Year: 2022 (published at ICLR 2023) Source: arXiv:2203.11171

One-Sentence Summary

Instead of taking a language model’s first answer to a reasoning question, you can get significantly better results by asking it to solve the problem multiple times and picking the answer that comes up most often.

Problem Statement

Large language models (LLMs) – neural networks trained on massive text corpora to predict the next word – struggle with tasks that require multi-step reasoning, such as arithmetic word problems or commonsense inference. In 2022, Wei et al. introduced chain-of-thought prompting (CoT), a technique where you show the model a few worked examples that include step-by-step reasoning, and the model learns to produce similar reasoning steps before giving its final answer. This was a significant improvement over just asking for the answer directly.

However, CoT prompting relied on greedy decoding, a strategy where the model always picks the single most probable next word at each step. Think of it like writing an essay by always choosing the most obvious next word: you get a coherent result, but you are locked into one line of thinking. If the model makes a mistake early in its reasoning chain – say, misinterpreting part of a word problem – greedy decoding has no way to recover. The model commits to that one path and arrives at whatever (possibly wrong) answer it leads to.

The core issue is that greedy decoding produces exactly one reasoning path. If that path contains an error, the final answer is wrong, and there is no fallback. Prior approaches to this problem required training additional models (called “verifiers” or “re-rankers”) on labeled data to evaluate the quality of generated solutions, which is expensive and task-specific.

Key Innovation

Imagine you are lost in an unfamiliar city and you ask five different people for directions to the train station. Three of them tell you to go north, one says east, and one says west. Even though you cannot verify any single person’s directions, you would reasonably follow the majority and head north. The more people who independently agree on the same answer, the more confident you should be.

Self-consistency applies exactly this intuition to language model reasoning. Instead of generating a single chain of thought (greedy decoding), the method samples multiple reasoning paths from the model’s decoder – each one potentially taking a different approach to the problem – and then takes a majority vote over the final answers. The key insight is that correct reasoning paths, even when they take different routes, tend to converge on the same answer, while incorrect reasoning paths are more likely to scatter across different wrong answers.

What makes this especially elegant is its simplicity. Self-consistency requires no additional training, no extra models, no human annotations, and no fine-tuning. It works with any pre-trained language model that supports chain-of-thought prompting. The only cost is computational: you run the model multiple times (typically 5-40 times) instead of once. The authors call this a “self-ensemble” because it gets the benefits of an ensemble (combining multiple predictions) while using only a single model.

Architecture / Method

The self-consistency method: prompt with chain-of-thought, sample a diverse set of reasoning paths, then marginalize to find the most consistent answer

Figure 1: The self-consistency method replaces greedy decoding with sample-and-marginalize. (1) Prompt the model with chain-of-thought exemplars. (2) Sample multiple diverse reasoning paths from the model’s decoder. (3) Marginalize over the reasoning paths by taking a majority vote on the final answers. Different reasoning paths may arrive at the same correct answer through different logic.

Self-consistency is a decoding strategy, not a new model architecture. It operates as a three-step pipeline on top of any existing language model:

Step 1: Prompt with chain-of-thought exemplars. You give the model a prompt containing a few worked examples that show step-by-step reasoning. For instance, for math word problems, you might show 8 examples where each question is followed by a detailed solution ending with “The answer is X.” This is identical to standard CoT prompting.

Step 2: Sample diverse reasoning paths. Instead of using greedy decoding (always taking the most probable token), you use stochastic sampling – the model randomly selects tokens according to their probability distribution, introducing controlled randomness. This means each time you run the model on the same question, it might produce a different reasoning chain. You do this $m$ times (the paper uses $m = 40$ in most experiments) to collect a set of reasoning paths, each ending with a final answer. The sampling uses standard techniques like temperature sampling (a parameter $T$ that controls randomness – higher values produce more diverse outputs) and top-k sampling (only considering the $k$ most probable tokens at each step).

For a concrete example, consider the question: “Janet’s ducks lay 16 eggs per day. She eats 3 for breakfast and bakes muffins with 4. She sells the rest for $2 each. How much does she make daily?” Three different sampled paths might reason:

Path 1: “She has 16 - 3 - 4 = 9 eggs left. She makes 9 * $2 = $18.” (Answer: 18)
Path 2: “She eats 3, leaving 13. Then bakes with 4, leaving 9. So 9 * $2 = $18.” (Answer: 18)
Path 3: “She uses 3 + 4 = 7 eggs. She sells 7 * $2 = $14.” (Answer: 14 – incorrect reasoning)

Step 3: Aggregate by majority vote. Parse the final answer from each of the $m$ generated outputs, then count how many times each distinct answer appears. The answer that appears most frequently – the plurality vote – is the final prediction. In the example above, “18” wins with 2 votes vs. 1 for “14”.

The method is compatible with various sampling strategies. The paper tests temperature sampling (with $T \in \{0.3, 0.5, 0.7\}$), top-k sampling (with $k \in \{20, 40\}$), and nucleus sampling (selecting from the smallest set of tokens whose cumulative probability exceeds a threshold $p$). Performance is robust across all these choices.

An important detail: the paper also explores weighted voting, where each sampled path is weighted by how likely the model thinks that path is. However, the experiments show that simple unweighted majority vote performs just as well, because the model assigns similar probabilities to all its generated paths (both correct and incorrect ones). This is actually a finding about model calibration – the model is not good at telling which of its own reasoning paths are correct based on probability alone.

Mathematical Foundations

The method relies on just a few key mathematical ideas. Let us walk through them.

Equation 1: Majority Vote (Self-Consistency Criterion)

The core operation of self-consistency is selecting the answer that receives the most votes across all sampled reasoning paths:

\[a^* = \arg\max_a \sum_{i=1}^{m} \mathbb{1}(a_i = a)\]

$a^*$ is the final selected answer
$a$ ranges over all possible answers in the answer set $\mathbb{A}$
$m$ is the number of sampled reasoning paths (e.g., 40)
$a_i$ is the final answer parsed from the $i$-th sampled reasoning path
$\mathbb{1}(a_i = a)$ is the indicator function: it equals 1 if $a_i = a$ (i.e., the $i$-th path produced answer $a$), and 0 otherwise

In plain language: for each possible answer, count how many of the $m$ sampled paths produced that answer, then pick the answer with the highest count. For a concrete example, suppose we sample $m = 5$ paths for a math problem, and the parsed answers are $[18, 14, 18, 26, 18]$. The counts are: 18 appears 3 times, 14 appears 1 time, 26 appears 1 time. So $a^* = 18$. This matters because it is the mechanism that filters out incorrect reasoning: errors tend to produce scattered wrong answers, while correct reasoning converges.

Equation 2: Length-Normalized Conditional Probability

When exploring weighted voting as an alternative to simple majority vote, the paper defines the probability of a specific reasoning path and answer pair, normalized by sequence length:

\[P(r_i, a_i \mid \text{prompt}, \text{question}) = \exp\left(\frac{1}{K}\sum_{k=1}^{K} \log P(t_k \mid \text{prompt}, \text{question}, t_1, \ldots, t_{k-1})\right)\]

$r_i$ is the reasoning path (sequence of tokens forming the step-by-step explanation) in the $i$-th sample
$a_i$ is the final answer in the $i$-th sample
$K$ is the total number of tokens in the combined output $(r_i, a_i)$
$t_k$ is the $k$-th token in the output sequence
$P(t_k \mid \text{prompt}, \text{question}, t_1, \ldots, t_{k-1})$ is the probability the model assigns to token $t_k$ given everything that comes before it
The $\frac{1}{K}$ factor normalizes by sequence length, preventing longer outputs from having systematically lower probabilities

In plain language: this computes the geometric mean of the per-token probabilities across the entire output. Without the $\frac{1}{K}$ normalization, longer reasoning paths would always have lower total probability (since you multiply many probabilities less than 1), unfairly penalizing detailed explanations. This formula is the length-normalized version used in the “weighted sum” aggregation.

To see this concretely, suppose a 3-token output has per-token log-probabilities $[-0.5, -0.3, -0.4]$. The normalized probability is $\exp\left(\frac{1}{3}(-0.5 + -0.3 + -0.4)\right) = \exp(-0.4) \approx 0.67$. A 10-token output might have the same average log-probability per token but a much lower total product; normalization ensures both are compared fairly.

This matters because it shows why normalization is critical: the unnormalized version performs much worse (Table 1 in the paper), while the normalized version performs comparably to simple majority vote.

Equation 3: Weighted Aggregation

When using weighted voting instead of majority vote, each answer is scored by the sum of its associated path probabilities:

\[a^* = \arg\max_a \sum_{i=1}^{m} P(r_i, a_i \mid \text{prompt}, \text{question}) \cdot \mathbb{1}(a_i = a)\]

All symbols are as defined above
Instead of counting each vote as 1, each vote is weighted by the normalized probability from Equation 2

In plain language: rather than treating all votes equally, weight each vote by how confident the model was in that particular reasoning path. An answer supported by three high-confidence paths would beat an answer supported by four low-confidence paths.

In practice, however, the paper shows this weighted version performs almost identically to the unweighted majority vote. On PaLM-540B across six benchmarks, the “unweighted sum (majority vote)” and “weighted sum (normalized)” achieve nearly identical accuracy (e.g., 74.4% vs 74.1% on GSM8K, 99.3% vs 99.3% on MultiArith). The reason is that the model assigns similar probabilities to all generated paths, meaning the weights are approximately equal anyway. This is an important practical finding: you can skip probability computation entirely and just count votes.

Results

Self-consistency delivers consistent improvements across every model and every benchmark tested. Here are the headline results on arithmetic reasoning with PaLM-540B (the largest model tested, with 540 billion parameters):

Task	CoT Prompting	Self-Consistency	Gain
GSM8K	56.5%	74.4%	+17.9%
MultiArith	94.7%	99.3%	+4.6%
AQuA	35.8%	48.3%	+12.5%
SVAMP	79.0%	86.6%	+7.6%
AddSub	91.9%	93.7%	+1.8%
ASDiv	74.0%	81.9%	+7.9%

The GSM8K result is particularly striking: a +17.9% absolute improvement on a challenging grade-school math benchmark, achieved purely through a decoding strategy change with no additional training. With GPT-3 (code-davinci-002), self-consistency achieves 78.0% on GSM8K, compared to 60.1% for CoT prompting – again a +17.9% gain.

On commonsense reasoning, the gains are also substantial:

Task	CoT Prompting	Self-Consistency	Gain
StrategyQA	75.3%	81.6%	+6.3%
CommonsenseQA	79.0%	80.7%	+1.7%
ARC-challenge	85.2%	88.7%	+3.5%
ARC-easy	95.3%	96.4%	+1.1%

A key finding is that gains increase with model scale. On the smaller UL2-20B model, improvements are typically 3-7% absolute. On LaMDA-137B, they jump to 5-24%. On PaLM-540B and GPT-3, improvements range from 2-18%. This suggests that larger models generate more diverse and higher-quality reasoning paths, giving majority vote more signal to work with.

Self-consistency outperforms every alternative decoding strategy tested:

Sample-and-rank (picking the highest-probability sample): much smaller gains than self-consistency
Beam search (exploring multiple paths simultaneously but deterministically): consistently worse because it produces less diverse outputs
Prompt ensembles (varying the prompt examples): 40-path self-consistency beats 40-permutation prompt ensembles by wide margins (e.g., 27.7% vs 19.2% on GSM8K for LaMDA-137B)
Model ensembles (combining predictions from multiple models): an ensemble of LaMDA-137B + PaLM-540B + GPT-3 achieves only 33.3% on GSM8K, far worse than self-consistency with PaLM-540B alone (74.4%), because weaker models drag down performance

The paper also shows that self-consistency helps even when CoT prompting hurts compared to standard prompting. On tasks like ANLI-R1, e-SNLI, and RTE, adding chain-of-thought rationales actually decreases performance with greedy decoding, but self-consistency recovers and surpasses standard prompting performance (e.g., ANLI-R1: standard prompting 69.1%, CoT 68.8%, self-consistency 78.5%).

Limitations

Computational cost. Self-consistency requires running the model $m$ times per question instead of once. With $m = 40$, this is a 40x increase in inference compute. The paper notes that most gains come early (5-10 samples capture much of the improvement), but this is still a significant cost for production deployment.
Fixed-answer-set requirement. The method only works when the answer belongs to a discrete set (a number, yes/no, a multiple-choice option). It cannot be directly applied to open-ended generation tasks like writing an essay or summarizing a document, unless a suitable consistency metric between free-form outputs can be defined.
Factual grounding. The paper acknowledges that sampled reasoning paths sometimes contain incorrect facts (e.g., stating wrong population numbers in commonsense reasoning), even when they arrive at the correct final answer. The method selects for answer consistency, not for reasoning correctness.
No improvement on easy tasks. When the base model already achieves near-perfect accuracy or scores 0%, self-consistency provides minimal gains (e.g., +0.0% on last-letter-concatenation for UL2-20B). Self-consistency amplifies existing model capabilities rather than creating new ones.
No theoretical grounding. The paper provides no formal analysis of when or why majority vote over reasoning paths works. The empirical observation that correct paths converge while incorrect paths scatter is intuitive but not proven. There may be problem distributions where this assumption breaks down – for instance, problems with common misconceptions that many reasoning paths share, causing incorrect answers to dominate the vote.
Latency impact. Beyond raw compute cost, sampling $m$ paths sequentially increases end-to-end latency by roughly $m$x. The paper does not discuss parallelization strategies, though in practice the samples are independent and can be batched.

Impact and Legacy

Self-consistency became one of the most widely adopted inference-time techniques in the language model ecosystem. Its core insight – that you can trade compute at inference time for accuracy, without any training – influenced a broad family of methods often called “inference-time compute scaling” or “test-time compute.” The idea that spending more computation during inference (rather than during training) can substantially improve performance became a recurring theme in subsequent research.

The technique is now a standard component in LLM evaluation pipelines and reasoning benchmarks. When researchers report “pass@k” metrics or use majority voting in coding benchmarks (like HumanEval), they are applying the same principle. The method also influenced the development of more sophisticated multi-path reasoning techniques that extend the idea from linear chains to branching search trees over reasoning steps.

Self-consistency also highlighted an important property of language models: their outputs are stochastic processes that can be meaningfully aggregated. This shifted the field’s perspective on decoding from “find the single best output” to “sample a population of outputs and extract signal from the distribution.” The paper’s finding that consistency correlates with accuracy also opened the door to using self-consistency as an uncertainty estimator – if the model’s sampled answers are scattered, it is likely uncertain about the correct answer.

Prerequisites

To fully understand this paper, the reader should be familiar with:

Language models and autoregressive generation: how a model predicts one token at a time, conditioned on all previous tokens. Each token is sampled from a probability distribution over the vocabulary.
Prompting and few-shot learning: providing a model with a few input-output examples as context, so it learns to perform a task without any weight updates. Covered in the GPT paper (see Improving Language Understanding by Generative Pre-Training).
Decoding strategies: greedy decoding (always pick the most probable token), temperature sampling (divide log-probabilities by $T$ before sampling to control diversity), top-k sampling (only sample from the $k$ most probable tokens), and nucleus/top-p sampling (sample from the smallest set of tokens with cumulative probability $\geq p$).
The transformer architecture: the underlying neural network architecture used by all models in this paper. Covered in the Attention Is All You Need paper (see Attention Is All You Need).

Connections

Chain-of-thought prompting (Wei et al., 2022): Self-consistency is a direct extension of CoT prompting. CoT provides the reasoning format (step-by-step explanations in the prompt); self-consistency provides the decoding strategy (sample multiple paths, take majority vote). Without CoT, self-consistency has no reasoning paths to sample over.
Transformer architecture (see Attention Is All You Need): All four language models evaluated (UL2-20B, GPT-3, LaMDA-137B, PaLM-540B) are transformer-based. The sampling diversity that self-consistency relies on comes from the stochastic nature of sampling from the transformer’s output probability distribution.
GPT and autoregressive pre-training (see Improving Language Understanding by Generative Pre-Training): Three of the four models tested (GPT-3, LaMDA, PaLM) are decoder-only autoregressive models following the GPT paradigm. Self-consistency exploits the fact that these models can generate diverse outputs when sampled with temperature > 0.
BERT and encoder-only models (see BERT): The paper compares against prior state-of-the-art results on several benchmarks, some of which were set by fine-tuned BERT-family models (e.g., DeBERTaV3 on CommonsenseQA). Self-consistency with prompting-only inference surpasses these fine-tuned results on multiple tasks.
Retrieval-Augmented Generation (see RAG): Both RAG and self-consistency address the same broad problem – improving LLM output quality – but from opposite angles. RAG improves the input by retrieving relevant documents. Self-consistency improves the output by sampling and aggregating multiple generations. The two approaches are complementary.
Ensemble methods: Self-consistency is explicitly positioned as a “self-ensemble.” Traditional ensembles combine predictions from multiple independently trained models. Self-consistency achieves similar benefits from a single model by exploiting the stochasticity of the sampling process, making it much cheaper and more practical than training multiple large models.