Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Authors: Jason Wei, Xuezhi Wang, Dale Schuurmans et al. Year: 2022 Source: 2201.11903

One-Sentence Summary

Adding step-by-step reasoning examples to the instructions you give a large language model dramatically improves its ability to solve math, logic, and commonsense problems – without changing the model itself.

Problem Statement

By 2022, large language models (LLMs) – neural networks trained on massive text corpora to predict the next word – had shown impressive abilities across many tasks. Simply scaling them up (adding more parameters and training data) improved performance on tasks like translation and question answering. But scaling alone hit a wall on tasks requiring multi-step reasoning: arithmetic word problems, commonsense logic, and symbolic manipulation.

The standard approach for getting these models to perform a task was “few-shot prompting” (sometimes called “in-context learning”), introduced with GPT-3 (see Improving Language Understanding by Generative Pre-Training). You show the model a few input-output examples, and it generalizes to new inputs. For instance, you might show the model three question-answer pairs, then ask a fourth question. This works well for simple factual questions, but fails on problems that require chaining multiple reasoning steps together.

The alternative was to fine-tune a model (update its parameters) on large datasets of reasoning problems with worked solutions. This approach works, but is expensive: you need to collect thousands of hand-written solutions, and the resulting model is specialized to one task. There was no general-purpose way to unlock reasoning in an off-the-shelf language model.

Key Innovation

Think about how a math teacher helps a struggling student. The teacher does not just show the student questions and final answers. Instead, the teacher shows their work – writing out each intermediate step: “First we calculate the cost per item, then we multiply by the quantity, then we subtract the discount.” By seeing the teacher’s reasoning process, the student learns not just what the answer is, but how to arrive at it.

Chain-of-thought (CoT) prompting applies this same idea to language models. Instead of providing few-shot examples as simple input-output pairs, you augment each example with a “chain of thought” – a sequence of intermediate natural language reasoning steps that connect the question to the answer. The model then mimics this step-by-step reasoning pattern when it encounters a new question.

Concretely, a standard prompt might look like:

Q: Roger has 5 tennis balls. He buys 2 more cans of 3 balls each.
   How many tennis balls does he have now?
A: The answer is 11.

A chain-of-thought prompt adds the reasoning:

Q: Roger has 5 tennis balls. He buys 2 more cans of 3 balls each.
   How many tennis balls does he have now?
A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6
   tennis balls. 5 + 6 = 11. The answer is 11.

The critical insight is that this requires no model modification, no fine-tuning, no additional training data beyond a handful of manually written examples. You simply change what you write in the prompt, and the model’s reasoning ability improves dramatically – but only if the model is large enough (roughly 100 billion parameters or more).

Architecture / Method

Standard prompting versus chain-of-thought prompting on a math word problem

Figure 1: Standard prompting (left) directly produces an answer – often incorrect. Chain-of-thought prompting (right) includes intermediate reasoning steps (highlighted in green) that decompose the problem, leading to the correct answer. The same model and the same question are used; only the prompt exemplars differ.

Chain-of-thought prompting is not a new architecture or training procedure. It is a prompting strategy that works with any sufficiently large, pre-trained language model. The method has three components:

1. Prompt construction. For a given task, the user writes a small number of examples (typically 8) in the format \(\langle \text{input}, \text{chain of thought}, \text{output} \rangle\). The chain of thought is a natural language decomposition of the problem into intermediate steps. For math word problems, this means writing out the arithmetic reasoning. For commonsense questions, it means articulating the logical deductions. For symbolic tasks, it means spelling out each manipulation step.

2. Inference. At test time, the prompt (containing the few-shot exemplars with chains of thought) is prepended to a new question. The language model generates a completion that – because of the pattern established by the exemplars – includes its own chain of thought before producing a final answer. The final answer is extracted from the end of the generated text (typically following a phrase like “The answer is”).

3. No training involved. The model’s parameters remain frozen. The same model checkpoint can be used for arithmetic, commonsense, and symbolic reasoning tasks – only the prompt changes. This is the key advantage over fine-tuning approaches: one model, many tasks, no gradient updates.

The authors evaluated this method on three categories of reasoning:

Arithmetic reasoning: 5 math word problem benchmarks (GSM8K, SVAMP, ASDiv, AQuA, MAWPS)
Commonsense reasoning: 5 benchmarks (CSQA, StrategyQA, Date Understanding, Sports Understanding, SayCan)
Symbolic reasoning: 2 tasks (last letter concatenation, coin flip state tracking)

They tested across five model families: GPT-3 (350M to 175B parameters), LaMDA (422M to 137B), PaLM (8B to 540B), UL2 (20B), and Codex. All models used greedy decoding (selecting the most probable token at each step, with no randomness).

Mathematical Foundations

This paper is empirical rather than mathematical – it introduces no new equations, loss functions, or optimization objectives. The method operates entirely through the existing text generation mechanism of autoregressive language models. However, the underlying framework can be expressed formally to clarify what chain-of-thought prompting actually changes.

Standard few-shot prompting. An autoregressive language model generates text one token at a time, where each token’s probability depends on all preceding tokens. Given a prompt consisting of \(k\) exemplars and a test input, the model produces an output by sampling from:

\[P(y \mid x_{\text{test}}, \{(x_i, y_i)\}_{i=1}^{k})\]

where \(x_i\) is the \(i\)-th example input, \(y_i\) is its corresponding output, and \(x_{\text{test}}\) is the new question. The model generates the output \(y\) token by token, with each token conditioned on everything that came before it. In standard prompting, \(y_i\) is just the final answer (e.g., “11”).

\(P\): the probability distribution defined by the language model
\(y\): the generated output (the model’s answer)
\(x_{\text{test}}\): the new question to answer
\(\{(x_i, y_i)\}_{i=1}^{k}\): the set of \(k\) few-shot examples

This formalizes what “few-shot prompting” means: the model sees examples and generalizes, without any parameter updates.

Chain-of-thought prompting. The only change is that each exemplar output \(y_i\) is replaced with a pair \((c_i, a_i)\) where \(c_i\) is the chain of thought (intermediate reasoning steps) and \(a_i\) is the final answer:

\[P(c, a \mid x_{\text{test}}, \{(x_i, c_i, a_i)\}_{i=1}^{k})\]

where \(c\) is the chain of thought the model generates for the test input and \(a\) is its final answer. Because the model generates tokens left-to-right, the chain of thought \(c\) is produced before the answer \(a\), meaning the model’s “reasoning” directly influences its answer.

\(c_i\): the chain of thought for exemplar \(i\) (e.g., “Roger started with 5 balls. 2 cans of 3 is 6. 5 + 6 = 11.”)
\(a_i\): the final answer for exemplar \(i\) (e.g., “11”)
\(c\): the model-generated chain of thought for the test input
\(a\): the model-generated final answer

This matters because the intermediate tokens in \(c\) effectively give the model additional “computation” – the answer \(a\) is conditioned on both the input and the reasoning steps, rather than being generated directly from the question.

Emergent ability as a function of scale. The paper’s most striking finding is that chain-of-thought prompting only helps above a threshold of approximately 100 billion parameters. Below this threshold, models produce fluent but logically incoherent chains of thought that actually hurt performance. This can be loosely described as: let \(N\) be the number of model parameters, and let \(\Delta(N)\) be the accuracy improvement from chain-of-thought prompting over standard prompting. The paper empirically observes:

\[\Delta(N) \approx \begin{cases} \leq 0 & \text{if } N < 10^{11} \\ > 0 \text{ (and increasing)} & \text{if } N \geq 10^{11} \end{cases}\]

\(N\): the number of trainable parameters in the language model
\(\Delta(N)\): the accuracy difference (chain-of-thought accuracy minus standard prompting accuracy) at scale \(N\)

This is not a precise mathematical law but an empirical observation. It matters because it shows that reasoning ability is an “emergent” property – it appears suddenly at a certain scale rather than improving gradually. This means you cannot predict whether chain-of-thought prompting will work for a given model by testing on smaller models first.

Results

The headline result: PaLM 540B with chain-of-thought prompting achieves 56.9% accuracy on GSM8K (a benchmark of grade-school math word problems), compared to 17.9% with standard prompting – more than tripling performance. This surpassed the previous state of the art of 55%, which required fine-tuning GPT-3 on thousands of training examples with a learned verifier.

Arithmetic reasoning results (selected, PaLM 540B):

Benchmark	Standard Prompting	Chain-of-Thought	Improvement	Prior Best (Fine-tuned)
GSM8K	17.9%	56.9%	+39.0 pp	55.0%
SVAMP	69.4%	79.0%	+9.6 pp	57.4%
MAWPS	79.2%	93.3%	+14.2 pp	88.4%
AQuA	25.2%	35.8%	+10.6 pp	37.9%
ASDiv	72.1%	73.9%	+1.8 pp	75.3%

Three key patterns emerge from the full results:

First, chain-of-thought prompting is an emergent ability of scale. For models with fewer than about 100 billion parameters, chain-of-thought prompting does not improve performance (and sometimes hurts it). Small models produce grammatically fluent but logically nonsensical reasoning chains.

Second, the harder the task, the larger the gain. GSM8K, which requires multi-step reasoning and has the lowest baseline accuracy, sees the biggest improvement. SingleOp from MAWPS, which only requires one arithmetic operation, sees negligible improvement because standard prompting already solves it well.

Third, the ablation study reveals why chain-of-thought works. The authors tested three alternatives: (a) outputting only the equation (no natural language reasoning), (b) outputting dots equal in length to the chain of thought (extra computation without reasoning), and (c) placing the chain of thought after the answer. None of these matched the full chain-of-thought approach. This rules out the hypotheses that the benefit comes from merely extracting equations, from simply using more tokens, or from activating relevant knowledge. The sequential natural language reasoning is itself doing the work.

Commonsense and symbolic reasoning showed similar patterns. On StrategyQA, PaLM 540B with CoT achieved 77.8% versus 68.6% with standard prompting. On symbolic tasks like last letter concatenation (e.g., “Amy Brown” -> “yn”), CoT enabled near-perfect in-domain performance (99.4%) and meaningful out-of-domain generalization to longer inputs (94.8% on 4-word names, despite only seeing 2-word examples).

Limitations

Only works at extreme scale. The method requires models with roughly 100B+ parameters, which are expensive to run and accessible only through API providers or large organizations. This severely limits practical adoption.
No guarantee of correct reasoning. The model can produce plausible-sounding but logically incorrect chains of thought. The authors found that 54% of incorrect answers on GSM8K involved major errors in semantic understanding or coherence – the model confidently generates wrong reasoning.
“Reasoning” is debatable. The paper explicitly acknowledges that producing a chain of thought does not prove the model is actually reasoning. It may be pattern-matching against similar reasoning chains seen during pretraining, rather than performing genuine logical inference.
Sensitive to prompt design. While the paper shows robustness across annotators, there is still substantial variance. For the coin flip task, accuracy ranged from 71.4% to 99.6% depending on who wrote the chain of thought. For some tasks, the authors found that only one of three co-authors could write a prompt that worked.
Limited to tasks expressible in language. Chain-of-thought prompting inherits the limitation that the reasoning must be expressible as a natural language sequence. Tasks requiring spatial reasoning, visual manipulation, or formal proofs may not benefit.
No mechanism for self-correction. The model generates the chain of thought in a single left-to-right pass. It cannot go back and revise earlier steps if it realizes a mistake later. (Later work addressed this limitation with sampling-based and iterative approaches.)
Evaluation scope is narrow. All benchmarks involve relatively short reasoning chains (typically 2-8 steps). The paper does not test whether CoT scales to problems requiring dozens or hundreds of reasoning steps.

Impact and Legacy

Chain-of-thought prompting is one of the most influential papers in the prompt engineering era of AI. It demonstrated that the right prompt design could unlock capabilities that appeared absent from language models – shifting the research focus from “how to train better models” to “how to better use existing models.” The paper’s core insight that intermediate reasoning steps improve model outputs became foundational to modern AI engineering.

The paper spawned an enormous body of follow-up work. Later research improved CoT by sampling multiple reasoning paths and selecting the most consistent answer, extended the linear chain to branching and networked reasoning structures, and showed that simply appending “Let’s think step by step” to a prompt – without any few-shot examples – could trigger chain-of-thought reasoning. These developments collectively established “reasoning via prompting” as a core research direction.

Beyond academia, chain-of-thought prompting became standard practice in production AI systems. Modern LLM applications routinely use system prompts that instruct the model to “think step by step” or “show your reasoning.” OpenAI’s o1 model family and similar “reasoning models” from other providers internalize chain-of-thought as part of the model’s generation process, producing hidden or visible reasoning traces before answering. The paper’s finding that reasoning is an emergent ability of scale also influenced decisions to build ever-larger models, contributing to the scaling race in the AI industry.

Prerequisites

To understand this paper, you need:

Language models and next-token prediction: Understanding that modern LLMs generate text one token at a time, where each token’s probability depends on all previous tokens. This is the autoregressive generation process used in the transformer architecture (see Attention Is All You Need).
Few-shot prompting / in-context learning: The idea that you can “teach” a model a new task by showing it examples in the input, without updating model parameters. This concept was popularized by GPT-3 (see Improving Language Understanding by Generative Pre-Training).
Model parameters and scale: A basic understanding that larger models (more parameters) generally perform better, and that some capabilities appear suddenly at certain scales rather than improving gradually.
Basic probability: Understanding conditional probability \(P(y \mid x)\) – the probability of output \(y\) given input \(x\).

Connections

Builds on GPT and few-shot prompting: Chain-of-thought prompting is a direct extension of the few-shot prompting paradigm introduced with GPT-3 (see Improving Language Understanding by Generative Pre-Training). Where GPT-3 showed that models could learn tasks from input-output examples in the prompt, this paper shows that augmenting those examples with reasoning steps unlocks an entirely new class of capabilities.
Builds on transformer architecture: The method relies on the autoregressive generation mechanism of transformer-based models (see Attention Is All You Need). The key mechanism – that earlier generated tokens influence later ones – is what makes chain-of-thought work: the reasoning tokens generated before the answer directly condition the answer’s probability distribution.
Relates to BERT’s pre-training insight: Like BERT (see BERT: Pre-training of Deep Bidirectional Transformers), this paper demonstrates that the same pre-trained model can be applied to diverse downstream tasks. But where BERT requires task-specific fine-tuning heads, CoT prompting requires only changes to the text prompt.
Emergent abilities and scaling: The paper’s finding that chain-of-thought reasoning emerges suddenly at scale connects to broader scaling laws research. The same PaLM and LaMDA models that fail at reasoning with 8B parameters succeed dramatically at 540B parameters, suggesting that capabilities can be latent in smaller models but only become accessible at sufficient scale.
Foundation for later reasoning methods: This paper directly enabled subsequent work on multi-path sampling, branching search over reasoning traces, and the development of dedicated reasoning models. It established that prompting strategies can be as important as model architecture for determining what tasks a model can perform.