Authors: Jason Wei, Xuezhi Wang, Dale Schuurmans et al. Year: 2022 Source: 2201.11903
Adding step-by-step reasoning examples to the instructions you give a large language model dramatically improves its ability to solve math, logic, and commonsense problems – without changing the model itself.
By 2022, large language models (LLMs) – neural networks trained on massive text corpora to predict the next word – had shown impressive abilities across many tasks. Simply scaling them up (adding more parameters and training data) improved performance on tasks like translation and question answering. But scaling alone hit a wall on tasks requiring multi-step reasoning: arithmetic word problems, commonsense logic, and symbolic manipulation.
The standard approach for getting these models to perform a task was “few-shot prompting” (sometimes called “in-context learning”), introduced with GPT-3 (see Improving Language Understanding by Generative Pre-Training). You show the model a few input-output examples, and it generalizes to new inputs. For instance, you might show the model three question-answer pairs, then ask a fourth question. This works well for simple factual questions, but fails on problems that require chaining multiple reasoning steps together.
The alternative was to fine-tune a model (update its parameters) on large datasets of reasoning problems with worked solutions. This approach works, but is expensive: you need to collect thousands of hand-written solutions, and the resulting model is specialized to one task. There was no general-purpose way to unlock reasoning in an off-the-shelf language model.
Think about how a math teacher helps a struggling student. The teacher does not just show the student questions and final answers. Instead, the teacher shows their work – writing out each intermediate step: “First we calculate the cost per item, then we multiply by the quantity, then we subtract the discount.” By seeing the teacher’s reasoning process, the student learns not just what the answer is, but how to arrive at it.
Chain-of-thought (CoT) prompting applies this same idea to language models. Instead of providing few-shot examples as simple input-output pairs, you augment each example with a “chain of thought” – a sequence of intermediate natural language reasoning steps that connect the question to the answer. The model then mimics this step-by-step reasoning pattern when it encounters a new question.
Concretely, a standard prompt might look like:
Q: Roger has 5 tennis balls. He buys 2 more cans of 3 balls each.
How many tennis balls does he have now?
A: The answer is 11.
A chain-of-thought prompt adds the reasoning:
Q: Roger has 5 tennis balls. He buys 2 more cans of 3 balls each.
How many tennis balls does he have now?
A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6
tennis balls. 5 + 6 = 11. The answer is 11.
The critical insight is that this requires no model modification, no fine-tuning, no additional training data beyond a handful of manually written examples. You simply change what you write in the prompt, and the model’s reasoning ability improves dramatically – but only if the model is large enough (roughly 100 billion parameters or more).
Figure 1: Standard prompting (left) directly produces an answer – often incorrect. Chain-of-thought prompting (right) includes intermediate reasoning steps (highlighted in green) that decompose the problem, leading to the correct answer. The same model and the same question are used; only the prompt exemplars differ.
Chain-of-thought prompting is not a new architecture or training procedure. It is a prompting strategy that works with any sufficiently large, pre-trained language model. The method has three components:
1. Prompt construction. For a given task, the user writes a small number of examples (typically 8) in the format \(\langle \text{input}, \text{chain of thought}, \text{output} \rangle\). The chain of thought is a natural language decomposition of the problem into intermediate steps. For math word problems, this means writing out the arithmetic reasoning. For commonsense questions, it means articulating the logical deductions. For symbolic tasks, it means spelling out each manipulation step.
2. Inference. At test time, the prompt (containing the few-shot exemplars with chains of thought) is prepended to a new question. The language model generates a completion that – because of the pattern established by the exemplars – includes its own chain of thought before producing a final answer. The final answer is extracted from the end of the generated text (typically following a phrase like “The answer is”).
3. No training involved. The model’s parameters remain frozen. The same model checkpoint can be used for arithmetic, commonsense, and symbolic reasoning tasks – only the prompt changes. This is the key advantage over fine-tuning approaches: one model, many tasks, no gradient updates.
The authors evaluated this method on three categories of reasoning:
They tested across five model families: GPT-3 (350M to 175B parameters), LaMDA (422M to 137B), PaLM (8B to 540B), UL2 (20B), and Codex. All models used greedy decoding (selecting the most probable token at each step, with no randomness).
This paper is empirical rather than mathematical – it introduces no new equations, loss functions, or optimization objectives. The method operates entirely through the existing text generation mechanism of autoregressive language models. However, the underlying framework can be expressed formally to clarify what chain-of-thought prompting actually changes.
Standard few-shot prompting. An autoregressive language model generates text one token at a time, where each token’s probability depends on all preceding tokens. Given a prompt consisting of \(k\) exemplars and a test input, the model produces an output by sampling from:
\[P(y \mid x_{\text{test}}, \{(x_i, y_i)\}_{i=1}^{k})\]
where \(x_i\) is the \(i\)-th example input, \(y_i\) is its corresponding output, and \(x_{\text{test}}\) is the new question. The model generates the output \(y\) token by token, with each token conditioned on everything that came before it. In standard prompting, \(y_i\) is just the final answer (e.g., “11”).
This formalizes what “few-shot prompting” means: the model sees examples and generalizes, without any parameter updates.
Chain-of-thought prompting. The only change is that each exemplar output \(y_i\) is replaced with a pair \((c_i, a_i)\) where \(c_i\) is the chain of thought (intermediate reasoning steps) and \(a_i\) is the final answer:
\[P(c, a \mid x_{\text{test}}, \{(x_i, c_i, a_i)\}_{i=1}^{k})\]
where \(c\) is the chain of thought the model generates for the test input and \(a\) is its final answer. Because the model generates tokens left-to-right, the chain of thought \(c\) is produced before the answer \(a\), meaning the model’s “reasoning” directly influences its answer.
This matters because the intermediate tokens in \(c\) effectively give the model additional “computation” – the answer \(a\) is conditioned on both the input and the reasoning steps, rather than being generated directly from the question.
Emergent ability as a function of scale. The paper’s most striking finding is that chain-of-thought prompting only helps above a threshold of approximately 100 billion parameters. Below this threshold, models produce fluent but logically incoherent chains of thought that actually hurt performance. This can be loosely described as: let \(N\) be the number of model parameters, and let \(\Delta(N)\) be the accuracy improvement from chain-of-thought prompting over standard prompting. The paper empirically observes:
\[\Delta(N) \approx \begin{cases} \leq 0 & \text{if } N < 10^{11} \\ > 0 \text{ (and increasing)} & \text{if } N \geq 10^{11} \end{cases}\]
This is not a precise mathematical law but an empirical observation. It matters because it shows that reasoning ability is an “emergent” property – it appears suddenly at a certain scale rather than improving gradually. This means you cannot predict whether chain-of-thought prompting will work for a given model by testing on smaller models first.
The headline result: PaLM 540B with chain-of-thought prompting achieves 56.9% accuracy on GSM8K (a benchmark of grade-school math word problems), compared to 17.9% with standard prompting – more than tripling performance. This surpassed the previous state of the art of 55%, which required fine-tuning GPT-3 on thousands of training examples with a learned verifier.
Arithmetic reasoning results (selected, PaLM 540B):
| Benchmark | Standard Prompting | Chain-of-Thought | Improvement | Prior Best (Fine-tuned) |
|---|---|---|---|---|
| GSM8K | 17.9% | 56.9% | +39.0 pp | 55.0% |
| SVAMP | 69.4% | 79.0% | +9.6 pp | 57.4% |
| MAWPS | 79.2% | 93.3% | +14.2 pp | 88.4% |
| AQuA | 25.2% | 35.8% | +10.6 pp | 37.9% |
| ASDiv | 72.1% | 73.9% | +1.8 pp | 75.3% |
Three key patterns emerge from the full results:
First, chain-of-thought prompting is an emergent ability of scale. For models with fewer than about 100 billion parameters, chain-of-thought prompting does not improve performance (and sometimes hurts it). Small models produce grammatically fluent but logically nonsensical reasoning chains.
Second, the harder the task, the larger the gain. GSM8K, which requires multi-step reasoning and has the lowest baseline accuracy, sees the biggest improvement. SingleOp from MAWPS, which only requires one arithmetic operation, sees negligible improvement because standard prompting already solves it well.
Third, the ablation study reveals why chain-of-thought works. The authors tested three alternatives: (a) outputting only the equation (no natural language reasoning), (b) outputting dots equal in length to the chain of thought (extra computation without reasoning), and (c) placing the chain of thought after the answer. None of these matched the full chain-of-thought approach. This rules out the hypotheses that the benefit comes from merely extracting equations, from simply using more tokens, or from activating relevant knowledge. The sequential natural language reasoning is itself doing the work.
Commonsense and symbolic reasoning showed similar patterns. On StrategyQA, PaLM 540B with CoT achieved 77.8% versus 68.6% with standard prompting. On symbolic tasks like last letter concatenation (e.g., “Amy Brown” -> “yn”), CoT enabled near-perfect in-domain performance (99.4%) and meaningful out-of-domain generalization to longer inputs (94.8% on 4-word names, despite only seeing 2-word examples).
Chain-of-thought prompting is one of the most influential papers in the prompt engineering era of AI. It demonstrated that the right prompt design could unlock capabilities that appeared absent from language models – shifting the research focus from “how to train better models” to “how to better use existing models.” The paper’s core insight that intermediate reasoning steps improve model outputs became foundational to modern AI engineering.
The paper spawned an enormous body of follow-up work. Later research improved CoT by sampling multiple reasoning paths and selecting the most consistent answer, extended the linear chain to branching and networked reasoning structures, and showed that simply appending “Let’s think step by step” to a prompt – without any few-shot examples – could trigger chain-of-thought reasoning. These developments collectively established “reasoning via prompting” as a core research direction.
Beyond academia, chain-of-thought prompting became standard practice in production AI systems. Modern LLM applications routinely use system prompts that instruct the model to “think step by step” or “show your reasoning.” OpenAI’s o1 model family and similar “reasoning models” from other providers internalize chain-of-thought as part of the model’s generation process, producing hidden or visible reasoning traces before answering. The paper’s finding that reasoning is an emergent ability of scale also influenced decisions to build ever-larger models, contributing to the scaling race in the AI industry.
To understand this paper, you need: