Authors: Shunyu Yao, Dian Yu, Jeffrey Zhao et al. Year: 2023 Source: arXiv 2305.10601
Instead of generating answers in a single left-to-right pass, this paper lets a language model explore multiple reasoning paths organized as a tree, evaluate which paths look promising, and backtrack from dead ends – boosting GPT-4’s success rate on a math puzzle from 4% to 74%.
Large language models like GPT-4 generate text one token at a time, always moving left to right. Once a token is produced, the model never reconsiders it. This works well for many tasks, but it creates a fundamental problem for anything that requires planning, exploration, or backtracking.
Consider the “Game of 24” puzzle: given four numbers (say 4, 9, 10, 13), find arithmetic operations that produce 24. A human might try “13 - 9 = 4, hmm now I have 4, 4, 10… let me try a different first step.” But a standard language model commits to its first arithmetic step and cannot undo it. If the first step leads to a dead end, the entire attempt fails. In experiments, GPT-4 with chain-of-thought prompting (a technique where the model writes out intermediate reasoning steps) solved only 4% of these puzzles – essentially guessing.
The deeper issue connects to cognitive science. Psychologist Daniel Kahneman describes two modes of human thinking: “System 1” (fast, automatic, intuitive) and “System 2” (slow, deliberate, analytical). Standard language model generation resembles System 1 – quick pattern matching without deliberation. But many real problems demand System 2 thinking: considering alternatives, evaluating progress, and changing course when a strategy is not working. Before this paper, there was no general framework to give language models that kind of deliberate, search-based reasoning.
Think of the difference between speed chess and tournament chess. In speed chess, a player glances at the board and makes the first reasonable move that comes to mind – much like how a language model generates tokens one after another. In tournament chess, a player considers multiple possible moves, thinks several turns ahead for each one, evaluates which positions look most promising, and sometimes abandons a line of play to try something completely different. Tree of Thoughts (ToT) gives language models the ability to play “tournament chess” with their reasoning.
The core idea is to organize the problem-solving process as a tree search. Each node in the tree represents a partial solution – the problem statement plus whatever intermediate reasoning steps have been taken so far. The model generates multiple candidate “thoughts” (coherent chunks of reasoning, like one step of an equation or a paragraph plan) at each node, then evaluates which candidates are most promising, and uses a search algorithm (breadth-first or depth-first) to systematically explore the tree. Crucially, the same language model serves as both the thought generator and the evaluator – no separate reward model or external feedback is needed.
What makes ToT novel compared to classical tree search (like the algorithms behind chess engines) is that the heuristic function guiding the search is not a hand-crafted rule or a separately trained neural network. Instead, the language model itself reasons about whether a partial solution looks promising, using natural language deliberation. This means ToT can be applied to problems that are hard to formalize – like judging whether a creative writing plan will produce a coherent story.
ToT is not a neural network architecture but an inference-time framework that wraps around any pretrained language model. It has four configurable components: thought decomposition, thought generation, state evaluation, and search algorithm.
Thought decomposition defines the granularity of each reasoning step. A “thought” is a coherent chunk of text that serves as an intermediate step toward the solution. The right granularity depends on the problem: for Game of 24, each thought is one arithmetic equation (e.g., “13 - 9 = 4, leaving 4, 4, 10”); for creative writing, each thought is a paragraph-level plan; for crossword puzzles, each thought is a single word to fill in. The thought should be large enough that the model can judge whether it is making progress, but small enough that the model can generate diverse candidates.
Thought generation produces candidate thoughts at each tree node. The paper proposes two strategies. The first is i.i.d. sampling: independently sample \(k\) thoughts from the language model given the current state. This works well when the thought space is rich (e.g., many possible paragraph plans). The second is sequential proposal: ask the model to propose \(k\) different thoughts in a single prompt, which avoids duplicates when the thought space is constrained (e.g., there are only a few sensible next arithmetic steps).
State evaluation judges how promising each partial solution is. Again, two strategies exist. Value estimation asks the model to independently rate each state – for example, classifying a partial Game of 24 solution as “sure” (definitely reachable), “maybe” (possibly reachable), or “impossible” (dead end). Voting asks the model to compare multiple states side by side and vote for the most promising one, which works better for subjective judgments like passage coherency.
Search algorithm determines how to traverse the tree. Breadth-first search (BFS) maintains the \(b\) best states at each depth level, expanding all of them before moving deeper. This suits problems with shallow trees (few steps) where early pruning is effective. Depth-first search (DFS) follows the most promising path all the way to a solution or dead end, then backtracks. This suits problems with deeper trees (many steps) where committing to a direction and backtracking on failure is more efficient.
For a concrete worked example, consider Game of 24 with input numbers [4, 9, 10, 13]. ToT uses BFS with breadth \(b = 5\) and 3 thought steps. At step 1, the model proposes candidate equations like “13 - 9 = 4 (left: 4, 4, 10)” and “10 - 4 = 6 (left: 6, 9, 13)” and “10 + 13 = 23 (left: 4, 9, 23)”. The model evaluates each: the first two are rated “maybe” (could work), while “10 + 13 = 23” might be rated “impossible” since it is hard to reach 24 from 4, 9, 23. The top 5 candidates survive to step 2, where the process repeats with the remaining numbers. By step 3, a surviving path might reach “4 * 6 = 24” and succeed.
The paper builds its notation on the standard autoregressive language model formulation. A language model \(p_\theta\) with parameters \(\theta\) assigns probability to a sequence \(x = (x[1], \ldots, x[n])\) by factoring it token by token:
\[p_\theta(x) = \prod_{i=1}^{n} p_\theta(x[i] \mid x[1 \ldots i])\]
This says the probability of the full sequence is the product of each token’s probability given all preceding tokens. Every autoregressive language model works this way – it predicts one token at a time, conditioning on everything generated so far. This is the “System 1” mechanism that ToT aims to augment.
Chain-of-thought (CoT) prompting introduces intermediate reasoning steps \(z_1, \ldots, z_n\) between the input \(x\) and output \(y\):
\[z_i \sim p_\theta^{CoT}(z_i \mid x, z_{1 \cdots i-1}), \quad y \sim p_\theta^{CoT}(y \mid x, z_{1 \cdots n})\]
This formalizes the idea that the model can “show its work” by writing intermediate steps. But it still generates a single chain left to right with no branching or backtracking. CoT is a special case of ToT: a tree with breadth 1 (only one path explored).
Self-consistency (CoT-SC) samples \(k\) independent chains and takes a majority vote on the output:
\[\hat{y} = \arg \max_{y} \#\{i \mid y^{(i)} = y\}\]
This aggregates multiple reasoning attempts but still has no local exploration within each chain. It is another special case of ToT: a tree with depth 1 (multiple independent paths, no branching within paths).
In ToT, each tree node is a state \(s = [x, z_{1 \cdots i}]\) combining the input with the thoughts generated so far. The BFS algorithm at each step selects the top-\(b\) states by maximizing the sum of their evaluator scores:
\[S_t = \arg \max_{S \subset S'_t, \, |S| = b} \sum_{s \in S} V_t(s)\]
This is the key equation that turns ToT into a practical algorithm. At each depth level, it generates all possible expansions, scores them, and keeps only the \(b\) most promising ones. This balances exploration (considering multiple paths) with tractability (not letting the tree grow explosively).
The vote-based evaluator uses an indicator function:
\[V(p_\theta, S)(s) = \mathds{1}[s = s^*], \quad s^* \sim p_\theta^{vote}(s^* \mid S)\]
Rather than scoring each state in isolation, this asks the model to directly compare candidates – like asking a judge to rank essays rather than score them independently. The paper finds this works better for subjective evaluation tasks like creative writing coherency.
The paper evaluates ToT on three tasks chosen to challenge GPT-4’s standard prompting capabilities.
Game of 24. Given four numbers, find arithmetic operations to reach 24. Tested on 100 hard games.
| Method | Success Rate |
|---|---|
| IO prompt | 7.3% |
| CoT prompt | 4.0% |
| CoT-SC (k=100) | 9.0% |
| ToT (b=1) | 45% |
| ToT (b=5) | 74% |
| IO (best of 100) | 33% |
| CoT (best of 100) | 49% |
ToT with breadth 5 achieves 74% success, an 18.5x improvement over CoT’s 4%. Even ToT with breadth 1 (45%) outperforms the best-of-100 oracle for CoT (49% requires 100 independent samples, while ToT b=1 visits far fewer nodes). Error analysis reveals that about 60% of CoT samples fail at the very first step – the first three tokens commit to an arithmetic operation that makes the problem unsolvable. ToT avoids this by exploring multiple first steps and pruning bad ones.
Creative Writing. Given 4 random sentences, write a coherent 4-paragraph passage where each paragraph ends with one of the sentences.
| Method | GPT-4 Score (1-10) |
|---|---|
| IO prompt | 6.19 |
| CoT prompt | 6.93 |
| ToT | 7.56 |
Human evaluators preferred ToT over CoT in 41 out of 100 comparisons, preferred CoT over ToT in only 21, and found 38 pairs similarly coherent. ToT’s advantage here comes from generating and voting on multiple writing plans before committing to one.
Mini Crosswords (5x5). Fill a 5x5 crossword grid given 10 clues (5 horizontal, 5 vertical). Tested on 20 games using DFS.
| Method | Letter % | Word % | Games Solved |
|---|---|---|---|
| IO | 38.7 | 14 | 0 |
| CoT | 40.6 | 15.6 | 1 |
| ToT | 78 | 60 | 4 |
Figure 6a: The propose prompt for mini crosswords. The language model receives the current board state, unfilled clues, and filled/changed words, then proposes candidate answers with confidence levels (certain/high/medium/low). This prompt structure lets ToT rank which word to fill next, prioritizing high-confidence answers to constrain the remaining search space.
Figure 6b: An intermediate crossword state during ToT’s depth-first search. The grid shows a partial solution with words filled in based on the highest-confidence proposals. If the state evaluator deems a remaining clue “impossible” to fill, DFS backtracks to try alternative words at earlier positions – a capability that IO and CoT completely lack.
ToT improves word-level accuracy from 16% to 60% and solves 4 of 20 games outright. IO and CoT cannot backtrack when a word choice conflicts with crossing words, so they get stuck. With an oracle that selects the best DFS state (rather than a heuristic), ToT actually solves 7 of 20 games, showing room for better state-selection heuristics.
Tree of Thoughts crystallized a broader shift in how the field thinks about language model inference. Before ToT, the dominant paradigm was to improve model outputs by improving the model itself (better training data, more parameters, better fine-tuning). ToT demonstrated that inference-time computation – spending more compute at test time through structured search – could dramatically improve performance without any model changes. This “scaling inference compute” idea became a major research direction.
The framework influenced a wave of follow-up work on structured reasoning with LMs. Methods like Graph of Thoughts, Algorithm of Thoughts, and various agent-based reasoning systems can trace their lineage to ToT’s core insight that LMs can serve as both generators and evaluators within a search procedure. The paper’s connection to classical AI search (BFS, DFS, A*) also helped bridge the modern deep learning community with decades of work in symbolic AI and planning.
More broadly, ToT contributed to the conceptual vocabulary now used throughout the field. The idea that standard LM generation is “System 1” thinking and that structured search provides “System 2” capabilities has become a common framing. This distinction has influenced the design of reasoning-focused models and inference strategies, including the trend toward models that “think longer” on harder problems by generating more intermediate tokens before committing to an answer.
To understand this paper, the reader should be comfortable with: