Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Authors: Shunyu Yao, Dian Yu, Jeffrey Zhao et al. Year: 2023 Source: arXiv 2305.10601

One-Sentence Summary

Instead of generating answers in a single left-to-right pass, this paper lets a language model explore multiple reasoning paths organized as a tree, evaluate which paths look promising, and backtrack from dead ends – boosting GPT-4’s success rate on a math puzzle from 4% to 74%.

Problem Statement

Large language models like GPT-4 generate text one token at a time, always moving left to right. Once a token is produced, the model never reconsiders it. This works well for many tasks, but it creates a fundamental problem for anything that requires planning, exploration, or backtracking.

Consider the “Game of 24” puzzle: given four numbers (say 4, 9, 10, 13), find arithmetic operations that produce 24. A human might try “13 - 9 = 4, hmm now I have 4, 4, 10… let me try a different first step.” But a standard language model commits to its first arithmetic step and cannot undo it. If the first step leads to a dead end, the entire attempt fails. In experiments, GPT-4 with chain-of-thought prompting (a technique where the model writes out intermediate reasoning steps) solved only 4% of these puzzles – essentially guessing.

The deeper issue connects to cognitive science. Psychologist Daniel Kahneman describes two modes of human thinking: “System 1” (fast, automatic, intuitive) and “System 2” (slow, deliberate, analytical). Standard language model generation resembles System 1 – quick pattern matching without deliberation. But many real problems demand System 2 thinking: considering alternatives, evaluating progress, and changing course when a strategy is not working. Before this paper, there was no general framework to give language models that kind of deliberate, search-based reasoning.

Key Innovation

Think of the difference between speed chess and tournament chess. In speed chess, a player glances at the board and makes the first reasonable move that comes to mind – much like how a language model generates tokens one after another. In tournament chess, a player considers multiple possible moves, thinks several turns ahead for each one, evaluates which positions look most promising, and sometimes abandons a line of play to try something completely different. Tree of Thoughts (ToT) gives language models the ability to play “tournament chess” with their reasoning.

The core idea is to organize the problem-solving process as a tree search. Each node in the tree represents a partial solution – the problem statement plus whatever intermediate reasoning steps have been taken so far. The model generates multiple candidate “thoughts” (coherent chunks of reasoning, like one step of an equation or a paragraph plan) at each node, then evaluates which candidates are most promising, and uses a search algorithm (breadth-first or depth-first) to systematically explore the tree. Crucially, the same language model serves as both the thought generator and the evaluator – no separate reward model or external feedback is needed.

What makes ToT novel compared to classical tree search (like the algorithms behind chess engines) is that the heuristic function guiding the search is not a hand-crafted rule or a separately trained neural network. Instead, the language model itself reasons about whether a partial solution looks promising, using natural language deliberation. This means ToT can be applied to problems that are hard to formalize – like judging whether a creative writing plan will produce a coherent story.

Architecture / Method

ToT is not a neural network architecture but an inference-time framework that wraps around any pretrained language model. It has four configurable components: thought decomposition, thought generation, state evaluation, and search algorithm.

Thought decomposition defines the granularity of each reasoning step. A “thought” is a coherent chunk of text that serves as an intermediate step toward the solution. The right granularity depends on the problem: for Game of 24, each thought is one arithmetic equation (e.g., “13 - 9 = 4, leaving 4, 4, 10”); for creative writing, each thought is a paragraph-level plan; for crossword puzzles, each thought is a single word to fill in. The thought should be large enough that the model can judge whether it is making progress, but small enough that the model can generate diverse candidates.

Thought generation produces candidate thoughts at each tree node. The paper proposes two strategies. The first is i.i.d. sampling: independently sample $k$ thoughts from the language model given the current state. This works well when the thought space is rich (e.g., many possible paragraph plans). The second is sequential proposal: ask the model to propose $k$ different thoughts in a single prompt, which avoids duplicates when the thought space is constrained (e.g., there are only a few sensible next arithmetic steps).

State evaluation judges how promising each partial solution is. Again, two strategies exist. Value estimation asks the model to independently rate each state – for example, classifying a partial Game of 24 solution as “sure” (definitely reachable), “maybe” (possibly reachable), or “impossible” (dead end). Voting asks the model to compare multiple states side by side and vote for the most promising one, which works better for subjective judgments like passage coherency.

Search algorithm determines how to traverse the tree. Breadth-first search (BFS) maintains the $b$ best states at each depth level, expanding all of them before moving deeper. This suits problems with shallow trees (few steps) where early pruning is effective. Depth-first search (DFS) follows the most promising path all the way to a solution or dead end, then backtracks. This suits problems with deeper trees (many steps) where committing to a direction and backtracking on failure is more efficient.

For a concrete worked example, consider Game of 24 with input numbers [4, 9, 10, 13]. ToT uses BFS with breadth $b = 5$ and 3 thought steps. At step 1, the model proposes candidate equations like “13 - 9 = 4 (left: 4, 4, 10)” and “10 - 4 = 6 (left: 6, 9, 13)” and “10 + 13 = 23 (left: 4, 9, 23)”. The model evaluates each: the first two are rated “maybe” (could work), while “10 + 13 = 23” might be rated “impossible” since it is hard to reach 24 from 4, 9, 23. The top 5 candidates survive to step 2, where the process repeats with the remaining numbers. By step 3, a surviving path might reach “4 * 6 = 24” and succeed.

Mathematical Foundations

The paper builds its notation on the standard autoregressive language model formulation. A language model $p_\theta$ with parameters $\theta$ assigns probability to a sequence $x = (x[1], \ldots, x[n])$ by factoring it token by token:

\[p_\theta(x) = \prod_{i=1}^{n} p_\theta(x[i] \mid x[1 \ldots i])\]

$p_\theta$ is the pretrained language model (e.g., GPT-4)
$x[i]$ is the $i$-th token in the sequence
$x[1 \ldots i]$ denotes all tokens from position 1 through $i$

This says the probability of the full sequence is the product of each token’s probability given all preceding tokens. Every autoregressive language model works this way – it predicts one token at a time, conditioning on everything generated so far. This is the “System 1” mechanism that ToT aims to augment.

Chain-of-thought (CoT) prompting introduces intermediate reasoning steps $z_1, \ldots, z_n$ between the input $x$ and output $y$:

\[z_i \sim p_\theta^{CoT}(z_i \mid x, z_{1 \cdots i-1}), \quad y \sim p_\theta^{CoT}(y \mid x, z_{1 \cdots n})\]

$z_i$ is the $i$-th “thought” – a coherent reasoning step (e.g., an intermediate equation)
$p_\theta^{CoT}$ denotes the language model prompted with a chain-of-thought template
Each thought is generated sequentially, conditioned on the input and all previous thoughts

This formalizes the idea that the model can “show its work” by writing intermediate steps. But it still generates a single chain left to right with no branching or backtracking. CoT is a special case of ToT: a tree with breadth 1 (only one path explored).

Self-consistency (CoT-SC) samples $k$ independent chains and takes a majority vote on the output:

\[\hat{y} = \arg \max_{y} \#\{i \mid y^{(i)} = y\}\]

$y^{(i)}$ is the output from the $i$-th independently sampled chain of thought
$\#\{i \mid y^{(i)} = y\}$ counts how many chains produced output $y$

This aggregates multiple reasoning attempts but still has no local exploration within each chain. It is another special case of ToT: a tree with depth 1 (multiple independent paths, no branching within paths).

In ToT, each tree node is a state $s = [x, z_{1 \cdots i}]$ combining the input with the thoughts generated so far. The BFS algorithm at each step selects the top-$b$ states by maximizing the sum of their evaluator scores:

\[S_t = \arg \max_{S \subset S'_t, \, |S| = b} \sum_{s \in S} V_t(s)\]

$S'_t$ is the set of all candidate states at step $t$ (each existing state expanded with each proposed thought)
$V_t(s)$ is the evaluator’s score for state $s$ at step $t$
$b$ is the breadth limit (how many states to keep)
$S_t$ is the pruned set of best states carried forward

This is the key equation that turns ToT into a practical algorithm. At each depth level, it generates all possible expansions, scores them, and keeps only the $b$ most promising ones. This balances exploration (considering multiple paths) with tractability (not letting the tree grow explosively).

The vote-based evaluator uses an indicator function:

\[V(p_\theta, S)(s) = \mathds{1}[s = s^*], \quad s^* \sim p_\theta^{vote}(s^* \mid S)\]

$\mathds{1}[\cdot]$ is the indicator function (1 if the condition is true, 0 otherwise)
$s^*$ is the state voted as best by the language model
$p_\theta^{vote}$ prompts the model to compare all states in $S$ and choose the most promising one

Rather than scoring each state in isolation, this asks the model to directly compare candidates – like asking a judge to rank essays rather than score them independently. The paper finds this works better for subjective evaluation tasks like creative writing coherency.

Results

The paper evaluates ToT on three tasks chosen to challenge GPT-4’s standard prompting capabilities.

Game of 24. Given four numbers, find arithmetic operations to reach 24. Tested on 100 hard games.

Method	Success Rate
IO prompt	7.3%
CoT prompt	4.0%
CoT-SC (k=100)	9.0%
ToT (b=1)	45%
ToT (b=5)	74%
IO (best of 100)	33%
CoT (best of 100)	49%

ToT with breadth 5 achieves 74% success, an 18.5x improvement over CoT’s 4%. Even ToT with breadth 1 (45%) outperforms the best-of-100 oracle for CoT (49% requires 100 independent samples, while ToT b=1 visits far fewer nodes). Error analysis reveals that about 60% of CoT samples fail at the very first step – the first three tokens commit to an arithmetic operation that makes the problem unsolvable. ToT avoids this by exploring multiple first steps and pruning bad ones.

Creative Writing. Given 4 random sentences, write a coherent 4-paragraph passage where each paragraph ends with one of the sentences.

Method	GPT-4 Score (1-10)
IO prompt	6.19
CoT prompt	6.93
ToT	7.56

Human evaluators preferred ToT over CoT in 41 out of 100 comparisons, preferred CoT over ToT in only 21, and found 38 pairs similarly coherent. ToT’s advantage here comes from generating and voting on multiple writing plans before committing to one.

Mini Crosswords (5x5). Fill a 5x5 crossword grid given 10 clues (5 horizontal, 5 vertical). Tested on 20 games using DFS.

Method	Letter %	Word %	Games Solved
IO	38.7	14	0
CoT	40.6	15.6	1
ToT	78	60	4

Crosswords propose prompt showing the LM being asked to fill a 5x5 mini crossword with confidence-rated word candidates

Figure 6a: The propose prompt for mini crosswords. The language model receives the current board state, unfilled clues, and filled/changed words, then proposes candidate answers with confidence levels (certain/high/medium/low). This prompt structure lets ToT rank which word to fill next, prioritizing high-confidence answers to constrain the remaining search space.

Partial crossword grid showing AREFY, VALUE, ABOMA, LOPER, ETERN filled in

Figure 6b: An intermediate crossword state during ToT’s depth-first search. The grid shows a partial solution with words filled in based on the highest-confidence proposals. If the state evaluator deems a remaining clue “impossible” to fill, DFS backtracks to try alternative words at earlier positions – a capability that IO and CoT completely lack.

ToT improves word-level accuracy from 16% to 60% and solves 4 of 20 games outright. IO and CoT cannot backtrack when a word choice conflicts with crossing words, so they get stuck. With an oracle that selects the best DFS state (rather than a heuristic), ToT actually solves 7 of 20 games, showing room for better state-selection heuristics.

Limitations

Computational cost. ToT requires 5 to 100 times more generated tokens than CoT prompting. For Game of 24, solving one problem costs roughly $0.74 with GPT-4 (vs. $0.47 for 100 CoT samples). This makes ToT impractical for latency-sensitive or high-volume applications.
Tasks are narrowly scoped. The paper evaluates only three synthetic puzzle-like tasks. It remains unclear how well ToT transfers to real-world multi-step problems like software debugging, scientific reasoning, or open-ended dialogue.
Relies entirely on a strong base model. With GPT-3.5 instead of GPT-4, Game of 24 success drops from 74% to 19%. The framework amplifies a model’s existing reasoning capabilities but cannot compensate for fundamental capability gaps.
No learning from search. Each problem is solved from scratch. The model does not improve its search strategy over time or learn from previously solved problems. This is in contrast to systems like AlphaGo that train on self-play.
State evaluation is brittle (not acknowledged by authors). The paper shows that the LM-based state evaluator sometimes prunes correct solutions – for example, marking valid but obscure crossword words as “impossible.” The evaluation heuristic has no calibration guarantees.
Prompt engineering burden (not acknowledged by authors). Each task requires carefully designed thought decomposition, generation prompts, evaluation prompts, and search hyperparameters. The paper does not provide guidance on how to design these for new tasks, making ToT more of a framework sketch than a turnkey method.
No comparison with external tool use (not acknowledged by authors). For Game of 24, a simple Python script could enumerate all solutions in milliseconds. The paper focuses on what LMs can do alone, but does not discuss when it would be more practical to give the LM access to external tools instead of expensive self-deliberation.

Impact and Legacy

Tree of Thoughts crystallized a broader shift in how the field thinks about language model inference. Before ToT, the dominant paradigm was to improve model outputs by improving the model itself (better training data, more parameters, better fine-tuning). ToT demonstrated that inference-time computation – spending more compute at test time through structured search – could dramatically improve performance without any model changes. This “scaling inference compute” idea became a major research direction.

The framework influenced a wave of follow-up work on structured reasoning with LMs. Methods like Graph of Thoughts, Algorithm of Thoughts, and various agent-based reasoning systems can trace their lineage to ToT’s core insight that LMs can serve as both generators and evaluators within a search procedure. The paper’s connection to classical AI search (BFS, DFS, A*) also helped bridge the modern deep learning community with decades of work in symbolic AI and planning.

More broadly, ToT contributed to the conceptual vocabulary now used throughout the field. The idea that standard LM generation is “System 1” thinking and that structured search provides “System 2” capabilities has become a common framing. This distinction has influenced the design of reasoning-focused models and inference strategies, including the trend toward models that “think longer” on harder problems by generating more intermediate tokens before committing to an answer.

Prerequisites

To understand this paper, the reader should be comfortable with:

Conditional probability and the chain rule: The autoregressive factorization $p(x_1, \ldots, x_n) = \prod p(x_i | x_{<i})$ is the foundation of how language models generate text (see Attention Is All You Need for the transformer architecture that implements this, and Improving Language Understanding by Generative Pre-Training for how GPT applies it)
Tree search algorithms: Breadth-first search (BFS) and depth-first search (DFS) are standard algorithms from introductory computer science – the paper applies them directly
Prompting: The paper assumes familiarity with how language models are steered by input prompts, including few-shot examples and chain-of-thought prompting

Connections

Builds on chain-of-thought prompting (Wei et al., 2022) and self-consistency (Wang et al., 2022): ToT explicitly generalizes both. CoT is a tree of depth $n$ and breadth 1; CoT-SC is a tree of depth 1 and breadth $k$. ToT is the general case with arbitrary depth and breadth.
Builds on autoregressive language models like GPT (see Improving Language Understanding by Generative Pre-Training): ToT uses a pretrained autoregressive LM as its core engine for both generating and evaluating thoughts. The framework is model-agnostic but was tested with GPT-4.
Connects to transformer attention mechanisms (see Attention Is All You Need): The transformer’s ability to attend to all previous tokens is what makes it possible for the LM to evaluate complex partial states and generate contextually appropriate next thoughts.
Shares the “inference-time scaling” insight with self-consistency: Both methods spend more compute at test time to get better results, but ToT structures this compute as a search tree rather than independent samples.
Relates to retrieval-augmented generation (see Retrieval-Augmented Generation): Both address the limitation that a single forward pass through an LM may not be sufficient. RAG augments the model with external knowledge retrieval; ToT augments it with internal deliberation and search. These approaches are complementary.
Foreshadows reasoning-focused inference methods: ToT’s insight that “spending more inference compute yields better reasoning” anticipated later developments like OpenAI’s o1 model, which trains models to generate extended reasoning chains.