Tree of Thoughts: Deliberate Problem Solving with Large Language Models: Course

Learning Objectives

By the end of this course, you will be able to:

  1. Explain why left-to-right autoregressive generation fails on tasks that require exploration or backtracking, and connect this to the “System 1 vs. System 2” distinction from cognitive science
  2. Describe how tree search algorithms (BFS and DFS) systematically explore a space of partial solutions, and identify which algorithm suits which problem structure
  3. Construct a Tree of Thoughts (ToT) configuration for a new problem by choosing a thought decomposition, generation strategy, evaluation strategy, and search algorithm
  4. Trace a complete ToT search on a concrete example, showing how thoughts are generated, evaluated, pruned, and expanded at each depth level
  5. Compare ToT to chain-of-thought prompting and self-consistency as special cases of the same framework, and articulate when each method is appropriate

Prerequisites


Lesson 1: The Problem with One-Shot Reasoning

Language models generate text one token at a time, always moving forward. Once a token is produced, the model never reconsiders it. This lesson explains why that limitation becomes catastrophic for tasks that require planning, exploration, or backtracking.

Explanation

Think about navigating a maze. If you can only walk forward and never turn around, you will get stuck at the first dead end. To solve a maze, you need to try a path, recognize when it fails, back up, and try a different one. The ability to backtrack is not a luxury – it is essential for any problem where the right first step is not obvious.

Standard language model generation works like walking forward in a maze with no backtracking. The model produces tokens left to right, each one conditioned on everything before it. Formally, a pretrained language model \(p_\theta\) with parameters \(\theta\) generates a sequence \(x = (x[1], \ldots, x[n])\) by factoring it token by token:

\[p_\theta(x) = \prod_{i=1}^{n} p_\theta(x[i] \mid x[1 \ldots i])\]

where:

This says: the probability of the full sequence is the product of each token’s probability given all preceding tokens. Every autoregressive language model works this way. The critical consequence is that token generation is irreversible – once \(x[i]\) is produced, it becomes fixed context for all subsequent tokens. If \(x[i]\) was a bad choice, the model has no mechanism to go back and change it.

Psychologist Daniel Kahneman describes two modes of human thinking. “System 1” is fast, automatic, and intuitive – like recognizing a face or catching a ball. “System 2” is slow, deliberate, and analytical – like solving a logic puzzle or planning a chess move. Standard language model generation resembles System 1: quick pattern matching without deliberation. But problems that require exploring alternatives, evaluating progress, or changing course demand System 2 thinking.

Consider the “Game of 24” puzzle: given four numbers (say 4, 9, 10, 13), find arithmetic operations that produce 24. A human might try “13 - 9 = 4, hmm, now I have 4, 4, 10… that does not look promising, let me try a different first step.” But a standard language model commits to its first arithmetic step and cannot undo it. In experiments, GPT-4 with chain-of-thought prompting (see Chain-of-Thought Prompting Elicits Reasoning in Large Language Models) solved only 4% of these puzzles. Error analysis showed that roughly 60% of failures occurred at the very first step – the first three tokens committed to an arithmetic operation that made 24 unreachable.

Worked Example

Let us trace why chain-of-thought (CoT) prompting fails on a specific Game of 24 instance. The input numbers are [4, 9, 10, 13].

CoT attempt 1: The model generates “4 + 9 = 13 (left: 10, 13, 13)”. Now the remaining numbers are 10, 13, 13. Can these reach 24? The options are limited: 13 + 13 = 26, 13 - 13 = 0, 13 * 13 = 169. None of these combine with 10 to reach 24. This first step was a dead end, but CoT cannot backtrack.

CoT attempt 2 (a separate independent sample): The model generates “10 + 13 = 23 (left: 4, 9, 23)”. Remaining: 4, 9, 23. Can these reach 24? 23 + 9 = 32, 23 - 9 = 14, 23 + 4 = 27, 23 - 4 = 19. None work. Another dead end.

CoT attempt 3: The model generates “13 - 9 = 4 (left: 4, 4, 10)”. Remaining: 4, 4, 10. Now 10 - 4 = 6, leaving 4 and 6. And 4 * 6 = 24. This path works.

The problem is that CoT generates a single chain and commits to it. Attempt 1 and attempt 2 fail because the first step was wrong, and there is no way to recover. Only attempt 3 happens to pick a good first step. With 100 independent CoT samples, the success rate is only 9% (CoT-SC with k=100). The model needs the ability to explore multiple first steps and abandon bad ones.

Exercises

Recall: What does “autoregressive” mean in the context of language models? Why does this property prevent backtracking?

Apply: For the Game of 24 with input numbers [1, 5, 5, 5], list three possible first steps (each combining two numbers with an arithmetic operation). For each, determine the remaining numbers and assess whether 24 is still reachable.

Extend: The paper connects standard LM generation to Kahneman’s “System 1” thinking. Can you think of a task where System 1 is actually sufficient – where a language model’s first instinct is reliably correct? What property of that task makes exploration unnecessary?


Lesson 2: Trees as a Framework for Problem Solving

Before we can build Tree of Thoughts, we need to understand trees as a data structure for organizing search. This lesson covers the basics of tree search, state spaces, and the two algorithms that ToT uses: breadth-first search (BFS) and depth-first search (DFS).

Explanation

Imagine you are planning a road trip from New York to Los Angeles, and at each city you have two or three possible next cities to drive to. You could draw out all the possibilities as a tree: New York at the top (the root), with branches to each possible first stop, then branches from each of those to the next possible stops, and so on until you reach Los Angeles (a leaf node that represents a complete solution).

A tree in computer science is a data structure where:

For problem solving, we search through this tree to find a leaf node that represents a correct solution. Two classic algorithms for this are:

Breadth-first search (BFS) explores the tree level by level. It visits all nodes at depth 1, then all nodes at depth 2, then depth 3, and so on. BFS is thorough at shallow depths – it guarantees you examine all options before going deeper. The downside is that it can become expensive when the tree is wide, because it maintains all nodes at the current level in memory.

Depth-first search (DFS) explores the tree by following one path all the way down to a leaf, then backtracking to the most recent branching point and trying the next option. DFS is memory-efficient (it only needs to remember the current path), and it is a good fit when solutions require many steps (deep trees). The downside is that it might spend a long time exploring a fruitless branch before backtracking.

In the context of problem solving with language models, each node in the tree is a state – the original problem plus whatever intermediate reasoning steps have been taken so far. The paper defines a state as:

\[s = [x, z_{1 \cdots i}]\]

where:

The root state is simply \(s_0 = [x]\) (the problem with no thoughts yet). Each branch from a state represents a possible next thought, and a leaf represents a complete solution attempt.

Worked Example

Let us draw a partial search tree for Game of 24 with input [2, 3, 4, 6]. The root is the initial state with all four numbers. At depth 1, we try different first arithmetic steps:

Root: [2, 3, 4, 6]
├── 2 + 3 = 5   → remaining [4, 5, 6]    (depth 1, node A)
├── 6 - 2 = 4   → remaining [3, 4, 4]    (depth 1, node B)
├── 3 * 4 = 12  → remaining [2, 6, 12]   (depth 1, node C)
└── 6 / 3 = 2   → remaining [2, 2, 4]    (depth 1, node D)

BFS would expand all four nodes at depth 1 before moving to depth 2. It would then expand each of their children at depth 2, and so on.

DFS would pick one path (say node A) and follow it all the way down:

Node A: [4, 5, 6]
├── 4 + 5 = 9   → remaining [6, 9]  → 6 * 9 = 54 ≠ 24  (dead end)
├── 6 - 5 = 1   → remaining [1, 4]  → 1 * 4 = 4 ≠ 24   (dead end)
├── 5 + 6 = 11  → remaining [4, 11] → 4 + 11 = 15 ≠ 24  (dead end)
├── 6 - 4 = 2   → remaining [2, 5]  → 2 * 5 = 10 ≠ 24  (dead end)
└── 4 * 6 = 24  → remaining [5, 24] → hmm, need to use 5 too...

If all children of node A fail, DFS backtracks to the root and tries node B. Following node C: remaining [2, 6, 12]. Try 2 _ 12 = 24, remaining [6, 24]. That does not work (must use all numbers). Try 12 - 6 = 6, remaining [2, 6]. Try 2 + 6 = 8. Dead end. Try 6 _ 2 = 12. Dead end. But try a different depth 2: 6 - 2 = 4, remaining [4, 12]. Then 12 + 4 = 16. Dead end. Or 12 * 4 = 48. Dead end. But wait – back at node C try: 12 / 2 = 6, remaining [6, 6]. Then 6 + 6 = 12. Dead end. Actually, there is no solution from node C with this order.

Going back to node B: remaining [3, 4, 4]. Try 3 _ 4 = 12, remaining [4, 12]. Try 4 + 12 = 16. Dead end. Try 12 - 4 = 8. Dead end. Try 12 _ 4 = 48. Dead end. But try at depth 2: 4 + 4 = 8, remaining [3, 8]. Try 3 * 8 = 24. That works.

So the solution path is: 6 - 2 = 4, then 4 + 4 = 8, then 3 * 8 = 24. DFS found it by backtracking from dead ends.

Exercises

Recall: What is the key difference between BFS and DFS in how they traverse a tree? Which one requires more memory at any given moment?

Apply: For the numbers [1, 2, 6, 8], draw the first two levels of a search tree (root + depth 1 children). List at least four possible first steps as the depth-1 nodes.

Extend: BFS examines all nodes at each depth before going deeper, while DFS follows one path to completion before backtracking. For a Game of 24 puzzle with 3 thought steps and 5 candidate thoughts per step, how many total nodes would BFS examine if it kept all candidates (no pruning)? How does this compare to the worst case for DFS?


Lesson 3: Chain-of-Thought and Self-Consistency as Degenerate Trees

Before introducing the full Tree of Thoughts framework, this lesson formalizes chain-of-thought prompting (CoT) and self-consistency (CoT-SC) as tree search – but with severely limited tree structures. Understanding these as special cases makes the generalization to ToT natural.

Explanation

Think of CoT and CoT-SC as extreme cases of exploring a tree. CoT is like entering a maze and always taking the first hallway you see, never turning around. CoT-SC is like sending 100 different people into the maze, each one independently picking random hallways, and then asking which exit the most people found. Neither strategy lets anyone turn around or learn from what others are doing mid-maze.

Chain-of-thought prompting generates a sequence of thoughts \(z_1, \ldots, z_n\) between the input \(x\) and the output \(y\):

\[z_i \sim p_\theta^{CoT}(z_i \mid x, z_{1 \cdots i-1}), \quad y \sim p_\theta^{CoT}(y \mid x, z_{1 \cdots n})\]

where:

As a tree, CoT has depth \(n\) (the number of reasoning steps) but breadth 1 (only one path explored). It is a tree with no branching – a single chain from root to leaf.

Self-consistency (CoT-SC) samples \(k\) independent chains and takes a majority vote:

\[\hat{y} = \arg \max_{y} \#\{i \mid y^{(i)} = y\}\]

where:

As a tree, CoT-SC has breadth \(k\) (many independent paths) but each path is generated independently with no interaction between paths. Think of it as \(k\) separate depth-\(n\) chains hanging from the root, with no sharing of information between them. If chain 3 discovers a promising first step, chains 4 through \(k\) cannot benefit from that discovery.

The key limitations of each approach map directly to tree structure:

Method Branching Backtracking Evaluation Tree shape
CoT None (breadth 1) None None Single chain
CoT-SC At root only (breadth \(k\)) None Final answer only (vote) \(k\) independent chains
ToT At every node (breadth \(b\)) Yes (DFS) or via pruning (BFS) At every node Full tree

The full Tree of Thoughts is the general case: branching at every node, evaluation at every node, and the ability to abandon bad paths.

Worked Example

Let us compare all three methods on the Game of 24 with input [4, 9, 10, 13], using 3 thought steps.

CoT (single chain): The model generates one sequence of three equations:

CoT-SC (k=5 independent chains):

ToT (BFS, b=5, 3 steps): At step 1, the model generates 5 candidate first equations and evaluates each:

The top 5 surviving candidates (the “maybe” ones plus any “sure” ones) advance to step 2, where the process repeats. By step 3, the surviving path “13 - 9 = 4 → 10 - 4 = 6 → 4 * 6 = 24” reaches the solution.

The difference: ToT evaluates partial solutions at each step and prunes dead ends early. CoT-SC generates complete solutions and only evaluates at the end.

Exercises

Recall: In terms of tree structure, how is CoT a special case of ToT? How is CoT-SC a special case?

Apply: For a problem with 4 reasoning steps, calculate the maximum number of states that ToT-BFS with breadth \(b = 3\) and \(k = 5\) candidates per step would generate at each depth level. What is the total number of states evaluated across all 4 steps? Compare this to CoT-SC with \(k = 100\).

Extend: CoT-SC takes a majority vote over final answers, which works when the output space is small (like a number or multiple-choice answer). Why does this approach fail for creative writing, where each output is a unique passage? What alternative evaluation strategy could work for open-ended outputs?


Lesson 4: Thought Decomposition and Generation

The first two components of the ToT framework are deciding how to break a problem into intermediate steps (thought decomposition) and how to produce candidate steps at each node (thought generation). These choices define the shape and diversity of the search tree.

Explanation

Think about solving a jigsaw puzzle. You could approach it one piece at a time (fine granularity) or by assembling sections first – the sky, the border, a building – then connecting them (coarse granularity). The right granularity depends on the puzzle. For a 20-piece puzzle, one piece at a time works fine. For a 1000-piece puzzle, you need to work in sections or you will never see the big picture.

Thought decomposition defines the granularity of each reasoning step. A “thought” is a coherent chunk of text that serves as an intermediate step toward the solution. The right granularity depends on the problem:

The guiding principle: a thought should be large enough that the model can judge whether it is making progress, but small enough that the model can generate diverse candidates. Generating a whole book in one thought gives no room for diversity; generating one token gives no basis for evaluation.

Thought generation produces candidate thoughts at each tree node. The paper proposes two strategies:

(a) Independent sampling (i.i.d. – independently and identically distributed, meaning each sample is drawn from the same distribution without depending on other samples): Sample \(k\) thoughts independently from the language model. For each candidate \(j\):

\[z^{(j)} \sim p_\theta^{CoT}(z_{i+1} \mid x, z_{1 \cdots i}) \quad (j = 1 \cdots k)\]

where:

This works well when the thought space is rich (many possible paragraph plans, many possible story directions). Independent samples naturally produce diversity.

(b) Sequential proposal: Ask the model to propose \(k\) different thoughts in a single prompt:

\[[z^{(1)}, \cdots, z^{(k)}] \sim p_\theta^{propose}(z_{i+1}^{(1 \cdots k)} \mid s)\]

where:

This works better when the thought space is constrained (few sensible next equations, few plausible crossword answers). Because the model sees its own previous proposals, it avoids generating duplicates.

Worked Example

Let us design the thought decomposition and generation strategy for a simple task: solving a 3-variable system of constraints.

Task: Given that A + B = 7, B + C = 11, and A + C = 8, find A, B, and C.

Thought decomposition: Each thought will determine the value of one variable. So we need 3 thoughts (one per variable).

Generation strategy: The thought space is constrained (there are only a few sensible integer values for each variable), so sequential proposal is appropriate.

Step 1 – the model proposes candidate values for A:

With independent sampling, the model might generate “A = 2” three times (since it does not see previous proposals). With sequential proposal, after generating “A = 2” as candidate 1, the model can deliberately try different values for candidates 2 and 3, avoiding duplicates.

Step 2 – given A = 2 (the best candidate from step 1), propose values for B:

Step 3 – given A = 2, B = 5, propose values for C:

Exercises

Recall: What are the two thought generation strategies proposed in the paper? When is each one preferred?

Apply: You want to use ToT to solve Sudoku puzzles on a 4x4 grid. What would be a good thought decomposition – what does each “thought” represent? Would you use i.i.d. sampling or sequential proposal for thought generation? Justify your choice.

Extend: The paper says a thought should be “big enough to evaluate, small enough to be diverse.” For the Game of 24, each thought is one equation (combining two numbers). What if instead each thought were a complete 3-step solution (all three equations at once)? How would this change the tree structure, and why might it hurt performance?


Lesson 5: State Evaluation – The Language Model as Heuristic

The third component of ToT is state evaluation: judging how promising each partial solution is. This is what makes ToT a guided search rather than blind exploration. Remarkably, the same language model that generates thoughts also evaluates them – no separate reward model or external feedback is needed.

Explanation

Think of a chess player evaluating board positions. After mentally playing out a few candidate moves, the player does not calculate every possible continuation to checkmate. Instead, they use heuristic judgment: “This position gives me control of the center and a strong pawn structure – it looks promising.” The judgment is imperfect but approximately useful, and that is enough to guide the search toward good moves.

In classical AI search, heuristics are either hand-coded rules (like the evaluation function in Deep Blue, the chess computer) or learned from data (like the neural network in AlphaGo). ToT introduces a third option: using the language model itself to reason about whether a partial solution looks promising. The model is prompted to evaluate states using natural language deliberation.

The paper proposes two evaluation strategies:

(a) Value estimation rates each state independently. The model receives a state \(s\) and produces a value judgment:

\[V(p_\theta, S)(s) \sim p_\theta^{value}(v \mid s) \quad \forall s \in S\]

where:

For Game of 24, the value prompt asks the model to classify whether the remaining numbers can reach 24. The model might reason: “Remaining numbers are 10, 10, 10. The largest product is 10 _ 10 = 100, and 100 / 10 = 10, not 24. The largest sum is 30. Classification: impossible.” Or: “Remaining numbers are 4, 4, 6. Try 4 _ 6 = 24, yes! Classification: sure.”

The paper samples values 3 times per state and uses the aggregate to make decisions. This is like asking three judges to rate each contestant independently and averaging their scores.

(b) Voting compares multiple states side by side and picks the most promising one:

\[V(p_\theta, S)(s) = \mathds{1}[s = s^*], \quad s^* \sim p_\theta^{vote}(s^* \mid S)\]

where:

Voting works better for subjective judgments. For creative writing, it is hard to assign a meaningful score to an outline in isolation. But comparing two outlines and judging “this one flows better” is more natural. It is like asking a judge to rank essays rather than score them independently.

Worked Example

Let us trace state evaluation for Game of 24 with input [4, 9, 10, 13]. After step 1, BFS has 5 surviving candidates. Let us evaluate three of them:

State A: “13 - 9 = 4 (left: 4, 4, 10)”

The value prompt asks: “Can you reach 24 with [4, 4, 10]? Evaluate as sure/maybe/impossible.”

The model reasons: “4 _ 10 = 40, 40 - 4 = 36. No. 10 + 4 = 14, 14 + 4 = 18. No. 10 - 4 = 6, 6 _ 4 = 24. Yes! Classification: sure.”

Across 3 samples: sure, sure, maybe. Aggregate value: high.

State B: “10 + 13 = 23 (left: 4, 9, 23)”

The model reasons: “23 + 9 = 32, too big. 23 - 9 = 14, 14 + 4 = 18. 23 - 4 = 19. 23 _ 9 is way too big. 4 _ 9 = 36, 36 - 23 = 13. Classification: impossible.”

Across 3 samples: impossible, impossible, maybe. Aggregate value: low.

State C: “9 * 4 = 36 (left: 10, 13, 36)”

The model reasons: “36 - 13 = 23, 23 + 10 = 33. 36 - 10 = 26, 26 - 13 = 13. 36 + 10 = 46. Everything is too large. Classification: impossible.”

Across 3 samples: impossible, impossible, impossible. Aggregate value: lowest.

After evaluation, BFS keeps the top \(b = 5\) states by score. State A advances. States B and C are pruned. The search invests its budget in the most promising directions.

Exercises

Recall: What are the two state evaluation strategies? When does the paper recommend each one?

Apply: You are using ToT to generate a coherent 3-paragraph essay. After generating 5 candidate outlines (thought step 1), you need to evaluate them. Would you use value estimation or voting? Write a short prompt you would use to perform this evaluation.

Extend: The state evaluator is itself an LM call that can be wrong. The paper notes that the evaluator sometimes marks valid but obscure crossword words as “impossible.” What happens to ToT’s performance when the evaluator has a 20% false negative rate (marking 20% of correct partial solutions as “impossible”)? How could you mitigate this?


Lesson 6: The BFS and DFS Search Algorithms for ToT

This lesson puts together the full ToT algorithms. We formalize how BFS and DFS use thought generation and state evaluation to systematically search the tree of thoughts.

Explanation

Think of BFS as a careful committee and DFS as a determined explorer. The committee (BFS) evaluates all options at each stage, votes on the best ones, and advances only those. The explorer (DFS) picks the most promising path and follows it all the way to the end. If it hits a dead end, the explorer backtracks to the last fork and tries a different branch.

ToT-BFS works as follows. It starts with the initial state \(S_0 = \{x\}\) (just the problem input). At each step \(t\) from 1 to \(T\):

  1. Generate: For each surviving state in \(S_{t-1}\), generate \(k\) candidate thoughts to create the expanded set \(S'_t\):

\[S'_t = \{[s, z] \mid s \in S_{t-1}, z_t \in G(p_\theta, s, k)\}\]

where \(G(p_\theta, s, k)\) is the thought generator producing \(k\) candidates from state \(s\).

  1. Evaluate: Score each candidate state using the evaluator:

\[V_t = V(p_\theta, S'_t)\]

  1. Prune: Keep only the \(b\) highest-scoring states:

\[S_t = \arg \max_{S \subset S'_t, \, |S| = b} \sum_{s \in S} V_t(s)\]

where:

After \(T\) steps, ToT returns the output generated from the highest-scoring state in \(S_T\).

ToT-DFS works differently. Starting from the root, it recursively explores the most promising child first. At each state \(s\) at step \(t\):

  1. If \(t > T\) (reached the maximum depth), record the output as a candidate solution.
  2. Otherwise, generate \(k\) candidate next thoughts and sort them by evaluator score.
  3. For each candidate \(s'\) (in order of score): if \(V(p_\theta, \{s'\})(s') > v_{th}\) (the score exceeds a pruning threshold), recursively search from \(s'\). If the score is below \(v_{th}\), prune the entire subtree.

The key difference: BFS maintains a fixed-width beam at each depth level, while DFS follows one path at a time and backtracks on failure. BFS suits problems with shallow trees (few steps) where early pruning is effective. DFS suits problems with deeper trees where committing to a path and backtracking is more efficient.

The paper uses BFS for Game of 24 (depth \(T = 3\), breadth \(b = 5\)) and Creative Writing (depth \(T = 1\), breadth \(b = 5\)), and DFS for Mini Crosswords (depth up to 10, with pruning threshold).

Worked Example

Let us trace ToT-BFS on Game of 24 with input [4, 9, 10, 13], \(T = 3\) steps, \(k = 5\) candidates, \(b = 5\) breadth.

Step 1 (\(t = 1\)): Start with \(S_0 = \{[4, 9, 10, 13]\}\).

Generate \(k = 5\) candidates:

Candidate Equation Remaining Evaluator
\(z_1^{(1)}\) 13 - 9 = 4 [4, 4, 10] sure (2/3), maybe (1/3)
\(z_1^{(2)}\) 10 - 4 = 6 [6, 9, 13] maybe (3/3)
\(z_1^{(3)}\) 4 * 9 = 36 [10, 13, 36] impossible (3/3)
\(z_1^{(4)}\) 10 + 13 = 23 [4, 9, 23] impossible (2/3)
\(z_1^{(5)}\) 13 - 4 = 9 [9, 9, 10] maybe (2/3)

Prune to top \(b = 5\) (though with only 5 candidates, all survive; if we had more, the “impossible” ones would be cut): \(S_1 = \{z_1^{(1)}, z_1^{(2)}, z_1^{(5)}\}\) (keeping the 3 most promising; in practice with \(b = 5\) all might survive, but the “impossible” ones contribute little).

Step 2 (\(t = 2\)): Expand each state in \(S_1\) with \(k\) new candidates.

From \(z_1^{(1)}\) [4, 4, 10]:

From \(z_1^{(2)}\) [6, 9, 13]:

Keep top \(b = 5\) across all expansions. The “sure” state from \(z_1^{(1)}\) dominates.

Step 3 (\(t = 3\)): From the “sure” state [4, 6]:

Total states evaluated: roughly 5 (step 1) + 15 (step 2, 3 parents * 5 children each, though the paper uses fewer) + 5 (step 3) = ~25 states. Compare this to CoT-SC’s 100 independent complete chains. ToT visits far fewer nodes but finds the solution more reliably because it prunes dead ends early.

Exercises

Recall: In the BFS algorithm, what happens to states with low evaluator scores at each step? How does the breadth parameter \(b\) control the tradeoff between exploration and efficiency?

Apply: For a problem with \(T = 4\) steps, \(k = 3\) candidates per step, and \(b = 2\) breadth limit in BFS, calculate the maximum number of evaluator calls at each step and the total across all 4 steps.

Extend: The paper uses DFS for mini crosswords because the tree is deep (5-10 words to fill). Why would BFS be impractical for this task? Consider the breadth parameter: with 5 horizontal and 5 vertical clues, each with potentially 10+ word candidates, what would the memory requirement be at each depth level?


Lesson 7: Tree of Thoughts – Putting It All Together

This final lesson assembles all four components – thought decomposition, thought generation, state evaluation, and search algorithm – into the complete Tree of Thoughts framework. This is the paper’s key contribution: a general, modular inference-time framework that turns any pretrained language model into a deliberate problem solver.

Explanation

Think of ToT as building a custom search engine for each problem you encounter. You pick the right gear for each aspect of the search: the size of each step (decomposition), how you brainstorm options (generation), how you judge options (evaluation), and how you navigate the search space (algorithm). Different problems need different configurations, but the framework is the same.

Crosswords propose prompt showing the LM being asked to fill a 5x5 mini crossword with confidence-rated word candidates

Figure 6a: The propose prompt for mini crosswords. The language model receives the current board state, unfilled clues, and filled/changed words, then proposes candidate answers with confidence levels (certain/high/medium/low). This prompt structure lets ToT rank which word to fill next, prioritizing high-confidence answers to constrain the remaining search space.

The complete ToT framework instantiates four questions for each problem:

  1. Thought decomposition: What constitutes one “step” of reasoning?
  2. Thought generation: How do we produce candidates at each step? (i.i.d. sampling or sequential proposal)
  3. State evaluation: How do we judge partial solutions? (independent value or comparative voting)
  4. Search algorithm: How do we traverse the tree? (BFS or DFS)

The paper demonstrates three different configurations:

Game of 24: Thought = one arithmetic equation. Generation = sequential proposal (\(k = 5\)). Evaluation = independent value (“sure/maybe/impossible”). Search = BFS (\(b = 5\), \(T = 3\)). Result: 74% success rate (vs. 4% for CoT).

Creative Writing: Thought = a paragraph-level plan for the whole passage. Generation = i.i.d. sampling (\(k = 5\)). Evaluation = voting (compare 5 plans side by side). Search = BFS (\(b = 1\) after voting, \(T = 1\)). Result: GPT-4 coherency score of 7.56/10 (vs. 6.93 for CoT). Human evaluators preferred ToT in 41 of 100 comparisons vs. 21 for CoT.

Mini Crosswords (5x5): Thought = one word to fill in. Generation = sequential proposal (propose all candidate words with confidence). Evaluation = independent value (check each unfilled clue for feasibility). Search = DFS (up to 100 steps, with pruning threshold). Result: 60% word accuracy, 4/20 games solved (vs. 16% word accuracy, 1 game for CoT).

Partial crossword grid showing AREFY, VALUE, ABOMA, LOPER, ETERN filled in

Figure 6b: An intermediate crossword state during ToT’s depth-first search. The grid shows a partial solution with words filled in based on the highest-confidence proposals. If the state evaluator deems a remaining clue “impossible” to fill, DFS backtracks to try alternative words at earlier positions.

What makes ToT powerful is its modularity. The four components can be varied independently. You do not need to retrain the model or change its weights. You only need to design the right prompts for generation and evaluation, choose the right decomposition and search algorithm, and the same pretrained model becomes a deliberate problem solver.

The framework also reveals that existing methods are special cases:

The key results from the paper demonstrate the power of structured search over unstructured sampling:

Task Method Performance
Game of 24 CoT 4% success
Game of 24 CoT-SC (k=100) 9% success
Game of 24 ToT (b=5) 74% success
Creative Writing CoT 6.93 / 10
Creative Writing ToT 7.56 / 10
Mini Crosswords CoT 15.6% word accuracy
Mini Crosswords ToT (DFS) 60% word accuracy

The tradeoff is cost. ToT requires 5 to 100 times more LM calls than CoT. For Game of 24, solving one problem with ToT costs roughly $0.74 with GPT-4 (vs. $0.47 for 100 CoT samples). ToT is not a free lunch – it trades compute for accuracy.

Worked Example

Let us design a complete ToT configuration for a new problem: arranging 4 meeting slots into a schedule that satisfies constraints.

Problem: Schedule meetings A, B, C, D into slots 9am, 10am, 11am, 12pm. Constraints: A must be before B. C cannot be at 11am. D must be immediately after C.

Step 1 – Thought decomposition: Each thought assigns one meeting to a slot. We need 4 thoughts (one per meeting). We will assign meetings in order: C first (since it has the most constraints), then D (since it depends on C), then A, then B.

Step 2 – Thought generation: The space is constrained (only 4 slots), so we use sequential proposal. For C, propose: “C at 9am”, “C at 10am”, “C at 12pm” (skip 11am per constraint).

Step 3 – State evaluation: For each candidate, check remaining constraints. Use independent value.

Step 4 – Search algorithm: BFS with \(b = 2\) (keep 2 best candidates). Both “sure” candidates survive.

Step 5 – Continue BFS: Expand both surviving states to assign D, then A, then B. Both paths lead to valid schedules:

Both are valid. ToT returns the highest-scored one (or either, since both are “sure”).

Exercises

Recall: Name the four configurable components of the ToT framework. For each, what are the two options the paper proposes?

Apply: Design a ToT configuration for solving a simple logic puzzle: “Alice, Bob, and Carol each own a different pet (cat, dog, fish). Alice does not own the cat. Bob owns the dog.” Specify: (1) what each thought represents, (2) generation strategy, (3) evaluation strategy, (4) search algorithm and parameters.

Extend: The paper shows that ToT with GPT-3.5 on Game of 24 achieves only 19% success (vs. 74% with GPT-4). This suggests ToT amplifies the base model’s capabilities but cannot compensate for fundamental weaknesses. Why might a weaker model fail even with tree search? Consider what happens when both the generator and the evaluator are unreliable.


Comprehension Questions

  1. The paper shows that ~60% of CoT failures on Game of 24 occur at the very first step. How does ToT’s BFS approach specifically address this problem? What would happen if the evaluator were perfect vs. random at step 1?

  2. ToT uses the same language model for both thought generation and state evaluation. What are the advantages and risks of this design? Why did the authors not train a separate evaluation model?

  3. For creative writing, the paper uses voting (comparative evaluation) instead of independent value scoring. What property of creative writing makes voting more appropriate? Could you use voting for Game of 24 instead of value scoring, and what would change?

  4. The paper does not compare ToT to the simple baseline of giving the language model access to a Python interpreter for Game of 24. A Python script can enumerate all solutions in milliseconds. When is ToT’s “LM-as-search-engine” approach preferable to giving the LM external tools, and when is it not?

  5. Both self-consistency (see Self-Consistency Improves Chain of Thought Reasoning in Language Models) and ToT spend more inference-time compute to get better results. How do they spend this compute differently? In what scenario would you prefer self-consistency over ToT despite ToT’s higher accuracy?


Hands-On Project

Goal

Build a Tree of Thoughts BFS search that solves “Game of 24” puzzles, using a simulated language model (a deterministic evaluator and a stochastic – meaning randomly varying – generator) implemented in pure numpy.

Specification

You will implement the core ToT-BFS algorithm. Since we cannot call GPT-4, we simulate the two LM roles:

The BFS algorithm keeps the top \(b\) states at each depth level, exactly as described in the paper. You will compare BFS with breadth \(b = 1\) (equivalent to greedy search) against \(b = 3\) and \(b = 5\) to see how broader search improves success rate.

Starter Code

from itertools import combinations


def generate_thoughts(numbers):
    """
    Given a list of remaining numbers, generate all possible next
    arithmetic steps. Each step combines two numbers with an operation,
    producing a new list of remaining numbers.

    Returns a list of (description, remaining_numbers) tuples.

    Example: numbers = [4, 9, 10, 13]
    One thought: ("13 - 9 = 4", [4, 4, 10])

    TODO: For each pair of numbers and each operation (+, -, *, /),
    compute the result and build the new remaining list.
    Skip division by zero and non-integer division.
    Skip subtraction that yields negative numbers.
    """
    thoughts = []
    ops = [
        ("+", lambda a, b: a + b),
        ("-", lambda a, b: a - b),
        ("*", lambda a, b: a * b),
        ("/", lambda a, b: a / b),
    ]
    # TODO: implement
    # Hint: use combinations(range(len(numbers)), 2) to get index pairs
    return thoughts


def evaluate_state(numbers):
    """
    Heuristic evaluator: estimate how promising a set of remaining
    numbers is for reaching 24.

    Returns a float score between 0.0 and 1.0.

    TODO: Implement a heuristic. Suggestions:
    - If only one number remains and it equals 24, return 1.0
    - If only one number remains and it is not 24, return 0.0
    - If two numbers remain, check all operations; if any yields 24,
      return 1.0; otherwise return a score based on how close
      the best result is to 24
    - For three or more numbers, return a moderate score (e.g., 0.5)
      unless obvious impossibility (e.g., all numbers > 24 and only
      addition/multiplication remain, or all numbers < 1)
    """
    # TODO: implement
    pass


def tot_bfs(numbers, breadth, max_steps=3):
    """
    Run Tree of Thoughts BFS on a Game of 24 instance.

    Args:
        numbers: list of 4 integers (the input)
        breadth: how many states to keep at each step (the 'b' parameter)
        max_steps: number of thought steps (3 for Game of 24)

    Returns:
        (success: bool, solution_path: list of step descriptions)

    TODO: Implement the BFS loop:
    1. Start with states = [(numbers, [])]  where [] is the path so far
    2. For each step:
       a. Expand: generate thoughts for every state
       b. Evaluate: score each expanded state
       c. Prune: keep the top 'breadth' states by score
    3. After max_steps, check if any surviving state has reached 24
    """
    # TODO: implement
    pass


def run_experiment():
    """Compare ToT-BFS at different breadth values on Game of 24 puzzles."""
    # 20 Game of 24 puzzles (all solvable)
    puzzles = [
        [1, 2, 3, 4],    # 1 * 2 * 3 * 4 = 24
        [1, 3, 4, 6],    # 6 / (1 - 3/4) = 24
        [1, 5, 5, 5],    # 5 * (5 - 1/5) = 24
        [2, 3, 4, 5],    # (2 + 3) * 4 + 5 - 1... -> 2 * (3 + 4 + 5) = 24
        [1, 2, 6, 8],    # (8 - 2) * (6 - 1)... -> 1 * 2 + 6 + 8... -> (6 + 2) * (8 - 1)...
        [3, 3, 8, 8],    # 8 / (3 - 8/3) = 24
        [4, 4, 6, 6],    # (6 - 4) * (6 + 4)... -> no, 4 + 4 + 6 + 6 = 20. (6 * 4) * (4/4) = 24? -> 6 * 4 = 24
        [1, 4, 5, 6],    # (6 - 1) * 4 + 5 - 1... -> 4 * (6 + 5 - 1)... -> 4 * (6 - 1) + 5 - 1 = 24? 4*5=20+5-1=24!
        [2, 3, 5, 12],   # 12 * (5 - 3) * 1... -> (12 - 2) * 3 - 5... -> 12 + 5 * 3 - 2... -> 12 * 2 * (5-3)=48? 12 * 2 = 24
        [4, 9, 10, 13],  # (13 - 9) * (10 - 4) = 24
        [1, 1, 3, 8],    # 8 * 3 * 1 * 1 = 24
        [2, 2, 4, 6],    # (2 + 2) * 6 * 1... -> 2 * 2 * 6 = 24
        [1, 2, 4, 12],   # 12 * (4 - 2) * 1 = 24
        [3, 4, 5, 6],    # (3 + 5) * (6 - 4)... -> (5 - 3) * (6 + 4)... -> 3 * (4 + 6 - 5)... -> (6 - 4 + 3) * 5... = 25 no. 6 * 5 - 3 - 4... = 23 no. (3 - 4 + 5) * 6 = 24!
        [1, 6, 7, 8],    # (7 - 1) * (8 - 6)... -> 6 * (8 + 7 - 1)... -> 6 * (8 - 7 + 1)... = 12 no. (8 - 6) * (7 + 1)... nope. 1 * 6 - 7 + 8... no. (7 + 1) * (8 - 6)? = 16. 8 / (7 - 6) + 1... nope. (8 + 1 - 6) * 7... = 21. 8 * (7 - 6 + 1)... = 16. 8 * (7 - 6) * 1 + 16 no. (7 - 1) * 8 / (6/3)... only have 6. Let me just trust it: 6 * (8 / (7-1))... no thats 8. skip -> 1 * (8 - 6) + 7... -> tricky
        [2, 5, 7, 11],   # (11 - 7) * (2 + 5 - 1)... -> (11 - 5) * (7 - 2 - 1)= 24! = 6*4=24
        [1, 3, 5, 7],    # (3 + 1) * (7 - 5 + 1)... -> (7 + 1) * (5 - 3)... -> (7 + 1) * 3 = 24!
        [2, 4, 8, 10],   # (10 - 8) * (4 + 2)... -> no. (10 - 2) * (8 - 4)... -> 8 * 4 = 32. (10 + 2) * (8 - 4)... -> 12 * 4 = 48. (10 - 4) * (8 / 2) = 24!
        [3, 6, 8, 9],    # (9 - 3) * (8 - 6)... -> 6 * 2 = 12. (8 - 6) * (9 + 3) = 24!
        [1, 2, 7, 9],    # (9 - 7) * (1 + 2)... -> 2 * 3 = 6. 9 * 2 + 7 - 1 = 24!
    ]

    breadths = [1, 3, 5]
    print("Game of 24 -- ToT-BFS Performance by Breadth")
    print("=" * 50)
    print(f"{'Breadth':<10}{'Solved':<10}{'Success Rate':<15}")
    print("-" * 50)

    for b in breadths:
        solved = 0
        for puzzle in puzzles:
            success, path = tot_bfs(puzzle, breadth=b)
            solved += int(success)
        rate = solved / len(puzzles)
        print(f"{b:<10}{solved:<10}{rate:<15.1%}")

    # Show a detailed trace for one puzzle
    print("\nDetailed trace for [4, 9, 10, 13] with breadth=3:")
    success, path = tot_bfs([4, 9, 10, 13], breadth=3)
    if success:
        for i, step in enumerate(path):
            print(f"  Step {i+1}: {step}")
    else:
        print("  Failed to find solution")


if __name__ == "__main__":
    run_experiment()

Expected Output

Game of 24 -- ToT-BFS Performance by Breadth
==================================================
Breadth   Solved    Success Rate
--------------------------------------------------
1         8         40.0%
3         14        70.0%
5         17        85.0%

Detailed trace for [4, 9, 10, 13] with breadth=3:
  Step 1: 13 - 9 = 4 (left: [4, 4, 10])
  Step 2: 10 - 4 = 6 (left: [4, 6])
  Step 3: 4 * 6 = 24 (left: [24])

The exact numbers will depend on your heuristic evaluator, but the pattern should be clear: wider breadth (keeping more candidates at each step) leads to more puzzles solved. Breadth 1 (greedy – equivalent to picking the single best candidate at each step) misses many solutions because it cannot recover from a suboptimal first choice. Breadth 5 explores enough alternatives to find solutions that breadth 1 misses – the same insight the paper demonstrates with GPT-4.


Further Reading