Prompting & Reasoning

Tree of Thoughts (ToT)

By Aditya Kumar Jha, Engineer

Tree of Thoughts (ToT) is a reasoning framework that lets a language model explore multiple intermediate reasoning steps as a search tree, evaluating and backtracking among branches. It generalizes chain-of-thought from a single linear path to deliberate search.

What is Tree of Thoughts (ToT)?

Tree of Thoughts (ToT) is a prompting and reasoning framework that frames problem solving as a search over a tree of intermediate steps, where each node is a thought, a coherent partial solution. Introduced by Yao and colleagues in 2023, it lets a language model generate several candidate next steps, evaluate how promising each is, and explore the most promising branches while backtracking from dead ends.

Where chain-of-thought produces one linear sequence of reasoning, Tree of Thoughts maintains many partial solutions at once and applies classic search to decide which to expand. This turns reasoning into a deliberate process of exploration and self-evaluation rather than a single forward pass.

  • A thought is a coherent intermediate reasoning step, a node in the tree.
  • The model generates candidate steps, evaluates them, then expands the best.
  • Introduced by Yao et al., 2023.

How it differs from chain-of-thought

Chain-of-thought (CoT) prompts a model to produce a single reasoning trace from problem to answer. If an early step is wrong, the whole chain typically fails because there is no mechanism to reconsider. Self-consistency improves on this by sampling many independent chains and voting, but each chain is still linear and there is no exploration within a chain.

Tree of Thoughts adds explicit branching and backtracking. At each step the model proposes several alternative thoughts, scores them, and keeps the strongest, so it can abandon a poor branch and try another. This makes ToT suited to problems that need planning, lookahead, or trial and error, where committing to one path early is risky.

  • CoT: one linear path, no backtracking.
  • Self-consistency: many linear paths, voted at the end, still no in-path exploration.
  • ToT: branching tree with evaluation and backtracking.

Search and evaluation

Tree of Thoughts navigates the tree with standard search algorithms, most commonly breadth-first search (BFS) or depth-first search (DFS). BFS keeps a fixed number of the best partial solutions at each level, while DFS pursues one branch deeply and backtracks when it stops looking promising.

The model also acts as its own evaluator. At each node it judges candidate thoughts, for example rating them as sure, maybe, or impossible toward solving the problem, and the search uses these self-evaluations to prune and prioritize. This pairing of generation and evaluation by the same model is what lets ToT search without an external solver.

python
# Simplified pseudocode: propose_thoughts and evaluate are LLM calls
def tree_of_thoughts_bfs(problem, k=3, beam=3, depth=3):
    # k candidate thoughts per node, keep top `beam` partial solutions
    frontier = [""]  # start with an empty partial solution
    for _ in range(depth):
        candidates = []
        for partial in frontier:
            for thought in propose_thoughts(problem, partial, k):
                score = evaluate(problem, partial + thought)  # model rates promise
                candidates.append((partial + thought, score))
        # keep the most promising partial solutions (beam search)
        candidates.sort(key=lambda x: x[1], reverse=True)
        frontier = [c[0] for c in candidates[:beam]]
    return frontier[0]  # best full solution found
A simplified breadth-first Tree of Thoughts loop: propose candidate thoughts, score each with the model, and keep the top partial solutions at each depth.
  • Search strategies: typically BFS or DFS over the thought tree.
  • The model self-evaluates candidate thoughts to prune and prioritize.
  • Branching factor and search depth are tunable per task.

When to use it

Tree of Thoughts shines on tasks that require search, planning, or exploration rather than recall. The original paper reported a large gap on the Game of 24, a math puzzle, where GPT-4 with chain-of-thought solved about 4 percent of cases while Tree of Thoughts reached about 74 percent. It was also evaluated on tasks such as creative writing and crosswords.

The cost is many more model calls per problem, since each node requires generation and evaluation across multiple branches. For simple questions, chain-of-thought or even direct prompting is cheaper and adequate. Tree of Thoughts is worth its overhead mainly when a problem genuinely benefits from exploring and pruning alternatives.

  • Strong on search and planning tasks like Game of 24 and constrained puzzles.
  • Yao et al. reported roughly 4 percent (CoT) versus 74 percent (ToT) on Game of 24.
  • High inference cost makes it overkill for simple questions.

Key takeaways

  • Tree of Thoughts frames reasoning as a search tree of intermediate thoughts with branching and backtracking.
  • It generalizes chain-of-thought from a single linear path to deliberate exploration of alternatives.
  • The model self-evaluates candidate thoughts, and search proceeds via BFS or DFS.
  • On the Game of 24, the paper reported roughly 74 percent success versus about 4 percent for chain-of-thought.
  • Its strength on planning tasks comes at the cost of many more model calls per problem.

Frequently asked questions

Tree of Thoughts is a reasoning framework that explores multiple intermediate reasoning steps as a search tree. The model proposes candidate thoughts, evaluates them, expands promising branches, and backtracks from dead ends, rather than following one linear chain.
Chain-of-thought produces one linear reasoning path with no way to reconsider a wrong step. Tree of Thoughts adds branching, self-evaluation, and backtracking, so the model can abandon poor branches and explore alternatives before committing to an answer.
It typically uses breadth-first search, which keeps the best partial solutions at each level, or depth-first search, which pursues one branch deeply and backtracks. The model itself scores candidate thoughts to guide pruning and prioritization.
Use it for tasks that need search, planning, or lookahead, such as puzzles and constrained problems where one early wrong step dooms a linear chain. For simple questions, chain-of-thought or direct prompting is cheaper and sufficient.
It requires many more model calls than chain-of-thought because each node needs generation and evaluation across several branches. This raises latency and cost, making it overkill for straightforward questions that do not benefit from exploration.