ReAct: Synergizing Reasoning and Acting in Language Models

Authors: Shunyu Yao, Jeffrey Zhao, Dian Yu et al. Year: 2022 (published at ICLR 2023) Source: arXiv 2210.03629

One-Sentence Summary

By prompting a language model to alternate between writing out its reasoning in plain language and taking concrete actions (like searching Wikipedia), ReAct lets the model ground its thinking in real information and act more intelligently – outperforming methods that only reason or only act.

Problem Statement

Before ReAct, researchers had explored two separate capabilities of large language models (LLMs – models like GPT-3 or PaLM that have been trained on vast amounts of text and can generate fluent language). One line of work focused on reasoning: chain-of-thought prompting (CoT) showed that if you give an LLM a few examples of step-by-step thinking, it can work through multi-step problems by writing out intermediate reasoning steps. The other line focused on acting: using LLMs to generate actions in interactive environments, like navigating websites or controlling robots.

The problem was that reasoning alone is a “static black box.” When a model reasons entirely from its internal knowledge, it has no way to check facts against the real world. This leads to hallucination – the model confidently invents plausible-sounding but incorrect information, and these errors compound as each reasoning step builds on the last. In experiments, hallucination accounted for 56% of CoT’s failures on multi-hop question answering.

Conversely, acting alone – generating sequences of actions without explicit reasoning – makes it hard for the model to formulate high-level plans, track progress toward a goal, handle exceptions, or synthesize information from multiple observations. An acting-only agent might search Wikipedia for an answer but then fail to combine the retrieved facts into a correct conclusion, because it never pauses to reason about what it found.

No prior work had systematically studied how to combine reasoning and acting in a synergistic loop, or whether such a combination would outperform either capability in isolation.

Key Innovation

Think of a detective investigating a case. A purely “reasoning” detective sits in their office, never visits a crime scene, and tries to solve everything from memory – they might construct an elegant theory that turns out to be wrong because they misremembered a key fact. A purely “acting” detective visits every location and interviews every witness but never stops to think about what the evidence means – they collect mountains of information but cannot piece it together. A good detective does both: they think about what evidence they need, go collect it, reflect on what they found, decide what to investigate next, and eventually synthesize everything into a conclusion.

ReAct applies exactly this principle to language models. The core idea is remarkably simple: augment the model’s action space with language itself. Alongside concrete actions like search[entity] or click[button], the model can also produce “thoughts” – free-form text that reasons about the current situation but does not affect the external environment. These thoughts serve multiple purposes: decomposing a complex goal into subgoals, extracting key information from observations, performing commonsense reasoning, tracking progress, handling exceptions, and synthesizing a final answer.

Technically, this is implemented through prompting (see Improving Language Understanding by Generative Pre-Training for background on how language models learn from examples). The researchers write a few example trajectories showing interleaved thoughts, actions, and observations, then give these to the model as in-context examples. The model learns from these examples to produce its own interleaved reasoning-acting trajectories for new problems. No model weights are changed – the model is used as-is, guided only by the prompt.

Architecture / Method

ReAct is not a new model architecture but a prompting paradigm. It works with any sufficiently capable large language model. The paper primarily uses PaLM-540B (a 540-billion parameter language model from Google) and also validates results with GPT-3.

Comparison of four prompting methods: Standard, Chain-of-thought (Reason Only), Act-only, and ReAct

Figure 1: Top: Comparison of four prompting paradigms. (a) Standard prompting produces answers directly. (b) Chain-of-thought adds reasoning but cannot act on the world. (c) Act-only takes actions without explicit reasoning. (d) ReAct interleaves reasoning (Thought) with acting (Action/Observation). Bottom: Example ReAct trajectories for a HotpotQA question and an ALFWorld household task, showing how thoughts guide action selection.

Step 1: Define the action space. For each task domain, the authors define a small set of concrete actions the model can take. For question answering with Wikipedia, these are: search[entity] (returns the first 5 sentences from a Wikipedia page), lookup[string] (finds the next sentence containing that string on the current page, like Ctrl+F in a browser), and finish[answer] (submits a final answer). For interactive decision-making tasks like ALFWorld (a text-based household simulation), actions include things like go to coffeetable 1, take paper 2, and use desklamp 1.

Step 2: Augment with thoughts. The model is allowed to produce “thought” actions – free-form text that reasons about the situation. These thoughts do not affect the environment and produce no observation feedback. Instead, they update the model’s internal context, helping it plan, reason, and decide what action to take next.

Step 3: Write few-shot exemplars. Human annotators create a small number of example trajectories (3 to 6 per task) showing how to solve representative problems using interleaved thoughts, actions, and observations. For reasoning-heavy tasks like question answering, every action is preceded by a thought (dense thought pattern). For decision-making tasks that involve many actions, thoughts appear only at key decision points (sparse thought pattern).

Step 4: Prompt and generate. The exemplar trajectories are placed into the model’s prompt, followed by a new problem. The model generates a trajectory token by token, producing thoughts, actions, and (after receiving environment observations) further thoughts and actions, until it reaches a terminal action like finish[answer].

Step 5: Combine with CoT-SC (optional). The authors also propose hybrid strategies that combine ReAct with chain-of-thought self-consistency (CoT-SC, a method that samples multiple reasoning chains and takes a majority vote). Two variants are tested: (A) ReAct first, falling back to CoT-SC if ReAct fails to produce an answer within a step limit; (B) CoT-SC first, falling back to ReAct when the majority vote is not confident (the top answer appears in fewer than half the samples).

Concrete Example: HotpotQA

To make this concrete, consider the question: “Aside from the Apple Remote, what other device can control the program Apple Remote was originally designed to interact with?”

A ReAct trajectory looks like this:

Notice how each thought reasons about what was observed and plans the next step. Without thoughts, the Act-only baseline on this same question failed to synthesize the final answer from the retrieved information.

Mathematical Foundations

The ReAct paper is primarily a prompting strategy paper and contains no traditional loss functions or training objectives. However, it formalizes the agent-environment interaction and the key innovation (adding language to the action space) using precise notation. These definitions are critical for understanding what ReAct actually changes.

Agent-Environment Interaction

\[\pi(a_t \mid c_t), \quad \text{where } c_t = (o_1, a_1, \cdots, o_{t-1}, a_{t-1}, o_t)\]

What it means: At each step, the agent sees the entire history of what it has observed and done, then picks an action according to its policy. For an LLM-based agent, \(\pi\) is the language model itself – it reads the concatenated text of all prior observations and actions and generates the next action as text.

Why it matters: This formulation makes explicit that the agent must map an increasingly long context to a single action. For complex tasks like multi-hop question answering, the mapping \(c_t \mapsto a_t\) requires substantial implicit reasoning, which is exactly the bottleneck that ReAct addresses.

Augmented Action Space

\[\hat{\mathcal{A}} = \mathcal{A} \cup \mathcal{L}\]

What it means: ReAct expands what the agent can “do” by adding every possible natural language string as a valid action. When the agent selects an action from \(\mathcal{L}\) (a thought), nothing happens in the environment – no observation is returned. The thought is purely internal. When it selects from \(\mathcal{A}\), the environment responds with an observation as usual.

Why it matters: This is the entire formal contribution of the paper expressed in a single line. By taking the union of the concrete action space with the language space, reasoning and acting become first-class citizens in the same decision loop. The simplicity of this formulation is the point – the power comes not from a complex mechanism but from the LLM’s ability to use unrestricted language to think.

Context Update via Thought

\[c_{t+1} = (c_t, \hat{a}_t) \quad \text{where } \hat{a}_t \in \mathcal{L}\]

What it means: When the agent produces a thought, it gets appended to the context. The environment does not respond (no new \(o_{t+1}\)), but the thought is visible to the agent in future steps. This is how the model “remembers” its reasoning – the thought becomes part of the text that the LLM conditions on when generating its next output.

Why it matters: This formalizes how thoughts support future decision-making. A worked example: suppose the context \(c_3\) is the text “Question: What is X? Action 1: search[Y]. Obs 1: Y was founded in 1990 and…” The model might produce the thought \(\hat{a}_3\) = “Y was founded in 1990, which is after 1985, so I need to search Z instead.” Now \(c_4\) includes this thought, and the model can directly act on this conclusion without re-deriving it.

Results

ReAct was evaluated on four diverse benchmarks spanning two categories: knowledge-intensive reasoning tasks (HotpotQA for multi-hop question answering, FEVER for fact verification) and interactive decision-making tasks (ALFWorld for text-based household navigation, WebShop for online shopping).

Knowledge-Intensive Reasoning (HotpotQA and FEVER)

Method HotpotQA (EM, exact match) FEVER (Accuracy)
Standard 28.7 57.1
CoT 29.4 56.3
CoT-SC (21 samples) 33.4 60.4
Act only 25.7 58.9
ReAct 27.4 60.9
CoT-SC then ReAct 34.2 64.6
ReAct then CoT-SC 35.1 62.0
Supervised SoTA 67.5 89.5

On FEVER, ReAct outperforms CoT (60.9 vs. 56.3) because fact verification requires checking claims against actual evidence, and ReAct can look things up. On HotpotQA, ReAct slightly trails CoT (27.4 vs. 29.4) because the structural constraint of alternating thoughts and actions reduces reasoning flexibility. However, the combination strategies significantly outperform all individual methods, with ReAct-then-CoT-SC achieving 35.1 on HotpotQA and CoT-SC-then-ReAct achieving 64.6 on FEVER.

A human analysis of 200 HotpotQA trajectories revealed a striking contrast: hallucination caused 56% of CoT’s failures but 0% of ReAct’s failures. ReAct’s main failure mode was search result errors (23%) – the Wikipedia API returning unhelpful information. This demonstrates that grounding reasoning in external knowledge dramatically reduces hallucination, even though it introduces a new failure mode (bad retrieval).

Interactive Decision Making (ALFWorld and WebShop)

Method ALFWorld (Success %) WebShop (Score / SR)
Act only (best of 6) 45 62.3 / 30.1
ReAct (best of 6) 71 66.6 / 40.0
BUTLER (imitation learning) 37
IL + RL (imitation learning + reinforcement learning) 62.4 / 28.7
Human Expert 82.1 / 59.6

On ALFWorld, ReAct achieved 71% success rate – nearly double Act-only (45%) and almost double BUTLER (37%), an imitation learning agent trained on 100,000 expert trajectories. ReAct accomplished this with just 2-3 in-context examples per task type. Even ReAct’s worst trial (48%) beat the best trial of both baselines.

On WebShop, ReAct achieved a 40% success rate, a 10 percentage point improvement over the previous best (IL+RL at 28.7%). The sparse reasoning helped ReAct bridge the gap between noisy product descriptions and the user’s requirements.

Fine-tuning Scaling Results

When fine-tuned on just 3,000 trajectories (rather than used with prompting), ReAct showed the strongest scaling behavior. Fine-tuned PaLM-8B with ReAct outperformed all PaLM-62B prompting methods, and fine-tuned PaLM-62B with ReAct outperformed all PaLM-540B prompting methods. In contrast, fine-tuning Standard or CoT essentially teaches models to memorize facts, while fine-tuning ReAct teaches models how to retrieve and reason – a more generalizable skill.

Limitations

Impact and Legacy

ReAct was one of the most influential papers in the emergence of LLM-based agents – systems where language models interact with external tools and environments rather than just generating text. The idea that an LLM can be prompted to interleave reasoning and action with no additional training became a foundational design pattern for building practical AI agent systems.

The paper’s influence extends directly into widely-used frameworks. LangChain, one of the most popular libraries for building LLM applications, adopted the ReAct pattern as its default agent architecture. The concept of “tool use” in modern LLM APIs (where models can call functions, search the web, or execute code) traces a direct line from ReAct’s demonstration that LLMs can effectively decide when and how to use external tools. Systems like ChatGPT’s browsing mode, Bing Chat, and Perplexity AI all implement variants of the ReAct loop.

Beyond tool use, ReAct established the principle that transparent reasoning traces make LLM behavior more interpretable and controllable. The paper demonstrated that humans can intervene by editing thoughts mid-trajectory, steering the model’s behavior with minimal effort. This human-in-the-loop editing capability – where correcting one or two thoughts can change the entire downstream trajectory – became a key design principle for building trustworthy agent systems.

Prerequisites

To understand ReAct, the reader should be familiar with:

Connections