ReAct: Synergizing Reasoning and Acting in Language Models

Authors: Shunyu Yao, Jeffrey Zhao, Dian Yu et al. Year: 2022 (published at ICLR 2023) Source: arXiv 2210.03629

One-Sentence Summary

By prompting a language model to alternate between writing out its reasoning in plain language and taking concrete actions (like searching Wikipedia), ReAct lets the model ground its thinking in real information and act more intelligently – outperforming methods that only reason or only act.

Problem Statement

Before ReAct, researchers had explored two separate capabilities of large language models (LLMs – models like GPT-3 or PaLM that have been trained on vast amounts of text and can generate fluent language). One line of work focused on reasoning: chain-of-thought prompting (CoT) showed that if you give an LLM a few examples of step-by-step thinking, it can work through multi-step problems by writing out intermediate reasoning steps. The other line focused on acting: using LLMs to generate actions in interactive environments, like navigating websites or controlling robots.

The problem was that reasoning alone is a “static black box.” When a model reasons entirely from its internal knowledge, it has no way to check facts against the real world. This leads to hallucination – the model confidently invents plausible-sounding but incorrect information, and these errors compound as each reasoning step builds on the last. In experiments, hallucination accounted for 56% of CoT’s failures on multi-hop question answering.

Conversely, acting alone – generating sequences of actions without explicit reasoning – makes it hard for the model to formulate high-level plans, track progress toward a goal, handle exceptions, or synthesize information from multiple observations. An acting-only agent might search Wikipedia for an answer but then fail to combine the retrieved facts into a correct conclusion, because it never pauses to reason about what it found.

No prior work had systematically studied how to combine reasoning and acting in a synergistic loop, or whether such a combination would outperform either capability in isolation.

Key Innovation

Think of a detective investigating a case. A purely “reasoning” detective sits in their office, never visits a crime scene, and tries to solve everything from memory – they might construct an elegant theory that turns out to be wrong because they misremembered a key fact. A purely “acting” detective visits every location and interviews every witness but never stops to think about what the evidence means – they collect mountains of information but cannot piece it together. A good detective does both: they think about what evidence they need, go collect it, reflect on what they found, decide what to investigate next, and eventually synthesize everything into a conclusion.

ReAct applies exactly this principle to language models. The core idea is remarkably simple: augment the model’s action space with language itself. Alongside concrete actions like search[entity] or click[button], the model can also produce “thoughts” – free-form text that reasons about the current situation but does not affect the external environment. These thoughts serve multiple purposes: decomposing a complex goal into subgoals, extracting key information from observations, performing commonsense reasoning, tracking progress, handling exceptions, and synthesizing a final answer.

Technically, this is implemented through prompting (see Improving Language Understanding by Generative Pre-Training for background on how language models learn from examples). The researchers write a few example trajectories showing interleaved thoughts, actions, and observations, then give these to the model as in-context examples. The model learns from these examples to produce its own interleaved reasoning-acting trajectories for new problems. No model weights are changed – the model is used as-is, guided only by the prompt.

Architecture / Method

ReAct is not a new model architecture but a prompting paradigm. It works with any sufficiently capable large language model. The paper primarily uses PaLM-540B (a 540-billion parameter language model from Google) and also validates results with GPT-3.

Comparison of four prompting methods: Standard, Chain-of-thought (Reason Only), Act-only, and ReAct

Figure 1: Top: Comparison of four prompting paradigms. (a) Standard prompting produces answers directly. (b) Chain-of-thought adds reasoning but cannot act on the world. (c) Act-only takes actions without explicit reasoning. (d) ReAct interleaves reasoning (Thought) with acting (Action/Observation). Bottom: Example ReAct trajectories for a HotpotQA question and an ALFWorld household task, showing how thoughts guide action selection.

Step 1: Define the action space. For each task domain, the authors define a small set of concrete actions the model can take. For question answering with Wikipedia, these are: search[entity] (returns the first 5 sentences from a Wikipedia page), lookup[string] (finds the next sentence containing that string on the current page, like Ctrl+F in a browser), and finish[answer] (submits a final answer). For interactive decision-making tasks like ALFWorld (a text-based household simulation), actions include things like go to coffeetable 1, take paper 2, and use desklamp 1.

Step 2: Augment with thoughts. The model is allowed to produce “thought” actions – free-form text that reasons about the situation. These thoughts do not affect the environment and produce no observation feedback. Instead, they update the model’s internal context, helping it plan, reason, and decide what action to take next.

Step 3: Write few-shot exemplars. Human annotators create a small number of example trajectories (3 to 6 per task) showing how to solve representative problems using interleaved thoughts, actions, and observations. For reasoning-heavy tasks like question answering, every action is preceded by a thought (dense thought pattern). For decision-making tasks that involve many actions, thoughts appear only at key decision points (sparse thought pattern).

Step 4: Prompt and generate. The exemplar trajectories are placed into the model’s prompt, followed by a new problem. The model generates a trajectory token by token, producing thoughts, actions, and (after receiving environment observations) further thoughts and actions, until it reaches a terminal action like finish[answer].

Step 5: Combine with CoT-SC (optional). The authors also propose hybrid strategies that combine ReAct with chain-of-thought self-consistency (CoT-SC, a method that samples multiple reasoning chains and takes a majority vote). Two variants are tested: (A) ReAct first, falling back to CoT-SC if ReAct fails to produce an answer within a step limit; (B) CoT-SC first, falling back to ReAct when the majority vote is not confident (the top answer appears in fewer than half the samples).

Concrete Example: HotpotQA

To make this concrete, consider the question: “Aside from the Apple Remote, what other device can control the program Apple Remote was originally designed to interact with?”

A ReAct trajectory looks like this:

Thought 1: I need to search Apple Remote and find the program it was originally designed to interact with.
Action 1: search[Apple Remote]
Observation 1: The Apple Remote is a remote control introduced in October 2005 by Apple… originally designed to control the Front Row media center program…
Thought 2: Apple Remote was originally designed to control the Front Row media center program. I need to search Front Row next and find what other device can control it.
Action 2: search[Front Row]
Observation 2: Could not find [Front Row]. Similar: [‘Front Row Seat to Earth’, …]
Thought 3: Front Row is not found. I need to search Front Row (software).
Action 3: search[Front Row (software)]
Observation 3: Front Row is a discontinued media center software…controlled by an Apple Remote or the keyboard function keys.
Thought 4: Front Row (software) is controlled by an Apple Remote or the keyboard function keys. So the answer is keyboard function keys.
Action 4: finish[keyboard function keys]

Notice how each thought reasons about what was observed and plans the next step. Without thoughts, the Act-only baseline on this same question failed to synthesize the final answer from the retrieved information.

Mathematical Foundations

The ReAct paper is primarily a prompting strategy paper and contains no traditional loss functions or training objectives. However, it formalizes the agent-environment interaction and the key innovation (adding language to the action space) using precise notation. These definitions are critical for understanding what ReAct actually changes.

Agent-Environment Interaction

\[\pi(a_t \mid c_t), \quad \text{where } c_t = (o_1, a_1, \cdots, o_{t-1}, a_{t-1}, o_t)\]

\(t\): the current time step in the interaction
\(a_t \in \mathcal{A}\): the action taken at time \(t\), drawn from the action space \(\mathcal{A}\)
\(o_t \in \mathcal{O}\): the observation received from the environment at time \(t\), drawn from the observation space \(\mathcal{O}\)
\(\pi\): the policy (the decision-making function that selects actions)
\(c_t\): the context – the full history of observations and actions up to and including the current observation

What it means: At each step, the agent sees the entire history of what it has observed and done, then picks an action according to its policy. For an LLM-based agent, \(\pi\) is the language model itself – it reads the concatenated text of all prior observations and actions and generates the next action as text.

Why it matters: This formulation makes explicit that the agent must map an increasingly long context to a single action. For complex tasks like multi-hop question answering, the mapping \(c_t \mapsto a_t\) requires substantial implicit reasoning, which is exactly the bottleneck that ReAct addresses.

Augmented Action Space

\[\hat{\mathcal{A}} = \mathcal{A} \cup \mathcal{L}\]

\(\hat{\mathcal{A}}\): the augmented (expanded) action space
\(\mathcal{A}\): the original action space (domain-specific actions like search, lookup, finish)
\(\mathcal{L}\): the space of all possible language strings (the “thought” space)

What it means: ReAct expands what the agent can “do” by adding every possible natural language string as a valid action. When the agent selects an action from \(\mathcal{L}\) (a thought), nothing happens in the environment – no observation is returned. The thought is purely internal. When it selects from \(\mathcal{A}\), the environment responds with an observation as usual.

Why it matters: This is the entire formal contribution of the paper expressed in a single line. By taking the union of the concrete action space with the language space, reasoning and acting become first-class citizens in the same decision loop. The simplicity of this formulation is the point – the power comes not from a complex mechanism but from the LLM’s ability to use unrestricted language to think.

Context Update via Thought

\[c_{t+1} = (c_t, \hat{a}_t) \quad \text{where } \hat{a}_t \in \mathcal{L}\]

\(\hat{a}_t\): a thought (a language action) selected at time \(t\)
\(c_{t+1}\): the updated context, which now includes the thought

What it means: When the agent produces a thought, it gets appended to the context. The environment does not respond (no new \(o_{t+1}\)), but the thought is visible to the agent in future steps. This is how the model “remembers” its reasoning – the thought becomes part of the text that the LLM conditions on when generating its next output.

Why it matters: This formalizes how thoughts support future decision-making. A worked example: suppose the context \(c_3\) is the text “Question: What is X? Action 1: search[Y]. Obs 1: Y was founded in 1990 and…” The model might produce the thought \(\hat{a}_3\) = “Y was founded in 1990, which is after 1985, so I need to search Z instead.” Now \(c_4\) includes this thought, and the model can directly act on this conclusion without re-deriving it.

Results

ReAct was evaluated on four diverse benchmarks spanning two categories: knowledge-intensive reasoning tasks (HotpotQA for multi-hop question answering, FEVER for fact verification) and interactive decision-making tasks (ALFWorld for text-based household navigation, WebShop for online shopping).

Knowledge-Intensive Reasoning (HotpotQA and FEVER)

Method	HotpotQA (EM, exact match)	FEVER (Accuracy)
Standard	28.7	57.1
CoT	29.4	56.3
CoT-SC (21 samples)	33.4	60.4
Act only	25.7	58.9
ReAct	27.4	60.9
CoT-SC then ReAct	34.2	64.6
ReAct then CoT-SC	35.1	62.0
Supervised SoTA	67.5	89.5

On FEVER, ReAct outperforms CoT (60.9 vs. 56.3) because fact verification requires checking claims against actual evidence, and ReAct can look things up. On HotpotQA, ReAct slightly trails CoT (27.4 vs. 29.4) because the structural constraint of alternating thoughts and actions reduces reasoning flexibility. However, the combination strategies significantly outperform all individual methods, with ReAct-then-CoT-SC achieving 35.1 on HotpotQA and CoT-SC-then-ReAct achieving 64.6 on FEVER.

A human analysis of 200 HotpotQA trajectories revealed a striking contrast: hallucination caused 56% of CoT’s failures but 0% of ReAct’s failures. ReAct’s main failure mode was search result errors (23%) – the Wikipedia API returning unhelpful information. This demonstrates that grounding reasoning in external knowledge dramatically reduces hallucination, even though it introduces a new failure mode (bad retrieval).

Interactive Decision Making (ALFWorld and WebShop)

Method	ALFWorld (Success %)	WebShop (Score / SR)
Act only (best of 6)	45	62.3 / 30.1
ReAct (best of 6)	71	66.6 / 40.0
BUTLER (imitation learning)	37	–
IL + RL (imitation learning + reinforcement learning)	–	62.4 / 28.7
Human Expert	–	82.1 / 59.6

On ALFWorld, ReAct achieved 71% success rate – nearly double Act-only (45%) and almost double BUTLER (37%), an imitation learning agent trained on 100,000 expert trajectories. ReAct accomplished this with just 2-3 in-context examples per task type. Even ReAct’s worst trial (48%) beat the best trial of both baselines.

On WebShop, ReAct achieved a 40% success rate, a 10 percentage point improvement over the previous best (IL+RL at 28.7%). The sparse reasoning helped ReAct bridge the gap between noisy product descriptions and the user’s requirements.

Fine-tuning Scaling Results

When fine-tuned on just 3,000 trajectories (rather than used with prompting), ReAct showed the strongest scaling behavior. Fine-tuned PaLM-8B with ReAct outperformed all PaLM-62B prompting methods, and fine-tuned PaLM-62B with ReAct outperformed all PaLM-540B prompting methods. In contrast, fine-tuning Standard or CoT essentially teaches models to memorize facts, while fine-tuning ReAct teaches models how to retrieve and reason – a more generalizable skill.

Limitations

Context length bottleneck: ReAct trajectories include thoughts, actions, and observations, which consume substantially more tokens than reasoning-only or acting-only approaches. With only a handful of in-context examples, this limits the complexity of behaviors the model can learn. The paper acknowledges this as a key challenge.
Sensitivity to retrieval quality: 23% of ReAct’s failures on HotpotQA came from uninformative search results. When the simple Wikipedia API returns unhelpful text, the model struggles to recover. A stronger retrieval system would likely improve results significantly.
Repetitive action loops: ReAct sometimes enters loops where it repeatedly generates the same thought-action sequence. The authors attribute this to greedy decoding (always picking the single most probable next token, with no exploration) and suggest beam search might help, but this remains unresolved.
Prompt engineering burden (unacknowledged): While the paper claims prompt design is “intuitive and easy,” the method requires human-written trajectories that include well-placed thoughts. The quality and placement of thoughts in the exemplars directly affect performance, and the paper provides limited guidance on how to design these for new domains.
Limited to text-based environments (unacknowledged): All experiments operate on text-in, text-out environments. Extending ReAct to multimodal settings (images, audio, physical robots) requires converting observations to text first, which can lose critical information.
No formal guarantees on reasoning quality (unacknowledged): The model can produce any text as a “thought,” including incorrect reasoning. Unlike formal verification or structured planning, there is nothing preventing the model from reasoning badly – the method relies entirely on the LLM’s capabilities, which degrade on tasks requiring precise logic or arithmetic.
Still far from supervised methods: On both HotpotQA (35.1 vs. 67.5) and FEVER (64.6 vs. 89.5), even the best ReAct+CoT-SC combination falls well short of supervised state-of-the-art.

Impact and Legacy

ReAct was one of the most influential papers in the emergence of LLM-based agents – systems where language models interact with external tools and environments rather than just generating text. The idea that an LLM can be prompted to interleave reasoning and action with no additional training became a foundational design pattern for building practical AI agent systems.

The paper’s influence extends directly into widely-used frameworks. LangChain, one of the most popular libraries for building LLM applications, adopted the ReAct pattern as its default agent architecture. The concept of “tool use” in modern LLM APIs (where models can call functions, search the web, or execute code) traces a direct line from ReAct’s demonstration that LLMs can effectively decide when and how to use external tools. Systems like ChatGPT’s browsing mode, Bing Chat, and Perplexity AI all implement variants of the ReAct loop.

Beyond tool use, ReAct established the principle that transparent reasoning traces make LLM behavior more interpretable and controllable. The paper demonstrated that humans can intervene by editing thoughts mid-trajectory, steering the model’s behavior with minimal effort. This human-in-the-loop editing capability – where correcting one or two thoughts can change the entire downstream trajectory – became a key design principle for building trustworthy agent systems.

Prerequisites

To understand ReAct, the reader should be familiar with:

Large language models (LLMs): What they are, how they generate text token by token, and what “prompting” means. The key concept is that LLMs can perform tasks described in their input without any weight updates, by conditioning on examples (see Improving Language Understanding by Generative Pre-Training).
In-context learning / few-shot prompting: The ability of LLMs to learn from examples placed in the prompt. ReAct relies entirely on this mechanism – the model sees example trajectories and produces analogous trajectories for new problems.
Chain-of-thought prompting: The technique of including step-by-step reasoning in prompts to improve LLM performance on reasoning tasks. ReAct directly extends this idea by adding actions.
Basic probability: Understanding what a policy \(\pi(a_t|c_t)\) means (selecting an action given a context) is helpful but not strictly necessary.

Connections

Chain-of-thought prompting (CoT): ReAct directly extends CoT by adding actions. Where CoT prompts the model to think step-by-step using only its internal knowledge, ReAct prompts it to also take actions that ground its reasoning in external observations. ReAct’s “thought” component is essentially CoT applied within an agent loop.
Retrieval-Augmented Generation (RAG) (see Retrieval-Augmented Generation): RAG retrieves relevant documents before generation. ReAct goes further by letting the model decide what to retrieve through iterative reasoning. Where RAG performs a single retrieval step, ReAct can perform multiple search-reason-search cycles, adaptively targeting its retrieval based on what it learns.
Transformer architecture (see Attention Is All You Need): ReAct operates on top of transformer-based LLMs. The self-attention mechanism that enables transformers to process long contexts is what allows ReAct to condition on increasingly long trajectories of thoughts, actions, and observations.
GPT and in-context learning (see Improving Language Understanding by Generative Pre-Training): ReAct is built on the discovery that large generative pre-trained models can learn new tasks from a few examples in the prompt. Without this in-context learning capability, the entire prompting paradigm that ReAct relies on would not work.
BERT and knowledge-intensive tasks (see BERT): The HotpotQA and FEVER benchmarks that ReAct is evaluated on were originally designed for fine-tuned encoder models like BERT. ReAct’s prompting approach offers a radically different paradigm – no fine-tuning, few examples, but with external tool access to compensate.