Authors: Aman Madaan, Niket Tandon, Prakhar Gupta et al. Year: 2023 Source: arXiv:2303.17651
Self-Refine is a technique that makes a large language model critique and revise its own output in a loop – generating a draft, writing feedback on that draft, then rewriting based on the feedback – improving quality by roughly 20% on average without any additional training.
When you ask a large language model (LLM) to generate text, code, or solve a problem, it gives you one shot at an answer. Sometimes that answer is good enough. Often, especially for tasks with multiple objectives (like writing a dialogue response that is engaging, informative, and safe), the first attempt misses something. A human writer would naturally re-read their draft, spot the weak parts, and revise. LLMs, by default, do not do this.
Before Self-Refine, getting better outputs from an LLM required one of three expensive approaches. First, you could collect human feedback and use reinforcement learning from human feedback (RLHF) to fine-tune the model – a process that demands thousands of labeled preference pairs and significant compute to retrain the model’s weights. Second, you could train a separate “corrector” or “refiner” model for each specific task, which requires supervised training data showing pairs of bad-and-improved outputs. Third, you could use an external reward signal (like a compiler for code or a unit test suite) to guide the model, limiting the approach to domains where such signals exist.
All three approaches share the same bottleneck: they require either human labor, task-specific training data, or external tools. This makes them expensive to build and hard to generalize across different tasks. The question Self-Refine addresses is: can we get the benefits of iterative revision using nothing more than the LLM itself – no new training, no human-in-the-loop, no external reward models?
Think of Self-Refine like a writer who plays three roles in sequence: author, editor, and rewriter. First, the writer drafts a paragraph (the author role). Then, the same writer puts on an editor’s hat and reads the draft critically, writing specific notes like “the opening sentence is too vague” or “this code uses a brute-force loop that could be replaced with a formula” (the editor role). Finally, the writer takes those notes and rewrites the paragraph to address each criticism (the rewriter role). This cycle repeats until the editor is satisfied or a maximum number of rounds is reached.
The technical insight is that a single LLM, guided by different
prompts, can perform all three roles. The “author” prompt contains a few
examples of good input-output pairs for the task. The “editor” prompt
contains examples of outputs paired with specific, actionable feedback –
not vague comments like “make it better” but concrete critiques like
“the variable name x should be renamed to
input_buffer for clarity.” The “rewriter” prompt contains
examples of how to take an output and its feedback and produce an
improved version. By switching between these prompts and feeding each
step’s output into the next, the LLM iteratively improves its own
work.
What makes this especially notable is its simplicity and generality. The same three-step loop works across radically different tasks – from reversing the sentiment of a review, to optimizing Python code, to generating acronyms, to solving math word problems – with only the few-shot examples in the prompts changing between tasks. No model weights are modified. No reinforcement learning is involved. The entire mechanism operates at inference time (also called “test time”), using only the LLM’s existing capabilities and carefully designed prompts.
Self-Refine has no neural architecture of its own – it is a prompting strategy that wraps around an existing LLM. The method has three components that run in a loop: initial generation, feedback, and refinement.
Step 1: Initial Generation. Given an input \(x\) (for example, “write a function to compute the sum of 1 to N”) and a generation prompt \(p_{\text{gen}}\) containing a few task-specific examples, the model \(\mathcal{M}\) produces an initial output \(y_0\). For instance, \(y_0\) might be a Python function using a for-loop to accumulate the sum.
Step 2: Feedback. The same model \(\mathcal{M}\) receives a feedback prompt \(p_{\text{fb}}\) along with the input \(x\) and the current output \(y_t\). The feedback prompt contains examples of triples: an input, an output, and specific multi-aspect feedback on that output. The model generates feedback \(fb_t\) such as: “This code is slow because it uses a for loop (brute force). A better approach is to use the formula n*(n+1)/2.”
Step 3: Refinement. The model receives a refinement
prompt \(p_{\text{refine}}\) along with
the input, the current output, and the feedback. It produces an improved
output \(y_{t+1}\). In our running
example, the refined code might be a one-liner:
return n * (n + 1) // 2.
Iteration. Steps 2 and 3 repeat. At each iteration, the model receives the full history of all previous outputs and feedback, allowing it to avoid repeating past mistakes. The loop stops either after a fixed number of iterations (typically 4) or when the feedback itself indicates no further improvement is needed (for example, by including a numeric “stop score”).
As a concrete example with real values, consider the dialogue task. The input \(x\) is “I am interested in playing table tennis.” The initial output \(y_0\) might be the generic response “I’m sure it’s a great way to socialize, stay active.” The feedback \(fb_0\) identifies two problems: “Engaging: Provides no information about table tennis. User understanding: Lacks understanding of user’s needs.” The refined output \(y_1\) becomes: “That’s great to hear! It’s a fun sport requiring quick reflexes and good hand-eye coordination. Have you played before, or are you looking to learn?” – a response that is more engaging, more informative, and shows understanding of the user.
Figure 1: An initial website layout generated by ChatGPT for an ice cream parlor. The design is functional but basic – plain white background, minimal styling, no visual hierarchy. This serves as the starting point (y_0) for Self-Refine’s iterative improvement loop.
Figure 2: The same ice cream parlor website after one round of Self-Refine. The model generated feedback suggesting: change the background color to light blue, increase the heading font size, add an icon, add an extra paragraph of content, and update the button color. The model then applied its own suggestions to produce this improved layout. This demonstrates Self-Refine working on a real-world creative task beyond standard benchmarks.
Self-Refine’s formalism is intentionally simple – the entire method reduces to four equations describing how a language model \(\mathcal{M}\) is called with different prompts at different stages.
Equation 1: Initial Generation
\[y_0 = \mathcal{M}(p_{\text{gen}} \| x)\]
In plain language: feed the model a prompt with a few examples of the task, followed by the actual input, and take whatever it generates as the first draft. This is standard few-shot prompting (also called “in-context learning”), where the model learns what to do from the examples in the prompt without any weight updates.
This matters because it establishes the baseline – the output that Self-Refine will then try to improve. The quality of \(y_0\) depends on the model’s inherent capabilities and the quality of the few-shot examples in \(p_{\text{gen}}\).
Equation 2: Feedback Generation
\[fb_t = \mathcal{M}(p_{\text{fb}} \| x \| y_t)\]
In plain language: show the model examples of what good feedback looks like (via \(p_{\text{fb}}\)), then give it the original input and the current draft, and ask it to critique the draft. The resulting feedback is natural language text, not a scalar score – it might say “the variable names are unclear” or “the response doesn’t address the user’s actual question.”
This is the critical innovation. By prompting the model with examples of specific, actionable feedback, Self-Refine gets the model to identify concrete problems. The examples in \(p_{\text{fb}}\) are triples \(\langle x^{(k)}, y^{(k)}, fb^{(k)} \rangle\) – an input, an output, and a detailed critique of that output.
Equation 3: Single-Step Refinement
\[y_{t+1} = \mathcal{M}(p_{\text{refine}} \| x \| y_t \| fb_t)\]
In plain language: show the model examples of how to revise an output based on feedback, then give it the input, the current draft, and the feedback, and let it produce an improved version. The examples in \(p_{\text{refine}}\) are quadruples \(\langle x^{(k)}, y_t^{(k)}, fb_t^{(k)}, y_{t+1}^{(k)} \rangle\).
This equation describes a single refinement step. In practice, Self-Refine retains the full history across iterations, as described in the next equation.
Equation 4: Refinement with Full History
\[y_{t+1} = \mathcal{M}(p_{\text{refine}} \| x \| y_0 \| fb_0 \| \ldots \| y_t \| fb_t)\]
This is the actual instantiation of refinement used in practice. Instead of only seeing the most recent output and feedback, the model receives the complete chain: the original output \(y_0\), its feedback \(fb_0\), the first revision \(y_1\), its feedback \(fb_1\), and so on up to the current output \(y_t\) and its feedback \(fb_t\).
In plain language: give the model the full revision history so it can see what has already been tried and what problems have already been identified. This prevents the model from regressing – undoing a fix it made in a previous iteration – or from generating the same feedback repeatedly without acting on it.
This matters because it is what makes the iteration productive. Without history, the model might oscillate between two versions or keep pointing out the same problem without ever fixing it. With history, each round of feedback can focus on remaining issues.
As a worked example with concrete numbers, consider constrained generation where the input is a list of 25 keywords that must all appear in a single coherent sentence. At \(t=0\), the model might produce \(y_0\) containing 18 of the 25 keywords (72% coverage). The feedback \(fb_0\) identifies the 7 missing keywords. At \(t=1\), the refined output \(y_1\) includes 22 keywords (88%). The feedback \(fb_1\) identifies the remaining 3. By \(t=2\), the output \(y_2\) includes all 25 (100% coverage). In the paper’s experiments, constrained generation improved from 29.0% to 49.7% coverage across three iterations.
Self-Refine was evaluated on 7 tasks across three base models: GPT-3.5 (text-davinci-003), ChatGPT (gpt-3.5-turbo), and GPT-4. The table below shows the core results.
| Task | GPT-3.5 Base | +Self-Refine | ChatGPT Base | +Self-Refine | GPT-4 Base | +Self-Refine |
|---|---|---|---|---|---|---|
| Sentiment Reversal | 8.8 | 30.4 (+21.6) | 11.4 | 43.2 (+31.8) | 3.8 | 36.2 (+32.4) |
| Dialogue Response | 36.4 | 63.6 (+27.2) | 40.1 | 59.9 (+19.8) | 25.4 | 74.6 (+49.2) |
| Code Optimization | 14.8 | 23.0 (+8.2) | 23.9 | 27.5 (+3.6) | 27.3 | 36.0 (+8.7) |
| Code Readability | 37.4 | 51.3 (+13.9) | 27.7 | 63.1 (+35.4) | 27.4 | 56.2 (+28.8) |
| Math Reasoning | 64.1 | 64.1 (0) | 74.8 | 75.0 (+0.2) | 92.9 | 93.1 (+0.2) |
| Acronym Generation | 41.6 | 56.4 (+14.8) | 27.2 | 37.2 (+10.0) | 30.4 | 56.0 (+25.6) |
| Constrained Gen. | 28.0 | 37.0 (+9.0) | 44.0 | 67.0 (+23.0) | 15.0 | 45.0 (+30.0) |
The improvements are substantial and consistent across models and tasks. GPT-4 + Self-Refine improves by an average of roughly 25 percentage points across the 7 tasks. The largest gains appear in preference-based tasks: dialogue response generation with GPT-4 jumps from 25.4% to 74.6% preference rate – meaning human evaluators preferred the Self-Refine output nearly three-quarters of the time.
The one clear outlier is math reasoning, where gains are minimal (0-0.2%). The paper traces this to feedback quality: for math, 94% of ChatGPT’s feedback was “everything looks good” even when the answer was wrong, because a plausible-looking chain of reasoning can fool the model’s self-evaluation. When an external oracle tells the model its answer is wrong (without saying where), gains improve to 1-5% – still modest, but demonstrating that the bottleneck is feedback accuracy, not the refinement mechanism itself.
The iterative nature of the method matters. Across code optimization, sentiment reversal, and constrained generation, most improvement occurs in the first 1-2 iterations, with diminishing but still positive returns through iteration 3. For code optimization with averaged results across all three models, scores went from 22.0 (\(y_0\)) to 27.0 (\(y_1\)) to 27.9 (\(y_2\)) to 28.8 (\(y_3\)).
The authors also showed that Self-Refine is not merely benefiting from generating multiple outputs. When ChatGPT generates 4 independent samples (without feedback or refinement), human evaluators still prefer the single Self-Refine output over all 4 samples. This confirms that the feedback-driven revision, not just extra sampling, is the source of improvement.
Requires strong base models. Self-Refine fails with weaker models. Testing with Vicuna-13B (a smaller open-source model) showed it could not consistently generate feedback in the required format, and even when given correct feedback, it often failed to refine properly – repeating its original output or generating hallucinated conversations instead.
Feedback quality is the bottleneck. When the model cannot accurately identify what is wrong with its output (as in math reasoning), Self-Refine provides little benefit. The paper’s own qualitative analysis found that 33% of failures were due to feedback that inaccurately identified the problem location, and 61% were due to feedback suggesting an incorrect fix. Only 6% of failures were the refiner’s fault.
Increased inference cost. Each iteration requires multiple LLM calls (one for feedback, one for refinement), multiplying the compute cost by roughly 2-3x per iteration. With 4 iterations, the cost can be 8-12x a single generation. The paper does not discuss this cost-quality tradeoff.
English-only evaluation. All experiments use English-language tasks. The method’s effectiveness in other languages is unknown, and likely degraded given the weaker multilingual capabilities of the base models used.
Closed-source model dependence. All main experiments use proprietary OpenAI models (GPT-3.5, ChatGPT, GPT-4, Codex) whose training data, model sizes, and internal details are not fully disclosed. This limits reproducibility and makes it hard to disentangle what model capabilities make Self-Refine work.
No adversarial robustness analysis. The paper acknowledges that bad actors could use similar prompting techniques to steer a model toward generating more harmful content through iterative refinement toward a toxic goal, but does not investigate mitigations.
Potential for self-reinforcing errors. The paper does not deeply examine cases where the model’s feedback reinforces its own biases or mistakes – a blind spot that becomes more concerning as the same model serves as both generator and critic.
Self-Refine was an early and influential demonstration that LLMs can meaningfully improve their own outputs through structured self-critique at inference time, without any training. Published in May 2023, it appeared alongside several related works (Reflexion, Constitutional AI, Self-Correction) that together established “test-time compute” and “self-improvement” as major research directions. The core idea – that the same model can generate, evaluate, and refine – became a foundational pattern in LLM application design.
The practical impact has been significant. The generate-critique-refine loop described in Self-Refine has become a standard pattern in LLM-powered agents and pipelines. Modern agent frameworks routinely implement “reflection” steps where the LLM reviews its own work before finalizing. The paper’s finding that feedback quality matters more than the number of iterations has influenced how practitioners design critique prompts – emphasizing specificity and actionability over generic quality judgments.
The paper also highlighted an important boundary in self-improvement: models struggle to improve on tasks where they cannot accurately self-evaluate (like math). This limitation has driven subsequent work on combining self-refinement with external verification tools – code interpreters for math, test suites for code, and retrieval systems for factual claims – creating hybrid systems that pair the model’s linguistic refinement abilities with reliable external signals.
Before reading this paper, you should understand:
Self-Refine builds on the foundation of few-shot prompting introduced with GPT-3 (Brown et al., 2020) and used in the GPT line of work (see Improving Language Understanding by Generative Pre-Training). The entire method depends on in-context learning: the model must be able to learn the “feedback” and “refinement” behaviors from just a handful of examples in the prompt.
The paper is closely related to Chain-of-Thought prompting (Wei et al., 2022), which showed that prompting LLMs to show intermediate reasoning steps improves performance on complex tasks. Self-Refine extends this insight: rather than just reasoning step-by-step within a single generation, it structures multi-step reasoning across generations, with explicit feedback as the bridge between iterations. Both methods share the principle that how you structure the LLM’s output format matters as much as the model’s raw capability.
Self-Refine is directly comparable to Reflexion (Shinn et al., 2023) and ReAct (Yao et al., 2022), which also appeared around the same time. ReAct interleaves reasoning and action in a single generation pass, while Self-Refine separates generation, feedback, and refinement into distinct steps. Reflexion focuses on planning tasks and stores reflections in an episodic memory, while Self-Refine provides more granular, multi-aspect feedback and targets a broader range of generation tasks. All three methods share the core insight that LLMs benefit from structured self-evaluation.
The method connects to reinforcement learning from human feedback (RLHF), as used in InstructGPT (Ouyang et al., 2022), but takes the opposite approach to the same problem. RLHF trains the model to internalize human preferences by updating its weights. Self-Refine achieves a similar effect – improved alignment with human preferences – without any weight updates, by externalizing the critique step as an explicit prompt-driven process. This makes Self-Refine faster to deploy but potentially less robust, since the feedback quality depends entirely on the model’s in-context capabilities rather than baked-in learned preferences.
The iterative refinement pattern in Self-Refine has since influenced later work that generalizes single-chain revision into branching exploration of multiple solution paths and token-level verification of generated outputs.