By the end of this course, you will be able to:
When you prompt a language model to write code, draft an email, or answer a question, you get one attempt. Sometimes that attempt is good enough. Often – especially for tasks with multiple competing objectives – it falls short. This lesson examines why a single generation pass is fundamentally limited, and why human writers naturally iterate.
Think about writing a cover letter. Your first draft might hit the main points but feel generic, miss the company’s specific needs, or use an awkward phrase. You re-read it, spot the problems, and revise. This cycle – draft, critique, rewrite – is how humans produce good writing. We almost never get it right on the first try.
Language models, by default, generate in a single pass. You provide an input \(x\) (like “write a function to sum 1 to N”) and a prompt \(p_{\text{gen}}\) with a few examples of the task, and the model \(\mathcal{M}\) produces output \(y_0\):
\[y_0 = \mathcal{M}(p_{\text{gen}} \| x)\]
where:
This is standard few-shot prompting (see Chain-of-Thought Prompting). The model learns what to do from the examples in the prompt, without any weight updates. The quality of \(y_0\) depends entirely on the model’s existing capabilities and the quality of those few-shot examples.
The problem is that many real tasks have multiple objectives that are hard to satisfy simultaneously. A dialogue response should be engaging, informative, and sensitive to the user’s emotional state. A piece of code should be correct, efficient, and readable. An acronym should be catchy, pronounceable, and relevant. A single generation pass may nail two of these objectives but miss the third.
Before Self-Refine, getting better outputs required one of three expensive approaches:
All three share the same bottleneck: human labor, task-specific training data, or external tools. The question Self-Refine addresses is: can we get the benefits of iterative revision using only the LLM itself?
Consider the code optimization task. The input \(x\) is “Generate sum of 1, …, N.” The model produces:
def sum(n):
res = 0
for i in range(n + 1):
res += i
return resThis output \(y_0\) is functionally correct – it computes the right answer. But it has a problem: it uses a brute-force loop with \(O(n)\) time complexity. For \(n = 1{,}000{,}000\), this loops a million times. The closed-form solution \(n(n+1)/2\) computes the same result in constant time.
A human programmer would immediately spot this: “I’m looping when I could use a formula.” But the model produced its answer and stopped. There is no mechanism in standard prompting for the model to look back at its own work and improve it.
Now consider the dialogue task. The input \(x\) is “User: I am interested in playing table tennis.” The model produces:
Response: I'm sure it's a great way to socialize, stay active.
This response is safe and grammatically correct, but it fails on two dimensions. First, it is not engaging – it provides no information about table tennis. Second, it shows no understanding of the user – it does not ask whether the user has played before or what aspect they are interested in. A human responding to this message would naturally tailor their response to the user’s specific interest.
In both cases, the first draft is adequate but improvable. The model’s knowledge is sufficient to do better – it “knows” the closed-form formula, and it “knows” how to ask follow-up questions. The limitation is not knowledge but process: the model lacks a mechanism to review and revise.
Recall: What are the three expensive approaches to improving LLM outputs that Self-Refine seeks to avoid? What bottleneck do they share?
Apply: Consider the task of generating a product description for a laptop. List three competing objectives the description must satisfy simultaneously (e.g., accuracy, persuasiveness, brevity). Write a one-sentence initial output \(y_0\) that satisfies two of the three objectives but fails on the third.
Extend: Chain-of-thought prompting (see Chain-of-Thought Prompting) improved reasoning by adding intermediate steps within a single generation. Why is this insufficient for tasks like dialogue generation or code readability, where the problem is not a lack of reasoning steps but a failure to satisfy multiple quality dimensions?
The key insight of Self-Refine is that the same model that generated the output can also critique it – if you prompt it correctly. This lesson covers how to design feedback prompts that produce specific, actionable critiques rather than vague suggestions.
Think about the difference between two kinds of writing feedback. A vague critique says “make it better.” A specific critique says “the opening sentence is too generic – replace ‘I’m sure it’s great’ with a concrete fact about table tennis, like its ranking as the world’s most popular racquet sport.” The vague version gives the writer nothing to act on. The specific version points to an exact phrase, explains the problem, and suggests a concrete fix.
Self-Refine generates feedback by prompting the same model \(\mathcal{M}\) with a feedback prompt \(p_{\text{fb}}\):
\[fb_t = \mathcal{M}(p_{\text{fb}} \| x \| y_t)\]
where:
The feedback prompt \(p_{\text{fb}}\) contains a small number of triples \(\langle x^{(k)}, y^{(k)}, fb^{(k)} \rangle\) – an input, an output, and a detailed critique of that output. These examples teach the model what good feedback looks like.
Two properties make the feedback effective:
x should be
input_buffer” is specific. “Improve variable naming” is
not.The paper’s analysis shows that feedback quality is the single most important factor in Self-Refine’s success. In a controlled experiment, replacing Self-Refine’s specific feedback with generic feedback (“improve the efficiency of the code”) reduced Code Optimization scores from 27.5 to 26.0. Removing feedback entirely dropped scores to 24.8. In Sentiment Reversal, the effect was dramatic: specific feedback scored 43.2, generic scored 31.2, and no feedback scored 0 – the task completely failed without feedback.
Feedback can address multiple aspects of the output simultaneously. For the dialogue task, the feedback prompt teaches the model to evaluate along dimensions like “engaging,” “user understanding,” and “safety.” For code, the dimensions might be “efficiency,” “readability,” and “correctness.” The few-shot examples in \(p_{\text{fb}}\) demonstrate what multi-aspect feedback looks like for each task.
Let’s construct a concrete feedback prompt for the dialogue task. The prompt \(p_{\text{fb}}\) contains triples like this:
Example triple in \(p_{\text{fb}}\):
Input: User: I am interested in playing table tennis.
Output: I'm sure it's a great way to socialize, stay active.
Feedback:
- Engaging: Provides no information about table tennis
or how to play it. [Score: 3/10]
- User understanding: Lacks understanding of user's
needs and state of mind. [Score: 2/10]
- Overall: The response should mention specific aspects
of table tennis and ask a follow-up question. [Score: 3/10]
Now suppose the model has already generated an output for a new input. The input \(x\) is “User: I just got a new puppy!” and the current output \(y_0\) is “That’s nice. Dogs are popular pets.”
The model receives: \(p_{\text{fb}} \| x \| y_0\) – the feedback examples, followed by the new input and output. It generates:
Feedback:
- Engaging: The response is flat and generic. It doesn't
express enthusiasm or ask about the puppy. [Score: 2/10]
- User understanding: The user is excited and sharing
personal news. The response fails to match their
energy or show interest. [Score: 2/10]
- Overall: Replace "That's nice" with an enthusiastic
response. Ask about the breed, name, or age. [Score: 2/10]
Notice how the feedback mirrors the structure of the examples in \(p_{\text{fb}}\). It evaluates along the same dimensions (engaging, user understanding), assigns scores, and provides specific suggestions. The model learned this format from the few-shot examples – no training required.
The numeric scores serve a dual purpose. They guide the refinement step (lower scores indicate more urgent problems), and they can function as a stopping indicator – if all scores exceed a threshold, the loop can terminate.
Recall: What two properties must feedback have to be effective in Self-Refine? Give an example of feedback that has both properties and an example that has neither.
Apply: Write a feedback prompt triple for the code readability task. The input is “Sort a list of names alphabetically.” The output is:
def s(l):
for i in range(len(l)):
for j in range(i+1, len(l)):
if l[i] > l[j]:
l[i], l[j] = l[j], l[i]
return lWrite specific, actionable feedback addressing at least two aspects (variable naming, algorithm choice, documentation).
Extend: The paper found that 94% of ChatGPT’s feedback on math reasoning was “everything looks good” – even when the answer was wrong. Why might a model be poor at critiquing its own math work, when it can effectively critique code or dialogue? What is fundamentally different about the error signal in math versus natural language tasks?
Generating good feedback is only half the problem. The model must also use that feedback to produce an improved output. This lesson covers the refinement equation and shows how the model translates natural language critiques into concrete changes.
Think of an architect receiving a client’s notes on a building design: “The lobby is too dark – add skylights. The staircase blocks the emergency exit – relocate it to the east wall.” The architect does not start from scratch. They take the existing design, address each specific note, and produce a revised version. The revision preserves everything that was already good while fixing the identified problems.
Self-Refine’s refinement step works the same way. The model receives the original input, the current output, and the feedback, and produces an improved version:
\[y_{t+1} = \mathcal{M}(p_{\text{refine}} \| x \| y_t \| fb_t)\]
where:
The refinement prompt \(p_{\text{refine}}\) contains quadruples \(\langle x^{(k)}, y_t^{(k)}, fb_t^{(k)}, y_{t+1}^{(k)} \rangle\) – an input, an output, feedback on that output, and an improved version that addresses the feedback. These examples teach the model what effective refinement looks like.
The refinement step is distinct from generating a new output from scratch. The model sees both the current output and the feedback, so it can make targeted changes rather than starting over. This is important because the current output likely has many good qualities that should be preserved – the refinement should fix what the feedback identified without breaking what already works.
The paper’s qualitative analysis found that the refiner was surprisingly robust. In 61% of successful cases, the refiner precisely implemented accurate feedback. In 33% of successful cases, the refiner managed to improve the output even when the feedback was partially incorrect – it extracted the useful signal from imperfect critiques and ignored the rest. Only 6% of failures were the refiner’s fault (implementing feedback incorrectly). The remaining 94% of failures traced back to feedback quality: 33% from misidentifying the error location and 61% from suggesting the wrong fix.
Let’s trace the refinement step for the code optimization task.
Input \(x\): “Generate sum of 1, …, N”
Current output \(y_0\):
def sum(n):
res = 0
for i in range(n + 1):
res += i
return resFeedback \(fb_0\): “This code is slow because it uses a for loop (brute force). A better approach is to use the formula \(n(n+1)/2\).”
The model receives \(p_{\text{refine}} \| x \| y_0 \| fb_0\). The examples in \(p_{\text{refine}}\) show how to revise code based on feedback – preserving the function signature and correctness while addressing the efficiency concern. The model generates:
Refined output \(y_1\):
def sum_faster(n):
return (n * (n + 1)) // 2The refinement addressed the feedback precisely: it replaced the \(O(n)\) loop with the \(O(1)\) closed-form formula. The function still accepts the same input and returns the same output, but it runs dramatically faster.
Now let’s trace the dialogue example.
Input \(x\): “User: I am interested in playing table tennis.”
Current output \(y_0\): “I’m sure it’s a great way to socialize, stay active.”
Feedback \(fb_0\): “Engaging: Provides no information about table tennis. User understanding: Lacks understanding of user’s needs.”
Refined output \(y_1\): “That’s great to hear! It’s a fun sport requiring quick reflexes and good hand-eye coordination. Have you played before, or are you looking to learn?”
The refined response addresses both feedback points: it now contains specific information about table tennis (quick reflexes, hand-eye coordination) and shows understanding of the user by asking a follow-up question. The tone also improved from indifferent (“I’m sure”) to enthusiastic (“That’s great to hear!”).
Recall: What is the structure of the examples in the refinement prompt \(p_{\text{refine}}\)? How does this differ from the examples in the feedback prompt \(p_{\text{fb}}\)?
Apply: Given the following input, output, and feedback, write what a good refined output \(y_1\) would look like.
def f(n): return all(n % i != 0 for i in range(2, n))f is not descriptive. The
function does not handle edge cases (n <= 1). The range can be
optimized to check up to sqrt(n) instead of n.”Extend: The paper reports that the refiner succeeded even with partially incorrect feedback 33% of the time. What property of large language models might explain this robustness? Consider how a model trained on millions of code examples might respond to feedback that says “use binary search” when the actual improvement should be “use a hash set.”
A single round of feedback and refinement often is not enough. Self-Refine iterates, and crucially, it gives the model the full history of all previous outputs and feedback. This lesson covers why history matters and how the complete loop works.
Imagine an editor reviewing successive drafts of an article. If the editor only ever sees the latest draft, they might give the same feedback twice – or worse, suggest a change that reverses an improvement from a previous round. But if the editor can see all previous drafts and all previous notes, they can focus on remaining issues and avoid retreading old ground.
Self-Refine preserves the full revision history. In practice, the refinement equation is instantiated as:
\[y_{t+1} = \mathcal{M}(p_{\text{refine}} \| x \| y_0 \| fb_0 \| \ldots \| y_t \| fb_t)\]
where:
This prevents two failure modes:
The loop terminates based on a stopping condition \(\mathrm{stop}(fb_t, t)\) that either stops at a fixed maximum number of iterations (typically 4) or checks a stopping indicator extracted from the feedback (e.g., all quality scores above a threshold).
The Self-Refine algorithm. Given an input (step 0), the model generates an initial output, then provides feedback on it (step 1), then refines the output using the feedback (step 2). Steps 1 and 2 iterate until a stopping condition is met.
The iterative nature matters empirically. Across code optimization, sentiment reversal, and constrained generation (averaged over ChatGPT, GPT-3.5, and GPT-4), the paper measured iteration-by-iteration improvements:
| Iteration | Code Optimization | Sentiment Reversal | Constrained Generation |
|---|---|---|---|
| \(y_0\) | 22.0 | 33.9 | 29.0 |
| \(y_1\) | 27.0 (+5.0) | 34.9 (+1.0) | 40.3 (+11.3) |
| \(y_2\) | 27.9 (+0.9) | 36.1 (+1.2) | 46.7 (+6.4) |
| \(y_3\) | 28.8 (+0.9) | 36.8 (+0.7) | 49.7 (+3.0) |
The pattern is consistent: most improvement comes in the first iteration, with diminishing but still positive returns in subsequent rounds. For constrained generation (where the task is to include all of 25 given keywords in a single sentence), the gains are especially large – coverage jumps from 29.0% to 49.7% over three iterations, because each round of feedback can identify and fix missing keywords.
Let’s trace the full Self-Refine loop on a constrained generation task. The input \(x\) is a list of 10 keywords: [“river”, “bridge”, “sunset”, “bicycle”, “dog”, “wind”, “flower”, “stone”, “cloud”, “path”].
Iteration 0 – Initial generation:
\(y_0\) = “The cyclist rode across the old stone bridge, watching the sunset paint the clouds above the river.”
Keywords present: river, bridge, sunset, stone, cloud (5/10 = 50%). Keywords missing: bicycle, dog, wind, flower, path.
Iteration 0 – Feedback:
\(fb_0\) = “The sentence is missing 5 of 10 required keywords: bicycle, dog, wind, flower, path. The sentence should be expanded to naturally incorporate these words.”
Iteration 1 – Refinement (model sees \(x\), \(y_0\), \(fb_0\)):
\(y_1\) = “The cyclist pedaled their bicycle along the winding path beside the river, crossing the old stone bridge as the wind carried flower petals past a resting dog, all beneath a sunset that colored the clouds.”
Keywords present: bicycle, path, river, stone, bridge, wind, flower, dog, sunset, cloud (10/10 = 100%).
Iteration 1 – Feedback:
\(fb_1\) = “All 10 keywords are present. The sentence is grammatically correct and reads naturally. No further improvement needed. Stop.”
The stopping condition triggers because the feedback indicates completion. The final output is \(y_1\).
Notice how the feedback at iteration 0 was specific – it listed the exact missing keywords. This gave the refiner a clear target. And at iteration 1, the feedback correctly identified that the task was complete, triggering the stop.
Recall: Why does Self-Refine pass the full history of outputs and feedback to the model at each iteration, rather than just the most recent output and feedback?
Apply: Trace through one complete iteration of Self-Refine for the following. Input \(x\): “User: I’m feeling really stressed about my exams.” Initial output \(y_0\): “You should study harder.” Write the feedback \(fb_0\) (addressing at least two aspects: empathy and helpfulness) and the refined output \(y_1\).
Extend: The iteration table shows diminishing returns – the improvement from \(y_0\) to \(y_1\) is always much larger than from \(y_1\) to \(y_2\). Propose two distinct explanations for this diminishing returns pattern. One should be optimistic (the method is working well) and one should be pessimistic (the method has a limitation).
Self-Refine is not magic. It works impressively well on some tasks and barely helps on others. Understanding these boundaries is essential for knowing when to apply the technique. This lesson examines the empirical results and traces successes and failures back to feedback quality.
Think about the difference between asking someone to review your cooking versus asking them to review your math homework. Tasting a dish, a reviewer can immediately identify “too salty,” “needs more acid,” or “overcooked pasta.” These are observable, sensory judgments. But reviewing a math proof requires verifying each logical step – and if the reviewer has the same mathematical blind spots as the writer, they will miss the same errors.
Self-Refine faces the same asymmetry. On tasks where quality is perceptible (dialogue engagement, code readability, keyword coverage), the model can generate useful feedback and improve. On tasks where quality requires deep verification (mathematical correctness), the model struggles to identify errors in its own reasoning.
The paper evaluated Self-Refine on 7 tasks across three models (GPT-3.5, ChatGPT, GPT-4). The results split into three categories:
Large gains (preference-based tasks):
| Task | GPT-4 Base | GPT-4 + Self-Refine | Improvement |
|---|---|---|---|
| Dialogue Response | 25.4% | 74.6% | +49.2 |
| Sentiment Reversal | 3.8% | 36.2% | +32.4 |
| Constrained Generation | 15.0% | 45.0% | +30.0 |
| Code Readability | 27.4% | 56.2% | +28.8 |
| Acronym Generation | 30.4% | 56.0% | +25.6 |
Moderate gains (code optimization):
| Task | GPT-4 Base | GPT-4 + Self-Refine | Improvement |
|---|---|---|---|
| Code Optimization | 27.3% | 36.0% | +8.7 |
Minimal gains (math reasoning):
| Task | GPT-4 Base | GPT-4 + Self-Refine | Improvement |
|---|---|---|---|
| Math Reasoning | 92.9% | 93.1% | +0.2 |
The math reasoning failure is especially revealing. The paper found that 94% of ChatGPT’s feedback on math problems was “everything looks good” – even when the answer was wrong. A plausible-looking chain of reasoning deceives the model’s self-evaluation. When the authors gave the model an external oracle that simply said “your answer is wrong” (without indicating where), improvement jumped to 1-5%. This confirms that the bottleneck is feedback accuracy, not the refinement mechanism.
The paper also tested Self-Refine against a natural baseline: instead of iterating with feedback, just generate 4 independent samples and pick the best one. If Self-Refine only helped because it produced “more outputs,” this baseline should perform comparably. It did not. Human evaluators preferred the single Self-Refine output over all 4 independent samples. The feedback-driven revision, not mere resampling, drives the improvement.
Weak models fail entirely. When the authors tried Self-Refine with Vicuna-13B (a smaller open-source model), it could not consistently generate feedback in the required format. Even when given correct feedback from an oracle, it often repeated its original output or generated hallucinated (fabricated, not grounded in the input) conversations instead of refining. Self-Refine requires a model strong enough to follow few-shot instructions for generation, feedback, and refinement – three distinct tasks steered only by prompts.
Let’s trace why Self-Refine fails on a math problem. The input \(x\) is: “There are 15 trees in the grove. Grove workers will plant trees today. After they are done, there will be 21 trees. How many trees did the workers plant?”
\(y_0\): “There are 15 trees originally. Then there were 21 trees after some more were planted. So 15 + 21 = 36. The answer is 36.”
The answer is wrong (correct answer: 21 - 15 = 6). Now the model generates feedback:
\(fb_0\): “The reasoning looks correct. The initial count is 15, the final count is 21, and the calculation 15 + 21 = 36 appears to be right. Everything looks good.”
The feedback approves the wrong answer because the chain of reasoning “looks plausible” to the model. The model cannot reliably detect that addition should have been subtraction. Without accurate feedback, the refinement step has nothing to fix, so \(y_1 \approx y_0\).
Now contrast this with Sentiment Reversal, where the model excels. The input \(x\) is a positive Yelp review: “The pizza was amazing! Best I’ve ever had. Will definitely come back.”
\(y_0\): “The pizza was terrible. I would not recommend this place.”
\(fb_0\): “The reversal is incomplete. The original mentions three aspects: pizza quality, comparison (‘best I’ve ever had’), and intent to return. The reversal only addresses pizza quality. Add negative counterparts for the comparison and return intent.”
\(y_1\): “The pizza was terrible. Worst I’ve ever had. I will never come back.”
Here the feedback correctly identifies exactly what is missing (two of three aspects were not reversed), and the refiner addresses each point. The task succeeds because quality is observable: you can check whether each sentiment aspect has been reversed by reading the text.
Recall: What percentage of ChatGPT’s feedback on math reasoning was “everything looks good”? What does this tell you about the relationship between feedback quality and task improvement?
Apply: Classify each of the following tasks as likely to benefit strongly, moderately, or minimally from Self-Refine. Explain your reasoning for each. (a) Translating English to French. (b) Generating a haiku that uses all 5 given words. (c) Proving a theorem in formal logic. (d) Writing a product description that is persuasive, accurate, and under 50 words.
Extend: The paper shows that an external oracle saying “your answer is wrong” (without explaining where) improves math performance from +0.2% to +1-5%. Design a hybrid system that combines Self-Refine with an external signal for math reasoning. What external tool would you use, what information would it provide to the model, and how would you integrate it into the feedback step?
This final lesson assembles all the pieces into the complete Self-Refine algorithm. You now understand one-shot generation (Lesson 1), feedback (Lesson 2), refinement (Lesson 3), iterative history (Lesson 4), and the method’s boundaries (Lesson 5). Here we formalize the full algorithm, examine how only the prompts change between tasks, and understand what makes Self-Refine a general-purpose technique.
Think of a one-person theater company where the same actor plays three roles by changing costumes. Between scenes, the actor steps offstage, changes costume, and returns as a different character. The audience sees three distinct roles – the writer, the editor, and the rewriter – but behind the curtain it is always the same person.
Self-Refine works the same way. A single language model \(\mathcal{M}\) plays three roles, switching between them by changing the prompt:
The full algorithm is:
Require: input x, model M, prompts {p_gen, p_fb, p_refine},
stop condition stop(.)
1: y_0 = M(p_gen || x) -- Initial generation
2: for t = 0, 1, ... do
3: fb_t = M(p_fb || x || y_t) -- Feedback
4: if stop(fb_t, t) then
5: break -- Stop condition met
6: else
7: y_{t+1} = M(p_refine || x || y_0 || fb_0 || ... || y_t || fb_t) -- Refine
8: end if
9: end for
10: return y_t
The stopping condition \(\mathrm{stop}(fb_t, t)\) is task-specific. It may check:
What makes Self-Refine general is that the same three-step loop applies to every task – only the contents of the three prompts change. For dialogue, \(p_{\text{gen}}\) contains example dialogues, \(p_{\text{fb}}\) contains example dialogues with multi-aspect feedback, and \(p_{\text{refine}}\) contains example dialogues with feedback and improved versions. For code, the same three prompt types contain code examples instead. The algorithm is task-agnostic; the task-specific knowledge lives entirely in the few-shot examples within each prompt.
Examples of Self-Refine. Top row: dialogue response generation, where an initial generic response is transformed into an engaging, user-aware response after feedback. Bottom row: code optimization, where a brute-force loop is replaced with a closed-form formula after feedback identifies the inefficiency.
This design has several important properties:
No training required. The model’s weights never change. All behavior changes come from the prompts. This means Self-Refine can be applied to any capable LLM immediately, without data collection or GPU-hours for fine-tuning.
Inference-time compute. All improvement happens at “test time” – when the model is generating outputs for real inputs. This is the same idea explored in different ways by chain-of-thought prompting (spending more tokens on reasoning within a single generation) and ReAct (interleaving reasoning with tool use). Self-Refine extends the principle across multiple generation passes.
Cost tradeoff. Each iteration requires two LLM calls (one for feedback, one for refinement). With 3 iterations, the total cost is roughly 7x a single generation (1 initial + 3 feedback + 3 refinement). The paper does not analyze this cost-quality tradeoff in detail, but the empirical results suggest that 1-2 iterations capture most of the improvement.
Comparison with RLHF. Reinforcement learning from human feedback (see Training Language Models to Follow Instructions with Human Feedback) trains a model to internalize human preferences by updating weights. Self-Refine achieves a similar effect – better alignment with what humans want – without touching weights. The preferences are externalized in the feedback prompt. This makes Self-Refine faster to deploy but potentially less robust: RLHF bakes preferences into every generation, while Self-Refine relies on the model’s ability to follow few-shot feedback instructions.
Comparison with ReAct. ReAct (see ReAct: Synergizing Reasoning and Acting) interleaves reasoning and action within a single generation pass. Self-Refine separates generation, feedback, and refinement into distinct passes over the full context. ReAct is better suited for tasks requiring external tool use (search, databases). Self-Refine is better suited for tasks requiring quality improvement through critique and revision.
Let’s trace the complete Self-Refine algorithm on the Acronym Generation task. The goal is to generate an acronym for a given title that is catchy, relevant, and pronounceable.
Input \(x\): “Cooperative Robotic Assistants for Flexible Task Scheduling”
Step 1 – Initial generation (\(t=0\)):
\(y_0\) = “CRAFTS”
Letters: C(ooperative) R(obotic) A(ssistants) F(or) T(ask) S(cheduling). All letters map to words. The acronym is pronounceable and spells an English word.
Step 2 – Feedback (\(t=0\)):
\(fb_0\) = “Relevance: The acronym CRAFTS is pronounceable and uses all first letters. However, ‘crafts’ suggests arts and crafts, not robotics or scheduling. The acronym should evoke technology or automation. Score: 5/10.”
Step 3 – Check stopping condition:
Score is 5/10, below threshold (8/10). Continue.
Step 4 – Refinement (\(t=0\), model sees \(x\), \(y_0\), \(fb_0\)):
\(y_1\) = “CoRAFTS” (with lowercase letters indicating partial word usage: Co-operative, R-obotic, A-ssistants, F-lexible, T-ask, S-cheduling).
Actually, a more realistic refinement might be:
\(y_1\) = “DRAFTS” (Dynamic Robotic Assistants for Flexible Task Scheduling) – the model modifies “Cooperative” to “Dynamic” to create an acronym that still spells a word but is more relevant to automation.
Step 5 – Feedback (\(t=1\), model sees \(x\), \(y_0\), \(fb_0\), \(y_1\)):
\(fb_1\) = “Relevance: DRAFTS is better – ‘Dynamic’ suggests automation. But ‘drafts’ still primarily evokes writing or beer. The first word was changed from the original title, which may not be acceptable if the title is fixed. Score: 6/10.”
Step 6 – Refinement (\(t=1\), model sees full history):
\(y_2\) = “CRAFTS” with the subtitle “Cooperative Robotic Assistants for Flexible Task Scheduling” – the model decides to keep the original mapping but add context that makes the technology connection clear.
Or: The model generates \(y_2\) = “C-RAFTS” (emphasizing the robotic angle).
In practice, acronym generation shows non-monotonic (not strictly increasing) improvement – one aspect (pronounceability) may improve while another (relevance) dips. The paper addresses this by having the feedback generate scores for multiple dimensions, and selecting the output with the best overall score rather than always taking the last iteration.
Recall: How many distinct prompts does Self-Refine use, and what does each one contain? What is the only thing that changes between tasks?
Apply: You want to apply Self-Refine to the task of writing unit tests for Python functions. Design the three prompts (\(p_{\text{gen}}\), \(p_{\text{fb}}\), \(p_{\text{refine}}\)) at a high level. For each prompt, describe what the few-shot examples look like (input, output, feedback structure). What would your stopping condition be?
Extend: Self-Refine uses the same model for all three roles. What would happen if you used a stronger model for the feedback step and a weaker (cheaper) model for the generation and refinement steps? Would this improve cost-effectiveness? What risks would it introduce?
Self-Refine operates entirely at inference time with no weight updates. What are the advantages of this approach over RLHF-style training? What are the disadvantages? Consider cost, robustness, and generalization.
The paper found that 94% of failures were due to bad feedback (33% wrong error location + 61% wrong fix suggestion) and only 6% were due to the refiner implementing good feedback incorrectly. If you could improve only one component of Self-Refine, which would you choose and why?
Self-Refine was shown to outperform generating 4 independent samples without feedback. Why does feedback-guided revision produce better results than simply generating more candidates? Consider what information the feedback provides that independent sampling does not.
The paper evaluates only closed-source models (GPT-3.5, ChatGPT, GPT-4). The one open-source model tested (Vicuna-13B) failed. What minimum capabilities must a model have for Self-Refine to work? Is this a limitation of the method or a reflection of model quality in 2023?
Chain-of-thought prompting (see Chain-of-Thought Prompting) improves reasoning within a single generation, while Self-Refine improves output quality across multiple generations. Could you combine the two – using chain-of-thought within each generation step and Self-Refine across steps? What kinds of tasks would benefit from this combination?
Build a Self-Refine simulator that iteratively improves a text output through feedback and refinement, demonstrating how quality increases with each iteration and how feedback specificity determines the improvement rate.
You will simulate the Self-Refine loop for a constrained generation task: producing a sentence that includes as many of N required keywords as possible. The “model” is a simple function that picks keywords and assembles them into phrases. The “feedback” function checks which keywords are missing. The “refiner” tries to incorporate missing keywords.
The simulation demonstrates three key findings from the paper:
import numpy as np
def generate_initial(keywords, rng, include_prob=0.5):
"""
Simulate initial generation: each keyword is independently
included with probability include_prob.
Returns a set of included keywords (simulating the model's
first draft).
"""
included = set()
for kw in keywords:
if rng.random() < include_prob:
included.add(kw)
return included
def generate_feedback_specific(keywords, included):
"""
Generate specific, actionable feedback: list the exact
keywords that are missing.
TODO: Return a dict with:
- "missing": set of keywords not in included
- "score": fraction of keywords included (0.0 to 1.0)
- "stop": True if all keywords are included, False otherwise
"""
# TODO: implement
pass
def generate_feedback_generic(keywords, included):
"""
Generate generic feedback: only report the score,
without listing which keywords are missing.
TODO: Return a dict with:
- "missing": empty set (generic feedback doesn't identify specifics)
- "score": fraction of keywords included (0.0 to 1.0)
- "stop": True if all keywords are included, False otherwise
"""
# TODO: implement
pass
def refine_with_feedback(keywords, included, feedback, rng,
per_keyword_fix_prob=0.7):
"""
Simulate refinement: attempt to incorporate missing keywords.
With specific feedback (missing keywords listed), each missing
keyword is added with probability per_keyword_fix_prob.
With generic feedback (no missing keywords listed), the refiner
picks random keywords from ALL keywords (not just missing ones)
and tries to add them -- simulating the model guessing at what
to fix without specific guidance.
The refiner never removes keywords that are already included
(it preserves what works).
TODO: Implement both specific and generic refinement paths.
Return the updated set of included keywords.
"""
# TODO: implement
pass
def run_self_refine(keywords, feedback_fn, rng, max_iterations=4,
include_prob=0.5, fix_prob=0.7):
"""
Run the full Self-Refine loop.
TODO:
1. Generate initial output
2. Loop up to max_iterations times:
a. Generate feedback
b. If stop condition met, break
c. Refine based on feedback
3. Return a list of scores (one per iteration, including initial)
Return: list of floats, where each float is the keyword
coverage (0.0 to 1.0) at that iteration.
"""
# TODO: implement
pass
def run_experiment():
"""Compare specific vs generic feedback across multiple trials."""
rng = np.random.default_rng(42)
keywords = [
"river", "bridge", "sunset", "bicycle", "dog",
"wind", "flower", "stone", "cloud", "path",
"morning", "garden", "shadow", "bell", "wave"
]
n_trials = 200
max_iters = 5
# Run trials with specific feedback
specific_scores = np.zeros((n_trials, max_iters))
for trial in range(n_trials):
scores = run_self_refine(
keywords, generate_feedback_specific, rng,
max_iterations=max_iters - 1, include_prob=0.4,
fix_prob=0.7
)
# Pad scores to max_iters length if loop stopped early
for i in range(len(scores)):
specific_scores[trial, i] = scores[i]
for i in range(len(scores), max_iters):
specific_scores[trial, i] = scores[-1]
# Run trials with generic feedback
generic_scores = np.zeros((n_trials, max_iters))
for trial in range(n_trials):
scores = run_self_refine(
keywords, generate_feedback_generic, rng,
max_iterations=max_iters - 1, include_prob=0.4,
fix_prob=0.7
)
for i in range(len(scores)):
generic_scores[trial, i] = scores[i]
for i in range(len(scores), max_iters):
generic_scores[trial, i] = scores[-1]
# Display results
print("Self-Refine: Specific vs Generic Feedback")
print("=" * 60)
print(f"{'Iteration':<12}{'Specific':<15}{'Generic':<15}{'Gap':<10}")
print("-" * 60)
for i in range(max_iters):
s_mean = np.mean(specific_scores[:, i])
g_mean = np.mean(generic_scores[:, i])
gap = s_mean - g_mean
label = "y_0 (init)" if i == 0 else f"y_{i} (iter {i})"
print(f"{label:<12}{s_mean:<15.1%}{g_mean:<15.1%}{gap:<+10.1%}")
print()
print("Diminishing returns analysis (specific feedback):")
print("-" * 40)
for i in range(1, max_iters):
delta = np.mean(specific_scores[:, i]) - np.mean(specific_scores[:, i-1])
print(f" y_{i-1} -> y_{i}: {delta:+.1%}")
print()
print(f"Final coverage -- Specific: {np.mean(specific_scores[:, -1]):.1%}, "
f"Generic: {np.mean(generic_scores[:, -1]):.1%}")
if __name__ == "__main__":
run_experiment()Self-Refine: Specific vs Generic Feedback
============================================================
Iteration Specific Generic Gap
------------------------------------------------------------
y_0 (init) 40.1% 39.9% +0.2%
y_1 (iter 1)72.1% 52.3% +19.8%
y_2 (iter 2)89.3% 60.8% +28.5%
y_3 (iter 3)96.1% 67.0% +29.1%
y_4 (iter 4)98.8% 71.5% +27.3%
Diminishing returns analysis (specific feedback):
----------------------------------------
y_0 -> y_1: +32.0%
y_1 -> y_2: +17.2%
y_2 -> y_3: +6.8%
y_3 -> y_4: +2.7%
Final coverage -- Specific: 98.8%, Generic: 71.5%
The exact numbers depend on the random seed, but the pattern should match the paper’s findings: (1) specific feedback dramatically outperforms generic feedback, (2) most improvement occurs in the first 1-2 iterations, (3) each iteration still provides positive returns, and (4) the gap between specific and generic feedback grows as iterations proceed.