Authors: Long Ouyang, Jeff Wu, Xu Jiang et al. Year: 2022 Source: arXiv 2203.02155
A small language model trained on human feedback about what makes a good response can outperform a model 100 times its size that was only trained to predict the next word.
By 2022, the largest language models (like GPT-3 with 175 billion parameters) were trained on a simple objective: predict the next word in a sequence of internet text. This made them remarkably fluent writers, but “fluent” and “helpful” are not the same thing. Ask GPT-3 to explain quantum physics to a 5-year-old, and it might instead continue writing a Wikipedia article, produce a toxic rant, or confidently make up facts – because its training objective never taught it what a user actually wants.
This gap between “predict the next token” and “follow the user’s instructions helpfully and safely” is what the authors call the alignment problem. The language modeling objective is misaligned with user intent. A model trained to mimic internet text will produce text that looks like internet text – including all the misinformation, toxicity, and irrelevance that the internet contains.
Previous attempts to fix this included filtering training data, adding safety tokens, or crafting clever prompts. But these were band-aids. The fundamental issue remained: the model’s training objective did not encode what humans actually want. The question was whether you could take a pretrained language model and systematically reshape its behavior to follow instructions, tell the truth, and avoid harm – without destroying its general capabilities.
Think of training a language model like raising a child who has read every book in a library. The child can recite facts, mimic writing styles, and complete sentences – but it has no sense of what’s appropriate to say in a conversation. InstructGPT’s innovation is a three-step “finishing school” that teaches the model social skills through human feedback.
The key idea is Reinforcement Learning from Human Feedback (RLHF), a pipeline with three stages. First, human labelers write example responses to prompts, and the model learns by imitation (supervised fine-tuning). Second, labelers rank multiple model outputs from best to worst, and a separate “reward model” learns to predict which outputs humans prefer. Third, the original model practices generating responses and uses the reward model as an automated judge, adjusting its behavior to score higher – a process driven by a reinforcement learning algorithm called Proximal Policy Optimization (PPO).
The striking result: a 1.3 billion parameter model trained this way was preferred by human evaluators over the raw 175 billion parameter GPT-3 – a model more than 100 times larger. This demonstrated that how you train a model matters far more than how big it is, at least for the goal of following instructions. RLHF became the foundational technique behind ChatGPT and nearly every modern conversational AI system.
InstructGPT starts with GPT-3 as its base model – a decoder-only Transformer (a neural network architecture where information flows in one direction, from left to right through a sequence of tokens; see Attention Is All You Need) pretrained on internet text (see Improving Language Understanding by Generative Pre-Training for the GPT approach). The method then applies three sequential steps to reshape this model’s behavior.
Figure 2: The RLHF pipeline. Step 1: Collect demonstration data from human labelers and fine-tune GPT-3 with supervised learning (SFT). Step 2: Collect comparison data by having labelers rank model outputs, then train a reward model to predict human preferences. Step 3: Use the reward model as a scoring function and optimize a new policy against it using PPO (Proximal Policy Optimization). Blue arrows indicate data used to train each stage.
Step 1: Supervised Fine-Tuning (SFT). A team of about 40 human labelers writes high-quality demonstrations of how to respond to diverse prompts (questions, instructions, creative writing requests, code tasks, etc.). The prompts come from real users of the OpenAI API and from labeler-written examples. The model is fine-tuned on roughly 13,000 of these prompt-demonstration pairs using standard supervised learning – the same next-token prediction objective as pretraining, but now on curated instruction-response pairs instead of raw internet text. The model trains for 16 epochs with cosine learning rate decay and 0.2 residual dropout.
Step 2: Reward Model (RM) Training. Instead of writing more demonstrations (which is expensive), labelers now compare model outputs. For a given prompt, the model generates \(K\) different responses (between 4 and 9). Labelers rank these responses from best to worst. Each ranking produces \(\binom{K}{2}\) pairwise comparisons (for example, \(K = 4\) gives 6 pairs). A 6-billion-parameter reward model (a modified GPT-3 with its final text-generation layer replaced by a single scalar output) learns to assign a numerical score to each prompt-response pair such that preferred responses receive higher scores. The key efficiency trick: all \(\binom{K}{2}\) comparisons from one prompt are trained as a single batch, requiring only one forward pass per response rather than one per pair. The RM training dataset contains about 33,000 prompts.
Figure 3: The labeling interface used by human evaluators. For each model output, labelers assign a quality rating on a 1-7 Likert scale and flag specific issues such as failing to follow instructions, hallucinating information, containing inappropriate content, or expressing harmful advice. This structured evaluation feeds into both the reward model training and the final evaluation of InstructGPT.
Step 3: Reinforcement Learning with PPO. The SFT model from Step 1 is now treated as a “policy” (a term from reinforcement learning meaning a function that decides what action to take – here, what token to generate next). This policy generates responses to prompts, and the reward model from Step 2 scores them. The PPO algorithm (Proximal Policy Optimization, a reinforcement learning method that updates the policy in small, stable steps) adjusts the model’s parameters to maximize the reward score. To prevent the model from “gaming” the reward model by generating degenerate text that scores high but reads poorly, a KL divergence penalty (a measure of how much the RL-trained model’s output distribution has drifted from the original SFT model) is added at each token. The PPO training uses about 31,000 prompts from the API.
Figure 4: The comparison ranking interface. Labelers see multiple model outputs for the same prompt and drag them into ranked order from best (Rank 1) to worst. These rankings produce the pairwise preference data used to train the reward model. For example, with 5 outputs shown here, a single ranking task yields 10 pairwise comparisons.
An additional variant called PPO-ptx mixes in standard language modeling updates on the original pretraining data during the RL phase. This prevents the model from “forgetting” general capabilities (an effect the authors call the “alignment tax”) while still learning to follow instructions.
The paper introduces two key equations. The first defines how the reward model learns from human preferences, and the second defines the objective that the final model optimizes.
Reward Model Loss:
\[\text{loss}(\theta) = -\frac{1}{\binom{K}{2}} E_{(x, y_w, y_l) \sim D} \left[ \log \left( \sigma \left( r_\theta(x, y_w) - r_\theta(x, y_l) \right) \right) \right]\]
where:
In plain language: for each pair of responses where a human preferred one over the other, this loss pushes the reward model to assign a higher score to the preferred response. The sigmoid of the score difference represents the probability that the model correctly predicts the human preference. Taking the log and negating it turns this into a cross-entropy loss – the same kind used in logistic regression. The \(\frac{1}{\binom{K}{2}}\) term averages over all pairs from the ranking.
This matters because it converts subjective human judgments (“I like response A better than B”) into a learnable numerical signal. The reward model becomes a proxy for human preference that can evaluate millions of responses automatically.
For a worked example: suppose \(K = 4\) responses are ranked, producing \(\binom{4}{2} = 6\) pairs. For one pair, if \(r_\theta(x, y_w) = 2.0\) and \(r_\theta(x, y_l) = 0.5\), then \(\sigma(2.0 - 0.5) = \sigma(1.5) \approx 0.82\), and \(\log(0.82) \approx -0.20\). The loss contribution is \(0.20\) – small because the model correctly ranked this pair. If the scores were reversed (\(r_\theta(x, y_w) = 0.5\), \(r_\theta(x, y_l) = 2.0\)), then \(\sigma(-1.5) \approx 0.18\), \(\log(0.18) \approx -1.71\), and the loss would be \(1.71\) – much larger, strongly penalizing the incorrect ranking.
PPO-ptx Objective:
\[\text{objective}(\phi) = E_{(x,y) \sim D_{\pi_\phi^{\text{RL}}}} \left[ r_\theta(x, y) - \beta \log \left( \pi_\phi^{\text{RL}}(y \mid x) / \pi^{\text{SFT}}(y \mid x) \right) \right] + \gamma E_{x \sim D_{\text{pretrain}}} \left[ \log(\pi_\phi^{\text{RL}}(x)) \right]\]
where:
In plain language: this objective has two parts. The first part says “generate responses that the reward model likes (\(r_\theta\)), but don’t stray too far from the SFT model” (the \(\beta \log(\pi^{\text{RL}} / \pi^{\text{SFT}})\) term is a KL divergence penalty that grows when the RL model’s token probabilities diverge from the SFT model’s). The second part says “also keep being good at predicting text in general” (the \(\gamma\) term is just the standard language modeling loss on pretraining data). For plain PPO models (without pretraining mix), \(\gamma = 0\).
This matters because without the KL penalty, the RL model would learn to exploit weaknesses in the reward model – producing degenerate outputs that score high but are actually gibberish. Without the pretraining mix (\(\gamma\) term), the model loses performance on standard NLP tasks. The objective balances three competing goals: follow instructions well, stay close to the SFT model, and retain general language abilities.
Standard Language Modeling Loss (SFT stage):
\[L_{\text{SFT}} = -\sum_{t=1}^{T} \log P(x_t \mid x_1, \ldots, x_{t-1}; \theta)\]
where:
In plain language: this is the standard next-token prediction loss. For each position in the sequence, the model predicts a probability distribution over all possible next tokens, and the loss measures how surprised the model is by the actual next token (lower loss means the model assigned high probability to the correct token). During SFT, this is applied to the labeler-written demonstration data rather than internet text.
This matters because it establishes the starting point: a model that can mimic the style and content of human-written demonstrations. The SFT model is already significantly better than raw GPT-3 at following instructions, and serves as both the initialization for PPO training and the reference distribution for the KL penalty.
InstructGPT demonstrates dramatic improvements over GPT-3 across multiple evaluation axes.
Human preference evaluations. On held-out API prompts, the 175B InstructGPT (PPO-ptx) model was preferred over the 175B GPT-3 85% of the time (with a 95% confidence interval of plus or minus 3 percentage points). Even the 1.3B InstructGPT model was preferred over the 175B GPT-3, demonstrating that alignment training provides a larger effective capability boost than a 100x increase in model size. The model was also preferred over GPT-3 with a carefully crafted few-shot prompt 71% of the time.
| Model | Size | Win Rate vs. 175B SFT |
|---|---|---|
| GPT-3 | 175B | ~20% |
| GPT-3 (prompted) | 175B | ~30% |
| SFT | 175B | 50% (baseline) |
| PPO | 175B | ~65% |
| PPO-ptx (InstructGPT) | 175B | ~65% |
| PPO-ptx (InstructGPT) | 1.3B | ~55% |
Truthfulness. On the TruthfulQA benchmark, InstructGPT generates truthful and informative answers roughly twice as often as GPT-3. On closed-domain tasks (where the answer should only use information from the input), InstructGPT hallucinates about 21% of the time compared to 41% for GPT-3.
Toxicity. When instructed to produce safe and respectful output, InstructGPT generates about 25% fewer toxic outputs than GPT-3 as measured by the Perspective API on the RealToxicityPrompts dataset. However, without the respectful instruction, the improvement disappears. And when explicitly instructed to be toxic, InstructGPT is more toxic than GPT-3 – the model has learned to follow instructions well, including harmful ones.
Alignment tax. Raw PPO training causes regressions on some standard NLP benchmarks (SQuAD, DROP, HellaSwag, translation). The PPO-ptx variant, which mixes pretraining updates into the RL training, largely eliminates these regressions while preserving the instruction-following gains.
Comparison to instruction-tuned models. InstructGPT was preferred over GPT-3 fine-tuned on FLAN (78% win rate) and T0 (79% win rate), two models trained on curated NLP datasets with instructions. This suggests that real user prompts are more diverse than academic NLP benchmarks.
InstructGPT is arguably the most consequential AI alignment paper published to date, because it provided the recipe that transformed language models from text completion engines into conversational assistants. The RLHF pipeline described here – SFT, then reward model, then PPO – became the standard training procedure behind ChatGPT (released later in 2022), and variations of it were adopted by virtually every major AI lab building conversational models (Anthropic’s Claude, Google’s Gemini, Meta’s Llama-based models).
The paper also shifted the field’s understanding of what makes a capable model. Before InstructGPT, the dominant strategy for improving model performance was scaling – training larger models on more data. InstructGPT demonstrated that a relatively cheap post-training alignment step could produce a more useful model than a 100x increase in parameters. This realization catalyzed a wave of research into alignment and post-training methods, including Direct Preference Optimization (DPO), Constitutional AI (CAI), and Reinforcement Learning from AI Feedback (RLAIF).
Beyond the technical contribution, the paper was unusually transparent about the limitations of the alignment process itself – who the labelers are, what values are being encoded, and how the choice of training data shapes model behavior. This honesty about the “alignment target” problem (section 5.2 of the paper asks “who are we aligning to?”) helped frame a research agenda that the field continues to grapple with. The tension between helpfulness and harmlessness that InstructGPT identified – a model that follows instructions well will also follow harmful instructions well – remains one of the central challenges in AI safety.
To understand this paper, you need: