Training Language Models to Follow Instructions with Human Feedback

Authors: Long Ouyang, Jeff Wu, Xu Jiang et al. Year: 2022 Source: arXiv 2203.02155

One-Sentence Summary

A small language model trained on human feedback about what makes a good response can outperform a model 100 times its size that was only trained to predict the next word.

Problem Statement

By 2022, the largest language models (like GPT-3 with 175 billion parameters) were trained on a simple objective: predict the next word in a sequence of internet text. This made them remarkably fluent writers, but “fluent” and “helpful” are not the same thing. Ask GPT-3 to explain quantum physics to a 5-year-old, and it might instead continue writing a Wikipedia article, produce a toxic rant, or confidently make up facts – because its training objective never taught it what a user actually wants.

This gap between “predict the next token” and “follow the user’s instructions helpfully and safely” is what the authors call the alignment problem. The language modeling objective is misaligned with user intent. A model trained to mimic internet text will produce text that looks like internet text – including all the misinformation, toxicity, and irrelevance that the internet contains.

Previous attempts to fix this included filtering training data, adding safety tokens, or crafting clever prompts. But these were band-aids. The fundamental issue remained: the model’s training objective did not encode what humans actually want. The question was whether you could take a pretrained language model and systematically reshape its behavior to follow instructions, tell the truth, and avoid harm – without destroying its general capabilities.

Key Innovation

Think of training a language model like raising a child who has read every book in a library. The child can recite facts, mimic writing styles, and complete sentences – but it has no sense of what’s appropriate to say in a conversation. InstructGPT’s innovation is a three-step “finishing school” that teaches the model social skills through human feedback.

The key idea is Reinforcement Learning from Human Feedback (RLHF), a pipeline with three stages. First, human labelers write example responses to prompts, and the model learns by imitation (supervised fine-tuning). Second, labelers rank multiple model outputs from best to worst, and a separate “reward model” learns to predict which outputs humans prefer. Third, the original model practices generating responses and uses the reward model as an automated judge, adjusting its behavior to score higher – a process driven by a reinforcement learning algorithm called Proximal Policy Optimization (PPO).

The striking result: a 1.3 billion parameter model trained this way was preferred by human evaluators over the raw 175 billion parameter GPT-3 – a model more than 100 times larger. This demonstrated that how you train a model matters far more than how big it is, at least for the goal of following instructions. RLHF became the foundational technique behind ChatGPT and nearly every modern conversational AI system.

Architecture / Method

InstructGPT starts with GPT-3 as its base model – a decoder-only Transformer (a neural network architecture where information flows in one direction, from left to right through a sequence of tokens; see Attention Is All You Need) pretrained on internet text (see Improving Language Understanding by Generative Pre-Training for the GPT approach). The method then applies three sequential steps to reshape this model’s behavior.

The three steps of InstructGPT: supervised fine-tuning, reward model training, and reinforcement learning via PPO

Figure 2: The RLHF pipeline. Step 1: Collect demonstration data from human labelers and fine-tune GPT-3 with supervised learning (SFT). Step 2: Collect comparison data by having labelers rank model outputs, then train a reward model to predict human preferences. Step 3: Use the reward model as a scoring function and optimize a new policy against it using PPO (Proximal Policy Optimization). Blue arrows indicate data used to train each stage.

Step 1: Supervised Fine-Tuning (SFT). A team of about 40 human labelers writes high-quality demonstrations of how to respond to diverse prompts (questions, instructions, creative writing requests, code tasks, etc.). The prompts come from real users of the OpenAI API and from labeler-written examples. The model is fine-tuned on roughly 13,000 of these prompt-demonstration pairs using standard supervised learning – the same next-token prediction objective as pretraining, but now on curated instruction-response pairs instead of raw internet text. The model trains for 16 epochs with cosine learning rate decay and 0.2 residual dropout.

Step 2: Reward Model (RM) Training. Instead of writing more demonstrations (which is expensive), labelers now compare model outputs. For a given prompt, the model generates \(K\) different responses (between 4 and 9). Labelers rank these responses from best to worst. Each ranking produces \(\binom{K}{2}\) pairwise comparisons (for example, \(K = 4\) gives 6 pairs). A 6-billion-parameter reward model (a modified GPT-3 with its final text-generation layer replaced by a single scalar output) learns to assign a numerical score to each prompt-response pair such that preferred responses receive higher scores. The key efficiency trick: all \(\binom{K}{2}\) comparisons from one prompt are trained as a single batch, requiring only one forward pass per response rather than one per pair. The RM training dataset contains about 33,000 prompts.

Labeler Likert scoring interface for evaluating model outputs

Figure 3: The labeling interface used by human evaluators. For each model output, labelers assign a quality rating on a 1-7 Likert scale and flag specific issues such as failing to follow instructions, hallucinating information, containing inappropriate content, or expressing harmful advice. This structured evaluation feeds into both the reward model training and the final evaluation of InstructGPT.

Step 3: Reinforcement Learning with PPO. The SFT model from Step 1 is now treated as a “policy” (a term from reinforcement learning meaning a function that decides what action to take – here, what token to generate next). This policy generates responses to prompts, and the reward model from Step 2 scores them. The PPO algorithm (Proximal Policy Optimization, a reinforcement learning method that updates the policy in small, stable steps) adjusts the model’s parameters to maximize the reward score. To prevent the model from “gaming” the reward model by generating degenerate text that scores high but reads poorly, a KL divergence penalty (a measure of how much the RL-trained model’s output distribution has drifted from the original SFT model) is added at each token. The PPO training uses about 31,000 prompts from the API.

Ranking interface where labelers drag-and-drop multiple model outputs into ranked order

Figure 4: The comparison ranking interface. Labelers see multiple model outputs for the same prompt and drag them into ranked order from best (Rank 1) to worst. These rankings produce the pairwise preference data used to train the reward model. For example, with 5 outputs shown here, a single ranking task yields 10 pairwise comparisons.

An additional variant called PPO-ptx mixes in standard language modeling updates on the original pretraining data during the RL phase. This prevents the model from “forgetting” general capabilities (an effect the authors call the “alignment tax”) while still learning to follow instructions.

Mathematical Foundations

The paper introduces two key equations. The first defines how the reward model learns from human preferences, and the second defines the objective that the final model optimizes.

Reward Model Loss:

\[\text{loss}(\theta) = -\frac{1}{\binom{K}{2}} E_{(x, y_w, y_l) \sim D} \left[ \log \left( \sigma \left( r_\theta(x, y_w) - r_\theta(x, y_l) \right) \right) \right]\]

where:

\(\theta\) = the parameters of the reward model
\(r_\theta(x, y)\) = the scalar reward score that the model assigns to response \(y\) given prompt \(x\)
\(y_w\) = the response preferred by the human labeler (the “winner”)
\(y_l\) = the less-preferred response (the “loser”)
\(D\) = the dataset of human comparisons
\(\sigma\) = the sigmoid function, \(\sigma(z) = 1 / (1 + e^{-z})\)
\(K\) = the number of responses ranked per prompt (between 4 and 9)
\(\binom{K}{2}\) = the number of pairwise comparisons derived from a ranking of \(K\) items

In plain language: for each pair of responses where a human preferred one over the other, this loss pushes the reward model to assign a higher score to the preferred response. The sigmoid of the score difference represents the probability that the model correctly predicts the human preference. Taking the log and negating it turns this into a cross-entropy loss – the same kind used in logistic regression. The \(\frac{1}{\binom{K}{2}}\) term averages over all pairs from the ranking.

This matters because it converts subjective human judgments (“I like response A better than B”) into a learnable numerical signal. The reward model becomes a proxy for human preference that can evaluate millions of responses automatically.

For a worked example: suppose \(K = 4\) responses are ranked, producing \(\binom{4}{2} = 6\) pairs. For one pair, if \(r_\theta(x, y_w) = 2.0\) and \(r_\theta(x, y_l) = 0.5\), then \(\sigma(2.0 - 0.5) = \sigma(1.5) \approx 0.82\), and \(\log(0.82) \approx -0.20\). The loss contribution is \(0.20\) – small because the model correctly ranked this pair. If the scores were reversed (\(r_\theta(x, y_w) = 0.5\), \(r_\theta(x, y_l) = 2.0\)), then \(\sigma(-1.5) \approx 0.18\), \(\log(0.18) \approx -1.71\), and the loss would be \(1.71\) – much larger, strongly penalizing the incorrect ranking.

PPO-ptx Objective:

\[\text{objective}(\phi) = E_{(x,y) \sim D_{\pi_\phi^{\text{RL}}}} \left[ r_\theta(x, y) - \beta \log \left( \pi_\phi^{\text{RL}}(y \mid x) / \pi^{\text{SFT}}(y \mid x) \right) \right] + \gamma E_{x \sim D_{\text{pretrain}}} \left[ \log(\pi_\phi^{\text{RL}}(x)) \right]\]

where:

\(\phi\) = the parameters of the policy being trained
\(\pi_\phi^{\text{RL}}\) = the RL policy (the model being optimized)
\(\pi^{\text{SFT}}\) = the supervised fine-tuned model (frozen, used as a reference)
\(r_\theta(x, y)\) = the reward model’s score for prompt \(x\) and response \(y\)
\(\beta\) = the KL penalty coefficient (controls how far the RL model can drift from the SFT model)
\(\gamma\) = the pretraining loss coefficient (controls the weight of the language modeling objective)
\(D_{\pi_\phi^{\text{RL}}}\) = distribution of prompts and responses generated by the current RL policy
\(D_{\text{pretrain}}\) = the original pretraining data distribution

In plain language: this objective has two parts. The first part says “generate responses that the reward model likes (\(r_\theta\)), but don’t stray too far from the SFT model” (the \(\beta \log(\pi^{\text{RL}} / \pi^{\text{SFT}})\) term is a KL divergence penalty that grows when the RL model’s token probabilities diverge from the SFT model’s). The second part says “also keep being good at predicting text in general” (the \(\gamma\) term is just the standard language modeling loss on pretraining data). For plain PPO models (without pretraining mix), \(\gamma = 0\).

This matters because without the KL penalty, the RL model would learn to exploit weaknesses in the reward model – producing degenerate outputs that score high but are actually gibberish. Without the pretraining mix (\(\gamma\) term), the model loses performance on standard NLP tasks. The objective balances three competing goals: follow instructions well, stay close to the SFT model, and retain general language abilities.

Standard Language Modeling Loss (SFT stage):

\[L_{\text{SFT}} = -\sum_{t=1}^{T} \log P(x_t \mid x_1, \ldots, x_{t-1}; \theta)\]

where:

\(x_t\) = the \(t\)-th token in the sequence
\(T\) = the total number of tokens
\(P(x_t \mid x_1, \ldots, x_{t-1}; \theta)\) = the model’s predicted probability of token \(x_t\) given all preceding tokens

In plain language: this is the standard next-token prediction loss. For each position in the sequence, the model predicts a probability distribution over all possible next tokens, and the loss measures how surprised the model is by the actual next token (lower loss means the model assigned high probability to the correct token). During SFT, this is applied to the labeler-written demonstration data rather than internet text.

This matters because it establishes the starting point: a model that can mimic the style and content of human-written demonstrations. The SFT model is already significantly better than raw GPT-3 at following instructions, and serves as both the initialization for PPO training and the reference distribution for the KL penalty.

Results

InstructGPT demonstrates dramatic improvements over GPT-3 across multiple evaluation axes.

Human preference evaluations. On held-out API prompts, the 175B InstructGPT (PPO-ptx) model was preferred over the 175B GPT-3 85% of the time (with a 95% confidence interval of plus or minus 3 percentage points). Even the 1.3B InstructGPT model was preferred over the 175B GPT-3, demonstrating that alignment training provides a larger effective capability boost than a 100x increase in model size. The model was also preferred over GPT-3 with a carefully crafted few-shot prompt 71% of the time.

Model	Size	Win Rate vs. 175B SFT
GPT-3	175B	~20%
GPT-3 (prompted)	175B	~30%
SFT	175B	50% (baseline)
PPO	175B	~65%
PPO-ptx (InstructGPT)	175B	~65%
PPO-ptx (InstructGPT)	1.3B	~55%

Truthfulness. On the TruthfulQA benchmark, InstructGPT generates truthful and informative answers roughly twice as often as GPT-3. On closed-domain tasks (where the answer should only use information from the input), InstructGPT hallucinates about 21% of the time compared to 41% for GPT-3.

Toxicity. When instructed to produce safe and respectful output, InstructGPT generates about 25% fewer toxic outputs than GPT-3 as measured by the Perspective API on the RealToxicityPrompts dataset. However, without the respectful instruction, the improvement disappears. And when explicitly instructed to be toxic, InstructGPT is more toxic than GPT-3 – the model has learned to follow instructions well, including harmful ones.

Alignment tax. Raw PPO training causes regressions on some standard NLP benchmarks (SQuAD, DROP, HellaSwag, translation). The PPO-ptx variant, which mixes pretraining updates into the RL training, largely eliminates these regressions while preserving the instruction-following gains.

Comparison to instruction-tuned models. InstructGPT was preferred over GPT-3 fine-tuned on FLAN (78% win rate) and T0 (79% win rate), two models trained on curated NLP datasets with instructions. This suggests that real user prompts are more diverse than academic NLP benchmarks.

Limitations

Follows harmful instructions. InstructGPT will generate biased, toxic, or misleading content when explicitly asked to, because the training prioritized helpfulness over safety. The model learned to follow instructions well, including instructions that cause harm.
Narrow labeler demographics. Only about 40 contractors (primarily English-speaking, based in the US or Southeast Asia) provided the preference data. Their judgments do not represent the full diversity of model users or people affected by model outputs.
Single-labeler comparisons. Most comparison pairs were labeled by only one person, making it impossible to identify prompts where reasonable people would disagree about the preferred response.
Struggles with false premises. When given prompts containing incorrect assumptions (e.g., “Why is it important to eat socks after meditating?”), InstructGPT often accepts the false premise rather than questioning it.
Over-hedging. The model frequently gives overly cautious, multi-option answers to straightforward questions – likely because labelers were instructed to reward epistemic humility, and the reward model amplified this tendency.
No bias improvement. InstructGPT does not improve over GPT-3 on bias benchmarks (Winogender, CrowS-Pairs). The alignment training did not address systematic biases in the base model.
Reward model gaming is not fully solved. The KL penalty mitigates but does not eliminate the risk of the model finding degenerate outputs that score high with the reward model but are not genuinely useful. This is a general limitation of any RLHF approach.
English-centric. Over 96% of training data is in English. While the model shows some ability to follow instructions in other languages, this is not systematically evaluated or reliable.

Impact and Legacy

InstructGPT is arguably the most consequential AI alignment paper published to date, because it provided the recipe that transformed language models from text completion engines into conversational assistants. The RLHF pipeline described here – SFT, then reward model, then PPO – became the standard training procedure behind ChatGPT (released later in 2022), and variations of it were adopted by virtually every major AI lab building conversational models (Anthropic’s Claude, Google’s Gemini, Meta’s Llama-based models).

The paper also shifted the field’s understanding of what makes a capable model. Before InstructGPT, the dominant strategy for improving model performance was scaling – training larger models on more data. InstructGPT demonstrated that a relatively cheap post-training alignment step could produce a more useful model than a 100x increase in parameters. This realization catalyzed a wave of research into alignment and post-training methods, including Direct Preference Optimization (DPO), Constitutional AI (CAI), and Reinforcement Learning from AI Feedback (RLAIF).

Beyond the technical contribution, the paper was unusually transparent about the limitations of the alignment process itself – who the labelers are, what values are being encoded, and how the choice of training data shapes model behavior. This honesty about the “alignment target” problem (section 5.2 of the paper asks “who are we aligning to?”) helped frame a research agenda that the field continues to grapple with. The tension between helpfulness and harmlessness that InstructGPT identified – a model that follows instructions well will also follow harmful instructions well – remains one of the central challenges in AI safety.

Prerequisites

To understand this paper, you need:

Language models and next-token prediction: How autoregressive models generate text by predicting one token at a time (see Improving Language Understanding by Generative Pre-Training)
Transformer architecture: The self-attention mechanism that underpins GPT-3 (see Attention Is All You Need)
Supervised learning / fine-tuning: Training a model on labeled examples by minimizing a loss function
Basic reinforcement learning concepts: The idea of a policy (a decision-making function), a reward signal, and optimizing the policy to maximize expected reward
KL divergence: A measure of how one probability distribution differs from another, used here to keep the RL model close to the SFT model
Sigmoid function and cross-entropy loss: Used in the reward model’s pairwise ranking loss

Connections

Builds on GPT: InstructGPT directly fine-tunes GPT-3, which extends the generative pretraining approach from Improving Language Understanding by Generative Pre-Training. The entire RLHF pipeline assumes a strong pretrained language model as its starting point.
Builds on Transformers: The GPT-3 architecture is a decoder-only variant of the Transformer introduced in Attention Is All You Need. InstructGPT does not modify the architecture – it changes only the training procedure.
Relationship to scaling: The paper’s most striking finding – that a 1.3B aligned model outperforms a 175B unaligned model – directly challenges the premise that scaling model size is the primary path to better performance. It demonstrates that what you optimize for matters as much as how much you optimize.
Shares the fine-tuning paradigm with BERT: Like BERT, InstructGPT follows a pretrain-then-fine-tune paradigm. But where BERT fine-tunes on task-specific labeled data with a standard loss function, InstructGPT fine-tunes on human preference data with a reward model – a fundamentally different supervision signal.
Precursor to modern alignment techniques: InstructGPT’s three-step pipeline (SFT, RM training, RL) established the template that later methods like DPO, RLHF variants, and Constitutional AI build upon or simplify. Nearly every conversational AI system deployed commercially as of 2024 uses some descendant of this approach.