Improving Language Understanding by Generative Pre-Training

Authors: Alec Radford, Karthik Narasimhan, Tim Salimans et al. Year: 2018 Source: OpenAI (no arxiv ID; published as an OpenAI technical report)

One-Sentence Summary

A single neural network first learns the structure of language by reading a massive book corpus, then adapts to specific tasks like question answering or sentiment analysis with minimal changes – outperforming models that were custom-built for each task.

Problem Statement

Before this paper, getting a computer to understand language well enough to answer questions, detect sentiment, or judge whether two sentences mean the same thing required large amounts of hand-labeled data for each task. Labeling data is expensive: a human must read each example and assign the correct answer. For many tasks and many languages, labeled data simply does not exist in sufficient quantities.

The natural solution is to learn from unlabeled text, which exists in enormous quantities – books, web pages, articles. Researchers had already shown that learning word representations (embeddings) from unlabeled text and then using them as features in task-specific models could help. But word embeddings capture individual word meanings, not the higher-level structure of sentences and paragraphs. Models like ELMo went further by learning contextualized word representations, but these were used as fixed features fed into task-specific architectures that still needed to be designed separately for each task.

Two fundamental questions remained unresolved. First, what training objective best captures useful language knowledge? Options included language modeling (predicting the next word), machine translation, and discourse coherence, with each working well on some tasks but not others. Second, how should learned representations be transferred to downstream tasks? Existing methods required designing new architecture components for each task, adding auxiliary objectives, or using complex multi-stage training procedures. The field needed a simple, general approach: one model architecture, one pre-training method, and minimal task-specific modification.

Key Innovation

Think of a student preparing for a standardized test with sections on reading comprehension, grammar, and logical reasoning. One strategy is to study each section independently with section-specific practice materials. But a better strategy might be: first, spend months reading thousands of books across every genre – novels, textbooks, newspapers, scientific papers. This broad reading builds a deep understanding of how language works: grammar, reasoning patterns, world knowledge, and argumentation structure. Then, with that foundation in place, a few hours of targeted practice for each specific test section is enough to excel.

This is exactly the GPT approach. Stage one (pre-training) trains a model to predict the next word in a massive book corpus. This task forces the model to learn syntax, semantics, factual knowledge, and reasoning patterns – because predicting what comes next requires understanding all of these. Stage two (fine-tuning) takes this pre-trained model and adapts it to a specific task using labeled data, requiring only a simple output layer on top.

The technical innovation has three parts. First, the paper uses a Transformer decoder (the architecture from the “Attention Is All You Need” paper) instead of the recurrent networks (LSTMs) used by prior work like ELMo and ULMFiT. The Transformer’s self-attention mechanism can directly relate any two positions in a text regardless of distance, giving it much better handling of long-range dependencies – understanding that a pronoun in sentence 10 refers to a character introduced in sentence 1. Second, the paper introduces task-specific input transformations that convert structured inputs (question-answer pairs, premise-hypothesis pairs) into a single token sequence that the pre-trained model can process without architectural changes. Third, the paper adds the language modeling objective as an auxiliary loss during fine-tuning, which acts as a regularizer that improves generalization.

Architecture / Method

The GPT model is a 12-layer Transformer decoder. Each layer contains two sub-components: a masked multi-head self-attention mechanism (which lets each token attend to all previous tokens but not future ones) and a position-wise feed-forward network. Each sub-component is followed by layer normalization (a technique that standardizes the values flowing through the network to stabilize training) and a residual connection (a shortcut that adds the sub-component’s input directly to its output, making it easier for gradients to flow during training).

GPT Transformer architecture diagram

Figure 1: The GPT architecture. Input tokens are converted to embeddings and combined with position embeddings at the bottom. These pass through 12 identical Transformer blocks, each containing masked multi-head self-attention and a feed-forward network with residual connections and layer normalization. The top produces two outputs: a text prediction head (for the language modeling objective) and a task classifier head (for the downstream task). The “12x” label on the left indicates the block is repeated 12 times.

The model processes input as follows. Given a sequence of tokens, each token is first converted to a 768-dimensional vector using a learned embedding table. A separate learned position embedding is added to encode each token’s position in the sequence (position 1, position 2, etc.). These combined embeddings pass through all 12 Transformer layers. The output of the final layer is then used for prediction.

Pre-training uses a language modeling objective on the BooksCorpus dataset (over 7,000 unpublished books). The model reads text in a sliding window of 512 tokens and learns to predict each next token given the preceding ones. Concretely, for a token sequence like [“The”, “cat”, “sat”, “on”, “the”, “mat”], the model tries to predict “cat” from [“The”], then “sat” from [“The”, “cat”], then “on” from [“The”, “cat”, “sat”], and so on. Pre-training runs for 100 epochs with a batch size of 64 sequences, using the Adam optimizer with a learning rate that warms up linearly over the first 2,000 updates and then decays following a cosine schedule.

Fine-tuning adapts the pre-trained model to a specific task. A single linear layer is added on top of the final Transformer output, mapping the hidden state to task-specific predictions (e.g., 3 classes for entailment: entailment, contradiction, neutral). The entire model – all 12 Transformer layers plus the new linear layer – is trained on the labeled task data for just 3 epochs. This requires very few additional parameters: only the linear output layer weights and special delimiter token embeddings.

The key to making fine-tuning work across different task types is a set of input transformations that restructure each task’s input into a single token sequence:

Task-specific input transformations

Figure 2: How GPT handles four different task types without changing the model architecture. Classification: the text is wrapped with Start and Extract tokens. Entailment: premise and hypothesis are concatenated with a Delim(iter) token between them. Similarity: because sentence order should not matter, both orderings are processed separately and their representations are added. Multiple Choice: each answer option is concatenated with the context and processed independently, then compared via softmax. In every case, the same pre-trained Transformer processes the token sequence, and only a final Linear layer differs per task.

Model specifications: 768-dimensional hidden states, 12 attention heads, 3072-dimensional feed-forward inner states, BPE (Byte Pair Encoding, a tokenization method that builds a vocabulary by iteratively merging the most frequent character pairs) vocabulary with 40,000 merges, GELU (Gaussian Error Linear Unit, a smooth alternative to the ReLU activation function) activation function, and dropout of 0.1 on residual, embedding, and attention connections.

Mathematical Foundations

The Pre-Training Objective (Equation 1)

\[L_1(\mathcal{U}) = \sum_{i} \log P(u_i \mid u_{i-k}, \ldots, u_{i-1}; \Theta)\]

What it means: For each token in the corpus, the model looks at the previous \(k\) tokens and predicts a probability distribution over all possible next tokens. This equation sums the log-probabilities of the correct next tokens. Maximizing this sum means the model assigns high probability to the actual text – it gets better at predicting what comes next. For example, given the context “The cat sat on the”, a well-trained model should assign high probability to “mat” and low probability to “elephant”.

Why it matters: This single objective drives the model to learn grammar (word order patterns), semantics (word meanings in context), factual knowledge (common co-occurrences), and reasoning (logical continuations). The model must implicitly understand all of these to predict the next word well. This learned knowledge then transfers to downstream tasks during fine-tuning.

The Transformer Computation (Equation 2)

\[h_0 = U W_e + W_p\]

\[h_l = \text{transformer\_block}(h_{l-1}) \quad \forall \, l \in [1, n]\]

\[P(u) = \text{softmax}(h_n W_e^T)\]

What it means: The computation proceeds in three stages. First, each token is looked up in the embedding matrix and added to its position embedding, producing \(h_0\) – a matrix where each row is a 768-dimensional vector representing one token in context. Second, this representation passes through \(n\) Transformer blocks, each applying masked self-attention (where each token can attend only to itself and earlier tokens) and a feed-forward network. Third, the final hidden state \(h_n\) is multiplied by the transpose of the embedding matrix to produce scores for every word in the vocabulary, and softmax converts these scores to probabilities.

For a concrete example: if our vocabulary has 40,000 tokens and our context window is 512 tokens, then \(W_e\) is a \(40000 \times 768\) matrix, \(W_p\) is a \(512 \times 768\) matrix, and after all 12 layers, \(h_{12}\) is a \(512 \times 768\) matrix. Multiplying \(h_{12}\) by \(W_e^T\) (a \(768 \times 40000\) matrix) gives a \(512 \times 40000\) matrix of vocabulary scores for each position.

Why it matters: Notice that the token embedding matrix \(W_e\) is used twice – once at the input to convert tokens to vectors, and once at the output (transposed) to convert vectors back to token probabilities. This weight tying reduces the total parameter count and forces the model to learn embeddings that work well for both understanding and prediction.

The Fine-Tuning Objective (Equation 4)

\[L_2(\mathcal{C}) = \sum_{(x,y)} \log P(y \mid x_1, \ldots, x_m)\]

What it means: For each labeled example, pass the input tokens through the pre-trained Transformer, take the final layer’s representation at the last position, project it to the number of classes via a learned linear layer, apply softmax, and maximize the log-probability of the correct label. This is standard supervised classification, but the Transformer’s weights start from pre-trained values rather than random initialization.

Why it matters: Because the Transformer has already learned rich language representations during pre-training, the linear layer only needs to learn a simple mapping from these representations to task labels. This is why fine-tuning works with very little labeled data and converges in just 3 epochs.

The Combined Objective (Equation 5)

\[L_3(\mathcal{C}) = L_2(\mathcal{C}) + \lambda \cdot L_1(\mathcal{C})\]

What it means: During fine-tuning, the model optimizes two objectives simultaneously. It tries to classify correctly (the \(L_2\) term), and it also tries to continue predicting the next word in the task input text (the \(L_1\) term). With \(\lambda = 0.5\), the language modeling loss contributes half as much as the classification loss.

Why it matters: The auxiliary language modeling objective acts as a regularizer – it prevents the pre-trained weights from drifting too far from their original values during fine-tuning. The paper’s ablation study shows this is particularly beneficial for larger datasets. Without the auxiliary objective, the model more easily overfits to the specific task and loses some of its general language understanding. This idea of using the pre-training objective as a regularizer during fine-tuning became a common technique in later transfer learning work.

Results

GPT achieved state-of-the-art results on 9 out of 12 evaluated tasks, often outperforming models that were custom-designed for each task and even ensemble models (multiple models combined for better predictions).

Natural Language Inference (determining if a hypothesis follows from a premise):

Method MNLI-m MNLI-mm SNLI SciTail QNLI RTE
ESIM + ELMo (5x ensemble) 89.3
CAFE (5x ensemble) 80.2 79.0 89.3
Stochastic Answer Net (3x) 80.6 80.1 83.3
GPT (single model) 82.1 81.4 89.9 88.3 88.1 56.0

GPT beat ensemble models (combinations of 3-5 models) using a single model on MNLI and SNLI. The 5% absolute gain on SciTail is notable because it requires understanding scientific reasoning. The only task where GPT underperformed was RTE, the smallest dataset (2,490 examples), where a multi-task BiLSTM achieved 61.7% vs. GPT’s 56.0%.

Question Answering and Commonsense Reasoning:

Method Story Cloze RACE-m RACE-h RACE
Previous best 77.6 53.3
GPT 86.5 62.9 57.4 59.0

The 8.9% improvement on Story Cloze (selecting the correct ending for a story) and 5.7% on RACE (reading comprehension from exams) demonstrate the model’s ability to handle long-range context and reasoning – exactly the capabilities that pre-training on books should build.

Ablation studies revealed three critical findings. Removing the auxiliary language modeling objective during fine-tuning dropped average performance by about 0.3 points, with larger drops on bigger datasets. Replacing the Transformer with an LSTM (keeping everything else identical) dropped average performance by 5.6 points, confirming that the Transformer architecture is essential for effective transfer. And removing pre-training entirely caused a 14.8% average drop, proving that pre-training is the dominant factor.

Layer transfer analysis

Figure 3: Effect of transferring different numbers of Transformer layers on RACE (question answering) and MultiNLI (natural language inference). Zero layers means only the embeddings are transferred. Each additional layer improves performance, with gains continuing all the way to the full 12 layers. This demonstrates that every layer of the pre-trained model captures useful knowledge for downstream tasks – earlier layers likely encode syntax, while deeper layers capture more abstract semantic and reasoning patterns.

Zero-shot performance during pre-training

Figure 4: Zero-shot task performance (no fine-tuning) as a function of pre-training updates, comparing the Transformer (solid lines) against an LSTM (dashed lines). The Transformer steadily improves on all tasks as pre-training progresses, while the LSTM shows higher variance and lower overall performance. This suggests the Transformer learns task-relevant capabilities as a side effect of language modeling, and its structured self-attention memory transfers better than the LSTM’s sequential processing.

Limitations

Impact and Legacy

GPT established the “pre-train then fine-tune” paradigm that dominated NLP from 2018 onward and ultimately reshaped the entire field of artificial intelligence. The core insight – that a language model pre-trained on a large text corpus learns transferable representations useful for virtually any language task – proved to be one of the most consequential ideas in modern machine learning.

The direct lineage from GPT is striking. GPT-2 (2019) scaled up the model and training data, showing emergent zero-shot capabilities that the original paper only hinted at. GPT-3 (2020) scaled further to 175 billion parameters and demonstrated that sufficiently large language models could perform tasks via in-context learning (providing examples in the prompt) without any fine-tuning at all. GPT-4 (2023) continued the scaling trajectory and became the backbone of ChatGPT, one of the most widely-used AI applications in history. Each successor validated the original paper’s bet that the Transformer decoder architecture, combined with generative pre-training, would scale.

Beyond the GPT family, this paper directly inspired BERT (published just months later), which took the pre-train/fine-tune approach but used bidirectional context and a different pre-training objective (masked language modeling). The competition between GPT-style autoregressive models (left-to-right generation) and BERT-style bidirectional models drove rapid progress from 2018 to 2020. Today, the GPT-style decoder-only architecture has largely won out for general-purpose AI, while BERT-style models remain dominant for specialized classification and retrieval tasks. The paper also influenced work on parameter-efficient fine-tuning (adapting pre-trained models without updating all weights) and prompt engineering (designing inputs that elicit desired behavior from pre-trained models).

Prerequisites

To fully understand this paper, you should be comfortable with:

Connections