Improving Language Understanding by Generative Pre-Training

Authors: Alec Radford, Karthik Narasimhan, Tim Salimans et al. Year: 2018 Source: OpenAI (no arxiv ID; published as an OpenAI technical report)

One-Sentence Summary

A single neural network first learns the structure of language by reading a massive book corpus, then adapts to specific tasks like question answering or sentiment analysis with minimal changes – outperforming models that were custom-built for each task.

Problem Statement

Before this paper, getting a computer to understand language well enough to answer questions, detect sentiment, or judge whether two sentences mean the same thing required large amounts of hand-labeled data for each task. Labeling data is expensive: a human must read each example and assign the correct answer. For many tasks and many languages, labeled data simply does not exist in sufficient quantities.

The natural solution is to learn from unlabeled text, which exists in enormous quantities – books, web pages, articles. Researchers had already shown that learning word representations (embeddings) from unlabeled text and then using them as features in task-specific models could help. But word embeddings capture individual word meanings, not the higher-level structure of sentences and paragraphs. Models like ELMo went further by learning contextualized word representations, but these were used as fixed features fed into task-specific architectures that still needed to be designed separately for each task.

Two fundamental questions remained unresolved. First, what training objective best captures useful language knowledge? Options included language modeling (predicting the next word), machine translation, and discourse coherence, with each working well on some tasks but not others. Second, how should learned representations be transferred to downstream tasks? Existing methods required designing new architecture components for each task, adding auxiliary objectives, or using complex multi-stage training procedures. The field needed a simple, general approach: one model architecture, one pre-training method, and minimal task-specific modification.

Key Innovation

Think of a student preparing for a standardized test with sections on reading comprehension, grammar, and logical reasoning. One strategy is to study each section independently with section-specific practice materials. But a better strategy might be: first, spend months reading thousands of books across every genre – novels, textbooks, newspapers, scientific papers. This broad reading builds a deep understanding of how language works: grammar, reasoning patterns, world knowledge, and argumentation structure. Then, with that foundation in place, a few hours of targeted practice for each specific test section is enough to excel.

This is exactly the GPT approach. Stage one (pre-training) trains a model to predict the next word in a massive book corpus. This task forces the model to learn syntax, semantics, factual knowledge, and reasoning patterns – because predicting what comes next requires understanding all of these. Stage two (fine-tuning) takes this pre-trained model and adapts it to a specific task using labeled data, requiring only a simple output layer on top.

The technical innovation has three parts. First, the paper uses a Transformer decoder (the architecture from the “Attention Is All You Need” paper) instead of the recurrent networks (LSTMs) used by prior work like ELMo and ULMFiT. The Transformer’s self-attention mechanism can directly relate any two positions in a text regardless of distance, giving it much better handling of long-range dependencies – understanding that a pronoun in sentence 10 refers to a character introduced in sentence 1. Second, the paper introduces task-specific input transformations that convert structured inputs (question-answer pairs, premise-hypothesis pairs) into a single token sequence that the pre-trained model can process without architectural changes. Third, the paper adds the language modeling objective as an auxiliary loss during fine-tuning, which acts as a regularizer that improves generalization.

Architecture / Method

The GPT model is a 12-layer Transformer decoder. Each layer contains two sub-components: a masked multi-head self-attention mechanism (which lets each token attend to all previous tokens but not future ones) and a position-wise feed-forward network. Each sub-component is followed by layer normalization (a technique that standardizes the values flowing through the network to stabilize training) and a residual connection (a shortcut that adds the sub-component’s input directly to its output, making it easier for gradients to flow during training).

Figure 1: The GPT architecture. Input tokens are converted to embeddings and combined with position embeddings at the bottom. These pass through 12 identical Transformer blocks, each containing masked multi-head self-attention and a feed-forward network with residual connections and layer normalization. The top produces two outputs: a text prediction head (for the language modeling objective) and a task classifier head (for the downstream task). The “12x” label on the left indicates the block is repeated 12 times.

The model processes input as follows. Given a sequence of tokens, each token is first converted to a 768-dimensional vector using a learned embedding table. A separate learned position embedding is added to encode each token’s position in the sequence (position 1, position 2, etc.). These combined embeddings pass through all 12 Transformer layers. The output of the final layer is then used for prediction.

Pre-training uses a language modeling objective on the BooksCorpus dataset (over 7,000 unpublished books). The model reads text in a sliding window of 512 tokens and learns to predict each next token given the preceding ones. Concretely, for a token sequence like [“The”, “cat”, “sat”, “on”, “the”, “mat”], the model tries to predict “cat” from [“The”], then “sat” from [“The”, “cat”], then “on” from [“The”, “cat”, “sat”], and so on. Pre-training runs for 100 epochs with a batch size of 64 sequences, using the Adam optimizer with a learning rate that warms up linearly over the first 2,000 updates and then decays following a cosine schedule.

Fine-tuning adapts the pre-trained model to a specific task. A single linear layer is added on top of the final Transformer output, mapping the hidden state to task-specific predictions (e.g., 3 classes for entailment: entailment, contradiction, neutral). The entire model – all 12 Transformer layers plus the new linear layer – is trained on the labeled task data for just 3 epochs. This requires very few additional parameters: only the linear output layer weights and special delimiter token embeddings.

The key to making fine-tuning work across different task types is a set of input transformations that restructure each task’s input into a single token sequence:

Figure 2: How GPT handles four different task types without changing the model architecture. Classification: the text is wrapped with Start and Extract tokens. Entailment: premise and hypothesis are concatenated with a Delim(iter) token between them. Similarity: because sentence order should not matter, both orderings are processed separately and their representations are added. Multiple Choice: each answer option is concatenated with the context and processed independently, then compared via softmax. In every case, the same pre-trained Transformer processes the token sequence, and only a final Linear layer differs per task.

Classification: Wrap the input text with special start and extract tokens: [Start] text [Extract].
Entailment: Concatenate premise and hypothesis with a delimiter: [Start] premise [Delim] hypothesis [Extract].
Similarity: Process both orderings ([Start] text1 [Delim] text2 [Extract] and [Start] text2 [Delim] text1 [Extract]) separately and add the resulting representations element-wise.
Multiple Choice: Concatenate the context with each possible answer separately ([Start] context [Delim] answer_k [Extract]), process each through the Transformer independently, and normalize via softmax.

Model specifications: 768-dimensional hidden states, 12 attention heads, 3072-dimensional feed-forward inner states, BPE (Byte Pair Encoding, a tokenization method that builds a vocabulary by iteratively merging the most frequent character pairs) vocabulary with 40,000 merges, GELU (Gaussian Error Linear Unit, a smooth alternative to the ReLU activation function) activation function, and dropout of 0.1 on residual, embedding, and attention connections.

Mathematical Foundations

The Pre-Training Objective (Equation 1)

\[L_1(\mathcal{U}) = \sum_{i} \log P(u_i \mid u_{i-k}, \ldots, u_{i-1}; \Theta)\]

\(\mathcal{U} = \{u_1, \ldots, u_n\}\): the corpus of unlabeled tokens
\(u_i\): the \(i\)-th token in the corpus
\(k\): the context window size (512 tokens in GPT)
\(P(u_i \mid u_{i-k}, \ldots, u_{i-1}; \Theta)\): the probability the model assigns to token \(u_i\) given the preceding \(k\) tokens, parameterized by weights \(\Theta\)

What it means: For each token in the corpus, the model looks at the previous \(k\) tokens and predicts a probability distribution over all possible next tokens. This equation sums the log-probabilities of the correct next tokens. Maximizing this sum means the model assigns high probability to the actual text – it gets better at predicting what comes next. For example, given the context “The cat sat on the”, a well-trained model should assign high probability to “mat” and low probability to “elephant”.

Why it matters: This single objective drives the model to learn grammar (word order patterns), semantics (word meanings in context), factual knowledge (common co-occurrences), and reasoning (logical continuations). The model must implicitly understand all of these to predict the next word well. This learned knowledge then transfers to downstream tasks during fine-tuning.

The Transformer Computation (Equation 2)

\[h_0 = U W_e + W_p\]

\[h_l = \text{transformer\_block}(h_{l-1}) \quad \forall \, l \in [1, n]\]

\[P(u) = \text{softmax}(h_n W_e^T)\]

\(U = (u_{-k}, \ldots, u_{-1})\): the context vector of token indices
\(W_e\): the token embedding matrix (vocabulary size \(\times\) 768), mapping each token to a vector
\(W_p\): the position embedding matrix (512 \(\times\) 768), encoding each position in the sequence
\(h_0\): the initial hidden state after combining token and position embeddings
\(h_l\): the hidden state after the \(l\)-th Transformer layer
\(n\): the number of Transformer layers (12 in GPT)
\(\text{softmax}(h_n W_e^T)\): the final probability distribution over the vocabulary

What it means: The computation proceeds in three stages. First, each token is looked up in the embedding matrix and added to its position embedding, producing \(h_0\) – a matrix where each row is a 768-dimensional vector representing one token in context. Second, this representation passes through \(n\) Transformer blocks, each applying masked self-attention (where each token can attend only to itself and earlier tokens) and a feed-forward network. Third, the final hidden state \(h_n\) is multiplied by the transpose of the embedding matrix to produce scores for every word in the vocabulary, and softmax converts these scores to probabilities.

For a concrete example: if our vocabulary has 40,000 tokens and our context window is 512 tokens, then \(W_e\) is a \(40000 \times 768\) matrix, \(W_p\) is a \(512 \times 768\) matrix, and after all 12 layers, \(h_{12}\) is a \(512 \times 768\) matrix. Multiplying \(h_{12}\) by \(W_e^T\) (a \(768 \times 40000\) matrix) gives a \(512 \times 40000\) matrix of vocabulary scores for each position.

Why it matters: Notice that the token embedding matrix \(W_e\) is used twice – once at the input to convert tokens to vectors, and once at the output (transposed) to convert vectors back to token probabilities. This weight tying reduces the total parameter count and forces the model to learn embeddings that work well for both understanding and prediction.

The Fine-Tuning Objective (Equation 4)

\[L_2(\mathcal{C}) = \sum_{(x,y)} \log P(y \mid x_1, \ldots, x_m)\]

\(\mathcal{C}\): the labeled dataset for the target task
\((x, y)\): a training example with input tokens \(x_1, \ldots, x_m\) and label \(y\)
\(P(y \mid x_1, \ldots, x_m)\): the probability of label \(y\) given the input, computed as \(\text{softmax}(h_l^m W_y)\)
\(h_l^m\): the final Transformer layer’s output at the last token position \(m\)
\(W_y\): a learned linear projection matrix mapping from hidden dimension to number of classes

What it means: For each labeled example, pass the input tokens through the pre-trained Transformer, take the final layer’s representation at the last position, project it to the number of classes via a learned linear layer, apply softmax, and maximize the log-probability of the correct label. This is standard supervised classification, but the Transformer’s weights start from pre-trained values rather than random initialization.

Why it matters: Because the Transformer has already learned rich language representations during pre-training, the linear layer only needs to learn a simple mapping from these representations to task labels. This is why fine-tuning works with very little labeled data and converges in just 3 epochs.

The Combined Objective (Equation 5)

\[L_3(\mathcal{C}) = L_2(\mathcal{C}) + \lambda \cdot L_1(\mathcal{C})\]

\(L_2(\mathcal{C})\): the supervised classification loss from the labeled data
\(L_1(\mathcal{C})\): the language modeling loss computed on the same labeled data (predicting next tokens)
\(\lambda\): a weighting coefficient (set to 0.5 in the paper)

What it means: During fine-tuning, the model optimizes two objectives simultaneously. It tries to classify correctly (the \(L_2\) term), and it also tries to continue predicting the next word in the task input text (the \(L_1\) term). With \(\lambda = 0.5\), the language modeling loss contributes half as much as the classification loss.

Why it matters: The auxiliary language modeling objective acts as a regularizer – it prevents the pre-trained weights from drifting too far from their original values during fine-tuning. The paper’s ablation study shows this is particularly beneficial for larger datasets. Without the auxiliary objective, the model more easily overfits to the specific task and loses some of its general language understanding. This idea of using the pre-training objective as a regularizer during fine-tuning became a common technique in later transfer learning work.

Results

GPT achieved state-of-the-art results on 9 out of 12 evaluated tasks, often outperforming models that were custom-designed for each task and even ensemble models (multiple models combined for better predictions).

Natural Language Inference (determining if a hypothesis follows from a premise):

Method	MNLI-m	MNLI-mm	SNLI	SciTail	QNLI	RTE
ESIM + ELMo (5x ensemble)	–	–	89.3	–	–	–
CAFE (5x ensemble)	80.2	79.0	89.3	–	–	–
Stochastic Answer Net (3x)	80.6	80.1	–	83.3	–	–
GPT (single model)	82.1	81.4	89.9	88.3	88.1	56.0

GPT beat ensemble models (combinations of 3-5 models) using a single model on MNLI and SNLI. The 5% absolute gain on SciTail is notable because it requires understanding scientific reasoning. The only task where GPT underperformed was RTE, the smallest dataset (2,490 examples), where a multi-task BiLSTM achieved 61.7% vs. GPT’s 56.0%.

Question Answering and Commonsense Reasoning:

Method	Story Cloze	RACE-m	RACE-h	RACE
Previous best	77.6	–	–	53.3
GPT	86.5	62.9	57.4	59.0

The 8.9% improvement on Story Cloze (selecting the correct ending for a story) and 5.7% on RACE (reading comprehension from exams) demonstrate the model’s ability to handle long-range context and reasoning – exactly the capabilities that pre-training on books should build.

Ablation studies revealed three critical findings. Removing the auxiliary language modeling objective during fine-tuning dropped average performance by about 0.3 points, with larger drops on bigger datasets. Replacing the Transformer with an LSTM (keeping everything else identical) dropped average performance by 5.6 points, confirming that the Transformer architecture is essential for effective transfer. And removing pre-training entirely caused a 14.8% average drop, proving that pre-training is the dominant factor.

Figure 3: Effect of transferring different numbers of Transformer layers on RACE (question answering) and MultiNLI (natural language inference). Zero layers means only the embeddings are transferred. Each additional layer improves performance, with gains continuing all the way to the full 12 layers. This demonstrates that every layer of the pre-trained model captures useful knowledge for downstream tasks – earlier layers likely encode syntax, while deeper layers capture more abstract semantic and reasoning patterns.

Zero-shot performance during pre-training

Figure 4: Zero-shot task performance (no fine-tuning) as a function of pre-training updates, comparing the Transformer (solid lines) against an LSTM (dashed lines). The Transformer steadily improves on all tasks as pre-training progresses, while the LSTM shows higher variance and lower overall performance. This suggests the Transformer learns task-relevant capabilities as a side effect of language modeling, and its structured self-attention memory transfers better than the LSTM’s sequential processing.

Limitations

Unidirectional context only: GPT reads text left-to-right. When processing a sentence like “The bank by the river was steep,” the model cannot use “river” to disambiguate “bank” because “river” comes after “bank.” BERT (published months later) showed that bidirectional context significantly improves performance on many tasks. The paper does not discuss this limitation.
Small training corpus: Pre-training uses only the BooksCorpus (~800 million words). Later models like GPT-2 and GPT-3 demonstrated dramatic improvements from scaling to larger and more diverse training corpora (40 GB and 570 GB respectively). The book-only corpus limits the model’s exposure to domains like scientific text, code, and conversational language.
Fixed context window: The 512-token context window cannot capture dependencies beyond approximately one page of text. Documents longer than 512 tokens must be truncated or windowed, losing global context.
No bidirectional fine-tuning: The task-specific input transformations are clever but restrictive. For tasks where both sentences should fully attend to each other (like semantic similarity), the unidirectional constraint forces workarounds like processing both orderings separately. This adds computational cost and may miss cross-sentence interactions that a bidirectional model would capture naturally.
Poor performance on small datasets: The RTE result (56.0% vs. 61.7% for a BiLSTM baseline) shows that GPT struggles when labeled data is very scarce. The large pre-trained model may require more fine-tuning examples to properly adapt than a smaller task-specific model.
No systematic scaling analysis: The paper tests only one model size (12 layers, 768 dimensions, ~117M parameters). It does not investigate how performance scales with model size, leaving open the question of whether simply making the model bigger would help – a question GPT-2 and GPT-3 later answered with a resounding yes.
Evaluation only on English: All tasks and training data are in English, with no investigation of multilingual or cross-lingual transfer.

Impact and Legacy

GPT established the “pre-train then fine-tune” paradigm that dominated NLP from 2018 onward and ultimately reshaped the entire field of artificial intelligence. The core insight – that a language model pre-trained on a large text corpus learns transferable representations useful for virtually any language task – proved to be one of the most consequential ideas in modern machine learning.

The direct lineage from GPT is striking. GPT-2 (2019) scaled up the model and training data, showing emergent zero-shot capabilities that the original paper only hinted at. GPT-3 (2020) scaled further to 175 billion parameters and demonstrated that sufficiently large language models could perform tasks via in-context learning (providing examples in the prompt) without any fine-tuning at all. GPT-4 (2023) continued the scaling trajectory and became the backbone of ChatGPT, one of the most widely-used AI applications in history. Each successor validated the original paper’s bet that the Transformer decoder architecture, combined with generative pre-training, would scale.

Beyond the GPT family, this paper directly inspired BERT (published just months later), which took the pre-train/fine-tune approach but used bidirectional context and a different pre-training objective (masked language modeling). The competition between GPT-style autoregressive models (left-to-right generation) and BERT-style bidirectional models drove rapid progress from 2018 to 2020. Today, the GPT-style decoder-only architecture has largely won out for general-purpose AI, while BERT-style models remain dominant for specialized classification and retrieval tasks. The paper also influenced work on parameter-efficient fine-tuning (adapting pre-trained models without updating all weights) and prompt engineering (designing inputs that elicit desired behavior from pre-trained models).

Prerequisites

To fully understand this paper, you should be comfortable with:

Probability and log-likelihood: the concept of maximizing the probability a model assigns to observed data, and why we work with log-probabilities (they turn products into sums)
Softmax function: converts a vector of raw scores into a probability distribution (all values between 0 and 1, summing to 1)
Matrix multiplication: the core operation of neural networks; understanding how a weight matrix transforms a vector of activations
The Transformer architecture: self-attention, multi-head attention, position-wise feed-forward networks, and residual connections. The original “Attention Is All You Need” paper (Vaswani et al., 2017) is the direct prerequisite – GPT uses the decoder half of that architecture
Generative models: the concept that GANs and VAEs learn to model data distributions (see Generative Adversarial Nets and Auto-Encoding Variational Bayes). GPT is also a generative model – it models the probability distribution over text – but it generates sequentially (one token at a time) rather than all at once

Connections

Transformers (“Attention Is All You Need”, Vaswani et al., 2017): GPT takes the decoder stack from the original Transformer and repurposes it. Where the original Transformer was designed for machine translation (with both an encoder and decoder), GPT uses only the decoder with masked self-attention for language modeling. The paper explicitly credits the Transformer for providing “a more structured memory for handling long-term dependencies in text.” GPT also replaces the original sinusoidal position encodings with learned position embeddings – a change that became standard in later models.
GAN (Generative Adversarial Nets): Both GANs and GPT are generative models, but they work in fundamentally different domains and ways. GANs generate data in parallel (the generator produces an entire image at once), while GPT generates sequentially (one token at a time). GANs use adversarial training with a discriminator; GPT uses maximum likelihood. However, both papers share the core insight that training a generative model on unlabeled data produces useful internal representations (see Generative Adversarial Nets).
VAE (Auto-Encoding Variational Bayes): VAEs and GPT both learn latent representations from data, but VAEs learn an explicit latent space \(z\) that can be sampled and manipulated, while GPT’s representations are implicit in the Transformer’s hidden states. The “pre-train then fine-tune” idea in GPT is loosely analogous to the VAE encoder-decoder structure – both involve learning a general representation first and then using it for specific purposes (see Auto-Encoding Variational Bayes).
BERT (Devlin et al., 2018): BERT is GPT’s most direct successor and competitor. BERT adopted the same “pre-train then fine-tune” paradigm but made two key changes: it used bidirectional context (each token can attend to both left and right neighbors) and replaced next-token prediction with masked language modeling (predicting randomly masked tokens). These changes gave BERT superior performance on understanding tasks but made it unsuitable for text generation.
LoRA and PEFT: GPT fine-tunes all parameters for each task, which becomes impractical as models grow to billions of parameters. Parameter-efficient fine-tuning methods like LoRA (which trains only small rank-decomposition matrices) were developed specifically to make the GPT-style fine-tuning paradigm practical at scale.