BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Authors: Jacob Devlin, Ming-Wei Chang, Kenton Lee et al. Year: 2018 Source: arXiv 1810.04805

One-Sentence Summary

A language model learns rich representations of text by reading sentences in both directions simultaneously – using a fill-in-the-blank training game – and then adapts to dozens of different language tasks by adding a single output layer on top.

Problem Statement

Before BERT, the dominant strategy for building language understanding systems was to pre-train a model on large amounts of text and then adapt it to specific tasks. The problem was that existing pre-training methods were fundamentally limited in how they read text.

Standard language models (the kind used by ELMo and GPT) read text in one direction: either left-to-right or right-to-left. A left-to-right model processing the sentence “The bank by the river was muddy” can only use “The bank by the” when predicting “river” – it cannot look ahead to see “was muddy,” which would strongly suggest we are talking about a riverbank rather than a financial institution. OpenAI GPT, the strongest system at the time, used exactly this left-to-right approach, built on the Transformer architecture (Vaswani et al., 2017). ELMo tried a workaround: it trained two separate models (one left-to-right, one right-to-left) and concatenated their outputs. But this concatenation is shallow – the left-to-right model never sees the right-to-left context during its own processing, so it cannot learn deep interactions between both directions.

This unidirectional constraint was particularly damaging for tasks that require understanding relationships between parts of a sentence. Question answering, for example, needs the model to understand how a question relates to a passage – which requires looking at context from both sides of every word. The field needed a way to pre-train models that could genuinely look in both directions at every layer of processing.

Key Innovation

Think of learning a language by doing fill-in-the-blank exercises. A teacher gives you a sentence like “The cat ___ on the mat” and you have to guess the missing word. To make a good guess, you naturally look at the words on both sides of the blank – “The cat” on the left and “on the mat” on the right. This is fundamentally different from trying to predict the next word in a sequence (like left-to-right language models do), because you get to use context from both directions.

BERT applies exactly this idea. During pre-training, BERT randomly covers up 15% of the words in each sentence with a [MASK] token and asks the model to predict what the original words were. This is called the Masked Language Model (MLM) objective. Because the masked words could appear anywhere in the sentence, the model must build representations that incorporate information from both left and right context at every layer of the Transformer – achieving true bidirectionality.

The key technical insight is that you cannot simply use a standard bidirectional model for language modeling, because each word would indirectly “see itself” through the network layers, making the prediction trivial. By masking out specific words, BERT prevents this information leak while still allowing fully bidirectional attention across all other positions. This is a simple idea, inspired by a 1953 psychological test called the Cloze task, but it unlocked a massive performance jump across virtually all NLP benchmarks.

BERT also introduces a second pre-training task: Next Sentence Prediction (NSP). The model receives two text segments and learns to predict whether the second segment actually follows the first in the original document, or is a randomly chosen sentence. This teaches the model about relationships between sentences, which is useful for tasks like question answering and natural language inference.

Architecture / Method

BERT uses the Transformer encoder architecture, the same building block described in “Attention Is All You Need” (Vaswani et al., 2017). If you think of the original Transformer as having two halves – an encoder that reads input and a decoder that generates output – BERT uses only the encoder half, stacked into a deep tower of layers. Each layer applies multi-head self-attention (where every token attends to every other token, in both directions) followed by a feed-forward network.

The paper defines two model sizes. BERT_BASE uses 12 Transformer layers, a hidden size of 768, and 12 attention heads, totaling 110 million parameters. BERT_LARGE uses 24 layers, a hidden size of 1024, and 16 attention heads, totaling 340 million parameters. BERT_BASE was deliberately sized to match OpenAI GPT for fair comparison – the only architectural difference is that BERT allows bidirectional attention while GPT restricts each token to attending only to tokens on its left.

Overall BERT framework showing pre-training (masked language model and next sentence prediction on unlabeled text) on the left and fine-tuning (same architecture adapted to specific tasks like SQuAD, MNLI, and NER by adding a single output layer) on the right

Figure 1: The same pre-trained BERT architecture is used for both pre-training and fine-tuning. During pre-training, the model learns from unlabeled text using masked language modeling and next sentence prediction. During fine-tuning, all parameters are updated for the specific downstream task. [CLS] is a special classification token prepended to every input, and [SEP] separates sentence pairs.

Input representation. BERT constructs its input by summing three embedding vectors for each token:

\[E_{\text{input}} = E_{\text{token}} + E_{\text{segment}} + E_{\text{position}}\]

Every input sequence starts with a special [CLS] token. For tasks involving two sentences (like question answering), the sentences are separated by a special [SEP] token. For example, a question-answering input looks like: [CLS] Where was the battle fought? [SEP] The battle took place near the river. [SEP].

BERT input representation showing token, segment, and position embeddings summed for each token

Figure 2: BERT input representation. For the sentence pair “my dog is cute [SEP] he likes play ##ing [SEP]”, each token receives three embeddings: a token embedding (from the WordPiece vocabulary), a segment embedding (\(E_A\) for the first sentence, \(E_B\) for the second), and a position embedding. These three vectors are summed element-wise to form the input representation.

Pre-training task 1: Masked LM. For each input sequence, BERT randomly selects 15% of tokens for prediction. Of those selected tokens: 80% are replaced with [MASK], 10% are replaced with a random token from the vocabulary, and 10% are left unchanged. The model must predict the original token at each selected position. This 80/10/10 split mitigates the mismatch between pre-training (which sees [MASK] tokens) and fine-tuning (which never sees [MASK]).

For a concrete example, given the sentence “my dog is hairy”, if the 4th token is selected for masking:

Pre-training task 2: Next Sentence Prediction. The model receives two segments A and B. Half the time, B is the actual next sentence from the document. Half the time, B is a random sentence from a different document. The model predicts IsNext or NotNext using the [CLS] token’s output representation.

Pre-training data and procedure. BERT trains on BooksCorpus (800 million words) and English Wikipedia (2.5 billion words), using a batch size of 256 sequences of up to 512 tokens (128,000 tokens per batch) for 1 million steps – approximately 40 passes over the data. Training takes 4 days on 16-64 TPU chips.

Fine-tuning. For each downstream task, BERT adds one task-specific output layer and fine-tunes all parameters end-to-end. For classification tasks (like sentiment analysis), the [CLS] token’s final hidden state feeds into a classification layer. For token-level tasks (like named entity recognition), each token’s final hidden state feeds into a token classifier. For question answering, the model learns start and end vectors that identify the answer span in the passage. Fine-tuning takes at most 1 hour on a single TPU.

Fine-tuning BERT on four different task types: sentence pair classification, single sentence classification, question answering, and sequence tagging

Figure 4: How BERT adapts to different downstream tasks. (a) Sentence pair classification (MNLI, QQP, etc.) feeds both sentences through BERT separated by [SEP] and classifies from [CLS]. (b) Single sentence classification (SST-2, CoLA) feeds one sentence and classifies from [CLS]. (c) Question answering (SQuAD) feeds the question and paragraph, predicting start and end positions. (d) Sequence tagging (NER) produces a label per token. In all cases, the same pre-trained BERT architecture is used with only a task-specific output layer added.

Mathematical Foundations

Classification Loss (GLUE tasks)

For sentence classification tasks, BERT uses the final hidden vector \(C \in \mathbb{R}^H\) of the [CLS] token and a learned weight matrix \(W \in \mathbb{R}^{K \times H}\):

\[\text{loss} = -\log(\text{softmax}(CW^T))\]

What it means: BERT takes the [CLS] token’s representation, multiplies it by a classification matrix to get one score per class, converts those scores to probabilities with softmax, and maximizes the log-probability of the correct class. For a worked example with \(H = 4\) and \(K = 2\) (binary classification): if \(C = [0.5, -0.3, 0.8, 0.1]\) and \(W = [[0.2, 0.1, -0.3, 0.4], [-0.1, 0.5, 0.2, -0.2]]\), then \(CW^T = [0.5 \times 0.2 + (-0.3) \times 0.1 + 0.8 \times (-0.3) + 0.1 \times 0.4, \; 0.5 \times (-0.1) + (-0.3) \times 0.5 + 0.8 \times 0.2 + 0.1 \times (-0.2)] = [-0.13, -0.06]\), and softmax converts this to approximately \([0.48, 0.52]\).

Why it matters: The entire classification head is just a single matrix multiplication plus softmax. This simplicity is the point – BERT’s pre-trained representations are so rich that a single linear layer on top is enough to achieve state-of-the-art results. The \(W\) matrix is the only new parameter learned during fine-tuning (alongside the fine-tuning of BERT’s existing parameters).

Answer Span Start Probability (SQuAD)

For question answering, BERT learns a start vector \(S \in \mathbb{R}^H\) and computes the probability that word \(i\) is the start of the answer:

\[P_i = \frac{e^{S \cdot T_i}}{\sum_j e^{S \cdot T_j}}\]

What it means: Each token gets a score based on how similar its representation is to the learned “start” direction. Softmax normalizes these scores into a probability distribution over all tokens. The token with the highest probability is the predicted start of the answer. An analogous formula using a separate end vector \(E \in \mathbb{R}^H\) finds the end of the answer span.

Why it matters: This formulation reduces question answering to a simple dot-product-plus-softmax operation on top of BERT’s contextualized representations. The dot product \(S \cdot T_i\) acts as a learned scoring function: during fine-tuning, \(S\) learns to point in a direction in the \(H\)-dimensional space that aligns with tokens that tend to start answer spans. No complex task-specific architecture is needed.

Answer Span Scoring (SQuAD v1.1)

The score of a candidate answer span from position \(i\) to position \(j\) is:

\[\text{score}(i, j) = S \cdot T_i + E \cdot T_j\]

What it means: The total span score is the sum of the start score and end score, computed independently. BERT predicts the span with the highest combined score where the end comes after the start. The training objective maximizes the log-likelihood of the correct start and end positions.

Why it matters: By decomposing the span score into independent start and end scores, BERT avoids the computational cost of scoring all \(O(n^2)\) possible spans jointly. Finding the best span is efficient: compute start scores for all tokens, compute end scores for all tokens, then find the pair \((i, j)\) with \(j \geq i\) that maximizes the sum.

No-Answer Scoring (SQuAD v2.0)

For questions that may have no answer in the passage, BERT compares a null score to the best span score:

\[s_{\text{null}} = S \cdot C + E \cdot C\]

\[\hat{s}_{i,j} = \max_{j \geq i} S \cdot T_i + E \cdot T_j\]

What it means: BERT treats “no answer” as a special span starting and ending at the [CLS] position. If the best actual answer span does not score sufficiently higher than this null span (by at least \(\tau\)), the model predicts that the question is unanswerable.

Why it matters: This extends the span extraction approach to handle unanswerable questions with minimal additional machinery – just a threshold comparison. The elegance is that the same dot-product scoring mechanism handles both answerable and unanswerable cases.

Results

BERT achieved state-of-the-art results on all 11 tasks it was evaluated on, often by large margins. The results demonstrated that a single pre-trained model, with minimal task-specific modification, could outperform systems that had been carefully engineered for each individual task.

GLUE Benchmark (General Language Understanding Evaluation):

Task Dataset Size Previous Best OpenAI GPT BERT_BASE BERT_LARGE
MNLI (matched/mismatched) 392k 80.6/80.1 82.1/81.4 84.6/83.4 86.7/85.9
QQP 363k 66.1 70.3 71.2 72.1
QNLI 108k 82.3 87.4 90.5 92.7
SST-2 67k 93.2 91.3 93.5 94.9
CoLA 8.5k 35.0 45.4 52.1 60.5
STS-B 5.7k 81.0 80.0 85.8 86.5
MRPC 3.5k 86.0 82.3 88.9 89.3
RTE 2.5k 61.7 56.0 66.4 70.1
Average - 74.0 75.1 79.6 82.1

The most striking result is the comparison with OpenAI GPT: BERT_BASE uses the same architecture size but achieves 4.5% higher average accuracy, demonstrating that bidirectional pre-training (not just model architecture or scale) drives the improvement. BERT_LARGE pushes this to 7.0% over previous state-of-the-art. The gains are especially large on smaller datasets like CoLA (8.5k examples, +15.1 over GPT) and RTE (2.5k examples, +14.1 over GPT), suggesting that pre-trained representations are most valuable when labeled data is scarce.

Question Answering (SQuAD v1.1 and v2.0):

On SQuAD v1.1, a single BERT_LARGE model achieved 90.9 F1 on the development set, and an ensemble reached 93.2 F1 on the test set – surpassing the previous best ensemble by 1.5 points and even surpassing human performance on F1 (though not on exact match). On SQuAD v2.0, which includes unanswerable questions, BERT_LARGE achieved 83.1 F1, a +5.1 improvement over the previous best system.

Ablation results confirmed that both bidirectionality and NSP contribute to BERT’s performance. Removing NSP hurt QNLI (-3.5), MNLI (-0.5), and SQuAD (-0.6). Switching from bidirectional MLM to left-to-right training (like GPT) caused even larger drops: MRPC fell by 9.2 points and SQuAD by 10.7 points. The ablation over model size showed a consistent trend: larger models improved accuracy across all tasks, even on datasets with only 3,600 training examples.

Limitations

Impact and Legacy

BERT fundamentally changed how NLP systems are built. Before BERT, practitioners designed task-specific architectures for each problem – a different model for sentiment analysis, another for question answering, another for named entity recognition. After BERT, the dominant paradigm became “pre-train once, fine-tune everywhere.” A single pre-trained model, with a trivial output layer on top, replaced years of task-specific engineering. This democratized high-quality NLP: researchers and practitioners could download BERT’s pre-trained weights and fine-tune them on their own data in hours, achieving results that previously required months of architecture design.

BERT spawned an enormous family of successor models. RoBERTa (2019) showed that BERT was undertrained and that removing NSP and training longer improved results. ALBERT (2019) reduced parameters through factorized embeddings and cross-layer sharing. DistilBERT (2019) compressed BERT to 60% of its size while retaining 97% of performance. SpanBERT (2019) improved span prediction by masking contiguous spans. Whole-word masking and domain-specific variants (BioBERT, SciBERT, ClinicalBERT, FinBERT) adapted the approach to specialized fields. The concept of masked language modeling became a standard pre-training objective, used in later models like T5, DeBERTa, and multimodal models like ViLBERT.

More broadly, BERT cemented the Transformer encoder as the architecture of choice for language understanding tasks and demonstrated the scaling hypothesis for NLP: bigger models pre-trained on more data yield better results on even the smallest downstream tasks. This lesson directly influenced the scaling of GPT-2, GPT-3, and subsequent large language models. While modern systems have largely moved toward decoder-only architectures (the GPT lineage) for their generative capabilities, BERT-style encoders remain the standard for classification, retrieval, semantic search, and embedding tasks.

Prerequisites

To fully understand this paper, you should be comfortable with:

Connections

Looking Ahead

These papers build on ideas introduced here. You will encounter them later in the collection.