BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Authors: Jacob Devlin, Ming-Wei Chang, Kenton Lee et al. Year: 2018 Source: arXiv 1810.04805

One-Sentence Summary

A language model learns rich representations of text by reading sentences in both directions simultaneously – using a fill-in-the-blank training game – and then adapts to dozens of different language tasks by adding a single output layer on top.

Problem Statement

Before BERT, the dominant strategy for building language understanding systems was to pre-train a model on large amounts of text and then adapt it to specific tasks. The problem was that existing pre-training methods were fundamentally limited in how they read text.

Standard language models (the kind used by ELMo and GPT) read text in one direction: either left-to-right or right-to-left. A left-to-right model processing the sentence “The bank by the river was muddy” can only use “The bank by the” when predicting “river” – it cannot look ahead to see “was muddy,” which would strongly suggest we are talking about a riverbank rather than a financial institution. OpenAI GPT, the strongest system at the time, used exactly this left-to-right approach, built on the Transformer architecture (Vaswani et al., 2017). ELMo tried a workaround: it trained two separate models (one left-to-right, one right-to-left) and concatenated their outputs. But this concatenation is shallow – the left-to-right model never sees the right-to-left context during its own processing, so it cannot learn deep interactions between both directions.

This unidirectional constraint was particularly damaging for tasks that require understanding relationships between parts of a sentence. Question answering, for example, needs the model to understand how a question relates to a passage – which requires looking at context from both sides of every word. The field needed a way to pre-train models that could genuinely look in both directions at every layer of processing.

Key Innovation

Think of learning a language by doing fill-in-the-blank exercises. A teacher gives you a sentence like “The cat ___ on the mat” and you have to guess the missing word. To make a good guess, you naturally look at the words on both sides of the blank – “The cat” on the left and “on the mat” on the right. This is fundamentally different from trying to predict the next word in a sequence (like left-to-right language models do), because you get to use context from both directions.

BERT applies exactly this idea. During pre-training, BERT randomly covers up 15% of the words in each sentence with a [MASK] token and asks the model to predict what the original words were. This is called the Masked Language Model (MLM) objective. Because the masked words could appear anywhere in the sentence, the model must build representations that incorporate information from both left and right context at every layer of the Transformer – achieving true bidirectionality.

The key technical insight is that you cannot simply use a standard bidirectional model for language modeling, because each word would indirectly “see itself” through the network layers, making the prediction trivial. By masking out specific words, BERT prevents this information leak while still allowing fully bidirectional attention across all other positions. This is a simple idea, inspired by a 1953 psychological test called the Cloze task, but it unlocked a massive performance jump across virtually all NLP benchmarks.

BERT also introduces a second pre-training task: Next Sentence Prediction (NSP). The model receives two text segments and learns to predict whether the second segment actually follows the first in the original document, or is a randomly chosen sentence. This teaches the model about relationships between sentences, which is useful for tasks like question answering and natural language inference.

Architecture / Method

BERT uses the Transformer encoder architecture, the same building block described in “Attention Is All You Need” (Vaswani et al., 2017). If you think of the original Transformer as having two halves – an encoder that reads input and a decoder that generates output – BERT uses only the encoder half, stacked into a deep tower of layers. Each layer applies multi-head self-attention (where every token attends to every other token, in both directions) followed by a feed-forward network.

The paper defines two model sizes. BERT_BASE uses 12 Transformer layers, a hidden size of 768, and 12 attention heads, totaling 110 million parameters. BERT_LARGE uses 24 layers, a hidden size of 1024, and 16 attention heads, totaling 340 million parameters. BERT_BASE was deliberately sized to match OpenAI GPT for fair comparison – the only architectural difference is that BERT allows bidirectional attention while GPT restricts each token to attending only to tokens on its left.

Overall BERT framework showing pre-training (masked language model and next sentence prediction on unlabeled text) on the left and fine-tuning (same architecture adapted to specific tasks like SQuAD, MNLI, and NER by adding a single output layer) on the right

Figure 1: The same pre-trained BERT architecture is used for both pre-training and fine-tuning. During pre-training, the model learns from unlabeled text using masked language modeling and next sentence prediction. During fine-tuning, all parameters are updated for the specific downstream task. [CLS] is a special classification token prepended to every input, and [SEP] separates sentence pairs.

Input representation. BERT constructs its input by summing three embedding vectors for each token:

\[E_{\text{input}} = E_{\text{token}} + E_{\text{segment}} + E_{\text{position}}\]

\(E_{\text{token}}\): a learned embedding for the specific word piece (from a 30,000-token WordPiece vocabulary)
\(E_{\text{segment}}\): a learned embedding indicating whether this token belongs to sentence A or sentence B
\(E_{\text{position}}\): a learned embedding encoding the token’s position in the sequence (positions 0 through 511)

Every input sequence starts with a special [CLS] token. For tasks involving two sentences (like question answering), the sentences are separated by a special [SEP] token. For example, a question-answering input looks like: [CLS] Where was the battle fought? [SEP] The battle took place near the river. [SEP].

BERT input representation showing token, segment, and position embeddings summed for each token

Figure 2: BERT input representation. For the sentence pair “my dog is cute [SEP] he likes play ##ing [SEP]”, each token receives three embeddings: a token embedding (from the WordPiece vocabulary), a segment embedding (\(E_A\) for the first sentence, \(E_B\) for the second), and a position embedding. These three vectors are summed element-wise to form the input representation.

Pre-training task 1: Masked LM. For each input sequence, BERT randomly selects 15% of tokens for prediction. Of those selected tokens: 80% are replaced with [MASK], 10% are replaced with a random token from the vocabulary, and 10% are left unchanged. The model must predict the original token at each selected position. This 80/10/10 split mitigates the mismatch between pre-training (which sees [MASK] tokens) and fine-tuning (which never sees [MASK]).

For a concrete example, given the sentence “my dog is hairy”, if the 4th token is selected for masking:

80% of the time: “my dog is [MASK]”
10% of the time: “my dog is apple” (random replacement)
10% of the time: “my dog is hairy” (unchanged)

Pre-training task 2: Next Sentence Prediction. The model receives two segments A and B. Half the time, B is the actual next sentence from the document. Half the time, B is a random sentence from a different document. The model predicts IsNext or NotNext using the [CLS] token’s output representation.

Pre-training data and procedure. BERT trains on BooksCorpus (800 million words) and English Wikipedia (2.5 billion words), using a batch size of 256 sequences of up to 512 tokens (128,000 tokens per batch) for 1 million steps – approximately 40 passes over the data. Training takes 4 days on 16-64 TPU chips.

Fine-tuning. For each downstream task, BERT adds one task-specific output layer and fine-tunes all parameters end-to-end. For classification tasks (like sentiment analysis), the [CLS] token’s final hidden state feeds into a classification layer. For token-level tasks (like named entity recognition), each token’s final hidden state feeds into a token classifier. For question answering, the model learns start and end vectors that identify the answer span in the passage. Fine-tuning takes at most 1 hour on a single TPU.

Fine-tuning BERT on four different task types: sentence pair classification, single sentence classification, question answering, and sequence tagging

Figure 4: How BERT adapts to different downstream tasks. (a) Sentence pair classification (MNLI, QQP, etc.) feeds both sentences through BERT separated by [SEP] and classifies from [CLS]. (b) Single sentence classification (SST-2, CoLA) feeds one sentence and classifies from [CLS]. (c) Question answering (SQuAD) feeds the question and paragraph, predicting start and end positions. (d) Sequence tagging (NER) produces a label per token. In all cases, the same pre-trained BERT architecture is used with only a task-specific output layer added.

Mathematical Foundations

Classification Loss (GLUE tasks)

For sentence classification tasks, BERT uses the final hidden vector \(C \in \mathbb{R}^H\) of the [CLS] token and a learned weight matrix \(W \in \mathbb{R}^{K \times H}\):

\[\text{loss} = -\log(\text{softmax}(CW^T))\]

\(C\): the \(H\)-dimensional hidden vector at the [CLS] position after passing through all Transformer layers (for BERT_BASE, \(H = 768\), so \(C\) is a vector of 768 numbers)
\(W\): a weight matrix with \(K\) rows and \(H\) columns, where \(K\) is the number of output classes (e.g., \(K = 3\) for MNLI: entailment, contradiction, neutral)
\(CW^T\): the matrix product producing \(K\) unnormalized scores (logits), one per class
\(\text{softmax}\): converts logits into probabilities that sum to 1

What it means: BERT takes the [CLS] token’s representation, multiplies it by a classification matrix to get one score per class, converts those scores to probabilities with softmax, and maximizes the log-probability of the correct class. For a worked example with \(H = 4\) and \(K = 2\) (binary classification): if \(C = [0.5, -0.3, 0.8, 0.1]\) and \(W = [[0.2, 0.1, -0.3, 0.4], [-0.1, 0.5, 0.2, -0.2]]\), then \(CW^T = [0.5 \times 0.2 + (-0.3) \times 0.1 + 0.8 \times (-0.3) + 0.1 \times 0.4, \; 0.5 \times (-0.1) + (-0.3) \times 0.5 + 0.8 \times 0.2 + 0.1 \times (-0.2)] = [-0.13, -0.06]\), and softmax converts this to approximately \([0.48, 0.52]\).

Why it matters: The entire classification head is just a single matrix multiplication plus softmax. This simplicity is the point – BERT’s pre-trained representations are so rich that a single linear layer on top is enough to achieve state-of-the-art results. The \(W\) matrix is the only new parameter learned during fine-tuning (alongside the fine-tuning of BERT’s existing parameters).

Answer Span Start Probability (SQuAD)

For question answering, BERT learns a start vector \(S \in \mathbb{R}^H\) and computes the probability that word \(i\) is the start of the answer:

\[P_i = \frac{e^{S \cdot T_i}}{\sum_j e^{S \cdot T_j}}\]

\(S\): a learned start vector of dimension \(H\) (introduced only during fine-tuning)
\(T_i\): the final hidden vector for the \(i\)-th token
\(S \cdot T_i\): the dot product between \(S\) and \(T_i\), producing a single scalar score for token \(i\)
The summation in the denominator runs over all tokens \(j\) in the passage

What it means: Each token gets a score based on how similar its representation is to the learned “start” direction. Softmax normalizes these scores into a probability distribution over all tokens. The token with the highest probability is the predicted start of the answer. An analogous formula using a separate end vector \(E \in \mathbb{R}^H\) finds the end of the answer span.

Why it matters: This formulation reduces question answering to a simple dot-product-plus-softmax operation on top of BERT’s contextualized representations. The dot product \(S \cdot T_i\) acts as a learned scoring function: during fine-tuning, \(S\) learns to point in a direction in the \(H\)-dimensional space that aligns with tokens that tend to start answer spans. No complex task-specific architecture is needed.

Answer Span Scoring (SQuAD v1.1)

The score of a candidate answer span from position \(i\) to position \(j\) is:

\[\text{score}(i, j) = S \cdot T_i + E \cdot T_j\]

\(S \cdot T_i\): how likely token \(i\) is to be the start of the answer
\(E \cdot T_j\): how likely token \(j\) is to be the end of the answer
The prediction is the span \((i, j)\) with the maximum score, subject to \(j \geq i\)

What it means: The total span score is the sum of the start score and end score, computed independently. BERT predicts the span with the highest combined score where the end comes after the start. The training objective maximizes the log-likelihood of the correct start and end positions.

Why it matters: By decomposing the span score into independent start and end scores, BERT avoids the computational cost of scoring all \(O(n^2)\) possible spans jointly. Finding the best span is efficient: compute start scores for all tokens, compute end scores for all tokens, then find the pair \((i, j)\) with \(j \geq i\) that maximizes the sum.

No-Answer Scoring (SQuAD v2.0)

For questions that may have no answer in the passage, BERT compares a null score to the best span score:

\[s_{\text{null}} = S \cdot C + E \cdot C\]

\[\hat{s}_{i,j} = \max_{j \geq i} S \cdot T_i + E \cdot T_j\]

\(s_{\text{null}}\): the score for “no answer exists,” computed by treating the [CLS] token \(C\) as both the start and end position
\(\hat{s}_{i,j}\): the score of the best non-null answer span
A non-null answer is predicted when \(\hat{s}_{i,j} > s_{\text{null}} + \tau\), where the threshold \(\tau\) is tuned on the development set

What it means: BERT treats “no answer” as a special span starting and ending at the [CLS] position. If the best actual answer span does not score sufficiently higher than this null span (by at least \(\tau\)), the model predicts that the question is unanswerable.

Why it matters: This extends the span extraction approach to handle unanswerable questions with minimal additional machinery – just a threshold comparison. The elegance is that the same dot-product scoring mechanism handles both answerable and unanswerable cases.

Results

BERT achieved state-of-the-art results on all 11 tasks it was evaluated on, often by large margins. The results demonstrated that a single pre-trained model, with minimal task-specific modification, could outperform systems that had been carefully engineered for each individual task.

GLUE Benchmark (General Language Understanding Evaluation):

Task	Dataset Size	Previous Best	OpenAI GPT	BERT_BASE	BERT_LARGE
MNLI (matched/mismatched)	392k	80.6/80.1	82.1/81.4	84.6/83.4	86.7/85.9
QQP	363k	66.1	70.3	71.2	72.1
QNLI	108k	82.3	87.4	90.5	92.7
SST-2	67k	93.2	91.3	93.5	94.9
CoLA	8.5k	35.0	45.4	52.1	60.5
STS-B	5.7k	81.0	80.0	85.8	86.5
MRPC	3.5k	86.0	82.3	88.9	89.3
RTE	2.5k	61.7	56.0	66.4	70.1
Average	-	74.0	75.1	79.6	82.1

The most striking result is the comparison with OpenAI GPT: BERT_BASE uses the same architecture size but achieves 4.5% higher average accuracy, demonstrating that bidirectional pre-training (not just model architecture or scale) drives the improvement. BERT_LARGE pushes this to 7.0% over previous state-of-the-art. The gains are especially large on smaller datasets like CoLA (8.5k examples, +15.1 over GPT) and RTE (2.5k examples, +14.1 over GPT), suggesting that pre-trained representations are most valuable when labeled data is scarce.

Question Answering (SQuAD v1.1 and v2.0):

On SQuAD v1.1, a single BERT_LARGE model achieved 90.9 F1 on the development set, and an ensemble reached 93.2 F1 on the test set – surpassing the previous best ensemble by 1.5 points and even surpassing human performance on F1 (though not on exact match). On SQuAD v2.0, which includes unanswerable questions, BERT_LARGE achieved 83.1 F1, a +5.1 improvement over the previous best system.

Ablation results confirmed that both bidirectionality and NSP contribute to BERT’s performance. Removing NSP hurt QNLI (-3.5), MNLI (-0.5), and SQuAD (-0.6). Switching from bidirectional MLM to left-to-right training (like GPT) caused even larger drops: MRPC fell by 9.2 points and SQuAD by 10.7 points. The ablation over model size showed a consistent trend: larger models improved accuracy across all tasks, even on datasets with only 3,600 training examples.

Limitations

Computational cost of pre-training: BERT_LARGE requires 4 days on 64 TPU chips. The paper does not address the environmental or financial cost, and at the time of publication, few research groups outside large companies could afford to replicate the pre-training.
Pre-training / fine-tuning mismatch: The [MASK] token appears during pre-training but never during fine-tuning. Although the 80/10/10 masking strategy mitigates this, the mismatch is a fundamental weakness. Later work (XLNet, 2019) addressed this with permutation-based language modeling that avoids masking entirely.
Fixed-length context: BERT can process sequences of at most 512 tokens. Documents longer than this must be truncated or split, which loses long-range dependencies. The paper does not discuss this limitation.
Independence assumption in MLM: BERT predicts each masked token independently – it does not model dependencies between the masked tokens themselves. If “New” and “York” are both masked, BERT predicts each without knowing what the other was predicted as. XLNet and later autoregressive models handle this naturally.
Next Sentence Prediction may not help: Later work (RoBERTa, 2019; ALBERT, 2019) showed that NSP provides little to no benefit and can be removed without hurting performance. Replacing it with longer contiguous sequences or sentence-order prediction works as well or better.
English only: The paper evaluates exclusively on English tasks. Multilingual capabilities (explored in later multilingual BERT) require separate discussion.
No generation capability: BERT is an encoder-only model. It can understand and classify text but cannot generate text autoregressively, unlike GPT. This limits its applicability to tasks that require producing text (summarization, translation, dialogue).

Impact and Legacy

BERT fundamentally changed how NLP systems are built. Before BERT, practitioners designed task-specific architectures for each problem – a different model for sentiment analysis, another for question answering, another for named entity recognition. After BERT, the dominant paradigm became “pre-train once, fine-tune everywhere.” A single pre-trained model, with a trivial output layer on top, replaced years of task-specific engineering. This democratized high-quality NLP: researchers and practitioners could download BERT’s pre-trained weights and fine-tune them on their own data in hours, achieving results that previously required months of architecture design.

BERT spawned an enormous family of successor models. RoBERTa (2019) showed that BERT was undertrained and that removing NSP and training longer improved results. ALBERT (2019) reduced parameters through factorized embeddings and cross-layer sharing. DistilBERT (2019) compressed BERT to 60% of its size while retaining 97% of performance. SpanBERT (2019) improved span prediction by masking contiguous spans. Whole-word masking and domain-specific variants (BioBERT, SciBERT, ClinicalBERT, FinBERT) adapted the approach to specialized fields. The concept of masked language modeling became a standard pre-training objective, used in later models like T5, DeBERTa, and multimodal models like ViLBERT.

More broadly, BERT cemented the Transformer encoder as the architecture of choice for language understanding tasks and demonstrated the scaling hypothesis for NLP: bigger models pre-trained on more data yield better results on even the smallest downstream tasks. This lesson directly influenced the scaling of GPT-2, GPT-3, and subsequent large language models. While modern systems have largely moved toward decoder-only architectures (the GPT lineage) for their generative capabilities, BERT-style encoders remain the standard for classification, retrieval, semantic search, and embedding tasks.

Prerequisites

To fully understand this paper, you should be comfortable with:

Matrix multiplication and dot products: how vectors and matrices combine to produce scores (the classification and span-scoring equations rely on these)
Softmax function: converts a vector of real numbers into a probability distribution (used in classification and attention)
Cross-entropy loss: a standard way to measure how well predicted probabilities match true labels (used in MLM and classification objectives)
The Transformer architecture: multi-head self-attention, position encodings, and layer normalization (BERT uses the encoder portion of the Transformer)
Transfer learning / fine-tuning: the idea of pre-training on one task and adapting to another (BERT builds directly on the GPT approach of pre-train then fine-tune)

Connections

Transformers (“Attention Is All You Need”, Vaswani et al., 2017): BERT’s architecture is the encoder half of the original Transformer. BERT inherits multi-head self-attention, positional encodings, feed-forward layers, and residual connections. The key modification is removing the causal attention mask that the Transformer decoder uses, allowing every token to attend to every other token. BERT keeps the same hyperparameter relationships (feed-forward size = 4H) and uses learned positional embeddings instead of sinusoidal ones.
GPT (“Improving Language Understanding by Generative Pre-Training”, Radford et al., 2018): BERT is a direct response to GPT. Both use Transformer-based architectures pre-trained on large text corpora and fine-tuned on downstream tasks. The critical difference is directionality: GPT uses a left-to-right Transformer decoder with causal masking, while BERT uses a bidirectional Transformer encoder with masked language modeling. BERT_BASE was deliberately sized to match GPT (12 layers, 768 hidden, 110M parameters) so the comparison isolates the effect of bidirectionality. BERT outperforms GPT on all tested tasks, with the largest gains on token-level tasks like question answering where bidirectional context is most valuable.
GAN (Generative Adversarial Nets, Goodfellow et al., 2014): While GANs and BERT operate in different domains (image generation vs. language understanding), they share a conceptual thread: both use clever training objectives to sidestep limitations of prior approaches. GANs avoid intractable likelihood computation through adversarial training (see Generative Adversarial Nets); BERT avoids the unidirectionality constraint of standard language models through masked token prediction. Both papers showed that the right training objective matters more than architectural novelty.
VAE (Auto-Encoding Variational Bayes, Kingma & Welling, 2013): BERT and VAEs both learn compressed representations of data, but in very different ways. VAEs learn a continuous latent space through an encoder-decoder with a probabilistic bottleneck (see Auto-Encoding Variational Bayes). BERT learns contextual token representations through self-attention without any explicit bottleneck. Later work (e.g., Optimus) combined the two, using BERT as the encoder and GPT as the decoder in a VAE framework for text generation.

Looking Ahead

These papers build on ideas introduced here. You will encounter them later in the collection.

An Image is Worth 16x16 Words (Dosovitskiy et al., 2020): Applies the Transformer encoder to images, extending BERT’s encoder-only paradigm to vision.
Retrieval-Augmented Generation (Lewis et al., 2020): Combines retrieval with generation, using BERT-like encoders for the retrieval component.
PEFT (Houlsby et al., 2019) and LoRA (Hu et al., 2021): Address the computational cost of fine-tuning large pre-trained models like BERT by updating only a small subset of parameters.