Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar et al. Year: 2017 Source: arXiv 1706.03762
A neural network architecture called the Transformer processes entire sequences of words simultaneously using a mechanism called “attention” – which lets each word look at every other word to decide what is relevant – replacing the older approach of reading words one at a time.
Before the Transformer, the dominant models for tasks like machine translation (converting a sentence from one language to another) were recurrent neural networks (RNNs). An RNN processes a sequence one element at a time, left to right: it reads the first word, updates its internal state, reads the second word, updates again, and so on. This is like reading a book one word at a time while trying to remember everything in a single notebook. The problem is that this sequential processing cannot be parallelized – the computation for word 5 depends on the result from word 4, which depends on word 3, and so on. On modern GPUs (hardware designed to do thousands of operations simultaneously), this sequential bottleneck wastes enormous computational capacity and makes training slow.
The sequential nature of RNNs creates a second problem: long-range dependencies are hard to learn. If a word near the end of a sentence depends on a word near the beginning, the relevant information must survive a chain of sequential processing steps. In practice, information degrades across long chains, making it difficult for RNNs to connect distant parts of a sequence. A technique called “attention” had been added on top of RNNs to help with this – it lets the model look directly at any position in the input – but attention was always used as a supplement to recurrence, never as a replacement.
Convolutional neural networks (CNNs) offered an alternative that could process positions in parallel, but they had their own limitation: a single convolutional layer can only relate positions within a fixed window (the “kernel size”). To connect distant positions, you need to stack many layers, creating paths of length \(O(\log_k(n))\) or \(O(n/k)\) between positions – better than RNNs’ \(O(n)\), but still not ideal.
Imagine you are translating a long document from English to French. The old approach (RNNs) is like having a single translator who reads the English document word by word, trying to keep everything in their head, then writes the French translation word by word. The Transformer’s approach is like spreading the entire English document on a large table so that when translating any French word, the translator can look at all the English words at once and decide which ones are most relevant to the word being translated right now.
This “looking at everything at once” mechanism is called self-attention. For each position in the sequence, self-attention computes a weighted sum over all other positions, where the weights reflect how relevant each other position is. The key insight of the paper is that self-attention alone, without any recurrence or convolution, is sufficient to build a state-of-the-art sequence transduction model. This eliminates the sequential bottleneck entirely: since every position attends to every other position in a single operation, long-range dependencies are captured in \(O(1)\) sequential steps instead of \(O(n)\).
The paper also introduces multi-head attention, which runs several independent attention operations in parallel, each with its own learned projection. This is like having multiple translators at the table, each paying attention to different aspects of the source text – one focuses on grammar, another on meaning, another on word order – and then combining their insights. The resulting architecture trains faster and achieves better translation quality than any prior model, establishing 8 P100 GPUs and 3.5 days as sufficient to reach state-of-the-art performance on English-to-French translation.
The Transformer follows the encoder-decoder structure common to sequence-to-sequence models. The encoder reads the input sequence and produces a rich representation of it. The decoder generates the output sequence one token at a time, using both the encoder’s representation and the tokens it has already generated.
Figure 1: The full Transformer architecture. The left half is the encoder: input tokens are embedded and combined with positional encodings, then pass through N=6 identical layers, each containing multi-head self-attention followed by a feed-forward network. The right half is the decoder: it has the same structure but adds a cross-attention layer that looks at the encoder output, and its self-attention is masked to prevent looking at future tokens. Residual connections (the arrows skipping around each sub-layer) and layer normalization (“Add & Norm”) stabilize training.
Encoder. The encoder is a stack of \(N = 6\) identical layers. Each layer has two sub-layers: (1) a multi-head self-attention mechanism and (2) a position-wise feed-forward network. Each sub-layer is wrapped with a residual connection (the input is added to the sub-layer’s output) followed by layer normalization. Formally, the output of each sub-layer is \(\text{LayerNorm}(x + \text{Sublayer}(x))\). All sub-layers and embedding layers produce outputs of dimension \(d_\text{model} = 512\).
Decoder. The decoder is also a stack of \(N = 6\) identical layers, but each layer has three sub-layers: (1) masked multi-head self-attention, (2) multi-head attention over the encoder output (called “cross-attention”), and (3) a position-wise feed-forward network. The masking in the first sub-layer prevents position \(i\) from attending to positions greater than \(i\) – this ensures that predictions for each position depend only on previously generated tokens, preserving the auto-regressive property (the model generates one token at a time, left to right).
Scaled Dot-Product Attention. The core computation in the Transformer is scaled dot-product attention. Each position produces three vectors: a query (what am I looking for?), a key (what do I contain?), and a value (what information do I provide?). The attention output for a position is a weighted sum of all values, where the weight for each value is determined by how well the corresponding key matches the query. The weights are computed by taking dot products between the query and all keys, scaling by \(1/\sqrt{d_k}\) to prevent the dot products from growing too large, and applying a softmax function to obtain a probability distribution.
Figure 2: The computation flow of Scaled Dot-Product Attention. Queries (Q) and Keys (K) are multiplied together (MatMul), scaled by \(1/\sqrt{d_k}\), optionally masked (for the decoder, to block future positions), passed through a softmax to get attention weights, and finally multiplied with Values (V) to produce the output. Every operation here is a matrix operation that runs efficiently on GPUs.
Multi-Head Attention. Rather than performing a single attention function with \(d_\text{model}\)-dimensional queries, keys, and values, the model projects them into \(h = 8\) separate lower-dimensional spaces (\(d_k = d_v = d_\text{model}/h = 64\)), runs attention in parallel on each projection, concatenates the results, and applies a final linear projection. This lets different heads learn to attend to different types of relationships.
Figure 3: Multi-Head Attention. The inputs V, K, Q are each projected through h separate learned linear transformations (shown as stacked “Linear” boxes), producing h sets of lower-dimensional queries, keys, and values. Scaled dot-product attention runs on each set in parallel, the results are concatenated, and a final linear layer maps back to the original dimension. With h=8 heads and d_model=512, each head operates on 64-dimensional vectors.
Positional Encoding. Since the Transformer processes all positions simultaneously (unlike an RNN, which processes them in order), it has no inherent notion of word order. The paper injects position information by adding sinusoidal signals to the input embeddings. Even-indexed dimensions use sine functions and odd-indexed dimensions use cosine functions, with wavelengths forming a geometric progression from \(2\pi\) to \(10000 \cdot 2\pi\). The authors chose sinusoidal over learned positional embeddings because the sinusoidal version can potentially generalize to sequence lengths longer than those seen during training, since for any fixed offset \(k\), \(PE_{pos+k}\) can be expressed as a linear function of \(PE_{pos}\).
\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]
What it means: For each query position, compute how similar it is to every key position (via dot product), scale the result to prevent extreme values, convert to probabilities with softmax, then take a weighted average of the value vectors. The output is an \(n \times d_v\) matrix – one output vector per query position.
Why it matters: This single equation is the computational heart of the Transformer. It replaces the recurrent computation of RNNs with a parallelizable matrix multiplication. The scaling by \(\sqrt{d_k}\) is critical: without it, when \(d_k\) is large (e.g., 64), dot products tend to have magnitude \(\sim d_k\), pushing the softmax into regions with near-zero gradients where the model cannot learn.
Worked example: Let our input be the 3-word sentence [“the”, “cat”, “sat”] with \(d_k = 4\). Suppose after linear projection, we have:
\(Q = \begin{bmatrix} 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 \\ 1 & 1 & 0 & 0 \end{bmatrix}\), \(K = \begin{bmatrix} 1 & 1 & 0 & 0 \\ 0 & 0 & 1 & 1 \\ 1 & 0 & 1 & 0 \end{bmatrix}\), \(V = \begin{bmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \\ 0.5 & 0.6 \end{bmatrix}\)
Then \(QK^T = \begin{bmatrix} 1 & 1 & 2 \\ 1 & 1 & 0 \\ 1 & 0 & 1 \end{bmatrix}\), and \(QK^T / \sqrt{4} = \begin{bmatrix} 0.5 & 0.5 & 1.0 \\ 0.5 & 0.5 & 0.0 \\ 0.5 & 0.0 & 0.5 \end{bmatrix}\).
Applying softmax to each row gives weights like \([0.26, 0.26, 0.48]\) for the first row – meaning “the” attends most strongly to “sat”. Multiplying by \(V\) produces the weighted combination of value vectors.
\[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O\]
\[\text{where } \text{head}_i = \text{Attention}(QW^Q_i, KW^K_i, VW^V_i)\]
What it means: Instead of running one attention function on the full 512-dimensional vectors, project the inputs into 8 separate 64-dimensional spaces, run attention independently in each, concatenate the 8 outputs (back to 512 dimensions), and apply one more linear transformation.
Why it matters: A single attention head computes a single set of attention weights, which means it can only capture one type of relationship at a time. Multi-head attention allows the model to simultaneously capture different types of relationships – for example, one head might learn to attend to syntactic dependencies (subject-verb agreement) while another attends to semantic relationships (what the sentence is about). The total computation cost is similar to single-head attention because each head operates on smaller dimensions (\(d_k = d_\text{model}/h\)).
\[\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2\]
What it means: At each position independently, expand the 512-dimensional representation to 2048 dimensions, apply a nonlinear activation, then compress back to 512 dimensions. The same function is applied at every position, but different layers use different weight matrices.
Why it matters: Attention alone is a linear operation on the values (it computes weighted sums). The feed-forward network introduces nonlinearity, which is essential for the model to learn complex transformations. The expansion to 4x the model dimension (\(2048 = 4 \times 512\)) gives the network a larger internal workspace. This pattern of “expand, activate, compress” is used in virtually all subsequent Transformer variants.
\[PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_\text{model}}}\right)\]
\[PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_\text{model}}}\right)\]
What it means: Each position gets a unique 512-dimensional vector added to its embedding. Even dimensions use sine and odd dimensions use cosine, with the frequency decreasing as the dimension index increases. The first few dimensions oscillate rapidly (high frequency, distinguishing nearby positions) while later dimensions oscillate slowly (low frequency, encoding coarser position information). The wavelengths form a geometric progression from \(2\pi\) (dimension 0) to \(10000 \cdot 2\pi\) (the last dimension).
Why it matters: Without positional encoding, the Transformer would treat “the cat sat on the mat” identically to “mat the on sat cat the” – it has no concept of word order. The sinusoidal encoding was chosen because for any fixed offset \(k\), \(PE_{pos+k}\) is a linear function of \(PE_{pos}\). This means the model can learn to attend to relative positions (e.g., “two words ahead”) rather than only absolute positions.
\[lrate = d_\text{model}^{-0.5} \cdot \min(step\_num^{-0.5}, \; step\_num \cdot warmup\_steps^{-1.5})\]
What it means: The learning rate increases linearly during the first 4000 steps (the warmup phase), then decreases proportionally to the inverse square root of the step number. During warmup, the formula simplifies to \(lrate \approx d_\text{model}^{-0.5} \cdot step\_num \cdot warmup\_steps^{-1.5}\), a line starting from near zero. After warmup, it simplifies to \(lrate \approx d_\text{model}^{-0.5} \cdot step\_num^{-0.5}\), a gradually decreasing curve.
Why it matters: The warmup is important because the Adam optimizer’s second-moment estimates are inaccurate at the start of training (they are initialized to zero). Large learning rates in this phase cause unstable updates. The gradual warmup lets the optimizer’s statistics stabilize before taking large steps. This learning rate schedule became standard practice in subsequent Transformer-based models.
The Transformer was evaluated primarily on machine translation, using the BLEU score (a standard metric that measures how closely machine-translated text matches human reference translations, on a scale where higher is better).
| Model | EN-DE BLEU | EN-FR BLEU | Training Cost (FLOPs, floating-point operations) |
|---|---|---|---|
| ByteNet | 23.75 | – | – |
| GNMT + RL | 24.6 | 39.92 | \(2.3 \times 10^{19}\) (EN-DE) |
| ConvS2S | 25.16 | 40.46 | \(9.6 \times 10^{18}\) (EN-DE) |
| MoE | 26.03 | 40.56 | \(2.0 \times 10^{19}\) (EN-DE) |
| GNMT + RL Ensemble | 26.30 | 41.16 | \(1.8 \times 10^{20}\) (EN-DE) |
| ConvS2S Ensemble | 26.36 | 41.29 | \(7.7 \times 10^{19}\) (EN-DE) |
| Transformer (base) | 27.3 | 38.1 | \(3.3 \times 10^{18}\) |
| Transformer (big) | 28.4 | 41.8 | \(2.3 \times 10^{19}\) |
On English-to-German (WMT 2014), the big Transformer achieves 28.4 BLEU, surpassing all previous models – including ensembles (multiple models combined) – by more than 2 BLEU points. The base Transformer already beats all prior single models and ensembles with training cost of just \(3.3 \times 10^{18}\) FLOPs, roughly 3-60x cheaper than competing approaches. On English-to-French, the big Transformer reaches 41.8 BLEU, a new single-model record, trained in 3.5 days on 8 GPUs – less than 1/4 the cost of the previous best model.
The paper also tested the Transformer on English constituency parsing (analyzing the grammatical structure of sentences), a task very different from translation. With no task-specific tuning, a 4-layer Transformer achieved 91.3 F1 (a metric combining precision and recall, where 100 is perfect) in the WSJ-only setting and 92.7 F1 in the semi-supervised setting, competitive with the best specialized parsers. This demonstrated that the architecture generalizes beyond translation.
Ablation experiments (Table 3 in the paper) revealed several insights: single-head attention performs 0.9 BLEU worse than 8 heads; reducing key dimension \(d_k\) hurts quality (dot-product compatibility needs sufficient dimensions); bigger models improve quality; and dropout is critical for preventing overfitting. Learned positional embeddings performed nearly identically to sinusoidal encodings.
The Transformer is arguably the most consequential single paper in modern artificial intelligence. Its architecture became the foundation for virtually all major language models that followed: GPT (decoder-only Transformer for language generation), BERT (encoder-only Transformer for language understanding), T5 (encoder-decoder Transformer for text-to-text tasks), and the entire lineage of large language models including GPT-3, GPT-4, PaLM, LLaMA, and Claude.
The impact extends far beyond language. The Vision Transformer (ViT) adapted the architecture to images by treating image patches as tokens. Diffusion models for image generation use Transformer-based architectures for their denoising networks. Transformers have been applied to protein structure prediction (AlphaFold), code generation (Codex, GitHub Copilot), music, robotics, and scientific simulation.
The paper’s deeper contribution is conceptual: it demonstrated that attention is not merely a useful add-on to recurrence, but a complete replacement for it. This challenged the assumption that sequential processing is necessary for sequential data. The resulting architecture has favorable scaling properties – larger models trained on more data consistently improve – which enabled the “scaling laws” research that drove the development of increasingly powerful foundation models. The Transformer’s design, with its modular layers, residual connections, and parallelizable computation, became the template for building neural networks at scales that were previously impractical.
To fully understand this paper, you should be comfortable with:
These papers build on ideas introduced here. You will encounter them later in the collection.