Attention Is All You Need

Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar et al. Year: 2017 Source: arXiv 1706.03762

One-Sentence Summary

A neural network architecture called the Transformer processes entire sequences of words simultaneously using a mechanism called “attention” – which lets each word look at every other word to decide what is relevant – replacing the older approach of reading words one at a time.

Problem Statement

Before the Transformer, the dominant models for tasks like machine translation (converting a sentence from one language to another) were recurrent neural networks (RNNs). An RNN processes a sequence one element at a time, left to right: it reads the first word, updates its internal state, reads the second word, updates again, and so on. This is like reading a book one word at a time while trying to remember everything in a single notebook. The problem is that this sequential processing cannot be parallelized – the computation for word 5 depends on the result from word 4, which depends on word 3, and so on. On modern GPUs (hardware designed to do thousands of operations simultaneously), this sequential bottleneck wastes enormous computational capacity and makes training slow.

The sequential nature of RNNs creates a second problem: long-range dependencies are hard to learn. If a word near the end of a sentence depends on a word near the beginning, the relevant information must survive a chain of sequential processing steps. In practice, information degrades across long chains, making it difficult for RNNs to connect distant parts of a sequence. A technique called “attention” had been added on top of RNNs to help with this – it lets the model look directly at any position in the input – but attention was always used as a supplement to recurrence, never as a replacement.

Convolutional neural networks (CNNs) offered an alternative that could process positions in parallel, but they had their own limitation: a single convolutional layer can only relate positions within a fixed window (the “kernel size”). To connect distant positions, you need to stack many layers, creating paths of length \(O(\log_k(n))\) or \(O(n/k)\) between positions – better than RNNs’ \(O(n)\), but still not ideal.

Key Innovation

Imagine you are translating a long document from English to French. The old approach (RNNs) is like having a single translator who reads the English document word by word, trying to keep everything in their head, then writes the French translation word by word. The Transformer’s approach is like spreading the entire English document on a large table so that when translating any French word, the translator can look at all the English words at once and decide which ones are most relevant to the word being translated right now.

This “looking at everything at once” mechanism is called self-attention. For each position in the sequence, self-attention computes a weighted sum over all other positions, where the weights reflect how relevant each other position is. The key insight of the paper is that self-attention alone, without any recurrence or convolution, is sufficient to build a state-of-the-art sequence transduction model. This eliminates the sequential bottleneck entirely: since every position attends to every other position in a single operation, long-range dependencies are captured in \(O(1)\) sequential steps instead of \(O(n)\).

The paper also introduces multi-head attention, which runs several independent attention operations in parallel, each with its own learned projection. This is like having multiple translators at the table, each paying attention to different aspects of the source text – one focuses on grammar, another on meaning, another on word order – and then combining their insights. The resulting architecture trains faster and achieves better translation quality than any prior model, establishing 8 P100 GPUs and 3.5 days as sufficient to reach state-of-the-art performance on English-to-French translation.

Architecture / Method

The Transformer follows the encoder-decoder structure common to sequence-to-sequence models. The encoder reads the input sequence and produces a rich representation of it. The decoder generates the output sequence one token at a time, using both the encoder’s representation and the tokens it has already generated.

The Transformer model architecture showing encoder (left) and decoder (right) stacks

Figure 1: The full Transformer architecture. The left half is the encoder: input tokens are embedded and combined with positional encodings, then pass through N=6 identical layers, each containing multi-head self-attention followed by a feed-forward network. The right half is the decoder: it has the same structure but adds a cross-attention layer that looks at the encoder output, and its self-attention is masked to prevent looking at future tokens. Residual connections (the arrows skipping around each sub-layer) and layer normalization (“Add & Norm”) stabilize training.

Encoder. The encoder is a stack of \(N = 6\) identical layers. Each layer has two sub-layers: (1) a multi-head self-attention mechanism and (2) a position-wise feed-forward network. Each sub-layer is wrapped with a residual connection (the input is added to the sub-layer’s output) followed by layer normalization. Formally, the output of each sub-layer is \(\text{LayerNorm}(x + \text{Sublayer}(x))\). All sub-layers and embedding layers produce outputs of dimension \(d_\text{model} = 512\).

Decoder. The decoder is also a stack of \(N = 6\) identical layers, but each layer has three sub-layers: (1) masked multi-head self-attention, (2) multi-head attention over the encoder output (called “cross-attention”), and (3) a position-wise feed-forward network. The masking in the first sub-layer prevents position \(i\) from attending to positions greater than \(i\) – this ensures that predictions for each position depend only on previously generated tokens, preserving the auto-regressive property (the model generates one token at a time, left to right).

Scaled Dot-Product Attention. The core computation in the Transformer is scaled dot-product attention. Each position produces three vectors: a query (what am I looking for?), a key (what do I contain?), and a value (what information do I provide?). The attention output for a position is a weighted sum of all values, where the weight for each value is determined by how well the corresponding key matches the query. The weights are computed by taking dot products between the query and all keys, scaling by \(1/\sqrt{d_k}\) to prevent the dot products from growing too large, and applying a softmax function to obtain a probability distribution.

Scaled Dot-Product Attention computation flow

Figure 2: The computation flow of Scaled Dot-Product Attention. Queries (Q) and Keys (K) are multiplied together (MatMul), scaled by \(1/\sqrt{d_k}\), optionally masked (for the decoder, to block future positions), passed through a softmax to get attention weights, and finally multiplied with Values (V) to produce the output. Every operation here is a matrix operation that runs efficiently on GPUs.

Multi-Head Attention. Rather than performing a single attention function with \(d_\text{model}\)-dimensional queries, keys, and values, the model projects them into \(h = 8\) separate lower-dimensional spaces (\(d_k = d_v = d_\text{model}/h = 64\)), runs attention in parallel on each projection, concatenates the results, and applies a final linear projection. This lets different heads learn to attend to different types of relationships.

Multi-Head Attention with parallel attention heads

Figure 3: Multi-Head Attention. The inputs V, K, Q are each projected through h separate learned linear transformations (shown as stacked “Linear” boxes), producing h sets of lower-dimensional queries, keys, and values. Scaled dot-product attention runs on each set in parallel, the results are concatenated, and a final linear layer maps back to the original dimension. With h=8 heads and d_model=512, each head operates on 64-dimensional vectors.

Positional Encoding. Since the Transformer processes all positions simultaneously (unlike an RNN, which processes them in order), it has no inherent notion of word order. The paper injects position information by adding sinusoidal signals to the input embeddings. Even-indexed dimensions use sine functions and odd-indexed dimensions use cosine functions, with wavelengths forming a geometric progression from \(2\pi\) to \(10000 \cdot 2\pi\). The authors chose sinusoidal over learned positional embeddings because the sinusoidal version can potentially generalize to sequence lengths longer than those seen during training, since for any fixed offset \(k\), \(PE_{pos+k}\) can be expressed as a linear function of \(PE_{pos}\).

Mathematical Foundations

Scaled Dot-Product Attention (Equation 1)

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

\(Q\): the query matrix, shape \((n \times d_k)\), where \(n\) is the number of positions (words) in the sequence and \(d_k\) is the dimension of each query/key vector
\(K\): the key matrix, shape \((m \times d_k)\), where \(m\) is the number of positions being attended to
\(V\): the value matrix, shape \((m \times d_v)\), where \(d_v\) is the dimension of each value vector
\(K^T\): the transpose of \(K\), so \(QK^T\) has shape \((n \times m)\) – one score for every (query, key) pair
\(\sqrt{d_k}\): the scaling factor
\(\text{softmax}\): applied row-wise, turning each row of scores into a probability distribution that sums to 1

What it means: For each query position, compute how similar it is to every key position (via dot product), scale the result to prevent extreme values, convert to probabilities with softmax, then take a weighted average of the value vectors. The output is an \(n \times d_v\) matrix – one output vector per query position.

Why it matters: This single equation is the computational heart of the Transformer. It replaces the recurrent computation of RNNs with a parallelizable matrix multiplication. The scaling by \(\sqrt{d_k}\) is critical: without it, when \(d_k\) is large (e.g., 64), dot products tend to have magnitude \(\sim d_k\), pushing the softmax into regions with near-zero gradients where the model cannot learn.

Worked example: Let our input be the 3-word sentence [“the”, “cat”, “sat”] with \(d_k = 4\). Suppose after linear projection, we have:

\(Q = \begin{bmatrix} 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 \\ 1 & 1 & 0 & 0 \end{bmatrix}\), \(K = \begin{bmatrix} 1 & 1 & 0 & 0 \\ 0 & 0 & 1 & 1 \\ 1 & 0 & 1 & 0 \end{bmatrix}\), \(V = \begin{bmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \\ 0.5 & 0.6 \end{bmatrix}\)

Then \(QK^T = \begin{bmatrix} 1 & 1 & 2 \\ 1 & 1 & 0 \\ 1 & 0 & 1 \end{bmatrix}\), and \(QK^T / \sqrt{4} = \begin{bmatrix} 0.5 & 0.5 & 1.0 \\ 0.5 & 0.5 & 0.0 \\ 0.5 & 0.0 & 0.5 \end{bmatrix}\).

Applying softmax to each row gives weights like \([0.26, 0.26, 0.48]\) for the first row – meaning “the” attends most strongly to “sat”. Multiplying by \(V\) produces the weighted combination of value vectors.

Multi-Head Attention (Equation 2)

\[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O\]

\[\text{where } \text{head}_i = \text{Attention}(QW^Q_i, KW^K_i, VW^V_i)\]

\(W^Q_i \in \mathbb{R}^{d_\text{model} \times d_k}\): learned projection matrix for queries in head \(i\)
\(W^K_i \in \mathbb{R}^{d_\text{model} \times d_k}\): learned projection matrix for keys in head \(i\)
\(W^V_i \in \mathbb{R}^{d_\text{model} \times d_v}\): learned projection matrix for values in head \(i\)
\(W^O \in \mathbb{R}^{hd_v \times d_\text{model}}\): learned output projection matrix
\(h\): number of attention heads (8 in the base model)
\(\text{Concat}\): concatenation of all head outputs along the last dimension

What it means: Instead of running one attention function on the full 512-dimensional vectors, project the inputs into 8 separate 64-dimensional spaces, run attention independently in each, concatenate the 8 outputs (back to 512 dimensions), and apply one more linear transformation.

Why it matters: A single attention head computes a single set of attention weights, which means it can only capture one type of relationship at a time. Multi-head attention allows the model to simultaneously capture different types of relationships – for example, one head might learn to attend to syntactic dependencies (subject-verb agreement) while another attends to semantic relationships (what the sentence is about). The total computation cost is similar to single-head attention because each head operates on smaller dimensions (\(d_k = d_\text{model}/h\)).

Position-wise Feed-Forward Network (Equation 3)

\[\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2\]

\(x\): the input vector at a single position, dimension \(d_\text{model} = 512\)
\(W_1 \in \mathbb{R}^{d_\text{model} \times d_{ff}}\): first weight matrix, projecting from 512 to \(d_{ff} = 2048\)
\(b_1 \in \mathbb{R}^{d_{ff}}\): first bias vector
\(W_2 \in \mathbb{R}^{d_{ff} \times d_\text{model}}\): second weight matrix, projecting from 2048 back to 512
\(b_2 \in \mathbb{R}^{d_\text{model}}\): second bias vector
\(\max(0, \cdot)\): the ReLU activation function (outputs the input if positive, zero otherwise)

What it means: At each position independently, expand the 512-dimensional representation to 2048 dimensions, apply a nonlinear activation, then compress back to 512 dimensions. The same function is applied at every position, but different layers use different weight matrices.

Why it matters: Attention alone is a linear operation on the values (it computes weighted sums). The feed-forward network introduces nonlinearity, which is essential for the model to learn complex transformations. The expansion to 4x the model dimension (\(2048 = 4 \times 512\)) gives the network a larger internal workspace. This pattern of “expand, activate, compress” is used in virtually all subsequent Transformer variants.

Positional Encoding (Equation 4)

\[PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_\text{model}}}\right)\]

\[PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_\text{model}}}\right)\]

\(pos\): the position of the token in the sequence (0, 1, 2, …)
\(i\): the dimension index (0, 1, 2, …, \(d_\text{model}/2 - 1\))
\(d_\text{model}\): the model dimension (512)
\(10000\): a constant that controls the range of wavelengths

What it means: Each position gets a unique 512-dimensional vector added to its embedding. Even dimensions use sine and odd dimensions use cosine, with the frequency decreasing as the dimension index increases. The first few dimensions oscillate rapidly (high frequency, distinguishing nearby positions) while later dimensions oscillate slowly (low frequency, encoding coarser position information). The wavelengths form a geometric progression from \(2\pi\) (dimension 0) to \(10000 \cdot 2\pi\) (the last dimension).

Why it matters: Without positional encoding, the Transformer would treat “the cat sat on the mat” identically to “mat the on sat cat the” – it has no concept of word order. The sinusoidal encoding was chosen because for any fixed offset \(k\), \(PE_{pos+k}\) is a linear function of \(PE_{pos}\). This means the model can learn to attend to relative positions (e.g., “two words ahead”) rather than only absolute positions.

Learning Rate Schedule (Equation 5)

\[lrate = d_\text{model}^{-0.5} \cdot \min(step\_num^{-0.5}, \; step\_num \cdot warmup\_steps^{-1.5})\]

\(d_\text{model}\): the model dimension (512), used to scale the learning rate
\(step\_num\): the current training step (1, 2, 3, …)
\(warmup\_steps\): the number of warmup steps (4000)

What it means: The learning rate increases linearly during the first 4000 steps (the warmup phase), then decreases proportionally to the inverse square root of the step number. During warmup, the formula simplifies to \(lrate \approx d_\text{model}^{-0.5} \cdot step\_num \cdot warmup\_steps^{-1.5}\), a line starting from near zero. After warmup, it simplifies to \(lrate \approx d_\text{model}^{-0.5} \cdot step\_num^{-0.5}\), a gradually decreasing curve.

Why it matters: The warmup is important because the Adam optimizer’s second-moment estimates are inaccurate at the start of training (they are initialized to zero). Large learning rates in this phase cause unstable updates. The gradual warmup lets the optimizer’s statistics stabilize before taking large steps. This learning rate schedule became standard practice in subsequent Transformer-based models.

Results

The Transformer was evaluated primarily on machine translation, using the BLEU score (a standard metric that measures how closely machine-translated text matches human reference translations, on a scale where higher is better).

Model	EN-DE BLEU	EN-FR BLEU	Training Cost (FLOPs, floating-point operations)
ByteNet	23.75	–	–
GNMT + RL	24.6	39.92	\(2.3 \times 10^{19}\) (EN-DE)
ConvS2S	25.16	40.46	\(9.6 \times 10^{18}\) (EN-DE)
MoE	26.03	40.56	\(2.0 \times 10^{19}\) (EN-DE)
GNMT + RL Ensemble	26.30	41.16	\(1.8 \times 10^{20}\) (EN-DE)
ConvS2S Ensemble	26.36	41.29	\(7.7 \times 10^{19}\) (EN-DE)
Transformer (base)	27.3	38.1	\(3.3 \times 10^{18}\)
Transformer (big)	28.4	41.8	\(2.3 \times 10^{19}\)

On English-to-German (WMT 2014), the big Transformer achieves 28.4 BLEU, surpassing all previous models – including ensembles (multiple models combined) – by more than 2 BLEU points. The base Transformer already beats all prior single models and ensembles with training cost of just \(3.3 \times 10^{18}\) FLOPs, roughly 3-60x cheaper than competing approaches. On English-to-French, the big Transformer reaches 41.8 BLEU, a new single-model record, trained in 3.5 days on 8 GPUs – less than 1/4 the cost of the previous best model.

The paper also tested the Transformer on English constituency parsing (analyzing the grammatical structure of sentences), a task very different from translation. With no task-specific tuning, a 4-layer Transformer achieved 91.3 F1 (a metric combining precision and recall, where 100 is perfect) in the WSJ-only setting and 92.7 F1 in the semi-supervised setting, competitive with the best specialized parsers. This demonstrated that the architecture generalizes beyond translation.

Ablation experiments (Table 3 in the paper) revealed several insights: single-head attention performs 0.9 BLEU worse than 8 heads; reducing key dimension \(d_k\) hurts quality (dot-product compatibility needs sufficient dimensions); bigger models improve quality; and dropout is critical for preventing overfitting. Learned positional embeddings performed nearly identically to sinusoidal encodings.

Limitations

Quadratic memory and compute in sequence length: Self-attention computes a score for every pair of positions, giving \(O(n^2)\) complexity. For a sequence of 512 tokens, this is manageable. For sequences of 10,000 or 100,000 tokens (common in document processing, genomics, or audio), the memory cost becomes prohibitive. The paper acknowledges this and suggests restricted self-attention as future work, but does not implement it.
No inherent sense of locality: RNNs and CNNs have inductive biases that favor local relationships (nearby words are processed together). The Transformer starts with no such bias – it must learn from data that nearby words are often more relevant than distant ones. This means Transformers may need more training data to achieve good performance compared to architectures with built-in locality.
Positional encoding limitations: The sinusoidal encoding provides absolute position information, but the model must learn to extract relative positions from it. Later work showed that explicit relative position encodings (like RoPE, ALiBi) perform better, especially for length generalization. The paper’s hypothesis that sinusoidal encodings would enable extrapolation to longer sequences was not strongly validated in subsequent work.
Evaluated only on translation and parsing: The paper demonstrates results on machine translation and constituency parsing. It does not explore language modeling, text classification, question answering, or non-text modalities – all domains where the Transformer later proved dominant.
Fixed-size context window: The Transformer processes a fixed-length input sequence. It cannot natively handle inputs that exceed this length. The paper does not address how to apply the architecture to tasks requiring context beyond the training sequence length.
Auto-regressive decoding is sequential: While training is fully parallel, the decoder still generates output tokens one at a time during inference. This sequential generation limits inference speed and was not addressed in the paper.

Impact and Legacy

The Transformer is arguably the most consequential single paper in modern artificial intelligence. Its architecture became the foundation for virtually all major language models that followed: GPT (decoder-only Transformer for language generation), BERT (encoder-only Transformer for language understanding), T5 (encoder-decoder Transformer for text-to-text tasks), and the entire lineage of large language models including GPT-3, GPT-4, PaLM, LLaMA, and Claude.

The impact extends far beyond language. The Vision Transformer (ViT) adapted the architecture to images by treating image patches as tokens. Diffusion models for image generation use Transformer-based architectures for their denoising networks. Transformers have been applied to protein structure prediction (AlphaFold), code generation (Codex, GitHub Copilot), music, robotics, and scientific simulation.

The paper’s deeper contribution is conceptual: it demonstrated that attention is not merely a useful add-on to recurrence, but a complete replacement for it. This challenged the assumption that sequential processing is necessary for sequential data. The resulting architecture has favorable scaling properties – larger models trained on more data consistently improve – which enabled the “scaling laws” research that drove the development of increasingly powerful foundation models. The Transformer’s design, with its modular layers, residual connections, and parallelizable computation, became the template for building neural networks at scales that were previously impractical.

Prerequisites

To fully understand this paper, you should be comfortable with:

Matrix multiplication: the core computation (queries times keys, attention weights times values) is matrix multiplication
Dot products: used to compute similarity between query and key vectors
Softmax function: converts a vector of raw scores into a probability distribution (all values between 0 and 1, summing to 1)
Neural network layers: fully connected (linear) layers, activation functions like ReLU, and the concept of learned weight matrices
Gradient descent and backpropagation: how neural networks learn by computing gradients of a loss function
Sequence-to-sequence models: the general idea of reading an input sequence and producing an output sequence (e.g., translation)

Connections

GAN (Generative Adversarial Nets): Both the GAN and the Transformer are foundational architectures that spawned enormous research subfields, but in different domains. GANs addressed generative modeling of images (see Generative Adversarial Nets); the Transformer addressed sequence transduction. They later converged: Transformer-based architectures replaced CNN-based generators and discriminators in some GAN variants (e.g., TransGAN, ViTGAN).
VAE (Auto-Encoding Variational Bayes): The Transformer shares the encoder-decoder structure with the VAE (see Auto-Encoding Variational Bayes), though the term means different things in each context. In the VAE, the encoder maps data to a probabilistic latent space and the decoder reconstructs from it. In the Transformer, the encoder maps an input sequence to a continuous representation and the decoder generates an output sequence. Later models combined both: VQ-VAE uses Transformer decoders, and Stable Diffusion uses a VAE for image compression with a Transformer-based diffusion model.

Looking Ahead

These papers build on ideas introduced here. You will encounter them later in the collection.

GPT (Radford et al., 2018): Takes the Transformer decoder stack (the right half of Figure 1) and uses it alone for language modeling – predicting the next word given all previous words. Demonstrates that the Transformer architecture, pre-trained on large text corpora, learns general-purpose language representations.
BERT (Devlin et al., 2018): Takes the Transformer encoder stack (the left half of Figure 1) and trains it with a masked language modeling objective, where random words are hidden and the model predicts them. This produces bidirectional representations that are effective for understanding tasks.
An Image is Worth 16x16 Words (Dosovitskiy et al., 2020): Directly applies the Transformer encoder to images by splitting each image into 16x16 patches, treating each patch as a “token,” and processing them with standard self-attention. Demonstrates that the Transformer generalizes beyond text.