Authors: Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov et al. Year: 2020 Source: arXiv:2010.11929
A standard Transformer architecture, applied directly to sequences of image patches instead of words, matches or beats the best convolutional neural networks on image classification when trained on enough data.
Before this paper, the dominant approach to image recognition was the convolutional neural network (CNN) – a type of neural network that slides small learned filters across an image to detect patterns like edges, textures, and shapes. CNNs have a built-in structural assumption (called an “inductive bias”) that nearby pixels matter more than distant ones, and that the same pattern is useful regardless of where it appears in the image. These assumptions are helpful, especially when training data is limited, because they constrain what the model needs to learn.
Meanwhile, in natural language processing (NLP), a different architecture called the Transformer (see Attention Is All You Need) had become dominant. Transformers use a mechanism called “self-attention” that lets every element in a sequence look at every other element, learning which pairs matter most. This made them extremely good at capturing long-range dependencies in text. Researchers wondered: could Transformers work for images too?
The problem was scaling. A 224x224 pixel image has 50,176 pixels. Self-attention computes relationships between every pair of elements, so the cost grows with the square of the sequence length – making it prohibitively expensive to treat every pixel as a token. Previous attempts either applied attention only locally (defeating the purpose), used specialized sparse attention patterns (requiring complex engineering), or combined attention with CNNs (never fully replacing them). No one had shown that a pure, standard Transformer could compete with CNNs on large-scale image recognition.
Think of how you might describe a photograph over the phone to someone. You would not describe it pixel by pixel – instead, you would describe it in chunks: “there’s a dog in the lower-left,” “there’s grass underneath,” “blue sky at the top.” The Vision Transformer (ViT) works the same way. Instead of feeding individual pixels into the Transformer (which would be computationally infeasible), it chops the image into a grid of fixed-size patches – for example, 16x16 pixel squares – and treats each patch as a single “word” in a sequence. A 224x224 image becomes just 196 patches, a manageable sequence length for a Transformer.
Each patch is flattened into a vector and linearly projected into an embedding space, just like how words in NLP are converted to embeddings before being fed to a Transformer. A special learnable “classification token” is prepended to the sequence (borrowed from BERT, see BERT), and learnable position embeddings are added so the model knows where each patch came from in the original image. The resulting sequence goes through a standard Transformer encoder, and the output corresponding to the classification token is used to predict the image class.
The key insight is that this approach is deliberately minimal – ViT makes almost no assumptions about the structure of images. It does not assume nearby pixels are more related, or that patterns are translation-invariant. Instead, it relies on large-scale pre-training (on datasets with 14 million to 300 million images) to learn these patterns from data. The paper shows that when you have enough data, learning the right visual patterns from scratch beats having them built into the architecture.
The ViT pipeline starts by dividing an input image of size \(H \times W\) with \(C\) color channels into a grid of non-overlapping patches, each of size \(P \times P\) pixels. For a standard 224x224 RGB image with \(P = 16\), this produces \(N = 224^2 / 16^2 = 196\) patches. Each patch is a \(16 \times 16 \times 3 = 768\)-dimensional vector when flattened.
These flattened patches are projected into a \(D\)-dimensional embedding space using a trainable linear layer (a simple matrix multiplication). To this sequence, the model prepends a learnable “class embedding” vector – think of it as a blank ticket that collects information about the whole image as it passes through the Transformer layers. Learnable position embeddings (one per patch, plus one for the class token) are added element-wise to tell the model the spatial arrangement of patches.
The Transformer encoder consists of \(L\) identical layers stacked on top of each other. Each layer has two sub-blocks. The first sub-block applies multi-head self-attention (MSA): every patch looks at every other patch and computes attention scores that determine how much information to gather from each. Before this operation, a layer normalization (LayerNorm, or LN – a technique that normalizes the values in each vector to have zero mean and unit variance) is applied. The output of the attention is added back to the input via a residual connection (a shortcut that adds the original input to the processed output, helping with training stability).
The second sub-block applies a two-layer feed-forward network (called an MLP) with a GELU activation function (a smooth, non-linear function similar to ReLU). Again, LayerNorm is applied first, and a residual connection is used. This structure – LayerNorm, attention, residual, LayerNorm, MLP, residual – is repeated \(L\) times.
After all \(L\) layers, the output vector corresponding to the class token position is extracted, passed through a final LayerNorm, and fed to a classification head (a small network) that outputs the predicted class.
The model comes in three sizes mirroring BERT’s configurations:
| Model | Layers | Hidden size \(D\) | MLP size | Heads | Parameters |
|---|---|---|---|---|---|
| ViT-Base | 12 | 768 | 3072 | 12 | 86M |
| ViT-Large | 24 | 1024 | 4096 | 16 | 307M |
| ViT-Huge | 32 | 1280 | 5120 | 16 | 632M |
The naming convention includes the patch size: “ViT-L/16” means the Large model with 16x16 patches. Smaller patches produce longer sequences and cost more compute.
Pre-training and fine-tuning: ViT is first pre-trained on a large dataset (ImageNet-21k with 14M images, or JFT-300M with 303M images) for image classification. For fine-tuning on a smaller downstream task, the pre-trained classification head is removed and replaced with a new zero-initialized linear layer sized for the target number of classes. Fine-tuning is often done at a higher resolution than pre-training (e.g., 384x384 instead of 224x224), which increases the number of patches. The pre-trained position embeddings are 2D-interpolated to handle the new grid size – this is one of only two points where ViT acknowledges the 2D structure of images.
The entire ViT forward pass is described by four equations. Let us walk through each one using a concrete example: a 224x224 RGB image with patch size \(P = 16\), giving \(N = 196\) patches and embedding dimension \(D = 768\).
Equation 1: Patch embedding and position encoding
\[z_0 = [x_\text{class};\; x_p^1 E;\; x_p^2 E;\; \cdots;\; x_p^N E] + E_{pos}\]
where \(E \in \mathbb{R}^{(P^2 \cdot C) \times D}\) and \(E_{pos} \in \mathbb{R}^{(N+1) \times D}\).
In plain language: flatten each image patch, multiply by a learned matrix to get an embedding, prepend the class token, then add position information. The output is a sequence of 197 vectors, each 768-dimensional. This equation converts a 2D image into a 1D sequence of embeddings that a standard Transformer can process.
Equation 2: Multi-head self-attention with residual connection
\[z'_\ell = \operatorname{MSA}(\operatorname{LN}(z_{\ell-1})) + z_{\ell-1}, \quad \ell = 1 \ldots L\]
This equation says: normalize the previous layer’s output, run self-attention so each patch can gather information from all other patches, then add the result back to the input. The residual connection ensures the model can learn small refinements on top of what it already knows.
Equation 3: MLP with residual connection
\[z_\ell = \operatorname{MLP}(\operatorname{LN}(z'_\ell)) + z'_\ell, \quad \ell = 1 \ldots L\]
This equation says: normalize the attention output, run it through a two-layer network that processes each position independently, then add back the input. While attention mixes information across positions, the MLP processes each position’s representation individually, acting like a per-patch feature transformation.
Equation 4: Final classification output
\[y = \operatorname{LN}(z_L^0)\]
This extracts the class token’s vector from the final layer and normalizes it. This single vector summarizes the entire image and is fed to the classification head.
Self-attention (from Appendix A): How attention weights are computed
\[[q, k, v] = z U_{qkv}, \quad U_{qkv} \in \mathbb{R}^{D \times 3D_h}\]
\[A = \operatorname{softmax}(qk^\top / \sqrt{D_h}), \quad A \in \mathbb{R}^{N \times N}\]
\[\operatorname{SA}(z) = Av\]
For multi-head self-attention with \(k\) heads, the model runs \(k\) independent attention operations in parallel, each with \(D_h = D/k\) dimensions, and concatenates the results:
\[\operatorname{MSA}(z) = [\operatorname{SA}_1(z); \operatorname{SA}_2(z); \cdots; \operatorname{SA}_k(z)] \, U_{msa}, \quad U_{msa} \in \mathbb{R}^{k \cdot D_h \times D}\]
Multiple heads let the model attend to different types of relationships simultaneously – one head might focus on color similarity, another on spatial proximity, another on texture patterns.
ViT achieves state-of-the-art results on multiple image classification benchmarks when pre-trained on large datasets, while using substantially less compute than competing CNNs.
| Model | ImageNet | ImageNet ReaL | CIFAR-100 | VTAB (19 tasks) | TPUv3-core-days |
|---|---|---|---|---|---|
| ViT-H/14 (JFT) | 88.55% | 90.72% | 94.55% | 77.63% | 2,500 |
| ViT-L/16 (JFT) | 87.76% | 90.54% | 93.90% | 76.28% | 680 |
| BiT-L (ResNet152x4) | 87.54% | 90.54% | 93.51% | 76.29% | 9,900 |
| Noisy Student (EfficientNet-L2) | 88.4% | 90.55% | – | – | 12,300 |
The critical finding is not just accuracy but efficiency. ViT-L/16 matches BiT-L on nearly every benchmark while using roughly 15 times less compute to pre-train (680 vs. 9,900 TPUv3-core-days). Even ViT-H/14, the largest model, uses 4 times less compute than Noisy Student while achieving competitive or better accuracy.
However, ViT’s performance depends heavily on the size of the pre-training dataset. When pre-trained only on ImageNet (1.3M images), ViT-Large actually underperforms ViT-Base – the larger model overfits because it lacks the inductive biases that CNNs have built in. With ImageNet-21k (14M images), the two sizes perform similarly. Only with JFT-300M (303M images) does the larger model clearly win. ResNets, by contrast, perform relatively well even on smaller datasets thanks to their built-in assumptions about image structure. This reveals a fundamental tradeoff: CNNs encode useful assumptions that help with limited data, but those same assumptions become a ceiling when data is abundant enough for the model to discover even better patterns on its own.
Figure 1: Position embedding similarity for ViT-L/16. Each cell in the 14x14 grid shows the cosine similarity between one patch’s learned position embedding and all others. Nearby patches (bright spots near each cell’s center) have more similar embeddings, and row-column structure is visible – the model learns 2D spatial relationships from 1D position indices alone, without being told that images have rows and columns.
The model’s internal representations reveal that ViT learns meaningful spatial structure despite receiving no explicit 2D information. The learned position embeddings (shown above) encode distance relationships: patches that are close in the image have similar position embeddings, and the grid structure of rows and columns emerges automatically. Attention distance analysis shows that some heads in the earliest layers attend globally across the entire image (something impossible for early CNN layers), while others attend locally – suggesting the model learns to use both fine-grained and coarse-grained processing.
Figure 2: An input image shown to ViT-L/16. The model must classify this image, and we can visualize which parts of the image the model attends to.
Figure 3: The attention map for the same image, computed by rolling up attention weights across all layers. Bright areas indicate where the model focuses most. ViT learns to attend to the semantically relevant object (the dog) and suppress the background, demonstrating that it discovers meaningful visual features without any built-in spatial assumptions.
The Vision Transformer fundamentally changed computer vision. Before ViT, the field assumed that convolutional inductive biases were essential for visual recognition, and Transformers were considered a language-specific architecture. ViT showed that with sufficient data, a generic sequence-processing architecture could match or beat purpose-built vision models. This opened the floodgates for Transformer-based vision research.
In the years following, ViT’s approach was extended in every direction. DeiT (Data-efficient Image Transformers) showed that with better training recipes, ViT could perform well even without massive datasets. Swin Transformer introduced hierarchical feature maps and shifted windows to make ViT practical for dense prediction tasks. DINO and MAE (Masked Autoencoders) closed the self-supervised gap the paper identified, eventually making self-supervised ViT pre-training competitive with supervised pre-training. ViT also became the visual encoder in multimodal models like CLIP, which aligned image and text representations and powered zero-shot image classification.
Perhaps ViT’s most lasting contribution is conceptual: it demonstrated that scale and data can substitute for domain-specific architectural assumptions. This finding, combined with similar observations in NLP (see Improving Language Understanding by Generative Pre-Training and the scaling laws literature, see Scaling Laws for Neural Language Models), cemented the trend toward large, general-purpose models pre-trained on massive datasets – the paradigm that now dominates both vision and language.
To understand this paper, the reader should be familiar with: