An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Authors: Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov et al. Year: 2020 Source: arXiv:2010.11929

One-Sentence Summary

A standard Transformer architecture, applied directly to sequences of image patches instead of words, matches or beats the best convolutional neural networks on image classification when trained on enough data.

Problem Statement

Before this paper, the dominant approach to image recognition was the convolutional neural network (CNN) – a type of neural network that slides small learned filters across an image to detect patterns like edges, textures, and shapes. CNNs have a built-in structural assumption (called an “inductive bias”) that nearby pixels matter more than distant ones, and that the same pattern is useful regardless of where it appears in the image. These assumptions are helpful, especially when training data is limited, because they constrain what the model needs to learn.

Meanwhile, in natural language processing (NLP), a different architecture called the Transformer (see Attention Is All You Need) had become dominant. Transformers use a mechanism called “self-attention” that lets every element in a sequence look at every other element, learning which pairs matter most. This made them extremely good at capturing long-range dependencies in text. Researchers wondered: could Transformers work for images too?

The problem was scaling. A 224x224 pixel image has 50,176 pixels. Self-attention computes relationships between every pair of elements, so the cost grows with the square of the sequence length – making it prohibitively expensive to treat every pixel as a token. Previous attempts either applied attention only locally (defeating the purpose), used specialized sparse attention patterns (requiring complex engineering), or combined attention with CNNs (never fully replacing them). No one had shown that a pure, standard Transformer could compete with CNNs on large-scale image recognition.

Key Innovation

Think of how you might describe a photograph over the phone to someone. You would not describe it pixel by pixel – instead, you would describe it in chunks: “there’s a dog in the lower-left,” “there’s grass underneath,” “blue sky at the top.” The Vision Transformer (ViT) works the same way. Instead of feeding individual pixels into the Transformer (which would be computationally infeasible), it chops the image into a grid of fixed-size patches – for example, 16x16 pixel squares – and treats each patch as a single “word” in a sequence. A 224x224 image becomes just 196 patches, a manageable sequence length for a Transformer.

Each patch is flattened into a vector and linearly projected into an embedding space, just like how words in NLP are converted to embeddings before being fed to a Transformer. A special learnable “classification token” is prepended to the sequence (borrowed from BERT, see BERT), and learnable position embeddings are added so the model knows where each patch came from in the original image. The resulting sequence goes through a standard Transformer encoder, and the output corresponding to the classification token is used to predict the image class.

The key insight is that this approach is deliberately minimal – ViT makes almost no assumptions about the structure of images. It does not assume nearby pixels are more related, or that patterns are translation-invariant. Instead, it relies on large-scale pre-training (on datasets with 14 million to 300 million images) to learn these patterns from data. The paper shows that when you have enough data, learning the right visual patterns from scratch beats having them built into the architecture.

Architecture / Method

The ViT pipeline starts by dividing an input image of size \(H \times W\) with \(C\) color channels into a grid of non-overlapping patches, each of size \(P \times P\) pixels. For a standard 224x224 RGB image with \(P = 16\), this produces \(N = 224^2 / 16^2 = 196\) patches. Each patch is a \(16 \times 16 \times 3 = 768\)-dimensional vector when flattened.

These flattened patches are projected into a \(D\)-dimensional embedding space using a trainable linear layer (a simple matrix multiplication). To this sequence, the model prepends a learnable “class embedding” vector – think of it as a blank ticket that collects information about the whole image as it passes through the Transformer layers. Learnable position embeddings (one per patch, plus one for the class token) are added element-wise to tell the model the spatial arrangement of patches.

The Transformer encoder consists of \(L\) identical layers stacked on top of each other. Each layer has two sub-blocks. The first sub-block applies multi-head self-attention (MSA): every patch looks at every other patch and computes attention scores that determine how much information to gather from each. Before this operation, a layer normalization (LayerNorm, or LN – a technique that normalizes the values in each vector to have zero mean and unit variance) is applied. The output of the attention is added back to the input via a residual connection (a shortcut that adds the original input to the processed output, helping with training stability).

The second sub-block applies a two-layer feed-forward network (called an MLP) with a GELU activation function (a smooth, non-linear function similar to ReLU). Again, LayerNorm is applied first, and a residual connection is used. This structure – LayerNorm, attention, residual, LayerNorm, MLP, residual – is repeated \(L\) times.

After all \(L\) layers, the output vector corresponding to the class token position is extracted, passed through a final LayerNorm, and fed to a classification head (a small network) that outputs the predicted class.

The model comes in three sizes mirroring BERT’s configurations:

Model	Layers	Hidden size \(D\)	MLP size	Heads	Parameters
ViT-Base	12	768	3072	12	86M
ViT-Large	24	1024	4096	16	307M
ViT-Huge	32	1280	5120	16	632M

The naming convention includes the patch size: “ViT-L/16” means the Large model with 16x16 patches. Smaller patches produce longer sequences and cost more compute.

Pre-training and fine-tuning: ViT is first pre-trained on a large dataset (ImageNet-21k with 14M images, or JFT-300M with 303M images) for image classification. For fine-tuning on a smaller downstream task, the pre-trained classification head is removed and replaced with a new zero-initialized linear layer sized for the target number of classes. Fine-tuning is often done at a higher resolution than pre-training (e.g., 384x384 instead of 224x224), which increases the number of patches. The pre-trained position embeddings are 2D-interpolated to handle the new grid size – this is one of only two points where ViT acknowledges the 2D structure of images.

Mathematical Foundations

The entire ViT forward pass is described by four equations. Let us walk through each one using a concrete example: a 224x224 RGB image with patch size \(P = 16\), giving \(N = 196\) patches and embedding dimension \(D = 768\).

Equation 1: Patch embedding and position encoding

\[z_0 = [x_\text{class};\; x_p^1 E;\; x_p^2 E;\; \cdots;\; x_p^N E] + E_{pos}\]

where \(E \in \mathbb{R}^{(P^2 \cdot C) \times D}\) and \(E_{pos} \in \mathbb{R}^{(N+1) \times D}\).

\(x_p^i\) is the \(i\)-th flattened image patch, a vector of length \(P^2 \cdot C = 768\)
\(E\) is the patch embedding matrix that projects each 768-dim patch vector to the \(D\)-dimensional embedding space (in ViT-Base, \(D = 768\), so \(E\) is \(768 \times 768\))
\(x_\text{class}\) is the learnable class token, a single \(D\)-dimensional vector
\(E_{pos}\) is the position embedding matrix with one \(D\)-dimensional vector for each of the \(N + 1 = 197\) positions (196 patches + 1 class token)
\(z_0\) is the resulting \((N+1) \times D\) matrix fed into the first Transformer layer

In plain language: flatten each image patch, multiply by a learned matrix to get an embedding, prepend the class token, then add position information. The output is a sequence of 197 vectors, each 768-dimensional. This equation converts a 2D image into a 1D sequence of embeddings that a standard Transformer can process.

Equation 2: Multi-head self-attention with residual connection

\[z'_\ell = \operatorname{MSA}(\operatorname{LN}(z_{\ell-1})) + z_{\ell-1}, \quad \ell = 1 \ldots L\]

\(z_{\ell-1}\) is the output of the previous layer (or \(z_0\) for the first layer)
\(\operatorname{LN}\) is LayerNorm, which normalizes each vector to stabilize training
\(\operatorname{MSA}\) is multi-head self-attention: every patch attends to every other patch
The \(+ z_{\ell-1}\) is the residual connection
\(z'_\ell\) is the intermediate output after attention

This equation says: normalize the previous layer’s output, run self-attention so each patch can gather information from all other patches, then add the result back to the input. The residual connection ensures the model can learn small refinements on top of what it already knows.

Equation 3: MLP with residual connection

\[z_\ell = \operatorname{MLP}(\operatorname{LN}(z'_\ell)) + z'_\ell, \quad \ell = 1 \ldots L\]

\(z'_\ell\) is the output from the attention sub-block
\(\operatorname{MLP}\) is a two-layer feed-forward network with GELU activation
\(z_\ell\) is the final output of layer \(\ell\)

This equation says: normalize the attention output, run it through a two-layer network that processes each position independently, then add back the input. While attention mixes information across positions, the MLP processes each position’s representation individually, acting like a per-patch feature transformation.

Equation 4: Final classification output

\[y = \operatorname{LN}(z_L^0)\]

\(z_L^0\) is the class token’s output vector after the final (\(L\)-th) Transformer layer
\(y\) is the image representation used for classification

This extracts the class token’s vector from the final layer and normalizes it. This single vector summarizes the entire image and is fed to the classification head.

Self-attention (from Appendix A): How attention weights are computed

\[[q, k, v] = z U_{qkv}, \quad U_{qkv} \in \mathbb{R}^{D \times 3D_h}\]

\[A = \operatorname{softmax}(qk^\top / \sqrt{D_h}), \quad A \in \mathbb{R}^{N \times N}\]

\[\operatorname{SA}(z) = Av\]

\(z\) is the input sequence (after LayerNorm), with \(N\) tokens each of dimension \(D\)
\(U_{qkv}\) is a learned projection matrix that produces three vectors for each token: a query \(q\), a key \(k\), and a value \(v\), each of dimension \(D_h\)
\(qk^\top\) computes the dot product between every pair of query and key vectors, producing an \(N \times N\) matrix of raw similarity scores
Dividing by \(\sqrt{D_h}\) prevents the scores from becoming too large (which would push the softmax into regions where gradients vanish)
\(\operatorname{softmax}\) converts each row of scores into a probability distribution (all values between 0 and 1, summing to 1)
Multiplying \(A\) by \(v\) produces a weighted combination of value vectors: each token’s new representation is a mixture of all other tokens’ values, weighted by how relevant they are

For multi-head self-attention with \(k\) heads, the model runs \(k\) independent attention operations in parallel, each with \(D_h = D/k\) dimensions, and concatenates the results:

\[\operatorname{MSA}(z) = [\operatorname{SA}_1(z); \operatorname{SA}_2(z); \cdots; \operatorname{SA}_k(z)] \, U_{msa}, \quad U_{msa} \in \mathbb{R}^{k \cdot D_h \times D}\]

Multiple heads let the model attend to different types of relationships simultaneously – one head might focus on color similarity, another on spatial proximity, another on texture patterns.

Results

ViT achieves state-of-the-art results on multiple image classification benchmarks when pre-trained on large datasets, while using substantially less compute than competing CNNs.

Model	ImageNet	ImageNet ReaL	CIFAR-100	VTAB (19 tasks)	TPUv3-core-days
ViT-H/14 (JFT)	88.55%	90.72%	94.55%	77.63%	2,500
ViT-L/16 (JFT)	87.76%	90.54%	93.90%	76.28%	680
BiT-L (ResNet152x4)	87.54%	90.54%	93.51%	76.29%	9,900
Noisy Student (EfficientNet-L2)	88.4%	90.55%	–	–	12,300

The critical finding is not just accuracy but efficiency. ViT-L/16 matches BiT-L on nearly every benchmark while using roughly 15 times less compute to pre-train (680 vs. 9,900 TPUv3-core-days). Even ViT-H/14, the largest model, uses 4 times less compute than Noisy Student while achieving competitive or better accuracy.

However, ViT’s performance depends heavily on the size of the pre-training dataset. When pre-trained only on ImageNet (1.3M images), ViT-Large actually underperforms ViT-Base – the larger model overfits because it lacks the inductive biases that CNNs have built in. With ImageNet-21k (14M images), the two sizes perform similarly. Only with JFT-300M (303M images) does the larger model clearly win. ResNets, by contrast, perform relatively well even on smaller datasets thanks to their built-in assumptions about image structure. This reveals a fundamental tradeoff: CNNs encode useful assumptions that help with limited data, but those same assumptions become a ceiling when data is abundant enough for the model to discover even better patterns on its own.

Position embedding similarity grid showing learned spatial structure

Figure 1: Position embedding similarity for ViT-L/16. Each cell in the 14x14 grid shows the cosine similarity between one patch’s learned position embedding and all others. Nearby patches (bright spots near each cell’s center) have more similar embeddings, and row-column structure is visible – the model learns 2D spatial relationships from 1D position indices alone, without being told that images have rows and columns.

The model’s internal representations reveal that ViT learns meaningful spatial structure despite receiving no explicit 2D information. The learned position embeddings (shown above) encode distance relationships: patches that are close in the image have similar position embeddings, and the grid structure of rows and columns emerges automatically. Attention distance analysis shows that some heads in the earliest layers attend globally across the entire image (something impossible for early CNN layers), while others attend locally – suggesting the model learns to use both fine-grained and coarse-grained processing.

Input image of a dog and its attention map showing ViT focuses on the animal

Figure 2: An input image shown to ViT-L/16. The model must classify this image, and we can visualize which parts of the image the model attends to.

Attention map highlighting the dog region

Figure 3: The attention map for the same image, computed by rolling up attention weights across all layers. Bright areas indicate where the model focuses most. ViT learns to attend to the semantically relevant object (the dog) and suppress the background, demonstrating that it discovers meaningful visual features without any built-in spatial assumptions.

Limitations

Massive pre-training data requirement. ViT underperforms comparably-sized CNNs when pre-trained on datasets smaller than roughly 14 million images. Organizations without access to large proprietary datasets (like JFT-300M) are at a disadvantage.
Classification only. The paper evaluates ViT exclusively on image classification. Extending it to dense prediction tasks like object detection and semantic segmentation (where you need per-pixel outputs, not a single class label) requires additional architectural modifications that the paper does not address.
Fixed patch size is rigid. Chopping images into a fixed grid of 16x16 patches is crude – it ignores object boundaries, can split important features across patches, and wastes compute on uniform regions (like sky) that contain little information. A more adaptive tokenization could be more efficient.
Self-supervised pre-training gap. The paper’s preliminary self-supervised experiments (masked patch prediction) achieve 79.9% on ImageNet, a 4% gap below supervised pre-training. In NLP, self-supervised pre-training (as in BERT and GPT) is what unlocked Transformer scaling. The paper acknowledges this gap but does not close it.
Quadratic attention cost with sequence length. While patches reduce the sequence length dramatically compared to pixels, scaling to higher resolutions or smaller patches still faces the \(O(N^2)\) cost of self-attention. A 224x224 image with 16x16 patches has 196 tokens, but a 512x512 image with the same patch size would have 1024 tokens, making attention 27 times more expensive.
Position embedding interpolation is a heuristic. When fine-tuning at higher resolution, the paper uses 2D interpolation of pre-trained position embeddings. This works in practice but is an ad hoc solution that reintroduces spatial assumptions the architecture was designed to avoid.

Impact and Legacy

The Vision Transformer fundamentally changed computer vision. Before ViT, the field assumed that convolutional inductive biases were essential for visual recognition, and Transformers were considered a language-specific architecture. ViT showed that with sufficient data, a generic sequence-processing architecture could match or beat purpose-built vision models. This opened the floodgates for Transformer-based vision research.

In the years following, ViT’s approach was extended in every direction. DeiT (Data-efficient Image Transformers) showed that with better training recipes, ViT could perform well even without massive datasets. Swin Transformer introduced hierarchical feature maps and shifted windows to make ViT practical for dense prediction tasks. DINO and MAE (Masked Autoencoders) closed the self-supervised gap the paper identified, eventually making self-supervised ViT pre-training competitive with supervised pre-training. ViT also became the visual encoder in multimodal models like CLIP, which aligned image and text representations and powered zero-shot image classification.

Perhaps ViT’s most lasting contribution is conceptual: it demonstrated that scale and data can substitute for domain-specific architectural assumptions. This finding, combined with similar observations in NLP (see Improving Language Understanding by Generative Pre-Training and the scaling laws literature, see Scaling Laws for Neural Language Models), cemented the trend toward large, general-purpose models pre-trained on massive datasets – the paradigm that now dominates both vision and language.

Prerequisites

To understand this paper, the reader should be familiar with:

Matrix multiplication and linear projections: how multiplying a vector by a matrix transforms it from one space to another
Softmax function: converts a vector of real numbers into a probability distribution
Residual connections: adding the input of a block to its output, a technique from ResNets that helps train deep networks
The Transformer architecture: self-attention, multi-head attention, and the encoder structure (see Attention Is All You Need)
BERT’s [CLS] token: the concept of a special classification token prepended to a sequence (see BERT: Pre-training of Deep Bidirectional Transformers)
Transfer learning: pre-training a model on a large dataset and fine-tuning it on a smaller target task (see Improving Language Understanding by Generative Pre-Training)

Connections

Builds directly on the Transformer (see Attention Is All You Need): ViT uses a standard Transformer encoder with minimal modification. The self-attention mechanism, multi-head attention, positional encodings, and the overall encoder architecture are all inherited directly. The only change is replacing token embeddings with patch embeddings.
Borrows the [CLS] token from BERT (see BERT: Pre-training of Deep Bidirectional Transformers): BERT introduced a special classification token prepended to the input sequence, whose final representation is used for classification tasks. ViT adopts this mechanism directly, using the class token’s output as the image-level representation. The paper also explores BERT-style masked prediction for self-supervised pre-training.
Follows the pre-train then fine-tune paradigm from GPT (see Improving Language Understanding by Generative Pre-Training): Like GPT, ViT is pre-trained on a large dataset and then fine-tuned on smaller downstream tasks. The paper’s core finding – that scale beats inductive bias – echoes the GPT line of work’s observation that scaling up data and model size steadily improves performance.
Validates scaling laws across modalities (see Scaling Laws for Neural Language Models): The scaling laws paper showed that language model performance improves predictably with compute, data, and parameters. ViT provides evidence that similar scaling behavior holds for vision Transformers: larger models trained on more data consistently achieve better accuracy, and ViT appears not to saturate within the range tested.
Model size configurations mirror BERT: ViT-Base and ViT-Large use the same hidden dimensions, number of layers, and attention heads as BERT-Base and BERT-Large, making the relationship explicit and enabling direct comparisons of how the same architecture scales across modalities.