Learning Transferable Visual Models From Natural Language Supervision

Authors: Alec Radford, Jong Wook Kim, Chris Hallacy et al. Year: 2021 Source: arXiv 2103.00020

One-Sentence Summary

CLIP trains an image encoder and a text encoder together on 400 million image-caption pairs from the internet so that, at test time, you can classify images into any set of categories just by describing them in plain English – no labeled training data required.

Problem Statement

Before CLIP, the standard recipe for building a computer vision system was: collect a large dataset of images with human-assigned labels (like ImageNet’s 1.28 million images across 1,000 categories), train a model to predict those labels, then fine-tune the model on whatever downstream task you actually care about. This approach has two fundamental problems.

First, the label vocabulary is fixed at training time. An ImageNet model knows about “golden retriever” and “construction crane” but has no concept of “a photo of someone social distancing” or “a satellite image of deforestation.” Every new task requires collecting and labeling a new dataset, which is expensive and slow. Meanwhile, in natural language processing, models like GPT-2 and GPT-3 (see Improving Language Understanding by Generative Pre-Training) had shown that pre-training on raw internet text produced models that could perform new tasks zero-shot – without any task-specific training data at all.

Second, models trained on fixed label sets tend to learn shortcuts. A model trained to classify 1,000 ImageNet categories might learn to associate “green background” with “frog” rather than learning what a frog actually looks like. This makes the model brittle: performance on ImageNet can be impressive while performance on slightly different distributions (sketches, adversarial examples, video frames) drops dramatically. A ResNet-101 makes 5 times as many mistakes on natural distribution shifts compared to the ImageNet validation set.

The question CLIP addresses: could the same revolution that happened in NLP – learning directly from raw text at internet scale – produce visual representations that are more general, more flexible, and more robust?

Key Innovation

Think of how humans learn to recognize things. A child does not see 1,000 flash cards labeled “dog” before understanding what a dog is. Instead, a parent points at the family pet and says “Look at the dog!” The child learns to connect visual experience with language naturally, absorbing thousands of such associations over time. Eventually, you can tell the child “find the spatula” and they can search the kitchen for an object they may have only heard described once.

CLIP works the same way. Instead of training on a fixed set of categories, CLIP trains on 400 million (image, text) pairs scraped from the internet – photos paired with their captions, alt-text, titles, and descriptions. The training objective is simple: given a batch of images and captions, figure out which caption goes with which image. This is a contrastive learning objective (the model contrasts correct pairings against incorrect ones), not a generative one (the model does not try to predict exact caption words).

The critical technical insight is that contrastive learning is dramatically more efficient than trying to predict the exact words of a caption. The authors found that a contrastive objective learns useful ImageNet representations 12 times faster than a transformer language model trained to generate captions, and 4 times faster than a bag-of-words prediction baseline. This efficiency gain is what made it practical to scale training to 400 million pairs.

At test time, zero-shot classification works by turning class labels into text. To classify an image as one of [“dog”, “cat”, “bird”], CLIP encodes the prompts “A photo of a dog”, “A photo of a cat”, and “A photo of a bird” through the text encoder, encodes the image through the image encoder, and picks the class whose text embedding is most similar to the image embedding. The text encoder effectively generates the weights of a linear classifier on the fly from a natural language description – no labeled examples needed.

Architecture / Method

CLIP has two parallel encoders that share no weights but are trained together:

Image encoder: Either a ResNet (with modifications: ResNet-D improvements, antialiased blur pooling, and attention pooling replacing global average pooling) or a Vision Transformer (ViT) (see An Image is Worth 16x16 Words). The image encoder takes a raw image and produces a feature vector \(I_f\) of dimension \(d_i\).

Text encoder: A Transformer (see Attention Is All You Need) with 63 million parameters, 12 layers, 512 width, and 8 attention heads. It operates on byte-pair-encoded text with a vocabulary of 49,152 tokens and a maximum sequence length of 76 tokens. The text is bracketed with [SOS] and [EOS] tokens, and the activation at the [EOS] position in the final layer serves as the text feature vector \(T_f\) of dimension \(d_t\). Masked self-attention is used (causal, left-to-right), following the GPT architecture.

Projection to shared space: Each encoder’s output is linearly projected into a shared embedding space of dimension \(d_e\), then L2-normalized. For the image: \(I_e = \text{normalize}(I_f \cdot W_i)\) where \(W_i\) is a learned \(d_i \times d_e\) matrix. For the text: \(T_e = \text{normalize}(T_f \cdot W_t)\) where \(W_t\) is a learned \(d_t \times d_e\) matrix. Notably, CLIP uses only a linear projection here – not the nonlinear projection head that was popular in self-supervised methods like SimCLR.

CLIP architecture: contrastive pre-training, dataset classifier creation, and zero-shot prediction

Figure 1: The CLIP architecture in three stages. (1) Contrastive pre-training: an image encoder and text encoder are jointly trained on image-text pairs, learning to match correct pairs via a matrix of pairwise cosine similarities. (2) Dataset classifier creation: class labels are converted to text prompts like “A photo of a {object}” and encoded by the text encoder. (3) Zero-shot prediction: a new image is encoded and compared against all class text embeddings to find the best match.

Training procedure: Given a minibatch of \(N\) image-text pairs (CLIP uses \(N = 32{,}768\)), the model computes all \(N \times N\) pairwise cosine similarities between image embeddings and text embeddings, scaled by a learned temperature parameter \(\tau\). The training objective is a symmetric cross-entropy loss: for each image, the correct text should have the highest similarity, and for each text, the correct image should have the highest similarity. The loss averages these two directions.

Here is the numpy-like pseudocode from the paper:

# I_f = image_encoder(I)   # [n, d_i]
# T_f = text_encoder(T)    # [n, d_t]

# Project to joint embedding space [n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)

# Scaled pairwise cosine similarities [n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t)

# Symmetric loss function
labels = np.arange(n)
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss   = (loss_i + loss_t) / 2

Zero-shot classification: At inference, CLIP converts each class name into a text prompt (e.g., “A photo of a {class}.”), encodes all prompts through the text encoder, and caches the resulting text embeddings. For each test image, the model computes cosine similarity against all cached text embeddings and predicts the class with the highest similarity. This is mathematically equivalent to a multinomial logistic regression classifier with L2-normalized inputs, L2-normalized weights, no bias, and temperature scaling.

Prompt engineering and ensembling: The authors found that the default “A photo of a {label}.” template could be improved with task-specific context. For pet classification: “A photo of a {label}, a type of pet.” For satellite images: “A satellite photo of a {label}.” On ImageNet, ensembling 80 different prompt templates improved accuracy by 3.5%, and combined with prompt engineering, the total gain was nearly 5 percentage points.

Scale: The authors trained 8 models: 5 ResNets (ResNet-50, ResNet-101, RN50x4, RN50x16, RN50x64) and 3 Vision Transformers (ViT-B/32, ViT-B/16, ViT-L/14). All models trained for 32 epochs on WIT. The largest ResNet (RN50x64) took 18 days on 592 V100 GPUs. The largest ViT (ViT-L/14) took 12 days on 256 V100 GPUs. The best model, ViT-L/14@336px, was further fine-tuned at 336-pixel resolution for one additional epoch.

Mathematical Foundations

Cosine Similarity with Temperature Scaling

\[\text{logit}(i, j) = \frac{I_e^{(i)} \cdot T_e^{(j)}}{\tau}\]

where:

\[I_e^{(i)} \cdot T_e^{(j)} = \sum_{k=1}^{d_e} I_{e,k}^{(i)} \, T_{e,k}^{(j)}\]

What it means: Because both embeddings are L2-normalized (their length is 1), the dot product equals the cosine of the angle between them. Values range from -1 (opposite directions) to +1 (same direction). The temperature \(\tau\) controls how “peaked” the resulting probability distribution is – a smaller \(\tau\) makes the model more confident in its top choice.

Why it matters: Cosine similarity in a shared embedding space is what allows CLIP to compare images and text directly, despite them being fundamentally different modalities. Two vectors pointing in the same direction means “this image and this text describe the same thing.”

Worked example: Suppose we have \(d_e = 4\) and two normalized embeddings: \(I_e = [0.5, 0.5, 0.5, 0.5]\) and \(T_e = [0.4, 0.6, 0.4, 0.5]\) (approximately normalized). Their dot product is \(0.5 \times 0.4 + 0.5 \times 0.6 + 0.5 \times 0.4 + 0.5 \times 0.5 = 0.2 + 0.3 + 0.2 + 0.25 = 0.95\). With \(\tau = 0.07\), the logit is \(0.95 / 0.07 \approx 13.6\), a very confident match.

Symmetric Contrastive Loss (InfoNCE)

For the image-to-text direction:

\[\mathcal{L}_i = -\frac{1}{N} \sum_{i=1}^{N} \log \frac{\exp(\text{logit}(i, i))}{\sum_{j=1}^{N} \exp(\text{logit}(i, j))}\]

For the text-to-image direction:

\[\mathcal{L}_t = -\frac{1}{N} \sum_{j=1}^{N} \log \frac{\exp(\text{logit}(j, j))}{\sum_{i=1}^{N} \exp(\text{logit}(i, j))}\]

The total loss:

\[\mathcal{L} = \frac{\mathcal{L}_i + \mathcal{L}_t}{2}\]

What it means: For each image in the batch, the loss asks: “Out of all \(N\) texts, which one is the correct match?” The softmax converts the \(N\) similarity scores into a probability distribution, and the loss penalizes the model for assigning low probability to the correct pairing. The text-to-image loss does the same thing in reverse: for each text, which image is correct? Averaging the two directions makes the loss symmetric.

Why it matters: This is a contrastive loss (known as InfoNCE, first introduced for contrastive representation learning). It does not require the model to predict exact words – only to determine which image-text pair goes together. This is much easier to learn than caption generation and is the key efficiency insight behind CLIP. With \(N = 32{,}768\), each training step effectively creates a 32,768-way classification problem.

Worked example: With a batch of \(N = 3\) and temperature-scaled logits:

Text 1 Text 2 Text 3
Image 1 14.3 2.1 -1.5
Image 2 0.8 12.7 3.2
Image 3 -0.3 1.9 11.4

Bold values are the correct pairings (diagonal). For Image 1, the softmax probability of the correct text is \(\exp(14.3) / (\exp(14.3) + \exp(2.1) + \exp(-1.5)) \approx 0.99995\). The loss contribution is \(-\log(0.99995) \approx 0.00005\), very small because the model is confident and correct. If the model were confused and assigned equal logits, the probability would be \(1/3\) and the loss would be \(-\log(1/3) \approx 1.10\).

Zero-Shot Prediction

\[p(y = c \mid x) = \frac{\exp(I_e(x) \cdot T_e(c) / \tau)}{\sum_{c'=1}^{C} \exp(I_e(x) \cdot T_e(c') / \tau)}\]

What it means: This is just a softmax over cosine similarities – the same operation used during training, but now the “batch” is replaced by the set of possible class labels. The model picks the class whose text description is most similar to the image.

Why it matters: This is where the zero-shot magic happens. The text encoder acts as a “hypernetwork” (a network that generates the weights of another network): it converts class names into classifier weights on the fly. No labeled training images are needed for the downstream task. Any set of text descriptions becomes a valid classifier.

Scaling Law

\[\text{Error} \propto (\text{Compute})^{-\alpha}\]

or equivalently in log-log space:

\[\log(\text{Error}) = -\alpha \cdot \log(\text{Compute}) + \beta\]

What it means: When the authors plotted average zero-shot error against model compute on a log-log scale, the 5 ResNet CLIP models fell on a straight line. This means doubling compute reduces error by a predictable, fixed percentage – the same kind of power law scaling observed in language models like GPT.

Why it matters: Predictable scaling means you can estimate how much compute you need to reach a target accuracy. The authors estimate that reaching overall state-of-the-art performance via zero-shot CLIP would require roughly 1,000 times more compute than their largest model – infeasible with current hardware, but a clear research direction.

Results

CLIP’s most striking result is matching the accuracy of the original ResNet-50 on ImageNet (76.2% top-1) in a zero-shot setting, without using any of ImageNet’s 1.28 million labeled training images. This was a massive leap from the previous best zero-shot result of 11.5% by Visual N-Grams.

Dataset Visual N-Grams Zero-Shot CLIP
aYahoo 72.4% 98.4%
ImageNet 11.5% 76.2%
SUN397 23.0% 58.5%

Across a suite of 27 datasets, zero-shot CLIP outperformed a fully supervised linear classifier trained on ResNet-50 features on 16 of the 27 datasets. The model excelled on action recognition (outperforming ResNet-50 by 14.5% on Kinetics700 and 7.7% on UCF101), general object classification, and OCR tasks. It struggled on specialized tasks like satellite image classification (EuroSAT), medical imaging (PatchCamelyon), and counting (CLEVRCounts).

Zero-shot CLIP matched the performance of 4-shot logistic regression on its own features, and roughly matched the best 16-shot classifier in the evaluation suite (BiT-M ResNet-152x2 trained on ImageNet-21K). This is remarkable because zero-shot CLIP communicates what to look for through language, while few-shot learners must infer the concept from example images alone.

For representation learning, CLIP’s best model (ViT-L/14@336px) outperformed the Noisy Student EfficientNet-L2 – the previous state-of-the-art – on 21 of 27 datasets when evaluated via linear probes (training a single linear classifier on top of frozen features – the standard test of representation quality). Vision Transformers were approximately 3 times more compute-efficient than ResNets in the CLIP training regime.

Perhaps most importantly, CLIP demonstrated dramatically improved robustness to distribution shift. On 7 natural distribution shift variants of ImageNet, zero-shot CLIP models reduced the “robustness gap” (the difference between ImageNet performance and performance under distribution shift) by up to 75%. Standard ImageNet-trained models see their accuracy drop severely on sketches, adversarial images, and video frames; CLIP’s zero-shot models maintain much more consistent performance across these variations. However, this robustness advantage largely disappears when CLIP is fine-tuned on ImageNet, suggesting the benefit comes specifically from avoiding distribution-specific training.

Limitations

Impact and Legacy

CLIP fundamentally changed how the computer vision community thinks about training visual models. Before CLIP, the dominant paradigm was: pre-train on ImageNet, then fine-tune on your target task. After CLIP, the field moved toward learning visual representations from web-scale text-image pairs, enabling zero-shot transfer and dramatically broader capabilities.

CLIP became a foundational building block for subsequent multimodal systems. DALL-E 2 (Ramesh et al., 2022) used CLIP embeddings as the bridge between text and image generation. Stable Diffusion and other text-to-image systems rely on CLIP’s text encoder (or derivatives) to guide image generation. The “CLIP embedding space” became a shared representation that connected vision and language across many different applications – image search, content moderation, visual question answering, and robotic manipulation.

The paper also popularized several ideas that became standard practice: using natural language prompts to define classifiers, prompt engineering for vision tasks (directly inspired by the GPT-3 finding that prompt wording matters), and evaluating models on broad zero-shot transfer suites rather than single fine-tuned benchmarks. OpenCLIP, SigLIP, and many other open reproductions extended the approach. The paper’s emphasis on robustness to distribution shift influenced how the field evaluates visual models, shifting focus from single-dataset accuracy to out-of-distribution performance. CLIP’s demonstrated biases also catalyzed important work on bias evaluation and mitigation in multimodal foundation models.

Prerequisites

To understand CLIP, you should be comfortable with:

Connections