Learning Transferable Visual Models From Natural Language Supervision

Authors: Alec Radford, Jong Wook Kim, Chris Hallacy et al. Year: 2021 Source: arXiv 2103.00020

One-Sentence Summary

CLIP trains an image encoder and a text encoder together on 400 million image-caption pairs from the internet so that, at test time, you can classify images into any set of categories just by describing them in plain English – no labeled training data required.

Problem Statement

Before CLIP, the standard recipe for building a computer vision system was: collect a large dataset of images with human-assigned labels (like ImageNet’s 1.28 million images across 1,000 categories), train a model to predict those labels, then fine-tune the model on whatever downstream task you actually care about. This approach has two fundamental problems.

First, the label vocabulary is fixed at training time. An ImageNet model knows about “golden retriever” and “construction crane” but has no concept of “a photo of someone social distancing” or “a satellite image of deforestation.” Every new task requires collecting and labeling a new dataset, which is expensive and slow. Meanwhile, in natural language processing, models like GPT-2 and GPT-3 (see Improving Language Understanding by Generative Pre-Training) had shown that pre-training on raw internet text produced models that could perform new tasks zero-shot – without any task-specific training data at all.

Second, models trained on fixed label sets tend to learn shortcuts. A model trained to classify 1,000 ImageNet categories might learn to associate “green background” with “frog” rather than learning what a frog actually looks like. This makes the model brittle: performance on ImageNet can be impressive while performance on slightly different distributions (sketches, adversarial examples, video frames) drops dramatically. A ResNet-101 makes 5 times as many mistakes on natural distribution shifts compared to the ImageNet validation set.

The question CLIP addresses: could the same revolution that happened in NLP – learning directly from raw text at internet scale – produce visual representations that are more general, more flexible, and more robust?

Key Innovation

Think of how humans learn to recognize things. A child does not see 1,000 flash cards labeled “dog” before understanding what a dog is. Instead, a parent points at the family pet and says “Look at the dog!” The child learns to connect visual experience with language naturally, absorbing thousands of such associations over time. Eventually, you can tell the child “find the spatula” and they can search the kitchen for an object they may have only heard described once.

CLIP works the same way. Instead of training on a fixed set of categories, CLIP trains on 400 million (image, text) pairs scraped from the internet – photos paired with their captions, alt-text, titles, and descriptions. The training objective is simple: given a batch of images and captions, figure out which caption goes with which image. This is a contrastive learning objective (the model contrasts correct pairings against incorrect ones), not a generative one (the model does not try to predict exact caption words).

The critical technical insight is that contrastive learning is dramatically more efficient than trying to predict the exact words of a caption. The authors found that a contrastive objective learns useful ImageNet representations 12 times faster than a transformer language model trained to generate captions, and 4 times faster than a bag-of-words prediction baseline. This efficiency gain is what made it practical to scale training to 400 million pairs.

At test time, zero-shot classification works by turning class labels into text. To classify an image as one of [“dog”, “cat”, “bird”], CLIP encodes the prompts “A photo of a dog”, “A photo of a cat”, and “A photo of a bird” through the text encoder, encodes the image through the image encoder, and picks the class whose text embedding is most similar to the image embedding. The text encoder effectively generates the weights of a linear classifier on the fly from a natural language description – no labeled examples needed.

Architecture / Method

CLIP has two parallel encoders that share no weights but are trained together:

Image encoder: Either a ResNet (with modifications: ResNet-D improvements, antialiased blur pooling, and attention pooling replacing global average pooling) or a Vision Transformer (ViT) (see An Image is Worth 16x16 Words). The image encoder takes a raw image and produces a feature vector \(I_f\) of dimension \(d_i\).

Text encoder: A Transformer (see Attention Is All You Need) with 63 million parameters, 12 layers, 512 width, and 8 attention heads. It operates on byte-pair-encoded text with a vocabulary of 49,152 tokens and a maximum sequence length of 76 tokens. The text is bracketed with [SOS] and [EOS] tokens, and the activation at the [EOS] position in the final layer serves as the text feature vector \(T_f\) of dimension \(d_t\). Masked self-attention is used (causal, left-to-right), following the GPT architecture.

Projection to shared space: Each encoder’s output is linearly projected into a shared embedding space of dimension \(d_e\), then L2-normalized. For the image: \(I_e = \text{normalize}(I_f \cdot W_i)\) where \(W_i\) is a learned \(d_i \times d_e\) matrix. For the text: \(T_e = \text{normalize}(T_f \cdot W_t)\) where \(W_t\) is a learned \(d_t \times d_e\) matrix. Notably, CLIP uses only a linear projection here – not the nonlinear projection head that was popular in self-supervised methods like SimCLR.

CLIP architecture: contrastive pre-training, dataset classifier creation, and zero-shot prediction

Figure 1: The CLIP architecture in three stages. (1) Contrastive pre-training: an image encoder and text encoder are jointly trained on image-text pairs, learning to match correct pairs via a matrix of pairwise cosine similarities. (2) Dataset classifier creation: class labels are converted to text prompts like “A photo of a {object}” and encoded by the text encoder. (3) Zero-shot prediction: a new image is encoded and compared against all class text embeddings to find the best match.

Training procedure: Given a minibatch of \(N\) image-text pairs (CLIP uses \(N = 32{,}768\)), the model computes all \(N \times N\) pairwise cosine similarities between image embeddings and text embeddings, scaled by a learned temperature parameter \(\tau\). The training objective is a symmetric cross-entropy loss: for each image, the correct text should have the highest similarity, and for each text, the correct image should have the highest similarity. The loss averages these two directions.

Here is the numpy-like pseudocode from the paper:

# I_f = image_encoder(I)   # [n, d_i]
# T_f = text_encoder(T)    # [n, d_t]

# Project to joint embedding space [n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)

# Scaled pairwise cosine similarities [n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t)

# Symmetric loss function
labels = np.arange(n)
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss   = (loss_i + loss_t) / 2

Zero-shot classification: At inference, CLIP converts each class name into a text prompt (e.g., “A photo of a {class}.”), encodes all prompts through the text encoder, and caches the resulting text embeddings. For each test image, the model computes cosine similarity against all cached text embeddings and predicts the class with the highest similarity. This is mathematically equivalent to a multinomial logistic regression classifier with L2-normalized inputs, L2-normalized weights, no bias, and temperature scaling.

Prompt engineering and ensembling: The authors found that the default “A photo of a {label}.” template could be improved with task-specific context. For pet classification: “A photo of a {label}, a type of pet.” For satellite images: “A satellite photo of a {label}.” On ImageNet, ensembling 80 different prompt templates improved accuracy by 3.5%, and combined with prompt engineering, the total gain was nearly 5 percentage points.

Scale: The authors trained 8 models: 5 ResNets (ResNet-50, ResNet-101, RN50x4, RN50x16, RN50x64) and 3 Vision Transformers (ViT-B/32, ViT-B/16, ViT-L/14). All models trained for 32 epochs on WIT. The largest ResNet (RN50x64) took 18 days on 592 V100 GPUs. The largest ViT (ViT-L/14) took 12 days on 256 V100 GPUs. The best model, ViT-L/14@336px, was further fine-tuned at 336-pixel resolution for one additional epoch.

Mathematical Foundations

Cosine Similarity with Temperature Scaling

\[\text{logit}(i, j) = \frac{I_e^{(i)} \cdot T_e^{(j)}}{\tau}\]

where:

\[I_e^{(i)} \cdot T_e^{(j)} = \sum_{k=1}^{d_e} I_{e,k}^{(i)} \, T_{e,k}^{(j)}\]

\(I_e^{(i)}\): the L2-normalized image embedding for the \(i\)-th image in the batch, a vector of dimension \(d_e\)
\(T_e^{(j)}\): the L2-normalized text embedding for the \(j\)-th text in the batch, a vector of dimension \(d_e\)
\(\tau\): a learned temperature parameter (initialized at 0.07, stored as a log-parameterized scalar)
\(d_e\): the dimension of the shared embedding space

What it means: Because both embeddings are L2-normalized (their length is 1), the dot product equals the cosine of the angle between them. Values range from -1 (opposite directions) to +1 (same direction). The temperature \(\tau\) controls how “peaked” the resulting probability distribution is – a smaller \(\tau\) makes the model more confident in its top choice.

Why it matters: Cosine similarity in a shared embedding space is what allows CLIP to compare images and text directly, despite them being fundamentally different modalities. Two vectors pointing in the same direction means “this image and this text describe the same thing.”

Worked example: Suppose we have \(d_e = 4\) and two normalized embeddings: \(I_e = [0.5, 0.5, 0.5, 0.5]\) and \(T_e = [0.4, 0.6, 0.4, 0.5]\) (approximately normalized). Their dot product is \(0.5 \times 0.4 + 0.5 \times 0.6 + 0.5 \times 0.4 + 0.5 \times 0.5 = 0.2 + 0.3 + 0.2 + 0.25 = 0.95\). With \(\tau = 0.07\), the logit is \(0.95 / 0.07 \approx 13.6\), a very confident match.

Symmetric Contrastive Loss (InfoNCE)

For the image-to-text direction:

\[\mathcal{L}_i = -\frac{1}{N} \sum_{i=1}^{N} \log \frac{\exp(\text{logit}(i, i))}{\sum_{j=1}^{N} \exp(\text{logit}(i, j))}\]

For the text-to-image direction:

\[\mathcal{L}_t = -\frac{1}{N} \sum_{j=1}^{N} \log \frac{\exp(\text{logit}(j, j))}{\sum_{i=1}^{N} \exp(\text{logit}(i, j))}\]

The total loss:

\[\mathcal{L} = \frac{\mathcal{L}_i + \mathcal{L}_t}{2}\]

\(N\): the batch size (32,768 in CLIP)
\(\text{logit}(i, j)\): the temperature-scaled cosine similarity between image \(i\) and text \(j\)
The \(\exp\) and \(\log\) convert similarities into a probability distribution via softmax

What it means: For each image in the batch, the loss asks: “Out of all \(N\) texts, which one is the correct match?” The softmax converts the \(N\) similarity scores into a probability distribution, and the loss penalizes the model for assigning low probability to the correct pairing. The text-to-image loss does the same thing in reverse: for each text, which image is correct? Averaging the two directions makes the loss symmetric.

Why it matters: This is a contrastive loss (known as InfoNCE, first introduced for contrastive representation learning). It does not require the model to predict exact words – only to determine which image-text pair goes together. This is much easier to learn than caption generation and is the key efficiency insight behind CLIP. With \(N = 32{,}768\), each training step effectively creates a 32,768-way classification problem.

Worked example: With a batch of \(N = 3\) and temperature-scaled logits:

	Text 1	Text 2	Text 3
Image 1	14.3	2.1	-1.5
Image 2	0.8	12.7	3.2
Image 3	-0.3	1.9	11.4

Bold values are the correct pairings (diagonal). For Image 1, the softmax probability of the correct text is \(\exp(14.3) / (\exp(14.3) + \exp(2.1) + \exp(-1.5)) \approx 0.99995\). The loss contribution is \(-\log(0.99995) \approx 0.00005\), very small because the model is confident and correct. If the model were confused and assigned equal logits, the probability would be \(1/3\) and the loss would be \(-\log(1/3) \approx 1.10\).

Zero-Shot Prediction

\[p(y = c \mid x) = \frac{\exp(I_e(x) \cdot T_e(c) / \tau)}{\sum_{c'=1}^{C} \exp(I_e(x) \cdot T_e(c') / \tau)}\]

\(x\): the input image
\(c\): a candidate class (e.g., “dog”)
\(C\): the total number of classes in the zero-shot classifier
\(I_e(x)\): the L2-normalized image embedding of \(x\)
\(T_e(c)\): the L2-normalized text embedding of the prompt for class \(c\) (e.g., “A photo of a dog.”)
\(\tau\): the learned temperature

What it means: This is just a softmax over cosine similarities – the same operation used during training, but now the “batch” is replaced by the set of possible class labels. The model picks the class whose text description is most similar to the image.

Why it matters: This is where the zero-shot magic happens. The text encoder acts as a “hypernetwork” (a network that generates the weights of another network): it converts class names into classifier weights on the fly. No labeled training images are needed for the downstream task. Any set of text descriptions becomes a valid classifier.

Scaling Law

\[\text{Error} \propto (\text{Compute})^{-\alpha}\]

or equivalently in log-log space:

\[\log(\text{Error}) = -\alpha \cdot \log(\text{Compute}) + \beta\]

Error: average zero-shot error rate across 36 datasets
Compute: model GFLOPs (ranging from 6.1 for RN50 to 265.9 for RN50x64)
\(\alpha\): the scaling exponent (slope in log-log space)
\(\beta\): a constant offset

What it means: When the authors plotted average zero-shot error against model compute on a log-log scale, the 5 ResNet CLIP models fell on a straight line. This means doubling compute reduces error by a predictable, fixed percentage – the same kind of power law scaling observed in language models like GPT.

Why it matters: Predictable scaling means you can estimate how much compute you need to reach a target accuracy. The authors estimate that reaching overall state-of-the-art performance via zero-shot CLIP would require roughly 1,000 times more compute than their largest model – infeasible with current hardware, but a clear research direction.

Results

CLIP’s most striking result is matching the accuracy of the original ResNet-50 on ImageNet (76.2% top-1) in a zero-shot setting, without using any of ImageNet’s 1.28 million labeled training images. This was a massive leap from the previous best zero-shot result of 11.5% by Visual N-Grams.

Dataset	Visual N-Grams	Zero-Shot CLIP
aYahoo	72.4%	98.4%
ImageNet	11.5%	76.2%
SUN397	23.0%	58.5%

Across a suite of 27 datasets, zero-shot CLIP outperformed a fully supervised linear classifier trained on ResNet-50 features on 16 of the 27 datasets. The model excelled on action recognition (outperforming ResNet-50 by 14.5% on Kinetics700 and 7.7% on UCF101), general object classification, and OCR tasks. It struggled on specialized tasks like satellite image classification (EuroSAT), medical imaging (PatchCamelyon), and counting (CLEVRCounts).

Zero-shot CLIP matched the performance of 4-shot logistic regression on its own features, and roughly matched the best 16-shot classifier in the evaluation suite (BiT-M ResNet-152x2 trained on ImageNet-21K). This is remarkable because zero-shot CLIP communicates what to look for through language, while few-shot learners must infer the concept from example images alone.

For representation learning, CLIP’s best model (ViT-L/14@336px) outperformed the Noisy Student EfficientNet-L2 – the previous state-of-the-art – on 21 of 27 datasets when evaluated via linear probes (training a single linear classifier on top of frozen features – the standard test of representation quality). Vision Transformers were approximately 3 times more compute-efficient than ResNets in the CLIP training regime.

Perhaps most importantly, CLIP demonstrated dramatically improved robustness to distribution shift. On 7 natural distribution shift variants of ImageNet, zero-shot CLIP models reduced the “robustness gap” (the difference between ImageNet performance and performance under distribution shift) by up to 75%. Standard ImageNet-trained models see their accuracy drop severely on sketches, adversarial images, and video frames; CLIP’s zero-shot models maintain much more consistent performance across these variations. However, this robustness advantage largely disappears when CLIP is fine-tuned on ImageNet, suggesting the benefit comes specifically from avoiding distribution-specific training.

Limitations

Still far from state-of-the-art: While competitive with ResNet-50, zero-shot CLIP falls well short of the best supervised models on most benchmarks. The authors estimate approximately 1,000 times more compute would be needed for zero-shot CLIP to match overall SOTA – infeasible with current hardware.
Poor on specialized and abstract tasks: CLIP struggles with fine-grained classification (car models, flower species, aircraft variants), counting objects, satellite imagery, and medical imaging. On MNIST (handwritten digits), a logistic regression classifier on raw pixels outperforms zero-shot CLIP – because MNIST-like images are virtually absent from internet data.
Limited to classification: CLIP can only choose from a predefined set of text descriptions. It cannot generate open-ended descriptions of images (like an image captioning model), perform object detection, or do semantic segmentation.
Counter-intuitive few-shot behavior: Transitioning from zero-shot to few-shot learning with CLIP can actually decrease performance, because the few-shot linear probe cannot leverage the rich prior knowledge encoded in the text encoder. Humans, by contrast, improve dramatically from zero-shot to one-shot.
Massive data requirements: CLIP trains on 400 million image-text pairs, seeing 12.8 billion images over 32 epochs. At one image per second, this would take 405 years. The model compensates for poor data efficiency with sheer data volume rather than solving the underlying sample efficiency problem.
Social biases from internet data: The unfiltered, uncurated training data causes CLIP to learn social biases. In bias probes, 14% of Black face images were misclassified into non-human animal categories (vs. under 8% for other races). Male images were classified into crime-related categories at 16.5% vs. 9.8% for female images. Class design choices (what labels are available) dramatically affect how these biases manifest.
Evaluation methodology concerns: The authors acknowledge repeatedly querying validation sets with thousands of examples to guide CLIP’s development, which is unrealistic for truly zero-shot scenarios. The 27-dataset evaluation suite was co-adapted with CLIP’s development rather than being independently designed.
No compositional understanding: The paper does not discuss this, but CLIP’s contrastive objective treats text as a holistic unit and does not develop strong compositional reasoning – it may match “a horse riding an astronaut” as well as “an astronaut riding a horse” because it focuses on the presence of concepts rather than their relationships.

Impact and Legacy

CLIP fundamentally changed how the computer vision community thinks about training visual models. Before CLIP, the dominant paradigm was: pre-train on ImageNet, then fine-tune on your target task. After CLIP, the field moved toward learning visual representations from web-scale text-image pairs, enabling zero-shot transfer and dramatically broader capabilities.

CLIP became a foundational building block for subsequent multimodal systems. DALL-E 2 (Ramesh et al., 2022) used CLIP embeddings as the bridge between text and image generation. Stable Diffusion and other text-to-image systems rely on CLIP’s text encoder (or derivatives) to guide image generation. The “CLIP embedding space” became a shared representation that connected vision and language across many different applications – image search, content moderation, visual question answering, and robotic manipulation.

The paper also popularized several ideas that became standard practice: using natural language prompts to define classifiers, prompt engineering for vision tasks (directly inspired by the GPT-3 finding that prompt wording matters), and evaluating models on broad zero-shot transfer suites rather than single fine-tuned benchmarks. OpenCLIP, SigLIP, and many other open reproductions extended the approach. The paper’s emphasis on robustness to distribution shift influenced how the field evaluates visual models, shifting focus from single-dataset accuracy to out-of-distribution performance. CLIP’s demonstrated biases also catalyzed important work on bias evaluation and mitigation in multimodal foundation models.

Prerequisites

To understand CLIP, you should be comfortable with:

Dot products and cosine similarity: the core operation for comparing embeddings (basic linear algebra)
Softmax and cross-entropy loss: how probabilities are computed and how classification models are trained
Neural network basics: the idea that a neural network maps inputs to vectors (embeddings) through learned transformations
The Transformer architecture: CLIP’s text encoder is a Transformer (see Attention Is All You Need)
Vision Transformers: CLIP’s best image encoder is a ViT (see An Image is Worth 16x16 Words)
The concept of pre-training and transfer learning: training a model on a large general dataset, then applying it to specific tasks (see Improving Language Understanding by Generative Pre-Training)

Connections

Builds on Transformers (see Attention Is All You Need): CLIP’s text encoder is a Transformer with the architecture modifications from GPT-2. The self-attention mechanism that made Transformers so effective for language is what allows CLIP’s text encoder to understand variable-length natural language descriptions of visual concepts.
Builds on GPT (see Improving Language Understanding by Generative Pre-Training): CLIP’s text encoder follows the GPT architecture (causal/masked self-attention Transformer). More importantly, CLIP extends GPT’s core insight – that pre-training on internet-scale text produces models with surprising zero-shot capabilities – from language to vision. The concept of “prompt engineering” that emerged from GPT-3 directly transfers to CLIP.
Builds on BERT’s pre-training paradigm (see BERT: Pre-training of Deep Bidirectional Transformers): While CLIP uses a GPT-style text encoder (not BERT’s bidirectional masking), it shares BERT’s fundamental insight that large-scale pre-training on natural data produces transferable representations. CLIP’s text encoder could alternatively have been initialized from a pre-trained language model, though the authors left this as future work.
Builds on Vision Transformer (ViT) (see An Image is Worth 16x16 Words): CLIP’s best image encoder is a ViT-L/14. The authors confirmed ViT’s finding that Vision Transformers are more compute-efficient than convolutional networks when trained on sufficiently large datasets – CLIP ViTs were approximately 3 times more efficient than CLIP ResNets.
Extends scaling insights: The paper demonstrates that zero-shot transfer performance follows a smooth log-log linear scaling law with compute, echoing the scaling laws observed in the GPT family. This suggests that the recipe of “scale up data and compute” applies to multimodal learning as well as pure language modeling.
Contrastive learning heritage: CLIP’s loss function is the InfoNCE loss (van den Oord et al., 2018), which was popularized for image self-supervised learning by SimCLR. The key difference is that CLIP contrasts images against text rather than contrasting augmented views of the same image against each other. This cross-modal contrastive setup, adapted from ConVIRT (Zhang et al., 2020), is what enables zero-shot transfer.
Predecessor to diffusion-based image generation: CLIP embeddings became the representation layer connecting text and images in DALL-E 2, Stable Diffusion, and other text-to-image generation systems. The shared image-text embedding space that CLIP learns serves as the “language” through which text prompts guide the image generation process.