Models & Evaluation

CLIP (Contrastive Language-Image Pre-training)

By Arpit Tripathi, Founder

CLIP is a model from OpenAI that trains an image encoder and a text encoder together so matching image-caption pairs land near each other in a shared embedding space, enabling zero-shot image classification from text labels.

What is CLIP?

CLIP (Contrastive Language-Image Pre-training) is a model introduced by OpenAI in 2021 that learns a single embedding space shared by images and text. It uses two encoders, one for images and one for text, trained jointly so that an image and its true caption produce vectors that point in nearly the same direction, while mismatched image-caption pairs are pushed apart. Once trained, the same model can classify images it was never explicitly trained on by comparing an image embedding against the embeddings of candidate text labels.

CLIP was described in the paper 'Learning Transferable Visual Models From Natural Language Supervision' and trained on roughly 400 million image-text pairs collected from the internet. Its headline result was zero-shot performance: a CLIP model matched the accuracy of a fully supervised ResNet-50 on ImageNet without using any of ImageNet's 1.28 million labeled training images.

  • Two encoders, image and text, share one embedding space.
  • Trained on about 400 million image-text pairs from the web.
  • Enables zero-shot classification by comparing images to text labels.

How contrastive training works

For a batch of N image-text pairs, CLIP computes N image embeddings and N text embeddings, then forms an N by N matrix of cosine similarities. The N correct (image, text) pairs lie on the diagonal and should score high; the N² minus N incorrect pairings should score low. CLIP optimizes a symmetric cross-entropy loss: for each image it predicts the correct caption among all captions in the batch, and for each caption it predicts the correct image. A learnable temperature scales the similarities before the softmax.

This contrastive objective is what makes CLIP scalable. Predicting which caption belongs to which image is a far cheaper learning signal than predicting exact words or pixels, so the model can absorb hundreds of millions of noisy web pairs. The result is a general-purpose representation that transfers across dozens of vision benchmarks.

sim(I, T) = (vᵢ · vₜ) / (‖vᵢ‖ ‖vₜ‖)
Cosine similarity between an image embedding vᵢ and a text embedding vₜ. CLIP wants this high for true pairs and low for mismatched pairs.
L = ½ ( CE(logits, labels; dim=image) + CE(logits, labels; dim=text) ), logits = (Iₑ · Tₑᵀ) · exp(τ)
The symmetric contrastive loss averages two cross-entropy terms over the similarity logits, scaled by a learnable temperature τ.
  • Similarity is measured by cosine similarity in the shared space.
  • A symmetric loss matches images to text and text to images.
  • A learnable temperature controls the sharpness of the softmax.

Zero-shot classification with CLIP

CLIP turns classification into a retrieval problem. To classify an image into one of K categories, you write each label as a short text prompt, for example 'a photo of a {label}', encode all K prompts with the text encoder, encode the image, and pick the label whose text embedding has the highest cosine similarity to the image embedding. No task-specific fine-tuning or labeled training set is needed.

Because labels are expressed in natural language, the set of possible classes is open-ended. You can add or change categories at inference time simply by editing the text prompts, which is why CLIP generalizes across more than 30 vision datasets in zero-shot settings. Prompt wording measurably affects accuracy, a practice the authors call prompt engineering for vision.

python
import torch, clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image = preprocess(Image.open("cat.jpg")).unsqueeze(0).to(device)
labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]
text = clip.tokenize(labels).to(device)

with torch.no_grad():
    logits_per_image, _ = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print(dict(zip(labels, probs[0])))  # highest prob = predicted label
Zero-shot image classification with OpenAI's CLIP library.
  • Labels become text prompts encoded by the text tower.
  • Prediction is the label with the highest image-text similarity.
  • Class set is open-ended and editable at inference time.

Why CLIP matters for multimodal AI

CLIP's shared embedding space made it a foundational component of later systems. Its image encoder is widely reused: text-to-image models such as Stable Diffusion use CLIP text embeddings to condition generation, and vision language models such as LLaVA use a frozen CLIP vision encoder as their eyes. CLIP embeddings also power text-to-image and image-to-image retrieval in vector databases, since matching across modalities reduces to a similarity search.

The approach has limits. CLIP inherits biases and gaps from its web-scraped data, struggles with fine-grained counting and precise spatial reasoning, and its zero-shot accuracy varies widely by domain. These are active research areas, but the core idea of aligning modalities through contrastive learning remains a standard building block.

  • CLIP text/image encoders feed diffusion models and vision language models.
  • Cross-modal retrieval becomes a vector similarity search.
  • Known weaknesses: data bias, counting, and fine-grained spatial reasoning.

Key takeaways

  • CLIP trains image and text encoders together so matching pairs align in one shared embedding space.
  • Contrastive training over about 400 million web pairs gives strong zero-shot transfer across many vision tasks.
  • Zero-shot classification works by encoding text labels as prompts and choosing the highest image-text similarity.
  • CLIP encoders are reused inside diffusion models, vision language models, and cross-modal retrieval systems.

Frequently asked questions

CLIP is used for zero-shot image classification, cross-modal retrieval (finding images from text or vice versa), and as a pre-trained vision or text encoder inside other systems such as text-to-image diffusion models and vision language models.
CLIP encodes candidate labels as text prompts and encodes the image into the same space, then picks the label whose embedding is most similar to the image. Because labels are plain text, you can change the class set at inference without retraining.
CLIP was trained on roughly 400 million image-text pairs gathered from the internet, called WIT in the original work. The captions act as a weak, scalable supervision signal instead of hand-curated category labels.
It is a symmetric cross-entropy loss over a batch's image-text similarity matrix. True pairs on the diagonal are pushed up and all other pairings down, scaled by a learnable temperature, so the model matches images to captions in both directions.
CLIP can absorb biases from its web data, struggles with fine-grained counting and precise spatial reasoning, and its zero-shot accuracy varies a lot by domain. Prompt wording also affects results, so the same task can score differently with different label phrasings.