CLIP is a model from OpenAI that trains an image encoder and a text encoder together so matching image-caption pairs land near each other in a shared embedding space, enabling zero-shot image classification from text labels.
What is CLIP?
CLIP (Contrastive Language-Image Pre-training) is a model introduced by OpenAI in 2021 that learns a single embedding space shared by images and text. It uses two encoders, one for images and one for text, trained jointly so that an image and its true caption produce vectors that point in nearly the same direction, while mismatched image-caption pairs are pushed apart. Once trained, the same model can classify images it was never explicitly trained on by comparing an image embedding against the embeddings of candidate text labels.
CLIP was described in the paper 'Learning Transferable Visual Models From Natural Language Supervision' and trained on roughly 400 million image-text pairs collected from the internet. Its headline result was zero-shot performance: a CLIP model matched the accuracy of a fully supervised ResNet-50 on ImageNet without using any of ImageNet's 1.28 million labeled training images.
- Two encoders, image and text, share one embedding space.
- Trained on about 400 million image-text pairs from the web.
- Enables zero-shot classification by comparing images to text labels.
How contrastive training works
For a batch of N image-text pairs, CLIP computes N image embeddings and N text embeddings, then forms an N by N matrix of cosine similarities. The N correct (image, text) pairs lie on the diagonal and should score high; the N² minus N incorrect pairings should score low. CLIP optimizes a symmetric cross-entropy loss: for each image it predicts the correct caption among all captions in the batch, and for each caption it predicts the correct image. A learnable temperature scales the similarities before the softmax.
This contrastive objective is what makes CLIP scalable. Predicting which caption belongs to which image is a far cheaper learning signal than predicting exact words or pixels, so the model can absorb hundreds of millions of noisy web pairs. The result is a general-purpose representation that transfers across dozens of vision benchmarks.
- Similarity is measured by cosine similarity in the shared space.
- A symmetric loss matches images to text and text to images.
- A learnable temperature controls the sharpness of the softmax.
Zero-shot classification with CLIP
CLIP turns classification into a retrieval problem. To classify an image into one of K categories, you write each label as a short text prompt, for example 'a photo of a {label}', encode all K prompts with the text encoder, encode the image, and pick the label whose text embedding has the highest cosine similarity to the image embedding. No task-specific fine-tuning or labeled training set is needed.
Because labels are expressed in natural language, the set of possible classes is open-ended. You can add or change categories at inference time simply by editing the text prompts, which is why CLIP generalizes across more than 30 vision datasets in zero-shot settings. Prompt wording measurably affects accuracy, a practice the authors call prompt engineering for vision.
- Labels become text prompts encoded by the text tower.
- Prediction is the label with the highest image-text similarity.
- Class set is open-ended and editable at inference time.
Why CLIP matters for multimodal AI
CLIP's shared embedding space made it a foundational component of later systems. Its image encoder is widely reused: text-to-image models such as Stable Diffusion use CLIP text embeddings to condition generation, and vision language models such as LLaVA use a frozen CLIP vision encoder as their eyes. CLIP embeddings also power text-to-image and image-to-image retrieval in vector databases, since matching across modalities reduces to a similarity search.
The approach has limits. CLIP inherits biases and gaps from its web-scraped data, struggles with fine-grained counting and precise spatial reasoning, and its zero-shot accuracy varies widely by domain. These are active research areas, but the core idea of aligning modalities through contrastive learning remains a standard building block.
- CLIP text/image encoders feed diffusion models and vision language models.
- Cross-modal retrieval becomes a vector similarity search.
- Known weaknesses: data bias, counting, and fine-grained spatial reasoning.
Key takeaways
- CLIP trains image and text encoders together so matching pairs align in one shared embedding space.
- Contrastive training over about 400 million web pairs gives strong zero-shot transfer across many vision tasks.
- Zero-shot classification works by encoding text labels as prompts and choosing the highest image-text similarity.
- CLIP encoders are reused inside diffusion models, vision language models, and cross-modal retrieval systems.
Frequently asked questions
Related terms
Sources
Put the idea into practice
MemX is an AI memory app built on these ideas: store anything, skip the folders, and find it again by asking in plain English.
Try MemX Free