A Vision Transformer (ViT) is an image model that splits an image into fixed-size patches, treats each patch as a token, and processes the patch sequence with a standard Transformer encoder instead of convolutions.
What is a Vision Transformer (ViT)?
A Vision Transformer (ViT) is a neural network that applies the Transformer architecture, originally built for text, directly to images. Rather than sliding convolutional filters across pixels, ViT cuts an image into a grid of fixed-size patches (commonly 16x16 pixels), flattens each patch, projects it into an embedding vector, and feeds the resulting sequence into a standard Transformer encoder. Each patch plays the role that a word token plays in a language model.
ViT was introduced in the 2020 paper 'An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale' by a team at Google, and presented at ICLR 2021. Its central finding was that, given enough pre-training data, a near-pure Transformer can match or beat convolutional networks on image classification, even though it has far less built-in knowledge about images.
- Input image is divided into non-overlapping patches, each treated as a token.
- A Transformer encoder, not convolutions, does the heavy lifting.
- Strong results depend on large-scale pre-training (tens to hundreds of millions of images).
How does ViT turn an image into tokens?
The pipeline has four steps. First, an image of height H, width W, and C channels is reshaped into N patches, where each patch is P by P pixels. Second, each flattened patch is passed through a single linear projection to produce a patch embedding of dimension D. Third, a learnable position embedding is added to each patch embedding so the model knows where each patch sat in the original image, since attention is otherwise order-agnostic. Fourth, a special learnable [class] token is prepended to the sequence; after the encoder runs, the final state of that token is used for classification.
The Transformer encoder itself is the familiar stack of alternating multi-head self-attention and feed-forward (MLP) blocks, each with layer normalization and residual connections. Because every patch attends to every other patch, ViT can model long-range relationships across the whole image from the very first layer, unlike a convolution whose receptive field grows slowly with depth.
- Number of patches: a 224x224 image with 16x16 patches yields 196 patches.
- Position embeddings restore spatial order lost by attention.
- The [class] token aggregates global information for the final prediction.
ViT versus convolutional networks
Convolutional neural networks bake in strong inductive biases: locality (nearby pixels relate) and translation equivariance (a feature is recognized regardless of position). These priors make CNNs data-efficient on small and medium datasets. ViT has weaker built-in priors, so when trained on a mid-sized dataset such as ImageNet alone, it tends to trail comparable CNNs by a few percentage points.
The picture flips at scale. When pre-trained on very large datasets (the paper used datasets up to roughly 300 million images) and then fine-tuned, large ViT variants reach or exceed state-of-the-art CNN accuracy while often using less compute to train. The lesson is that, with enough data, learned attention patterns can substitute for hand-designed convolutional priors.
- CNN strengths: data efficiency, locality, translation equivariance.
- ViT strengths: global context from layer one, strong scaling with data.
- Common ViT sizes: ViT-Base, ViT-Large (about 307M parameters), and ViT-Huge (about 632M).
A minimal ViT patch embedding in code
The core of ViT is straightforward. The snippet below builds the patch embedding and a class token using a convolution with stride equal to the patch size, which is the standard trick to extract and project all patches in one operation.
Where ViTs are used
The Vision Transformer is now a default backbone for many vision systems. Pre-trained ViT encoders supply the image side of contrastive models such as CLIP and the visual front end of vision language models such as LLaVA. Variants extend the idea to dense prediction, video, and self-supervised learning, and ViT features are commonly stored as embeddings in a vector database for semantic image search.
- Image classification, retrieval, and as a frozen feature extractor.
- The visual encoder inside multimodal and vision language models.
- Self-supervised pre-training and dense tasks via ViT-derived variants.
Key takeaways
- ViT treats fixed-size image patches as tokens and runs them through a standard Transformer encoder, no convolutions required.
- Position embeddings and a learnable class token let the model recover spatial order and produce a single classification vector.
- ViT has weaker image priors than CNNs, so it needs large-scale pre-training to reach top accuracy, after which it matches or beats CNNs.
- ViT backbones now power CLIP, vision language models, and image retrieval pipelines.
Frequently asked questions
Related terms
Related reading
Sources
Put the idea into practice
MemX is an AI memory app built on these ideas: store anything, skip the folders, and find it again by asking in plain English.
Try MemX Free