AI Foundations

Vision Transformer (ViT)

A Vision Transformer (ViT) is an image model that splits an image into fixed-size patches, treats each patch as a token, and processes the patch sequence with a standard Transformer encoder instead of convolutions.

What is a Vision Transformer (ViT)?

A Vision Transformer (ViT) is a neural network that applies the Transformer architecture, originally built for text, directly to images. Rather than sliding convolutional filters across pixels, ViT cuts an image into a grid of fixed-size patches (commonly 16x16 pixels), flattens each patch, projects it into an embedding vector, and feeds the resulting sequence into a standard Transformer encoder. Each patch plays the role that a word token plays in a language model.

ViT was introduced in the 2020 paper 'An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale' by a team at Google, and presented at ICLR 2021. Its central finding was that, given enough pre-training data, a near-pure Transformer can match or beat convolutional networks on image classification, even though it has far less built-in knowledge about images.

Input image is divided into non-overlapping patches, each treated as a token.
A Transformer encoder, not convolutions, does the heavy lifting.
Strong results depend on large-scale pre-training (tens to hundreds of millions of images).

How does ViT turn an image into tokens?

The pipeline has four steps. First, an image of height H, width W, and C channels is reshaped into N patches, where each patch is P by P pixels. Second, each flattened patch is passed through a single linear projection to produce a patch embedding of dimension D. Third, a learnable position embedding is added to each patch embedding so the model knows where each patch sat in the original image, since attention is otherwise order-agnostic. Fourth, a special learnable [class] token is prepended to the sequence; after the encoder runs, the final state of that token is used for classification.

The Transformer encoder itself is the familiar stack of alternating multi-head self-attention and feed-forward (MLP) blocks, each with layer normalization and residual connections. Because every patch attends to every other patch, ViT can model long-range relationships across the whole image from the very first layer, unlike a convolution whose receptive field grows slowly with depth.

N = (H · W) / P²

Number of patches N for an H by W image cut into P by P patches. A 224x224 image with P = 16 gives N = 196 tokens.

z₀ = [x_class; x¹ₚE; x²ₚE; …; xᴺₚE] + E_pos,   E ∈ ℝ^(P²·C × D)

The input sequence z₀ is the class token plus each flattened patch xⁱₚ linearly projected by E, with position embeddings E_pos added.

Number of patches: a 224x224 image with 16x16 patches yields 196 patches.
Position embeddings restore spatial order lost by attention.
The [class] token aggregates global information for the final prediction.

ViT versus convolutional networks

Convolutional neural networks bake in strong inductive biases: locality (nearby pixels relate) and translation equivariance (a feature is recognized regardless of position). These priors make CNNs data-efficient on small and medium datasets. ViT has weaker built-in priors, so when trained on a mid-sized dataset such as ImageNet alone, it tends to trail comparable CNNs by a few percentage points.

The picture flips at scale. When pre-trained on very large datasets (the paper used datasets up to roughly 300 million images) and then fine-tuned, large ViT variants reach or exceed state-of-the-art CNN accuracy while often using less compute to train. The lesson is that, with enough data, learned attention patterns can substitute for hand-designed convolutional priors.

CNN strengths: data efficiency, locality, translation equivariance.
ViT strengths: global context from layer one, strong scaling with data.
Common ViT sizes: ViT-Base, ViT-Large (about 307M parameters), and ViT-Huge (about 632M).

A minimal ViT patch embedding in code

The core of ViT is straightforward. The snippet below builds the patch embedding and a class token using a convolution with stride equal to the patch size, which is the standard trick to extract and project all patches in one operation.

python

import torch
import torch.nn as nn

class PatchEmbed(nn.Module):
    def __init__(self, img_size=224, patch=16, in_ch=3, dim=768):
        super().__init__()
        self.n_patches = (img_size // patch) ** 2
        # A conv with stride=patch extracts and projects each patch at once
        self.proj = nn.Conv2d(in_ch, dim, kernel_size=patch, stride=patch)
        self.cls = nn.Parameter(torch.zeros(1, 1, dim))
        self.pos = nn.Parameter(torch.zeros(1, self.n_patches + 1, dim))

    def forward(self, x):                 # x: (B, 3, 224, 224)
        x = self.proj(x)                  # (B, dim, 14, 14)
        x = x.flatten(2).transpose(1, 2)  # (B, 196, dim)
        cls = self.cls.expand(x.shape[0], -1, -1)
        x = torch.cat([cls, x], dim=1)    # (B, 197, dim)
        return x + self.pos               # add position embeddings

tokens = PatchEmbed()(torch.randn(2, 3, 224, 224))
print(tokens.shape)                       # torch.Size([2, 197, 768])

Patch + class-token embedding, the front end of a Vision Transformer.

Where ViTs are used

The Vision Transformer is now a default backbone for many vision systems. Pre-trained ViT encoders supply the image side of contrastive models such as CLIP and the visual front end of vision language models such as LLaVA. Variants extend the idea to dense prediction, video, and self-supervised learning, and ViT features are commonly stored as embeddings in a vector database for semantic image search.

Image classification, retrieval, and as a frozen feature extractor.
The visual encoder inside multimodal and vision language models.
Self-supervised pre-training and dense tasks via ViT-derived variants.

Key takeaways

ViT treats fixed-size image patches as tokens and runs them through a standard Transformer encoder, no convolutions required.
Position embeddings and a learnable class token let the model recover spatial order and produce a single classification vector.
ViT has weaker image priors than CNNs, so it needs large-scale pre-training to reach top accuracy, after which it matches or beats CNNs.
ViT backbones now power CLIP, vision language models, and image retrieval pipelines.

Frequently asked questions

A Vision Transformer classifies and represents images by splitting them into small patches, embedding each patch as a token, and processing the sequence with a Transformer encoder. The output captures global relationships across the whole image rather than just local pixel neighborhoods.

Cutting a 224x224 image into 16x16 patches yields 196 tokens, a sequence short enough for self-attention to handle efficiently while keeping each patch large enough to carry meaningful local detail. The paper's title, 'An Image is Worth 16x16 Words,' reflects this choice.

It depends on data scale. On small or medium datasets, CNNs usually win because of built-in locality priors. With large-scale pre-training, ViT matches or surpasses comparable CNNs, often using less training compute.

Each patch is flattened and passed through one linear projection to a fixed-dimension embedding, after which a position embedding is added. In practice a single strided convolution performs the patch extraction and projection in one step.

Yes. Self-attention treats its inputs as an unordered set, so without position embeddings the model would not know where each patch came from. Learnable position embeddings are added to every patch token to restore spatial order.

Put the idea into practice

MemX is an AI memory agent built on these ideas: store anything, skip the folders, and find it again by asking in plain English.

Try MemX Free