AI Foundations

Vision Language Model (VLM)

A Vision Language Model (VLM) is a multimodal model that combines a vision encoder, a projection module, and a large language model so it can take images and text together as input and produce text that reasons about both.

What is a Vision Language Model (VLM)?

A Vision Language Model (VLM) is a multimodal AI model that accepts both images and text and generates text in response. It lets a system answer questions about a picture, describe a scene, read a chart, or follow instructions that refer to what is visible. In effect, a VLM gives a large language model a way to see.

The common recipe has three parts: a vision encoder that converts an image into feature vectors, a projection module (also called a connector or adapter) that maps those vectors into the language model's embedding space, and a large language model that consumes the projected image tokens alongside ordinary text tokens. The LLaVA model, introduced in the 2023 paper 'Visual Instruction Tuning', is a widely studied open example of this design.

Takes images and text as input and produces text as output.
Built from a vision encoder, a projector, and a language model.
Powers image question answering, captioning, and document and chart reading.

The three building blocks

The vision encoder is usually a pre-trained image model such as a CLIP vision transformer. It turns an image into a grid of feature vectors that capture objects, text, and layout. Because it is pre-trained on large image-text data, it already produces representations that align reasonably well with language concepts.

The projection module is the bridge. A vision encoder's outputs do not live in the same space as the language model's word embeddings, so a learned layer maps image features into that space. In the original LLaVA this was a single linear layer; LLaVA-1.5 replaced it with a small two-layer MLP for better alignment. The output is a set of image tokens that the language model treats much like word tokens.

The large language model does the reasoning and writes the answer. It reads the projected image tokens interleaved with the user's text prompt and generates a response token by token, the same way it would for a text-only conversation. Choosing a strong instruction-tuned LLM is what gives a VLM its conversational and reasoning ability.

Vision encoder: often a CLIP-style ViT that turns pixels into feature vectors.
Projector: a learned linear layer or small MLP that maps image features into the LLM's embedding space.
Language model: an instruction-tuned LLM that reasons over image and text tokens together.

How a VLM is trained

Training typically happens in stages. A feature-alignment stage trains mainly the projection module on image-caption pairs while the vision encoder and often the LLM stay frozen, so the projector learns to translate image features into something the LLM understands. A visual instruction-tuning stage then trains on image-grounded instructions and dialogues, updating the projector and the LLM so the model learns to follow visual instructions and hold multimodal conversations.

This staged approach keeps costs manageable. The expensive pre-trained components are reused, and only the connector and a relatively cheap instruction-tuning pass adapt them to work together. The quality and diversity of the instruction data largely determine how well the final model reasons about images.

Stage one aligns image features to the language space, often training just the projector.
Stage two is visual instruction tuning on image-grounded prompts and dialogues.
Reusing frozen pre-trained encoders and LLMs keeps training affordable.

What VLMs are used for

VLMs handle a broad range of tasks: visual question answering, detailed image captioning, optical character recognition combined with reasoning, reading charts and tables, describing screenshots for accessibility, and grounding instructions in what a camera or screen shows. They are also the perception layer for computer-use agents that act on what they see on a screen.

Because a VLM reasons in language, it can chain visual understanding with the rest of an agent's tools. Its limitations mirror those of its parts: it can hallucinate details not present in an image, struggle with precise counting or spatial layout, and inherit biases from its training data. Grounding outputs in the actual image content and evaluating carefully remain important.

Visual question answering, captioning, document and chart understanding.
The perception layer for screen-reading and computer-use agents.
Shared weaknesses: hallucinating unseen details, counting, and spatial precision.

Key takeaways

A VLM combines a vision encoder, a projection module, and a large language model to reason jointly over images and text.
The projector is the key bridge that maps image features into the language model's embedding space.
Training is usually staged: align image features first, then visual instruction tuning.
VLMs enable image question answering, document understanding, and the perception layer for computer-use agents, but can still hallucinate visual details.

Frequently asked questions

A vision language model is a multimodal model that takes images and text as input and produces text. It joins a vision encoder, a projection layer, and a large language model so it can describe images, answer questions about them, and follow instructions that refer to visual content.

A projection module, a learned linear layer or small MLP, maps the vision encoder's image features into the language model's embedding space. The result is a set of image tokens the language model processes alongside text tokens, letting it reason over both at once.

CLIP aligns images and text in a shared embedding space for matching and retrieval but does not generate sentences. A VLM generates text and typically uses a CLIP-style encoder as its vision component, adding a projector and a language model to produce conversational, reasoned answers.

Usually in stages: a feature-alignment stage trains the projector on image-caption pairs, often with the encoder and LLM frozen, followed by a visual instruction-tuning stage on image-grounded instructions that updates the projector and the language model.

VLMs can hallucinate details that are not in an image, struggle with precise counting and fine spatial reasoning, and inherit biases from their training data. Grounding answers in the actual image and careful evaluation help mitigate these issues.

Put the idea into practice

MemX is an AI memory agent built on these ideas: store anything, skip the folders, and find it again by asking in plain English.

Try MemX Free