Matryoshka Embeddings: One Vector, Many Sizes

A Matryoshka embedding is a single vector you can cut shorter on demand, because the model packed the most important meaning into its first dimensions. Slice a 3072-number vector down to its first 256 and it still works as a search key. You pay a small accuracy cost and get a large drop in storage and compute. Front-loaded information is the whole idea, and it turns vector size from a permanent decision into a dial you turn per task.

The name comes from Russian nesting dolls. Open the big doll and a smaller, complete doll sits inside; open that and a smaller one waits again. A Matryoshka embedding nests the same way: the full vector contains a usable shorter vector, which contains an even shorter usable vector, all the way down. The technique is called Matryoshka Representation Learning, or MRL, introduced in a 2022 paper from researchers at the University of Washington, Google, and Harvard.

Why a normal embedding cannot be cut

Cut a normal embedding and you destroy it. A standard model spreads meaning across every dimension with no priority order, so the last 100 numbers can matter as much as the first 100. Drop the tail and you throw away real signal at random, so the truncated vector no longer points where it should and nearest-neighbor search returns garbage.

There is a simple reason for that. A model with no instruction to rank its dimensions has no incentive to concentrate meaning anywhere in particular. Training pushes every number toward whatever value lowers the loss, and the loss only ever looks at the full vector, so each dimension is just one equal voice in a large committee. Useful information ends up smeared across all of them, and any slice you take is a random subset of that committee rather than a summary of it.

This is why dimension count used to be a hard commitment. Pick a 1536-dimension model and every vector in your index is 1536 numbers, forever, even for the many queries that would have been fine with a coarse match. You paid full storage and full search cost on every lookup whether the task needed that precision or not, and the only way to get a smaller vector was to train or fine-tune a separate, smaller model from scratch.

What MRL actually changes during training

MRL adds one twist to ordinary contrastive training: it measures the loss at several vector lengths at once, not just the full length. MRL scores the model on its full 3072-dimension output, and also on the first 1536, the first 768, the first 256, the first 64, and so on. Each prefix has to be a good embedding on its own. To satisfy all of those targets simultaneously, the model learns to put the broadest, most discriminating information up front and reserve the later dimensions for fine detail.

Insight

The key shift: meaning goes from evenly spread to coarse-to-fine. Early dimensions hold the gist, later dimensions hold the nuance. Truncation then removes nuance first, not meaning at random.

The pressure that produces this ordering is worth picturing. If the first 64 dimensions were allowed to be weak, the 64-dimension loss term would punish the model, so those early numbers are forced to stand on their own as a coarse summary. The next block of dimensions then only has to add what the first block missed, and the block after that refines further. Each new slice of length is graded on the improvement it brings, so the model naturally arranges its output from most important to least, the same way you would write a summary before the supporting detail.

The cost of this is close to nothing. MRL needs no extra training runs, no separate small models, and no change at inference time beyond slicing the array and renormalizing. The paper frames it as coarse-to-fine representations that are at least as accurate as separately trained low-dimension models, at no added training cost. You train once and get an entire ladder of vector sizes out of the same weights.

Also on MemX

AI Explained

GraphRAG vs Vector RAG: When Graphs Win

11 min read→

AI Explained

What Is a Vector Database? Plain Guide

11 min read→

AI Explained

Why HNSW Vector Search Is Fast

12 min read→

How much accuracy do you lose when you truncate an embedding?

Here is the number that sells MRL. OpenAI's text-embedding-3-large produces 3072-dimension vectors. Truncated to 256 dimensions, it still scores higher on the MTEB retrieval benchmark than the full 1536-dimension text-embedding-ada-002, the model it replaced.

Insight

A vector one-twelfth the size beats the model it replaced. That is the whole pitch in one number.

The retention curve at smaller cuts is gentle too. On the STS benchmark, one Matryoshka model truncated to just 64 of its 768 dimensions, about 8 percent of the full length, kept 98.37 percent of the full model's performance. A non-Matryoshka model truncated the same way kept only 96.46 percent, and that gap widens fast as you cut deeper. The training is what buys you the graceful decline.

Treat the exact numbers as benchmark-dependent

Those percentages come from specific models on specific test sets. Your retention at a given cut depends on the model, the truncation depth, your corpus, and your query mix. A figure like 98 percent on STS does not promise 98 percent on your support-ticket search. The honest claim is directional and reliable: with an MRL-trained model, moderate truncation keeps most of the quality, and the drop is smooth rather than a cliff. Measure on your own data before you commit a depth.

What you save by cutting dimensions

Vector storage scales linearly with dimension count, so halving the dimensions roughly halves the index. Microsoft's Azure AI Search documents the formula directly: 1,000 documents with two 1,536-dimension vector fields consume 1000 x 2 x 1536 x 4 bytes, which is about 12.3 MB, since each number is a 4-byte float. Cut each vector to 768 dimensions and that figure halves to roughly 6.1 MB.

Speed moves for the same reason, because the work a nearest-neighbor search does is proportional to the length of the vectors it compares. A similarity score is a sum over dimensions, so a 768-dimension comparison is half the arithmetic of a 1536-dimension one, on every candidate, for every query. At large index sizes this compounds with the funnel pattern below: a short index lets the first pass scan millions of vectors cheaply, and the slow full-length comparison then runs on only the few dozen candidates that survive, instead of the whole corpus.

Storage and memory: linear in dimensions, and held in RAM for fast search. Halve the dimensions, roughly halve the bytes.
Query speed: distance math runs over fewer numbers per comparison, so each nearest-neighbor lookup does proportionally less work.
Cost at scale: for a large index kept in memory, the dimension count is a direct line item on your infrastructure bill.
No retraining: truncation is a slice of an existing vector, so you re-index without touching the model or paying to embed again.

Insight

Renormalize after truncating. Most similarity search assumes unit-length vectors, and chopping the tail changes a vector's length. Slice, then divide by the new norm, then index or compare.

Which embedding models support Matryoshka truncation

MRL is now standard in the default embedding models people reach for, not a research curiosity. OpenAI's text-embedding-3 family exposes it through a simple dimensions API parameter. Nomic Embed v1.5 trains with MRL and supports any dimension between 64 and 768. Jina's recent models truncate down from 1024 to 256 while holding quality, and Google's Gemini Embedding offers MRL cuts from 3072 down to 768 or 256. Each lets you shorten without retraining.

Because the capability has moved into the defaults, the practical question stopped being whether you can truncate and became how short you should go.

A simple when-to-truncate rule

Insight

Here is what the benchmark-bragging posts skip: for most indexes you should not truncate at all. A few thousand vectors cost almost nothing at full length, so trading away accuracy buys you nothing.

Default to full dimensions, and truncate only when scale or latency forces the issue. The savings become real once the index is large enough that its dimension count drives memory cost or query latency you can actually feel. Below that, the cleaner choice is to keep every dimension and spend your tuning effort elsewhere.

Index under ~100k vectors and latency is fine: keep full dimensions. The savings are not worth any accuracy loss.
Index in the millions and memory or speed is the bottleneck: truncate, and tune the depth against your own retrieval metric.
Tight recall requirements (legal, medical, compliance): stay near full length, or use the funnel pattern below to keep precision.
Quick prototypes and on-device use: short vectors first, since a coarse match is usually enough and the footprint matters.

The funnel pattern: best of both

You do not have to choose one length. A funnel search uses a short, truncated index for a fast first pass to gather candidates, then re-scores only those candidates with the full-length vectors. You get the speed and small memory of short vectors for the broad sweep, and the precision of full vectors for the final ranking, since the expensive comparison runs on a few dozen items instead of millions.

What makes the funnel safe with MRL is that the short vectors and the long vectors are the same vectors, just cut to different lengths. The coarse first pass is reading the front of the very embedding the precise second pass will finish reading, so a candidate that looks promising at 256 dimensions is not a different object at 3072; it is the same point seen at lower resolution. That coherence is why a cheap shortlist rarely drops the answer the full pass would have ranked first.

Property	Standard embedding	Matryoshka (MRL) embedding
Information layout	Spread evenly across all dimensions	Front-loaded, coarse-to-fine
Effect of truncation	Breaks; signal lost at random	Graceful; detail drops before meaning
Sizes from one model	One fixed length	A ladder of usable lengths
Retraining to shrink	Retrain or train a separate small model	None; just slice and renormalize
Training cost vs standard	Baseline	Roughly the same

Pro Tip

Before you pick a truncation depth, run your real evaluation set at several lengths (full, half, quarter) and plot recall against index size. The right cut is the shortest one that still clears your quality bar, not a number from someone else's benchmark.

Where this fits in a real memory system

Embeddings are the index behind any system that searches your own content by meaning rather than keywords, and MRL is what keeps that index affordable as it grows. MemX is a consumer AI memory layer over your own documents, photos, notes, and chats across Android, iOS, and WhatsApp, so semantic search over a personal corpus is the core job. Front-loaded vectors help keep that search fast and light enough to run close to the device instead of shipping everything to a server.

MemX is private by architecture: per-user keys, encryption at rest, and an on-device first pass over your data. That on-device step is where compact, truncatable vectors pull their weight, because a phone has far less memory than a data center. The design point is straightforward: keep the index small and the matching local, then reach for heavier compute only when a query genuinely needs it.

Frequently Asked Questions

01What is a Matryoshka embedding?

It is an embedding vector trained so its first dimensions carry the most important information. That lets you cut it shorter on demand, using just the first 256 or 64 numbers as a smaller but still usable embedding, named for Russian nesting dolls.

02How do you reduce OpenAI embedding dimensions without losing quality?

Use the dimensions API parameter on text-embedding-3, which is MRL-trained. Truncated to 256 dimensions it still beats the full 1536-dimension ada-002 on the MTEB retrieval benchmark, so you shorten the vector with no retraining.

03How much storage do you save by halving embedding dimensions?

Roughly half. Vector index size scales linearly with dimension count, since each number is a fixed-size float. Cutting a 1536-dimension vector to 768 roughly halves the memory and storage that index consumes.

04Does truncating an embedding lose accuracy?

Some, but little for moderate cuts on MRL models. One reported case kept about 98 percent of full performance on the STS benchmark at a deep cut. Exact retention depends on the model, depth, and your own data, so measure it.

05Which embedding models support Matryoshka truncation?

OpenAI text-embedding-3, Nomic Embed v1.5, recent Jina models, and Google Gemini Embedding all train with MRL and support dimension truncation. It has become a standard feature of modern default embedding models.

Matryoshka training turns vector size from a one-time commitment into a dial you can turn per task. Train once, then choose your length: full precision when accuracy is everything, a quarter of the length when scale and speed win. The doll just keeps a smaller, complete copy of itself inside, ready whenever you need to travel lighter.

Matryoshka Embeddings: One Vector, Many Sizes

Why a normal embedding cannot be cut

What MRL actually changes during training

How much accuracy do you lose when you truncate an embedding?

Treat the exact numbers as benchmark-dependent

What you save by cutting dimensions

Which embedding models support Matryoshka truncation

A simple when-to-truncate rule

The funnel pattern: best of both

Where this fits in a real memory system

Stop losing what you save.
Let MemX remember it for you.

Keep reading

Why a normal embedding cannot be cut

What MRL actually changes during training

How much accuracy do you lose when you truncate an embedding?

Treat the exact numbers as benchmark-dependent

What you save by cutting dimensions

Which embedding models support Matryoshka truncation

A simple when-to-truncate rule

The funnel pattern: best of both

Where this fits in a real memory system

Stop losing what you save.Let MemX remember it for you.

Keep reading

Stop losing what you save.
Let MemX remember it for you.