Retrieval & Context

Contextual Retrieval

Contextual Retrieval is a technique introduced by Anthropic that prepends a short, chunk-specific explanation to each document chunk before it is embedded and indexed, so retrieval keeps the surrounding context that naive chunking strips away. Anthropic reports it cuts retrieval failures by roughly a third, and by about two-thirds when combined with BM25 and reranking.

What is Contextual Retrieval?

Contextual Retrieval is a retrieval-augmented generation (RAG) technique, introduced by Anthropic in September 2024, that adds a short explanatory preamble to every document chunk before that chunk is embedded and indexed. The preamble situates the chunk inside its source document so the stored representation carries context that ordinary chunking would otherwise discard. The goal is a direct one: make each chunk independently understandable so the retriever can find it for the right query.

The problem it solves is the loss of context during chunking. Standard RAG pipelines split a document into small pieces of a few hundred tokens each. A chunk that reads "The company's revenue grew by 3% over the previous quarter" no longer states which company, which quarter, or which prior figure it refers to. Once that chunk is embedded in isolation, a query like "What was ACME Corp's Q2 2023 revenue growth?" may fail to retrieve it, because the chunk's stored vector and keywords never mention ACME or Q2 2023.

Contextual Retrieval fixes this by using a language model to write 50 to 100 tokens of context for each chunk, such as "This chunk is from ACME Corp's Q2 2023 10-Q filing; the previous quarter's revenue was 314 million dollars." That context is prepended to the chunk, and the combined text is then both embedded (Contextual Embeddings) and indexed for keyword search (Contextual BM25).

A RAG technique from Anthropic that prepends chunk-specific context before embedding and indexing.
Addresses context loss caused by splitting documents into small, isolated chunks.
Uses an LLM to generate 50 to 100 tokens of situating context per chunk.
Applies to both vector embeddings (Contextual Embeddings) and keyword search (Contextual BM25).
Designed so each chunk is independently meaningful at retrieval time.

How does contextual retrieval work step by step?

The pipeline runs once, at indexing time. First the document is chunked normally. For each chunk, the full document and the chunk are passed to a language model with a prompt asking for a short, succinct context that situates the chunk within the overall document, specifically for the purpose of improving search retrieval. Anthropic's published prompt asks the model to answer only with the context and nothing else.

The generated context is prepended to the original chunk text. The combined text is then embedded with an embeddings model to populate the vector index, and the same combined text feeds a BM25 lexical index. At query time the system runs both vector search and BM25, merges the candidate lists (commonly with reciprocal rank fusion), and optionally passes the top results through a reranker before handing the final chunks to the generator.

Running an LLM over every chunk sounds expensive, but prompt caching makes it practical. Because the full document is reused across all of its chunks, it can be cached once and only the per-chunk completion is billed. Anthropic reports a one-time cost of about 1.02 US dollars per million document tokens to generate the contextualized chunks with caching enabled.

Chunk the document, then for each chunk ask an LLM to write situating context.
Prepend the context to the chunk and index the combined text in both a vector store and a BM25 index.
At query time, run hybrid search, fuse the results, and optionally rerank.
Prompt caching reuses the document across its chunks to keep the one-time cost low.
Anthropic cites about 1.02 dollars per million document tokens to contextualize with caching.

How much does contextual retrieval improve accuracy?

Anthropic measured retrieval failure rate, defined as the share of queries whose relevant chunk was not retrieved in the top results. On their evaluations, Contextual Embeddings alone reduced the failure rate by about 35 percent, from 5.7 percent to 3.7 percent. Combining Contextual Embeddings with Contextual BM25 reduced it by about 49 percent, to 2.9 percent.

Adding a reranking stage on top, which rescores the merged candidates with a cross-encoder before selecting the final chunks, pushed the total reduction to about 67 percent, taking the failure rate from 5.7 percent down to 1.9 percent. The three techniques are complementary: contextualization improves what gets indexed, hybrid search broadens recall across semantic and lexical matches, and reranking improves the final ordering.

These figures come from Anthropic's own internal benchmarks across multiple datasets and should be treated as representative rather than guaranteed for every corpus. Gains depend on chunk size, the embeddings model, the quality of the context-generation prompt, and how much the source documents rely on implicit context.

Contextual Embeddings alone: failure rate 5.7% to 3.7%, about a 35% reduction.
Contextual Embeddings plus Contextual BM25: down to 2.9%, about a 49% reduction.
Adding reranking: down to 1.9%, about a 67% reduction overall.
The techniques stack because they target indexing, recall, and final ordering.
Numbers are Anthropic's internal benchmarks; real gains vary by corpus and setup.

Code: generating chunk context with prompt caching

The core operation is one model call per chunk that produces the situating context. The document is marked for prompt caching so it is processed once and reused across every chunk, which is what keeps the cost low. The example below uses Anthropic's Python SDK and a minimal version of the published prompt.

python

import anthropic

client = anthropic.Anthropic()

DOC_PROMPT = "<document>\n{doc}\n</document>"
CHUNK_PROMPT = (
    "Here is the chunk we want to situate within the whole document:\n"
    "<chunk>\n{chunk}\n</chunk>\n\n"
    "Give a short, succinct context to situate this chunk within the "
    "overall document for the purposes of improving search retrieval. "
    "Answer only with the succinct context and nothing else."
)

def contextualize(document: str, chunk: str) -> str:
    resp = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": DOC_PROMPT.format(doc=document),
                    "cache_control": {"type": "ephemeral"},  # reuse across chunks
                },
                {"type": "text", "text": CHUNK_PROMPT.format(chunk=chunk)},
            ],
        }],
    )
    return resp.content[0].text

# Index the prepended text, not the bare chunk.
for chunk in chunks:
    context = contextualize(full_document, chunk)
    indexed_text = f"{context}\n\n{chunk}"
    # embed(indexed_text) and add to BM25 index

Generate situating context per chunk, caching the full document to cut cost.

When to use it, and its limits

Contextual Retrieval is most useful for corpora where individual chunks lean on document-level context: financial filings, legal contracts, technical manuals, codebases, and long reports where a paragraph is meaningless without knowing the section, entity, or time period it belongs to. For small knowledge bases that fit inside a model's context window, Anthropic notes you may not need RAG at all and can supply the whole corpus directly.

The main costs are an extra preprocessing pass at indexing time and slightly larger stored text per chunk. Reindexing is required whenever documents change, and the quality of the generated context depends on the model and prompt. The technique also does not, on its own, address how the model uses retrieved chunks once they are in the prompt, which is where issues like lost-in-the-middle still apply. A dedicated memory layer such as MemX is a complementary concern: contextual retrieval improves what a system finds in a document store, while a memory layer governs what persists about a user across sessions, private by architecture with per-user isolation.

Best for documents where chunks depend heavily on section, entity, or time context.
For tiny corpora that fit the context window, full-context prompting may beat RAG.
Costs: a one-time preprocessing pass and larger per-chunk stored text.
Documents must be reindexed when they change.
It improves retrieval, not how the model uses chunks once they are in the prompt.

Key takeaways

Contextual Retrieval prepends a short, LLM-generated context to each chunk before embedding and indexing so chunks stay meaningful in isolation.
Anthropic reports retrieval failures falling about 35% with Contextual Embeddings, about 49% adding Contextual BM25, and about 67% with reranking on top.
Prompt caching reuses the full document across its chunks, keeping the one-time cost near 1.02 dollars per million document tokens.
It is strongest for documents where chunks depend on section, entity, or time-period context, such as filings, contracts, and manuals.
It improves what gets retrieved but does not fix how a model uses chunks already in context, so pair it with reranking and good context engineering.

Frequently asked questions

It is an Anthropic technique that uses a language model to write 50 to 100 tokens of context describing where a chunk sits in its document, then prepends that context before embedding and indexing the chunk. This keeps chunks meaningful in isolation and reduces retrieval failures.

In Anthropic's benchmarks, Contextual Embeddings alone cut retrieval failures by about 35 percent, adding Contextual BM25 reached about 49 percent, and adding reranking reached about 67 percent overall, taking failure rate from 5.7 percent to 1.9 percent. Results vary by corpus.

It adds a one-time preprocessing pass that calls an LLM per chunk, but prompt caching reuses the full document across its chunks. Anthropic reports about 1.02 US dollars per million document tokens with caching enabled, making it practical for most corpora.

Both index the same context-prepended chunk text, but Contextual Embeddings stores it as a vector for semantic search, while Contextual BM25 indexes it for lexical keyword search. Running both as hybrid search catches matches that either method alone would miss.

Reranking is optional but complementary. Contextual retrieval improves what gets indexed and recalled, while a reranker rescores the merged candidates to improve final ordering. Anthropic found adding reranking lifted the total failure-rate reduction from about 49 percent to about 67 percent.

Put the idea into practice

MemX is an AI memory agent built on these ideas: store anything, skip the folders, and find it again by asking in plain English.

Try MemX Free