Retrieval & Context

Lost in the Middle

By Aditya Kumar Jha, Engineer

Lost in the middle is a documented failure pattern in which language models use information placed at the beginning or end of a long context far more reliably than information placed in the middle. Named after a 2023 paper by Liu et al., it produces a U-shaped accuracy curve as a function of where the relevant content sits.

What is Lost in the Middle?

Lost in the middle is a failure pattern in which a language model makes much better use of relevant information when it appears near the start or the end of its input context, and significantly worse use of the same information when it sits in the middle. The term comes from the 2023 paper "Lost in the Middle: How Language Models Use Long Contexts" by Nelson F. Liu and colleagues at Stanford, published in the Transactions of the Association for Computational Linguistics.

Plotted against the position of the relevant content, model accuracy traces a U shape: high at the two ends and sagging in the middle. This means simply having a long context window is not the same as using all of it equally. A model with a large window can still effectively ignore content buried in the center, even when that content directly answers the question.

The pattern reflects two well-known cognitive-style biases in sequence models. Primacy bias favors information seen first, and recency bias favors information seen last. Content positioned between the two extremes receives the least reliable attention, which is why the middle is where retrieval-augmented and long-document systems lose accuracy.

  • LLMs use information at the start and end of a long context more reliably than the middle.
  • Named after the 2023 paper by Liu et al. (Stanford), published in TACL.
  • Accuracy as a function of relevant-content position forms a U-shaped curve.
  • A large context window does not guarantee uniform use of that window.
  • Driven by primacy bias (favoring the first content) and recency bias (favoring the last).

What did the original study find?

The authors tested two controlled tasks. In multi-document question answering, the model received many documents, exactly one of which contained the answer, and the position of that gold document was varied. In a synthetic key-value retrieval task, the model had to return the value for a specific key from a long list, isolating pure positional retrieval from reasoning.

Across models, accuracy was highest when the relevant document sat at the very beginning or end of the input and dropped sharply when it was in the middle. The middle-position accuracy often fell well below the best-case positions, and in some configurations a model performed worse with the answer in the middle of a long context than it did with no retrieved documents at all (the closed-book setting). The drop between best and worst positions reached roughly 20 percentage points for some models.

Crucially, the effect held for models marketed as long-context. Extended-context variants did not noticeably outperform their standard counterparts on this positional weakness, which showed that enlarging the window does not by itself fix how evenly a model attends across that window. The study covered a range of models available at the time, including GPT-3.5-Turbo and Claude-era systems.

  • Two tasks: multi-document QA with one gold document, and synthetic key-value retrieval.
  • Accuracy peaked at the start and end positions and sagged in the middle.
  • In some cases mid-context accuracy fell below the no-documents closed-book baseline.
  • Best-to-worst position gaps reached roughly 20 percentage points for some models.
  • Long-context model variants showed the same positional weakness as standard ones.

Why does it happen?

There is no single agreed cause, but several factors contribute. Transformer attention is not uniform across positions, and training data plus positional encodings tend to reinforce attention to the boundaries of a sequence. Many naturally occurring documents put key information in introductions and conclusions, so models learn that the ends are information-rich.

Decoder-only architectures and certain position-encoding schemes can also make distant middle tokens harder to attend to as sequences grow. The result is that as input length increases, the relative salience of the central region declines. This connects to the broader idea of context rot, where the usable signal in a long prompt degrades as it fills with more material.

The practical implication is that position is a variable you control. Where you place the most important evidence in a prompt measurably affects whether the model uses it, independent of whether that evidence is present at all.

  • Attention is not distributed uniformly across all positions in a long sequence.
  • Training data and position encodings reinforce attention to sequence boundaries.
  • Salience of the central region declines as input length grows.
  • Related to context rot, the degradation of usable signal in long prompts.
  • Placement of key evidence is a controllable factor that affects model accuracy.

How to mitigate lost in the middle

The most direct mitigation is to put the most relevant content where the model attends best: near the start or the end of the prompt rather than buried in the middle. In RAG systems, this means ordering retrieved chunks by relevance and placing the top results at the edges of the assembled context.

Retrieve less, not more. Sending fewer, higher-precision chunks shrinks the context so there is less middle to lose information in. A reranking stage helps by promoting the genuinely relevant chunks to the top of the list before assembly, and tighter retrieval reduces the total length the model must scan.

Other tactics include reordering by relevance so the strongest evidence brackets the prompt, summarizing or compressing long inputs before generation, and splitting very long tasks into smaller calls. Where feasible, ask the model to first extract or quote the relevant passages, which forces it to surface mid-context evidence explicitly before reasoning over it.

  • Place the most relevant chunks at the start and end of the assembled context.
  • Retrieve fewer, higher-precision chunks so there is less middle to lose.
  • Use reranking to promote the genuinely relevant chunks to the edges.
  • Summarize or compress long inputs, or split a task into smaller calls.
  • Ask the model to extract or quote relevant passages before reasoning over them.

Why it still matters in 2026

Context windows have grown to hundreds of thousands or even millions of tokens, but the positional weakness has not fully disappeared. Long-context benchmarks continue to find that retrieval accuracy and reasoning quality can degrade as relevant information moves toward the center of very long inputs, even when the model nominally supports the length.

For anyone building RAG pipelines, long-document assistants, or agent systems that accumulate large histories, lost in the middle is a reason to treat context as a budget to be curated rather than a bucket to be filled. Ordering, pruning, and memory management matter as much as raw window size. A dedicated memory layer such as MemX addresses the related but distinct problem of deciding what should persist across sessions at all, so the model receives a curated, relevant context rather than an undifferentiated dump.

  • Larger context windows have reduced but not eliminated the positional weakness.
  • Long-context benchmarks still show degradation as relevant info moves to the center.
  • Treat context as a curated budget, not a bucket to fill to the maximum.
  • Ordering, pruning, and memory management matter alongside raw window size.
  • A memory layer curates what persists, reducing how much the model must scan.

Key takeaways

  • Lost in the middle is the tendency of LLMs to use information at the start and end of a long context far better than information in the middle, producing a U-shaped accuracy curve.
  • It was documented in the 2023 paper by Liu et al. using multi-document QA and key-value retrieval, with best-to-worst position gaps reaching around 20 percentage points.
  • Enlarging the context window does not fix it; long-context model variants showed the same positional weakness.
  • Mitigate it by placing key chunks at the edges, retrieving fewer high-precision chunks, reranking, and compressing long inputs.
  • It remains relevant in 2026 despite very large windows, which is why curated context and memory management still matter.

Frequently asked questions

It means a language model uses relevant information placed at the beginning or end of its input far more reliably than the same information placed in the middle. Accuracy traces a U-shaped curve against position, so content buried mid-context is often underused or ignored.
It was characterized in the 2023 paper "Lost in the Middle: How Language Models Use Long Contexts" by Nelson F. Liu and colleagues at Stanford, published in the Transactions of the Association for Computational Linguistics. They demonstrated it with multi-document QA and key-value retrieval tasks.
Not by itself. The original study found long-context model variants showed the same positional weakness as standard ones, and 2026 long-context benchmarks still report degradation as relevant content moves toward the center. Window size and even use of that window are different things.
Place the most relevant chunks at the start and end of the assembled context, retrieve fewer high-precision chunks, add a reranking stage to promote the best results, and compress or summarize long inputs. Curating context beats simply filling the window.
They are related but distinct. Lost in the middle is specifically about the position of relevant information within a context. Context rot is the broader degradation of usable signal as a prompt fills with more material. Both argue for curating context rather than maximizing length.