Context Rot: Why Longer Chats Get Worse

A chat that started sharp drifts into vague, hedged answers an hour later, and nothing about the questions got harder. That is context rot: an LLM's accuracy drops as the raw length of its input grows, even when the context window is far from full. Chroma tested 18 frontier models and every single one degraded as input length increased, so this is not a quirk of one vendor or one prompt. It is a property of how attention works.

Here is what the spec sheets won't tell you. Vendors advertise windows up to 10 million tokens, and the implied promise is that you can stuff everything in and the model will sort it out. The data says otherwise. More tokens in the prompt make the model worse at using any of them, including the few that matter. The fix is counterintuitive: treat context as a finite attention budget and trim, not stuff.

Context rot is not lost-in-the-middle

These are two failures, and conflating them leads to the wrong fix. Lost-in-the-middle is positional: a model retrieves facts near the start and end of a long prompt more reliably than facts buried in the center. Move the fact to the top and accuracy recovers. Context rot is about total length: accuracy falls as the input grows regardless of where the relevant token sits, and it shows up well before the window is full.

The distinction matters because the remedies diverge. If your only problem were position, you would reorder the prompt and stop. Because the real problem is volume, reordering is not enough. You have to cut. A prompt that is correctly ordered but bloated still rots.

Insight

Lost-in-the-middle asks where the needle sits. Context rot asks how big the haystack is. A long prompt can fail on both axes at once, which is why a single reorder rarely fixes a sprawling chat.

What the Chroma context rot study found

Chroma ran 18 state-of-the-art models, spanning the GPT, Claude, Gemini, and Qwen families, across controlled tasks that held difficulty constant and varied only input length. The headline: performance degrades as input length increases, often in non-uniform ways. The degradation appeared even when the task itself stayed trivial, which rules out the easy explanation that longer inputs just contain harder questions.

The repeated-words test

The cleanest result came from a task with no reasoning at all. Researchers fed models a long run of one repeated word with a single different word inserted somewhere, then asked the model to copy the input back verbatim. They swept word counts from 25 up to 10,000 across 1,090 variations of length and insertion point. Short inputs: near perfect. As the sequence got longer, every model family degraded, frequently misplacing the unique word when it sat later in the text, and some began emitting words that never appeared in the input at all. Copying text is the simplest operation a language model can perform, and length alone broke it.

Needle-in-a-haystack: distractors and the shuffle result

On needle-in-a-haystack retrieval, two factors made things worse as length grew. Lower semantic similarity between the question and the buried fact produced steeper drops, and adding topically related distractors degraded accuracy further. Even a single distractor cut performance below the needle-only baseline, four distractors compounded the damage, and the effect was non-uniform: one particular distractor hurt more than the others. Unexpectedly, models often scored higher when the haystack was shuffled into incoherence than when it read as a coherent document. A logically flowing context gives the model more plausible-but-wrong threads to follow.

Long conversational memory

On LongMemEval, a conversational question-answering benchmark, Chroma compared a focused prompt of roughly 300 tokens against the full history of about 113,000 tokens containing the same answer. The focused version won by a wide margin across every model family tested. Same answer, two prompts: 300 tokens beat 113,000. The information was identical; the only variable was how much irrelevant material surrounded it. That gap is the whole thesis in one experiment: irrelevant context is not free.

Also on MemX

AI Explained

What Is MCP? Model Context Protocol

11 min read→

AI Explained

Does Gemini Remember? Personal Context Explained

7 min read→

AI Explained

Context Window vs Memory: The Difference

10 min read→

Why context rot happens: the attention budget

The cause is structural, not a tuning bug. In a transformer, every token attends to every other token, which creates n-squared pairwise relationships for n tokens. Anthropic's engineering team frames the consequence as an attention budget. A model has a finite pool of attention, and each additional token draws it down. Past a point returns diminish, and the signal you care about competes with everything else for the same finite resource.

Insight

A context window is how much fits. The attention budget is how much the model can actually hold in mind, and it runs out long before the window does.

This is why a bigger advertised window does not rescue you. The window sets a hard ceiling on what fits. The attention budget governs how well the model uses what is inside, and that budget does not scale linearly with the window. A 10-million-token window with a depleting budget still produces a model that loses the plot long before it runs out of room.

It shows up in production RAG, not just lab tests

The Chroma tasks are deliberately clean, so the obvious objection is that real workloads behave differently. They do not. Databricks Mosaic Research ran more than 2,000 experiments across 13 open and closed models on retrieval-augmented generation, the workhorse pattern behind most production assistants. Most models improved as they retrieved more documents, then reversed and got worse past a threshold. Llama-3.1-405B started declining around 32,000 tokens. GPT-4 held out to roughly 64,000. Only a handful of the newest models kept accuracy steady above 64,000 tokens.

The failures were not uniform misses either. As context grew, models drifted into distinct bad behaviors: some refused more often, some stopped following instructions, and some produced answers untethered from the retrieved passages. The practical reading is blunt. Retrieving the top 50 chunks because the window can hold them is not a safety margin. Past a model-specific threshold, every extra chunk is a liability, and that threshold can sit well under one tenth of the advertised window.

The four ways long context fails

Context rot is the symptom; in practice it arrives through four failure modes. Drew Breunig's taxonomy is the most cited, and each maps to a different cause you can diagnose.

Context poisoning: a hallucination or error enters the context and then gets referenced over and over, compounding because the model trusts its own prior text. A Gemini 2.5 agent playing Pokemon poisoned its own goal state and chased an unreachable objective for hundreds of turns.
Context distraction: the context grows so long that the model over-focuses on accumulated history and neglects what it learned in training. The same agent, once it passed about 100,000 tokens, started repeating actions from its past instead of forming new plans.
Context confusion: superfluous content in the prompt gets pulled into the answer because the model feels obliged to use what is in front of it. On the Berkeley Function-Calling Leaderboard, every model scored worse with more than one tool available, and a quantized Llama 3.1 8B that handled 19 tools failed outright at 46.
Context clash: newly added information or tools conflict with earlier content in the same context, and the contradiction degrades reasoning.

Agents are especially exposed. They gather from many sources, append tool outputs, and run for many turns, so all four modes accumulate over a session.

Insight

A long-running agent that quietly gets worse is usually rotting, not malfunctioning.

The multi-turn cliff: how chats rot in real time

Context rot is not only an agent or RAG problem. It hits ordinary multi-turn chat, and a Microsoft Research and Salesforce study put a number on it. The team took fully specified tasks and sharded them, releasing the same instructions one fragment per turn instead of all at once. Across more than 200,000 simulated conversations, every model degraded in the multi-turn setting, with an average drop of 39 percent versus the single-turn version of the identical task.

The decomposition is the part worth screenshotting. Almost none of the loss came from missing information: concatenating all the shards back into one prompt recovered about 95 percent of full performance. The damage came from spreading the same content across turns, which produced a 112 percent jump in unreliability against a smaller 16 percent dip in raw aptitude. The model still knew how to do the task. It just answered inconsistently once early turns let it commit to a wrong assumption and then build on it. That is poisoning and distraction playing out one message at a time, which is exactly why starting a clean chat often beats nursing a long one back to health.

Property	Lost-in-the-middle	Context rot
Root cause	Positional bias in attention	Total input length depleting attention budget
Trigger	Where the relevant token sits	How many tokens are present, relevant or not
Window must be full?	No	No, degrades well before the limit
Shows up on trivial tasks?	Less so	Yes, including verbatim copying
Primary fix	Reorder so key facts sit at edges	Trim irrelevant context, retrieve only what is needed

How to fix context rot: trim, do not stuff

Find the smallest set of high-signal tokens that gets the job done, and cut the rest. Anthropic states the principle directly: context should be informative yet tight, and the goal is the smallest possible set of high-signal tokens that maximize the likelihood of the desired outcome. Every token you add that does not earn its place is spending attention budget against you.

Retrieve, do not dump: pull the few relevant passages with search or RAG instead of pasting an entire document or transcript, and cap the retrieved set rather than maxing it out.
Prune the running context: drop stale tool outputs and resolved sub-tasks rather than carrying every turn forward.
Summarize long histories: compress old conversation into a short factual digest so the model keeps the conclusions without the bulk.
Quarantine conflicting sources: isolate information that might clash so a contradiction cannot poison the rest of the prompt.
Cut distractors aggressively: topically similar but irrelevant text is worse than obviously off-topic text, because it competes directly with the real answer.

For agents that run too long to fit any window, Anthropic describes three patterns worth copying. Compaction summarizes the conversation as it nears the limit and continues from the distilled version, preserving decisions without the raw transcript. Structured note-taking pushes durable facts into an external file the agent reads back on demand, keeping the active context lean. Sub-agent architectures hand focused work to specialized agents with clean windows, each returning only a condensed result to the coordinator. All three are the same move: keep the working context small and store the rest outside it.

Pro Tip

When an answer degrades, your first instinct is to add more context to clarify. Do the opposite. Remove everything not strictly needed and re-ask. Shorter, sharper prompts frequently beat longer, more complete ones.

Where a memory layer fits

If trimming is the answer, the hard part is deciding what to keep across many sessions. A memory layer addresses exactly that: instead of replaying a whole transcript into every prompt, it stores durable facts and retrieves only the relevant few when they are needed. MemX (memx.app) is an external AI memory layer built on this idea, supplying a small, relevant slice of personal context to a model rather than a giant history. It is private by architecture, with per-user isolation and encryption at rest, and on-device options. Used well, a memory layer is a trimming tool, not another place to pile tokens.

Insight

The lesson generalizes well past any one study. As long as transformers spread a finite attention budget across every token, more input will keep costing accuracy, and curating context will keep beating stuffing it.

Frequently Asked Questions

01What is context rot in LLMs?

Context rot is when a model's accuracy drops as its input grows, even when the window is not full and the task is simple. It stems from attention being a finite resource spread across every token in the prompt.

02Is context rot the same as lost-in-the-middle?

No. Lost-in-the-middle is positional: facts buried mid-prompt are recalled worse than facts at the edges. Context rot is about total length: accuracy falls as input grows no matter where the key fact sits, even on trivial tasks.

03Does context rot affect ChatGPT and Claude?

Yes. Chroma tested 18 frontier models across the GPT, Claude, Gemini, and Qwen families, and all 18 degraded as input length increased, including on a task that only required copying repeated words back verbatim.

04Does a bigger context window fix context rot?

No. A larger window sets how much fits, but the attention budget governs how well the model uses it. In production RAG tests, some models began degrading around 32,000 tokens, far below their advertised limits.

05How do I reduce context rot in practice?

Trim instead of stuff. Retrieve only the few relevant passages, cap how many, prune stale tool outputs, summarize long histories, and remove distracting or conflicting text. Aim for the smallest set of high-signal tokens that still answers the question.

Context Rot: Why Longer Chats Get Worse

Context rot is not lost-in-the-middle

What the Chroma context rot study found

The repeated-words test

Needle-in-a-haystack: distractors and the shuffle result

Long conversational memory

Why context rot happens: the attention budget

It shows up in production RAG, not just lab tests

The four ways long context fails

The multi-turn cliff: how chats rot in real time

How to fix context rot: trim, do not stuff

Where a memory layer fits

Stop losing what you save.
Let MemX remember it for you.

Keep reading

Context rot is not lost-in-the-middle

What the Chroma context rot study found

The repeated-words test

Needle-in-a-haystack: distractors and the shuffle result

Long conversational memory

Why context rot happens: the attention budget

It shows up in production RAG, not just lab tests

The four ways long context fails

The multi-turn cliff: how chats rot in real time

How to fix context rot: trim, do not stuff

Where a memory layer fits

Stop losing what you save.Let MemX remember it for you.

Keep reading

Stop losing what you save.
Let MemX remember it for you.