Choose an agent memory architecture by access pattern, not by whatever a vendor demo pushes. Three numbers decide it: how big your corpus is, whether the facts have relationships that matter, and how much p95 latency your turn budget can spend. Bolting a flat vector store onto an agent is the reflex move, and for a single-session chatbot it is usually correct. For multi-session recall, relational reasoning, or unlimited history, four other patterns beat it on the metric you actually care about. This post lays out all five with measured accuracy and latency tradeoffs, so the choice becomes arithmetic instead of fashion.
Short answer: choose by access pattern, not by trend
The five patterns are monolithic context, context plus a retrieval store, tiered memory with a learned controller, graph plus vector, and a context-engineering layer that curates the window each turn. Accuracy and latency move in opposite directions across them. Stuffing everything into the window scores highest on raw answer quality, around 72.9% on the LOCOMO long-conversation benchmark, but pays a 17.12 second p95 and roughly 26,000 tokens per conversation. A flat vector store drops accuracy to about 66.9% and cuts p95 to 1.44 seconds, a 91% latency reduction. Graph plus vector recovers relational accuracy to roughly 68% at a 2.59 second p95. Each pattern is the right answer for a different workload.
Here is what most coverage buries. On these LOCOMO numbers the accuracy spread across patterns is only about 6 points, from 72.9% down to 66.9%, while the p95 latency spread is roughly 12x, from 17.12 seconds down to 1.44. So latency, not accuracy, is the variable that actually separates the patterns. Pick the slowest design for a few points of answer quality and you have paid an order of magnitude in response time for a margin most users never notice.
Latency is the real decision driver. Accuracy moves about 6 points across these patterns; p95 moves roughly 12x. You are choosing a speed budget, not an accuracy tier.
Pattern 1: monolithic context, zero infra, capped by the window
Put the entire history in the prompt and let the model read all of it. No retrieval, no database, no index. On the LOCOMO benchmark this full-context approach posts the highest raw accuracy of any pattern, about 72.9%, because nothing is ever lost to a retrieval miss. The cost is brutal on both axes that scale with history: p95 latency reaches 17.12 seconds, and a single conversation can consume around 26,000 tokens, since the model reprocesses the whole transcript every turn.
Use monolithic context when the working set is small and bounded: a single support session, a code-review thread, a document that fits inside the window with room to reason. It breaks the moment history outgrows the context window. There is no eviction policy, so once you hit the cap you either truncate and lose facts or pay a latency and cost penalty that grows with every turn. Treat it as the baseline you graduate away from the instant recall has to outlive one session.
Pattern 2: context plus a retrieval store, the default workhorse
Write facts to an external vector store, embed the query, pull back the top matches, and inject only those into the window. Most teams reach for this pattern first. By retrieving concise facts instead of replaying the full transcript, it cuts p95 latency by 91%, from 17.12 seconds down to 1.44, and reduces token usage by roughly 73%, from about 26,000 to around 7,000 per conversation. Accuracy lands near 66.9%, a few points under full context because retrieval occasionally misses the relevant chunk.
The flat vector store shines when facts are largely independent and a good semantic match answers the question. It struggles the moment the answer depends on how facts connect: who reports to whom, which event preceded which, how a preference changed across sessions. Semantic similarity retrieves passages that look alike, not passages that are linked. For single-fact recall, this is the default to start from and the floor the next three patterns try to beat on specific weaknesses.
A flat vector store retrieves what looks similar, not what is connected. When the answer lives in the relationships, similarity search quietly loses it.
Pattern 3: tiered memory with a learned controller, OS-style hierarchy for unlimited recall
Treat the context window like RAM and let the agent page information in and out of slower tiers, the way an operating system manages virtual memory. The MemGPT design, now shipped in the Letta framework, splits memory into a main context held in the window, a recall tier holding all past messages, and an archival tier for long-term documents in a vector index. A controller, the model itself driving function calls, decides what to load and what to evict each turn.
The payoff is effectively unbounded recall. History can exceed the context window by orders of magnitude because cold data lives on disk and only hot data occupies tokens. The cost is orchestration. Every paging decision is an extra model call that adds latency and tokens, and a controller that pages in the wrong record produces a confident wrong answer. Reach for this pattern when long-running agents must remember across many sessions and the operational complexity of a paging loop is justified by genuinely unlimited memory.
Pattern 4: graph plus vector, relational accuracy at a modest latency cost
Layer a knowledge graph over the vector store. Vectors handle the semantic entry point; the graph captures entities and the edges between them, so the agent traverses relationships instead of hoping similarity surfaces them. Mem0's graph-enhanced variant on LOCOMO recovers accuracy to roughly 68%, a couple of points above the flat vector store, while holding p95 latency near 2.59 seconds, well under the 17 seconds of full context. The graph earns its keep on multi-session, multi-hop questions.
The tradeoff is build and maintenance cost. Something has to extract entities and relationships, keep the graph current as facts change, and run traversal at query time, which is why p95 sits above a flat store. Graph queries that take seconds break conversational flow, so the engineering target is sub-second retrieval at scale. Choose graph plus vector when relationships carry the answer: org charts, dependency chains, evolving user preferences, anything where the link between two facts is itself the fact.
Graph plus vector buys a point or two of accuracy and the power to answer multi-hop questions a flat store cannot reach. The bill is an extraction and maintenance pipeline.
Pattern 5: a context-engineering layer that decides what occupies the window each turn
Stop asking which store to query and start governing what occupies the context window every single turn. Context engineering treats the window as a constrained resource and curates its payload: instructions, retrieved facts, tool results, and state. Anthropic's published strategies include just-in-time retrieval, where the agent holds lightweight identifiers and loads data only when a tool needs it, structured note-taking to external storage, and clearing stale tool results while keeping message structure. The store underneath can be any of the previous patterns; this layer decides what reaches the model.
In governed enterprise settings this layer measures well. Atlan reports a roughly 3x improvement on text-to-SQL when the agent sees live metadata instead of a bare schema, about a 20% accuracy gain from an ontology layer, and a 39% reduction in tool calls. The accuracy-latency axis and the governance-freshness axis are independent. A bigger context window never solves which facts the agent is allowed to see or how current they are. Most production systems combine this layer with one of the storage patterns rather than picking a single approach.
The tradeoff table: accuracy, latency, and infra cost per pattern
The numbers below come from the LOCOMO long-conversation benchmark for patterns 1, 2, and 4, and from governed-metadata reporting for pattern 5. Read accuracy and latency together. The highest-accuracy pattern is also the slowest and most expensive; the fastest pattern gives up a few points. There is no free lunch, only the lunch that fits your turn budget.
| Pattern | Accuracy / quality | Latency and cost |
|---|---|---|
| 1. Monolithic context | Highest, ~72.9% on LOCOMO | Worst: ~17.12s p95, ~26k tokens/conversation |
| 2. Context + vector store | ~66.9% on LOCOMO | ~1.44s p95, ~7k tokens/conversation |
| 3. Tiered + controller | Unbounded recall across sessions | Extra model calls per paging step |
| 4. Graph + vector | ~68% on LOCOMO, best multi-hop | ~2.59s p95 |
| 5. Context-engineering layer | ~3x text-to-SQL, ~20% from ontology | ~39% fewer tool calls |
How to choose: a decision rule mapping your workload to one pattern
Run your workload through four questions in order. Does all relevant history fit in the context window with room to reason? If yes, use monolithic context and add nothing. If history outgrows the window but facts are mostly independent, a flat vector store gives you 91% lower latency for a few points of accuracy, and it is the right default. If the answer depends on relationships between facts, add a graph over the vector store and accept the extraction pipeline and the higher p95.
If memory must persist across many sessions and exceed the window by orders of magnitude, adopt tiered memory with a learned controller and budget for the paging orchestration. Layer context engineering on top of any of these once governance, freshness, or token discipline starts to matter, since that axis is independent of the storage choice. Mature systems usually combine patterns: a vector store for entry, a graph for relationships, and a curation layer deciding what reaches the model each turn.
Start at monolithic, move to vector when history outgrows the window, add a graph when relationships carry the answer, go tiered for unlimited recall, and wrap it in a context layer when governance bites.
Where MemX fits
MemX is an external AI memory layer by Neural Forge Technologies, built so an agent or assistant can read and write durable memory across sessions without you standing up the storage tiers yourself. It sits in the patterns above as the persistent store and retrieval surface, so the architecture choice becomes how you query memory rather than which database to operate. MemX is private by architecture: per-user isolation, encryption at rest, and master key management through Google Cloud KMS hardware security modules, with decrypted data held only briefly to serve a query. The point is to give an agent memory that outlives a single context window without making memory infrastructure your team's full-time job.
Frequently asked questions
01What is the best memory architecture for an AI agent?
There is no single best one. Use monolithic context if history fits the window, a vector store if facts are independent, graph plus vector if relationships matter, tiered memory for unlimited recall, and a context-engineering layer when governance and freshness matter. Choose by access pattern and latency budget, since latency varies far more across patterns than accuracy does.
02Is a vector database enough for agent memory?
Often yes for single-fact recall. A flat vector store cuts latency about 91% versus full context and answers independent-fact questions well. It struggles when the answer depends on how facts connect, since similarity search retrieves what looks alike, not what is linked. Add a graph for relational questions.
03How much does graph memory improve accuracy over plain vectors?
On the LOCOMO benchmark, a graph-enhanced store reaches about 68% accuracy versus roughly 66.9% for a flat vector store, with p95 latency near 2.59 seconds. The gain is a point or two plus the power to answer multi-hop relational questions a flat store cannot reach, at the cost of an extraction pipeline.
04What is the difference between agent memory and context engineering?
Agent memory is the persistent store of facts across sessions. Context engineering is the layer that decides what occupies the context window each turn: which retrieved facts, tool results, and state reach the model. Memory is where facts live; context engineering governs what the model sees right now.
05What is tiered or hierarchical agent memory?
Tiered memory treats the context window like RAM and pages data in and out of slower tiers, modeled on an operating system. The MemGPT design, now in Letta, uses a main context, a recall tier for past messages, and an archival tier, giving recall that far exceeds the window at the cost of paging overhead.
The reflex is to bolt a vector store onto the agent and move on. The better move is to read your own access pattern off four questions, weigh latency before accuracy, and combine patterns only where the metrics justify it.
