Agents & Tools

AI Agent Memory

AI agent memory is the system that lets an AI agent retain and reuse information beyond a single message. It splits into short-term memory (the context window, which acts as temporary working memory) and long-term memory (a persistent external store of facts, past events, and learned procedures that survives across sessions).

What is AI Agent Memory?

AI agent memory is the set of mechanisms that let an AI agent (an LLM-driven system that plans and takes actions) hold on to information and reuse it later, instead of treating every request as a blank slate. It is usually described in two tiers. Short-term memory is the information the model can see right now inside its context window, the block of text passed in with each request. Long-term memory is a separate, persistent store that the agent writes to and reads from across many conversations and sessions, so that something learned on Monday is still available the following month.

The distinction matters because a base language model has no memory of its own. As LangChain's LangGraph documentation puts it, short-term (thread-scoped) memory tracks the ongoing conversation by maintaining message history within a single session, while long-term memory stores user-specific or application-level data across sessions and is shared across conversational threads. Anything not placed back into the context window is, from the model's point of view, forgotten.

Memory is what turns a stateless chatbot into something that behaves like an assistant that knows you. It lets an agent recall your name and preferences, remember the outcome of a task it ran last week, and reapply a workflow it figured out earlier. The implementation of that persistence (vector databases, rolling summaries, plain files, or a managed memory layer) is a separate question from the conceptual split between working memory and long-term memory.

Two tiers: short-term (in the context window) and long-term (a persistent external store).
A base LLM is stateless; memory is added on top by the agent framework or product.
Short-term memory is scoped to one conversation thread; long-term memory is shared across sessions.
Memory is the difference between a stateless chatbot and an assistant that remembers you.
How memory is stored (vectors, summaries, files) is separate from the working-vs-long-term split.

Why is the context window working memory (RAM), not storage?

The context window is the fixed-size block of tokens the model reads on each turn, and it behaves like RAM rather than a hard drive. It is fast, directly usable for reasoning, and wiped between independent requests. Once a conversation grows past the window, the oldest content falls out of view and the model can no longer reference it unless that content is re-supplied. This is why a long-running chat eventually starts to forget what you told it earlier in the same session.

The clearest articulation of this analogy comes from the MemGPT paper (Packer et al., arXiv:2310.08560, October 2023), which proposes treating an LLM like an operating system. In that framing, the context window is fast main memory analogous to RAM, and an external store is slow memory analogous to disk. The system moves information between the two tiers on demand, giving the appearance of a much larger memory than the window alone allows. This OS-inspired pattern, virtual context management, underpins many modern agent memory designs.

Treating the window as RAM has practical consequences. Stuffing everything into context is expensive (you pay per token), slow, and degrades quality, because models can lose track of relevant facts buried in a very long prompt. LangGraph's documentation notes that most LLMs still perform poorly over long contexts and get distracted by stale or off-topic content. The fix is to keep the active context small and pull in only what is relevant, retrieving from long-term storage just in time rather than loading everything upfront.

The context window is fast, temporary working memory; it is cleared between independent requests.
When a conversation exceeds the window, the oldest tokens drop out and are forgotten.
MemGPT (2023) frames the window as RAM and an external store as disk, with data moved between tiers.
Long context is costly, slower, and can degrade accuracy, so it is not a substitute for real storage.
Good agents keep active context small and retrieve relevant memories on demand.

Episodic, semantic, and procedural memory

Long-term agent memory is commonly divided into three types borrowed from human cognitive psychology: semantic, episodic, and procedural. This taxonomy appears across agent frameworks, including LangChain's LangGraph documentation, which maps them to facts, experiences, and instructions respectively. Each type answers a different question about what an agent should carry forward.

Semantic memory stores facts and stable knowledge about the user or the world, such as your job, your dietary preferences, or a constraint that holds across time. Episodic memory stores specific past events and interactions tied to a time and context, often distilled into examples of how a situation was handled before. Procedural memory stores how to do things: the rules, skills, and workflows an agent follows. In practice procedural knowledge often lives in the agent's system prompt, its code, and its tools rather than in a database.

The three types are complementary. Semantic memory alone would make an agent knowledgeable but unable to learn from its own history. Episodic memory alone would make it over-personalized with no general grounding. Procedural memory alone would make it good at fixed tasks but unable to adapt. A capable agent draws on all three, often retrieving the relevant pieces of each into the context window for a given task. Most consumer memory features today (for example saved facts about a user) are a practical form of semantic memory.

Semantic memory = facts and stable knowledge about the user or the world.
Episodic memory = specific past events and interactions, often stored as examples.
Procedural memory = how to perform tasks; usually held in the system prompt, code, and tools.
The three are complementary; capable agents use all of them together.
Most consumer 'memory' features today are essentially semantic memory about the user.

How is AI agent memory implemented?

There is no single mechanism. The most common pattern for long-term memory is a vector database: text is converted into embeddings (numeric vectors), stored, and later retrieved by semantic similarity to the current query, with the top matches injected back into the context window. This is the same retrieval machinery used in retrieval-augmented generation (RAG), which is why agent memory and RAG are often confused. The difference is usually intent: RAG retrieves from a fixed knowledge corpus to answer a question, while agent memory accumulates and updates information about a specific user or task over time.

Not every implementation uses vectors. Rolling summaries compress old turns into a short recap that stays in context. Hierarchical systems like MemGPT keep a small in-context core plus searchable recall and archival tiers. Anthropic's Claude memory tool, documented at platform.claude.com, takes a deliberately file-based approach: Claude reads and writes plain files in a /memories directory that you host yourself, so memory persists across sessions without a vector store. Managed memory layers and SDKs (for example LangChain's LangMem) package the extraction, storage, and retrieval steps so developers do not build the pipeline from scratch.

On the consumer side, the same ideas show up as product features. ChatGPT exposes two layers: 'reference saved memories' (explicit facts it remembers) and 'reference chat history' (implicit recall from past chats, which OpenAI began rolling out on April 10, 2025), per OpenAI's help documentation. Anthropic made automatic memory available to all Claude users, including the free tier, on March 2, 2026. These are concrete instances of long-term memory built on top of a stateless model. A privacy-first option such as MemX positions memory as private by architecture, with per-user isolation, encryption at rest, and on-device options, but the underlying conceptual split between working memory and long-term store is the same everywhere.

score(q, m) = cos(e_q, e_m) = (e_q . e_m) / (||e_q|| ||e_m||)

retrieved = top_k_{ m in store } score(q, m)

where e_q is the embedding of the current query q, e_m is the stored embedding of memory m, and top_k returns the k memories with the highest score (the ones injected into the context window).

How the vector-store retrieval step (step 2 of the loop above) actually scores and selects memories.

text

1. Receive user request
2. RETRIEVE: query long-term store (vector DB / files) for relevant memories
3. ASSEMBLE: inject top-k memories + recent turns into the context window (working memory)
4. REASON / ACT: model plans and responds using only what is in context
5. WRITE: extract new facts, events, or learned steps -> persist to long-term store
6. (next session) the window is empty again; step 2 repopulates it on demand

A simplified read-then-write memory loop an agent runs around each task.

Vector databases store embeddings and retrieve memories by semantic similarity (the RAG machinery).
Rolling summaries compress old turns; hierarchical systems keep core, recall, and archival tiers.
Anthropic's Claude memory tool is file-based: Claude edits files in a /memories directory you host.
Managed memory layers and SDKs (such as LangMem) package extraction, storage, and retrieval.
Consumer features like ChatGPT's saved memories and reference chat history are long-term memory in practice.

How to view, manage, and turn off agent memory

For consumer products, memory is a setting you control. In ChatGPT, memory lives under Settings then Personalization, where you can toggle 'Reference saved memories' and 'Reference chat history' on or off, view and delete individual saved memories, and ask the assistant directly what it remembers or to forget something, according to OpenAI's Memory FAQ. Turning off 'Reference saved memories' also turns off 'Reference chat history'. To have a conversation that is not saved to or drawn from memory, use a Temporary Chat, which does not appear in your history, does not reference saved memories, and does not create new ones.

For developer-built agents, control is at the implementation layer. With Anthropic's memory tool you decide where files are stored and can delete or expire them; the documentation explicitly recommends tracking file sizes, periodically clearing files that have not been accessed, and strict path validation so the agent cannot read outside its /memories directory. In a vector-database setup, managing memory means editing or deleting the underlying records and re-embedding as needed. Either way, because the model itself is stateless, removing a memory from the store is what actually makes the agent forget it.

Privacy posture varies by vendor, so check the specifics rather than assuming. Some providers may use stored conversations to improve their models unless you opt out; others isolate memory per user and keep it out of training. The honest framing is that 'memory' means your data is being retained somewhere, so the controls that matter are visibility (can you see what is stored), editability (can you change or delete it), and data use (is it used for training). Read each product's current settings and policy before relying on it.

ChatGPT: Settings > Personalization to toggle 'Reference saved memories' and 'Reference chat history'.
Use a Temporary Chat in ChatGPT for a session that is not saved to or read from memory.
For developer agents, deleting the underlying record or file is what makes the agent forget.
Anthropic's memory tool recommends size limits, periodic cleanup, and path-traversal protection.
Check each vendor's data-use policy; the key controls are visibility, editability, and training use.

Key takeaways

AI agent memory has two tiers: short-term memory inside the context window and long-term memory in a persistent external store.
The context window behaves like RAM (fast, temporary, wiped between requests), not like disk storage, which is why long chats forget earlier details.
The MemGPT paper (2023) popularized treating the window as RAM and an external store as disk, moving data between tiers on demand.
Long-term memory is usually split into semantic (facts), episodic (past events), and procedural (how-to) memory, and capable agents use all three.
Memory is implemented via vector databases, rolling summaries, file-based stores, or managed memory layers; it is the same retrieval machinery as RAG, distinguished mainly by intent.
Memory is controllable: consumer features like ChatGPT's let you view, edit, and turn it off, and deleting the stored record is what makes a stateless model actually forget.

Frequently asked questions

They share the same retrieval machinery (embeddings and a vector store), but differ in intent. RAG retrieves from a relatively fixed knowledge corpus to ground an answer to a question. Agent memory accumulates, updates, and reuses information about a specific user or ongoing task over time, including facts, past events, and learned procedures.

No. The context window is short-term working memory: a fixed-size block the model reads on each turn, comparable to RAM. It is fast but temporary and gets wiped between independent requests. True long-term memory is a separate persistent store that survives across sessions, and content must be retrieved from it back into the window to be used.

Open Settings, then Personalization, and toggle off 'Reference saved memories' and 'Reference chat history', per OpenAI's Memory FAQ. Turning off saved memories also turns off chat-history reference. You can delete individual saved memories there, or start a Temporary Chat for a conversation that is not saved to or drawn from memory.

Semantic memory stores facts and stable knowledge about the user or world. Episodic memory stores specific past events and interactions. Procedural memory stores how to perform tasks, which often lives in the agent's system prompt, code, and tools rather than a database. Frameworks like LangGraph map these to facts, experiences, and instructions.

It depends on the vendor, so check the specific product. Memory by definition means your data is retained somewhere, and some providers may use it for model training unless you opt out. The controls that matter are whether you can see what is stored, edit or delete it, and whether it is used for training. A privacy-first service can isolate memory per user, encrypt it at rest, and offer on-device options, but you should verify each provider's current policy rather than assume.

Not without a memory system. The underlying language model is stateless and only sees what is in the current context window. Cross-session recall requires the surrounding agent or product to write information to a persistent store and retrieve it later. If memory is off or not implemented, each new session starts fresh.

Put the idea into practice

MemX is an AI memory agent built on these ideas: store anything, skip the folders, and find it again by asking in plain English.

Try MemX Free