Memory & Cognition

Generative Agents (Memory Stream & Reflection)

By Aditya Kumar Jha, Engineer

Generative Agents are LLM-driven simulated characters from Stanford and Google research that store experiences in a natural-language memory stream, score memories by recency, importance, and relevance for retrieval, and synthesize higher-level reflections to drive believable behavior.

What are Generative Agents?

Generative Agents are LLM-powered simulated characters introduced by Park et al. in the 2023 paper Generative Agents: Interactive Simulacra of Human Behavior. In a small sandbox town called Smallville, 25 agents went about daily routines, formed opinions, started conversations, and even coordinated a Valentine's Day party, all driven by language-model reasoning over remembered experience. The work showed that believable, consistent behavior over long horizons depends less on the base model and more on an external memory architecture.

The core contribution is that architecture: a memory stream that records experiences in natural language, a retrieval mechanism that surfaces the right memories at the right time, and a reflection process that turns raw observations into higher-level insight. Together these let an agent act coherently across many simulated days far beyond what a single context window could hold.

  • Introduced by Park et al. (2023) in the Smallville sandbox simulation.
  • 25 agents showed emergent social behavior driven by remembered experience.
  • The contribution is the memory architecture, not a new base model.

The Memory Stream

The memory stream is a long, time-stamped list of memory objects written in natural language. Each object records an observation, such as something the agent saw, did, or was told, along with its creation time and the time it was last accessed. Nothing is discarded: the stream is a complete record that grows continuously as the simulation runs.

Because the stream quickly becomes far too large to fit in a prompt, the agent never feeds all of it to the model. Instead, at each decision point it retrieves a small, relevant subset. The design problem is therefore retrieval: choosing which handful of memories out of thousands should inform the next action.

  • Each memory object stores a natural-language description plus timestamps.
  • The stream is append-only and grows without bound during the simulation.
  • Only a small retrieved subset is placed into the prompt at any time.

Retrieval Scoring: Recency, Importance, Relevance

Generative Agents rank candidate memories by combining three signals. Recency favors memories accessed recently and is modeled as exponential decay over time since last access. Importance captures how significant a memory is, obtained by asking the language model to rate it on a scale (the paper uses roughly 1 to 10), so a mundane observation scores low and a meaningful life event scores high. Relevance is the semantic similarity between the current query and the memory, computed as cosine similarity of their embeddings.

Each component is min-max normalized to the range 0 to 1, and the three are summed (with equal weights in the paper) to produce a final retrieval score. The top-scoring memories are passed into the prompt. This scoring scheme has become a reference design copied across many later agent memory systems.

score = α_rec · recency + α_imp · importance + α_rel · relevance
Each memory's retrieval score sums normalized recency, importance, and relevance; the paper weights all three equally (α = 1).
recency = γ^(hours since last access), with γ = 0.995
Recency decays exponentially with decay factor γ = 0.995, so memories not touched in a long time gradually lose retrieval priority.
  • Recency: exponential decay since the memory was last accessed.
  • Importance: an LLM-assigned score (about 1 to 10) of how significant the memory is.
  • Relevance: cosine similarity between the query embedding and the memory embedding.
  • The three normalized scores are summed, and the highest are retrieved.

Reflection and Planning

Raw observations alone do not let an agent draw conclusions like deciding that another character is passionate about research. Reflection solves this: periodically, when the sum of recent importance scores crosses a threshold, the agent pauses, asks itself what salient questions arise from recent memories, retrieves evidence, and synthesizes higher-level statements. These reflections are written back into the memory stream as new memory objects, so they can themselves be retrieved and reflected upon, forming a tree of increasingly abstract insight.

Reflections and retrieved memories feed planning, where the agent lays out a day and adjusts it as events unfold. The loop of observe, retrieve, reflect, and plan is what produces behavior that stays consistent with an agent's identity and past over long stretches of simulated time.

  • Reflection triggers when accumulated importance crosses a threshold.
  • The agent poses questions, gathers evidence, and writes synthesized insights back into the stream.
  • Reflections plus retrieval drive day-level planning and consistent long-horizon behavior.

Key takeaways

  • Generative Agents store experiences in an append-only natural-language memory stream.
  • Retrieval ranks memories by summing normalized recency, importance, and relevance scores.
  • Recency uses exponential decay, importance is LLM-assigned, and relevance is embedding cosine similarity.
  • Reflection synthesizes raw observations into higher-level insights stored back in the stream.
  • The architecture, not the base model, is what enables believable long-horizon behavior.

Frequently asked questions

Generative Agents are LLM-driven simulated characters from a 2023 Stanford and Google study. In a sandbox town, they remember experiences, form relationships, and plan their days, demonstrating believable human-like behavior driven by a memory and reflection architecture rather than a new model.
They use a memory stream, an append-only list of time-stamped natural-language records of observations and actions. Because it grows too large for a prompt, the agent retrieves only the most relevant memories when deciding what to do next.
Each memory gets three normalized scores: recency (exponential decay since last access), importance (an LLM rating of significance), and relevance (cosine similarity to the current query). The scores are summed with equal weight, and the top memories are retrieved into the prompt.
Reflection is a periodic process where the agent reviews recent important memories, asks itself salient questions, and synthesizes higher-level conclusions. These insights are written back into the memory stream and can be retrieved later, building increasingly abstract understanding.
They provided an influential, reusable blueprint: a memory stream plus recency-importance-relevance retrieval plus reflection. Many later AI agent memory systems adopt this scoring scheme, making the paper a foundational reference for long-term agent memory.