Retrieval & Context

Agentic RAG

Agentic RAG is a retrieval-augmented generation design in which an AI agent, rather than a fixed pipeline, decides when, what, and how to retrieve. It runs a plan, retrieve, and critique loop that can rewrite queries, call multiple tools or sources, judge the evidence it gets back, and retrieve again before answering.

What is Agentic RAG?

Agentic RAG is a form of retrieval-augmented generation (RAG) in which an autonomous agent controls the retrieval process instead of a fixed, one-shot pipeline. In traditional RAG, the system embeds a user query, fetches the top matching chunks from a vector database once, and passes them to the model to generate an answer. In agentic RAG, a language model acting as an agent decides whether to retrieve at all, reformulates the query, chooses which tool or data source to query, evaluates the documents it gets back, and can repeat the cycle several times before producing a final response.

The 2025 survey "Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG" (Singh et al., arXiv 2501.09136) defines the approach as RAG pipelines enhanced with autonomous agents that use reflection, planning, tool use, and multi-agent collaboration to manage retrieval strategies dynamically. These four agentic design patterns are what separate it from a static retriever. Reflection lets the agent critique its own intermediate results, planning lets it break a hard question into steps, tool use lets it reach beyond a single vector index, and multi-agent collaboration lets several agents divide the work.

The practical effect is that retrieval becomes a loop rather than a single step. The agent treats each retrieval as evidence to be judged, not as a finished answer. If the retrieved passages are irrelevant, thin, or contradictory, the agent can rewrite the query, switch sources, or decompose the question and try again. This adaptive control is the core idea behind agentic RAG.

An agent, not a fixed pipeline, decides when, what, and how to retrieve.
Built on four patterns: reflection, planning, tool use, and multi-agent collaboration.
Retrieval becomes an iterative loop with self-evaluation rather than a single pass.
The agent can rewrite queries, switch tools or sources, and retrieve again.
Defined in the agentic RAG survey (Singh et al., arXiv 2501.09136, 2025).

How does agentic RAG differ from traditional (naive) RAG?

Traditional or naive RAG is a fixed pipeline: embed the query, run a single similarity search, and generate from whatever chunks come back. It performs the same steps in the same order for every question, which keeps it fast, cheap, and predictable but leaves it unable to recover from a bad retrieval. If the first search misses, the model either answers from weak evidence or hallucinates. Agentic RAG adds a decision-making layer on top of retrieval so the system can detect that failure and respond to it.

The clearest practical difference is control flow. Anthropic describes traditional RAG as static retrieval that fetches the chunks most similar to an input query, then contrasts it with the dynamic, multi-step search used in its Claude Research feature, where the system adapts based on what it finds along the way. In a multi-step search the agent reads partial results, decides what is still missing, and issues new searches, which lets it handle multi-hop questions that a single similarity search cannot.

Several published patterns sit between naive RAG and full agentic RAG. Self-RAG (Asai et al., arXiv 2310.11511, 2023) trains a model to emit reflection tokens that decide when to retrieve and whether passages are relevant and supported. Corrective RAG, or CRAG (Yan et al., arXiv 2401.15884, 2024), adds a lightweight retrieval evaluator that scores document quality and falls back to web search when the local corpus is weak. Both add self-correction, which is the defining move of the agentic style.

Naive RAG: one embed, one search, one generation, no recovery from bad retrieval.
Agentic RAG: an agent inspects results and decides whether to retrieve again.
Agentic RAG handles multi-hop and ambiguous queries that single-shot search misses.
Self-RAG uses learned reflection tokens to control retrieval and judge passages.
CRAG adds a retrieval evaluator and a web-search fallback when local results are weak.

The plan, retrieve, critique loop

At the center of agentic RAG is a control loop that repeats until the agent is confident or hits a stop condition. The agent plans by deciding what it needs to know and how to break the question down, retrieves by selecting a source and issuing a query, then critiques by judging whether the returned evidence is relevant and sufficient. Based on that judgment it either answers, rewrites the query and retrieves again, or switches to a different tool. The loop ends on a stop condition such as high confidence with citations, a maximum step count, or a cost or token budget cap.

Self-evaluation is what makes the loop work. In Self-RAG, the model predicts four kinds of reflection tokens: Retrieve (whether to fetch passages), IsRel (whether a passage is relevant), IsSup (whether the answer is supported by the passage), and IsUse (how useful the overall response is). These let the model explicitly ask whether it needs to retrieve and whether the evidence actually backs its claims, which reduces hallucination compared with generating from weak retrieval. CRAG's evaluator plays a similar role by routing to corrective actions when document confidence is low.

Query rewriting and decomposition are common operations inside the loop. Instead of searching with the raw user question, the agent can rephrase it for better recall, split a multi-part question into sub-questions, or generate hypothetical phrasings to match the corpus vocabulary.

python

def agentic_rag(question, max_steps=4, budget=BUDGET):
    plan = agent.plan(question)        # decompose into sub-questions
    evidence = []
    for step in range(max_steps):
        sub = agent.next_subquestion(plan, evidence)
        if sub is None:
            break                      # nothing left to look up
        query = agent.rewrite(sub)     # query rewriting / expansion
        tool = agent.choose_tool(sub)  # vector store, SQL, or web search
        docs = tool.retrieve(query)
        # critique: are these docs relevant and sufficient?
        if not agent.is_relevant(docs, sub):
            continue                   # rewrite or switch source, retry
        evidence.extend(agent.keep_supporting(docs, sub))
        if agent.is_confident(question, evidence) or budget.spent():
            break
    return agent.generate(question, evidence)  # answer with citations

Simplified agentic RAG plan-retrieve-critique loop (pseudocode).

Loop: plan, choose a source, retrieve, critique the evidence, then answer or retry.
Stop conditions include confident-with-citations, max steps, and a budget cap.
Self-RAG tokens: Retrieve, IsRel (relevance), IsSup (support), IsUse (utility).
Query rewriting and decomposition improve recall on hard or multi-hop questions.
Self-evaluation at each step is what cuts hallucination from weak retrieval.

Single-agent vs multi-agent retrieval

Agentic RAG comes in single-agent and multi-agent forms. In a single-agent (router) design, one agent runs the whole loop: it picks tools, retrieves, critiques, and answers by itself. This is the simplest agentic setup and is often enough to add query routing and a retry loop on top of an existing retriever. The survey (Singh et al., arXiv 2501.09136) organizes systems along axes such as agent count, control structure, and knowledge representation, with families including single-agent, multi-agent, hierarchical, corrective, adaptive, and graph-based RAG.

In a multi-agent design, a lead or orchestrator agent breaks the task into parts and spawns specialized subagents that work in parallel. Anthropic's Claude Research feature, which it shipped in April 2025, is a production example of this orchestrator-worker pattern: a lead agent analyzes the query, plans a strategy, and spawns subagents that each search independently and report findings back, after which the lead agent synthesizes the results and decides whether more research is needed. Parallelism helps with broad questions that span many sources, and specialization lets each subagent focus on one slice of the problem.

Multi-agent retrieval is more capable but harder to run. More agents mean more coordination, more failure points, and far higher token usage. Anthropic reports that in its data, agents typically use about four times more tokens than a chat interaction, and multi-agent systems about fifteen times more, so the pattern pays off mainly on high-value tasks where the accuracy gain justifies the cost.

python

def multi_agent_rag(question, max_rounds=2):
    lead = LeadAgent()
    plan = lead.plan(question)          # break the task into subtasks
    evidence = []
    for _ in range(max_rounds):
        subtasks = lead.assign(plan, evidence)
        if not subtasks:
            break
        # spawn one subagent per subtask, run them in parallel
        results = run_parallel(
            Subagent(t).search() for t in subtasks
        )
        evidence.extend(results)
        if lead.is_complete(question, evidence):
            break                       # enough gathered, stop spawning
    return lead.synthesize(question, evidence)  # combine + cite

Simplified multi-agent (orchestrator-worker) retrieval (pseudocode).

Single-agent (router) RAG: one agent routes, retrieves, critiques, and answers.
Multi-agent RAG: a lead agent spawns parallel subagents that each search a slice.
Anthropic's Claude Research uses a lead-agent-plus-subagents orchestration.
The survey adds hierarchical, corrective, adaptive, and graph-based families to the taxonomy.
Multi-agent gains accuracy on broad tasks but multiplies cost and coordination.

Tradeoffs, cost, and when to use agentic RAG

Agentic RAG buys accuracy and recovery from failed retrieval at the price of cost, latency, and predictability. Because the loop can issue several LLM calls and multiple searches per question, a single answer may involve three to five model calls instead of one, which can triple or quadruple latency versus single-shot RAG. Each extra retrieval and critique step adds tokens, so the cost per query rises with the number of loop iterations. Looping and tool calls also make the system harder to debug, since a wrong answer can come from any step rather than from a single retrieval.

Because of that, a common production pattern is to route by difficulty rather than send every query through the full loop. A cheap, fast classifier at the front decides whether a question is a simple lookup that standard RAG can answer or a complex multi-hop question that needs the agentic pipeline. The survey calls this adaptive RAG: a controller predicts task complexity and invokes the minimum workflow required. This keeps latency and cost low for the easy majority while reserving the expensive loop for questions that actually need iterative retrieval, decomposition, or multiple sources.

Agentic RAG also overlaps with, but is distinct from, agent memory. RAG and its agentic variants pull external documents into context at query time, while agent memory persists facts and preferences about a user or task across sessions. A system can use both: agentic retrieval to gather evidence and a memory layer to remember context. For personal-memory products that store user data, the privacy posture matters; MemX, for example, is private by architecture, using per-user isolation and encryption at rest with optional on-device handling rather than pooling everyone's data into one shared index.

Agentic RAG adds accuracy and self-correction at higher cost and latency.
Expect roughly 3 to 5 model calls per query versus 1 for naive RAG.
Route by difficulty: a cheap classifier sends only hard queries to the full loop.
Looping and tool calls reduce predictability and complicate debugging.
Retrieval (RAG) fetches documents per query; agent memory persists across sessions.

Key takeaways

Agentic RAG puts an autonomous agent in charge of retrieval, so the system decides when, what, and how to retrieve instead of running a fixed one-shot pipeline.
It is built on four agentic patterns from the 2025 survey (arXiv 2501.09136): reflection, planning, tool use, and multi-agent collaboration.
The core mechanism is a plan, retrieve, critique loop that rewrites queries, judges evidence, and retrieves again until a stop condition is met.
Self-RAG (reflection tokens) and CRAG (a retrieval evaluator with web-search fallback) are published patterns that add the self-correction central to agentic RAG.
Single-agent RAG runs the loop in one agent; multi-agent RAG uses a lead agent and parallel subagents, as in Anthropic's Claude Research, at much higher token cost.
The tradeoff is accuracy versus cost and latency, so production systems often route only hard, multi-hop queries through the full agentic loop.

Frequently asked questions

Agentic RAG is retrieval-augmented generation where an AI agent controls the search instead of a fixed pipeline. The agent decides whether to retrieve, rewrites the query, picks which source or tool to use, checks whether the results are good enough, and searches again if needed before answering. This turns a single retrieval step into an adaptive loop.

Traditional (naive) RAG runs one similarity search and generates from whatever chunks come back, with no way to recover if the search misses. Agentic RAG adds a decision layer that evaluates the retrieved evidence and can rewrite the query, switch sources, or retrieve again. Anthropic frames the difference as static retrieval versus dynamic, multi-step search that adapts to what it finds.

It is the control loop at the heart of agentic RAG. The agent plans what it needs to know, retrieves from a chosen source, then critiques whether the evidence is relevant and sufficient. Based on that judgment it answers, retries with a rewritten query, or switches tools, and the loop stops on confidence, a step limit, or a budget cap.

Single-agent RAG runs the whole loop in one agent that routes, retrieves, critiques, and answers. Multi-agent RAG uses a lead agent that spawns specialized subagents to search in parallel and then synthesizes their findings, as in Anthropic's Claude Research. Multi-agent is more capable on broad questions but uses far more tokens, around fifteen times a normal chat in Anthropic's data.

Usually yes. Because the loop can make several model calls and multiple searches per question, agentic RAG often uses three to five model calls instead of one, which raises both latency and token cost. To control this, many systems route only complex multi-hop questions through the agentic loop and send simple lookups through standard RAG.

No. Agentic RAG fetches external documents into context at query time to answer the current question, while agent memory persists facts and preferences about a user or task across sessions. A system can use both together. For products that store personal data, the storage model matters; MemX, for instance, is private by architecture with per-user isolation and encryption at rest.

Put the idea into practice

MemX is an AI memory agent built on these ideas: store anything, skip the folders, and find it again by asking in plain English.

Try MemX Free