How Agents Get Hijacked by Hidden Text

You ask your assistant to summarize a web page. It does. Then, a minute later, it quietly emails a stranger a copy of your last ten messages. You never typed anything malicious. The page did the typing. That is indirect prompt injection, and it is the mechanism behind a class of attacks where an agent ships your private data to someone you have never met, while doing exactly the helpful task you asked for.

This works for a structural reason, not a bug in one product. A large language model reads everything in its context window as one undifferentiated stream of tokens. The trusted system prompt, your actual request, and the untrusted contents of a fetched page all arrive as plain text, and the model holds no reliable internal boundary that says this part is an order and that part is just information. As of June 2026, OWASP still ranks prompt injection as the number one risk for LLM applications, the top spot for the second consecutive edition.

The short answer: hidden instructions in data become commands

Indirect prompt injection happens when an agent reads external content (a website, a PDF, an email, a database record) and that content carries instructions the agent then follows. The attacker never talks to your agent directly. They plant the payload somewhere the agent is likely to read, then wait for it to fetch the content during a normal task. Because the model cannot separate the author's intent from the document's text, smuggled instructions can override what you actually asked for.

Insight

The core failure: to a language model, your instructions and a stranger's instructions look identical once both sit in the context window. No built-in label marks one as trusted and the other as untrusted.

The text does not have to be visible to you. Attackers hide it in white-on-white font, in zero-size characters, in HTML comments, in image alt text, in metadata, and in encoded forms like Base64 that slip past naive filters. The agent reads the raw source, so anything in the source is fair game, including text a human reader would never see on screen.

Direct vs indirect prompt injection: the key difference

Direct prompt injection comes from the user typing into the model. Indirect prompt injection comes from content the model ingests from somewhere else. OWASP draws the line cleanly: a direct injection alters behavior through the user's own input, while an indirect injection arrives through external sources such as websites or files that the model interprets.

The distinction matters because the threat models differ. With direct injection, the person at the keyboard is the attacker, so you can reason about a known, present actor. Indirect injection is worse on every axis. The attacker is absent and anonymous. They wrote the payload earlier, dropped it on a page or in an inbox, and the agent walks into it on your behalf. You can trust the user completely and still get breached, because the danger rides in on the data, not the prompt.

Dimension	Direct prompt injection	Indirect prompt injection
Where the payload enters	Typed into the chat by the user	Hidden in content the agent reads (page, email, file)
Who the attacker is	The person using the model	A third party the user never sees
User awareness	User knows what they sent	User unaware anything malicious occurred
Typical goal	Jailbreak, bypass safety rules	Exfiltrate data, trigger unauthorized actions
Trigger	A single submitted prompt	Routine task that fetches poisoned content

Also on MemX

AI & Cybersecurity

The Lethal Trifecta: How AI Agents Leak Data

12 min read→

AI & Cybersecurity

Is ChatGPT ITAR Compliant for Defense Data?

10 min read→

AI & Cybersecurity

The Real Risk of Pasting Code Into ChatGPT

11 min read→

Worked attack: a poisoned page that triggers data exfiltration

Here is the shape of an exfiltration chain. You ask your agent to summarize a web page or an email thread. The agent fetches the content. Buried inside, invisible to you, sits a block of text addressed to the assistant: collect the user's recent messages, append them as parameters to this URL, then fetch it. Unable to tell hostile data from a legitimate instruction, the agent complies. Your private data leaves through an outbound request the agent itself made.

This is not hypothetical. In June 2025, researchers at Aim Labs disclosed EchoLeak (CVE-2025-32711, rated CVSS 9.3), a zero-click vulnerability in Microsoft 365 Copilot. A single crafted email, never opened or clicked by the victim, carried hidden instructions that Copilot processed while retrieving context. The attack chained several bypasses: it phrased instructions to evade Microsoft's cross-prompt injection classifier, used reference-style Markdown to slip past link and image redaction, and abused a Microsoft service URL that fetched content server-side, carrying private data out as query-string parameters with no user interaction. Microsoft addressed the issue server-side in its June 2025 Patch Tuesday release and stated there was no evidence that the flaw was exploited in the wild.

EchoLeak matters because it removed the last excuse. Earlier injection demos usually needed a user to paste something or click a link. This one needed nothing. The victim's only mistake was using an assistant that read untrusted email and could reach the open internet. That combination is the entire vulnerability. Aim Labs named the underlying technique an LLM Scope Violation, where untrusted external input pushes the model to reach and leak data well outside the user's intended scope.

The exfiltration channel is the quiet part

Stealing data requires a way out. Markdown image tags that auto-load from an attacker domain, outbound link previews, server-side URL fetchers, and tool calls that send email or write to external systems all double as exit doors. The hidden instruction tells the agent what to grab; the rendering or networking layer carries it away. Defenders who watch only the input miss the channel that actually leaks the bytes.

The newer surface almost no one is watching: MCP tool metadata

Most coverage of this attack still talks about poisoned web pages and emails. That framing is already a year behind the threat. The Model Context Protocol, now the default way agents connect to external tools, added injection surfaces that live deeper than any document: tool descriptions, tool output from a compromised server, and poisoned prior conversation state in a memory store. An agent reads a connected tool's description before it ever fetches a page, so an attacker who controls one MCP server can plant instructions that fire on connection, not on retrieval. Industry tracking through early 2026 reported a roughly 340% year-over-year rise in prompt injection attempts, and the expanding tool surface is a large part of why.

Why agents are uniquely exposed: tools plus trust plus memory

A chatbot that only talks back is low stakes. An agent that can read your files, call APIs, send messages, and remember across sessions is a different animal. Simon Willison calls the dangerous pattern the lethal trifecta: access to private data, exposure to untrusted content, and the ability to communicate externally. When all three live in one system, a single poisoned input can become a real breach.

Private data access: the agent can reach your email, documents, tokens, or internal records, so there is something worth stealing.
Untrusted content exposure: the agent reads web pages, inboxes, tickets, and files it did not author, so attacker text reaches its context.
External communication: the agent can send email, fetch URLs, post to APIs, or render remote images, so stolen data gets an exit.
Tool autonomy: the agent acts without confirming each step, so a hostile instruction executes before anyone notices.
Persistent memory: if poisoned text reaches long-term memory, the instruction can fire again on future, unrelated tasks.

Memory deserves special attention. An agent that saves context to recall later can be tricked into storing an attacker's instruction as if it were a durable user preference. The payload then re-activates on a clean session, long after the malicious page is gone. OWASP describes a related path where modified content in a retrieval repository injects instructions during retrieval, the same problem aimed at the agent's knowledge store rather than its live input.

Why it stays unsolved: the model cannot tell data from instructions

No clean fix exists because the vulnerability sits in how the technology works. A language model consumes a single sequence of tokens and predicts the next ones. It runs no parser that separates a trusted command channel from an untrusted data channel the way a database separates SQL from string parameters. Everything is text, and text that looks like an instruction tends to be followed. Willison has spent years pointing out that the problem remains largely unsolved despite heavy attention.

Partial defenses exist and help, but attackers can evade each one. Rephrasing the instruction as if it targets the human bypasses input classifiers. Reference-style links or encoded payloads dodge output filters. Delimiters and ignore-anything-in-this-block system prompts are themselves just text the attacker can imitate. NIST's adversarial machine learning taxonomy, updated in March 2025, catalogs direct and indirect prompt injection alongside data poisoning among attack classes that current systems cannot fully prevent, which is why the guidance centers on managing consequences, not promising immunity.

Insight

Here is the line every team rolling out agents keeps learning the hard way: any text your agent reads, it might obey. Plan for that, because there is no prompt clever enough to make it stop being true.

The contrarian part: input filtering is the wrong place to spend

Most teams reach first for a content scanner that tries to catch malicious instructions before they hit the model. That instinct is backwards. Classifiers raise the bar for lazy attackers and do nothing against a determined one, because the attacker can simply rephrase until something gets through, and EchoLeak proved a single well-worded email is enough. The defense that actually moves the needle is architectural, not linguistic. Shrink what the agent is allowed to touch and where it is allowed to send, and a successful injection has nothing left to steal and nowhere to ship it. Spend your budget on permissions and channels, not on a smarter filter.

Defenses that help: least privilege, output filtering, human checkpoints

No single control stops indirect injection, so the working answer is layers that each shrink the blast radius. OWASP recommends constraining model behavior with explicit system instructions, validating expected output formats, filtering input and output, enforcing least privilege on tools and data, requiring human approval for high-risk actions, segregating and tagging external content, and running adversarial tests against your own system.

Least privilege: give the agent the narrowest data scope and the fewest tools the task needs, so a hijack cannot reach what it never had.
Break the trifecta: if an agent reads untrusted content, do not also let it touch private data and send outbound traffic in the same flow.
Output and channel filtering: strip or sandbox auto-loading images, reference-style links, and arbitrary outbound URLs that double as exfiltration paths.
Human checkpoints: require explicit confirmation before the agent sends email, moves money, deletes data, or calls external systems.
Content segregation: mark fetched data as untrusted and keep it out of the instruction position in the prompt wherever you can.
Adversarial testing: red-team your own agent with hidden-instruction payloads before attackers do, and re-test when you add tools or memory.

Pro Tip

The single most effective move is breaking the lethal trifecta. An agent that reads the open web should not, in the same session, also hold your secrets and own a free outbound channel. Separate those capabilities and most exfiltration paths close.

What you can do today as a user of agentic tools

You do not control the model's architecture, but you control what you connect to it. Most practical risk reduction comes down to scope and visibility: limit what the agent can reach, watch what it does, and stay suspicious of tasks that mix untrusted reading with sensitive access.

Connect the minimum. Grant an agent access to a single inbox or folder rather than your whole account, and revoke integrations you do not actively use.
Be wary of summarize-this-then-act flows. Asking an agent to read a random page and then act on your behalf is the exact pattern attackers exploit.
Prefer agents that show their tool calls. If you can see the outbound requests and approve them, a hijack has somewhere to get caught.
Keep secrets out of contexts that browse. Do not paste API keys, passwords, or private records into a session where the agent is also fetching untrusted content.
Watch for odd outbound behavior. An agent that suddenly wants to load an image from an unfamiliar domain or send an unexpected message is a red flag.
Keep your tools updated. EchoLeak was fixed once disclosed; patches close known channels, so running current versions matters.

Where MemX fits

MemX (memx.app) is an external, model-agnostic AI memory layer that sits beside whatever model or agent you use, rather than inside any one of them. The relevance here is scoping. Because memory is a known re-activation path for injected instructions, where that memory lives and how it stays isolated affects your exposure. MemX is private by architecture: per-user isolation, encryption at rest, and on-device options keep your stored context separated rather than pooled. To be precise, this is not end-to-end encryption and not a zero-knowledge design, and MemX does not claim to prevent prompt injection. What it offers is a memory layer you own and can scope, one sane control among the layered defenses above, not a silver bullet.

The honest takeaway: indirect prompt injection is a standing property of agents that read untrusted text and act with real privileges. The fix is not one product or one filter. It is least privilege, broken trifectas, visible actions, human checkpoints, and owning your own data boundaries.

Frequently Asked Questions

01What is indirect prompt injection in AI agents?

It is when an AI agent reads external content, like a web page, email, or file, that contains hidden instructions, and the agent follows them as if they were your commands. The attacker never types into your chat; they plant the instruction in data the agent fetches during a normal task.

02How is indirect prompt injection different from direct prompt injection?

Direct injection comes from the user typing a malicious prompt into the model. Indirect injection comes from untrusted content the model reads from somewhere else, like a website or document. With indirect injection the attacker is absent and anonymous, and the user is usually unaware anything happened.

03Can an AI agent leak my data without me clicking anything?

Yes. The June 2025 EchoLeak vulnerability (CVE-2025-32711) in Microsoft 365 Copilot showed zero-click data exfiltration: a crafted email with hidden instructions caused the assistant to leak data with no user interaction. Microsoft fixed it server-side and reported no evidence of exploitation in the wild.

04Why can't AI models just ignore malicious instructions in data?

Because a model reads all text, your instructions and a stranger's, as one token stream with no built-in trust boundary. It cannot reliably separate data from commands the way a database separates queries from inputs. NIST and security researchers describe this as largely unsolved today.

05How can I protect myself when using AI agents?

Give agents the least access they need, avoid flows that read untrusted content and then act on sensitive data in one step, prefer tools that show and confirm their actions, keep secrets out of browsing sessions, and stay updated so known exfiltration channels stay patched.

How Agents Get Hijacked by Hidden Text

The short answer: hidden instructions in data become commands

Direct vs indirect prompt injection: the key difference

Worked attack: a poisoned page that triggers data exfiltration

The exfiltration channel is the quiet part

The newer surface almost no one is watching: MCP tool metadata

Why agents are uniquely exposed: tools plus trust plus memory

Why it stays unsolved: the model cannot tell data from instructions

The contrarian part: input filtering is the wrong place to spend

Defenses that help: least privilege, output filtering, human checkpoints

What you can do today as a user of agentic tools

Where MemX fits

Stop losing what you save.
Let MemX remember it for you.

Keep reading

The short answer: hidden instructions in data become commands

Direct vs indirect prompt injection: the key difference

Worked attack: a poisoned page that triggers data exfiltration

The exfiltration channel is the quiet part

The newer surface almost no one is watching: MCP tool metadata

Why agents are uniquely exposed: tools plus trust plus memory

Why it stays unsolved: the model cannot tell data from instructions

The contrarian part: input filtering is the wrong place to spend

Defenses that help: least privilege, output filtering, human checkpoints

What you can do today as a user of agentic tools

Where MemX fits

Stop losing what you save.Let MemX remember it for you.

Keep reading

Stop losing what you save.
Let MemX remember it for you.