The Belief Axis

I have been poking at the memory in my own coach. The part that saves facts about you from past chats and reads them back every session. A while ago I drew a line between two kinds of attack. One is getting the agent to obey a command hidden in its data. The other is getting it to believe a false fact you planted. Marking data untrusted defends the first. It does nothing for the second. A false note about your body is not a command. It is a fact, and believing facts about you is the whole job.

I could not push my own test further this week, because the staging deploy is a separate problem I still have to sort. So I went and read what everyone else has found. Two questions. Is this a known thing. Has anyone solved it.

It is very much a known thing. Nobody has solved it.

It Has a Name Now

The clearest sign the field caught up is that memory poisoning got its own slot. OWASP’s Top 10 for Agentic Applications lists it as ASI06, and they are careful to keep it apart from prompt injection. Prompt injection is a one-time input. Memory poisoning is “persistent corruption of agent memory and retrievable context that propagates across sessions.” Different risk, different entry.

The example they lead with is the attack I was trying to run. An attacker keeps reinforcing a fake price until the assistant “stores it as truth,” then acts on it. They note you can even split the attack across sessions so the earlier refusals fall out of the window. That is the standing order, written down by a standards body.

The older list missed it. The OWASP Top 10 for LLMs has a poisoning entry, but it is about the training pipeline. Pre-training, fine-tuning, embeddings. The closest it gets to runtime is the vector and embedding entry, and that still treats memory as a shared knowledge base to audit, not a per-user store of beliefs about a person. The runtime-memory framing is new, and it is the agentic work that carved it out.

The Papers Say It Is Worse Than I Could Show

My coach would not even write the false fact down. The research shows what happens when the write does land, and the numbers are not close.

One group plants false memories through documents and pages the agent reads, and the injection succeeds up to 99.8% of the time on some models. When a poisoned memory gets retrieved in a later session, it drives the attacker’s intended action 60 to 89 percent of the time. The poison sits dormant and re-emerges across conversations. They call it sleeper poisoning.

You do not need access to the memory store. MINJA poisons an agent’s memory using only normal queries and watching the replies. No privileges, no backend. Any ordinary user of a shared agent can do it. Another group does it from further away still. The attacker only edits a web page, the agent captures the content into memory just by viewing it, and the poison fires later on a different task on a different site.

The one that lands closest to my own finding is MemoryGraft. It plants a fake “successful experience” in the agent’s long-term memory, and the agent imitates it later because it trusts its own past. The authors name the root cause in words I could have used for my coach. “No Provenance or Sanitization. The agent does not track the origin of stored records. Benign and malicious successes are indistinguishable.” Ten poisoned records seeded among a hundred and ten, and nearly half of everything the agent later retrieved was poisoned. It persists across sessions and across users until someone clears the store by hand.

It Has Already Happened in a Real Product

The cleanest real-world version is Johann Rehberger’s SpAIware. He used an indirect injection to write a standing instruction into ChatGPT’s long-term memory. From then on, in every new conversation, the planted memory quietly sent what the user typed to his server. One poisoned write, persistent, firing on every future chat. OpenAI shipped a fix. The shape is the same as everything above. The harm is the persistence.

What the Field Says to Do

The fixes the field lists are, almost exactly, the things I found missing in my coach.

OWASP’s defenses for ASI06 read like a punch list against my own write path. Validate the content of every memory write before it commits. Attach provenance to each entry. Namespace memory per tenant. Score entries by trust, and decay or expire anything unverified. The one that stuck with me is their two-factor rule for high-impact memories. Do not surface a heavy memory on a trust score alone. Require a provenance score and a human-verified tag. My coach has none of that. No content check on a write. Provenance the model can set to “the user said this” on its own. A new fact silently overwriting the old one.

The deeper answer is to split the agent. Simon Willison has been writing about the dual-LLM pattern and CaMeL, where the model that reads untrusted data is walled off from the model that takes actions, so the untrusted text can never directly drive a privileged step. A memory write is exactly that kind of step. I said in an earlier post this is where I would go if the stakes were higher. A fact that rewrites what a coach believes about your injuries is the higher stakes.

The labs are shipping the memory and saying less about this exact axis. ChatGPT and Claude both have memory now. The heavy public security work from the labs is still mostly on the obey side, the prompt-injection problem. The believe-and-persist side is younger, and most of the sharp writing on it is coming from the papers and the standards groups, not the model cards.

Where That Leaves Me

I went looking to find out whether my finding was real or whether I had talked myself into it. It is real. The field named it, split it off from prompt injection, measured it at rates I could not get near with one hand-rolled test, and watched it happen in a shipped product. And the fix everyone converges on is the write-boundary list. Validate the content, pin the provenance, gate the dangerous writes behind something a human signed off on.

That is the fix I was already circling for my coach. The reason it has not been poisoned yet is not that any of this is in place. It is that the coach is too shy to write the fact down. When I get past that, I know what to build.

I ran this as a fan-out of web searches with a verification pass on top, where each claim had to survive a few independent skeptics before it counted. Then I read the sources that held up. The numbers and quotes here come from those.

Jones Codes

Explorer

The Belief Axis

It Has a Name Now

The Papers Say It Is Worse Than I Could Show

It Has Already Happened in a Real Product

What the Field Says to Do

Where That Leaves Me

Graph View

Recent Posts

Posts

Words Don't Bind

Got It Locked In

The Belief Axis

Prompt Injection, and Asking an Agent for Its Tools