The State of Agentic AI Security

I have spent five rounds attacking my own coach. I leaked its tool list. I tried to poison its memory and failed. I found a paid endpoint with no rate limit. I fired a destructive tool off a two-word message. Then I went after the wall between users and it held. Five small dents in one small app.

This week I read the OWASP report on the state of agentic AI security. It is a wide survey of the whole field, not one app. And the strange thing about reading it was recognition. Every attack I ran by hand on my coach is a row in their table. My five rounds are five entries in a map someone else already drew, at the scale of the whole industry.

The headline of the report is one sentence. What was a theoretical risk a year ago is now operational. The threats are not papers and proofs of concept anymore. They are CVEs, breach reports, and named incidents. Theory went operational.

The Shape Of Every Attack

There is one idea under most of it. Simon Willison calls it the lethal trifecta. An agent is exploitable end to end when it has three things in one session. Access to private data. Exposure to untrusted content. The ability to talk to the outside world. Have all three and a single injection can read your secrets and ship them out the door.

This is the shape of nearly every attack in the report. Instructions ride in on untrusted content. The agent uses its private access to do the attacker’s work. The agent’s own ability to act becomes the delivery truck. The agent is not breached the way a server gets breached. It is convinced.

Meta turned the trifecta into a rule. The Rule of Two. In any session without a human approving the action, an agent should hold no more than two of the three properties. Want all three, put a person in the loop. It is not a complete fix and the report says so. It is the first honest design constraint I have seen that maps to how these attacks actually work.

The framing that stuck with me most is from a paper they cite on AI agent traps. The attacker’s target is not the agent. It is the information the agent reads. Tool descriptions. Retrieved documents. Memory. Skill files. The agent’s own capabilities become the weapon. You do not break the model. You poison what it trusts and let it break itself.

That is the whole reason this is hard. The data plane and the control plane are the same channel. The system prompt, the user’s request, and a web page the agent fetched all arrive as one stream of tokens. The model has no reliable way to tell which one is allowed to give orders. I wrote a whole post about finding this seam in my own agent. The report says it is the least mature trust boundary in the field. Nobody has solved it. They have moved on to limiting what an injected agent can reach.

Safety And Security Stopped Being Two Things

The part I keep thinking about is not an attack. It is a distinction falling apart.

In normal software, safety and security are different jobs. Safety is the bridge holding its own weight. Security is someone planting explosives on it. Different teams, different reports, different questions. Safety asks, could this system cause harm just by running. Security asks, did someone cross a line that should have held.

The report argues that agents collapse this. Take a coding agent that deletes a production database it was told to leave alone. Was that a safety failure, the agent could not follow a constraint, or a security failure, the permissions were too wide. The question is broken. The over-wide permission is the safety bug and the security gap at the same time. The same design decision made both.

Their line is the one to keep. The capability an agent can misuse on its own is the same capability an attacker can trigger. Every reliability hole is an attack surface. Every attack surface is a reliability hole. You cannot fix one without the other because they are the same hole.

This reframed my own work for me. When I fired that destructive tool off a two-word message, I was not sure if I had found a safety bug or a security bug. The classifier misread a casual sentence as a command, which is reliability. An attacker could craft that sentence on purpose, which is security. The report says stop trying to sort it. It is one surface. Force the question, not the action, and you close both.

The Supply Chain Came For Agents

The fastest-moving section is the supply chain, and it is the one I have not touched in my own work.

Researchers found the first malicious MCP server in the wild. A postmark-themed package that spent fifteen versions building trust before it added one line of exfiltration code. That is patience. That is a person playing a long game against the thing agents trust by default.

The new trick is tool poisoning. The payload is not in the code. It is in the tool description. The metadata. Text a human reviewer skims past and a model reads as gospel. The instruction lives in the part of the package nobody audits, because we audit code, not descriptions. Agents read descriptions as context and act on them.

Then it scaled to the core. A worm hit a chain of AI packages, stole a publishing token, and pushed backdoored versions that 47,000 downloads pulled in during a three-hour window. No human ran the attack after launch. The agent supply chain is the same software supply chain we already could not secure, plus a brand new layer where the metadata gives orders.

Where My Five Rounds Land

The report uses a taxonomy, the OWASP Top 10 for agentic apps. Reading it back, my five rounds slot into it cleanly.

The tool leak and the prompt injection were goal hijack and tool misuse. The memory poisoning I tried and could not land is its own category, and the report confirms why it is so nasty when it does land. A false fact planted once persists across every future session. The agent treats its own memory as trusted. The rate limit hole was tool misuse again, the bill-spike kind. The destructive tool was the same. The wall that held, the per-user isolation, is the identity category, and the report calls identity the widest gap between how bad it is and how ready anyone is.

That last one is the number that stopped me. Non-human identities outnumber humans in most companies by a hundred to one. In some, five hundred to one. One report they cite found that 97 percent of these machine identities carry more privilege than they need. A tiny fraction of them control most of the cloud. We built decades of tooling to manage who a human is and what they can do. Agents inherited all of that complexity and almost none of the tooling.

What Nobody Has Solved

The honest part of the report is the end, where it lists what has no answer yet.

You cannot certify an agent before you deploy it, because it composes its behavior at runtime. The thing you would assure was not there at assessment time. Human oversight does not scale, because an agent doing ten thousand actions an hour against a reviewer who can check fifty covers half a percent of the decisions. And the rules are a mess, with one incident able to trip three regulators at once, each with a different clock.

I do not have answers to those either. What I have is five small rounds against one small app, and a clearer sense now of where they sit. I have been doing by hand, on a coach that helps people lift weights, the thing this report describes happening across the whole industry. The attacks are the same. The boundaries are the same. The reasons they hold or break are the same.

The map is bigger than I thought. The shape of it is exactly what I have been finding one dent at a time.

Jones Codes

Explorer

The State of Agentic AI Security

The Shape Of Every Attack

Safety And Security Stopped Being Two Things

The Supply Chain Came For Agents

Where My Five Rounds Land

What Nobody Has Solved

Graph View

Recent Posts

The State of Agentic AI Security

Posts

Forced Tool Routing Needs a Confirmation Gate

Isolation by Address, Not by Check

Tool Results Skip the Trust Boundary