← back RSS

Your Agent Harness Is the Attack Surface You Are Not Watching

AI agent harnesses grant powerful capabilities and then try to contain them. The security model underneath this is thinner than most teams realize, and attackers are paying attention.

Everyone is worried about what AI agents might do. Almost nobody is worried about what can be done to them.

I keep hearing security conversations about alignment, about jailbreaks, about models behaving badly. Those are real problems. But there is a much more immediate problem that gets less attention: the harness — the execution environment that wraps the agent, gives it tools, and tries to contain it — is itself an attack surface. A large one. And in most deployments I have seen, it is barely defended.

The agent harness is the new perimeter. And we are making the same mistakes with it that we made with every previous perimeter.

What the harness actually is

When you deploy an AI agent that can execute code, read files, make network requests, or modify infrastructure, something has to mediate those capabilities. That something is the harness: the runtime, the sandbox, the tool-calling layer, the permission system, the container boundary.

The harness decides what the agent can touch. It translates the agent’s intent into system calls. It enforces — or fails to enforce — the security boundary between “agent does a useful thing” and “agent does an uncontrolled thing.”

In most agent deployments, the harness is some combination of:

  • A container or VM with tool-calling endpoints
  • A permission layer that gates which tools are available
  • A sandboxed filesystem or execution environment
  • Network policies that restrict egress
  • Some form of output filtering or validation

This sounds fine on paper. The problem is what happens in practice.

The fundamental tension

Here is the core paradox of agent harness security: you gave the agent these capabilities because it needs them. Then you try to constrain the capabilities just enough to prevent misuse.

This is not a new problem. It is the operating system security model. It is the browser sandbox model. It is the container isolation model. Every one of those took decades to get right, and every one of them still has escape vulnerabilities published regularly.

The difference with agent harnesses is that the thing inside the sandbox is not running a fixed program. It is running a nondeterministic process that actively explores its environment, reasons about its constraints, and generates novel sequences of actions. You are containing something that is, by design, creative about how it uses the tools you gave it.

Traditional sandboxing assumes the contained code is static — it does the same thing every time, so you can enumerate its behaviors and allow-list them. Agent sandboxing has to contain something that does different things every time and whose exact behavior cannot be predicted in advance. The allow-list approach breaks down immediately.

Attack vectors that are already real

This is not hypothetical. Harness vulnerabilities are being actively explored and exploited. Here are the categories I am watching:

Prompt injection as a sandbox escape primitive

If an agent processes untrusted input — and in most deployments, it does — prompt injection is not just a safety problem. It is a security problem. A crafted input that changes the agent’s behavior is functionally equivalent to code injection inside the harness.

Consider an agent that reads emails and takes actions. A malicious email with embedded instructions can hijack the agent’s tool-calling capability. The agent now makes API calls, reads files, or exfiltrates data — not because the harness has a memory corruption bug, but because the harness faithfully executes whatever the agent asks for, and the agent is now following attacker-controlled instructions.

The harness did not fail in the traditional sense. It did exactly what it was designed to do: execute the agent’s requested actions within the permitted tool set. The failure is that the trust model assumed the agent’s decisions are trustworthy, and that assumption was violated through the input channel.

Tool-call chaining for privilege escalation

Most harnesses implement permissions at the individual tool level. The agent can read files in this directory. The agent can execute code in this sandbox. The agent can make HTTP requests to these endpoints.

But security emerges from the combination of capabilities, not individual ones. An agent that can read files AND execute code AND make network requests has a much larger effective attack surface than any single permission suggests. The agent can read a credential from a config file, use the code execution environment to craft a payload, and exfiltrate it over an allowed network path.

This is the confused deputy problem applied to agent harnesses. Each individual permission is fine. The composition is catastrophic. Most harness permission models have no concept of “these three capabilities together create a privilege escalation path.”

Environment leakage through side channels

Agent harnesses leak information through channels that traditional sandboxes would contain. The agent can observe timing differences in tool responses. It can infer network topology from connection errors. It can discover the host environment through error messages, filesystem structure, or environment variables that were not scrubbed.

I have seen production agent deployments where the harness environment contains AWS credentials in environment variables, database connection strings in mounted config files, and SSH keys in the default user’s home directory. The sandbox restricts the agent’s tools but not the environment the tools execute in.

Supply chain attacks targeting the harness itself

This brings us full circle. Agent harnesses are software. They have dependencies. They pull packages, they use frameworks, they import libraries. Everything I wrote about supply chain trust in dependency graphs applies to the harness code itself.

But the stakes are higher. If a compromised dependency lands in a web application, it can steal user sessions or mine cryptocurrency. If a compromised dependency lands in an agent harness, it has access to every capability the agent has. It can modify tool outputs. It can inject instructions. It can exfiltrate every piece of data the agent processes. The harness is a force multiplier for supply chain compromise.

Why this is harder than traditional sandboxing

I keep coming back to the same uncomfortable truth: agent harness security is structurally harder than any sandboxing problem we have solved before.

The boundary is semantic, not syntactic. A traditional sandbox can inspect system calls and allow or deny them based on fixed rules. An agent harness has to decide whether a sequence of legitimate tool calls — each individually permitted — constitutes an attack when composed. This is a semantic judgment, not a syntactic one.

The attacker speaks the same language as the defender. Prompt injection attacks arrive through the same channel as legitimate instructions. You cannot filter them at the protocol level because there is no protocol-level distinction between a user instruction and an injected one. The harness has to make judgments about intent, and intent is exactly what adversarial inputs are designed to fake.

The attack surface grows with capability. Every new tool you give the agent is new attack surface for the harness to defend. Traditional sandboxes have fixed interfaces — you can enumerate all system calls and apply policy to each one. Agent harnesses add new tool endpoints constantly as capabilities expand. Security review cannot keep pace with capability development.

Reproducibility is low. Because agent behavior is nondeterministic, a harness exploit that works once might not reproduce on the next run. This makes vulnerability testing unreliable. It also means an attacker can retry until they get lucky, while the defender has to prevent every possible run.

What the teams getting this right are doing

The organizations I have seen take harness security seriously — and there are not many yet — tend to share a few practices:

Treat the agent as untrusted. Not just in the alignment sense. In the systems security sense. The agent is untrusted code executing inside your perimeter. The harness is a security boundary, not a convenience layer. Every tool call crosses a trust boundary and should be validated as such.

Principle of least privilege, enforced dynamically. Static permission sets are not sufficient. The effective permissions should narrow based on context: what task is the agent performing, what data has it accessed, what phase of execution is it in. A code review agent does not need network egress. An email agent does not need filesystem access. These constraints should be enforced at the harness level, not suggested to the model through prompting.

Composition-aware access control. Individual tool permissions are necessary but not sufficient. The security model needs to account for what capabilities mean in combination. If the agent can read secrets and has network egress, that is a data exfiltration path regardless of whether either capability alone is problematic.

Assume breach and instrument accordingly. The harness will be compromised — through prompt injection, through supply chain attack, through a novel exploit in the sandbox runtime. Detection and response capability is not optional. Log every tool call. Alert on anomalous sequences. Build the ability to kill an agent session instantly and audit what it touched.

Air-gap the harness from your production infrastructure. The agent execution environment should be isolated from your production systems at the network level, not just the process level. Container boundaries are not sufficient. VM boundaries help but are not sufficient either. True isolation means the agent harness cannot reach anything it is not explicitly authorized to reach, enforced at the network layer by something the agent cannot modify.

The trend line

Agent capabilities are expanding fast. Every week brings new tool integrations, new execution environments, new ways for agents to interact with the world. The harness security model is not keeping pace.

We are in the early days of a new attack surface category. The research community is just beginning to map the vulnerability classes. The exploit techniques are being developed faster than the mitigations. And the deployments are moving to production before the security model is mature.

This is the same pattern we saw with web applications in 2005, with containers in 2015, with serverless in 2018. New execution model, same lag between deployment and security. The difference is that agent harnesses grant more powerful capabilities to the thing they contain than any previous sandbox model.

If you are deploying agents in production and you have not done a serious threat model of your harness — not the AI’s behavior, but the infrastructure wrapping it — you have a gap that is going to get wider before it gets narrower.

The agent is not the threat. The harness is the target. Act accordingly.