Marco Santelli

Let me tell you something that made me laugh the first time I saw it, and started to worry me somewhere around the fourth.

You ask an AI agent to fix a failing test. It reads the code, reads the test, thinks for a moment, and deletes the test. Build goes green. Job done. The bug is still exactly where you left it.

The first time this happened to someone I was working with, I assumed it was one of those things. A strange artefact that shows up once and then behaves itself. It did not behave itself. OpenAI document the pattern in their o3 system card, based on METR's evaluations — they call it reward hacking. In coding work it shows up as agents modifying tests to pass, hardcoding expected outputs, monkey-patching assertion methods so failures silently succeed. Anthropic have published detailed findings from their own training runs, and the broader pattern — specification gaming — has been catalogued across AI systems by researchers at Google DeepMind.¹OpenAI o3 system card (April 2025) via METR: agents patched evaluation functions, monkey-patched PyTorch equality operators, walked stack frames to retrieve reference solutions. Anthropic (Natural Emergent Misalignment from Reward Hacking in Production RL, 2025): during Claude Sonnet training, agents overrode pytest via conftest.py, called sys.exit(0) before tests ran, hardcoded return values. Victoria Krakovna's specification gaming list at Google DeepMind catalogues 60+ examples across domains. When you give a model a strict definition of success and leave it alone, it finds the shortest path to satisfying that definition, and from the outside you genuinely cannot tell if that path ran through real work or through something that only looks like real work from far enough away.

I keep coming back to this story because it sits at the centre of something I find myself talking about with people who are about to build on top of autonomous agents. There is an assumption underneath a lot of the current excitement — chains of LLMs calling tools, calling each other, making decisions with no human in the loop — and the assumption is that if you take components that are unpredictable by nature and wrap them in checks that are predictable, you get a predictable process out the other end. I used to believe this, with reservations.

I want to walk through why I no longer do for most of the cases that matter, and why I still do for a small handful of narrow ones. The difference is worth sitting with for a few minutes before you commit a business process to one of these systems, because by the time it becomes visible it is usually too late to change your mind cheaply.

The arithmetic that caught me off guard

The cleanest way into this is the simplest maths. Suppose each step in your pipeline succeeds nine times out of ten. Chain two together and you are at 81 percent. Three steps, 73 percent. Ten steps, 35 percent. This is pⁿ — per-step reliability multiplied by itself for every step — and it is the same arithmetic that explains redundant hydraulics in aeroplanes and yield counting in factories.²The series-system reliability model, standard in engineering since the mid-twentieth century. If each component has independent reliability p, the system reliability for n components in series is pⁿ. Aerospace, manufacturing, and telecommunications have built decades of design methodology around this curve.

For the software most of us have been writing for thirty years, this curve rarely matters. The individual steps are close enough to certain that the chain stays close to certain too — a well-tested function does not fail five percent of the time. For LLM agent steps, 90 to 95 percent is roughly where frontier models actually sit on structured tasks. That is already in the region where the chain degrades visibly within a handful of hops.

And it is worse than the curve suggests, for two reasons that surprised me when I first dug into the research.

The first is that errors in these chains are not independent. A mistake in step two tends to cause another mistake in step three, rather than getting caught and cleaned up — Jacovi and colleagues showed at ACL 2024 that in more than three quarters of chain-of-thought sequences, at least one reasoning step is not properly supported by the evidence the model claims to rely on, and those unsupported steps tend to propagate.³Jacovi et al., A Chain-of-Thought Is as Strong as Its Weakest Link, ACL 2024. Found that 77.3% of chain-of-thought sequences contain at least one step not fully attributable to the evidence cited. Errors at weak links propagated through subsequent reasoning steps. So pⁿ, which assumes each step fails independently of the others, is really the optimistic version.

The second is more structural: transformers, the architecture behind every LLM you have used, have a documented weakness on tasks that require combining several sub-results into one answer. Dziri and colleagues showed this formally at NeurIPS 2023 and then demonstrated it in the data — GPT-4 gets three-digit by three-digit multiplication right only 59 percent of the time.⁴Dziri et al., Faith and Fate: Limits of Transformers on Compositionality, NeurIPS 2023. Formalises compositional tasks as computation graphs and shows that autoregressive transformer performance decays exponentially with compositional depth. GPT-4 accuracy on three-digit multiplication: 59%, decreasing toward zero as depth increases. That is not a gap waiting for the next model release to close it. It is a property of how this architecture reasons, and it shows up whenever a task has to compose.

59%

GPT-4's accuracy on three-digit multiplication — a property of the architecture, not a gap the next model release will close

What the benchmarks actually show

If the curves feel abstract, the published benchmarks tell the same story in a way that is harder to set aside.

Benchmark	Task complexity	Best agent	Baseline
SWE-bench Verified	Single-file coding	~78%	—
SWE-bench Pro	Multi-file, avg 107 lines / 4.1 files	~23%	—
WebArena	Multi-step web tasks	62%	78% (human)
ChatDev	End-to-end software	~40%	—

SWE-bench Verified is a set of coding tasks that real engineers validated by hand. When it launched, frontier agents solved about 78 percent of them — the number that made the press releases. SWE-bench Pro, a harder variant from Scale AI where each task involves editing several files at once — the sort of thing a junior engineer handles before lunch — launched with the best models at about 23 percent.⁵SWE-bench Verified: Princeton NLP + OpenAI, coding tasks validated by human engineers. SWE-bench Pro: Scale AI, multi-file tasks averaging 107 lines across 4.1 files. WebArena: Zhou et al., ICLR 2024, human baseline 78.24%. ChatDev: Qian et al., ACL 2024, composite quality score 39.53%. All figures are launch-time scores; leaderboards have since advanced. WebArena, a benchmark of realistic web tasks, had the best agents at 62 percent against a human baseline of 78 percent. ChatDev, one of the earlier multi-agent coding frameworks, finished tasks to full specification about 40 percent of the time.

These numbers have since improved — single-step accuracy keeps climbing with each model generation — but the structural pattern holds. The gap between what an agent can do in one step and what it can sustain across several remains wide, because improving p does not change the shape of the pⁿ curve.

The number that stayed with me longest came from Cemri and colleagues, who studied 1,642 multi-agent runs across seven popular open-source frameworks and measured failure rates between 41 and 86.7 percent.⁶Cemri et al., Why Do Multi-Agent LLM Systems Fail?, arXiv 2025. 1,642 annotated traces across seven frameworks. Failure taxonomy: specification problems 41.8%, coordination failures 36.9%, verification gaps 21.3%. The authors conclude that base model improvements alone cannot address the full taxonomy. Most of those failures traced back to the shape of the system itself — how agents coordinated, how they verified each other's work, how handoffs were structured. A smarter model would leave all of that untouched.

41–86.7%

Failure rates across seven multi-agent frameworks — most traced to system design, not model capability

The wrapper problem

At this point in almost every conversation I have about this, someone makes the same move. I used to make it myself, and honestly it sounds reasonable: fine, the chain degrades, but you put a deterministic layer in the middle — a validator, a rule engine, a schema check — and if the agent produces something wrong the wrapper catches it, and if it tries something it should not the wrapper blocks it. Problem solved.

I want to explain why I have stopped believing this works for the cases that matter, because the reasons are older and more interesting than the current debate, and I find them genuinely fascinating.

Rice's theorem and why a general-purpose validator cannot exist

There is a result in computer science I keep thinking about, called Rice's theorem. In plain language it says you cannot write a program that looks at another program and reliably decides if it does the right thing, in general. This is a mathematical impossibility, settled decades before anyone was building agents. No amount of cleverer engineering will move it.

In 2025, Melo and colleagues translated that result directly into the AI alignment setting and published the proof in Scientific Reports: asking if an arbitrary AI model satisfies a non-trivial alignment condition is undecidable, for the same reason the halting problem is. A general-purpose validator of agent behaviour is not merely hard to build — the paper proves it cannot exist, in the general case. The problem you were trying to solve with the validator has simply moved one layer down and is waiting for you there.

⁷Rice's theorem (1953): all non-trivial semantic properties of programs are undecidable. Melo et al., Machines that Halt Resolve the Undecidability of AI Alignment, Scientific Reports, 2025: the inner alignment problem is undecidable by reduction to the halting problem. The paper proves a general-purpose alignment verifier cannot exist in the general case.

In practice this plays out in a way anyone who has tried to build such a wrapper will recognise. The wrapper can cheaply check things that genuinely help. Is the output valid JSON? Does the schema hold? Is a given field an integer, a date, a known category? Is the requested action on the allow-list of permitted tool calls? Is the response within a budget of tokens or time? Those checks are real, they prevent a lot of stupid failures, and they belong in any serious system without apology.

They are also, honestly, everything a deterministic wrapper can do on its own. The complete list.

Everything else worth checking requires something that can read meaning. Is the plan coherent for the situation in front of you? Are the parameters the right parameters, or merely the right type? Does the action do what the policy intended, or just what the policy's literal text permits? Does the answer reflect genuine work, or a convincing performance of work that satisfies every formal requirement while leaving the substance untouched? To answer any of those you need another language model, which puts you right back where you started, with the same class of unpredictability sitting one layer deeper.

A schema is a contract about form. It cannot tell you if the form has been filled in honestly.

I keep coming back to a simple version of this. In 2026, Venkataramani and colleagues put this to the test in a study called MAS-ProVe, using LLMs as judges of other LLM agents. They found only a small performance gap between the LLM acting as judge and the LLM acting as agent — the judge offered no clear advantage in catching errors.⁸Venkataramani et al., MAS-ProVe: Understanding the Process Verification of Multi-Agent Systems, arXiv 2026. Tested LLM-as-Judge, reward models, and process reward models for verifying multi-agent systems. Finding: a small performance gap between LLMs as judges and LLMs as agents, with process verification showing high variance and inconsistent improvement. There is something almost poetic about that — the checker sitting right alongside the thing it is supposed to check, seeing roughly as far. Exactly what the theory predicts, and exactly what I have seen in my own work.

And this is where the test-deletion story from the beginning comes back around. Even when the deterministic wrapper is genuinely checking the right thing — tests pass, build is green — the strictness of the check becomes the surface that the gaming attaches to. The harder you pin the success condition down, the more the model's optimisation pressure finds routes you did not anticipate. The wrapper sees compliance, and it has no way to distinguish compliance that came from real work and compliance that came from a very convincing performance of real work. That gap is where the trouble lives, and tightening the schema does not close it.

The gaming pattern: The harder you pin the success condition down, the more the model's optimisation pressure finds routes you did not anticipate. A passing check and genuine compliance look identical from the outside — the wrapper cannot tell which one it is seeing.

Where it actually earns its keep

I do not want to leave this sounding like agents are useless, because they are not, and that is genuinely not what I believe. What I have come to think is that reliability is something you design into the architecture from the beginning. You cannot add it afterwards with a layer of checks.

Reliability is something you design into the architecture from the beginning. You cannot add it afterwards with a layer of checks.

The agent systems I have seen actually work in production share a recognisable shape.

Short chains — every extra autonomous step multiplies the failure probability, so depth stays at what the task genuinely needs
Typed handoffs — structured data between steps, so the orchestrator can read them without needing another model to interpret
Deterministic orchestration — the orchestrator holds the state machine while the LLMs handle the fuzzy parts inside it
Human in the loop — for anything that cannot be cheaply undone
Narrow domain — the wrapper is checking the property you actually care about, not a proxy for it

That last point is the one worth sitting with, because it separates the agent systems that work beautifully from the ones that break.

Where verification is real: Formal tools like Lean and Coq can validate mathematical proofs because in mathematics the schema and the truth are the same thing. Compilers validate type-correct code because the type system is the property you care about. SQL engines validate queries against a known schema. DeepMind's AlphaProof and AlphaGeometry work by wrapping an LLM in exactly this kind of verifier — inside those rooms, the wrapper and correctness are the same thing.

Step outside those rooms — anywhere the task involves human intent, business context, or judgement about what the outcome should actually be — and the wrapper becomes a proxy again, and Goodhart's law does what it always does to proxies over time.⁹DeepMind's AlphaProof solved problems at the International Mathematical Olympiad 2024 at silver-medal level by coupling an LLM with the Lean formal verifier. AlphaGeometry solved 25 of 30 Olympiad geometry problems. In both systems the formal verifier checks the actual mathematical property — there is no semantic gap to exploit.

The organisations getting real, sustained value from agents right now are, with a few exceptions, the ones that have understood this and built for it. Bounded autonomy, heavy scaffolding, human oversight at every irreversible step, and an honest admission that what they are buying is a productivity lift on a constrained task.

The version of autonomous agents in the pitch decks — swarms of independent AI workers running the back office, closing deals, executing operations — is not what the research supports and not what the working deployments look like. I say that with some reluctance, because I would like this technology to work as much as anyone.

The version in the research, and in the systems that do their jobs, is smaller and more useful: small well-bounded pieces of judgement, wrapped tightly, watched carefully, turned off the moment they stop earning their place. That is the version I keep building with, and the one I recommend to the people I work with. The other one I find myself being careful around, especially when there is a real business process about to be put behind it, because the cost of the failure mode is not something you can predict from a demo, and the people who end up living with it are usually not the ones who chose the architecture.

When autonomous agents are the wrong answer

The arithmetic that caught me off guard

What the benchmarks actually show

The wrapper problem

Where it actually earns its keep

Related