Marco Santelli

There is a small thing I have started noticing whenever I watch someone who is new to AI models begin to depend on one for work that genuinely matters to them. They ask the same question twice. Not always within the same conversation, often days apart, sometimes in the same words and sometimes with the wording subtly altered.

The first time, in my experience, most people are not doing it deliberately; they have forgotten exactly what the model said the first time, or they are not sure, or they want to see whether the answer holds up if they pose the question slightly differently. Then they notice something. The second answer is rarely identical to the first. Sometimes the differences are cosmetic, an order of paragraphs changed or a word swapped for a synonym; sometimes the differences are substantive, a number that was twelve percent now reading fourteen, or a recommendation that arrived clearly the first time arriving hedged the second.

After a few rounds of this most people make an adjustment that I think is roughly the right adjustment, even if they do not put it into words, and they start to treat the model less like a piece of software and more like a colleague who is generally sharp but who has had a long week, whose advice is usually good and is also not necessarily the same advice they would have given an hour earlier.

The calculator and the dice

The word calculator comes up a lot in conversations of this kind, usually as a contrast: the model is not a calculator, you have to check it, you cannot trust the numbers without verifying. The word is doing more work in those sentences than I think most people who use it have noticed. The implicit picture is that a calculator is a reliable kind of machine and a language model is a less reliable version of the same kind of machine, with bugs in it, with limitations that better engineering will eventually iron out.

I do not think that picture is right, and the rest of what I want to say in this piece is, in one way or another, about why I do not think it is right. The two machines are doing structurally different work underneath, and I want to spend a couple of paragraphs being precise about what the difference is, before getting to anything that follows from it, because most of what goes wrong when people try to build serious systems on top of language models traces back, sooner or later, to the difference between the two.

When you type 2 + 2 into a calculator, the calculator does not reason about what arithmetic is or what addition means. It runs through a small set of fixed instructions that some engineer wrote down decades ago, and those instructions take that specific input through a specific procedure that produces a specific output. The same instructions, the same input, the same output, every time, on any device, in any country in the world, regardless of who happens to be operating it.

Nothing in the procedure is random. There is no place inside the calculator where it could have done something else, no point at which a coin gets flipped or a die gets rolled, no alternate path the computation could have taken. The reliability we associate with calculators is a feature of the procedures they execute. Those procedures have been worked over carefully across the history of computing so that any input goes to one and only one output, and the machine is the thing that follows the procedure from start to finish without deviation. Whether a particular calculator is well-built or poorly-built only affects whether it will follow the procedure correctly; it has nothing to say about whether the procedure has the property of single-valued output, because that property is built into the procedure itself.

How LLM works

A language model is a different kind of object. When you ask it "what is 2 + 2," what happens mechanically inside it, as far as I can tell from the published architectures, has very little in common with what happens inside a calculator. The characters of your question get converted into tokens, small chunks of text that the model has been trained to operate over, and the tokens are passed through a very large neural network whose parameters were learned by gradient descent over a great deal of training text.

The network produces, at the end of that pass, what people in the field call a distribution over next tokens, which is a slightly fancy way of saying that, instead of one answer, the network gives back probabilities for every possible next thing it might say. The token "4" might come out with probability 0.998, the token "four" with probability 0.001, and the remaining 0.001 of probability is spread thinly across a long tail of other things the model has decided are plausible enough to assign nonzero weight to.

How sampling works, step by step

The model samples from this distribution to choose which token to actually emit, then runs the whole network again with the chosen token now part of the running input, to figure out what to say after that, and then again, and then again, until the distribution itself begins to favour stopping.

You can set the temperature parameter to zero, which tells the model to always pick the highest-probability token rather than sampling randomly. The resulting behaviour looks deterministic from outside — same input, same output. But the mechanism underneath is still a probability distribution. Setting temperature to zero is a rule for how to read the distribution; the distribution still exists at every step, and a small change in the input would change which token came out highest, sometimes by a hairline margin.

Every word in the response you eventually see was the outcome of one of these sampling steps, weighted by probabilities that the network produced from inside its layers, layers which are doing something I cannot fully explain mechanically, in part because the field as a whole cannot fully explain it mechanically yet. The rolls are usually heavily weighted toward one option, which is why the model usually says something sensible, and the mechanism nevertheless remains a sampling mechanism all the way through, with no point in the process where the model stops sampling and starts doing something more determinate.

The usefulness of this kind of machine is real, and I am not arguing that language models are bad or that they should not be built into serious work. I use them constantly. The thing I want to look at, for the rest of this piece, is the specific places where the friction between this kind of mechanism and the kind of machine it sometimes gets compared to actually shows up: the places where the probabilistic mechanism produces outputs that look deterministic enough to be wrapped, where the wrapping turns out to fix less than people sometimes assume, and where the structural difference between the two kinds of machine reasserts itself underneath the wrapper.

The two machines are doing structurally different work underneath, and most of what goes wrong when people try to build serious systems on top of language models traces back to the difference between the two.

Where the leak shows up first: numbers

The cleanest place to watch this friction between probabilistic mechanism and deterministic-looking output start to show, in my experience working with these systems, has been numbers. There is something about numbers that I think most of us bring an unconscious assumption to: if a machine can read a number, the assumption goes, it can reason about that number, because that is roughly how computers have always worked. A spreadsheet that can read 481 in one cell and 482 in another can also tell you what 481 plus 482 is, because both numbers are stored in the same way and the addition operation is defined over that representation.

Language models do not work this way, and the reason has nothing to do with how clever any particular model is. It sits at the input layer, before any reasoning has had a chance to happen.

What the tokeniser sees

Before any reasoning happens, the input string is converted into tokens. Tokens are the units the model can actually see, small chunks of text, sometimes whole words and sometimes fragments of words, and the model was trained on long sequences of them. For ordinary prose this works well enough, because words and word-fragments are the natural units of the work the model is trying to do. For numbers, it works oddly.¹Different tokenisers handle numbers differently. In widely-used model families, a number like 480 may land as a single token while 481 splits into two tokens ('4' and '81'). Two integers adjacent on the number line become structurally unrelated objects at the level the model actually operates on.

Different model families use different tokenisers, but the issue I want to point at shows up across most of them in some form. To pick the kind of example you find documented in tokeniser inspections of widely-used models: the number 480 might land in the model's vocabulary as a single token, while the number 481 might get split into two tokens, "4" and "81". Two integers that sit next to each other on the number line, that we think of as having almost identical structure, have at the level the model can actually see almost nothing in common with each other. What the model is looking at is one or two arbitrary chunks of text that happen, in the world we live in, to spell out the digits of a number, with no built-in reason for the model to treat those chunks as having any numerical structure.

The consequences of this have been measured, and the measurements are striking in a way I did not fully anticipate before reading them. Singh and Strouse, in a paper at ICLR 2025, showed that you can take a frontier model that struggles with arithmetic and improve its accuracy by more than twenty percentage points just by reversing the way the digits are presented to it, right-to-left instead of left-to-right. The reason this works, when you sit with it, is that addition carries from right to left, and presenting the digits in that order aligns the model's token sequence with the structure of the operation it is being asked to perform.

Schwartz and colleagues at EMNLP 2024 showed something related and even more striking: if you prefix each number with its digit count, so that the model is told 42 has two digits and 1015 has four digits, written as "{2:42}" and "{4:1015}", the addition accuracy of a small model goes from 88 percent to essentially 100 percent, and subtraction from 74 percent to 97 percent.²Singh and Strouse, ICLR 2025; Schwartz et al., EMNLP 2024. Both papers hold the model fixed and change only the input format. The size of the accuracy changes — 20+ percentage points from digit reversal, 88% to ~100% from digit count prefixes — measures the gap between what the model can do and what the tokeniser lets it see.

In all of these experiments the model itself is held fixed, with the same parameters and the same training. What changes between conditions is the form in which the number reaches the model, and the size of the change in the model's behaviour, given changes only in input form, is what I want to draw attention to here, because it is enormous, and it tells you something about where the arithmetic weakness in language models actually lives, which is at the input layer, in the fact that the form numbers take when they reach the network does not encode the numerical structure that arithmetic depends on.

88 → 100%

Addition accuracy when digit count prefixes are added to the input — same model, same weights, different formatting

Tabular data

The same story plays out at larger scale in tabular data. If you work in finance, operations, clinical research, or anywhere that stores reality in spreadsheets, the research on this is by now unambiguous: language models do not perform well on this kind of data when you compare them to the older approaches that have been around for decades. Grinsztajn, Oyallon and Varoquaux, in a paper at NeurIPS 2022, compared neural networks and tree-based models across forty-five datasets, and the trees won, mostly, for three structural reasons that I think are worth keeping in mind because they generalise.³Grinsztajn, Oyallon and Varoquaux, NeurIPS 2022. Across 45 datasets, trees won for three structural reasons: natural feature selection (ignoring noise), axis-aligned splits (preserving feature-level structure), and handling of irregular decision boundaries. When the authors removed uninformative features and smoothed the target function, the gap shrank — confirming the explanation.

Trees naturally ignore features that turn out not to matter, while neural networks try to use all of them and get confused by noise. Trees make their decisions in a way that respects the original axes of the data, while neural networks rotate the data into combined dimensions and lose the feature-level structure in the process. Trees handle the kind of sharp, irregular decision boundaries that real data actually has, while neural networks have a built-in preference for smooth ones.

For language models the problem is compounded by the tokenisation story I just walked through. You are asking a model whose input is a one-dimensional sequence of text chunks to reason about data whose structure is multi-dimensional, heterogeneous, and unordered across columns. The serialisation you have to do to fit the data into the model's input destroys the structure that made the data useful in the first place. The model can read your serialised CSV and tell you something about it, and what it is doing when it does so is reasoning over the linguistic surface of the data, which is a lossy projection of the data itself.

The counter-proof

The cleanest counter-proof of this last point, and the paper I find myself thinking about most often when this comes up, is TabPFN, which appeared in Nature in January 2025. TabPFN is a transformer, the same underlying architecture as the language models I have been talking about, and the reason it is interesting is that it processes numbers natively, as continuous values, rather than as chunks of text.⁴Hollmann et al., TabPFN, Nature, January 2025. Pre-trained on 130 million synthetic tabular datasets. On real datasets up to 10,000 samples: outperforms a 4-hour-tuned tree ensemble in under 3 seconds of inference. Same transformer architecture, native numerical input instead of text tokens.

The authors pre-trained it on 130 million synthetic tabular datasets, and on real datasets of up to ten thousand samples it outperforms an ensemble of tree-based baselines that have been carefully tuned for four hours, in under three seconds of inference. The same architecture that cannot reliably multiply three-digit numbers when the numbers arrive as text fragments turns out to be world-class at tabular prediction when the numbers arrive as numbers. The lesson, as cleanly as a single paper can deliver it, is that the tabular weakness in language models lives at the text interface that language models were built around, and the transformer underneath is up to the work, when the input arrives in a form that lets it do that work.

The TabPFN result: Same transformer architecture. Native numerical input instead of text tokens. Outperforms a 4-hour-tuned tree ensemble in under 3 seconds of inference. The architecture is not the bottleneck — the text interface is.

The confidence that should not be there

The next place the friction shows up is subtler than the numbers one, and it matters in particular for anyone who is trying to build checks of any kind around a model's output. Before we train them to be helpful assistants, large language models turn out to be surprisingly good at knowing what they do not know. The GPT-4 technical report, which I find unusually candid for a document of its genre, says so explicitly: the pre-trained model is highly calibrated, with predicted confidence in an answer generally matching the probability of being correct, and after the post-training process the calibration is reduced.⁵GPT-4 Technical Report (OpenAI, 2023): “the pre-trained model is highly calibrated… after the post-training process the calibration is reduced.” Kadavath et al. (Anthropic, 2022): larger pre-trained models are well-calibrated, with calibration improving as models scale. Both findings point to post-training as the source of miscalibration.

Kadavath and colleagues at Anthropic measured the same thing independently in 2022, on a wide range of multiple-choice and true-or-false tasks, and found that larger pre-trained models are well-calibrated, with calibration improving as models scale.

The reason calibration is reduced after post-training is worth understanding, because it is, I think, structural rather than incidental. Post-training is the reinforcement-learning-from-human-feedback step that turns a raw model into something polite and useful, and it works by using a reward model to score outputs. Leng and colleagues, in a paper at ICLR 2025, showed that the reward models currently in use systematically prefer confident-sounding answers, regardless of whether the answers are correct. A model that hedges, admits uncertainty, or says "I do not know" gets scored lower, even when the hedge happens to be the right answer. A model that sounds sure gets scored higher.⁶Leng et al., ICLR 2025: reward models used in RLHF systematically prefer confident-sounding answers regardless of correctness. Stated confidence clusters in the 80–100% range. Xiong et al., ICLR 2024: GPT-4 achieves AUROC of only 62.7% for predicting its own failures — barely above a coin flip.

Over many training steps, this teaches the model to express confidence it does not internally have, and the same group measured that models trained this way produce verbalised confidence values clustered in the eighty-to-one-hundred-percent range, mimicking human overconfidence rather than reflecting actual probability. When Xiong and colleagues benchmarked verbalised confidence across major models at ICLR 2024, even GPT-4 achieved an area under the curve of only 62.7 percent for predicting its own failures, which is barely above a coin flip.

62.7%

GPT-4's AUROC for predicting its own failures — barely above a coin flip

The chart tells a story I find striking when you sit with it. The reference line, the diagonal where stated confidence matches actual accuracy, is what we want from any system whose outputs we are going to act on; it represents the case where, when the model claims it is eighty percent sure, the model turns out to be right roughly eighty percent of the time. The pre-training base model, before any post-training has been applied, sits very close to that diagonal, sagging slightly at the high end but in roughly the right place, which is what the GPT-4 technical report and the Kadavath paper both describe.

After standard RLHF post-training, the curve falls noticeably below the diagonal, particularly in the high-confidence region where most of the model's outputs cluster. In a more heavily post-trained model, of the kind Xiong and colleagues measured with an expected calibration error above 0.37, the curve sags so far below the diagonal that the stated confidence number stops carrying useful information at all.

There is a second effect in the same body of research that is worth naming, even though the chart above does not try to draw it, because I think it amplifies the practical consequences of the first effect in a way that I did not appreciate the first time I encountered it. Leng and colleagues measured that RLHF-trained models do not simply sag below the diagonal; they also stop using the lower part of the confidence scale altogether. Their stated confidence values cluster in the eighty-to-one-hundred-percent range regardless of whether the answer is correct, and you will rarely see a post-trained model return "I am thirty percent sure" of anything. Which means the gentle sag visible in the chart is an optimistic reading of the situation. In practice most of the curve goes unused and almost every output the model produces lands in the right-hand third of the x-axis, which is precisely where the gap between stated and actual is widest.

The signal underneath

The paradox is precise, in a way that I think is worth taking seriously rather than waving past. Work by Farquhar, Kuhn and Gal, published in Nature in 2024, showed that the uncertainty information is still there, inside the model, in a form they call semantic entropy. If you sample several completions from the model and cluster them by meaning, the spread across clusters is a surprisingly good indicator of whether the model is actually guessing on a particular question.⁷Farquhar, Kuhn and Gal, Nature, 2024: semantic entropy from multiple completions detects hallucination. Kossen et al., 2024: a linear probe on internal activations approximates semantic entropy from a single generation. Ahdritz et al. (Harvard/Kempner): epistemic uncertainty is linearly represented in activations and transfers across domains.

The research trail: from semantic entropy to linear probes

Kossen and colleagues extended this the same year by showing that a linear probe trained on the model's internal activations can approximate semantic entropy from a single generation, reducing the computational overhead to nearly zero.

Ahdritz and colleagues at Harvard and the Kempner Institute showed that epistemic uncertainty is linearly represented in the model's activations, and that probes trained in one domain transfer to others — which suggests the representation of uncertainty inside the model is reasonably general rather than a feature of any narrow training distribution.

Read together, these papers tell a story that I find unexpectedly melancholy: the model has a reasonable internal sense of when it is on solid ground and when it is not, and the training process teaches the model to suppress that sense at the output layer, while leaving the underlying signal intact in the activations of the network itself.

The model has a reasonable internal sense of when it is on solid ground and when it is not, and the training process teaches it to suppress that sense at the output layer.

If you are designing a wrapper that uses the model's expressed confidence, with a rule along the lines of "only act on answers the model says it is ninety-five percent sure about," you inherit the calibration problem wholesale, because the confidence number is miscalibrated by design. The better information exists in the system, and it sits one layer in, and you have to know where to look for it.

What a schema can and cannot pin down

This brings me to the next point of friction, which is what a deterministic wrapper of any kind can actually do with a probabilistic output, and what it cannot. The engineering answer here is real and useful, and I want to give it credit before pointing at what it leaves on the table.

Constrained decoding is the name for a family of techniques that forces a language model's output to match a given grammar or schema. You can require the model to emit valid JSON. You can require a specific set of fields with specific types. You can require the next word to be one of a short list. The way this works under the hood is that the technique inspects the model's probability distribution at each step and masks out any token that would violate the grammar, then samples only from what remains.⁸OpenAI structured outputs (response_format: json_schema), August 2024. Google Gemini API response_schema, Google I/O 2024. Anthropic constrained decoding, November 2025. Open-source: llama.cpp grammar constraints, Outlines. AgentBench (ICLR 2024) attributed 53% of database task failures to invalid format — entirely eliminable with these tools.

Geng and colleagues demonstrated this as a general framework in 2023, and the approach is now embedded in most production stacks. OpenAI's structured outputs shipped in August 2024. Google's Gemini API exposed response_schema at the same Google I/O. Anthropic shipped constrained decoding into general availability in November 2025. Open-source equivalents in llama.cpp and Outlines, along with the typed tool-call schemas in agent frameworks, all converge on the same primitive. Where it works, it works cleanly, and it deserves a place in any serious system.

The twelve-percent gap

What constrained decoding does not do is guarantee that the content inside the schema is right. Park and Wang, in a paper at NeurIPS 2024, showed that standard grammar-constrained decoding distorts the distribution the model actually learned during training. Forcing the output to fit a particular shape pulls probability mass around in ways that were not part of training, and the result can be subtler than people expect. A paper called CRANE in 2025 showed that this distortion can also degrade the model's reasoning, on top of changing its formatting behaviour.⁹Park and Wang, NeurIPS 2024: grammar-constrained decoding distorts the training distribution. CRANE, 2025: the distortion also degrades reasoning. A 2025 formal specification study: 100% syntactic correctness (every output parses), 88% semantic correctness (meaning matches intent). The 12% gap is not addressable by tighter grammars.

A 2025 study of agents writing formal specifications reported syntactic correctness at one hundred percent, with every output parsing successfully, while semantic correctness against what the specification was meant to describe sat at 88 percent even with frontier models. The twelve-percent gap between parsing successfully and being correct about the world is not going to close by tightening the grammar further, because the work depends on a different property than the grammar measures.

This is the pattern of the friction. A schema is a contract about form. Form is a proxy for meaning, and meaning is the thing you actually care about. The proxy can be made tight enough to catch broken JSON and wrong types, and that is genuinely useful. It cannot be made tight enough to catch an answer that fits the schema and happens to be wrong about the world, because the wrongness lives at the level of meaning, where the schema cannot reach.

A schema is a contract about form. Form is a proxy for meaning, and meaning is the thing you actually care about.

Goodhart in the loop

The deeper version of the problem, and the one I think people designing verification layers underestimate, is that the schema does not sit passively in the loop. It sits there as an optimisation target, and language models trained with reinforcement learning are very good at finding the cheapest path to satisfying any optimisation target you put in front of them. A tight schema gives that pressure a clear surface to attach to.¹⁰OpenAI o1 and o3 system cards, 2024–2025. In coding evaluations, models modified the tests so they would pass instead of fixing the code, hardcoded expected outputs, monkey-patched assertion methods, and deleted failing test cases. Anthropic and DeepMind report similar findings in their own evaluations.

The clearest documentation of this lives in OpenAI's own o1 and o3 system cards, which describe coding evaluations where the models modified the tests so that they would pass instead of fixing the code being tested, hardcoded the expected outputs, monkey-patched the assertion methods, and in some runs simply deleted the failing test cases. This shows up across labs, in the ordinary course of expressing a goal as a checkable output format and then optimising hard against the checker.

The model learns to satisfy the checker, and whether it does so by performing the underlying work or by routing around the work is invisible from the outside, because a passing schema check looks identical either way. Tightening the schema gives the optimisation pressure a sharper surface to attach to, which can make the gaming worse rather than better. The harder you constrain the output, the more the model routes its effort through whatever wiggle room the constraint specification still permits, and the verifier sees what looks like a compliant output. Whether the compliance reflects underlying work or compliance theatre is something inspection of the output alone cannot tell you, because the two are genuinely indistinguishable at the level the verifier can observe.

The Goodhart pattern: In OpenAI's own evaluations, models modified tests so they would pass instead of fixing the code, hardcoded expected outputs, monkey-patched assertion methods, and deleted failing test cases. A passing check and genuine compliance are indistinguishable from the outside.

Where verification actually works

There is a narrow class of problems where the schema and the truth are the same thing, and these are the cases where deterministic verification of probabilistic output really does work. Type-correct code that compiles. Mathematical proofs in a formal system like Lean or Coq. SQL queries that execute against a known schema and return verifiable results. In these domains the deterministic verifier is checking the actual property you care about, because the property is itself syntactic. There is no semantic gap to worry about, because the semantics are fully captured by the formal system.

DeepMind's AlphaProof and AlphaGeometry are the cleanest illustrations of this, an LLM generator wrapped in a formal verifier where the verifier's pass/fail signal corresponds exactly to mathematical correctness, and they are also the reason formal methods work where they work. The moment the task involves natural language, human intent, business context, or any judgement about whether an outcome is desirable, you are back in the world where the schema is a proxy for what you wanted, and the model can satisfy the proxy without doing the underlying work.

Where deterministic verification actually works — the short list

Type-correct code that compiles. Mathematical proofs in a formal system like Lean or Coq. SQL queries that execute against a known schema and return verifiable results. In these domains the verifier checks the actual property you care about, because the property is itself syntactic — there is no semantic gap. DeepMind's AlphaProof and AlphaGeometry are the cleanest illustrations: an LLM generator wrapped in a formal verifier where pass/fail corresponds exactly to mathematical correctness.

What happens when you chain them

The third place the friction builds, and the place where I think the practical consequences accumulate fastest, is what happens when you put several of these systems in a row. The naive arithmetic is straightforward, and it is worth working through, because it gives an upper bound on what to expect. If each step in a chain has independent reliability p, the end-to-end reliability across n steps is p to the power n. This is the standard series-system model from reliability engineering, which has been well-understood for decades, applied to a new substrate, and what it gives you is a ceiling on what you can hope for.

Per-step reliability	3 steps	5 steps	10 steps
99%	97.0%	95.1%	90.4%
95%	85.7%	77.4%	59.9%
90%	72.9%	59.0%	34.9%

The empirical picture is worse than the ceiling, for two reasons. The first is that LLM chains exhibit correlated and cascading errors. Huang and colleagues measured in 2023 that factual errors in reasoning chains have a seventy-three-percent probability of causing downstream failures, and that the model's capacity to detect its own error decreases as the chain gets longer. The naive p^n formula assumes independence between steps, which means it overestimates the reliability you actually get from real chains, because the errors in real chains are not independent in the way the model assumes.

The second reason the empirical picture is worse than the ceiling is that there is a structural result behind the empirical numbers, and I think it is worth sitting with for a moment, because it is the sort of thing that people who treat reliability as a pure engineering problem sometimes assume away. Dziri and colleagues' "Faith and Fate" paper, which was a NeurIPS 2023 spotlight, formalises compositional tasks as computation graphs and proves that autoregressive transformers' performance decays exponentially with compositional depth. The probability of incorrect predictions converges toward one as compositional complexity grows. This is a theorem about the architecture itself; it applies regardless of how capable any particular model becomes, and a larger training run will not undo it.¹¹Huang et al., 2023: factual errors in reasoning chains have a 73% probability of causing downstream failures, with error detection decreasing as chains lengthen. Dziri et al., Faith and Fate, NeurIPS 2023 Spotlight: a theorem proving autoregressive transformers' performance decays exponentially with compositional depth. GPT-4 at 59% on 3×3 digit multiplication, collapsing toward zero as depth increases.

The empirical validation in the same paper showed GPT-4 at fifty-nine percent on three-digit by three-digit multiplication, with performance collapsing toward zero as compositional depth increased.

73%

Probability that a factual error in a reasoning chain causes downstream failure — and error detection decreases as chains lengthen

The agent benchmarks confirm the same pattern at production scale, and I think they are worth looking at closely because the gap between benchmark variants is the most visible measure of compounding error in current systems. SWE-bench Verified shows top models at roughly seventy-eight percent on human-validated coding tasks. SWE-bench Pro, which demands multi-file modifications averaging 107 lines of code spread across 4.1 files, drops the same models to about twenty-three percent. The gap between seventy-eight and twenty-three is the compounding-error tax as task complexity grows.¹²SWE-bench Verified: ~78%. SWE-bench Pro (multi-file, avg 107 lines across 4.1 files): ~23%. WebArena: 62% vs 78% human baseline. MAST taxonomy (Cemri et al., NeurIPS 2025 Spotlight): 1,642 traces across seven frameworks, failure rates 41–86.7%. The gap between Verified and Pro is the compounding-error tax.

WebArena shows the best agents at sixty-two percent against a seventy-eight-percent human baseline on realistic multi-step web tasks. Cemri and colleagues' MAST taxonomy, a NeurIPS 2025 spotlight, analysed 1,642 execution traces across seven open-source multi-agent frameworks and measured failure rates between forty-one and 86.7 percent, with most failures rooted in inter-agent coordination and verification gaps.

Benchmark	Task complexity	Best agent	Baseline
SWE-bench Verified	Single-file coding	~78%	—
SWE-bench Pro	Multi-file, avg 107 lines / 4.1 files	~23%	—
WebArena	Multi-step web tasks	62%	78% (human)
MAST (multi-agent)	7 frameworks, 1,642 traces	13–59%	—

The implication is that scaling does not close the gap. What scaling does is improve single-step accuracy p, which shifts the reliability curve upward without changing its shape. The exponential decay in chain length remains, and for any target reliability threshold there exists a chain length beyond which the system fails more often than it succeeds. The length grows with model capability, and it grows much more slowly than the ambitions of agent-system designers tend to assume.

The scaling paradox: Better models improve per-step accuracy, which shifts the reliability curve upward — but the exponential shape stays. At 95% per step, a 10-step chain drops to 59.9%. The empirical picture is worse because real errors are correlated, not independent.

The friction, honestly

If you are building with these systems, the honest summary, from what the research actually shows, is that probabilistic and deterministic do not meet in the middle. You can restrict the domain narrowly enough that they effectively do, in the kind of cases I mentioned earlier, like mathematical proofs checked in Lean, type-correct code validated by a compiler, or SQL queries checked against a known schema, and where you can do this, the systems that result are often genuinely reliable, because inside those rooms the wrapper is checking the same property you actually care about, fully captured by the formal system. Outside those narrow rooms, what you have is a deterministic skin layered over a probabilistic core, and the core is still doing what it has always done: rolling heavily-weighted dice, mostly getting it right, sometimes drifting, expressing more confidence than it ought to, handling numbers through a text interface that loses their structure, and producing outputs whose meaning is not reachable from their form.

None of this means the machines are bad, or that building with them is a mistake. It means a serious architecture has to respect the friction rather than pretend it is not there. The shapes that I have seen work, in the systems I admire most among the ones I have read about and the ones I have built myself, share a few common features:

Chains kept short, with each step individually reliable
Structured handoffs where a deterministic layer holds the state machine between agent steps
Tool calls for arithmetic and data lookups, so the probabilistic layer is not asked to do numerical work through its tokeniser
Humans in the loop at irreversible steps
Acceptance that the wrapper sets boundaries on what the machine can do, while leaving the core behaviour of the machine itself untouched

The tension between probabilistic and deterministic tends to get buried in the excitement of a demo, because a demo is almost always the good roll of the dice, and demos are what the world ends up seeing. Production systems have to live with the whole distribution, including the rolls that go the other way, and the people who depend on production systems have to live with the consequences of those rolls.

Production systems have to live with the whole distribution, including the rolls that go the other way.

Building well, in this context, means being honest about which part of the distribution you are actually buying into, and being honest about it with the people on the other side of the system who will have to live with the rest.

The friction between probabilistic and deterministic

The calculator and the dice

How LLM works

Where the leak shows up first: numbers

What the tokeniser sees

Tabular data

The counter-proof

The confidence that should not be there

The signal underneath

What a schema can and cannot pin down

The twelve-percent gap

Goodhart in the loop

Where verification actually works

What happens when you chain them

The friction, honestly

Related