The Verifier's Dividend
How much task horizon does a checker buy an agent? A closed-form answer, three interactive instruments, and one uncomfortable thing the METR data has been saying all along.
On this page
There is a number in METR's task-horizon data that gets far less attention than it deserves, and it isn't the doubling time.
METR's headline result is rightly famous: the length of task an AI agent can complete at a 50% success rate has been doubling roughly every seven months for six years.1 The less-quoted result sits one table away: the horizon at an 80% success rate is about five times shorter than the horizon at 50%. Same models, same tasks — just a different bar for "reliable."
That ratio is a measurement of something. This note works out what, derives a closed-form result we use constantly at the lab, and ends with a claim you can check us on.
Compounding is the whole story
Model an agent's task as a chain of steps — tool calls, edits, decisions. Suppose each step succeeds independently with probability . Then an -step task succeeds with probability
and the horizon — the task length at which success drops to some target — is
This is the constant-hazard reading of the METR curves, which Toby Ord spelled out shortly after the paper landed: an agent has a half-life, and survival decays exponentially with task length.2
The arithmetic is brutal in a specific, useful way. Drag the slider:
Per-step reliability compounds exponentially against task length. The dashed markers are the 80% and 50% horizons — note they sit a fixed ratio apart (~3.1×) no matter where you drag the slider. Ghost curves: 99% and 99.9% for reference.
Two things worth noticing. At 99% per-step reliability — a number that would make any agent harness proud — the coin-flip horizon is 69 steps. And every "nine" you add multiplies the horizon by ten: near . Horizons are bought with nines, and nines of actor reliability are the most expensive commodity in this industry. You rent them from a frontier lab, at the frontier's price, on the frontier's schedule.
There's a cheaper store.
Now add a checker
Put a verifier in the loop. After each step attempt:
- a failed attempt is caught with probability — the step retries, up to a budget of retries;
- with probability the failure slips through silently, and the task is quietly poisoned.
One geometric sum later, the effective per-step reliability is
where is the probability a given attempt fails and earns a retry. Define the verifier's dividend as the multiplier on horizon:
Note what cancelled: doesn't depend on the target at all. The dividend you earn at the coin-flip horizon is the same dividend you earn at the four-nines horizon.
Dashed curve: the bare actor. Solid curve: the same actor with a verifier in the loop. The dividend converges to 1/(1−c) as the actor improves — verifier nines convert directly into horizon multipliers, and the retry overhead stays negligible.
The dividend has a closed form, and it's rude
In the regime that matters — a competent actor, near 1, any retry budget at all — expand the logs and almost everything falls away:3
Read that carefully, because it's saying three things:
- Verifier nines convert one-for-one into horizon multipliers. A checker that catches 90% of failures multiplies your horizon by ten. Catch 99%, multiply by a hundred. This is the same exchange rate as actor nines —
- — except the price is different. The compute overhead of the whole scheme is the expected retry rate, . At , : a tenfold horizon for under 1% extra compute. Actor nines cost a training run. Verifier nines cost an engineering sprint — type checkers, property tests, sandboxed replay, invariant checks, a second model reading diffs. Checking is cheaper than doing, and this is the formula that monetizes the gap.
- The dividend doesn't care how good the actor is. has no in it. Whatever model you're running next year, the same verifier still multiplies it.
We keep this formula taped above the workbench, metaphorically speaking. When someone proposes spending a quarter chasing actor reliability, the first question is now: what's , and what would it cost to raise it instead?
What the 5× gap is telling us
Back to METR's ratio. Under constant hazard, the gap between the 50% and 80% horizons is fixed:
You can see it in Instrument 01 — the two dashed markers keep their ratio no matter where the slider sits. But METR measures roughly 5×, not 3.1×. The constant-hazard model is wrong in a directional way, and the direction is interesting.
Fit the simplest generalization — Weibull survival, — and the observed ratio pins the shape parameter near .4 Shape less than one means the hazard falls as the task progresses: agents that survive the opening act tend to keep surviving. Failures are front-loaded — the fatal move is the early wrong assumption that the next two hundred steps faithfully build on, not the thousandth tool call.
Which sharpens the engineering advice considerably: a uniform verifier is spending its catch-rate budget in the wrong place. Verify the plan like it's radioactive; spot-check the execution. The first ten steps of an agent run deserve a different class of scrutiny than the last hundred — the data, not just the intuition, says so.
A thousand agents walk into a task
Closed forms are tidy; production is a histogram. Instrument 03 runs the actual process — seeded, so you'll see what we see.
Each run is a 200-step task at 99% per-step reliability (seeded, reproducible). The histogram shows where doomed runs were poisoned. Red is the dangerous mass — failures nobody noticed at the time. A bare actor fails silently by definition. Add the verifier and most of that mass simply disappears into retries; the red that remains is the 10% that slipped past the checker.
The toggle is the entire argument of this note, compressed. A bare actor doesn't just fail more — it fails silently, every single time, because nothing was watching. The verifier eliminates most of the failure mass and, just as valuable, converts the texture of what remains: caught-and-exhausted retries halt loudly where you can see them. The red that survives is precisely the the checker missed — which is the number to attack next.
Silent failures are the expensive kind. They're the ones that ship.
What we do with this
Three working rules, all derived above, all in force at the lab:
- Buy nines from the verifier store first. at overhead is the best trade in agent engineering. Exhaust it before paying actor prices.
- Front-load the scrutiny. . The plan review is worth more than the step review.
- Engineer failures to be loud. The verifier's second product, after the dividend, is the conversion of silent failures into loud ones. Loud failures cost a retry; silent ones cost a postmortem.
And one falsifiable claim, so this note earns the name field note rather than opinion: the model above says horizon ratios between success thresholds are invariant to actor quality — as models improve, the gap should stay near 5× rather than closing toward 3.1×, unless harnesses (not models) change. That's checkable against every future METR release. If it breaks, the constant- assumption breaks with it, and we'll write the follow-up.5
The instruments on this page run the same math we unit-test in the site's repository — the essay and the code can't drift apart without CI noticing.
Footnotes
-
Kwa et al., Measuring AI Ability to Complete Long Tasks (METR, 2025). The 50%-horizon doubling time is ~7 months over 2019–2025, faster in the most recent span; the 80% horizon runs roughly 5× shorter than the 50% horizon across models. ↩
-
Toby Ord, The Half-Life of AI Agents (2025) — the observation that METR's curves are consistent with a constant per-minute hazard rate, i.e. exponential survival. ↩
-
With : for any , so . The condition is doing real work: with no retry budget, detection alone buys you loudness but no dividend. ↩
-
, so . Setting that to 5 gives . We're fitting one parameter to one ratio — call it what it is, a back-of-the-envelope — but the sign of is the robust part. ↩
-
Known liberties: steps aren't independent, verifiers have false-positive rates (which tax throughput, not correctness), and a verifier whose blind spots correlate with the actor's failure modes has a lower effective than its benchmark suggests — the case where "use a second model as the checker" quietly underdelivers. Each of these moves the constants, not the shape. ↩