kc
How often a 20-step agent finishes cleanEnd-to-end outcomes at 95% per-step reliabilityFails somewhere64%Succeedsend-to-end36%Chain twenty 95% steps and most runs fail before the finish
Essay

You can't ship an agent you can't test

KC·Jun 29, 2026·11 min read

A demo proves the agent can succeed. It says nothing about how often it will. Those are different questions, and the gap between them is where money dies — so testing an agent is not a dashboard you check after the fact. It is the contract that decides whether the thing ships at all.

They say you shouldnt deploy on Fridays. I know the rule, so technically I never broke it — I let my DevOps agent do it for me. It read the incident, opened the pull request, patched the Kubernetes manifest, ran the checks, and the suite came back six green as a spring lawn: valid manifest, pinned image, probes, replica floor, diff before apply, healthy rollout. I shipped it Friday afternoon feeling like a man who had found a loophole in his own superstition, closed the laptop, and thought: this is fine. Automated, tested, six-for-six. This is fine.

It was not fine.

The Friday deploy, by the numbers6tests green before Ishipped11 mincluster down at peaktraffic36%end-to-end success at20 steps

By Saturday night the cluster was down — not from a crash, not from an exception, not from anything that turned one of those six tests red. One number in the manifest, a memory request, was quietly set too high, and when Saturday traffic peaked the autoscaler tried to add pods the scheduler could not place. The replicas sat Pending, the healthy pods drowned under the load, and the checkout service went dark for eleven minutes in the busiest window of the week. Every test I had written still passed — passes to this day. The agent had not been tested. It had been auditioned, I liked the performance, and I confused applause for a passing grade.

Six green checks, one dead serviceWhat I testedResultManifest is valid YAML + schemaPASSImage pinned to a digestPASSLiveness + readiness probes setPASSReplica floor respectedPASSdiff runs before applyPASSRollout waits for healthy podsPASSRequest small enough to schedule at peaknever written

Here is the thing nobody tells you when you start building agents: a demo proves the agent can succeed; it says nothing about how often it will. Those are different questions, and the gap between them is where money dies. The discipline that closes the gap already exists, has existed for fifty years, and we threw it out the window the moment we started building with models. It is called testing. For agents we call it evals — and evals are not a dashboard you check after the fact. They are the contract that decides whether the thing is allowed to ship at all.

The bug isn't a bug, it's a distribution

A normal bug is a fact. The function returns -1 when it should return 0; you write a test that asserts 0, watch it go red, fix it, watch it go green, and it stays green. The test is a tripwire that fires the same way every time because the code is deterministic. That determinism is what makes a unit test mean something — a green test is a promise about all future runs.

An agent makes no such promise. Ask the same model the same question twice and you can get two different answers, because sampling from a probability distribution is the whole mechanism, not a defect you can patch out. So a "passing run" is not a fact about the agent. It is a single sample from a distribution you have not measured. You flipped the coin once, it came up heads, and you wrote down "this coin lands heads."

That is why pass@1 — did it work that one time — is the most dangerous number in the field. It feels like a test result and it is a coin flip. The number that matters is closer to pass^k: run the same task k times, and ask whether it succeeds every time.

The number you report changes the truthReported success of the same 20-step agent95%36%pass@1 (one run)pass^k (every run)Report the worst of k runs, not the one demo that happened to pass

The reason pass^k is the honest number is that an agent is a chain, and a chain only holds if every link holds. Anthropic's own writeup on building effective agents frames them exactly this way — as multi-step loops of model calls and tool calls — and the arithmetic of a chain is unforgiving. A 95%-reliable step looks wonderful right up until you chain twenty of them and land at 0.95^20 ≈ 36% — the split the header chart makes impossible to ignore. Headline benchmarks hide this: a model with a great average is reporting the mean of a distribution whose tail is eating your uptime. You do not ship the mean. You ship every sample, including the ugly ones, to real traffic. So the first move in testing an agent is to stop asking "did it work?" and start asking "how often, and how badly does it fail when it fails?" That is not a vibe. It is a measurement, and a measurement needs a harness.

Evals are unit tests with the determinism moved

The cleanest way I have found to think about it: an eval is a unit test where the assertion lives somewhere other than assertEqual. You still have the three classic parts — an input, an execution, a check — but the check can no longer be exact-match-or-bust, because the output legitimately varies. You move the rigidity from the output to the property. Concretely, an eval is a dataset plus a scorer: the dataset is your corpus of cases — the incident that needs a rollback, the manifest with a bad resource limit, the change that passes a smoke test but starves the autoscaler — and the scorer is the function that grades the run.

An eval is a unit test, rearrangedDatasetlabeled deploycases, JSONLDeterministiccheckmanifests,limits, probesTrajectorycheckright order,gated rolloutJudge,calibratedopen-ended, lastresort

Scorers come in a strict hierarchy of cost and trust, and internalizing it before you spend a dollar is the whole game.

Where your eval value actually livesShare of failures caught, by scorer type80%Deterministic asserts15%Trajectory checks5%LLM-as-judgeThe cheapest layer catches the most; teams skip it because it is boring

At the base sit deterministic assertions. Did the agent emit a manifest that passes schema validation? Does every container that sets a memory limit also set a request, and is the request small enough that the scheduler can actually place a replica on a real node? Did it run kubectl diff before kubectl apply? These run in under a millisecond, never flake, cost nothing, and catch the dumbest, most common, most expensive failures. This is where most of your eval value lives, and it is the layer everyone skips because it is boring — but boring is the point, because boring is what you can afford to run on every commit. The middle layer checks the trajectory: not the final answer but the path, because agents fail in the steps, not the summary, and the right rollout reached by skipping the canary stage is an outage wearing a green checkmark. Open frameworks like DeepEval, Comet's Opik, and Arize Phoenix exist precisely to let you assert over tool-call sequences and traces rather than final strings.

Two kinds of check, opposite trustDeterministic assert<1msnever flakes, free, exactLLM-as-judge~80%agreement with humans, on a good day

Only at the top — slow, expensive, and the least trustworthy — sits LLM-as-judge: a second model grading the first one's work for reasonableness, for whether the postmortem reads sound, for whether the change summary matches the diff. It is necessary for genuinely open-ended outputs, and it is also a model, which means it has its own distribution, its own biases, its own bad days. Treat a judge as a noisy sensor you calibrate against human labels, not as ground truth; the teams shipping this seriously report roughly 80% agreement between judge and human on a good day. The rule that falls out of the pyramid is simple: push every check as far down it as you can, because the bottom is free and the top lies to you about a fifth of the time.

A real example, in code

Here is the shape of it, stripped to the bone. The broken version — the one I shipped — had no contract at all:

# what I shipped on Friday: a vibe with a deploy button
plan = agent.run(incident)
patch_manifest(plan.manifest)
kubectl_apply(plan.manifest)   # trusts the model's YAML. completely.

Nothing there can ever fail loudly. There is no manifest it rejects, no number it distrusts; the model writes YAML, the cluster eats it. The memory request that took me down was valid YAML, valid schema, valid everything — it was just too big for any node to hold once the autoscaler asked for more. Now the version with a contract — an eval suite that runs in CI and a runtime assertion that mirrors it:

# evals/test_deploy.py — runs on every PR, blocks merge on red
import pytest
from agent import plan_deploy

CASES = load_jsonl("datasets/deploy_golden.jsonl")  # 200 labeled incidents

@pytest.mark.parametrize("case", CASES)
def test_deploy(case):
    out = plan_deploy(case.incident)
    m = out.manifest
    # base of the pyramid: deterministic, <1ms, zero flake
    assert m.schema_valid                                  # well-formed manifest
    for c in m.containers:
        assert c.mem_request and c.mem_limit               # both set, never one
        assert c.mem_request <= LARGEST_NODE_ALLOCATABLE   # a replica can be scheduled
        assert c.mem_request <= c.mem_limit                # request never exceeds limit
    # trajectory: it must diff and canary before a full rollout
    assert out.steps.index("kubectl_diff") < out.steps.index("kubectl_apply")
    assert "canary" in out.steps

def test_pass_caret_k():
    # the honest number: every case must hold across 10 samples, not just once
    rate = min(run_n(plan_deploy, c, n=10).success_rate for c in CASES)
    assert rate >= 0.98, f"worst-case reliability {rate:.0%} below bar"

The second block is not longer because I like typing. It is longer because it encodes the things the model is not allowed to get wrong, and it runs them on a corpus, across repeated samples, before the code can merge. The mem_request <= LARGEST_NODE_ALLOCATABLE line is the single assertion that would have kept my cluster up. It is one line. It was always one line. I just had not written the test that demanded a replica be schedulable at peak, because at demo time there was only one pod and it fit fine.

Make the green gate, not the green dashboard

The mistake I see most — the one I made — is treating evals as observability. You ship, you watch a dashboard, you notice pods going Pending, you sigh, you page yourself at 2 a.m. That is not testing. That is an autopsy with good charts: observability tells you the patient died, while a test refuses to let them on the table sick.

Make the eval a gate, not a dashboardChangeprompt,model, ortoolRun suitedataset onthe PR, in CIRed / greena regressiongrays themergeMergeonly a greensuite ships

The shift that changes outcomes is to make the eval suite a gate. A prompt change, a model swap, a new tool — every one is a deploy, and none of them merge until the suite is green. CI runs the dataset on the pull request, and a regression — the new prompt that is 3% better at picking rollback targets but 15% worse at sizing resource requests — shows up as a red check, not a Saturday-night surprise. You caught it the way you catch a null-pointer: the build went red and the merge button grayed out.

pass@1 hides the tail36%hold across20 stepsRuns that fail64%Runs that pass36%

This is also the only honest way to swap models. Everybody wants to drop in the new, cheaper, faster model the week it ships. The only thing standing between you and a silent quality cliff is a suite that re-runs your real incidents against the new weights and tells you, in numbers, what you traded. Without it you are not upgrading — you are redeploying a stranger straight into prod.

An agent you can't test is not a product. It's a prototype you happen to be running in production.

What to actually do Monday

You do not need a platform or a budget to start. You need to convert your vibes into cases and your hopes into assertions. In order of return on effort:

The scorer pyramid, at a glanceLayerCostTrustRun itDeterministic assert<1msexactevery commitTrajectory checkcheaphighevery PRLLM-as-judgeslow~80%open-ended only
  1. Write down ten cases you already tested by hand. Every demo you ran is an eval you did not save. Capture the incident and the correct outcome as a row in a JSONL file; that file is your dataset, and it only ever grows.
  2. Assert the boring, deterministic things first. Schema validity, requests and limits both set, requests small enough to schedule at peak, the diff run before the apply. This is the bottom of the pyramid, and it catches the failures that actually take you down.
  3. Report pass^k, never pass@1. Run every case 5-10 times and record the worst result, not the average. One green run is a coin flip you only flipped once.
  4. Turn the suite into a CI gate. A prompt change is a deploy. If it can reach production without going through a red/green check, you do not have evals — you have logs.
  5. Add LLM-as-judge last, and calibrate it. Only for the genuinely open-ended outputs, only after the cheap checks pass, and only once you have measured how often the judge agrees with a human.
  6. Re-run the whole suite on every model swap. The new model is a stranger until your cases say otherwise.

Fifty years ago we learned that code you can't test is code you don't understand. Agents did not repeal that law — they just hid it behind a convincing demo. The contract is the same as it ever was: you can't ship what you can't test, and for production AI, the test is the eval.

← Browse all