kc
EssayAgentic engineering
one call, three silent attemptsattempt 1FAILattempt 2FAILattempt 3okcaller seesSUCCESSwhat the dashboard records1 SUCCESSthe two real failureserased from the record2 FAIL (gone)

Silent retries don't fix the bug.
They hide it from you.

KC·Jun 24, 2026·12 min read

A retry that succeeds on attempt three didn't heal anything. It buried the evidence, smoothed the dashboard, and billed me for the funeral.

A retry that succeeds on attempt three didn't heal anything. It buried the evidence, smoothed the dashboard, and billed me for the funeral.

Last month a single agent step in my pipeline started returning the wrong tool output about once every forty runs. The dashboard said 100% success. It wasn't. It was a retry(3) wrapper quietly papering over a 2.5% failure rate, and I'd been shipping new features on top of that lie for nine days before a customer screenshot caught it.

The model didn't get flakier. My harness got quieter.

That's the whole essay in two sentences, but the trap is subtle enough that I keep watching good engineers walk into it, so let me lay it out with the numbers.

A retry doesn't fix the bug. It relocates it.

Here's the setup. You have a call that fails non-deterministically — a JSON parse that chokes maybe 3% of the time on a malformed completion, a tool that races on shared state, an embedding lookup that times out under load. It's reproducible if you actually look at it: run it a hundred times, count the reds. So you do the obvious thing. You wrap it in retry(3), the red turns green, CI passes, you ship.

What you actually did is convert a bug you could catch into a bug you can't. Before the wrapper, that failure was a hard line in your logs — a stack trace with a timestamp and an input you could replay. After the wrapper, it's a probability. A 1-in-N ghost that only materializes under load, at 3am, in front of the one customer who screenshots everything.

one call, three silent attemptsattempt 1FAILattempt 2FAILattempt 3okcaller seesSUCCESSwhat the dashboard records1 SUCCESSthe two real failureserased from the record2 FAIL (gone)

Walk the arithmetic, because the arithmetic is where the danger hides. A 2.5% per-call failure rate, wrapped in three attempts, drops to roughly 0.025³ ≈ 1 in 64,000 observed failures — if the failures are independent. So your dashboard goes from "2.5% red" to "99.998% green," and every instinct you have says the second number is better. It isn't. The 2.5% didn't go anywhere. You're still paying for it: three times the latency on every failing call, three times the token spend, three times the rate-limit pressure — and now you've blinded yourself to which calls are paying it.

And that independence assumption is doing enormous unearned work. In an agent loop the retries are almost never independent. If attempt one failed because the upstream context was malformed, attempts two and three fail the same way — you've spent triple the money to fail identically, three times, in silence. The retry didn't add resilience. It added a bill.

The number you trust is the number that's lying

The deeper problem isn't the latency or the spend. It's epistemic. A failure rate is a measurement device pointed at your own system, and a silent retry quietly unplugs it.

true failure rate vs what retry(3) reports0%3%6%week 12.5% real0.002% reportedweek 2 regress4.0% real0.006% reportedweek 3 shipped6.0% real0.022% reportedraw call (the truth)retry(3), what the dashboard shows

Look at what those bars actually say. The raw call exposes the true failure rate; the retry-wrapped call reports a flat line near zero no matter what the truth underneath is. A call that fails 2.5% of the time is a signal: something is non-deterministic, and here's how loud. When that rate drifts from 2.5% to 6%, the drift is information — a dependency degrading, a prompt change regressing, a model endpoint getting flakier. The retry wrapper flattens all of it into the same featureless green. You lose the trend, which is the single most useful thing a metric can give you.

I learned this the expensive way. During those nine days the underlying failure rate had actually climbed from 2.5% to about 6% — a real regression I'd introduced in a prompt refactor. The dashboard never twitched, because retry(3) absorbs a 6% rate almost as easily as a 2.5% one (0.06³ ≈ 1 in 4,600, still invisible at my traffic). The forcing function that should have stopped me on day one was disconnected by design. I'd built a smoke detector and then wrapped it in a blanket so it wouldn't bother me.

A green dashboard built on silent retries isn't telling you the system works. It's telling you the system stopped talking.

This is the part that makes silent retry worse than no retry at all. No retry and a 6% failure rate is painful — but painful is honest, and honest is debuggable. Silent retry and a 6% failure rate is comfortable, right up until the moment it's catastrophic, and by then you've shipped a week of features on a foundation you can't trust and can't even see.

The fix is structural, not numeric

The instinct, once you notice the problem, is to tune the number — more retries, longer backoff, jitter the delays. Wrong axis entirely. You can't fix a measurement problem by measuring harder in the same blind spot. The fix is three structural moves, and not one of them is a bigger number:

# the cover-up: the loop eats its own evidence
def call():
    return retry(do_thing, attempts=3)       # green, lying

# the fix: fail loud, log the cause, repeat on purpose
def call():
    try:
        return do_thing()                    # idempotent: safe to repeat
    except ThingError as e:
        log.error("thing_failed",
                  root_cause=e.kind,          # WHY, not "attempt 2 of 3"
                  input_hash=e.input_hash,    # so I can replay it
                  trace=e.trace)
        raise                                 # hand the decision UP

Three moves. Fail loud — let the error surface to whatever layer actually owns the decision, instead of dying anonymously inside a helper. Log the root cause, not the attempt countroot_cause=rate_limit and an input_hash I can replay are worth a thousand lines of retrying… (2/3). And make the call idempotent, so that when you do retry it's a safe operation, not a gamble on a side effect.

Then the retry doesn't disappear — it moves up, out of the anonymous helper and into the orchestrator, where it becomes a visible, reasoned decision with a paper trail:

# the orchestrator retries deliberately: a named class, a ceiling, an alarm
def step():
    return orchestrate(
        call,
        retry_when=is_transient,    # NOT "any exception" — a named class
        max_attempts=3,
        on_exhausted=alert,         # exhaustion is an event, not a shrug
        emit=metrics.failure_rate,  # the raw rate still reaches the dashboard
    )

That emit=metrics.failure_rate line is the whole game. The retry still happens, but the true failure rate is recorded before the retry masks it — so the dashboard shows reality and the system still recovers. You get the resilience without the blindfold.

hidden retry: the loop eats its own evidencedo_thing()catch + retryerror never leaves the loopdashboard stays greenloud + idempotent: failure leaves the loopdo_thing()log + raiseroot_causeorchestrator: retry when transient,emit raw rate, alarm on exhaustedthe true rate still reaches the dashboard

Retry isn't banned. Hidden retry is. The line between the two is whether the failure ever reached a log a human reads. A retry the orchestrator makes on purpose, against a transient error it named, with the exhaustion wired to an alarm and the raw rate still on the dashboard — that's resilience. A retry a helper makes silently, against any exception, that flattens the dashboard and deletes the cause — that's a cover-up wearing resilience's clothes.

What it costs, and what's still open

None of this is free, and I'd be lying by omission if I pretended the loud path has no price. Failing loud means you'll see failures you used to never know about, and the first week after you pull the blanket off the smoke detector is loud — pages you didn't get before, a dashboard that's suddenly honest and therefore ugly. That's not the fix breaking things; that's the fix showing you what was already broken. Sit with the discomfort. It's the cost of getting your measurement device back.

The genuinely open question is where to draw the transient line. retry_when=is_transient is easy to write and hard to define: a 429 is obviously transient, a malformed-JSON parse error obviously isn't, but a tool timeout could be either depending on why it timed out. Get that classifier wrong in the permissive direction and you've quietly rebuilt the cover-up — retry_when=lambda e: True is just retry(3) with extra steps. I don't have a clean rule here; I have a heuristic, which is that anything I can't name a cause for is not transient, it's a bug, and bugs get raised, not retried.

So make it fail where you can see it. A bug you can read is already half fixed; a bug you've retried into silence is one you'll meet again at the worst possible time, at the worst possible scale, in front of the worst possible customer — and it'll still be the same 2.5% you could have caught on day one.

← Browse all