Durable agents: event-source the run so a crash resumes mid-thought
Long agent runs are expensive and side-effecting. Treat them like a database: store the events, fold them into state, and replay on crash without re-firing effects.
A real agent run is a long chain of expensive, irreversible steps: it calls a model fifty times, hits a dozen APIs, charges a card, sends an email. Now the process crashes at step 40. What happens next decides whether your agent is a toy or a system.
If the run restarts from zero, you have burned 39 steps of model spend and — worse — you are about to charge that card a second time. If the run is simply lost, the user is left with a half-finished task and no way to recover it. Neither is acceptable for anything that touches money, infrastructure, or a customer's inbox.
The fix is the one databases have used for decades: don't store where the agent is — store how it got there.
The problem with storing state
The obvious approach is to checkpoint the agent's current state to a row and update it each step. It fails for two reasons.
First, the state is a moving target. An agent's "state" is its full context, plan, partial results, and pending tool calls — a tangle that reshapes every turn. Serializing it correctly each step and deserializing it back into a runnable agent is brittle, and one schema change orphans every in-flight run.
Second, a checkpoint loses the path. You know the agent is at step 40, but not what it tried, what it observed, or why it took this branch. You cannot audit it, debug it, or replay it. The most valuable artifact of an agent run — the sequence of decisions — is exactly what a snapshot throws away.
Event sourcing applied to agents
Invert it. Instead of storing state and mutating it, store an append-only event log of everything that happened and derive the current state by folding the log. Each step appends one immutable event. State is never written directly — it is always reduce(events).
type Event =
| { type: "thought"; text: string }
| { type: "tool_call"; id: string; tool: string; args: unknown; idemKey: string }
| { type: "observation"; callId: string; result: unknown }
| { type: "done"; result: unknown };
interface RunState {
thoughts: string[];
pending: Map<string, { tool: string; args: unknown; idemKey: string }>;
finished: boolean;
result?: unknown;
}
const empty = (): RunState => ({
thoughts: [],
pending: new Map(),
finished: false,
});
// Current state is a pure fold over the immutable log.
function reduce(events: readonly Event[]): RunState {
return events.reduce<RunState>((s, e) => {
switch (e.type) {
case "thought":
s.thoughts.push(e.text);
return s;
case "tool_call":
s.pending.set(e.id, { tool: e.tool, args: e.args, idemKey: e.idemKey });
return s;
case "observation":
s.pending.delete(e.callId); // call resolved — no longer in flight
return s;
case "done":
s.finished = true;
s.result = e.result;
return s;
}
}, empty());
}
The log is the source of truth. The RunState is a cache you can throw away and rebuild from the log at any time. That single property — state is derived, never authoritative — is what makes the run durable.
Don't store where the agent is — store how it got there. State is a cache; the log is the truth.
Resume without re-firing effects
Replay is where event sourcing earns its keep, and where it can bite you. Rebuilding state means reloading the log and folding it — and a tool_call event with no matching observation is a call that crashed mid-flight. You have to re-drive it, but you must not let the effect fire twice: the card was already charged. The discipline is one rule — every tool call carries an idempotency key, and the executor dedupes on it. A repeated key returns the prior result instead of re-performing the effect, exactly like a payment API's idempotency header.
interface Store {
load(runId: string): Promise<Event[]>;
append(runId: string, e: Event): Promise<void>;
}
interface Tools {
// Dedupes on idemKey: a repeat returns the prior result, no re-effect.
run(tool: string, args: unknown, idemKey: string): Promise<unknown>;
}
interface Model {
propose(state: RunState): Promise<{
thought: string;
toolCall?: { tool: string; args: unknown };
result?: unknown;
}>;
}
async function resume(runId: string, store: Store, model: Model, tools: Tools) {
let state = reduce(await store.load(runId)); // fast-forward, no effects
// Re-drive any call that crashed before its observation landed. Safe to
// repeat: the idempotency key collapses a re-run into the original effect.
for (const [callId, call] of state.pending) {
const result = await tools.run(call.tool, call.args, call.idemKey);
await store.append(runId, { type: "observation", callId, result });
}
state = reduce(await store.load(runId));
// Continue the run forward from real, rebuilt state.
while (!state.finished) {
const next = await model.propose(state);
await store.append(runId, { type: "thought", text: next.thought });
if (next.toolCall) {
const id = crypto.randomUUID();
const idemKey = `${runId}:${state.thoughts.length}:${next.toolCall.tool}`;
await store.append(runId, { type: "tool_call", id, idemKey, ...next.toolCall });
const result = await tools.run(next.toolCall.tool, next.toolCall.args, idemKey);
await store.append(runId, { type: "observation", callId: id, result });
} else {
await store.append(runId, { type: "done", result: next.result });
}
state = reduce(await store.load(runId));
}
return state.result;
}
The idempotency key is the whole trick. Note it is derived deterministically — runId, the thought count, the tool name — so the same logical step always produces the same key. Re-drive an orphaned call and tools.run recognizes the key and returns the recorded result instead of charging again. Combined with the append-only log, this makes a crash a non-event: reload, fold, re-drive any orphaned call safely, continue.
What durability gives you for free
Event sourcing is not only a crash-recovery trick. The same log buys three more things at no extra cost. An audit trail — every thought and action of a nondeterministic agent, in order, immutable. Time-travel debugging — fold the log up to any event and inspect the exact state the agent saw at step 23. And deterministic replay — because observations are recorded, you can re-run the agent's decisions against fixed inputs and reproduce a bug that surfaces once in a thousand runs.
A nondeterministic agent does not have to be an unaccountable one. Store the events, not the state; fold to recover; keep the tool boundary idempotent — and a run that crashes at step 40 picks up mid-thought as if nothing happened.