Essay

Agents don't get dumber — their context rots

KC·Jun 21, 2026·9 min read

The model that scored perfectly on your eval starts making careless mistakes forty turns into a run. The weights didn't change. The window did.

Run your agent on a single clean prompt and it nails the task. Run it across a forty-turn investigation and the quality quietly falls apart: it forgets a constraint stated in turn three, fixates on a stale detail, repeats a step it already finished. The instinct is to blame the model. The model is fine — the same weights answered the clean prompt perfectly. What changed is the context. The window the model reads from has rotted.

Context rot is the gap between a model's performance on a tidy prompt and its performance when the same information is buried in a long, noisy, accumulating transcript. It is not a property of the model. It is a property of how you fed it.

Why long runs degrade

Three mechanisms compound as a run grows.

Token accumulation. The naive loop appends every thought, tool call, and observation to one ever-growing array. By turn forty you are sending tens of thousands of tokens, most irrelevant to the current decision. Signal-to-noise collapses. The model is not reasoning over your task; it is reasoning over a transcript of everything that ever happened, with the task one sentence somewhere in the middle.

Distractor dilution. Every irrelevant token in the window is a distractor. A tool that returned a 4,000-token JSON blob twenty turns ago is still sitting there, pulling attention away from the three facts that matter now. More context is not more knowledge — past a point it is more noise, and attention is a finite resource being spent on garbage.

Lost in the middle. Models attend most reliably to the start and end of their context, least reliably to the middle. In a long run, the constraint you stated early gets pushed into exactly that low-attention band. It is technically present and effectively invisible.

A bigger window is a bigger bucket, not a better memory. The fix is to put less in the bucket, not to buy a bigger one.

The four levers

Stop treating the window as a dumping ground. Treat it as a token budget you assemble deliberately on every turn. Four levers spend that budget well.

Retrieval — pull only what this turn needs. Store knowledge externally and fetch the few relevant chunks per turn instead of carrying every document inline. The window holds what is relevant now, not everything that might ever be.

Compaction — summarize old turns. The most recent turns stay verbatim because they carry the live working state. Older turns collapse into a dense summary that preserves decisions and constraints while shedding the raw transcript.

Scratchpads — externalize working memory. Let the agent write durable notes — a plan, a checklist, intermediate results — to a store outside the window, and read them back on demand. The window stops being the only memory the agent has.

Isolation — spawn sub-agents with clean contexts. A noisy subtask (search a codebase, parse a large file) runs in its own fresh window and returns only a distilled result. The parent's context never sees the mess.

Here is a context assembler that applies the first two levers within a hard budget:

interface Turn {
  role: "user" | "assistant" | "tool";
  content: string;
  tokens: number;
}

interface AssembleArgs {
  system: string;
  query: string;
  history: Turn[];
  budget: number; // total token ceiling for the request
  keepLastK: number; // recent turns kept verbatim
}

declare function countTokens(s: string): number;
declare function summarize(turns: Turn[]): Promise<string>;
declare function retrieve(query: string, budget: number): Promise<string[]>;

async function assembleContext(a: AssembleArgs): Promise<string> {
  let remaining = a.budget - countTokens(a.system) - countTokens(a.query);

  // 1. Recent turns are kept verbatim — they hold the live state.
  const recent = a.history.slice(-a.keepLastK);
  remaining -= recent.reduce((n, t) => n + t.tokens, 0);

  // 2. Older turns are compacted into one summary.
  const older = a.history.slice(0, -a.keepLastK);
  let summary = "";
  if (older.length > 0) {
    summary = await summarize(older);
    remaining -= countTokens(summary);
  }

  // 3. Spend whatever budget is left on retrieved, relevant docs.
  const docs = remaining > 0 ? await retrieve(a.query, remaining) : [];

  // 4. Structure the window: high-attention bands top and bottom,
  //    compacted history in the low-attention middle.
  return [
    a.system,
    docs.length ? `# Relevant context\n${docs.join("\n\n")}` : "",
    summary ? `# Earlier in this run\n${summary}` : "",
    recent.map((t) => `${t.role}: ${t.content}`).join("\n"),
    `user: ${a.query}`,
  ]
    .filter(Boolean)
    .join("\n\n");
}

Structure matters as much as the budget. Retrieved docs and the summary sit near the top; the live recent turns and the current query sit at the bottom — the two high-attention bands. The low-attention middle holds the compacted history, the part you can most afford to have read loosely.

Context engineering over prompt engineering

For a one-shot task, prompt engineering — wording, examples, structure — is the lever that matters. For a long-horizon agent, it is nearly irrelevant. No amount of clever wording survives forty turns of accumulated noise. What matters is context engineering: deciding, on every single turn, what the model gets to see and what it does not.

The window is not free storage. Every token you add costs attention from every other token. Budget it like the scarce resource it is, assemble it fresh each turn, and the model that aced your clean prompt keeps acing it at turn one hundred — because you kept its context as clean as the prompt it started with.

← Browse all