Essay

Harness engineering is control theory for a stochastic component

KC·Jun 25, 2026·13 min read

Same prompt, different output, every time, and no prompt drives that to zero. Agent reliability is not a prompting problem. It is a control problem, and harness engineering is the controller you build around a plant you do not get to fix.

Every honest conversation about agent reliability eventually arrives at the same uncomfortable admission: you cannot make a language model do the same thing twice. Same prompt, same tools, same temperature pinned to zero if you like — and it will still, some fraction of the time, call the wrong tool, skip a step you told it not to skip, or invent an argument that was never in the schema. No prompt drives that fraction to zero. I have never seen one, and I have stopped expecting to.

So I want to retire the framing that treats this as a prompting problem, because it is not. It is a control problem, and it has a control-theoretic answer. Here is the thesis, and I am going to state it exactly once: the model is the plant — a stochastic, non-deterministic component you do not get to fix — and the harness is the controller you wrap around it to produce reliable, durable behavior at scale. Everything that follows is what that controller is actually made of. The model is commodity; the engineering lives in the controller.

The control problem

Borrow the vocabulary, because it fits with no stretching. In control theory, a plant is the system you are trying to govern — a motor, a reactor, a process — characterized by dynamics you cannot rewrite, only steer. A controller sits around the plant and drives it toward a setpoint despite noise and disturbance. A language model is a plant with one unusual property: its disturbance is internal. The noise is not wind on the airframe; it is the sampling distribution itself. You cannot tune it out of the plant.

What you can do is the only thing any control engineer does with a noisy plant: you stop trying to perfect the plant and you build the loop around it. Two halves, and they have precise names that Martin Fowler has already given the agent world. Feedforward control acts before the plant does — it shapes the input so disturbances are anticipated and suppressed in advance. Feedback control acts after the plant responds — it measures the output against a reference and corrects the error. An agent harness is exactly this: feedforward to constrain what the model can do, feedback to verify what it did. Fowler's definition of the harness is the widest and the right one — everything in an agent except the model itself — and the two control halves are how that "everything" earns its keep.

Feedforward: shrink the choice space until bad output is impossible

Feedforward is the cheaper half because it is preventive. The goal is not to discourage the model from emitting a bad action; it is to make whole classes of bad action unrepresentable. You do not ask politely. You narrow the channel.

The first lever is the action space itself. A model choosing among forty tools is a model with forty ways to be wrong; curation is feedforward control on the selection problem. When Vercel cut roughly 80% of v0's tools, they reported fewer steps, fewer tokens, and faster responses — not because the model got smarter but because its choice space got smaller and the controller got tighter. Fewer, sharper tools is a control decision, not a UX one.

The second lever is the shape of the action. Free-text output is an unbounded plant; a JSON-schema-constrained tool call is a bounded one. If the only thing the runtime will accept is a structurally valid action, then the model is physically unable to hand you a malformed one — the invalid region of the output space has been removed before the model ever samples.

# Feedforward: the model can only emit a STRUCTURALLY valid action.
TOOLS = [{
    "name": "refund_order",
    "input_schema": {
        "type": "object",
        "properties": {
            "order_id": {"type": "string", "pattern": "^ord_[a-z0-9]{12}$"},
            "amount_cents": {"type": "integer", "minimum": 1, "maximum": 500_00},
            "reason": {"enum": ["defective", "late", "duplicate"]},
        },
        "required": ["order_id", "amount_cents", "reason"],
        "additionalProperties": False,
    },
}]

def dispatch(call):
    # The provider constrains generation to the schema; re-validate at the
    # boundary anyway — trust the contract, verify the bytes.
    validate(call.input, TOOLS_BY_NAME[call.name]["input_schema"])  # raises on drift
    require_scope(call.name, actor=current_actor())                 # authz is feedforward too
    return REGISTRY[call.name](**call.input)

Three classes of failure are gone before the loop even turns: malformed arguments, out-of-range values, and unauthorized actions. Permission scoping belongs here too — authorization is feedforward control, deciding what the plant is allowed to drive before it tries. None of this is prompt-tweaking. It is choosing the model's reachable state space.

Feedback: measure against ground truth, not against the model's mood

Feedforward lowers the error rate; it never reaches zero, because the plant is stochastic. So you close the loop. Feedback control observes the actual output, compares it to a reference, and feeds the error back as a correction. In an agent, the reference is ground truth — a test, a type-check, a schema, an assertion — and the correction is the error string itself, handed back to the model as the next input.

Fowler draws the distinction that matters most for cost here: feedback controls come in two kinds. Computational sensors are deterministic and fast — tests, linters, type-checkers, schema validation — milliseconds, reliable, run on every single action. Inferential sensors are the expensive, non-deterministic kind — LLM-as-judge, semantic review — and you reserve them for late stages where a deterministic check cannot reach. The engineering discipline is to push as much verification as possible onto the computational side, because a deterministic sensor is a sensor you can trust, and a non-deterministic sensor is another plant you now have to control.

# Feedback: propose -> check against ground truth -> feed the error BACK.
def run_step(model, ctx, max_tries=3):
    for attempt in range(max_tries):
        action = model.propose(ctx)             # the ONE stochastic call
        ok, error = check(action)               # computational sensor: deterministic
        if ok:
            return execute(action)
        # the error string is optimized for the model's own consumption
        ctx = ctx.append_observation(
            f"action rejected: {error}. fix and re-emit a valid action.")
    raise StepExhausted(ctx)                     # fail loud; never silently "done"

The non-obvious move is the last line. The most dangerous output a harness can accept is the model's own claim that it finished. "Done because the model said so" and "done because a test passed" are different epistemic objects, and only one of them is a fact. A controller that trusts the plant's self-report of success is open-loop wearing a closed-loop costume. Done is a sensor reading, never a sentiment.

Determinism: only one box in the loop is stochastic

Step back and look at what the two halves buy you together, because this is the insight the whole discipline rests on. Trace the loop: validate the schema, check authorization, dispatch the tool, run the test, append the observation, checkpoint the state. Every one of those is ordinary deterministic code. The model call — model.propose(ctx) — is the only stochastic operation in the entire system. The harness's job is to take that one probabilistic emission and collapse it, immediately, into a typed and validated action, so that the non-determinism never propagates past the single line where it is born.

This is why determinism is a property of the interface, not the model. The arXiv work Adapting the Interface, Not the Model (2605.22166) makes the point empirically: a frozen LLM, weights untouched, wrapped in a lifecycle-aware runtime harness, improved on 116 of 126 model–environment settings across eighteen model backbones — an average relative improvement near 88% — and the harness evolved on one small model transferred to seventeen others. The reliability lived in the interface, and it was reusable across plants. You do not make the model deterministic. You build a deterministic envelope so tight that the model's randomness has nowhere to go.

Durability: surviving the two-hundredth tool call

Single-turn quality is a solved-enough problem. The frontier, as Phil Schmid frames it, is durability — how well a model follows instructions while executing hundreds of tool calls over time. A one-percent leaderboard gap tells you nothing about whether an agent drifts off-track after its fiftieth step; durability is the property that separates a demo from a system, and it is entirely the controller's responsibility.

A controller that runs for hours must survive crashes without restarting from zero. That is durable execution, and it is a known systems discipline — Temporal, DBOS, and Prefect exist precisely so a process that dies at step 200 resumes at step 200 instead of replaying 199 stochastic calls you cannot reproduce. The pattern is checkpoint, execute, commit, made resumable:

# Durability: a crash at step 200 RESUMES, it does not restart.
@workflow
def agent_run(task):
    state = load_checkpoint(task.id) or init_state(task)
    while not state.done:
        action = step.execute(state)            # durably-recorded activity
        result = tool.execute(action)           # idempotency key = action.id
        state = state.apply(result)
        persist_checkpoint(task.id, state)       # committed before the next turn
    return state.answer

Two things make this hold over the long run. Tool calls are idempotent — keyed on the action id — so a retry after a partial failure does not double-charge a card or send a message twice. And the filesystem, not the context window, is the controller's primary state store: you offload artifacts to disk, carry a handle in context, and re-open on demand, which is what keeps a thousand-step run from drowning in its own history. Context engineering — compaction, offload, sub-agent isolation — is real, but in this frame it is one line of the controller's job description: keep working memory bounded so the plant stays in its operating range.

Scale, and the law

Cost and throughput fall out of the same controller, not out of the model. Stable prompt prefixes keep the KV-cache warm; bounded context keeps each call cheap; back-pressure and concurrency limits keep a fleet of agent runs from melting your rate limits. These are not bugs to avoid — they are control parameters to set. The model vendor does not tune them for you. The harness does, or no one does.

Which is the whole point, and the reason this is a durable discipline rather than a season of prompt fashion. The plant is swappable — and you should design for it to be swapped. Schmid's rule is build to delete: write every harness component so that when the next model absorbs its job, you can rip the component out instead of accreting rigid control flow around a model that no longer needs it. Manus rebuilt their framework four times; LangChain shipped three architectures of Open Deep Research. That churn is not failure. It is the controller adapting to a changing plant while the controller's purpose — feedforward to shape the input, feedback to verify the output, durable execution to survive failure — stays fixed.

That purpose is the engineering. It is classical control theory pointed at a stochastic software component, and it is where the reliability, the value, and the moat actually live. The model is the part you rent. The controller is the part you build, and it is the only part that was ever yours.

What to take from this

Engineer the controller, not the prompt. The model is a plant whose noise lives in the sampling distribution — you cannot tune it out. Hours spent perfecting the prompt are hours not spent on the loop that actually governs the plant.
Feedforward first: make bad actions unrepresentable, not discouraged. Curate the toolset hard, constrain every action to a JSON schema, and scope authorization before the call. Delete the invalid region of the output space so the model cannot reach it.
Close the loop on ground truth, not the model's mood. Verify with deterministic sensors — tests, types, schemas — and feed the error back as the next input. "Done because a test passed" is a fact; "done because the model said so" is a sentiment. Never ship the second.
Keep the stochasticity to one line. Validate, authorize, dispatch, test, and checkpoint are all deterministic code. Collapse the single model call into a typed, validated action immediately, so non-determinism dies where it is born instead of propagating through the run.
Treat durability as a systems problem. Durable execution (Temporal, DBOS, Prefect), checkpoint-resume, idempotent tool calls, and the filesystem as state are what let a crash at the two-hundredth tool call resume instead of replaying 199 calls you cannot reproduce.
Build to delete. The model is the part you rent, and the next one will be better — so write every harness component to be ripped out when the model absorbs its job, not wrapped in control flow a smarter plant has already outgrown.

← Browse all