Agent Harness Engineering – O’Reilly - The Home of WEBFILMBOOKS

This text was initially printed on Addy Osmani’s weblog. It’s being reposted right here with the writer’s permission.

Roughly: Anytime you discover an agent makes a mistake, you are taking the time to engineer an answer such that the agent by no means makes that mistake once more.

We’ve spent the final two years arguing about fashions. Which one is smartest, which one writes the cleanest React, which one hallucinates much less. That dialog is ok so far as it goes, nevertheless it’s lacking the opposite half of the system. The mannequin is one enter right into a working agent. The remaining is the harness: the prompts, instruments, context insurance policies, hooks, sandboxes, subagents, suggestions loops, and restoration paths wrapped across the mannequin so it might truly end one thing.

A good mannequin with an important harness beats an important mannequin with a nasty harness. I’ve watched this play out alone work again and again. And more and more the fascinating engineering isn’t in selecting the mannequin; it’s in designing the scaffolding round it.

That self-discipline now has a reputation. Viv Trivedy coined the time period harness engineering, and his “Anatomy of an Agent Harness” publish is the cleanest derivation of what a harness truly is and why every bit exists. Dex Horthy has been monitoring the sample because it emerges. HumanLayer frames most agent failures as “ability points” that come right down to configuration relatively than mannequin weights. Anthropic’s engineering crew has printed what I feel is the perfect public breakdown of find out how to design a harness for long-running work. And Birgitta Böckeler has a superb overview of what this appears like from the person’s aspect.

This publish is my try to tug these threads collectively.

What’s a harness, actually?

Viv’s one-liner does a lot of the work:

Agent = Mannequin + Harness. When you’re not the mannequin, you’re the harness.

A harness is every bit of code, configuration, and execution logic that isn’t the mannequin itself. A uncooked mannequin will not be an agent. It turns into one as soon as a harness provides it state, device execution, suggestions loops, and enforceable constraints.

Concretely, a harness contains:

System prompts, CLAUDE.md, AGENTS.md, ability recordsdata, and subagent prompts
Instruments, abilities, MCP servers, and their descriptions
Bundled infrastructure (filesystem, sandbox, browser)
Orchestration logic (subagent spawning, handoffs, mannequin routing)
Hooks and middleware for deterministic execution (compaction, continuation, lint checks)
Observability (logs, traces, price and latency metering)

Simon Willison reduces the loop half to its essence: an agent is a system that “runs instruments in a loop to realize a objective.” The ability is within the design of each the instruments and the loop.

If that feels like a variety of floor space, it’s. And it’s your floor space, not the mannequin supplier’s. Claude Code, Cursor, Codex, Aider, Cline: These are all harnesses. The mannequin beneath is usually the identical, however the habits you expertise is dominated by what the harness does.

coding agent = AI mannequin(s) + harness

This equation, articulated by Viv and echoed by HumanLayer, is the place the work truly lives. The talk over the left-hand aspect is loud. Many of the precise leverage sits on the best.

The “ability challenge” reframe

There’s a sample I watch engineers fall into. The agent does one thing dumb, the engineer blames the mannequin, and the blame will get filed underneath “look ahead to the subsequent model.”

The harness-engineering mindset rejects that default. The failure is often legible. The agent didn’t learn about a conference, so that you add it to AGENTS.md. The agent ran a harmful command, so that you add a hook that blocks it. The agent acquired misplaced in a 40-step activity, so that you break up it right into a planner and an executor. The agent stored “ending” damaged code, so that you wire a typecheck back-pressure sign into the loop.

HumanLayer says: “It’s not a mannequin drawback. It’s a configuration drawback.” Harness engineering is what occurs whenever you take that significantly.

There’s a putting knowledge level that exhibits up in each Viv’s write-up and HumanLayer’s. On Terminal Bench 2.0, Claude Opus 4.6 working inside Claude Code scores far decrease than the identical mannequin working in a customized harness. Viv’s crew moved a coding agent from High 30 to High 5 by altering solely the harness. Fashions get posttraining coupled to the harness they have been skilled towards. Shifting them into a special harness, with higher instruments on your codebase, a tighter immediate, and sharper backpressure, can unlock functionality the unique harness was leaving on the ground.

That is the other of the “simply look ahead to GPT-6” narrative. The hole between what immediately’s fashions can do and what you see them doing is essentially a harness hole.

The ratchet: Each mistake turns into a rule

Crucial behavior in harness engineering is treating agent errors as everlasting alerts. Not one-off tales to snicker about, not “dangerous runs” to retry. Alerts.

If the agent ships a PR with a commented-out take a look at and I merge it accidentally, that’s an enter. The subsequent model of my AGENTS.md says “by no means remark out checks; delete them or repair them.” The subsequent model of my precommit hook greps for .skip( and xit( within the diff. The subsequent model of my reviewer subagent flags commented-out checks as a blocker.

You solely add constraints whenever you’ve seen an actual failure. You solely take away them when a succesful mannequin has made them redundant. Each line in a superb AGENTS.md ought to be traceable again to a selected factor that went flawed.

That is additionally why harness engineering is a self-discipline relatively than a framework. The suitable harness on your codebase is formed by your failure historical past. You possibly can’t obtain it.

Working backward from habits

The framing from Viv that I discover most helpful after I’m truly designing a harness is to start out from the habits you need and derive the harness piece that delivers it. His sample: habits we wish (or need to repair) → harness design to assist the mannequin obtain this.

The helpful factor about deriving it this fashion is that each harness part has a selected job. When you can’t identify the habits a part exists to ship, it in all probability shouldn’t be there.

The remainder of this part walks the items in roughly the order Viv does, with the particular patterns I’ve discovered value stealing.

Filesystem and Git: Sturdy state

The filesystem is probably the most foundational primitive, and it tends to be underrated as a result of it’s boring. Fashions can solely straight function on what matches in context. And not using a filesystem, you’re copy-pasting right into a chat window, and that isn’t a workflow.

After you have a filesystem, the agent will get a workspace to learn knowledge, code, and docs; a spot to dump intermediate work as an alternative of holding it in context; and a floor the place a number of brokers and people can coordinate by way of shared recordsdata. Including Git on high provides you versioning without cost, so the agent can observe progress, roll again errors, and department experiments.

Many of the different harness primitives find yourself pointing on the filesystem for one thing.

Bash and code execution: The overall-purpose device

The primary agent loop immediately is a ReAct loop: The mannequin causes, takes an motion through a device name, observes the outcome, and repeats. However a harness can solely execute the instruments it has logic for. You possibly can attempt to prebuild a device for each potential motion, otherwise you can provide the agent bash and let it construct the instruments it wants on the fly.

Willison’s tackle that is that brokers already excel at shell instructions; most duties collapse to some well-chosen CLI invocations. Harnesses nonetheless ship targeted instruments, however bash plus code execution has grow to be the default general-purpose technique for autonomous drawback fixing. It’s the distinction between educating somebody to make use of a single kitchen gadget and handing them a kitchen.

Sandboxes and default tooling

Bash is just helpful if it runs someplace secure. Working agent-generated code in your laptop computer is dangerous, and a single native surroundings doesn’t scale to many parallel brokers.

Sandboxes give brokers an remoted working surroundings. As an alternative of executing regionally, the harness connects to a sandbox to run code, examine recordsdata, set up dependencies, and confirm work. You possibly can allow-list instructions, implement community isolation, spin up new environments on demand, and tear them down when the duty is completed.

sandbox ships with good defaults: preinstalled language runtimes and packages, Git and take a look at CLIs, a headless browser for internet interplay. Browsers, logs, screenshots, and take a look at runners are what let the agent observe its personal work and shut the self-verification loop.

The mannequin doesn’t configure its execution surroundings. Deciding the place the agent runs, what’s out there, and the way it verifies its output are all harness-level calls.

Reminiscence and search: Continuous studying

Fashions don’t have any extra information past their weights and what’s at present in context. With out the power to edit weights, the one manner so as to add information is thru context injection.

The filesystem is once more the primitive. Harnesses assist reminiscence file requirements like AGENTS.md that get injected on each begin. Because the agent edits that file, the harness reloads it, and information from one session carries into the subsequent. It is a crude however efficient type of continuous studying.

For information that didn’t exist at coaching time (new library variations, present docs, immediately’s knowledge), internet search and MCP instruments like Context7 bridge the cutoff. These are helpful primitives to bake into the harness relatively than leaving to the person.

Battling context rot

Context rot is the commentary that fashions worsen at reasoning and finishing duties because the context window fills up. Context is scarce, and harnesses are largely supply mechanisms for good context engineering.

Three strategies present up repeatedly:

Compaction. When the window will get near full, one thing has to present. Letting the API error will not be an possibility for a manufacturing harness, so the harness intelligently summarizes and offloads older context so the agent can hold working.

Software-call offloading. Massive device outputs (suppose 2,000-line log recordsdata) muddle context with out including a lot sign. The harness retains the pinnacle and tail tokens above a threshold and offloads the complete output to the filesystem, the place the agent can learn it on demand.

Abilities with progressive disclosure. Loading each device and MCP into context at startup degrades efficiency earlier than the agent takes a single motion. Abilities let the harness reveal directions and instruments solely when the duty truly requires them.

Anthropic’s harness publish provides another approach for the actually lengthy jobs: full context resets, the place the harness tears the session down and rebuilds it from a compact handoff file. They’re express that compaction alone wasn’t ample for lengthy duties; typically you have to begin recent with a structured transient. That is nearer to how people onboard a brand new engineer than to how we often take into consideration “reminiscence.”

Lengthy-horizon execution: Ralph loops, planning, verification

Autonomous long-horizon work is the holy grail and the toughest factor to get proper. Right this moment’s fashions undergo from early stopping, poor decomposition of advanced issues, and incoherence as work stretches throughout a number of context home windows. The harness has to design round all of that.

I’ve written about autonomous coding loops just like the Ralph loop earlier than in self-improving brokers and in my 2026 developments piece, nevertheless it’s value restating on this framing: A hook intercepts the mannequin’s try to exit and reinjects the unique immediate right into a recent context window, forcing the agent to proceed towards a completion objective. Every iteration begins clear however reads state from the earlier one by way of the filesystem. It’s a surprisingly easy trick for turning a single-session agent right into a multisession one, and it’s the sort of primitive you’d by no means derive from “simply use a wiser mannequin.”

Planning is when the mannequin decomposes a objective right into a sequence of steps, often right into a plan file on disk. The harness helps this with prompting and reminders about find out how to use the plan file. After every step, the agent checks its work through self-verification: Hooks run a predefined take a look at suite and loop failures again to the mannequin with the error textual content, or the mannequin opinions its personal output towards express standards.

Planner/generator/evaluator splits. Anthropic’s long-running harness work is express that separating technology from analysis into distinct brokers outperforms self-evaluation, as a result of brokers reliably skew constructive when grading their very own work. It’s GANs for prose. The associated sample is the dash contract, the place the generator and evaluator negotiate what “performed” truly means earlier than code will get written. In my very own workflows, writing down the performed situation earlier than beginning has caught extra scope drift than any immediate change I’ve ever made.

Hooks: The enforcement layer

Hooks are what separate “I advised the agent to do X” from “the system enforces X.”

A hook is a script that runs at a selected lifecycle level: earlier than a device name, after a file edit, earlier than commit, on session begin. They’re the best place for issues the agent ought to always remember however typically does. Run typecheck and lint and checks after each edit and floor failures. Block harmful bash (rm -rf, git push --force, DROP TABLE). Require approval earlier than opening a PR or pushing to major. Auto-format on write so the agent doesn’t waste tokens on whitespace.

The precept HumanLayer highlights and I’ve come to agree with is: Success is silent; failures are verbose. If typecheck passes, the agent hears nothing. If it fails, the error textual content will get injected into the loop and the agent self-corrects. That makes the suggestions loop virtually free within the widespread case and straight actionable when one thing goes flawed.

AGENTS.md and power selection

The flat markdown rulebook on the root of your repo remains to be the one highest-leverage configuration level, as a result of it lands within the system immediate each flip. Conventions go right here: bundle supervisor, take a look at framework, formatting, “by no means contact /legacy,” “at all times use our logger.” Two hard-won classes:

Hold it brief. HumanLayer retains theirs underneath 60 strains. Each line is competing for consideration, and extra guidelines make every rule matter much less. Pilot’s guidelines, not model information.

Earn every line. Guidelines ought to hint to a selected previous failure or a tough exterior constraint. In the event that they don’t, they’re noise. Ratchet; don’t brainstorm.

Similar self-discipline applies to instruments. Every device’s identify, description, and schema will get stamped into the immediate each request. Ten targeted instruments outperform fifty overlapping ones as a result of the mannequin can maintain the menu in its head. HumanLayer additionally flags an actual safety concern right here: device descriptions populate the immediate, so any MCP server you put in is trusted textual content the mannequin will learn. A sloppy or malicious MCP can prompt-inject your agent earlier than you’ve typed something.

What this appears like in manufacturing

The clearest public image I’ve seen of a mature harness is Fareed Khan’s (estimated) breakdown of Claude Code’s structure.

Virtually each idea from the earlier part exhibits up on this diagram as a named part. Context injection is the information layer. Loop state lives within the reminiscence retailer and the worktree isolator. Damaging-action hooks sit behind the permission gate. Subagent context firewalls are all the multi-agent layer. The device dispatch registry is the place MCP servers and bash each plug in. Khan’s argument is similar as Viv’s, simply labored by way of a delivery product: Claude Code’s trajectory is concerning the harness no less than as a lot as concerning the mannequin beneath it.

Harnesses don’t shrink; they transfer

One of many higher observations within the Anthropic write-up is that as fashions enhance, the house of fascinating harness combos doesn’t shrink. It strikes.

The naive story is that higher fashions make harnesses out of date. If the mannequin can plan, no planner. If the mannequin is coherent at lengthy horizons, no context resets. And sure, Opus 4.6 largely killed the context-anxiety failure mode (Sonnet 4.5 used to wrap up work prematurely because it approached what it thought was its context restrict), which suggests a complete class of anxiety-mitigation scaffolding I used to be writing six months in the past is now useless code.

However the ceiling moved with the mannequin. Duties that have been unreachable are in play, and so they have their very own failure modes. The anxiousness scaffolding goes away, and as a replacement you want a multiday reminiscence coverage or a harness that coordinates three specialised brokers or evaluators for design high quality in generated UIs. The assumptions shift, and so does the scaffolding that encodes them.

Anthropic places it cleanly: “Each part in a harness encodes an assumption about what the mannequin can’t do by itself.” When the mannequin will get higher at one thing, that part turns into load-bearing for nothing and will come out. When the mannequin unlocks one thing new, new scaffolding is required to succeed in the brand new ceiling.

The model-harness coaching loop

The opposite factor that’s taking place, which Viv names explicitly, is a suggestions loop between harness design and mannequin coaching.

Right this moment’s agent merchandise are posttrained with harnesses within the loop. The mannequin will get particularly higher on the actions the harness designers suppose it ought to be good at: filesystem operations, bash, planning, subagent dispatch. That’s why Opus 4.6 feels completely different inside Claude Code than inside another person’s harness, and it’s why altering a device’s logic typically causes unusual regressions. A genuinely common mannequin wouldn’t care whether or not you used apply_patch or str_replace, however cotraining creates overfitting.

The sensible implication is twofold. A harness is a dwelling system, not a config file you arrange as soon as. And the “finest” harness isn’t essentially the one the mannequin was skilled inside; it’s the one designed on your activity. Viv’s High 30 to High 5 Terminal Bench soar is the clearest proof level I’ve seen.

Harness as a service

Viv’s different contribution is the HaaS framing: harness as a service. The commentary is that we’re shifting from constructing on LLM APIs (which offer you a completion) to constructing on harness APIs (which offer you a runtime). The Claude Agent SDK, the Codex SDK, and the OpenAI Brokers SDK all level in the identical path. You get the loop, the instruments, the context administration, the hooks, and the sandbox primitives out of the field, and also you customise them.

The shift issues as a result of the default path was once: construct your individual loop, wire up your individual tool-calling, deal with your individual dialog state, invent your individual approval circulation. Now the default path is: choose a harness framework, configure it alongside the 4 pillars (system immediate, instruments, context, subagents), and put the remainder of your effort into domain-specific immediate and power design.

That’s what makes “ability challenge” tractable. You’re not rebuilding an agent from scratch each time one thing goes flawed. You’re tuning a configuration floor that’s already well-factored.

Viv’s line on that is additionally the perfect argument for beginning messy: “Good agent constructing is an train in iteration. You possibly can’t do iterations in the event you don’t have a v0.1.”

The place that is going

Take a look at the highest coding brokers aspect by aspect (Claude Code, Cursor, Codex, Aider, Cline) and they give the impression of being extra like one another than their underlying fashions do. The fashions are completely different. The harness patterns are converging. I don’t suppose that’s an accident. It’s the trade slowly discovering the load-bearing items of scaffolding that flip a generative mannequin into one thing that may ship.

Viv’s framing of the open issues is the one I discover most fun: orchestrating many brokers working in parallel on a shared codebase; brokers that analyze their very own traces to determine and repair harness-level failure modes; harnesses that dynamically assemble the best instruments and context just-in-time for a given activity as an alternative of being preconfigured at startup.

That final one, particularly, looks like the place harnesses cease being static config and begin turning into one thing nearer to a compiler.

Supply hyperlink

Agent Harness Engineering – O’Reilly