AI Engineering · 2026-06-05 · 7 min read

Reliability Is a Harness Problem, Not Just a Model Problem

The smartest model still ships broken work if the harness around it is weak. Here is what I do to keep coding agents reliable once they hit production.

In 2026 the industry finally stopped arguing about whether AI agents are real and started getting burned by how they fail. That is progress, even if it came the hard way. The lesson I keep coming back to, the one this year's public incidents reinforced for everyone, is that reliability lives in the harness around the model as much as in the model itself.

A model regression. A dropped piece of reasoning history. A default setting that quietly changed under you. Any one of these can turn a capable agent into one that ships broken work, and the cruel part is that you usually find out in production, after the bad output already went somewhere it matters. I do not trust an agent because it is smart. I trust it because the harness I built around it makes its failures visible and cheap.

This is the builder's-seat view of what I actually do about that. Not a framework to copy, the methodology, and why each piece earns its place.

The agent did not get dumber. The harness changed under it.

The defining pattern of 2026 was capable coding agents suddenly producing worse work, and the cause often was not the model. Public reporting this year traced quality dips to harness-level causes: a changed default effort setting here, a bug that dropped older reasoning history there. The intelligence was intact. The system feeding it context and constraints had shifted.

That is the whole thesis in one observation. Reliability is a property of the system around the model, and the system is the part I can actually control. I cannot stop a hosted model from changing. I can build everything around it so that when it changes, I notice immediately and the blast radius is small.

It also reframed cost. As of June 1, 2026, GitHub Copilot plans moved to usage-based AI credits, and plenty of developers reported sharp cost surprises in the days before the switch. An undisciplined agent run stopped being a free annoyance and became a line item. Capability stopped being the bottleneck. The system around the model became the differentiator, on quality and on spend.

What "breaks in production" actually looks like

Before the fixes, it helps to name the failure modes precisely, because vague worry leads to vague safeguards.

  • Silent failures. The agent returns something plausible and wrong, and nothing flags it. This is the dangerous one, because plausible-and-wrong sails through every check a human does at a glance.
  • Drift. A default or a model version changes, and your behavior changes with it, without a single commit on your side. "It worked last week" is not a status you get to keep for free when you build on hosted models.
  • Cost surprises. A run loops, or over-reads context, or fans out wider than it needed to. With usage-based billing, that is now money, not just minutes.

Every technique below exists to convert one of these from an invisible problem into a visible, bounded one.

Pin what can change under you

The first move is to stop inheriting defaults. Model versions, effort settings, and tool definitions are not things to accept as they come. They are things to pin and version, the same way you pin a dependency.

If a model version is going to determine the quality of what I ship, it belongs in my configuration explicitly, not as "whatever the provider points at today." When a provider updates a default, I want that to be a decision I make on purpose after testing, not a surprise that arrives in my users' output first. Pinning does not stop change. It moves change from something that happens to me into something I choose.

The same logic extends past the model. Effort and reasoning settings, the system prompt, the exact tool definitions the agent is handed: all of it is configuration that shapes behavior, and all of it can drift if you treat it as ambient. I keep these in version control and review changes to them the way I review code, because functionally they are code. A one-word change to a tool description can quietly redirect how an agent uses it. If that change is tracked, I can see it and reason about it. If it lives in a console somewhere as a default, it is a liability I cannot audit. Multi-provider discipline helps here too. When I am not wholly dependent on one lab's defaults, a change on their side is a thing I evaluate, not an emergency I absorb.

Put deterministic orchestration around the non-negotiables

Here is the boundary that matters most in agent design. Let the model drive where judgment genuinely helps, and wrap the steps that must happen every single time in deterministic control flow.

Some things are not the model's call. The order of operations, the validation that has to run, the fan-out that must cover every item: those belong in code I wrote, not in a prompt I am hoping the model honors this time. The model is good at the fuzzy, context-heavy decisions. It is the wrong tool for guaranteeing that a required step ran. Deciding which steps are model-driven and which are deterministic is not an implementation detail. It is the core design decision, and getting it right is most of what separates a reliable agent from a demo that worked once.

The tell is whether a step needs to be correct or merely plausible. Drafting a summary, choosing how to phrase a finding, deciding which file is most relevant: plausible is fine there, and judgment is exactly what the model adds. Running every item in a list, enforcing a schema on the output, checking a result before it moves to the next stage: those need to be correct every time, so I do not leave them to inference. A useful habit is to write the control flow first, in plain code, with the model calls as clearly marked holes in it. That forces me to decide on purpose where judgment lives, instead of letting the prompt slowly absorb responsibilities that should have been guaranteed by the harness.

Evals before vibes

You cannot tell whether the harness changed by looking at a demo. A demo shows you the happy path on a good day. A small, real eval harness, even a modest one built from actual cases, catches regressions a demo never will.

This is how I notice the harness shifted before my users do. When a default changes or a model updates, the eval moves, and I see it in a number instead of in a complaint. I am not talking about a heavyweight academic benchmark. I mean a focused set of real inputs with known-good outputs, run regularly, measuring the things that actually matter for the job. The point is not the framework. The point is that "did this get worse" becomes a question I can answer with data instead of a feeling.

Keep a human where a wrong answer is expensive

Automation is not the goal. Good outcomes are the goal, and human-in-the-loop is not a failure of automation. It is the cheapest insurance you can buy on a high-cost step.

The decision is about asymmetry. Where a wrong answer is cheap and reversible, I let the agent ship straight through and I move on. Where a wrong answer is expensive, irreversible, or lands in front of a customer or a regulator, a human reviews it before it goes out. That is not timidity. It is putting the review where the cost is, and removing it where the cost is not. The skill is knowing which outputs are which, and being honest about it.

The unsexy part is the point

None of this is exciting. Pinning versions, writing evals, drawing the deterministic boundary, deciding what a human checks: it is the boring work that starts after the demo already impressed everyone. And it is the entire job. The demo proves the capability exists. The harness is what makes the capability survive contact with production, a changing provider, and a billing meter that is now running.

That is the difference between selling the model's intelligence and shipping a system you can stand behind. Anyone can wire up a smart model and show a good day. The work is building the part that makes the bad days visible and cheap, which is the part you actually own.

This is the seat I work from on every build. If you want a partner who treats reliability as the job and not the afterthought, that is what my agent orchestration methodology is built around, and you can start a conversation any time at yikesdude.com/contact.

AI engineeringagent reliabilityagentic coding

Liked this?

Tell us what is broken. We’ll tell you what the first week looks like.

Next read →

Retail AI in 2026: What Actually Ships to the Floor, and What Stays a Demo