AI Engineering · 2026-06-19 · 7 min read

Context Engineering Is the Real Job: How I Decide What an Agent Gets to See

Context engineering, not prompt wording, separates an agent that demos from one that survives production. Here is how I keep my own agents lean and reliable.

I stopped tuning prompt wording a while ago. Not because prompts do not matter, but because I kept finding that the wording was a rounding error next to a bigger lever: what the model actually sees when it runs. The industry has a name for that lever now. It is context engineering, and in 2026 it quietly became the successor to prompt engineering. Anthropic and others formalized it this year, and the framing that stuck is a good one. Context engineering is finding the smallest set of high-signal tokens that maximize the odds of a good outcome, given a finite attention budget.

That last phrase is the whole thing. The attention budget is finite. So the job is not to give the model more, it is to give it the right amount of the right stuff. When I get called in to fix an agent that works in the demo and falls apart in production, the problem is almost never the model. It is the context.

What context engineering is, and why it replaced prompt engineering

Prompt engineering was about phrasing. You found the magic words that nudged a model toward the answer you wanted. It worked when the task was a single turn and the input was clean.

Agents broke that. An agent runs many turns, calls tools, accumulates history, and reads from external systems. By turn ten, the prompt wording you agonized over is buried under thousands of tokens of tool output, retrieved documents, and prior reasoning. The model is not failing your clever instruction. It is drowning in everything else you handed it.

So the discipline shifted. Instead of asking "what should I say," I ask "what should be in the window at this exact step, and nothing else." That is context engineering. The wording still matters at the margin. What goes in the window decides whether the agent is reliable.

The finite attention budget is the whole game

Here is the counterintuitive part that trips up most teams. More context is not better context.

A bloated window does three bad things at once. It makes the model slower, because it has more to process. It makes every call cost more, because you pay per token. And it makes the model more likely to drift or hallucinate, because every irrelevant token competes for attention with the ones that actually matter. The signal you need is still in there somewhere. It is just sitting next to a wall of noise you put there yourself.

The failure I see most often looks like good intentions. Someone wants the agent to be informed, so they dump the whole config file, the entire table, and the full conversation history into the prompt. Then the agent makes a confident, wrong decision, and they blame the model and go shopping for a bigger one. The bigger model has the same finite attention problem. It just hides it for a few more turns.

The four moves I actually use

When I design an agent's context, I lean on four moves. They show up under different names in different write-ups, but they are the same four. Anthropic's guidance and write-ups like Weaviate's both converge on roughly this set.

Offload. Keep state on disk or in a store, not in the prompt. Pass a pointer, not the payload. If an agent needs a 400-line document, it does not need all 400 lines sitting in its window. It needs to know the document exists and how to pull the part it needs.

Retrieve dynamically. Pull the slice the task needs at the moment it needs it, instead of front-loading everything at the start. Front-loading feels safe. It is actually how you fill a window with things the agent will never use on this particular run.

Isolate. Give each subtask its own clean context, so one step does not poison the next. If step one did a messy exploration full of dead ends, step two should not have to read all of it. It should get a clean, narrow brief and start fresh.

Reduce. Compress or summarize history, keeping only what the agent will still need downstream. The trick is the "still need" part. Reducing is not just truncating. It is deciding what is load-bearing and dropping the rest on purpose.

How this shows up in my own agents

The reason I trust these moves is that I run them every night in real systems. I will keep the specifics anonymized, but the shapes are real.

I run nightly routines that check what changed since their last run and read nothing on a quiet night. If no source file moved, the agent does not pull a single document into context. That is reduce, applied before the model is even called. The cheapest token is the one you never load.

I run work that fans out to subagents, where each one gets a narrow, clean slice of the job and hands back a small structured result. The parent never sees the messy middle of each subtask, only the tidy output. That is isolate plus reduce working together, and it is the difference between a parent context that stays lean and one that balloons with every child's scratch work.

And I have two-agent setups where one agent decides and another executes, and the only thing they share is a strict file schema on disk. The decision agent writes a small, well-defined record. The execution agent reads exactly that record and nothing else. They never drag a transcript between them. That is offload in its purest form, and it keeps each agent's window scoped to its actual job.

None of these are clever prompts. They are decisions about what enters the window, made before the model runs. And that is the mindset shift I want operators to take away. The leverage moved upstream of the prompt. By the time you are wording an instruction, the expensive decisions, what to load, what to drop, what to isolate, are already made or already missed. I spend more design time on the plumbing that decides what reaches the model than on the sentences I send it, because that is where reliability actually gets built or lost.

When context engineering is the difference between a demo and production

A demo runs once, on a clean input, with a fresh context. Of course it looks great. Production runs the same agent thousands of times, on messy inputs, with state and history piling up.

The agent that survives is not the one with the cleverest prompt at session one. It is the one whose context stays lean at session one thousand. That is an engineering property, not a wording property. You earn it by controlling what accumulates.

This is also why I stay provider-agnostic about it. The attention budget is finite on Claude, on Gemini, and on GPT alike. I pick the lab per problem for other reasons, latency, cost, a specific capability, a compliance posture, but the context discipline travels across all of them unchanged. A model that is twice as capable does not exempt you from feeding it well. It just raises the ceiling on what good context can buy you. If you want the deeper version of how I structure this across an engagement, that is the core of my agent orchestration methodology.

Where to start if your agent is unreliable

If you have an agent that demos well and misbehaves in the wild, resist the urge to swap models or rewrite the prompt first. Do this instead.

Log exactly what tokens entered the window on each call. Not a summary, the actual content. Most teams have never looked at this, and it is usually horrifying in a useful way. You will see the full history, the redundant tool outputs, the documents the agent never used.

Then cut what the task does not need. Isolate the subtasks so a messy step stops contaminating the next one. Move state out of the prompt and into a store the agent reads on demand.

I will say it plainly, because it is the through-line of almost every reliability project I take on. Most of the agent problems I get called in for are context problems wearing a model-quality costume. Fix what the agent sees, and the model you already have usually turns out to be enough.

If you are staring at an agent that works in the demo and not in production, that is exactly the kind of problem worth an hour. Book a working session and we will look at what your agent is actually reading.

agent engineeringcontext engineeringproduction AI

Liked this?

Want this built for your team, or want to learn it yourself? Either way, start here.

Start a project →

Learn 1:1 →

Next read →

AI Document Review Is the Wedge for Legal AI. Here Is Where I'd Actually Start a Firm.