The Current State of AI Engineering

Software code with list of AI options

Part 4: Make Every Token Count

Part 3 ended with a promise. We walked the scaffolding layer (the base harness, the instructions, the skills, the runtime context) and argued that the most important part of AI engineering isn't the model; it's what goes into the model. Then we said: next, cost control. Token optimization strategies, and cost offsetting through locally deployed open-weight models.

This is the first half of that. The local open-weight story is big enough to stand on its own, so it gets Part 5. This post is about the metered side: what happens to your engineering when every token has a price tag.

For most of the last two years, it didn't. Tokens were effectively free, subsidized by providers buying market share, and that subsidy quietly shaped how everyone worked. When a request costs nothing, accuracy is optional. You can fire off ten loosely specified agents and keep whichever one happens to land. You can paste the whole repository into context because trimming it takes effort and the tokens are on the house. The dominant strategy was volume.

That era is closing. The industry is shifting to usage-based billing, and the subsidy is going away. The instinct most teams reach for is the obvious one: count tokens, set budgets, ration requests. That instinct is wrong, or at least incomplete. The way to control cost under metered billing isn't to spend less. It's to spend better, and spending better is an engineering problem, not an accounting one.

When tokens were cheap, accuracy was a nicety. Now it's the whole game. The goal isn't to count tokens; it's to make every token count.

The Shift from Counting to Spending

Start with the failure mode the subsidy created. Call it agent gambling: dispatch a swarm of low-context agents at a problem and hope one of them gets it right. When the marginal cost of an attempt is zero, gambling is rational. You're not paying for the misses, only the win.

Strip the subsidy out and the math inverts. Now you pay for every miss. The swarm that used to be free is a recurring line item, and most of what it produces is waste you're billed for. The rational move flips from volume to precision: send fewer, better-aimed requests, each carrying enough context to succeed on the first attempt.

This is where people mistake the problem for a cost problem. It looks like one (the bill went up) so they treat it like one (spend less). But cutting spend without raising quality just means fewer attempts at the same hit rate, which means more failures reaching production, which means more rework, which costs more than the tokens ever did. The token bill is the visible cost. The invisible cost is everything downstream of a bad output.

The reframe that actually works is return on investment. The value of an agent's output, net of what it cost to produce, over that cost. Two things fall out of that framing immediately. First, optimizing token cost is wasted effort if the value of the output is zero; you can't divide your way to a positive return on garbage. Second, and less obviously, the most reliable way to increase the value of an output is often to decrease the tokens going in, because most of those tokens are irrelevant context that's actively making the model worse. Quality and cost aren't in tension. They're the same lever pulled from two ends.

Cutting spend without raising quality is just gambling with a smaller stack. The lever that moves both value and cost is quality.

The Math of Compounding Error

Here's why quality matters more than it used to, and the argument is just arithmetic.

An LLM is a probability machine. It doesn't execute logic; it predicts the next token, and even a very good prediction is occasionally wrong. On a single step that's fine. Across a multi-step agentic workflow it compounds, and compounding is brutal.

Take an agent that's right 99% of the time on any individual step. Run it through a fifty-step task and its odds of getting the whole chain right aren't 99%. They're 0.99 to the fiftieth power, which is about 60%. Now drop the per-step accuracy to 95%, which still sounds reliable. The same fifty-step chain succeeds 0.95 to the fiftieth power of the time: roughly 8%.

Sit with that. A four-point drop in per-step accuracy, from 99% to 95%, takes a workflow from a coin-flip-and-then-some down to failing more than nine times out of ten. The relationship isn't linear; it's exponential, and it runs against you. Which means every incremental improvement in step quality buys a disproportionate improvement in your odds of finishing. Quality isn't a polish you apply at the end. It's the thing that determines whether the chain completes at all, and a chain that fails halfway through has already spent its tokens.

The way out is to stop the compounding before it runs away. You insert deterministic checkpoints (tests, linters, type checks, security scans, schema validation) into the non-deterministic chain. A passing test doesn't just catch a bug. It resets the accuracy margin: everything upstream of a green check is known-good, so the error rate starts compounding fresh from there instead of from the beginning. Without those checkpoints you get the failure pattern anyone who's watched an agent run unsupervised will recognize: a buggy change built on a buggy change built on a buggy change, each one billed, plus the CI minutes to discover it and the human hours to unwind it. This is the same shift-left discipline software has practiced for fifteen years. The agent era just raised the stakes on getting it right.

Across enough steps, even 99% accuracy decays toward zero. Deterministic controls are how you reset the margin before it collapses.

Context Is the Cost

If quality is the lever, context is where you pull it, and the rule is narrower than most people expect: provide as little context as possible, but as much as required.

Both failure directions are real. Too little context and the model fills the gap the only way it knows how, by predicting something plausible, which is the polite name for a hallucination. Too much context is the failure people don't see coming, because it feels safe. It isn't. The model can't reliably tell your relevant context from your irrelevant context; it treats the whole window as signal, so every extra file you paste in is another opportunity for it to anchor on the wrong thing. More context doesn't mean more grounding. Past a point it means more noise, and you're paying by the token for the privilege of confusing the model.

It gets worse inside a long window. Models don't attend to a context uniformly. Performance is highest when the relevant information sits at the very beginning or the very end of the input and degrades, sometimes sharply, when the model has to retrieve it from the middle: a U-shaped curve that holds even for models explicitly built for long contexts.¹ Whatever you bury in the middle of a bloated window is the content the model is least likely to use. There's a companion effect in conversation: as a session fills past roughly half its window, the model increasingly favors what's recent over what came earlier, so the careful setup you provided at the top quietly loses weight the longer you talk.

None of this is absolute. These are biases in a probabilistic system, not hard rules, and treating them as hard rules is its own mistake. But they point at one discipline. The skill that controls cost in 2026 is the same skill that controls quality, and it's the one Part 3 was circling the whole time: context engineering. Deciding what the model sees, in what order, and what it never sees at all. Keeping the working window lean (well short of full, not because there's a magic threshold but because a packed window is where both the lost-in-the-middle and recency effects do their damage) and starting clean when the task changes rather than dragging a stale conversation into new work. Tokens don't carry over between sessions, so context is cheap to discard and expensive to hoard. Throw it away liberally.

The model can't separate signal from noise in its own context; that's your job. Context engineering is the discipline that buys quality and economy at the same time.

The Levers You Actually Pull

The principles cash out into a handful of concrete moves.

Match the model to the task. Bigger isn't better; bigger is just bigger, and the price gap between a flagship reasoning model and a small execution model is an order of magnitude, not a rounding error.² Reasoning models earn their cost on planning, architecture, and genuinely hard debugging, the work where the model needs to think. Execution against a tight, well-specified plan is different work, and a small fast model is often better at it, not just cheaper. Point a heavyweight reasoning model at a precise spec and it'll start second-guessing the instructions, reopening decisions you'd already closed, spending expensive output tokens to talk itself out of the plan. The skill it has is exactly the wrong skill for the job. (The task-aware "auto" model selection now shipping in the major harnesses is a reasonable default for people who don't want to make this call by hand on every request; it's trying to make this match for you.)

Work in phases, and clear between them. Research, planning, implementation are three different jobs with three different context needs, and cramming them into one session is how windows bloat. Research is exploratory and token-heavy; you're casting wide. Planning is where a good reasoning model turns that pile into a precise specification, ideally a concrete to-do list. Implementation takes that spec and executes it, often with cheaper models, sometimes in parallel, each agent handed only the slice of context its piece needs. The move between phases is to start a fresh window rather than let research sediment pile up under your implementation work. A clean handoff (a tight spec into a clean context) is worth more than all the conversational history you'd otherwise carry.

There's a corollary worth stating: the exploratory front end doesn't have to run on metered agent tokens at all. The research and early shaping (reading docs, summarizing internal threads, pressure-testing an approach before any code exists) is human-in-the-loop thinking, and it's a good fit for a flat-rate assistant surface rather than per-request agent billing. In a Microsoft shop that's M365 Copilot; the principle generalizes to whatever cheap-or-flat surface you have. Do the expensive thinking where thinking is cheap, then spend your metered tokens only where they convert into code.

Be precise, and tell the agent when to stop. Hand it the context you already have instead of paying it to rediscover what you know; if you know which file the bug is in, say so. Add explicit stop conditions. "Once the bug is fixed and the tests pass, stop" prevents the agent from wandering off to refactor things you didn't ask about on tokens you didn't budget for.

Lean on deterministic controls. Covered above, but it belongs in the list: tests, linters, scanners, type checks. They're the cheapest quality you can buy, they run for free relative to model calls, and they're the thing standing between you and buggy-change-on-buggy-change.

The expensive token is the output token, and the model will happily generate more than you need. Spend your context budget on the work, not on rediscovering what you already know.

The Scaffolding Pays Twice

Here's where this post and the last one close into a single argument.

Part 3 made the case that your instructions, skills, and prompt templates are source code for behavior: a directory of Markdown files that programs how the agent reasons. That was a quality argument. The same scaffolding is also your primary cost-control surface, and it's worth seeing that these are not two features of the scaffolding. They're one.

Your AGENTS.md (or copilot-instructions.md, or CLAUDE.md) loads on every single interaction. That's leverage in both directions. Keep it small and sharp (the non-negotiables, the project guardrails, the decision rules) and every session starts grounded for a near-zero marginal token cost. Let it bloat into a pasted-in wiki and you're paying that weight on every request while actively burying the rules that matter under the lost-in-the-middle curve. The discipline that makes the file good (precise, ordered, contradiction-free) is the same discipline that makes it cheap. One of the highest-leverage uses of that file is as an agent-miss log: when the agent makes a class of mistake, you add the one line that prevents it, and you've fixed every future session for the price of a sentence.

Skills extend the pattern. They load procedural knowledge dynamically, only when the task matches, instead of carrying every procedure in context all the time. That's a context-engineering decision wearing a capability hat: the win is precisely that the knowledge isn't always-on. The caution is the inverse. Don't encode skills for things the model already does well, and don't let redundant or stale skills accumulate; a skill you wrote six months ago against a weaker model may now be dead weight loading into your window. Re-check them. Throw away the ones that no longer earn their tokens.

MCP servers are the sharpest version of the same trade. They're genuinely powerful (external tools, live data, dynamic content) and they're also where token cost hides, because every connected server injects its tool descriptions into context and a heavy one can quietly bloat the window and tempt extra tool calls before you've asked for anything. Gate the heavy integrations behind a custom agent that only loads them when that specific work demands it, rather than leaving them on for every interaction. And where a plain CLI does the job (a gh command instead of a full GitHub MCP surface), the leaner tool is usually the cheaper one.

Even output is a scaffolding decision. Output tokens cost more than input tokens, and a model left to its own defaults pads its responses. A one-line instruction to be concise and drop the conversational niceties trims that, and it's nearly as effective as an elaborate "minimal output" skill at a fraction of the complexity.

The Markdown files that program quality also govern cost. They're not two levers that happen to sit together; they're one lever, and you were already holding it.

Strategy Is the Job That's Left

Part 1 of this series argued that reading comprehension is the most important skill in AI engineering: the ability to evaluate what an agent produced against what you asked for. Part 3 extended it to writing scaffolding well. This post points at what those skills are for now that the meter is running.

Agents execute. They don't strategize. They'll run a plan with remarkable competence and they will not tell you the plan was wrong, the architecture was working against them, or the whole task was framed badly. That judgment (which model for which task, how to phase the work, what context to provide and what to withhold, where to spend the expensive tokens and where to refuse) is the part that doesn't automate. It's also the part that determines whether your token bill is an investment or a leak.

So the developer's center of gravity keeps moving the same direction it's been moving through this whole series: away from typing the code, toward designing the system that produces it. Architecture matters more, not less, because good structure (clear domains, clean boundaries) is navigation guardrail for an agent and a smaller blast radius for its mistakes. Prompts and configs get iterated like the engineering artifacts they are, rewritten as models change, thrown away when they stop earning their place. Nothing here is set-and-forget. It's a discipline, and like every engineering discipline before it, it rewards the people who treat it as one.

Stop counting tokens. Start making them count.

Next: Part 5. Cost offsetting through locally deployed open-weight models, where the cheapest token is the one you don't send to anyone's meter at all.

Sources

¹ Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang. "Lost in the Middle: How Language Models Use Long Contexts." Transactions of the Association for Computational Linguistics, vol. 12 (2024), pp. 157–173. https://doi.org/10.1162/tacl_a_00638 — Source for the U-shaped performance curve: model performance is highest when relevant information is at the beginning or end of the input context and degrades in the middle, even for long-context models.

² Pricing gap between Anthropic's flagship reasoning model and its small execution model is exactly 5× across every pricing tier, not a rounding error. Anthropic model pricing (as of May 2026): Claude Opus 4.8 lists at $5 / MTok base input, $6.25 / MTok for 5-minute cache writes, $10 / MTok for 1-hour cache writes, $0.50 / MTok for cache hits and refreshes, and $25 / MTok for output. Claude Haiku 4.5 lists at $1 / MTok, $1.25 / MTok, $2 / MTok, $0.10 / MTok, and $5 / MTok respectively — a uniform 5× multiplier at every line item. The output-token gap is the one that hits hardest on agentic workloads, where the model generates far more than it consumes; a Haiku run that costs $5 of output costs the same agent run on Opus $25, before the input and cache charges are added.