The Current State of AI Engineering

Software code with list of AI options

Part 3: The Scaffolding Layer

Part 2 ended with a promise. We looked at the open-weight question (whether frontier models are even necessary for most coding tasks) and noted that a 12-billion-parameter open model with the right scaffolding doesn't close the gap with frontier models. It surpasses them. Then we said: next, what the research actually says about scaffolding, and why a directory full of Markdown files might be the most important thing in AI engineering right now.

Here it is.

The most important part of AI engineering isn't the model. It's what goes into the model. That's not a provocation; it's what the benchmark data is showing. On SWE-Bench Pro, the same model with a basic scaffold scores 23%. With an optimized 250-turn scaffold, it scores 45%+.¹ That 22-point swing is larger than the performance gap between any two frontier models competing head-to-head on standardized tooling.² The pattern holds outside pure coding benchmarks too: one team moved from the top 30 to the top 5 on Terminal Bench without changing their model once, and every point came from harness improvements.¹ If you're optimizing the wrong layer, you're leaving more capability on the table than any model upgrade will give back.

The scaffolding layer is where that capability lives. It's the architecture between the model and the problem. Understanding it (how it's structured, how information flows through it, how it can be extended and composed) is quickly becoming the core engineering discipline of this era.


The Base Layer: Four Tools and a Token Budget

At the bottom of every agentic system is something deceptively simple: a small set of tools and a system prompt.

Pi, the minimal terminal coding harness built by Mario Zechner, ships with four tools: read, write, edit, and bash. The tool definitions and system prompt together are deliberately minimal, a short focused context that gives the model grounding without burying it.² That's it. No plan mode. No sub-agents. No built-in safety rails. Just four verbs and a model that knows how to use them.

The design philosophy is deliberate. The model already has general intelligence. What it needs isn't more built-in capabilities; it needs grounding. Give it hands and feet. Let it act on the filesystem. The rest follows from the model's existing reasoning.

Claude Code takes the same bet from a different angle. It's terminal-native, built for remote headless execution, designed to run in parallel across isolated git worktrees. The SWE-bench data shows it scoring higher on coding benchmarks than the raw Claude Opus model it runs on.³ That gap isn't the model weights. It's tool use patterns, retry logic, and context management: Anthropic's agent engineering, layered on top of the model.

OpenCode, the MIT-licensed open alternative with 120K+ GitHub stars, adds LSP support and a Plan/Build mode that gives you a review step before the agent touches your files. Still provider-agnostic. Still built around the same core primitives.⁴

The harness provides the interface to the world. It gives the agent something to act on. The intelligence is already in the model; the harness just gives it traction.

The agent isn't the LLM. The agent is the LLM plus the scaffolding. Neither is useful alone.


Instructions: Behavior as Code

On top of the base harness sits the instructions layer. This is where the scaffolding starts to get interesting.

AGENTS.md, CLAUDE.md, copilot-instructions.md: these files go by different names depending on which tool you're using. What they are, regardless of name, is source code for behavior. They're the program that runs inside the model.

Instructions define who the agent is and how it works. They specify communication style and persona. They set constraints on what the agent is and isn't allowed to do. They define decision-making frameworks: when you're unsure, do this; when you encounter this pattern, handle it this way. They describe output formats, quality standards, escalation paths.

A single well-placed instruction (always read a file completely before attempting to edit it) prevents the most common class of agent failures. A poorly written one, whether vague, ambiguous, or contradicted by another instruction, produces exactly the hallucination-adjacent behavior that makes people distrust agents.

The insight here is architectural: instructions aren't documentation. They're not a README that a human might skim. They're context that the model receives on every request, every session, a consistent signal shaping how it reasons and responds. The discipline required to write them well is closer to writing tests than writing prose. Precision matters. Order matters. Contradiction is a bug.

Instructions can also be layered. They can live at the project level, the user level, or inside a specific skill. They accumulate. A project's AGENTS.md sets the baseline context. A skill brings its own instructions. A user's personal overrides layer on top. The model sees all of it, and it has to reason coherently across the whole stack.

A directory full of Markdown files is a programming language for behavior. If the model is the CPU, instructions are the assembly code.


Skills: Procedural Knowledge

Skills are the next layer up. If instructions define how an agent behaves, skills encode what to do in specific domains. They capture procedural knowledge that makes the agent reliable on complex, multi-step tasks.

A skill isn't just a prompt. It's a procedure written in Markdown: a sequence of steps, decision branches, validation checkpoints. It brings the context, vocabulary, and guardrails needed for a specific class of work. When it runs, it gives the agent a structured sequence to follow that reduces hallucination and helps ensure the output is complete and correct. The model doesn't improvise; it executes a procedure that has been vetted, tested, and refined.

The BMad Method (Build More Architect Dreams) is one of the clearest examples of this pattern at scale. It's a collection of skills organized around a four-phase software development workflow: analysis, planning, solutioning, and implementation.⁵ Each skill captures a procedure for doing something reliably: research a technology, create a PRD, perform code review, update architecture documentation. A story creation skill, for instance, follows a defined procedure. It performs architecture analysis, reads git history to understand recent work patterns, loads the PRD and epic files for context, researches latest library versions, synthesizes previous story learnings. The agent executes each step in order, validating as it goes, producing a story that matches the project's actual context and requirements. The skill encodes the procedure as Markdown.

Pi handles skills similarly. Each skill is a directory of Markdown files (a SKILL.md entry point, steps, templates, resources) installed into a .agents/skills/ folder in the project. The model loads whichever skills are relevant to the current task and executes the procedures defined in each skill on top of the same four base tools.²

Skills encode procedural knowledge. You install a skill and you're not just adding a prompt; you're encoding a procedure that the agent can reliably execute.


Prompts: The Runtime Layer

The final layer before the model sees anything is the runtime context, the dynamic assembly of everything that's accumulated across the scaffolding stack.

Prompt templates handle the recurring patterns. A brainstorming session, a code review, a story implementation: each has a shape that works well and a shape that doesn't. Templates encode the shape that works. They're pre-built scaffolding for common interactions, and they compose with the instructions and skills above them.

Dynamic context adds the live layer. File references (@filename) pull content from the filesystem directly into the conversation. Tool outputs from the current session accumulate as the agent works. Session history builds and is managed through compaction when the token budget gets tight. Images and attachments extend the model's surface further.

The prompt that reaches the model isn't a single thing you wrote. It's built: system prompt, plus project instructions, plus skill context, plus template, plus dynamic file references, plus session history, plus the current user message. Each layer is a contribution. The engineer isn't writing a prompt; they're designing a system that builds prompts correctly for any situation the agent encounters.

That distinction is what separates brittle agents from reliable ones. A brittle agent works when the user's message happens to include enough context. A reliable agent works because the scaffolding around it ensures that enough context always arrives, regardless of what the user says.


The Self-Extending System

Here's where the scaffolding layer stops being just architecture and starts being something more interesting.

The same read, write, edit, and bash tools that let an agent write application code can also let it modify its own scaffolding. Its own AGENTS.md. Its own skill definitions. Its own prompt templates.

This isn't a theoretical capability. The BMad bmad-help skill exists to inspect project state and recommend next steps, and the same toolset that does that can author and update configuration.⁵ The skill uses the agent to improve the scaffolding that shapes future agent behavior. The loop is closed.

OpenClaw takes the pattern further. Rather than calling Pi as a subprocess, it imports Pi's AgentSession directly via createAgentSession() and runs it in-process.⁶ That gives it full control over the session lifecycle: custom tool injection, dynamic system prompts per channel and context, provider-agnostic model switching. It replaces Pi's default bash tool with its own execution layer and adds channel-specific tools for each messaging platform. The harness is no longer fixed infrastructure; it's a pluggable runtime that the integrating system shapes at startup and can continue to reshape as context changes.

The architectural implication is subtle but significant. Traditional software has a hard boundary between the program and its configuration. The program runs; the config is read at startup; the behavior is fixed until the next deploy. That boundary doesn't exist here. The scaffolding is just files. The agent can read them, reason about them, and rewrite them. The agent is aware of its own architecture in the only sense that matters practically: it can act on it.

This isn't consciousness. It's not agency in any philosophical sense. But it is something genuinely new: software that can participate in its own extension. A system that, given the right instruction, can read the skill that governs its brainstorming workflow, identify that it keeps making a particular class of mistake, update the guardrail, and run better next session, without a human having written a single line of that update.

The scaffolding layer is where the self-modifying loop lives. Not in the model. In the Markdown files.


Reading Comprehension Was Just the Beginning

Part 1 of this series argued that reading comprehension is the most important skill in AI engineering. The data backed it. Developers who can evaluate generated code critically (who can read fast, spot errors, and understand what the agent actually produced versus what they asked for) are the ones delivering reliable work.

Part 3 extends that argument. Reading comprehension is necessary. But the engineers who will define this era aren't just the ones who read well. They're the ones who write scaffolding well.

Writing scaffolding well means writing instructions that are precise enough for a model to work with, not just clear enough for a human to understand. It means designing skills that package domain knowledge into reusable, composable modules. It means building systems where each layer of context makes the next layer better, where the information that arrives at the model is correct, complete, and structured for the task at hand.

The governance implications follow directly. If agents can modify their own scaffolding, the scaffolding has to be under version control. Instruction changes have to go through review. The same processes that catch bad code need to catch bad prompts, because a poorly written instruction at the scaffolding layer will produce bad behavior at scale, across every session that loads it, until someone catches and corrects it.

The engineers who understand this architecture (who can design it, extend it, review it, and govern it) are the ones building the infrastructure that everyone else's AI tools will run on.

Next: Cost Control. We’ll explore token optimization strategies and cost offsetting by leveraging locally deployed open-weight models.


Sources

¹ Best AI for Coding (2026): Every Model Ranked by Real Benchmarks. Morph LLM. https://www.morphllm.com/best-ai-model-for-coding — Source for basic scaffold 23% vs. optimized 250-turn scaffold 45%+ on SWE-Bench Pro; Terminal Bench team moving from top 30 to top 5 through harness improvements alone.

² Mario Zechner, "What I learned building an opinionated and minimal coding agent." November 2025. https://mariozechner.at/posts/2025-11-30-pi-coding-agent/

³ Best AI for Coding (2026): Every Model Ranked by Real Benchmarks. Morph LLM. https://www.morphllm.com/best-ai-model-for-coding — Claude Code scores 80.9% on SWE-bench Verified, higher than raw Claude Opus 4.6.

⁴ OpenCode. https://opencode.ai / https://github.com/opencode-ai/opencode

⁵ BMad Method Documentation. https://docs.bmad-method.org/reference/workflow-map/

⁶ OpenClaw, "Pi Integration Architecture." https://github.com/openclaw/openclaw