Skip to content

The Six-Layer Architecture of Harness Engineering: How to Build Reliable AI Agents

Abstract layered structure

Purpose

I’ve been building AI agents for a while now. The hardest part isn’t making them work once—it’s making them work reliably across hundreds of tasks.

When I looked at why agents failed, I found they broke at different points. Some got lost in irrelevant information. Others called wrong tools. Some corrupted their own context.

What I needed was a structured way to address each failure mode. The six-layer architecture gave me that.

The Six Layers

┌─────────────────────────────────────────────┐
│ L6: Constraints, Validation & Recovery │ ← Error catching, rollback
├─────────────────────────────────────────────┤
│ L5: Evaluation & Observability │ ← Self-check mechanisms
├─────────────────────────────────────────────┤
│ L4: Memory & State │ ← Task state, handoff docs
├─────────────────────────────────────────────┤
│ L3: Execution Orchestration │ ← Planning, execution loops
├─────────────────────────────────────────────┤
│ L2: Tool System │ ← Discovery, result extraction
├─────────────────────────────────────────────┤
│ L1: Information Boundary │ ← What agent should/shouldn't know
└─────────────────────────────────────────────┘

L1: Information Boundary

This defines what the agent should and shouldn’t know. The best practice is to keep it minimal.

Mitchell Hashimoto’s approach: create an AGENTS.md file, 50-100 lines max. Each line should address a past agent failure. Structure it as a “task → location” map, not an encyclopedia.

AGENTS.md example
# Project Overview
React + TypeScript frontend for policy templates.
## Quick Navigation
| What you want | Where to look |
|---------------|---------------|
| Module structure | docs/architecture/overview.md |
| Component specs | docs/conventions/components.md |
| API reference | docs/reference/api-spec.yaml |
## Hard Rules
1. Components ≤ 300 lines
2. Use TDesign only, no other UI frameworks
3. All API calls via apiClient, no raw fetch

The key insight: don’t dump everything on the agent. Give it a map.

L2: Tool System

This layer handles how the agent interacts with external systems.

Anthropic’s Tool Search Tool showed me something important: loading all tools at once wastes context. By loading only relevant tools, they saved ~85% of tokens. Accuracy jumped from 49% to 74% on Opus 4.

The principle is simple:

Don't: Load 50 tools upfront
Do: Search for tools, load only what's needed

L3: Execution Orchestration

This layer sequences multi-step tasks.

Anthropic uses a Planner → Generator → Evaluator pattern. Stripe uses a hybrid state machine: deterministic nodes for known paths, agent nodes for exploration.

I think the key is separating planning from execution. Let one agent figure out the steps, another execute them, and a third check results.

L4: Memory & State

This layer persists intermediate results.

Anthropic’s Context Resets changed my thinking. Instead of compressing context when it gets full, they launch fresh agents with structured handoff documents.

The idea: when context quality drops, reset. Hand off state cleanly. Don’t try to squeeze more into a degraded context window.

L5: Evaluation & Observability

This layer lets agents validate their own work.

OpenAI exposes their observability stack to agents. They can grab DOM snapshots, screenshots, and logs directly. The agent can see what happened and adjust.

Self-check mechanisms matter. An agent that can inspect its own work is more reliable than one that can’t.

L6: Constraints, Validation & Recovery

This layer catches errors and handles recovery.

OpenAI’s principle stuck with me: “If it cannot be enforced mechanically, agents will deviate.”

Custom linters with embedded fix instructions work well:

eslint-rules/no-raw-fetch.js
module.exports = {
meta: {
messages: {
noRawFetch: [
'❌ Direct fetch() forbidden.',
'✅ FIX: Use apiClient:',
' import { apiClient } from "@/lib/api-client";',
' const data = await apiClient.get("/endpoint");',
'📖 See: docs/conventions/api-calls.md'
].join('\n')
}
}
};

The fix instruction is embedded directly in the error message. The agent knows exactly what to do.

Where to Start

Here’s what I found works best:

Start with L1 and L6.

Why? They’re the bookends. L1 defines the boundary before the agent starts. L6 catches everything that goes wrong. Without L1, agents wander into irrelevant areas. Without L6, errors compound and become unrecoverable.

The middle layers (L2-L5) matter too, but they’re harder to get right upfront. L1 and L6 give you immediate leverage.

Common Mistakes

I’ve made these mistakes myself:

  1. Skipping L1, jumping to L3 - Agents lack guidance and waste time on irrelevant areas
  2. Treating L6 as optional - Errors compound, sessions become unrecoverable
  3. Loading all tools at once - Wastes context, degrades performance
  4. Ignoring context thresholds - Quality drops sharply past 40% of window size

Summary

In this post, I explained the six-layer architecture of Harness Engineering. The key point is to start with L1 (Information Boundary) and L6 (Constraints & Recovery)—they deliver the highest ROI. The other layers fill in the gaps between boundary definition and error recovery.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments