Skip to content

What Is a Claude Harness and How Does Multi-Agent Design Improve Coding?

Problem

I spent 6 hours asking Claude to build a simple game. It produced broken code with hallucinated imports and non-existent API calls. The code looked plausible at first glance, but when I tried to run it:

error-output.txt
ModuleNotFoundError: No module named 'pygame_ext'
AttributeError: 'GameEngine' object has no attribute 'render_sprites'
ImportError: cannot import name 'SpriteManager' from 'game.utils'

I spent another 4 hours debugging these phantom references. Eventually, I realized Claude had invented entire modules that never existed in my project.

Then I found Anthropic’s blog post about a “harness” architecture. A Reddit user claimed:

“The big bugs just disappeared. No more hallucinations” - using a 12-agent system

The numbers from Anthropic were dramatic: solo Claude took 6 hours and $200 to produce broken code. A harness with multiple agents took 20 minutes and $9 to produce working code.

I wanted to understand: What exactly is a Claude harness, and why does using multiple specialized agents produce better code?

What is a Claude Harness

A Claude harness is a multi-agent architecture that wraps Claude with specialized sub-agents following a generator-evaluator pattern. Think of it like a development team: one agent plans, another writes code, and a third validates.

Here’s the core architecture:

harness-architecture.txt
User Task --> Planner Agent --> Sprint Contract
|
Generator <--+--> Evaluator
| |
+-------------+
(feedback loop)
|
Validated Code

The key concept is the sprint contract - a negotiated agreement between agents that defines what “done” means before coding begins.

Why Solo AI Coding Fails

When I used Claude solo, I observed three recurring problems:

Hallucination Cascade

A single agent makes assumptions without validation. Wrong assumptions compound into systemic bugs. There’s no checkpoint to catch errors early.

hallucination-example.txt
Step 1: Claude assumes I have a SpriteManager class (hallucination)
Step 2: Claude writes code calling SpriteManager.load()
Step 3: Claude assumes GameEngine has render_sprites() method
Step 4: Generated code references both hallucinated features
Step 5: Code runs, errors cascade, debugging nightmare

Context Blindness

Claude has a large context window but no structured memory. Previous decisions get lost in conversation history. There’s no explicit contract defining success criteria.

No Quality Gates

Solo Claude generates code without test verification. It accepts its own output without critique. There’s no second perspective on correctness.

The Generator-Evaluator Pattern

Anthropic’s solution borrows from Generative Adversarial Networks (GANs). In GANs, two networks compete:

  • Generator: creates images
  • Discriminator: evaluates if images are real

Applied to Claude coding:

  • Generator Agent: creates code
  • Evaluator Agent: critiques if code meets spec

Why this works: competition forces quality. The evaluator catches hallucinations. Iteration improves output. The contract defines success.

Three-Agent Workflow

The harness uses three specialized agents:

three-agent-architecture.py
class SprintContract:
"""Agreement between Planner and Generator."""
spec: str # What to implement
tests: list[str] # Test cases that must pass
constraints: list # Limitations and boundaries
acceptance: str # Definition of "done"
class PlannerAgent:
"""Defines what needs to be built."""
def create_contract(self, task: str) -> SprintContract:
spec = self.analyze_requirements(task)
tests = self.define_test_cases(spec)
constraints = self.identify_limits(task)
return SprintContract(spec, tests, constraints)
class GeneratorAgent:
"""Creates code based on contract."""
def implement(self, contract: SprintContract) -> str:
return self.generate_code(contract.spec, contract.constraints)
def iterate(self, feedback: str, code: str) -> str:
return self.revise_code(code, feedback)
class EvaluatorAgent:
"""Validates code against contract."""
def evaluate(self, code: str, contract: SprintContract) -> Evaluation:
test_results = self.run_tests(code, contract.tests)
hallucinations = self.detect_hallucinations(code, contract.spec)
bugs = self.find_bugs(code)
return Evaluation(test_results, hallucinations, bugs)
def harness_workflow(task: str) -> str:
planner = PlannerAgent()
generator = GeneratorAgent()
evaluator = EvaluatorAgent()
# Step 1: Plan
contract = planner.create_contract(task)
# Step 2: Generate-Evaluate loop
code = generator.implement(contract)
evaluation = evaluator.evaluate(code, contract)
while not evaluation.is_acceptable():
feedback = evaluator.provide_feedback(evaluation)
code = generator.iterate(feedback, code)
evaluation = evaluator.evaluate(code, contract)
return code # Validated, working code

The workflow:

workflow.txt
Step 1: Planner defines sprint contract
|
Step 2: Generator creates implementation
|
Step 3: Evaluator validates against contract
| (feedback loop if issues found)
Step 4: Generator iterates based on feedback
| (repeat until Evaluator approves)
Step 5: Output validated, working code

The Sprint Contract

The sprint contract is the critical piece I was missing. It defines success before coding starts:

sprint-contract-example.yaml
spec:
task: "Implement user authentication API endpoint"
requirements:
- POST /auth/login accepts email + password
- Returns JWT token on success
- Returns 401 on failure
- Rate limited to 5 requests per minute
tests:
- "Valid credentials return 200 with token"
- "Invalid credentials return 401"
- "Rate limit enforced after 5 requests"
- "Token contains user_id claim"
constraints:
- "Use existing User model"
- "Use bcrypt for password verification"
- "Token expiry: 24 hours"
- "No new database tables"
acceptance_criteria:
- "All tests pass"
- "No hallucinated imports"
- "Follows existing auth pattern"
- "Security review approved"

Without this contract, agents argue indefinitely about scope. With it, the negotiation phase happens first, before any code is written.

The Economics of Quality

The Reddit skeptic had a valid point:

“I’m skeptical of the shovel-peddler telling me the way to fix my shovel problem is to get my shovel a bunch of tinier shovels”

More agents do cost more tokens. But here’s the paradox:

cost-comparison.txt
Solo approach:
- $200 generation cost
- $500 debugging cost (10 hours of engineer time)
- Total: $700
Harness approach:
- $9 generation cost (more tokens but faster)
- $0 debugging cost (code already validated)
- Total: $9

Higher token cost pays for itself in debugging time saved. The economics make sense when you account for the full lifecycle.

Scaling to 12 Agents

The Reddit user who eliminated hallucinations used 12 specialized agents:

twelve-agent-system.py
specialized_agents = {
# Planning Phase
"pseudo_code_writer": "Writes high-level logic flow",
"contract_definer": "Defines specification contracts",
"architect": "Designs system structure",
# Implementation Phase
"code_writer": "Implements contracts",
"test_writer": "Creates test cases",
"doc_writer": "Documents implementation",
# Validation Phase
"code_reviewer": "Reviews for patterns and style",
"compliance_reviewer": "Checks spec adherence",
"security_reviewer": "Identifies vulnerabilities",
# Integration Phase
"integration_agent": "Combines components",
"qa_agent": "End-to-end testing",
"deploy_agent": "Production readiness check"
}
# Key insight: Each agent has ONE specific job
# Result: "The big bugs just disappeared. No more hallucinations"

But I wouldn’t start with 12 agents. The Reddit skeptic was right about over-engineering. Start with 3 (Planner, Generator, Evaluator). Add more only when complexity demands.

Common Mistakes

When I first tried building my own harness, I made several mistakes:

Mistake 1: Skipping Sprint Contracts

Without contracts, my Generator and Evaluator argued endlessly about scope. The negotiation phase is critical - it defines “done” before coding starts.

Mistake 2: Overlapping Agent Roles

I initially gave my Generator agent some review responsibilities. This confused the system. Each agent must have a singular, specific job.

Mistake 3: Single-Pass Generation

I expected one generate-evaluate cycle to produce perfect code. The feedback loop is the quality mechanism. Iteration is essential.

iteration-necessity.txt
Iteration 1: Generator produces draft, Evaluator finds 5 issues
Iteration 2: Generator fixes 3, Evaluator finds 2 remaining
Iteration 3: Generator fixes 2, Evaluator approves
Result: Validated code in 3 iterations

Mistake 4: Ignoring the Evaluator

The Evaluator is not optional. Without validation, you’re back to solo Claude’s hallucination problem. The Generator needs external critique.

Why GANs Work for AI Coding

Generative Adversarial Networks succeed because the discriminator forces the generator to improve. The same principle applies:

  1. Generator wants to produce “good enough” code quickly
  2. Evaluator enforces stricter standards
  3. Competition drives quality up

Claude Code vs Explicit Harness

A Reddit comment raised an important point:

“I don’t get it. Are we pretending Claude Code doesn’t exist? Isn’t that a harness?”

Claude Code does have built-in validation loops. But explicit multi-agent design gives you:

  • Control over agent roles
  • Custom evaluation criteria
  • Domain-specific validators
  • Negotiation protocols

How to Build Your Own Harness

Start minimal:

minimal-harness.py
def simple_harness(task: str) -> str:
# Agent 1: Define contract
contract = define_contract(task)
# Agent 2: Generate
code = generate_code(contract)
# Agent 3: Evaluate
while not evaluate(code, contract).passes:
feedback = get_feedback(code, contract)
code = iterate(code, feedback)
return code

Add specialized agents only when you identify specific failure patterns in your workflow.

Summary

A Claude harness is a multi-agent architecture that uses generator-evaluator patterns and sprint contracts to produce validated, working code. The key insight from practitioners: specialized sub-agents with singular jobs eliminate hallucinations and big bugs.

The evidence is compelling:

  • Solo Claude: 6 hours, $200, broken code
  • Harness: 20 minutes, $9, working code

The token cost increase pays for itself in debugging time saved. More importantly, the generator-evaluator pattern transforms AI coding from a risky gamble into a reliable workflow.

If you’re skeptical about “tinier shovels,” start with a minimal three-agent harness. Define sprint contracts before coding begins. See if the quality improvement outweighs the complexity cost.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments