What Is a Claude Harness and How Does Multi-Agent Design Improve Coding?
Problem
I spent 6 hours asking Claude to build a simple game. It produced broken code with hallucinated imports and non-existent API calls. The code looked plausible at first glance, but when I tried to run it:
ModuleNotFoundError: No module named 'pygame_ext'AttributeError: 'GameEngine' object has no attribute 'render_sprites'ImportError: cannot import name 'SpriteManager' from 'game.utils'I spent another 4 hours debugging these phantom references. Eventually, I realized Claude had invented entire modules that never existed in my project.
Then I found Anthropic’s blog post about a “harness” architecture. A Reddit user claimed:
“The big bugs just disappeared. No more hallucinations” - using a 12-agent system
The numbers from Anthropic were dramatic: solo Claude took 6 hours and $200 to produce broken code. A harness with multiple agents took 20 minutes and $9 to produce working code.
I wanted to understand: What exactly is a Claude harness, and why does using multiple specialized agents produce better code?
What is a Claude Harness
A Claude harness is a multi-agent architecture that wraps Claude with specialized sub-agents following a generator-evaluator pattern. Think of it like a development team: one agent plans, another writes code, and a third validates.
Here’s the core architecture:
User Task --> Planner Agent --> Sprint Contract | Generator <--+--> Evaluator | | +-------------+ (feedback loop) | Validated CodeThe key concept is the sprint contract - a negotiated agreement between agents that defines what “done” means before coding begins.
Why Solo AI Coding Fails
When I used Claude solo, I observed three recurring problems:
Hallucination Cascade
A single agent makes assumptions without validation. Wrong assumptions compound into systemic bugs. There’s no checkpoint to catch errors early.
Step 1: Claude assumes I have a SpriteManager class (hallucination)Step 2: Claude writes code calling SpriteManager.load()Step 3: Claude assumes GameEngine has render_sprites() methodStep 4: Generated code references both hallucinated featuresStep 5: Code runs, errors cascade, debugging nightmareContext Blindness
Claude has a large context window but no structured memory. Previous decisions get lost in conversation history. There’s no explicit contract defining success criteria.
No Quality Gates
Solo Claude generates code without test verification. It accepts its own output without critique. There’s no second perspective on correctness.
The Generator-Evaluator Pattern
Anthropic’s solution borrows from Generative Adversarial Networks (GANs). In GANs, two networks compete:
- Generator: creates images
- Discriminator: evaluates if images are real
Applied to Claude coding:
- Generator Agent: creates code
- Evaluator Agent: critiques if code meets spec
Why this works: competition forces quality. The evaluator catches hallucinations. Iteration improves output. The contract defines success.
Three-Agent Workflow
The harness uses three specialized agents:
class SprintContract: """Agreement between Planner and Generator.""" spec: str # What to implement tests: list[str] # Test cases that must pass constraints: list # Limitations and boundaries acceptance: str # Definition of "done"
class PlannerAgent: """Defines what needs to be built."""
def create_contract(self, task: str) -> SprintContract: spec = self.analyze_requirements(task) tests = self.define_test_cases(spec) constraints = self.identify_limits(task) return SprintContract(spec, tests, constraints)
class GeneratorAgent: """Creates code based on contract."""
def implement(self, contract: SprintContract) -> str: return self.generate_code(contract.spec, contract.constraints)
def iterate(self, feedback: str, code: str) -> str: return self.revise_code(code, feedback)
class EvaluatorAgent: """Validates code against contract."""
def evaluate(self, code: str, contract: SprintContract) -> Evaluation: test_results = self.run_tests(code, contract.tests) hallucinations = self.detect_hallucinations(code, contract.spec) bugs = self.find_bugs(code) return Evaluation(test_results, hallucinations, bugs)
def harness_workflow(task: str) -> str: planner = PlannerAgent() generator = GeneratorAgent() evaluator = EvaluatorAgent()
# Step 1: Plan contract = planner.create_contract(task)
# Step 2: Generate-Evaluate loop code = generator.implement(contract) evaluation = evaluator.evaluate(code, contract)
while not evaluation.is_acceptable(): feedback = evaluator.provide_feedback(evaluation) code = generator.iterate(feedback, code) evaluation = evaluator.evaluate(code, contract)
return code # Validated, working codeThe workflow:
Step 1: Planner defines sprint contract |Step 2: Generator creates implementation |Step 3: Evaluator validates against contract | (feedback loop if issues found)Step 4: Generator iterates based on feedback | (repeat until Evaluator approves)Step 5: Output validated, working codeThe Sprint Contract
The sprint contract is the critical piece I was missing. It defines success before coding starts:
spec: task: "Implement user authentication API endpoint" requirements: - POST /auth/login accepts email + password - Returns JWT token on success - Returns 401 on failure - Rate limited to 5 requests per minute
tests: - "Valid credentials return 200 with token" - "Invalid credentials return 401" - "Rate limit enforced after 5 requests" - "Token contains user_id claim"
constraints: - "Use existing User model" - "Use bcrypt for password verification" - "Token expiry: 24 hours" - "No new database tables"
acceptance_criteria: - "All tests pass" - "No hallucinated imports" - "Follows existing auth pattern" - "Security review approved"Without this contract, agents argue indefinitely about scope. With it, the negotiation phase happens first, before any code is written.
The Economics of Quality
The Reddit skeptic had a valid point:
“I’m skeptical of the shovel-peddler telling me the way to fix my shovel problem is to get my shovel a bunch of tinier shovels”
More agents do cost more tokens. But here’s the paradox:
Solo approach: - $200 generation cost - $500 debugging cost (10 hours of engineer time) - Total: $700
Harness approach: - $9 generation cost (more tokens but faster) - $0 debugging cost (code already validated) - Total: $9Higher token cost pays for itself in debugging time saved. The economics make sense when you account for the full lifecycle.
Scaling to 12 Agents
The Reddit user who eliminated hallucinations used 12 specialized agents:
specialized_agents = { # Planning Phase "pseudo_code_writer": "Writes high-level logic flow", "contract_definer": "Defines specification contracts", "architect": "Designs system structure",
# Implementation Phase "code_writer": "Implements contracts", "test_writer": "Creates test cases", "doc_writer": "Documents implementation",
# Validation Phase "code_reviewer": "Reviews for patterns and style", "compliance_reviewer": "Checks spec adherence", "security_reviewer": "Identifies vulnerabilities",
# Integration Phase "integration_agent": "Combines components", "qa_agent": "End-to-end testing", "deploy_agent": "Production readiness check"}
# Key insight: Each agent has ONE specific job# Result: "The big bugs just disappeared. No more hallucinations"But I wouldn’t start with 12 agents. The Reddit skeptic was right about over-engineering. Start with 3 (Planner, Generator, Evaluator). Add more only when complexity demands.
Common Mistakes
When I first tried building my own harness, I made several mistakes:
Mistake 1: Skipping Sprint Contracts
Without contracts, my Generator and Evaluator argued endlessly about scope. The negotiation phase is critical - it defines “done” before coding starts.
Mistake 2: Overlapping Agent Roles
I initially gave my Generator agent some review responsibilities. This confused the system. Each agent must have a singular, specific job.
Mistake 3: Single-Pass Generation
I expected one generate-evaluate cycle to produce perfect code. The feedback loop is the quality mechanism. Iteration is essential.
Iteration 1: Generator produces draft, Evaluator finds 5 issuesIteration 2: Generator fixes 3, Evaluator finds 2 remainingIteration 3: Generator fixes 2, Evaluator approvesResult: Validated code in 3 iterationsMistake 4: Ignoring the Evaluator
The Evaluator is not optional. Without validation, you’re back to solo Claude’s hallucination problem. The Generator needs external critique.
Related Knowledge
Why GANs Work for AI Coding
Generative Adversarial Networks succeed because the discriminator forces the generator to improve. The same principle applies:
- Generator wants to produce “good enough” code quickly
- Evaluator enforces stricter standards
- Competition drives quality up
Claude Code vs Explicit Harness
A Reddit comment raised an important point:
“I don’t get it. Are we pretending Claude Code doesn’t exist? Isn’t that a harness?”
Claude Code does have built-in validation loops. But explicit multi-agent design gives you:
- Control over agent roles
- Custom evaluation criteria
- Domain-specific validators
- Negotiation protocols
How to Build Your Own Harness
Start minimal:
def simple_harness(task: str) -> str: # Agent 1: Define contract contract = define_contract(task)
# Agent 2: Generate code = generate_code(contract)
# Agent 3: Evaluate while not evaluate(code, contract).passes: feedback = get_feedback(code, contract) code = iterate(code, feedback)
return codeAdd specialized agents only when you identify specific failure patterns in your workflow.
Summary
A Claude harness is a multi-agent architecture that uses generator-evaluator patterns and sprint contracts to produce validated, working code. The key insight from practitioners: specialized sub-agents with singular jobs eliminate hallucinations and big bugs.
The evidence is compelling:
- Solo Claude: 6 hours, $200, broken code
- Harness: 20 minutes, $9, working code
The token cost increase pays for itself in debugging time saved. More importantly, the generator-evaluator pattern transforms AI coding from a risky gamble into a reliable workflow.
If you’re skeptical about “tinier shovels,” start with a minimal three-agent harness. Define sprint contracts before coding begins. See if the quality improvement outweighs the complexity cost.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Anthropic Blog: Building Effective Agents
- 👨💻 Reddit: Anthropic Shares How to Make Claude Code Better with a Harness
- 👨💻 Claude Agent SDK Documentation
- 👨💻 Generative Adversarial Networks Explained
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments