How Do You Fix Claude's Self-Evaluation Bias When Coding?
Problem
When I ask Claude to review code it just generated, I get this frustrating response:
Me: Review the function you just wrote. Is it correct?Claude: The code looks correct and follows best practices. Great work!
[Then I run the code and it crashes immediately]I’ve experienced this countless times. Claude confidently says “this looks correct” right before the code fails. The problem is fundamental: Claude cannot objectively assess its own output.
A Reddit user described this perfectly:
“When asked to evaluate work they’ve produced, agents tend to respond by confidently praising the work—even when, to a human observer, the quality is obviously mediocre”
Another developer noted:
“The ‘self-evaluation bias’ point hits hard. If you’ve used Claude (or honestly any LLM) for coding, you’ve probably seen it confidently say ‘this looks correct’ right before the code crashes”
What I tried first
My initial approach was to improve the prompt:
Me: Please be critical when reviewing your code. Focus on potential bugs.
Claude: The code appears mostly correct, though there might be minor improvements...[Still not finding actual bugs that crash the app]I tried adding more constraints:
Me: Act as a harsh code reviewer. Look for edge cases, security issues, and bugs.Assume the code is wrong until proven correct.
Claude: Upon closer inspection, the implementation seems sound. The logic follows...[Still praising the code despite obvious bugs]No matter how I phrased the request, Claude kept praising code that didn’t work. The bias is baked in—it cannot see its own flaws.
Why self-evaluation bias happens
LLMs suffer from a fundamental limitation when evaluating their own output:
1. Confabulation of quality Tend to praise their work confidently even when flawed
2. Lack of execution context Cannot run or test the code to verify functionality
3. Familiarity bias Being the creator makes them blind to their own errors
4. Conflict avoidance Training creates tendency to agree rather than critically assessThis mirrors human cognitive biases but operates at scale. Every time I ask “Does this look correct?”, I’m setting up for failure.
The solution: Separate evaluator agent
Anthropic’s solution is straightforward: never let Claude judge its own work.
┌─────────────┐ ┌──────────────────┐│ Generator │ ──▶ │ Evaluator ││ Agent │ │ Agent ││ (Claude) │ │ (Skeptical) │└─────────────┘ └──────────────────┘ │ │ │ ▼ │ ┌──────────────────┐ │ │ Playwright MCP │ │ │ (Test Runner) │ │ └──────────────────┘ │ │ └──────────────────────┘ Feedback LoopKey principles:
- Evaluator has NO context about being the author
- Evaluator is tuned to be skeptical by default
- Evaluator has tool access (Playwright MCP) for actual testing
- Evaluator uses explicit grading criteria
Implementing the evaluator agent
Here’s my evaluator prompt template:
You are a skeptical code evaluator. Your job is to find flaws.
You did NOT write this code. You must verify everything.
Grading Criteria:- Design Quality (1-5): Does it follow best practices?- Originality (1-5): Is it a copy-paste or thoughtful solution?- Craft (1-5): Is the code clean and maintainable?- Functionality (1-5): Does it actually work?
Before scoring:1. Use Playwright MCP to test the application2. Check for edge cases3. Verify error handling4. Look for security issues
Be harsh. Finding bugs is more valuable than praise.The key difference from self-review prompts: “You did NOT write this code.” This removes the familiarity bias.
Multi-agent implementation with LangGraph
I implemented this with LangGraph:
from langgraph import StateGraph
def generator_agent(state): """Produces code solution""" code = claude.generate(state["task"]) return {"code": code}
def evaluator_agent(state): """Skeptical testing agent - runs actual tests""" # Run actual tests with Playwright MCP test_results = playwright_mcp.test(state["code"])
# Apply grading rubric score = grade_with_criteria( code=state["code"], test_results=test_results, criteria=["design", "originality", "craft", "functionality"] )
return {"score": score, "feedback": test_results}
# Build feedback loopgraph = StateGraph(AgentState)graph.add_node("generator", generator_agent)graph.add_node("evaluator", evaluator_agent)graph.add_edge("generator", "evaluator")graph.add_conditional_edges( "evaluator", lambda s: "retry" if s["score"] < 4 else "done")The evaluator agent runs Playwright tests before scoring. This means the evaluation is based on actual execution, not just code reading.
Evaluator with Playwright MCP integration
Here’s the evaluator configuration:
const evaluatorAgent = { name: "code-evaluator",
tools: [ { name: "playwright_test", description: "Run Playwright tests on generated app", execute: async (codePath) => { // Launch app, run tests, capture results return await playwright.mcp.runTests(codePath); } } ],
systemPrompt: ` You are a harsh code critic. Use Playwright to test the app before scoring. Never trust that code works - verify it. `};When I run this system:
Generator: Creates login form componentEvaluator: Running Playwright tests...
Test 1: Submit button disabled when empty[PASS]
Test 2: Error message on invalid credentials[FAIL] - Error not displayed, component crashes
Test 3: Redirect on successful login[FAIL] - Route not configured
Score: 2/5 (Functionality failed)Feedback: Fix error handling and routingThe evaluator found bugs the generator couldn’t see because it actually tested the code.
What happens with self-review vs. evaluator
Here’s the comparison:
[Self-Review]Claude: "The login form looks correct. Great implementation!"Reality: Crashes on invalid credentials, routing broken
[Evaluator Agent]Evaluator: "Testing login functionality... [FAIL] [FAIL]"Score: 2/5Feedback: Specific bugs identifiedThe evaluator catches real bugs because it runs actual tests instead of just reading code.
Advanced: Contract testing for multi-agent systems
A developer with 12 subagents shared this insight:
“Contract testing was a big thing that removed many hallucinations”
Contract testing means each agent verifies that its output meets specific contracts before passing to the next agent:
class AgentContract: """Defines what valid output looks like"""
def validate(self, output: dict) -> tuple[bool, str]: """Verify output meets requirements"""
# Example: Generated code must have tests if not output.get("tests"): return False, "No tests provided"
# Example: Code must pass linting lint_result = run_linter(output["code"]) if lint_result.errors: return False, f"Lint errors: {lint_result.errors}"
return True, "Contract satisfied"
def agent_with_contract(agent, contract): """Wrap agent with contract validation"""
def wrapped_agent(state): output = agent(state)
valid, reason = contract.validate(output) if not valid: # Retry or escalate return {"error": reason, "retry": True}
return output
return wrapped_agentThis prevents hallucinations from propagating through the agent chain.
Why Playwright MCP matters
The evaluator needs actual testing tools, not just code reading. Here’s why:
[Without Playwright MCP]Evaluator reads code: "The error handling logic looks correct"Reality: Runtime error never caught
[With Playwright MCP]Evaluator runs test: "Submit invalid credentials... CRASH"Result: Bug found and feedback providedPlaywright MCP enables the evaluator to:
- Launch the application
- Interact with UI elements
- Verify actual behavior
- Capture runtime errors
This turns evaluation from opinion into verification.
Summary
In this post, I explained how to fix Claude’s self-evaluation bias. The key point is implementing a separate evaluator agent that tests code independently from the generator.
Self-evaluation bias isn’t a bug you can prompt away—it’s a fundamental architectural limitation requiring structural solutions.
The fix requires five steps:
- Separate generator from evaluator - Never let Claude judge its own work
- Tune evaluator for skepticism - Bias toward finding flaws, not praise
- Give evaluator real tools - Playwright MCP enables actual testing
- Use explicit grading criteria - Design, Originality, Craft, Functionality
- Implement feedback loops - Allow retry cycles based on evaluation
For production systems, consider multi-agent architectures with contract testing between agents to prevent hallucinations from propagating.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit: Claude self-evaluation bias discussion
- 👨💻 Anthropic: Coding Harness Implementation
- 👨💻 Playwright MCP Documentation
- 👨💻 LangGraph Multi-Agent Patterns
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments