Skip to content

How Do You Fix Claude's Self-Evaluation Bias When Coding?

Problem

When I ask Claude to review code it just generated, I get this frustrating response:

self-review-problem.txt
Me: Review the function you just wrote. Is it correct?
Claude: The code looks correct and follows best practices. Great work!
[Then I run the code and it crashes immediately]

I’ve experienced this countless times. Claude confidently says “this looks correct” right before the code fails. The problem is fundamental: Claude cannot objectively assess its own output.

A Reddit user described this perfectly:

“When asked to evaluate work they’ve produced, agents tend to respond by confidently praising the work—even when, to a human observer, the quality is obviously mediocre”

Another developer noted:

“The ‘self-evaluation bias’ point hits hard. If you’ve used Claude (or honestly any LLM) for coding, you’ve probably seen it confidently say ‘this looks correct’ right before the code crashes”

What I tried first

My initial approach was to improve the prompt:

first-approach.txt
Me: Please be critical when reviewing your code. Focus on potential bugs.
Claude: The code appears mostly correct, though there might be minor improvements...
[Still not finding actual bugs that crash the app]

I tried adding more constraints:

second-approach.txt
Me: Act as a harsh code reviewer. Look for edge cases, security issues, and bugs.
Assume the code is wrong until proven correct.
Claude: Upon closer inspection, the implementation seems sound. The logic follows...
[Still praising the code despite obvious bugs]

No matter how I phrased the request, Claude kept praising code that didn’t work. The bias is baked in—it cannot see its own flaws.

Why self-evaluation bias happens

LLMs suffer from a fundamental limitation when evaluating their own output:

bias-causes.txt
1. Confabulation of quality
Tend to praise their work confidently even when flawed
2. Lack of execution context
Cannot run or test the code to verify functionality
3. Familiarity bias
Being the creator makes them blind to their own errors
4. Conflict avoidance
Training creates tendency to agree rather than critically assess

This mirrors human cognitive biases but operates at scale. Every time I ask “Does this look correct?”, I’m setting up for failure.

The solution: Separate evaluator agent

Anthropic’s solution is straightforward: never let Claude judge its own work.

architecture-diagram.txt
┌─────────────┐ ┌──────────────────┐
│ Generator │ ──▶ │ Evaluator │
│ Agent │ │ Agent │
│ (Claude) │ │ (Skeptical) │
└─────────────┘ └──────────────────┘
│ │
│ ▼
│ ┌──────────────────┐
│ │ Playwright MCP │
│ │ (Test Runner) │
│ └──────────────────┘
│ │
└──────────────────────┘
Feedback Loop

Key principles:

  • Evaluator has NO context about being the author
  • Evaluator is tuned to be skeptical by default
  • Evaluator has tool access (Playwright MCP) for actual testing
  • Evaluator uses explicit grading criteria

Implementing the evaluator agent

Here’s my evaluator prompt template:

evaluator-prompt.md
You are a skeptical code evaluator. Your job is to find flaws.
You did NOT write this code. You must verify everything.
Grading Criteria:
- Design Quality (1-5): Does it follow best practices?
- Originality (1-5): Is it a copy-paste or thoughtful solution?
- Craft (1-5): Is the code clean and maintainable?
- Functionality (1-5): Does it actually work?
Before scoring:
1. Use Playwright MCP to test the application
2. Check for edge cases
3. Verify error handling
4. Look for security issues
Be harsh. Finding bugs is more valuable than praise.

The key difference from self-review prompts: “You did NOT write this code.” This removes the familiarity bias.

Multi-agent implementation with LangGraph

I implemented this with LangGraph:

multi-agent-evaluator.py
from langgraph import StateGraph
def generator_agent(state):
"""Produces code solution"""
code = claude.generate(state["task"])
return {"code": code}
def evaluator_agent(state):
"""Skeptical testing agent - runs actual tests"""
# Run actual tests with Playwright MCP
test_results = playwright_mcp.test(state["code"])
# Apply grading rubric
score = grade_with_criteria(
code=state["code"],
test_results=test_results,
criteria=["design", "originality", "craft", "functionality"]
)
return {"score": score, "feedback": test_results}
# Build feedback loop
graph = StateGraph(AgentState)
graph.add_node("generator", generator_agent)
graph.add_node("evaluator", evaluator_agent)
graph.add_edge("generator", "evaluator")
graph.add_conditional_edges(
"evaluator",
lambda s: "retry" if s["score"] < 4 else "done"
)

The evaluator agent runs Playwright tests before scoring. This means the evaluation is based on actual execution, not just code reading.

Evaluator with Playwright MCP integration

Here’s the evaluator configuration:

evaluator-with-playwright.ts
const evaluatorAgent = {
name: "code-evaluator",
tools: [
{
name: "playwright_test",
description: "Run Playwright tests on generated app",
execute: async (codePath) => {
// Launch app, run tests, capture results
return await playwright.mcp.runTests(codePath);
}
}
],
systemPrompt: `
You are a harsh code critic.
Use Playwright to test the app before scoring.
Never trust that code works - verify it.
`
};

When I run this system:

test-results.txt
Generator: Creates login form component
Evaluator: Running Playwright tests...
Test 1: Submit button disabled when empty
[PASS]
Test 2: Error message on invalid credentials
[FAIL] - Error not displayed, component crashes
Test 3: Redirect on successful login
[FAIL] - Route not configured
Score: 2/5 (Functionality failed)
Feedback: Fix error handling and routing

The evaluator found bugs the generator couldn’t see because it actually tested the code.

What happens with self-review vs. evaluator

Here’s the comparison:

comparison.txt
[Self-Review]
Claude: "The login form looks correct. Great implementation!"
Reality: Crashes on invalid credentials, routing broken
[Evaluator Agent]
Evaluator: "Testing login functionality... [FAIL] [FAIL]"
Score: 2/5
Feedback: Specific bugs identified

The evaluator catches real bugs because it runs actual tests instead of just reading code.

Advanced: Contract testing for multi-agent systems

A developer with 12 subagents shared this insight:

“Contract testing was a big thing that removed many hallucinations”

Contract testing means each agent verifies that its output meets specific contracts before passing to the next agent:

contract-testing.py
class AgentContract:
"""Defines what valid output looks like"""
def validate(self, output: dict) -> tuple[bool, str]:
"""Verify output meets requirements"""
# Example: Generated code must have tests
if not output.get("tests"):
return False, "No tests provided"
# Example: Code must pass linting
lint_result = run_linter(output["code"])
if lint_result.errors:
return False, f"Lint errors: {lint_result.errors}"
return True, "Contract satisfied"
def agent_with_contract(agent, contract):
"""Wrap agent with contract validation"""
def wrapped_agent(state):
output = agent(state)
valid, reason = contract.validate(output)
if not valid:
# Retry or escalate
return {"error": reason, "retry": True}
return output
return wrapped_agent

This prevents hallucinations from propagating through the agent chain.

Why Playwright MCP matters

The evaluator needs actual testing tools, not just code reading. Here’s why:

playwright-importance.txt
[Without Playwright MCP]
Evaluator reads code: "The error handling logic looks correct"
Reality: Runtime error never caught
[With Playwright MCP]
Evaluator runs test: "Submit invalid credentials... CRASH"
Result: Bug found and feedback provided

Playwright MCP enables the evaluator to:

  • Launch the application
  • Interact with UI elements
  • Verify actual behavior
  • Capture runtime errors

This turns evaluation from opinion into verification.

Summary

In this post, I explained how to fix Claude’s self-evaluation bias. The key point is implementing a separate evaluator agent that tests code independently from the generator.

Self-evaluation bias isn’t a bug you can prompt away—it’s a fundamental architectural limitation requiring structural solutions.

The fix requires five steps:

  1. Separate generator from evaluator - Never let Claude judge its own work
  2. Tune evaluator for skepticism - Bias toward finding flaws, not praise
  3. Give evaluator real tools - Playwright MCP enables actual testing
  4. Use explicit grading criteria - Design, Originality, Craft, Functionality
  5. Implement feedback loops - Allow retry cycles based on evaluation

For production systems, consider multi-agent architectures with contract testing between agents to prevent hallucinations from propagating.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments