How to Benchmark AI Models for Code Review: A Rigorous 133-Cycle Methodology

Mar 5, 2026

I spent 22 days comparing AI models for code review. After 133 cycles across 42 phases of a platform refactoring project, I realized my initial approach was fundamentally flawed.

The problem? Every comparison I saw showed “their” model winning. The OpenAI benchmarks showed GPT on top. Anthropic’s tests showed Claude ahead. Independent reviews often had subtle biases I couldn’t identify.

A comment on Reddit crystallized my concern: “Anyone notice any ‘evaluation’ posted in any of the AI frontier model subs that model always wins… Just food for thought when you read these people’s opinions.”

I needed a methodology that would produce results I could trust. Here’s what I built.

The Core Problem: Context Bleed-Through

My first attempt at comparing models looked like this:

Session: Run GPT review on commit X
Session: Run Claude review on same commit
Session: Compare results in same conversation

This failed immediately. Once I mentioned GPT’s output in the Claude session, Claude’s responses subtly shifted. It started agreeing with points it had independently discovered, or worse, finding issues it had “noticed” from GPT’s report.

Even worse: when I ran sequential reviews, the second model would see implementation changes made based on the first model’s feedback. The comparison was no longer fair.

Model A reviews commit → Developer fixes issues → Model B reviews updated commit
                                          ↑
                                    Unfair advantage!

Solution: Session Isolation

I rebuilt the entire approach around a simple principle: complete session isolation.

┌─────────────────────────────────────────────────────────────────┐
│                     REVIEW CYCLE                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐                  │
│   │ Model A  │    │ Model B  │    │ Model C  │                  │
│   │ (fresh)  │    │ (fresh)  │    │ (fresh)  │                  │
│   └────┬─────┘    └────┬─────┘    └────┬─────┘                  │
│        │               │               │                         │
│        │  Same prompt  │  Same prompt  │  Same prompt            │
│        │  Same code    │  Same code    │  Same code              │
│        │               │               │                         │
│        ▼               ▼               ▼                         │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐                  │
│   │ Output A │    │ Output B │    │ Output C │                  │
│   └────┬─────┘    └────┬─────┘    └────┬─────┘                  │
│        │               │               │                         │
│        └───────────────┼───────────────┘                         │
│                        │                                         │
│                        ▼                                         │
│              ┌─────────────────┐                                 │
│              │  Orchestrator   │  ← Fresh session,               │
│              │  (synthesize)   │    no evaluation                │
│              └─────────────────┘                                 │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Each model receives:

A fresh CLI session with no conversation history
Identical prompts (validated beforehand)
The exact same code state
No knowledge of other models’ outputs

Why Separate CLI Tools Matter

I initially used a single tool with different model configurations. Bad idea. The underlying framework can cache context or share state in ways that aren’t obvious.

The solution: use native tools for each model family.

GPT models → Codex CLI (native OpenAI integration)
Claude models → Claude Code (native Anthropic API)

This ensures complete separation at the tooling layer.

Standardized Prompts: Eliminating Engineering Bias

My second major mistake: optimizing prompts for one model’s preferences.

I wrote prompts that worked well with GPT’s response style. When I ran the same prompts through Claude, the results looked worse—not because Claude was worse at the task, but because the prompt format didn’t match Claude’s strengths.

Pre-Benchmark Validation

Before running any benchmark cycles, I ran a prompt validation phase:

Draft the review prompt format
Send to all models in the panel
Collect feedback on clarity and structure
Iterate until ALL models agree the prompt is unambiguous
Lock the prompt for all benchmark cycles

✓ All models understand the task description
✓ All models interpret evaluation criteria identically
✓ All models produce output in the expected format
✓ No model expresses confusion about constraints

The Final Prompt Structure

Review the following code changes for:

1. Blockers: Critical issues that prevent deployment
2. Non-blockers: Issues that should be addressed but don't block
3. Suggestions: Optional improvements
4. Summary: Brief wrap-up of findings

[Code diff here]

Output in the following JSON format:
{
  "blockers": [...],
  "non_blockers": [...],
  "suggestions": [...],
  "summary": "..."
}

This structure emerged from the validation phase. Both GPT and Claude variants confirmed it was unambiguous.

The final piece: models must never see their own scores or other models’ reports.

Two-Stage Evaluation

Stage 1 (During benchmark): Recording only

@dataclass
class RecordedOutput:
    """Anonymized output - model identity hidden."""
    output_id: str  # Random UUID, NOT model name
    blockers: List[dict]
    non_blockers: List[dict]
    suggestions: List[dict]
    summary: str
    timestamp: datetime
    # No score, no model_id at this stage

Stage 2 (Post-benchmark): Blind evaluation

After all 133 cycles completed, I used two independent AI evaluators to score the accumulated data. They received:

Anonymized outputs (output_id only, no model names)
Standardized scoring rubric
Ground truth for each cycle (human-verified correct answers)

class BlindEvaluator:
    """Evaluates outputs without knowing which model produced them."""

    def evaluate_output(self, output: RecordedOutput) -> Score:
        # Score against ground truth, not other models
        real_blockers = self.ground_truth["blockers"]
        found_blockers = {b["id"] for b in output.blockers}

        accuracy = len(real_blockers & found_blockers) / len(real_blockers)
        false_positives = found_blockers - real_blockers
        fp_rate = len(false_positives) / len(found_blockers) if found_blockers else 0

        return Score(
            blocker_accuracy=accuracy,
            false_positive_rate=fp_rate,
            # ... other metrics
        )

Evaluation Criteria Weights

┌────────────────────────────────────────────────────────────┐
│                    SCORING BREAKDOWN                        │
├────────────────────────────────────────────────────────────┤
│                                                             │
│  Blocker Identification ────────────────────── 40%         │
│  ├── Accuracy: Did it catch real blockers?                  │
│  └── False positive rate: Flagging non-blockers?            │
│                                                             │
│  Non-Blocker Detection ─────────────────────── 30%         │
│  ├── Coverage: How many issues found?                       │
│  └── Relevance: Are they actually issues?                   │
│                                                             │
│  Suggestion Quality ────────────────────────── 20%         │
│  ├── Actionability: Can developers use them?                │
│  └── Insight: Deep understanding shown?                     │
│                                                             │
│  Summary Completeness ──────────────────────── 10%         │
│  ├── Clarity: Clear and concise?                            │
│  └── Accuracy: Reflects findings correctly?                 │
│                                                             │
└────────────────────────────────────────────────────────────┘

The Orchestrator Pattern: Multi-Model Synthesis

For each review cycle, I ran 3-4 models in parallel, then used a fresh orchestrator session to synthesize their outputs.

class ReviewOrchestrator:
    """Synthesizes outputs from multiple models WITHOUT evaluation."""

    def synthesize(self, outputs: List[RecordedOutput]) -> dict:
        # Find consensus (all models agree)
        consensus = self._find_consensus([o.blockers for o in outputs])

        # Find disagreements (valuable signal)
        disagreements = self._find_disagreements(outputs)

        return {
            "consensus": consensus,
            "disagreements": disagreements,
            "unique_findings": self._extract_unique(outputs),
            # No scores, no model names
        }

The orchestrator’s job is synthesis, not judgment. It produces a unified report for human review but never evaluates which model performed better.

What 133 Cycles Revealed

After 22 days and 133 review cycles, the blind evaluation produced clear patterns:

Blocker detection accuracy varied significantly between models—not always in the direction I expected based on each company’s marketing.
False positive rates differed wildly. One model was aggressive in flagging blockers, catching more real issues but also generating more noise.
Suggestion quality showed the most variance. Some models excelled at identifying problems but struggled with actionable recommendations.
Model agreement was lower than expected. On the same code, models often disagreed on what constituted a blocker vs. non-blocker.

The detailed results deserve their own article, but the methodology itself is what matters: without isolation, standardization, and blinding, I would have drawn incorrect conclusions.

Implementation Checklist

If you’re building your own benchmark:

Session Isolation
─────────────────────────────────────────────────────────────
[ ] Separate CLI tools for each model family
[ ] Fresh session per model per cycle
[ ] No conversation history carried forward
[ ] Clean API connections (no shared state)
[ ] Temporary files deleted after each session

Prompt Standardization
─────────────────────────────────────────────────────────────
[ ] Pre-validate prompts with all models
[ ] Lock format before benchmark begins
[ ] Same prompt text delivered to all models
[ ] Consistent output format requirements

Blind Evaluation
─────────────────────────────────────────────────────────────
[ ] Models never see their own scores
[ ] Models never see other models' outputs
[ ] Evaluators receive anonymized data only
[ ] Ground truth established independently
[ ] Two-stage process: record then evaluate

Statistical Rigor
─────────────────────────────────────────────────────────────
[ ] 50+ cycles minimum for statistical significance
[ ] Record all raw outputs before evaluation
[ ] Use multiple independent evaluators
[ ] Report variance, not just means

Why This Matters

The AI landscape is flooded with comparisons that show the author’s preferred model winning. Some of this is intentional bias. Most is unintentional—methodology flaws that favor one approach over another.

The 133-cycle methodology isn’t perfect, but it produces results I can defend. When someone asks “how do you know this comparison is fair?”, I can point to:

Complete session isolation logs
Pre-validated prompts
Blind evaluation protocols
Statistical analysis across sufficient cycles

That’s the minimum bar for trustworthy AI model benchmarking.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!