How to Benchmark AI Models for Code Review: A Rigorous 133-Cycle Methodology
I spent 22 days comparing AI models for code review. After 133 cycles across 42 phases of a platform refactoring project, I realized my initial approach was fundamentally flawed.
The problem? Every comparison I saw showed “their” model winning. The OpenAI benchmarks showed GPT on top. Anthropic’s tests showed Claude ahead. Independent reviews often had subtle biases I couldn’t identify.
A comment on Reddit crystallized my concern: “Anyone notice any ‘evaluation’ posted in any of the AI frontier model subs that model always wins… Just food for thought when you read these people’s opinions.”
I needed a methodology that would produce results I could trust. Here’s what I built.
The Core Problem: Context Bleed-Through
My first attempt at comparing models looked like this:
Session: Run GPT review on commit XSession: Run Claude review on same commitSession: Compare results in same conversationThis failed immediately. Once I mentioned GPT’s output in the Claude session, Claude’s responses subtly shifted. It started agreeing with points it had independently discovered, or worse, finding issues it had “noticed” from GPT’s report.
Even worse: when I ran sequential reviews, the second model would see implementation changes made based on the first model’s feedback. The comparison was no longer fair.
Model A reviews commit → Developer fixes issues → Model B reviews updated commit ↑ Unfair advantage!Solution: Session Isolation
I rebuilt the entire approach around a simple principle: complete session isolation.
┌─────────────────────────────────────────────────────────────────┐│ REVIEW CYCLE │├─────────────────────────────────────────────────────────────────┤│ ││ ┌──────────┐ ┌──────────┐ ┌──────────┐ ││ │ Model A │ │ Model B │ │ Model C │ ││ │ (fresh) │ │ (fresh) │ │ (fresh) │ ││ └────┬─────┘ └────┬─────┘ └────┬─────┘ ││ │ │ │ ││ │ Same prompt │ Same prompt │ Same prompt ││ │ Same code │ Same code │ Same code ││ │ │ │ ││ ▼ ▼ ▼ ││ ┌──────────┐ ┌──────────┐ ┌──────────┐ ││ │ Output A │ │ Output B │ │ Output C │ ││ └────┬─────┘ └────┬─────┘ └────┬─────┘ ││ │ │ │ ││ └───────────────┼───────────────┘ ││ │ ││ ▼ ││ ┌─────────────────┐ ││ │ Orchestrator │ ← Fresh session, ││ │ (synthesize) │ no evaluation ││ └─────────────────┘ ││ │└─────────────────────────────────────────────────────────────────┘Each model receives:
- A fresh CLI session with no conversation history
- Identical prompts (validated beforehand)
- The exact same code state
- No knowledge of other models’ outputs
Why Separate CLI Tools Matter
I initially used a single tool with different model configurations. Bad idea. The underlying framework can cache context or share state in ways that aren’t obvious.
The solution: use native tools for each model family.
GPT models → Codex CLI (native OpenAI integration)Claude models → Claude Code (native Anthropic API)This ensures complete separation at the tooling layer.
Standardized Prompts: Eliminating Engineering Bias
My second major mistake: optimizing prompts for one model’s preferences.
I wrote prompts that worked well with GPT’s response style. When I ran the same prompts through Claude, the results looked worse—not because Claude was worse at the task, but because the prompt format didn’t match Claude’s strengths.
Pre-Benchmark Validation
Before running any benchmark cycles, I ran a prompt validation phase:
- Draft the review prompt format
- Send to all models in the panel
- Collect feedback on clarity and structure
- Iterate until ALL models agree the prompt is unambiguous
- Lock the prompt for all benchmark cycles
✓ All models understand the task description✓ All models interpret evaluation criteria identically✓ All models produce output in the expected format✓ No model expresses confusion about constraintsThe Final Prompt Structure
Review the following code changes for:
1. Blockers: Critical issues that prevent deployment2. Non-blockers: Issues that should be addressed but don't block3. Suggestions: Optional improvements4. Summary: Brief wrap-up of findings
[Code diff here]
Output in the following JSON format:{ "blockers": [...], "non_blockers": [...], "suggestions": [...], "summary": "..."}This structure emerged from the validation phase. Both GPT and Claude variants confirmed it was unambiguous.
Blind Evaluation: The Trust Layer
The final piece: models must never see their own scores or other models’ reports.
Two-Stage Evaluation
Stage 1 (During benchmark): Recording only
@dataclassclass RecordedOutput: """Anonymized output - model identity hidden.""" output_id: str # Random UUID, NOT model name blockers: List[dict] non_blockers: List[dict] suggestions: List[dict] summary: str timestamp: datetime # No score, no model_id at this stageStage 2 (Post-benchmark): Blind evaluation
After all 133 cycles completed, I used two independent AI evaluators to score the accumulated data. They received:
- Anonymized outputs (output_id only, no model names)
- Standardized scoring rubric
- Ground truth for each cycle (human-verified correct answers)
class BlindEvaluator: """Evaluates outputs without knowing which model produced them."""
def evaluate_output(self, output: RecordedOutput) -> Score: # Score against ground truth, not other models real_blockers = self.ground_truth["blockers"] found_blockers = {b["id"] for b in output.blockers}
accuracy = len(real_blockers & found_blockers) / len(real_blockers) false_positives = found_blockers - real_blockers fp_rate = len(false_positives) / len(found_blockers) if found_blockers else 0
return Score( blocker_accuracy=accuracy, false_positive_rate=fp_rate, # ... other metrics )Evaluation Criteria Weights
┌────────────────────────────────────────────────────────────┐│ SCORING BREAKDOWN │├────────────────────────────────────────────────────────────┤│ ││ Blocker Identification ────────────────────── 40% ││ ├── Accuracy: Did it catch real blockers? ││ └── False positive rate: Flagging non-blockers? ││ ││ Non-Blocker Detection ─────────────────────── 30% ││ ├── Coverage: How many issues found? ││ └── Relevance: Are they actually issues? ││ ││ Suggestion Quality ────────────────────────── 20% ││ ├── Actionability: Can developers use them? ││ └── Insight: Deep understanding shown? ││ ││ Summary Completeness ──────────────────────── 10% ││ ├── Clarity: Clear and concise? ││ └── Accuracy: Reflects findings correctly? ││ │└────────────────────────────────────────────────────────────┘The Orchestrator Pattern: Multi-Model Synthesis
For each review cycle, I ran 3-4 models in parallel, then used a fresh orchestrator session to synthesize their outputs.
class ReviewOrchestrator: """Synthesizes outputs from multiple models WITHOUT evaluation."""
def synthesize(self, outputs: List[RecordedOutput]) -> dict: # Find consensus (all models agree) consensus = self._find_consensus([o.blockers for o in outputs])
# Find disagreements (valuable signal) disagreements = self._find_disagreements(outputs)
return { "consensus": consensus, "disagreements": disagreements, "unique_findings": self._extract_unique(outputs), # No scores, no model names }The orchestrator’s job is synthesis, not judgment. It produces a unified report for human review but never evaluates which model performed better.
What 133 Cycles Revealed
After 22 days and 133 review cycles, the blind evaluation produced clear patterns:
-
Blocker detection accuracy varied significantly between models—not always in the direction I expected based on each company’s marketing.
-
False positive rates differed wildly. One model was aggressive in flagging blockers, catching more real issues but also generating more noise.
-
Suggestion quality showed the most variance. Some models excelled at identifying problems but struggled with actionable recommendations.
-
Model agreement was lower than expected. On the same code, models often disagreed on what constituted a blocker vs. non-blocker.
The detailed results deserve their own article, but the methodology itself is what matters: without isolation, standardization, and blinding, I would have drawn incorrect conclusions.
Implementation Checklist
If you’re building your own benchmark:
Session Isolation─────────────────────────────────────────────────────────────[ ] Separate CLI tools for each model family[ ] Fresh session per model per cycle[ ] No conversation history carried forward[ ] Clean API connections (no shared state)[ ] Temporary files deleted after each session
Prompt Standardization─────────────────────────────────────────────────────────────[ ] Pre-validate prompts with all models[ ] Lock format before benchmark begins[ ] Same prompt text delivered to all models[ ] Consistent output format requirements
Blind Evaluation─────────────────────────────────────────────────────────────[ ] Models never see their own scores[ ] Models never see other models' outputs[ ] Evaluators receive anonymized data only[ ] Ground truth established independently[ ] Two-stage process: record then evaluate
Statistical Rigor─────────────────────────────────────────────────────────────[ ] 50+ cycles minimum for statistical significance[ ] Record all raw outputs before evaluation[ ] Use multiple independent evaluators[ ] Report variance, not just meansWhy This Matters
The AI landscape is flooded with comparisons that show the author’s preferred model winning. Some of this is intentional bias. Most is unintentional—methodology flaws that favor one approach over another.
The 133-cycle methodology isn’t perfect, but it produces results I can defend. When someone asks “how do you know this comparison is fair?”, I can point to:
- Complete session isolation logs
- Pre-validated prompts
- Blind evaluation protocols
- Statistical analysis across sufficient cycles
That’s the minimum bar for trustworthy AI model benchmarking.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 OpenAI Cookbook - Model Evaluation
- 👨💻 Anthropic Claude Documentation
- 👨💻 LLM Benchmarking Methods
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments