Skip to content

Are AI Coding Benchmarks Reliable? The SWE-Bench Problem

A few weeks ago, I saw a Reddit thread that stopped me in my tracks. The headline claimed that Alibaba’s Qwen3.6-Plus was “beating Claude Opus in coding.” The comments told a different story. The top-voted response—74 upvotes—said: “When everyone is optimizing for the benchmarks the benchmarks stop meaning anything.”

That comment captures the fundamental problem with AI coding benchmarks today. We’ve created sophisticated evaluation frameworks like SWE-Bench and Terminal-Bench, but their meaning has become increasingly murky as model developers optimize specifically for these tests.

I think this is a critical issue that deserves a closer look. Let me break down what these benchmarks actually measure, where they fail, and how we should interpret them.

What These Benchmarks Actually Measure

Before we can judge their reliability, we need to understand what SWE-Bench and Terminal-Bench are trying to evaluate.

SWE-Bench: Real-World GitHub Issues

SWE-Bench presents language models with actual GitHub issues from popular repositories. The model receives:

  1. A problem description (the GitHub issue)
  2. The corresponding codebase
  3. The task: generate a functional patch that resolves the issue

The evaluation runs in Docker containers to ensure reproducibility. A model succeeds if its patch passes the repository’s existing test suite and actually fixes the reported problem.

This tests several capabilities:

  • Code comprehension across complex codebases
  • Issue understanding (translating natural language to code changes)
  • Repository navigation (finding the right files to modify)
  • Patch generation (writing correct, maintainable fixes)

Terminal-Bench: End-to-End Task Completion

Terminal-Bench takes a different approach. Instead of just generating code patches, it tests AI agents in real terminal environments. Tasks include:

  • Compiling and building code
  • Training machine learning models
  • Setting up servers and services
  • Running multi-step workflows

Each task is self-contained with Docker-based evaluation. The model must autonomously navigate the terminal, use tools, and complete the objective.

This tests:

  • Autonomous problem-solving in realistic environments
  • Tool usage and command-line proficiency
  • End-to-end task completion (not just code generation)
  • Handling real-world complexity and failure modes

The Benchmark Gaming Problem

Here’s where things break down. Both benchmarks suffer from a core reliability issue: they’ve become targets rather than measures.

Data Contamination: The Elephant in the Room

The most glaring problem is data contamination. SWE-Bench pulls from public GitHub repositories. The issues, pull requests, and “gold” solutions are all publicly available data.

Models trained on GitHub data (which includes most modern LLMs) have likely seen many of these examples during training. This creates several issues:

  1. Memorization vs. Reasoning: A model might recall a specific issue and its solution rather than genuinely reasoning through the problem.

  2. Pattern Matching: Models learn patterns from training data. If they’ve seen similar GitHub issues repeatedly, they can match patterns without truly understanding.

  3. Unfair Comparisons: Newer models often train on more recent data, including benchmark examples that older models haven’t seen. Leaderboard comparisons become apples-to-oranges.

I don’t think model developers are deliberately cheating here—the contamination is often incidental. But it undermines the validity of the benchmarks as pure measures of capability.

Optimization for Benchmarks: Goodhart’s Law in Action

Goodhart’s Law states: “When a measure becomes a target, it ceases to be a good measure.” This is exactly what’s happening with AI benchmarks.

The Reddit comment I mentioned earlier (74 upvotes) shows the community recognizes this problem. Model developers optimize for these benchmarks in several ways:

Prompt Engineering for Benchmarks:

  • Crafting specific prompts that work well for SWE-Bench’s format
  • Tuning system instructions to match benchmark evaluation criteria
  • Building scaffolding that compensates for model weaknesses

Training on Similar Tasks:

  • Creating synthetic datasets that mirror benchmark tasks
  • Fine-tuning on code-generation tasks similar to benchmark examples
  • Reinforcement learning from benchmark-like problems

Agent Architecture Optimization:

  • Building complex agent pipelines specifically for benchmark tasks
  • Adding tools and workflows that target benchmark requirements
  • Iterating on agent design until benchmark scores improve

There’s nothing inherently wrong with optimizing for tasks you care about. But when the benchmark becomes the goal itself, scores diverge from real-world performance.

What Benchmarks Do Well (And When to Trust Them)

Despite these issues, I don’t want to dismiss benchmarks entirely. They provide genuine value when used appropriately.

Valid Use Cases

Relative Comparisons Within Model Families: If you’re comparing GPT-4o to GPT-4-turbo or Claude 3.5 Sonnet to Claude 3 Opus, benchmarks provide useful signals. The same developer built both models with similar training approaches, so contamination and optimization differences are minimized.

Identifying Specific Capability Gaps: Benchmarks reveal where models struggle. If a model fails consistently at certain types of issues (e.g., multi-file refactoring, database migrations), that’s valuable information.

Tracking Progress Over Time: When a new benchmark launches, early scores are often more reliable. As time passes and models optimize, scores become inflated. But the initial period offers cleaner data.

Reproducible Testing: The containerized evaluation approach is excellent. Anyone can run these benchmarks, verify claims, and compare results. This standardization is valuable for the community.

Design Strengths

Both SWE-Bench and Terminal-Bench have strong methodological foundations:

  • Real-world tasks: They use actual GitHub issues and terminal work, not artificial toy problems
  • Containerized evaluation: Docker ensures reproducibility
  • Automated testing: Scale and consistency that manual evaluation can’t match
  • Multi-step complexity: They require sustained reasoning, not just single-shot answers

The Reliability Limitations

Let me be specific about where these benchmarks fail.

Known Issues

Temporal Validity: Benchmarks become stale as models improve. A benchmark that was challenging in 2024 might be trivial by 2026. Scores approach ceiling effects, making differentiation harder.

Distribution Shift: The tasks in these benchmarks may not represent your actual use case. SWE-Bench focuses on bug fixes in popular open-source projects. Terminal-Bench emphasizes DevOps-style tasks. If you’re building a code completion tool for enterprise Java development, these benchmarks might not predict performance well.

Evaluation Brittleness: Pass/fail on specific test cases misses nuance. A model might:

  • Solve the problem but with poor code quality
  • Fix the bug but introduce new issues
  • Pass tests but produce unmaintainable code
  • Fail the benchmark but actually solve the user’s real problem

Agent Dependency: Results vary dramatically based on the agent setup. The same model with different prompting, tools, or scaffolding can score very differently. Leaderboards often don’t control for this properly.

Specific Concerns by Benchmark

SWE-Bench:

  • Tests issue resolution but not code quality or maintainability
  • Focuses on bugs in existing codebases, not greenfield development
  • Repository selection bias: popular open-source projects might not represent typical work

Terminal-Bench:

  • Tests task completion but not efficiency or robustness
  • Emphasizes specific workflows (compiling, training, serving)
  • Self-contained tasks might not reflect real-world messiness

Both miss critical human factors:

  • Communication with stakeholders
  • Collaboration with other developers
  • Adaptability to new domains
  • Code review and feedback incorporation

Interpreting Benchmark Results Responsibly

If you’re using these benchmarks to make decisions, here’s my guidance.

Do These Things

Look at Multiple Benchmarks: Never rely on a single benchmark. Check SWE-Bench, Terminal-Bench, HumanEval, and others. Models that perform consistently across diverse benchmarks are more reliable.

Consider Evaluation Methodology: Does the score come from:

  • Direct generation (model outputs code directly)?
  • Agent-based solution (model uses tools, retries, scaffolding)?
  • Human-in-the-loop (model assists but doesn’t complete autonomously)?

These are very different capabilities. An 80% score with heavy scaffolding isn’t comparable to a 60% score with direct generation.

Check for Contamination Disclosures: Some papers disclose if benchmark data was in training sets. Look for this. If a model’s training data includes benchmark repositories, take scores with a grain of salt.

Compare Similar Setups: When comparing models, ensure the agent architecture is similar. Comparing a bare model to a heavily-tooled agent is unfair.

Test on Your Own Tasks: The best benchmark is your actual work. Create internal evaluation sets from your tasks. Test models on your proprietary codebase where there’s no training contamination.

Don’t Do These Things

Treat Leaderboard Position as Definitive: The #1 spot on a benchmark today might be #5 next month after another round of optimization. Rankings change rapidly and don’t necessarily reflect real capability differences.

Assume Benchmark Score = Real-World Performance: I’ve seen models with impressive benchmark scores struggle with simple real-world tasks. Benchmarks test specific skills in specific contexts.

Ignore the Agent/Scaffolding Layer: A mediocre model with excellent tooling can outperform a strong model with poor setup. Don’t attribute agent performance solely to the underlying model.

Discount Models Without Benchmark Submissions: Some of the best models don’t submit to public leaderboards. The absence of a score isn’t the absence of capability.

Recommendations for Real-World Evaluation

So what should you actually do? Here’s a practical framework.

Create Internal Evaluation Sets

Build your own benchmarks from actual work:

  1. Collect Real Tasks: Gather issues, bugs, and features your team actually worked on
  2. Create Test Cases: Write automated tests that verify solutions
  3. Measure Time-to-Solution: Not just success rate, but how long it takes
  4. Include Code Review: Have humans evaluate code quality, not just correctness

This is more work than using public benchmarks, but it’s infinitely more relevant to your needs.

Use a Hybrid Approach

Combine public and private evaluation:

  1. Initial Filter: Use public benchmarks to narrow down candidate models (top 3-5)
  2. Validate on Domain Tasks: Test finalists on your internal evaluation set
  3. Monitor Over Time: Track performance on your tasks as models update
  4. Consider Cost-Efficiency: A 5% performance drop for 50% cost savings might be worth it

Test Collaboration, Not Just Solo Performance

Real software development is collaborative. Evaluate models on:

  • How well they incorporate code review feedback
  • Whether they can work with partial information
  • If they ask clarifying questions when requirements are ambiguous
  • How they handle conflicting constraints

Current benchmarks don’t measure this. You’ll need to build custom evaluations.

The Bottom Line

AI coding benchmarks like SWE-Bench and Terminal-Bench are valuable tools, but they’re not crystal balls. They provide signals about model capabilities, but those signals are noisy and sometimes misleading.

The key insight from that Reddit thread is correct: when benchmarks become targets, they lose their meaning as measures. As an industry, we need to recognize this limitation and build more robust evaluation frameworks.

Until then, I recommend:

  • Using benchmarks as starting points, not definitive answers
  • Building internal evaluations that match your actual work
  • Maintaining healthy skepticism about leaderboard rankings
  • Testing models on your own tasks before making decisions

The benchmarks aren’t broken—they’re just incomplete. Understanding what they measure (and what they don’t) helps us use them appropriately.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments