Skip to content

Why Harness Engineering Beats Model Upgrades: Experimental Evidence for AI Agent Performance

Data analytics dashboard

Purpose

When my AI agents kept failing, my first instinct was to upgrade the model. Pay for GPT-5.4, switch to Opus 4.6, or try Gemini 3.1.

But then I saw something that changed my thinking: experimental data showing that optimizing the infrastructure around the model delivered far bigger gains than the model upgrade itself.

The Evidence

Here are three experiments that proved the point.

Experiment #1: Can Bölük’s Edit Tool Comparison

Can Bölük ran a systematic comparison: 16 models, 3 edit formats, 540 tasks each. Same models, different formats.

The results shocked me:

Grok Code Fast 1:
str_replace format: 6.7% success rate
hashline format: 68.3% success rate
Change: Only edit format (zero model changes)
Improvement: 10x

His quote stuck with me: “You blame the pilot, but the landing gear is broken.”

Experiment #2: LangChain’s Terminal Bench 2.0 Jump

LangChain optimized their harness without changing model parameters. Here’s what happened:

Terminal Bench 2.0:
Before: 52.8% score, rank 30th globally
After: 66.5% score, rank 5th globally
Changes:
• Better docs loading
• Validation loops
• Tracing improvements
Model parameters: unchanged

They jumped from 30th to 5th place globally. Zero model changes.

Experiment #3: Dex Horthy’s 40% Context Threshold

Dex Horthy tested what happens when context windows get full. The finding:

168K token context window:
"Smart Zone" (0-40% usage):
• Clear reasoning
• Accurate tool calls
• Good format compliance
"Dumb Zone" (>40% usage):
• Hallucinations
• Circular reasoning
• Format errors

The implication: context management matters more than window size. A 200K window at 80% usage performs worse than a 100K window at 30% usage.

What This Means

The bottleneck isn’t the model. It’s the harness.

Here’s a rough comparison of what each approach delivers:

┌────────────────────────────────────────────────────────────┐
│ IMPROVEMENT POTENTIAL │
├────────────────────────────────────────────────────────────┤
│ │
│ Model Upgrade (GPT-4 → GPT-5): │
│ ┌──────────────────────────────────────┐ │
│ │ ████████░░░░░░░░░░░░░░░░░░░░░░░░░░░ │ ~5-15% │
│ └──────────────────────────────────────┘ │
│ │
│ Harness Optimization (edit format change): │
│ ┌──────────────────────────────────────┐ │
│ │ ████████████████████████████████████ │ ~10x │
│ └──────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────┘

The ROI difference isn’t marginal. It’s dramatic.

What to Do

Before paying for a stronger model, fix these harness components:

  1. Edit formats - Optimize for the model’s strengths. Some models work better with str_replace, others with hashline or unified diff.

  2. Tool organization - Load only needed tools. Anthropic’s Tool Search Tool saves ~85% context tokens.

  3. Context resets - Don’t wait for the window to fill. Reset at ~40% usage with structured handoff documents.

  4. Linter rules - Add fix instructions directly in error messages. OpenAI: “If it cannot be enforced mechanically, agents will deviate.”

Common Mistakes

I’ve seen teams make these errors repeatedly:

  • Treating context window as “more is better” - Past 40%, quality drops sharply
  • Loading all tools upfront - Wastes tokens, degrades performance
  • Expecting model upgrades to fix structural problems - They won’t
  • Skipping harness work because it’s “just infrastructure” - That’s where the gains are

Summary

In this post, I presented experimental evidence that harness optimization beats model upgrades for AI agent performance. The key point is that the bottleneck is infrastructure, not model capability. Can Bölük’s experiment showed a 10x improvement from changing only the edit format. Before blaming the model, check if the landing gear is broken.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments