Why Harness Engineering Beats Model Upgrades: Experimental Evidence for AI Agent Performance

Apr 21, 2026

Data analytics dashboard

Purpose

When my AI agents kept failing, my first instinct was to upgrade the model. Pay for GPT-5.4, switch to Opus 4.6, or try Gemini 3.1.

But then I saw something that changed my thinking: experimental data showing that optimizing the infrastructure around the model delivered far bigger gains than the model upgrade itself.

The Evidence

Here are three experiments that proved the point.

Experiment #1: Can Bölük’s Edit Tool Comparison

Can Bölük ran a systematic comparison: 16 models, 3 edit formats, 540 tasks each. Same models, different formats.

The results shocked me:

Grok Code Fast 1:
  str_replace format:  6.7% success rate
  hashline format:    68.3% success rate

Change: Only edit format (zero model changes)
Improvement: 10x

His quote stuck with me: “You blame the pilot, but the landing gear is broken.”

Experiment #2: LangChain’s Terminal Bench 2.0 Jump

LangChain optimized their harness without changing model parameters. Here’s what happened:

Terminal Bench 2.0:
  Before: 52.8% score, rank 30th globally
  After:  66.5% score, rank 5th globally

Changes:
  • Better docs loading
  • Validation loops
  • Tracing improvements

Model parameters: unchanged

They jumped from 30th to 5th place globally. Zero model changes.

Experiment #3: Dex Horthy’s 40% Context Threshold

Dex Horthy tested what happens when context windows get full. The finding:

168K token context window:

"Smart Zone" (0-40% usage):
  • Clear reasoning
  • Accurate tool calls
  • Good format compliance

"Dumb Zone" (>40% usage):
  • Hallucinations
  • Circular reasoning
  • Format errors

The implication: context management matters more than window size. A 200K window at 80% usage performs worse than a 100K window at 30% usage.

What This Means

The bottleneck isn’t the model. It’s the harness.

Here’s a rough comparison of what each approach delivers:

┌────────────────────────────────────────────────────────────┐
│                    IMPROVEMENT POTENTIAL                    │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  Model Upgrade (GPT-4 → GPT-5):                            │
│  ┌──────────────────────────────────────┐                 │
│  │ ████████░░░░░░░░░░░░░░░░░░░░░░░░░░░ │  ~5-15%          │
│  └──────────────────────────────────────┘                 │
│                                                            │
│  Harness Optimization (edit format change):                │
│  ┌──────────────────────────────────────┐                 │
│  │ ████████████████████████████████████ │  ~10x           │
│  └──────────────────────────────────────┘                 │
│                                                            │
└────────────────────────────────────────────────────────────┘

The ROI difference isn’t marginal. It’s dramatic.

What to Do

Before paying for a stronger model, fix these harness components:

Edit formats - Optimize for the model’s strengths. Some models work better with str_replace, others with hashline or unified diff.
Tool organization - Load only needed tools. Anthropic’s Tool Search Tool saves ~85% context tokens.
Context resets - Don’t wait for the window to fill. Reset at ~40% usage with structured handoff documents.
Linter rules - Add fix instructions directly in error messages. OpenAI: “If it cannot be enforced mechanically, agents will deviate.”

Common Mistakes

I’ve seen teams make these errors repeatedly:

Treating context window as “more is better” - Past 40%, quality drops sharply
Loading all tools upfront - Wastes tokens, degrades performance
Expecting model upgrades to fix structural problems - They won’t
Skipping harness work because it’s “just infrastructure” - That’s where the gains are

Summary

In this post, I presented experimental evidence that harness optimization beats model upgrades for AI agent performance. The key point is that the bottleneck is infrastructure, not model capability. Can Bölük’s experiment showed a 10x improvement from changing only the edit format. Before blaming the model, check if the landing gear is broken.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Can Bölük: Edit Tool Comparison Experiment
👨‍💻 LangChain Terminal Bench 2.0 Results
👨‍💻 Dex Horthy: Context Window 40% Threshold

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!