How Does Autoresearch Work for Automated Code Optimization?

Mar 29, 2026

Manual code optimization is painful. You tweak a parameter, run tests, check results, and repeat. Hours disappear. Sometimes you’re not even sure if your “optimization” actually improved anything.

When I discovered autoresearch, I realized there’s a better way. Let me show you what it is and how it works.

What is Autoresearch?

Autoresearch is an automated experiment-loop methodology that continuously tests code changes, measures results against objective metrics, and keeps only improvements.

Andrej Karpathy originally developed this approach for ML training. The key insight? ML training has an unambiguous metric: validation bits per byte (val_bpb). Lower is objectively better. No debates, no subjective judgments.

But when I generalized this concept, I realized it works far beyond ML. Any code with measurable metrics can benefit from autoresearch.

The Core Loop: How Autoresearch Works

Here’s the essence of autoresearch in a simple diagram:

┌─────────────────────────────────────────────────────────────┐
│                     AUTORESEARCH LOOP                       │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│    ┌──────────┐    ┌──────────┐    ┌──────────┐           │
│    │  THOUGHT │───▶│ EXPERIMENT│───▶│ MEASURE  │           │
│    │    ↓     │    │           │    │    ↓     │           │
│    │ Hypothesis│   │ Code Change│   │  Metrics  │           │
│    └──────────┘    └──────────┘    └──────────┘           │
│         │                               │                  │
│         │                               │                  │
│         └───────────◀───────────────────┘                  │
│                     │                                      │
│                     ▼                                      │
│            ┌──────────────┐                               │
│            │  KEEP or     │                               │
│            │  DISCARD?    │                               │
│            └──────────────┘                               │
│                                                             │
└─────────────────────────────────────────────────────────────┘

The loop operates on a simple principle:

Generate hypothesis - “Maybe changing X will improve Y”
Run experiment - Apply the change
Measure results - Collect objective metrics
Decide - Keep if better, discard if worse

ML-Specific vs Generalized Autoresearch

Karpathy’s original autoresearch was ML-specific. Here’s how it differs from the generalized version:

Aspect	ML-Specific	Generalized
Metric	val_bpb	Any measurable metric
Experiments	Hyperparameter changes	Code changes, algorithms, configs
Branching	Training checkpoints	Git commits, patches
Convergence	Loss plateau	Metric improvement plateau
Context	Training logs	Test results, benchmarks

The key difference is what you measure. ML has natural metrics like loss and accuracy. For general code, you need to define what “better” means.

Practical Applications Beyond ML

So the solution is to find measurable metrics. Here are examples:

Performance Optimization:

Benchmark execution time
Measure memory usage
Track API latency percentiles

Code Quality:

Test coverage percentage
Static analysis scores
Cyclomatic complexity

Business Metrics:

Page load time
Conversion rates
Error rates

The requirement is simple: your metric must be unambiguous. Lower latency is better. Higher coverage is better. No subjective interpretation needed.

Code Example: Basic Autoresearch Loop

Here’s a simplified Python implementation:

import subprocess
import json
from dataclasses import dataclass
from typing import Callable, Any

@dataclass
class ExperimentResult:
    change_id: str
    metric_value: float
    improvement: bool
    metadata: dict

class AutoOptimizer:
    def __init__(
        self,
        metric_fn: Callable[[], float],
        threshold: float = 0.01,
        max_iterations: int = 100
    ):
        self.metric_fn = metric_fn
        self.threshold = threshold
        self.max_iterations = max_iterations
        self.baseline = metric_fn()
        self.history = []

    def run_experiment(
        self,
        change_fn: Callable[[], str],
        change_id: str
    ) -> ExperimentResult:
        """Run a single experiment with the given change."""

        # Apply the change
        description = change_fn()

        # Measure the result
        new_value = self.metric_fn()

        # Calculate improvement
        improvement = new_value < self.baseline - self.threshold

        result = ExperimentResult(
            change_id=change_id,
            metric_value=new_value,
            improvement=improvement,
            metadata={"description": description}
        )

        self.history.append(result)

        # Keep improvement, restore baseline if not
        if improvement:
            self.baseline = new_value
        else:
            self.revert(change_id)

        return result

    def revert(self, change_id: str):
        """Revert a change that didn't improve metrics."""
        # Implementation depends on version control
        subprocess.run(["git", "revert", "--no-commit", change_id])

Convergence Detection Strategies

One challenge is knowing when to stop. Here are practical strategies:

CONVERGENCE DETECTION APPROACHES
─────────────────────────────────

1. PLATEAU DETECTION
   ┌────────────────────────┐
   │ Metric │ Iterations    │
   │ ───────┼────────────── │
   │ 100ms  │ 1-5           │
   │  85ms  │ 6-10          │
   │  82ms  │ 11-15         │
   │  81ms  │ 16-20 ◀── PLATEAU (stop here)
   └────────────────────────┘

2. BUDGET EXHAUSTION
   - Set max iterations
   - Set max time budget
   - Set max "no improvement" streak

3. THRESHOLD REACHED
   - Stop when metric below target
   - Example: "Stop when latency < 50ms"

Here’s convergence detection code:

from typing import List
from collections import deque

class ConvergenceDetector:
    def __init__(
        self,
        window_size: int = 10,
        threshold: float = 0.001
    ):
        self.window = deque(maxlen=window_size)
        self.threshold = threshold

    def check(self, metric_history: List[float]) -> bool:
        """Returns True if converged."""

        if len(metric_history) < self.window.maxlen:
            return False

        recent = metric_history[-self.window.maxlen:]
        variance = max(recent) - min(recent)

        return variance < self.threshold

    def improvement_rate(
        self,
        metric_history: List[float],
        window: int = 10
    ) -> float:
        """Calculate rate of improvement."""

        if len(metric_history) < window:
            return 1.0

        start = metric_history[-window]
        end = metric_history[-1]

        return (start - end) / start  # Positive = improving

Challenges and Considerations

When I started using autoresearch, I hit several issues:

1. Metric Gaming

Bad metrics lead to bad optimizations. If you optimize for “lines of code,” you’ll get terse but unreadable code. Choose metrics that reflect actual goals.

2. Non-Linear Branching

Real experiments don’t always follow a single path. Sometimes you want to explore multiple branches:

        ┌─── Branch A (failed) ────┐
        │                          │
Main ───┼─── Branch B (improved) ──┼─── Merge B ─── Continue
        │                          │
        └─── Branch C (failed) ────┘

3. Session Resume

Long-running optimizations need checkpointing. Store state so you can resume after interruptions.

4. Thought Experiments

Not every hypothesis needs immediate testing. Sometimes you generate ideas, rank them by potential impact, and test only the most promising.

Configuration Example

Here’s a YAML config for autoresearch:

objective:
  metric: "latency_p99"
  direction: "minimize"
  target: 50  # milliseconds

constraints:
  max_iterations: 100
  max_time_hours: 4
  no_improvement_limit: 10

branching:
  strategy: "parallel"
  max_branches: 3

checkpoint:
  enabled: true
  interval_minutes: 5
  path: "./autoresearch_state.json"

metrics:
  collection:
    warmup_runs: 3
    measurement_runs: 5
    aggregate: "median"

Getting Started with Autoresearch

To implement autoresearch for your project:

Define your metric - Something unambiguous and measurable
Set up measurement - Automated benchmark or test suite
Implement change tracking - Git branches, patches, or checkpoints
Create the loop - Generate, test, measure, decide
Add convergence detection - Know when to stop

Start small. Maybe just optimize a single function’s performance. Then expand as you get comfortable with the process.

Summary

In this post, I explained autoresearch—a methodology that transforms code optimization from manual art into automated science. The core is simple: continuously test changes, measure objectively, keep improvements.

The magic isn’t in AI or complex algorithms. It’s in having unambiguous metrics and the discipline to let the loop run. When your metric is truly objective—like val_bpb for ML or latency for performance—autoresearch can find optimizations you’d never discover manually.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!