How Does Autoresearch Work for Automated Code Optimization?
Manual code optimization is painful. You tweak a parameter, run tests, check results, and repeat. Hours disappear. Sometimes you’re not even sure if your “optimization” actually improved anything.
When I discovered autoresearch, I realized there’s a better way. Let me show you what it is and how it works.
What is Autoresearch?
Autoresearch is an automated experiment-loop methodology that continuously tests code changes, measures results against objective metrics, and keeps only improvements.
Andrej Karpathy originally developed this approach for ML training. The key insight? ML training has an unambiguous metric: validation bits per byte (val_bpb). Lower is objectively better. No debates, no subjective judgments.
But when I generalized this concept, I realized it works far beyond ML. Any code with measurable metrics can benefit from autoresearch.
The Core Loop: How Autoresearch Works
Here’s the essence of autoresearch in a simple diagram:
┌─────────────────────────────────────────────────────────────┐│ AUTORESEARCH LOOP │├─────────────────────────────────────────────────────────────┤│ ││ ┌──────────┐ ┌──────────┐ ┌──────────┐ ││ │ THOUGHT │───▶│ EXPERIMENT│───▶│ MEASURE │ ││ │ ↓ │ │ │ │ ↓ │ ││ │ Hypothesis│ │ Code Change│ │ Metrics │ ││ └──────────┘ └──────────┘ └──────────┘ ││ │ │ ││ │ │ ││ └───────────◀───────────────────┘ ││ │ ││ ▼ ││ ┌──────────────┐ ││ │ KEEP or │ ││ │ DISCARD? │ ││ └──────────────┘ ││ │└─────────────────────────────────────────────────────────────┘The loop operates on a simple principle:
- Generate hypothesis - “Maybe changing X will improve Y”
- Run experiment - Apply the change
- Measure results - Collect objective metrics
- Decide - Keep if better, discard if worse
ML-Specific vs Generalized Autoresearch
Karpathy’s original autoresearch was ML-specific. Here’s how it differs from the generalized version:
| Aspect | ML-Specific | Generalized |
|---|---|---|
| Metric | val_bpb | Any measurable metric |
| Experiments | Hyperparameter changes | Code changes, algorithms, configs |
| Branching | Training checkpoints | Git commits, patches |
| Convergence | Loss plateau | Metric improvement plateau |
| Context | Training logs | Test results, benchmarks |
The key difference is what you measure. ML has natural metrics like loss and accuracy. For general code, you need to define what “better” means.
Practical Applications Beyond ML
So the solution is to find measurable metrics. Here are examples:
Performance Optimization:
- Benchmark execution time
- Measure memory usage
- Track API latency percentiles
Code Quality:
- Test coverage percentage
- Static analysis scores
- Cyclomatic complexity
Business Metrics:
- Page load time
- Conversion rates
- Error rates
The requirement is simple: your metric must be unambiguous. Lower latency is better. Higher coverage is better. No subjective interpretation needed.
Code Example: Basic Autoresearch Loop
Here’s a simplified Python implementation:
import subprocessimport jsonfrom dataclasses import dataclassfrom typing import Callable, Any
@dataclassclass ExperimentResult: change_id: str metric_value: float improvement: bool metadata: dict
class AutoOptimizer: def __init__( self, metric_fn: Callable[[], float], threshold: float = 0.01, max_iterations: int = 100 ): self.metric_fn = metric_fn self.threshold = threshold self.max_iterations = max_iterations self.baseline = metric_fn() self.history = []
def run_experiment( self, change_fn: Callable[[], str], change_id: str ) -> ExperimentResult: """Run a single experiment with the given change."""
# Apply the change description = change_fn()
# Measure the result new_value = self.metric_fn()
# Calculate improvement improvement = new_value < self.baseline - self.threshold
result = ExperimentResult( change_id=change_id, metric_value=new_value, improvement=improvement, metadata={"description": description} )
self.history.append(result)
# Keep improvement, restore baseline if not if improvement: self.baseline = new_value else: self.revert(change_id)
return result
def revert(self, change_id: str): """Revert a change that didn't improve metrics.""" # Implementation depends on version control subprocess.run(["git", "revert", "--no-commit", change_id])Convergence Detection Strategies
One challenge is knowing when to stop. Here are practical strategies:
CONVERGENCE DETECTION APPROACHES─────────────────────────────────
1. PLATEAU DETECTION ┌────────────────────────┐ │ Metric │ Iterations │ │ ───────┼────────────── │ │ 100ms │ 1-5 │ │ 85ms │ 6-10 │ │ 82ms │ 11-15 │ │ 81ms │ 16-20 ◀── PLATEAU (stop here) └────────────────────────┘
2. BUDGET EXHAUSTION - Set max iterations - Set max time budget - Set max "no improvement" streak
3. THRESHOLD REACHED - Stop when metric below target - Example: "Stop when latency < 50ms"Here’s convergence detection code:
from typing import Listfrom collections import deque
class ConvergenceDetector: def __init__( self, window_size: int = 10, threshold: float = 0.001 ): self.window = deque(maxlen=window_size) self.threshold = threshold
def check(self, metric_history: List[float]) -> bool: """Returns True if converged."""
if len(metric_history) < self.window.maxlen: return False
recent = metric_history[-self.window.maxlen:] variance = max(recent) - min(recent)
return variance < self.threshold
def improvement_rate( self, metric_history: List[float], window: int = 10 ) -> float: """Calculate rate of improvement."""
if len(metric_history) < window: return 1.0
start = metric_history[-window] end = metric_history[-1]
return (start - end) / start # Positive = improvingChallenges and Considerations
When I started using autoresearch, I hit several issues:
1. Metric Gaming
Bad metrics lead to bad optimizations. If you optimize for “lines of code,” you’ll get terse but unreadable code. Choose metrics that reflect actual goals.
2. Non-Linear Branching
Real experiments don’t always follow a single path. Sometimes you want to explore multiple branches:
┌─── Branch A (failed) ────┐ │ │Main ───┼─── Branch B (improved) ──┼─── Merge B ─── Continue │ │ └─── Branch C (failed) ────┘3. Session Resume
Long-running optimizations need checkpointing. Store state so you can resume after interruptions.
4. Thought Experiments
Not every hypothesis needs immediate testing. Sometimes you generate ideas, rank them by potential impact, and test only the most promising.
Configuration Example
Here’s a YAML config for autoresearch:
objective: metric: "latency_p99" direction: "minimize" target: 50 # milliseconds
constraints: max_iterations: 100 max_time_hours: 4 no_improvement_limit: 10
branching: strategy: "parallel" max_branches: 3
checkpoint: enabled: true interval_minutes: 5 path: "./autoresearch_state.json"
metrics: collection: warmup_runs: 3 measurement_runs: 5 aggregate: "median"Getting Started with Autoresearch
To implement autoresearch for your project:
- Define your metric - Something unambiguous and measurable
- Set up measurement - Automated benchmark or test suite
- Implement change tracking - Git branches, patches, or checkpoints
- Create the loop - Generate, test, measure, decide
- Add convergence detection - Know when to stop
Start small. Maybe just optimize a single function’s performance. Then expand as you get comfortable with the process.
Summary
In this post, I explained autoresearch—a methodology that transforms code optimization from manual art into automated science. The core is simple: continuously test changes, measure objectively, keep improvements.
The magic isn’t in AI or complex algorithms. It’s in having unambiguous metrics and the discipline to let the loop run. When your metric is truly objective—like val_bpb for ML or latency for performance—autoresearch can find optimizations you’d never discover manually.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments