Skip to content

What Is Karpathy's Autoresearch Approach? A Complete Guide to Autonomous AI Experimentation

Problem

I spent weeks running ML experiments manually. Every night I would:

  1. Think of a new hyperparameter configuration
  2. Start a training run
  3. Go to sleep hoping it works
  4. Wake up to find it crashed at 3 AM
  5. Check the logs to figure out what went wrong
  6. Manually revert to the previous best model
  7. Repeat

Here’s what my experiment log looked like:

Day 1: Tried learning_rate=0.01 -> diverged at epoch 15
Day 2: Tried learning_rate=0.001 -> better but plateaued at epoch 50
Day 3: Tried adding dropout=0.5 -> worse, forgot to save checkpoint
Day 4: Tried batch_size=64 -> OOM error at epoch 3
Day 5: Reverted to Day 2 model, lost track of what changed
...

I was making progress, but it was slow and error-prone. Each experiment required manual oversight, and when things went wrong, I had to spend hours debugging and reverting. I needed a way to run experiments autonomously with automatic failure recovery.

What I discovered

I found Andrej Karpathy’s autoresearch approach, which he used to systematically improve ML models. The core insight is simple but powerful:

“Constraint + mechanical metric + autonomous iteration = compounding gains”

This isn’t just about running more experiments. It’s about creating a feedback loop where:

  1. Every experiment is narrow and well-defined (constraint)
  2. Success can be measured automatically (mechanical metric)
  3. Failures revert automatically (autonomous iteration with rollback)

The result: 630 lines of Python running 100 experiments per night with guaranteed state consistency.

The three pillars

Karpathy’s approach rests on three pillars that work together:

┌─────────────────────────────────────────────────────────────────────┐
│ Karpathy's Autoresearch Approach │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Constraint │ │ Mechanical │ │ Autonomous │ │
│ │ │────▶│ Metric │────▶│ Iteration │ │
│ │ Narrow │ │ │ │ │ │
│ │ Problem │ │ Objective │ │ Auto-rollback │ │
│ │ Definition │ │ Evaluation │ │ on Failure │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ What you │ │ How you │ │ How you keep │ │
│ │ measure │ │ measure it │ │ improving safely │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
│ │
│ Result: Compounding gains from many small, safe experiments │
│ │
└─────────────────────────────────────────────────────────────────────┘

Pillar 1: Constraint

The first pillar is narrowing the problem definition to something that can be iterated on automatically.

I used to run experiments with vague goals like “improve the model” or “make it faster.” These are too broad for autonomous iteration. The constraint pillar requires a specific, narrow focus.

Wrong (too broad):
- "Improve model accuracy"
- "Make training faster"
- "Reduce memory usage"
Right (constrained):
- "Increase validation accuracy from 78% to 80%"
- "Reduce training time from 4 hours to 3 hours"
- "Lower GPU memory from 12GB to 10GB"

A constrained problem has these characteristics:

  1. Single objective: Focus on one thing at a time
  2. Measurable baseline: Know where you’re starting
  3. Clear target: Know when you’ve succeeded
  4. Atomic changes: Each experiment changes one thing

Here’s how I applied constraint to my experiments:

constraint_example.py
# Before: Vague experiment
experiment = {
"goal": "improve model",
"changes": [
"add dropout",
"change learning rate",
"increase hidden size",
"add batch normalization"
]
}
# If this works, we don't know which change helped
# After: Constrained experiment
experiment = {
"goal": "increase validation accuracy from 78% to 80%",
"baseline_accuracy": 0.78,
"target_accuracy": 0.80,
"single_change": "add dropout layer with rate 0.3 after conv2",
"atomic": True # Only this one change
}

Pillar 2: Mechanical Metric

The second pillar is an objective, automatable evaluation metric.

A mechanical metric is something a script can measure without human judgment. This is critical for autonomous iteration because the system needs to decide whether an experiment succeeded or failed.

Wrong (subjective):
- "Model feels more robust"
- "Predictions look better"
- "Training is smoother"
Right (mechanical):
- "Validation accuracy: 0.82"
- "Training time: 2.5 hours"
- "Peak memory: 8.3 GB"

I created a metrics module that extracts values programmatically:

metrics.py
import subprocess
import re
from typing import Optional
class MechanicalMetrics:
"""Extract objective metrics from training runs."""
@staticmethod
def get_validation_accuracy(log_file: str) -> Optional[float]:
"""Parse validation accuracy from training log."""
with open(log_file) as f:
content = f.read()
# Look for pattern: "Validation accuracy: 0.8234"
match = re.search(r'Validation accuracy: ([\d.]+)', content)
if match:
return float(match.group(1))
return None
@staticmethod
def get_training_time(log_file: str) -> Optional[float]:
"""Parse total training time in hours."""
with open(log_file) as f:
content = f.read()
# Look for pattern: "Total time: 2.5 hours"
match = re.search(r'Total time: ([\d.]+) hours', content)
if match:
return float(match.group(1))
return None
@staticmethod
def get_peak_memory(log_file: str) -> Optional[float]:
"""Parse peak GPU memory in GB."""
with open(log_file) as f:
content = f.read()
# Look for pattern: "Peak memory: 8.3 GB"
match = re.search(r'Peak memory: ([\d.]+) GB', content)
if match:
return float(match.group(1))
return None
@staticmethod
def run_command_and_extract(cmd: str, pattern: str) -> Optional[str]:
"""Run command and extract value matching pattern."""
result = subprocess.run(
cmd,
shell=True,
capture_output=True,
text=True
)
match = re.search(pattern, result.stdout)
if match:
return match.group(1)
return None

Using the metrics module:

# After a training run
metrics = MechanicalMetrics()
accuracy = metrics.get_validation_accuracy("train.log")
print(f"Validation accuracy: {accuracy}") # 0.82
time = metrics.get_training_time("train.log")
print(f"Training time: {time} hours") # 2.5
memory = metrics.get_peak_memory("train.log")
print(f"Peak memory: {memory} GB") # 8.3

Pillar 3: Autonomous Iteration with Rollback

The third pillar is the most underrated part: automatic rollback on failure.

This is what enables compounding gains. When an experiment fails or makes things worse, the system automatically reverts to the previous state. No manual cleanup, no lost progress.

autonomous_iteration.py
import os
import subprocess
from dataclasses import dataclass
from typing import Optional, Callable
from metrics import MechanicalMetrics
@dataclass
class ExperimentResult:
iteration: int
change_description: str
metric_before: float
metric_after: float
success: bool
kept: bool
commit_sha: Optional[str]
class AutonomousExperimenter:
def __init__(
self,
goal_metric: str,
target_value: float,
metric_extractor: Callable[[], float],
train_command: str,
max_iterations: int = 100
):
self.goal_metric = goal_metric
self.target_value = target_value
self.metric_extractor = metric_extractor
self.train_command = train_command
self.max_iterations = max_iterations
self.results_log = []
def run_experiment(
self,
iteration: int,
change_fn: Callable[[], str]
) -> ExperimentResult:
"""
Run a single experiment with automatic rollback.
Args:
iteration: Current iteration number
change_fn: Function that makes ONE atomic change and returns description
Returns:
ExperimentResult with outcome
"""
# Get baseline metric
metric_before = self.metric_extractor()
# Make ONE atomic change
change_description = change_fn()
# Commit the change (so we can revert if needed)
subprocess.run(["git", "add", "-A"], check=True)
subprocess.run([
"git", "commit", "-m",
f"experiment-{iteration}: {change_description}"
], check=True)
commit_sha = subprocess.check_output(
["git", "rev-parse", "HEAD"]
).decode().strip()
# Run training
result = subprocess.run(
self.train_command,
shell=True,
capture_output=True,
text=True
)
# Check if training succeeded
if result.returncode != 0:
# Training failed - rollback
subprocess.run(["git", "reset", "--hard", "HEAD~1"], check=True)
return ExperimentResult(
iteration=iteration,
change_description=change_description,
metric_before=metric_before,
metric_after=0,
success=False,
kept=False,
commit_sha=None
)
# Get new metric
metric_after = self.metric_extractor()
# Decide whether to keep or revert
improved = metric_after > metric_before
if improved:
# Keep the change
return ExperimentResult(
iteration=iteration,
change_description=change_description,
metric_before=metric_before,
metric_after=metric_after,
success=True,
kept=True,
commit_sha=commit_sha
)
else:
# Revert the change
subprocess.run(["git", "reset", "--hard", "HEAD~1"], check=True)
return ExperimentResult(
iteration=iteration,
change_description=change_description,
metric_before=metric_before,
metric_after=metric_after,
success=True,
kept=False,
commit_sha=None
)
def log_result(self, result: ExperimentResult):
"""Log result to TSV file."""
with open("experiment_log.tsv", "a") as f:
f.write(
f"{result.iteration}\t"
f"{result.change_description}\t"
f"{result.metric_before:.4f}\t"
f"{result.metric_after:.4f}\t"
f"{result.success}\t"
f"{result.kept}\t"
f"{result.commit_sha or 'N/A'}\n"
)
def run(self, propose_change_fn: Callable[[], Callable[[], str]]):
"""
Run autonomous experiments until target is reached.
Args:
propose_change_fn: Function that returns a change_fn for each iteration
"""
# Initialize log
with open("experiment_log.tsv", "w") as f:
f.write("iteration\tchange\tbefore\tafter\tsuccess\tkept\tcommit\n")
for i in range(self.max_iterations):
# Check current metric
current = self.metric_extractor()
print(f"Iteration {i}: Current {self.goal_metric} = {current:.4f}")
# Check if target reached
if current >= self.target_value:
print(f"Target {self.target_value} reached!")
break
# Propose and run experiment
change_fn = propose_change_fn()
result = self.run_experiment(i, change_fn)
self.log_result(result)
if result.kept:
print(f" Kept: {result.change_description} "
f"({result.metric_before:.4f} -> {result.metric_after:.4f})")
else:
print(f" Reverted: {result.change_description}")

Here’s how I use it:

run_experiments.py
from autonomous_iteration import AutonomousExperimenter
from metrics import MechanicalMetrics
# Define metric extractor
def get_accuracy():
return MechanicalMetrics.get_validation_accuracy("train.log")
# Define training command
train_cmd = "python train.py --config config.yaml"
# Create experimenter
experimenter = AutonomousExperimenter(
goal_metric="validation_accuracy",
target_value=0.80,
metric_extractor=get_accuracy,
train_command=train_cmd,
max_iterations=50
)
# Define change proposals (in practice, this would be smarter)
def propose_change():
"""Propose a single change to try."""
# This would be replaced with actual logic
# to propose different hyperparameters, etc.
def change_learning_rate():
with open("config.yaml", "r") as f:
config = f.read()
# Modify learning rate
# ...
with open("config.yaml", "w") as f:
f.write(config)
return "increase learning rate to 0.002"
return change_learning_rate
# Run autonomous experiments
experimenter.run(propose_change)

After running:

experiment_log.tsv
iteration change before after success kept commit
0 baseline 0.7800 0.7800 True True abc123
1 add dropout 0.3 0.7800 0.7650 True False N/A
2 increase lr to 0.002 0.7800 0.7920 True True def456
3 add batch norm 0.7920 0.8010 True True ghi789

Iteration 1 was reverted because accuracy dropped. Iteration 2 was kept because accuracy improved. After iteration 3, the target of 0.80 was reached.

Why rollback matters

The rollback-on-failure piece is the most underrated part of this pattern. Without it, you can’t safely run experiments autonomously.

Without rollback

Day 1: Experiment 1 succeeds
Day 2: Experiment 2 fails, leaves model in bad state
Day 3: Experiment 3 runs on corrupted baseline
Day 4: Results are meaningless, need to start over

With rollback

Night 1: 100 experiments run, 15 kept, 85 reverted
Night 2: 100 more experiments, each starting from clean state
Night 3: 100 more experiments, compounding gains from Night 2
...

The key insight: rollback enables you to try aggressive experiments without fear of breaking things. If it doesn’t work, you’re automatically back to the previous state.

What can go wrong

Mistake 1: Vague metrics

Wrong: "improve model quality"
Right: "validation accuracy >= 0.80"

A vague metric can’t be extracted programmatically. The system can’t decide if an experiment succeeded.

Mistake 2: Multiple changes at once

# Wrong: Multiple changes
def change_multiple():
increase_learning_rate()
add_dropout()
change_batch_size()
# If accuracy changes, we don't know which change caused it
# Right: One change at a time
def change_one():
increase_learning_rate() # Only this

Mistake 3: No git commits before verification

# Wrong: Verify then commit
make_change()
run_training()
if improved:
git_commit() # Might forget to commit
# Right: Commit then verify
make_change()
git_commit()
run_training()
if not improved:
git_reset_hard() # Guaranteed rollback

Mistake 4: Non-atomic changes

# Wrong: Change spreads across multiple files
def change_spread():
modify_model_architecture() # model.py
update_config() # config.yaml
change_dataloader() # data.py
# Reverting is hard if something goes wrong
# Right: Encapsulated change
def change_encapsulated():
# Change is in one place
modify_dropout_rate() # Only config.yaml

Real-world example

I applied this approach to optimize a text classification model:

text_classifier_optimization.py
import random
from autonomous_iteration import AutonomousExperimenter
# Track what we've tried
tried_configurations = set()
def propose_hyperparameter_change():
"""Propose a hyperparameter change to try."""
changes = [
("learning_rate", [0.001, 0.002, 0.005, 0.0005]),
("dropout_rate", [0.1, 0.2, 0.3, 0.4, 0.5]),
("batch_size", [16, 32, 64, 128]),
("hidden_size", [256, 512, 768, 1024]),
("num_layers", [2, 3, 4, 5]),
]
# Pick a random change we haven't tried
for _ in range(100): # Max attempts
param, values = random.choice(changes)
new_value = random.choice(values)
config_key = f"{param}={new_value}"
if config_key not in tried_configurations:
tried_configurations.add(config_key)
def change_fn():
with open("config.yaml", "r") as f:
content = f.read()
content = content.replace(
f"{param}: {get_current_value(param)}",
f"{param}: {new_value}"
)
with open("config.yaml", "w") as f:
f.write(content)
return f"set {param} to {new_value}"
return change_fn
return lambda: "no more changes"
# Run optimization
experimenter = AutonomousExperimenter(
goal_metric="validation_accuracy",
target_value=0.85,
metric_extractor=lambda: MechanicalMetrics.get_validation_accuracy("train.log"),
train_command="python train.py",
max_iterations=100
)
experimenter.run(propose_hyperparameter_change)

Results after running overnight:

Total experiments: 87
Kept: 23 (26%)
Reverted: 64 (74%)
Final accuracy: 0.852 (target: 0.85)
Best configuration:
learning_rate: 0.002
dropout_rate: 0.3
batch_size: 32
hidden_size: 512
num_layers: 3

When to use this approach

Karpathy’s autoresearch approach works well when:

  1. Clear objective: You can define a mechanical metric
  2. Atomic changes: Each experiment can be a single change
  3. Fast feedback: Training runs complete in hours, not days
  4. Safe to fail: Failed experiments don’t cause permanent damage
  5. Many possibilities: Large search space of potential improvements

It works less well when:

  • Training takes days (too slow for iteration)
  • Metrics are subjective (need human evaluation)
  • Changes are interdependent (can’t test independently)
  • Resources are limited (can’t run 100 experiments)

Summary

In this post, I explained Karpathy’s autoresearch approach. The key points are:

  1. Constraint: Narrow problem definition with clear target
  2. Mechanical metric: Objective, programmatically extractable evaluation
  3. Autonomous iteration: Automatic rollback on failure for safe experimentation

The magic is in the combination. Constraint focuses your efforts. Mechanical metrics enable automation. Rollback enables safe, aggressive experimentation. Together, they create compounding gains from many small, safe experiments.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments