What Is Karpathy's Autoresearch Approach? A Complete Guide to Autonomous AI Experimentation
Problem
I spent weeks running ML experiments manually. Every night I would:
- Think of a new hyperparameter configuration
- Start a training run
- Go to sleep hoping it works
- Wake up to find it crashed at 3 AM
- Check the logs to figure out what went wrong
- Manually revert to the previous best model
- Repeat
Here’s what my experiment log looked like:
Day 1: Tried learning_rate=0.01 -> diverged at epoch 15Day 2: Tried learning_rate=0.001 -> better but plateaued at epoch 50Day 3: Tried adding dropout=0.5 -> worse, forgot to save checkpointDay 4: Tried batch_size=64 -> OOM error at epoch 3Day 5: Reverted to Day 2 model, lost track of what changed...I was making progress, but it was slow and error-prone. Each experiment required manual oversight, and when things went wrong, I had to spend hours debugging and reverting. I needed a way to run experiments autonomously with automatic failure recovery.
What I discovered
I found Andrej Karpathy’s autoresearch approach, which he used to systematically improve ML models. The core insight is simple but powerful:
“Constraint + mechanical metric + autonomous iteration = compounding gains”
This isn’t just about running more experiments. It’s about creating a feedback loop where:
- Every experiment is narrow and well-defined (constraint)
- Success can be measured automatically (mechanical metric)
- Failures revert automatically (autonomous iteration with rollback)
The result: 630 lines of Python running 100 experiments per night with guaranteed state consistency.
The three pillars
Karpathy’s approach rests on three pillars that work together:
┌─────────────────────────────────────────────────────────────────────┐│ Karpathy's Autoresearch Approach │├─────────────────────────────────────────────────────────────────────┤│ ││ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ ││ │ Constraint │ │ Mechanical │ │ Autonomous │ ││ │ │────▶│ Metric │────▶│ Iteration │ ││ │ Narrow │ │ │ │ │ ││ │ Problem │ │ Objective │ │ Auto-rollback │ ││ │ Definition │ │ Evaluation │ │ on Failure │ ││ └─────────────┘ └─────────────┘ └─────────────────────┘ ││ │ │ │ ││ ▼ ▼ ▼ ││ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ ││ │ What you │ │ How you │ │ How you keep │ ││ │ measure │ │ measure it │ │ improving safely │ ││ └─────────────┘ └─────────────┘ └─────────────────────┘ ││ ││ Result: Compounding gains from many small, safe experiments ││ │└─────────────────────────────────────────────────────────────────────┘Pillar 1: Constraint
The first pillar is narrowing the problem definition to something that can be iterated on automatically.
I used to run experiments with vague goals like “improve the model” or “make it faster.” These are too broad for autonomous iteration. The constraint pillar requires a specific, narrow focus.
Wrong (too broad):- "Improve model accuracy"- "Make training faster"- "Reduce memory usage"
Right (constrained):- "Increase validation accuracy from 78% to 80%"- "Reduce training time from 4 hours to 3 hours"- "Lower GPU memory from 12GB to 10GB"A constrained problem has these characteristics:
- Single objective: Focus on one thing at a time
- Measurable baseline: Know where you’re starting
- Clear target: Know when you’ve succeeded
- Atomic changes: Each experiment changes one thing
Here’s how I applied constraint to my experiments:
# Before: Vague experimentexperiment = { "goal": "improve model", "changes": [ "add dropout", "change learning rate", "increase hidden size", "add batch normalization" ]}# If this works, we don't know which change helped
# After: Constrained experimentexperiment = { "goal": "increase validation accuracy from 78% to 80%", "baseline_accuracy": 0.78, "target_accuracy": 0.80, "single_change": "add dropout layer with rate 0.3 after conv2", "atomic": True # Only this one change}Pillar 2: Mechanical Metric
The second pillar is an objective, automatable evaluation metric.
A mechanical metric is something a script can measure without human judgment. This is critical for autonomous iteration because the system needs to decide whether an experiment succeeded or failed.
Wrong (subjective):- "Model feels more robust"- "Predictions look better"- "Training is smoother"
Right (mechanical):- "Validation accuracy: 0.82"- "Training time: 2.5 hours"- "Peak memory: 8.3 GB"I created a metrics module that extracts values programmatically:
import subprocessimport refrom typing import Optional
class MechanicalMetrics: """Extract objective metrics from training runs."""
@staticmethod def get_validation_accuracy(log_file: str) -> Optional[float]: """Parse validation accuracy from training log.""" with open(log_file) as f: content = f.read()
# Look for pattern: "Validation accuracy: 0.8234" match = re.search(r'Validation accuracy: ([\d.]+)', content) if match: return float(match.group(1)) return None
@staticmethod def get_training_time(log_file: str) -> Optional[float]: """Parse total training time in hours.""" with open(log_file) as f: content = f.read()
# Look for pattern: "Total time: 2.5 hours" match = re.search(r'Total time: ([\d.]+) hours', content) if match: return float(match.group(1)) return None
@staticmethod def get_peak_memory(log_file: str) -> Optional[float]: """Parse peak GPU memory in GB.""" with open(log_file) as f: content = f.read()
# Look for pattern: "Peak memory: 8.3 GB" match = re.search(r'Peak memory: ([\d.]+) GB', content) if match: return float(match.group(1)) return None
@staticmethod def run_command_and_extract(cmd: str, pattern: str) -> Optional[str]: """Run command and extract value matching pattern.""" result = subprocess.run( cmd, shell=True, capture_output=True, text=True )
match = re.search(pattern, result.stdout) if match: return match.group(1) return NoneUsing the metrics module:
# After a training runmetrics = MechanicalMetrics()
accuracy = metrics.get_validation_accuracy("train.log")print(f"Validation accuracy: {accuracy}") # 0.82
time = metrics.get_training_time("train.log")print(f"Training time: {time} hours") # 2.5
memory = metrics.get_peak_memory("train.log")print(f"Peak memory: {memory} GB") # 8.3Pillar 3: Autonomous Iteration with Rollback
The third pillar is the most underrated part: automatic rollback on failure.
This is what enables compounding gains. When an experiment fails or makes things worse, the system automatically reverts to the previous state. No manual cleanup, no lost progress.
import osimport subprocessfrom dataclasses import dataclassfrom typing import Optional, Callablefrom metrics import MechanicalMetrics
@dataclassclass ExperimentResult: iteration: int change_description: str metric_before: float metric_after: float success: bool kept: bool commit_sha: Optional[str]
class AutonomousExperimenter: def __init__( self, goal_metric: str, target_value: float, metric_extractor: Callable[[], float], train_command: str, max_iterations: int = 100 ): self.goal_metric = goal_metric self.target_value = target_value self.metric_extractor = metric_extractor self.train_command = train_command self.max_iterations = max_iterations self.results_log = []
def run_experiment( self, iteration: int, change_fn: Callable[[], str] ) -> ExperimentResult: """ Run a single experiment with automatic rollback.
Args: iteration: Current iteration number change_fn: Function that makes ONE atomic change and returns description
Returns: ExperimentResult with outcome """ # Get baseline metric metric_before = self.metric_extractor()
# Make ONE atomic change change_description = change_fn()
# Commit the change (so we can revert if needed) subprocess.run(["git", "add", "-A"], check=True) subprocess.run([ "git", "commit", "-m", f"experiment-{iteration}: {change_description}" ], check=True) commit_sha = subprocess.check_output( ["git", "rev-parse", "HEAD"] ).decode().strip()
# Run training result = subprocess.run( self.train_command, shell=True, capture_output=True, text=True )
# Check if training succeeded if result.returncode != 0: # Training failed - rollback subprocess.run(["git", "reset", "--hard", "HEAD~1"], check=True) return ExperimentResult( iteration=iteration, change_description=change_description, metric_before=metric_before, metric_after=0, success=False, kept=False, commit_sha=None )
# Get new metric metric_after = self.metric_extractor()
# Decide whether to keep or revert improved = metric_after > metric_before
if improved: # Keep the change return ExperimentResult( iteration=iteration, change_description=change_description, metric_before=metric_before, metric_after=metric_after, success=True, kept=True, commit_sha=commit_sha ) else: # Revert the change subprocess.run(["git", "reset", "--hard", "HEAD~1"], check=True) return ExperimentResult( iteration=iteration, change_description=change_description, metric_before=metric_before, metric_after=metric_after, success=True, kept=False, commit_sha=None )
def log_result(self, result: ExperimentResult): """Log result to TSV file.""" with open("experiment_log.tsv", "a") as f: f.write( f"{result.iteration}\t" f"{result.change_description}\t" f"{result.metric_before:.4f}\t" f"{result.metric_after:.4f}\t" f"{result.success}\t" f"{result.kept}\t" f"{result.commit_sha or 'N/A'}\n" )
def run(self, propose_change_fn: Callable[[], Callable[[], str]]): """ Run autonomous experiments until target is reached.
Args: propose_change_fn: Function that returns a change_fn for each iteration """ # Initialize log with open("experiment_log.tsv", "w") as f: f.write("iteration\tchange\tbefore\tafter\tsuccess\tkept\tcommit\n")
for i in range(self.max_iterations): # Check current metric current = self.metric_extractor() print(f"Iteration {i}: Current {self.goal_metric} = {current:.4f}")
# Check if target reached if current >= self.target_value: print(f"Target {self.target_value} reached!") break
# Propose and run experiment change_fn = propose_change_fn() result = self.run_experiment(i, change_fn) self.log_result(result)
if result.kept: print(f" Kept: {result.change_description} " f"({result.metric_before:.4f} -> {result.metric_after:.4f})") else: print(f" Reverted: {result.change_description}")Here’s how I use it:
from autonomous_iteration import AutonomousExperimenterfrom metrics import MechanicalMetrics
# Define metric extractordef get_accuracy(): return MechanicalMetrics.get_validation_accuracy("train.log")
# Define training commandtrain_cmd = "python train.py --config config.yaml"
# Create experimenterexperimenter = AutonomousExperimenter( goal_metric="validation_accuracy", target_value=0.80, metric_extractor=get_accuracy, train_command=train_cmd, max_iterations=50)
# Define change proposals (in practice, this would be smarter)def propose_change(): """Propose a single change to try.""" # This would be replaced with actual logic # to propose different hyperparameters, etc. def change_learning_rate(): with open("config.yaml", "r") as f: config = f.read() # Modify learning rate # ... with open("config.yaml", "w") as f: f.write(config) return "increase learning rate to 0.002"
return change_learning_rate
# Run autonomous experimentsexperimenter.run(propose_change)After running:
iteration change before after success kept commit0 baseline 0.7800 0.7800 True True abc1231 add dropout 0.3 0.7800 0.7650 True False N/A2 increase lr to 0.002 0.7800 0.7920 True True def4563 add batch norm 0.7920 0.8010 True True ghi789Iteration 1 was reverted because accuracy dropped. Iteration 2 was kept because accuracy improved. After iteration 3, the target of 0.80 was reached.
Why rollback matters
The rollback-on-failure piece is the most underrated part of this pattern. Without it, you can’t safely run experiments autonomously.
Without rollback
Day 1: Experiment 1 succeedsDay 2: Experiment 2 fails, leaves model in bad stateDay 3: Experiment 3 runs on corrupted baselineDay 4: Results are meaningless, need to start overWith rollback
Night 1: 100 experiments run, 15 kept, 85 revertedNight 2: 100 more experiments, each starting from clean stateNight 3: 100 more experiments, compounding gains from Night 2...The key insight: rollback enables you to try aggressive experiments without fear of breaking things. If it doesn’t work, you’re automatically back to the previous state.
What can go wrong
Mistake 1: Vague metrics
Wrong: "improve model quality"Right: "validation accuracy >= 0.80"A vague metric can’t be extracted programmatically. The system can’t decide if an experiment succeeded.
Mistake 2: Multiple changes at once
# Wrong: Multiple changesdef change_multiple(): increase_learning_rate() add_dropout() change_batch_size() # If accuracy changes, we don't know which change caused it
# Right: One change at a timedef change_one(): increase_learning_rate() # Only thisMistake 3: No git commits before verification
# Wrong: Verify then commitmake_change()run_training()if improved: git_commit() # Might forget to commit
# Right: Commit then verifymake_change()git_commit()run_training()if not improved: git_reset_hard() # Guaranteed rollbackMistake 4: Non-atomic changes
# Wrong: Change spreads across multiple filesdef change_spread(): modify_model_architecture() # model.py update_config() # config.yaml change_dataloader() # data.py # Reverting is hard if something goes wrong
# Right: Encapsulated changedef change_encapsulated(): # Change is in one place modify_dropout_rate() # Only config.yamlReal-world example
I applied this approach to optimize a text classification model:
import randomfrom autonomous_iteration import AutonomousExperimenter
# Track what we've triedtried_configurations = set()
def propose_hyperparameter_change(): """Propose a hyperparameter change to try.""" changes = [ ("learning_rate", [0.001, 0.002, 0.005, 0.0005]), ("dropout_rate", [0.1, 0.2, 0.3, 0.4, 0.5]), ("batch_size", [16, 32, 64, 128]), ("hidden_size", [256, 512, 768, 1024]), ("num_layers", [2, 3, 4, 5]), ]
# Pick a random change we haven't tried for _ in range(100): # Max attempts param, values = random.choice(changes) new_value = random.choice(values) config_key = f"{param}={new_value}"
if config_key not in tried_configurations: tried_configurations.add(config_key)
def change_fn(): with open("config.yaml", "r") as f: content = f.read() content = content.replace( f"{param}: {get_current_value(param)}", f"{param}: {new_value}" ) with open("config.yaml", "w") as f: f.write(content) return f"set {param} to {new_value}"
return change_fn
return lambda: "no more changes"
# Run optimizationexperimenter = AutonomousExperimenter( goal_metric="validation_accuracy", target_value=0.85, metric_extractor=lambda: MechanicalMetrics.get_validation_accuracy("train.log"), train_command="python train.py", max_iterations=100)
experimenter.run(propose_hyperparameter_change)Results after running overnight:
Total experiments: 87Kept: 23 (26%)Reverted: 64 (74%)Final accuracy: 0.852 (target: 0.85)Best configuration: learning_rate: 0.002 dropout_rate: 0.3 batch_size: 32 hidden_size: 512 num_layers: 3When to use this approach
Karpathy’s autoresearch approach works well when:
- Clear objective: You can define a mechanical metric
- Atomic changes: Each experiment can be a single change
- Fast feedback: Training runs complete in hours, not days
- Safe to fail: Failed experiments don’t cause permanent damage
- Many possibilities: Large search space of potential improvements
It works less well when:
- Training takes days (too slow for iteration)
- Metrics are subjective (need human evaluation)
- Changes are interdependent (can’t test independently)
- Resources are limited (can’t run 100 experiments)
Summary
In this post, I explained Karpathy’s autoresearch approach. The key points are:
- Constraint: Narrow problem definition with clear target
- Mechanical metric: Objective, programmatically extractable evaluation
- Autonomous iteration: Automatic rollback on failure for safe experimentation
The magic is in the combination. Constraint focuses your efforts. Mechanical metrics enable automation. Rollback enables safe, aggressive experimentation. Together, they create compounding gains from many small, safe experiments.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments