What Is Karpathy's Autoresearch Approach? A Complete Guide to Autonomous AI Experimentation

Mar 15, 2026

Problem

I spent weeks running ML experiments manually. Every night I would:

Think of a new hyperparameter configuration
Start a training run
Go to sleep hoping it works
Wake up to find it crashed at 3 AM
Check the logs to figure out what went wrong
Manually revert to the previous best model
Repeat

Here’s what my experiment log looked like:

Day 1: Tried learning_rate=0.01 -> diverged at epoch 15
Day 2: Tried learning_rate=0.001 -> better but plateaued at epoch 50
Day 3: Tried adding dropout=0.5 -> worse, forgot to save checkpoint
Day 4: Tried batch_size=64 -> OOM error at epoch 3
Day 5: Reverted to Day 2 model, lost track of what changed
...

I was making progress, but it was slow and error-prone. Each experiment required manual oversight, and when things went wrong, I had to spend hours debugging and reverting. I needed a way to run experiments autonomously with automatic failure recovery.

What I discovered

I found Andrej Karpathy’s autoresearch approach, which he used to systematically improve ML models. The core insight is simple but powerful:

“Constraint + mechanical metric + autonomous iteration = compounding gains”

This isn’t just about running more experiments. It’s about creating a feedback loop where:

Every experiment is narrow and well-defined (constraint)
Success can be measured automatically (mechanical metric)
Failures revert automatically (autonomous iteration with rollback)

The result: 630 lines of Python running 100 experiments per night with guaranteed state consistency.

The three pillars

Karpathy’s approach rests on three pillars that work together:

┌─────────────────────────────────────────────────────────────────────┐
│                    Karpathy's Autoresearch Approach                  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│   ┌─────────────┐     ┌─────────────┐     ┌─────────────────────┐   │
│   │  Constraint │     │  Mechanical │     │ Autonomous          │   │
│   │             │────▶│   Metric    │────▶│ Iteration           │   │
│   │  Narrow     │     │             │     │                     │   │
│   │  Problem    │     │  Objective  │     │  Auto-rollback      │   │
│   │  Definition │     │  Evaluation │     │  on Failure         │   │
│   └─────────────┘     └─────────────┘     └─────────────────────┘   │
│         │                    │                      │                │
│         ▼                    ▼                      ▼                │
│   ┌─────────────┐     ┌─────────────┐     ┌─────────────────────┐   │
│   │ What you    │     │ How you     │     │ How you keep       │   │
│   │ measure     │     │ measure it  │     │ improving safely   │   │
│   └─────────────┘     └─────────────┘     └─────────────────────┘   │
│                                                                      │
│   Result: Compounding gains from many small, safe experiments       │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Pillar 1: Constraint

The first pillar is narrowing the problem definition to something that can be iterated on automatically.

I used to run experiments with vague goals like “improve the model” or “make it faster.” These are too broad for autonomous iteration. The constraint pillar requires a specific, narrow focus.

Wrong (too broad):
- "Improve model accuracy"
- "Make training faster"
- "Reduce memory usage"

Right (constrained):
- "Increase validation accuracy from 78% to 80%"
- "Reduce training time from 4 hours to 3 hours"
- "Lower GPU memory from 12GB to 10GB"

A constrained problem has these characteristics:

Single objective: Focus on one thing at a time
Measurable baseline: Know where you’re starting
Clear target: Know when you’ve succeeded
Atomic changes: Each experiment changes one thing

Here’s how I applied constraint to my experiments:

# Before: Vague experiment
experiment = {
    "goal": "improve model",
    "changes": [
        "add dropout",
        "change learning rate",
        "increase hidden size",
        "add batch normalization"
    ]
}
# If this works, we don't know which change helped

# After: Constrained experiment
experiment = {
    "goal": "increase validation accuracy from 78% to 80%",
    "baseline_accuracy": 0.78,
    "target_accuracy": 0.80,
    "single_change": "add dropout layer with rate 0.3 after conv2",
    "atomic": True  # Only this one change
}

Pillar 2: Mechanical Metric

The second pillar is an objective, automatable evaluation metric.

A mechanical metric is something a script can measure without human judgment. This is critical for autonomous iteration because the system needs to decide whether an experiment succeeded or failed.

Wrong (subjective):
- "Model feels more robust"
- "Predictions look better"
- "Training is smoother"

Right (mechanical):
- "Validation accuracy: 0.82"
- "Training time: 2.5 hours"
- "Peak memory: 8.3 GB"

I created a metrics module that extracts values programmatically:

import subprocess
import re
from typing import Optional

class MechanicalMetrics:
    """Extract objective metrics from training runs."""

    @staticmethod
    def get_validation_accuracy(log_file: str) -> Optional[float]:
        """Parse validation accuracy from training log."""
        with open(log_file) as f:
            content = f.read()

        # Look for pattern: "Validation accuracy: 0.8234"
        match = re.search(r'Validation accuracy: ([\d.]+)', content)
        if match:
            return float(match.group(1))
        return None

    @staticmethod
    def get_training_time(log_file: str) -> Optional[float]:
        """Parse total training time in hours."""
        with open(log_file) as f:
            content = f.read()

        # Look for pattern: "Total time: 2.5 hours"
        match = re.search(r'Total time: ([\d.]+) hours', content)
        if match:
            return float(match.group(1))
        return None

    @staticmethod
    def get_peak_memory(log_file: str) -> Optional[float]:
        """Parse peak GPU memory in GB."""
        with open(log_file) as f:
            content = f.read()

        # Look for pattern: "Peak memory: 8.3 GB"
        match = re.search(r'Peak memory: ([\d.]+) GB', content)
        if match:
            return float(match.group(1))
        return None

    @staticmethod
    def run_command_and_extract(cmd: str, pattern: str) -> Optional[str]:
        """Run command and extract value matching pattern."""
        result = subprocess.run(
            cmd,
            shell=True,
            capture_output=True,
            text=True
        )

        match = re.search(pattern, result.stdout)
        if match:
            return match.group(1)
        return None

Using the metrics module:

# After a training run
metrics = MechanicalMetrics()

accuracy = metrics.get_validation_accuracy("train.log")
print(f"Validation accuracy: {accuracy}")  # 0.82

time = metrics.get_training_time("train.log")
print(f"Training time: {time} hours")  # 2.5

memory = metrics.get_peak_memory("train.log")
print(f"Peak memory: {memory} GB")  # 8.3

Pillar 3: Autonomous Iteration with Rollback

The third pillar is the most underrated part: automatic rollback on failure.

This is what enables compounding gains. When an experiment fails or makes things worse, the system automatically reverts to the previous state. No manual cleanup, no lost progress.

import os
import subprocess
from dataclasses import dataclass
from typing import Optional, Callable
from metrics import MechanicalMetrics

@dataclass
class ExperimentResult:
    iteration: int
    change_description: str
    metric_before: float
    metric_after: float
    success: bool
    kept: bool
    commit_sha: Optional[str]

class AutonomousExperimenter:
    def __init__(
        self,
        goal_metric: str,
        target_value: float,
        metric_extractor: Callable[[], float],
        train_command: str,
        max_iterations: int = 100
    ):
        self.goal_metric = goal_metric
        self.target_value = target_value
        self.metric_extractor = metric_extractor
        self.train_command = train_command
        self.max_iterations = max_iterations
        self.results_log = []

    def run_experiment(
        self,
        iteration: int,
        change_fn: Callable[[], str]
    ) -> ExperimentResult:
        """
        Run a single experiment with automatic rollback.

        Args:
            iteration: Current iteration number
            change_fn: Function that makes ONE atomic change and returns description

        Returns:
            ExperimentResult with outcome
        """
        # Get baseline metric
        metric_before = self.metric_extractor()

        # Make ONE atomic change
        change_description = change_fn()

        # Commit the change (so we can revert if needed)
        subprocess.run(["git", "add", "-A"], check=True)
        subprocess.run([
            "git", "commit", "-m",
            f"experiment-{iteration}: {change_description}"
        ], check=True)
        commit_sha = subprocess.check_output(
            ["git", "rev-parse", "HEAD"]
        ).decode().strip()

        # Run training
        result = subprocess.run(
            self.train_command,
            shell=True,
            capture_output=True,
            text=True
        )

        # Check if training succeeded
        if result.returncode != 0:
            # Training failed - rollback
            subprocess.run(["git", "reset", "--hard", "HEAD~1"], check=True)
            return ExperimentResult(
                iteration=iteration,
                change_description=change_description,
                metric_before=metric_before,
                metric_after=0,
                success=False,
                kept=False,
                commit_sha=None
            )

        # Get new metric
        metric_after = self.metric_extractor()

        # Decide whether to keep or revert
        improved = metric_after > metric_before

        if improved:
            # Keep the change
            return ExperimentResult(
                iteration=iteration,
                change_description=change_description,
                metric_before=metric_before,
                metric_after=metric_after,
                success=True,
                kept=True,
                commit_sha=commit_sha
            )
        else:
            # Revert the change
            subprocess.run(["git", "reset", "--hard", "HEAD~1"], check=True)
            return ExperimentResult(
                iteration=iteration,
                change_description=change_description,
                metric_before=metric_before,
                metric_after=metric_after,
                success=True,
                kept=False,
                commit_sha=None
            )

    def log_result(self, result: ExperimentResult):
        """Log result to TSV file."""
        with open("experiment_log.tsv", "a") as f:
            f.write(
                f"{result.iteration}\t"
                f"{result.change_description}\t"
                f"{result.metric_before:.4f}\t"
                f"{result.metric_after:.4f}\t"
                f"{result.success}\t"
                f"{result.kept}\t"
                f"{result.commit_sha or 'N/A'}\n"
            )

    def run(self, propose_change_fn: Callable[[], Callable[[], str]]):
        """
        Run autonomous experiments until target is reached.

        Args:
            propose_change_fn: Function that returns a change_fn for each iteration
        """
        # Initialize log
        with open("experiment_log.tsv", "w") as f:
            f.write("iteration\tchange\tbefore\tafter\tsuccess\tkept\tcommit\n")

        for i in range(self.max_iterations):
            # Check current metric
            current = self.metric_extractor()
            print(f"Iteration {i}: Current {self.goal_metric} = {current:.4f}")

            # Check if target reached
            if current >= self.target_value:
                print(f"Target {self.target_value} reached!")
                break

            # Propose and run experiment
            change_fn = propose_change_fn()
            result = self.run_experiment(i, change_fn)
            self.log_result(result)

            if result.kept:
                print(f"  Kept: {result.change_description} "
                      f"({result.metric_before:.4f} -> {result.metric_after:.4f})")
            else:
                print(f"  Reverted: {result.change_description}")

Here’s how I use it:

from autonomous_iteration import AutonomousExperimenter
from metrics import MechanicalMetrics

# Define metric extractor
def get_accuracy():
    return MechanicalMetrics.get_validation_accuracy("train.log")

# Define training command
train_cmd = "python train.py --config config.yaml"

# Create experimenter
experimenter = AutonomousExperimenter(
    goal_metric="validation_accuracy",
    target_value=0.80,
    metric_extractor=get_accuracy,
    train_command=train_cmd,
    max_iterations=50
)

# Define change proposals (in practice, this would be smarter)
def propose_change():
    """Propose a single change to try."""
    # This would be replaced with actual logic
    # to propose different hyperparameters, etc.
    def change_learning_rate():
        with open("config.yaml", "r") as f:
            config = f.read()
        # Modify learning rate
        # ...
        with open("config.yaml", "w") as f:
            f.write(config)
        return "increase learning rate to 0.002"

    return change_learning_rate

# Run autonomous experiments
experimenter.run(propose_change)

After running:

iteration  change  before  after  success  kept  commit
0  baseline  0.7800  0.7800  True  True  abc123
1  add dropout 0.3  0.7800  0.7650  True  False  N/A
2  increase lr to 0.002  0.7800  0.7920  True  True  def456
3  add batch norm  0.7920  0.8010  True  True  ghi789

Iteration 1 was reverted because accuracy dropped. Iteration 2 was kept because accuracy improved. After iteration 3, the target of 0.80 was reached.

Why rollback matters

The rollback-on-failure piece is the most underrated part of this pattern. Without it, you can’t safely run experiments autonomously.

Without rollback

Day 1: Experiment 1 succeeds
Day 2: Experiment 2 fails, leaves model in bad state
Day 3: Experiment 3 runs on corrupted baseline
Day 4: Results are meaningless, need to start over

With rollback

Night 1: 100 experiments run, 15 kept, 85 reverted
Night 2: 100 more experiments, each starting from clean state
Night 3: 100 more experiments, compounding gains from Night 2
...

The key insight: rollback enables you to try aggressive experiments without fear of breaking things. If it doesn’t work, you’re automatically back to the previous state.

What can go wrong

Mistake 1: Vague metrics

Wrong: "improve model quality"
Right: "validation accuracy >= 0.80"

A vague metric can’t be extracted programmatically. The system can’t decide if an experiment succeeded.

Mistake 2: Multiple changes at once

# Wrong: Multiple changes
def change_multiple():
    increase_learning_rate()
    add_dropout()
    change_batch_size()
    # If accuracy changes, we don't know which change caused it

# Right: One change at a time
def change_one():
    increase_learning_rate()  # Only this

Mistake 3: No git commits before verification

# Wrong: Verify then commit
make_change()
run_training()
if improved:
    git_commit()  # Might forget to commit

# Right: Commit then verify
make_change()
git_commit()
run_training()
if not improved:
    git_reset_hard()  # Guaranteed rollback

Mistake 4: Non-atomic changes

# Wrong: Change spreads across multiple files
def change_spread():
    modify_model_architecture()  # model.py
    update_config()              # config.yaml
    change_dataloader()          # data.py
    # Reverting is hard if something goes wrong

# Right: Encapsulated change
def change_encapsulated():
    # Change is in one place
    modify_dropout_rate()  # Only config.yaml

Real-world example

I applied this approach to optimize a text classification model:

import random
from autonomous_iteration import AutonomousExperimenter

# Track what we've tried
tried_configurations = set()

def propose_hyperparameter_change():
    """Propose a hyperparameter change to try."""
    changes = [
        ("learning_rate", [0.001, 0.002, 0.005, 0.0005]),
        ("dropout_rate", [0.1, 0.2, 0.3, 0.4, 0.5]),
        ("batch_size", [16, 32, 64, 128]),
        ("hidden_size", [256, 512, 768, 1024]),
        ("num_layers", [2, 3, 4, 5]),
    ]

    # Pick a random change we haven't tried
    for _ in range(100):  # Max attempts
        param, values = random.choice(changes)
        new_value = random.choice(values)
        config_key = f"{param}={new_value}"

        if config_key not in tried_configurations:
            tried_configurations.add(config_key)

            def change_fn():
                with open("config.yaml", "r") as f:
                    content = f.read()
                content = content.replace(
                    f"{param}: {get_current_value(param)}",
                    f"{param}: {new_value}"
                )
                with open("config.yaml", "w") as f:
                    f.write(content)
                return f"set {param} to {new_value}"

            return change_fn

    return lambda: "no more changes"

# Run optimization
experimenter = AutonomousExperimenter(
    goal_metric="validation_accuracy",
    target_value=0.85,
    metric_extractor=lambda: MechanicalMetrics.get_validation_accuracy("train.log"),
    train_command="python train.py",
    max_iterations=100
)

experimenter.run(propose_hyperparameter_change)

Results after running overnight:

Total experiments: 87
Kept: 23 (26%)
Reverted: 64 (74%)
Final accuracy: 0.852 (target: 0.85)
Best configuration:
  learning_rate: 0.002
  dropout_rate: 0.3
  batch_size: 32
  hidden_size: 512
  num_layers: 3

When to use this approach

Karpathy’s autoresearch approach works well when:

Clear objective: You can define a mechanical metric
Atomic changes: Each experiment can be a single change
Fast feedback: Training runs complete in hours, not days
Safe to fail: Failed experiments don’t cause permanent damage
Many possibilities: Large search space of potential improvements

It works less well when:

Training takes days (too slow for iteration)
Metrics are subjective (need human evaluation)
Changes are interdependent (can’t test independently)
Resources are limited (can’t run 100 experiments)

Summary

In this post, I explained Karpathy’s autoresearch approach. The key points are:

Constraint: Narrow problem definition with clear target
Mechanical metric: Objective, programmatically extractable evaluation
Autonomous iteration: Automatic rollback on failure for safe experimentation

The magic is in the combination. Constraint focuses your efforts. Mechanical metrics enable automation. Rollback enables safe, aggressive experimentation. Together, they create compounding gains from many small, safe experiments.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Andrej Karpathy's Blog

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!