Skip to content

How Does Karpathy's Autoresearch Automated ML Loop Actually Work?

Problem

I spent months manually tuning my ML training pipeline. Each experiment took hours: read papers, implement changes, wait for training to finish, analyze results. The bottleneck wasn’t the code - it was me.

When I saw Andrej Karpathy’s autoresearch project, I realized what I was doing wrong. His approach ran 700 experiments in 48 hours on a single GPU. It found a bug he had missed. It achieved an 11% performance improvement without any human intervention.

This post explains how this automated ML research loop works and why it succeeds where manual approaches struggle.

What Is Autoresearch?

Autoresearch is an automated loop that:

  1. Reads a research brief (program.md)
  2. Proposes code changes to train.py
  3. Trains for exactly 5 minutes
  4. Validates the improvement
  5. Commits successful changes, discards failures
  6. Repeats forever

The entire system runs on a 630-line stripped-down LLM framework called nanochat. No complex infrastructure. No multiple metrics. Just a simple binary decision: did the validation loss improve?

The Loop Architecture

Here’s how the loop looks conceptually:

autoresearch-loop-flow.txt
+-------------------+
| Read program.md |
| (research goal) |
+-------------------+
|
v
+-------------------+
| Propose code change|
| to train.py |
+-------------------+
|
v
+-------------------+
| Train for 5 min |
| on single GPU |
+-------------------+
|
v
+-------------------+
| Measure validation |
| loss |
+-------------------+
|
v
+----------+
| Improved?|
+----------+
/ \
YES NO
| |
v v
+--------+ +--------+
| Commit | | Discard|
| change | | revert |
+--------+ +--------+
| |
+-----+-----+
|
v
(Repeat forever)

This simplicity is intentional. Karpathy chose a minimal 630-line codebase because complex frameworks introduce more failure modes. The agent cannot effectively navigate complexity.

The Core Loop Code

Here’s a simplified representation of the actual loop:

autoresearch_loop.py
import subprocess
import json
def run_autoresearch_loop():
while True:
# Step 1: Read research brief
program = read_file("program.md")
current_code = read_file("train.py")
# Step 2: Agent proposes code change
proposed_code = llm_propose_improvement(
program_brief=program,
current_code=current_code,
goal="Reduce validation loss"
)
# Step 3: Apply change temporarily
write_file("train.py", proposed_code)
# Step 4: Train for exactly 5 minutes
baseline_loss = get_current_validation_loss()
subprocess.run(["python", "train.py", "--duration", "5m"])
new_loss = evaluate_validation_loss()
# Step 5: Binary decision - commit or discard
if new_loss < baseline_loss:
git_commit(f"Improved: {new_loss:.4f} < {baseline_loss:.4f}")
update_program_md(f"SUCCESS: {describe_change()}")
else:
git_reset_hard() # Discard changes
update_program_md(f"FAILED: {describe_change()}")
def llm_propose_improvement(program_brief, current_code, goal):
"""LLM agent reads context and proposes a specific change"""
prompt = f"""
Research Goal: {program_brief}
Current train.py:
{current_code}
Propose ONE specific change to improve {goal}.
Output the modified train.py only.
"""
return call_llm(prompt)

The key insight: every decision is binary. Either the loss improved, or it didn’t. No subjective judgment. No weighing multiple metrics.

Why 5-Minute Training Windows?

I initially thought longer training runs would give better results. But Karpathy’s approach proved me wrong.

training-duration-comparison.txt
LONG TRAINING (my old approach):
- 12 hours per experiment
- ~2 experiments per day
- Subject to daily interruptions
- Result: 14 experiments in 1 week
SHORT TRAINING (autoresearch approach):
- 5 minutes per experiment
- 700 experiments in 48 hours
- No human needed
- Result: 20 successful improvements found

Short windows enable rapid iteration. The agent can explore many hypotheses quickly. Most fail, but the successful ones accumulate.

This is the exploration vs. exploitation tradeoff. Short runs favor exploration. The agent tries many approaches, discarding failures quickly.

The Research Brief (program.md)

The program.md file acts as persistent research memory. Without it, the agent explores randomly. With it, the agent focuses on promising directions.

program.md
# Research Brief: Improve GPT-2 Training Speed
## Goal
Reduce "Time to GPT-2" - the time to reach GPT-2 validation loss baseline
## Current Status
- Baseline: 2.02 hours
- Best achieved: 1.80 hours (after 48 hours of autoresearch)
## Successful Changes (DO NOT REMOVE)
- QKNorm: Added missing scaler multiplier (major improvement)
- Learning rate: Increased from 3e-4 to 6e-4
- Gradient accumulation: Reduced from 8 to 4
## Failed Attempts (AVOID REPEATING)
- Weight decay adjustments (no improvement)
- Dropout addition (worse loss)
- AdamW epsilon changes (no improvement)
## Promising Directions (EXPLORE THESE)
- Check attention layer optimization
- Review normalization implementations
- Consider batch size adjustments

This file guides the agent. It shows what worked, what failed, and where to look next. After each experiment, the file updates with the result.

The Bug That Karpathy Missed

The most striking result: the agent found a bug Karpathy himself had written and missed.

qknorm_bug_fix.py
# Before (bug - missing scaler multiplier)
class QKNorm(nn.Module):
"""Query-Key normalization for attention"""
def forward(self, q, k):
# Normalize query and key vectors
q = q / q.norm(dim=-1, keepdim=True)
k = k / k.norm(dim=-1, keepdim=True)
# Compute attention scores
return q @ k.transpose(-2, -1) # BUG: missing scale!
# After (fixed by autoresearch agent)
class QKNorm(nn.Module):
"""Query-Key normalization for attention"""
def __init__(self, dim):
super().__init__()
self.scale = dim ** -0.5 # Standard attention scaling
def forward(self, q, k):
q = q / q.norm(dim=-1, keepdim=True)
k = k / k.norm(dim=-1, keepdim=True)
# Now correctly scaled
return (q @ k.transpose(-2, -1)) * self.scale

Karpathy wrote the buggy version. He reviewed it. He missed the bug. The agent found it in 48 hours of autonomous exploration.

This proves the value of automated research. The agent explores paths humans skip. It doesn’t have human biases about what “should” work.

Why Single Metrics Matter

I used to track multiple metrics: training loss, validation loss, accuracy, speed, memory usage. This created ambiguous tradeoffs.

multi-metric-problem.txt
My old approach:
Experiment A:
- Training loss: 2.3 (better)
- Validation loss: 2.5 (same)
- Accuracy: 78% (better)
- Speed: 12 hours (worse)
- Memory: 8GB (better)
Decision: ??? (Is this good or bad?)

Autoresearch uses one metric: validation loss. This makes every decision clear.

single_metric_decision.py
# Clear, objective decision
if new_validation_loss < baseline_validation_loss:
commit_change() # SUCCESS
else:
discard_change() # FAILURE

The agent cannot effectively weigh competing objectives. Single metrics enable autonomous operation.

The Results

After 48 hours on a single H100 GPU:

autoresearch-results.txt
Total experiments: 700
Successful changes: 20 (2.9% success rate)
Performance improvement: 11%
Metric: "Time to GPT-2"
- Before: 2.02 hours
- After: 1.80 hours
Key discoveries:
- QKNorm missing scaler (found bug Karpathy missed)
- Learning rate adjustments
- Gradient accumulation optimization

2.9% success rate sounds low. But that’s 20 genuine improvements in 48 hours. A human researcher might find 1-2 in the same time.

Common Mistakes to Avoid

Mistake 1: Overcomplicating the Framework

The nanochat framework is 630 lines intentionally. I initially wanted to use my full training pipeline (5000+ lines). This would fail.

framework-complexity.txt
Complex framework problems:
- More code = more failure modes
- Agent struggles with large context
- Debugging becomes harder
- Changes have unpredictable side effects
Simple framework benefits:
- Agent understands the full codebase
- Changes are isolated
- Failures are easy to diagnose
- Iteration is faster

Mistake 2: Long Training Runs

I thought 5-minute runs were too short to show real improvements. But the results prove otherwise.

training-duration-wisdom.txt
Why short runs work:
- Quick feedback enables more exploration
- Trends visible in minutes, not hours
- 5 minutes x 700 = many hypotheses tested
- Successful changes compound
Why long runs fail:
- Fewer experiments overall
- Human patience required
- Opportunity cost of waiting

Mistake 3: No Research Context

Without program.md, the agent explores blindly. I tried running autoresearch without context files. The agent made random changes that mostly failed.

research-context-value.txt
With program.md:
- Agent focuses on promising areas
- Avoids repeating failed attempts
- Builds on successful changes
- Progress is cumulative
Without program.md:
- Agent tries random modifications
- Repeats similar failed experiments
- No cumulative learning
- Wasted GPU time

How to Implement Your Own Loop

The basic structure is straightforward:

minimal_autoresearch.py
import os
import subprocess
def minimal_autoresearch():
baseline = evaluate_model()
while True:
# 1. Get current state
code = read_file("train.py")
program = read_file("program.md")
# 2. Propose change (your LLM API)
new_code = propose_change(code, program)
# 3. Test change
write_file("train.py", new_code)
subprocess.run(["python", "train.py", "--epochs", "1"])
result = evaluate_model()
# 4. Decide
if result < baseline:
commit("Improvement found")
baseline = result
log_success(new_code)
else:
os.system("git checkout train.py")
log_failure(new_code)

The hard part isn’t the loop code. It’s:

  1. A minimal, clean training codebase
  2. A fast evaluation method
  3. A clear research goal in program.md
  4. Patience to let it run

Summary

Karpathy’s autoresearch proves AI agents can autonomously conduct meaningful ML research. The approach succeeds because:

  1. Simple framework - 630 lines, easy for agent to understand
  2. Single metric - Validation loss makes decisions binary
  3. Short runs - 5-minute windows enable rapid exploration
  4. Research context - program.md guides exploration
  5. Binary commit-or-discard - No subjective decisions

The results: 700 experiments, 20 improvements, 11% performance gain, one bug found that Karpathy missed.

If you’re bottlenecked by manual ML research, try this approach. Start small. Use a minimal codebase. Track one metric. Let it run overnight.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments