How Does Karpathy's Autoresearch Automated ML Loop Actually Work?
Problem
I spent months manually tuning my ML training pipeline. Each experiment took hours: read papers, implement changes, wait for training to finish, analyze results. The bottleneck wasn’t the code - it was me.
When I saw Andrej Karpathy’s autoresearch project, I realized what I was doing wrong. His approach ran 700 experiments in 48 hours on a single GPU. It found a bug he had missed. It achieved an 11% performance improvement without any human intervention.
This post explains how this automated ML research loop works and why it succeeds where manual approaches struggle.
What Is Autoresearch?
Autoresearch is an automated loop that:
- Reads a research brief (
program.md) - Proposes code changes to
train.py - Trains for exactly 5 minutes
- Validates the improvement
- Commits successful changes, discards failures
- Repeats forever
The entire system runs on a 630-line stripped-down LLM framework called nanochat. No complex infrastructure. No multiple metrics. Just a simple binary decision: did the validation loss improve?
The Loop Architecture
Here’s how the loop looks conceptually:
+-------------------+| Read program.md || (research goal) |+-------------------+ | v+-------------------+| Propose code change|| to train.py |+-------------------+ | v+-------------------+| Train for 5 min || on single GPU |+-------------------+ | v+-------------------+| Measure validation || loss |+-------------------+ | v +----------+ | Improved?| +----------+ / \ YES NO | | v v+--------+ +--------+| Commit | | Discard|| change | | revert |+--------+ +--------+ | | +-----+-----+ | v (Repeat forever)This simplicity is intentional. Karpathy chose a minimal 630-line codebase because complex frameworks introduce more failure modes. The agent cannot effectively navigate complexity.
The Core Loop Code
Here’s a simplified representation of the actual loop:
import subprocessimport json
def run_autoresearch_loop(): while True: # Step 1: Read research brief program = read_file("program.md") current_code = read_file("train.py")
# Step 2: Agent proposes code change proposed_code = llm_propose_improvement( program_brief=program, current_code=current_code, goal="Reduce validation loss" )
# Step 3: Apply change temporarily write_file("train.py", proposed_code)
# Step 4: Train for exactly 5 minutes baseline_loss = get_current_validation_loss() subprocess.run(["python", "train.py", "--duration", "5m"]) new_loss = evaluate_validation_loss()
# Step 5: Binary decision - commit or discard if new_loss < baseline_loss: git_commit(f"Improved: {new_loss:.4f} < {baseline_loss:.4f}") update_program_md(f"SUCCESS: {describe_change()}") else: git_reset_hard() # Discard changes update_program_md(f"FAILED: {describe_change()}")
def llm_propose_improvement(program_brief, current_code, goal): """LLM agent reads context and proposes a specific change""" prompt = f""" Research Goal: {program_brief}
Current train.py: {current_code}
Propose ONE specific change to improve {goal}. Output the modified train.py only. """ return call_llm(prompt)The key insight: every decision is binary. Either the loss improved, or it didn’t. No subjective judgment. No weighing multiple metrics.
Why 5-Minute Training Windows?
I initially thought longer training runs would give better results. But Karpathy’s approach proved me wrong.
LONG TRAINING (my old approach):- 12 hours per experiment- ~2 experiments per day- Subject to daily interruptions- Result: 14 experiments in 1 week
SHORT TRAINING (autoresearch approach):- 5 minutes per experiment- 700 experiments in 48 hours- No human needed- Result: 20 successful improvements foundShort windows enable rapid iteration. The agent can explore many hypotheses quickly. Most fail, but the successful ones accumulate.
This is the exploration vs. exploitation tradeoff. Short runs favor exploration. The agent tries many approaches, discarding failures quickly.
The Research Brief (program.md)
The program.md file acts as persistent research memory. Without it, the agent explores randomly. With it, the agent focuses on promising directions.
# Research Brief: Improve GPT-2 Training Speed
## GoalReduce "Time to GPT-2" - the time to reach GPT-2 validation loss baseline
## Current Status- Baseline: 2.02 hours- Best achieved: 1.80 hours (after 48 hours of autoresearch)
## Successful Changes (DO NOT REMOVE)- QKNorm: Added missing scaler multiplier (major improvement)- Learning rate: Increased from 3e-4 to 6e-4- Gradient accumulation: Reduced from 8 to 4
## Failed Attempts (AVOID REPEATING)- Weight decay adjustments (no improvement)- Dropout addition (worse loss)- AdamW epsilon changes (no improvement)
## Promising Directions (EXPLORE THESE)- Check attention layer optimization- Review normalization implementations- Consider batch size adjustmentsThis file guides the agent. It shows what worked, what failed, and where to look next. After each experiment, the file updates with the result.
The Bug That Karpathy Missed
The most striking result: the agent found a bug Karpathy himself had written and missed.
# Before (bug - missing scaler multiplier)class QKNorm(nn.Module): """Query-Key normalization for attention"""
def forward(self, q, k): # Normalize query and key vectors q = q / q.norm(dim=-1, keepdim=True) k = k / k.norm(dim=-1, keepdim=True)
# Compute attention scores return q @ k.transpose(-2, -1) # BUG: missing scale!
# After (fixed by autoresearch agent)class QKNorm(nn.Module): """Query-Key normalization for attention"""
def __init__(self, dim): super().__init__() self.scale = dim ** -0.5 # Standard attention scaling
def forward(self, q, k): q = q / q.norm(dim=-1, keepdim=True) k = k / k.norm(dim=-1, keepdim=True)
# Now correctly scaled return (q @ k.transpose(-2, -1)) * self.scaleKarpathy wrote the buggy version. He reviewed it. He missed the bug. The agent found it in 48 hours of autonomous exploration.
This proves the value of automated research. The agent explores paths humans skip. It doesn’t have human biases about what “should” work.
Why Single Metrics Matter
I used to track multiple metrics: training loss, validation loss, accuracy, speed, memory usage. This created ambiguous tradeoffs.
My old approach:
Experiment A:- Training loss: 2.3 (better)- Validation loss: 2.5 (same)- Accuracy: 78% (better)- Speed: 12 hours (worse)- Memory: 8GB (better)
Decision: ??? (Is this good or bad?)Autoresearch uses one metric: validation loss. This makes every decision clear.
# Clear, objective decisionif new_validation_loss < baseline_validation_loss: commit_change() # SUCCESSelse: discard_change() # FAILUREThe agent cannot effectively weigh competing objectives. Single metrics enable autonomous operation.
The Results
After 48 hours on a single H100 GPU:
Total experiments: 700Successful changes: 20 (2.9% success rate)Performance improvement: 11%
Metric: "Time to GPT-2"- Before: 2.02 hours- After: 1.80 hours
Key discoveries:- QKNorm missing scaler (found bug Karpathy missed)- Learning rate adjustments- Gradient accumulation optimization2.9% success rate sounds low. But that’s 20 genuine improvements in 48 hours. A human researcher might find 1-2 in the same time.
Common Mistakes to Avoid
Mistake 1: Overcomplicating the Framework
The nanochat framework is 630 lines intentionally. I initially wanted to use my full training pipeline (5000+ lines). This would fail.
Complex framework problems:- More code = more failure modes- Agent struggles with large context- Debugging becomes harder- Changes have unpredictable side effects
Simple framework benefits:- Agent understands the full codebase- Changes are isolated- Failures are easy to diagnose- Iteration is fasterMistake 2: Long Training Runs
I thought 5-minute runs were too short to show real improvements. But the results prove otherwise.
Why short runs work:- Quick feedback enables more exploration- Trends visible in minutes, not hours- 5 minutes x 700 = many hypotheses tested- Successful changes compound
Why long runs fail:- Fewer experiments overall- Human patience required- Opportunity cost of waitingMistake 3: No Research Context
Without program.md, the agent explores blindly. I tried running autoresearch without context files. The agent made random changes that mostly failed.
With program.md:- Agent focuses on promising areas- Avoids repeating failed attempts- Builds on successful changes- Progress is cumulative
Without program.md:- Agent tries random modifications- Repeats similar failed experiments- No cumulative learning- Wasted GPU timeHow to Implement Your Own Loop
The basic structure is straightforward:
import osimport subprocess
def minimal_autoresearch(): baseline = evaluate_model()
while True: # 1. Get current state code = read_file("train.py") program = read_file("program.md")
# 2. Propose change (your LLM API) new_code = propose_change(code, program)
# 3. Test change write_file("train.py", new_code) subprocess.run(["python", "train.py", "--epochs", "1"]) result = evaluate_model()
# 4. Decide if result < baseline: commit("Improvement found") baseline = result log_success(new_code) else: os.system("git checkout train.py") log_failure(new_code)The hard part isn’t the loop code. It’s:
- A minimal, clean training codebase
- A fast evaluation method
- A clear research goal in program.md
- Patience to let it run
Summary
Karpathy’s autoresearch proves AI agents can autonomously conduct meaningful ML research. The approach succeeds because:
- Simple framework - 630 lines, easy for agent to understand
- Single metric - Validation loss makes decisions binary
- Short runs - 5-minute windows enable rapid exploration
- Research context - program.md guides exploration
- Binary commit-or-discard - No subjective decisions
The results: 700 experiments, 20 improvements, 11% performance gain, one bug found that Karpathy missed.
If you’re bottlenecked by manual ML research, try this approach. Start small. Use a minimal codebase. Track one metric. Let it run overnight.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 GitHub: karpathy/autoresearch
- 👨💻 nanoGPT repository
- 👨💻 Andrej Karpathy's YouTube ML tutorials
- 👨💻 Reddit discussion on autoresearch
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments