Skip to content

How Does Karpathy's Autoresearch Autonomous AI Agent Work?

Problem

Traditional ML research takes forever. I spend days designing experiments, waiting for results, analyzing logs, then deciding what to try next. Each experiment needs manual intervention - I have to start it, monitor it, and interpret the results.

Here’s what a typical research day looks like:

Manual Research Cycle
8:00 AM - Design experiment (modify hyperparameters)
9:00 AM - Start training run
12:00 PM - Training finishes, check results
12:30 PM - Analyze loss curves, decide next experiment
1:00 PM - Modify code, start new run
...
6:00 PM - End of day, maybe 2-3 experiments completed

I got maybe 3 experiments done in a whole day. And I had to be there the entire time.

What happened?

Karpathy released autoresearch - an AI agent that runs ML experiments autonomously. The key insight is simple: give an AI a small but real LLM training setup, let it experiment overnight.

I looked at the repo and found something surprising. The entire system is just 4 files:

autoresearch File Structure
prepare.py — fixed constants, data prep (do not modify)
train.py — model, optimizer, training loop (agent edits this)
program.md — agent instructions (human edits this)
pyproject.toml — dependencies

That’s it. No complex agent framework, no distributed system, just a markdown file that tells the agent what to do.

How it works

The system has three components that work together.

1. The program.md Pattern

The entire research strategy lives in program.md. This is a markdown file that the AI agent reads and interprets.

Here’s a simplified version:

program.md (simplified)
# autoresearch
## Setup
1. Agree on a run tag (e.g. `mar5`)
2. Create the branch: `git checkout -b autoresearch/mar5`
3. Read the in-scope files: README.md, prepare.py, train.py
4. Verify data exists
5. Initialize results.tsv
## Experimentation
Each experiment runs for 5 minutes. You launch it as:
`uv run train.py`
**What you CAN do:**
- Modify train.py — model architecture, optimizer, hyperparameters
**What you CANNOT do:**
- Modify prepare.py (read-only)
- Install new packages
**The goal: get the lowest val_bpb.**
## The experiment loop
LOOP FOREVER:
1. Tune train.py with an experimental idea
2. git commit
3. Run: `uv run train.py > run.log 2>&1`
4. Read results: `grep "^val_bpb:" run.log`
5. Record in results.tsv
6. If improved: keep the commit
7. If worse: git reset back
**NEVER STOP:** Continue until human interrupts.

The agent reads this file and executes it like code. The markdown becomes the “program.”

2. The 7-Step Experiment Loop

The agent follows a simple loop. I counted 7 steps:

Experiment Loop Flow
Step 1: Read current git state
Step 2: Modify train.py with an idea
Step 3: Commit the change
Step 4: Run experiment (5 minutes)
Step 5: Extract val_bpb from log
Step 6: Log to results.tsv
Step 7: Keep if better, reset if worse
→ Back to Step 1

Here’s how the agent logs results:

results.tsv format
commit val_bpb memory_gb status description
a1b2c3d 0.997900 44.0 keep baseline
b2c3d4e 0.993200 44.2 keep increase LR to 0.04
c3d4e5f 1.005000 44.0 discard switch to GeLU
d4e5f6g 0.000000 0.0 crash double width (OOM)

The agent tracks everything: commit hash, metric, memory, status, description.

3. The Single-GPU Training Setup

The training runs on a single GPU for exactly 5 minutes. This fixed time budget is key.

train.py (key hyperparameters)
# Model architecture
DEPTH = 8 # number of transformer layers
ASPECT_RATIO = 64 # model_dim = depth * ASPECT_RATIO
HEAD_DIM = 128 # target head dimension
# Optimization
TOTAL_BATCH_SIZE = 2**19 # ~524K tokens per step
MATRIX_LR = 0.04 # Muon learning rate
WEIGHT_DECAY = 0.2 # cautious weight decay
# Device settings
DEVICE_BATCH_SIZE = 128 # reduce if OOM

After 5 minutes, the script prints:

Training output example
---
val_bpb: 0.997900
training_seconds: 300.1
peak_vram_mb: 45060.2
mfu_percent: 39.80
total_tokens_M: 499.6
num_steps: 953
num_params_M: 50.3
depth: 8

The metric val_bpb (validation bits per byte) is what the agent optimizes. Lower is better.

The reason

Why does this design work? I see three reasons.

Experiment Velocity

With 5-minute experiments, the agent runs ~12 experiments per hour. Over 8 hours of sleep, that’s ~100 experiments.

Experiment Velocity Comparison
Manual research: 3 experiments/day
Autoresearch: 100 experiments/night
Improvement factor: 33x

The agent explores far more ideas than I could manually.

Cost Efficiency

Single GPU, fixed 5 minutes, no distributed training. The cost is predictable and low.

Cost calculation
# H100 GPU rental: ~$3/hour
# 5-minute experiment: ~$0.25
# 100 experiments overnight: ~$25

Compare that to a researcher’s daily salary. The economics work.

Git-Integrated Version Control

Every experiment is tracked in git. The agent commits before each run, keeps good results, discards bad ones.

Git workflow in the loop
# Before experiment
git add train.py
git commit -m "experiment: increase learning rate"
# After experiment
if val_bpb_improved:
# Keep the commit, continue from here
else:
# Reset back to previous state
git reset --hard HEAD~1

I can review the entire experiment history when I wake up.

Common Mistakes

I looked at Reddit discussions and found common mistakes people make:

MistakeWhat Goes WrongFix
Modifying prepare.pyAgent breaks evaluation harnessKeep prepare.py read-only
Changing time budgetExperiments become incomparableKeep fixed 5-minute budget
Installing packagesBloated dependenciesUse only pyproject.toml packages
Stopping for feedbackAgent waits foreverRemove “should I continue?” prompts
Complex program.mdAgent confused by instructionsKeep instructions simple and direct

The simplicity is deliberate. The agent only touches one file.

program.md Template

Here’s a template I’d use for my own autoresearch:

my_program.md template
# autoresearch
## Goal
Minimize val_bpb in 5-minute experiments.
## Constraints
- Only modify train.py
- Fixed 5-minute time budget
- Use existing packages only
## Strategy Ideas to Try
1. Adjust learning rates
2. Change model depth/width
3. Modify optimizer settings
4. Try different activation patterns
5. Adjust batch sizes
## Evaluation
- Lower val_bpb = keep
- Equal or higher = discard
- Crash = log and move on
## Output
Log to results.tsv (tab-separated):
commit | val_bpb | memory_gb | status | description
## Loop
Modify → Commit → Run → Evaluate → Keep/Discard → Repeat

Running the Agent

To run autoresearch, I do:

Terminal
# 1. Clone and setup
git clone https://github.com/karpathy/autoresearch
cd autoresearch
uv sync
# 2. Prepare data (one-time)
uv run prepare.py
# 3. Start the agent
# Point Claude/Codex to the repo with this prompt:
Agent prompt
Hi, have a look at program.md and let's kick off a new experiment!
Let's do the setup first.

The agent reads program.md, sets up the branch, and starts running experiments.

What Happens Overnight

When I wake up, I check results.tsv:

Morning results.tsv
commit val_bpb memory_gb status description
a1b2c3d 1.002000 44.0 keep baseline run
b2c3d4e 0.998500 44.0 keep reduce weight decay
c3d4e5f 0.995100 44.2 keep increase depth to 10
d4e5f6g 0.992800 45.0 keep add warmup schedule
...

I see what the agent tried, what worked, what failed. The git history shows the progression.

Summary

Karpathy’s autoresearch works by giving an AI a small but real LLM training setup and letting it experiment autonomously. The key innovation is the program.md pattern - the entire research strategy lives in a markdown file that agents interpret and execute.

The system has three components: the program.md pattern (markdown as code), the 7-step experiment loop (modify, commit, run, evaluate, decide, repeat), and the single-GPU fixed-time setup (5 minutes, one file).

I can run ~100 experiments overnight while I sleep, with full git tracking of every attempt. The simplicity is the power - just 4 files, one editable by the agent, one editable by the human.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments