How Does Karpathy's Autoresearch Autonomous AI Agent Work?
Problem
Traditional ML research takes forever. I spend days designing experiments, waiting for results, analyzing logs, then deciding what to try next. Each experiment needs manual intervention - I have to start it, monitor it, and interpret the results.
Here’s what a typical research day looks like:
8:00 AM - Design experiment (modify hyperparameters)9:00 AM - Start training run12:00 PM - Training finishes, check results12:30 PM - Analyze loss curves, decide next experiment1:00 PM - Modify code, start new run...6:00 PM - End of day, maybe 2-3 experiments completedI got maybe 3 experiments done in a whole day. And I had to be there the entire time.
What happened?
Karpathy released autoresearch - an AI agent that runs ML experiments autonomously. The key insight is simple: give an AI a small but real LLM training setup, let it experiment overnight.
I looked at the repo and found something surprising. The entire system is just 4 files:
prepare.py — fixed constants, data prep (do not modify)train.py — model, optimizer, training loop (agent edits this)program.md — agent instructions (human edits this)pyproject.toml — dependenciesThat’s it. No complex agent framework, no distributed system, just a markdown file that tells the agent what to do.
How it works
The system has three components that work together.
1. The program.md Pattern
The entire research strategy lives in program.md. This is a markdown file that the AI agent reads and interprets.
Here’s a simplified version:
# autoresearch
## Setup1. Agree on a run tag (e.g. `mar5`)2. Create the branch: `git checkout -b autoresearch/mar5`3. Read the in-scope files: README.md, prepare.py, train.py4. Verify data exists5. Initialize results.tsv
## ExperimentationEach experiment runs for 5 minutes. You launch it as:`uv run train.py`
**What you CAN do:**- Modify train.py — model architecture, optimizer, hyperparameters
**What you CANNOT do:**- Modify prepare.py (read-only)- Install new packages
**The goal: get the lowest val_bpb.**
## The experiment loopLOOP FOREVER:1. Tune train.py with an experimental idea2. git commit3. Run: `uv run train.py > run.log 2>&1`4. Read results: `grep "^val_bpb:" run.log`5. Record in results.tsv6. If improved: keep the commit7. If worse: git reset back
**NEVER STOP:** Continue until human interrupts.The agent reads this file and executes it like code. The markdown becomes the “program.”
2. The 7-Step Experiment Loop
The agent follows a simple loop. I counted 7 steps:
Step 1: Read current git stateStep 2: Modify train.py with an ideaStep 3: Commit the changeStep 4: Run experiment (5 minutes)Step 5: Extract val_bpb from logStep 6: Log to results.tsvStep 7: Keep if better, reset if worse → Back to Step 1Here’s how the agent logs results:
commit val_bpb memory_gb status descriptiona1b2c3d 0.997900 44.0 keep baselineb2c3d4e 0.993200 44.2 keep increase LR to 0.04c3d4e5f 1.005000 44.0 discard switch to GeLUd4e5f6g 0.000000 0.0 crash double width (OOM)The agent tracks everything: commit hash, metric, memory, status, description.
3. The Single-GPU Training Setup
The training runs on a single GPU for exactly 5 minutes. This fixed time budget is key.
# Model architectureDEPTH = 8 # number of transformer layersASPECT_RATIO = 64 # model_dim = depth * ASPECT_RATIOHEAD_DIM = 128 # target head dimension
# OptimizationTOTAL_BATCH_SIZE = 2**19 # ~524K tokens per stepMATRIX_LR = 0.04 # Muon learning rateWEIGHT_DECAY = 0.2 # cautious weight decay
# Device settingsDEVICE_BATCH_SIZE = 128 # reduce if OOMAfter 5 minutes, the script prints:
---val_bpb: 0.997900training_seconds: 300.1peak_vram_mb: 45060.2mfu_percent: 39.80total_tokens_M: 499.6num_steps: 953num_params_M: 50.3depth: 8The metric val_bpb (validation bits per byte) is what the agent optimizes. Lower is better.
The reason
Why does this design work? I see three reasons.
Experiment Velocity
With 5-minute experiments, the agent runs ~12 experiments per hour. Over 8 hours of sleep, that’s ~100 experiments.
Manual research: 3 experiments/dayAutoresearch: 100 experiments/nightImprovement factor: 33xThe agent explores far more ideas than I could manually.
Cost Efficiency
Single GPU, fixed 5 minutes, no distributed training. The cost is predictable and low.
# H100 GPU rental: ~$3/hour# 5-minute experiment: ~$0.25# 100 experiments overnight: ~$25Compare that to a researcher’s daily salary. The economics work.
Git-Integrated Version Control
Every experiment is tracked in git. The agent commits before each run, keeps good results, discards bad ones.
# Before experimentgit add train.pygit commit -m "experiment: increase learning rate"
# After experimentif val_bpb_improved: # Keep the commit, continue from hereelse: # Reset back to previous state git reset --hard HEAD~1I can review the entire experiment history when I wake up.
Common Mistakes
I looked at Reddit discussions and found common mistakes people make:
| Mistake | What Goes Wrong | Fix |
|---|---|---|
| Modifying prepare.py | Agent breaks evaluation harness | Keep prepare.py read-only |
| Changing time budget | Experiments become incomparable | Keep fixed 5-minute budget |
| Installing packages | Bloated dependencies | Use only pyproject.toml packages |
| Stopping for feedback | Agent waits forever | Remove “should I continue?” prompts |
| Complex program.md | Agent confused by instructions | Keep instructions simple and direct |
The simplicity is deliberate. The agent only touches one file.
program.md Template
Here’s a template I’d use for my own autoresearch:
# autoresearch
## GoalMinimize val_bpb in 5-minute experiments.
## Constraints- Only modify train.py- Fixed 5-minute time budget- Use existing packages only
## Strategy Ideas to Try1. Adjust learning rates2. Change model depth/width3. Modify optimizer settings4. Try different activation patterns5. Adjust batch sizes
## Evaluation- Lower val_bpb = keep- Equal or higher = discard- Crash = log and move on
## OutputLog to results.tsv (tab-separated):commit | val_bpb | memory_gb | status | description
## LoopModify → Commit → Run → Evaluate → Keep/Discard → RepeatRunning the Agent
To run autoresearch, I do:
# 1. Clone and setupgit clone https://github.com/karpathy/autoresearchcd autoresearchuv sync
# 2. Prepare data (one-time)uv run prepare.py
# 3. Start the agent# Point Claude/Codex to the repo with this prompt:Hi, have a look at program.md and let's kick off a new experiment!Let's do the setup first.The agent reads program.md, sets up the branch, and starts running experiments.
What Happens Overnight
When I wake up, I check results.tsv:
commit val_bpb memory_gb status descriptiona1b2c3d 1.002000 44.0 keep baseline runb2c3d4e 0.998500 44.0 keep reduce weight decayc3d4e5f 0.995100 44.2 keep increase depth to 10d4e5f6g 0.992800 45.0 keep add warmup schedule...I see what the agent tried, what worked, what failed. The git history shows the progression.
Summary
Karpathy’s autoresearch works by giving an AI a small but real LLM training setup and letting it experiment autonomously. The key innovation is the program.md pattern - the entire research strategy lives in a markdown file that agents interpret and execute.
The system has three components: the program.md pattern (markdown as code), the 7-step experiment loop (modify, commit, run, evaluate, decide, repeat), and the single-GPU fixed-time setup (5 minutes, one file).
I can run ~100 experiments overnight while I sleep, with full git tracking of every attempt. The simplicity is the power - just 4 files, one editable by the agent, one editable by the human.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Karpathy autoresearch GitHub
- 👨💻 Karpathy tweet about autoresearch
- 👨💻 Reddit Discussion: karpathy autoresearch
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments