How Does Karpathy's Autoresearch Autonomous AI Agent Work?

Mar 30, 2026

Problem

Traditional ML research takes forever. I spend days designing experiments, waiting for results, analyzing logs, then deciding what to try next. Each experiment needs manual intervention - I have to start it, monitor it, and interpret the results.

Here’s what a typical research day looks like:

8:00 AM  - Design experiment (modify hyperparameters)
9:00 AM  - Start training run
12:00 PM - Training finishes, check results
12:30 PM - Analyze loss curves, decide next experiment
1:00 PM  - Modify code, start new run
...
6:00 PM  - End of day, maybe 2-3 experiments completed

I got maybe 3 experiments done in a whole day. And I had to be there the entire time.

What happened?

Karpathy released autoresearch - an AI agent that runs ML experiments autonomously. The key insight is simple: give an AI a small but real LLM training setup, let it experiment overnight.

I looked at the repo and found something surprising. The entire system is just 4 files:

prepare.py   — fixed constants, data prep (do not modify)
train.py     — model, optimizer, training loop (agent edits this)
program.md   — agent instructions (human edits this)
pyproject.toml — dependencies

That’s it. No complex agent framework, no distributed system, just a markdown file that tells the agent what to do.

How it works

The system has three components that work together.

1. The program.md Pattern

The entire research strategy lives in program.md. This is a markdown file that the AI agent reads and interprets.

Here’s a simplified version:

# autoresearch

## Setup
1. Agree on a run tag (e.g. `mar5`)
2. Create the branch: `git checkout -b autoresearch/mar5`
3. Read the in-scope files: README.md, prepare.py, train.py
4. Verify data exists
5. Initialize results.tsv

## Experimentation
Each experiment runs for 5 minutes. You launch it as:
`uv run train.py`

**What you CAN do:**
- Modify train.py — model architecture, optimizer, hyperparameters

**What you CANNOT do:**
- Modify prepare.py (read-only)
- Install new packages

**The goal: get the lowest val_bpb.**

## The experiment loop
LOOP FOREVER:
1. Tune train.py with an experimental idea
2. git commit
3. Run: `uv run train.py > run.log 2>&1`
4. Read results: `grep "^val_bpb:" run.log`
5. Record in results.tsv
6. If improved: keep the commit
7. If worse: git reset back

**NEVER STOP:** Continue until human interrupts.

The agent reads this file and executes it like code. The markdown becomes the “program.”

2. The 7-Step Experiment Loop

The agent follows a simple loop. I counted 7 steps:

Step 1: Read current git state
Step 2: Modify train.py with an idea
Step 3: Commit the change
Step 4: Run experiment (5 minutes)
Step 5: Extract val_bpb from log
Step 6: Log to results.tsv
Step 7: Keep if better, reset if worse
        → Back to Step 1

Here’s how the agent logs results:

commit    val_bpb    memory_gb  status    description
a1b2c3d   0.997900   44.0       keep      baseline
b2c3d4e   0.993200   44.2       keep      increase LR to 0.04
c3d4e5f   1.005000   44.0       discard   switch to GeLU
d4e5f6g   0.000000   0.0        crash     double width (OOM)

The agent tracks everything: commit hash, metric, memory, status, description.

3. The Single-GPU Training Setup

The training runs on a single GPU for exactly 5 minutes. This fixed time budget is key.

# Model architecture
DEPTH = 8                  # number of transformer layers
ASPECT_RATIO = 64          # model_dim = depth * ASPECT_RATIO
HEAD_DIM = 128             # target head dimension

# Optimization
TOTAL_BATCH_SIZE = 2**19   # ~524K tokens per step
MATRIX_LR = 0.04           # Muon learning rate
WEIGHT_DECAY = 0.2         # cautious weight decay

# Device settings
DEVICE_BATCH_SIZE = 128    # reduce if OOM

After 5 minutes, the script prints:

---
val_bpb: 0.997900
training_seconds: 300.1
peak_vram_mb: 45060.2
mfu_percent: 39.80
total_tokens_M: 499.6
num_steps: 953
num_params_M: 50.3
depth: 8

The metric val_bpb (validation bits per byte) is what the agent optimizes. Lower is better.

The reason

Why does this design work? I see three reasons.

Experiment Velocity

With 5-minute experiments, the agent runs ~12 experiments per hour. Over 8 hours of sleep, that’s ~100 experiments.

Manual research:     3 experiments/day
Autoresearch:        100 experiments/night
Improvement factor:  33x

The agent explores far more ideas than I could manually.

Cost Efficiency

Single GPU, fixed 5 minutes, no distributed training. The cost is predictable and low.

# H100 GPU rental: ~$3/hour
# 5-minute experiment: ~$0.25
# 100 experiments overnight: ~$25

Compare that to a researcher’s daily salary. The economics work.

Git-Integrated Version Control

Every experiment is tracked in git. The agent commits before each run, keeps good results, discards bad ones.

# Before experiment
git add train.py
git commit -m "experiment: increase learning rate"

# After experiment
if val_bpb_improved:
    # Keep the commit, continue from here
else:
    # Reset back to previous state
    git reset --hard HEAD~1

I can review the entire experiment history when I wake up.

Common Mistakes

I looked at Reddit discussions and found common mistakes people make:

Mistake	What Goes Wrong	Fix
Modifying prepare.py	Agent breaks evaluation harness	Keep prepare.py read-only
Changing time budget	Experiments become incomparable	Keep fixed 5-minute budget
Installing packages	Bloated dependencies	Use only pyproject.toml packages
Stopping for feedback	Agent waits forever	Remove “should I continue?” prompts
Complex program.md	Agent confused by instructions	Keep instructions simple and direct

The simplicity is deliberate. The agent only touches one file.

program.md Template

Here’s a template I’d use for my own autoresearch:

# autoresearch

## Goal
Minimize val_bpb in 5-minute experiments.

## Constraints
- Only modify train.py
- Fixed 5-minute time budget
- Use existing packages only

## Strategy Ideas to Try
1. Adjust learning rates
2. Change model depth/width
3. Modify optimizer settings
4. Try different activation patterns
5. Adjust batch sizes

## Evaluation
- Lower val_bpb = keep
- Equal or higher = discard
- Crash = log and move on

## Output
Log to results.tsv (tab-separated):
commit | val_bpb | memory_gb | status | description

## Loop
Modify → Commit → Run → Evaluate → Keep/Discard → Repeat

Running the Agent

To run autoresearch, I do:

# 1. Clone and setup
git clone https://github.com/karpathy/autoresearch
cd autoresearch
uv sync

# 2. Prepare data (one-time)
uv run prepare.py

# 3. Start the agent
# Point Claude/Codex to the repo with this prompt:

Hi, have a look at program.md and let's kick off a new experiment!
Let's do the setup first.

The agent reads program.md, sets up the branch, and starts running experiments.

What Happens Overnight

When I wake up, I check results.tsv:

commit    val_bpb    memory_gb  status    description
a1b2c3d   1.002000   44.0       keep      baseline run
b2c3d4e   0.998500   44.0       keep      reduce weight decay
c3d4e5f   0.995100   44.2       keep      increase depth to 10
d4e5f6g   0.992800   45.0       keep      add warmup schedule
...

I see what the agent tried, what worked, what failed. The git history shows the progression.

Summary

Karpathy’s autoresearch works by giving an AI a small but real LLM training setup and letting it experiment autonomously. The key innovation is the program.md pattern - the entire research strategy lives in a markdown file that agents interpret and execute.

The system has three components: the program.md pattern (markdown as code), the 7-step experiment loop (modify, commit, run, evaluate, decide, repeat), and the single-GPU fixed-time setup (5 minutes, one file).

I can run ~100 experiments overnight while I sleep, with full git tracking of every attempt. The simplicity is the power - just 4 files, one editable by the agent, one editable by the human.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Karpathy autoresearch GitHub
👨‍💻 Karpathy tweet about autoresearch
👨‍💻 Reddit Discussion: karpathy autoresearch

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!