How to Write a program.md File for AI Automated Research
Purpose
This post shows how to write a program.md file for automated AI research. The key point: program.md is NOT a configuration file - it’s a natural language brief that tells your AI agent what to optimize, what constraints to follow, and what counts as success.
The Problem
When I started exploring AI research automation, I hit a wall. Traditional ML research needs constant human intervention: hypothesize, code, run experiment, analyze results, iterate. This cycle is tedious and time-consuming.
I wanted to run 100+ experiments overnight. But I couldn’t figure out how to encode my research intent so an AI could autonomously run this loop.
Then I found Karpathy’s autoresearch project. The answer was simple: a program.md file.
What is program.md?
A program.md file is a natural language contract between you and your AI agent. It defines:
- Goal: What metric to optimize
- Constraints: What files can/cannot be modified
- Success Criteria: How to measure improvement
- Iteration Protocol: How to log results and when to keep/discard changes
- Boundaries: Resource limits, simplicity criterion
The key insight: this is natural language, not YAML or JSON. It’s meant to be read by an LLM, not parsed by a machine.
Here’s the high-level structure:
+-------------------+| Setup Section | <- How to prepare before experiments+-------------------+ | v+-------------------+| Experimentation | <- What CAN and CANNOT do+-------------------+ | v+-------------------+| Output Format | <- How to report results+-------------------+ | v+-------------------+| Logging Results | <- How to track progress+-------------------+ | v+-------------------+| Experiment Loop | <- The forever loop+-------------------+Why This Matters
Before program.md, I had to babysit every experiment. Now:
- Scalability: Run 100+ experiments while I sleep
- Consistency: The agent follows my research principles exactly
- Accumulation: Each version captures learned lessons
- Portability: Share research strategies as markdown files
- Autonomy: The loop runs until manually stopped
The Reddit discussion summed it up: “The program.md file is the whole game - the most important piece often glossed over.”
The Minimal Template
Karpathy’s default program.md is intentionally minimal. Here’s a simplified version I use:
# autoresearchThis is an experiment to have the LLM do its own research.
## Setup1. Agree on a run tag (e.g., `mar5`)2. Create the branch: `git checkout -b autoresearch/<tag>`3. Read in-scope files for context4. Verify data exists5. Initialize results tracking
## Experimentation**What you CAN do:**- Modify `train.py` - model architecture, optimizer, hyperparameters
**What you CANNOT do:**- Modify `prepare.py` (read-only)- Install new packages- Modify evaluation harness
**Goal:** Get the lowest val_bpb
**Simplicity criterion:** All else equal, simpler is better
## Output formatval_bpb: X.XXXXXXtraining_seconds: 300.1peak_vram_mb: XXXXX.X
## Logging resultsTSV format: commit, val_bpb, memory_gb, status, description
## The experiment loopLOOP FOREVER:1. Look at git state2. Tune code with experimental idea3. git commit4. Run experiment5. Read results6. Record to TSV7. If improved -> keep, else -> reset
**NEVER STOP:** Continue until manually interrupted.The Five Key Sections
I learned that each section serves a specific purpose:
1. Setup Section
Tells the agent how to prepare. This prevents common errors like missing data or wrong branches.
## Setup1. Agree on a run tag (e.g., `mar5`)2. Create the branch: `git checkout -b autoresearch/<tag>`3. Read in-scope files for context4. Verify data exists at `/data/train.bin`5. Initialize `results.tsv` with headersWhy this matters: Without setup, agents often skip critical steps. I once lost 3 hours of experiments because the agent didn’t verify data existence.
2. Experimentation Constraints
The most critical section. It defines boundaries:
**What you CAN do:**- Modify `train.py` - model architecture, optimizer, hyperparameters- Adjust learning rate, batch size, model depth
**What you CANNOT do:**- Modify `prepare.py` (read-only)- Install new packages- Change evaluation harness- Exceed 24GB VRAM limitI made mistakes here early on:
| Mistake | Consequence | Fix |
|---|---|---|
| Too vague on constraints | Agent modified evaluation code | Explicit “read-only” labels |
| No VRAM limit | OOM crashes at 2AM | Added explicit memory limit |
| Forgetting simplicity criterion | Complex solutions that didn’t generalize | Added “simpler is better” rule |
3. Output Format
Standardizes how the agent reports results:
## Output formatval_bpb: X.XXXXXXtraining_seconds: 300.1peak_vram_mb: XXXXX.XWhy: Without standard format, the agent might output JSON, YAML, or prose. This makes parsing results automated.
4. Logging Protocol
Defines how to track progress over iterations:
## Logging resultsTSV format: commit, val_bpb, memory_gb, status, description
Example entries:abc123, 0.0045, 12.3, SUCCESS, "Added layer normalization"def456, 0.0050, 15.1, FAILED, "Increased batch size - OOM"5. The Loop
The forever loop that runs experiments:
+------------------+ | Look at state | +------------------+ | v +------------------+ | Generate idea | +------------------+ | v +------------------+ | Modify code | +------------------+ | v +------------------+ | Commit changes | +------------------+ | v +------------------+ | Run experiment | +------------------+ | v +------------------+ | Read results | +------------------+ | v +------------------+ | Log to TSV | +------------------+ | v +------------------+ | Better? Keep | | Worse? Reset | +------------------+ | v [ LOOP FOREVER ]Connection to OpenClaw Skill Files
The program.md pattern maps directly to OpenClaw skill files. They share the same structure:
# Skill Name
## When to UseInvoke this skill:- After completing X- Before creating Y- When Z condition met
## What you CAN do- Action 1- Action 2
## What you CANNOT do- Forbidden action 1- Forbidden action 2
## Success Criteria- Metric to optimize- Threshold for acceptance
## Output FormatExpected output structure
## Loop ProtocolIteration rules and termination conditionsThe only difference: OpenClaw’s evaluation loop is automated and runs in Git. program.md is the inspiration for this pattern.
Common Mistakes
I made several mistakes when writing program.md files:
1. Treating it as a config file
# WRONG: Thinking it needs YAML syntaxgoal: minimize_val_bpbconstraints: - read_only: prepare.py# CORRECT: Natural language for LLMGoal: Get the lowest validation bits per byte (val_bpb)The prepare.py file is read-only - do not modify it.2. Being too vague about success
# WRONGGoal: Improve the model
# CORRECTGoal: Get the lowest val_bpb (lower is better)Current baseline: 0.0050Success threshold: < 0.00403. Not specifying what NOT to touch
# WRONGYou can modify training code.
# CORRECTWhat you CAN do:- Modify train.py only
What you CANNOT do:- prepare.py (read-only)- evaluate.py (read-only)- Any file outside src/4. Missing the simplicity criterion
Without this, agents tend toward complex solutions. I added:
**Simplicity criterion:** All else equal, simpler is better.Prefer:- Fewer lines of code- Standard techniques over novel ones- Removing code over adding code5. Adding too many constraints
Too many constraints limit exploration. I keep it minimal:
- 3-5 CAN actions
- 3-5 CANNOT actions
- One clear goal
- One simplicity criterion
How to Iterate on program.md
The “meta-skill” is learning to write better program.md files over time. Here’s my approach:
Day 1: Write minimal program.md -> Run 10 experiments -> Review results.tsv
Day 2: Notice patterns in failures -> Add constraints to prevent repeats -> Update program.md v2
Day 3: Run 50 more experiments -> Review what worked -> Refine success criteria
Week 2: program.md v3 with accumulated lessons -> Run 100 experiments overnight -> Wake up to resultsEach version becomes accumulated intelligence. I track versions in Git:
git log --oneline program.md
# Outputabc123 program.md v3: Added VRAM limitdef456 program.md v2: Added simplicity criterion789abc program.md v1: Initial minimal versionSummary
In this post, I showed how to write a program.md file for automated AI research. The key points:
- program.md is NOT a config file - it’s a natural language brief
- Five sections: Setup, Constraints, Output Format, Logging, Loop
- Constraints are the most critical - be explicit about what NOT to touch
- Simplicity criterion prevents over-engineering
- Iterate on program.md itself - each version captures lessons
Next steps:
- Start with the minimal template above
- Add constraints specific to your domain
- Run your first overnight experiment batch
- Review results and iterate on program.md
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Karpathy's Autoresearch Project
- 👨💻 Reddit Discussion on program.md
- 👨💻 OpenClaw Skill Files Pattern
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments