Skip to content

How to Write a program.md File for AI Automated Research

Purpose

This post shows how to write a program.md file for automated AI research. The key point: program.md is NOT a configuration file - it’s a natural language brief that tells your AI agent what to optimize, what constraints to follow, and what counts as success.

The Problem

When I started exploring AI research automation, I hit a wall. Traditional ML research needs constant human intervention: hypothesize, code, run experiment, analyze results, iterate. This cycle is tedious and time-consuming.

I wanted to run 100+ experiments overnight. But I couldn’t figure out how to encode my research intent so an AI could autonomously run this loop.

Then I found Karpathy’s autoresearch project. The answer was simple: a program.md file.

What is program.md?

A program.md file is a natural language contract between you and your AI agent. It defines:

  • Goal: What metric to optimize
  • Constraints: What files can/cannot be modified
  • Success Criteria: How to measure improvement
  • Iteration Protocol: How to log results and when to keep/discard changes
  • Boundaries: Resource limits, simplicity criterion

The key insight: this is natural language, not YAML or JSON. It’s meant to be read by an LLM, not parsed by a machine.

Here’s the high-level structure:

program.md Structure
+-------------------+
| Setup Section | <- How to prepare before experiments
+-------------------+
|
v
+-------------------+
| Experimentation | <- What CAN and CANNOT do
+-------------------+
|
v
+-------------------+
| Output Format | <- How to report results
+-------------------+
|
v
+-------------------+
| Logging Results | <- How to track progress
+-------------------+
|
v
+-------------------+
| Experiment Loop | <- The forever loop
+-------------------+

Why This Matters

Before program.md, I had to babysit every experiment. Now:

  • Scalability: Run 100+ experiments while I sleep
  • Consistency: The agent follows my research principles exactly
  • Accumulation: Each version captures learned lessons
  • Portability: Share research strategies as markdown files
  • Autonomy: The loop runs until manually stopped

The Reddit discussion summed it up: “The program.md file is the whole game - the most important piece often glossed over.”

The Minimal Template

Karpathy’s default program.md is intentionally minimal. Here’s a simplified version I use:

program.md
# autoresearch
This is an experiment to have the LLM do its own research.
## Setup
1. Agree on a run tag (e.g., `mar5`)
2. Create the branch: `git checkout -b autoresearch/<tag>`
3. Read in-scope files for context
4. Verify data exists
5. Initialize results tracking
## Experimentation
**What you CAN do:**
- Modify `train.py` - model architecture, optimizer, hyperparameters
**What you CANNOT do:**
- Modify `prepare.py` (read-only)
- Install new packages
- Modify evaluation harness
**Goal:** Get the lowest val_bpb
**Simplicity criterion:** All else equal, simpler is better
## Output format
val_bpb: X.XXXXXX
training_seconds: 300.1
peak_vram_mb: XXXXX.X
## Logging results
TSV format: commit, val_bpb, memory_gb, status, description
## The experiment loop
LOOP FOREVER:
1. Look at git state
2. Tune code with experimental idea
3. git commit
4. Run experiment
5. Read results
6. Record to TSV
7. If improved -> keep, else -> reset
**NEVER STOP:** Continue until manually interrupted.

The Five Key Sections

I learned that each section serves a specific purpose:

1. Setup Section

Tells the agent how to prepare. This prevents common errors like missing data or wrong branches.

Setup Section Example
## Setup
1. Agree on a run tag (e.g., `mar5`)
2. Create the branch: `git checkout -b autoresearch/<tag>`
3. Read in-scope files for context
4. Verify data exists at `/data/train.bin`
5. Initialize `results.tsv` with headers

Why this matters: Without setup, agents often skip critical steps. I once lost 3 hours of experiments because the agent didn’t verify data existence.

2. Experimentation Constraints

The most critical section. It defines boundaries:

Constraints Section
**What you CAN do:**
- Modify `train.py` - model architecture, optimizer, hyperparameters
- Adjust learning rate, batch size, model depth
**What you CANNOT do:**
- Modify `prepare.py` (read-only)
- Install new packages
- Change evaluation harness
- Exceed 24GB VRAM limit

I made mistakes here early on:

MistakeConsequenceFix
Too vague on constraintsAgent modified evaluation codeExplicit “read-only” labels
No VRAM limitOOM crashes at 2AMAdded explicit memory limit
Forgetting simplicity criterionComplex solutions that didn’t generalizeAdded “simpler is better” rule

3. Output Format

Standardizes how the agent reports results:

Output Format
## Output format
val_bpb: X.XXXXXX
training_seconds: 300.1
peak_vram_mb: XXXXX.X

Why: Without standard format, the agent might output JSON, YAML, or prose. This makes parsing results automated.

4. Logging Protocol

Defines how to track progress over iterations:

Logging Protocol
## Logging results
TSV format: commit, val_bpb, memory_gb, status, description
Example entries:
abc123, 0.0045, 12.3, SUCCESS, "Added layer normalization"
def456, 0.0050, 15.1, FAILED, "Increased batch size - OOM"

5. The Loop

The forever loop that runs experiments:

Experiment Loop Visualization
+------------------+
| Look at state |
+------------------+
|
v
+------------------+
| Generate idea |
+------------------+
|
v
+------------------+
| Modify code |
+------------------+
|
v
+------------------+
| Commit changes |
+------------------+
|
v
+------------------+
| Run experiment |
+------------------+
|
v
+------------------+
| Read results |
+------------------+
|
v
+------------------+
| Log to TSV |
+------------------+
|
v
+------------------+
| Better? Keep |
| Worse? Reset |
+------------------+
|
v
[ LOOP FOREVER ]

Connection to OpenClaw Skill Files

The program.md pattern maps directly to OpenClaw skill files. They share the same structure:

OpenClaw Skill File Pattern
# Skill Name
## When to Use
Invoke this skill:
- After completing X
- Before creating Y
- When Z condition met
## What you CAN do
- Action 1
- Action 2
## What you CANNOT do
- Forbidden action 1
- Forbidden action 2
## Success Criteria
- Metric to optimize
- Threshold for acceptance
## Output Format
Expected output structure
## Loop Protocol
Iteration rules and termination conditions

The only difference: OpenClaw’s evaluation loop is automated and runs in Git. program.md is the inspiration for this pattern.

Common Mistakes

I made several mistakes when writing program.md files:

1. Treating it as a config file

# WRONG: Thinking it needs YAML syntax
goal: minimize_val_bpb
constraints:
- read_only: prepare.py
# CORRECT: Natural language for LLM
Goal: Get the lowest validation bits per byte (val_bpb)
The prepare.py file is read-only - do not modify it.

2. Being too vague about success

# WRONG
Goal: Improve the model
# CORRECT
Goal: Get the lowest val_bpb (lower is better)
Current baseline: 0.0050
Success threshold: < 0.0040

3. Not specifying what NOT to touch

# WRONG
You can modify training code.
# CORRECT
What you CAN do:
- Modify train.py only
What you CANNOT do:
- prepare.py (read-only)
- evaluate.py (read-only)
- Any file outside src/

4. Missing the simplicity criterion

Without this, agents tend toward complex solutions. I added:

**Simplicity criterion:** All else equal, simpler is better.
Prefer:
- Fewer lines of code
- Standard techniques over novel ones
- Removing code over adding code

5. Adding too many constraints

Too many constraints limit exploration. I keep it minimal:

  • 3-5 CAN actions
  • 3-5 CANNOT actions
  • One clear goal
  • One simplicity criterion

How to Iterate on program.md

The “meta-skill” is learning to write better program.md files over time. Here’s my approach:

Iteration Workflow
Day 1: Write minimal program.md
-> Run 10 experiments
-> Review results.tsv
Day 2: Notice patterns in failures
-> Add constraints to prevent repeats
-> Update program.md v2
Day 3: Run 50 more experiments
-> Review what worked
-> Refine success criteria
Week 2: program.md v3 with accumulated lessons
-> Run 100 experiments overnight
-> Wake up to results

Each version becomes accumulated intelligence. I track versions in Git:

Terminal window
git log --oneline program.md
# Output
abc123 program.md v3: Added VRAM limit
def456 program.md v2: Added simplicity criterion
789abc program.md v1: Initial minimal version

Summary

In this post, I showed how to write a program.md file for automated AI research. The key points:

  1. program.md is NOT a config file - it’s a natural language brief
  2. Five sections: Setup, Constraints, Output Format, Logging, Loop
  3. Constraints are the most critical - be explicit about what NOT to touch
  4. Simplicity criterion prevents over-engineering
  5. Iterate on program.md itself - each version captures lessons

Next steps:

  1. Start with the minimal template above
  2. Add constraints specific to your domain
  3. Run your first overnight experiment batch
  4. Review results and iterate on program.md

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments