How to Write a program.md File for AI Automated Research

Mar 30, 2026

Purpose

This post shows how to write a program.md file for automated AI research. The key point: program.md is NOT a configuration file - it’s a natural language brief that tells your AI agent what to optimize, what constraints to follow, and what counts as success.

The Problem

When I started exploring AI research automation, I hit a wall. Traditional ML research needs constant human intervention: hypothesize, code, run experiment, analyze results, iterate. This cycle is tedious and time-consuming.

I wanted to run 100+ experiments overnight. But I couldn’t figure out how to encode my research intent so an AI could autonomously run this loop.

Then I found Karpathy’s autoresearch project. The answer was simple: a program.md file.

What is program.md?

A program.md file is a natural language contract between you and your AI agent. It defines:

Goal: What metric to optimize
Constraints: What files can/cannot be modified
Success Criteria: How to measure improvement
Iteration Protocol: How to log results and when to keep/discard changes
Boundaries: Resource limits, simplicity criterion

The key insight: this is natural language, not YAML or JSON. It’s meant to be read by an LLM, not parsed by a machine.

Here’s the high-level structure:

+-------------------+
|   Setup Section   |  <- How to prepare before experiments
+-------------------+
         |
         v
+-------------------+
| Experimentation   |  <- What CAN and CANNOT do
+-------------------+
         |
         v
+-------------------+
|  Output Format    |  <- How to report results
+-------------------+
         |
         v
+-------------------+
|  Logging Results  |  <- How to track progress
+-------------------+
         |
         v
+-------------------+
| Experiment Loop   |  <- The forever loop
+-------------------+

Why This Matters

Before program.md, I had to babysit every experiment. Now:

Scalability: Run 100+ experiments while I sleep
Consistency: The agent follows my research principles exactly
Accumulation: Each version captures learned lessons
Portability: Share research strategies as markdown files
Autonomy: The loop runs until manually stopped

The Reddit discussion summed it up: “The program.md file is the whole game - the most important piece often glossed over.”

The Minimal Template

Karpathy’s default program.md is intentionally minimal. Here’s a simplified version I use:

# autoresearch
This is an experiment to have the LLM do its own research.

## Setup
1. Agree on a run tag (e.g., `mar5`)
2. Create the branch: `git checkout -b autoresearch/<tag>`
3. Read in-scope files for context
4. Verify data exists
5. Initialize results tracking

## Experimentation
**What you CAN do:**
- Modify `train.py` - model architecture, optimizer, hyperparameters

**What you CANNOT do:**
- Modify `prepare.py` (read-only)
- Install new packages
- Modify evaluation harness

**Goal:** Get the lowest val_bpb

**Simplicity criterion:** All else equal, simpler is better

## Output format
val_bpb: X.XXXXXX
training_seconds: 300.1
peak_vram_mb: XXXXX.X

## Logging results
TSV format: commit, val_bpb, memory_gb, status, description

## The experiment loop
LOOP FOREVER:
1. Look at git state
2. Tune code with experimental idea
3. git commit
4. Run experiment
5. Read results
6. Record to TSV
7. If improved -> keep, else -> reset

**NEVER STOP:** Continue until manually interrupted.

The Five Key Sections

I learned that each section serves a specific purpose:

1. Setup Section

Tells the agent how to prepare. This prevents common errors like missing data or wrong branches.

## Setup
1. Agree on a run tag (e.g., `mar5`)
2. Create the branch: `git checkout -b autoresearch/<tag>`
3. Read in-scope files for context
4. Verify data exists at `/data/train.bin`
5. Initialize `results.tsv` with headers

Why this matters: Without setup, agents often skip critical steps. I once lost 3 hours of experiments because the agent didn’t verify data existence.

2. Experimentation Constraints

The most critical section. It defines boundaries:

**What you CAN do:**
- Modify `train.py` - model architecture, optimizer, hyperparameters
- Adjust learning rate, batch size, model depth

**What you CANNOT do:**
- Modify `prepare.py` (read-only)
- Install new packages
- Change evaluation harness
- Exceed 24GB VRAM limit

I made mistakes here early on:

Mistake	Consequence	Fix
Too vague on constraints	Agent modified evaluation code	Explicit “read-only” labels
No VRAM limit	OOM crashes at 2AM	Added explicit memory limit
Forgetting simplicity criterion	Complex solutions that didn’t generalize	Added “simpler is better” rule

3. Output Format

Standardizes how the agent reports results:

## Output format
val_bpb: X.XXXXXX
training_seconds: 300.1
peak_vram_mb: XXXXX.X

Why: Without standard format, the agent might output JSON, YAML, or prose. This makes parsing results automated.

4. Logging Protocol

Defines how to track progress over iterations:

## Logging results
TSV format: commit, val_bpb, memory_gb, status, description

Example entries:
abc123, 0.0045, 12.3, SUCCESS, "Added layer normalization"
def456, 0.0050, 15.1, FAILED, "Increased batch size - OOM"

5. The Loop

The forever loop that runs experiments:

     +------------------+
     |  Look at state   |
     +------------------+
              |
              v
     +------------------+
     |  Generate idea   |
     +------------------+
              |
              v
     +------------------+
     |  Modify code     |
     +------------------+
              |
              v
     +------------------+
     |  Commit changes  |
     +------------------+
              |
              v
     +------------------+
     |  Run experiment  |
     +------------------+
              |
              v
     +------------------+
     |  Read results    |
     +------------------+
              |
              v
     +------------------+
     |  Log to TSV      |
     +------------------+
              |
              v
     +------------------+
     |  Better? Keep    |
     |  Worse? Reset    |
     +------------------+
              |
              v
     [ LOOP FOREVER ]

Connection to OpenClaw Skill Files

The program.md pattern maps directly to OpenClaw skill files. They share the same structure:

# Skill Name

## When to Use
Invoke this skill:
- After completing X
- Before creating Y
- When Z condition met

## What you CAN do
- Action 1
- Action 2

## What you CANNOT do
- Forbidden action 1
- Forbidden action 2

## Success Criteria
- Metric to optimize
- Threshold for acceptance

## Output Format
Expected output structure

## Loop Protocol
Iteration rules and termination conditions

The only difference: OpenClaw’s evaluation loop is automated and runs in Git. program.md is the inspiration for this pattern.

Common Mistakes

I made several mistakes when writing program.md files:

1. Treating it as a config file

# WRONG: Thinking it needs YAML syntax
goal: minimize_val_bpb
constraints:
  - read_only: prepare.py

# CORRECT: Natural language for LLM
Goal: Get the lowest validation bits per byte (val_bpb)
The prepare.py file is read-only - do not modify it.

2. Being too vague about success

# WRONG
Goal: Improve the model

# CORRECT
Goal: Get the lowest val_bpb (lower is better)
Current baseline: 0.0050
Success threshold: < 0.0040

3. Not specifying what NOT to touch

# WRONG
You can modify training code.

# CORRECT
What you CAN do:
- Modify train.py only

What you CANNOT do:
- prepare.py (read-only)
- evaluate.py (read-only)
- Any file outside src/

4. Missing the simplicity criterion

Without this, agents tend toward complex solutions. I added:

**Simplicity criterion:** All else equal, simpler is better.
Prefer:
- Fewer lines of code
- Standard techniques over novel ones
- Removing code over adding code

5. Adding too many constraints

Too many constraints limit exploration. I keep it minimal:

3-5 CAN actions
3-5 CANNOT actions
One clear goal
One simplicity criterion

How to Iterate on program.md

The “meta-skill” is learning to write better program.md files over time. Here’s my approach:

Day 1: Write minimal program.md
       -> Run 10 experiments
       -> Review results.tsv

Day 2: Notice patterns in failures
       -> Add constraints to prevent repeats
       -> Update program.md v2

Day 3: Run 50 more experiments
       -> Review what worked
       -> Refine success criteria

Week 2: program.md v3 with accumulated lessons
        -> Run 100 experiments overnight
        -> Wake up to results

Each version becomes accumulated intelligence. I track versions in Git:

git log --oneline program.md

# Output
abc123 program.md v3: Added VRAM limit
def456 program.md v2: Added simplicity criterion
789abc program.md v1: Initial minimal version

Summary

In this post, I showed how to write a program.md file for automated AI research. The key points:

program.md is NOT a config file - it’s a natural language brief
Five sections: Setup, Constraints, Output Format, Logging, Loop
Constraints are the most critical - be explicit about what NOT to touch
Simplicity criterion prevents over-engineering
Iterate on program.md itself - each version captures lessons

Next steps:

Start with the minimal template above
Add constraints specific to your domain
Run your first overnight experiment batch
Review results and iterate on program.md

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!