Skip to content

What is autoresearch and how does it work?

I kept seeing “autoresearch” pop up in my feed. People were talking about running “100 ML experiments overnight” without writing a single line of Python. Sounded like hype. Then I saw Andrej Karpathy’s name attached to it.

So I dug in. Here’s what I found.

The Problem: ML Research is Tedious

Every ML researcher knows the drill:

  1. Have an idea for improving the model
  2. Modify code
  3. Train for hours
  4. Check results
  5. Tweak something
  6. Repeat

Most of this is not creative work. It’s mechanical iteration. You’re essentially a glorified button-clicker hoping something works.

Karpathy’s question: What if an AI could do this loop instead?

What autoresearch Actually Does

autoresearch is a minimalist framework that turns language models (Claude, GPT-4, etc.) into autonomous ML researchers. Here’s the core idea:

The autoresearch loop
┌─────────────────────────────────────────┐
│ │
│ ┌──────────┐ ┌──────────────┐ │
│ │ AI Agent │ ───► │ Modify Code │ │
│ └──────────┘ └──────────────┘ │
│ │ │ │
│ │ ▼ │
│ │ ┌──────────────┐ │
│ │ │ Train Model │ │
│ │ │ (5 min) │ │
│ │ └──────────────┘ │
│ │ │ │
│ │ ▼ │
│ │ ┌──────────────┐ │
│ └─────────── │ Evaluate │ │
│ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Keep/Discard │ │
│ └──────────────┘ │
│ │ │
└────────────────────────────┘
(repeat forever)

The AI runs this loop indefinitely. It modifies Python code, trains for 5 minutes, evaluates performance, and decides whether to keep the changes.

How I Understood It (The Mental Model)

At first, I thought this was just “LLM writes code, runs it.” But the key insight is the feedback loop:

Traditional workflow vs autoresearch
Traditional:
Human ──► Write code ──► Train ──► Evaluate ──► Human decides
autoresearch:
Human ──► Write goals (Markdown) ──► AI takes over
AI ──► Code ──► Train ──► Evaluate ──► AI decides
│ │
└──────────────────────────────┘
(loops forever)

The human’s role shifts from doing experiments to defining experiments.

The Configuration: Just Markdown

Here’s what blew my mind. You don’t write Python. You write this:

experiment.md (simplified example)
# Goal
Minimize validation loss for character-level language model
# Constraints
- Training time: 5 minutes per experiment
- Hardware: Single GPU
- Model size: < 1M parameters
# Metrics
Track validation perplexity, training stability, inference speed

That’s it. The AI reads this, generates Python code, runs experiments, and iterates.

What Actually Happens in One Loop

Let me trace through one iteration:

Step 1: The AI Analyzes Current State

  • Looks at the existing model architecture
  • Reviews previous experiment results
  • Checks what’s been tried before

Step 2: Proposes a Change

  • Maybe adjusts the learning rate
  • Maybe adds a new layer type
  • Maybe changes the activation function
  • Generates clean, modular Python code

Step 3: Trains for 5 Minutes

  • Spins up training on the GPU
  • Monitors loss curves
  • Logs metrics

Step 4: Evaluates Results

  • Compares new metrics to baseline
  • Checks if the change helped
  • Looks for training instabilities

Step 5: Makes a Decision

  • Keep: Change becomes new baseline
  • Discard: Revert to previous state
  • Branch: Try alternative approach

Then it repeats. Forever, until you stop it or it hits your target.

The “100 Experiments Overnight” Claim

This is where it clicked for me. Traditional ML research:

Manual approach (8 hours):
- Read papers, plan experiments: 2 hours
- Write/debug code: 2 hours
- Wait for training: 3 hours
- Analyze results: 1 hour
- Actual experiments run: ~3-5

autoresearch approach:

Automated approach (8 hours):
- Human writes Markdown: 30 minutes
- AI runs experiments continuously: 7.5 hours
- Actual experiments run: ~100+ (5 min each)

The AI doesn’t need to sleep, eat, or debug typos. It just iterates.

Key Components Under the Hood

I looked at the architecture:

Code Generation Engine

  • Parses your Markdown goals
  • Generates PyTorch/TensorFlow/JAX code
  • Follows ML best practices automatically

Training Orchestrator

  • Manages GPU allocation
  • Handles checkpointing
  • Implements fault tolerance

Evaluation Framework

  • Collects metrics automatically
  • Compares to baseline
  • Generates visualizations

Version Control Integration

  • Every experiment is a git commit
  • Can reproduce any experiment
  • Branch for parallel exploration

Agent Interface

  • Works with Claude, GPT-4, or other LLMs
  • Maintains context across experiments
  • Has safety constraints

What Makes This Different from “LLM Writes Code”

I initially dismissed this as “just another code generator.” But the difference is:

Code Generator:
Prompt ──► Code ──► Done
autoresearch:
Goal ──► Code ──► Run ──► Result ──► New Code ──► ...
│ │
└────────────────────────────────────┘
(feedback loop)

The feedback loop is the innovation. The AI sees the results of its changes and learns from them.

Limitations (What It Can’t Do Yet)

From what I’ve seen:

  1. Requires clear metrics - If you can’t define “better,” the AI can’t optimize
  2. Small models only - 5-minute training means tiny models
  3. No big architectural leaps - It iterates, doesn’t invent fundamentally new approaches
  4. GPU-bound - Still needs compute

When Would I Use This?

Real talk - this isn’t for:

  • Production model training (too small scale)
  • Novel architecture research (needs human creativity)
  • Quick prototypes (overkill)

But it IS useful for:

  • Hyperparameter exploration
  • Ablation studies
  • Finding surprising combinations that work
  • Overnight iteration when you’re stuck

The Real Insight

The framework isn’t about “AI replacing researchers.” It’s about removing the mechanical bottleneck from research.

Researchers still:

  • Define the problem space
  • Set constraints and goals
  • Interpret surprising results
  • Decide research direction

But the AI handles the tedious loop of “try this, check that, adjust, repeat.”

Getting Started

If you want to try it:

  1. Clone the repo
  2. Write your experiment.md
  3. Point it at your Claude/GPT-4 API
  4. Let it run overnight
  5. Review results in the morning

The barrier to entry is essentially: “Can you write a Markdown file describing what you want optimized?”

This fits into a broader trend:

  • AutoML - Automated machine learning (hyperparameter tuning)
  • Neural Architecture Search - AI designing neural networks
  • LLM Agents - Language models taking actions

autoresearch is essentially “LLM Agent + AutoML” in a tight loop.

Final Thoughts

What struck me most: this changes the unit of work in ML research.

Before: “One experiment” (takes hours of human time) After: “One goal file” (takes minutes of human time)

The AI handles the experiment density. You handle the research direction.

That’s not replacing researchers. That’s multiplying their iteration speed.


Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments