What is autoresearch and how does it work?

Mar 29, 2026

I kept seeing “autoresearch” pop up in my feed. People were talking about running “100 ML experiments overnight” without writing a single line of Python. Sounded like hype. Then I saw Andrej Karpathy’s name attached to it.

So I dug in. Here’s what I found.

The Problem: ML Research is Tedious

Every ML researcher knows the drill:

Have an idea for improving the model
Modify code
Train for hours
Check results
Tweak something
Repeat

Most of this is not creative work. It’s mechanical iteration. You’re essentially a glorified button-clicker hoping something works.

Karpathy’s question: What if an AI could do this loop instead?

What autoresearch Actually Does

autoresearch is a minimalist framework that turns language models (Claude, GPT-4, etc.) into autonomous ML researchers. Here’s the core idea:

┌─────────────────────────────────────────┐
│                                         │
│   ┌──────────┐      ┌──────────────┐    │
│   │ AI Agent │ ───► │ Modify Code  │    │
│   └──────────┘      └──────────────┘    │
│         │                  │            │
│         │                  ▼            │
│         │           ┌──────────────┐    │
│         │           │ Train Model  │    │
│         │           │   (5 min)    │    │
│         │           └──────────────┘    │
│         │                  │            │
│         │                  ▼            │
│         │           ┌──────────────┐    │
│         └─────────── │  Evaluate    │    │
│                     └──────────────┘    │
│                            │            │
│                            ▼            │
│                     ┌──────────────┐    │
│                     │ Keep/Discard │    │
│                     └──────────────┘    │
│                            │            │
└────────────────────────────┘
         │
         ▼
    (repeat forever)

The AI runs this loop indefinitely. It modifies Python code, trains for 5 minutes, evaluates performance, and decides whether to keep the changes.

How I Understood It (The Mental Model)

At first, I thought this was just “LLM writes code, runs it.” But the key insight is the feedback loop:

Traditional:
  Human ──► Write code ──► Train ──► Evaluate ──► Human decides

autoresearch:
  Human ──► Write goals (Markdown) ──► AI takes over
                                    │
                                    ▼
                          AI ──► Code ──► Train ──► Evaluate ──► AI decides
                                    │                              │
                                    └──────────────────────────────┘
                                              (loops forever)

The human’s role shifts from doing experiments to defining experiments.

The Configuration: Just Markdown

Here’s what blew my mind. You don’t write Python. You write this:

# Goal
Minimize validation loss for character-level language model

# Constraints
- Training time: 5 minutes per experiment
- Hardware: Single GPU
- Model size: < 1M parameters

# Metrics
Track validation perplexity, training stability, inference speed

That’s it. The AI reads this, generates Python code, runs experiments, and iterates.

What Actually Happens in One Loop

Let me trace through one iteration:

Step 1: The AI Analyzes Current State

Looks at the existing model architecture
Reviews previous experiment results
Checks what’s been tried before

Step 2: Proposes a Change

Maybe adjusts the learning rate
Maybe adds a new layer type
Maybe changes the activation function
Generates clean, modular Python code

Step 3: Trains for 5 Minutes

Spins up training on the GPU
Monitors loss curves
Logs metrics

Step 4: Evaluates Results

Compares new metrics to baseline
Checks if the change helped
Looks for training instabilities

Step 5: Makes a Decision

Keep: Change becomes new baseline
Discard: Revert to previous state
Branch: Try alternative approach

Then it repeats. Forever, until you stop it or it hits your target.

The “100 Experiments Overnight” Claim

This is where it clicked for me. Traditional ML research:

Manual approach (8 hours):
  - Read papers, plan experiments: 2 hours
  - Write/debug code: 2 hours
  - Wait for training: 3 hours
  - Analyze results: 1 hour
  - Actual experiments run: ~3-5

autoresearch approach:

Automated approach (8 hours):
  - Human writes Markdown: 30 minutes
  - AI runs experiments continuously: 7.5 hours
  - Actual experiments run: ~100+ (5 min each)

The AI doesn’t need to sleep, eat, or debug typos. It just iterates.

Key Components Under the Hood

I looked at the architecture:

Code Generation Engine

Parses your Markdown goals
Generates PyTorch/TensorFlow/JAX code
Follows ML best practices automatically

Training Orchestrator

Manages GPU allocation
Handles checkpointing
Implements fault tolerance

Evaluation Framework

Collects metrics automatically
Compares to baseline
Generates visualizations

Version Control Integration

Every experiment is a git commit
Can reproduce any experiment
Branch for parallel exploration

Agent Interface

Works with Claude, GPT-4, or other LLMs
Maintains context across experiments
Has safety constraints

What Makes This Different from “LLM Writes Code”

I initially dismissed this as “just another code generator.” But the difference is:

Code Generator:
  Prompt ──► Code ──► Done

autoresearch:
  Goal ──► Code ──► Run ──► Result ──► New Code ──► ...
          │                                    │
          └────────────────────────────────────┘
                    (feedback loop)

The feedback loop is the innovation. The AI sees the results of its changes and learns from them.

Limitations (What It Can’t Do Yet)

From what I’ve seen:

Requires clear metrics - If you can’t define “better,” the AI can’t optimize
Small models only - 5-minute training means tiny models
No big architectural leaps - It iterates, doesn’t invent fundamentally new approaches
GPU-bound - Still needs compute

When Would I Use This?

Real talk - this isn’t for:

Production model training (too small scale)
Novel architecture research (needs human creativity)
Quick prototypes (overkill)

But it IS useful for:

Hyperparameter exploration
Ablation studies
Finding surprising combinations that work
Overnight iteration when you’re stuck

The Real Insight

The framework isn’t about “AI replacing researchers.” It’s about removing the mechanical bottleneck from research.

Researchers still:

Define the problem space
Set constraints and goals
Interpret surprising results
Decide research direction

But the AI handles the tedious loop of “try this, check that, adjust, repeat.”

Getting Started

If you want to try it:

Clone the repo
Write your experiment.md
Point it at your Claude/GPT-4 API
Let it run overnight
Review results in the morning

The barrier to entry is essentially: “Can you write a Markdown file describing what you want optimized?”

This fits into a broader trend:

AutoML - Automated machine learning (hyperparameter tuning)
Neural Architecture Search - AI designing neural networks
LLM Agents - Language models taking actions

autoresearch is essentially “LLM Agent + AutoML” in a tight loop.

Final Thoughts

What struck me most: this changes the unit of work in ML research.

Before: “One experiment” (takes hours of human time) After: “One goal file” (takes minutes of human time)

The AI handles the experiment density. You handle the research direction.

That’s not replacing researchers. That’s multiplying their iteration speed.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 autoresearch - GitHub Repository
👨‍💻 Reddit Discussion: What is autoresearch and how does it work?

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!