What is autoresearch and how does it work?
I kept seeing “autoresearch” pop up in my feed. People were talking about running “100 ML experiments overnight” without writing a single line of Python. Sounded like hype. Then I saw Andrej Karpathy’s name attached to it.
So I dug in. Here’s what I found.
The Problem: ML Research is Tedious
Every ML researcher knows the drill:
- Have an idea for improving the model
- Modify code
- Train for hours
- Check results
- Tweak something
- Repeat
Most of this is not creative work. It’s mechanical iteration. You’re essentially a glorified button-clicker hoping something works.
Karpathy’s question: What if an AI could do this loop instead?
What autoresearch Actually Does
autoresearch is a minimalist framework that turns language models (Claude, GPT-4, etc.) into autonomous ML researchers. Here’s the core idea:
┌─────────────────────────────────────────┐│ ││ ┌──────────┐ ┌──────────────┐ ││ │ AI Agent │ ───► │ Modify Code │ ││ └──────────┘ └──────────────┘ ││ │ │ ││ │ ▼ ││ │ ┌──────────────┐ ││ │ │ Train Model │ ││ │ │ (5 min) │ ││ │ └──────────────┘ ││ │ │ ││ │ ▼ ││ │ ┌──────────────┐ ││ └─────────── │ Evaluate │ ││ └──────────────┘ ││ │ ││ ▼ ││ ┌──────────────┐ ││ │ Keep/Discard │ ││ └──────────────┘ ││ │ │└────────────────────────────┘ │ ▼ (repeat forever)The AI runs this loop indefinitely. It modifies Python code, trains for 5 minutes, evaluates performance, and decides whether to keep the changes.
How I Understood It (The Mental Model)
At first, I thought this was just “LLM writes code, runs it.” But the key insight is the feedback loop:
Traditional: Human ──► Write code ──► Train ──► Evaluate ──► Human decides
autoresearch: Human ──► Write goals (Markdown) ──► AI takes over │ ▼ AI ──► Code ──► Train ──► Evaluate ──► AI decides │ │ └──────────────────────────────┘ (loops forever)The human’s role shifts from doing experiments to defining experiments.
The Configuration: Just Markdown
Here’s what blew my mind. You don’t write Python. You write this:
# GoalMinimize validation loss for character-level language model
# Constraints- Training time: 5 minutes per experiment- Hardware: Single GPU- Model size: < 1M parameters
# MetricsTrack validation perplexity, training stability, inference speedThat’s it. The AI reads this, generates Python code, runs experiments, and iterates.
What Actually Happens in One Loop
Let me trace through one iteration:
Step 1: The AI Analyzes Current State
- Looks at the existing model architecture
- Reviews previous experiment results
- Checks what’s been tried before
Step 2: Proposes a Change
- Maybe adjusts the learning rate
- Maybe adds a new layer type
- Maybe changes the activation function
- Generates clean, modular Python code
Step 3: Trains for 5 Minutes
- Spins up training on the GPU
- Monitors loss curves
- Logs metrics
Step 4: Evaluates Results
- Compares new metrics to baseline
- Checks if the change helped
- Looks for training instabilities
Step 5: Makes a Decision
- Keep: Change becomes new baseline
- Discard: Revert to previous state
- Branch: Try alternative approach
Then it repeats. Forever, until you stop it or it hits your target.
The “100 Experiments Overnight” Claim
This is where it clicked for me. Traditional ML research:
Manual approach (8 hours): - Read papers, plan experiments: 2 hours - Write/debug code: 2 hours - Wait for training: 3 hours - Analyze results: 1 hour - Actual experiments run: ~3-5autoresearch approach:
Automated approach (8 hours): - Human writes Markdown: 30 minutes - AI runs experiments continuously: 7.5 hours - Actual experiments run: ~100+ (5 min each)The AI doesn’t need to sleep, eat, or debug typos. It just iterates.
Key Components Under the Hood
I looked at the architecture:
Code Generation Engine
- Parses your Markdown goals
- Generates PyTorch/TensorFlow/JAX code
- Follows ML best practices automatically
Training Orchestrator
- Manages GPU allocation
- Handles checkpointing
- Implements fault tolerance
Evaluation Framework
- Collects metrics automatically
- Compares to baseline
- Generates visualizations
Version Control Integration
- Every experiment is a git commit
- Can reproduce any experiment
- Branch for parallel exploration
Agent Interface
- Works with Claude, GPT-4, or other LLMs
- Maintains context across experiments
- Has safety constraints
What Makes This Different from “LLM Writes Code”
I initially dismissed this as “just another code generator.” But the difference is:
Code Generator: Prompt ──► Code ──► Done
autoresearch: Goal ──► Code ──► Run ──► Result ──► New Code ──► ... │ │ └────────────────────────────────────┘ (feedback loop)The feedback loop is the innovation. The AI sees the results of its changes and learns from them.
Limitations (What It Can’t Do Yet)
From what I’ve seen:
- Requires clear metrics - If you can’t define “better,” the AI can’t optimize
- Small models only - 5-minute training means tiny models
- No big architectural leaps - It iterates, doesn’t invent fundamentally new approaches
- GPU-bound - Still needs compute
When Would I Use This?
Real talk - this isn’t for:
- Production model training (too small scale)
- Novel architecture research (needs human creativity)
- Quick prototypes (overkill)
But it IS useful for:
- Hyperparameter exploration
- Ablation studies
- Finding surprising combinations that work
- Overnight iteration when you’re stuck
The Real Insight
The framework isn’t about “AI replacing researchers.” It’s about removing the mechanical bottleneck from research.
Researchers still:
- Define the problem space
- Set constraints and goals
- Interpret surprising results
- Decide research direction
But the AI handles the tedious loop of “try this, check that, adjust, repeat.”
Getting Started
If you want to try it:
- Clone the repo
- Write your experiment.md
- Point it at your Claude/GPT-4 API
- Let it run overnight
- Review results in the morning
The barrier to entry is essentially: “Can you write a Markdown file describing what you want optimized?”
Related Concepts
This fits into a broader trend:
- AutoML - Automated machine learning (hyperparameter tuning)
- Neural Architecture Search - AI designing neural networks
- LLM Agents - Language models taking actions
autoresearch is essentially “LLM Agent + AutoML” in a tight loop.
Final Thoughts
What struck me most: this changes the unit of work in ML research.
Before: “One experiment” (takes hours of human time) After: “One goal file” (takes minutes of human time)
The AI handles the experiment density. You handle the research direction.
That’s not replacing researchers. That’s multiplying their iteration speed.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 autoresearch - GitHub Repository
- 👨💻 Reddit Discussion: What is autoresearch and how does it work?
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments