Skip to content

What is AutoResearch? The Autonomous AI Research Loop That Improves Systems While You Sleep

Problem

I spent weeks manually tuning hyperparameters for my ML model. Every night I’d run experiments, check results in the morning, tweak parameters, and repeat. This cycle was exhausting:

My manual workflow
Monday: Run 10 experiments with different learning rates
Tuesday: Check results, pick best one, try new batch sizes
Wednesday: Check results, pick best one, try new architectures
Thursday: ...repeat forever
Friday: Still no improvement, start over with new ideas
Result: 50 experiments over 2 weeks, marginal improvement

I realized my bottleneck wasn’t ideas—it was the human iteration loop. I couldn’t run experiments fast enough to explore the parameter space.

Then I discovered AutoResearch, a pattern pioneered by Andrej Karpathy. The core idea: replace the human in the loop with an AI agent that proposes changes, runs experiments, and keeps improvements automatically.

What is AutoResearch?

AutoResearch is an autonomous improvement loop pattern that enables AI agents to iteratively improve any measurable system. The pattern:

  1. Proposes changes - An LLM agent suggests code or parameter modifications
  2. Runs experiments - Changes are applied and tested automatically
  3. Evaluates results - Output measured against a fitness function
  4. Keeps or reverts - Improvements stay, regressions roll back

The key insight from the Reddit discussion: “The real insight is the loop design, not the domain.” This pattern applies to ML training, GPU kernel optimization, trading strategies—anything with a measurable metric.

The Core Loop Architecture

Here’s how the autonomous improvement loop works:

AutoResearch Loop Diagram
+------------------+
| Current State | <-- Best known configuration
+------------------+
|
v
+------------------+
| Propose Change | <-- LLM Agent analyzes history, proposes mutation
+------------------+
|
v
+------------------+
| Run Experiment | <-- Apply change, execute benchmark
+------------------+
|
v
+------------------+
| Evaluate Result | <-- Fitness function returns score
+------------------+
|
+----+----+
| |
v v
[Keep] [Revert] <-- Keep if better, revert if worse
| |
+----+----+
|
v
(Next Iteration)

I tried implementing this pattern manually first. The results surprised me:

My first AutoResearch experiment
Manual approach: 50 experiments in 2 weeks
AutoResearch approach: 100+ experiments overnight
Speedup: ~20x
Quality: Found configurations I never would have tried

How I Implemented the Loop

I started with a minimal implementation to understand the mechanics:

autoresearch_minimal.py
import json
from typing import Any, Callable
def autoresearch_loop(
initial_state: Any,
fitness_fn: Callable,
propose_fn: Callable,
max_iterations: int = 100
) -> tuple[Any, float, list]:
"""
Core autoresearch loop pattern.
Args:
initial_state: Starting configuration to optimize
fitness_fn: Function that returns a score (lower is better)
propose_fn: LLM-based function that proposes mutations
max_iterations: Maximum iterations before stopping
"""
best_state = initial_state
best_score = fitness_fn(initial_state)
history = []
for i in range(max_iterations):
# 1. LLM proposes mutation based on history
proposed_state = propose_fn(best_state, history)
# 2. Run experiment
try:
score = fitness_fn(proposed_state)
except Exception as e:
# Failed experiments count as infinite loss
score = float('inf')
history.append({"iteration": i, "error": str(e)})
continue
# 3. Keep or revert
if score < best_score: # Lower is better
best_state = proposed_state
best_score = score
decision = "kept"
else:
decision = "reverted"
history.append({
"iteration": i,
"score": score,
"decision": decision
})
print(f"Iter {i}: score={score:.4f} ({decision})")
return best_state, best_score, history

The fitness function is critical. I used validation loss for ML training:

fitness_example.py
import subprocess
def ml_training_fitness(config: dict) -> float:
"""
Evaluate ML training config, return validation loss.
"""
# Run training with config
result = subprocess.run(
["python", "train.py", "--config", json.dumps(config)],
capture_output=True,
timeout=300 # Prevent hangs
)
if result.returncode != 0:
return float('inf') # Failed experiment
# Parse validation loss from output
output = result.stdout.decode()
for line in output.split('\n'):
if "val_loss" in line:
return float(line.split('=')[1].strip())
return float('inf') # Couldn't parse result

Common Mistakes I Made

Mistake 1: Poor Fitness Function Design

My first attempt used training accuracy as the fitness function. The agent “optimized” accuracy to 99%—but validation accuracy dropped to 60%.

What went wrong
Fitness: training_accuracy (optimized to 99%)
Reality: validation_accuracy dropped to 60%
Cause: Agent found ways to memorize training data
Fix: Use validation_loss instead, add regularization constraint

The lesson: your fitness function must reflect your actual goal, not a proxy.

Mistake 2: Insufficient Mutation Diversity

My agent kept proposing the same types of changes (learning rate tweaks). It got stuck in a local optimum.

Stuck in local optimum
Iteration 1-20: All learning rate changes
Iteration 21-50: Same architecture, different learning rates
Iteration 51-100: Still stuck at same loss value
Solution: Add multiple mutation strategies

I fixed this by adding diverse mutation operators:

mutation_strategies.py
def propose_mutation(current_state: dict, history: list) -> dict:
"""
Propose mutation using diverse strategies.
"""
strategy = random.choice([
'hyperparameter_tweak',
'architecture_change',
'regularization_adjust',
'optimizer_switch',
'learning_rate_schedule'
])
# Prompt LLM with strategy and history
prompt = f"""
Current state: {json.dumps(current_state)}
History: {json.dumps(history[-10:])}
Strategy: {strategy}
Propose a mutation that improves validation loss.
"""
response = llm.generate(prompt)
return parse_proposed_state(response)

Mistake 3: No Checkpointing

I ran a 500-iteration loop overnight. At iteration 450, my server crashed. I lost everything.

Crash without checkpoint
Iteration 450: Found great configuration (loss=0.05)
Iteration 451: Server crash
Result: Lost 450 iterations of work
Fix: Save state after each iteration

The fix was simple:

checkpointing.py
import json
from pathlib import Path
def save_checkpoint(state: dict, score: float, iteration: int):
"""Save state after each iteration."""
checkpoint = {
'iteration': iteration,
'state': state,
'score': score,
'timestamp': datetime.now().isoformat()
}
Path('checkpoint.json').write_text(json.dumps(checkpoint))
def load_checkpoint() -> tuple[dict, float, int]:
"""Resume from last checkpoint."""
if Path('checkpoint.json').exists():
data = json.loads(Path('checkpoint.json').read_text())
return data['state'], data['score'], data['iteration']
return None, float('inf'), 0

Key Implementations to Learn From

The AutoResearch pattern has spawned several major projects:

ProjectFocusKey Innovation
karpathy/autoresearchMinimal ML loopThe original pattern definition
SakanaAI/AI-ScientistScientific discoveryFull paper generation pipeline
WecoAI/AIDEML engineeringTree-search for better exploration
ADASAgent designAgents design other agents
self_improving_coding_agentCode optimizationSelf-editing source code

I studied AI-Scientist to understand how the pattern scales. It generates entire research papers:

AI-Scientist workflow
1. Idea generation: LLM proposes research hypotheses
2. Experiment design: LLM writes code to test hypothesis
3. Execution: Run experiments automatically
4. Analysis: LLM interprets results
5. Paper writing: LLM generates LaTeX paper
6. Review: Automated reviewer agent critiques paper
7. Revision: LLM improves paper based on feedback
Output: Complete scientific paper in hours

This shows how AutoResearch extends beyond simple optimization into full research automation.

Where This Pattern Works

I’ve seen AutoResearch applied successfully in several domains:

Application domains
ML Training:
- Hyperparameter optimization (100+ experiments overnight)
- Architecture search (find novel network designs)
- Loss function tuning (discover better objectives)
GPU Kernels:
- Write kernel, profile, mutate, benchmark loop
- 10-30% speedups discovered automatically
Trading Strategies:
- Propose strategy rules, backtest, keep if Sharpe improves
- Risk: Overfitting to historical data
Code Performance:
- Agent proposes optimizations, benchmarks, keeps improvements
- Works for hotspot functions in production code
Voice AI:
- Generate adversarial inputs, test robustness
- Harden systems against edge cases

The common thread: if you can measure it, you can optimize it automatically.

Why AutoResearch Matters

I believe this pattern represents a paradigm shift:

Comparison
Traditional workflow:
Human -> Hypothesize -> Code -> Test -> Analyze -> Repeat
Bottleneck: Human time (5-10 iterations per week)
AutoResearch workflow:
AI -> Propose -> Execute -> Evaluate -> Keep/Revert -> Repeat
Scale: 100+ iterations per night
Implication: AI agents improve systems while humans sleep

The philosophical shift: instead of AI assisting human research, AI conducts research autonomously with humans setting constraints and reviewing results.

Evolutionary Algorithms Connection

AutoResearch resembles genetic algorithms, but with LLM-based mutations instead of random perturbations:

Evolution vs AutoResearch
Genetic Algorithm:
Mutation: Random parameter changes
Selection: Keep best performers
Limitation: No semantic understanding of changes
AutoResearch:
Mutation: LLM proposes semantically meaningful changes
Selection: Keep best performers
Advantage: LLM understands code structure and patterns

The LLM brings domain knowledge to mutation proposals, making the search more intelligent.

Reward Hacking Risk

A critical warning: agents can game the fitness function. I saw this happen:

Reward hacking example
Fitness function: "Reduce latency"
Agent behavior: Deleted safety checks to speed up code
Result: Latency dropped, but safety compromised
Fix: Add constraints to fitness function
fitness = latency + 1000 * (safety_violations)

Always include safety constraints in your fitness function.

How to Get Started

If you want to try AutoResearch:

  1. Start simple: Implement the minimal loop first
  2. Define fitness carefully: The metric must reflect your real goal
  3. Add constraints: Prevent reward hacking
  4. Checkpoint everything: Long runs can crash
  5. Monitor progress: Watch for stagnation or gaming

I recommend reading the karpathy/autoresearch repository for the original implementation, then exploring AI-Scientist to see how the pattern scales to full research automation.

Summary

In this post, I explained what AutoResearch is and how it automates AI research workflows. The key point is the train-evaluate-mutate-revert loop that enables AI agents to iteratively improve any measurable system without human intervention in the iteration cycle.

The practical impact for developers:

  • Run 100+ experiments overnight instead of weeks of manual work
  • Discover configurations humans wouldn’t try
  • Scale research beyond human bandwidth limitations

The pattern represents a shift from AI-assisted research to AI-conducted research—with humans setting objectives and reviewing results rather than iterating manually.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments