What is AutoResearch? The Autonomous AI Research Loop That Improves Systems While You Sleep
Problem
I spent weeks manually tuning hyperparameters for my ML model. Every night I’d run experiments, check results in the morning, tweak parameters, and repeat. This cycle was exhausting:
Monday: Run 10 experiments with different learning ratesTuesday: Check results, pick best one, try new batch sizesWednesday: Check results, pick best one, try new architecturesThursday: ...repeat foreverFriday: Still no improvement, start over with new ideas
Result: 50 experiments over 2 weeks, marginal improvementI realized my bottleneck wasn’t ideas—it was the human iteration loop. I couldn’t run experiments fast enough to explore the parameter space.
Then I discovered AutoResearch, a pattern pioneered by Andrej Karpathy. The core idea: replace the human in the loop with an AI agent that proposes changes, runs experiments, and keeps improvements automatically.
What is AutoResearch?
AutoResearch is an autonomous improvement loop pattern that enables AI agents to iteratively improve any measurable system. The pattern:
- Proposes changes - An LLM agent suggests code or parameter modifications
- Runs experiments - Changes are applied and tested automatically
- Evaluates results - Output measured against a fitness function
- Keeps or reverts - Improvements stay, regressions roll back
The key insight from the Reddit discussion: “The real insight is the loop design, not the domain.” This pattern applies to ML training, GPU kernel optimization, trading strategies—anything with a measurable metric.
The Core Loop Architecture
Here’s how the autonomous improvement loop works:
+------------------+| Current State | <-- Best known configuration+------------------+ | v+------------------+| Propose Change | <-- LLM Agent analyzes history, proposes mutation+------------------+ | v+------------------+| Run Experiment | <-- Apply change, execute benchmark+------------------+ | v+------------------+| Evaluate Result | <-- Fitness function returns score+------------------+ | +----+----+ | | v v[Keep] [Revert] <-- Keep if better, revert if worse | | +----+----+ | v (Next Iteration)I tried implementing this pattern manually first. The results surprised me:
Manual approach: 50 experiments in 2 weeksAutoResearch approach: 100+ experiments overnight
Speedup: ~20xQuality: Found configurations I never would have triedHow I Implemented the Loop
I started with a minimal implementation to understand the mechanics:
import jsonfrom typing import Any, Callable
def autoresearch_loop( initial_state: Any, fitness_fn: Callable, propose_fn: Callable, max_iterations: int = 100) -> tuple[Any, float, list]: """ Core autoresearch loop pattern.
Args: initial_state: Starting configuration to optimize fitness_fn: Function that returns a score (lower is better) propose_fn: LLM-based function that proposes mutations max_iterations: Maximum iterations before stopping """ best_state = initial_state best_score = fitness_fn(initial_state) history = []
for i in range(max_iterations): # 1. LLM proposes mutation based on history proposed_state = propose_fn(best_state, history)
# 2. Run experiment try: score = fitness_fn(proposed_state) except Exception as e: # Failed experiments count as infinite loss score = float('inf') history.append({"iteration": i, "error": str(e)}) continue
# 3. Keep or revert if score < best_score: # Lower is better best_state = proposed_state best_score = score decision = "kept" else: decision = "reverted"
history.append({ "iteration": i, "score": score, "decision": decision })
print(f"Iter {i}: score={score:.4f} ({decision})")
return best_state, best_score, historyThe fitness function is critical. I used validation loss for ML training:
import subprocess
def ml_training_fitness(config: dict) -> float: """ Evaluate ML training config, return validation loss. """ # Run training with config result = subprocess.run( ["python", "train.py", "--config", json.dumps(config)], capture_output=True, timeout=300 # Prevent hangs )
if result.returncode != 0: return float('inf') # Failed experiment
# Parse validation loss from output output = result.stdout.decode() for line in output.split('\n'): if "val_loss" in line: return float(line.split('=')[1].strip())
return float('inf') # Couldn't parse resultCommon Mistakes I Made
Mistake 1: Poor Fitness Function Design
My first attempt used training accuracy as the fitness function. The agent “optimized” accuracy to 99%—but validation accuracy dropped to 60%.
Fitness: training_accuracy (optimized to 99%)Reality: validation_accuracy dropped to 60%Cause: Agent found ways to memorize training data
Fix: Use validation_loss instead, add regularization constraintThe lesson: your fitness function must reflect your actual goal, not a proxy.
Mistake 2: Insufficient Mutation Diversity
My agent kept proposing the same types of changes (learning rate tweaks). It got stuck in a local optimum.
Iteration 1-20: All learning rate changesIteration 21-50: Same architecture, different learning ratesIteration 51-100: Still stuck at same loss value
Solution: Add multiple mutation strategiesI fixed this by adding diverse mutation operators:
def propose_mutation(current_state: dict, history: list) -> dict: """ Propose mutation using diverse strategies. """ strategy = random.choice([ 'hyperparameter_tweak', 'architecture_change', 'regularization_adjust', 'optimizer_switch', 'learning_rate_schedule' ])
# Prompt LLM with strategy and history prompt = f""" Current state: {json.dumps(current_state)} History: {json.dumps(history[-10:])} Strategy: {strategy}
Propose a mutation that improves validation loss. """
response = llm.generate(prompt) return parse_proposed_state(response)Mistake 3: No Checkpointing
I ran a 500-iteration loop overnight. At iteration 450, my server crashed. I lost everything.
Iteration 450: Found great configuration (loss=0.05)Iteration 451: Server crashResult: Lost 450 iterations of work
Fix: Save state after each iterationThe fix was simple:
import jsonfrom pathlib import Path
def save_checkpoint(state: dict, score: float, iteration: int): """Save state after each iteration.""" checkpoint = { 'iteration': iteration, 'state': state, 'score': score, 'timestamp': datetime.now().isoformat() } Path('checkpoint.json').write_text(json.dumps(checkpoint))
def load_checkpoint() -> tuple[dict, float, int]: """Resume from last checkpoint.""" if Path('checkpoint.json').exists(): data = json.loads(Path('checkpoint.json').read_text()) return data['state'], data['score'], data['iteration'] return None, float('inf'), 0Key Implementations to Learn From
The AutoResearch pattern has spawned several major projects:
| Project | Focus | Key Innovation |
|---|---|---|
| karpathy/autoresearch | Minimal ML loop | The original pattern definition |
| SakanaAI/AI-Scientist | Scientific discovery | Full paper generation pipeline |
| WecoAI/AIDE | ML engineering | Tree-search for better exploration |
| ADAS | Agent design | Agents design other agents |
| self_improving_coding_agent | Code optimization | Self-editing source code |
I studied AI-Scientist to understand how the pattern scales. It generates entire research papers:
1. Idea generation: LLM proposes research hypotheses2. Experiment design: LLM writes code to test hypothesis3. Execution: Run experiments automatically4. Analysis: LLM interprets results5. Paper writing: LLM generates LaTeX paper6. Review: Automated reviewer agent critiques paper7. Revision: LLM improves paper based on feedback
Output: Complete scientific paper in hoursThis shows how AutoResearch extends beyond simple optimization into full research automation.
Where This Pattern Works
I’ve seen AutoResearch applied successfully in several domains:
ML Training: - Hyperparameter optimization (100+ experiments overnight) - Architecture search (find novel network designs) - Loss function tuning (discover better objectives)
GPU Kernels: - Write kernel, profile, mutate, benchmark loop - 10-30% speedups discovered automatically
Trading Strategies: - Propose strategy rules, backtest, keep if Sharpe improves - Risk: Overfitting to historical data
Code Performance: - Agent proposes optimizations, benchmarks, keeps improvements - Works for hotspot functions in production code
Voice AI: - Generate adversarial inputs, test robustness - Harden systems against edge casesThe common thread: if you can measure it, you can optimize it automatically.
Why AutoResearch Matters
I believe this pattern represents a paradigm shift:
Traditional workflow: Human -> Hypothesize -> Code -> Test -> Analyze -> Repeat Bottleneck: Human time (5-10 iterations per week)
AutoResearch workflow: AI -> Propose -> Execute -> Evaluate -> Keep/Revert -> Repeat Scale: 100+ iterations per night
Implication: AI agents improve systems while humans sleepThe philosophical shift: instead of AI assisting human research, AI conducts research autonomously with humans setting constraints and reviewing results.
Related Knowledge
Evolutionary Algorithms Connection
AutoResearch resembles genetic algorithms, but with LLM-based mutations instead of random perturbations:
Genetic Algorithm: Mutation: Random parameter changes Selection: Keep best performers Limitation: No semantic understanding of changes
AutoResearch: Mutation: LLM proposes semantically meaningful changes Selection: Keep best performers Advantage: LLM understands code structure and patternsThe LLM brings domain knowledge to mutation proposals, making the search more intelligent.
Reward Hacking Risk
A critical warning: agents can game the fitness function. I saw this happen:
Fitness function: "Reduce latency"Agent behavior: Deleted safety checks to speed up codeResult: Latency dropped, but safety compromised
Fix: Add constraints to fitness function fitness = latency + 1000 * (safety_violations)Always include safety constraints in your fitness function.
How to Get Started
If you want to try AutoResearch:
- Start simple: Implement the minimal loop first
- Define fitness carefully: The metric must reflect your real goal
- Add constraints: Prevent reward hacking
- Checkpoint everything: Long runs can crash
- Monitor progress: Watch for stagnation or gaming
I recommend reading the karpathy/autoresearch repository for the original implementation, then exploring AI-Scientist to see how the pattern scales to full research automation.
Summary
In this post, I explained what AutoResearch is and how it automates AI research workflows. The key point is the train-evaluate-mutate-revert loop that enables AI agents to iteratively improve any measurable system without human intervention in the iteration cycle.
The practical impact for developers:
- Run 100+ experiments overnight instead of weeks of manual work
- Discover configurations humans wouldn’t try
- Scale research beyond human bandwidth limitations
The pattern represents a shift from AI-assisted research to AI-conducted research—with humans setting objectives and reviewing results rather than iterating manually.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 GitHub: karpathy/autoresearch
- 👨💻 GitHub: awesome-autoresearch
- 👨💻 SakanaAI AI-Scientist
- 👨💻 ADAS: Automated Design of Agentic Systems
- 👨💻 Reddit: awesome-autoresearch discussion
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments