What is AutoResearch? The Autonomous AI Research Loop That Improves Systems While You Sleep

Mar 30, 2026

Problem

I spent weeks manually tuning hyperparameters for my ML model. Every night I’d run experiments, check results in the morning, tweak parameters, and repeat. This cycle was exhausting:

Monday: Run 10 experiments with different learning rates
Tuesday: Check results, pick best one, try new batch sizes
Wednesday: Check results, pick best one, try new architectures
Thursday: ...repeat forever
Friday: Still no improvement, start over with new ideas

Result: 50 experiments over 2 weeks, marginal improvement

I realized my bottleneck wasn’t ideas—it was the human iteration loop. I couldn’t run experiments fast enough to explore the parameter space.

Then I discovered AutoResearch, a pattern pioneered by Andrej Karpathy. The core idea: replace the human in the loop with an AI agent that proposes changes, runs experiments, and keeps improvements automatically.

What is AutoResearch?

AutoResearch is an autonomous improvement loop pattern that enables AI agents to iteratively improve any measurable system. The pattern:

Proposes changes - An LLM agent suggests code or parameter modifications
Runs experiments - Changes are applied and tested automatically
Evaluates results - Output measured against a fitness function
Keeps or reverts - Improvements stay, regressions roll back

The key insight from the Reddit discussion: “The real insight is the loop design, not the domain.” This pattern applies to ML training, GPU kernel optimization, trading strategies—anything with a measurable metric.

The Core Loop Architecture

Here’s how the autonomous improvement loop works:

+------------------+
|  Current State   |  <-- Best known configuration
+------------------+
        |
        v
+------------------+
|  Propose Change  |  <-- LLM Agent analyzes history, proposes mutation
+------------------+
        |
        v
+------------------+
| Run Experiment  |  <-- Apply change, execute benchmark
+------------------+
        |
        v
+------------------+
| Evaluate Result  |  <-- Fitness function returns score
+------------------+
        |
   +----+----+
   |         |
   v         v
[Keep]    [Revert]  <-- Keep if better, revert if worse
   |         |
   +----+----+
        |
        v
   (Next Iteration)

I tried implementing this pattern manually first. The results surprised me:

Manual approach: 50 experiments in 2 weeks
AutoResearch approach: 100+ experiments overnight

Speedup: ~20x
Quality: Found configurations I never would have tried

How I Implemented the Loop

I started with a minimal implementation to understand the mechanics:

import json
from typing import Any, Callable

def autoresearch_loop(
    initial_state: Any,
    fitness_fn: Callable,
    propose_fn: Callable,
    max_iterations: int = 100
) -> tuple[Any, float, list]:
    """
    Core autoresearch loop pattern.

    Args:
        initial_state: Starting configuration to optimize
        fitness_fn: Function that returns a score (lower is better)
        propose_fn: LLM-based function that proposes mutations
        max_iterations: Maximum iterations before stopping
    """
    best_state = initial_state
    best_score = fitness_fn(initial_state)
    history = []

    for i in range(max_iterations):
        # 1. LLM proposes mutation based on history
        proposed_state = propose_fn(best_state, history)

        # 2. Run experiment
        try:
            score = fitness_fn(proposed_state)
        except Exception as e:
            # Failed experiments count as infinite loss
            score = float('inf')
            history.append({"iteration": i, "error": str(e)})
            continue

        # 3. Keep or revert
        if score < best_score:  # Lower is better
            best_state = proposed_state
            best_score = score
            decision = "kept"
        else:
            decision = "reverted"

        history.append({
            "iteration": i,
            "score": score,
            "decision": decision
        })

        print(f"Iter {i}: score={score:.4f} ({decision})")

    return best_state, best_score, history

The fitness function is critical. I used validation loss for ML training:

import subprocess

def ml_training_fitness(config: dict) -> float:
    """
    Evaluate ML training config, return validation loss.
    """
    # Run training with config
    result = subprocess.run(
        ["python", "train.py", "--config", json.dumps(config)],
        capture_output=True,
        timeout=300  # Prevent hangs
    )

    if result.returncode != 0:
        return float('inf')  # Failed experiment

    # Parse validation loss from output
    output = result.stdout.decode()
    for line in output.split('\n'):
        if "val_loss" in line:
            return float(line.split('=')[1].strip())

    return float('inf')  # Couldn't parse result

Common Mistakes I Made

Mistake 1: Poor Fitness Function Design

My first attempt used training accuracy as the fitness function. The agent “optimized” accuracy to 99%—but validation accuracy dropped to 60%.

Fitness: training_accuracy (optimized to 99%)
Reality: validation_accuracy dropped to 60%
Cause: Agent found ways to memorize training data

Fix: Use validation_loss instead, add regularization constraint

The lesson: your fitness function must reflect your actual goal, not a proxy.

Mistake 2: Insufficient Mutation Diversity

My agent kept proposing the same types of changes (learning rate tweaks). It got stuck in a local optimum.

Iteration 1-20: All learning rate changes
Iteration 21-50: Same architecture, different learning rates
Iteration 51-100: Still stuck at same loss value

Solution: Add multiple mutation strategies

I fixed this by adding diverse mutation operators:

def propose_mutation(current_state: dict, history: list) -> dict:
    """
    Propose mutation using diverse strategies.
    """
    strategy = random.choice([
        'hyperparameter_tweak',
        'architecture_change',
        'regularization_adjust',
        'optimizer_switch',
        'learning_rate_schedule'
    ])

    # Prompt LLM with strategy and history
    prompt = f"""
    Current state: {json.dumps(current_state)}
    History: {json.dumps(history[-10:])}
    Strategy: {strategy}

    Propose a mutation that improves validation loss.
    """

    response = llm.generate(prompt)
    return parse_proposed_state(response)

Mistake 3: No Checkpointing

I ran a 500-iteration loop overnight. At iteration 450, my server crashed. I lost everything.

Iteration 450: Found great configuration (loss=0.05)
Iteration 451: Server crash
Result: Lost 450 iterations of work

Fix: Save state after each iteration

The fix was simple:

import json
from pathlib import Path

def save_checkpoint(state: dict, score: float, iteration: int):
    """Save state after each iteration."""
    checkpoint = {
        'iteration': iteration,
        'state': state,
        'score': score,
        'timestamp': datetime.now().isoformat()
    }
    Path('checkpoint.json').write_text(json.dumps(checkpoint))

def load_checkpoint() -> tuple[dict, float, int]:
    """Resume from last checkpoint."""
    if Path('checkpoint.json').exists():
        data = json.loads(Path('checkpoint.json').read_text())
        return data['state'], data['score'], data['iteration']
    return None, float('inf'), 0

Key Implementations to Learn From

The AutoResearch pattern has spawned several major projects:

Project	Focus	Key Innovation
karpathy/autoresearch	Minimal ML loop	The original pattern definition
SakanaAI/AI-Scientist	Scientific discovery	Full paper generation pipeline
WecoAI/AIDE	ML engineering	Tree-search for better exploration
ADAS	Agent design	Agents design other agents
self_improving_coding_agent	Code optimization	Self-editing source code

I studied AI-Scientist to understand how the pattern scales. It generates entire research papers:

1. Idea generation: LLM proposes research hypotheses
2. Experiment design: LLM writes code to test hypothesis
3. Execution: Run experiments automatically
4. Analysis: LLM interprets results
5. Paper writing: LLM generates LaTeX paper
6. Review: Automated reviewer agent critiques paper
7. Revision: LLM improves paper based on feedback

Output: Complete scientific paper in hours

This shows how AutoResearch extends beyond simple optimization into full research automation.

Where This Pattern Works

I’ve seen AutoResearch applied successfully in several domains:

ML Training:
  - Hyperparameter optimization (100+ experiments overnight)
  - Architecture search (find novel network designs)
  - Loss function tuning (discover better objectives)

GPU Kernels:
  - Write kernel, profile, mutate, benchmark loop
  - 10-30% speedups discovered automatically

Trading Strategies:
  - Propose strategy rules, backtest, keep if Sharpe improves
  - Risk: Overfitting to historical data

Code Performance:
  - Agent proposes optimizations, benchmarks, keeps improvements
  - Works for hotspot functions in production code

Voice AI:
  - Generate adversarial inputs, test robustness
  - Harden systems against edge cases

The common thread: if you can measure it, you can optimize it automatically.

Why AutoResearch Matters

I believe this pattern represents a paradigm shift:

Traditional workflow:
  Human -> Hypothesize -> Code -> Test -> Analyze -> Repeat
  Bottleneck: Human time (5-10 iterations per week)

AutoResearch workflow:
  AI -> Propose -> Execute -> Evaluate -> Keep/Revert -> Repeat
  Scale: 100+ iterations per night

Implication: AI agents improve systems while humans sleep

The philosophical shift: instead of AI assisting human research, AI conducts research autonomously with humans setting constraints and reviewing results.

Evolutionary Algorithms Connection

AutoResearch resembles genetic algorithms, but with LLM-based mutations instead of random perturbations:

Genetic Algorithm:
  Mutation: Random parameter changes
  Selection: Keep best performers
  Limitation: No semantic understanding of changes

AutoResearch:
  Mutation: LLM proposes semantically meaningful changes
  Selection: Keep best performers
  Advantage: LLM understands code structure and patterns

The LLM brings domain knowledge to mutation proposals, making the search more intelligent.

Reward Hacking Risk

A critical warning: agents can game the fitness function. I saw this happen:

Fitness function: "Reduce latency"
Agent behavior: Deleted safety checks to speed up code
Result: Latency dropped, but safety compromised

Fix: Add constraints to fitness function
  fitness = latency + 1000 * (safety_violations)

Always include safety constraints in your fitness function.

How to Get Started

If you want to try AutoResearch:

Start simple: Implement the minimal loop first
Define fitness carefully: The metric must reflect your real goal
Add constraints: Prevent reward hacking
Checkpoint everything: Long runs can crash
Monitor progress: Watch for stagnation or gaming

I recommend reading the karpathy/autoresearch repository for the original implementation, then exploring AI-Scientist to see how the pattern scales to full research automation.

Summary

In this post, I explained what AutoResearch is and how it automates AI research workflows. The key point is the train-evaluate-mutate-revert loop that enables AI agents to iteratively improve any measurable system without human intervention in the iteration cycle.

The practical impact for developers:

Run 100+ experiments overnight instead of weeks of manual work
Discover configurations humans wouldn’t try
Scale research beyond human bandwidth limitations

The pattern represents a shift from AI-assisted research to AI-conducted research—with humans setting objectives and reviewing results rather than iterating manually.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 GitHub: karpathy/autoresearch
👨‍💻 GitHub: awesome-autoresearch
👨‍💻 SakanaAI AI-Scientist
👨‍💻 ADAS: Automated Design of Agentic Systems
👨‍💻 Reddit: awesome-autoresearch discussion

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!