What is Karpathy's AutoResearch? A Developer's Guide to Recursive Self-Improvement

Mar 29, 2026

Problem

When I first heard about “AutoResearch” from Andrej Karpathy, I thought it was a tool for academic literature research—automatically finding and summarizing papers. But when I looked at the actual code, I realized I was completely wrong.

The Reddit discussion confirmed this confusion:

“Autoresearch is basically recursive self improvement… The difference is that Karpathy put something out there that you can actually run” — TokenRingAI

So what is it actually?

Environment

Python-based ML optimization framework
Created by Andrej Karpathy (OpenAI co-founder, Tesla AI director)
Released in 2026 as open-source

What Is AutoResearch?

AutoResearch is a recursive self-improvement (RSI) framework for ML optimization. It runs a loop where an LLM proposes code changes, tests them against metrics, and iteratively improves.

Think of it as automating the job that many ML engineers do manually: “make number go down by twiddling code semi-intelligently.”

A user from the Reddit thread applied this concept immediately:

“The day after autoresearch was posted I applied the idea to a stereo depth estimator and by the end of the day it had made massive gains.” — sdfgeoff

The Core Concept

The structure is surprisingly simple—a while True loop:

# Conceptual AutoResearch loop
while True:
    proposal = llm.suggest_improvement(current_code, metrics)
    new_code = apply_change(current_code, proposal)
    new_metrics = evaluate(new_code)
    if new_metrics < current_metrics:  # Lower loss = better
        current_code = new_code
        current_metrics = new_metrics

The LLM proposes a code change, you test it, and if the metric improves, you keep it. Otherwise, you revert and try again.

A Minimal Implementation

Here’s a simplified version I could actually run:

import anthropic
import subprocess

def autoresearch_loop(codebase_path, metric_fn, max_iterations=100):
    """Run AutoResearch-style optimization loop."""
    current_score = metric_fn()

    for i in range(max_iterations):
        # 1. LLM proposes improvement
        proposal = propose_improvement(codebase_path, current_score)

        # 2. Apply change
        apply_proposal(codebase_path, proposal)

        # 3. Evaluate
        new_score = metric_fn()

        # 4. Keep or revert
        if new_score < current_score:  # Lower is better
            current_score = new_score
            print(f"Iteration {i}: Improved to {new_score}")
        else:
            revert_proposal(codebase_path, proposal)
            print(f"Iteration {i}: Rejected, score {new_score}")

    return current_score

The key parts:

metric_fn: Your evaluation function (loss, accuracy, etc.)
propose_improvement: LLM analyzes code and suggests changes
apply_proposal / revert_proposal: Modify codebase safely

The Overfitting Problem

But there’s a catch. A Reddit user raised a critical concern:

“95% concern: overfitting to validation set” — kaggleqrdl

If you optimize against the same validation data repeatedly, the model memorizes it instead of learning general patterns.

Here’s how I’d protect against that:

def safe_autoresearch_loop(codebase_path, train_metric, val_metric):
    """AutoResearch with overfitting protection."""
    best_val = val_metric()
    patience = 10
    no_improve = 0

    while no_improve < patience:
        proposal = propose_improvement(codebase_path)
        apply_proposal(codebase_path, proposal)

        train = train_metric()
        val = val_metric()

        if val < best_val:
            best_val = val
            no_improve = 0
            save_checkpoint(codebase_path, f"best_val_{val:.4f}")
        else:
            revert_proposal(codebase_path, proposal)
            no_improve += 1
            print(f"No val improvement ({no_improve}/{patience})")

Now I track validation metrics separately. If validation stops improving but training keeps dropping, I revert and stop.

What AutoResearch Can’t Do

I need to be clear about limitations:

Expectation	Reality
Invents new architectures	Only tweaks existing code
Replaces ML engineers	Automates tedious tuning, not design
Works on any problem	Best for optimization tasks with clear metrics
Guarantees improvement	May produce “confident nonsense” if retrieval pipeline returns garbage

The Reddit thread emphasized: “No new paradigms—only nibbling around the edges.”

AutoResearch doesn’t create breakthroughs. It’s a tool for incremental optimization.

Why Karpathy’s Implementation Matters

The AI industry discussed RSI for years. Zuckerberg, Sam Altman, and Amodei all mentioned it in summer 2025. But their implementations were:

Closed-source
Too abstract to understand
Buried in complex research codebases

Karpathy’s contribution was making RSI accessible:

He wrote clean, minimal code
He explained the concept clearly
He released it open-source so you can run it yourself

This is why it gained traction—not because it was technically novel, but because it solved a real problem with an understandable implementation.

Summary

In this post, I explained what Karpathy’s AutoResearch actually is: a recursive self-improvement framework for ML optimization, not academic research software. The key point is that it automates the iterative tuning loop that ML engineers do manually. But I need to set up proper validation guardrails to avoid overfitting, and I shouldn’t expect it to invent new architectures—only to optimize existing ones.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 AutoResearch GitHub Repository
👨‍💻 Reddit Discussion: AutoResearch vs OpenClaw Buzz

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!