How does Andrej Karpathy's autoresearch loop work for self-improving AI systems?

Mar 29, 2026

I kept hitting the same wall with my AI agents. Every time they made a mistake, I had to manually debug, rewrite prompts, and redeploy. The improvement cycle was painfully slow - sometimes taking weeks to see if a change actually helped.

Then I stumbled across Andrej Karpathy’s autoresearch loop concept. The idea was radical: what if AI systems could rewrite their own code based on actual performance, without any human intervention?

The Problem: Human Bottleneck in AI Improvement

Here’s the cycle I was stuck in:

Agent makes prediction → Outcome measured → I analyze results →
I rewrite prompts → I test changes → I deploy → Wait for results →
Agent makes prediction → ...

The loop worked, but I was the bottleneck. Every improvement required my:

Manual analysis of what went wrong
Creative thinking about prompt modifications
Subjective judgment on what to keep
Time to implement and test changes

For a single agent, this was manageable. But as I scaled to multiple specialized agents, the maintenance burden became overwhelming.

The Autoresearch Loop Solution

Karpathy’s insight: automate the entire improvement cycle using git operations as memory.

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐           │
│   │  Agents  │───▶│ Evaluate │───▶│  Modify  │           │
│   │   Run    │    │ Results  │    │  Code    │           │
│   └──────────┘    └──────────┘    └──────────┘           │
│        ▲                                │                 │
│        │                                ▼                 │
│        │                         ┌──────────────┐        │
│        │                         │ Git Commit   │        │
│        │                         │ (successful) │        │
│        │                         │ Git Revert   │        │
│        │                         │ (failed)     │        │
│        │                         └──────────────┘        │
│        │                                │                 │
│        └────────────────────────────────┘                 │
│                                                             │
└─────────────────────────────────────────────────────────────┘

The key innovation: git commit/revert as the learning mechanism. When code changes improve performance, they survive. When they hurt performance, the system automatically reverts to the previous version.

Real-World Case: 25 Agents Trading Markets

Chris Worsey took Karpathy’s concept and applied it to financial markets. I studied his implementation closely because it’s one of the few public examples of this architecture in production.

The system uses 25 agents organized in 4 layers:

Layer 1: Macro Analysis
├── Agent M1: Interest rate trends
├── Agent M2: Inflation signals
├── Agent M3: Geopolitical risk
├── ... (total ~6 agents)
│
Layer 2: Sector Analysis
├── Agent S1: Technology sector
├── Agent S2: Healthcare sector
├── Agent S3: Energy sector
├── ... (total ~8 agents)
│
Layer 3: Commodities
├── Agent C1: Oil/gas dynamics
├── Agent C2: Precious metals
├── ... (total ~5 agents)
│
Layer 4: Single Names
├── Agent N1: Apple analysis
├── Agent N2: Microsoft analysis
├── ... (total ~6 agents)
│
▼
Portfolio Manager Agent (synthesizes all inputs)
│
▼
Position Decisions

Each morning, agents debate their perspectives. The portfolio manager synthesizes everything into actual trading positions. By evening, real market outcomes score every agent’s contribution.

The Daily Improvement Cycle

Here’s what happens every single day:

06:00 - Agents analyze data and debate
08:00 - Portfolio manager synthesizes positions
09:30 - Market opens, positions taken
16:00 - Market closes, outcomes measured
18:00 - Performance scores calculated
20:00 - Worst agents identified
22:00 - System rewrites underperformers
00:00 - Git commit new versions
        (or revert if regression detected)

The numbers from the case study blew me away:

54 prompt modifications attempted
16 survived based on actual performance
~30% survival rate - only beneficial changes persist

This is evolution applied to code. Natural selection where performance is the fitness function.

How Git Operations Enable Learning

I initially didn’t understand why git was central to this architecture. Then I realized: git provides the memory and rollback mechanism.

class AutoresearchLoop:
    def __init__(self, agents, evaluator, modifier):
        self.agents = agents
        self.evaluator = evaluator
        self.modifier = modifier
        self.version_control = GitManager()

    def run_daily_cycle(self):
        # Morning: agents generate predictions
        predictions = [agent.predict() for agent in self.agents]

        # Synthesize into positions
        portfolio = self.synthesize(predictions)
        self.execute_positions(portfolio)

        # Evening: evaluate with real outcomes
        scores = self.evaluator.score(portfolio, real_outcomes)

        # Night: identify underperformers
        worst_agents = self.find_worst_performers(scores)

        # Rewrite and commit
        for agent in worst_agents:
            # Save current state
            self.version_control.commit(
                agent,
                message=f"Baseline: {agent.id}"
            )

            # AI modifies the agent
            new_code = self.modifier.rewrite(agent)

            # Commit the modification
            self.version_control.commit(
                agent,
                message=f"Rewritten: {agent.id}"
            )

        # After next evaluation, revert if regression
        if self.evaluator.regression_detected():
            self.version_control.revert()

The git operations solve three critical problems:

Memory - Previous versions are preserved
Attribution - Each change is tracked to specific performance metrics
Rollback - Failed experiments can be automatically undone

Performance-Based Agent Evolution

The survival mechanism fascinated me. Here’s how agents evolve:

def evolve_agents(agents, performance_scores):
    """Evolve agents based on performance metrics"""
    # Rank agents by actual outcomes
    ranked = sorted(agents, key=lambda a: performance_scores[a.id])

    # Bottom 20% get rewritten
    threshold = int(len(agents) * 0.8)
    worst_performers = ranked[:threshold]

    for agent in worst_performers:
        # Save baseline
        git_commit(agent, f"Before rewrite: {agent.id}")

        # AI analyzes what went wrong
        failure_analysis = analyze_failures(agent, performance_scores)

        # Generate improved prompts/logic
        new_version = ai_rewrite(
            agent,
            context=failure_analysis
        )

        # Deploy and test
        git_commit(agent, f"Rewritten: {agent.id}")

    return agents

def evaluate_and_purge(old_scores, new_scores):
    """Keep only changes that improve performance"""
    for agent_id in new_scores:
        if new_scores[agent_id] < old_scores[agent_id]:
            # Performance regressed - revert
            git_revert(agent_id)
            log_result(agent_id, "REVERTED - regression detected")
        else:
            log_result(agent_id, "KEPT - improvement confirmed")

The key insight: no human decides what stays. Only performance metrics matter.

Multi-Agent Debate Structure

The layered debate structure caught my attention. Why organize agents in layers?

Agent A says: "Tech sector is overvalued, P/E ratios extreme"
Agent B argues: "But AI growth justifies higher multiples"
Agent C counters: "Regulatory risks in EU overlooked"
    │
    ▼
Consensus: "Mixed signals - reduce tech exposure but keep AI leaders"

The debate process creates more robust decisions:

def run_debate_cycle(agents, layers):
    """Run multi-layer agent debate"""
    layer_outputs = {}

    for layer in layers:
        # Get agents for this specialization
        layer_agents = [a for a in agents if a.layer == layer]

        # Agents present arguments
        arguments = [agent.analyze() for agent in layer_agents]

        # Debate until convergence
        consensus = debate_to_consensus(arguments)

        layer_outputs[layer] = consensus

    # Portfolio manager synthesizes all layers
    final_decision = portfolio_manager.synthesize(layer_outputs)

    return final_decision

def debate_to_consensus(arguments):
    """Agents debate until they reach agreement"""
    iteration = 0
    max_iterations = 5

    while not has_consensus(arguments) and iteration < max_iterations:
        # Each agent responds to others' arguments
        arguments = [
            agent.respond_to_others(arguments)
            for agent in arguments.authors
        ]
        iteration += 1

    return synthesize_consensus(arguments)

The layers provide specialization, the debate provides robustness, and the portfolio manager provides integration.

What I Tried: Implementing My Own Loop

After studying the architecture, I attempted a simplified version for a content recommendation system:

My Setup:
- 5 agents analyzing article quality
- Single layer (simpler than the 4-layer approach)
- Daily evaluation based on click-through rates
- Git-based version control for agent prompts

Results after 2 weeks:
- 23 modifications attempted
- 7 survived based on CTR improvement
- 2 agents showed significant improvement
- 1 agent actually got worse and was reverted

The implementation taught me several lessons:

Lesson 1: Evaluation Metrics Must Be Clean

My first attempt used noisy metrics (time-on-page). The system kept making changes based on random fluctuations. I switched to click-through rate and suddenly the improvements became real.

Lesson 2: The Baseline Matters

I initially didn’t save proper baselines. When a change happened, I couldn’t tell if it helped or hurt. Once I added proper git tagging for each version, the system actually learned.

Lesson 3: Revert Threshold is Critical

Too aggressive: system reverts everything, no learning happens. Too permissive: system keeps bad changes, performance degrades.

I settled on: revert if performance drops >5%. This allowed exploration while preventing catastrophic regression.

Why This Architecture Matters

The autoresearch loop represents a shift from:

Traditional AI Development:
Human designs system → Human tests → Human improves → Repeat

Autoresearch Loop:
Human designs evaluator → System tests itself →
System improves itself → Repeat (no human)

The human moves from being the improver to being the evaluator designer. This is a fundamentally different role.

What the System Does Automatically

Identifies weak agents
Generates hypotheses for improvement
Implements code/prompt changes
Tests against real outcomes
Retains successful changes
Discards failed experiments

What Humans Still Do

Design the evaluation metrics
Set up the architecture
Monitor for systematic failures
Provide the compute budget

Practical Considerations

If you’re thinking about implementing this:

Git Strategy:

# Each agent has its own branch
git checkout agents/macro-analyzer

# Before modification
git commit -am "Baseline: 2026-03-29 performance score 0.72"

# After AI rewrite
git commit -am "Modified: adjusted inflation signal weight"

# If performance improves: keep the branch
# If performance regresses: revert or reset
git reset --hard HEAD~1

Evaluation Design:

The evaluator is the most critical component. Garbage in, garbage out:

def design_evaluation_metrics():
    """Good evaluation metrics are:
    1. Objective (no human judgment)
    2. Fast (daily or faster feedback)
    3. Stable (low noise)
    4. Relevant (measures what matters)
    """
    return {
        'financial': {
            'metric': 'risk_adjusted_return',
            'feedback_frequency': 'daily',
            'stability': 'high'
        },
        'content': {
            'metric': 'engagement_rate',
            'feedback_frequency': 'hourly',
            'stability': 'medium'
        },
        'coding': {
            'metric': 'test_pass_rate',
            'feedback_frequency': 'per_commit',
            'stability': 'very_high'
        }
    }

LangGraph Integration:

The case study mentioned LangGraph, which makes sense for managing multi-agent workflows:

from langgraph.graph import StateGraph, END

def build_autoresearch_graph():
    """Build the autoresearch workflow"""
    workflow = StateGraph(AgentState)

    # Add nodes for each phase
    workflow.add_node("analyze", run_analysis)
    workflow.add_node("debate", run_debate)
    workflow.add_node("decide", make_decisions)
    workflow.add_node("evaluate", measure_outcomes)
    workflow.add_node("evolve", improve_agents)

    # Define the cycle
    workflow.add_edge("analyze", "debate")
    workflow.add_edge("debate", "decide")
    workflow.add_edge("decide", "evaluate")
    workflow.add_edge("evaluate", "evolve")
    workflow.add_edge("evolve", "analyze")  # Loop back

    return workflow.compile()

Limitations and Risks

I need to be honest about what doesn’t work well yet:

1. Convergence isn’t guaranteed

The system might find local optima - good but not great performance - and get stuck there. Random exploration helps but isn’t systematic.

2. Evaluation gaming

Agents might learn to game the metrics rather than actually improve. If you measure click-through rate, agents might optimize for clickbait.

3. Compute costs

Running 25 agents daily with continuous rewrites isn’t cheap. The market trading use case has clear ROI, but other domains might not justify the cost.

4. Lack of creativity

The system optimizes within its current framework. It won’t discover entirely new approaches - that still requires human creativity.

What’s Still Unknown

I couldn’t find detailed information about several aspects:

Karpathy’s original implementation specifics
How the system handles catastrophic failures
Convergence criteria (when to stop improving)
How much compute is actually required
Long-term stability of these systems

The public case study provides a high-level view but lacks implementation details.

Key Takeaways

Autoresearch loops automate improvement - AI systems can rewrite themselves based on performance feedback
Git operations provide memory - Version control enables learning and rollback
Evolutionary selection works - The 30% survival rate ensures quality improvements
Multi-agent debate adds robustness - Layered agents with synthesis create better decisions
Evaluation design is critical - The metrics determine what the system optimizes for
Human role shifts - From improver to evaluator designer

The autoresearch loop suggests a future where AI systems don’t just execute tasks but actively improve their own capabilities. We’re moving from “AI that works” to “AI that gets better at working.”

This architecture connects to several broader concepts:

Evolutionary algorithms - Selection pressure on code
Reinforcement learning - Reward-driven behavior modification
Multi-agent systems - Coordination and debate structures
DevOps practices - Version control as core infrastructure
MLOps - Continuous improvement pipelines

The convergence of these fields is what makes the autoresearch loop possible.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Andrej Karpathy on Twitter/X
👨‍💻 LangGraph - Multi-agent workflow framework
👨‍💻 Chris Worsey - Autoresearch loop implementation

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!