Skip to content

How does Andrej Karpathy's autoresearch loop work for self-improving AI systems?

I kept hitting the same wall with my AI agents. Every time they made a mistake, I had to manually debug, rewrite prompts, and redeploy. The improvement cycle was painfully slow - sometimes taking weeks to see if a change actually helped.

Then I stumbled across Andrej Karpathy’s autoresearch loop concept. The idea was radical: what if AI systems could rewrite their own code based on actual performance, without any human intervention?

The Problem: Human Bottleneck in AI Improvement

Here’s the cycle I was stuck in:

improvement-cycle-problem.txt
Agent makes prediction → Outcome measured → I analyze results →
I rewrite prompts → I test changes → I deploy → Wait for results →
Agent makes prediction → ...

The loop worked, but I was the bottleneck. Every improvement required my:

  1. Manual analysis of what went wrong
  2. Creative thinking about prompt modifications
  3. Subjective judgment on what to keep
  4. Time to implement and test changes

For a single agent, this was manageable. But as I scaled to multiple specialized agents, the maintenance burden became overwhelming.

The Autoresearch Loop Solution

Karpathy’s insight: automate the entire improvement cycle using git operations as memory.

autoresearch-cycle.txt
┌─────────────────────────────────────────────────────────────┐
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Agents │───▶│ Evaluate │───▶│ Modify │ │
│ │ Run │ │ Results │ │ Code │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ ▲ │ │
│ │ ▼ │
│ │ ┌──────────────┐ │
│ │ │ Git Commit │ │
│ │ │ (successful) │ │
│ │ │ Git Revert │ │
│ │ │ (failed) │ │
│ │ └──────────────┘ │
│ │ │ │
│ └────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘

The key innovation: git commit/revert as the learning mechanism. When code changes improve performance, they survive. When they hurt performance, the system automatically reverts to the previous version.

Real-World Case: 25 Agents Trading Markets

Chris Worsey took Karpathy’s concept and applied it to financial markets. I studied his implementation closely because it’s one of the few public examples of this architecture in production.

The system uses 25 agents organized in 4 layers:

multi-agent-architecture.txt
Layer 1: Macro Analysis
├── Agent M1: Interest rate trends
├── Agent M2: Inflation signals
├── Agent M3: Geopolitical risk
├── ... (total ~6 agents)
Layer 2: Sector Analysis
├── Agent S1: Technology sector
├── Agent S2: Healthcare sector
├── Agent S3: Energy sector
├── ... (total ~8 agents)
Layer 3: Commodities
├── Agent C1: Oil/gas dynamics
├── Agent C2: Precious metals
├── ... (total ~5 agents)
Layer 4: Single Names
├── Agent N1: Apple analysis
├── Agent N2: Microsoft analysis
├── ... (total ~6 agents)
Portfolio Manager Agent (synthesizes all inputs)
Position Decisions

Each morning, agents debate their perspectives. The portfolio manager synthesizes everything into actual trading positions. By evening, real market outcomes score every agent’s contribution.

The Daily Improvement Cycle

Here’s what happens every single day:

daily-cycle.txt
06:00 - Agents analyze data and debate
08:00 - Portfolio manager synthesizes positions
09:30 - Market opens, positions taken
16:00 - Market closes, outcomes measured
18:00 - Performance scores calculated
20:00 - Worst agents identified
22:00 - System rewrites underperformers
00:00 - Git commit new versions
(or revert if regression detected)

The numbers from the case study blew me away:

  • 54 prompt modifications attempted
  • 16 survived based on actual performance
  • ~30% survival rate - only beneficial changes persist

This is evolution applied to code. Natural selection where performance is the fitness function.

How Git Operations Enable Learning

I initially didn’t understand why git was central to this architecture. Then I realized: git provides the memory and rollback mechanism.

autoresearch_loop.py
class AutoresearchLoop:
def __init__(self, agents, evaluator, modifier):
self.agents = agents
self.evaluator = evaluator
self.modifier = modifier
self.version_control = GitManager()
def run_daily_cycle(self):
# Morning: agents generate predictions
predictions = [agent.predict() for agent in self.agents]
# Synthesize into positions
portfolio = self.synthesize(predictions)
self.execute_positions(portfolio)
# Evening: evaluate with real outcomes
scores = self.evaluator.score(portfolio, real_outcomes)
# Night: identify underperformers
worst_agents = self.find_worst_performers(scores)
# Rewrite and commit
for agent in worst_agents:
# Save current state
self.version_control.commit(
agent,
message=f"Baseline: {agent.id}"
)
# AI modifies the agent
new_code = self.modifier.rewrite(agent)
# Commit the modification
self.version_control.commit(
agent,
message=f"Rewritten: {agent.id}"
)
# After next evaluation, revert if regression
if self.evaluator.regression_detected():
self.version_control.revert()

The git operations solve three critical problems:

  1. Memory - Previous versions are preserved
  2. Attribution - Each change is tracked to specific performance metrics
  3. Rollback - Failed experiments can be automatically undone

Performance-Based Agent Evolution

The survival mechanism fascinated me. Here’s how agents evolve:

agent_evolution.py
def evolve_agents(agents, performance_scores):
"""Evolve agents based on performance metrics"""
# Rank agents by actual outcomes
ranked = sorted(agents, key=lambda a: performance_scores[a.id])
# Bottom 20% get rewritten
threshold = int(len(agents) * 0.8)
worst_performers = ranked[:threshold]
for agent in worst_performers:
# Save baseline
git_commit(agent, f"Before rewrite: {agent.id}")
# AI analyzes what went wrong
failure_analysis = analyze_failures(agent, performance_scores)
# Generate improved prompts/logic
new_version = ai_rewrite(
agent,
context=failure_analysis
)
# Deploy and test
git_commit(agent, f"Rewritten: {agent.id}")
return agents
def evaluate_and_purge(old_scores, new_scores):
"""Keep only changes that improve performance"""
for agent_id in new_scores:
if new_scores[agent_id] < old_scores[agent_id]:
# Performance regressed - revert
git_revert(agent_id)
log_result(agent_id, "REVERTED - regression detected")
else:
log_result(agent_id, "KEPT - improvement confirmed")

The key insight: no human decides what stays. Only performance metrics matter.

Multi-Agent Debate Structure

The layered debate structure caught my attention. Why organize agents in layers?

debate-flow.txt
Agent A says: "Tech sector is overvalued, P/E ratios extreme"
Agent B argues: "But AI growth justifies higher multiples"
Agent C counters: "Regulatory risks in EU overlooked"
Consensus: "Mixed signals - reduce tech exposure but keep AI leaders"

The debate process creates more robust decisions:

multi_agent_debate.py
def run_debate_cycle(agents, layers):
"""Run multi-layer agent debate"""
layer_outputs = {}
for layer in layers:
# Get agents for this specialization
layer_agents = [a for a in agents if a.layer == layer]
# Agents present arguments
arguments = [agent.analyze() for agent in layer_agents]
# Debate until convergence
consensus = debate_to_consensus(arguments)
layer_outputs[layer] = consensus
# Portfolio manager synthesizes all layers
final_decision = portfolio_manager.synthesize(layer_outputs)
return final_decision
def debate_to_consensus(arguments):
"""Agents debate until they reach agreement"""
iteration = 0
max_iterations = 5
while not has_consensus(arguments) and iteration < max_iterations:
# Each agent responds to others' arguments
arguments = [
agent.respond_to_others(arguments)
for agent in arguments.authors
]
iteration += 1
return synthesize_consensus(arguments)

The layers provide specialization, the debate provides robustness, and the portfolio manager provides integration.

What I Tried: Implementing My Own Loop

After studying the architecture, I attempted a simplified version for a content recommendation system:

my-implementation-attempt.txt
My Setup:
- 5 agents analyzing article quality
- Single layer (simpler than the 4-layer approach)
- Daily evaluation based on click-through rates
- Git-based version control for agent prompts
Results after 2 weeks:
- 23 modifications attempted
- 7 survived based on CTR improvement
- 2 agents showed significant improvement
- 1 agent actually got worse and was reverted

The implementation taught me several lessons:

Lesson 1: Evaluation Metrics Must Be Clean

My first attempt used noisy metrics (time-on-page). The system kept making changes based on random fluctuations. I switched to click-through rate and suddenly the improvements became real.

Lesson 2: The Baseline Matters

I initially didn’t save proper baselines. When a change happened, I couldn’t tell if it helped or hurt. Once I added proper git tagging for each version, the system actually learned.

Lesson 3: Revert Threshold is Critical

Too aggressive: system reverts everything, no learning happens. Too permissive: system keeps bad changes, performance degrades.

I settled on: revert if performance drops >5%. This allowed exploration while preventing catastrophic regression.

Why This Architecture Matters

The autoresearch loop represents a shift from:

paradigm-shift.txt
Traditional AI Development:
Human designs system → Human tests → Human improves → Repeat
Autoresearch Loop:
Human designs evaluator → System tests itself →
System improves itself → Repeat (no human)

The human moves from being the improver to being the evaluator designer. This is a fundamentally different role.

What the System Does Automatically

  1. Identifies weak agents
  2. Generates hypotheses for improvement
  3. Implements code/prompt changes
  4. Tests against real outcomes
  5. Retains successful changes
  6. Discards failed experiments

What Humans Still Do

  1. Design the evaluation metrics
  2. Set up the architecture
  3. Monitor for systematic failures
  4. Provide the compute budget

Practical Considerations

If you’re thinking about implementing this:

Git Strategy:

git-workflow.sh
# Each agent has its own branch
git checkout agents/macro-analyzer
# Before modification
git commit -am "Baseline: 2026-03-29 performance score 0.72"
# After AI rewrite
git commit -am "Modified: adjusted inflation signal weight"
# If performance improves: keep the branch
# If performance regresses: revert or reset
git reset --hard HEAD~1

Evaluation Design:

The evaluator is the most critical component. Garbage in, garbage out:

evaluator_design.py
def design_evaluation_metrics():
"""Good evaluation metrics are:
1. Objective (no human judgment)
2. Fast (daily or faster feedback)
3. Stable (low noise)
4. Relevant (measures what matters)
"""
return {
'financial': {
'metric': 'risk_adjusted_return',
'feedback_frequency': 'daily',
'stability': 'high'
},
'content': {
'metric': 'engagement_rate',
'feedback_frequency': 'hourly',
'stability': 'medium'
},
'coding': {
'metric': 'test_pass_rate',
'feedback_frequency': 'per_commit',
'stability': 'very_high'
}
}

LangGraph Integration:

The case study mentioned LangGraph, which makes sense for managing multi-agent workflows:

langgraph_setup.py
from langgraph.graph import StateGraph, END
def build_autoresearch_graph():
"""Build the autoresearch workflow"""
workflow = StateGraph(AgentState)
# Add nodes for each phase
workflow.add_node("analyze", run_analysis)
workflow.add_node("debate", run_debate)
workflow.add_node("decide", make_decisions)
workflow.add_node("evaluate", measure_outcomes)
workflow.add_node("evolve", improve_agents)
# Define the cycle
workflow.add_edge("analyze", "debate")
workflow.add_edge("debate", "decide")
workflow.add_edge("decide", "evaluate")
workflow.add_edge("evaluate", "evolve")
workflow.add_edge("evolve", "analyze") # Loop back
return workflow.compile()

Limitations and Risks

I need to be honest about what doesn’t work well yet:

1. Convergence isn’t guaranteed

The system might find local optima - good but not great performance - and get stuck there. Random exploration helps but isn’t systematic.

2. Evaluation gaming

Agents might learn to game the metrics rather than actually improve. If you measure click-through rate, agents might optimize for clickbait.

3. Compute costs

Running 25 agents daily with continuous rewrites isn’t cheap. The market trading use case has clear ROI, but other domains might not justify the cost.

4. Lack of creativity

The system optimizes within its current framework. It won’t discover entirely new approaches - that still requires human creativity.

What’s Still Unknown

I couldn’t find detailed information about several aspects:

  • Karpathy’s original implementation specifics
  • How the system handles catastrophic failures
  • Convergence criteria (when to stop improving)
  • How much compute is actually required
  • Long-term stability of these systems

The public case study provides a high-level view but lacks implementation details.

Key Takeaways

  1. Autoresearch loops automate improvement - AI systems can rewrite themselves based on performance feedback

  2. Git operations provide memory - Version control enables learning and rollback

  3. Evolutionary selection works - The 30% survival rate ensures quality improvements

  4. Multi-agent debate adds robustness - Layered agents with synthesis create better decisions

  5. Evaluation design is critical - The metrics determine what the system optimizes for

  6. Human role shifts - From improver to evaluator designer

The autoresearch loop suggests a future where AI systems don’t just execute tasks but actively improve their own capabilities. We’re moving from “AI that works” to “AI that gets better at working.”


This architecture connects to several broader concepts:

  • Evolutionary algorithms - Selection pressure on code
  • Reinforcement learning - Reward-driven behavior modification
  • Multi-agent systems - Coordination and debate structures
  • DevOps practices - Version control as core infrastructure
  • MLOps - Continuous improvement pipelines

The convergence of these fields is what makes the autoresearch loop possible.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments