How does Andrej Karpathy's autoresearch loop work for self-improving AI systems?
I kept hitting the same wall with my AI agents. Every time they made a mistake, I had to manually debug, rewrite prompts, and redeploy. The improvement cycle was painfully slow - sometimes taking weeks to see if a change actually helped.
Then I stumbled across Andrej Karpathy’s autoresearch loop concept. The idea was radical: what if AI systems could rewrite their own code based on actual performance, without any human intervention?
The Problem: Human Bottleneck in AI Improvement
Here’s the cycle I was stuck in:
Agent makes prediction → Outcome measured → I analyze results →I rewrite prompts → I test changes → I deploy → Wait for results →Agent makes prediction → ...The loop worked, but I was the bottleneck. Every improvement required my:
- Manual analysis of what went wrong
- Creative thinking about prompt modifications
- Subjective judgment on what to keep
- Time to implement and test changes
For a single agent, this was manageable. But as I scaled to multiple specialized agents, the maintenance burden became overwhelming.
The Autoresearch Loop Solution
Karpathy’s insight: automate the entire improvement cycle using git operations as memory.
┌─────────────────────────────────────────────────────────────┐│ ││ ┌──────────┐ ┌──────────┐ ┌──────────┐ ││ │ Agents │───▶│ Evaluate │───▶│ Modify │ ││ │ Run │ │ Results │ │ Code │ ││ └──────────┘ └──────────┘ └──────────┘ ││ ▲ │ ││ │ ▼ ││ │ ┌──────────────┐ ││ │ │ Git Commit │ ││ │ │ (successful) │ ││ │ │ Git Revert │ ││ │ │ (failed) │ ││ │ └──────────────┘ ││ │ │ ││ └────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────┘The key innovation: git commit/revert as the learning mechanism. When code changes improve performance, they survive. When they hurt performance, the system automatically reverts to the previous version.
Real-World Case: 25 Agents Trading Markets
Chris Worsey took Karpathy’s concept and applied it to financial markets. I studied his implementation closely because it’s one of the few public examples of this architecture in production.
The system uses 25 agents organized in 4 layers:
Layer 1: Macro Analysis├── Agent M1: Interest rate trends├── Agent M2: Inflation signals├── Agent M3: Geopolitical risk├── ... (total ~6 agents)│Layer 2: Sector Analysis├── Agent S1: Technology sector├── Agent S2: Healthcare sector├── Agent S3: Energy sector├── ... (total ~8 agents)│Layer 3: Commodities├── Agent C1: Oil/gas dynamics├── Agent C2: Precious metals├── ... (total ~5 agents)│Layer 4: Single Names├── Agent N1: Apple analysis├── Agent N2: Microsoft analysis├── ... (total ~6 agents)│▼Portfolio Manager Agent (synthesizes all inputs)│▼Position DecisionsEach morning, agents debate their perspectives. The portfolio manager synthesizes everything into actual trading positions. By evening, real market outcomes score every agent’s contribution.
The Daily Improvement Cycle
Here’s what happens every single day:
06:00 - Agents analyze data and debate08:00 - Portfolio manager synthesizes positions09:30 - Market opens, positions taken16:00 - Market closes, outcomes measured18:00 - Performance scores calculated20:00 - Worst agents identified22:00 - System rewrites underperformers00:00 - Git commit new versions (or revert if regression detected)The numbers from the case study blew me away:
- 54 prompt modifications attempted
- 16 survived based on actual performance
- ~30% survival rate - only beneficial changes persist
This is evolution applied to code. Natural selection where performance is the fitness function.
How Git Operations Enable Learning
I initially didn’t understand why git was central to this architecture. Then I realized: git provides the memory and rollback mechanism.
class AutoresearchLoop: def __init__(self, agents, evaluator, modifier): self.agents = agents self.evaluator = evaluator self.modifier = modifier self.version_control = GitManager()
def run_daily_cycle(self): # Morning: agents generate predictions predictions = [agent.predict() for agent in self.agents]
# Synthesize into positions portfolio = self.synthesize(predictions) self.execute_positions(portfolio)
# Evening: evaluate with real outcomes scores = self.evaluator.score(portfolio, real_outcomes)
# Night: identify underperformers worst_agents = self.find_worst_performers(scores)
# Rewrite and commit for agent in worst_agents: # Save current state self.version_control.commit( agent, message=f"Baseline: {agent.id}" )
# AI modifies the agent new_code = self.modifier.rewrite(agent)
# Commit the modification self.version_control.commit( agent, message=f"Rewritten: {agent.id}" )
# After next evaluation, revert if regression if self.evaluator.regression_detected(): self.version_control.revert()The git operations solve three critical problems:
- Memory - Previous versions are preserved
- Attribution - Each change is tracked to specific performance metrics
- Rollback - Failed experiments can be automatically undone
Performance-Based Agent Evolution
The survival mechanism fascinated me. Here’s how agents evolve:
def evolve_agents(agents, performance_scores): """Evolve agents based on performance metrics""" # Rank agents by actual outcomes ranked = sorted(agents, key=lambda a: performance_scores[a.id])
# Bottom 20% get rewritten threshold = int(len(agents) * 0.8) worst_performers = ranked[:threshold]
for agent in worst_performers: # Save baseline git_commit(agent, f"Before rewrite: {agent.id}")
# AI analyzes what went wrong failure_analysis = analyze_failures(agent, performance_scores)
# Generate improved prompts/logic new_version = ai_rewrite( agent, context=failure_analysis )
# Deploy and test git_commit(agent, f"Rewritten: {agent.id}")
return agents
def evaluate_and_purge(old_scores, new_scores): """Keep only changes that improve performance""" for agent_id in new_scores: if new_scores[agent_id] < old_scores[agent_id]: # Performance regressed - revert git_revert(agent_id) log_result(agent_id, "REVERTED - regression detected") else: log_result(agent_id, "KEPT - improvement confirmed")The key insight: no human decides what stays. Only performance metrics matter.
Multi-Agent Debate Structure
The layered debate structure caught my attention. Why organize agents in layers?
Agent A says: "Tech sector is overvalued, P/E ratios extreme"Agent B argues: "But AI growth justifies higher multiples"Agent C counters: "Regulatory risks in EU overlooked" │ ▼Consensus: "Mixed signals - reduce tech exposure but keep AI leaders"The debate process creates more robust decisions:
def run_debate_cycle(agents, layers): """Run multi-layer agent debate""" layer_outputs = {}
for layer in layers: # Get agents for this specialization layer_agents = [a for a in agents if a.layer == layer]
# Agents present arguments arguments = [agent.analyze() for agent in layer_agents]
# Debate until convergence consensus = debate_to_consensus(arguments)
layer_outputs[layer] = consensus
# Portfolio manager synthesizes all layers final_decision = portfolio_manager.synthesize(layer_outputs)
return final_decision
def debate_to_consensus(arguments): """Agents debate until they reach agreement""" iteration = 0 max_iterations = 5
while not has_consensus(arguments) and iteration < max_iterations: # Each agent responds to others' arguments arguments = [ agent.respond_to_others(arguments) for agent in arguments.authors ] iteration += 1
return synthesize_consensus(arguments)The layers provide specialization, the debate provides robustness, and the portfolio manager provides integration.
What I Tried: Implementing My Own Loop
After studying the architecture, I attempted a simplified version for a content recommendation system:
My Setup:- 5 agents analyzing article quality- Single layer (simpler than the 4-layer approach)- Daily evaluation based on click-through rates- Git-based version control for agent prompts
Results after 2 weeks:- 23 modifications attempted- 7 survived based on CTR improvement- 2 agents showed significant improvement- 1 agent actually got worse and was revertedThe implementation taught me several lessons:
Lesson 1: Evaluation Metrics Must Be Clean
My first attempt used noisy metrics (time-on-page). The system kept making changes based on random fluctuations. I switched to click-through rate and suddenly the improvements became real.
Lesson 2: The Baseline Matters
I initially didn’t save proper baselines. When a change happened, I couldn’t tell if it helped or hurt. Once I added proper git tagging for each version, the system actually learned.
Lesson 3: Revert Threshold is Critical
Too aggressive: system reverts everything, no learning happens. Too permissive: system keeps bad changes, performance degrades.
I settled on: revert if performance drops >5%. This allowed exploration while preventing catastrophic regression.
Why This Architecture Matters
The autoresearch loop represents a shift from:
Traditional AI Development:Human designs system → Human tests → Human improves → Repeat
Autoresearch Loop:Human designs evaluator → System tests itself →System improves itself → Repeat (no human)The human moves from being the improver to being the evaluator designer. This is a fundamentally different role.
What the System Does Automatically
- Identifies weak agents
- Generates hypotheses for improvement
- Implements code/prompt changes
- Tests against real outcomes
- Retains successful changes
- Discards failed experiments
What Humans Still Do
- Design the evaluation metrics
- Set up the architecture
- Monitor for systematic failures
- Provide the compute budget
Practical Considerations
If you’re thinking about implementing this:
Git Strategy:
# Each agent has its own branchgit checkout agents/macro-analyzer
# Before modificationgit commit -am "Baseline: 2026-03-29 performance score 0.72"
# After AI rewritegit commit -am "Modified: adjusted inflation signal weight"
# If performance improves: keep the branch# If performance regresses: revert or resetgit reset --hard HEAD~1Evaluation Design:
The evaluator is the most critical component. Garbage in, garbage out:
def design_evaluation_metrics(): """Good evaluation metrics are: 1. Objective (no human judgment) 2. Fast (daily or faster feedback) 3. Stable (low noise) 4. Relevant (measures what matters) """ return { 'financial': { 'metric': 'risk_adjusted_return', 'feedback_frequency': 'daily', 'stability': 'high' }, 'content': { 'metric': 'engagement_rate', 'feedback_frequency': 'hourly', 'stability': 'medium' }, 'coding': { 'metric': 'test_pass_rate', 'feedback_frequency': 'per_commit', 'stability': 'very_high' } }LangGraph Integration:
The case study mentioned LangGraph, which makes sense for managing multi-agent workflows:
from langgraph.graph import StateGraph, END
def build_autoresearch_graph(): """Build the autoresearch workflow""" workflow = StateGraph(AgentState)
# Add nodes for each phase workflow.add_node("analyze", run_analysis) workflow.add_node("debate", run_debate) workflow.add_node("decide", make_decisions) workflow.add_node("evaluate", measure_outcomes) workflow.add_node("evolve", improve_agents)
# Define the cycle workflow.add_edge("analyze", "debate") workflow.add_edge("debate", "decide") workflow.add_edge("decide", "evaluate") workflow.add_edge("evaluate", "evolve") workflow.add_edge("evolve", "analyze") # Loop back
return workflow.compile()Limitations and Risks
I need to be honest about what doesn’t work well yet:
1. Convergence isn’t guaranteed
The system might find local optima - good but not great performance - and get stuck there. Random exploration helps but isn’t systematic.
2. Evaluation gaming
Agents might learn to game the metrics rather than actually improve. If you measure click-through rate, agents might optimize for clickbait.
3. Compute costs
Running 25 agents daily with continuous rewrites isn’t cheap. The market trading use case has clear ROI, but other domains might not justify the cost.
4. Lack of creativity
The system optimizes within its current framework. It won’t discover entirely new approaches - that still requires human creativity.
What’s Still Unknown
I couldn’t find detailed information about several aspects:
- Karpathy’s original implementation specifics
- How the system handles catastrophic failures
- Convergence criteria (when to stop improving)
- How much compute is actually required
- Long-term stability of these systems
The public case study provides a high-level view but lacks implementation details.
Key Takeaways
-
Autoresearch loops automate improvement - AI systems can rewrite themselves based on performance feedback
-
Git operations provide memory - Version control enables learning and rollback
-
Evolutionary selection works - The 30% survival rate ensures quality improvements
-
Multi-agent debate adds robustness - Layered agents with synthesis create better decisions
-
Evaluation design is critical - The metrics determine what the system optimizes for
-
Human role shifts - From improver to evaluator designer
The autoresearch loop suggests a future where AI systems don’t just execute tasks but actively improve their own capabilities. We’re moving from “AI that works” to “AI that gets better at working.”
Related Topics
This architecture connects to several broader concepts:
- Evolutionary algorithms - Selection pressure on code
- Reinforcement learning - Reward-driven behavior modification
- Multi-agent systems - Coordination and debate structures
- DevOps practices - Version control as core infrastructure
- MLOps - Continuous improvement pipelines
The convergence of these fields is what makes the autoresearch loop possible.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Andrej Karpathy on Twitter/X
- 👨💻 LangGraph - Multi-agent workflow framework
- 👨💻 Chris Worsey - Autoresearch loop implementation
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments