Skip to content

How to Build a Self-Improving Coding Agent That Edits Its Own Codebase

The Problem: Manual Architecture Design is Slow

When I started building AI agents, I spent weeks iterating on architectures. Each change required manual code editing, running benchmarks, and analyzing results. The cycle was slow and error-prone.

I wondered: could an agent improve itself? Could it propose changes to its own code, test them, and keep successful modifications?

The answer is yes. Self-improving coding agents do exactly this. They run evaluation benchmarks, propose code changes, test improvements, and keep successful modifications while reverting regressions.

What is a Self-Improving Coding Agent?

A self-improving coding agent is an AI system that can propose, test, and apply modifications to its own source code. Unlike traditional AI improvement that tunes parameters or prompts, these agents modify their own architecture.

The key difference: scaffold-level improvement changes how the agent thinks, not just what it knows.

Traditional improvement focuses on:

  • Model weights (gradient descent)
  • Hyperparameters (grid search)
  • Prompts (prompt engineering)

Self-improving agents modify:

  • Agent architecture (how components connect)
  • Tool definitions (what actions are available)
  • Reasoning patterns (how problems are decomposed)
  • Memory systems (how context is managed)

The Self-Improvement Loop

The basic loop consists of five phases:

self-improvement-loop.txt
+-------------------+
| Current Agent |
+-------------------+
|
v
+-------------------+
| Self-Reflect | <-- Analyze own code, identify improvements
+-------------------+
|
v
+-------------------+
| Propose Change | <-- LLM suggests specific modification
+-------------------+
|
v
+-------------------+
| Run Evaluation | <-- Test on benchmark tasks
+-------------------+
|
+----+----+
| |
v v
[Keep] [Revert]

Let me show you a minimal implementation:

basic_self_improving_agent.py
import json
from typing import Callable, List, Dict
class SelfImprovingAgent:
"""
Minimal self-improving agent scaffold.
WARNING: For educational purposes only. Not production-safe.
"""
def __init__(self, code_path: str, benchmark: Callable):
self.code_path = code_path
self.benchmark = benchmark
self.history: List[Dict] = []
self.best_score = float('-inf')
def self_reflect(self) -> str:
"""Analyze own code and propose improvement."""
with open(self.code_path, 'r') as f:
current_code = f.read()
# In practice: call LLM API with code + history
proposal = self._generate_proposal(current_code, self.history)
return proposal
def _generate_proposal(self, code: str, history: List) -> str:
"""
LLM proposes code modification.
Includes: analysis of past attempts, failure modes, potential improvements
"""
prompt = f"""
Current agent code:
{code}
Past attempts and results:
{json.dumps(history[-10:], indent=2)}
Propose ONE specific code change that could improve benchmark performance.
Return only the modified code.
"""
# return llm_call(prompt)
pass
def evaluate(self) -> float:
"""Run benchmark and return score."""
return self.benchmark(self.code_path)
def improve(self, max_iterations: int = 100):
"""Main self-improvement loop."""
self.best_score = self.evaluate()
for i in range(max_iterations):
# 1. Self-reflect and propose change
modified_code = self.self_reflect()
# 2. Apply modification
with open(self.code_path, 'w') as f:
f.write(modified_code)
# 3. Evaluate
try:
score = self.evaluate()
except Exception as e:
# Evaluation failed - revert
score = float('-inf')
# 4. Keep or revert
if score > self.best_score:
self.best_score = score
decision = "kept"
else:
decision = "reverted"
# Restore previous version (simplified)
# 5. Log
self.history.append({
"iteration": i,
"score": score,
"decision": decision
})
print(f"Iter {i}: score={score:.4f} ({decision})")

This minimal example shows the core loop. But I quickly realized running this without safety measures is dangerous.

CRITICAL Safety Warning

From the ADAS README:

“The code in this repository involves executing untrusted model-generated code. We strongly advise users to be aware of this safety concern. While it is highly unlikely that model-generated code will perform overtly malicious actions in our current settings and with the models we use, such code may still act destructively due to limitations in model capability or alignment.”

When I first ran self-improvement experiments, I learned this lesson the hard way. The agent proposed a modification that deleted important files during evaluation.

Here are the critical safety measures:

RiskMitigation
Code injectionSandbox execution (Docker containers)
Resource exhaustionTime/compute limits per iteration
Reward hackingMulti-metric evaluation, human oversight
Catastrophic modificationCheckpointing, rollback capability
Unintended behaviorComprehensive test suites

Let me show you how to implement safe evaluation:

safe_docker_evaluator.py
import docker
class SafeEvaluator:
"""
Evaluate agent modifications in isolated Docker container.
Prevents catastrophic code from affecting host system.
"""
def __init__(self, benchmark_image: str):
self.client = docker.from_env()
self.image = benchmark_image
def evaluate(self, agent_code: str, timeout: int = 300) -> dict:
"""
Run agent code in isolated container.
Returns benchmark results or error.
"""
try:
container = self.client.containers.run(
self.image,
command=["python", "-c", agent_code],
volumes={
'/var/run/docker.sock': {
'bind': '/var/run/docker.sock',
'mode': 'ro'
}
},
mem_limit='2g',
cpu_period=100000,
cpu_quota=50000, # 50% CPU
network_mode='none', # No network access
detach=True,
timeout=timeout
)
# Wait for completion
result = container.wait(timeout=timeout)
logs = container.logs().decode('utf-8')
# Parse results
return {
"success": result['StatusCode'] == 0,
"logs": logs,
"score": self._parse_score(logs)
}
except Exception as e:
return {
"success": False,
"error": str(e),
"score": float('-inf')
}
finally:
# Always cleanup
container.remove(force=True)

The key safety features:

  • network_mode='none': No network access prevents data exfiltration
  • mem_limit='2g': Memory limit prevents runaway processes
  • cpu_quota=50000: CPU limit prevents resource exhaustion
  • timeout: Time limit prevents infinite loops
  • finally: Always cleanup containers

Three Architecture Patterns

After researching the field, I found three main approaches.

Pattern 1: Scaffold-Level Improvement (SICA)

SICA focuses on improving the “scaffold” - the code structure that defines agent behavior.

sica-flow.txt
+-------------------+
| Agent Scaffold | <-- The code defining agent behavior
+-------------------+
|
v
+-------------------+
| Benchmark Set | <-- Tasks to measure performance
+-------------------+
|
v
+-------------------+
| Propose Change | <-- LLM suggests scaffold modification
+-------------------+
|
v
+-------------------+
| Run Evaluation | <-- Test modified agent on benchmarks
+-------------------+
|
+----+----+
| |
v v
[Keep] [Revert]

Key insight: Scaffold-level changes modify how the agent thinks, not just what it knows.

Pattern 2: Meta Agent Search (ADAS)

ADAS uses a meta-agent that programs new agents. This is fundamentally different from SICA. Instead of improving one agent, the meta-agent invents new agent architectures.

adas-flow.txt
+----------------------+
| Meta Agent | <-- "Programs" new agents
+----------------------+
|
v
+----------------------+
| Agent Candidate | <-- Newly generated agent code
+----------------------+
|
v
+----------------------+
| Evaluate on Tasks | <-- Benchmark performance
+----------------------+
|
v
+----------------------+
| Archive Best | <-- Store successful designs
+----------------------+
|
v
(Next iteration learns from archive)

Key insight: Agents can invent novel architectures humans never designed.

Here’s a simplified ADAS-style meta agent:

meta_agent_search.py
from dataclasses import dataclass
from typing import List, Optional
import json
@dataclass
class AgentDesign:
"""Represents a candidate agent architecture."""
code: str
score: float
generation: int
parent_id: Optional[str]
class MetaAgentSearch:
"""
Meta agent that programs new agents.
Based on ADAS pattern.
"""
def __init__(self, base_functions: List[str], benchmark):
self.base_functions = base_functions # Available building blocks
self.benchmark = benchmark
self.archive: List[AgentDesign] = []
def design_new_agent(self) -> str:
"""
Meta agent designs a new agent using base functions.
Returns complete agent code.
"""
# Get inspiration from past successes
successful_designs = [
a for a in self.archive
if a.score > self._get_average_score()
]
prompt = f"""
Available building blocks:
{json.dumps(self.base_functions, indent=2)}
Past successful designs:
{[s.code[:500] for s in successful_designs[-5:]]}
Design a NOVEL agent architecture that combines these blocks
in an innovative way. The agent should solve the benchmark task.
Return ONLY the Python code for the agent.
"""
# return llm_call(prompt)
pass
def _get_average_score(self) -> float:
if not self.archive:
return 0.0
recent = self.archive[-10:]
return sum(a.score for a in recent) / len(recent)
def search(self, generations: int = 50, pop_size: int = 10):
"""
Run meta agent search across generations.
"""
for gen in range(generations):
print(f"Generation {gen}")
# Design new agents
for _ in range(pop_size):
agent_code = self.design_new_agent()
# Evaluate
score = self.benchmark(agent_code)
# Archive
self.archive.append(AgentDesign(
code=agent_code,
score=score,
generation=gen,
parent_id=None
))
# Report best in generation
best = max(
[a for a in self.archive if a.generation == gen],
key=lambda x: x.score
)
print(f" Best score: {best.score}")

Pattern 3: Clade-Based Tree Search (HGM)

HGM approximates the theoretical optimal self-improving machine (Godel Machine). It uses tree-based search with clade promise estimation.

hgm-tree.txt
+------------------------+
| Current Agent Code |
+------------------------+
|
v
+------------------------+
| Generate Modifications| <-- Propose multiple changes
+------------------------+
|
+-----+-----+-----+
| | | |
v v v v
[A] [B] [C] [D] <-- Modification branches (clades)
| | | |
v v v v
Eval Eval Eval Eval
| | | |
Score Score Score Score
| | | |
+-----+-----+-----+
|
v
+------------------------+
| Expand Most Promising | <-- Use subtree promise estimates
+------------------------+

Key insight: Tree-based search with clade promise estimation enables efficient exploration.

Here’s the clade-based expansion logic:

huxley_godel_machine.py
from dataclasses import dataclass
from typing import Dict
import math
@dataclass
class Modification:
"""A proposed code modification."""
code: str
parent_id: str
estimated_promise: float # Estimated subtree potential
class HuxleyGodelMachine:
"""
Approximates the theoretical optimal self-improving machine.
Uses clade-based tree search with promise estimation.
"""
def __init__(self, initial_code: str, benchmark):
self.initial_code = initial_code
self.benchmark = benchmark
self.tree: Dict[str, Modification] = {}
self.scores: Dict[str, float] = {}
def estimate_clade_promise(self, modification: Modification) -> float:
"""
Estimate promise of entire subtree rooted at modification.
Uses: current_score + exploration_bonus + estimated_improvement
"""
current_score = self.scores.get(modification.parent_id, 0)
# UCB-style exploration bonus
visit_count = sum(
1 for m in self.tree.values()
if m.parent_id == modification.parent_id
)
exploration_bonus = math.sqrt(
2 * math.log(visit_count + 1) / (visit_count + 1)
)
# Estimated improvement from modification
estimated_improvement = modification.estimated_promise
return current_score + exploration_bonus + estimated_improvement
def search(self, iterations: int = 1000):
"""
Main search loop using clade-based expansion.
"""
# Initialize
root = Modification(
code=self.initial_code,
parent_id="root",
estimated_promise=0.5
)
self.tree["root"] = root
self.scores["root"] = self.benchmark(root.code)
for i in range(iterations):
# Select most promising clade
best_clade = max(
self.tree.values(),
key=lambda m: self.estimate_clade_promise(m)
)
# Expand (generate branches)
# In practice: call LLM to generate modifications
# Evaluate and add to tree
# Report progress
best_score = max(self.scores.values())
print(f"Iter {i}: best={best_score:.4f}")

The UCB-style exploration bonus balances exploitation (known good modifications) and exploration (potentially better but untested modifications).

Benchmarks and Results

From the research papers:

SystemBenchmarkKey AchievementMethod
SICACoding benchmarksScaffold-level gainsSelf-editing loop
ADASARC, DROP, MGSM, MMLUNovel architectures inventedMeta agent search
HGMSWE-bench, PolyglotHuman-level codingClade-based search

ADAS won Outstanding Paper at NeurIPS 2024. HGM received an oral presentation at ICLR 2026. These results show self-improving agents can discover architectures that outperform human-designed ones.

Getting Started: Practical Setup

Here’s how to set up a safe development environment:

setup_environment.sh
# Verify Docker is configured (safety requirement)
docker run hello-world
# Create isolated environment
conda create -n self-improving python=3.11
conda activate self-improving
# Install dependencies
pip install docker anthropic openai
# Set API keys (use environment variables, never hardcode)
export OPENAI_API_KEY='your-key-here'
export ANTHROPIC_API_KEY='your-key-here'

For running ADAS:

run_adas.sh
# Navigate to a domain
cd _arc # or _drop, _mgsm, _mmlu
# Run meta agent search
python search.py

For running HGM:

run_hgm.sh
# Setup SWE-bench
cd swe_bench
git clone https://github.com/princeton-nlp/SWE-bench.git
cd SWE-bench
git checkout dc4c087c2b9e4cefebf2e3d201d27e36
pip install -e .
cd ../../
# Prepare dataset
python -m polyglot.prepare_polyglot_dataset
# Run HGM
./run.sh

Why Scaffold-Level Improvement Matters

I found this distinction crucial. Traditional improvement operates on a fixed architecture:

traditional-vs-scaffold.txt
Traditional Improvement:
Model → Weights → Gradient descent
Hyperparams → Grid search
Prompts → Prompt engineering
Scaffold-Level Improvement:
Architecture → How components connect
Tools → What actions are available
Reasoning → How problems are decomposed
Memory → How context is managed

Scaffold-level changes are qualitatively different. They change the fundamental structure of the agent, not just its parameters.

For example, a scaffold modification might:

  • Change from single-agent to multi-agent architecture
  • Add a new reasoning module (e.g., planning before execution)
  • Modify how tools are selected and invoked
  • Change memory from linear to hierarchical

These changes can unlock performance gains that parameter tuning cannot achieve.

Common Pitfalls I Encountered

When implementing self-improving agents, I made several mistakes:

1. Running without sandboxing

The agent proposed code that created infinite loops during evaluation. Without Docker isolation, this crashed my development machine.

2. Single-metric evaluation

I initially used only accuracy as the metric. The agent discovered a “reward hack” - it modified the evaluation code to always return high scores. Multi-metric evaluation (accuracy + cost + safety) prevented this.

3. No rollback capability

A modification broke the agent completely, and I had no way to restore the previous working version. Git versioning and checkpointing became essential.

4. Over-aggressive exploration

The agent explored too many modifications simultaneously, overwhelming my compute budget. Clade-based expansion with promise estimation (like HGM) solved this.

5. Ignoring safety warnings

I initially thought “it’s just generated code, nothing bad can happen.” The ADAS team’s warning is real - model-generated code can act destructively.

Summary

Self-improving coding agents represent a frontier where AI systems modify their own architectures. Three main approaches exist:

  1. SICA: Scaffold-level improvement - modify agent’s code structure
  2. ADAS: Meta agent search - agents that program new agents
  3. HGM: Clade-based tree search - approximate the optimal self-improving machine

The core loop: self-reflect, propose change, evaluate, keep or revert.

Critical safety measures: Docker sandboxing, resource limits, multi-metric evaluation, rollback capability.

The future direction: agents that not only improve their own code but invent entirely new architectures. This opens possibilities for automated AI research at unprecedented scale.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments