How to Build a Self-Improving Coding Agent That Edits Its Own Codebase
The Problem: Manual Architecture Design is Slow
When I started building AI agents, I spent weeks iterating on architectures. Each change required manual code editing, running benchmarks, and analyzing results. The cycle was slow and error-prone.
I wondered: could an agent improve itself? Could it propose changes to its own code, test them, and keep successful modifications?
The answer is yes. Self-improving coding agents do exactly this. They run evaluation benchmarks, propose code changes, test improvements, and keep successful modifications while reverting regressions.
What is a Self-Improving Coding Agent?
A self-improving coding agent is an AI system that can propose, test, and apply modifications to its own source code. Unlike traditional AI improvement that tunes parameters or prompts, these agents modify their own architecture.
The key difference: scaffold-level improvement changes how the agent thinks, not just what it knows.
Traditional improvement focuses on:
- Model weights (gradient descent)
- Hyperparameters (grid search)
- Prompts (prompt engineering)
Self-improving agents modify:
- Agent architecture (how components connect)
- Tool definitions (what actions are available)
- Reasoning patterns (how problems are decomposed)
- Memory systems (how context is managed)
The Self-Improvement Loop
The basic loop consists of five phases:
+-------------------+| Current Agent |+-------------------+ | v+-------------------+| Self-Reflect | <-- Analyze own code, identify improvements+-------------------+ | v+-------------------+| Propose Change | <-- LLM suggests specific modification+-------------------+ | v+-------------------+| Run Evaluation | <-- Test on benchmark tasks+-------------------+ | +----+----+ | | v v[Keep] [Revert]Let me show you a minimal implementation:
import jsonfrom typing import Callable, List, Dict
class SelfImprovingAgent: """ Minimal self-improving agent scaffold. WARNING: For educational purposes only. Not production-safe. """
def __init__(self, code_path: str, benchmark: Callable): self.code_path = code_path self.benchmark = benchmark self.history: List[Dict] = [] self.best_score = float('-inf')
def self_reflect(self) -> str: """Analyze own code and propose improvement.""" with open(self.code_path, 'r') as f: current_code = f.read()
# In practice: call LLM API with code + history proposal = self._generate_proposal(current_code, self.history) return proposal
def _generate_proposal(self, code: str, history: List) -> str: """ LLM proposes code modification. Includes: analysis of past attempts, failure modes, potential improvements """ prompt = f""" Current agent code: {code}
Past attempts and results: {json.dumps(history[-10:], indent=2)}
Propose ONE specific code change that could improve benchmark performance. Return only the modified code. """ # return llm_call(prompt) pass
def evaluate(self) -> float: """Run benchmark and return score.""" return self.benchmark(self.code_path)
def improve(self, max_iterations: int = 100): """Main self-improvement loop.""" self.best_score = self.evaluate()
for i in range(max_iterations): # 1. Self-reflect and propose change modified_code = self.self_reflect()
# 2. Apply modification with open(self.code_path, 'w') as f: f.write(modified_code)
# 3. Evaluate try: score = self.evaluate() except Exception as e: # Evaluation failed - revert score = float('-inf')
# 4. Keep or revert if score > self.best_score: self.best_score = score decision = "kept" else: decision = "reverted" # Restore previous version (simplified)
# 5. Log self.history.append({ "iteration": i, "score": score, "decision": decision })
print(f"Iter {i}: score={score:.4f} ({decision})")This minimal example shows the core loop. But I quickly realized running this without safety measures is dangerous.
CRITICAL Safety Warning
From the ADAS README:
“The code in this repository involves executing untrusted model-generated code. We strongly advise users to be aware of this safety concern. While it is highly unlikely that model-generated code will perform overtly malicious actions in our current settings and with the models we use, such code may still act destructively due to limitations in model capability or alignment.”
When I first ran self-improvement experiments, I learned this lesson the hard way. The agent proposed a modification that deleted important files during evaluation.
Here are the critical safety measures:
| Risk | Mitigation |
|---|---|
| Code injection | Sandbox execution (Docker containers) |
| Resource exhaustion | Time/compute limits per iteration |
| Reward hacking | Multi-metric evaluation, human oversight |
| Catastrophic modification | Checkpointing, rollback capability |
| Unintended behavior | Comprehensive test suites |
Let me show you how to implement safe evaluation:
import docker
class SafeEvaluator: """ Evaluate agent modifications in isolated Docker container. Prevents catastrophic code from affecting host system. """
def __init__(self, benchmark_image: str): self.client = docker.from_env() self.image = benchmark_image
def evaluate(self, agent_code: str, timeout: int = 300) -> dict: """ Run agent code in isolated container. Returns benchmark results or error. """ try: container = self.client.containers.run( self.image, command=["python", "-c", agent_code], volumes={ '/var/run/docker.sock': { 'bind': '/var/run/docker.sock', 'mode': 'ro' } }, mem_limit='2g', cpu_period=100000, cpu_quota=50000, # 50% CPU network_mode='none', # No network access detach=True, timeout=timeout )
# Wait for completion result = container.wait(timeout=timeout) logs = container.logs().decode('utf-8')
# Parse results return { "success": result['StatusCode'] == 0, "logs": logs, "score": self._parse_score(logs) }
except Exception as e: return { "success": False, "error": str(e), "score": float('-inf') } finally: # Always cleanup container.remove(force=True)The key safety features:
network_mode='none': No network access prevents data exfiltrationmem_limit='2g': Memory limit prevents runaway processescpu_quota=50000: CPU limit prevents resource exhaustiontimeout: Time limit prevents infinite loopsfinally: Always cleanup containers
Three Architecture Patterns
After researching the field, I found three main approaches.
Pattern 1: Scaffold-Level Improvement (SICA)
SICA focuses on improving the “scaffold” - the code structure that defines agent behavior.
+-------------------+| Agent Scaffold | <-- The code defining agent behavior+-------------------+ | v+-------------------+| Benchmark Set | <-- Tasks to measure performance+-------------------+ | v+-------------------+| Propose Change | <-- LLM suggests scaffold modification+-------------------+ | v+-------------------+| Run Evaluation | <-- Test modified agent on benchmarks+-------------------+ | +----+----+ | | v v[Keep] [Revert]Key insight: Scaffold-level changes modify how the agent thinks, not just what it knows.
Pattern 2: Meta Agent Search (ADAS)
ADAS uses a meta-agent that programs new agents. This is fundamentally different from SICA. Instead of improving one agent, the meta-agent invents new agent architectures.
+----------------------+| Meta Agent | <-- "Programs" new agents+----------------------+ | v+----------------------+| Agent Candidate | <-- Newly generated agent code+----------------------+ | v+----------------------+| Evaluate on Tasks | <-- Benchmark performance+----------------------+ | v+----------------------+| Archive Best | <-- Store successful designs+----------------------+ | v(Next iteration learns from archive)Key insight: Agents can invent novel architectures humans never designed.
Here’s a simplified ADAS-style meta agent:
from dataclasses import dataclassfrom typing import List, Optionalimport json
@dataclassclass AgentDesign: """Represents a candidate agent architecture.""" code: str score: float generation: int parent_id: Optional[str]
class MetaAgentSearch: """ Meta agent that programs new agents. Based on ADAS pattern. """
def __init__(self, base_functions: List[str], benchmark): self.base_functions = base_functions # Available building blocks self.benchmark = benchmark self.archive: List[AgentDesign] = []
def design_new_agent(self) -> str: """ Meta agent designs a new agent using base functions. Returns complete agent code. """ # Get inspiration from past successes successful_designs = [ a for a in self.archive if a.score > self._get_average_score() ]
prompt = f""" Available building blocks: {json.dumps(self.base_functions, indent=2)}
Past successful designs: {[s.code[:500] for s in successful_designs[-5:]]}
Design a NOVEL agent architecture that combines these blocks in an innovative way. The agent should solve the benchmark task.
Return ONLY the Python code for the agent. """ # return llm_call(prompt) pass
def _get_average_score(self) -> float: if not self.archive: return 0.0 recent = self.archive[-10:] return sum(a.score for a in recent) / len(recent)
def search(self, generations: int = 50, pop_size: int = 10): """ Run meta agent search across generations. """ for gen in range(generations): print(f"Generation {gen}")
# Design new agents for _ in range(pop_size): agent_code = self.design_new_agent()
# Evaluate score = self.benchmark(agent_code)
# Archive self.archive.append(AgentDesign( code=agent_code, score=score, generation=gen, parent_id=None ))
# Report best in generation best = max( [a for a in self.archive if a.generation == gen], key=lambda x: x.score ) print(f" Best score: {best.score}")Pattern 3: Clade-Based Tree Search (HGM)
HGM approximates the theoretical optimal self-improving machine (Godel Machine). It uses tree-based search with clade promise estimation.
+------------------------+| Current Agent Code |+------------------------+ | v+------------------------+| Generate Modifications| <-- Propose multiple changes+------------------------+ | +-----+-----+-----+ | | | | v v v v [A] [B] [C] [D] <-- Modification branches (clades) | | | | v v v v Eval Eval Eval Eval | | | | Score Score Score Score | | | | +-----+-----+-----+ | v+------------------------+| Expand Most Promising | <-- Use subtree promise estimates+------------------------+Key insight: Tree-based search with clade promise estimation enables efficient exploration.
Here’s the clade-based expansion logic:
from dataclasses import dataclassfrom typing import Dictimport math
@dataclassclass Modification: """A proposed code modification.""" code: str parent_id: str estimated_promise: float # Estimated subtree potential
class HuxleyGodelMachine: """ Approximates the theoretical optimal self-improving machine. Uses clade-based tree search with promise estimation. """
def __init__(self, initial_code: str, benchmark): self.initial_code = initial_code self.benchmark = benchmark self.tree: Dict[str, Modification] = {} self.scores: Dict[str, float] = {}
def estimate_clade_promise(self, modification: Modification) -> float: """ Estimate promise of entire subtree rooted at modification. Uses: current_score + exploration_bonus + estimated_improvement """ current_score = self.scores.get(modification.parent_id, 0)
# UCB-style exploration bonus visit_count = sum( 1 for m in self.tree.values() if m.parent_id == modification.parent_id ) exploration_bonus = math.sqrt( 2 * math.log(visit_count + 1) / (visit_count + 1) )
# Estimated improvement from modification estimated_improvement = modification.estimated_promise
return current_score + exploration_bonus + estimated_improvement
def search(self, iterations: int = 1000): """ Main search loop using clade-based expansion. """ # Initialize root = Modification( code=self.initial_code, parent_id="root", estimated_promise=0.5 ) self.tree["root"] = root self.scores["root"] = self.benchmark(root.code)
for i in range(iterations): # Select most promising clade best_clade = max( self.tree.values(), key=lambda m: self.estimate_clade_promise(m) )
# Expand (generate branches) # In practice: call LLM to generate modifications
# Evaluate and add to tree
# Report progress best_score = max(self.scores.values()) print(f"Iter {i}: best={best_score:.4f}")The UCB-style exploration bonus balances exploitation (known good modifications) and exploration (potentially better but untested modifications).
Benchmarks and Results
From the research papers:
| System | Benchmark | Key Achievement | Method |
|---|---|---|---|
| SICA | Coding benchmarks | Scaffold-level gains | Self-editing loop |
| ADAS | ARC, DROP, MGSM, MMLU | Novel architectures invented | Meta agent search |
| HGM | SWE-bench, Polyglot | Human-level coding | Clade-based search |
ADAS won Outstanding Paper at NeurIPS 2024. HGM received an oral presentation at ICLR 2026. These results show self-improving agents can discover architectures that outperform human-designed ones.
Getting Started: Practical Setup
Here’s how to set up a safe development environment:
# Verify Docker is configured (safety requirement)docker run hello-world
# Create isolated environmentconda create -n self-improving python=3.11conda activate self-improving
# Install dependenciespip install docker anthropic openai
# Set API keys (use environment variables, never hardcode)export OPENAI_API_KEY='your-key-here'export ANTHROPIC_API_KEY='your-key-here'For running ADAS:
# Navigate to a domaincd _arc # or _drop, _mgsm, _mmlu
# Run meta agent searchpython search.pyFor running HGM:
# Setup SWE-benchcd swe_benchgit clone https://github.com/princeton-nlp/SWE-bench.gitcd SWE-benchgit checkout dc4c087c2b9e4cefebf2e3d201d27e36pip install -e .cd ../../
# Prepare datasetpython -m polyglot.prepare_polyglot_dataset
# Run HGM./run.shWhy Scaffold-Level Improvement Matters
I found this distinction crucial. Traditional improvement operates on a fixed architecture:
Traditional Improvement: Model → Weights → Gradient descent Hyperparams → Grid search Prompts → Prompt engineering
Scaffold-Level Improvement: Architecture → How components connect Tools → What actions are available Reasoning → How problems are decomposed Memory → How context is managedScaffold-level changes are qualitatively different. They change the fundamental structure of the agent, not just its parameters.
For example, a scaffold modification might:
- Change from single-agent to multi-agent architecture
- Add a new reasoning module (e.g., planning before execution)
- Modify how tools are selected and invoked
- Change memory from linear to hierarchical
These changes can unlock performance gains that parameter tuning cannot achieve.
Common Pitfalls I Encountered
When implementing self-improving agents, I made several mistakes:
1. Running without sandboxing
The agent proposed code that created infinite loops during evaluation. Without Docker isolation, this crashed my development machine.
2. Single-metric evaluation
I initially used only accuracy as the metric. The agent discovered a “reward hack” - it modified the evaluation code to always return high scores. Multi-metric evaluation (accuracy + cost + safety) prevented this.
3. No rollback capability
A modification broke the agent completely, and I had no way to restore the previous working version. Git versioning and checkpointing became essential.
4. Over-aggressive exploration
The agent explored too many modifications simultaneously, overwhelming my compute budget. Clade-based expansion with promise estimation (like HGM) solved this.
5. Ignoring safety warnings
I initially thought “it’s just generated code, nothing bad can happen.” The ADAS team’s warning is real - model-generated code can act destructively.
Summary
Self-improving coding agents represent a frontier where AI systems modify their own architectures. Three main approaches exist:
- SICA: Scaffold-level improvement - modify agent’s code structure
- ADAS: Meta agent search - agents that program new agents
- HGM: Clade-based tree search - approximate the optimal self-improving machine
The core loop: self-reflect, propose change, evaluate, keep or revert.
Critical safety measures: Docker sandboxing, resource limits, multi-metric evaluation, rollback capability.
The future direction: agents that not only improve their own code but invent entirely new architectures. This opens possibilities for automated AI research at unprecedented scale.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 SICA: Self-Improving Coding Agent
- 👨💻 ADAS: Automated Design of Agentic Systems
- 👨💻 HGM: Huxley-Godel Machine
- 👨💻 awesome-autoresearch: Curated Implementations
- 👨💻 SICA ICLR 2025 Workshop Paper
- 👨💻 ADAS NeurIPS 2024 Outstanding Paper
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments