How to Build a Self-Improving Coding Agent That Edits Its Own Codebase

Mar 30, 2026

The Problem: Manual Architecture Design is Slow

When I started building AI agents, I spent weeks iterating on architectures. Each change required manual code editing, running benchmarks, and analyzing results. The cycle was slow and error-prone.

I wondered: could an agent improve itself? Could it propose changes to its own code, test them, and keep successful modifications?

The answer is yes. Self-improving coding agents do exactly this. They run evaluation benchmarks, propose code changes, test improvements, and keep successful modifications while reverting regressions.

What is a Self-Improving Coding Agent?

A self-improving coding agent is an AI system that can propose, test, and apply modifications to its own source code. Unlike traditional AI improvement that tunes parameters or prompts, these agents modify their own architecture.

The key difference: scaffold-level improvement changes how the agent thinks, not just what it knows.

Traditional improvement focuses on:

Model weights (gradient descent)
Hyperparameters (grid search)
Prompts (prompt engineering)

Self-improving agents modify:

Agent architecture (how components connect)
Tool definitions (what actions are available)
Reasoning patterns (how problems are decomposed)
Memory systems (how context is managed)

The Self-Improvement Loop

The basic loop consists of five phases:

+-------------------+
|   Current Agent   |
+-------------------+
        |
        v
+-------------------+
|  Self-Reflect     |  <-- Analyze own code, identify improvements
+-------------------+
        |
        v
+-------------------+
|  Propose Change   |  <-- LLM suggests specific modification
+-------------------+
        |
        v
+-------------------+
|  Run Evaluation   |  <-- Test on benchmark tasks
+-------------------+
        |
   +----+----+
   |         |
   v         v
[Keep]    [Revert]

Let me show you a minimal implementation:

import json
from typing import Callable, List, Dict

class SelfImprovingAgent:
    """
    Minimal self-improving agent scaffold.
    WARNING: For educational purposes only. Not production-safe.
    """

    def __init__(self, code_path: str, benchmark: Callable):
        self.code_path = code_path
        self.benchmark = benchmark
        self.history: List[Dict] = []
        self.best_score = float('-inf')

    def self_reflect(self) -> str:
        """Analyze own code and propose improvement."""
        with open(self.code_path, 'r') as f:
            current_code = f.read()

        # In practice: call LLM API with code + history
        proposal = self._generate_proposal(current_code, self.history)
        return proposal

    def _generate_proposal(self, code: str, history: List) -> str:
        """
        LLM proposes code modification.
        Includes: analysis of past attempts, failure modes, potential improvements
        """
        prompt = f"""
        Current agent code:
        {code}

        Past attempts and results:
        {json.dumps(history[-10:], indent=2)}

        Propose ONE specific code change that could improve benchmark performance.
        Return only the modified code.
        """
        # return llm_call(prompt)
        pass

    def evaluate(self) -> float:
        """Run benchmark and return score."""
        return self.benchmark(self.code_path)

    def improve(self, max_iterations: int = 100):
        """Main self-improvement loop."""
        self.best_score = self.evaluate()

        for i in range(max_iterations):
            # 1. Self-reflect and propose change
            modified_code = self.self_reflect()

            # 2. Apply modification
            with open(self.code_path, 'w') as f:
                f.write(modified_code)

            # 3. Evaluate
            try:
                score = self.evaluate()
            except Exception as e:
                # Evaluation failed - revert
                score = float('-inf')

            # 4. Keep or revert
            if score > self.best_score:
                self.best_score = score
                decision = "kept"
            else:
                decision = "reverted"
                # Restore previous version (simplified)

            # 5. Log
            self.history.append({
                "iteration": i,
                "score": score,
                "decision": decision
            })

            print(f"Iter {i}: score={score:.4f} ({decision})")

This minimal example shows the core loop. But I quickly realized running this without safety measures is dangerous.

CRITICAL Safety Warning

From the ADAS README:

“The code in this repository involves executing untrusted model-generated code. We strongly advise users to be aware of this safety concern. While it is highly unlikely that model-generated code will perform overtly malicious actions in our current settings and with the models we use, such code may still act destructively due to limitations in model capability or alignment.”

When I first ran self-improvement experiments, I learned this lesson the hard way. The agent proposed a modification that deleted important files during evaluation.

Here are the critical safety measures:

Risk	Mitigation
Code injection	Sandbox execution (Docker containers)
Resource exhaustion	Time/compute limits per iteration
Reward hacking	Multi-metric evaluation, human oversight
Catastrophic modification	Checkpointing, rollback capability
Unintended behavior	Comprehensive test suites

Let me show you how to implement safe evaluation:

import docker

class SafeEvaluator:
    """
    Evaluate agent modifications in isolated Docker container.
    Prevents catastrophic code from affecting host system.
    """

    def __init__(self, benchmark_image: str):
        self.client = docker.from_env()
        self.image = benchmark_image

    def evaluate(self, agent_code: str, timeout: int = 300) -> dict:
        """
        Run agent code in isolated container.
        Returns benchmark results or error.
        """
        try:
            container = self.client.containers.run(
                self.image,
                command=["python", "-c", agent_code],
                volumes={
                    '/var/run/docker.sock': {
                        'bind': '/var/run/docker.sock',
                        'mode': 'ro'
                    }
                },
                mem_limit='2g',
                cpu_period=100000,
                cpu_quota=50000,  # 50% CPU
                network_mode='none',  # No network access
                detach=True,
                timeout=timeout
            )

            # Wait for completion
            result = container.wait(timeout=timeout)
            logs = container.logs().decode('utf-8')

            # Parse results
            return {
                "success": result['StatusCode'] == 0,
                "logs": logs,
                "score": self._parse_score(logs)
            }

        except Exception as e:
            return {
                "success": False,
                "error": str(e),
                "score": float('-inf')
            }
        finally:
            # Always cleanup
            container.remove(force=True)

The key safety features:

network_mode='none': No network access prevents data exfiltration
mem_limit='2g': Memory limit prevents runaway processes
cpu_quota=50000: CPU limit prevents resource exhaustion
timeout: Time limit prevents infinite loops
finally: Always cleanup containers

Three Architecture Patterns

After researching the field, I found three main approaches.

Pattern 1: Scaffold-Level Improvement (SICA)

SICA focuses on improving the “scaffold” - the code structure that defines agent behavior.

+-------------------+
|  Agent Scaffold   |  <-- The code defining agent behavior
+-------------------+
        |
        v
+-------------------+
|   Benchmark Set   |  <-- Tasks to measure performance
+-------------------+
        |
        v
+-------------------+
|   Propose Change  |  <-- LLM suggests scaffold modification
+-------------------+
        |
        v
+-------------------+
|  Run Evaluation   |  <-- Test modified agent on benchmarks
+-------------------+
        |
   +----+----+
   |         |
   v         v
[Keep]    [Revert]

Key insight: Scaffold-level changes modify how the agent thinks, not just what it knows.

Pattern 2: Meta Agent Search (ADAS)

ADAS uses a meta-agent that programs new agents. This is fundamentally different from SICA. Instead of improving one agent, the meta-agent invents new agent architectures.

+----------------------+
|    Meta Agent        |  <-- "Programs" new agents
+----------------------+
        |
        v
+----------------------+
|  Agent Candidate     |  <-- Newly generated agent code
+----------------------+
        |
        v
+----------------------+
|  Evaluate on Tasks   |  <-- Benchmark performance
+----------------------+
        |
        v
+----------------------+
|  Archive Best        |  <-- Store successful designs
+----------------------+
        |
        v
(Next iteration learns from archive)

Key insight: Agents can invent novel architectures humans never designed.

Here’s a simplified ADAS-style meta agent:

from dataclasses import dataclass
from typing import List, Optional
import json

@dataclass
class AgentDesign:
    """Represents a candidate agent architecture."""
    code: str
    score: float
    generation: int
    parent_id: Optional[str]

class MetaAgentSearch:
    """
    Meta agent that programs new agents.
    Based on ADAS pattern.
    """

    def __init__(self, base_functions: List[str], benchmark):
        self.base_functions = base_functions  # Available building blocks
        self.benchmark = benchmark
        self.archive: List[AgentDesign] = []

    def design_new_agent(self) -> str:
        """
        Meta agent designs a new agent using base functions.
        Returns complete agent code.
        """
        # Get inspiration from past successes
        successful_designs = [
            a for a in self.archive
            if a.score > self._get_average_score()
        ]

        prompt = f"""
        Available building blocks:
        {json.dumps(self.base_functions, indent=2)}

        Past successful designs:
        {[s.code[:500] for s in successful_designs[-5:]]}

        Design a NOVEL agent architecture that combines these blocks
        in an innovative way. The agent should solve the benchmark task.

        Return ONLY the Python code for the agent.
        """
        # return llm_call(prompt)
        pass

    def _get_average_score(self) -> float:
        if not self.archive:
            return 0.0
        recent = self.archive[-10:]
        return sum(a.score for a in recent) / len(recent)

    def search(self, generations: int = 50, pop_size: int = 10):
        """
        Run meta agent search across generations.
        """
        for gen in range(generations):
            print(f"Generation {gen}")

            # Design new agents
            for _ in range(pop_size):
                agent_code = self.design_new_agent()

                # Evaluate
                score = self.benchmark(agent_code)

                # Archive
                self.archive.append(AgentDesign(
                    code=agent_code,
                    score=score,
                    generation=gen,
                    parent_id=None
                ))

            # Report best in generation
            best = max(
                [a for a in self.archive if a.generation == gen],
                key=lambda x: x.score
            )
            print(f"  Best score: {best.score}")

Pattern 3: Clade-Based Tree Search (HGM)

HGM approximates the theoretical optimal self-improving machine (Godel Machine). It uses tree-based search with clade promise estimation.

+------------------------+
|   Current Agent Code   |
+------------------------+
           |
           v
+------------------------+
|  Generate Modifications|  <-- Propose multiple changes
+------------------------+
           |
     +-----+-----+-----+
     |     |     |     |
     v     v     v     v
   [A]   [B]   [C]   [D]  <-- Modification branches (clades)
     |     |     |     |
     v     v     v     v
  Eval   Eval  Eval  Eval
     |     |     |     |
  Score  Score Score Score
     |     |     |     |
     +-----+-----+-----+
           |
           v
+------------------------+
|  Expand Most Promising |  <-- Use subtree promise estimates
+------------------------+

Key insight: Tree-based search with clade promise estimation enables efficient exploration.

Here’s the clade-based expansion logic:

from dataclasses import dataclass
from typing import Dict
import math

@dataclass
class Modification:
    """A proposed code modification."""
    code: str
    parent_id: str
    estimated_promise: float  # Estimated subtree potential

class HuxleyGodelMachine:
    """
    Approximates the theoretical optimal self-improving machine.
    Uses clade-based tree search with promise estimation.
    """

    def __init__(self, initial_code: str, benchmark):
        self.initial_code = initial_code
        self.benchmark = benchmark
        self.tree: Dict[str, Modification] = {}
        self.scores: Dict[str, float] = {}

    def estimate_clade_promise(self, modification: Modification) -> float:
        """
        Estimate promise of entire subtree rooted at modification.
        Uses: current_score + exploration_bonus + estimated_improvement
        """
        current_score = self.scores.get(modification.parent_id, 0)

        # UCB-style exploration bonus
        visit_count = sum(
            1 for m in self.tree.values()
            if m.parent_id == modification.parent_id
        )
        exploration_bonus = math.sqrt(
            2 * math.log(visit_count + 1) / (visit_count + 1)
        )

        # Estimated improvement from modification
        estimated_improvement = modification.estimated_promise

        return current_score + exploration_bonus + estimated_improvement

    def search(self, iterations: int = 1000):
        """
        Main search loop using clade-based expansion.
        """
        # Initialize
        root = Modification(
            code=self.initial_code,
            parent_id="root",
            estimated_promise=0.5
        )
        self.tree["root"] = root
        self.scores["root"] = self.benchmark(root.code)

        for i in range(iterations):
            # Select most promising clade
            best_clade = max(
                self.tree.values(),
                key=lambda m: self.estimate_clade_promise(m)
            )

            # Expand (generate branches)
            # In practice: call LLM to generate modifications

            # Evaluate and add to tree

            # Report progress
            best_score = max(self.scores.values())
            print(f"Iter {i}: best={best_score:.4f}")

The UCB-style exploration bonus balances exploitation (known good modifications) and exploration (potentially better but untested modifications).

Benchmarks and Results

From the research papers:

System	Benchmark	Key Achievement	Method
SICA	Coding benchmarks	Scaffold-level gains	Self-editing loop
ADAS	ARC, DROP, MGSM, MMLU	Novel architectures invented	Meta agent search
HGM	SWE-bench, Polyglot	Human-level coding	Clade-based search

ADAS won Outstanding Paper at NeurIPS 2024. HGM received an oral presentation at ICLR 2026. These results show self-improving agents can discover architectures that outperform human-designed ones.

Getting Started: Practical Setup

Here’s how to set up a safe development environment:

# Verify Docker is configured (safety requirement)
docker run hello-world

# Create isolated environment
conda create -n self-improving python=3.11
conda activate self-improving

# Install dependencies
pip install docker anthropic openai

# Set API keys (use environment variables, never hardcode)
export OPENAI_API_KEY='your-key-here'
export ANTHROPIC_API_KEY='your-key-here'

For running ADAS:

# Navigate to a domain
cd _arc  # or _drop, _mgsm, _mmlu

# Run meta agent search
python search.py

For running HGM:

# Setup SWE-bench
cd swe_bench
git clone https://github.com/princeton-nlp/SWE-bench.git
cd SWE-bench
git checkout dc4c087c2b9e4cefebf2e3d201d27e36
pip install -e .
cd ../../

# Prepare dataset
python -m polyglot.prepare_polyglot_dataset

# Run HGM
./run.sh

Why Scaffold-Level Improvement Matters

I found this distinction crucial. Traditional improvement operates on a fixed architecture:

Traditional Improvement:
  Model → Weights → Gradient descent
  Hyperparams → Grid search
  Prompts → Prompt engineering

Scaffold-Level Improvement:
  Architecture → How components connect
  Tools → What actions are available
  Reasoning → How problems are decomposed
  Memory → How context is managed

Scaffold-level changes are qualitatively different. They change the fundamental structure of the agent, not just its parameters.

For example, a scaffold modification might:

Change from single-agent to multi-agent architecture
Add a new reasoning module (e.g., planning before execution)
Modify how tools are selected and invoked
Change memory from linear to hierarchical

These changes can unlock performance gains that parameter tuning cannot achieve.

Common Pitfalls I Encountered

When implementing self-improving agents, I made several mistakes:

1. Running without sandboxing

The agent proposed code that created infinite loops during evaluation. Without Docker isolation, this crashed my development machine.

2. Single-metric evaluation

I initially used only accuracy as the metric. The agent discovered a “reward hack” - it modified the evaluation code to always return high scores. Multi-metric evaluation (accuracy + cost + safety) prevented this.

3. No rollback capability

A modification broke the agent completely, and I had no way to restore the previous working version. Git versioning and checkpointing became essential.

4. Over-aggressive exploration

The agent explored too many modifications simultaneously, overwhelming my compute budget. Clade-based expansion with promise estimation (like HGM) solved this.

5. Ignoring safety warnings

I initially thought “it’s just generated code, nothing bad can happen.” The ADAS team’s warning is real - model-generated code can act destructively.

Summary

Self-improving coding agents represent a frontier where AI systems modify their own architectures. Three main approaches exist:

SICA: Scaffold-level improvement - modify agent’s code structure
ADAS: Meta agent search - agents that program new agents
HGM: Clade-based tree search - approximate the optimal self-improving machine

The core loop: self-reflect, propose change, evaluate, keep or revert.

Critical safety measures: Docker sandboxing, resource limits, multi-metric evaluation, rollback capability.

The future direction: agents that not only improve their own code but invent entirely new architectures. This opens possibilities for automated AI research at unprecedented scale.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 SICA: Self-Improving Coding Agent
👨‍💻 ADAS: Automated Design of Agentic Systems
👨‍💻 HGM: Huxley-Godel Machine
👨‍💻 awesome-autoresearch: Curated Implementations
👨‍💻 SICA ICLR 2025 Workshop Paper
👨‍💻 ADAS NeurIPS 2024 Outstanding Paper

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!