How Does AutoGen's Self-Debugging Capability Work for Code Generation?

Apr 29, 2026

I was skeptical when I first heard about self-debugging AI code. Every developer knows the pain of debugging - understanding error messages, tracing execution, fixing one bug only to find three more. How could an AI possibly do this autonomously?

Then I tried AutoGen and watched it solve a pathfinding problem I’d been stuck on for weeks. The code wasn’t just correct - it was clean, efficient, and arrived at without me babysitting every step.

The Problem: AI Code Usually Breaks

When you ask an LLM to generate code, you typically get:

Code that looks syntactically correct
But fails at runtime due to missing imports
Or uses outdated APIs
Or has logic errors the LLM didn’t catch

I’ve spent countless hours copying code from ChatGPT, running it, seeing errors, and pasting those errors back into the chat. It works, but it’s manual, tedious, and breaks my flow.

AutoGen solves this by automating the entire feedback loop.

How It Works: Two Agents, One Loop

The magic lies in a simple architecture with two specialized agents:

+------------------+          +---------------------+
|  AssistantAgent | <------> | UserProxyAgent/     |
|  (Code Generator)|          | CodeExecutorAgent   |
+------------------+          | (Code Executor)     |
         ^                    +---------------------+
         |                            |
         |        Error Feedback      |
         +----------------------------+

Agent 1: AssistantAgent (The Brain)

This agent:

Receives the task description
Generates Python code using an LLM (GPT-4, Claude, etc.)
Understands error messages and rewrites code
Maintains conversation context across iterations

Agent 2: UserProxyAgent or CodeExecutorAgent (The Hands)

This agent:

Extracts code blocks from the Assistant’s messages
Executes code in a sandbox environment
Captures stdout, stderr, and return values
Feeds results back to the Assistant

The Self-Debugging Feedback Loop

Here’s what happens when you ask AutoGen to solve a coding problem:

Iteration 1: Initial Attempt

User: "Write a function to calculate factorial"

AssistantAgent generates:
def factorial(n):
    if n == 0:
        return 1
    return n * factorial(n-1)

CodeExecutorAgent runs it and... SUCCESS!
Output: factorial(5) = 120

But what if there’s an error?

Iteration 2: Error Detection

User: "Write a function to parse a JSON file"

AssistantAgent generates:
import json
def parse_json(filepath):
    with open(filepath) as f:
        return json.load(f)

result = parse_json("nonexistent.json")

CodeExecutorAgent runs it:
ERROR: FileNotFoundError: [Errno 2] No such file or directory: 'nonexistent.json'

Iteration 3: Self-Correction

AssistantAgent sees the error and thinks:
"I need to handle missing files gracefully"

AssistantAgent rewrites:
import json

def parse_json(filepath):
    try:
        with open(filepath) as f:
            return json.load(f)
    except FileNotFoundError:
        return None
    except json.JSONDecodeError as e:
        print(f"Invalid JSON: {e}")
        return None

CodeExecutorAgent runs it:
SUCCESS! Returns None for missing file

This continues until:

The code runs without errors, OR
A termination condition is met

What Makes This Actually Work

Three critical components enable autonomous debugging:

1. Natural Language Understanding of Errors

The AssistantAgent doesn’t just see “Error code 1” - it receives the full stack trace and error message. An LLM trained on millions of Stack Overflow posts can:

Identify the root cause
Understand the context
Generate an appropriate fix

2. Execution Isolation

Code runs in a sandbox (Docker container or subprocess), so:

Malformed code can’t crash your system
Each execution starts clean
State is captured and returned

3. Conversation Memory

The agent remembers the full history:

User request → First attempt → Error → Fix attempt 1 → Error → Fix attempt 2 → Success

This prevents repeating the same mistakes and enables incremental improvement.

A Real Example: The Pathfinding Problem

I gave AutoGen this task:

"Implement A* pathfinding for a 2D grid with obstacles"

Here’s the actual debug sequence:

Attempt 1: Generated A* with wrong priority queue usage

Error: 'list' object has no attribute 'heappush'

Attempt 2: Fixed heap operations but forgot to import heapq

Error: name 'heapq' is not defined

Attempt 3: Added import but had wrong heuristic calculation

Error: unsupported operand type(s) for -: 'tuple' and 'int'

Attempt 4: Fixed tuple arithmetic, added proper coordinate handling

Result: Working A* implementation with path visualization

Total iterations: 4 My intervention: None

The key insight: Each error became context for the next attempt. The LLM used its understanding of Python errors to systematically eliminate bugs.

Common Mistakes When Setting Up Self-Debugging

Mistake 1: No Termination Condition

# DANGEROUS: Agents may loop forever
user_proxy.initiate_chat(
    assistant,
    message="Solve this impossible problem"
)

Fix: Always set termination conditions:

user_proxy = UserProxyAgent(
    max_consecutive_auto_reply=10,  # Stop after 10 turns
    human_input_mode="NEVER",
    code_execution_config={"use_docker": True}
)

Mistake 2: Unsafe Code Execution

# DANGEROUS: Running untrusted code locally
code_execution_config={"work_dir": "coding"}

Fix: Use Docker isolation:

code_execution_config={
    "use_docker": True,  # Sandboxed execution
    "work_dir": "coding",
    "timeout": 60
}

Mistake 3: Vague Task Descriptions

# BAD: Too vague, agent may produce irrelevant code
"Write a sorting function"

Fix: Be specific:

# GOOD: Clear requirements
"Write a function that sorts a list of dictionaries by a specific key.
The key name should be a parameter. Handle edge cases like empty lists
and missing keys. Include doctests."

The Trade-offs

Self-debugging isn’t magic - it has real costs:

Token Consumption: Each iteration sends the full conversation history. Complex bugs can burn through tokens quickly.

Time: Multiple LLM calls mean latency. A simple bug might take 30-60 seconds to fix autonomously.

Non-Determinism: The same error might be fixed differently each run. Sometimes the agent goes down wrong paths.

Limited Scope: Works best for:

Pure Python code
Self-contained problems
Deterministic logic

Struggles with:

Database connections
External APIs
Complex state management

When to Use Self-Debugging

Good use cases:

Algorithm implementation
Data processing scripts
Test case generation
Prototyping new features
Educational coding tasks

Poor use cases:

Production code (you’ll still need human review)
Complex system integrations
Security-sensitive operations
Performance-critical code

Under the Hood: The Conversation Flow

The actual implementation uses a simple but effective pattern:

┌─────────────────────────────────────────────────────────┐
│  AssistantAgent                    UserProxyAgent       │
│  ──────────────                    ───────────────      │
│                                                         │
│  [System: You are a helpful assistant...]               │
│  [User: Write a function to...]                         │
│         │                                               │
│         ▼                                               │
│  [Assistant: Here's the code:                           │
│   ```python                                             │
│   def solve(): ...                                      │
│   ```                                                   │
│  ]                                                      │
│         │                                               │
│         │  (message sent to UserProxyAgent)            │
│         ▼                                               │
│                                    [Execute code block] │
│                                    [Capture output]     │
│         │                                               │
│         │  <─────────────────── [Output/Error]         │
│         ▼                                               │
│  [User (Proxy): Execution result:                       │
│   Error: ...                                            │
│   or                                                    │
│   Output: ...                                           │
│  ]                                                      │
│         │                                               │
│         ▼                                               │
│  [Assistant: I see the error. Let me fix it:            │
│   ```python                                             │
│   def solve(): ... # fixed version                      │
│   ```                                                   │
│  ]                                                      │
│         │                                               │
│         ... loop continues until success ...            │
│                                                        │
└─────────────────────────────────────────────────────────┘

The UserProxyAgent acts as a simulated user - it doesn’t have AI capabilities, it just executes code and returns results. The AssistantAgent does all the thinking.

Key Configuration Options

from autogen import AssistantAgent, UserProxyAgent

# The thinker
assistant = AssistantAgent(
    name="assistant",
    llm_config={
        "model": "gpt-4",
        "temperature": 0,  # Lower = more deterministic
    },
    system_message="You are a Python expert. Write clean, efficient code."
)

# The executor
user_proxy = UserProxyAgent(
    name="user_proxy",
    human_input_mode="NEVER",  # Fully autonomous
    max_consecutive_auto_reply=10,
    code_execution_config={
        "work_dir": "coding",
        "use_docker": True,  # Sandboxed
    },
    # Termination condition
    is_termination_msg=lambda x: x.get("content", "").rstrip().endswith("TERMINATE")
)

Why This Matters

Self-debugging changes the relationship between developers and AI:

Before: I copy-paste code, run it, see errors, paste errors back to AI, repeat After: I describe what I want, AI handles the implementation details

This doesn’t replace developers - it changes what we focus on:

Less time on syntax errors and import statements
More time on system design and requirements
Faster iteration on complex algorithms
Better exploration of solution spaces

The code still needs review. The architecture still needs human judgment. But the tedious debugging loop? That’s automated.

Why LLMs Can Debug Their Own Code

LLMs are trained on billions of lines of code, including:

Stack Overflow Q&A pairs
GitHub issues and fixes
Documentation with error examples
Tutorial explanations

This creates a model that “understands” error messages not as abstract symbols, but as natural language descriptions of problems with known solutions.

The Convergence of Code Generation and Execution

Self-debugging represents a fundamental shift: code generation is no longer a one-shot operation, but an iterative process with real-world feedback. This mirrors how humans write code - we write, run, debug, and iterate.

Limitations of Current Approaches

No persistent learning (each session starts fresh)
Limited multi-file reasoning
Struggles with complex dependency chains
No understanding of system-level constraints

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!