Skip to content

How Does AutoGen's Self-Debugging Capability Work for Code Generation?

I was skeptical when I first heard about self-debugging AI code. Every developer knows the pain of debugging - understanding error messages, tracing execution, fixing one bug only to find three more. How could an AI possibly do this autonomously?

Then I tried AutoGen and watched it solve a pathfinding problem I’d been stuck on for weeks. The code wasn’t just correct - it was clean, efficient, and arrived at without me babysitting every step.

The Problem: AI Code Usually Breaks

When you ask an LLM to generate code, you typically get:

  1. Code that looks syntactically correct
  2. But fails at runtime due to missing imports
  3. Or uses outdated APIs
  4. Or has logic errors the LLM didn’t catch

I’ve spent countless hours copying code from ChatGPT, running it, seeing errors, and pasting those errors back into the chat. It works, but it’s manual, tedious, and breaks my flow.

AutoGen solves this by automating the entire feedback loop.

How It Works: Two Agents, One Loop

The magic lies in a simple architecture with two specialized agents:

+------------------+ +---------------------+
| AssistantAgent | <------> | UserProxyAgent/ |
| (Code Generator)| | CodeExecutorAgent |
+------------------+ | (Code Executor) |
^ +---------------------+
| |
| Error Feedback |
+----------------------------+

Agent 1: AssistantAgent (The Brain)

This agent:

  • Receives the task description
  • Generates Python code using an LLM (GPT-4, Claude, etc.)
  • Understands error messages and rewrites code
  • Maintains conversation context across iterations

Agent 2: UserProxyAgent or CodeExecutorAgent (The Hands)

This agent:

  • Extracts code blocks from the Assistant’s messages
  • Executes code in a sandbox environment
  • Captures stdout, stderr, and return values
  • Feeds results back to the Assistant

The Self-Debugging Feedback Loop

Here’s what happens when you ask AutoGen to solve a coding problem:

Iteration 1: Initial Attempt

User: "Write a function to calculate factorial"
AssistantAgent generates:
def factorial(n):
if n == 0:
return 1
return n * factorial(n-1)
CodeExecutorAgent runs it and... SUCCESS!
Output: factorial(5) = 120

But what if there’s an error?

Iteration 2: Error Detection

User: "Write a function to parse a JSON file"
AssistantAgent generates:
import json
def parse_json(filepath):
with open(filepath) as f:
return json.load(f)
result = parse_json("nonexistent.json")
CodeExecutorAgent runs it:
ERROR: FileNotFoundError: [Errno 2] No such file or directory: 'nonexistent.json'

Iteration 3: Self-Correction

AssistantAgent sees the error and thinks:
"I need to handle missing files gracefully"
AssistantAgent rewrites:
import json
def parse_json(filepath):
try:
with open(filepath) as f:
return json.load(f)
except FileNotFoundError:
return None
except json.JSONDecodeError as e:
print(f"Invalid JSON: {e}")
return None
CodeExecutorAgent runs it:
SUCCESS! Returns None for missing file

This continues until:

  1. The code runs without errors, OR
  2. A termination condition is met

What Makes This Actually Work

Three critical components enable autonomous debugging:

1. Natural Language Understanding of Errors

The AssistantAgent doesn’t just see “Error code 1” - it receives the full stack trace and error message. An LLM trained on millions of Stack Overflow posts can:

  • Identify the root cause
  • Understand the context
  • Generate an appropriate fix

2. Execution Isolation

Code runs in a sandbox (Docker container or subprocess), so:

  • Malformed code can’t crash your system
  • Each execution starts clean
  • State is captured and returned

3. Conversation Memory

The agent remembers the full history:

User request → First attempt → Error → Fix attempt 1 → Error → Fix attempt 2 → Success

This prevents repeating the same mistakes and enables incremental improvement.

A Real Example: The Pathfinding Problem

I gave AutoGen this task:

"Implement A* pathfinding for a 2D grid with obstacles"

Here’s the actual debug sequence:

Attempt 1: Generated A* with wrong priority queue usage

Error: 'list' object has no attribute 'heappush'

Attempt 2: Fixed heap operations but forgot to import heapq

Error: name 'heapq' is not defined

Attempt 3: Added import but had wrong heuristic calculation

Error: unsupported operand type(s) for -: 'tuple' and 'int'

Attempt 4: Fixed tuple arithmetic, added proper coordinate handling

Result: Working A* implementation with path visualization

Total iterations: 4 My intervention: None

The key insight: Each error became context for the next attempt. The LLM used its understanding of Python errors to systematically eliminate bugs.

Common Mistakes When Setting Up Self-Debugging

Mistake 1: No Termination Condition

# DANGEROUS: Agents may loop forever
user_proxy.initiate_chat(
assistant,
message="Solve this impossible problem"
)

Fix: Always set termination conditions:

user_proxy = UserProxyAgent(
max_consecutive_auto_reply=10, # Stop after 10 turns
human_input_mode="NEVER",
code_execution_config={"use_docker": True}
)

Mistake 2: Unsafe Code Execution

# DANGEROUS: Running untrusted code locally
code_execution_config={"work_dir": "coding"}

Fix: Use Docker isolation:

code_execution_config={
"use_docker": True, # Sandboxed execution
"work_dir": "coding",
"timeout": 60
}

Mistake 3: Vague Task Descriptions

# BAD: Too vague, agent may produce irrelevant code
"Write a sorting function"

Fix: Be specific:

# GOOD: Clear requirements
"Write a function that sorts a list of dictionaries by a specific key.
The key name should be a parameter. Handle edge cases like empty lists
and missing keys. Include doctests."

The Trade-offs

Self-debugging isn’t magic - it has real costs:

Token Consumption: Each iteration sends the full conversation history. Complex bugs can burn through tokens quickly.

Time: Multiple LLM calls mean latency. A simple bug might take 30-60 seconds to fix autonomously.

Non-Determinism: The same error might be fixed differently each run. Sometimes the agent goes down wrong paths.

Limited Scope: Works best for:

  • Pure Python code
  • Self-contained problems
  • Deterministic logic

Struggles with:

  • Database connections
  • External APIs
  • Complex state management

When to Use Self-Debugging

Good use cases:

  • Algorithm implementation
  • Data processing scripts
  • Test case generation
  • Prototyping new features
  • Educational coding tasks

Poor use cases:

  • Production code (you’ll still need human review)
  • Complex system integrations
  • Security-sensitive operations
  • Performance-critical code

Under the Hood: The Conversation Flow

The actual implementation uses a simple but effective pattern:

┌─────────────────────────────────────────────────────────┐
│ AssistantAgent UserProxyAgent │
│ ────────────── ─────────────── │
│ │
│ [System: You are a helpful assistant...] │
│ [User: Write a function to...] │
│ │ │
│ ▼ │
│ [Assistant: Here's the code: │
│ ```python │
│ def solve(): ... │
│ ``` │
│ ] │
│ │ │
│ │ (message sent to UserProxyAgent) │
│ ▼ │
│ [Execute code block] │
│ [Capture output] │
│ │ │
│ │ <─────────────────── [Output/Error] │
│ ▼ │
│ [User (Proxy): Execution result: │
│ Error: ... │
│ or │
│ Output: ... │
│ ] │
│ │ │
│ ▼ │
│ [Assistant: I see the error. Let me fix it: │
│ ```python │
│ def solve(): ... # fixed version │
│ ``` │
│ ] │
│ │ │
│ ... loop continues until success ... │
│ │
└─────────────────────────────────────────────────────────┘

The UserProxyAgent acts as a simulated user - it doesn’t have AI capabilities, it just executes code and returns results. The AssistantAgent does all the thinking.

Key Configuration Options

from autogen import AssistantAgent, UserProxyAgent
# The thinker
assistant = AssistantAgent(
name="assistant",
llm_config={
"model": "gpt-4",
"temperature": 0, # Lower = more deterministic
},
system_message="You are a Python expert. Write clean, efficient code."
)
# The executor
user_proxy = UserProxyAgent(
name="user_proxy",
human_input_mode="NEVER", # Fully autonomous
max_consecutive_auto_reply=10,
code_execution_config={
"work_dir": "coding",
"use_docker": True, # Sandboxed
},
# Termination condition
is_termination_msg=lambda x: x.get("content", "").rstrip().endswith("TERMINATE")
)

Why This Matters

Self-debugging changes the relationship between developers and AI:

Before: I copy-paste code, run it, see errors, paste errors back to AI, repeat After: I describe what I want, AI handles the implementation details

This doesn’t replace developers - it changes what we focus on:

  • Less time on syntax errors and import statements
  • More time on system design and requirements
  • Faster iteration on complex algorithms
  • Better exploration of solution spaces

The code still needs review. The architecture still needs human judgment. But the tedious debugging loop? That’s automated.

Why LLMs Can Debug Their Own Code

LLMs are trained on billions of lines of code, including:

  • Stack Overflow Q&A pairs
  • GitHub issues and fixes
  • Documentation with error examples
  • Tutorial explanations

This creates a model that “understands” error messages not as abstract symbols, but as natural language descriptions of problems with known solutions.

The Convergence of Code Generation and Execution

Self-debugging represents a fundamental shift: code generation is no longer a one-shot operation, but an iterative process with real-world feedback. This mirrors how humans write code - we write, run, debug, and iterate.

Limitations of Current Approaches

  • No persistent learning (each session starts fresh)
  • Limited multi-file reasoning
  • Struggles with complex dependency chains
  • No understanding of system-level constraints

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments