Why Does Writing a Test Help AI Fix Bugs Faster?

Mar 17, 2026

I spent three hours watching Codex fail at the same bug. Three different approaches, three confident “I’ve fixed it” responses, three failures. The fix took five minutes once I changed my approach.

The difference? I wrote a failing test first.

The Bug That Wouldn’t Die

I had a filter function that was supposed to return an empty list when no items matched the predicate. Simple, right? But every time Codex tried to fix it, it would:

Add null checks everywhere except where needed
Refactor the function to be “more robust”
Add logging to help debug

None of these addressed the actual issue: the function returned None instead of [] when the filter excluded all items.

Attempt 1: "Added null safety checks"
Result: Still returns None, now with extra code

Attempt 2: "Refactored for clarity"
Result: Same bug, more lines

Attempt 3: "Added comprehensive logging"
Result: Now I can see it returning None more clearly

I was frustrated. The AI was clearly trying, but it kept missing the mark.

The Test That Changed Everything

Instead of asking again, I wrote this:

def test_filter_returns_empty_when_predicate_excludes_all():
    items = [1, 2, 3, 4, 5]
    result = filter_items(items, lambda x: x > 10)
    # Expected: []
    # Actual: None

    assert result == [], f"Expected [], got {result}"

I shared the failing test output with Codex:

FAILED - test_filter_returns_empty_when_predicate_excludes_all
AssertionError: Expected [], got None

The fix came immediately:

def filter_items(items, predicate):
    return [item for item in items if predicate(item)]
    # Previously had a special case returning None when result was empty

One attempt. Five minutes. Done.

Why Tests Unstick AI

The problem wasn’t that Codex couldn’t understand the code. The problem was that without a test, I was asking it to:

Guess what “broken” means - Is it returning wrong values? Crashing? Performance issues?
Invent success criteria - What does “fixed” look like?
Verify without tools - How does it check if the fix worked?

A test answers all three questions explicitly. Look at what the test provides:

┌─────────────────────────────────────────────────────────┐
│  Test Component     │  What AI Gets                     │
├─────────────────────────────────────────────────────────┤
│  Test name         │  "filter returns empty list..."   │
│                    │  → Problem description             │
├─────────────────────────────────────────────────────────┤
│  Input             │  items=[1,2,3,4,5], predicate     │
│                    │  → Exact reproduction case        │
├─────────────────────────────────────────────────────────┤
│  Assertion         │  result == []                     │
│                    │  → Clear success criteria         │
├─────────────────────────────────────────────────────────┤
│  Failure message   │  "Expected [], got None"           │
│                    │  → What specifically is wrong      │
└─────────────────────────────────────────────────────────┘

This isn’t just about having more information. It’s about framing.

The Framing Problem

Here’s what I’ve realized after using AI coding assistants for months: they don’t struggle with code complexity. They struggle with problem framing.

When I said “fix the filter function,” I was thinking about one specific edge case. But the AI had no way to know that. It was exploring the entire problem space:

"Fix the filter function" could mean:
├── Handle null inputs?
├── Improve performance?
├── Fix type errors?
├── Add error handling?
├── Refactor for readability?
└── ...or maybe the actual bug?

A test collapses this infinite possibility space into one specific target. The AI stops wandering and starts solving.

The Community Consensus

I’m not alone in this discovery. A recent thread on Reddit’s r/codex captured the same insight:

“After three unsuccessful attempts, Codex still couldn’t fix the issue. So I investigated the data myself and wrote the root cause… Then I asked it to write a test for the case and reproduce the steps causing the problem. Once it did that, it fixed the issue.”

Another user chimed in:

“A lot of the time the model is not really stuck on code, it is stuck on having the wrong frame for the problem. Once you write down the actual failure mode and force a repro or test, it stops wandering and gets useful fast.”

This matches my experience exactly. The test-first approach is the turning point.

Common Anti-Patterns

I’ve learned (the hard way) what doesn’t work:

1. Writing Tests After the Fix

❌ Wrong order:
1. "Fix the bug" → AI guesses → multiple failures
2. Write test → test passes
3. Wonder why it took so long

✓ Right order:
1. Write failing test → clear failure
2. "Make this test pass" → AI targets exactly
3. Fix in one attempt

The test before the fix is what grounds the AI’s reasoning.

2. Vague Test Names

# Bad - tells AI nothing
def test_bug():
    assert filter_items([], lambda x: x) == []

# Good - documents the specific bug
def test_returns_empty_list_when_filter_excludes_all_items():
    items = [1, 2, 3]
    result = filter_items(items, lambda x: x > 100)
    assert result == [], f"Expected empty list, got {result}"

The test name should read like a bug report.

I used to paste just the test code. Now I always include the failure output:

Test: test_returns_empty_list_when_filter_excludes_all_items
Input: items=[1,2,3], predicate=x>100
Expected: []
Actual: None

Error: AssertionError: Expected [], got None

The AI needs to see what it’s trying to fix, not just the test that will eventually verify the fix.

A Simple Framework

When an AI assistant gets stuck on a bug, I follow this sequence:

1. REPRODUCE
   Write the smallest test case that triggers the bug

2. ISOLATE
   Remove all unrelated code from the test

3. DOCUMENT
   Name the test after the bug behavior

4. SHARE
   Give AI both test code AND failure output

5. ITERATE
   Let AI run test, see failure, fix, verify

This transforms debugging from an art into a science. The AI becomes a precise tool rather than a creative guesser.

Why This Matters More Now

AI coding assistants are getting better at writing code. But they’re not getting better at reading minds. The gap between what you want and what you ask for still exists.

Tests bridge that gap. They turn vague intent into verifiable specification. They create a feedback loop that doesn’t require your intervention.

I’ve stopped asking “fix this bug.” Now I ask “make this test pass.” The difference in results is night and day.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Reddit r/codex: You were right, eventually
👨‍💻 Martin Fowler on TDD
👨‍💻 Anthropic Claude Coding Best Practices

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!