Why Does Writing a Test Help AI Fix Bugs Faster?
I spent three hours watching Codex fail at the same bug. Three different approaches, three confident “I’ve fixed it” responses, three failures. The fix took five minutes once I changed my approach.
The difference? I wrote a failing test first.
The Bug That Wouldn’t Die
I had a filter function that was supposed to return an empty list when no items matched the predicate. Simple, right? But every time Codex tried to fix it, it would:
- Add null checks everywhere except where needed
- Refactor the function to be “more robust”
- Add logging to help debug
None of these addressed the actual issue: the function returned None instead of [] when the filter excluded all items.
Attempt 1: "Added null safety checks"Result: Still returns None, now with extra code
Attempt 2: "Refactored for clarity"Result: Same bug, more lines
Attempt 3: "Added comprehensive logging"Result: Now I can see it returning None more clearlyI was frustrated. The AI was clearly trying, but it kept missing the mark.
The Test That Changed Everything
Instead of asking again, I wrote this:
def test_filter_returns_empty_when_predicate_excludes_all(): items = [1, 2, 3, 4, 5] result = filter_items(items, lambda x: x > 10) # Expected: [] # Actual: None
assert result == [], f"Expected [], got {result}"I shared the failing test output with Codex:
FAILED - test_filter_returns_empty_when_predicate_excludes_allAssertionError: Expected [], got NoneThe fix came immediately:
def filter_items(items, predicate): return [item for item in items if predicate(item)] # Previously had a special case returning None when result was emptyOne attempt. Five minutes. Done.
Why Tests Unstick AI
The problem wasn’t that Codex couldn’t understand the code. The problem was that without a test, I was asking it to:
- Guess what “broken” means - Is it returning wrong values? Crashing? Performance issues?
- Invent success criteria - What does “fixed” look like?
- Verify without tools - How does it check if the fix worked?
A test answers all three questions explicitly. Look at what the test provides:
┌─────────────────────────────────────────────────────────┐│ Test Component │ What AI Gets │├─────────────────────────────────────────────────────────┤│ Test name │ "filter returns empty list..." ││ │ → Problem description │├─────────────────────────────────────────────────────────┤│ Input │ items=[1,2,3,4,5], predicate ││ │ → Exact reproduction case │├─────────────────────────────────────────────────────────┤│ Assertion │ result == [] ││ │ → Clear success criteria │├─────────────────────────────────────────────────────────┤│ Failure message │ "Expected [], got None" ││ │ → What specifically is wrong │└─────────────────────────────────────────────────────────┘This isn’t just about having more information. It’s about framing.
The Framing Problem
Here’s what I’ve realized after using AI coding assistants for months: they don’t struggle with code complexity. They struggle with problem framing.
When I said “fix the filter function,” I was thinking about one specific edge case. But the AI had no way to know that. It was exploring the entire problem space:
"Fix the filter function" could mean:├── Handle null inputs?├── Improve performance?├── Fix type errors?├── Add error handling?├── Refactor for readability?└── ...or maybe the actual bug?A test collapses this infinite possibility space into one specific target. The AI stops wandering and starts solving.
The Community Consensus
I’m not alone in this discovery. A recent thread on Reddit’s r/codex captured the same insight:
“After three unsuccessful attempts, Codex still couldn’t fix the issue. So I investigated the data myself and wrote the root cause… Then I asked it to write a test for the case and reproduce the steps causing the problem. Once it did that, it fixed the issue.”
Another user chimed in:
“A lot of the time the model is not really stuck on code, it is stuck on having the wrong frame for the problem. Once you write down the actual failure mode and force a repro or test, it stops wandering and gets useful fast.”
This matches my experience exactly. The test-first approach is the turning point.
Common Anti-Patterns
I’ve learned (the hard way) what doesn’t work:
1. Writing Tests After the Fix
❌ Wrong order:1. "Fix the bug" → AI guesses → multiple failures2. Write test → test passes3. Wonder why it took so long
✓ Right order:1. Write failing test → clear failure2. "Make this test pass" → AI targets exactly3. Fix in one attemptThe test before the fix is what grounds the AI’s reasoning.
2. Vague Test Names
# Bad - tells AI nothingdef test_bug(): assert filter_items([], lambda x: x) == []# Good - documents the specific bugdef test_returns_empty_list_when_filter_excludes_all_items(): items = [1, 2, 3] result = filter_items(items, lambda x: x > 100) assert result == [], f"Expected empty list, got {result}"The test name should read like a bug report.
3. Sharing Code Without Output
I used to paste just the test code. Now I always include the failure output:
Test: test_returns_empty_list_when_filter_excludes_all_itemsInput: items=[1,2,3], predicate=x>100Expected: []Actual: None
Error: AssertionError: Expected [], got NoneThe AI needs to see what it’s trying to fix, not just the test that will eventually verify the fix.
A Simple Framework
When an AI assistant gets stuck on a bug, I follow this sequence:
1. REPRODUCE Write the smallest test case that triggers the bug
2. ISOLATE Remove all unrelated code from the test
3. DOCUMENT Name the test after the bug behavior
4. SHARE Give AI both test code AND failure output
5. ITERATE Let AI run test, see failure, fix, verifyThis transforms debugging from an art into a science. The AI becomes a precise tool rather than a creative guesser.
Why This Matters More Now
AI coding assistants are getting better at writing code. But they’re not getting better at reading minds. The gap between what you want and what you ask for still exists.
Tests bridge that gap. They turn vague intent into verifiable specification. They create a feedback loop that doesn’t require your intervention.
I’ve stopped asking “fix this bug.” Now I ask “make this test pass.” The difference in results is night and day.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit r/codex: You were right, eventually
- 👨💻 Martin Fowler on TDD
- 👨💻 Anthropic Claude Coding Best Practices
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments