Can AI Agents Find Bugs That Expert Engineers Miss?
Problem
I was skeptical about AI code review tools. How could an AI agent find bugs that experienced engineers miss? If Andrej Karpathy - former Tesla AI director, OpenAI co-founder, one of the most respected ML practitioners - reviewed his own code and found nothing wrong, surely an AI couldn’t do better.
Then I read about his autoresearch project. An AI agent ran 700 experiments over 48 hours and discovered a bug Karpathy himself had written and missed in his own QKNorm implementation. This changed my perspective entirely.
This post explains how AI agents find bugs that experts miss, using the QKNorm case study as a concrete example.
What Is the QKNorm Bug?
QKNorm stands for Query-Key Normalization, a technique used in attention mechanisms of transformer models. The idea is simple: normalize the query and key vectors before computing their dot product attention scores.
Karpathy implemented this correctly - or so he thought. The bug was subtle:
# Original implementation (with bug)class QKNorm(nn.Module): """Query-Key normalization for attention"""
def __init__(self, dim, norm_eps=1e-6): super().__init__() self.scale = dim ** -0.5 # Standard attention scaling
def forward(self, q, k): # Normalize query and key vectors q_norm = F.normalize(q, dim=-1) k_norm = F.normalize(k, dim=-1)
# BUG: Missing scaler multiplier! # After normalization, vectors have unit length # The standard scale factor alone is insufficient attn = q_norm @ k_norm.transpose(-2, -1) * self.scale return attnThe code runs without errors. It produces reasonable-looking outputs. But the attention becomes too diffuse - the model can’t focus properly on relevant information.
Why Karpathy Missed It
Even expert engineers have blind spots. I identified several reasons why manual code review failed here:
Reason 1: The code "works"- No runtime errors- Outputs look reasonable- Model still trains
Reason 2: Domain knowledge creates assumptions- "I know how attention should work"- "Standard scaling is correct"- Experienced engineers skip "obvious" checks
Reason 3: Bug manifests subtly- Not a crash or obvious wrong output- Performance degradation, not failure- Requires empirical testing to detect
Reason 4: Human attention is finite- Reviewers see what they expect- Pattern recognition can miss anomalies- We focus on complexity, not simplicityKarpathy had all the domain knowledge. He wrote the code. He reviewed it. He still missed the bug because his experience told him “this should work.”
How the AI Agent Found It
The autoresearch agent used a different approach: exhaustive empirical testing rather than reasoning.
Agent Strategy:1. Read program.md (research goal)2. Propose code modifications systematically3. Train for 5 minutes per experiment4. Measure validation loss5. Commit improvements, discard failures6. Repeat 700 times in 48 hours
Key difference from humans:- No assumptions about "correct" implementation- Tests variations humans would never try- Empirical validation, not reasoningThe agent tried adding various components to QKNorm. Most failed. One succeeded: adding a learnable scaler multiplier.
# Fixed implementation (agent's discovery)class QKNorm(nn.Module): """Query-Key normalization with proper scaling"""
def __init__(self, dim, norm_eps=1e-6): super().__init__() self.scale = dim ** -0.5 self.scaler = nn.Parameter(torch.ones(1)) # Learnable scaler
def forward(self, q, k): q_norm = F.normalize(q, dim=-1) k_norm = F.normalize(k, dim=-1)
# FIX: Scaler controls attention sharpness attn = q_norm @ k_norm.transpose(-2, -1) * self.scale * self.scaler return attnThe learnable scaler parameter lets the model adjust attention sharpness during training. Without it, normalized vectors always produce the same scale, making attention too uniform.
Why Scale Matters After Normalization
I needed to understand why this bug caused problems. Here’s what I learned:
Before normalization:- Query and Key vectors have varying magnitudes- Dot product = magnitude_q x magnitude_k x cosine(angle)- Larger vectors contribute more to attention
After normalization:- All vectors have unit magnitude (length = 1)- Dot product = cosine(angle) only- All positions contribute equally
The problem:- Standard attention relies on magnitude differences- Normalization removes this signal- Without additional scaling, attention is "flat"Standard attention uses magnitude to weight importance. Normalization removes magnitude, so you need another way to control sharpness. The missing scaler was a fundamental mismatch between normalization and the standard attention formula.
The Scale of Experimentation
The numbers explain why the AI succeeded where humans failed:
Human approach:- Maybe 5-10 experiments per day- Each experiment takes hours of thought- Subject to fatigue and assumptions- Cost: Low throughput
AI agent approach:- 700 experiments in 48 hours- 5 minutes per experiment- No fatigue, no assumptions- Cost: High throughput, systematic coverage
Result comparison:- Human: Might test 2-3 variations of QKNorm- AI: Tested hundreds of variations- AI found the one that workedScale matters. The agent tested variations a human would never think worth trying. One of those “unlikely” variations was the fix.
What Makes AI Agents Good at Finding Bugs
I analyzed the Reddit discussion and identified key factors:
Factor 1: No assumptions- Agents don't "know" what should work- They test what might work- More exploration, less expectation
Factor 2: Exhaustive search- Humans skip "obviously wrong" options- Agents test everything systematically- What's obvious to humans might be wrong
Factor 3: Empirical validation- Agents measure results, not reason about correctness- Code that "looks wrong" might work- Code that "looks right" might be buggy
Factor 4: Documentation- Agents explain what they found- Research remains legible to humans- We can understand and validate resultsA Reddit user named CryptographerUnable8 raised an important point: “The real question isn’t whether AI can do research. It’s whether the research AI does will be legible to us.”
The autoresearch agent documented its findings. Humans can understand why the scaler fix works. The agent augments human capability, not replaces it.
Common Misconceptions About AI Code Review
I held several wrong beliefs before this case study. Let me address them:
Misconception 1: “AI Can’t Understand Complex ML Code”
Modern AI agents can reason about transformer architectures. They understand attention mechanisms, normalization, and training dynamics. The QKNorm case proves this.
Misconception 2: “If I Wrote It, I Understand It”
Karpathy wrote the buggy code. He understood QKNorm conceptually. But implementation details differ from conceptual understanding. Even experts miss things in their own code.
Why authors miss bugs:- We read what we intended to write- We skip details we "know" are correct- Our mental model fills gaps in code- Fresh eyes see what's actually thereMisconception 3: “Code Review by Experts Is Enough”
Human review is valuable but limited. We have finite attention, patterned expectations, and domain assumptions. AI agents complement human review with systematic exploration.
Misconception 4: “Bugs Cause Crashes”
In ML code, bugs often manifest as degraded performance, not crashes. The QKNorm bug made attention less effective. The model still worked, just not optimally.
# How ML bugs differ from traditional bugs
# Traditional bug (crashes):def calculate_average(numbers): total = 0 for n in numbers: total += n return total / len(numbers) # Crashes if numbers is empty!
# ML bug (performance degradation):def attention(query, key, value): # Missing scaling - model still trains # But learns slower, converges worse scores = query @ key.transpose(-2, -1) # Should be: scores * (dim ** -0.5) return scores @ value # No crash, just worse modelHow to Use AI Agents for Bug Detection
Based on this case study, here’s my approach:
Step 1: Prepare Your Codebase
The agent needs a minimal, understandable codebase. Karpathy’s nanochat is 630 lines. Large codebases confuse agents.
Good for AI review:- Focused, single-purpose code- Clear input/output- Quick evaluation (minutes, not hours)- Well-documented goals
Bad for AI review:- Large, complex systems- Multiple interdependent modules- Long evaluation times- Unclear success metricsStep 2: Define a Clear Metric
Autoresearch uses validation loss as its single metric. This makes decisions binary: improved or not improved.
# Good metric: binary decisionif new_loss < baseline_loss: commit_change()else: discard_change()
# Bad metric: multiple objectivesif new_loss < baseline_loss and new_speed > baseline_speed: # Which matters more? Agent can't decide commit_change()Step 3: Run Systematic Experiments
Let the agent explore. Don’t limit what it can try. Most experiments will fail, but the successful ones compound.
Step 4: Validate Results
The agent finds candidates. You validate them. Karpathy reviewed the scaler fix and confirmed it made sense. AI finds, humans validate.
Agent finds candidate fix | vHuman reviews the change | vDoes change make sense conceptually? | YES --> Run longer training to confirm | | v v NO --> Discard and continue |--> Apply to larger modelResults from Autoresearch
The 48-hour run produced:
Experiments: 700Successful changes: 20Success rate: 2.9%
Metric improvement:- "Time to GPT-2": 2.02 hours -> 1.80 hours- Overall improvement: 11%
Key discoveries:- QKNorm missing scaler (found bug Karpathy missed)- Learning rate adjustments- Gradient accumulation optimization
All 20 improvements were:- Additive (can be combined)- Transferable (work on larger models)- Documented (human-understandable)2.9% success rate sounds low. But 20 real improvements in 48 hours beats what most human researchers achieve.
When AI Agents Struggle
I want to be honest about limitations. AI agents fail when:
Condition 1: Complex codebases- Agent struggles with thousands of lines- Changes have unpredictable side effects- Context becomes overwhelming
Condition 2: Multiple metrics- Agent can't weigh competing objectives- "Better loss but slower" - which matters?- Single metric is essential
Condition 3: Long evaluation cycles- Each experiment needs hours/days- Can't iterate quickly- Exploration becomes impractical
Condition 4: Domain-specific knowledge- Some bugs require deep expertise- Agent might miss conceptual errors- Human insight still necessaryAI agents complement human review, not replace it. Use them for systematic exploration. Use humans for conceptual validation.
Practical Takeaways
For your next ML project:
- Write minimal code - 600-800 lines, focused purpose
- Define one metric - validation loss, accuracy, speed
- Add AI review - let agents explore variations
- Validate findings - human review of agent discoveries
- Combine approaches - AI finds, humans verify
The QKNorm bug case proves AI agents can find what experts miss. Not because AI is smarter, but because AI explores without assumptions.
Summary
AI agents found a bug Andrej Karpathy missed in his own QKNorm implementation. The missing scaler multiplier made attention too diffuse. The agent discovered it through 700 experiments in 48 hours.
Key insights:
- Experts have patterned blind spots from experience
- AI agents lack assumptions and explore exhaustively
- Scale matters: hundreds of experiments find what single reviews miss
- ML bugs manifest as performance degradation, not crashes
- AI finds candidates, humans validate
For beginner developers: don’t over-rely on your own code review. Add AI-assisted exploration to catch subtle bugs. The QKNorm case shows even world-class experts benefit from AI review.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 GitHub: karpathy/autoresearch
- 👨💻 Reddit discussion on autoresearch findings
- 👨💻 nanoGPT repository
- 👨💻 Andrej Karpathy's YouTube channel
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments