Can AI Agents Find Bugs That Expert Engineers Miss?

Mar 30, 2026

Problem

I was skeptical about AI code review tools. How could an AI agent find bugs that experienced engineers miss? If Andrej Karpathy - former Tesla AI director, OpenAI co-founder, one of the most respected ML practitioners - reviewed his own code and found nothing wrong, surely an AI couldn’t do better.

Then I read about his autoresearch project. An AI agent ran 700 experiments over 48 hours and discovered a bug Karpathy himself had written and missed in his own QKNorm implementation. This changed my perspective entirely.

This post explains how AI agents find bugs that experts miss, using the QKNorm case study as a concrete example.

What Is the QKNorm Bug?

QKNorm stands for Query-Key Normalization, a technique used in attention mechanisms of transformer models. The idea is simple: normalize the query and key vectors before computing their dot product attention scores.

Karpathy implemented this correctly - or so he thought. The bug was subtle:

# Original implementation (with bug)
class QKNorm(nn.Module):
    """Query-Key normalization for attention"""

    def __init__(self, dim, norm_eps=1e-6):
        super().__init__()
        self.scale = dim ** -0.5  # Standard attention scaling

    def forward(self, q, k):
        # Normalize query and key vectors
        q_norm = F.normalize(q, dim=-1)
        k_norm = F.normalize(k, dim=-1)

        # BUG: Missing scaler multiplier!
        # After normalization, vectors have unit length
        # The standard scale factor alone is insufficient
        attn = q_norm @ k_norm.transpose(-2, -1) * self.scale
        return attn

The code runs without errors. It produces reasonable-looking outputs. But the attention becomes too diffuse - the model can’t focus properly on relevant information.

Why Karpathy Missed It

Even expert engineers have blind spots. I identified several reasons why manual code review failed here:

Reason 1: The code "works"
- No runtime errors
- Outputs look reasonable
- Model still trains

Reason 2: Domain knowledge creates assumptions
- "I know how attention should work"
- "Standard scaling is correct"
- Experienced engineers skip "obvious" checks

Reason 3: Bug manifests subtly
- Not a crash or obvious wrong output
- Performance degradation, not failure
- Requires empirical testing to detect

Reason 4: Human attention is finite
- Reviewers see what they expect
- Pattern recognition can miss anomalies
- We focus on complexity, not simplicity

Karpathy had all the domain knowledge. He wrote the code. He reviewed it. He still missed the bug because his experience told him “this should work.”

How the AI Agent Found It

The autoresearch agent used a different approach: exhaustive empirical testing rather than reasoning.

Agent Strategy:
1. Read program.md (research goal)
2. Propose code modifications systematically
3. Train for 5 minutes per experiment
4. Measure validation loss
5. Commit improvements, discard failures
6. Repeat 700 times in 48 hours

Key difference from humans:
- No assumptions about "correct" implementation
- Tests variations humans would never try
- Empirical validation, not reasoning

The agent tried adding various components to QKNorm. Most failed. One succeeded: adding a learnable scaler multiplier.

# Fixed implementation (agent's discovery)
class QKNorm(nn.Module):
    """Query-Key normalization with proper scaling"""

    def __init__(self, dim, norm_eps=1e-6):
        super().__init__()
        self.scale = dim ** -0.5
        self.scaler = nn.Parameter(torch.ones(1))  # Learnable scaler

    def forward(self, q, k):
        q_norm = F.normalize(q, dim=-1)
        k_norm = F.normalize(k, dim=-1)

        # FIX: Scaler controls attention sharpness
        attn = q_norm @ k_norm.transpose(-2, -1) * self.scale * self.scaler
        return attn

The learnable scaler parameter lets the model adjust attention sharpness during training. Without it, normalized vectors always produce the same scale, making attention too uniform.

Why Scale Matters After Normalization

I needed to understand why this bug caused problems. Here’s what I learned:

Before normalization:
- Query and Key vectors have varying magnitudes
- Dot product = magnitude_q x magnitude_k x cosine(angle)
- Larger vectors contribute more to attention

After normalization:
- All vectors have unit magnitude (length = 1)
- Dot product = cosine(angle) only
- All positions contribute equally

The problem:
- Standard attention relies on magnitude differences
- Normalization removes this signal
- Without additional scaling, attention is "flat"

Standard attention uses magnitude to weight importance. Normalization removes magnitude, so you need another way to control sharpness. The missing scaler was a fundamental mismatch between normalization and the standard attention formula.

The Scale of Experimentation

The numbers explain why the AI succeeded where humans failed:

Human approach:
- Maybe 5-10 experiments per day
- Each experiment takes hours of thought
- Subject to fatigue and assumptions
- Cost: Low throughput

AI agent approach:
- 700 experiments in 48 hours
- 5 minutes per experiment
- No fatigue, no assumptions
- Cost: High throughput, systematic coverage

Result comparison:
- Human: Might test 2-3 variations of QKNorm
- AI: Tested hundreds of variations
- AI found the one that worked

Scale matters. The agent tested variations a human would never think worth trying. One of those “unlikely” variations was the fix.

What Makes AI Agents Good at Finding Bugs

I analyzed the Reddit discussion and identified key factors:

Factor 1: No assumptions
- Agents don't "know" what should work
- They test what might work
- More exploration, less expectation

Factor 2: Exhaustive search
- Humans skip "obviously wrong" options
- Agents test everything systematically
- What's obvious to humans might be wrong

Factor 3: Empirical validation
- Agents measure results, not reason about correctness
- Code that "looks wrong" might work
- Code that "looks right" might be buggy

Factor 4: Documentation
- Agents explain what they found
- Research remains legible to humans
- We can understand and validate results

A Reddit user named CryptographerUnable8 raised an important point: “The real question isn’t whether AI can do research. It’s whether the research AI does will be legible to us.”

The autoresearch agent documented its findings. Humans can understand why the scaler fix works. The agent augments human capability, not replaces it.

Common Misconceptions About AI Code Review

I held several wrong beliefs before this case study. Let me address them:

Misconception 1: “AI Can’t Understand Complex ML Code”

Modern AI agents can reason about transformer architectures. They understand attention mechanisms, normalization, and training dynamics. The QKNorm case proves this.

Misconception 2: “If I Wrote It, I Understand It”

Karpathy wrote the buggy code. He understood QKNorm conceptually. But implementation details differ from conceptual understanding. Even experts miss things in their own code.

Why authors miss bugs:
- We read what we intended to write
- We skip details we "know" are correct
- Our mental model fills gaps in code
- Fresh eyes see what's actually there

Misconception 3: “Code Review by Experts Is Enough”

Human review is valuable but limited. We have finite attention, patterned expectations, and domain assumptions. AI agents complement human review with systematic exploration.

Misconception 4: “Bugs Cause Crashes”

In ML code, bugs often manifest as degraded performance, not crashes. The QKNorm bug made attention less effective. The model still worked, just not optimally.

# How ML bugs differ from traditional bugs

# Traditional bug (crashes):
def calculate_average(numbers):
    total = 0
    for n in numbers:
        total += n
    return total / len(numbers)  # Crashes if numbers is empty!

# ML bug (performance degradation):
def attention(query, key, value):
    # Missing scaling - model still trains
    # But learns slower, converges worse
    scores = query @ key.transpose(-2, -1)
    # Should be: scores * (dim ** -0.5)
    return scores @ value  # No crash, just worse model

How to Use AI Agents for Bug Detection

Based on this case study, here’s my approach:

Step 1: Prepare Your Codebase

The agent needs a minimal, understandable codebase. Karpathy’s nanochat is 630 lines. Large codebases confuse agents.

Good for AI review:
- Focused, single-purpose code
- Clear input/output
- Quick evaluation (minutes, not hours)
- Well-documented goals

Bad for AI review:
- Large, complex systems
- Multiple interdependent modules
- Long evaluation times
- Unclear success metrics

Step 2: Define a Clear Metric

Autoresearch uses validation loss as its single metric. This makes decisions binary: improved or not improved.

# Good metric: binary decision
if new_loss < baseline_loss:
    commit_change()
else:
    discard_change()

# Bad metric: multiple objectives
if new_loss < baseline_loss and new_speed > baseline_speed:
    # Which matters more? Agent can't decide
    commit_change()

Step 3: Run Systematic Experiments

Let the agent explore. Don’t limit what it can try. Most experiments will fail, but the successful ones compound.

Step 4: Validate Results

The agent finds candidates. You validate them. Karpathy reviewed the scaler fix and confirmed it made sense. AI finds, humans validate.

Agent finds candidate fix
    |
    v
Human reviews the change
    |
    v
Does change make sense conceptually?
    |
   YES --> Run longer training to confirm
    |          |
    v          v
   NO --> Discard and continue   |--> Apply to larger model

Results from Autoresearch

The 48-hour run produced:

Experiments: 700
Successful changes: 20
Success rate: 2.9%

Metric improvement:
- "Time to GPT-2": 2.02 hours -> 1.80 hours
- Overall improvement: 11%

Key discoveries:
- QKNorm missing scaler (found bug Karpathy missed)
- Learning rate adjustments
- Gradient accumulation optimization

All 20 improvements were:
- Additive (can be combined)
- Transferable (work on larger models)
- Documented (human-understandable)

2.9% success rate sounds low. But 20 real improvements in 48 hours beats what most human researchers achieve.

When AI Agents Struggle

I want to be honest about limitations. AI agents fail when:

Condition 1: Complex codebases
- Agent struggles with thousands of lines
- Changes have unpredictable side effects
- Context becomes overwhelming

Condition 2: Multiple metrics
- Agent can't weigh competing objectives
- "Better loss but slower" - which matters?
- Single metric is essential

Condition 3: Long evaluation cycles
- Each experiment needs hours/days
- Can't iterate quickly
- Exploration becomes impractical

Condition 4: Domain-specific knowledge
- Some bugs require deep expertise
- Agent might miss conceptual errors
- Human insight still necessary

AI agents complement human review, not replace it. Use them for systematic exploration. Use humans for conceptual validation.

Practical Takeaways

For your next ML project:

Write minimal code - 600-800 lines, focused purpose
Define one metric - validation loss, accuracy, speed
Add AI review - let agents explore variations
Validate findings - human review of agent discoveries
Combine approaches - AI finds, humans verify

The QKNorm bug case proves AI agents can find what experts miss. Not because AI is smarter, but because AI explores without assumptions.

Summary

AI agents found a bug Andrej Karpathy missed in his own QKNorm implementation. The missing scaler multiplier made attention too diffuse. The agent discovered it through 700 experiments in 48 hours.

Key insights:

Experts have patterned blind spots from experience
AI agents lack assumptions and explore exhaustively
Scale matters: hundreds of experiments find what single reviews miss
ML bugs manifest as performance degradation, not crashes
AI finds candidates, humans validate

For beginner developers: don’t over-rely on your own code review. Add AI-assisted exploration to catch subtle bugs. The QKNorm case shows even world-class experts benefit from AI review.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 GitHub: karpathy/autoresearch
👨‍💻 Reddit discussion on autoresearch findings
👨‍💻 nanoGPT repository
👨‍💻 Andrej Karpathy's YouTube channel

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!