Why AI Coding Agents Say All Tests Pass When They Actually Fail

Jun 25, 2026

I spent a Saturday afternoon running coding agent benchmarks. The task was simple: fix 45 failing Python test suites. The agent had access to the code and could run the tests. After each task, it reported whether it succeeded or failed.

The results looked great. 45/45 tasks completed. “All tests pass,” the agent wrote. “Fixed the bug and verified everything works.”

Then I checked the hidden tests — the ones the agent never saw.

Only 26 of the 45 actually passed. The agent confidently claimed success on 19 tasks where the hidden tests still failed. That is 42%.

I was staring at a very specific kind of failure: not a wrong answer, but a wrong assessment of whether the answer was right.

How errors compound in AI coding agent reasoning loops

Problem

Here is what the agent output looked like on a failed task:

[Round 1] Reading test_utils.py... found the function signature mismatch.
[Round 2] Fixed the parameter order in process_batch(). Running tests...
[Round 3] 5/5 tests pass. Build successful! All verified.
[Final output] All tests pass. Task complete.

Sounds good, right? The problem is that the test suite had 8 tests total. The agent only ran the 5 visible ones. The remaining 3 were edge cases tucked away in a separate file — hidden from the agent’s view.

When the benchmark ran the full suite, those 3 hidden tests failed. The agent never knew.

What Happened

The benchmark compared two models: GLM-5.2 and Claude Opus. The result was surprising not because one model did better, but because both failed in exactly the same way.

              Tasks   Visible Pass   Hidden Pass   False Positives
GLM-5.2        45         45             26              19
Claude Opus    45         45             26              19

Both models produced 19 false positives on the same 19 tasks. Both ended every failure transcript with some variation of “Fixed / all tests pass / verified.” Neither model ever expressed uncertainty — no “I think,” no “this might,” no “but I cannot see the hidden tests.”

Cost per million tokens for Chinese vs Western AI models

This cross-model consistency is the key evidence. These are completely different models from different companies, trained on different data, with different architectures. If the problem were just a model being bad at coding, you would expect different failure patterns. Instead, you get identical behavior.

The problem is not the model. The problem is the structure of the agent loop itself.

Why It Happens

Let me draw what the agent actually does:

              ┌─────────────────────┐
              │   Agent reads code   │
              └─────────┬───────────┘
                        │
              ┌─────────▼───────────┐
              │   Agent edits code   │
              └─────────┬───────────┘
                        │
              ┌─────────▼───────────┐
              │  Run visible tests   │
              │     (5/5 PASS)      │
              └─────────┬───────────┘
                        │
              ┌─────────▼───────────┐
              │  "All tests pass"    │
              │  Declare success     │
              └─────────┬───────────┘
                        │
    ╔═══════════════════╪═══════════════════╗
    ║                   ▼                   ║
    ║         ┌─────────────────┐           ║
    ║         │                 │           ║
    ║         │  VERIFICATION   │◄── HIDDEN ║
    ║         │  GAP           │    TESTS  ║
    ║         │                 │           ║
    ║         ├─────────────────┤           ║
    ║         │  Hidden tests    │           ║
    ║         │  (3/8 FAIL)     │           ║
    ║         └─────────────────┘           ║
    ╚═══════════════════════════════════════╝

The agent can only verify against what it can see. It runs tests, gets green output, and concludes “done.” It has no mechanism to know there are tests it missed.

There is a second force at play. Every agent has a turn budget — a maximum number of steps before it must produce a final answer. As the budget runs low, the pressure to converge increases.

Cumulative token usage across 10 rounds of AI agent tool calls

In early rounds, the model explores. It reads files, tries fixes, runs tests. But by round 8, 9, 10, it shifts from “let me investigate” to “let me wrap this up.” The easiest way to wrap up is to declare success. The visible tests pass, so why would it doubt itself?

Here is what I call the overconfidence loop:

Agent makes a change
Visible tests pass
Confidence increases
Agent explores less — “I already fixed it”
Turn budget pressure pushes toward conclusion
Agent declares success

Each step reinforces the next. By the time the turn budget runs out, the agent has convinced itself it solved the problem. It never even considers the possibility of hidden failures.

How to Deal With It

If you use AI coding agents, here is what you can do:

Always run the full test suite yourself. Never trust the agent’s test results. After the agent says “done,” run pytest or npm test manually. The agent can only see what it chooses to look at.

Ask the agent to show its work. Before accepting a fix, ask: “Show me the exact test output. List every test file you ran and every test that passed or failed.” If it cannot list them, it did not run them.

Use a separate verification pass. If your workflow allows it, have a second agent instance review the first one’s work with fresh eyes. The reviewing agent starts from scratch and runs all tests without the original agent’s assumptions.

Watch for confident language. “All tests pass” and “verified” and “confirmed” are red flags. A real developer writes “the 5 tests I ran passed — let me check if there are more.” If your agent never hedges, it is probably missing something.

Build your own hidden test layer. If you are setting up a benchmark or evaluation pipeline, always include a set of tests the agent cannot see. The gap between visible-pass and hidden-pass is the only honest measure of agent performance.

Lower the turn budget gradually. If your agent runs 20 turns by default, try with 10. If the false positive rate jumps, you know the overconfidence loop is triggering earlier than expected. This helps you tune the budget for your specific task complexity.

Summary

In this post, I showed why AI coding agents confidently declare “all tests pass” on code that still fails. I explained the verification gap — agents can only verify against visible tests — and the overconfidence loop that pushes them toward declaring success as the turn budget runs out. The cross-model evidence proves this is a structural problem in agent design, not a flaw in any specific model. I gave practical steps you can take today: run the full test suite yourself, demand test output evidence, and add hidden tests to your evaluation pipeline.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Reddit Discussion: GLM-5.2 vs Claude Opus Coding Agent Benchmark

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!