What Is the Difference Between False Positives and False Negatives in Python Type Checkers?
Problem
I spent a week debugging a production crash that mypy said was impossible. The type checker had approved my code - no errors, no warnings. But at runtime, an IndexError brought down our API.
This is the false negative problem, and I learned it’s far more dangerous than the annoying false positives I’d been dealing with.
What False Positives Look Like
False positives are the noisy errors that make you want to disable the type checker:
from typing import Union
def process(value: Union[int, str]) -> str: if isinstance(value, int): # Some type checkers complain here: # "int has no attribute .upper()" # But this is actually safe - we convert to str first return str(value).upper() return value.upper()When I ran this through mypy with strict settings, it flagged the first branch incorrectly. The code was correct - I was converting the integer to a string before calling .upper(). This is a false positive: the checker reports an error where none exists.
Annoying? Yes. But I could add a type assertion and move on:
from typing import Union, cast
def process(value: Union[int, str]) -> str: if isinstance(value, int): result = str(value).upper() # Works fine return result return value.upper()The noise was frustrating, but at least I knew there was a potential issue to investigate.
What False Negatives Look Like
False negatives are silent killers. The type checker approves code that will crash:
from typing import List
def get_first(items: List[int]) -> int: # Type checker says: "Looks good! Returns int as expected" # Reality: Will crash with IndexError on empty list return items[0]
# This passes type checking but crashes at runtimeresult = get_first([]) # IndexError: list index out of rangeI had code like this in production. Mypy saw no problems. My tests passed (because they used non-empty lists). But when an empty list came through in production, everything crashed.
Another example that slipped past my type checker:
from typing import Dict
def unsafe_access(data: Dict[str, int]) -> int: # Type checker assumes the key exists # No error reported, but KeyError possible return data["count"]
# This type-checks perfectlyresult = unsafe_access({"other": 1}) # KeyError: 'count'No warnings. No errors. Just a runtime crash waiting to happen.
Why False Negatives Are More Dangerous
I compared the tradeoff data from the Python typing specification conformance tests:
Type Checker False Positive vs False Negative Tradeoffs==========================================================
| Checker | False Positives | False Negatives ||----------|-----------------|-----------------|| zuban | 10 | 0 || pyright | 15 | 4 || pyrefly | 52 | 21 || mypy | 231 | 76 || ty | 159 | 211 |The insight from experienced developers was clear:
“The zero false negatives from Zuban is really impressive. In my experience, false negatives are way more dangerous than false positives in a type checker since they silently let bugs through.”
Here’s why false negatives are worse:
1. Silent Failures
False positives scream at you. False negatives stay quiet:
# False positive: Annoying but visiblex: int = "string" # Type checker: ERROR
# False negative: Silent and deadlydef risky_operation(data: dict) -> int: return data["key"] # Type checker: OK (but KeyError at runtime)2. False Confidence
When my type checker passes, I trust the code. I skip manual testing. I deploy to production. Then the crash happens.
3. Debugging Difficulty
With false positives, I know there’s an issue to investigate. With false negatives, I only discover the bug when users report crashes.
The Tradeoff in Practice
Type checkers must balance between two extremes:
Type Checker Spectrum==========================================================
STRICT LENIENT | | v vMore False Positives More False Negatives | | v vAnnoying but safe Quiet but dangerous | | v vCatch everything Miss real bugs | | v vAdd type annotations Trust and prayI used to prefer lenient checkers because they were quieter. Now I understand: the noise of false positives is the price of safety.
How I Configure My Checkers Now
I’ve changed my approach to minimize false negatives:
{ "typeCheckingMode": "strict", "reportUnnecessaryTypeIgnoreComment": true, "reportMissingImports": true, "reportMissingTypeStubs": false}The strict mode produces more false positives, but it catches more real bugs.
For mypy, I enabled stricter settings:
[mypy]strict = Truewarn_return_any = Truewarn_unused_ignores = Truedisallow_untyped_defs = TrueYes, I get more warnings now. But those warnings represent real potential issues, not just noise.
Common Mistakes I Made
1. Using # type: ignore Too Much
# Before: Silencing all errorsresult = process(data) # type: ignoreitems = transform(result) # type: ignorefinal = combine(items) # type: ignore
# This creates hidden false negatives!# One of these might actually be a real errorNow I only use # type: ignore when I’ve verified the code is correct and the checker is wrong:
# After: Only ignore specific, verified false positivesresult = process(data) # type: ignore[arg-type] # Union type not inferred correctly2. Choosing Based on Popularity
I used mypy because it was popular. But popularity doesn’t mean accuracy:
GitHub Stars vs False Negatives===================================
mypy: 18k stars, 76 false negativespyright: 12k stars, 4 false negativeszuban: 0.5k stars, 0 false negativesThe newer, less popular checkers are actually more accurate.
3. Not Running Multiple Checkers
I now run both pyright and mypy in CI:
# .github/workflows/typecheck.ymljobs: typecheck: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Install dependencies run: pip install pyright mypy - name: Run pyright run: pyright --strict src/ - name: Run mypy run: mypy --strict src/Different checkers catch different issues. The overlap provides defense in depth.
What I Check For Now
When evaluating a type checker, I look at:
- False negative rate - Most important. Zero is ideal.
- False positive rate - Important for developer experience.
- Spec conformance - Does it follow the Python typing spec?
- Maintenance activity - Is it actively developed?
Based on the conformance data, pyright and zuban lead with only 4 and 0 false negatives respectively.
Summary
False negatives in Python type checkers are significantly more dangerous than false positives. They give a false sense of security while letting real bugs slip through to production.
After my production crash, I changed my approach:
- Prioritize checkers with low false negative rates (pyright, zuban)
- Accept more false positives as the cost of safety
- Run multiple type checkers in CI
- Avoid
# type: ignoreunless I’ve verified the code is correct
The noise of false positives is annoying. But silence from false negatives is deadly.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments