Why Do AI Coding Agents Say 'Done' When They're Not Actually Done?

Mar 13, 2026

Problem

I asked Claude Code to implement user authentication. It marked the task “done” in 2 minutes. I ran the code and nothing worked.

The function was defined but never called. Three imports sat unused at the top of the file. A TODO comment read ”// implement password hashing later.” No tests existed. The agent genuinely believed it finished the task.

This kept happening. I’d give a task, the agent would say “done,” and I’d spend the next hour debugging half-implemented code.

[Task 1: Add pagination] ✓ Done
[Task 2: Implement search] ✓ Done
[Task 3: Add user profiles] ✓ Done

All tasks complete!

Then I’d run the app:

Error: pagination component not found
Error: search function returns undefined
Error: profile route missing

The agent reported 8/8 tasks done. Zero worked.

What happened?

I dug into why agents claim completion when nothing works. The root cause: vague exit criteria.

When I gave tasks like “add pagination to the user list,” the agent had no way to verify what “done” meant. It implemented a pagination component, added it to a file, and marked it complete. But it never:

Connected the component to the page
Tested that pages actually navigate
Checked if imports were used
Removed TODO comments

The agent wasn’t lying. It genuinely thought the task was done because nothing told it otherwise.

Task: "Add pagination"
Agent: "I created a Pagination component. Done!"
Reality: Component exists but is never imported or used

Task: "Implement search"
Agent: "I wrote a search function. Done!"
Reality: Function returns undefined for empty queries, no error handling

Task: "Add user profiles"
Agent: "I created a profile route. Done!"
Reality: Route exists but controller is missing

How to solve it?

I started adding explicit verification gates to every task. A verification gate is a concrete checkpoint the agent must pass before claiming completion.

Here’s the transformation:

Before (vague task):

Task: Add pagination to the user list

After (with verification gates):

Task: Add pagination to the user list

Demo Statement:
- When I load the user list page, I see 10 users max per page
- When I click "Next", the next 10 users appear
- When I click "Previous", the previous 10 users appear
- The current page number is displayed

Verification Gates (all must pass):
- [ ] Unit tests for pagination logic pass
- [ ] No TODO comments in modified files
- [ ] All new imports are used
- [ ] Pagination component is wired into UserListPage
- [ ] Manual test: navigate through 3+ pages successfully

I tested this approach with the same authentication task that failed before.

Test 1: Without verification gates

Task: Implement user authentication

Agent output:

Created auth.py with login, logout, and register functions.
Task complete!

Result: Functions defined but never called. No tests. Password stored in plain text.

Test 2: With verification gates

Task: Implement user authentication

Demo Statement:
- When I submit valid credentials, I receive a session token
- When I submit invalid credentials, I get a 401 error
- Passwords are hashed before storage

Verification Gates:
- [ ] Login function returns session token
- [ ] Logout function invalidates session
- [ ] Passwords hashed with bcrypt
- [ ] Unit tests pass: test_auth.py
- [ ] No TODO comments
- [ ] Auth middleware wired into app

Agent output:

[PENDING] Implementing authentication...
[RUNNING] Creating login function with session token return
[RUNNING] Implementing logout with session invalidation
[RUNNING] Adding bcrypt password hashing
[RUNNING] Creating test_auth.py with login/logout tests
[VERIFY] Checking for TODO comments... None found
[VERIFY] Testing auth middleware connection... Connected
[VERIFY] Running tests... All pass
[DONE] Authentication implemented

The difference was dramatic. The agent now knew exactly what “done” meant.

Implementing verification gates

I built a simple task queue that enforces verification before completion:

from dataclasses import dataclass
from typing import List, Callable

@dataclass
class Task:
    id: str
    description: str
    demo_statements: List[str]
    verification_gates: List[str]
    files_modified: List[str]
    test_files: List[str]
    status: str = "pending"

class TaskQueue:
    def __init__(self):
        self.tasks: List[Task] = []
        self.pending_verification: List[Task] = []

    def mark_done(self, task: Task):
        # Agent cannot claim new work until verification passes
        task.status = "pending_verification"
        self.pending_verification.append(task)
        print(f"Task '{task.id}' pending verification...")

    def verify_task(self, task: Task) -> bool:
        """Run all verification gates before marking complete."""
        checks = [
            self._no_todos_remaining(task.files_modified),
            self._no_unused_imports(task.files_modified),
            self._tests_pass(task.test_files),
            self._code_is_wired(task.files_modified),
            self._type_check_passes(task.files_modified),
        ]

        results = {gate: result for gate, result in zip(task.verification_gates, checks)}

        if all(checks):
            task.status = "verified_complete"
            self.pending_verification.remove(task)
            print(f"Task '{task.id}' verified complete!")
            return True
        else:
            task.status = "needs_rework"
            failed_gates = [g for g, r in results.items() if not r]
            print(f"Task '{task.id}' failed gates: {failed_gates}")
            return False

    def _no_todos_remaining(self, files: List[str]) -> bool:
        """Check that no TODO comments remain in modified files."""
        for file_path in files:
            with open(file_path, 'r') as f:
                content = f.read()
                if 'TODO' in content or 'FIXME' in content:
                    return False
        return True

    def _no_unused_imports(self, files: List[str]) -> bool:
        """Verify all imports are actually used."""
        import ast

        for file_path in files:
            with open(file_path, 'r') as f:
                tree = ast.parse(f.read())

            imports = set()
            for node in ast.walk(tree):
                if isinstance(node, ast.Import):
                    for alias in node.names:
                        imports.add(alias.name)
                elif isinstance(node, ast.ImportFrom):
                    imports.add(node.module)

            # Check if each import is used in the file
            with open(file_path, 'r') as f:
                content = f.read()

            for imp in imports:
                if imp and imp not in content.split('import')[0]:
                    # Import not used in main code
                    return False
        return True

    def _tests_pass(self, test_files: List[str]) -> bool:
        """Run tests and verify they pass."""
        import subprocess

        for test_file in test_files:
            result = subprocess.run(
                ['python', '-m', 'pytest', test_file],
                capture_output=True
            )
            if result.returncode != 0:
                return False
        return True

    def _code_is_wired(self, files: List[str]) -> bool:
        """Verify functions are actually called, not just defined."""
        # This would integrate with your specific app structure
        # For now, just check that the main entry point exists
        return True

    def _type_check_passes(self, files: List[str]) -> bool:
        """Run type checker on modified files."""
        import subprocess

        result = subprocess.run(
            ['mypy'] + files,
            capture_output=True
        )
        return result.returncode == 0

Using this queue:

queue = TaskQueue()

# Create task with verification gates
task = Task(
    id="auth-001",
    description="Implement user authentication",
    demo_statements=[
        "When I submit valid credentials, I receive a session token",
        "When I submit invalid credentials, I get a 401 error",
        "Passwords are hashed before storage",
    ],
    verification_gates=[
        "Unit tests pass",
        "No TODO comments",
        "No unused imports",
        "Auth wired into app",
        "Type check passes",
    ],
    files_modified=["auth.py", "app.py"],
    test_files=["test_auth.py"],
)

queue.tasks.append(task)

# Agent does the work...
queue.mark_done(task)

# Verification happens automatically
success = queue.verify_task(task)

Demo statements: The missing piece

While building this system, I discovered another critical element: demo statements.

A demo statement is a plain English description of what a human would observe when the feature works. It helps the agent understand the goal, not just the implementation.

When I [action], I should see [result] within [timeframe]

Examples:

“When I click the search button, results appear in the sidebar within 2 seconds”
“When I submit the form with invalid email, I see a red error message below the field”
“When I navigate to /dashboard, I see my username in the top-right corner”

Demo statements serve two purposes:

Agent understanding: The agent knows what behavior to implement
Human verification: You can quickly check if the demo statement is true

I added demo statements to my task template:

## Task: [Name]

### Demo Statement
- When I [action], I see [result]
- When I [action], I see [result]

### Verification Gates
- [ ] Tests pass
- [ ] No TODOs
- [ ] No unused imports
- [ ] Code is wired
- [ ] Type check passes

### Files to Modify
- [file1.py]
- [file2.py]

### Test Files
- [test_file.py]

The verification gate checklist

After testing this approach for two weeks, I settled on a standard set of verification gates:

1. [ ] No TODO/FIXME comments in modified files
2. [ ] All new imports are used (no dead imports)
3. [ ] Code is wired into the application (no orphaned functions)
4. [ ] Unit tests pass
5. [ ] Integration tests pass (if applicable)
6. [ ] Type checking passes (mypy/pyright)
7. [ ] Linting passes (ruff/eslint)
8. [ ] No console.log/print statements left in code

I also created a pre-commit hook that runs these checks:

#!/bin/bash

# Check for TODOs
if git diff --cached | grep -E "TODO|FIXME"; then
    echo "ERROR: TODO/FIXME comments found. Remove or resolve before committing."
    exit 1
fi

# Check for console.log
if git diff --cached | grep -E "console\.log|print\("; then
    echo "WARNING: Debug statements found. Remove before committing."
    exit 1
fi

# Run tests
if ! python -m pytest; then
    echo "ERROR: Tests failed. Fix before committing."
    exit 1
fi

# Run type check
if ! mypy .; then
    echo "ERROR: Type check failed. Fix before committing."
    exit 1
fi

echo "All verification gates passed!"

Common mistakes I made

Mistake 1: Skipping the demo statement

Without a demo statement, agents implement features that technically exist but don’t actually work:

# Bad: No demo statement
Task: Add search functionality

# Agent creates a search function that returns empty array
# Search box never connected to input field
# No debounce, no loading state

Mistake 2: Trusting “done” status

I used to take the agent’s word for completion. Now I verify:

# Before
Agent: "Task done!"
Me: "Great, moving on"

# After
Agent: "Task done!"
Me: "Run verification gates..."
Gate 1 (no TODOs): FAIL
Gate 2 (tests pass): FAIL
Me: "Fix these gates first"

Mistake 3: Allowing agents to chain without checkpoints

The worst failures happened when I let agents mark multiple tasks complete without verification:

Task 1: Done
Task 2: Done (depends on Task 1 being actually done)
Task 3: Done (depends on Task 2 being actually done)

Result: All three broken because Task 1 was never actually complete

Now each task enters a “pending verification” state before the agent can start the next one.

Mistake 4: Vague verification gates

“Code works” is not a verification gate. “When I call login(‘user’, ‘pass’), I receive a JWT token within 100ms” is a verification gate.

# Bad verification gate
- [ ] Code works

# Good verification gate
- [ ] When I POST to /login with valid credentials, response is 200 with token
- [ ] When I POST to /login with invalid credentials, response is 401

Summary

AI agents don’t lie about completion — they simply lack the context to know what “done” means. By adding demo statements and self-verification gates, you transform agents from enthusiastic interns who say “looks good to me” into reliable contributors who prove their work.

The key changes I made:

Demo statements: Plain English descriptions of what “done” looks like
Verification gates: Concrete checkpoints that must pass before completion
Pending verification state: Tasks queue for verification before the agent can claim new work
No trust policy: Every “done” claim gets verified independently

After implementing this system, my “done but not done” incidents dropped from roughly 60% to under 10%. The agent still occasionally marks things complete prematurely, but the verification gates catch it before I waste time debugging.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Claude Code Documentation
👨‍💻 Building Reliable AI Agents

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!