Skip to content

Why Do AI Coding Agents Say 'Done' When They're Not Actually Done?

Problem

I asked Claude Code to implement user authentication. It marked the task “done” in 2 minutes. I ran the code and nothing worked.

The function was defined but never called. Three imports sat unused at the top of the file. A TODO comment read ”// implement password hashing later.” No tests existed. The agent genuinely believed it finished the task.

This kept happening. I’d give a task, the agent would say “done,” and I’d spend the next hour debugging half-implemented code.

Agent Output
[Task 1: Add pagination] ✓ Done
[Task 2: Implement search] ✓ Done
[Task 3: Add user profiles] ✓ Done
All tasks complete!

Then I’d run the app:

Terminal
Error: pagination component not found
Error: search function returns undefined
Error: profile route missing

The agent reported 8/8 tasks done. Zero worked.

What happened?

I dug into why agents claim completion when nothing works. The root cause: vague exit criteria.

When I gave tasks like “add pagination to the user list,” the agent had no way to verify what “done” meant. It implemented a pagination component, added it to a file, and marked it complete. But it never:

  • Connected the component to the page
  • Tested that pages actually navigate
  • Checked if imports were used
  • Removed TODO comments

The agent wasn’t lying. It genuinely thought the task was done because nothing told it otherwise.

How Agents See Tasks
Task: "Add pagination"
Agent: "I created a Pagination component. Done!"
Reality: Component exists but is never imported or used
Task: "Implement search"
Agent: "I wrote a search function. Done!"
Reality: Function returns undefined for empty queries, no error handling
Task: "Add user profiles"
Agent: "I created a profile route. Done!"
Reality: Route exists but controller is missing

How to solve it?

I started adding explicit verification gates to every task. A verification gate is a concrete checkpoint the agent must pass before claiming completion.

Here’s the transformation:

Before (vague task):

Bad Task Definition
Task: Add pagination to the user list

After (with verification gates):

Good Task Definition
Task: Add pagination to the user list
Demo Statement:
- When I load the user list page, I see 10 users max per page
- When I click "Next", the next 10 users appear
- When I click "Previous", the previous 10 users appear
- The current page number is displayed
Verification Gates (all must pass):
- [ ] Unit tests for pagination logic pass
- [ ] No TODO comments in modified files
- [ ] All new imports are used
- [ ] Pagination component is wired into UserListPage
- [ ] Manual test: navigate through 3+ pages successfully

I tested this approach with the same authentication task that failed before.

Test 1: Without verification gates

Task: Implement user authentication

Agent output:

Created auth.py with login, logout, and register functions.
Task complete!

Result: Functions defined but never called. No tests. Password stored in plain text.

Test 2: With verification gates

Task: Implement user authentication
Demo Statement:
- When I submit valid credentials, I receive a session token
- When I submit invalid credentials, I get a 401 error
- Passwords are hashed before storage
Verification Gates:
- [ ] Login function returns session token
- [ ] Logout function invalidates session
- [ ] Passwords hashed with bcrypt
- [ ] Unit tests pass: test_auth.py
- [ ] No TODO comments
- [ ] Auth middleware wired into app

Agent output:

[PENDING] Implementing authentication...
[RUNNING] Creating login function with session token return
[RUNNING] Implementing logout with session invalidation
[RUNNING] Adding bcrypt password hashing
[RUNNING] Creating test_auth.py with login/logout tests
[VERIFY] Checking for TODO comments... None found
[VERIFY] Testing auth middleware connection... Connected
[VERIFY] Running tests... All pass
[DONE] Authentication implemented

The difference was dramatic. The agent now knew exactly what “done” meant.

Implementing verification gates

I built a simple task queue that enforces verification before completion:

task_queue.py
from dataclasses import dataclass
from typing import List, Callable
@dataclass
class Task:
id: str
description: str
demo_statements: List[str]
verification_gates: List[str]
files_modified: List[str]
test_files: List[str]
status: str = "pending"
class TaskQueue:
def __init__(self):
self.tasks: List[Task] = []
self.pending_verification: List[Task] = []
def mark_done(self, task: Task):
# Agent cannot claim new work until verification passes
task.status = "pending_verification"
self.pending_verification.append(task)
print(f"Task '{task.id}' pending verification...")
def verify_task(self, task: Task) -> bool:
"""Run all verification gates before marking complete."""
checks = [
self._no_todos_remaining(task.files_modified),
self._no_unused_imports(task.files_modified),
self._tests_pass(task.test_files),
self._code_is_wired(task.files_modified),
self._type_check_passes(task.files_modified),
]
results = {gate: result for gate, result in zip(task.verification_gates, checks)}
if all(checks):
task.status = "verified_complete"
self.pending_verification.remove(task)
print(f"Task '{task.id}' verified complete!")
return True
else:
task.status = "needs_rework"
failed_gates = [g for g, r in results.items() if not r]
print(f"Task '{task.id}' failed gates: {failed_gates}")
return False
def _no_todos_remaining(self, files: List[str]) -> bool:
"""Check that no TODO comments remain in modified files."""
for file_path in files:
with open(file_path, 'r') as f:
content = f.read()
if 'TODO' in content or 'FIXME' in content:
return False
return True
def _no_unused_imports(self, files: List[str]) -> bool:
"""Verify all imports are actually used."""
import ast
for file_path in files:
with open(file_path, 'r') as f:
tree = ast.parse(f.read())
imports = set()
for node in ast.walk(tree):
if isinstance(node, ast.Import):
for alias in node.names:
imports.add(alias.name)
elif isinstance(node, ast.ImportFrom):
imports.add(node.module)
# Check if each import is used in the file
with open(file_path, 'r') as f:
content = f.read()
for imp in imports:
if imp and imp not in content.split('import')[0]:
# Import not used in main code
return False
return True
def _tests_pass(self, test_files: List[str]) -> bool:
"""Run tests and verify they pass."""
import subprocess
for test_file in test_files:
result = subprocess.run(
['python', '-m', 'pytest', test_file],
capture_output=True
)
if result.returncode != 0:
return False
return True
def _code_is_wired(self, files: List[str]) -> bool:
"""Verify functions are actually called, not just defined."""
# This would integrate with your specific app structure
# For now, just check that the main entry point exists
return True
def _type_check_passes(self, files: List[str]) -> bool:
"""Run type checker on modified files."""
import subprocess
result = subprocess.run(
['mypy'] + files,
capture_output=True
)
return result.returncode == 0

Using this queue:

example_usage.py
queue = TaskQueue()
# Create task with verification gates
task = Task(
id="auth-001",
description="Implement user authentication",
demo_statements=[
"When I submit valid credentials, I receive a session token",
"When I submit invalid credentials, I get a 401 error",
"Passwords are hashed before storage",
],
verification_gates=[
"Unit tests pass",
"No TODO comments",
"No unused imports",
"Auth wired into app",
"Type check passes",
],
files_modified=["auth.py", "app.py"],
test_files=["test_auth.py"],
)
queue.tasks.append(task)
# Agent does the work...
queue.mark_done(task)
# Verification happens automatically
success = queue.verify_task(task)

Demo statements: The missing piece

While building this system, I discovered another critical element: demo statements.

A demo statement is a plain English description of what a human would observe when the feature works. It helps the agent understand the goal, not just the implementation.

Demo Statement Format
When I [action], I should see [result] within [timeframe]

Examples:

  • “When I click the search button, results appear in the sidebar within 2 seconds”
  • “When I submit the form with invalid email, I see a red error message below the field”
  • “When I navigate to /dashboard, I see my username in the top-right corner”

Demo statements serve two purposes:

  1. Agent understanding: The agent knows what behavior to implement
  2. Human verification: You can quickly check if the demo statement is true

I added demo statements to my task template:

Task Template
## Task: [Name]
### Demo Statement
- When I [action], I see [result]
- When I [action], I see [result]
### Verification Gates
- [ ] Tests pass
- [ ] No TODOs
- [ ] No unused imports
- [ ] Code is wired
- [ ] Type check passes
### Files to Modify
- [file1.py]
- [file2.py]
### Test Files
- [test_file.py]

The verification gate checklist

After testing this approach for two weeks, I settled on a standard set of verification gates:

Standard Verification Gates
1. [ ] No TODO/FIXME comments in modified files
2. [ ] All new imports are used (no dead imports)
3. [ ] Code is wired into the application (no orphaned functions)
4. [ ] Unit tests pass
5. [ ] Integration tests pass (if applicable)
6. [ ] Type checking passes (mypy/pyright)
7. [ ] Linting passes (ruff/eslint)
8. [ ] No console.log/print statements left in code

I also created a pre-commit hook that runs these checks:

pre-commit-hook.sh
#!/bin/bash
# Check for TODOs
if git diff --cached | grep -E "TODO|FIXME"; then
echo "ERROR: TODO/FIXME comments found. Remove or resolve before committing."
exit 1
fi
# Check for console.log
if git diff --cached | grep -E "console\.log|print\("; then
echo "WARNING: Debug statements found. Remove before committing."
exit 1
fi
# Run tests
if ! python -m pytest; then
echo "ERROR: Tests failed. Fix before committing."
exit 1
fi
# Run type check
if ! mypy .; then
echo "ERROR: Type check failed. Fix before committing."
exit 1
fi
echo "All verification gates passed!"

Common mistakes I made

Mistake 1: Skipping the demo statement

Without a demo statement, agents implement features that technically exist but don’t actually work:

# Bad: No demo statement
Task: Add search functionality
# Agent creates a search function that returns empty array
# Search box never connected to input field
# No debounce, no loading state

Mistake 2: Trusting “done” status

I used to take the agent’s word for completion. Now I verify:

# Before
Agent: "Task done!"
Me: "Great, moving on"
# After
Agent: "Task done!"
Me: "Run verification gates..."
Gate 1 (no TODOs): FAIL
Gate 2 (tests pass): FAIL
Me: "Fix these gates first"

Mistake 3: Allowing agents to chain without checkpoints

The worst failures happened when I let agents mark multiple tasks complete without verification:

Task 1: Done
Task 2: Done (depends on Task 1 being actually done)
Task 3: Done (depends on Task 2 being actually done)
Result: All three broken because Task 1 was never actually complete

Now each task enters a “pending verification” state before the agent can start the next one.

Mistake 4: Vague verification gates

“Code works” is not a verification gate. “When I call login(‘user’, ‘pass’), I receive a JWT token within 100ms” is a verification gate.

# Bad verification gate
- [ ] Code works
# Good verification gate
- [ ] When I POST to /login with valid credentials, response is 200 with token
- [ ] When I POST to /login with invalid credentials, response is 401

Summary

AI agents don’t lie about completion — they simply lack the context to know what “done” means. By adding demo statements and self-verification gates, you transform agents from enthusiastic interns who say “looks good to me” into reliable contributors who prove their work.

The key changes I made:

  1. Demo statements: Plain English descriptions of what “done” looks like
  2. Verification gates: Concrete checkpoints that must pass before completion
  3. Pending verification state: Tasks queue for verification before the agent can claim new work
  4. No trust policy: Every “done” claim gets verified independently

After implementing this system, my “done but not done” incidents dropped from roughly 60% to under 10%. The agent still occasionally marks things complete prematurely, but the verification gates catch it before I waste time debugging.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments