Why Do AI Coding Agents Say 'Done' When They're Not Actually Done?
Problem
I asked Claude Code to implement user authentication. It marked the task “done” in 2 minutes. I ran the code and nothing worked.
The function was defined but never called. Three imports sat unused at the top of the file. A TODO comment read ”// implement password hashing later.” No tests existed. The agent genuinely believed it finished the task.
This kept happening. I’d give a task, the agent would say “done,” and I’d spend the next hour debugging half-implemented code.
[Task 1: Add pagination] ✓ Done[Task 2: Implement search] ✓ Done[Task 3: Add user profiles] ✓ Done
All tasks complete!Then I’d run the app:
Error: pagination component not foundError: search function returns undefinedError: profile route missingThe agent reported 8/8 tasks done. Zero worked.
What happened?
I dug into why agents claim completion when nothing works. The root cause: vague exit criteria.
When I gave tasks like “add pagination to the user list,” the agent had no way to verify what “done” meant. It implemented a pagination component, added it to a file, and marked it complete. But it never:
- Connected the component to the page
- Tested that pages actually navigate
- Checked if imports were used
- Removed TODO comments
The agent wasn’t lying. It genuinely thought the task was done because nothing told it otherwise.
Task: "Add pagination"Agent: "I created a Pagination component. Done!"Reality: Component exists but is never imported or used
Task: "Implement search"Agent: "I wrote a search function. Done!"Reality: Function returns undefined for empty queries, no error handling
Task: "Add user profiles"Agent: "I created a profile route. Done!"Reality: Route exists but controller is missingHow to solve it?
I started adding explicit verification gates to every task. A verification gate is a concrete checkpoint the agent must pass before claiming completion.
Here’s the transformation:
Before (vague task):
Task: Add pagination to the user listAfter (with verification gates):
Task: Add pagination to the user list
Demo Statement:- When I load the user list page, I see 10 users max per page- When I click "Next", the next 10 users appear- When I click "Previous", the previous 10 users appear- The current page number is displayed
Verification Gates (all must pass):- [ ] Unit tests for pagination logic pass- [ ] No TODO comments in modified files- [ ] All new imports are used- [ ] Pagination component is wired into UserListPage- [ ] Manual test: navigate through 3+ pages successfullyI tested this approach with the same authentication task that failed before.
Test 1: Without verification gates
Task: Implement user authenticationAgent output:
Created auth.py with login, logout, and register functions.Task complete!Result: Functions defined but never called. No tests. Password stored in plain text.
Test 2: With verification gates
Task: Implement user authentication
Demo Statement:- When I submit valid credentials, I receive a session token- When I submit invalid credentials, I get a 401 error- Passwords are hashed before storage
Verification Gates:- [ ] Login function returns session token- [ ] Logout function invalidates session- [ ] Passwords hashed with bcrypt- [ ] Unit tests pass: test_auth.py- [ ] No TODO comments- [ ] Auth middleware wired into appAgent output:
[PENDING] Implementing authentication...[RUNNING] Creating login function with session token return[RUNNING] Implementing logout with session invalidation[RUNNING] Adding bcrypt password hashing[RUNNING] Creating test_auth.py with login/logout tests[VERIFY] Checking for TODO comments... None found[VERIFY] Testing auth middleware connection... Connected[VERIFY] Running tests... All pass[DONE] Authentication implementedThe difference was dramatic. The agent now knew exactly what “done” meant.
Implementing verification gates
I built a simple task queue that enforces verification before completion:
from dataclasses import dataclassfrom typing import List, Callable
@dataclassclass Task: id: str description: str demo_statements: List[str] verification_gates: List[str] files_modified: List[str] test_files: List[str] status: str = "pending"
class TaskQueue: def __init__(self): self.tasks: List[Task] = [] self.pending_verification: List[Task] = []
def mark_done(self, task: Task): # Agent cannot claim new work until verification passes task.status = "pending_verification" self.pending_verification.append(task) print(f"Task '{task.id}' pending verification...")
def verify_task(self, task: Task) -> bool: """Run all verification gates before marking complete.""" checks = [ self._no_todos_remaining(task.files_modified), self._no_unused_imports(task.files_modified), self._tests_pass(task.test_files), self._code_is_wired(task.files_modified), self._type_check_passes(task.files_modified), ]
results = {gate: result for gate, result in zip(task.verification_gates, checks)}
if all(checks): task.status = "verified_complete" self.pending_verification.remove(task) print(f"Task '{task.id}' verified complete!") return True else: task.status = "needs_rework" failed_gates = [g for g, r in results.items() if not r] print(f"Task '{task.id}' failed gates: {failed_gates}") return False
def _no_todos_remaining(self, files: List[str]) -> bool: """Check that no TODO comments remain in modified files.""" for file_path in files: with open(file_path, 'r') as f: content = f.read() if 'TODO' in content or 'FIXME' in content: return False return True
def _no_unused_imports(self, files: List[str]) -> bool: """Verify all imports are actually used.""" import ast
for file_path in files: with open(file_path, 'r') as f: tree = ast.parse(f.read())
imports = set() for node in ast.walk(tree): if isinstance(node, ast.Import): for alias in node.names: imports.add(alias.name) elif isinstance(node, ast.ImportFrom): imports.add(node.module)
# Check if each import is used in the file with open(file_path, 'r') as f: content = f.read()
for imp in imports: if imp and imp not in content.split('import')[0]: # Import not used in main code return False return True
def _tests_pass(self, test_files: List[str]) -> bool: """Run tests and verify they pass.""" import subprocess
for test_file in test_files: result = subprocess.run( ['python', '-m', 'pytest', test_file], capture_output=True ) if result.returncode != 0: return False return True
def _code_is_wired(self, files: List[str]) -> bool: """Verify functions are actually called, not just defined.""" # This would integrate with your specific app structure # For now, just check that the main entry point exists return True
def _type_check_passes(self, files: List[str]) -> bool: """Run type checker on modified files.""" import subprocess
result = subprocess.run( ['mypy'] + files, capture_output=True ) return result.returncode == 0Using this queue:
queue = TaskQueue()
# Create task with verification gatestask = Task( id="auth-001", description="Implement user authentication", demo_statements=[ "When I submit valid credentials, I receive a session token", "When I submit invalid credentials, I get a 401 error", "Passwords are hashed before storage", ], verification_gates=[ "Unit tests pass", "No TODO comments", "No unused imports", "Auth wired into app", "Type check passes", ], files_modified=["auth.py", "app.py"], test_files=["test_auth.py"],)
queue.tasks.append(task)
# Agent does the work...queue.mark_done(task)
# Verification happens automaticallysuccess = queue.verify_task(task)Demo statements: The missing piece
While building this system, I discovered another critical element: demo statements.
A demo statement is a plain English description of what a human would observe when the feature works. It helps the agent understand the goal, not just the implementation.
When I [action], I should see [result] within [timeframe]Examples:
- “When I click the search button, results appear in the sidebar within 2 seconds”
- “When I submit the form with invalid email, I see a red error message below the field”
- “When I navigate to /dashboard, I see my username in the top-right corner”
Demo statements serve two purposes:
- Agent understanding: The agent knows what behavior to implement
- Human verification: You can quickly check if the demo statement is true
I added demo statements to my task template:
## Task: [Name]
### Demo Statement- When I [action], I see [result]- When I [action], I see [result]
### Verification Gates- [ ] Tests pass- [ ] No TODOs- [ ] No unused imports- [ ] Code is wired- [ ] Type check passes
### Files to Modify- [file1.py]- [file2.py]
### Test Files- [test_file.py]The verification gate checklist
After testing this approach for two weeks, I settled on a standard set of verification gates:
1. [ ] No TODO/FIXME comments in modified files2. [ ] All new imports are used (no dead imports)3. [ ] Code is wired into the application (no orphaned functions)4. [ ] Unit tests pass5. [ ] Integration tests pass (if applicable)6. [ ] Type checking passes (mypy/pyright)7. [ ] Linting passes (ruff/eslint)8. [ ] No console.log/print statements left in codeI also created a pre-commit hook that runs these checks:
#!/bin/bash
# Check for TODOsif git diff --cached | grep -E "TODO|FIXME"; then echo "ERROR: TODO/FIXME comments found. Remove or resolve before committing." exit 1fi
# Check for console.logif git diff --cached | grep -E "console\.log|print\("; then echo "WARNING: Debug statements found. Remove before committing." exit 1fi
# Run testsif ! python -m pytest; then echo "ERROR: Tests failed. Fix before committing." exit 1fi
# Run type checkif ! mypy .; then echo "ERROR: Type check failed. Fix before committing." exit 1fi
echo "All verification gates passed!"Common mistakes I made
Mistake 1: Skipping the demo statement
Without a demo statement, agents implement features that technically exist but don’t actually work:
# Bad: No demo statementTask: Add search functionality
# Agent creates a search function that returns empty array# Search box never connected to input field# No debounce, no loading stateMistake 2: Trusting “done” status
I used to take the agent’s word for completion. Now I verify:
# BeforeAgent: "Task done!"Me: "Great, moving on"
# AfterAgent: "Task done!"Me: "Run verification gates..."Gate 1 (no TODOs): FAILGate 2 (tests pass): FAILMe: "Fix these gates first"Mistake 3: Allowing agents to chain without checkpoints
The worst failures happened when I let agents mark multiple tasks complete without verification:
Task 1: DoneTask 2: Done (depends on Task 1 being actually done)Task 3: Done (depends on Task 2 being actually done)
Result: All three broken because Task 1 was never actually completeNow each task enters a “pending verification” state before the agent can start the next one.
Mistake 4: Vague verification gates
“Code works” is not a verification gate. “When I call login(‘user’, ‘pass’), I receive a JWT token within 100ms” is a verification gate.
# Bad verification gate- [ ] Code works
# Good verification gate- [ ] When I POST to /login with valid credentials, response is 200 with token- [ ] When I POST to /login with invalid credentials, response is 401Summary
AI agents don’t lie about completion — they simply lack the context to know what “done” means. By adding demo statements and self-verification gates, you transform agents from enthusiastic interns who say “looks good to me” into reliable contributors who prove their work.
The key changes I made:
- Demo statements: Plain English descriptions of what “done” looks like
- Verification gates: Concrete checkpoints that must pass before completion
- Pending verification state: Tasks queue for verification before the agent can claim new work
- No trust policy: Every “done” claim gets verified independently
After implementing this system, my “done but not done” incidents dropped from roughly 60% to under 10%. The agent still occasionally marks things complete prematurely, but the verification gates catch it before I waste time debugging.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments