Skip to content

How Can AI Agents Verify Their Own Task Completion? A Practical Guide to Self-Verification

Problem

My AI agent kept saying “task complete” but when I checked the code, nothing worked. The search function existed but was never called. The pagination component sat orphaned in a file. Three TODO comments marked places where actual logic should have been.

I ran the application. It crashed immediately.

Agent Output
[Task 1: Add search filtering] Complete
[Task 2: Connect to API] Complete
[Task 3: Add error handling] Complete
Summary: All tasks finished successfully!
Reality
Error: searchFilter is not defined
Error: API client never initialized
Error: 3 TODO comments found in production code

The agent genuinely believed it finished. It wrote code. It made progress. But it never verified that the code actually worked.

What I discovered

I started investigating how production AI systems handle this problem. After reverse-engineering several agent orchestration frameworks, I found a pattern that changed everything:

Demo statements as quality levers.

Every task gets a demo statement — a plain English description of the observable outcome. This was the single biggest quality lever in the systems I studied.

Demo Statement Examples
Task: Implement search filtering
Demo: "User types query in search box, results filter in real-time without page reload"
Task: Add user authentication
Demo: "When I submit valid credentials, I receive a session token within 200ms"

Without demo statements, agents build plumbing that never connects to anything visible. They optimize for “I wrote code” rather than “the system works.”

Building a self-verification protocol

I decided to implement a system where agents must verify their own work before closing tasks. Here’s what I built.

Step 1: Task structure with demo statements

Every task now requires a demo statement upfront:

task.py
from dataclasses import dataclass
from typing import List
from enum import Enum
class TaskStatus(Enum):
IN_PROGRESS = "in_progress"
PENDING_VERIFICATION = "pending_verification"
VERIFIED = "verified"
CLOSED = "closed"
@dataclass
class DemoStatement:
task_id: str
observable_outcome: str
verification_criteria: List[str]
@dataclass
class Task:
id: str
description: str
demo_statement: DemoStatement
status: TaskStatus = TaskStatus.IN_PROGRESS
files_modified: List[str] = None
test_results: str = ""
def __post_init__(self):
if self.files_modified is None:
self.files_modified = []

The demo statement forces clarity before implementation begins. Instead of “add search,” the agent must think about what “search working” actually looks like.

Step 2: Structured completion artifacts

Before an agent can claim a task is complete, it must produce a verification artifact:

verification_artifact.py
from dataclasses import dataclass
from typing import List
@dataclass
class VerificationArtifact:
demo_statement: str
no_todos: bool
code_wired: bool
tests_pass: bool
demo_possible: bool
files_modified: List[str]
test_results: str
integration_points: List[str]
def is_valid(self) -> bool:
"""All verification checks must pass."""
return all([
self.no_todos,
self.code_wired,
self.tests_pass,
self.demo_possible
])

This artifact becomes the proof of completion. Not the agent’s word — actual verification results.

Step 3: The verification pipeline

I built a TaskVerifier class that agents must call before closing tasks:

task_verifier.py
import subprocess
import re
from pathlib import Path
class TaskVerifier:
def __init__(self, task_id: str, demo_statement: str):
self.task_id = task_id
self.demo_statement = demo_statement
self.files_modified = []
def verify_completion(self) -> VerificationArtifact:
"""Agent must call this before claiming task complete."""
return VerificationArtifact(
demo_statement=self.demo_statement,
no_todos=self._check_no_todos(),
code_wired=self._check_code_wired(),
tests_pass=self._run_tests(),
demo_possible=self._verify_demo_possible(),
files_modified=self.files_modified,
test_results=self._get_test_results(),
integration_points=self._find_integration_points()
)
def _check_no_todos(self) -> bool:
"""Scan modified files for TODO comments."""
for file_path in self.files_modified:
content = Path(file_path).read_text()
if re.search(r'\bTODO\b|\bFIXME\b', content):
print(f"Found TODO/FIXME in {file_path}")
return False
return True
def _check_code_wired(self) -> bool:
"""Verify code is called from application entry points."""
# Check if new functions are imported/called
# This is application-specific, but here's the idea:
for file_path in self.files_modified:
content = Path(file_path).read_text()
# Look for function definitions
functions = re.findall(r'def (\w+)\(', content)
for func in functions:
if not self._is_function_used(func, file_path):
print(f"Function {func} defined but never called")
return False
return True
def _run_tests(self) -> bool:
"""Execute test suite for modified code."""
result = subprocess.run(
['python', '-m', 'pytest', '-v'],
capture_output=True,
text=True
)
return result.returncode == 0
def _verify_demo_possible(self) -> bool:
"""Can the demo statement be demonstrated?"""
# This could be automated for some cases
# For now, returns True if all other checks pass
return self._check_no_todos() and self._run_tests()
def _get_test_results(self) -> str:
"""Get human-readable test results."""
result = subprocess.run(
['python', '-m', 'pytest', '-v'],
capture_output=True,
text=True
)
if result.returncode == 0:
return "All tests passing"
return f"Tests failed:\n{result.stdout}"
def _find_integration_points(self) -> List[str]:
"""Find where new code connects to existing system."""
points = []
for file_path in self.files_modified:
content = Path(file_path).read_text()
# Find imports from this file in other files
# This is simplified - real implementation would be more thorough
if 'import' in content:
imports = re.findall(r'from .+ import|import \w+', content)
points.extend(imports)
return points
def _is_function_used(self, func_name: str, defined_in: str) -> bool:
"""Check if function is called anywhere in the codebase."""
# Search all Python files for calls to this function
for py_file in Path('.').rglob('*.py'):
if str(py_file) == defined_in:
continue
content = py_file.read_text()
if func_name + '(' in content:
return True
return False

Step 4: Pending verification state

The critical piece: tasks don’t close immediately. They enter a “pending verification” state:

Task Lifecycle
1. In Progress -> Agent working on implementation
2. Pending Verification -> Agent claims completion, structured verification required
3. Verified -> Independent verification confirms success
4. Closed -> Task truly complete

I modified the agent’s task claiming logic to enforce this:

agent_task_queue.py
class AgentTaskQueue:
def __init__(self):
self.queue = []
self.agent_tasks = {} # agent_id -> [tasks]
def claim_task(self, agent_id: str) -> Task:
"""Block if agent has tasks pending verification."""
if not self._can_claim_new_task(agent_id):
raise AgentBusyError(
f"Agent {agent_id} has tasks pending verification. "
f"Complete verification before claiming new work."
)
task = self.queue.pop(0)
if agent_id not in self.agent_tasks:
self.agent_tasks[agent_id] = []
self.agent_tasks[agent_id].append(task)
return task
def _can_claim_new_task(self, agent_id: str) -> bool:
"""Agents with pending verification cannot claim new work."""
agent_tasks = self.agent_tasks.get(agent_id, [])
pending = [t for t in agent_tasks
if t.status == TaskStatus.PENDING_VERIFICATION]
return len(pending) == 0
def submit_for_verification(self, task: Task, artifact: VerificationArtifact):
"""Agent submits task for verification."""
if not artifact.is_valid():
raise VerificationFailedError(
f"Verification failed for task {task.id}:\n"
f" no_todos: {artifact.no_todos}\n"
f" code_wired: {artifact.code_wired}\n"
f" tests_pass: {artifact.tests_pass}\n"
f" demo_possible: {artifact.demo_possible}"
)
task.status = TaskStatus.PENDING_VERIFICATION
print(f"Task {task.id} moved to pending verification")
def verify_and_close(self, task_id: str) -> bool:
"""Final verification before closing."""
# Independent verification could happen here
# For now, trust the artifact
task = self._get_task(task_id)
task.status = TaskStatus.VERIFIED
# Give agent a grace period before closing
# This allows for manual spot-checks
return True

Step 5: Integration with agent workflow

Here’s how an agent now completes a task:

agent_complete_task.py
def agent_complete_task(task_id: str, demo_statement: str):
"""Agent workflow for task completion."""
# 1. Get the task
task = get_task(task_id)
# 2. Track files modified during work
files_modified = track_modified_files()
# 3. Create verifier
verifier = TaskVerifier(
task_id=task_id,
demo_statement=demo_statement
)
verifier.files_modified = files_modified
# 4. Generate verification artifact
artifact = verifier.verify_completion()
# 5. Submit for verification
if artifact.is_valid():
task.status = TaskStatus.PENDING_VERIFICATION
queue.submit_for_verification(task, artifact)
return artifact
else:
raise VerificationFailedError(
f"Self-verification failed:\n"
f" Demo: {demo_statement}\n"
f" Results: {artifact}"
)
# Example usage
try:
artifact = agent_complete_task(
task_id="search-001",
demo_statement="User types query, results filter live"
)
print(f"Task submitted for verification: {artifact}")
except VerificationFailedError as e:
print(f"Fix the issues: {e}")
# Agent must fix before trying again

What changed

After implementing this self-verification protocol, the results were dramatic:

Before Self-Verification (100 tasks)
Tasks marked complete: 100
Tasks actually working: 42
Success rate: 42%
Time debugging "complete" tasks: 23 hours
After Self-Verification (100 tasks)
Tasks marked complete: 100
Tasks actually working: 91
Success rate: 91%
Time debugging "complete" tasks: 4 hours

The key differences:

  1. Demo statements force clarity about what “done” means
  2. Pending verification state prevents agents from rushing to new work
  3. Structured artifacts provide proof, not claims
  4. Automated checks catch TODO comments, orphaned code, failing tests

Common mistakes I made

Mistake 1: Trusting agent completion reports

# Before
Agent: "Task done!"
Me: "Great, next task"
# After
Agent: "Task done!"
System: Running verification...
System: ERROR: 2 TODO comments found
System: ERROR: Function searchFilter defined but never called
System: ERROR: Tests failing
Me: "Agent, fix these issues and resubmit"

Mistake 2: Skipping demo statements for “simple” tasks

Simple tasks are where agents fail most often:

bad_task.py
# Bad: No demo statement
Task: Add a loading spinner
# Agent adds spinner component but:
# - Never shows/hides it based on loading state
# - Uses wrong color
# - Places it off-screen
# Good: With demo statement
Task: Add a loading spinner
Demo: "When data is loading, spinner appears in center of form. When data loads, spinner disappears within 100ms."

Mistake 3: Allowing task chaining without checkpoints

The worst failures happened when I let agents complete multiple dependent tasks without verification:

Task 1: Create API client -> Marked done (actually broken)
Task 2: Connect to API -> Marked done (depends on Task 1)
Task 3: Add error handling -> Marked done (depends on Task 2)
Result: All three broken because Task 1 never actually worked

Now each task requires verification before the next one starts.

Mistake 4: Vague verification criteria

“Code works” is not a verification criterion. Specific, observable outcomes are:

# Bad
- [ ] Code works
# Good
- [ ] When I call search("test"), I get results within 500ms
- [ ] When I call search(""), I get empty results (not error)
- [ ] When I call search with special chars, no exceptions

Mistake 5: Manual verification only

Relying on humans to verify everything doesn’t scale. Agents must self-verify first, with humans spot-checking:

verification_flow.py
# Agent self-verifies (automated, fast, scalable)
artifact = agent_self_verify(task)
if not artifact.is_valid():
raise VerificationFailed()
# Human spot-checks (random 10%)
if random.random() < 0.1:
human_verify(task)

The complete pattern

Here’s the full self-verification pattern I now use:

Self-Verification Checklist
Before task starts:
[ ] Define demo statement (what will I observe when this works?)
[ ] List verification criteria (how will I prove it works?)
[ ] Identify integration points (where does this connect?)
During implementation:
[ ] Track all modified files
[ ] Run tests after each change
[ ] Check for TODO comments
Before claiming complete:
[ ] Generate verification artifact
[ ] All checks pass
[ ] Demo statement can be demonstrated
[ ] Submit for pending verification
After verification:
[ ] Task moves to verified state
[ ] Agent can claim new work
[ ] Artifact stored for audit

Summary

AI agents can verify their own task completion through three mechanisms:

  1. Demo statements: Define what “done” looks like before implementation begins
  2. Pending verification state: Force a checkpoint between implementation and closure
  3. Structured completion artifacts: Require proof, not claims

The key insight: agents don’t lie about completion — they lack the feedback loop to know if their code actually works. By implementing self-verification protocols, you transform agents from code writers into solution deliverers.

After implementing this system, my task success rate improved from 42% to 91%. The agents still make mistakes, but the verification catches them before I waste time debugging phantom completions.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments