How Can AI Agents Verify Their Own Task Completion? A Practical Guide to Self-Verification
Problem
My AI agent kept saying “task complete” but when I checked the code, nothing worked. The search function existed but was never called. The pagination component sat orphaned in a file. Three TODO comments marked places where actual logic should have been.
I ran the application. It crashed immediately.
[Task 1: Add search filtering] Complete[Task 2: Connect to API] Complete[Task 3: Add error handling] Complete
Summary: All tasks finished successfully!Error: searchFilter is not definedError: API client never initializedError: 3 TODO comments found in production codeThe agent genuinely believed it finished. It wrote code. It made progress. But it never verified that the code actually worked.
What I discovered
I started investigating how production AI systems handle this problem. After reverse-engineering several agent orchestration frameworks, I found a pattern that changed everything:
Demo statements as quality levers.
Every task gets a demo statement — a plain English description of the observable outcome. This was the single biggest quality lever in the systems I studied.
Task: Implement search filteringDemo: "User types query in search box, results filter in real-time without page reload"
Task: Add user authenticationDemo: "When I submit valid credentials, I receive a session token within 200ms"Without demo statements, agents build plumbing that never connects to anything visible. They optimize for “I wrote code” rather than “the system works.”
Building a self-verification protocol
I decided to implement a system where agents must verify their own work before closing tasks. Here’s what I built.
Step 1: Task structure with demo statements
Every task now requires a demo statement upfront:
from dataclasses import dataclassfrom typing import Listfrom enum import Enum
class TaskStatus(Enum): IN_PROGRESS = "in_progress" PENDING_VERIFICATION = "pending_verification" VERIFIED = "verified" CLOSED = "closed"
@dataclassclass DemoStatement: task_id: str observable_outcome: str verification_criteria: List[str]
@dataclassclass Task: id: str description: str demo_statement: DemoStatement status: TaskStatus = TaskStatus.IN_PROGRESS files_modified: List[str] = None test_results: str = ""
def __post_init__(self): if self.files_modified is None: self.files_modified = []The demo statement forces clarity before implementation begins. Instead of “add search,” the agent must think about what “search working” actually looks like.
Step 2: Structured completion artifacts
Before an agent can claim a task is complete, it must produce a verification artifact:
from dataclasses import dataclassfrom typing import List
@dataclassclass VerificationArtifact: demo_statement: str no_todos: bool code_wired: bool tests_pass: bool demo_possible: bool files_modified: List[str] test_results: str integration_points: List[str]
def is_valid(self) -> bool: """All verification checks must pass.""" return all([ self.no_todos, self.code_wired, self.tests_pass, self.demo_possible ])This artifact becomes the proof of completion. Not the agent’s word — actual verification results.
Step 3: The verification pipeline
I built a TaskVerifier class that agents must call before closing tasks:
import subprocessimport refrom pathlib import Path
class TaskVerifier: def __init__(self, task_id: str, demo_statement: str): self.task_id = task_id self.demo_statement = demo_statement self.files_modified = []
def verify_completion(self) -> VerificationArtifact: """Agent must call this before claiming task complete.""" return VerificationArtifact( demo_statement=self.demo_statement, no_todos=self._check_no_todos(), code_wired=self._check_code_wired(), tests_pass=self._run_tests(), demo_possible=self._verify_demo_possible(), files_modified=self.files_modified, test_results=self._get_test_results(), integration_points=self._find_integration_points() )
def _check_no_todos(self) -> bool: """Scan modified files for TODO comments.""" for file_path in self.files_modified: content = Path(file_path).read_text() if re.search(r'\bTODO\b|\bFIXME\b', content): print(f"Found TODO/FIXME in {file_path}") return False return True
def _check_code_wired(self) -> bool: """Verify code is called from application entry points.""" # Check if new functions are imported/called # This is application-specific, but here's the idea: for file_path in self.files_modified: content = Path(file_path).read_text() # Look for function definitions functions = re.findall(r'def (\w+)\(', content) for func in functions: if not self._is_function_used(func, file_path): print(f"Function {func} defined but never called") return False return True
def _run_tests(self) -> bool: """Execute test suite for modified code.""" result = subprocess.run( ['python', '-m', 'pytest', '-v'], capture_output=True, text=True ) return result.returncode == 0
def _verify_demo_possible(self) -> bool: """Can the demo statement be demonstrated?""" # This could be automated for some cases # For now, returns True if all other checks pass return self._check_no_todos() and self._run_tests()
def _get_test_results(self) -> str: """Get human-readable test results.""" result = subprocess.run( ['python', '-m', 'pytest', '-v'], capture_output=True, text=True ) if result.returncode == 0: return "All tests passing" return f"Tests failed:\n{result.stdout}"
def _find_integration_points(self) -> List[str]: """Find where new code connects to existing system.""" points = [] for file_path in self.files_modified: content = Path(file_path).read_text() # Find imports from this file in other files # This is simplified - real implementation would be more thorough if 'import' in content: imports = re.findall(r'from .+ import|import \w+', content) points.extend(imports) return points
def _is_function_used(self, func_name: str, defined_in: str) -> bool: """Check if function is called anywhere in the codebase.""" # Search all Python files for calls to this function for py_file in Path('.').rglob('*.py'): if str(py_file) == defined_in: continue content = py_file.read_text() if func_name + '(' in content: return True return FalseStep 4: Pending verification state
The critical piece: tasks don’t close immediately. They enter a “pending verification” state:
1. In Progress -> Agent working on implementation2. Pending Verification -> Agent claims completion, structured verification required3. Verified -> Independent verification confirms success4. Closed -> Task truly completeI modified the agent’s task claiming logic to enforce this:
class AgentTaskQueue: def __init__(self): self.queue = [] self.agent_tasks = {} # agent_id -> [tasks]
def claim_task(self, agent_id: str) -> Task: """Block if agent has tasks pending verification.""" if not self._can_claim_new_task(agent_id): raise AgentBusyError( f"Agent {agent_id} has tasks pending verification. " f"Complete verification before claiming new work." )
task = self.queue.pop(0) if agent_id not in self.agent_tasks: self.agent_tasks[agent_id] = [] self.agent_tasks[agent_id].append(task) return task
def _can_claim_new_task(self, agent_id: str) -> bool: """Agents with pending verification cannot claim new work.""" agent_tasks = self.agent_tasks.get(agent_id, []) pending = [t for t in agent_tasks if t.status == TaskStatus.PENDING_VERIFICATION] return len(pending) == 0
def submit_for_verification(self, task: Task, artifact: VerificationArtifact): """Agent submits task for verification.""" if not artifact.is_valid(): raise VerificationFailedError( f"Verification failed for task {task.id}:\n" f" no_todos: {artifact.no_todos}\n" f" code_wired: {artifact.code_wired}\n" f" tests_pass: {artifact.tests_pass}\n" f" demo_possible: {artifact.demo_possible}" )
task.status = TaskStatus.PENDING_VERIFICATION print(f"Task {task.id} moved to pending verification")
def verify_and_close(self, task_id: str) -> bool: """Final verification before closing.""" # Independent verification could happen here # For now, trust the artifact task = self._get_task(task_id) task.status = TaskStatus.VERIFIED
# Give agent a grace period before closing # This allows for manual spot-checks return TrueStep 5: Integration with agent workflow
Here’s how an agent now completes a task:
def agent_complete_task(task_id: str, demo_statement: str): """Agent workflow for task completion.""" # 1. Get the task task = get_task(task_id)
# 2. Track files modified during work files_modified = track_modified_files()
# 3. Create verifier verifier = TaskVerifier( task_id=task_id, demo_statement=demo_statement ) verifier.files_modified = files_modified
# 4. Generate verification artifact artifact = verifier.verify_completion()
# 5. Submit for verification if artifact.is_valid(): task.status = TaskStatus.PENDING_VERIFICATION queue.submit_for_verification(task, artifact) return artifact else: raise VerificationFailedError( f"Self-verification failed:\n" f" Demo: {demo_statement}\n" f" Results: {artifact}" )
# Example usagetry: artifact = agent_complete_task( task_id="search-001", demo_statement="User types query, results filter live" ) print(f"Task submitted for verification: {artifact}")except VerificationFailedError as e: print(f"Fix the issues: {e}") # Agent must fix before trying againWhat changed
After implementing this self-verification protocol, the results were dramatic:
Tasks marked complete: 100Tasks actually working: 42Success rate: 42%Time debugging "complete" tasks: 23 hoursTasks marked complete: 100Tasks actually working: 91Success rate: 91%Time debugging "complete" tasks: 4 hoursThe key differences:
- Demo statements force clarity about what “done” means
- Pending verification state prevents agents from rushing to new work
- Structured artifacts provide proof, not claims
- Automated checks catch TODO comments, orphaned code, failing tests
Common mistakes I made
Mistake 1: Trusting agent completion reports
# BeforeAgent: "Task done!"Me: "Great, next task"
# AfterAgent: "Task done!"System: Running verification...System: ERROR: 2 TODO comments foundSystem: ERROR: Function searchFilter defined but never calledSystem: ERROR: Tests failingMe: "Agent, fix these issues and resubmit"Mistake 2: Skipping demo statements for “simple” tasks
Simple tasks are where agents fail most often:
# Bad: No demo statementTask: Add a loading spinner
# Agent adds spinner component but:# - Never shows/hides it based on loading state# - Uses wrong color# - Places it off-screen
# Good: With demo statementTask: Add a loading spinnerDemo: "When data is loading, spinner appears in center of form. When data loads, spinner disappears within 100ms."Mistake 3: Allowing task chaining without checkpoints
The worst failures happened when I let agents complete multiple dependent tasks without verification:
Task 1: Create API client -> Marked done (actually broken)Task 2: Connect to API -> Marked done (depends on Task 1)Task 3: Add error handling -> Marked done (depends on Task 2)
Result: All three broken because Task 1 never actually workedNow each task requires verification before the next one starts.
Mistake 4: Vague verification criteria
“Code works” is not a verification criterion. Specific, observable outcomes are:
# Bad- [ ] Code works
# Good- [ ] When I call search("test"), I get results within 500ms- [ ] When I call search(""), I get empty results (not error)- [ ] When I call search with special chars, no exceptionsMistake 5: Manual verification only
Relying on humans to verify everything doesn’t scale. Agents must self-verify first, with humans spot-checking:
# Agent self-verifies (automated, fast, scalable)artifact = agent_self_verify(task)if not artifact.is_valid(): raise VerificationFailed()
# Human spot-checks (random 10%)if random.random() < 0.1: human_verify(task)The complete pattern
Here’s the full self-verification pattern I now use:
Before task starts: [ ] Define demo statement (what will I observe when this works?) [ ] List verification criteria (how will I prove it works?) [ ] Identify integration points (where does this connect?)
During implementation: [ ] Track all modified files [ ] Run tests after each change [ ] Check for TODO comments
Before claiming complete: [ ] Generate verification artifact [ ] All checks pass [ ] Demo statement can be demonstrated [ ] Submit for pending verification
After verification: [ ] Task moves to verified state [ ] Agent can claim new work [ ] Artifact stored for auditSummary
AI agents can verify their own task completion through three mechanisms:
- Demo statements: Define what “done” looks like before implementation begins
- Pending verification state: Force a checkpoint between implementation and closure
- Structured completion artifacts: Require proof, not claims
The key insight: agents don’t lie about completion — they lack the feedback loop to know if their code actually works. By implementing self-verification protocols, you transform agents from code writers into solution deliverers.
After implementing this system, my task success rate improved from 42% to 91%. The agents still make mistakes, but the verification catches them before I waste time debugging phantom completions.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments