How Can AI Agents Verify Their Own Task Completion? A Practical Guide to Self-Verification

Mar 13, 2026

Problem

My AI agent kept saying “task complete” but when I checked the code, nothing worked. The search function existed but was never called. The pagination component sat orphaned in a file. Three TODO comments marked places where actual logic should have been.

I ran the application. It crashed immediately.

[Task 1: Add search filtering] Complete
[Task 2: Connect to API] Complete
[Task 3: Add error handling] Complete

Summary: All tasks finished successfully!

Error: searchFilter is not defined
Error: API client never initialized
Error: 3 TODO comments found in production code

The agent genuinely believed it finished. It wrote code. It made progress. But it never verified that the code actually worked.

What I discovered

I started investigating how production AI systems handle this problem. After reverse-engineering several agent orchestration frameworks, I found a pattern that changed everything:

Demo statements as quality levers.

Every task gets a demo statement — a plain English description of the observable outcome. This was the single biggest quality lever in the systems I studied.

Task: Implement search filtering
Demo: "User types query in search box, results filter in real-time without page reload"

Task: Add user authentication
Demo: "When I submit valid credentials, I receive a session token within 200ms"

Without demo statements, agents build plumbing that never connects to anything visible. They optimize for “I wrote code” rather than “the system works.”

Building a self-verification protocol

I decided to implement a system where agents must verify their own work before closing tasks. Here’s what I built.

Step 1: Task structure with demo statements

Every task now requires a demo statement upfront:

from dataclasses import dataclass
from typing import List
from enum import Enum

class TaskStatus(Enum):
    IN_PROGRESS = "in_progress"
    PENDING_VERIFICATION = "pending_verification"
    VERIFIED = "verified"
    CLOSED = "closed"

@dataclass
class DemoStatement:
    task_id: str
    observable_outcome: str
    verification_criteria: List[str]

@dataclass
class Task:
    id: str
    description: str
    demo_statement: DemoStatement
    status: TaskStatus = TaskStatus.IN_PROGRESS
    files_modified: List[str] = None
    test_results: str = ""

    def __post_init__(self):
        if self.files_modified is None:
            self.files_modified = []

The demo statement forces clarity before implementation begins. Instead of “add search,” the agent must think about what “search working” actually looks like.

Step 2: Structured completion artifacts

Before an agent can claim a task is complete, it must produce a verification artifact:

from dataclasses import dataclass
from typing import List

@dataclass
class VerificationArtifact:
    demo_statement: str
    no_todos: bool
    code_wired: bool
    tests_pass: bool
    demo_possible: bool
    files_modified: List[str]
    test_results: str
    integration_points: List[str]

    def is_valid(self) -> bool:
        """All verification checks must pass."""
        return all([
            self.no_todos,
            self.code_wired,
            self.tests_pass,
            self.demo_possible
        ])

This artifact becomes the proof of completion. Not the agent’s word — actual verification results.

Step 3: The verification pipeline

I built a TaskVerifier class that agents must call before closing tasks:

import subprocess
import re
from pathlib import Path

class TaskVerifier:
    def __init__(self, task_id: str, demo_statement: str):
        self.task_id = task_id
        self.demo_statement = demo_statement
        self.files_modified = []

    def verify_completion(self) -> VerificationArtifact:
        """Agent must call this before claiming task complete."""
        return VerificationArtifact(
            demo_statement=self.demo_statement,
            no_todos=self._check_no_todos(),
            code_wired=self._check_code_wired(),
            tests_pass=self._run_tests(),
            demo_possible=self._verify_demo_possible(),
            files_modified=self.files_modified,
            test_results=self._get_test_results(),
            integration_points=self._find_integration_points()
        )

    def _check_no_todos(self) -> bool:
        """Scan modified files for TODO comments."""
        for file_path in self.files_modified:
            content = Path(file_path).read_text()
            if re.search(r'\bTODO\b|\bFIXME\b', content):
                print(f"Found TODO/FIXME in {file_path}")
                return False
        return True

    def _check_code_wired(self) -> bool:
        """Verify code is called from application entry points."""
        # Check if new functions are imported/called
        # This is application-specific, but here's the idea:
        for file_path in self.files_modified:
            content = Path(file_path).read_text()
            # Look for function definitions
            functions = re.findall(r'def (\w+)\(', content)
            for func in functions:
                if not self._is_function_used(func, file_path):
                    print(f"Function {func} defined but never called")
                    return False
        return True

    def _run_tests(self) -> bool:
        """Execute test suite for modified code."""
        result = subprocess.run(
            ['python', '-m', 'pytest', '-v'],
            capture_output=True,
            text=True
        )
        return result.returncode == 0

    def _verify_demo_possible(self) -> bool:
        """Can the demo statement be demonstrated?"""
        # This could be automated for some cases
        # For now, returns True if all other checks pass
        return self._check_no_todos() and self._run_tests()

    def _get_test_results(self) -> str:
        """Get human-readable test results."""
        result = subprocess.run(
            ['python', '-m', 'pytest', '-v'],
            capture_output=True,
            text=True
        )
        if result.returncode == 0:
            return "All tests passing"
        return f"Tests failed:\n{result.stdout}"

    def _find_integration_points(self) -> List[str]:
        """Find where new code connects to existing system."""
        points = []
        for file_path in self.files_modified:
            content = Path(file_path).read_text()
            # Find imports from this file in other files
            # This is simplified - real implementation would be more thorough
            if 'import' in content:
                imports = re.findall(r'from .+ import|import \w+', content)
                points.extend(imports)
        return points

    def _is_function_used(self, func_name: str, defined_in: str) -> bool:
        """Check if function is called anywhere in the codebase."""
        # Search all Python files for calls to this function
        for py_file in Path('.').rglob('*.py'):
            if str(py_file) == defined_in:
                continue
            content = py_file.read_text()
            if func_name + '(' in content:
                return True
        return False

Step 4: Pending verification state

The critical piece: tasks don’t close immediately. They enter a “pending verification” state:

1. In Progress     -> Agent working on implementation
2. Pending Verification -> Agent claims completion, structured verification required
3. Verified        -> Independent verification confirms success
4. Closed          -> Task truly complete

I modified the agent’s task claiming logic to enforce this:

class AgentTaskQueue:
    def __init__(self):
        self.queue = []
        self.agent_tasks = {}  # agent_id -> [tasks]

    def claim_task(self, agent_id: str) -> Task:
        """Block if agent has tasks pending verification."""
        if not self._can_claim_new_task(agent_id):
            raise AgentBusyError(
                f"Agent {agent_id} has tasks pending verification. "
                f"Complete verification before claiming new work."
            )

        task = self.queue.pop(0)
        if agent_id not in self.agent_tasks:
            self.agent_tasks[agent_id] = []
        self.agent_tasks[agent_id].append(task)
        return task

    def _can_claim_new_task(self, agent_id: str) -> bool:
        """Agents with pending verification cannot claim new work."""
        agent_tasks = self.agent_tasks.get(agent_id, [])
        pending = [t for t in agent_tasks
                   if t.status == TaskStatus.PENDING_VERIFICATION]
        return len(pending) == 0

    def submit_for_verification(self, task: Task, artifact: VerificationArtifact):
        """Agent submits task for verification."""
        if not artifact.is_valid():
            raise VerificationFailedError(
                f"Verification failed for task {task.id}:\n"
                f"  no_todos: {artifact.no_todos}\n"
                f"  code_wired: {artifact.code_wired}\n"
                f"  tests_pass: {artifact.tests_pass}\n"
                f"  demo_possible: {artifact.demo_possible}"
            )

        task.status = TaskStatus.PENDING_VERIFICATION
        print(f"Task {task.id} moved to pending verification")

    def verify_and_close(self, task_id: str) -> bool:
        """Final verification before closing."""
        # Independent verification could happen here
        # For now, trust the artifact
        task = self._get_task(task_id)
        task.status = TaskStatus.VERIFIED

        # Give agent a grace period before closing
        # This allows for manual spot-checks
        return True

Step 5: Integration with agent workflow

Here’s how an agent now completes a task:

def agent_complete_task(task_id: str, demo_statement: str):
    """Agent workflow for task completion."""
    # 1. Get the task
    task = get_task(task_id)

    # 2. Track files modified during work
    files_modified = track_modified_files()

    # 3. Create verifier
    verifier = TaskVerifier(
        task_id=task_id,
        demo_statement=demo_statement
    )
    verifier.files_modified = files_modified

    # 4. Generate verification artifact
    artifact = verifier.verify_completion()

    # 5. Submit for verification
    if artifact.is_valid():
        task.status = TaskStatus.PENDING_VERIFICATION
        queue.submit_for_verification(task, artifact)
        return artifact
    else:
        raise VerificationFailedError(
            f"Self-verification failed:\n"
            f"  Demo: {demo_statement}\n"
            f"  Results: {artifact}"
        )

# Example usage
try:
    artifact = agent_complete_task(
        task_id="search-001",
        demo_statement="User types query, results filter live"
    )
    print(f"Task submitted for verification: {artifact}")
except VerificationFailedError as e:
    print(f"Fix the issues: {e}")
    # Agent must fix before trying again

What changed

After implementing this self-verification protocol, the results were dramatic:

Tasks marked complete: 100
Tasks actually working: 42
Success rate: 42%
Time debugging "complete" tasks: 23 hours

Tasks marked complete: 100
Tasks actually working: 91
Success rate: 91%
Time debugging "complete" tasks: 4 hours

The key differences:

Demo statements force clarity about what “done” means
Pending verification state prevents agents from rushing to new work
Structured artifacts provide proof, not claims
Automated checks catch TODO comments, orphaned code, failing tests

Common mistakes I made

Mistake 1: Trusting agent completion reports

# Before
Agent: "Task done!"
Me: "Great, next task"

# After
Agent: "Task done!"
System: Running verification...
System: ERROR: 2 TODO comments found
System: ERROR: Function searchFilter defined but never called
System: ERROR: Tests failing
Me: "Agent, fix these issues and resubmit"

Mistake 2: Skipping demo statements for “simple” tasks

Simple tasks are where agents fail most often:

# Bad: No demo statement
Task: Add a loading spinner

# Agent adds spinner component but:
# - Never shows/hides it based on loading state
# - Uses wrong color
# - Places it off-screen

# Good: With demo statement
Task: Add a loading spinner
Demo: "When data is loading, spinner appears in center of form. When data loads, spinner disappears within 100ms."

Mistake 3: Allowing task chaining without checkpoints

The worst failures happened when I let agents complete multiple dependent tasks without verification:

Task 1: Create API client -> Marked done (actually broken)
Task 2: Connect to API -> Marked done (depends on Task 1)
Task 3: Add error handling -> Marked done (depends on Task 2)

Result: All three broken because Task 1 never actually worked

Now each task requires verification before the next one starts.

Mistake 4: Vague verification criteria

“Code works” is not a verification criterion. Specific, observable outcomes are:

# Bad
- [ ] Code works

# Good
- [ ] When I call search("test"), I get results within 500ms
- [ ] When I call search(""), I get empty results (not error)
- [ ] When I call search with special chars, no exceptions

Mistake 5: Manual verification only

Relying on humans to verify everything doesn’t scale. Agents must self-verify first, with humans spot-checking:

# Agent self-verifies (automated, fast, scalable)
artifact = agent_self_verify(task)
if not artifact.is_valid():
    raise VerificationFailed()

# Human spot-checks (random 10%)
if random.random() < 0.1:
    human_verify(task)

The complete pattern

Here’s the full self-verification pattern I now use:

Before task starts:
  [ ] Define demo statement (what will I observe when this works?)
  [ ] List verification criteria (how will I prove it works?)
  [ ] Identify integration points (where does this connect?)

During implementation:
  [ ] Track all modified files
  [ ] Run tests after each change
  [ ] Check for TODO comments

Before claiming complete:
  [ ] Generate verification artifact
  [ ] All checks pass
  [ ] Demo statement can be demonstrated
  [ ] Submit for pending verification

After verification:
  [ ] Task moves to verified state
  [ ] Agent can claim new work
  [ ] Artifact stored for audit

Summary

AI agents can verify their own task completion through three mechanisms:

Demo statements: Define what “done” looks like before implementation begins
Pending verification state: Force a checkpoint between implementation and closure
Structured completion artifacts: Require proof, not claims

The key insight: agents don’t lie about completion — they lack the feedback loop to know if their code actually works. By implementing self-verification protocols, you transform agents from code writers into solution deliverers.

After implementing this system, my task success rate improved from 42% to 91%. The agents still make mistakes, but the verification catches them before I waste time debugging phantom completions.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Claude Code Orchestration System
👨‍💻 Building Reliable AI Agents

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!