Skip to content

Testing AI Agents: Evidence Collectors, Reality Checkers, and Accessibility Auditors

Purpose

I used to treat testing as an afterthought. After building a feature, I’d run through some quick manual checks, write a few unit tests, and call it done. Then production bugs happened. Users found edge cases I never considered. Accessibility violations I didn’t know existed.

The problem wasn’t that I didn’t test. The problem was that I tested like a developer who built the feature - I knew how it was supposed to work, so I tested the happy path. What I needed was someone who would actively try to break things.

This post covers the 8 specialized testing agents in the Agency’s Testing Division. These agents are designed as quality gates, not afterthoughts. They default to finding problems, require visual proof, and provide certification-style outputs.

The Problem with Afterthought Testing

When I tested my own code, I noticed a pattern:

Testing TypeMy ApproachResult
Visual QAClick through the UIMissed layout issues on different screens
PerformanceCheck if it loadsDidn’t measure under load
Accessibility”Seems accessible”WCAG violations in production
API TestingHappy path onlyEdge cases crashed the backend

I think the root cause is that testing requires a different mindset than building. A builder wants things to work. A tester wants things to break. Generic AI assistants often default to the builder mindset - they confirm functionality rather than challenge it.

The Testing Division Overview

The Testing Division contains 8 specialized agents organized into functional areas:

Testing Division organization
+-------------------------------------------------------------------------+
| Testing Division (8 Agents) |
| "Breaking things so users don't have to" |
+-------------------------------------------------------------------------+
| |
| Visual & Certification Test Analysis |
| ---------------------- ------------------ |
| - Evidence Collector - Test Results Analyzer |
| - Reality Checker |
| Decision Support |
| Performance & Infrastructure ------------------------ |
| ----------------------------- - Tool Evaluator |
| - Performance Benchmarker - Workflow Optimizer |
| - API Tester |
| |
| Compliance |
| ---------- |
| - Accessibility Auditor |
| |
+-------------------------------------------------------------------------+

Visual & Certification Specialists

Evidence Collector

This agent embodies a rigorous testing philosophy: “I don’t just test your code - I default to finding 3-5 issues and require visual proof for everything.”

I used the Evidence Collector for a recent feature release. Instead of a generic “looks good” response, it provided:

Evidence Collector output example
## Evidence Report: User Dashboard Feature
### Issue 1: Responsive Breakpoint
- Screenshot: dashboard-breakpoint-375px.png
- Location: /dashboard at 375px viewport
- Problem: Sidebar overlaps main content below 480px
- Severity: Medium
- Expected: Sidebar should collapse to hamburger menu
- Actual: Sidebar remains visible, covering 40% of content
### Issue 2: Loading State Missing
- Screenshot: dashboard-loading-async.png
- Location: /dashboard -> Projects widget
- Problem: No loading indicator during async fetch
- Severity: Low
- Expected: Skeleton or spinner while data loads
- Actual: Empty space, then sudden content appearance
### Issue 3: Focus Trap in Modal
- Screenshot: settings-modal-focus-trap.png
- Location: /dashboard -> Settings -> Edit Profile modal
- Problem: Tab key escapes modal without closing
- Severity: High (Accessibility)
- Expected: Focus trapped within modal until closed
- Actual: Focus returns to background elements
### Visual Verification Checklist
[ ] Dashboard renders correctly on 320px viewport
[ ] Dashboard renders correctly on 768px viewport
[ ] Dashboard renders correctly on 1440px viewport
[ ] All interactive elements have visible focus states
[ ] Charts display correctly with 0 data points
[ ] Charts display correctly with 1000+ data points

The key insight: the agent didn’t just verify functionality. It actively looked for problems. Every issue came with a screenshot. Every finding was actionable.

Reality Checker

I used this agent for production readiness certification. It provides formal sign-off with pass/fail criteria:

Reality Checker certification template
## Production Readiness Certification
### Application: User Dashboard v2.1
### Certification Date: 2026-03-17
### Certifier: Reality Checker Agent
---
## Pass/Fail Criteria
| Criterion | Status | Evidence |
|-----------|--------|----------|
| Error handling graceful | PASS | All API failures show user-friendly messages |
| Loading states implemented | PASS | All async operations have loading indicators |
| Empty states handled | FAIL | Missing state for zero projects |
| Authentication required | PASS | Unauthenticated users redirected to login |
| Authorization enforced | PASS | Role-based access verified for all routes |
| Input validation | PASS | All forms validate before submission |
| Responsive design | PASS | Tested at 320px, 768px, 1440px viewports |
| Accessibility scan | FAIL | 3 WCAG violations detected |
## Blocking Issues
1. Empty state for zero projects (UX)
2. WCAG violations: missing alt text, insufficient contrast, missing ARIA labels
## Certification Status: NOT READY FOR PRODUCTION
### Required Actions Before Release
1. Implement empty state component for projects list
2. Add alt text to dashboard icons (3 instances)
3. Increase contrast ratio for secondary text (#888888 -> #666666)
4. Add aria-label to icon-only buttons (5 instances)
### Re-certification Required: YES

This certification approach forces a decision. It doesn’t say “maybe fix these things.” It says “NOT READY FOR PRODUCTION” with explicit blocking issues.

Test Analysis

Test Results Analyzer

I used this agent to make sense of test output that would otherwise be noise. Instead of dumping raw test results, it provides actionable analysis:

Test health dashboard output
## Test Health Dashboard
### Summary
- Total Tests: 847
- Pass Rate: 94.2% (797 passed)
- Flaky Tests: 23 (2.7%)
- Coverage: 78.3%
### Flaky Test Analysis
| Test | Flakiness Rate | Root Cause |
|------|----------------|------------|
| test_user_login | 34% | Race condition in async auth |
| test_search_results | 28% | External API timeout not mocked |
| test_file_upload | 22% | File system state not isolated |
### Coverage Gaps
1. **auth/password_reset.ts** - 0% coverage
- Recommendation: Add unit tests for reset flow
2. **api/webhooks.ts** - 12% coverage
- Recommendation: Mock external services, test handlers
3. **components/Chart.tsx** - 45% coverage
- Recommendation: Test edge cases (empty data, overflow)
### Recommended Actions
1. Fix flaky tests before adding new tests
2. Target 80% coverage in identified gaps
3. Add integration tests for auth flow

The analyzer transforms raw test output into a health dashboard. It identifies patterns I would miss - like the connection between flaky tests and unmocked external APIs.

Performance & Infrastructure

Performance Benchmarker

I used this agent to move beyond “it feels fast” to actual metrics:

Performance benchmark report
## Performance Benchmark Report
### Application: User Dashboard v2.1
### Test Environment: Chrome 122, Desktop, 100Mbps
---
## Load Time Metrics
| Metric | Target | Actual | Status |
|--------|--------|--------|--------|
| First Contentful Paint | < 1.5s | 1.2s | PASS |
| Largest Contentful Paint | < 2.5s | 2.8s | FAIL |
| Time to Interactive | < 3.5s | 3.1s | PASS |
| Cumulative Layout Shift | < 0.1 | 0.15 | FAIL |
| Total Blocking Time | < 300ms | 180ms | PASS |
## Bottleneck Analysis
### 1. Largest Contentful Paint (2.8s)
Root Cause: Dashboard chart loads 2.4MB of historical data
Recommendation:
- Implement data pagination (last 30 days by default)
- Add skeleton loading for chart area
- Estimated improvement: 1.5s LCP reduction
### 2. Cumulative Layout Shift (0.15)
Root Cause: Dynamic content loads without reserved space
Affected Elements:
- Projects list (shifts 120px when loaded)
- Activity feed (shifts 80px when loaded)
Recommendation:
- Add min-height to containers
- Use skeleton placeholders
- Estimated improvement: CLS to 0.05
## Load Test Results
| Concurrent Users | Avg Response Time | Error Rate |
|-----------------|-------------------|------------|
| 10 | 120ms | 0% |
| 50 | 340ms | 0% |
| 100 | 580ms | 0.1% |
| 500 | 2.1s | 2.3% |
| 1000 | 4.8s | 12.7% |
### Capacity Recommendation
- Safe concurrent limit: 100 users
- Scale-out trigger: 80 users
- Required for 500+ users: Horizontal scaling, connection pooling

The benchmarker doesn’t just report numbers. It explains why metrics fail and provides specific remediation steps.

API Tester

This agent validates endpoints beyond the happy path:

API test suite structure
## API Test Suite: User Dashboard Endpoints
### Base URL: /api/v1/dashboard
---
### GET /dashboard
| Test Case | Request | Expected | Status |
|-----------|---------|----------|--------|
| Auth required | No token | 401 Unauthorized | PASS |
| Valid token | Bearer token | 200 OK | PASS |
| Expired token | Expired JWT | 401 Unauthorized | PASS |
| Malformed token | Invalid format | 401 Unauthorized | PASS |
| Rate limit | 101 req/min | 429 Too Many Requests | PASS |
### POST /dashboard/settings
| Test Case | Request Body | Expected | Status |
|-----------|--------------|----------|--------|
| Valid update | {theme: "dark"} | 200 OK | PASS |
| Invalid theme | {theme: "invalid"} | 400 Bad Request | PASS |
| Extra fields | {theme: "dark", extra: 1} | 400 Bad Request | FAIL |
| Empty body | {} | 400 Bad Request | PASS |
| SQL injection | {theme: "'; DROP TABLE"} | 400 Bad Request | PASS |
### Contract Validation
- OpenAPI spec: /api/docs/openapi.yaml
- Contract tests: 47/50 PASS
- Missing response codes: 204, 422
### Integration Tests
- Database connectivity: PASS
- Cache invalidation: PASS
- Event publishing: FAIL (missing event for settings.update)

The API tester found issues I didn’t know existed - like the server accepting extra fields and missing event publishing for settings updates.

Decision Support

Tool Evaluator

When I needed to choose a testing framework, this agent provided structured comparison:

Tool evaluation matrix
## Testing Framework Evaluation
### Candidates: Jest, Vitest, Mocha
### Criteria: Startup MVP, React + TypeScript, E2E needed
---
| Criterion | Jest | Vitest | Mocha |
|-----------|------|--------|-------|
| TypeScript support | Native | Native | Requires config |
| ESM support | Config required | Native | Native |
| Speed (1000 tests) | 12s | 3s | 8s |
| Watch mode | Yes | Yes | Requires plugin |
| Coverage built-in | Yes | Yes | Requires Istanbul |
| React Testing Library | Excellent | Excellent | Good |
| E2E integration | Playwright/Jest | Playwright/Vitest | Multiple options |
| Community size | Large | Growing | Large |
| Learning curve | Medium | Low | Medium |
### Recommendation: Vitest
**Reasons:**
1. 4x faster test runs (critical for TDD workflow)
2. Native ESM support matches modern React setup
3. Drop-in replacement for Jest in most cases
4. Built-in coverage, no extra configuration
5. Vite-native, matches build toolchain
**Migration path from Jest:**
1. Install vitest, @vitest/ui
2. Update package.json scripts
3. Rename jest.config.js to vitest.config.ts
4. Most tests run without modification
**Estimated setup time:** 2 hours
**Estimated migration time:** 4-8 hours for typical React project

Workflow Optimizer

This agent analyzes testing processes for efficiency gains:

Testing workflow analysis
## Current Testing Workflow Analysis
### Time Breakdown (per PR)
- Unit test execution: 3 minutes
- Integration test setup: 5 minutes
- Integration test execution: 8 minutes
- E2E test execution: 15 minutes
- Manual QA review: 20 minutes
- **Total: 51 minutes per PR**
### Bottlenecks Identified
1. **Sequential execution**
- Unit, integration, E2E run one after another
- Parallel execution possible: Save 11 minutes
2. **Full E2E suite for every PR**
- Only 30% of E2E tests affected by typical change
- Selective test selection possible: Save 10 minutes
3. **Manual QA review**
- Many checks can be automated with visual testing
- Automated visual regression: Save 15 minutes
### Optimized Workflow
```text title="Optimized testing pipeline"
PR Created
|
v
+-------------------+
| Parallel Execution|
+-------------------+
| Unit Tests (3m) |
| Integration (8m) |
| Selective E2E (5m)|
+-------------------+
|
v
+-------------------+
| Visual Regression |
| (Automated) |
+-------------------+
|
v
+-------------------+
| Reality Checker |
| (Only on fail) |
+-------------------+
|
v
Ready for Review
Total: ~16 minutes per PR

Savings: 35 minutes per PR (69% reduction)

## Compliance
### Accessibility Auditor
This agent caught WCAG violations I would have missed entirely:
```text title="WCAG audit report"
## WCAG 2.1 AA Audit Report
### Application: User Dashboard v2.1
### Scope: All user-facing pages
### Audit Date: 2026-03-17
---
## Violations Summary
| Level | Count | Impact |
|-------|-------|--------|
| Critical | 2 | Blocks users completely |
| Serious | 5 | Significant barriers |
| Moderate | 8 | Frustrating but usable |
| Minor | 12 | Low impact |
## Critical Violations
### 1. Keyboard Navigation Blocked (2.1.1)
**Location:** Dashboard -> Settings modal
**Issue:** Modal opens but cannot be closed with keyboard
**Impact:** Keyboard users cannot escape modal
**Code:**
```javascript
// Current (broken)
<Button onClick={closeModal}>X</Button>
// Required fix
<Button onClick={closeModal} aria-label="Close settings" onKeyDown={(e) => {
if (e.key === 'Escape' || e.key === 'Enter') closeModal();
}}>
X
</Button>

2. Form Missing Labels (3.3.2)

Location: Dashboard -> Edit Profile -> Email field Issue: Input has no visible label or aria-label Impact: Screen readers announce “edit text” with no context Remediation: Add visible label or aria-label

Serious Violations

3. Insufficient Color Contrast (1.4.3)

Affected: 47 text elements Current ratio: 3.2:1 average Required: 4.5:1 for normal text, 3:1 for large text Remediation: Increase contrast ratios

4. Missing Alt Text (1.1.1)

Affected: 12 images, 8 icons Impact: Screen readers skip meaningful content Remediation: Add descriptive alt text

5. Focus Not Visible (2.4.7)

Affected: 23 interactive elements Issue: Focus ring removed with outline: none Remediation: Add visible focus indicator

Remediation Plan

WeekFocusViolations Addressed
1Critical2
2Serious5
3Moderate8
4Minor + Retest12

Estimated effort: 40 hours Legal risk if unfixed: High (ADA compliance)

The Accessibility Auditor didn't just find problems. It provided specific code fixes, prioritized by impact, with a remediation timeline.
## The Quality Gate Workflow Pattern
I learned to insert testing agents as checkpoints between phases:
```text title="Quality gate workflow pattern"
[Development Agent]
|
v
+---------------------+
| Evidence Collector | <-- Screenshots, visual proof
+---------------------+
|
v
+---------------------+
| Reality Checker | <-- Certification, pass/fail
+---------------------+
|
v
[Release Approved]

This pattern appears in multiple README scenarios as the standard quality gate. The key insight: don’t move forward until the gate passes.

The Evidence Collector Philosophy

What makes this agent different from generic testing approaches:

  1. Default expectation: Find 3-5 issues - Not “check if it works” but “actively find problems”
  2. Visual requirement: Every issue has a screenshot - No vague descriptions
  3. Verification: Don’t just report - verify the fix - Follow through to completion

I tested this philosophy on a real project. When I asked a generic AI to “test my dashboard,” it confirmed everything worked. When I deployed the Evidence Collector, it found 7 issues I hadn’t considered - from responsive breakpoints to loading states to focus traps.

Why Specialized Testing Agents Matter

MetricGeneric AI TestingSpecialized Agent
Issues found1-2 per review3-5 per review
Issue documentationText descriptionScreenshot + code location
Pass/fail clarityMaybe fix theseNOT READY + blocking issues
Accessibility coverage”Seems accessible”WCAG 2.1 AA compliance scan
Performance metrics”It’s fast”LCP, FCP, CLS with baselines

The difference: Specialized agents have explicit criteria and default to finding problems. Generic AI assistants default to confirming functionality.

How I Use Testing Agents Now

My current workflow:

Testing agent deployment
Feature Complete
|
v
Evidence Collector (visual QA, 5-10 min)
|
v
Performance Benchmarker (load test, 15 min)
|
v
Accessibility Auditor (WCAG scan, 10 min)
|
v
Reality Checker (certification, 5 min)
|
v
Production Release

Total testing time: ~35 minutes per feature. But I’ve caught more bugs in that 35 minutes than I used to find in a full day of manual testing.

Summary

The Testing Division provides specialized QA agents designed as quality gates. The Evidence Collector requires visual proof for every issue. The Reality Checker provides production readiness certification with explicit pass/fail criteria. The Accessibility Auditor ensures WCAG compliance with specific remediation steps.

The key insight: these agents default to finding problems, not confirming functionality. They provide screenshot evidence, certification reports, and actionable remediation plans. Deploy them as checkpoints in your workflow, not afterthoughts at the end.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments