Testing AI Agents: Evidence Collectors, Reality Checkers, and Accessibility Auditors
Purpose
I used to treat testing as an afterthought. After building a feature, I’d run through some quick manual checks, write a few unit tests, and call it done. Then production bugs happened. Users found edge cases I never considered. Accessibility violations I didn’t know existed.
The problem wasn’t that I didn’t test. The problem was that I tested like a developer who built the feature - I knew how it was supposed to work, so I tested the happy path. What I needed was someone who would actively try to break things.
This post covers the 8 specialized testing agents in the Agency’s Testing Division. These agents are designed as quality gates, not afterthoughts. They default to finding problems, require visual proof, and provide certification-style outputs.
The Problem with Afterthought Testing
When I tested my own code, I noticed a pattern:
| Testing Type | My Approach | Result |
|---|---|---|
| Visual QA | Click through the UI | Missed layout issues on different screens |
| Performance | Check if it loads | Didn’t measure under load |
| Accessibility | ”Seems accessible” | WCAG violations in production |
| API Testing | Happy path only | Edge cases crashed the backend |
I think the root cause is that testing requires a different mindset than building. A builder wants things to work. A tester wants things to break. Generic AI assistants often default to the builder mindset - they confirm functionality rather than challenge it.
The Testing Division Overview
The Testing Division contains 8 specialized agents organized into functional areas:
+-------------------------------------------------------------------------+| Testing Division (8 Agents) || "Breaking things so users don't have to" |+-------------------------------------------------------------------------+| || Visual & Certification Test Analysis || ---------------------- ------------------ || - Evidence Collector - Test Results Analyzer || - Reality Checker || Decision Support || Performance & Infrastructure ------------------------ || ----------------------------- - Tool Evaluator || - Performance Benchmarker - Workflow Optimizer || - API Tester || || Compliance || ---------- || - Accessibility Auditor || |+-------------------------------------------------------------------------+Visual & Certification Specialists
Evidence Collector
This agent embodies a rigorous testing philosophy: “I don’t just test your code - I default to finding 3-5 issues and require visual proof for everything.”
I used the Evidence Collector for a recent feature release. Instead of a generic “looks good” response, it provided:
## Evidence Report: User Dashboard Feature
### Issue 1: Responsive Breakpoint- Screenshot: dashboard-breakpoint-375px.png- Location: /dashboard at 375px viewport- Problem: Sidebar overlaps main content below 480px- Severity: Medium- Expected: Sidebar should collapse to hamburger menu- Actual: Sidebar remains visible, covering 40% of content
### Issue 2: Loading State Missing- Screenshot: dashboard-loading-async.png- Location: /dashboard -> Projects widget- Problem: No loading indicator during async fetch- Severity: Low- Expected: Skeleton or spinner while data loads- Actual: Empty space, then sudden content appearance
### Issue 3: Focus Trap in Modal- Screenshot: settings-modal-focus-trap.png- Location: /dashboard -> Settings -> Edit Profile modal- Problem: Tab key escapes modal without closing- Severity: High (Accessibility)- Expected: Focus trapped within modal until closed- Actual: Focus returns to background elements
### Visual Verification Checklist[ ] Dashboard renders correctly on 320px viewport[ ] Dashboard renders correctly on 768px viewport[ ] Dashboard renders correctly on 1440px viewport[ ] All interactive elements have visible focus states[ ] Charts display correctly with 0 data points[ ] Charts display correctly with 1000+ data pointsThe key insight: the agent didn’t just verify functionality. It actively looked for problems. Every issue came with a screenshot. Every finding was actionable.
Reality Checker
I used this agent for production readiness certification. It provides formal sign-off with pass/fail criteria:
## Production Readiness Certification
### Application: User Dashboard v2.1### Certification Date: 2026-03-17### Certifier: Reality Checker Agent
---
## Pass/Fail Criteria
| Criterion | Status | Evidence ||-----------|--------|----------|| Error handling graceful | PASS | All API failures show user-friendly messages || Loading states implemented | PASS | All async operations have loading indicators || Empty states handled | FAIL | Missing state for zero projects || Authentication required | PASS | Unauthenticated users redirected to login || Authorization enforced | PASS | Role-based access verified for all routes || Input validation | PASS | All forms validate before submission || Responsive design | PASS | Tested at 320px, 768px, 1440px viewports || Accessibility scan | FAIL | 3 WCAG violations detected |
## Blocking Issues1. Empty state for zero projects (UX)2. WCAG violations: missing alt text, insufficient contrast, missing ARIA labels
## Certification Status: NOT READY FOR PRODUCTION
### Required Actions Before Release1. Implement empty state component for projects list2. Add alt text to dashboard icons (3 instances)3. Increase contrast ratio for secondary text (#888888 -> #666666)4. Add aria-label to icon-only buttons (5 instances)
### Re-certification Required: YESThis certification approach forces a decision. It doesn’t say “maybe fix these things.” It says “NOT READY FOR PRODUCTION” with explicit blocking issues.
Test Analysis
Test Results Analyzer
I used this agent to make sense of test output that would otherwise be noise. Instead of dumping raw test results, it provides actionable analysis:
## Test Health Dashboard
### Summary- Total Tests: 847- Pass Rate: 94.2% (797 passed)- Flaky Tests: 23 (2.7%)- Coverage: 78.3%
### Flaky Test Analysis
| Test | Flakiness Rate | Root Cause ||------|----------------|------------|| test_user_login | 34% | Race condition in async auth || test_search_results | 28% | External API timeout not mocked || test_file_upload | 22% | File system state not isolated |
### Coverage Gaps
1. **auth/password_reset.ts** - 0% coverage - Recommendation: Add unit tests for reset flow
2. **api/webhooks.ts** - 12% coverage - Recommendation: Mock external services, test handlers
3. **components/Chart.tsx** - 45% coverage - Recommendation: Test edge cases (empty data, overflow)
### Recommended Actions1. Fix flaky tests before adding new tests2. Target 80% coverage in identified gaps3. Add integration tests for auth flowThe analyzer transforms raw test output into a health dashboard. It identifies patterns I would miss - like the connection between flaky tests and unmocked external APIs.
Performance & Infrastructure
Performance Benchmarker
I used this agent to move beyond “it feels fast” to actual metrics:
## Performance Benchmark Report
### Application: User Dashboard v2.1### Test Environment: Chrome 122, Desktop, 100Mbps
---
## Load Time Metrics
| Metric | Target | Actual | Status ||--------|--------|--------|--------|| First Contentful Paint | < 1.5s | 1.2s | PASS || Largest Contentful Paint | < 2.5s | 2.8s | FAIL || Time to Interactive | < 3.5s | 3.1s | PASS || Cumulative Layout Shift | < 0.1 | 0.15 | FAIL || Total Blocking Time | < 300ms | 180ms | PASS |
## Bottleneck Analysis
### 1. Largest Contentful Paint (2.8s)Root Cause: Dashboard chart loads 2.4MB of historical dataRecommendation:- Implement data pagination (last 30 days by default)- Add skeleton loading for chart area- Estimated improvement: 1.5s LCP reduction
### 2. Cumulative Layout Shift (0.15)Root Cause: Dynamic content loads without reserved spaceAffected Elements:- Projects list (shifts 120px when loaded)- Activity feed (shifts 80px when loaded)Recommendation:- Add min-height to containers- Use skeleton placeholders- Estimated improvement: CLS to 0.05
## Load Test Results
| Concurrent Users | Avg Response Time | Error Rate ||-----------------|-------------------|------------|| 10 | 120ms | 0% || 50 | 340ms | 0% || 100 | 580ms | 0.1% || 500 | 2.1s | 2.3% || 1000 | 4.8s | 12.7% |
### Capacity Recommendation- Safe concurrent limit: 100 users- Scale-out trigger: 80 users- Required for 500+ users: Horizontal scaling, connection poolingThe benchmarker doesn’t just report numbers. It explains why metrics fail and provides specific remediation steps.
API Tester
This agent validates endpoints beyond the happy path:
## API Test Suite: User Dashboard Endpoints
### Base URL: /api/v1/dashboard
---
### GET /dashboard| Test Case | Request | Expected | Status ||-----------|---------|----------|--------|| Auth required | No token | 401 Unauthorized | PASS || Valid token | Bearer token | 200 OK | PASS || Expired token | Expired JWT | 401 Unauthorized | PASS || Malformed token | Invalid format | 401 Unauthorized | PASS || Rate limit | 101 req/min | 429 Too Many Requests | PASS |
### POST /dashboard/settings| Test Case | Request Body | Expected | Status ||-----------|--------------|----------|--------|| Valid update | {theme: "dark"} | 200 OK | PASS || Invalid theme | {theme: "invalid"} | 400 Bad Request | PASS || Extra fields | {theme: "dark", extra: 1} | 400 Bad Request | FAIL || Empty body | {} | 400 Bad Request | PASS || SQL injection | {theme: "'; DROP TABLE"} | 400 Bad Request | PASS |
### Contract Validation- OpenAPI spec: /api/docs/openapi.yaml- Contract tests: 47/50 PASS- Missing response codes: 204, 422
### Integration Tests- Database connectivity: PASS- Cache invalidation: PASS- Event publishing: FAIL (missing event for settings.update)The API tester found issues I didn’t know existed - like the server accepting extra fields and missing event publishing for settings updates.
Decision Support
Tool Evaluator
When I needed to choose a testing framework, this agent provided structured comparison:
## Testing Framework Evaluation
### Candidates: Jest, Vitest, Mocha### Criteria: Startup MVP, React + TypeScript, E2E needed
---
| Criterion | Jest | Vitest | Mocha ||-----------|------|--------|-------|| TypeScript support | Native | Native | Requires config || ESM support | Config required | Native | Native || Speed (1000 tests) | 12s | 3s | 8s || Watch mode | Yes | Yes | Requires plugin || Coverage built-in | Yes | Yes | Requires Istanbul || React Testing Library | Excellent | Excellent | Good || E2E integration | Playwright/Jest | Playwright/Vitest | Multiple options || Community size | Large | Growing | Large || Learning curve | Medium | Low | Medium |
### Recommendation: Vitest
**Reasons:**1. 4x faster test runs (critical for TDD workflow)2. Native ESM support matches modern React setup3. Drop-in replacement for Jest in most cases4. Built-in coverage, no extra configuration5. Vite-native, matches build toolchain
**Migration path from Jest:**1. Install vitest, @vitest/ui2. Update package.json scripts3. Rename jest.config.js to vitest.config.ts4. Most tests run without modification
**Estimated setup time:** 2 hours**Estimated migration time:** 4-8 hours for typical React projectWorkflow Optimizer
This agent analyzes testing processes for efficiency gains:
## Current Testing Workflow Analysis
### Time Breakdown (per PR)- Unit test execution: 3 minutes- Integration test setup: 5 minutes- Integration test execution: 8 minutes- E2E test execution: 15 minutes- Manual QA review: 20 minutes- **Total: 51 minutes per PR**
### Bottlenecks Identified
1. **Sequential execution** - Unit, integration, E2E run one after another - Parallel execution possible: Save 11 minutes
2. **Full E2E suite for every PR** - Only 30% of E2E tests affected by typical change - Selective test selection possible: Save 10 minutes
3. **Manual QA review** - Many checks can be automated with visual testing - Automated visual regression: Save 15 minutes
### Optimized Workflow
```text title="Optimized testing pipeline"PR Created | v+-------------------+| Parallel Execution|+-------------------+| Unit Tests (3m) || Integration (8m) || Selective E2E (5m)|+-------------------+ | v+-------------------+| Visual Regression || (Automated) |+-------------------+ | v+-------------------+| Reality Checker || (Only on fail) |+-------------------+ | vReady for Review
Total: ~16 minutes per PRSavings: 35 minutes per PR (69% reduction)
## Compliance
### Accessibility Auditor
This agent caught WCAG violations I would have missed entirely:
```text title="WCAG audit report"## WCAG 2.1 AA Audit Report
### Application: User Dashboard v2.1### Scope: All user-facing pages### Audit Date: 2026-03-17
---
## Violations Summary
| Level | Count | Impact ||-------|-------|--------|| Critical | 2 | Blocks users completely || Serious | 5 | Significant barriers || Moderate | 8 | Frustrating but usable || Minor | 12 | Low impact |
## Critical Violations
### 1. Keyboard Navigation Blocked (2.1.1)**Location:** Dashboard -> Settings modal**Issue:** Modal opens but cannot be closed with keyboard**Impact:** Keyboard users cannot escape modal**Code:**```javascript// Current (broken)<Button onClick={closeModal}>X</Button>
// Required fix<Button onClick={closeModal} aria-label="Close settings" onKeyDown={(e) => { if (e.key === 'Escape' || e.key === 'Enter') closeModal();}}> X</Button>2. Form Missing Labels (3.3.2)
Location: Dashboard -> Edit Profile -> Email field Issue: Input has no visible label or aria-label Impact: Screen readers announce “edit text” with no context Remediation: Add visible label or aria-label
Serious Violations
3. Insufficient Color Contrast (1.4.3)
Affected: 47 text elements Current ratio: 3.2:1 average Required: 4.5:1 for normal text, 3:1 for large text Remediation: Increase contrast ratios
4. Missing Alt Text (1.1.1)
Affected: 12 images, 8 icons Impact: Screen readers skip meaningful content Remediation: Add descriptive alt text
5. Focus Not Visible (2.4.7)
Affected: 23 interactive elements Issue: Focus ring removed with outline: none Remediation: Add visible focus indicator
Remediation Plan
| Week | Focus | Violations Addressed |
|---|---|---|
| 1 | Critical | 2 |
| 2 | Serious | 5 |
| 3 | Moderate | 8 |
| 4 | Minor + Retest | 12 |
Estimated effort: 40 hours Legal risk if unfixed: High (ADA compliance)
The Accessibility Auditor didn't just find problems. It provided specific code fixes, prioritized by impact, with a remediation timeline.
## The Quality Gate Workflow Pattern
I learned to insert testing agents as checkpoints between phases:
```text title="Quality gate workflow pattern"[Development Agent] | v+---------------------+| Evidence Collector | <-- Screenshots, visual proof+---------------------+ | v+---------------------+| Reality Checker | <-- Certification, pass/fail+---------------------+ | v[Release Approved]This pattern appears in multiple README scenarios as the standard quality gate. The key insight: don’t move forward until the gate passes.
The Evidence Collector Philosophy
What makes this agent different from generic testing approaches:
- Default expectation: Find 3-5 issues - Not “check if it works” but “actively find problems”
- Visual requirement: Every issue has a screenshot - No vague descriptions
- Verification: Don’t just report - verify the fix - Follow through to completion
I tested this philosophy on a real project. When I asked a generic AI to “test my dashboard,” it confirmed everything worked. When I deployed the Evidence Collector, it found 7 issues I hadn’t considered - from responsive breakpoints to loading states to focus traps.
Why Specialized Testing Agents Matter
| Metric | Generic AI Testing | Specialized Agent |
|---|---|---|
| Issues found | 1-2 per review | 3-5 per review |
| Issue documentation | Text description | Screenshot + code location |
| Pass/fail clarity | Maybe fix these | NOT READY + blocking issues |
| Accessibility coverage | ”Seems accessible” | WCAG 2.1 AA compliance scan |
| Performance metrics | ”It’s fast” | LCP, FCP, CLS with baselines |
The difference: Specialized agents have explicit criteria and default to finding problems. Generic AI assistants default to confirming functionality.
How I Use Testing Agents Now
My current workflow:
Feature Complete | vEvidence Collector (visual QA, 5-10 min) | vPerformance Benchmarker (load test, 15 min) | vAccessibility Auditor (WCAG scan, 10 min) | vReality Checker (certification, 5 min) | vProduction ReleaseTotal testing time: ~35 minutes per feature. But I’ve caught more bugs in that 35 minutes than I used to find in a full day of manual testing.
Summary
The Testing Division provides specialized QA agents designed as quality gates. The Evidence Collector requires visual proof for every issue. The Reality Checker provides production readiness certification with explicit pass/fail criteria. The Accessibility Auditor ensures WCAG compliance with specific remediation steps.
The key insight: these agents default to finding problems, not confirming functionality. They provide screenshot evidence, certification reports, and actionable remediation plans. Deploy them as checkpoints in your workflow, not afterthoughts at the end.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments