How to Define Metrics and Verification Commands for AI-Driven Code Optimization
Problem
I wanted Claude Code to optimize my code automatically. I set up the iteration loop, configured the skill, and let it run overnight. The next morning, I checked the results.
The metric had improved. But the code was worse.
Here’s what happened:
Iteration 1: Score improved from 45 to 48 -> keptIteration 2: Score improved from 48 to 52 -> keptIteration 3: Score improved from 52 to 55 -> kept...Iteration 47: Score improved from 78 to 81 -> kept
Final result: 81/100 scoreActual code: Over-optimized to the specific metric, broke edge casesI had used a subjective metric: “does this code look better?” Claude kept making changes that improved the score but accumulated subtle problems. By iteration 50, the drift was invisible until I ran real-world tests.
The root cause: I defined the wrong metric. The iteration loop worked perfectly. The verification command ran correctly. But my metric was subjective, not mechanical.
What I discovered
The Reddit discussion about Karpathy’s autoresearch pattern highlighted something I missed:
“The repo assumes you already know your metric and verification command. But that’s actually the hardest part for most people”
And more importantly:
“Mechanical metric matters too; ‘does this seem better’ as the eval produces drift that’s invisible until you’re 50 iterations in”
The key insight: your metric must be mechanical and objective. A script should be able to measure it without any human judgment. Subjective evaluations like “does this look better” or “is this cleaner” will produce drift over many iterations.
The metric-selection problem
Most tutorials on AI-driven optimization skip the hardest part: how do you actually choose the right metric?
┌─────────────────────────────────────────────────────────────────────┐│ The Metric Selection Problem │├─────────────────────────────────────────────────────────────────────┤│ ││ What most tutorials show: ││ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ││ │ Define Goal │───▶│ Run Loop │───▶│ Success! │ ││ └──────────────┘ └──────────────┘ └──────────────┘ ││ ││ What actually happens: ││ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ││ │ Define Goal │───▶│ ??? │───▶│ Wrong metric │ ││ └──────────────┘ │ What metric? │ │ Invisible │ ││ │ How to │ │ drift │ ││ │ extract it? │ └──────────────┘ ││ └──────────────┘ ││ ││ The gap: Choosing the right metric is the hardest part ││ │└─────────────────────────────────────────────────────────────────────┘The pattern “works for anything measurable: test coverage, bundle size, Lighthouse scores, API response time.” But those are all mechanical metrics. The moment you try to optimize something subjective, you hit problems.
What makes a good metric
A good metric for AI-driven optimization has three properties:
Property 1: Mechanical
A script can extract the value without human judgment.
Mechanical (good):- Test coverage percentage: npm test -- --coverage- Bundle size in KB: du -k dist/bundle.js- Lighthouse score: npx lighthouse --output=json- API p95 latency: curl timing metrics- Number of TypeScript errors: npx tsc --noEmit
Non-mechanical (bad):- "Code readability"- "Is this cleaner?"- "Does this seem better?"- "Code maintainability"Non-mechanical metrics require human evaluation. You can’t script them. AI can only optimize what can be measured programmatically.
Property 2: Objective
The same input always produces the same output.
# This always returns the same number for the same codenpm test -- --coverage 2>&1 | grep "All files" | awk '{print $4}'# Output: 72.5%Contrast with subjective metrics:
# Different evaluators might give different scores"Rate the readability of this code on a scale of 1-10"Evaluator A: 7Evaluator B: 5Evaluator C: 8Property 3: Aligned with your actual goal
The metric should directly measure what you want to improve.
Goal: Improve performanceWrong metric: Lines of code (less code != faster code)Right metric: API response time
Goal: Improve code qualityWrong metric: Number of comments (more comments != better quality)Right metric: Test coverage or bug count
Goal: Reduce bundle sizeWrong metric: Number of filesRight metric: Total bundle size in KBHow to create verification commands
A verification command does two things:
- Runs your test or measurement
- Exits with code 0 on success, non-zero on failure
Here’s the pattern:
# Run the measurementmeasurement_output=$(some_command 2>&1)
# Extract the metricmetric_value=$(echo "$measurement_output" | grep | awk)
# Compare against targetif [ "$metric_value" -ge "$target" ]; then exit 0 # Successelse exit 1 # FailurefiExample 1: Test coverage verification
#!/bin/bash# Goal: Test coverage >= 80%
# Run tests with coveragenpm test -- --coverage 2>&1 > coverage-report.txt
# Extract the total coveragecoverage=$(grep "All files" coverage-report.txt | awk '{print $4}' | tr -d '%')
# Check against thresholdif [ "$coverage" -ge 80 ]; then echo "Coverage: ${coverage}% (PASS)" exit 0else echo "Coverage: ${coverage}% (FAIL - need 80%)" exit 1fiExample 2: Bundle size verification
#!/bin/bash# Goal: Bundle size < 100KB
# Build the projectnpm run build
# Get bundle size in KBsize=$(du -k dist/bundle.js | cut -f1)
# Check against thresholdif [ "$size" -lt 100 ]; then echo "Bundle size: ${size}KB (PASS)" exit 0else echo "Bundle size: ${size}KB (FAIL - need < 100KB)" exit 1fiExample 3: Performance verification
#!/bin/bash# Goal: p95 response time < 200ms
# Run load testwrk -t4 -c100 -d30s https://api.example.com/endpoint > load-test.txt 2>&1
# Extract p95 latency (this varies by load testing tool)# Example using wrk output parsingp95=$(grep "95%" load-test.txt | awk '{print $2}')
# Convert to milliseconds if neededp95_ms=$(echo "$p95" | sed 's/ms$//')
# Check against thresholdif [ "$p95_ms" -lt 200 ]; then echo "p95 latency: ${p95_ms}ms (PASS)" exit 0else echo "p95 latency: ${p95_ms}ms (FAIL - need < 200ms)" exit 1fiExample 4: Lighthouse verification
#!/bin/bash# Goal: Performance score >= 90
# Run Lighthousenpx lighthouse https://example.com --output=json --quiet > lighthouse-report.json
# Extract performance scorescore=$(jq '.categories.performance.score' lighthouse-report.json)
# Convert to 0-100 scalescore_100=$(echo "$score * 100" | bc | cut -d. -f1)
# Check against thresholdif [ "$score_100" -ge 90 ]; then echo "Lighthouse performance: ${score_100} (PASS)" exit 0else echo "Lighthouse performance: ${score_100} (FAIL - need >= 90)" exit 1fiThe metric extraction step
The verification command needs to extract a numeric value. Here are patterns for different output formats:
# Pattern 1: Extract from tabular output# Input: "All files | 72 | 85 | 68 | 72.5"# Command:grep "All files" report.txt | awk '{print $4}'
# Pattern 2: Extract from key-value output# Input: "coverage: 72.5%"# Command:grep "coverage:" report.txt | sed 's/coverage: //' | tr -d '%'
# Pattern 3: Extract from JSON# Input: {"coverage": {"total": 72.5}}# Command:jq '.coverage.total' report.json
# Pattern 4: Extract from log lines# Input: "2026-03-15 10:00:00 INFO Coverage is 72.5%"# Command:grep "Coverage is" log.txt | sed 's/.*Coverage is //' | tr -d '%'
# Pattern 5: Extract timing from curl# Input: HTTP response time# Command:curl -w "%{time_total}" -o /dev/null -s https://api.example.com/endpointCommon mistakes
Mistake 1: Subjective evaluation
def evaluate_code(code): """Ask Claude if the code looks better.""" response = ask_claude(f"Rate this code on a scale of 1-10: {code}") return int(response)This seems reasonable but produces drift. Each iteration, the AI might interpret “better” differently. By iteration 50, you’ve optimized for something that doesn’t match your actual goal.
def evaluate_code(): """Extract test coverage percentage.""" result = subprocess.run( ["npm", "test", "--", "--coverage"], capture_output=True, text=True ) # Parse: "All files | 72 | 85 | 68 | 72.5" match = re.search(r'All files.*?(\d+\.\d+)', result.stdout) return float(match.group(1)) if match else 0Mistake 2: Metric doesn’t match goal
Goal: Improve application performanceMetric: Number of tests (increasing)
Problem: More tests don't make the app faster. They might even slow down CI.
Fix: Use actual performance metric- API response time- Page load time- ThroughputMistake 3: Metric can be gamed
Goal: Improve code qualityMetric: Lines of code (decreasing)
Problem: AI might delete error handling, comments, and useful code.
Fix: Use a metric that can't be gamed- Test coverage (can't fake)- Static analysis score (requires actual improvements)- Bug count in production (real-world impact)Mistake 4: No baseline measurement
# WRONG: Start optimizing without knowing where you areclaude-code iterate --goal "improve performance"
# RIGHT: Measure baseline firstcurl -w "%{time_total}" -o /dev/null -s https://api.example.com/endpoint# Baseline: 0.342 seconds
claude-code iterate --goal "reduce response time below 200ms"Real-world example
I used this approach to optimize API response times:
#!/bin/bash# Goal: Reduce p95 API response time from 350ms to under 200ms
# Create verification commandcat > verify_response_time.sh << 'EOF'#!/bin/bash# Run 100 requests and calculate p95response_times=()for i in {1..100}; do time=$(curl -w "%{time_total}" -o /dev/null -s https://api.example.com/users) response_times+=("$time")done
# Sort and get p95 (95th percentile = 95th value in sorted array)IFS=$'\n' sorted=($(sort -n <<<"${response_times[*]}"))unset IFSp95=${sorted[94]} # 0-indexed
# Convert to millisecondsp95_ms=$(echo "$p95 * 1000" | bc | cut -d. -f1)
echo "p95 response time: ${p95_ms}ms"
if [ "$p95_ms" -lt 200 ]; then exit 0else exit 1fiEOF
chmod +x verify_response_time.sh
# Create metric extraction commandcat > get_response_time.sh << 'EOF'#!/bin/bash# Get current p95 in millisecondsresponse_times=()for i in {1..100}; do time=$(curl -w "%{time_total}" -o /dev/null -s https://api.example.com/users) response_times+=("$time")doneIFS=$'\n' sorted=($(sort -n <<<"${response_times[*]}"))unset IFSp95=${sorted[94]}p95_ms=$(echo "$p95 * 1000" | bc | cut -d. -f1)echo "$p95_ms"EOF
chmod +x get_response_time.shThen configured the iteration loop:
Goal: p95 API response time < 200msBaseline: 350ms (measured)Metric command: ./get_response_time.shVerify command: ./verify_response_time.shMax iterations: 50The AI made these changes over 12 iterations:
Iteration 1: Added index on users.email -> 312ms (kept)Iteration 2: Added Redis cache for user profiles -> 245ms (kept)Iteration 3: Removed unnecessary JOIN -> 198ms (kept)Iteration 4: Added connection pooling -> 195ms (kept)Each iteration:
- Made one atomic change
- Committed to git
- Measured the impact
- Kept or reverted based on p95
Choosing between competing metrics
Sometimes you have multiple possible metrics. How do you choose?
Goal: Improve API performance
Option A: Average response time- Pro: Easy to measure- Con: Averages hide outliers
Option B: p95 response time- Pro: Catches tail latency- Con: Requires more samples
Option C: Throughput (requests/second)- Pro: Measures capacity- Con: Might optimize for wrong thing
Option D: Time to first byte- Pro: Measures user perception- Con: Doesn't capture full experienceThe answer depends on your actual goal:
If users complain about slow requests: Choose p95 response timeIf you need to handle more traffic: Choose throughputIf pages feel slow to load: Choose time to first byteSummary
In this post, I showed how to define metrics and verification commands for AI-driven code optimization. The key points are:
- Choose mechanical metrics: A script must be able to extract the value
- Make metrics objective: Same input should always produce same output
- Align metric with goal: The metric should directly measure what you want to improve
- Create parseable output: Verification commands should output metrics in a format you can extract
- Avoid subjective evaluations: “Does this seem better” produces invisible drift over iterations
The iteration loop pattern is powerful, but it’s only as good as your metric. Spend time upfront choosing the right metric. A bad metric will optimize for the wrong thing, and you won’t notice until 50 iterations have passed.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments