How to Define Metrics and Verification Commands for AI-Driven Code Optimization

Mar 15, 2026

Problem

I wanted Claude Code to optimize my code automatically. I set up the iteration loop, configured the skill, and let it run overnight. The next morning, I checked the results.

The metric had improved. But the code was worse.

Here’s what happened:

Iteration 1: Score improved from 45 to 48 -> kept
Iteration 2: Score improved from 48 to 52 -> kept
Iteration 3: Score improved from 52 to 55 -> kept
...
Iteration 47: Score improved from 78 to 81 -> kept

Final result: 81/100 score
Actual code: Over-optimized to the specific metric, broke edge cases

I had used a subjective metric: “does this code look better?” Claude kept making changes that improved the score but accumulated subtle problems. By iteration 50, the drift was invisible until I ran real-world tests.

The root cause: I defined the wrong metric. The iteration loop worked perfectly. The verification command ran correctly. But my metric was subjective, not mechanical.

What I discovered

The Reddit discussion about Karpathy’s autoresearch pattern highlighted something I missed:

“The repo assumes you already know your metric and verification command. But that’s actually the hardest part for most people”

And more importantly:

“Mechanical metric matters too; ‘does this seem better’ as the eval produces drift that’s invisible until you’re 50 iterations in”

The key insight: your metric must be mechanical and objective. A script should be able to measure it without any human judgment. Subjective evaluations like “does this look better” or “is this cleaner” will produce drift over many iterations.

The metric-selection problem

Most tutorials on AI-driven optimization skip the hardest part: how do you actually choose the right metric?

┌─────────────────────────────────────────────────────────────────────┐
│                    The Metric Selection Problem                      │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│   What most tutorials show:                                         │
│   ┌──────────────┐    ┌──────────────┐    ┌──────────────┐         │
│   │ Define Goal  │───▶│  Run Loop    │───▶│  Success!    │         │
│   └──────────────┘    └──────────────┘    └──────────────┘         │
│                                                                      │
│   What actually happens:                                            │
│   ┌──────────────┐    ┌──────────────┐    ┌──────────────┐         │
│   │ Define Goal  │───▶│ ???          │───▶│ Wrong metric │         │
│   └──────────────┘    │ What metric? │    │ Invisible    │         │
│                       │ How to       │    │ drift        │         │
│                       │ extract it?  │    └──────────────┘         │
│                       └──────────────┘                             │
│                                                                      │
│   The gap: Choosing the right metric is the hardest part           │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

The pattern “works for anything measurable: test coverage, bundle size, Lighthouse scores, API response time.” But those are all mechanical metrics. The moment you try to optimize something subjective, you hit problems.

What makes a good metric

A good metric for AI-driven optimization has three properties:

Property 1: Mechanical

A script can extract the value without human judgment.

Mechanical (good):
- Test coverage percentage: npm test -- --coverage
- Bundle size in KB: du -k dist/bundle.js
- Lighthouse score: npx lighthouse --output=json
- API p95 latency: curl timing metrics
- Number of TypeScript errors: npx tsc --noEmit

Non-mechanical (bad):
- "Code readability"
- "Is this cleaner?"
- "Does this seem better?"
- "Code maintainability"

Non-mechanical metrics require human evaluation. You can’t script them. AI can only optimize what can be measured programmatically.

Property 2: Objective

The same input always produces the same output.

# This always returns the same number for the same code
npm test -- --coverage 2>&1 | grep "All files" | awk '{print $4}'
# Output: 72.5%

Contrast with subjective metrics:

# Different evaluators might give different scores
"Rate the readability of this code on a scale of 1-10"
Evaluator A: 7
Evaluator B: 5
Evaluator C: 8

Property 3: Aligned with your actual goal

The metric should directly measure what you want to improve.

Goal: Improve performance
Wrong metric: Lines of code (less code != faster code)
Right metric: API response time

Goal: Improve code quality
Wrong metric: Number of comments (more comments != better quality)
Right metric: Test coverage or bug count

Goal: Reduce bundle size
Wrong metric: Number of files
Right metric: Total bundle size in KB

How to create verification commands

A verification command does two things:

Runs your test or measurement
Exits with code 0 on success, non-zero on failure

Here’s the pattern:

# Run the measurement
measurement_output=$(some_command 2>&1)

# Extract the metric
metric_value=$(echo "$measurement_output" | grep | awk)

# Compare against target
if [ "$metric_value" -ge "$target" ]; then
    exit 0  # Success
else
    exit 1  # Failure
fi

Example 1: Test coverage verification

#!/bin/bash
# Goal: Test coverage >= 80%

# Run tests with coverage
npm test -- --coverage 2>&1 > coverage-report.txt

# Extract the total coverage
coverage=$(grep "All files" coverage-report.txt | awk '{print $4}' | tr -d '%')

# Check against threshold
if [ "$coverage" -ge 80 ]; then
    echo "Coverage: ${coverage}% (PASS)"
    exit 0
else
    echo "Coverage: ${coverage}% (FAIL - need 80%)"
    exit 1
fi

Example 2: Bundle size verification

#!/bin/bash
# Goal: Bundle size < 100KB

# Build the project
npm run build

# Get bundle size in KB
size=$(du -k dist/bundle.js | cut -f1)

# Check against threshold
if [ "$size" -lt 100 ]; then
    echo "Bundle size: ${size}KB (PASS)"
    exit 0
else
    echo "Bundle size: ${size}KB (FAIL - need < 100KB)"
    exit 1
fi

Example 3: Performance verification

#!/bin/bash
# Goal: p95 response time < 200ms

# Run load test
wrk -t4 -c100 -d30s https://api.example.com/endpoint > load-test.txt 2>&1

# Extract p95 latency (this varies by load testing tool)
# Example using wrk output parsing
p95=$(grep "95%" load-test.txt | awk '{print $2}')

# Convert to milliseconds if needed
p95_ms=$(echo "$p95" | sed 's/ms$//')

# Check against threshold
if [ "$p95_ms" -lt 200 ]; then
    echo "p95 latency: ${p95_ms}ms (PASS)"
    exit 0
else
    echo "p95 latency: ${p95_ms}ms (FAIL - need < 200ms)"
    exit 1
fi

Example 4: Lighthouse verification

#!/bin/bash
# Goal: Performance score >= 90

# Run Lighthouse
npx lighthouse https://example.com --output=json --quiet > lighthouse-report.json

# Extract performance score
score=$(jq '.categories.performance.score' lighthouse-report.json)

# Convert to 0-100 scale
score_100=$(echo "$score * 100" | bc | cut -d. -f1)

# Check against threshold
if [ "$score_100" -ge 90 ]; then
    echo "Lighthouse performance: ${score_100} (PASS)"
    exit 0
else
    echo "Lighthouse performance: ${score_100} (FAIL - need >= 90)"
    exit 1
fi

The metric extraction step

The verification command needs to extract a numeric value. Here are patterns for different output formats:

# Pattern 1: Extract from tabular output
# Input: "All files | 72 | 85 | 68 | 72.5"
# Command:
grep "All files" report.txt | awk '{print $4}'

# Pattern 2: Extract from key-value output
# Input: "coverage: 72.5%"
# Command:
grep "coverage:" report.txt | sed 's/coverage: //' | tr -d '%'

# Pattern 3: Extract from JSON
# Input: {"coverage": {"total": 72.5}}
# Command:
jq '.coverage.total' report.json

# Pattern 4: Extract from log lines
# Input: "2026-03-15 10:00:00 INFO Coverage is 72.5%"
# Command:
grep "Coverage is" log.txt | sed 's/.*Coverage is //' | tr -d '%'

# Pattern 5: Extract timing from curl
# Input: HTTP response time
# Command:
curl -w "%{time_total}" -o /dev/null -s https://api.example.com/endpoint

Common mistakes

Mistake 1: Subjective evaluation

def evaluate_code(code):
    """Ask Claude if the code looks better."""
    response = ask_claude(f"Rate this code on a scale of 1-10: {code}")
    return int(response)

This seems reasonable but produces drift. Each iteration, the AI might interpret “better” differently. By iteration 50, you’ve optimized for something that doesn’t match your actual goal.

def evaluate_code():
    """Extract test coverage percentage."""
    result = subprocess.run(
        ["npm", "test", "--", "--coverage"],
        capture_output=True,
        text=True
    )
    # Parse: "All files | 72 | 85 | 68 | 72.5"
    match = re.search(r'All files.*?(\d+\.\d+)', result.stdout)
    return float(match.group(1)) if match else 0

Mistake 2: Metric doesn’t match goal

Goal: Improve application performance
Metric: Number of tests (increasing)

Problem: More tests don't make the app faster. They might even slow down CI.

Fix: Use actual performance metric
- API response time
- Page load time
- Throughput

Mistake 3: Metric can be gamed

Goal: Improve code quality
Metric: Lines of code (decreasing)

Problem: AI might delete error handling, comments, and useful code.

Fix: Use a metric that can't be gamed
- Test coverage (can't fake)
- Static analysis score (requires actual improvements)
- Bug count in production (real-world impact)

Mistake 4: No baseline measurement

# WRONG: Start optimizing without knowing where you are
claude-code iterate --goal "improve performance"

# RIGHT: Measure baseline first
curl -w "%{time_total}" -o /dev/null -s https://api.example.com/endpoint
# Baseline: 0.342 seconds

claude-code iterate --goal "reduce response time below 200ms"

Real-world example

I used this approach to optimize API response times:

#!/bin/bash
# Goal: Reduce p95 API response time from 350ms to under 200ms

# Create verification command
cat > verify_response_time.sh << 'EOF'
#!/bin/bash
# Run 100 requests and calculate p95
response_times=()
for i in {1..100}; do
    time=$(curl -w "%{time_total}" -o /dev/null -s https://api.example.com/users)
    response_times+=("$time")
done

# Sort and get p95 (95th percentile = 95th value in sorted array)
IFS=$'\n' sorted=($(sort -n <<<"${response_times[*]}"))
unset IFS
p95=${sorted[94]}  # 0-indexed

# Convert to milliseconds
p95_ms=$(echo "$p95 * 1000" | bc | cut -d. -f1)

echo "p95 response time: ${p95_ms}ms"

if [ "$p95_ms" -lt 200 ]; then
    exit 0
else
    exit 1
fi
EOF

chmod +x verify_response_time.sh

# Create metric extraction command
cat > get_response_time.sh << 'EOF'
#!/bin/bash
# Get current p95 in milliseconds
response_times=()
for i in {1..100}; do
    time=$(curl -w "%{time_total}" -o /dev/null -s https://api.example.com/users)
    response_times+=("$time")
done
IFS=$'\n' sorted=($(sort -n <<<"${response_times[*]}"))
unset IFS
p95=${sorted[94]}
p95_ms=$(echo "$p95 * 1000" | bc | cut -d. -f1)
echo "$p95_ms"
EOF

chmod +x get_response_time.sh

Then configured the iteration loop:

Goal: p95 API response time < 200ms
Baseline: 350ms (measured)
Metric command: ./get_response_time.sh
Verify command: ./verify_response_time.sh
Max iterations: 50

The AI made these changes over 12 iterations:

Iteration 1: Added index on users.email -> 312ms (kept)
Iteration 2: Added Redis cache for user profiles -> 245ms (kept)
Iteration 3: Removed unnecessary JOIN -> 198ms (kept)
Iteration 4: Added connection pooling -> 195ms (kept)

Each iteration:

Made one atomic change
Committed to git
Measured the impact
Kept or reverted based on p95

Choosing between competing metrics

Sometimes you have multiple possible metrics. How do you choose?

Goal: Improve API performance

Option A: Average response time
- Pro: Easy to measure
- Con: Averages hide outliers

Option B: p95 response time
- Pro: Catches tail latency
- Con: Requires more samples

Option C: Throughput (requests/second)
- Pro: Measures capacity
- Con: Might optimize for wrong thing

Option D: Time to first byte
- Pro: Measures user perception
- Con: Doesn't capture full experience

The answer depends on your actual goal:

If users complain about slow requests: Choose p95 response time
If you need to handle more traffic: Choose throughput
If pages feel slow to load: Choose time to first byte

Summary

In this post, I showed how to define metrics and verification commands for AI-driven code optimization. The key points are:

Choose mechanical metrics: A script must be able to extract the value
Make metrics objective: Same input should always produce same output
Align metric with goal: The metric should directly measure what you want to improve
Create parseable output: Verification commands should output metrics in a format you can extract
Avoid subjective evaluations: “Does this seem better” produces invisible drift over iterations

The iteration loop pattern is powerful, but it’s only as good as your metric. Spend time upfront choosing the right metric. A bad metric will optimize for the wrong thing, and you won’t notice until 50 iterations have passed.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Reddit Discussion: Claude Code skill with Karpathy's autoresearch

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!