Skip to content

How Do I Choose Good Metrics for AI Coding Experiments?

Problem

I tried running an autoresearch-style optimization loop on my codebase. The loop kept “improving” code that was already working, breaking tests, and optimizing metrics that didn’t reflect actual quality.

Here’s what happened when I used “lines of code” as my metric:

Terminal window
# Before optimization
src/utils/helpers.py: 120 lines, all tests pass
# After "optimization"
src/utils/helpers.py: 45 lines, 3 tests fail
# Agent "optimized" by removing error handling and edge case logic

The agent did exactly what I asked. But my metric was gameable—it reduced lines while destroying functionality.

What Makes a Good Metric?

Good metrics for AI coding experiments must be unambiguous, game-proof, and directly correlated with actual quality improvement.

Unlike ML training where validation loss (val_bpb) provides clear directional signals, non-ML code experiments require composite validation signals.

The Three Pillars of Good Metrics

  1. Unambiguous Direction: Everyone agrees on what “better” means
  2. Game-Proof Design: Cannot be manipulated without real improvement
  3. Quality Correlation: Higher scores genuinely mean better code

Here’s the comparison:

metric-comparison.txt
| Metric Type | Unambiguous | Game-Proof | Correlates with Quality |
|--------------------|-------------|------------|-------------------------|
| Test pass rate | Yes | Yes | Yes |
| Type safety score | Yes | Yes | Yes |
| Lines of code | No | No | No |
| Code coverage % | No | No | Partial |
| Execution time | Yes | Partial | Yes |
| "Readability" | No | No | Unknown |

Why Single Metrics Fail

I learned this lesson from a Reddit discussion about autoresearch loops:

"The generalization that matters most is the metric selection.
autoresearch works because val_bpb is completely unambiguous —
lower is objectively better, no debate."
"The challenge with applying this pattern to business problems
is that most real-world metrics are gameable: open rates,
click rates, 'engagement'."

Single metrics fail because they can be optimized without improving real quality. Code coverage can hit 100% with useless tests. Lines of code can shrink by removing essential logic.

The Goodhart’s Law Trap

Goodhart’s Law states: “When a measure becomes a target, it ceases to be a good measure.”

When I made “test coverage” my optimization target:

fake_tests.py
# Agent-generated tests to hit coverage target
def test_nothing_1():
pass # Covers function import
def test_nothing_2():
pass # Covers class definition
def test_nothing_3():
pass # Covers module load
# Coverage: 95% | Tests: 12 | Actual validation: 0

The agent found the easiest path to maximize coverage: generate empty tests that exercise imports without validating behavior.

Composite Verification Strategy

The solution is combining multiple orthogonal metrics with hard gates.

Here’s my composite score approach:

metrics.py
from dataclasses import dataclass
from typing import List, Dict
@dataclass
class ExperimentMetrics:
"""Composite metrics for AI coding experiments."""
test_pass_rate: float # 0.0 - 1.0
type_safety_score: float # 0.0 - 1.0 (from mypy/pyright)
lint_score: float # 0.0 - 1.0 (from ruff/flake8)
performance_ratio: float # relative to baseline
coverage_quality: float # weighted by test value
# Hard gates (must pass)
all_tests_pass: bool
no_type_errors: bool
no_critical_lint: bool
def composite_score(self, weights: Dict[str, float]) -> float:
"""Calculate weighted composite score."""
if not self.all_tests_pass:
return 0.0 # Hard gate: tests must pass
if not self.no_type_errors:
return 0.0 # Hard gate: no type errors
return (
weights['tests'] * self.test_pass_rate +
weights['types'] * self.type_safety_score +
weights['lint'] * self.lint_score +
weights['perf'] * self.performance_ratio +
weights['coverage'] * self.coverage_quality
)
# Default weights prioritize correctness over style
DEFAULT_WEIGHTS = {
'tests': 0.4,
'types': 0.25,
'lint': 0.15,
'perf': 0.1,
'coverage': 0.1
}

The hard gates prevent gaming. If tests fail or type errors exist, the score is zero regardless of other metrics.

Validation Gates Configuration

I configure validation gates in YAML:

validation_gates.yaml
experiment:
name: "optimize-utils-module"
hard_gates:
- gate: "all_tests_pass"
command: "pytest --tb=short"
failure_action: "reject" # Block completely
- gate: "no_type_errors"
command: "mypy --strict src/"
failure_action: "reject"
- gate: "no_critical_lint"
command: "ruff check --select=E9,F63,F7,F82 src/"
failure_action: "reject"
soft_metrics:
- metric: "test_coverage_quality"
command: "pytest --cov --cov-report=json"
weight: 0.1
threshold: 0.6
- metric: "lint_warnings"
command: "ruff check --output-format=json src/"
weight: 0.15
threshold: 0.8
- metric: "execution_time"
command: "python -m timeit 'from src.utils import process; process(data)'"
weight: 0.1
baseline: 2.5 # seconds
threshold_ratio: 1.2 # must not be > 20% slower
composite_threshold: 0.75 # Minimum acceptable score

The distinction between hard gates and soft metrics prevents gaming while allowing optimization.

Metric Collection Script

Here’s how I collect metrics:

collect_metrics.py
import subprocess
import json
from pathlib import Path
from dataclasses import dataclass
class MetricCollector:
"""Collect metrics for AI coding experiments."""
def __init__(self, project_root: Path):
self.root = project_root
def collect_all(self) -> ExperimentMetrics:
"""Run all metric collection."""
return ExperimentMetrics(
test_pass_rate=self.collect_test_pass_rate(),
type_safety_score=self.collect_type_safety(),
lint_score=self.collect_lint_score(),
performance_ratio=self.collect_performance(),
coverage_quality=self.collect_coverage_quality(),
all_tests_pass=self.check_tests_pass(),
no_type_errors=self.check_type_errors(),
no_critical_lint=self.check_critical_lint()
)
def collect_test_pass_rate(self) -> float:
"""Get test pass rate."""
result = subprocess.run(
["pytest", "--tb=no", "-q", "--json-report"],
cwd=self.root,
capture_output=True,
text=True
)
report = json.loads(result.stdout)
passed = report.get('summary', {}).get('passed', 0)
total = report.get('summary', {}).get('total', 1)
return passed / total if total > 0 else 0.0
def collect_type_safety(self) -> float:
"""Get type safety score from mypy."""
result = subprocess.run(
["mypy", "--strict", "--error-summary", "src/"],
cwd=self.root,
capture_output=True,
text=True
)
# Parse: "Found X errors in Y files"
errors = self._parse_mypy_errors(result.stdout)
files = self._count_python_files()
return 1.0 - (errors / max(files, 1))
def collect_coverage_quality(self) -> float:
"""Get coverage weighted by test quality."""
result = subprocess.run(
["pytest", "--cov=src", "--cov-report=json", "-q"],
cwd=self.root,
capture_output=True,
text=True
)
cov_report = json.loads(
(self.root / "coverage.json").read_text()
)
# Weight coverage by test assertions (simple heuristic)
coverage = cov_report.get('totals', {}).get('percent_covered', 0)
test_value = self._estimate_test_value()
return (coverage / 100) * test_value
def _estimate_test_value(self) -> float:
"""Heuristic for test quality."""
# Count assertions in tests
test_files = list(self.root.glob("tests/**/*.py"))
assertion_count = 0
for f in test_files:
content = f.read_text()
assertion_count += content.count("assert ")
assertion_count += content.count("assertEqual")
assertion_count += content.count("expect(")
# More assertions = higher quality (simple heuristic)
return min(assertion_count / (len(test_files) * 5 + 1), 1.0)
def check_tests_pass(self) -> bool:
"""Hard gate: all tests must pass."""
result = subprocess.run(
["pytest", "--tb=no", "-q"],
cwd=self.root,
capture_output=True
)
return result.returncode == 0
def check_type_errors(self) -> bool:
"""Hard gate: no type errors."""
result = subprocess.run(
["mypy", "--strict", "src/"],
cwd=self.root,
capture_output=True
)
return result.returncode == 0

AI Experiment Runner

Now I can run experiments with proper validation:

experiment_runner.py
from dataclasses import dataclass
from typing import Optional
import json
@dataclass
class ExperimentResult:
accepted: bool
score: float
metrics: ExperimentMetrics
reason: Optional[str]
class AIExperimentRunner:
"""Run AI coding experiments with proper metric validation."""
def __init__(self, config_path: str):
with open(config_path) as f:
self.config = yaml.safe_load(f)
self.collector = MetricCollector(Path("."))
self.weights = self.config['soft_metrics']
def run_iteration(self, proposal: dict) -> ExperimentResult:
"""Evaluate a proposed code change."""
# 1. Apply proposed changes (in isolation)
self.apply_proposal(proposal)
# 2. Collect metrics
metrics = self.collector.collect_all()
# 3. Check hard gates first
gate_failures = self._check_hard_gates(metrics)
if gate_failures:
self.revert_proposal()
return ExperimentResult(
accepted=False,
score=0.0,
metrics=metrics,
reason=f"Hard gate failed: {gate_failures}"
)
# 4. Calculate composite score
score = metrics.composite_score(self._build_weights())
# 5. Check threshold
threshold = self.config.get('composite_threshold', 0.75)
if score >= threshold:
return ExperimentResult(
accepted=True,
score=score,
metrics=metrics,
reason=None
)
else:
self.revert_proposal()
return ExperimentResult(
accepted=False,
score=score,
metrics=metrics,
reason=f"Score {score} below threshold {threshold}"
)
def _check_hard_gates(self, metrics: ExperimentMetrics) -> list[str]:
"""Check all hard gates."""
failures = []
if not metrics.all_tests_pass:
failures.append("tests_failed")
if not metrics.no_type_errors:
failures.append("type_errors")
if not metrics.no_critical_lint:
failures.append("critical_lint")
return failures
def _build_weights(self) -> dict[str, float]:
"""Build weight dict from config."""
return {
m['metric']: m['weight']
for m in self.config['soft_metrics']
}
# Usage
runner = AIExperimentRunner("validation_gates.yaml")
# Test each proposal
for proposal in ai_generated_proposals:
result = runner.run_iteration(proposal)
if result.accepted:
print(f"Accepted: {result.score:.2f}")
else:
print(f"Rejected: {result.reason}")

When I run this with real proposals:

Terminal window
$ python experiment_runner.py
Proposal 1: Reduce error handling
Rejected: Hard gate failed: tests_failed
Proposal 2: Optimize loop performance
Accepted: 0.82
Proposal 3: Add caching layer
Rejected: Score 0.68 below threshold 0.75
Proposal 4: Refactor for readability
Accepted: 0.91

Anti-Patterns to Avoid

I learned these anti-patterns from failed experiments:

1. Single Metric Optimization

# WRONG: Optimize one metric
metric = "test_coverage"
optimize_for(metric)
# RIGHT: Composite with hard gates
metric = CompositeMetric(
hard_gates=["tests_pass", "no_type_errors"],
soft_metrics=["coverage", "lint_score", "perf"]
)

2. Subjective Metrics

# WRONG: Subjective metric
metric = "code_readability" # Who decides?
# RIGHT: Objective proxies
metrics = [
"cyclomatic_complexity", # Measurable
"lines_per_function", # Countable
"naming_convention_match" # Checkable
]

3. Metrics Without Baselines

# WRONG: No baseline comparison
score = calculate_score(current_code)
# RIGHT: Compare to baseline
baseline = calculate_score(original_code)
improvement = score / baseline
if improvement < 1.0:
reject("Code got worse")

Domain-Specific Metric Examples

Different domains need different metric combinations:

Frontend Optimization

frontend-metrics.yaml
hard_gates:
- gate: "build_succeeds"
command: "npm run build"
- gate: "no_console_errors"
command: "npm run test:e2e"
soft_metrics:
- metric: "bundle_size"
weight: 0.3
baseline: "500kb"
threshold_ratio: 0.9
- metric: "lighthouse_score"
weight: 0.25
threshold: 90
- metric: "component_coverage"
weight: 0.15
threshold: 80

Backend API Optimization

backend-metrics.yaml
hard_gates:
- gate: "all_endpoints_work"
command: "pytest tests/api/"
- gate: "response_schema_valid"
command: "python validate_schemas.py"
soft_metrics:
- metric: "response_time_p99"
weight: 0.35
baseline: "200ms"
threshold_ratio: 0.8
- metric: "memory_usage"
weight: 0.2
baseline: "150MB"
threshold_ratio: 1.1
- metric: "error_rate"
weight: 0.25
threshold: 0.01

Human Validation for Business Metrics

Some metrics require human judgment. The Reddit discussion highlighted this:

"If you're going to run an autoresearch-style loop,
the metric has to be 'meaningful reply received'
with a human doing spot-checks on quality."

For business problems, I use a hybrid approach:

hybrid_validation.py
class HybridValidator:
"""Combine automated metrics with human spot-checks."""
def validate_proposal(self, proposal: dict) -> tuple[bool, str]:
# 1. Automated hard gates
metrics = self.collect_metrics(proposal)
if not metrics.all_tests_pass:
return False, "Tests failed"
# 2. Automated soft metrics
score = metrics.composite_score(self.weights)
if score < 0.6:
return False, f"Score too low: {score}"
# 3. Human spot-check for subjective quality
if random.random() < self.spot_check_rate:
human_result = self.request_human_review(proposal)
if not human_result.approved:
return False, f"Human rejected: {human_result.reason}"
# 4. Accept if all pass
return True, f"Accepted with score {score}"

This balances automation efficiency with human judgment for ambiguous cases.

Summary

In this post, I showed how to choose good metrics for AI coding experiments. The key point is that metrics must be unambiguous (clear direction), game-proof (cannot be manipulated), and correlated with actual quality.

The solution is using composite verification with hard gates and soft metrics. Hard gates (test pass, no type errors, no critical lint) block bad changes completely. Soft metrics (coverage quality, performance ratio, lint score) allow optimization while preventing gaming.

Requirements for good metrics:

  1. Unambiguous Direction - everyone agrees what “better” means
  2. Game-Proof Design - cannot be gamed without real improvement
  3. Quality Correlation - higher scores mean genuinely better code
  4. Composite Signals - multiple orthogonal metrics, not single targets
  5. Hard Requirements - gates that block unacceptable changes
  6. Soft Optimization - metrics that allow improvement within bounds

Use multiple orthogonal metrics combined with hard gates. Single metrics like “coverage” or “lines of code” are gameable and will lead your AI agent to optimize the wrong thing.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments