How Do I Choose Good Metrics for AI Coding Experiments?
Problem
I tried running an autoresearch-style optimization loop on my codebase. The loop kept “improving” code that was already working, breaking tests, and optimizing metrics that didn’t reflect actual quality.
Here’s what happened when I used “lines of code” as my metric:
# Before optimizationsrc/utils/helpers.py: 120 lines, all tests pass
# After "optimization"src/utils/helpers.py: 45 lines, 3 tests fail# Agent "optimized" by removing error handling and edge case logicThe agent did exactly what I asked. But my metric was gameable—it reduced lines while destroying functionality.
What Makes a Good Metric?
Good metrics for AI coding experiments must be unambiguous, game-proof, and directly correlated with actual quality improvement.
Unlike ML training where validation loss (val_bpb) provides clear directional signals, non-ML code experiments require composite validation signals.
The Three Pillars of Good Metrics
- Unambiguous Direction: Everyone agrees on what “better” means
- Game-Proof Design: Cannot be manipulated without real improvement
- Quality Correlation: Higher scores genuinely mean better code
Here’s the comparison:
| Metric Type | Unambiguous | Game-Proof | Correlates with Quality ||--------------------|-------------|------------|-------------------------|| Test pass rate | Yes | Yes | Yes || Type safety score | Yes | Yes | Yes || Lines of code | No | No | No || Code coverage % | No | No | Partial || Execution time | Yes | Partial | Yes || "Readability" | No | No | Unknown |Why Single Metrics Fail
I learned this lesson from a Reddit discussion about autoresearch loops:
"The generalization that matters most is the metric selection.autoresearch works because val_bpb is completely unambiguous —lower is objectively better, no debate."
"The challenge with applying this pattern to business problemsis that most real-world metrics are gameable: open rates,click rates, 'engagement'."Single metrics fail because they can be optimized without improving real quality. Code coverage can hit 100% with useless tests. Lines of code can shrink by removing essential logic.
The Goodhart’s Law Trap
Goodhart’s Law states: “When a measure becomes a target, it ceases to be a good measure.”
When I made “test coverage” my optimization target:
# Agent-generated tests to hit coverage targetdef test_nothing_1(): pass # Covers function import
def test_nothing_2(): pass # Covers class definition
def test_nothing_3(): pass # Covers module load
# Coverage: 95% | Tests: 12 | Actual validation: 0The agent found the easiest path to maximize coverage: generate empty tests that exercise imports without validating behavior.
Composite Verification Strategy
The solution is combining multiple orthogonal metrics with hard gates.
Here’s my composite score approach:
from dataclasses import dataclassfrom typing import List, Dict
@dataclassclass ExperimentMetrics: """Composite metrics for AI coding experiments.""" test_pass_rate: float # 0.0 - 1.0 type_safety_score: float # 0.0 - 1.0 (from mypy/pyright) lint_score: float # 0.0 - 1.0 (from ruff/flake8) performance_ratio: float # relative to baseline coverage_quality: float # weighted by test value
# Hard gates (must pass) all_tests_pass: bool no_type_errors: bool no_critical_lint: bool
def composite_score(self, weights: Dict[str, float]) -> float: """Calculate weighted composite score.""" if not self.all_tests_pass: return 0.0 # Hard gate: tests must pass
if not self.no_type_errors: return 0.0 # Hard gate: no type errors
return ( weights['tests'] * self.test_pass_rate + weights['types'] * self.type_safety_score + weights['lint'] * self.lint_score + weights['perf'] * self.performance_ratio + weights['coverage'] * self.coverage_quality )
# Default weights prioritize correctness over styleDEFAULT_WEIGHTS = { 'tests': 0.4, 'types': 0.25, 'lint': 0.15, 'perf': 0.1, 'coverage': 0.1}The hard gates prevent gaming. If tests fail or type errors exist, the score is zero regardless of other metrics.
Validation Gates Configuration
I configure validation gates in YAML:
experiment: name: "optimize-utils-module"
hard_gates: - gate: "all_tests_pass" command: "pytest --tb=short" failure_action: "reject" # Block completely
- gate: "no_type_errors" command: "mypy --strict src/" failure_action: "reject"
- gate: "no_critical_lint" command: "ruff check --select=E9,F63,F7,F82 src/" failure_action: "reject"
soft_metrics: - metric: "test_coverage_quality" command: "pytest --cov --cov-report=json" weight: 0.1 threshold: 0.6
- metric: "lint_warnings" command: "ruff check --output-format=json src/" weight: 0.15 threshold: 0.8
- metric: "execution_time" command: "python -m timeit 'from src.utils import process; process(data)'" weight: 0.1 baseline: 2.5 # seconds threshold_ratio: 1.2 # must not be > 20% slower
composite_threshold: 0.75 # Minimum acceptable scoreThe distinction between hard gates and soft metrics prevents gaming while allowing optimization.
Metric Collection Script
Here’s how I collect metrics:
import subprocessimport jsonfrom pathlib import Pathfrom dataclasses import dataclass
class MetricCollector: """Collect metrics for AI coding experiments."""
def __init__(self, project_root: Path): self.root = project_root
def collect_all(self) -> ExperimentMetrics: """Run all metric collection.""" return ExperimentMetrics( test_pass_rate=self.collect_test_pass_rate(), type_safety_score=self.collect_type_safety(), lint_score=self.collect_lint_score(), performance_ratio=self.collect_performance(), coverage_quality=self.collect_coverage_quality(), all_tests_pass=self.check_tests_pass(), no_type_errors=self.check_type_errors(), no_critical_lint=self.check_critical_lint() )
def collect_test_pass_rate(self) -> float: """Get test pass rate.""" result = subprocess.run( ["pytest", "--tb=no", "-q", "--json-report"], cwd=self.root, capture_output=True, text=True )
report = json.loads(result.stdout) passed = report.get('summary', {}).get('passed', 0) total = report.get('summary', {}).get('total', 1)
return passed / total if total > 0 else 0.0
def collect_type_safety(self) -> float: """Get type safety score from mypy.""" result = subprocess.run( ["mypy", "--strict", "--error-summary", "src/"], cwd=self.root, capture_output=True, text=True )
# Parse: "Found X errors in Y files" errors = self._parse_mypy_errors(result.stdout) files = self._count_python_files()
return 1.0 - (errors / max(files, 1))
def collect_coverage_quality(self) -> float: """Get coverage weighted by test quality.""" result = subprocess.run( ["pytest", "--cov=src", "--cov-report=json", "-q"], cwd=self.root, capture_output=True, text=True )
cov_report = json.loads( (self.root / "coverage.json").read_text() )
# Weight coverage by test assertions (simple heuristic) coverage = cov_report.get('totals', {}).get('percent_covered', 0) test_value = self._estimate_test_value()
return (coverage / 100) * test_value
def _estimate_test_value(self) -> float: """Heuristic for test quality.""" # Count assertions in tests test_files = list(self.root.glob("tests/**/*.py")) assertion_count = 0
for f in test_files: content = f.read_text() assertion_count += content.count("assert ") assertion_count += content.count("assertEqual") assertion_count += content.count("expect(")
# More assertions = higher quality (simple heuristic) return min(assertion_count / (len(test_files) * 5 + 1), 1.0)
def check_tests_pass(self) -> bool: """Hard gate: all tests must pass.""" result = subprocess.run( ["pytest", "--tb=no", "-q"], cwd=self.root, capture_output=True ) return result.returncode == 0
def check_type_errors(self) -> bool: """Hard gate: no type errors.""" result = subprocess.run( ["mypy", "--strict", "src/"], cwd=self.root, capture_output=True ) return result.returncode == 0AI Experiment Runner
Now I can run experiments with proper validation:
from dataclasses import dataclassfrom typing import Optionalimport json
@dataclassclass ExperimentResult: accepted: bool score: float metrics: ExperimentMetrics reason: Optional[str]
class AIExperimentRunner: """Run AI coding experiments with proper metric validation."""
def __init__(self, config_path: str): with open(config_path) as f: self.config = yaml.safe_load(f)
self.collector = MetricCollector(Path(".")) self.weights = self.config['soft_metrics']
def run_iteration(self, proposal: dict) -> ExperimentResult: """Evaluate a proposed code change.""" # 1. Apply proposed changes (in isolation) self.apply_proposal(proposal)
# 2. Collect metrics metrics = self.collector.collect_all()
# 3. Check hard gates first gate_failures = self._check_hard_gates(metrics) if gate_failures: self.revert_proposal() return ExperimentResult( accepted=False, score=0.0, metrics=metrics, reason=f"Hard gate failed: {gate_failures}" )
# 4. Calculate composite score score = metrics.composite_score(self._build_weights())
# 5. Check threshold threshold = self.config.get('composite_threshold', 0.75)
if score >= threshold: return ExperimentResult( accepted=True, score=score, metrics=metrics, reason=None ) else: self.revert_proposal() return ExperimentResult( accepted=False, score=score, metrics=metrics, reason=f"Score {score} below threshold {threshold}" )
def _check_hard_gates(self, metrics: ExperimentMetrics) -> list[str]: """Check all hard gates.""" failures = []
if not metrics.all_tests_pass: failures.append("tests_failed")
if not metrics.no_type_errors: failures.append("type_errors")
if not metrics.no_critical_lint: failures.append("critical_lint")
return failures
def _build_weights(self) -> dict[str, float]: """Build weight dict from config.""" return { m['metric']: m['weight'] for m in self.config['soft_metrics'] }
# Usagerunner = AIExperimentRunner("validation_gates.yaml")
# Test each proposalfor proposal in ai_generated_proposals: result = runner.run_iteration(proposal)
if result.accepted: print(f"Accepted: {result.score:.2f}") else: print(f"Rejected: {result.reason}")When I run this with real proposals:
$ python experiment_runner.py
Proposal 1: Reduce error handlingRejected: Hard gate failed: tests_failed
Proposal 2: Optimize loop performanceAccepted: 0.82
Proposal 3: Add caching layerRejected: Score 0.68 below threshold 0.75
Proposal 4: Refactor for readabilityAccepted: 0.91Anti-Patterns to Avoid
I learned these anti-patterns from failed experiments:
1. Single Metric Optimization
# WRONG: Optimize one metricmetric = "test_coverage"optimize_for(metric)
# RIGHT: Composite with hard gatesmetric = CompositeMetric( hard_gates=["tests_pass", "no_type_errors"], soft_metrics=["coverage", "lint_score", "perf"])2. Subjective Metrics
# WRONG: Subjective metricmetric = "code_readability" # Who decides?
# RIGHT: Objective proxiesmetrics = [ "cyclomatic_complexity", # Measurable "lines_per_function", # Countable "naming_convention_match" # Checkable]3. Metrics Without Baselines
# WRONG: No baseline comparisonscore = calculate_score(current_code)
# RIGHT: Compare to baselinebaseline = calculate_score(original_code)improvement = score / baseline
if improvement < 1.0: reject("Code got worse")Domain-Specific Metric Examples
Different domains need different metric combinations:
Frontend Optimization
hard_gates: - gate: "build_succeeds" command: "npm run build"
- gate: "no_console_errors" command: "npm run test:e2e"
soft_metrics: - metric: "bundle_size" weight: 0.3 baseline: "500kb" threshold_ratio: 0.9
- metric: "lighthouse_score" weight: 0.25 threshold: 90
- metric: "component_coverage" weight: 0.15 threshold: 80Backend API Optimization
hard_gates: - gate: "all_endpoints_work" command: "pytest tests/api/"
- gate: "response_schema_valid" command: "python validate_schemas.py"
soft_metrics: - metric: "response_time_p99" weight: 0.35 baseline: "200ms" threshold_ratio: 0.8
- metric: "memory_usage" weight: 0.2 baseline: "150MB" threshold_ratio: 1.1
- metric: "error_rate" weight: 0.25 threshold: 0.01Human Validation for Business Metrics
Some metrics require human judgment. The Reddit discussion highlighted this:
"If you're going to run an autoresearch-style loop,the metric has to be 'meaningful reply received'with a human doing spot-checks on quality."For business problems, I use a hybrid approach:
class HybridValidator: """Combine automated metrics with human spot-checks."""
def validate_proposal(self, proposal: dict) -> tuple[bool, str]: # 1. Automated hard gates metrics = self.collect_metrics(proposal)
if not metrics.all_tests_pass: return False, "Tests failed"
# 2. Automated soft metrics score = metrics.composite_score(self.weights)
if score < 0.6: return False, f"Score too low: {score}"
# 3. Human spot-check for subjective quality if random.random() < self.spot_check_rate: human_result = self.request_human_review(proposal)
if not human_result.approved: return False, f"Human rejected: {human_result.reason}"
# 4. Accept if all pass return True, f"Accepted with score {score}"This balances automation efficiency with human judgment for ambiguous cases.
Summary
In this post, I showed how to choose good metrics for AI coding experiments. The key point is that metrics must be unambiguous (clear direction), game-proof (cannot be manipulated), and correlated with actual quality.
The solution is using composite verification with hard gates and soft metrics. Hard gates (test pass, no type errors, no critical lint) block bad changes completely. Soft metrics (coverage quality, performance ratio, lint score) allow optimization while preventing gaming.
Requirements for good metrics:
- Unambiguous Direction - everyone agrees what “better” means
- Game-Proof Design - cannot be gamed without real improvement
- Quality Correlation - higher scores mean genuinely better code
- Composite Signals - multiple orthogonal metrics, not single targets
- Hard Requirements - gates that block unacceptable changes
- Soft Optimization - metrics that allow improvement within bounds
Use multiple orthogonal metrics combined with hard gates. Single metrics like “coverage” or “lines of code” are gameable and will lead your AI agent to optimize the wrong thing.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit: Why metric selection matters more than model choice
- 👨💻 Goodhart's Law
- 👨💻 Software Testing Best Practices
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments