Skip to content

What Codebase Metrics Can Improve AI-Generated Code Quality? (A Complete Guide)

I asked Claude to generate a utility module for my project. The code worked perfectly — clean syntax, proper error handling, even comprehensive tests. Three months later, that same module became a maintenance nightmare. Every small change cascaded into unexpected breakages. The module had quietly accumulated circular dependencies, god-class behavior, and tight coupling that no linter or code review caught.

The problem? I had no way to measure architectural quality. I only measured correctness.

This is the hidden trap of AI-generated code: it produces syntactically correct but architecturally problematic output. Let me walk through the five codebase metrics that finally gave me objective visibility into this problem.

The Measurement Gap

Traditional code quality tools focus on surface-level issues:

  • Linting catches syntax problems
  • Type checkers catch type errors
  • Test coverage measures correctness

None of these measure maintainability — the ability to change code without breaking everything.

When AI generates code, it optimizes for:

  • Completing the task
  • Following syntax rules
  • Passing tests

It does NOT optimize for:

  • Module boundaries
  • Dependency complexity
  • Code distribution balance
  • Long-term architectural health

I needed metrics that measure these structural properties. Here’s what I found.

Metric 1: Newman’s Modularity Q

I first encountered this in a Reddit discussion about codebase quality metrics. Newman’s Modularity Q comes from network science — it measures how well a graph separates into distinct communities.

What it measures: Architectural cohesion vs. coupling.

The formula compares:

  • Edges within modules (good — cohesion)
  • Edges between modules (bad — coupling)

A Q-score ranges from -0.5 to 1.0. Higher values mean better modularity.

The AI Code Problem

When I analyzed AI-generated codebases, I noticed a pattern:

# AI-generated module structure (Q = 0.12)
user_service imports: [auth, db, cache, email, logger, config, utils, validators]
auth imports: [user_service, db, cache, session, tokens]
db imports: [user_service, auth, config, logger]
# Well-structured codebase (Q = 0.73)
user_service imports: [user_repository]
auth imports: [token_service]
user_repository imports: [db_connection]

AI generators tend to create “everything depends on everything” structures because they stitch together code without understanding module boundaries. The result: a low Q-score that predicts future refactoring pain.

How to Interpret

Q-ScoreInterpretation
0.7 - 1.0Excellent modularity
0.4 - 0.7Acceptable, some coupling issues
0.0 - 0.4Significant architectural problems
< 0.0Modules are arbitrary — restructure immediately

Metric 2: Tarjan’s Cycle Detection

This one hit me hard. I inherited a codebase that “just worked” until I tried to extract a module for reuse. Import errors cascaded everywhere. Circular dependencies.

Tarjan’s algorithm (1972) finds strongly connected components — groups of modules that all depend on each other. In codebases, these are architectural red flags.

What it measures: Circular dependencies in your dependency graph.

Why AI Creates Cycles

When I prompt Claude: “Add email notifications to the user service,” it might add:

user_service.py
from email_service import send_email
def create_user(data):
# ... creates user
send_email(user.email, "Welcome!")
# email_service.py
from user_service import get_user_preferences
def send_email(address, subject):
prefs = get_user_preferences(address)
# ... sends email based on prefs

This creates a cycle: user_service → email_service → user_service.

AI generators don’t see the big picture. They solve immediate problems without considering the dependency graph. Each feature addition risks creating cycles.

Cycle Density Score

I compute a cycle density score:

cycle_density = (modules_in_cycles / total_modules) * (avg_cycle_depth / max_depth)

A score of 0.0 means no cycles. Higher scores indicate worse problems. In AI-generated codebases, I’ve seen scores from 0.15 to 0.60 before anyone noticed something was wrong.

Metric 3: Gini Coefficient

This surprised me. The Gini coefficient is normally used to measure income inequality. But it applies perfectly to code distribution.

What it measures: How evenly code is distributed across modules.

The God Module Problem

AI generators love creating god modules:

# Typical AI-generated distribution
modules:
- app.py: 8,500 lines
- utils.py: 120 lines
- helpers.py: 80 lines
- config.py: 45 lines
- constants.py: 30 lines
Gini coefficient: 0.78 (very unequal)

This pattern emerges because AI optimizes for “put related code together” without considering module size. When a module grows too large, it should split. AI doesn’t do this autonomously.

A healthy codebase has more even distribution:

# Well-balanced distribution
modules:
- user_service.py: 520 lines
- auth_service.py: 480 lines
- email_service.py: 450 lines
- payment_service.py: 490 lines
- notification_service.py: 440 lines
Gini coefficient: 0.08 (nearly equal)

Interpretation

GiniCodebase Health
0.0 - 0.2Excellent distribution
0.2 - 0.4Some modules need attention
0.4 - 0.6Significant god modules exist
0.6 - 1.0Critical — refactor immediately

I use 1 - Gini as the quality score, so higher is better.

Metric 4: Tree-Sitter Structural Analysis

The first three metrics require language-specific import analysis. That’s a problem for polyglot codebases. Enter tree-sitter.

What it measures: Structural code properties using AST parsing.

Tree-sitter is a parser generator that works across 52 programming languages. It parses code into an abstract syntax tree, enabling:

  • Function/method counts
  • Nesting depth measurement
  • Parameter count analysis
  • Class hierarchy extraction

Why Language-Agnostic Matters

My codebase has:

  • Python services
  • TypeScript frontend
  • Go microservices
  • Rust utilities
  • Shell scripts

Each language has different import syntax. But tree-sitter handles all of them uniformly:

def structural_health(ast_graph):
issues = []
# Deeply nested code (maintainability killer)
if max_nesting_depth(ast_graph) > 5:
issues.append("deep_nesting")
# Too many parameters (AI loves adding "just one more")
if max_parameter_count(ast_graph) > 7:
issues.append("parameter_bloat")
# God functions (AI dumps logic into single functions)
if max_function_length(ast_graph) > 100:
issues.append("god_functions")
return 1.0 - (len(issues) / MAX_POSSIBLE_ISSUES)

This produces a structural score that works across all languages in my codebase.

Metric 5: Geometric Mean Aggregation

Here’s where it all comes together — and where I made a mistake.

Initially, I averaged the scores:

# WRONG: Arithmetic mean
quality_score = (modularity_q + cycle_score + gini_score + structure_score) / 4

This allowed gaming. I could maximize modularity Q while ignoring cycles and still get a “good” score.

What it measures: Overall quality that cannot be gamed.

The solution comes from game theory (Nash, 1950). Geometric mean:

# CORRECT: Geometric mean
quality_score = (modularity_q * cycle_score * gini_score * structure_score) ** (1/4)

Why This Works

If any single metric approaches zero, the entire score collapses:

Arithmetic mean example:
modularity = 0.9, cycles = 0.1, gini = 0.8, structure = 0.7
score = (0.9 + 0.1 + 0.8 + 0.7) / 4 = 0.625 # Looks acceptable!
Geometric mean example:
score = (0.9 * 0.1 * 0.8 * 0.7) ** 0.25 = 0.466 # Reveals the problem

The geometric mean enforces balance. You cannot game one metric while ignoring others.

Putting It Together: A Quality Score Pipeline

Here’s the complete approach:

┌─────────────────────────────────────────────────────────────┐
│ CODEBASE INPUT │
│ (Python, TS, Go, Rust, etc.) │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ TREE-SITTER PARSER │
│ - Build AST for each file │
│ - Extract imports/exports │
│ - Build dependency graph │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ DEPENDENCY GRAPH ANALYSIS │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Newman's Q │ │ Tarjan's │ │ Code Distribution │ │
│ │ Modularity │ │ Cycle Det. │ │ (Gini Coefficient) │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ STRUCTURAL ANALYSIS │
│ - Nesting depth, function size │
│ - Parameter counts, cyclomatic complexity │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ GEOMETRIC MEAN AGGREGATION │
│ │
│ Q = (Q_mod * Q_cycle * Q_gini * Q_struct) ^ (1/4) │
│ │
│ Nash equilibrium principle: │
│ Cannot game one metric without tanking all │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ QUALITY SCORE │
│ 0.0 ──────────────── 1.0 │
│ │
│ 0.8-1.0: Excellent │ 0.4-0.6: Needs work │
│ 0.6-0.8: Good │ 0.0-0.4: Critical │
└─────────────────────────────────────────────────────────────┘

What I Learned

Mistake 1: Treating Metrics as Targets

Initially, I tried to “maximize the score.” This led to:

  • Artificial module splits (high Q, but poor coherence)
  • Over-abstracting (low Gini, but excessive indirection)

Metrics are diagnostic tools, not goals. They tell you where to investigate, not what to do.

Mistake 2: Uniform Thresholds

A microservice has different acceptable thresholds than a monolith. A prototype differs from production code. I now use weighted thresholds based on project context.

A static score is less useful than a trend. I now track metrics over time:

Week 1: Q = 0.72, Cycles = 0.05, Gini = 0.25, Structure = 0.81 → Overall: 0.39
Week 4: Q = 0.68, Cycles = 0.12, Gini = 0.31, Structure = 0.74 → Overall: 0.43
Week 8: Q = 0.61, Cycles = 0.28, Gini = 0.45, Structure = 0.62 → Overall: 0.46

The overall score hides the decline. Tracking individual metrics reveals the pattern.

Implementation Notes

For tree-sitter integration, I use the Python bindings:

import tree_sitter_python as tspython
from tree_sitter import Language, Parser
PY_LANGUAGE = Language(tspython.language())
parser = Parser(PY_LANGUAGE)
def parse_file(path):
with open(path, 'rb') as f:
tree = parser.parse(f.read())
return tree.root_node

For dependency graph analysis, I built a simple graph structure and implemented Newman’s Q and Tarjan’s SCC detection using standard algorithms.

The Real Value

These metrics gave me something I lacked: early warning.

Before:

  • Code works → Ship it → Technical debt accumulates → Refactoring nightmare

After:

  • Code works → Check metrics → Fix architectural issues → Ship healthy code

When AI generates code, I now run this analysis before committing. It catches problems that code review misses — because reviewers see individual files, not the dependency graph.

Conclusion

Newman’s modularity Q catches coupling problems. Tarjan’s cycle detection finds circular dependencies. Gini coefficient identifies god modules. Tree-sitter enables cross-language analysis. Geometric mean prevents gaming.

Together, they provide objective measurement of what “good architecture” means — something I can track, trend, and improve as AI assistants generate more of my codebase.

Start by integrating tree-sitter parsing into your CI pipeline. Establish baseline metrics now, before AI-generated code accumulates technical debt you can’t measure.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments