What Codebase Metrics Can Improve AI-Generated Code Quality? (A Complete Guide)
I asked Claude to generate a utility module for my project. The code worked perfectly — clean syntax, proper error handling, even comprehensive tests. Three months later, that same module became a maintenance nightmare. Every small change cascaded into unexpected breakages. The module had quietly accumulated circular dependencies, god-class behavior, and tight coupling that no linter or code review caught.
The problem? I had no way to measure architectural quality. I only measured correctness.
This is the hidden trap of AI-generated code: it produces syntactically correct but architecturally problematic output. Let me walk through the five codebase metrics that finally gave me objective visibility into this problem.
The Measurement Gap
Traditional code quality tools focus on surface-level issues:
- Linting catches syntax problems
- Type checkers catch type errors
- Test coverage measures correctness
None of these measure maintainability — the ability to change code without breaking everything.
When AI generates code, it optimizes for:
- Completing the task
- Following syntax rules
- Passing tests
It does NOT optimize for:
- Module boundaries
- Dependency complexity
- Code distribution balance
- Long-term architectural health
I needed metrics that measure these structural properties. Here’s what I found.
Metric 1: Newman’s Modularity Q
I first encountered this in a Reddit discussion about codebase quality metrics. Newman’s Modularity Q comes from network science — it measures how well a graph separates into distinct communities.
What it measures: Architectural cohesion vs. coupling.
The formula compares:
- Edges within modules (good — cohesion)
- Edges between modules (bad — coupling)
A Q-score ranges from -0.5 to 1.0. Higher values mean better modularity.
The AI Code Problem
When I analyzed AI-generated codebases, I noticed a pattern:
# AI-generated module structure (Q = 0.12)user_service imports: [auth, db, cache, email, logger, config, utils, validators]auth imports: [user_service, db, cache, session, tokens]db imports: [user_service, auth, config, logger]
# Well-structured codebase (Q = 0.73)user_service imports: [user_repository]auth imports: [token_service]user_repository imports: [db_connection]AI generators tend to create “everything depends on everything” structures because they stitch together code without understanding module boundaries. The result: a low Q-score that predicts future refactoring pain.
How to Interpret
| Q-Score | Interpretation |
|---|---|
| 0.7 - 1.0 | Excellent modularity |
| 0.4 - 0.7 | Acceptable, some coupling issues |
| 0.0 - 0.4 | Significant architectural problems |
| < 0.0 | Modules are arbitrary — restructure immediately |
Metric 2: Tarjan’s Cycle Detection
This one hit me hard. I inherited a codebase that “just worked” until I tried to extract a module for reuse. Import errors cascaded everywhere. Circular dependencies.
Tarjan’s algorithm (1972) finds strongly connected components — groups of modules that all depend on each other. In codebases, these are architectural red flags.
What it measures: Circular dependencies in your dependency graph.
Why AI Creates Cycles
When I prompt Claude: “Add email notifications to the user service,” it might add:
from email_service import send_email
def create_user(data): # ... creates user send_email(user.email, "Welcome!")
# email_service.pyfrom user_service import get_user_preferences
def send_email(address, subject): prefs = get_user_preferences(address) # ... sends email based on prefsThis creates a cycle: user_service → email_service → user_service.
AI generators don’t see the big picture. They solve immediate problems without considering the dependency graph. Each feature addition risks creating cycles.
Cycle Density Score
I compute a cycle density score:
cycle_density = (modules_in_cycles / total_modules) * (avg_cycle_depth / max_depth)A score of 0.0 means no cycles. Higher scores indicate worse problems. In AI-generated codebases, I’ve seen scores from 0.15 to 0.60 before anyone noticed something was wrong.
Metric 3: Gini Coefficient
This surprised me. The Gini coefficient is normally used to measure income inequality. But it applies perfectly to code distribution.
What it measures: How evenly code is distributed across modules.
The God Module Problem
AI generators love creating god modules:
# Typical AI-generated distributionmodules: - app.py: 8,500 lines - utils.py: 120 lines - helpers.py: 80 lines - config.py: 45 lines - constants.py: 30 lines
Gini coefficient: 0.78 (very unequal)This pattern emerges because AI optimizes for “put related code together” without considering module size. When a module grows too large, it should split. AI doesn’t do this autonomously.
A healthy codebase has more even distribution:
# Well-balanced distributionmodules: - user_service.py: 520 lines - auth_service.py: 480 lines - email_service.py: 450 lines - payment_service.py: 490 lines - notification_service.py: 440 lines
Gini coefficient: 0.08 (nearly equal)Interpretation
| Gini | Codebase Health |
|---|---|
| 0.0 - 0.2 | Excellent distribution |
| 0.2 - 0.4 | Some modules need attention |
| 0.4 - 0.6 | Significant god modules exist |
| 0.6 - 1.0 | Critical — refactor immediately |
I use 1 - Gini as the quality score, so higher is better.
Metric 4: Tree-Sitter Structural Analysis
The first three metrics require language-specific import analysis. That’s a problem for polyglot codebases. Enter tree-sitter.
What it measures: Structural code properties using AST parsing.
Tree-sitter is a parser generator that works across 52 programming languages. It parses code into an abstract syntax tree, enabling:
- Function/method counts
- Nesting depth measurement
- Parameter count analysis
- Class hierarchy extraction
Why Language-Agnostic Matters
My codebase has:
- Python services
- TypeScript frontend
- Go microservices
- Rust utilities
- Shell scripts
Each language has different import syntax. But tree-sitter handles all of them uniformly:
def structural_health(ast_graph): issues = []
# Deeply nested code (maintainability killer) if max_nesting_depth(ast_graph) > 5: issues.append("deep_nesting")
# Too many parameters (AI loves adding "just one more") if max_parameter_count(ast_graph) > 7: issues.append("parameter_bloat")
# God functions (AI dumps logic into single functions) if max_function_length(ast_graph) > 100: issues.append("god_functions")
return 1.0 - (len(issues) / MAX_POSSIBLE_ISSUES)This produces a structural score that works across all languages in my codebase.
Metric 5: Geometric Mean Aggregation
Here’s where it all comes together — and where I made a mistake.
Initially, I averaged the scores:
# WRONG: Arithmetic meanquality_score = (modularity_q + cycle_score + gini_score + structure_score) / 4This allowed gaming. I could maximize modularity Q while ignoring cycles and still get a “good” score.
What it measures: Overall quality that cannot be gamed.
The solution comes from game theory (Nash, 1950). Geometric mean:
# CORRECT: Geometric meanquality_score = (modularity_q * cycle_score * gini_score * structure_score) ** (1/4)Why This Works
If any single metric approaches zero, the entire score collapses:
Arithmetic mean example:modularity = 0.9, cycles = 0.1, gini = 0.8, structure = 0.7score = (0.9 + 0.1 + 0.8 + 0.7) / 4 = 0.625 # Looks acceptable!
Geometric mean example:score = (0.9 * 0.1 * 0.8 * 0.7) ** 0.25 = 0.466 # Reveals the problemThe geometric mean enforces balance. You cannot game one metric while ignoring others.
Putting It Together: A Quality Score Pipeline
Here’s the complete approach:
┌─────────────────────────────────────────────────────────────┐│ CODEBASE INPUT ││ (Python, TS, Go, Rust, etc.) │└─────────────────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────┐│ TREE-SITTER PARSER ││ - Build AST for each file ││ - Extract imports/exports ││ - Build dependency graph │└─────────────────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────┐│ DEPENDENCY GRAPH ANALYSIS ││ ││ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ ││ │ Newman's Q │ │ Tarjan's │ │ Code Distribution │ ││ │ Modularity │ │ Cycle Det. │ │ (Gini Coefficient) │ ││ └─────────────┘ └─────────────┘ └─────────────────────┘ │└─────────────────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────┐│ STRUCTURAL ANALYSIS ││ - Nesting depth, function size ││ - Parameter counts, cyclomatic complexity │└─────────────────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────┐│ GEOMETRIC MEAN AGGREGATION ││ ││ Q = (Q_mod * Q_cycle * Q_gini * Q_struct) ^ (1/4) ││ ││ Nash equilibrium principle: ││ Cannot game one metric without tanking all │└─────────────────────────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────┐│ QUALITY SCORE ││ 0.0 ──────────────── 1.0 ││ ││ 0.8-1.0: Excellent │ 0.4-0.6: Needs work ││ 0.6-0.8: Good │ 0.0-0.4: Critical │└─────────────────────────────────────────────────────────────┘What I Learned
Mistake 1: Treating Metrics as Targets
Initially, I tried to “maximize the score.” This led to:
- Artificial module splits (high Q, but poor coherence)
- Over-abstracting (low Gini, but excessive indirection)
Metrics are diagnostic tools, not goals. They tell you where to investigate, not what to do.
Mistake 2: Uniform Thresholds
A microservice has different acceptable thresholds than a monolith. A prototype differs from production code. I now use weighted thresholds based on project context.
Mistake 3: Ignoring Trends
A static score is less useful than a trend. I now track metrics over time:
Week 1: Q = 0.72, Cycles = 0.05, Gini = 0.25, Structure = 0.81 → Overall: 0.39Week 4: Q = 0.68, Cycles = 0.12, Gini = 0.31, Structure = 0.74 → Overall: 0.43Week 8: Q = 0.61, Cycles = 0.28, Gini = 0.45, Structure = 0.62 → Overall: 0.46The overall score hides the decline. Tracking individual metrics reveals the pattern.
Implementation Notes
For tree-sitter integration, I use the Python bindings:
import tree_sitter_python as tspythonfrom tree_sitter import Language, Parser
PY_LANGUAGE = Language(tspython.language())parser = Parser(PY_LANGUAGE)
def parse_file(path): with open(path, 'rb') as f: tree = parser.parse(f.read()) return tree.root_nodeFor dependency graph analysis, I built a simple graph structure and implemented Newman’s Q and Tarjan’s SCC detection using standard algorithms.
The Real Value
These metrics gave me something I lacked: early warning.
Before:
- Code works → Ship it → Technical debt accumulates → Refactoring nightmare
After:
- Code works → Check metrics → Fix architectural issues → Ship healthy code
When AI generates code, I now run this analysis before committing. It catches problems that code review misses — because reviewers see individual files, not the dependency graph.
Conclusion
Newman’s modularity Q catches coupling problems. Tarjan’s cycle detection finds circular dependencies. Gini coefficient identifies god modules. Tree-sitter enables cross-language analysis. Geometric mean prevents gaming.
Together, they provide objective measurement of what “good architecture” means — something I can track, trend, and improve as AI assistants generate more of my codebase.
Start by integrating tree-sitter parsing into your CI pipeline. Establish baseline metrics now, before AI-generated code accumulates technical debt you can’t measure.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Newman, M.E.J. (2006) - Modularity and community structure in networks
- 👨💻 Tarjan's strongly connected components algorithm
- 👨💻 Gini coefficient - Wikipedia
- 👨💻 Tree-sitter - An incremental parsing system for programming tools
- 👨💻 Nash equilibrium - Wikipedia
- 👨💻 Reddit discussion on measuring codebase quality mathematically
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments