Skip to content

Is Lines of Code a Good Metric for AI Coding Assistant Productivity?

Problem

When I saw someone claiming their AI tool generates “600,000+ lines of production code in 60 days” or “10k-20k lines per day,” I immediately questioned whether lines of code (LOC) means anything useful for measuring AI coding productivity.

I’ve worked with AI coding assistants extensively, and I noticed something counterintuitive: the more code an AI generates, the more cleanup I often have to do.

What’s Wrong with LOC for AI-Generated Code?

I think the core issue is that AI assistants generate verbose code by default. Here’s a simple comparison:

verbosity-comparison.txt
Human Developer (5 lines): AI Assistant (20 lines):
def calculate(x): def calculate(x: int) -> int:
return x * 2 """
Calculate the doubled value.
Args:
x: The input integer
Returns:
The doubled result
"""
result = x * 2
return result

Both achieve the same thing. But if I measure productivity by LOC, the AI version looks 4x more “productive.” In reality, I have to read, understand, and maintain all 20 lines.

The Verbosity Bias

I analyzed my own AI coding sessions and found this pattern:

ai-code-analysis.txt
┌─────────────────────────────────────────────────────────────┐
│ AI Code Generation │
├─────────────────────────────────────────────────────────────┤
│ │
│ Prompt: "Add error handling to this function" │
│ │
│ Human Result (3 lines): │
│ ┌─────────────────────────────────────────┐ │
│ │ try: │ │
│ │ result = process(data) │ │
│ │ except ValueError: return None │ │
│ └─────────────────────────────────────────┘ │
│ │
│ AI Result (15+ lines): │
│ ┌─────────────────────────────────────────┐ │
│ │ try: │ │
│ │ # Attempt to process the data │ │
│ │ result = process(data) │ │
│ │ except ValueError as e: │ │
│ │ # Log the error for debugging │ │
│ │ logger.error(f"Error: {e}") │ │
│ │ return None │ │
│ │ except TypeError as e: │ │
│ │ # Handle type errors │ │
│ │ logger.warning(f"Type error: {e}") │ │
│ │ return None │ │
│ │ except Exception as e: │ │
│ │ # Catch any other exceptions │ │
│ │ logger.critical(f"Unexpected: {e}") │ │
│ │ return None │ │
│ └─────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘

Again, the AI version has more lines. Is it better? Sometimes. But often it’s over-engineering for the actual requirements.

What the Reddit Discussion Revealed

When I looked at the Reddit discussion about Garry Tan’s gstack claims, the top comments were skeptical:

reddit-reactions.txt
Top Comment (145 upvotes):
"So 5M+ lines of code per year! As we all know, more code
is always better so it must be really good."
Engineering Critique:
"LOC theatre. '600,000+ lines of production code in 60 days'
— anyone who's worked in a serious engineering org knows lines
of code is a vanity metric at best and actively misleading
at worst."
AI-Specific Concern:
"AI-generated code is verbose by default. 35% test coverage
doesn't redeem that — it just means 35% of the bloat has tests."
Practical Question:
"ship 10-20k lines per day - where? To what?"

These comments highlight the absurdity of using LOC as a productivity metric, especially for AI-generated code.

Why LOC Fails as a Metric

I think the problem runs deeper than verbosity. Here’s my analysis:

loc-problems.txt
┌────────────────────────────────────────────────────────────┐
│ Why LOC Fails for AI Coding │
├────────────────────────────────────────────────────────────┤
│ │
│ 1. INVERSE QUALITY RELATIONSHIP │
│ ┌────────────────────────────────────────────┐ │
│ │ Great Developer → Negative LOC (deletes) │ │
│ │ Average Developer → Low/Neutral LOC │ │
│ │ Poor Developer → High LOC (bloat) │ │
│ │ AI (unfiltered) → Very High LOC (verbose)│ │
│ └────────────────────────────────────────────┘ │
│ │
│ 2. CONTEXT BLINDNESS │
│ "10k-20k lines per day" tells you nothing about: │
│ • What problem was solved │
│ • Whether tests exist │
│ • Code maintainability │
│ • User value delivered │
│ • Bug count and severity │
│ │
│ 3. HISTORICAL LESSONS IGNORED │
│ • Bill Gates: "Measuring programming progress by │
│ lines of code is like measuring aircraft building │
│ progress by weight" │
│ • Mature orgs abandoned SLOC metrics decades ago │
│ • Function points replaced LOC for serious estimation │
│ │
└────────────────────────────────────────────────────────────┘

The irony is that the best developers I know often have negative LOC contributions over time. They delete more code than they add through refactoring and simplification.

Better Metrics for AI Coding Productivity

So if LOC is misleading, what should I measure instead? I’ve found these categories useful:

Output Quality Metrics

MetricWhy It MattersHow to Measure
Feature Delivery TimeTime from spec to working featureIssue tracker timestamps
Code Review CyclesFewer rounds = clearer initial codePR iteration count
Bug Rate Post-MergeQuality indicatorIssue tracker + time window
Test CoverageEspecially for new code pathsCoverage tools
Customer SatisfactionDoes shipped code solve real problems?Feedback surveys

Code Health Metrics

MetricWhy It MattersHow to Measure
Code Deletion RatioGreat developers delete codeGit stats (lines removed)
Complexity ScoresLower is more maintainableCyclomatic complexity tools
Documentation CoverageSelf-explanatory codeDoc coverage tools
Static Analysis ScoreCode smell detectionLinters, SonarQube

Developer Experience Metrics

MetricWhy It MattersHow to Measure
Time Saved by AIActual productivity gainDeveloper surveys, time tracking
Iteration SpeedHow fast can devs refine code?Code review turnaround
Learning CurveDoes AI help developers learn?Skill assessments over time

A Better Visualization

I think the relationship between LOC and actual productivity looks like this:

productivity-curve.txt
Actual
Value
│ ★ Optimal Zone
│ ╱╲
│ ╱ ╲
│ ╱ ╲
│ ╱ ╲
│ ╱ ╲
│ ╱ ╲
│ ╱ ╲
│ ╱ ╲
│╱ ╲
└──────────────────────────→
Lines of Code
Too Little Just Right Too Much
(incomplete) (optimal) (bloat/technical debt)

The key insight: there’s an optimal zone. Both too little and too much code indicate problems.

What I Actually Track

For my own AI coding sessions, I track these instead of LOC:

my-tracking-metrics.txt
1. FEATURE VELOCITY
- Features shipped per sprint
- Time from idea to production
2. CODE QUALITY
- Bugs found in code review
- Bugs found in production
- Test coverage percentage
3. MAINTENANCE BURDEN
- Time spent on bug fixes
- Time spent on refactoring
- Time spent understanding AI-generated code
4. DEVELOPER SATISFACTION
- "Did this AI help or hinder?"
- "How much cleanup was needed?"

These metrics actually tell me whether the AI coding assistant is helping or creating more work.

The Real Question

When someone claims “600,000 lines of code in 60 days,” I want to know:

real-questions.txt
┌─────────────────────────────────────────────────────────┐
│ Questions that Actually Matter │
├─────────────────────────────────────────────────────────┤
│ │
│ 1. What percentage shipped to production? │
│ │
│ 2. How much was deleted within 30 days? │
│ │
│ 3. What's the bug rate post-merge? │
│ │
│ 4. Did developers spend more time reviewing/fixing │
│ than they saved? │
│ │
│ 5. Would a human have written 100k lines to solve │
│ the same problems? │
│ │
└─────────────────────────────────────────────────────────┘

Without answers to these questions, LOC counts are just noise.

Summary

In this post, I analyzed why lines of code is a poor productivity metric for AI coding assistants. The key points are:

  1. AI-generated code tends to be verbose by default
  2. More code often means more maintenance burden, not more value
  3. The best developers often have negative LOC contributions (they delete code)
  4. Better metrics include feature delivery time, code deletion ratio, test coverage, and developer satisfaction

The next time someone claims high LOC counts as evidence of AI productivity, ask what those lines actually delivered. The number that matters isn’t lines of code written—it’s problems solved and value created.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments