Is Lines of Code a Good Metric for AI Coding Assistant Productivity?

Mar 31, 2026

Problem

When I saw someone claiming their AI tool generates “600,000+ lines of production code in 60 days” or “10k-20k lines per day,” I immediately questioned whether lines of code (LOC) means anything useful for measuring AI coding productivity.

I’ve worked with AI coding assistants extensively, and I noticed something counterintuitive: the more code an AI generates, the more cleanup I often have to do.

What’s Wrong with LOC for AI-Generated Code?

I think the core issue is that AI assistants generate verbose code by default. Here’s a simple comparison:

Human Developer (5 lines):        AI Assistant (20 lines):

def calculate(x):                 def calculate(x: int) -> int:
    return x * 2                     """
                                     Calculate the doubled value.

                                     Args:
                                         x: The input integer

                                     Returns:
                                         The doubled result
                                     """
                                     result = x * 2
                                     return result

Both achieve the same thing. But if I measure productivity by LOC, the AI version looks 4x more “productive.” In reality, I have to read, understand, and maintain all 20 lines.

The Verbosity Bias

I analyzed my own AI coding sessions and found this pattern:

┌─────────────────────────────────────────────────────────────┐
│                    AI Code Generation                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   Prompt: "Add error handling to this function"            │
│                                                             │
│   Human Result (3 lines):                                   │
│   ┌─────────────────────────────────────────┐               │
│   │ try:                                    │               │
│   │     result = process(data)              │               │
│   │ except ValueError: return None          │               │
│   └─────────────────────────────────────────┘               │
│                                                             │
│   AI Result (15+ lines):                                    │
│   ┌─────────────────────────────────────────┐               │
│   │ try:                                    │               │
│   │     # Attempt to process the data       │               │
│   │     result = process(data)               │               │
│   │ except ValueError as e:                  │               │
│   │     # Log the error for debugging        │               │
│   │     logger.error(f"Error: {e}")          │               │
│   │     return None                          │               │
│   │ except TypeError as e:                   │               │
│   │     # Handle type errors                 │               │
│   │     logger.warning(f"Type error: {e}")   │               │
│   │     return None                          │               │
│   │ except Exception as e:                   │               │
│   │     # Catch any other exceptions         │               │
│   │     logger.critical(f"Unexpected: {e}")  │               │
│   │     return None                          │               │
│   └─────────────────────────────────────────┘               │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Again, the AI version has more lines. Is it better? Sometimes. But often it’s over-engineering for the actual requirements.

What the Reddit Discussion Revealed

When I looked at the Reddit discussion about Garry Tan’s gstack claims, the top comments were skeptical:

Top Comment (145 upvotes):
"So 5M+ lines of code per year! As we all know, more code
is always better so it must be really good."

Engineering Critique:
"LOC theatre. '600,000+ lines of production code in 60 days'
— anyone who's worked in a serious engineering org knows lines
of code is a vanity metric at best and actively misleading
at worst."

AI-Specific Concern:
"AI-generated code is verbose by default. 35% test coverage
doesn't redeem that — it just means 35% of the bloat has tests."

Practical Question:
"ship 10-20k lines per day - where? To what?"

These comments highlight the absurdity of using LOC as a productivity metric, especially for AI-generated code.

Why LOC Fails as a Metric

I think the problem runs deeper than verbosity. Here’s my analysis:

┌────────────────────────────────────────────────────────────┐
│              Why LOC Fails for AI Coding                   │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  1. INVERSE QUALITY RELATIONSHIP                           │
│     ┌────────────────────────────────────────────┐        │
│     │ Great Developer  → Negative LOC (deletes)   │        │
│     │ Average Developer → Low/Neutral LOC        │        │
│     │ Poor Developer    → High LOC (bloat)       │        │
│     │ AI (unfiltered)   → Very High LOC (verbose)│        │
│     └────────────────────────────────────────────┘        │
│                                                            │
│  2. CONTEXT BLINDNESS                                      │
│     "10k-20k lines per day" tells you nothing about:     │
│     • What problem was solved                              │
│     • Whether tests exist                                  │
│     • Code maintainability                                 │
│     • User value delivered                                 │
│     • Bug count and severity                               │
│                                                            │
│  3. HISTORICAL LESSONS IGNORED                             │
│     • Bill Gates: "Measuring programming progress by       │
│       lines of code is like measuring aircraft building    │
│       progress by weight"                                  │
│     • Mature orgs abandoned SLOC metrics decades ago       │
│     • Function points replaced LOC for serious estimation  │
│                                                            │
└────────────────────────────────────────────────────────────┘

The irony is that the best developers I know often have negative LOC contributions over time. They delete more code than they add through refactoring and simplification.

Better Metrics for AI Coding Productivity

So if LOC is misleading, what should I measure instead? I’ve found these categories useful:

Output Quality Metrics

Metric	Why It Matters	How to Measure
Feature Delivery Time	Time from spec to working feature	Issue tracker timestamps
Code Review Cycles	Fewer rounds = clearer initial code	PR iteration count
Bug Rate Post-Merge	Quality indicator	Issue tracker + time window
Test Coverage	Especially for new code paths	Coverage tools
Customer Satisfaction	Does shipped code solve real problems?	Feedback surveys

Code Health Metrics

Metric	Why It Matters	How to Measure
Code Deletion Ratio	Great developers delete code	Git stats (lines removed)
Complexity Scores	Lower is more maintainable	Cyclomatic complexity tools
Documentation Coverage	Self-explanatory code	Doc coverage tools
Static Analysis Score	Code smell detection	Linters, SonarQube

Developer Experience Metrics

Metric	Why It Matters	How to Measure
Time Saved by AI	Actual productivity gain	Developer surveys, time tracking
Iteration Speed	How fast can devs refine code?	Code review turnaround
Learning Curve	Does AI help developers learn?	Skill assessments over time

A Better Visualization

I think the relationship between LOC and actual productivity looks like this:

Actual
Value
  │
  │         ★ Optimal Zone
  │        ╱╲
  │       ╱  ╲
  │      ╱    ╲
  │     ╱      ╲
  │    ╱        ╲
  │   ╱          ╲
  │  ╱            ╲
  │ ╱              ╲
  │╱                ╲
  └──────────────────────────→
                    Lines of Code

  Too Little    Just Right    Too Much
  (incomplete)  (optimal)     (bloat/technical debt)

The key insight: there’s an optimal zone. Both too little and too much code indicate problems.

What I Actually Track

For my own AI coding sessions, I track these instead of LOC:

1. FEATURE VELOCITY
   - Features shipped per sprint
   - Time from idea to production

2. CODE QUALITY
   - Bugs found in code review
   - Bugs found in production
   - Test coverage percentage

3. MAINTENANCE BURDEN
   - Time spent on bug fixes
   - Time spent on refactoring
   - Time spent understanding AI-generated code

4. DEVELOPER SATISFACTION
   - "Did this AI help or hinder?"
   - "How much cleanup was needed?"

These metrics actually tell me whether the AI coding assistant is helping or creating more work.

The Real Question

When someone claims “600,000 lines of code in 60 days,” I want to know:

┌─────────────────────────────────────────────────────────┐
│ Questions that Actually Matter                          │
├─────────────────────────────────────────────────────────┤
│                                                         │
│ 1. What percentage shipped to production?               │
│                                                         │
│ 2. How much was deleted within 30 days?                 │
│                                                         │
│ 3. What's the bug rate post-merge?                      │
│                                                         │
│ 4. Did developers spend more time reviewing/fixing      │
│    than they saved?                                     │
│                                                         │
│ 5. Would a human have written 100k lines to solve       │
│    the same problems?                                   │
│                                                         │
└─────────────────────────────────────────────────────────┘

Without answers to these questions, LOC counts are just noise.

Summary

In this post, I analyzed why lines of code is a poor productivity metric for AI coding assistants. The key points are:

AI-generated code tends to be verbose by default
More code often means more maintenance burden, not more value
The best developers often have negative LOC contributions (they delete code)
Better metrics include feature delivery time, code deletion ratio, test coverage, and developer satisfaction

The next time someone claims high LOC counts as evidence of AI productivity, ask what those lines actually delivered. The number that matters isn’t lines of code written—it’s problems solved and value created.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Reddit Discussion: Garry Tan's gstack Claims
👨‍💻 Bill Gates on LOC Metrics

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!