Is Lines of Code a Good Metric for AI Coding Assistant Productivity?
Problem
When I saw someone claiming their AI tool generates “600,000+ lines of production code in 60 days” or “10k-20k lines per day,” I immediately questioned whether lines of code (LOC) means anything useful for measuring AI coding productivity.
I’ve worked with AI coding assistants extensively, and I noticed something counterintuitive: the more code an AI generates, the more cleanup I often have to do.
What’s Wrong with LOC for AI-Generated Code?
I think the core issue is that AI assistants generate verbose code by default. Here’s a simple comparison:
Human Developer (5 lines): AI Assistant (20 lines):
def calculate(x): def calculate(x: int) -> int: return x * 2 """ Calculate the doubled value.
Args: x: The input integer
Returns: The doubled result """ result = x * 2 return resultBoth achieve the same thing. But if I measure productivity by LOC, the AI version looks 4x more “productive.” In reality, I have to read, understand, and maintain all 20 lines.
The Verbosity Bias
I analyzed my own AI coding sessions and found this pattern:
┌─────────────────────────────────────────────────────────────┐│ AI Code Generation │├─────────────────────────────────────────────────────────────┤│ ││ Prompt: "Add error handling to this function" ││ ││ Human Result (3 lines): ││ ┌─────────────────────────────────────────┐ ││ │ try: │ ││ │ result = process(data) │ ││ │ except ValueError: return None │ ││ └─────────────────────────────────────────┘ ││ ││ AI Result (15+ lines): ││ ┌─────────────────────────────────────────┐ ││ │ try: │ ││ │ # Attempt to process the data │ ││ │ result = process(data) │ ││ │ except ValueError as e: │ ││ │ # Log the error for debugging │ ││ │ logger.error(f"Error: {e}") │ ││ │ return None │ ││ │ except TypeError as e: │ ││ │ # Handle type errors │ ││ │ logger.warning(f"Type error: {e}") │ ││ │ return None │ ││ │ except Exception as e: │ ││ │ # Catch any other exceptions │ ││ │ logger.critical(f"Unexpected: {e}") │ ││ │ return None │ ││ └─────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────┘Again, the AI version has more lines. Is it better? Sometimes. But often it’s over-engineering for the actual requirements.
What the Reddit Discussion Revealed
When I looked at the Reddit discussion about Garry Tan’s gstack claims, the top comments were skeptical:
Top Comment (145 upvotes):"So 5M+ lines of code per year! As we all know, more codeis always better so it must be really good."
Engineering Critique:"LOC theatre. '600,000+ lines of production code in 60 days'— anyone who's worked in a serious engineering org knows linesof code is a vanity metric at best and actively misleadingat worst."
AI-Specific Concern:"AI-generated code is verbose by default. 35% test coveragedoesn't redeem that — it just means 35% of the bloat has tests."
Practical Question:"ship 10-20k lines per day - where? To what?"These comments highlight the absurdity of using LOC as a productivity metric, especially for AI-generated code.
Why LOC Fails as a Metric
I think the problem runs deeper than verbosity. Here’s my analysis:
┌────────────────────────────────────────────────────────────┐│ Why LOC Fails for AI Coding │├────────────────────────────────────────────────────────────┤│ ││ 1. INVERSE QUALITY RELATIONSHIP ││ ┌────────────────────────────────────────────┐ ││ │ Great Developer → Negative LOC (deletes) │ ││ │ Average Developer → Low/Neutral LOC │ ││ │ Poor Developer → High LOC (bloat) │ ││ │ AI (unfiltered) → Very High LOC (verbose)│ ││ └────────────────────────────────────────────┘ ││ ││ 2. CONTEXT BLINDNESS ││ "10k-20k lines per day" tells you nothing about: ││ • What problem was solved ││ • Whether tests exist ││ • Code maintainability ││ • User value delivered ││ • Bug count and severity ││ ││ 3. HISTORICAL LESSONS IGNORED ││ • Bill Gates: "Measuring programming progress by ││ lines of code is like measuring aircraft building ││ progress by weight" ││ • Mature orgs abandoned SLOC metrics decades ago ││ • Function points replaced LOC for serious estimation ││ │└────────────────────────────────────────────────────────────┘The irony is that the best developers I know often have negative LOC contributions over time. They delete more code than they add through refactoring and simplification.
Better Metrics for AI Coding Productivity
So if LOC is misleading, what should I measure instead? I’ve found these categories useful:
Output Quality Metrics
| Metric | Why It Matters | How to Measure |
|---|---|---|
| Feature Delivery Time | Time from spec to working feature | Issue tracker timestamps |
| Code Review Cycles | Fewer rounds = clearer initial code | PR iteration count |
| Bug Rate Post-Merge | Quality indicator | Issue tracker + time window |
| Test Coverage | Especially for new code paths | Coverage tools |
| Customer Satisfaction | Does shipped code solve real problems? | Feedback surveys |
Code Health Metrics
| Metric | Why It Matters | How to Measure |
|---|---|---|
| Code Deletion Ratio | Great developers delete code | Git stats (lines removed) |
| Complexity Scores | Lower is more maintainable | Cyclomatic complexity tools |
| Documentation Coverage | Self-explanatory code | Doc coverage tools |
| Static Analysis Score | Code smell detection | Linters, SonarQube |
Developer Experience Metrics
| Metric | Why It Matters | How to Measure |
|---|---|---|
| Time Saved by AI | Actual productivity gain | Developer surveys, time tracking |
| Iteration Speed | How fast can devs refine code? | Code review turnaround |
| Learning Curve | Does AI help developers learn? | Skill assessments over time |
A Better Visualization
I think the relationship between LOC and actual productivity looks like this:
ActualValue │ │ ★ Optimal Zone │ ╱╲ │ ╱ ╲ │ ╱ ╲ │ ╱ ╲ │ ╱ ╲ │ ╱ ╲ │ ╱ ╲ │ ╱ ╲ │╱ ╲ └──────────────────────────→ Lines of Code
Too Little Just Right Too Much (incomplete) (optimal) (bloat/technical debt)The key insight: there’s an optimal zone. Both too little and too much code indicate problems.
What I Actually Track
For my own AI coding sessions, I track these instead of LOC:
1. FEATURE VELOCITY - Features shipped per sprint - Time from idea to production
2. CODE QUALITY - Bugs found in code review - Bugs found in production - Test coverage percentage
3. MAINTENANCE BURDEN - Time spent on bug fixes - Time spent on refactoring - Time spent understanding AI-generated code
4. DEVELOPER SATISFACTION - "Did this AI help or hinder?" - "How much cleanup was needed?"These metrics actually tell me whether the AI coding assistant is helping or creating more work.
The Real Question
When someone claims “600,000 lines of code in 60 days,” I want to know:
┌─────────────────────────────────────────────────────────┐│ Questions that Actually Matter │├─────────────────────────────────────────────────────────┤│ ││ 1. What percentage shipped to production? ││ ││ 2. How much was deleted within 30 days? ││ ││ 3. What's the bug rate post-merge? ││ ││ 4. Did developers spend more time reviewing/fixing ││ than they saved? ││ ││ 5. Would a human have written 100k lines to solve ││ the same problems? ││ │└─────────────────────────────────────────────────────────┘Without answers to these questions, LOC counts are just noise.
Summary
In this post, I analyzed why lines of code is a poor productivity metric for AI coding assistants. The key points are:
- AI-generated code tends to be verbose by default
- More code often means more maintenance burden, not more value
- The best developers often have negative LOC contributions (they delete code)
- Better metrics include feature delivery time, code deletion ratio, test coverage, and developer satisfaction
The next time someone claims high LOC counts as evidence of AI productivity, ask what those lines actually delivered. The number that matters isn’t lines of code written—it’s problems solved and value created.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments