Skip to content

Why AI Refactoring Adds More Code Than It Removes

Problem

I asked Claude to refactor a simple Python function. The original was 8 lines. The result? 35 lines of “improved” code.

Here’s what I started with:

original.py
def get_user_name(user_id):
user = db.users.find(user_id)
return user.name

And here’s what the AI gave me:

ai_refactored.py
def get_user_name(user_id):
"""
Retrieve the name of a user from the database.
Args:
user_id: The unique identifier for the user
Returns:
str: The user's name
Raises:
ValueError: If user_id is None or empty
UserNotFoundError: If user does not exist
DatabaseError: If database connection fails
"""
# Validate input
if user_id is None:
raise ValueError("user_id cannot be None")
if not isinstance(user_id, str):
raise ValueError("user_id must be a string")
if user_id.strip() == "":
raise ValueError("user_id cannot be empty")
# Attempt database retrieval with error handling
try:
user = db.users.find(user_id)
except DatabaseConnectionError as e:
logger.error(f"Database connection failed: {e}")
raise DatabaseError("Failed to connect to database") from e
# Validate user exists
if user is None:
raise UserNotFoundError(f"User with id {user_id} not found")
# Return name with validation
name = user.name
if name is None:
logger.warning(f"User {user_id} has no name set")
return "Unknown"
return name

8 lines became 35. A 4x increase. And I’ve seen this pattern repeat across dozens of refactoring sessions with various AI tools.

What happened?

This isn’t a bug. It’s a feature of how LLMs are trained. Three root causes drive this behavior.

Cause 1: Safety Bias

AI models are trained to be helpful. In coding contexts, “helpful” often means “thorough.” The model interprets my simple request as “make this production-ready.”

The training data includes millions of production codebases with extensive error handling. When the model sees “refactor,” it pattern-matches against production-quality examples it learned from.

This isn’t wrong. It’s just not what I asked for. The model added error handling for:

  • user_id being None
  • user_id being the wrong type
  • user_id being an empty string
  • Database connection errors
  • User not existing
  • User having no name

Most of these cases might never happen in my codebase. But the model doesn’t know my context.

Cause 2: The Abstraction Reflex

Watch what happens when I ask AI to refactor two file parsing functions:

before_refactor.py
def process_csv(file_path):
with open(file_path) as f:
data = f.read()
lines = data.strip().split('\n')
return [line.split(',') for line in lines]
def process_json(file_path):
import json
with open(file_path) as f:
return json.load(f)

The AI created 4 helper functions:

after_refactor.py
def read_file_content(file_path, encoding='utf-8'):
"""Read content from a file with error handling."""
try:
with open(file_path, 'r', encoding=encoding) as f:
return f.read()
except FileNotFoundError:
raise FileNotFoundError(f"File not found: {file_path}")
def parse_csv_content(content):
"""Parse CSV content into a list of lists."""
lines = content.strip().split('\n')
return [line.split(',') for line in lines]
def parse_json_content(content):
"""Parse JSON content into a Python object."""
import json
try:
return json.loads(content)
except json.JSONDecodeError as e:
raise ValueError(f"Invalid JSON content: {e}")
def process_csv(file_path):
content = read_file_content(file_path)
return parse_csv_content(content)
def process_json(file_path):
content = read_file_content(file_path)
return parse_json_content(content)

The problem? Each helper is called exactly once. The DRY principle says “don’t repeat yourself,” but creating abstractions for single-use code adds complexity, not value.

I call this the Abstraction Reflex. AI sees repeated patterns and extracts them by default, without asking whether the abstraction will ever be reused.

Cause 3: Template Matching from Verbose Examples

LLMs learn from training data. That training data includes:

  • Production codebases: Often carry years of accumulated error handling, legacy compatibility, and defensive patterns
  • Tutorial code: Designed for teaching, with extensive comments and explanations
  • Stack Overflow answers: Written to handle every possible edge case for broad applicability

What’s underrepresented in training data?

  • Minimal implementations that solve exactly one problem
  • Refactored code showing the cleaned-up version, not the process
  • Delete commits showing what was removed, not just what was added

The model learned from code that tends toward verbosity.

When More Code Is Actually Better

Here’s where the story gets interesting. Sometimes the AI is right to add code.

Consider this API call I asked AI to improve:

before.ts
async function fetchUser(id: string) {
const response = await fetch(`/api/users/${id}`);
return response.json();
}

The AI gave me 15 lines instead of 4:

after.ts
async function fetchUser(id: string): Promise<User> {
if (!id?.trim()) {
throw new Error(`Invalid user ID: received "${id}"`);
}
const response = await fetch(`/api/users/${id}`);
if (!response.ok) {
throw new Error(
`Failed to fetch user ${id}: ${response.status} ${response.statusText}`
);
}
try {
return await response.json();
} catch (e) {
throw new Error(`Invalid JSON response for user ${id}: ${e.message}`);
}
}

I initially thought this was overkill. Then I spent 3 hours debugging a production issue caused by:

  1. An empty string being passed as id
  2. The API returning a 500 error that was silently ignored
  3. Invalid JSON causing a cryptic “unexpected token” error downstream

The verbose version would have caught all three issues immediately with clear error messages. The 11 extra lines would have saved 3 hours of debugging.

The Key Distinction

Good verbosity: Code that prevents future code

  • Clear error messages (saves debugging time)
  • Good logging (enables quick investigation)
  • Comprehensive tests (prevents regression bugs)
  • Type definitions (catches errors at compile time)

Bad verbosity: Code that only adds complexity

  • Single-use helper functions
  • Error handlers for errors that never occur
  • Fallback chains that obscure the main path
  • Comments that restate the code

How to Get Better Results

I’ve developed a framework for working with AI refactoring that produces more appropriate results.

Before Refactoring: Set Explicit Constraints

The most common mistake I made was giving vague instructions. Instead of “refactor this,” I now specify:

Context to provide:
1. Show existing codebase patterns
2. Specify LOC target explicitly
3. Indicate which abstractions already exist
4. Clarify reuse potential of new helpers

Example prompt that works:

Refactor this function. Constraints:
- Keep total lines under 15
- No new helper functions unless called 3+ times
- Maintain current error handling approach
- Follow existing project patterns

During Refactoring: Use Specific Constraints

I’ve found these constraint patterns effective:

Constraints to set:
1. "Reduce LOC, do not add helper functions unless called 3+ times"
2. "Maintain current LOC or reduce by at least 10%"
3. "Inline all single-use functions"
4. "Remove fallbacks that handle <0.1% cases"

The key is being explicit about what “good” means in your context.

After Refactoring: Audit the Output

I use a simple checklist to evaluate AI output:

Red Flags:
[ ] Helper functions called exactly once
[ ] Error handlers for errors that never occur
[ ] Fallback chains that obscure the main code path
[ ] Duplicate logic across files (AI forgot previous implementations)
[ ] Comments that restate the code

When I spot these, I ask the AI to fix specific issues:

Inline the read_file_content helper - it's only called once.
Remove the FileNotFoundError handler - that's handled at the application level.

Measuring the Problem

I track three metrics after every AI refactoring session:

1. LOC Delta: Before vs after. Anything greater than +20% triggers a review.

2. Helper Reuse Ratio: Helper function definitions divided by unique call sites. Should be greater than 1.5.

3. Complexity Trend: Does cyclomatic complexity increase? If so, the refactor might be adding unnecessary branching.

These metrics help distinguish between helpful verbosity and harmful bloat.

The “Low Quality Slop” Argument

Some developers dismiss all AI-generated code as “low quality slop.” Is that fair?

The skeptic’s case:

  • LLM code prioritizes speed over elegance
  • No understanding of project-specific context
  • No ability to see long-term maintenance implications
  • Pattern matching without judgment of pattern quality

The nuanced reality:

  • AI is a force multiplier for developer judgment
  • Speed advantage is real and valuable
  • Quality depends heavily on constraints provided
  • Better prompts produce better output

The “slop” label ignores that manual code is often verbose too. I’ve seen developers create the same single-use helpers and excessive fallbacks without any AI assistance. The difference is that AI produces verbose code faster.

Summary

In this post, I showed why AI refactoring often adds more code than it removes. Three root causes drive this: safety bias from training on production code, the abstraction reflex that creates unnecessary helpers, and template matching from verbose training examples.

The key insight is that more code isn’t always bad. Clear errors, good logging, and tests reduce long-term code through prevention. The real question isn’t “how do I make AI write less code?” but “how do I make AI write the right amount of code?”

To get better results:

  1. Set explicit LOC targets and constraints upfront
  2. Specify what abstractions already exist and what’s reusable
  3. Audit output with metrics: LOC delta, helper reuse ratio, complexity
  4. Distinguish between good verbosity (error handling, tests, types) and bad verbosity (single-use helpers, excessive fallbacks)

Remember: your AI will optimize for what it learned from training data, not what your specific project needs. Bridge that gap with explicit constraints.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments