Why AI Refactoring Adds More Code Than It Removes
Problem
I asked Claude to refactor a simple Python function. The original was 8 lines. The result? 35 lines of “improved” code.
Here’s what I started with:
def get_user_name(user_id): user = db.users.find(user_id) return user.nameAnd here’s what the AI gave me:
def get_user_name(user_id): """ Retrieve the name of a user from the database.
Args: user_id: The unique identifier for the user
Returns: str: The user's name
Raises: ValueError: If user_id is None or empty UserNotFoundError: If user does not exist DatabaseError: If database connection fails """ # Validate input if user_id is None: raise ValueError("user_id cannot be None")
if not isinstance(user_id, str): raise ValueError("user_id must be a string")
if user_id.strip() == "": raise ValueError("user_id cannot be empty")
# Attempt database retrieval with error handling try: user = db.users.find(user_id) except DatabaseConnectionError as e: logger.error(f"Database connection failed: {e}") raise DatabaseError("Failed to connect to database") from e
# Validate user exists if user is None: raise UserNotFoundError(f"User with id {user_id} not found")
# Return name with validation name = user.name if name is None: logger.warning(f"User {user_id} has no name set") return "Unknown"
return name8 lines became 35. A 4x increase. And I’ve seen this pattern repeat across dozens of refactoring sessions with various AI tools.
What happened?
This isn’t a bug. It’s a feature of how LLMs are trained. Three root causes drive this behavior.
Cause 1: Safety Bias
AI models are trained to be helpful. In coding contexts, “helpful” often means “thorough.” The model interprets my simple request as “make this production-ready.”
The training data includes millions of production codebases with extensive error handling. When the model sees “refactor,” it pattern-matches against production-quality examples it learned from.
This isn’t wrong. It’s just not what I asked for. The model added error handling for:
user_idbeingNoneuser_idbeing the wrong typeuser_idbeing an empty string- Database connection errors
- User not existing
- User having no name
Most of these cases might never happen in my codebase. But the model doesn’t know my context.
Cause 2: The Abstraction Reflex
Watch what happens when I ask AI to refactor two file parsing functions:
def process_csv(file_path): with open(file_path) as f: data = f.read() lines = data.strip().split('\n') return [line.split(',') for line in lines]
def process_json(file_path): import json with open(file_path) as f: return json.load(f)The AI created 4 helper functions:
def read_file_content(file_path, encoding='utf-8'): """Read content from a file with error handling.""" try: with open(file_path, 'r', encoding=encoding) as f: return f.read() except FileNotFoundError: raise FileNotFoundError(f"File not found: {file_path}")
def parse_csv_content(content): """Parse CSV content into a list of lists.""" lines = content.strip().split('\n') return [line.split(',') for line in lines]
def parse_json_content(content): """Parse JSON content into a Python object.""" import json try: return json.loads(content) except json.JSONDecodeError as e: raise ValueError(f"Invalid JSON content: {e}")
def process_csv(file_path): content = read_file_content(file_path) return parse_csv_content(content)
def process_json(file_path): content = read_file_content(file_path) return parse_json_content(content)The problem? Each helper is called exactly once. The DRY principle says “don’t repeat yourself,” but creating abstractions for single-use code adds complexity, not value.
I call this the Abstraction Reflex. AI sees repeated patterns and extracts them by default, without asking whether the abstraction will ever be reused.
Cause 3: Template Matching from Verbose Examples
LLMs learn from training data. That training data includes:
- Production codebases: Often carry years of accumulated error handling, legacy compatibility, and defensive patterns
- Tutorial code: Designed for teaching, with extensive comments and explanations
- Stack Overflow answers: Written to handle every possible edge case for broad applicability
What’s underrepresented in training data?
- Minimal implementations that solve exactly one problem
- Refactored code showing the cleaned-up version, not the process
- Delete commits showing what was removed, not just what was added
The model learned from code that tends toward verbosity.
When More Code Is Actually Better
Here’s where the story gets interesting. Sometimes the AI is right to add code.
Consider this API call I asked AI to improve:
async function fetchUser(id: string) { const response = await fetch(`/api/users/${id}`); return response.json();}The AI gave me 15 lines instead of 4:
async function fetchUser(id: string): Promise<User> { if (!id?.trim()) { throw new Error(`Invalid user ID: received "${id}"`); }
const response = await fetch(`/api/users/${id}`);
if (!response.ok) { throw new Error( `Failed to fetch user ${id}: ${response.status} ${response.statusText}` ); }
try { return await response.json(); } catch (e) { throw new Error(`Invalid JSON response for user ${id}: ${e.message}`); }}I initially thought this was overkill. Then I spent 3 hours debugging a production issue caused by:
- An empty string being passed as
id - The API returning a 500 error that was silently ignored
- Invalid JSON causing a cryptic “unexpected token” error downstream
The verbose version would have caught all three issues immediately with clear error messages. The 11 extra lines would have saved 3 hours of debugging.
The Key Distinction
Good verbosity: Code that prevents future code
- Clear error messages (saves debugging time)
- Good logging (enables quick investigation)
- Comprehensive tests (prevents regression bugs)
- Type definitions (catches errors at compile time)
Bad verbosity: Code that only adds complexity
- Single-use helper functions
- Error handlers for errors that never occur
- Fallback chains that obscure the main path
- Comments that restate the code
How to Get Better Results
I’ve developed a framework for working with AI refactoring that produces more appropriate results.
Before Refactoring: Set Explicit Constraints
The most common mistake I made was giving vague instructions. Instead of “refactor this,” I now specify:
Context to provide:1. Show existing codebase patterns2. Specify LOC target explicitly3. Indicate which abstractions already exist4. Clarify reuse potential of new helpersExample prompt that works:
Refactor this function. Constraints:- Keep total lines under 15- No new helper functions unless called 3+ times- Maintain current error handling approach- Follow existing project patternsDuring Refactoring: Use Specific Constraints
I’ve found these constraint patterns effective:
Constraints to set:1. "Reduce LOC, do not add helper functions unless called 3+ times"2. "Maintain current LOC or reduce by at least 10%"3. "Inline all single-use functions"4. "Remove fallbacks that handle <0.1% cases"The key is being explicit about what “good” means in your context.
After Refactoring: Audit the Output
I use a simple checklist to evaluate AI output:
Red Flags:[ ] Helper functions called exactly once[ ] Error handlers for errors that never occur[ ] Fallback chains that obscure the main code path[ ] Duplicate logic across files (AI forgot previous implementations)[ ] Comments that restate the codeWhen I spot these, I ask the AI to fix specific issues:
Inline the read_file_content helper - it's only called once.Remove the FileNotFoundError handler - that's handled at the application level.Measuring the Problem
I track three metrics after every AI refactoring session:
1. LOC Delta: Before vs after. Anything greater than +20% triggers a review.
2. Helper Reuse Ratio: Helper function definitions divided by unique call sites. Should be greater than 1.5.
3. Complexity Trend: Does cyclomatic complexity increase? If so, the refactor might be adding unnecessary branching.
These metrics help distinguish between helpful verbosity and harmful bloat.
The “Low Quality Slop” Argument
Some developers dismiss all AI-generated code as “low quality slop.” Is that fair?
The skeptic’s case:
- LLM code prioritizes speed over elegance
- No understanding of project-specific context
- No ability to see long-term maintenance implications
- Pattern matching without judgment of pattern quality
The nuanced reality:
- AI is a force multiplier for developer judgment
- Speed advantage is real and valuable
- Quality depends heavily on constraints provided
- Better prompts produce better output
The “slop” label ignores that manual code is often verbose too. I’ve seen developers create the same single-use helpers and excessive fallbacks without any AI assistance. The difference is that AI produces verbose code faster.
Summary
In this post, I showed why AI refactoring often adds more code than it removes. Three root causes drive this: safety bias from training on production code, the abstraction reflex that creates unnecessary helpers, and template matching from verbose training examples.
The key insight is that more code isn’t always bad. Clear errors, good logging, and tests reduce long-term code through prevention. The real question isn’t “how do I make AI write less code?” but “how do I make AI write the right amount of code?”
To get better results:
- Set explicit LOC targets and constraints upfront
- Specify what abstractions already exist and what’s reusable
- Audit output with metrics: LOC delta, helper reuse ratio, complexity
- Distinguish between good verbosity (error handling, tests, types) and bad verbosity (single-use helpers, excessive fallbacks)
Remember: your AI will optimize for what it learned from training data, not what your specific project needs. Bridge that gap with explicit constraints.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit: AI Refactoring Code Bloat Discussion
- 👨💻 YAGNI Principle
- 👨💻 Essential vs Accidental Complexity
- 👨💻 Goodhart's Law
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments