What Can LLMs Actually Do Well in Software Development?
Problem
I spent three hours last week debugging a TPM authentication failure. Every time I asked Claude for help, it suggested generic fixes that didn’t address the root cause. The problem involved:
- TPM 2.0 specification quirks
- Hardware-specific timing issues
- Race conditions in the authentication flow
- Legacy system integration points
Claude kept hallucinating solutions. It suggested kernel parameters that didn’t exist, claimed TPM had features it doesn’t have, and completely missed the actual timing issue that was causing intermittent failures.
Meanwhile, earlier that same day, Claude generated a perfect SQL query for finding inactive users in under 30 seconds. Same model. Same session. Completely different results.
Why the difference?
The answer lies in what each task requires:
┌─────────────────────────────────────────────────────────────────────────┐│ LLM Task Requirements │├─────────────────────────────┬───────────────────────────────────────────┤│ SQL Query │ TPM Debugging │├─────────────────────────────┼───────────────────────────────────────────┤│ • Well-documented patterns │ • Hardware-specific quirks ││ • Abundant training examples│ • No public documentation ││ • Clear input/output format │ • Race conditions hard to reproduce ││ • Standard SQL syntax │ • Requires deep system knowledge ││ • Deterministic behavior │ • Context from multiple systems │└─────────────────────────────┴───────────────────────────────────────────┘For the SQL query, Claude had thousands of examples in its training data. The pattern is well-documented. The input (user requirements) and output (SQL syntax) have clear formats.
For TPM debugging, Claude had almost nothing. TPM internals are obscure. The timing issue only appeared on specific hardware. No blog posts or Stack Overflow questions covered this exact scenario.
Where LLMs excel
Based on my experience over the past year, LLMs perform best on these types of tasks:
1. SQL query generation
When I need to write SQL queries, LLMs are my first stop:
Prompt: "Find users who haven't logged in for 30 days but still have active subscriptions"
Output:SELECT u.id, u.email, u.last_login_at, s.plan_typeFROM users uJOIN subscriptions s ON u.id = s.user_idWHERE u.last_login_at < NOW() - INTERVAL '30 days' AND s.status = 'active'ORDER BY u.last_login_at DESC;The output is usually correct or close enough that minor tweaks fix it. SQL has:
- Standard syntax across databases
- Clear semantic meaning
- Thousands of examples in training data
- Deterministic behavior
2. Regex pattern generation
Regex is another area where LLMs shine:
Prompt: "Match US phone numbers in formats: (123) 456-7890, 123-456-7890, 123.456.7890"
Output:^\(?(\d{3})\)?[-.\s]?(\d{3})[-.\s]?(\d{4})$I used to spend 20 minutes debugging regex. Now I get a working pattern in seconds. Regex has:
- Formal grammar with clear rules
- Well-documented syntax
- Standard patterns for common use cases
3. Boilerplate code
LLMs excel at generating repetitive code structures:
Prompt: "Create a FastAPI endpoint for user registration with validation"
Output:from fastapi import APIRouter, HTTPExceptionfrom pydantic import BaseModel, EmailStrfrom typing import Optional
router = APIRouter()
class UserRegistration(BaseModel): email: EmailStr password: str name: Optional[str] = None
@router.post("/register")async def register_user(user: UserRegistration): # Validation logic if len(user.password) < 8: raise HTTPException(status_code=400, detail="Password too short")
# Database logic would go here return {"status": "success", "email": user.email}This code follows standard patterns. The structure is well-documented in FastAPI tutorials. LLMs have seen thousands of similar endpoints.
4. Well-defined codebases
LLMs perform better in established codebases with lots of examples:
┌────────────────────────────────────────────────────────────────────┐│ LLM Performance by Codebase Context │├──────────────────────────┬─────────────────────────────────────────┤│ High Context │ Low Context ││ (Good Performance) │ (Poor Performance) │├──────────────────────────┼─────────────────────────────────────────┤│ • Many similar functions │ • Novel architecture ││ • Consistent patterns │ • No existing examples ││ • Clear naming conventions│ • Inconsistent naming ││ • Well-documented code │ • Sparse documentation ││ • Standard libraries │ • Custom/niche libraries │└──────────────────────────┴─────────────────────────────────────────┘In my main project, the codebase has consistent patterns. When I ask for a new API endpoint, Claude finds similar endpoints and follows the same structure. The result fits naturally.
In a new prototype project with no established patterns, Claude reinvents everything. Sometimes good, sometimes a mess.
Where LLMs fail
Here’s where I’ve seen consistent failures:
1. Troubleshooting complex issues
The TPM debugging story is one example. Here’s another:
I had a microservice that sporadically returned 500 errors. The logs showed nothing useful. Claude suggested:
- “Add more logging” (we had extensive logging)
- “Check memory usage” (memory was fine)
- “Verify database connections” (connections were healthy)
None of these addressed the actual issue: a race condition between service startup and dependency initialization that only manifested under specific load patterns.
Claude couldn’t help because:
- The issue required understanding the startup sequence
- Race conditions don’t show in logs
- The problem only appeared under load
- No documentation covered our specific architecture
2. Reverse engineering unfamiliar systems
I inherited a legacy system with minimal documentation. I asked Claude to explain how the authentication flow worked.
Claude made up an authentication flow that sounded plausible but was completely wrong. It:
- Claimed there was a JWT validation step (there wasn’t)
- Suggested a middleware pattern (the code used direct checks)
- Inventeda “session manager” class (no such class existed)
The hallucinations were convincing. They followed standard authentication patterns. But the actual system used a completely different approach that Claude couldn’t know about.
3. Niche libraries and old code
When I needed to modify code using a deprecated library from 2019, Claude:
- Suggested API calls that no longer existed
- Recommended patterns that were removed in v2.0
- Proposed solutions that would break at runtime
The library had changed significantly. Claude’s training data contained old examples but not the current reality.
4. Overengineering
LLMs tend to reimplement existing code instead of finding it:
# My codebase already has:from app.utils import validate_email, sanitize_input, log_error
# Claude generated:import re
def validate_email(email: str) -> bool: """Validate email format.""" pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$' return bool(re.match(pattern, email))
def sanitize_input(text: str) -> str: """Remove dangerous characters.""" return text.replace('<', '').replace('>', '')
def log_error(message: str) -> None: """Log error message.""" print(f"[ERROR] {message}") # Not even using our logging system!Claude missed that we already had validated implementations. It reinvented the wheel with simpler (worse) versions.
Why this happens
The pattern is clear: LLMs succeed when they have abundant, relevant examples in their training data. They fail when they lack context.
┌─────────────────────────────────────────────────────────────────────┐│ LLM Capability Model ││ ││ Performance = (Training Examples) × (Pattern Clarity) ││ ───────────────────────────────────── ││ (Context Requirements) ││ │├─────────────────────────────────────────────────────────────────────┤│ ││ High Performance Tasks: ││ ─────────────────────── ││ SQL: Many examples + Clear syntax / Low context ││ Regex: Many examples + Formal grammar / Low context ││ Boilerplate: Many examples + Standard patterns / Medium context││ ││ Low Performance Tasks: ││ ──────────────────────── ││ TPM debugging: Few examples + No standard / High context ││ Reverse engineering: Few examples + Varies / High context ││ Niche libraries: Few examples + Often outdated / Medium context ││ │└─────────────────────────────────────────────────────────────────────┘The hallucination problem
When Claude lacks training examples, it fills gaps with plausible-sounding content. This isn’t lying—it’s pattern completion. Claude generates what a correct answer would look like without knowing what the actual answer is.
In the TPM case, Claude knew what TPM debugging advice should contain (check logs, verify connections, examine error codes). It generated that structure. But the actual fix required knowing:
- TPM initialization timing on our specific hardware
- A race condition between kernel module load and application startup
- A workaround involving a startup delay we discovered experimentally
Claude couldn’t know this. No training example contained it. So it hallucinated standard debugging advice that was useless for our specific situation.
The context gap
LLMs also miss context about existing code:
┌───────────────────────────────────────────────────────────────────────┐│ What LLMs Know vs What They Need │├───────────────────────────────────────────────────────────────────────┤│ ││ LLM Training Knowledge: ││ ─────────────────────── ││ • Standard patterns across many projects ││ • Public documentation and tutorials ││ • Stack Overflow answers ││ • Open source code examples ││ ││ LLM Missing Context: ││ ───────────────────── ││ • Your specific architecture ││ • Internal conventions and patterns ││ • Company-specific workarounds ││ • Hardware/environment specifics ││ • History of why code exists ││ │└───────────────────────────────────────────────────────────────────────┘This explains the overengineering problem. Claude knows standard patterns but doesn’t know your project has existing implementations.
How to use LLMs effectively
Based on my experience, here’s the right approach:
Use LLMs for:
- SQL and regex generation - They excel at these
- Boilerplate code - Standard patterns are well-represented
- Code explanation - For well-documented libraries and patterns
- Initial drafts - Generate first versions, then refine manually
- Documentation generation - For code that follows standard patterns
Avoid LLMs for:
- Complex troubleshooting - They lack context and hallucinate solutions
- Reverse engineering - They don’t know your specific system
- Niche libraries - Training data is often outdated or missing
- Security-sensitive code - Authentication, encryption require exact implementation
- Architecture decisions - They can’t understand your specific constraints
Verification workflow
For all LLM-generated code, I follow this process:
┌─────────────────────────────────────────────────────────────────────┐│ LLM Code Verification Workflow │├─────────────────────────────────────────────────────────────────────┤│ ││ 1. Generate with LLM ││ └─────────────── ││ Output: Code draft ││ ││ 2. Check existing codebase ││ ───────────────────── ││ "Do we already have this?" ││ "Does this match our patterns?" ││ ││ 3. Validate functionality ││ ─────────────────── ││ Run tests, check edge cases ││ ││ 4. Security review ││ ──────────────── ││ Check for hardcoded secrets, unsafe patterns ││ ││ 5. Integrate carefully ││ ─────────────────── ││ Adapt to project conventions ││ │└─────────────────────────────────────────────────────────────────────┘Improving LLM performance
To get better results from LLMs:
- Provide context explicitly
Bad: "Create a user registration endpoint"
Better: "Create a user registration endpoint following the pattern inapp/api/endpoints/auth.py. Use our existing validate_email fromapp/utils/validation.py and log with app/utils/logging.py"- Show examples from your codebase
Paste similar code before asking for new code. LLMs will follow the pattern.
- Be specific about constraints
"Generate SQL for inactive users, but we use PostgreSQL so useNOW() - INTERVAL syntax, and include the subscription join"- Ask for explanation, not just code
“Explain why this SQL works” helps verify the logic before using it.
The reality check
LLMs are powerful accelerators for certain tasks. But they’re not magic. They work by pattern matching on training data. When your problem matches patterns in their training, they excel. When it doesn’t, they hallucinate.
The TPM debugging failure wasn’t Claude being lazy or incompetent. It was Claude having no relevant training examples. The SQL success wasn’t Claude being brilliant. It was Claude having thousands of similar examples.
Understanding this distinction prevents frustration and wasted time. I now use LLMs for their strengths (SQL, regex, boilerplate) and rely on human expertise for their weaknesses (complex debugging, unfamiliar systems, niche problems).
Summary
In this post, I showed where LLMs excel vs fail in software development based on real experience. The key insights:
- LLMs excel at SQL, regex, and boilerplate because they have abundant training examples with clear patterns
- LLMs fail at complex troubleshooting, reverse engineering, and niche libraries because they lack relevant training data
- LLMs overengineer because they miss context about existing code
- Hallucination is pattern completion when training data is insufficient
- Use LLMs for pattern-based tasks, avoid them for context-heavy tasks
- Always verify LLM output against your codebase and requirements
The right approach is not “LLMs are useless” or “LLMs can do everything.” It’s knowing where they work and where they don’t, then using them accordingly.
Final Words + More Resources
Related Posts
- How to use Feature Forge skill in Claude Code for beginners48
- How to Use Postgres Pro Skill in Claude Code for Infrastructure Development48
- How to Use Sql Pro Skill in Claude Code for Beginners48
- How to Use agents.md File for Persistent AI Rules in Claude Code and ChatGPT48
- Where Does Claude AI Outperform Human Developers in Code Quality?48
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit Discussion on LLM Coding Limitations
- 👨💻 Claude Code Documentation
- 👨💻 Goodhart's Law and AI Optimization
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments