What Can LLMs Actually Do Well in Software Development?

Mar 28, 2026

Problem

I spent three hours last week debugging a TPM authentication failure. Every time I asked Claude for help, it suggested generic fixes that didn’t address the root cause. The problem involved:

TPM 2.0 specification quirks
Hardware-specific timing issues
Race conditions in the authentication flow
Legacy system integration points

Claude kept hallucinating solutions. It suggested kernel parameters that didn’t exist, claimed TPM had features it doesn’t have, and completely missed the actual timing issue that was causing intermittent failures.

Meanwhile, earlier that same day, Claude generated a perfect SQL query for finding inactive users in under 30 seconds. Same model. Same session. Completely different results.

Why the difference?

The answer lies in what each task requires:

┌─────────────────────────────────────────────────────────────────────────┐
│                          LLM Task Requirements                          │
├─────────────────────────────┬───────────────────────────────────────────┤
│         SQL Query           │           TPM Debugging                   │
├─────────────────────────────┼───────────────────────────────────────────┤
│ • Well-documented patterns  │ • Hardware-specific quirks                │
│ • Abundant training examples│ • No public documentation                │
│ • Clear input/output format │ • Race conditions hard to reproduce      │
│ • Standard SQL syntax       │ • Requires deep system knowledge         │
│ • Deterministic behavior    │ • Context from multiple systems          │
└─────────────────────────────┴───────────────────────────────────────────┘

For the SQL query, Claude had thousands of examples in its training data. The pattern is well-documented. The input (user requirements) and output (SQL syntax) have clear formats.

For TPM debugging, Claude had almost nothing. TPM internals are obscure. The timing issue only appeared on specific hardware. No blog posts or Stack Overflow questions covered this exact scenario.

Where LLMs excel

Based on my experience over the past year, LLMs perform best on these types of tasks:

1. SQL query generation

When I need to write SQL queries, LLMs are my first stop:

Prompt: "Find users who haven't logged in for 30 days but still have active subscriptions"

Output:
SELECT u.id, u.email, u.last_login_at, s.plan_type
FROM users u
JOIN subscriptions s ON u.id = s.user_id
WHERE u.last_login_at < NOW() - INTERVAL '30 days'
  AND s.status = 'active'
ORDER BY u.last_login_at DESC;

The output is usually correct or close enough that minor tweaks fix it. SQL has:

Standard syntax across databases
Clear semantic meaning
Thousands of examples in training data
Deterministic behavior

2. Regex pattern generation

Regex is another area where LLMs shine:

Prompt: "Match US phone numbers in formats: (123) 456-7890, 123-456-7890, 123.456.7890"

Output:
^\(?(\d{3})\)?[-.\s]?(\d{3})[-.\s]?(\d{4})$

I used to spend 20 minutes debugging regex. Now I get a working pattern in seconds. Regex has:

Formal grammar with clear rules
Well-documented syntax
Standard patterns for common use cases

3. Boilerplate code

LLMs excel at generating repetitive code structures:

Prompt: "Create a FastAPI endpoint for user registration with validation"

Output:
from fastapi import APIRouter, HTTPException
from pydantic import BaseModel, EmailStr
from typing import Optional

router = APIRouter()

class UserRegistration(BaseModel):
    email: EmailStr
    password: str
    name: Optional[str] = None

@router.post("/register")
async def register_user(user: UserRegistration):
    # Validation logic
    if len(user.password) < 8:
        raise HTTPException(status_code=400, detail="Password too short")

    # Database logic would go here
    return {"status": "success", "email": user.email}

This code follows standard patterns. The structure is well-documented in FastAPI tutorials. LLMs have seen thousands of similar endpoints.

4. Well-defined codebases

LLMs perform better in established codebases with lots of examples:

┌────────────────────────────────────────────────────────────────────┐
│              LLM Performance by Codebase Context                    │
├──────────────────────────┬─────────────────────────────────────────┤
│     High Context         │              Low Context                │
│    (Good Performance)    │            (Poor Performance)           │
├──────────────────────────┼─────────────────────────────────────────┤
│ • Many similar functions │ • Novel architecture                    │
│ • Consistent patterns    │ • No existing examples                  │
│ • Clear naming conventions│ • Inconsistent naming                  │
│ • Well-documented code   │ • Sparse documentation                 │
│ • Standard libraries     │ • Custom/niche libraries               │
└──────────────────────────┴─────────────────────────────────────────┘

In my main project, the codebase has consistent patterns. When I ask for a new API endpoint, Claude finds similar endpoints and follows the same structure. The result fits naturally.

In a new prototype project with no established patterns, Claude reinvents everything. Sometimes good, sometimes a mess.

Where LLMs fail

Here’s where I’ve seen consistent failures:

1. Troubleshooting complex issues

The TPM debugging story is one example. Here’s another:

I had a microservice that sporadically returned 500 errors. The logs showed nothing useful. Claude suggested:

“Add more logging” (we had extensive logging)
“Check memory usage” (memory was fine)
“Verify database connections” (connections were healthy)

None of these addressed the actual issue: a race condition between service startup and dependency initialization that only manifested under specific load patterns.

Claude couldn’t help because:

The issue required understanding the startup sequence
Race conditions don’t show in logs
The problem only appeared under load
No documentation covered our specific architecture

2. Reverse engineering unfamiliar systems

I inherited a legacy system with minimal documentation. I asked Claude to explain how the authentication flow worked.

Claude made up an authentication flow that sounded plausible but was completely wrong. It:

Claimed there was a JWT validation step (there wasn’t)
Suggested a middleware pattern (the code used direct checks)
Inventeda “session manager” class (no such class existed)

The hallucinations were convincing. They followed standard authentication patterns. But the actual system used a completely different approach that Claude couldn’t know about.

3. Niche libraries and old code

When I needed to modify code using a deprecated library from 2019, Claude:

Suggested API calls that no longer existed
Recommended patterns that were removed in v2.0
Proposed solutions that would break at runtime

The library had changed significantly. Claude’s training data contained old examples but not the current reality.

4. Overengineering

LLMs tend to reimplement existing code instead of finding it:

# My codebase already has:
from app.utils import validate_email, sanitize_input, log_error

# Claude generated:
import re

def validate_email(email: str) -> bool:
    """Validate email format."""
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return bool(re.match(pattern, email))

def sanitize_input(text: str) -> str:
    """Remove dangerous characters."""
    return text.replace('<', '').replace('>', '')

def log_error(message: str) -> None:
    """Log error message."""
    print(f"[ERROR] {message}")  # Not even using our logging system!

Claude missed that we already had validated implementations. It reinvented the wheel with simpler (worse) versions.

Why this happens

The pattern is clear: LLMs succeed when they have abundant, relevant examples in their training data. They fail when they lack context.

┌─────────────────────────────────────────────────────────────────────┐
│                    LLM Capability Model                              │
│                                                                      │
│    Performance = (Training Examples) × (Pattern Clarity)            │
│                   ─────────────────────────────────────             │
│                      (Context Requirements)                         │
│                                                                      │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│    High Performance Tasks:                                          │
│    ───────────────────────                                          │
│    SQL:           Many examples + Clear syntax / Low context        │
│    Regex:         Many examples + Formal grammar / Low context      │
│    Boilerplate:   Many examples + Standard patterns / Medium context│
│                                                                      │
│    Low Performance Tasks:                                           │
│    ────────────────────────                                         │
│    TPM debugging: Few examples + No standard / High context         │
│    Reverse engineering: Few examples + Varies / High context        │
│    Niche libraries: Few examples + Often outdated / Medium context  │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

The hallucination problem

When Claude lacks training examples, it fills gaps with plausible-sounding content. This isn’t lying—it’s pattern completion. Claude generates what a correct answer would look like without knowing what the actual answer is.

In the TPM case, Claude knew what TPM debugging advice should contain (check logs, verify connections, examine error codes). It generated that structure. But the actual fix required knowing:

TPM initialization timing on our specific hardware
A race condition between kernel module load and application startup
A workaround involving a startup delay we discovered experimentally

Claude couldn’t know this. No training example contained it. So it hallucinated standard debugging advice that was useless for our specific situation.

The context gap

LLMs also miss context about existing code:

┌───────────────────────────────────────────────────────────────────────┐
│              What LLMs Know vs What They Need                         │
├───────────────────────────────────────────────────────────────────────┤
│                                                                       │
│   LLM Training Knowledge:                                            │
│   ───────────────────────                                            │
│   • Standard patterns across many projects                           │
│   • Public documentation and tutorials                               │
│   • Stack Overflow answers                                           │
│   • Open source code examples                                        │
│                                                                       │
│   LLM Missing Context:                                               │
│   ─────────────────────                                              │
│   • Your specific architecture                                       │
│   • Internal conventions and patterns                                │
│   • Company-specific workarounds                                     │
│   • Hardware/environment specifics                                   │
│   • History of why code exists                                       │
│                                                                       │
└───────────────────────────────────────────────────────────────────────┘

This explains the overengineering problem. Claude knows standard patterns but doesn’t know your project has existing implementations.

How to use LLMs effectively

Based on my experience, here’s the right approach:

Use LLMs for:

SQL and regex generation - They excel at these
Boilerplate code - Standard patterns are well-represented
Code explanation - For well-documented libraries and patterns
Initial drafts - Generate first versions, then refine manually
Documentation generation - For code that follows standard patterns

Avoid LLMs for:

Complex troubleshooting - They lack context and hallucinate solutions
Reverse engineering - They don’t know your specific system
Niche libraries - Training data is often outdated or missing
Security-sensitive code - Authentication, encryption require exact implementation
Architecture decisions - They can’t understand your specific constraints

Verification workflow

For all LLM-generated code, I follow this process:

┌─────────────────────────────────────────────────────────────────────┐
│                     LLM Code Verification Workflow                   │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  1. Generate with LLM                                               │
│     └───────────────                                                │
│     Output: Code draft                                              │
│                                                                     │
│  2. Check existing codebase                                         │
│     ─────────────────────                                           │
│     "Do we already have this?"                                      │
│     "Does this match our patterns?"                                 │
│                                                                     │
│  3. Validate functionality                                          │
│     ───────────────────                                             │
│     Run tests, check edge cases                                     │
│                                                                     │
│  4. Security review                                                 │
│     ────────────────                                                │
│     Check for hardcoded secrets, unsafe patterns                    │
│                                                                     │
│  5. Integrate carefully                                             │
│     ───────────────────                                             │
│     Adapt to project conventions                                    │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Improving LLM performance

To get better results from LLMs:

Provide context explicitly

Bad: "Create a user registration endpoint"

Better: "Create a user registration endpoint following the pattern in
app/api/endpoints/auth.py. Use our existing validate_email from
app/utils/validation.py and log with app/utils/logging.py"

Show examples from your codebase

Paste similar code before asking for new code. LLMs will follow the pattern.

Be specific about constraints

"Generate SQL for inactive users, but we use PostgreSQL so use
NOW() - INTERVAL syntax, and include the subscription join"

Ask for explanation, not just code

“Explain why this SQL works” helps verify the logic before using it.

The reality check

LLMs are powerful accelerators for certain tasks. But they’re not magic. They work by pattern matching on training data. When your problem matches patterns in their training, they excel. When it doesn’t, they hallucinate.

The TPM debugging failure wasn’t Claude being lazy or incompetent. It was Claude having no relevant training examples. The SQL success wasn’t Claude being brilliant. It was Claude having thousands of similar examples.

Understanding this distinction prevents frustration and wasted time. I now use LLMs for their strengths (SQL, regex, boilerplate) and rely on human expertise for their weaknesses (complex debugging, unfamiliar systems, niche problems).

Summary

In this post, I showed where LLMs excel vs fail in software development based on real experience. The key insights:

LLMs excel at SQL, regex, and boilerplate because they have abundant training examples with clear patterns
LLMs fail at complex troubleshooting, reverse engineering, and niche libraries because they lack relevant training data
LLMs overengineer because they miss context about existing code
Hallucination is pattern completion when training data is insufficient
Use LLMs for pattern-based tasks, avoid them for context-heavy tasks
Always verify LLM output against your codebase and requirements

The right approach is not “LLMs are useless” or “LLMs can do everything.” It’s knowing where they work and where they don’t, then using them accordingly.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Reddit Discussion on LLM Coding Limitations
👨‍💻 Claude Code Documentation
👨‍💻 Goodhart's Law and AI Optimization

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!