Why Do LLM Agents Fail Unpredictably in Production?

Mar 20, 2026

The Problem: Same Input, Different Output

I shipped an LLM agent to production. It worked in testing. It worked in demos. Two weeks later, I got a bug report: same input, different output. No error. No helpful log. Just a wrong answer delivered confidently.

I checked the logs. The agent had called a different tool than before. With different parameters. In a different order. The input was identical. The result was wrong.

This happened randomly across production. Sometimes the agent worked perfectly. Sometimes it hallucinated tool calls. Sometimes it went into infinite loops. I could not reproduce the issue consistently.

My First Mistake: Blaming the Model

I spent weeks tuning prompts. I added examples. I clarified instructions. I tried different temperatures. Nothing worked reliably.

The pattern I kept seeing in production logs:

Day 1: Input "search for user john" -> calls search_user("john") -> correct
Day 3: Input "search for user john" -> calls find_user("john") -> correct
Day 5: Input "search for user john" -> calls search("john", type="user") -> correct
Day 7: Input "search for user john" -> calls database_query("SELECT * FROM users WHERE name LIKE '%john%'") -> exposed SQL

Same input. Different tool calls. Different behavior each time.

Then I realized the problem was never the model. The problem was that I handed the model full control over execution.

The Root Cause: Unconstrained Agent Execution

Why does the LLM get to decide which tool to call, in what order, with what parameters? That is just unconstrained execution with no contract, no validation, and no recovery path.

Here is the problematic pattern I was using:

from langchain.agents import initialize_agent

agent = initialize_agent(
    tools=[search_tool, database_tool, api_tool],
    llm=llm,
    agent="zero-shot-react-description"
)

# Model has full control over:
# - Which tool to call
# - What order to call them
# - What parameters to pass
# - When to stop
result = agent.run(user_input)  # Unpredictable!

The model decides everything. And models are probabilistic by nature. Small variations in context, temperature, or token limits create different execution paths for identical inputs.

This causes three problems:

Non-determinism: Same input produces different execution paths
No debugging: Cannot reproduce issues without knowing exact model state
No recovery: When things go wrong, no rollback or retry strategy

The shift from “fix the prompt” to “fix the execution layer” was the pattern I needed.

The Solution: Deterministic Execution Layer

I stopped giving the model control over execution flow. Instead, I wrote code that controls execution, and only used the model for content generation.

Step 1: Define Execution Contracts

I started by specifying exact tool sequences for each task type:

from typing import TypedDict
from pydantic import BaseModel, validator

class SearchParams(BaseModel):
    """Contract for search tool parameters"""
    query: str
    max_results: int = 10

    @validator('query')
    def validate_query(cls, v):
        if len(v) < 3:
            raise ValueError('Query too short')
        return v

class AgentWorkflow:
    """Deterministic execution with model at specific points only"""

    def __init__(self, llm, tools):
        self.llm = llm
        self.tools = tools
        self.state = "initialized"

    def execute_search(self, params: SearchParams) -> list:
        """Contract: exact tool, validated params"""
        validated = SearchParams(**params.dict())
        return self.tools['search'].run(validated.query, validated.max_results)

    def synthesize_answer(self, query: str, context: list) -> str:
        """Model generates content, not control flow"""
        prompt = f"Based on: {context}\nAnswer: {query}"
        return self.llm.generate(prompt)

    def run(self, user_query: str) -> dict:
        """Fixed execution path with recovery"""
        try:
            # Step 1: Search (deterministic)
            results = self.execute_search(SearchParams(query=user_query))

            # Step 2: Synthesize (model for content only)
            answer = self.synthesize_answer(user_query, results)

            return {"success": True, "answer": answer, "source": results}
        except Exception as e:
            # Recovery path
            return self.fallback_response(user_query, e)

    def fallback_response(self, query: str, error: Exception) -> dict:
        """Known failure mode with graceful degradation"""
        return {
            "success": False,
            "answer": "Unable to process request",
            "error": str(error),
            "fallback": True
        }

Now the execution path is fixed. The model only generates the answer text. It cannot decide which tools to call or in what order.

Step 2: Add State Machine for Complex Workflows

For more complex workflows, I added a state machine to enforce execution order:

from enum import Enum
from typing import Optional

class AgentState(Enum):
    INIT = "init"
    PLANNING = "planning"
    EXECUTING = "executing"
    VALIDATING = "validating"
    COMPLETED = "completed"
    FAILED = "failed"

class StateMachineAgent:
    """Enforce execution order, no model control over flow"""

    TRANSITIONS = {
        AgentState.INIT: [AgentState.PLANNING],
        AgentState.PLANNING: [AgentState.EXECUTING, AgentState.FAILED],
        AgentState.EXECUTING: [AgentState.VALIDATING, AgentState.FAILED],
        AgentState.VALIDATING: [AgentState.COMPLETED, AgentState.EXECUTING],
        AgentState.COMPLETED: [],
        AgentState.FAILED: []
    }

    def __init__(self, llm, tools):
        self.llm = llm
        self.tools = tools
        self.state = AgentState.INIT
        self.attempts = 0
        self.max_attempts = 3

    def transition(self, next_state: AgentState) -> bool:
        """Only allow valid transitions"""
        if next_state in self.TRANSITIONS[self.state]:
            self.state = next_state
            return True
        return False  # Invalid transition, enforce order

    def run(self, task: dict) -> dict:
        while self.state not in [AgentState.COMPLETED, AgentState.FAILED]:
            if self.state == AgentState.INIT:
                self.transition(AgentState.PLANNING)

            elif self.state == AgentState.PLANNING:
                plan = self.create_plan(task)
                if plan:
                    self.transition(AgentState.EXECUTING)
                else:
                    self.transition(AgentState.FAILED)

            elif self.state == AgentState.EXECUTING:
                result = self.execute_plan(plan)
                self.transition(AgentState.VALIDATING)

            elif self.state == AgentState.VALIDATING:
                if self.validate(result):
                    self.transition(AgentState.COMPLETED)
                elif self.attempts < self.max_attempts:
                    self.attempts += 1
                    self.transition(AgentState.EXECUTING)
                else:
                    self.transition(AgentState.FAILED)

        return self.build_result()

The state machine prevents the model from jumping between states arbitrarily. Each transition is validated. If validation fails, the agent retries up to max_attempts before failing.

Step 3: Validate Every Tool Call

I added validation schemas to prevent parameter hallucination:

from pydantic import BaseModel, Field, validator
from typing import List, Optional

REGISTERED_TOOLS = {'search', 'database', 'api'}
TOOL_SCHEMAS = {
    'search': SearchParams,
    # other tool schemas...
}

class ToolCallSchema(BaseModel):
    """Contract for every tool call - no hallucinated params"""

    tool_name: str = Field(..., pattern="^[a-z_]+$")
    parameters: dict
    reasoning: str = Field(..., min_length=10)

    @validator('tool_name')
    def tool_must_exist(cls, v, values, **kwargs):
        if v not in REGISTERED_TOOLS:
            raise ValueError(f"Unknown tool: {v}")
        return v

    @validator('parameters')
    def validate_params_against_tool_schema(cls, v, values):
        tool_name = values.get('tool_name')
        if tool_name:
            schema = TOOL_SCHEMAS[tool_name]
            return schema(**v).dict()
        return v

def safe_execute(model_output: dict):
    try:
        validated = ToolCallSchema(**model_output)
        return execute_tool(validated.tool_name, validated.parameters)
    except ValidationError as e:
        log_failure(e)
        return fallback_response()

Now every tool call goes through validation. If the model hallucinates a parameter or calls an unregistered tool, the validation fails before execution.

What Changed in Production

After implementing the deterministic execution layer:

Day 1: Input "search for user john" -> search_user("john") -> correct
Day 3: Input "search for user john" -> search_user("john") -> correct
Day 7: Input "search for user john" -> search_user("john") -> correct

Same input. Same execution path. Consistent results.

When failures occur now, I can see exactly where in the state machine they happened. I can reproduce them. I can add recovery logic for that specific failure mode.

The Three Principles

After six months of production failures and fixes, I learned three principles:

Models generate, code controls - Use LLMs for content, not control flow
Validate everything - Every tool call, every parameter, every state transition
Design for failure - Build recovery paths, not just happy paths

Most production issues were not “quality” problems. They were “behavior drift” problems. The model drifted to different execution paths without any code change. Constraining that drift fixed the reliability issues.

Common Mistakes I Made

I made several mistakes along the way:

Blaming the model: I spent weeks tuning prompts when the architecture was the problem.

Over-engineering prompts: I added complexity to handle edge cases that code should handle.

Missing validation: I trusted model output without schema validation. This let hallucinated parameters through to production.

No circuit breakers: I let agents continue after failures instead of failing fast with recovery paths.

Summary

In this post, I explained why LLM agents fail unpredictably in production and how to fix it. The key insight is that models should generate content within a framework, not control execution flow. By implementing deterministic execution layers with contracts, validation, and recovery paths, you can build reliable agent systems that behave predictably in production.

The shift from “fix the prompt” to “fix the execution layer” was the pattern that finally made my agents reliable.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Reddit discussion on LLM agent production failures
👨‍💻 LangChain Documentation
👨‍💻 Pydantic Validation Guide
👨‍💻 State Machine Pattern in Python

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!