Why Do LLM Agents Fail Unpredictably in Production?
The Problem: Same Input, Different Output
I shipped an LLM agent to production. It worked in testing. It worked in demos. Two weeks later, I got a bug report: same input, different output. No error. No helpful log. Just a wrong answer delivered confidently.
I checked the logs. The agent had called a different tool than before. With different parameters. In a different order. The input was identical. The result was wrong.
This happened randomly across production. Sometimes the agent worked perfectly. Sometimes it hallucinated tool calls. Sometimes it went into infinite loops. I could not reproduce the issue consistently.
My First Mistake: Blaming the Model
I spent weeks tuning prompts. I added examples. I clarified instructions. I tried different temperatures. Nothing worked reliably.
The pattern I kept seeing in production logs:
Day 1: Input "search for user john" -> calls search_user("john") -> correctDay 3: Input "search for user john" -> calls find_user("john") -> correctDay 5: Input "search for user john" -> calls search("john", type="user") -> correctDay 7: Input "search for user john" -> calls database_query("SELECT * FROM users WHERE name LIKE '%john%'") -> exposed SQLSame input. Different tool calls. Different behavior each time.
Then I realized the problem was never the model. The problem was that I handed the model full control over execution.
The Root Cause: Unconstrained Agent Execution
Why does the LLM get to decide which tool to call, in what order, with what parameters? That is just unconstrained execution with no contract, no validation, and no recovery path.
Here is the problematic pattern I was using:
from langchain.agents import initialize_agent
agent = initialize_agent( tools=[search_tool, database_tool, api_tool], llm=llm, agent="zero-shot-react-description")
# Model has full control over:# - Which tool to call# - What order to call them# - What parameters to pass# - When to stopresult = agent.run(user_input) # Unpredictable!The model decides everything. And models are probabilistic by nature. Small variations in context, temperature, or token limits create different execution paths for identical inputs.
This causes three problems:
- Non-determinism: Same input produces different execution paths
- No debugging: Cannot reproduce issues without knowing exact model state
- No recovery: When things go wrong, no rollback or retry strategy
The shift from “fix the prompt” to “fix the execution layer” was the pattern I needed.
The Solution: Deterministic Execution Layer
I stopped giving the model control over execution flow. Instead, I wrote code that controls execution, and only used the model for content generation.
Step 1: Define Execution Contracts
I started by specifying exact tool sequences for each task type:
from typing import TypedDictfrom pydantic import BaseModel, validator
class SearchParams(BaseModel): """Contract for search tool parameters""" query: str max_results: int = 10
@validator('query') def validate_query(cls, v): if len(v) < 3: raise ValueError('Query too short') return v
class AgentWorkflow: """Deterministic execution with model at specific points only"""
def __init__(self, llm, tools): self.llm = llm self.tools = tools self.state = "initialized"
def execute_search(self, params: SearchParams) -> list: """Contract: exact tool, validated params""" validated = SearchParams(**params.dict()) return self.tools['search'].run(validated.query, validated.max_results)
def synthesize_answer(self, query: str, context: list) -> str: """Model generates content, not control flow""" prompt = f"Based on: {context}\nAnswer: {query}" return self.llm.generate(prompt)
def run(self, user_query: str) -> dict: """Fixed execution path with recovery""" try: # Step 1: Search (deterministic) results = self.execute_search(SearchParams(query=user_query))
# Step 2: Synthesize (model for content only) answer = self.synthesize_answer(user_query, results)
return {"success": True, "answer": answer, "source": results} except Exception as e: # Recovery path return self.fallback_response(user_query, e)
def fallback_response(self, query: str, error: Exception) -> dict: """Known failure mode with graceful degradation""" return { "success": False, "answer": "Unable to process request", "error": str(error), "fallback": True }Now the execution path is fixed. The model only generates the answer text. It cannot decide which tools to call or in what order.
Step 2: Add State Machine for Complex Workflows
For more complex workflows, I added a state machine to enforce execution order:
from enum import Enumfrom typing import Optional
class AgentState(Enum): INIT = "init" PLANNING = "planning" EXECUTING = "executing" VALIDATING = "validating" COMPLETED = "completed" FAILED = "failed"
class StateMachineAgent: """Enforce execution order, no model control over flow"""
TRANSITIONS = { AgentState.INIT: [AgentState.PLANNING], AgentState.PLANNING: [AgentState.EXECUTING, AgentState.FAILED], AgentState.EXECUTING: [AgentState.VALIDATING, AgentState.FAILED], AgentState.VALIDATING: [AgentState.COMPLETED, AgentState.EXECUTING], AgentState.COMPLETED: [], AgentState.FAILED: [] }
def __init__(self, llm, tools): self.llm = llm self.tools = tools self.state = AgentState.INIT self.attempts = 0 self.max_attempts = 3
def transition(self, next_state: AgentState) -> bool: """Only allow valid transitions""" if next_state in self.TRANSITIONS[self.state]: self.state = next_state return True return False # Invalid transition, enforce order
def run(self, task: dict) -> dict: while self.state not in [AgentState.COMPLETED, AgentState.FAILED]: if self.state == AgentState.INIT: self.transition(AgentState.PLANNING)
elif self.state == AgentState.PLANNING: plan = self.create_plan(task) if plan: self.transition(AgentState.EXECUTING) else: self.transition(AgentState.FAILED)
elif self.state == AgentState.EXECUTING: result = self.execute_plan(plan) self.transition(AgentState.VALIDATING)
elif self.state == AgentState.VALIDATING: if self.validate(result): self.transition(AgentState.COMPLETED) elif self.attempts < self.max_attempts: self.attempts += 1 self.transition(AgentState.EXECUTING) else: self.transition(AgentState.FAILED)
return self.build_result()The state machine prevents the model from jumping between states arbitrarily. Each transition is validated. If validation fails, the agent retries up to max_attempts before failing.
Step 3: Validate Every Tool Call
I added validation schemas to prevent parameter hallucination:
from pydantic import BaseModel, Field, validatorfrom typing import List, Optional
REGISTERED_TOOLS = {'search', 'database', 'api'}TOOL_SCHEMAS = { 'search': SearchParams, # other tool schemas...}
class ToolCallSchema(BaseModel): """Contract for every tool call - no hallucinated params"""
tool_name: str = Field(..., pattern="^[a-z_]+$") parameters: dict reasoning: str = Field(..., min_length=10)
@validator('tool_name') def tool_must_exist(cls, v, values, **kwargs): if v not in REGISTERED_TOOLS: raise ValueError(f"Unknown tool: {v}") return v
@validator('parameters') def validate_params_against_tool_schema(cls, v, values): tool_name = values.get('tool_name') if tool_name: schema = TOOL_SCHEMAS[tool_name] return schema(**v).dict() return v
def safe_execute(model_output: dict): try: validated = ToolCallSchema(**model_output) return execute_tool(validated.tool_name, validated.parameters) except ValidationError as e: log_failure(e) return fallback_response()Now every tool call goes through validation. If the model hallucinates a parameter or calls an unregistered tool, the validation fails before execution.
What Changed in Production
After implementing the deterministic execution layer:
Day 1: Input "search for user john" -> search_user("john") -> correctDay 3: Input "search for user john" -> search_user("john") -> correctDay 7: Input "search for user john" -> search_user("john") -> correctSame input. Same execution path. Consistent results.
When failures occur now, I can see exactly where in the state machine they happened. I can reproduce them. I can add recovery logic for that specific failure mode.
The Three Principles
After six months of production failures and fixes, I learned three principles:
- Models generate, code controls - Use LLMs for content, not control flow
- Validate everything - Every tool call, every parameter, every state transition
- Design for failure - Build recovery paths, not just happy paths
Most production issues were not “quality” problems. They were “behavior drift” problems. The model drifted to different execution paths without any code change. Constraining that drift fixed the reliability issues.
Common Mistakes I Made
I made several mistakes along the way:
Blaming the model: I spent weeks tuning prompts when the architecture was the problem.
Over-engineering prompts: I added complexity to handle edge cases that code should handle.
Missing validation: I trusted model output without schema validation. This let hallucinated parameters through to production.
No circuit breakers: I let agents continue after failures instead of failing fast with recovery paths.
Summary
In this post, I explained why LLM agents fail unpredictably in production and how to fix it. The key insight is that models should generate content within a framework, not control execution flow. By implementing deterministic execution layers with contracts, validation, and recovery paths, you can build reliable agent systems that behave predictably in production.
The shift from “fix the prompt” to “fix the execution layer” was the pattern that finally made my agents reliable.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit discussion on LLM agent production failures
- 👨💻 LangChain Documentation
- 👨💻 Pydantic Validation Guide
- 👨💻 State Machine Pattern in Python
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments