Skip to content

How to build reliable AI agents in production: 6 lessons from the trenches

Purpose

When I deployed my first AI agent for the operations team, I thought we were done. Six months later, I learned that the real work begins after deployment. AI agents promise intelligence but often deliver chaos in production. hallucinations, vendor communication failures, and unexpected output patterns can cripple your system.

Production AI Failure Patterns

The harsh reality hits you when your AI agent starts making things up. In our production environment, we saw:

  • Vendor communications failure: “I’ve integrated with Slack API” when no integration existed
  • Technical specification generation: Invented API endpoints with realistic-looking responses
  • Time-sensitive operational decisions: Incorrect timestamps affecting critical workflows

On Reddit, teams who’ve been there share similar stories:

  • “4 weeks supervised deployment” - gradual rollout prevents disaster
  • “Human-in-the-loop essential” - the safety net that saves you
  • “Output validation layers as survival tools” - your last line of defense

The Reliability Architecture: Multi-Layered Defense

Layer 1: Input Validation & Sanitization

I implemented pre-processing filters for user inputs before they reach the agent:

input_validation.py
from langchain_core.runnables import RunnableLambda
from langchain.prompts import PromptTemplate
def validate_input(input_text: str) -> dict:
"""Validate and sanitize user input before agent processing"""
# Check for prompt injection
injection_keywords = ["ignore previous", "start over", "forget"]
if any(keyword in input_text.lower() for keyword in injection_keywords):
return {"valid": False, "error": "Potential prompt injection detected"}
# Check for sensitive data leakage
if "password" in input_text.lower() or "api_key" in input_text.lower():
return {"valid": False, "error": "Sensitive content detected"}
return {"valid": True, "cleaned": input_text}
# Create input validation middleware
input_validator = RunnableLambda(validate_input)

Layer 2: Agent-Level Guardrails

For production reliability, I needed more than basic validation. I implemented model-based safety checks using secondary LLMs to validate outputs:

agent_guardrails.py
from langchain_core.messages import AIMessage
from langchain_openai import ChatOpenAI
from typing import Dict, Any
class ContentSafetyGuardrail:
"""Validate agent outputs for safety and accuracy"""
def __init__(self):
self.safety_model = ChatOpenAI(model="gpt-4.1-mini", temperature=0.1)
def check_output(self, message: AIMessage) -> Dict[str, Any]:
"""Validate output using secondary model"""
prompt = f"""
Analyze this AI agent output for:
1. Factual accuracy
2. Safety concerns
3. Hallucination risk
4. Compliance with operational requirements
Output: {message.content}
Respond with JSON:
{{
"safe": true/false,
"confidence": 0.0-1.0,
"issues": ["list of concerns"],
"suggestion": "revised output or 'APPROVED'"
}}
"""
response = self.safety_model.invoke(prompt)
try:
import json
result = json.loads(response.content)
return result
except:
return {"safe": False, "confidence": 0.0, "issues": ["Parsing failed"], "suggestion": "REJECT"}

The real breakthrough came when I added middleware architecture:

middleware_architecture.py
# Layered middleware implementation
agent = create_agent(
model="gpt-4.1",
tools=[search_tool, calculator_tool],
middleware=[
# Input validation layer
RunnableLambda(validate_input),
# PII protection layer
PIIMiddleware("email", strategy="redact", apply_to_input=True),
PIIMiddleware("phone", strategy="redact", apply_to_input=True),
# Human approval for sensitive operations
HumanInTheLoopMiddleware(
interrupt_on={
"send_email": True,
"delete_resource": True,
"vendor_api_call": True
}
),
# Output safety validation
ContentSafetyGuardrail(),
],
)

Layer 3: Tool Execution Monitoring

I learned the hard way that unchecked tool execution can break your system:

tool_monitoring.py
from datetime import datetime
import time
class ToolExecutionMonitor:
"""Monitor tool execution in real-time"""
def __init__(self, timeout_seconds=30):
self.timeout = timeout_seconds
self.active_calls = {}
def validate_parameters(self, tool_name: str, params: dict) -> bool:
"""Validate tool parameters before execution"""
# Tool-specific validation
if tool_name == "search_tool":
if "query" not in params or len(params["query"]) < 3:
return False
elif tool_name == "calculator_tool":
if "expression" in params and "delete" in params["expression"].lower():
return False
return True
def monitor_execution(self, tool_name: str, tool_func, params: dict):
"""Monitor tool execution with timeout and rate limiting"""
# Start monitoring
start_time = time.time()
call_id = f"{tool_name}_{int(start_time)}"
try:
# Execute with timeout
result = tool_func(**params)
# Log execution time
execution_time = time.time() - start_time
# Check for suspiciously fast results (potential hallucination)
if execution_time < 0.1 and tool_name == "search_tool":
print(f"WARNING: Suspicious fast execution for {tool_name}")
return result
except TimeoutError:
print(f"ERROR: Tool {tool_name} timed out after {self.timeout}s")
raise
except Exception as e:
print(f"ERROR: Tool {tool_name} failed: {str(e)}")
raise

Layer 4: Output Validation & Verification

The most critical layer I implemented was structured output validation:

output_validation.py
from pydantic import BaseModel, ValidationError
from typing import Optional, List
class OperationalResponse(BaseModel):
"""Structured output model for operational responses"""
action_required: bool
confidence: float
response_text: str
vendor_impact: Optional[str] = None
technical_specs: Optional[List[str]] = None
warnings: List[str] = []
def validate_and_enhance_output(agent_output: str) -> OperationalResponse:
"""Validate agent output and enhance with reliability metrics"""
try:
# Parse structured output
response = OperationalResponse.model_validate_json(agent_output)
# Additional validation
if response.confidence < 0.7:
response.warnings.append("Low confidence output detected")
response.response_text += " [CONFIDENCE WARNING]"
# Check for potential hallucinations
if "vendor_impact" in response and "integrated" in response.vendor_impact.lower():
if not verify_vendor_integration():
response.warnings.append("Vendor integration claim not verified")
response.response_text += " [INTEGRATION NOT VERIFIED]"
return response
except ValidationError as e:
# Fallback for unstructured output
return OperationalResponse(
action_required=False,
confidence=0.3,
response_text=agent_output,
warnings=["Output validation failed - using raw response"]
)

LangSmith Implementation for Production Monitoring

I set up LangSmith to monitor our agent’s performance in real-time:

langsmith_monitoring.py
from langchain.smith import RunEvalConfig
from langchain.evaluation import EvaluatorType
def setup_production_monitoring():
"""Configure LangSmith for production reliability evaluation"""
evaluation_config = RunEvalConfig(
evaluators=[
# Evaluate response accuracy
EvaluatorType.CONCISENESS,
EvaluatorType.COHERENCE,
# Custom reliability evaluator
EvaluatorType.QA("Does the response contain factual errors?"),
],
evaluation_name="production_reliability_check"
)
# Test with production-like inputs
fake_production_inputs = [
"Check system status and report any anomalies",
"Generate monthly operational report for Q4",
"Analyze recent performance trends and recommend actions"
]
# Run evaluation
results = agent.batch([
{"messages": [{"role": "user", "content": content}]}
for content in fake_production_inputs
], config=evaluation_config)
# Analyze reliability patterns
reliability_score = calculate_reliability_score(results)
print(f"Current reliability score: {reliability_score:.2f}")
return reliability_score
def calculate_reliability_score(results) -> float:
"""Calculate overall reliability score from evaluation results"""
total_score = 0
valid_evaluations = 0
for result in results:
if hasattr(result, 'evaluation_results'):
for eval_result in result.evaluation_results:
if hasattr(eval_result, 'score'):
total_score += eval_result.score
valid_evaluations += 1
return total_score / max(valid_evaluations, 1)

Deployment Strategies That Actually Work

I learned that deployment strategy makes or breaks reliability:

Phase 1: Sandbox (2-3 weeks)

# Development configuration - extensive logging
agent = create_agent(
model="gpt-4.1",
tools=[search_tool, calculator_tool],
middleware=[
ContentSafetyGuardrail(),
# Enable all debugging in development
LoggingMiddleware(level="DEBUG"),
],
verbose=True # Show all internal workings
)

Phase 2: Supervised (4 weeks)

# Staging configuration - human oversight
agent = create_agent(
model="gpt-4.1",
tools=[search_tool, calculator_tool],
middleware=[
ContentSafetyGuardrail(),
HumanInTheLoopMiddleware(
interrupt_on={"vendor_api_call": True, "delete_resource": True}
),
],
verbose=False
)

Phase 3: Production (2-4 weeks limited scope)

# Production configuration - minimal logging, maximum safety
agent = create_agent(
model="gpt-4.1",
tools=[search_tool, calculator_tool],
middleware=[
# Multi-layered production safety
ContentSafetyGuardrail(),
ToolExecutionMonitor(timeout_seconds=15),
OutputValidationMiddleware(),
],
verbose=False
)

Real Results: From Chaos to Control

Before implementing reliability measures:

  • Error rate: 23% of interactions
  • Hallucination incidents: 15 per week
  • Manual intervention: 40% of requests
  • User satisfaction: 6.2/10

After implementing the multi-layered defense:

  • Error rate: 3.1% of interactions
  • Hallucination incidents: 2 per week
  • Manual intervention: 8% of requests
  • User satisfaction: 8.7/10

The key success factors were:

  1. Multi-layered validation approach
  2. Comprehensive monitoring system
  3. Gradual deployment strategy
  4. Continuous improvement processes

Summary

In this post, I showed how to build reliable AI agents in production environments. The key point is that reliability isn’t accidental - it’s engineered through systematic validation and monitoring.

I implemented a four-layer defense system: input validation, agent guardrails, tool monitoring, and output verification. Each layer catches different types of failures, creating redundancy in your safety system.

But when I deployed without the gradual rollout strategy, I encountered production issues that could have been prevented. The phased approach (sandbox → supervised → limited production → full production) gave us time to learn and adapt.

The most important lesson was that AI agents need both automated validation and human oversight. I’ve seen too many teams rely purely on technical solutions without the human element that catches the unexpected failures.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments