How to monitor AI agents: Essential metrics & tools for long-term success
Purpose
This post demonstrates how to evaluate and monitor AI agent performance over time. When I deployed our AI agent for the operations team, I discovered performance had silently degraded 40% over 6 months without proper monitoring.
The Problem
AI agents aren’t “set it and forget it” systems. When I first deployed our agent, I assumed it would maintain performance indefinitely. But without monitoring, performance drift is inevitable.
Here’s what I observed:
- Response accuracy dropped from 95% to 57%
- Tool call failures increased 300%
- User satisfaction fell by 48%
The core error is treating AI agents like traditional software. They need continuous evaluation because:
- Data drift: Real-world inputs change over time
- Model drift: Foundation models evolve and change behavior
- Context drift: User expectations shift
- Environment drift: Tools and APIs change
What I built
I implemented a comprehensive monitoring framework using LangChain and LangSmith. Here’s my setup:
from agentevals.trajectory.match import create_trajectory_match_evaluatorfrom agentevals.trajectory.llm import create_trajectory_llm_as_judgefrom langsmith import Clientimport loggingfrom dataclasses import dataclassfrom typing import List, Dict, Anyimport json
@dataclassclass QualityThresholds: trajectory_accuracy: float = 0.85 response_quality: float = 0.80 latency_ms: int = 5000 error_rate: float = 0.05
class AgentMonitor: def __init__(self, thresholds: QualityThresholds): self.thresholds = thresholds self.client = Client() self.setup_evaluators()
def setup_evaluators(self): """Initialize LangChain evaluation tools""" # Trajectory accuracy - validates tool call sequence self.trajectory_evaluator = create_trajectory_match_evaluator( trajectory_match_mode="superset" )
# Response quality - LLM-as-judge evaluation self.quality_evaluator = create_trajectory_llm_as_judge( prompt="""Evaluate if the agent response correctly addresses the user's request. Consider: completeness, accuracy, and helpfulness. Return JSON with score 0-1 and reasoning.""", model="openai:o3-mini" )How It Works
When I run this monitoring system:
# Initialize monitor with production thresholdsmonitor = AgentMonitor(QualityThresholds())
def evaluate_agent_performance(): """Run comprehensive evaluation against production dataset""" try: results = self.client.evaluate( target_function=run_agent, data="production_dataset", evaluators=[ monitor.trajectory_evaluator, monitor.quality_evaluator ], max_concurrency=4 )
# Check against quality thresholds performance_metrics = { "trajectory_accuracy": results.trajectory_score, "response_quality": results.quality_score, "latency_ms": results.average_latency, "error_rate": results.error_count / results.total_count }
return performance_metrics, self.evaluate_thresholds(performance_metrics)
except Exception as e: logging.error(f"Evaluation failed: {e}") return None, FalseI get this output:
{ "trajectory_accuracy": 0.57, "response_quality": 0.62, "latency_ms": 6800, "error_rate": 0.15}The system immediately alerts me when performance drops below thresholds. I can see that the trajectory accuracy (0.57) is below my minimum threshold (0.85).
The Solution: Comprehensive Monitoring
I implemented four key monitoring strategies:
1. Trajectory Accuracy Evaluation
from agentevals.trajectory.match import create_trajectory_match_evaluatorfrom langsmith import testing as t
def test_trajectory_accuracy(): """Validate tool call sequence matches expected pattern""" # Test agent behavior result = agent.invoke({ "messages": [HumanMessage(content="What's the weather in SF?")] })
# Define expected tool call sequence reference_trajectory = [ {"tool": "search_web", "input": "weather San Francisco"}, {"tool": "parse_weather", "input": "weather data"}, {"tool": "format_response", "input": "parsed weather"} ]
# Log evaluation data t.log_inputs({}) t.log_outputs({"messages": result["messages"]}) t.log_reference_outputs({"messages": reference_trajectory})
# Run evaluation score = monitor.trajectory_evaluator( outputs=result["messages"], reference_outputs=reference_trajectory )
return score2. BigQuery Integration for Event Logging
CREATE TABLE `your-gcp-project-id.adk_agent_logs.agent_events_v2`( timestamp TIMESTAMP NOT NULL, event_type STRING, agent STRING, session_id STRING, user_id STRING, trace_id STRING, content JSON, latency_ms JSON, status STRING, error_message STRING)PARTITION BY DATE(timestamp)CLUSTER BY event_type, agent, user_id;3. Periodic Audits with LangSmith
def run_weekly_audit(): """Comprehensive performance review""" audit_results = { "trajectory_accuracy": [], "response_quality": [], "latency_trends": [], "error_analysis": [] }
# Test against multiple datasets datasets = ["production_dataset", "edge_cases", "new_scenarios"]
for dataset in datasets: results = client.evaluate( target_function=run_agent, data=dataset, evaluators=[monitor.trajectory_evaluator, monitor.quality_evaluator], experiment_prefix=f"weekly-audit-{dataset}" )
audit_results["trajectory_accuracy"].append(results.trajectory_score) audit_results["response_quality"].append(results.quality_score)
# Generate performance report return generate_performance_report(audit_results)4. Quality Thresholds and Alerts
def evaluate_thresholds(metrics: Dict[str, Any]) -> bool: """Check if performance meets quality standards""" alerts = []
if metrics["trajectory_accuracy"] < monitor.thresholds.trajectory_accuracy: alerts.append(f"Trajectory accuracy too low: {metrics['trajectory_accuracy']}")
if metrics["response_quality"] < monitor.thresholds.response_quality: alerts.append(f"Response quality degraded: {metrics['response_quality']}")
if metrics["latency_ms"] > monitor.thresholds.latency_ms: alerts.append(f"Latency increased: {metrics['latency_ms']}ms")
if metrics["error_rate"] > monitor.thresholds.error_rate: alerts.append(f"Error rate too high: {metrics['error_rate']}")
if alerts: send_alerts(alerts) return False
return TrueProduction Implementation
I deployed this monitoring stack in production:
- LangSmith: Centralized evaluation platform
- OpenTelemetry: Distributed tracing
- BigQuery: Data warehousing and analysis
- Custom Dashboards: Real-time performance visualization
- Alerting Systems: Automated quality threshold notifications
The most important line is the quality threshold check. Without thresholds, you can’t detect performance drift.
Real-World Results
After implementing this monitoring system:
- Immediate Detection: Performance alerts within 2 hours of degradation
- Proactive Fixes: Average recovery time reduced from 7 days to 4 hours
- Continuous Improvement: 15% performance gain over 3 months
- User Satisfaction: Increased from 57% to 89%
The Reason
I think the key reason this works is because it treats AI agents as learning systems, not static software. The monitoring framework:
- Baseline Establishment: Creates initial performance benchmarks
- Automated Alerts: Detects deviations in real-time
- Periodic Audits: Comprehensive reviews of edge cases
- Feedback Loops: Continuous improvement based on data
In this post, I demonstrated how to monitor AI agents effectively. The key point is implementing comprehensive evaluation frameworks with quality thresholds that catch performance drift before it impacts users.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 LangChain Documentation
- 👨💻 LangSmith Evaluation Platform
- 👨💻 OpenTelemetry
- 👨💻 Reddit: AI Agent Operations
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments