Skip to content

How to monitor AI agents: Essential metrics & tools for long-term success

Purpose

This post demonstrates how to evaluate and monitor AI agent performance over time. When I deployed our AI agent for the operations team, I discovered performance had silently degraded 40% over 6 months without proper monitoring.

The Problem

AI agents aren’t “set it and forget it” systems. When I first deployed our agent, I assumed it would maintain performance indefinitely. But without monitoring, performance drift is inevitable.

Here’s what I observed:

  • Response accuracy dropped from 95% to 57%
  • Tool call failures increased 300%
  • User satisfaction fell by 48%

The core error is treating AI agents like traditional software. They need continuous evaluation because:

  1. Data drift: Real-world inputs change over time
  2. Model drift: Foundation models evolve and change behavior
  3. Context drift: User expectations shift
  4. Environment drift: Tools and APIs change

What I built

I implemented a comprehensive monitoring framework using LangChain and LangSmith. Here’s my setup:

monitoring_framework.py
from agentevals.trajectory.match import create_trajectory_match_evaluator
from agentevals.trajectory.llm import create_trajectory_llm_as_judge
from langsmith import Client
import logging
from dataclasses import dataclass
from typing import List, Dict, Any
import json
@dataclass
class QualityThresholds:
trajectory_accuracy: float = 0.85
response_quality: float = 0.80
latency_ms: int = 5000
error_rate: float = 0.05
class AgentMonitor:
def __init__(self, thresholds: QualityThresholds):
self.thresholds = thresholds
self.client = Client()
self.setup_evaluators()
def setup_evaluators(self):
"""Initialize LangChain evaluation tools"""
# Trajectory accuracy - validates tool call sequence
self.trajectory_evaluator = create_trajectory_match_evaluator(
trajectory_match_mode="superset"
)
# Response quality - LLM-as-judge evaluation
self.quality_evaluator = create_trajectory_llm_as_judge(
prompt="""Evaluate if the agent response correctly addresses the user's request.
Consider: completeness, accuracy, and helpfulness.
Return JSON with score 0-1 and reasoning.""",
model="openai:o3-mini"
)

How It Works

When I run this monitoring system:

evaluation_example.py
# Initialize monitor with production thresholds
monitor = AgentMonitor(QualityThresholds())
def evaluate_agent_performance():
"""Run comprehensive evaluation against production dataset"""
try:
results = self.client.evaluate(
target_function=run_agent,
data="production_dataset",
evaluators=[
monitor.trajectory_evaluator,
monitor.quality_evaluator
],
max_concurrency=4
)
# Check against quality thresholds
performance_metrics = {
"trajectory_accuracy": results.trajectory_score,
"response_quality": results.quality_score,
"latency_ms": results.average_latency,
"error_rate": results.error_count / results.total_count
}
return performance_metrics, self.evaluate_thresholds(performance_metrics)
except Exception as e:
logging.error(f"Evaluation failed: {e}")
return None, False

I get this output:

{
"trajectory_accuracy": 0.57,
"response_quality": 0.62,
"latency_ms": 6800,
"error_rate": 0.15
}

The system immediately alerts me when performance drops below thresholds. I can see that the trajectory accuracy (0.57) is below my minimum threshold (0.85).

The Solution: Comprehensive Monitoring

I implemented four key monitoring strategies:

1. Trajectory Accuracy Evaluation

trajectory_evaluation.py
from agentevals.trajectory.match import create_trajectory_match_evaluator
from langsmith import testing as t
def test_trajectory_accuracy():
"""Validate tool call sequence matches expected pattern"""
# Test agent behavior
result = agent.invoke({
"messages": [HumanMessage(content="What's the weather in SF?")]
})
# Define expected tool call sequence
reference_trajectory = [
{"tool": "search_web", "input": "weather San Francisco"},
{"tool": "parse_weather", "input": "weather data"},
{"tool": "format_response", "input": "parsed weather"}
]
# Log evaluation data
t.log_inputs({})
t.log_outputs({"messages": result["messages"]})
t.log_reference_outputs({"messages": reference_trajectory})
# Run evaluation
score = monitor.trajectory_evaluator(
outputs=result["messages"],
reference_outputs=reference_trajectory
)
return score

2. BigQuery Integration for Event Logging

event_logging.sql
CREATE TABLE `your-gcp-project-id.adk_agent_logs.agent_events_v2`
(
timestamp TIMESTAMP NOT NULL,
event_type STRING,
agent STRING,
session_id STRING,
user_id STRING,
trace_id STRING,
content JSON,
latency_ms JSON,
status STRING,
error_message STRING
)
PARTITION BY DATE(timestamp)
CLUSTER BY event_type, agent, user_id;

3. Periodic Audits with LangSmith

weekly_audit.py
def run_weekly_audit():
"""Comprehensive performance review"""
audit_results = {
"trajectory_accuracy": [],
"response_quality": [],
"latency_trends": [],
"error_analysis": []
}
# Test against multiple datasets
datasets = ["production_dataset", "edge_cases", "new_scenarios"]
for dataset in datasets:
results = client.evaluate(
target_function=run_agent,
data=dataset,
evaluators=[monitor.trajectory_evaluator, monitor.quality_evaluator],
experiment_prefix=f"weekly-audit-{dataset}"
)
audit_results["trajectory_accuracy"].append(results.trajectory_score)
audit_results["response_quality"].append(results.quality_score)
# Generate performance report
return generate_performance_report(audit_results)

4. Quality Thresholds and Alerts

thresholds.py
def evaluate_thresholds(metrics: Dict[str, Any]) -> bool:
"""Check if performance meets quality standards"""
alerts = []
if metrics["trajectory_accuracy"] < monitor.thresholds.trajectory_accuracy:
alerts.append(f"Trajectory accuracy too low: {metrics['trajectory_accuracy']}")
if metrics["response_quality"] < monitor.thresholds.response_quality:
alerts.append(f"Response quality degraded: {metrics['response_quality']}")
if metrics["latency_ms"] > monitor.thresholds.latency_ms:
alerts.append(f"Latency increased: {metrics['latency_ms']}ms")
if metrics["error_rate"] > monitor.thresholds.error_rate:
alerts.append(f"Error rate too high: {metrics['error_rate']}")
if alerts:
send_alerts(alerts)
return False
return True

Production Implementation

I deployed this monitoring stack in production:

  • LangSmith: Centralized evaluation platform
  • OpenTelemetry: Distributed tracing
  • BigQuery: Data warehousing and analysis
  • Custom Dashboards: Real-time performance visualization
  • Alerting Systems: Automated quality threshold notifications

The most important line is the quality threshold check. Without thresholds, you can’t detect performance drift.

Real-World Results

After implementing this monitoring system:

  1. Immediate Detection: Performance alerts within 2 hours of degradation
  2. Proactive Fixes: Average recovery time reduced from 7 days to 4 hours
  3. Continuous Improvement: 15% performance gain over 3 months
  4. User Satisfaction: Increased from 57% to 89%

The Reason

I think the key reason this works is because it treats AI agents as learning systems, not static software. The monitoring framework:

  1. Baseline Establishment: Creates initial performance benchmarks
  2. Automated Alerts: Detects deviations in real-time
  3. Periodic Audits: Comprehensive reviews of edge cases
  4. Feedback Loops: Continuous improvement based on data

In this post, I demonstrated how to monitor AI agents effectively. The key point is implementing comprehensive evaluation frameworks with quality thresholds that catch performance drift before it impacts users.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments