How to monitor AI agents: Essential metrics & tools for long-term success

Feb 28, 2026

Purpose

This post demonstrates how to evaluate and monitor AI agent performance over time. When I deployed our AI agent for the operations team, I discovered performance had silently degraded 40% over 6 months without proper monitoring.

The Problem

AI agents aren’t “set it and forget it” systems. When I first deployed our agent, I assumed it would maintain performance indefinitely. But without monitoring, performance drift is inevitable.

Here’s what I observed:

Response accuracy dropped from 95% to 57%
Tool call failures increased 300%
User satisfaction fell by 48%

The core error is treating AI agents like traditional software. They need continuous evaluation because:

Data drift: Real-world inputs change over time
Model drift: Foundation models evolve and change behavior
Context drift: User expectations shift
Environment drift: Tools and APIs change

What I built

I implemented a comprehensive monitoring framework using LangChain and LangSmith. Here’s my setup:

from agentevals.trajectory.match import create_trajectory_match_evaluator
from agentevals.trajectory.llm import create_trajectory_llm_as_judge
from langsmith import Client
import logging
from dataclasses import dataclass
from typing import List, Dict, Any
import json

@dataclass
class QualityThresholds:
    trajectory_accuracy: float = 0.85
    response_quality: float = 0.80
    latency_ms: int = 5000
    error_rate: float = 0.05

class AgentMonitor:
    def __init__(self, thresholds: QualityThresholds):
        self.thresholds = thresholds
        self.client = Client()
        self.setup_evaluators()

    def setup_evaluators(self):
        """Initialize LangChain evaluation tools"""
        # Trajectory accuracy - validates tool call sequence
        self.trajectory_evaluator = create_trajectory_match_evaluator(
            trajectory_match_mode="superset"
        )

        # Response quality - LLM-as-judge evaluation
        self.quality_evaluator = create_trajectory_llm_as_judge(
            prompt="""Evaluate if the agent response correctly addresses the user's request.
            Consider: completeness, accuracy, and helpfulness.
            Return JSON with score 0-1 and reasoning.""",
            model="openai:o3-mini"
        )

How It Works

When I run this monitoring system:

# Initialize monitor with production thresholds
monitor = AgentMonitor(QualityThresholds())

def evaluate_agent_performance():
    """Run comprehensive evaluation against production dataset"""
    try:
        results = self.client.evaluate(
            target_function=run_agent,
            data="production_dataset",
            evaluators=[
                monitor.trajectory_evaluator,
                monitor.quality_evaluator
            ],
            max_concurrency=4
        )

        # Check against quality thresholds
        performance_metrics = {
            "trajectory_accuracy": results.trajectory_score,
            "response_quality": results.quality_score,
            "latency_ms": results.average_latency,
            "error_rate": results.error_count / results.total_count
        }

        return performance_metrics, self.evaluate_thresholds(performance_metrics)

    except Exception as e:
        logging.error(f"Evaluation failed: {e}")
        return None, False

I get this output:

{
    "trajectory_accuracy": 0.57,
    "response_quality": 0.62,
    "latency_ms": 6800,
    "error_rate": 0.15
}

The system immediately alerts me when performance drops below thresholds. I can see that the trajectory accuracy (0.57) is below my minimum threshold (0.85).

The Solution: Comprehensive Monitoring

I implemented four key monitoring strategies:

1. Trajectory Accuracy Evaluation

from agentevals.trajectory.match import create_trajectory_match_evaluator
from langsmith import testing as t

def test_trajectory_accuracy():
    """Validate tool call sequence matches expected pattern"""
    # Test agent behavior
    result = agent.invoke({
        "messages": [HumanMessage(content="What's the weather in SF?")]
    })

    # Define expected tool call sequence
    reference_trajectory = [
        {"tool": "search_web", "input": "weather San Francisco"},
        {"tool": "parse_weather", "input": "weather data"},
        {"tool": "format_response", "input": "parsed weather"}
    ]

    # Log evaluation data
    t.log_inputs({})
    t.log_outputs({"messages": result["messages"]})
    t.log_reference_outputs({"messages": reference_trajectory})

    # Run evaluation
    score = monitor.trajectory_evaluator(
        outputs=result["messages"],
        reference_outputs=reference_trajectory
    )

    return score

2. BigQuery Integration for Event Logging

CREATE TABLE `your-gcp-project-id.adk_agent_logs.agent_events_v2`
(
  timestamp TIMESTAMP NOT NULL,
  event_type STRING,
  agent STRING,
  session_id STRING,
  user_id STRING,
  trace_id STRING,
  content JSON,
  latency_ms JSON,
  status STRING,
  error_message STRING
)
PARTITION BY DATE(timestamp)
CLUSTER BY event_type, agent, user_id;

3. Periodic Audits with LangSmith

def run_weekly_audit():
    """Comprehensive performance review"""
    audit_results = {
        "trajectory_accuracy": [],
        "response_quality": [],
        "latency_trends": [],
        "error_analysis": []
    }

    # Test against multiple datasets
    datasets = ["production_dataset", "edge_cases", "new_scenarios"]

    for dataset in datasets:
        results = client.evaluate(
            target_function=run_agent,
            data=dataset,
            evaluators=[monitor.trajectory_evaluator, monitor.quality_evaluator],
            experiment_prefix=f"weekly-audit-{dataset}"
        )

        audit_results["trajectory_accuracy"].append(results.trajectory_score)
        audit_results["response_quality"].append(results.quality_score)

    # Generate performance report
    return generate_performance_report(audit_results)

4. Quality Thresholds and Alerts

def evaluate_thresholds(metrics: Dict[str, Any]) -> bool:
    """Check if performance meets quality standards"""
    alerts = []

    if metrics["trajectory_accuracy"] < monitor.thresholds.trajectory_accuracy:
        alerts.append(f"Trajectory accuracy too low: {metrics['trajectory_accuracy']}")

    if metrics["response_quality"] < monitor.thresholds.response_quality:
        alerts.append(f"Response quality degraded: {metrics['response_quality']}")

    if metrics["latency_ms"] > monitor.thresholds.latency_ms:
        alerts.append(f"Latency increased: {metrics['latency_ms']}ms")

    if metrics["error_rate"] > monitor.thresholds.error_rate:
        alerts.append(f"Error rate too high: {metrics['error_rate']}")

    if alerts:
        send_alerts(alerts)
        return False

    return True

Production Implementation

I deployed this monitoring stack in production:

LangSmith: Centralized evaluation platform
OpenTelemetry: Distributed tracing
BigQuery: Data warehousing and analysis
Custom Dashboards: Real-time performance visualization
Alerting Systems: Automated quality threshold notifications

The most important line is the quality threshold check. Without thresholds, you can’t detect performance drift.

Real-World Results

After implementing this monitoring system:

Immediate Detection: Performance alerts within 2 hours of degradation
Proactive Fixes: Average recovery time reduced from 7 days to 4 hours
Continuous Improvement: 15% performance gain over 3 months
User Satisfaction: Increased from 57% to 89%

The Reason

I think the key reason this works is because it treats AI agents as learning systems, not static software. The monitoring framework:

Baseline Establishment: Creates initial performance benchmarks
Automated Alerts: Detects deviations in real-time
Periodic Audits: Comprehensive reviews of edge cases
Feedback Loops: Continuous improvement based on data

In this post, I demonstrated how to monitor AI agents effectively. The key point is implementing comprehensive evaluation frameworks with quality thresholds that catch performance drift before it impacts users.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!