How to Evaluate Deep Agents Performance with Harbor and LangSmith

Mar 20, 2026

Purpose

I built an agent. It worked in my tests. But would it work reliably in production? I needed a way to measure its performance systematically. Harbor framework with Terminal Bench 2.0 provides exactly that - automated evaluation with sandbox environments and scoring.

Why Evaluation Matters

Building agents is easy. Building reliable agents is hard. Without systematic evaluation:

I don’t know where my agent fails
I can’t measure if changes improve or hurt performance
I ship with false confidence

With evaluation:

Benchmarks reveal failure patterns
Scores quantify progress
Tracing helps debug specific failures

Environment

Python 3.11+
Docker (for local sandbox)
Harbor framework
LangSmith account (optional, for tracing)

What is Harbor?

Harbor is an evaluation framework for AI agents. It provides:

Sandbox environments - Docker, Modal, Daytona, E2B
Automatic test execution - Runs agent, checks results
Reward scoring - 0.0 to 1.0 scale

Terminal Bench 2.0

Terminal Bench 2.0 is a benchmark with 90+ tasks across domains:

Software engineering
Biology
Security
Gaming

Example tasks:

path-tracing - Trace execution paths
chess-best-move - Find optimal chess moves
git-multibranch - Complex git operations
sqlite-with-gcov - Database with coverage

How to Run Evaluation

Quick Start with Docker

Run a single task locally:

uv run harbor run --agent-import-path deepagents_harbor:DeepAgentsWrapper \
  --dataset [email protected] -n 1 --jobs-dir jobs/terminal-bench --env docker

Scale with Daytona

Run 10 tasks in cloud sandboxes:

uv run harbor run --agent-import-path deepagents_harbor:DeepAgentsWrapper \
  --dataset [email protected] -n 10 --jobs-dir jobs/terminal-bench --env daytona

The --agent-import-path points to the Deep Agents wrapper that Harbor uses to invoke the agent.

Deep Agent Architecture for Evaluation

The evaluation-optimized Deep Agent uses:

┌─────────────────────────────────────────────────────┐
│                  Deep Agent                         │
├─────────────────────────────────────────────────────┤
│  1. Detailed System Prompt                          │
│     - Expansive instructions                        │
│     - Tool usage guidance                           │
├─────────────────────────────────────────────────────┤
│  2. Planning Middleware                             │
│     - write_todos for task structure                │
├─────────────────────────────────────────────────────┤
│  3. Filesystem                                      │
│     - File operations for context                   │
├─────────────────────────────────────────────────────┤
│  4. SubAgents                                       │
│     - Specialized agents for isolated work          │
└─────────────────────────────────────────────────────┘

LangSmith Integration

The evaluation workflow integrates with LangSmith for tracing:

Deep Agents ──▶ Harbor (evaluate) ──▶ LangSmith (analyze)
                     ▲                      │
                     └──────────────────────┘
                          Improve & Repeat

Create Dataset and Experiment

# Run the harbor_langsmith.py scripts
python scripts/harbor_langsmith.py create-dataset
python scripts/harbor_langsmith.py run-experiment

Add feedback scores in LangSmith to filter and analyze results.

Common Failure Patterns

Harbor reveals patterns in agent failures:

Pattern	Symptom	Fix
Poor Planning	Jumps into coding without reading requirements	Add upfront planning requirement
Incorrect Tool Usage	Uses `bash cat` instead of `read_file`	Improve tool descriptions
No Incremental Testing	Writes 200 lines, tests once	Prompt to test after each unit
Hallucinated Paths	Reads files before checking existence	Add “always ls before read” rule

Available Environments

Choose based on your needs:

Environment	Use Case
`docker`	Local testing, quick iterations
`daytona`	Cloud sandboxes, scaling up
`modal`	Modal cloud compute
`runloop`	Runloop sandboxes

Summary

In this post, I showed how to evaluate Deep Agents using Harbor and Terminal Bench 2.0. The evaluation pipeline runs agents in sandbox environments, scores their outputs, and traces results through LangSmith. The key benefit is systematic measurement - I can identify failure patterns, iterate on improvements, and ship with confidence.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Deep Agents Documentation
👨‍💻 Harbor Framework
👨‍💻 LangSmith

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!