How to Evaluate Deep Agents Performance with Harbor and LangSmith
Purpose
I built an agent. It worked in my tests. But would it work reliably in production? I needed a way to measure its performance systematically. Harbor framework with Terminal Bench 2.0 provides exactly that - automated evaluation with sandbox environments and scoring.
Why Evaluation Matters
Building agents is easy. Building reliable agents is hard. Without systematic evaluation:
- I don’t know where my agent fails
- I can’t measure if changes improve or hurt performance
- I ship with false confidence
With evaluation:
- Benchmarks reveal failure patterns
- Scores quantify progress
- Tracing helps debug specific failures
Environment
- Python 3.11+
- Docker (for local sandbox)
- Harbor framework
- LangSmith account (optional, for tracing)
What is Harbor?
Harbor is an evaluation framework for AI agents. It provides:
- Sandbox environments - Docker, Modal, Daytona, E2B
- Automatic test execution - Runs agent, checks results
- Reward scoring - 0.0 to 1.0 scale
Terminal Bench 2.0
Terminal Bench 2.0 is a benchmark with 90+ tasks across domains:
- Software engineering
- Biology
- Security
- Gaming
Example tasks:
path-tracing- Trace execution pathschess-best-move- Find optimal chess movesgit-multibranch- Complex git operationssqlite-with-gcov- Database with coverage
How to Run Evaluation
Quick Start with Docker
Run a single task locally:
uv run harbor run --agent-import-path deepagents_harbor:DeepAgentsWrapper \Scale with Daytona
Run 10 tasks in cloud sandboxes:
uv run harbor run --agent-import-path deepagents_harbor:DeepAgentsWrapper \The --agent-import-path points to the Deep Agents wrapper that Harbor uses to invoke the agent.
Deep Agent Architecture for Evaluation
The evaluation-optimized Deep Agent uses:
┌─────────────────────────────────────────────────────┐│ Deep Agent │├─────────────────────────────────────────────────────┤│ 1. Detailed System Prompt ││ - Expansive instructions ││ - Tool usage guidance │├─────────────────────────────────────────────────────┤│ 2. Planning Middleware ││ - write_todos for task structure │├─────────────────────────────────────────────────────┤│ 3. Filesystem ││ - File operations for context │├─────────────────────────────────────────────────────┤│ 4. SubAgents ││ - Specialized agents for isolated work │└─────────────────────────────────────────────────────┘LangSmith Integration
The evaluation workflow integrates with LangSmith for tracing:
Deep Agents ──▶ Harbor (evaluate) ──▶ LangSmith (analyze) ▲ │ └──────────────────────┘ Improve & RepeatCreate Dataset and Experiment
# Run the harbor_langsmith.py scriptspython scripts/harbor_langsmith.py create-datasetpython scripts/harbor_langsmith.py run-experimentAdd feedback scores in LangSmith to filter and analyze results.
Common Failure Patterns
Harbor reveals patterns in agent failures:
| Pattern | Symptom | Fix |
|---|---|---|
| Poor Planning | Jumps into coding without reading requirements | Add upfront planning requirement |
| Incorrect Tool Usage | Uses bash cat instead of read_file | Improve tool descriptions |
| No Incremental Testing | Writes 200 lines, tests once | Prompt to test after each unit |
| Hallucinated Paths | Reads files before checking existence | Add “always ls before read” rule |
Available Environments
Choose based on your needs:
| Environment | Use Case |
|---|---|
docker | Local testing, quick iterations |
daytona | Cloud sandboxes, scaling up |
modal | Modal cloud compute |
runloop | Runloop sandboxes |
Summary
In this post, I showed how to evaluate Deep Agents using Harbor and Terminal Bench 2.0. The evaluation pipeline runs agents in sandbox environments, scores their outputs, and traces results through LangSmith. The key benefit is systematic measurement - I can identify failure patterns, iterate on improvements, and ship with confidence.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Deep Agents Documentation
- 👨💻 Harbor Framework
- 👨💻 LangSmith
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments