Skip to content

How to Evaluate Deep Agents Performance with Harbor and LangSmith

Purpose

I built an agent. It worked in my tests. But would it work reliably in production? I needed a way to measure its performance systematically. Harbor framework with Terminal Bench 2.0 provides exactly that - automated evaluation with sandbox environments and scoring.

Why Evaluation Matters

Building agents is easy. Building reliable agents is hard. Without systematic evaluation:

  • I don’t know where my agent fails
  • I can’t measure if changes improve or hurt performance
  • I ship with false confidence

With evaluation:

  • Benchmarks reveal failure patterns
  • Scores quantify progress
  • Tracing helps debug specific failures

Environment

  • Python 3.11+
  • Docker (for local sandbox)
  • Harbor framework
  • LangSmith account (optional, for tracing)

What is Harbor?

Harbor is an evaluation framework for AI agents. It provides:

  • Sandbox environments - Docker, Modal, Daytona, E2B
  • Automatic test execution - Runs agent, checks results
  • Reward scoring - 0.0 to 1.0 scale

Terminal Bench 2.0

Terminal Bench 2.0 is a benchmark with 90+ tasks across domains:

  • Software engineering
  • Biology
  • Security
  • Gaming

Example tasks:

  • path-tracing - Trace execution paths
  • chess-best-move - Find optimal chess moves
  • git-multibranch - Complex git operations
  • sqlite-with-gcov - Database with coverage

How to Run Evaluation

Quick Start with Docker

Run a single task locally:

Run 1 task via Docker
uv run harbor run --agent-import-path deepagents_harbor:DeepAgentsWrapper \
--dataset [email protected] -n 1 --jobs-dir jobs/terminal-bench --env docker

Scale with Daytona

Run 10 tasks in cloud sandboxes:

Run 10 tasks via Daytona
uv run harbor run --agent-import-path deepagents_harbor:DeepAgentsWrapper \
--dataset [email protected] -n 10 --jobs-dir jobs/terminal-bench --env daytona

The --agent-import-path points to the Deep Agents wrapper that Harbor uses to invoke the agent.

Deep Agent Architecture for Evaluation

The evaluation-optimized Deep Agent uses:

Deep Agent architecture
┌─────────────────────────────────────────────────────┐
│ Deep Agent │
├─────────────────────────────────────────────────────┤
│ 1. Detailed System Prompt │
│ - Expansive instructions │
│ - Tool usage guidance │
├─────────────────────────────────────────────────────┤
│ 2. Planning Middleware │
│ - write_todos for task structure │
├─────────────────────────────────────────────────────┤
│ 3. Filesystem │
│ - File operations for context │
├─────────────────────────────────────────────────────┤
│ 4. SubAgents │
│ - Specialized agents for isolated work │
└─────────────────────────────────────────────────────┘

LangSmith Integration

The evaluation workflow integrates with LangSmith for tracing:

Evaluation workflow
Deep Agents ──▶ Harbor (evaluate) ──▶ LangSmith (analyze)
▲ │
└──────────────────────┘
Improve & Repeat

Create Dataset and Experiment

LangSmith setup
# Run the harbor_langsmith.py scripts
python scripts/harbor_langsmith.py create-dataset
python scripts/harbor_langsmith.py run-experiment

Add feedback scores in LangSmith to filter and analyze results.

Common Failure Patterns

Harbor reveals patterns in agent failures:

PatternSymptomFix
Poor PlanningJumps into coding without reading requirementsAdd upfront planning requirement
Incorrect Tool UsageUses bash cat instead of read_fileImprove tool descriptions
No Incremental TestingWrites 200 lines, tests oncePrompt to test after each unit
Hallucinated PathsReads files before checking existenceAdd “always ls before read” rule

Available Environments

Choose based on your needs:

EnvironmentUse Case
dockerLocal testing, quick iterations
daytonaCloud sandboxes, scaling up
modalModal cloud compute
runloopRunloop sandboxes

Summary

In this post, I showed how to evaluate Deep Agents using Harbor and Terminal Bench 2.0. The evaluation pipeline runs agents in sandbox environments, scores their outputs, and traces results through LangSmith. The key benefit is systematic measurement - I can identify failure patterns, iterate on improvements, and ship with confidence.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments