The Problem

You built an agent. It has a loop, tools, planning, subagents, skills, context management, tasks, background execution, teams, protocols, autonomy, and isolation. You can watch it work and it looks impressive. But is it actually good?

You can’t answer that question by watching. Agents are non-deterministic — the same prompt produces different tool call sequences on different runs. A change to your system prompt might improve file creation but silently break refactoring. You won’t notice until a user does, and by then the damage is done.

Traditional unit tests don’t help. You can’t assert that the agent called write_file on turn 3, because tomorrow it might call bash on turn 2 and get the same result. You need tests that check outcomes, not steps.

The Solution

An eval harness. Define scenarios with known-good outcomes, run your agent in a sandbox, check what it produced, and score the results. Run the same suite every time you change the agent. Catch regressions before they ship.

Define scenario  →  Run agent in sandbox  →  Check outcomes  →  Score  →  Report

Building the Eval Harness

The core is two dataclasses and one function.

from dataclasses import dataclass
from typing import Callable
import tempfile, os

@dataclass
class EvalCase:
    name: str
    prompt: str
    check: Callable[[str], "EvalResult"]
    max_turns: int = 20
    max_tokens: int = 50000

@dataclass
class EvalResult:
    passed: bool
    score: float   # 0.0 to 1.0
    details: str

EvalCase is the input. EvalResult is the output. Every eval, no matter how complex, conforms to this interface.

The runner creates an isolated workspace, executes the agent, and hands the workspace to the checker:

def run_eval(case: EvalCase, agent_fn) -> EvalResult:
    workspace = tempfile.mkdtemp()
    messages = [{"role": "user", "content": case.prompt}]
    turns = 0
    total_tokens = 0

    while turns < case.max_turns:
        response = agent_fn(messages)
        total_tokens += response.usage.input_tokens + response.usage.output_tokens
        if total_tokens > case.max_tokens:
            return EvalResult(False, 0.0, "Token budget exceeded")
        if response.stop_reason != "tool_use":
            break
        execute_tools(response, messages, cwd=workspace)
        turns += 1

    return case.check(workspace)

The cwd=workspace parameter is critical. Every tool call executes inside the temp directory. The agent can create files, run commands, and modify state — all contained to that workspace. When the eval finishes, you inspect what’s there.

A concrete checker:

def check_hello_world(workspace):
    path = os.path.join(workspace, "hello.py")
    if not os.path.exists(path):
        return EvalResult(False, 0.0, "hello.py not found")
    content = open(path).read()
    if "print" in content and "Hello" in content:
        return EvalResult(True, 1.0, "Correct")
    return EvalResult(False, 0.5, "File exists but content wrong")

hello_eval = EvalCase(
    name="hello-world",
    prompt="Create a file called hello.py that prints 'Hello, World!'",
    check=check_hello_world,
)

Scoring Strategies

Not every eval is pass/fail. Three strategies, escalating in nuance:

Binary Pass/Fail

The simplest. Did the file exist? Did the test pass?

def check_file_exists(workspace):
    if os.path.exists(os.path.join(workspace, "output.txt")):
        return EvalResult(True, 1.0, "File created")
    return EvalResult(False, 0.0, "File missing")

Partial Credit

Award points for each criterion met. This catches agents that get close but not all the way:

def check_refactor(workspace):
    path = os.path.join(workspace, "math_utils.py")
    if not os.path.exists(path):
        return EvalResult(False, 0.0, "File not found")

    content = open(path).read()
    score = 0.0
    details = []

    # Criterion 1: function exists
    if "def calculate_average" in content:
        score += 0.25
        details.append("PASS: function exists")
    else:
        details.append("FAIL: function missing")

    # Criterion 2: has type hints
    if "def calculate_average(numbers: list" in content:
        score += 0.25
        details.append("PASS: type hints present")
    else:
        details.append("FAIL: no type hints")

    # Criterion 3: has docstring
    if '"""' in content or "'''" in content:
        score += 0.25
        details.append("PASS: docstring present")
    else:
        details.append("FAIL: no docstring")

    # Criterion 4: actually runs
    result = subprocess.run(
        ["python", "-c", f"import math_utils; print(math_utils.calculate_average([1,2,3]))"],
        capture_output=True, text=True, cwd=workspace
    )
    if result.returncode == 0 and "2" in result.stdout:
        score += 0.25
        details.append("PASS: correct output")
    else:
        details.append(f"FAIL: runtime error: {result.stderr[:100]}")

    return EvalResult(score >= 0.75, score, "; ".join(details))

The key insight: criterion 4 actually runs the generated code. Checking string content tells you the agent wrote something that looks right. Running it tells you it is right.

Rubric-Based Scoring with LLM Judge

For subjective qualities like “is this code clean?”, use a second LLM call as a judge:

def llm_judge(workspace, criteria: str) -> EvalResult:
    content = open(os.path.join(workspace, "solution.py")).read()
    response = client.messages.create(
        model=MODEL,
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"""Score this code from 0.0 to 1.0 on: {criteria}

Code:
{content}

Respond with JSON: {{"score": float, "reasoning": str}}"""
        }],
    )
    result = json.loads(response.content[0].text)
    return EvalResult(
        result["score"] >= 0.7,
        result["score"],
        result["reasoning"],
    )

Use LLM judges sparingly. They add cost, latency, and their own non-determinism. Prefer deterministic checks when possible.

Running Evals at Scale

One eval tells you nothing. You need a suite that runs many cases and tracks results over time.

Parallel Execution

from concurrent.futures import ThreadPoolExecutor
import json, time

def run_suite(cases: list[EvalCase], agent_fn, workers: int = 4) -> dict:
    results = {}
    start = time.time()

    with ThreadPoolExecutor(max_workers=workers) as pool:
        futures = {
            pool.submit(run_eval, case, agent_fn): case.name
            for case in cases
        }
        for future in futures:
            name = futures[future]
            try:
                results[name] = future.result(timeout=300)
            except Exception as e:
                results[name] = EvalResult(False, 0.0, f"Error: {e}")

    elapsed = time.time() - start
    passed = sum(1 for r in results.values() if r.passed)
    total_score = sum(r.score for r in results.values()) / len(results)

    return {
        "timestamp": time.strftime("%Y-%m-%dT%H:%M:%S"),
        "elapsed_seconds": round(elapsed, 1),
        "total": len(cases),
        "passed": passed,
        "failed": len(cases) - passed,
        "avg_score": round(total_score, 3),
        "results": {
            name: {"passed": r.passed, "score": r.score, "details": r.details}
            for name, r in results.items()
        },
    }

JSON Reports and Regression Detection

Save each run’s results. Compare against a baseline to catch regressions:

def save_report(report: dict, path: str = "eval_results.json"):
    with open(path, "w") as f:
        json.dump(report, f, indent=2)

def check_regressions(report: dict, baseline_path: str = "baseline.json") -> list[str]:
    if not os.path.exists(baseline_path):
        return []

    baseline = json.load(open(baseline_path))
    regressions = []

    for name, result in report["results"].items():
        if name in baseline["results"]:
            old_score = baseline["results"][name]["score"]
            new_score = result["score"]
            if new_score < old_score - 0.1:  # 10% tolerance
                regressions.append(
                    f"REGRESSION: {name} dropped from {old_score} to {new_score}"
                )

    return regressions

CI Integration

Wire the suite into your CI pipeline. A failing eval blocks the merge, just like a failing test:

if __name__ == "__main__":
    cases = [hello_eval, refactor_eval, summarize_eval]
    report = run_suite(cases, agent_fn=my_agent)
    save_report(report)

    regressions = check_regressions(report)
    if regressions:
        for r in regressions:
            print(f"  {r}")
        sys.exit(1)

    print(f"Evals passed: {report['passed']}/{report['total']}")
    print(f"Average score: {report['avg_score']}")
    sys.exit(0 if report['failed'] == 0 else 1)

Now python run_evals.py returns exit code 0 on success, 1 on failure. Any CI system knows what to do with that.

What Changed From Worktree + Task Isolation

Component	Worktree + Task Isolation	Agent Evals
Focus	Building the agent	Measuring the agent
Workspace	Git worktree per task	Temp directory per eval
Success criteria	Task marked complete	Checker function returns score
Isolation purpose	Prevent agents from interfering	Prevent evals from leaking state
Output	Merged branch	JSON report with scores
Feedback loop	Agent reports to lead	Eval suite reports to developer

Key Takeaway

Evals close the loop. Without them, every change to your agent is a guess — you hope it got better, you assume nothing broke. With an eval harness, you know. Define your scenarios, write your checkers, run the suite, and read the scores. The agent is only as good as your ability to measure it. Now you can measure it.

13. Agent Evals

What is an agent eval?

Why can't you just test agents like normal code?

What is a scoring rubric?