13. Agent Evals

"You can't improve what you can't measure"

25 min read
πŸ’‘New to this?

What is an agent eval?

A structured test that measures how well your agent performs a task. Unlike unit tests that check one function, evals measure end-to-end behavior: did the agent use the right tools, complete the task, stay within budget?

Why can't you just test agents like normal code?

Because agents are non-deterministic β€” the same prompt can produce different tool call sequences. Evals handle this by checking outcomes (did the file get created correctly?) rather than exact steps (did it call write_file on line 3?).

What is a scoring rubric?

A set of criteria that define success. For example: 'file exists' (pass/fail), 'file contains correct function' (pass/fail), 'completed in under 5 tool calls' (efficiency score). The rubric turns subjective quality into measurable numbers.

The Problem

You built an agent. It has a loop, tools, planning, subagents, skills, context management, tasks, background execution, teams, protocols, autonomy, and isolation. You can watch it work and it looks impressive. But is it actually good?

You can’t answer that question by watching. Agents are non-deterministic β€” the same prompt produces different tool call sequences on different runs. A change to your system prompt might improve file creation but silently break refactoring. You won’t notice until a user does, and by then the damage is done.

Traditional unit tests don’t help. You can’t assert that the agent called write_file on turn 3, because tomorrow it might call bash on turn 2 and get the same result. You need tests that check outcomes, not steps.

The Solution

An eval harness. Define scenarios with known-good outcomes, run your agent in a sandbox, check what it produced, and score the results. Run the same suite every time you change the agent. Catch regressions before they ship.

Define scenario  β†’  Run agent in sandbox  β†’  Check outcomes  β†’  Score  β†’  Report

Building the Eval Harness

The core is two dataclasses and one function.

from dataclasses import dataclass
from typing import Callable
import tempfile, os

@dataclass
class EvalCase:
    name: str
    prompt: str
    check: Callable[[str], "EvalResult"]
    max_turns: int = 20
    max_tokens: int = 50000

@dataclass
class EvalResult:
    passed: bool
    score: float   # 0.0 to 1.0
    details: str

EvalCase is the input. EvalResult is the output. Every eval, no matter how complex, conforms to this interface.

The runner creates an isolated workspace, executes the agent, and hands the workspace to the checker:

def run_eval(case: EvalCase, agent_fn) -> EvalResult:
    workspace = tempfile.mkdtemp()
    messages = [{"role": "user", "content": case.prompt}]
    turns = 0
    total_tokens = 0

    while turns < case.max_turns:
        response = agent_fn(messages)
        total_tokens += response.usage.input_tokens + response.usage.output_tokens
        if total_tokens > case.max_tokens:
            return EvalResult(False, 0.0, "Token budget exceeded")
        if response.stop_reason != "tool_use":
            break
        execute_tools(response, messages, cwd=workspace)
        turns += 1

    return case.check(workspace)

The cwd=workspace parameter is critical. Every tool call executes inside the temp directory. The agent can create files, run commands, and modify state β€” all contained to that workspace. When the eval finishes, you inspect what’s there.

A concrete checker:

def check_hello_world(workspace):
    path = os.path.join(workspace, "hello.py")
    if not os.path.exists(path):
        return EvalResult(False, 0.0, "hello.py not found")
    content = open(path).read()
    if "print" in content and "Hello" in content:
        return EvalResult(True, 1.0, "Correct")
    return EvalResult(False, 0.5, "File exists but content wrong")

hello_eval = EvalCase(
    name="hello-world",
    prompt="Create a file called hello.py that prints 'Hello, World!'",
    check=check_hello_world,
)

Scoring Strategies

Not every eval is pass/fail. Three strategies, escalating in nuance:

Binary Pass/Fail

The simplest. Did the file exist? Did the test pass?

def check_file_exists(workspace):
    if os.path.exists(os.path.join(workspace, "output.txt")):
        return EvalResult(True, 1.0, "File created")
    return EvalResult(False, 0.0, "File missing")

Partial Credit

Award points for each criterion met. This catches agents that get close but not all the way:

def check_refactor(workspace):
    path = os.path.join(workspace, "math_utils.py")
    if not os.path.exists(path):
        return EvalResult(False, 0.0, "File not found")

    content = open(path).read()
    score = 0.0
    details = []

    # Criterion 1: function exists
    if "def calculate_average" in content:
        score += 0.25
        details.append("PASS: function exists")
    else:
        details.append("FAIL: function missing")

    # Criterion 2: has type hints
    if "def calculate_average(numbers: list" in content:
        score += 0.25
        details.append("PASS: type hints present")
    else:
        details.append("FAIL: no type hints")

    # Criterion 3: has docstring
    if '"""' in content or "'''" in content:
        score += 0.25
        details.append("PASS: docstring present")
    else:
        details.append("FAIL: no docstring")

    # Criterion 4: actually runs
    result = subprocess.run(
        ["python", "-c", f"import math_utils; print(math_utils.calculate_average([1,2,3]))"],
        capture_output=True, text=True, cwd=workspace
    )
    if result.returncode == 0 and "2" in result.stdout:
        score += 0.25
        details.append("PASS: correct output")
    else:
        details.append(f"FAIL: runtime error: {result.stderr[:100]}")

    return EvalResult(score >= 0.75, score, "; ".join(details))

The key insight: criterion 4 actually runs the generated code. Checking string content tells you the agent wrote something that looks right. Running it tells you it is right.

Rubric-Based Scoring with LLM Judge

For subjective qualities like β€œis this code clean?”, use a second LLM call as a judge:

def llm_judge(workspace, criteria: str) -> EvalResult:
    content = open(os.path.join(workspace, "solution.py")).read()
    response = client.messages.create(
        model=MODEL,
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"""Score this code from 0.0 to 1.0 on: {criteria}

Code:
{content}

Respond with JSON: {{"score": float, "reasoning": str}}"""
        }],
    )
    result = json.loads(response.content[0].text)
    return EvalResult(
        result["score"] >= 0.7,
        result["score"],
        result["reasoning"],
    )

Use LLM judges sparingly. They add cost, latency, and their own non-determinism. Prefer deterministic checks when possible.

Running Evals at Scale

One eval tells you nothing. You need a suite that runs many cases and tracks results over time.

Parallel Execution

from concurrent.futures import ThreadPoolExecutor
import json, time

def run_suite(cases: list[EvalCase], agent_fn, workers: int = 4) -> dict:
    results = {}
    start = time.time()

    with ThreadPoolExecutor(max_workers=workers) as pool:
        futures = {
            pool.submit(run_eval, case, agent_fn): case.name
            for case in cases
        }
        for future in futures:
            name = futures[future]
            try:
                results[name] = future.result(timeout=300)
            except Exception as e:
                results[name] = EvalResult(False, 0.0, f"Error: {e}")

    elapsed = time.time() - start
    passed = sum(1 for r in results.values() if r.passed)
    total_score = sum(r.score for r in results.values()) / len(results)

    return {
        "timestamp": time.strftime("%Y-%m-%dT%H:%M:%S"),
        "elapsed_seconds": round(elapsed, 1),
        "total": len(cases),
        "passed": passed,
        "failed": len(cases) - passed,
        "avg_score": round(total_score, 3),
        "results": {
            name: {"passed": r.passed, "score": r.score, "details": r.details}
            for name, r in results.items()
        },
    }

JSON Reports and Regression Detection

Save each run’s results. Compare against a baseline to catch regressions:

def save_report(report: dict, path: str = "eval_results.json"):
    with open(path, "w") as f:
        json.dump(report, f, indent=2)

def check_regressions(report: dict, baseline_path: str = "baseline.json") -> list[str]:
    if not os.path.exists(baseline_path):
        return []

    baseline = json.load(open(baseline_path))
    regressions = []

    for name, result in report["results"].items():
        if name in baseline["results"]:
            old_score = baseline["results"][name]["score"]
            new_score = result["score"]
            if new_score < old_score - 0.1:  # 10% tolerance
                regressions.append(
                    f"REGRESSION: {name} dropped from {old_score} to {new_score}"
                )

    return regressions

CI Integration

Wire the suite into your CI pipeline. A failing eval blocks the merge, just like a failing test:

if __name__ == "__main__":
    cases = [hello_eval, refactor_eval, summarize_eval]
    report = run_suite(cases, agent_fn=my_agent)
    save_report(report)

    regressions = check_regressions(report)
    if regressions:
        for r in regressions:
            print(f"  {r}")
        sys.exit(1)

    print(f"Evals passed: {report['passed']}/{report['total']}")
    print(f"Average score: {report['avg_score']}")
    sys.exit(0 if report['failed'] == 0 else 1)

Now python run_evals.py returns exit code 0 on success, 1 on failure. Any CI system knows what to do with that.

What Changed From Worktree + Task Isolation

ComponentWorktree + Task IsolationAgent Evals
FocusBuilding the agentMeasuring the agent
WorkspaceGit worktree per taskTemp directory per eval
Success criteriaTask marked completeChecker function returns score
Isolation purposePrevent agents from interferingPrevent evals from leaking state
OutputMerged branchJSON report with scores
Feedback loopAgent reports to leadEval suite reports to developer

Key Takeaway

Evals close the loop. Without them, every change to your agent is a guess β€” you hope it got better, you assume nothing broke. With an eval harness, you know. Define your scenarios, write your checkers, run the suite, and read the scores. The agent is only as good as your ability to measure it. Now you can measure it.

Interactive Code Walkthrough

Building an Eval Harness
1@dataclass
2class EvalCase:
3 name: str
4 prompt: str
5 check: Callable[[str], EvalResult]
6 max_turns: int = 20
7 max_tokens: int = 50000
8 
9@dataclass
10class EvalResult:
11 passed: bool
12 score: float # 0.0 to 1.0
13 details: str
14 
15def run_eval(case: EvalCase, agent_fn) -> EvalResult:
16 workspace = tempfile.mkdtemp()
17 messages = [{"role": "user", "content": case.prompt}]
18 turns = 0
19 total_tokens = 0
20 
21 while turns < case.max_turns:
22 response = agent_fn(messages)
23 total_tokens += response.usage.input_tokens + response.usage.output_tokens
24 if total_tokens > case.max_tokens:
25 return EvalResult(False, 0.0, "Token budget exceeded")
26 if response.stop_reason != "tool_use":
27 break
28 execute_tools(response, messages, cwd=workspace)
29 turns += 1
30 
31 return case.check(workspace)
32 
33# Example eval case
34def check_hello_world(workspace):
35 path = os.path.join(workspace, "hello.py")
36 if not os.path.exists(path):
37 return EvalResult(False, 0.0, "hello.py not found")
38 content = open(path).read()
39 if "print" in content and "Hello" in content:
40 return EvalResult(True, 1.0, "Correct")
41 return EvalResult(False, 0.5, "File exists but content wrong")
42 
An EvalCase defines one test scenario: a prompt to send the agent, a checker function, and resource limits. The limits prevent runaway agents from burning tokens during testing.
Step 1 of 4
πŸ§ͺ Try it yourself
πŸ”₯ Warm-up ~5 min

Predict: if you run the same eval 10 times, will the agent get the same score every time? Why or why not?

Hint

Think about temperature, non-deterministic tool execution order, and network timing.

πŸ”¨ Build ~20 min

Write 3 eval cases for a file-manipulation agent: (1) create a file, (2) read and summarize a file, (3) refactor a function. Include scoring rubrics.

Hint

Use subprocess to run the generated code and check if it actually works, not just if it looks right.

πŸš€ Stretch ~45 min

Build an eval suite that runs N cases in parallel, collects scores into a JSON report, and flags regressions when scores drop below a baseline.

Hint

Use concurrent.futures.ThreadPoolExecutor and compare against a saved baseline.json

Found a mistake? Report it β†’