15. Observability & Debugging

"When the agent derails at 2am, logs are your only witness"

25 min read
πŸ’‘New to this?

What is structured logging?

Instead of printing plain text like 'tool called', you log JSON objects with fields: timestamp, event type, tool name, input, output, duration, token count. This makes logs searchable and analyzable.

What is a trace?

A trace follows one agent task from start to finish β€” every LLM call, tool execution, and decision. Think of it like a flight recorder for your agent. When something goes wrong, you replay the trace to see exactly what happened.

What is replay debugging?

Recording all LLM responses so you can re-run the agent without making real API calls. This lets you reproduce bugs deterministically and test fixes cheaply β€” no tokens spent during replay.

The Problem

You have a 30-turn agent session. It called 15 tools, compressed context twice (Context Compact), and eventually produced the wrong output. Where did it go wrong? Turn 7? Turn 22? Was it a bad tool result or a bad LLM decision?

Print statements do not scale here. A print("calling tool") line tells you nothing about which tool, what input it received, how long it took, or what the LLM was thinking when it chose that tool. Multiply this across an autonomous agent team (Autonomous Agents) and you are flying blind.

The Agent Evals session taught you to measure whether agents succeed. This session teaches you to understand why they fail.

The Solution

Structured tracing: wrap every decision point in the agent loop with a recorder that captures timestamped, typed events. Store them as JSONL β€” one JSON object per line, append-only, grep-friendly. Every LLM call, tool execution, error, and context compression becomes a searchable event.

Three capabilities emerge from this:

  1. Post-mortem analysis β€” read the trace file after a failure and see exactly what happened, in order.
  2. Performance profiling β€” which tools are slow? Which LLM calls burn the most tokens?
  3. Replay debugging β€” re-run the agent using recorded LLM responses, no API calls needed.

The Tracing System

The core is two pieces: a TraceEvent dataclass and an AgentTracer that writes events to disk.

import json
from dataclasses import dataclass, asdict
from datetime import datetime
from pathlib import Path

@dataclass
class TraceEvent:
    timestamp: str
    event_type: str  # "llm_call", "tool_exec", "error", "compact"
    data: dict
    duration_ms: int = 0
    tokens: int = 0

class AgentTracer:
    def __init__(self, task_id: str):
        self.task_id = task_id
        self.events: list[TraceEvent] = []
        self.trace_file = Path(f".traces/{task_id}.jsonl")
        self.trace_file.parent.mkdir(exist_ok=True)

    def record(self, event_type: str, data: dict,
               duration_ms: int = 0, tokens: int = 0):
        event = TraceEvent(
            timestamp=datetime.utcnow().isoformat(),
            event_type=event_type, data=data,
            duration_ms=duration_ms, tokens=tokens,
        )
        self.events.append(event)
        with open(self.trace_file, "a") as f:
            f.write(json.dumps(asdict(event)) + "\n")

    def replay(self) -> list[TraceEvent]:
        """Read all events back from disk."""
        lines = self.trace_file.read_text().strip().split("\n")
        return [TraceEvent(**json.loads(line)) for line in lines]

Every event has a type, a timestamp, a data payload, and optional performance fields. The JSONL format means you can grep "error" .traces/task_42.jsonl and immediately find failures.

Integrating with the Agent Loop

Here is the agent loop from The Agent Loop, now instrumented with tracing at every decision point.

import time

def agent_loop(prompt: str, tools: list, task_id: str) -> str:
    tracer = AgentTracer(task_id)
    messages = [{"role": "user", "content": prompt}]

    while True:
        # Record the LLM call
        t0 = time.time()
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            system=system_prompt,
            tools=tools,
            messages=messages,
        )
        duration = int((time.time() - t0) * 1000)
        tokens_used = response.usage.input_tokens + response.usage.output_tokens

        tracer.record("llm_call", {
            "input_tokens": response.usage.input_tokens,
            "output_tokens": response.usage.output_tokens,
            "stop_reason": response.stop_reason,
        }, duration_ms=duration, tokens=tokens_used)

        # Check for end_turn
        if response.stop_reason == "end_turn":
            final = next(b.text for b in response.content if hasattr(b, "text"))
            tracer.record("end_turn", {"response_length": len(final)})
            return final

        # Process tool calls
        for block in response.content:
            if block.type == "tool_use":
                t0 = time.time()
                try:
                    result = execute_tool(block.name, block.input)
                    tool_duration = int((time.time() - t0) * 1000)
                    tracer.record("tool_exec", {
                        "tool": block.name,
                        "input": block.input,
                        "output_preview": str(result)[:200],
                    }, duration_ms=tool_duration)
                except Exception as e:
                    tracer.record("error", {
                        "tool": block.name,
                        "input": block.input,
                        "error": str(e),
                    })
                    result = f"Error: {e}"

                messages.append({"role": "assistant", "content": response.content})
                messages.append({
                    "role": "user",
                    "content": [{"type": "tool_result",
                                 "tool_use_id": block.id,
                                 "content": str(result)}],
                })

Four tracing points: before/after LLM calls, after tool executions, and on errors. The output_preview field truncates tool output to 200 characters β€” enough to debug, not enough to bloat the trace file.

Trace Analysis

Raw JSONL files are useful for grep, but structured analysis reveals patterns across an entire run.

from collections import Counter

def trace_summary(task_id: str) -> dict:
    tracer = AgentTracer(task_id)
    events = tracer.replay()

    summary = {
        "total_events": len(events),
        "total_tokens": sum(e.tokens for e in events),
        "total_duration_ms": sum(e.duration_ms for e in events),
        "turns": sum(1 for e in events if e.event_type == "llm_call"),
        "errors": [e.data for e in events if e.event_type == "error"],
    }

    # Tool usage stats
    tool_events = [e for e in events if e.event_type == "tool_exec"]
    tool_names = [e.data["tool"] for e in tool_events]
    summary["tool_counts"] = dict(Counter(tool_names))
    summary["slowest_tool_calls"] = sorted(
        [{"tool": e.data["tool"], "duration_ms": e.duration_ms} for e in tool_events],
        key=lambda x: x["duration_ms"], reverse=True,
    )[:5]

    return summary

def print_trace_summary(task_id: str):
    s = trace_summary(task_id)
    print(f"Turns: {s['turns']}  |  Tokens: {s['total_tokens']}  |  Duration: {s['total_duration_ms']}ms")
    print(f"Tools used: {s['tool_counts']}")
    if s["errors"]:
        print(f"ERRORS ({len(s['errors'])}):")
        for err in s["errors"]:
            print(f"  - {err['tool']}: {err['error']}")
    print("Slowest calls:")
    for call in s["slowest_tool_calls"]:
        print(f"  {call['tool']}: {call['duration_ms']}ms")

This immediately answers debugging questions: Did the agent burn tokens on a loop? Did a tool consistently fail? Was one tool call responsible for most of the latency?

Replay Debugging

The most powerful capability: record LLM responses during a live run, then replay them without spending tokens.

class RecordingClient:
    """Wraps the real client and records every response."""
    def __init__(self, real_client, tracer: AgentTracer):
        self.real_client = real_client
        self.tracer = tracer

    def create(self, **kwargs):
        response = self.real_client.messages.create(**kwargs)
        # Store the full response for replay
        self.tracer.record("llm_response", {
            "content": [block_to_dict(b) for b in response.content],
            "stop_reason": response.stop_reason,
            "usage": {"input": response.usage.input_tokens,
                      "output": response.usage.output_tokens},
        })
        return response


class ReplayClient:
    """Serves recorded responses instead of calling the API."""
    def __init__(self, task_id: str):
        tracer = AgentTracer(task_id)
        events = tracer.replay()
        self.responses = [
            e.data for e in events if e.event_type == "llm_response"
        ]
        self.index = 0

    def create(self, **kwargs):
        if self.index >= len(self.responses):
            raise RuntimeError("Replay exhausted β€” agent took a different path")
        data = self.responses[self.index]
        self.index += 1
        return MockResponse(data)

The workflow: run once with RecordingClient to capture the trace. When a bug surfaces, swap in ReplayClient and re-run. The agent receives identical LLM responses, so it makes identical tool calls. Now you can add print statements, step through with a debugger, or test a fix β€” all without spending a single token.

If the agent diverges during replay (different tool call than expected), the replay raises an error. That divergence itself is a signal: it means your fix changed the agent’s behavior at that exact point.

What Changed From Guardrails

ComponentBefore (Guardrails)After (Observability)
Error handlingBlock dangerous actionsRecord every action for analysis
Failure modePrevent bad outcomesDiagnose why bad outcomes happened
LoggingAd-hoc print statementsStructured JSONL with typed events
DebuggingRe-run and hope to reproduceReplay exact LLM responses deterministically
PerformanceNot trackedDuration and token counts per event
ScopeSingle-turn validationFull trace across all turns

Key Takeaway

An agent without observability is a black box. Structured tracing turns it into a glass box β€” every LLM call, tool execution, and error is recorded with timestamps and performance data. Replay debugging eliminates the most frustrating part of agent development: non-reproducible failures. Record once, replay forever, fix with confidence. This is the foundation for running agents in production, where the 2am failure needs to be debugged at 9am from a trace file alone.

Interactive Code Walkthrough

Agent Tracing System
1@dataclass
2class TraceEvent:
3 timestamp: str
4 event_type: str # "llm_call", "tool_exec", "error", "compact"
5 data: dict
6 duration_ms: int = 0
7 tokens: int = 0
8 
9class AgentTracer:
10 def __init__(self, task_id: str):
11 self.task_id = task_id
12 self.events: list[TraceEvent] = []
13 self.trace_file = Path(f".traces/{task_id}.jsonl")
14 self.trace_file.parent.mkdir(exist_ok=True)
15 
16 def record(self, event_type: str, data: dict,
17 duration_ms: int = 0, tokens: int = 0):
18 event = TraceEvent(
19 timestamp=datetime.utcnow().isoformat(),
20 event_type=event_type, data=data,
21 duration_ms=duration_ms, tokens=tokens,
22 )
23 self.events.append(event)
24 with open(self.trace_file, "a") as f:
25 f.write(json.dumps(asdict(event)) + "\n")
26 
27 def replay(self) -> list[TraceEvent]:
28 lines = self.trace_file.read_text().strip().split("\n")
29 return [TraceEvent(**json.loads(l)) for l in lines]
30 
TraceEvent captures one thing that happened. The event_type categorizes it (LLM call, tool execution, error, compression). duration_ms and tokens enable performance analysis β€” which tools are slow? Which calls are expensive?
Step 1 of 4
πŸ§ͺ Try it yourself
πŸ”₯ Warm-up ~5 min

Add tracer.record() calls to the agent loop from [The Agent Loop](/en/s01-the-agent-loop). Run a task and inspect the .traces/ JSONL file. What patterns do you notice?

Hint

Record events at: before LLM call, after LLM response, before tool exec, after tool exec.

πŸ”¨ Build ~20 min

Build a trace_summary() function that reads a trace file and prints: total turns, total tokens, total duration, most-used tool, and any errors.

Hint

Group events by event_type using collections.Counter

πŸš€ Stretch ~45 min

Implement full replay debugging: record LLM responses during a live run, then create a MockClient that serves recorded responses. Verify the agent produces identical tool calls.

Hint

Replace client.messages.create with a function that pops from a list of recorded responses.

Found a mistake? Report it β†’