15. תצפית ודיבאג

"כשהסוכן יורד מהפסים בשתיים בלילה, הלוגים הם העד היחיד"

25 דקות קריאה

💡חדש בנושא?

מה זה לוגים מובנים (structured logging)?

במקום להדפיס טקסט פשוט כמו 'tool called', מתעדים אובייקטי JSON עם שדות: חותמת זמן, סוג אירוע, שם כלי, קלט, פלט, משך זמן, ספירת token. זה הופך את הלוגים לניתנים לחיפוש ולניתוח.

מה זה trace?

trace עוקב אחר משימה אחת של סוכן מתחילתה ועד סופה — כל קריאת LLM, הרצת כלי והחלטה. חשבו על זה כמו קופסה שחורה במטוס. כשמשהו משתבש, מפעילים מחדש את ה-trace כדי לראות בדיוק מה קרה.

מה זה דיבאג בהפעלה מחדש (replay debugging)?

הקלטת כל תשובות ה-LLM כדי שניתן יהיה להריץ מחדש את הסוכן בלי לבצע קריאות API אמיתיות. זה מאפשר לשחזר באגים באופן דטרמיניסטי ולבדוק תיקונים בזול — אפס token מבוזבזים בהפעלה מחדש.

הבעיה

יש לכם סשן סוכן של 30 סיבובים. הוא קרא ל-15 כלים, דחס את ההקשר פעמיים (דחיסת הקשר), ובסופו של דבר הפיק פלט שגוי. איפה הוא טעה? סיבוב 7? סיבוב 22? האם זו הייתה תוצאת כלי שגויה או החלטת LLM שגויה?

הדפסות (print) לא מתרחבות כאן. שורת print("calling tool") לא אומרת לכם כלום על איזה כלי, מה הקלט שהוא קיבל, כמה זמן זה לקח, או מה ה-LLM חשב כשבחר את הכלי הזה. הכפילו את זה על פני צוות סוכנים אוטונומי (סוכנים אוטונומיים) ואתם טסים בעיוורון.

הסשן הערכות סוכנים לימד אתכם למדוד האם סוכנים מצליחים. הסשן הזה מלמד אתכם להבין למה הם נכשלים.

הפתרון

מעקב מובנה: עטיפת כל נקודת החלטה בלולאת הסוכן עם מתעד שלוכד אירועים מתויגי זמן ומסווגים. אחסון כ-JSONL — אובייקט JSON אחד לשורה, הוספה בלבד, ידידותי ל-grep. כל קריאת LLM, הרצת כלי, שגיאה ודחיסת הקשר הופכים לאירוע שניתן לחיפוש.

שלוש יכולות נובעות מכך:

ניתוח לאחר קריסה — קראו את קובץ ה-trace אחרי כשל וראו בדיוק מה קרה, לפי סדר.
פרופיל ביצועים — אילו כלים איטיים? אילו קריאות LLM שורפות הכי הרבה token?
דיבאג בהפעלה מחדש — הריצו מחדש את הסוכן עם תשובות LLM מוקלטות, ללא צורך בקריאות API.

מערכת המעקב

הליבה מורכבת משני חלקים: dataclass של TraceEvent ו-AgentTracer שכותב אירועים לדיסק.

import json
from dataclasses import dataclass, asdict
from datetime import datetime
from pathlib import Path

@dataclass
class TraceEvent:
    timestamp: str
    event_type: str  # "llm_call", "tool_exec", "error", "compact"
    data: dict
    duration_ms: int = 0
    tokens: int = 0

class AgentTracer:
    def __init__(self, task_id: str):
        self.task_id = task_id
        self.events: list[TraceEvent] = []
        self.trace_file = Path(f".traces/{task_id}.jsonl")
        self.trace_file.parent.mkdir(exist_ok=True)

    def record(self, event_type: str, data: dict,
               duration_ms: int = 0, tokens: int = 0):
        event = TraceEvent(
            timestamp=datetime.utcnow().isoformat(),
            event_type=event_type, data=data,
            duration_ms=duration_ms, tokens=tokens,
        )
        self.events.append(event)
        with open(self.trace_file, "a") as f:
            f.write(json.dumps(asdict(event)) + "\n")

    def replay(self) -> list[TraceEvent]:
        """Read all events back from disk."""
        lines = self.trace_file.read_text().strip().split("\n")
        return [TraceEvent(**json.loads(line)) for line in lines]

לכל אירוע יש סוג, חותמת זמן, מטען נתונים, ושדות ביצועים אופציונליים. פורמט JSONL מאפשר לכם להריץ grep "error" .traces/task_42.jsonl ולמצוא כשלים מיד.

שילוב עם לולאת הסוכן

הנה לולאת הסוכן מלולאת הסוכן, כעת עם מכשור מעקב בכל נקודת החלטה.

import time

def agent_loop(prompt: str, tools: list, task_id: str) -> str:
    tracer = AgentTracer(task_id)
    messages = [{"role": "user", "content": prompt}]

    while True:
        # Record the LLM call
        t0 = time.time()
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            system=system_prompt,
            tools=tools,
            messages=messages,
        )
        duration = int((time.time() - t0) * 1000)
        tokens_used = response.usage.input_tokens + response.usage.output_tokens

        tracer.record("llm_call", {
            "input_tokens": response.usage.input_tokens,
            "output_tokens": response.usage.output_tokens,
            "stop_reason": response.stop_reason,
        }, duration_ms=duration, tokens=tokens_used)

        # Check for end_turn
        if response.stop_reason == "end_turn":
            final = next(b.text for b in response.content if hasattr(b, "text"))
            tracer.record("end_turn", {"response_length": len(final)})
            return final

        # Process tool calls
        for block in response.content:
            if block.type == "tool_use":
                t0 = time.time()
                try:
                    result = execute_tool(block.name, block.input)
                    tool_duration = int((time.time() - t0) * 1000)
                    tracer.record("tool_exec", {
                        "tool": block.name,
                        "input": block.input,
                        "output_preview": str(result)[:200],
                    }, duration_ms=tool_duration)
                except Exception as e:
                    tracer.record("error", {
                        "tool": block.name,
                        "input": block.input,
                        "error": str(e),
                    })
                    result = f"Error: {e}"

                messages.append({"role": "assistant", "content": response.content})
                messages.append({
                    "role": "user",
                    "content": [{"type": "tool_result",
                                 "tool_use_id": block.id,
                                 "content": str(result)}],
                })

ארבע נקודות מעקב: לפני/אחרי קריאות LLM, אחרי הרצות כלים, ובשגיאות. שדה ה-output_preview חותך את פלט הכלי ל-200 תווים — מספיק לדיבאג, לא מספיק כדי לנפח את קובץ ה-trace.

ניתוח מעקב

קובצי JSONL גולמיים שימושיים ל-grep, אבל ניתוח מובנה חושף דפוסים לאורך הרצה שלמה.

from collections import Counter

def trace_summary(task_id: str) -> dict:
    tracer = AgentTracer(task_id)
    events = tracer.replay()

    summary = {
        "total_events": len(events),
        "total_tokens": sum(e.tokens for e in events),
        "total_duration_ms": sum(e.duration_ms for e in events),
        "turns": sum(1 for e in events if e.event_type == "llm_call"),
        "errors": [e.data for e in events if e.event_type == "error"],
    }

    # Tool usage stats
    tool_events = [e for e in events if e.event_type == "tool_exec"]
    tool_names = [e.data["tool"] for e in tool_events]
    summary["tool_counts"] = dict(Counter(tool_names))
    summary["slowest_tool_calls"] = sorted(
        [{"tool": e.data["tool"], "duration_ms": e.duration_ms} for e in tool_events],
        key=lambda x: x["duration_ms"], reverse=True,
    )[:5]

    return summary

def print_trace_summary(task_id: str):
    s = trace_summary(task_id)
    print(f"Turns: {s['turns']}  |  Tokens: {s['total_tokens']}  |  Duration: {s['total_duration_ms']}ms")
    print(f"Tools used: {s['tool_counts']}")
    if s["errors"]:
        print(f"ERRORS ({len(s['errors'])}):")
        for err in s["errors"]:
            print(f"  - {err['tool']}: {err['error']}")
    print("Slowest calls:")
    for call in s["slowest_tool_calls"]:
        print(f"  {call['tool']}: {call['duration_ms']}ms")

זה עונה מיד על שאלות דיבאג: האם הסוכן שרף token על לולאה? האם כלי נכשל באופן עקבי? האם קריאת כלי אחת הייתה אחראית לרוב ההשהיה?

דיבאג בהפעלה מחדש

היכולת החזקה ביותר: הקלטת תשובות LLM בזמן הרצה חיה, ואז הפעלתן מחדש ללא בזבוז token.

class RecordingClient:
    """Wraps the real client and records every response."""
    def __init__(self, real_client, tracer: AgentTracer):
        self.real_client = real_client
        self.tracer = tracer

    def create(self, **kwargs):
        response = self.real_client.messages.create(**kwargs)
        # Store the full response for replay
        self.tracer.record("llm_response", {
            "content": [block_to_dict(b) for b in response.content],
            "stop_reason": response.stop_reason,
            "usage": {"input": response.usage.input_tokens,
                      "output": response.usage.output_tokens},
        })
        return response


class ReplayClient:
    """Serves recorded responses instead of calling the API."""
    def __init__(self, task_id: str):
        tracer = AgentTracer(task_id)
        events = tracer.replay()
        self.responses = [
            e.data for e in events if e.event_type == "llm_response"
        ]
        self.index = 0

    def create(self, **kwargs):
        if self.index >= len(self.responses):
            raise RuntimeError("Replay exhausted — agent took a different path")
        data = self.responses[self.index]
        self.index += 1
        return MockResponse(data)

תהליך העבודה: הריצו פעם אחת עם RecordingClient כדי ללכוד את ה-trace. כשבאג צץ, החליפו ל-ReplayClient והריצו מחדש. הסוכן מקבל תשובות LLM זהות, אז הוא מבצע קריאות כלים זהות. עכשיו אפשר להוסיף הדפסות, לעבור שלב-שלב עם debugger, או לבדוק תיקון — הכל בלי לבזבז token אחד.

אם הסוכן סוטה במהלך ההפעלה מחדש (קריאת כלי שונה מהצפוי), ההפעלה מחדש מעלה שגיאה. הסטייה עצמה היא סימן: היא אומרת שהתיקון שלכם שינה את התנהגות הסוכן בדיוק בנקודה הזו.

מה השתנה ממגנונים

רכיב	לפני (מגנונים)	אחרי (ניראות)
טיפול בשגיאות	חסימת פעולות מסוכנות	תיעוד כל פעולה לצורך ניתוח
מצב כשל	מניעת תוצאות גרועות	אבחון למה תוצאות גרועות קרו
לוגים	הדפסות אד-הוק	JSONL מובנה עם אירועים מסווגים
דיבאג	הרצה מחדש ותקווה לשחזור	הפעלה מחדש של תשובות LLM מדויקות באופן דטרמיניסטי
ביצועים	לא נמדדים	משך זמן וספירת token לכל אירוע
היקף	ולידציה של סיבוב בודד	trace מלא על פני כל הסיבובים

מסקנה מרכזית

סוכן ללא ניראות הוא קופסה שחורה. מעקב מובנה הופך אותו לקופסה שקופה — כל קריאת LLM, הרצת כלי ושגיאה מתועדת עם חותמות זמן ונתוני ביצועים. דיבאג בהפעלה מחדש מבטל את החלק המתסכל ביותר בפיתוח סוכנים: כשלים שלא ניתן לשחזר. הקליטו פעם אחת, הפעילו מחדש לנצח, תקנו בביטחון. זהו הבסיס להרצת סוכנים בסביבת ייצור, שם הכשל של השעה 2 בלילה צריך להיות מאובחן בשעה 9 בבוקר מקובץ trace בלבד.

מדריך קוד אינטראקטיבי

מערכת מעקב לסוכן

1@dataclass
2class TraceEvent:
3    timestamp: str
4    event_type: str  # "llm_call", "tool_exec", "error", "compact"
5    data: dict
6    duration_ms: int = 0
7    tokens: int = 0
8 
9class AgentTracer:
10    def __init__(self, task_id: str):
11        self.task_id = task_id
12        self.events: list[TraceEvent] = []
13        self.trace_file = Path(f".traces/{task_id}.jsonl")
14        self.trace_file.parent.mkdir(exist_ok=True)
15 
16    def record(self, event_type: str, data: dict,
17               duration_ms: int = 0, tokens: int = 0):
18        event = TraceEvent(
19            timestamp=datetime.utcnow().isoformat(),
20            event_type=event_type, data=data,
21            duration_ms=duration_ms, tokens=tokens,
22        )
23        self.events.append(event)
24        with open(self.trace_file, "a") as f:
25            f.write(json.dumps(asdict(event)) + "\n")
26 
27    def replay(self) -> list[TraceEvent]:
28        lines = self.trace_file.read_text().strip().split("\n")
29        return [TraceEvent(**json.loads(l)) for l in lines]
30

TraceEvent מייצג דבר אחד שקרה. ה-event_type מסווג אותו (קריאת LLM, הרצת כלי, שגיאה, דחיסה). duration_ms ו-tokens מאפשרים ניתוח ביצועים — אילו כלים איטיים? אילו קריאות יקרות?

שלב 1 מתוך 4

🧪 נסו בעצמכם

🔥 חימום ~5 min

הוסיפו קריאות tracer.record() ללולאת הסוכן מ[לולאת הסוכן](/he/s01-the-agent-loop). הריצו משימה ובדקו את קובץ ה-JSONL ב-.traces/. אילו דפוסים אתם מזהים?

רמז

תעדו אירועים ב: לפני קריאת LLM, אחרי תשובת LLM, לפני הרצת כלי, אחרי הרצת כלי.

🔨 בנייה ~20 min

בנו פונקציית trace_summary() שקוראת קובץ trace ומדפיסה: סך הסיבובים, סך ה-token, משך כולל, הכלי הנפוץ ביותר, וכל שגיאה.

רמז

קבצו אירועים לפי event_type באמצעות collections.Counter

🚀 אתגר ~45 min

ממשו דיבאג מלא בהפעלה מחדש: הקליטו תשובות LLM בזמן הרצה חיה, ואז צרו MockClient שמגיש תשובות מוקלטות. וודאו שהסוכן מייצר קריאות כלים זהות.

רמז

החליפו את client.messages.create בפונקציה ששולפת מרשימת תשובות מוקלטות.

מצאתם טעות? דווחו ←