The Problem

The context window is finite. A single read_file on a 1000-line file costs ~4000 tokens. After reading 30 files and running 20 bash commands, you hit 100,000+ tokens. The agent cannot work on large codebases without compression.

The Solution

Three layers, increasing in aggressiveness:

Every turn:
+------------------+
| Tool call result |
+------------------+
        |
        v
[Layer 1: micro_compact]        (silent, every turn)
  Replace tool_result > 3 turns old
  with "[Previous: used {tool_name}]"
        |
        v
[Check: tokens > 50000?]
   |               |
  yes              no
   |               |
   v               +--- continue normally
[Layer 2: mid_compact]
  Summarize assistant messages
  Keep only last 5 tool results
   |
   v
[Check: tokens > 80000?]
   |               |
  yes              no
   |               +--- continue
   v
[Layer 3: hard_compact]
  Call LLM to write a dense summary
  Replace entire history with summary
  Inject <identity> reminder

How It Works

Layer 1 — Micro compaction runs silently every turn. Tool results older than 3 turns become one-line placeholders.

def micro_compact(messages: list) -> list:
    compacted = []
    for i, msg in enumerate(messages):
        if msg["role"] == "user" and isinstance(msg["content"], list):
            age = len(messages) - i
            if age > 6:  # older than 3 turns (user+assistant pairs)
                new_content = []
                for block in msg["content"]:
                    if block.get("type") == "tool_result":
                        tool_name = block.get("_tool_name", "tool")
                        new_content.append({
                            "type": "tool_result",
                            "tool_use_id": block["tool_use_id"],
                            "content": f"[Previous: used {tool_name}]",
                        })
                    else:
                        new_content.append(block)
                compacted.append({**msg, "content": new_content})
                continue
        compacted.append(msg)
    return compacted

Layer 2 — Mid compaction triggers when token count exceeds 50,000. It keeps the system prompt, the most recent 5 tool results in full, and summarizes the rest.

def count_tokens(messages: list) -> int:
    text = json.dumps(messages)
    return len(text) // 4  # rough estimate: 4 chars ≈ 1 token

def maybe_compact(messages: list) -> list:
    tokens = count_tokens(messages)
    if tokens > 80000:
        return hard_compact(messages)
    if tokens > 50000:
        return mid_compact(messages)
    return micro_compact(messages)

Layer 3 — Hard compaction asks the LLM itself to write a dense summary of what happened, then replaces the entire history with that summary plus an identity reminder.

def hard_compact(messages: list) -> list:
    summary_prompt = (
        "Summarize the conversation so far. Include: "
        "what the user asked, what tools you used, "
        "what you found, what's left to do. Be dense."
    )
    summary_messages = messages + [{"role": "user", "content": summary_prompt}]
    response = client.messages.create(
        model=MODEL, system=SYSTEM,
        messages=summary_messages, max_tokens=2000,
    )
    summary = response.content[0].text
    return [
        {"role": "user", "content": f"<context_summary>\n{summary}\n</context_summary>"},
        {"role": "assistant", "content": "Understood. Continuing from the summary."},
    ]

What Changed From Skills

Component	Before (Skills)	After (Context Compact)
Context	Grows forever	Three-layer compression
Old results	Full content	One-line placeholders
Token limit	Hit and crash	Soft limit at 50k, hard at 80k
History	Unbounded	Compacted on demand

Key Takeaway

Context compression is what makes long-running agents practical. The three-layer strategy is progressive: do the cheapest thing first (micro), escalate only when needed (mid), and as a last resort ask the model to summarize itself (hard). The loop code barely changes — just wrap messages through maybe_compact() before each LLM call.

6. Context Compact

What is a context window?

Why does context fill up?

What is micro-compaction?

The Problem

The Solution

How It Works

What Changed From Skills

Key Takeaway

Interactive Code Walkthrough