6. Context Compact
"Context will fill up; you need a way to make room"
New to this?
What is a context window?
The total amount of text (tokens) the model can 'see' at once โ including your prompt, conversation history, tool results, and all previous messages. Claude's context window is large but finite.
Why does context fill up?
Every tool call appends its result to the messages array. Reading a 1000-line file adds ~4000 tokens. After 30 file reads and 20 bash commands, you can easily hit 100,000+ tokens and approach the limit.
What is micro-compaction?
A technique where tool results older than 3 turns are replaced with a short summary like '[Previous: used read_file]'. This silently trims stale detail while keeping recent context intact.
The Problem
The context window is finite. A single read_file on a 1000-line file costs ~4000 tokens. After reading 30 files and running 20 bash commands, you hit 100,000+ tokens. The agent cannot work on large codebases without compression.
The Solution
Three layers, increasing in aggressiveness:
Every turn:
+------------------+
| Tool call result |
+------------------+
|
v
[Layer 1: micro_compact] (silent, every turn)
Replace tool_result > 3 turns old
with "[Previous: used {tool_name}]"
|
v
[Check: tokens > 50000?]
| |
yes no
| |
v +--- continue normally
[Layer 2: mid_compact]
Summarize assistant messages
Keep only last 5 tool results
|
v
[Check: tokens > 80000?]
| |
yes no
| +--- continue
v
[Layer 3: hard_compact]
Call LLM to write a dense summary
Replace entire history with summary
Inject <identity> reminder
How It Works
- Layer 1 โ Micro compaction runs silently every turn. Tool results older than 3 turns become one-line placeholders.
def micro_compact(messages: list) -> list:
compacted = []
for i, msg in enumerate(messages):
if msg["role"] == "user" and isinstance(msg["content"], list):
age = len(messages) - i
if age > 6: # older than 3 turns (user+assistant pairs)
new_content = []
for block in msg["content"]:
if block.get("type") == "tool_result":
tool_name = block.get("_tool_name", "tool")
new_content.append({
"type": "tool_result",
"tool_use_id": block["tool_use_id"],
"content": f"[Previous: used {tool_name}]",
})
else:
new_content.append(block)
compacted.append({**msg, "content": new_content})
continue
compacted.append(msg)
return compacted
- Layer 2 โ Mid compaction triggers when token count exceeds 50,000. It keeps the system prompt, the most recent 5 tool results in full, and summarizes the rest.
def count_tokens(messages: list) -> int:
text = json.dumps(messages)
return len(text) // 4 # rough estimate: 4 chars โ 1 token
def maybe_compact(messages: list) -> list:
tokens = count_tokens(messages)
if tokens > 80000:
return hard_compact(messages)
if tokens > 50000:
return mid_compact(messages)
return micro_compact(messages)
- Layer 3 โ Hard compaction asks the LLM itself to write a dense summary of what happened, then replaces the entire history with that summary plus an identity reminder.
def hard_compact(messages: list) -> list:
summary_prompt = (
"Summarize the conversation so far. Include: "
"what the user asked, what tools you used, "
"what you found, what's left to do. Be dense."
)
summary_messages = messages + [{"role": "user", "content": summary_prompt}]
response = client.messages.create(
model=MODEL, system=SYSTEM,
messages=summary_messages, max_tokens=2000,
)
summary = response.content[0].text
return [
{"role": "user", "content": f"<context_summary>\n{summary}\n</context_summary>"},
{"role": "assistant", "content": "Understood. Continuing from the summary."},
]
What Changed From Skills
| Component | Before (Skills) | After (Context Compact) |
|---|---|---|
| Context | Grows forever | Three-layer compression |
| Old results | Full content | One-line placeholders |
| Token limit | Hit and crash | Soft limit at 50k, hard at 80k |
| History | Unbounded | Compacted on demand |
Key Takeaway
Context compression is what makes long-running agents practical. The three-layer strategy is progressive: do the cheapest thing first (micro), escalate only when needed (mid), and as a last resort ask the model to summarize itself (hard). The loop code barely changes โ just wrap messages through maybe_compact() before each LLM call.
Interactive Code Walkthrough
1def count_tokens(messages: list) -> int:2 text = json.dumps(messages)3 return len(text) // 4 # rough estimate: 4 chars โ 1 token4 5def maybe_compact(messages: list) -> list:6 tokens = count_tokens(messages)7 if tokens > 80000:8 return hard_compact(messages)9 if tokens > 50000:10 return mid_compact(messages)11 return micro_compact(messages)12 13def hard_compact(messages: list) -> list:14 summary_prompt = (15 "Summarize the conversation so far. Include: "16 "what the user asked, what tools you used, "17 "what you found, what's left to do. Be dense."18 )19 summary_messages = messages + [{"role": "user", "content": summary_prompt}]20 response = client.messages.create(21 model=MODEL, system=SYSTEM,22 messages=summary_messages, max_tokens=2000,23 )24 summary = response.content[0].text25 return [26 {"role": "user", "content": f"<context_summary>\n{summary}\n</context_summary>"},27 {"role": "assistant", "content": "Understood. Continuing from the summary."},28 ]29 If the rough token estimate is len(json.dumps(messages)) // 4, how accurate is it? Try counting the exact tokens for a few messages using the API's usage field and compare.
Hint
The estimate is usually within 20% โ good enough for triggering compression thresholds.
Fill the context window by giving the agent many tasks in sequence. Watch the compression kick in. Add a log message that prints which layer triggered and how many tokens were saved.
Hint
Check the token count before and after compaction and print the difference.
Implement a fourth compression layer: semantic deduplication. Before hard compaction, detect if the agent read the same file multiple times and keep only the most recent version.
Hint
Track file paths in tool results and remove older read_file results for the same path.