17. Capstone: Code Review Agent
"Everything you learned, one system"
New to this?
Why a code review agent?
Code review is the perfect capstone because it naturally requires everything: reading files (tools), delegating to specialists (subagents/teams), coordinating findings (mailboxes/protocols), running tests in isolation (worktrees), and producing structured output. It's complex enough to exercise every concept but concrete enough to build in a day.
What does the finished system look like?
You run `python capstone.py PR_URL` and get back a structured review with sections for security, performance, style, and tests. Under the hood, 4 agents work in parallel, each examining the diff from their specialty, coordinating through the team protocols you built.
Can I customize this for my own use case?
Absolutely. The architecture is the same whether you're building a code reviewer, a research assistant, a deployment bot, or a customer support agent. The capstone teaches the pattern โ you choose what to build with it.
The Problem
You have learned 16 concepts in isolation. The agent loop runs tools. Tools read files and execute commands. TodoWrite plans work. Subagents delegate. Skills load context on demand. Context compression keeps conversations from exploding. Tasks persist state. Background tasks run work off the main thread. Teams coordinate multiple agents. Protocols define how they talk. Autonomous agents claim their own work. Worktrees give each agent an isolated workspace. Evals measure quality. Guardrails enforce safety. Observability makes the system visible. Production deployment makes it reliable.
Each works alone. Real systems combine them all. This capstone ties every concept together into one production-grade system: a multi-agent code review bot that receives a PR diff, delegates to specialist reviewers, coordinates their findings, and produces a structured review summary.
The Architecture
PR Diff
|
+------v------+
| Lead Agent |--------> Task Board
| (s01) | (s07)
+------+------+
| plan + delegate
+-------------+-------------+
v v v
+--------------+ +--------------+ +--------------+
| Security | | Performance | | Style |
| Reviewer | | Reviewer | | Reviewer |
| (s09) | | (s09) | | (s09) |
+------+-------+ +------+-------+ +------+-------+
| | |
+------v-------+ +------v-------+ +------v-------+
| worktree/ | | worktree/ | | worktree/ |
| security | | performance | | style |
| (s12) | | (s12) | | (s12) |
+------+-------+ +------v-------+ +------+-------+
| | |
+----------------+----------------+
| findings via mailbox (s10)
+------v------+
| Synthesize |
| (s04) |
+------+------+
|
+------v------+
| Review |
| Summary |
+-------------+
Cross-cutting: Guardrails (s14) | Tracer (s15) | Retries + Streaming (s16)
Every box maps to a session you completed. The lead agent runs the core loop. The task board is the task system. Specialists are agent teams. Worktrees provide isolation. Findings flow through mailboxes. Synthesis uses a subagent. And the entire system is wrapped in guardrails, observability, and production infrastructure. This is not a new concept โ it is the assembly of everything you have learned.
Phase 1: Planning the Review
The lead agent receives a PR diff and breaks it into reviewable chunks using the TodoWrite planning approach and the Task System for persistence:
import json
def plan_review(diff: str) -> list[dict]:
"""Use the LLM to analyze a diff and create review tasks."""
prompt = f"""Analyze this PR diff and create specific review tasks.
For each concern, output a JSON task with:
- title: what to check
- category: security | performance | style
- files: list of relevant file paths
- priority: critical | high | medium | low
Diff:
{diff}"""
messages = [{"role": "user", "content": prompt}]
response = agent_loop(messages) # s01 loop
tasks = parse_tasks(response)
# Persist tasks to disk via the task system (s07)
for task in tasks:
task_manager.create(
title=task["title"],
category=task["category"],
files=task["files"],
priority=task["priority"],
status="pending",
depends_on=[], # DAG edges for ordering
)
return tasks
The task manager writes each task as a JSON file in .tasks/. This means tasks survive crashes โ if the agent restarts, it picks up where it left off. The depends_on field supports the DAG structure from s07, so you can express ordering constraints like โrun the security check before the summary.โ
Phase 2: Assembling the Team
Each specialist agent gets a focused system prompt that shapes what it looks for. This is the Agent Teams pattern with identity, lifecycle, and role-based behavior:
import threading
SPECIALIST_CONFIGS = {
"security": {
"name": "security",
"role": "security reviewer",
"system": (
"You are a security reviewer. Examine the diff for: "
"SQL injection, XSS, path traversal, authentication bypass, "
"hardcoded secrets, insecure deserialization, missing input validation. "
"Rate each finding: critical / high / medium / low. "
"Output JSON: {\"findings\": [{\"issue\": ..., \"file\": ..., "
"\"line\": ..., \"severity\": ..., \"fix\": ...}]}"
),
},
"performance": {
"name": "performance",
"role": "performance reviewer",
"system": (
"You are a performance reviewer. Examine the diff for: "
"N+1 queries, unnecessary allocations, missing database indexes, "
"O(n^2) algorithms where O(n) exists, synchronous I/O in hot paths, "
"unbounded list growth, missing pagination. "
"Estimate impact: high / medium / low. "
"Output JSON: {\"findings\": [{\"issue\": ..., \"file\": ..., "
"\"line\": ..., \"impact\": ..., \"suggestion\": ...}]}"
),
},
"style": {
"name": "style",
"role": "style reviewer",
"system": (
"You are a style reviewer. Examine the diff for: "
"inconsistent naming conventions, missing type hints, dead code, "
"functions over 40 lines, missing docstrings on public APIs, "
"unused imports, magic numbers without constants. "
"Suggest concrete fixes for each issue. "
"Output JSON: {\"findings\": [{\"issue\": ..., \"file\": ..., "
"\"line\": ..., \"suggestion\": ...}]}"
),
},
}
def setup_review_team() -> list[dict]:
"""Initialize the team with specialist agents."""
teammates = list(SPECIALIST_CONFIGS.values())
init_team(teammates) # s09: register teammates and create mailboxes
return teammates
Notice that each specialist outputs structured JSON. This is critical for Phase 4 โ the lead agent needs to parse and aggregate findings programmatically, not read free-form text.
Phase 3: Parallel Execution
Each specialist gets its own worktree and runs as an autonomous agent that claims tasks from the board:
def execute_review(tasks: list, team: list):
"""Assign worktrees and let agents work autonomously in parallel."""
# Create isolated worktrees for each specialist (s12)
worktrees = {}
for mate in team:
wt_path = create_worktree(mate["name"])
worktrees[mate["name"]] = wt_path
# Launch each specialist in its own thread (s09 + s11)
threads = []
for mate in team:
t = threading.Thread(
target=specialist_loop,
args=(mate, worktrees[mate["name"]]),
daemon=True,
)
t.start()
threads.append(t)
# Wait for all specialists to finish (with timeout)
wait_for_completion(task_manager, timeout=300)
# Cleanup worktrees (s12)
for name, path in worktrees.items():
cleanup_worktree(path)
def specialist_loop(mate: dict, worktree_path: str):
"""Autonomous loop for a single specialist agent."""
guardrail = GuardRail(REVIEW_PERMISSIONS) # s14
tracer = AgentTracer(f"reviewer-{mate['name']}") # s15
while True:
# Claim a task matching this specialist's category (s11)
task = task_manager.claim(
category=mate["name"],
agent_id=mate["name"],
)
if task is None:
break # No more tasks for this specialist
tracer.record("task_claimed", {"task": task["title"]})
# Build the review prompt with the relevant files
file_contents = ""
for f in task["files"]:
path = os.path.join(worktree_path, f)
if os.path.exists(path):
content = read_file(path) # s02: tool use
file_contents += f"\n--- {f} ---\n{content}\n"
messages = [{"role": "user", "content": (
f"Review these files for {mate['name']} issues.\n"
f"Task: {task['title']}\n"
f"Files:\n{file_contents}"
)}]
# Run the agent loop with guardrails (s01 + s14)
response = agent_loop_with_guardrails(
messages=messages,
system=mate["system"],
guardrail=guardrail,
tracer=tracer,
)
# Send findings to lead via mailbox (s10)
send_message(
from_agent=mate["name"],
to_agent="lead",
content=response,
)
# Mark task complete (s07)
task_manager.update(task["id"], status="done")
tracer.record("task_done", {"task": task["title"]})
Each specialist runs its own agent loop with the guardrail layer checking every tool call. The tracer records every claim, every tool execution, every completion. If a specialist loads a large file, context compression can kick in to keep the conversation within token limits. And if a specialist needs domain-specific context โ say, a list of known vulnerability patterns โ skill loading can inject it on demand.
Phase 4: Collecting Findings
Once all specialists finish, the lead agent drains their mailboxes and aggregates the structured outputs:
def collect_findings(team: list) -> dict:
"""Drain all specialist mailboxes and aggregate findings."""
findings = {"security": [], "performance": [], "style": []}
# Drain the lead's inbox using the request-response protocol (s10)
messages = drain_inbox("lead")
for msg in messages:
sender = msg["from"]
if sender in findings:
try:
parsed = json.loads(msg["content"])
findings[sender].extend(parsed.get("findings", []))
except json.JSONDecodeError:
# Fallback: treat raw text as a single finding
findings[sender].append({
"issue": msg["content"],
"severity": "medium",
"raw": True,
})
# Sort each category by severity
severity_order = {"critical": 0, "high": 1, "medium": 2, "low": 3}
for category in findings:
findings[category].sort(
key=lambda f: severity_order.get(
f.get("severity", f.get("impact", "medium")), 2
)
)
return findings
The fallback for non-JSON responses is important. Agents are probabilistic โ sometimes a specialist returns prose instead of JSON. Robust code handles both cases instead of crashing.
Phase 5: Synthesizing the Review
A subagent takes all findings and produces a coherent, prioritized review summary:
def synthesize_review(findings: dict) -> str:
"""Use a subagent to produce a unified review summary."""
all_findings = json.dumps(findings, indent=2)
# Context compression (s06) if findings are very large
if len(all_findings) > 50000:
all_findings = compress_context(all_findings, target_tokens=10000)
summary = run_subagent(
system="You are a senior engineer writing a code review summary.",
prompt=f"""Synthesize these findings from three specialist reviewers
into a single, actionable code review.
Rules:
- Lead with critical/high severity issues
- Group by theme, not by reviewer
- Include file paths and line numbers
- End with a clear accept/request-changes/block verdict
Findings:
{all_findings}
Output format:
## Verdict: [APPROVE | REQUEST_CHANGES | BLOCK]
## Critical Issues
(list or "None found")
## Recommendations
(list)
## Style Suggestions
(list)
## Summary
(1-2 sentence overall assessment)
""",
)
return summary
The output format is structured so downstream systems can parse the verdict programmatically. A CI integration can block merges on BLOCK, request changes on REQUEST_CHANGES, and auto-approve on APPROVE.
Putting It All Together
Here is the complete entry point that ties every session concept into one function:
import os
import json
import threading
import anthropic
def run_code_review(diff: str) -> str:
"""Full production code review pipeline.
Concepts used:
s01: agent loop s02: tool use s03: planning
s04: subagent s05: skills s06: compression
s07: task system s08: background s09: teams
s10: protocols s11: autonomy s12: worktrees
s13: evals s14: guardrails s15: observability
s16: production
"""
# Production infrastructure (s15, s16)
tracer = AgentTracer("code-review")
cost_tracker = CostTracker(budget_limit_usd=5.0)
guardrail = GuardRail(REVIEW_PERMISSIONS)
tracer.record("review_start", {"diff_size": len(diff)})
# Phase 1: Plan the review (s03 + s07)
tasks = plan_review(diff)
# Phase 2: Assemble the team (s09)
team = setup_review_team()
# Phase 3: Parallel execution (s11 + s12)
execute_review(tasks, team)
# Phase 4: Collect findings (s10)
findings = collect_findings(team)
# Phase 5: Synthesize (s04 + s06)
summary = synthesize_review(findings)
# Record completion (s15)
tracer.record("review_complete", {
"total_findings": sum(len(v) for v in findings.values()),
"security_findings": len(findings["security"]),
"performance_findings": len(findings["performance"]),
"style_findings": len(findings["style"]),
"total_cost_usd": cost_tracker.total_cost_usd(),
"total_tokens": tracer.total_tokens(),
})
return summary
if __name__ == "__main__":
import sys
diff = fetch_pr_diff(sys.argv[1]) # e.g., python capstone.py PR_URL
review = run_code_review(diff)
print(review)
Count the session references. Every single concept from s01 through s16 appears in this system. The agent loop drives every agent. Tools let them read code. TodoWrite plans the review. Subagents synthesize findings. Skills load domain knowledge. Context compression handles large diffs. The task system persists state. Background tasks run tests. Teams coordinate specialists. Protocols define communication. Autonomous agents self-assign work. Worktrees provide isolation. Evals measure review quality. Guardrails enforce safety. Observability makes it all visible. Production infrastructure makes it reliable.
Session Concept Map
| Session | Concept | Where It Appears in the Capstone |
|---|---|---|
| s01 | Agent Loop | Every agent โ lead and specialists โ runs the core loop |
| s02 | Tool Use | Specialists use read_file and bash to analyze code in worktrees |
| s03 | TodoWrite | Lead agent plans the review as a structured task list |
| s04 | Subagents | Synthesis subagent produces the final unified review |
| s05 | Skills | Specialists load domain-specific knowledge (vulnerability patterns, lint rules) |
| s06 | Context Compact | Compression kicks in when diffs or findings exceed token limits |
| s07 | Task System | Review tasks persisted as a JSON DAG on disk |
| s08 | Background Tasks | Test execution and linting run in background threads |
| s09 | Agent Teams | Three specialist teammates plus one lead agent |
| s10 | Team Protocols | Request-response mailboxes for findings collection |
| s11 | Autonomous Agents | Specialists self-assign tasks from the board |
| s12 | Worktree Isolation | Each specialist works in its own git worktree |
| s13 | Agent Evals | Eval suite scores review quality against known bugs |
| s14 | Guardrails | Permission checks on every tool call in every agent |
| s15 | Observability | Full trace of every review session for debugging |
| s16 | Production | Retries, streaming, cost tracking, model routing |
Key Takeaway
The harness is complete. From a single while True loop in session 1, you built a system where multiple agents autonomously coordinate to perform complex work in parallel โ with planning, isolation, safety, observability, and production readiness baked in. The model provides the intelligence. The code you wrote is the harness that makes that intelligence useful, safe, and reliable. You did not learn 16 disconnected ideas. You learned 16 layers of the same system, each building on the last. Now you know how to build it from scratch.
Interactive Code Walkthrough
1def run_code_review(diff: str) -> dict:2 # Phase 1: Planning (s03, s07)3 tasks = plan_review(diff)4 task_manager.create_all(tasks)5 6 # Phase 2: Team setup (s09)7 team = init_team([8 {"name": "security", "role": "security reviewer"},9 {"name": "performance", "role": "performance reviewer"},10 {"name": "style", "role": "style reviewer"},11 ])12 13 # Phase 3: Parallel execution (s11, s12)14 for task in task_manager.get_ready():15 wt = create_worktree(task.id)16 assign_task(task.id, wt)17 # Autonomous agents claim and execute18 19 # Phase 4: Collect results (s10)20 wait_for_completion(task_manager)21 findings = collect_findings(team)22 23 # Phase 5: Synthesize (s04)24 summary = synthesize_review(findings)25 return summary26 Draw the architecture diagram for this system. Map each box to the session where you learned that concept.
Hint
Your diagram should have: Lead Agent (s01), Task Board (s07), 3 Specialist Agents (s09), Mailboxes (s09), Worktrees (s12), Guardrails (s14), and Tracer (s15).
Implement the capstone. Start with a single reviewer agent that reads a diff and produces findings. Then add a second specialist and coordinate them via mailboxes.
Hint
Start simple โ a lead + 1 specialist is enough. Add the second specialist only after the mailbox coordination works.
Add an eval suite (s13) for your code review agent. Define 5 test cases with known issues (SQL injection, unused import, O(n^2) loop, inconsistent naming, missing error handling) and score the agent's ability to find them.
Hint
Create synthetic diffs with known bugs. The checker verifies that the agent's output mentions each bug.