Why this piece exists
The payoff is not the script size. It is the fact that the chain is real.
Up to this point, each lab taught one layer at a time. That is useful for learning, but
it leaves an honest question behind: do these pieces actually compose, or did we just build
a pile of disconnected examples? Lab 11 answers that question.
The capstone makes one governed run and saves the evidence. In plain English, it asks a
tool to count a term in local files, but only after policy says yes, while also reading the
current task graph and scoring the run with eval checks. Under the hood, that means the host
boundary, protocol boundary, state boundary, and audit boundary all show up in one place.
Expected output
What capstone_run.json looks like when the run succeeds and when policy blocks it.
Your file paths may appear as absolute paths depending on where you run the script — that's expected.
With the source as written, a successful run uses the term agent, reads the two sample docs,
sees add-hook as the currently ready task, and records a passing eval:
{
"time": "2026-05-06T18:24:16Z",
"goal": "Find agent mentions in sample docs and leave an auditable trail.",
"host": {
"approved_by": "toy-user"
},
"tool": "term_count",
"arguments": {
"term": "agent",
"files": [
"labs/sample_docs/agents.txt",
"labs/sample_docs/protocols.txt"
]
},
"policy": {
"allowed": true,
"reason": "read-only sample docs query"
},
"ready_tasks": [
"add-hook"
],
"tool_result": {
"ok": true,
"items": [
{
"file": "labs/sample_docs/agents.txt",
"count": 2
},
{
"file": "labs/sample_docs/protocols.txt",
"count": 0
}
],
"summary": {
"total": 2
},
"errors": []
},
"eval": {
"passed": true,
"checks": [
{
"name": "has_policy_decision",
"passed": true
},
{
"name": "successful_calls_include_summary",
"passed": true,
"applies": true
},
{
"name": "ready_tasks_visible",
"passed": true
}
]
}
}
If you run the capstone with --block-term agent, the same record shape stays intact, but the
hook blocks the call and the governance eval still passes because the policy worked as intended:
{
"time": "2026-05-06T00:00:00Z",
"goal": "Find agent mentions in sample docs and leave an auditable trail.",
"host": {
"approved_by": "toy-user"
},
"tool": "term_count",
"arguments": {
"term": "agent",
"files": [
"labs/sample_docs/agents.txt",
"labs/sample_docs/protocols.txt"
]
},
"policy": {
"allowed": false,
"reason": "blocked sensitive term: agent"
},
"ready_tasks": [
"add-hook"
],
"tool_result": {
"ok": false,
"error": "blocked_by_hook"
},
"eval": {
"passed": true,
"checks": [
{
"name": "has_policy_decision",
"passed": true
},
{
"name": "blocked_calls_do_not_succeed",
"passed": true
},
{
"name": "successful_calls_include_summary",
"passed": true,
"applies": false
},
{
"name": "ready_tasks_visible",
"passed": true
}
]
}
}
That blocked example is useful because it shows the audit value clearly. You still get a durable record of
the attempted run and the decision that blocked it. A blocked call passes governance — the policy did its job.
What you built
You ran the full stack in one command.
To be clear, this is not a simulation anymore. This one script now composes five earlier labs in one real run:
the Lab 03 protocol server, Lab 05 hook filtering, Lab 07 task state, Lab 09 approval gate, and Lab 10 eval pass.
In other words, you did not just repeat the same pattern five times. You wired separate pieces into one working
system, and changes in those upstream labs can show up here immediately. That is the real integration payoff of
the sequence.