Labs / Lab 11

Wire the tiny stack together

Run one command that crosses the whole toy stack: JSON result shape, protocol tool call, approval boundary, durable task state, and eval output.

What you'll build

A small run that proves the earlier labs were real pieces, not isolated demos.

This capstone does not introduce a brand-new subsystem. It takes the pieces from Labs 02 through 10 and actually runs them together. The script builds a record, checks policy, dispatches a protocol-style tool call, reads durable task state, evaluates the result, and writes the whole thing out as capstone_run.json.

That matters because integration is where toy examples usually get hand-wavy. Here, the seams are the lesson. Each layer still does its own job, but now you can see the full chain behave like a tiny governed agent stack.

Run it

cd ai_ecosystem_labs
python3 11-capstone/capstone.py
Starting here? Quick setup
git clone https://github.com/BanditF/ai_ecosystem_labs
cd ai_ecosystem_labs
python3 reset.py   # restores tasks.json to initial state
python3 11-capstone/capstone.py

Run reset.py first because the capstone reads task graph state.

Run from the ai_ecosystem_labs/ repo root. That is the correct command path for this lab.

Local mirror: labs/11-capstone/capstone.py

Time guide. Setup: 2–5 min, including a reset if you have already changed task state. Working through it: 25–45 min because it pays off several earlier labs at once.

Why this piece exists

The payoff is not the script size. It is the fact that the chain is real.

Up to this point, each lab taught one layer at a time. That is useful for learning, but it leaves an honest question behind: do these pieces actually compose, or did we just build a pile of disconnected examples? Lab 11 answers that question.

The capstone makes one governed run and saves the evidence. In plain English, it asks a tool to count a term in local files, but only after policy says yes, while also reading the current task graph and scoring the run with eval checks. Under the hood, that means the host boundary, protocol boundary, state boundary, and audit boundary all show up in one place.

Integration map

What earlier labs are doing inside the capstone.

Prior lab What Lab 11 calls or imports
Lab 02 JSON result envelope — ok, items, summary
Lab 03 calls the protocol server over tools/call
Lab 05 imports run_tool() (as run_hooked_tool) plus blocked_terms() for real term filtering and blocking
Lab 07 reads tasks.json to pull in ready task state
Lab 09 imports request_tool_approval() for the real approval gate
Lab 10 imports evaluate() for the real eval pass and audit checks

The code

capstone.py

Walk through it

Notice the seams, not just the functions.

Subprocess calls test real integration, not mocked interfaces

protocol_call() does not import a helper function and pretend that counts as integration. It shells out to 03-protocol-adapter/protocol_server.py, sends a JSON request, and parses the returned JSON response. That is a tiny end-to-end boundary. The capstone is exercising the protocol surface the way another process would actually use it.

Labs 05, 09, and 10 are real runtime dependencies now, not just inspiration

The capstone imports run_tool() (as run_hooked_tool) and blocked_terms() from Lab 05, request_tool_approval() from Lab 09, and evaluate() from Lab 10. So if you update Lab 05's blocked terms, Lab 11 sees that change on the next run. This is not a copied pattern anymore. It is one script calling into the others.

Task graph state from Lab 07 feeds into the run — the capstone reads durable state, not just in-memory

ready_tasks() reads tasks.json from the task-graph lab and carries the ready task IDs into the final record. That means this run is not self-contained theater. It looks at external state that could have been created earlier and could still be there after the process exits.

capstone_run.json is your audit artifact — it proves the imported pieces actually ran together

The last step is not just printing a result. The script writes the full record to capstone_run.json so you can inspect the goal, hook decision, approval decision, ready tasks, tool result, and eval checks after the fact. That is the governance move from Lab 10 in its real form: leave evidence, not just a success message.

Expected output

What capstone_run.json looks like when the run succeeds and when policy blocks it.

Your file paths may appear as absolute paths depending on where you run the script — that's expected.

With the source as written, a successful run uses the term agent, reads the two sample docs, sees add-hook as the currently ready task, and records a passing eval:

{
  "time": "2026-05-06T18:24:16Z",
  "goal": "Find agent mentions in sample docs and leave an auditable trail.",
  "host": {
    "approved_by": "toy-user"
  },
  "tool": "term_count",
  "arguments": {
    "term": "agent",
    "files": [
      "labs/sample_docs/agents.txt",
      "labs/sample_docs/protocols.txt"
    ]
  },
  "policy": {
    "allowed": true,
    "reason": "read-only sample docs query"
  },
  "ready_tasks": [
    "add-hook"
  ],
  "tool_result": {
    "ok": true,
    "items": [
      {
        "file": "labs/sample_docs/agents.txt",
        "count": 2
      },
      {
        "file": "labs/sample_docs/protocols.txt",
        "count": 0
      }
    ],
    "summary": {
      "total": 2
    },
    "errors": []
  },
  "eval": {
    "passed": true,
    "checks": [
      {
        "name": "has_policy_decision",
        "passed": true
      },
      {
        "name": "successful_calls_include_summary",
        "passed": true,
        "applies": true
      },
      {
        "name": "ready_tasks_visible",
        "passed": true
      }
    ]
  }
}

If you run the capstone with --block-term agent, the same record shape stays intact, but the hook blocks the call and the governance eval still passes because the policy worked as intended:

{
  "time": "2026-05-06T00:00:00Z",
  "goal": "Find agent mentions in sample docs and leave an auditable trail.",
  "host": {
    "approved_by": "toy-user"
  },
  "tool": "term_count",
  "arguments": {
    "term": "agent",
    "files": [
      "labs/sample_docs/agents.txt",
      "labs/sample_docs/protocols.txt"
    ]
  },
  "policy": {
    "allowed": false,
    "reason": "blocked sensitive term: agent"
  },
  "ready_tasks": [
    "add-hook"
  ],
  "tool_result": {
    "ok": false,
    "error": "blocked_by_hook"
  },
  "eval": {
    "passed": true,
    "checks": [
      {
        "name": "has_policy_decision",
        "passed": true
      },
      {
        "name": "blocked_calls_do_not_succeed",
        "passed": true
      },
      {
        "name": "successful_calls_include_summary",
        "passed": true,
        "applies": false
      },
      {
        "name": "ready_tasks_visible",
        "passed": true
      }
    ]
  }
}

That blocked example is useful because it shows the audit value clearly. You still get a durable record of the attempted run and the decision that blocked it. A blocked call passes governance — the policy did its job.

Try this

Three small changes that make the integration more obvious.

  1. Run it with --block-term agent. Use python3 11-capstone/capstone.py --block-term agent, then compare the new capstone_run.json with the successful one. The capstone always searches for the term agent. Blocking agent is what triggers the policy gate.
  2. Mark add-hook done in tasks.json. Then run the capstone again and watch ready_tasks change to agent-loop. That is a quick way to see that the capstone is reading durable state rather than hard-coding the answer.
  3. Break the protocol response on purpose. Remove summary from the term-count result in the protocol lab and re-run. The eval should fail on successful_calls_include_summary, which makes the dependency between Labs 03, 10, and 11 very visible.

What you built

You ran the full stack in one command.

To be clear, this is not a simulation anymore. This one script now composes five earlier labs in one real run: the Lab 03 protocol server, Lab 05 hook filtering, Lab 07 task state, Lab 09 approval gate, and Lab 10 eval pass.

In other words, you did not just repeat the same pattern five times. You wired separate pieces into one working system, and changes in those upstream labs can show up here immediately. That is the real integration payoff of the sequence.

Concepts behind this

If you want to zoom back out, compare this lab with the stack view and the tooling catalog. The capstone is small, but the architectural shape is the same one larger systems keep repeating.

It is also a good moment to re-read Agents with fresher eyes. The agent loop is not magic. It is a chain of boundaries, state, and enforcement points that can be inspected.