Lab 10: Governance and Evals — AI Tooling Field Guide

What you'll build

A tiny governance runner that checks saved tool calls.

By the end of this lab you will have a script that loads previously saved tool call records from JSON, evaluates each one against a few policy checks, and writes the results as JSONL. Each line says, in a machine-readable way, whether that call passed governance expectations.

The important idea is not the size of the script. It is that governance lives in code. Instead of saying "someone should probably review this kind of call," you write rules that can be versioned, tested, rerun, and dropped into CI.

Run it

cd ai_ecosystem_labs
python3 10-governance/eval_runner.py

Starting here? Quick setup

git clone https://github.com/BanditF/ai_ecosystem_labs
cd ai_ecosystem_labs
python3 10-governance/eval_runner.py

Requires Python 3.8+. This lab uses only the standard library.

Time guide. Setup: ~2 min. Working through it: 20–35 min, mostly around tracing policy checks, logs, and eval output together.

Why this piece exists

You want policy to be checkable, not just discussable.

Once an agent can call tools, "did it behave correctly?" stops being a vague product question and becomes an engineering question. You need records of what happened, plus rules that tell you whether each action matched the intended policy. Otherwise governance lives in memory, Slack threads, or a page nobody reruns.

This lab keeps things deliberately small: load saved calls, inspect the policy decision and the returned result, and log named checks. But the shape scales. The same pattern works for safety checks, approvals, PII handling, rate limits, or any other rule you want enforced consistently.

Real-world analog: this is a bit like replaying old API requests against a test suite after you change authorization logic. You are not redoing the live work. You are asking whether the recorded behavior still satisfies the rules.

The code

eval_runner.py

Walk through it

Four things worth noticing.

Evaluation runs over saved calls

The script reads saved_calls.json, not a live tool endpoint. That matters because you do not need to re-run the original calls to test policy. You can replay the record you already captured and check whether it meets the spec. This makes governance cheap to rerun and safe to automate.

Checks are named booleans

Each evaluation emits a list of checks with a name and a passed field. That is simple on purpose. Humans can read it, scripts can parse it, and a quick grep can tell you which rule failed most often. You do not need a huge framework before the log becomes useful.

`allowed` and `ok` should agree

The main bug this lab is looking for is a mismatch between policy and result. If a call was allowed, the result should come back with ok: true. If the call was blocked, the result should say ok: false. When those disagree, something is wrong in the wrapper, the policy gate, or the logging path.

Governance-as-code is the real pattern

The checks live in Python inside evaluate(), not in a doc or a Confluence page. That means they can be reviewed in Git, tested on every commit, and enforced in CI. Fairly quickly, that becomes the difference between "we have a policy" and "the system actually follows it."

Expected output

What an intentional eval failure looks like.

The script prints a JSON object to stdout, and it also writes one JSON record per evaluated call to eval_results.jsonl. With the saved calls in this lab, the JSONL log includes two passing records and one intentionally failing one:

This command exits non-zero because one saved call is deliberately malformed — that's the lesson.

{"time": "2025-...Z", "tool": "term_count", "passed": true, "checks": [{"name": "has_policy_decision", "passed": true}, {"name": "successful_calls_include_summary", "passed": true, "applies": true}]}
{"time": "2025-...Z", "tool": "term_count", "passed": true, "checks": [{"name": "has_policy_decision", "passed": true}, {"name": "blocked_calls_do_not_succeed", "passed": true}, {"name": "successful_calls_include_summary", "passed": true, "applies": false}]}
{"time": "2025-...Z", "tool": "term_count", "passed": false, "checks": [{"name": "has_policy_decision", "passed": true}, {"name": "successful_calls_include_summary", "passed": false, "applies": true}]}

The third line is the interesting one. That saved call was allowed and marked successful, but it is missing summary, so successful_calls_include_summary fails. Because that record is intentional, the overall command exits non-zero on purpose.

Try this

Three things to try before moving on.

Add a malformed saved record. Open saved_calls.json and add a case where policy.allowed is true but result.ok is false. Re-run the script and check which named rule fails. That is a fast way to build intuition for what the evaluator is actually asserting.
Add one more rule to evaluate(). For example, require that successful responses include a summary field. Then run the saved calls again and see which records now fail. This is the governance-as-code move in miniature: change the policy in Python, rerun the evidence, inspect the diff.
Run eval_runner.sh and inspect it. Run ./10-governance/eval_runner.sh, inspect what it does, and then modify it to fit your own workflow — for example, run it as a pre-commit hook.

Concepts behind this

Read Agents for the bigger loop this wraps. Governance sits around the agent's planning and tool execution cycle rather than replacing it.

Then read Extensions for the practical places these checks often get attached: wrappers, hooks, and other thin layers around model or tool calls.

Next lab

Lab 11: wire the pieces together in the capstone →