What you'll build
A tiny governance runner that checks saved tool calls.
By the end of this lab you will have a script that loads previously saved
tool call records from JSON, evaluates each one against a few policy checks,
and writes the results as JSONL. Each line says, in a machine-readable way,
whether that call passed governance expectations.
The important idea is not the size of the script. It is that governance lives
in code. Instead of saying "someone should probably review this kind of call,"
you write rules that can be versioned, tested, rerun, and dropped into CI.
Run it
cd ai_ecosystem_labs
python3 10-governance/eval_runner.py
Starting here? Quick setup
git clone https://github.com/BanditF/ai_ecosystem_labs
cd ai_ecosystem_labs
python3 10-governance/eval_runner.py
Requires Python 3.8+. This lab uses only the standard library.
Time guide. Setup: ~2 min. Working through it: 20–35 min, mostly around tracing policy checks, logs, and eval output together.
Why this piece exists
You want policy to be checkable, not just discussable.
Once an agent can call tools, "did it behave correctly?" stops being a vague
product question and becomes an engineering question. You need records of what
happened, plus rules that tell you whether each action matched the intended policy.
Otherwise governance lives in memory, Slack threads, or a page nobody reruns.
This lab keeps things deliberately small: load saved calls, inspect the policy
decision and the returned result, and log named checks. But the shape scales.
The same pattern works for safety checks, approvals, PII handling, rate limits,
or any other rule you want enforced consistently.
Real-world analog: this is a bit like replaying old API requests against a test
suite after you change authorization logic. You are not redoing the live work.
You are asking whether the recorded behavior still satisfies the rules.
Expected output
What an intentional eval failure looks like.
The script prints a JSON object to stdout, and it also writes one JSON record
per evaluated call to eval_results.jsonl. With the saved calls in
this lab, the JSONL log includes two passing records and one intentionally
failing one:
This command exits non-zero because one saved call is deliberately malformed — that's the lesson.
{"time": "2025-...Z", "tool": "term_count", "passed": true, "checks": [{"name": "has_policy_decision", "passed": true}, {"name": "successful_calls_include_summary", "passed": true, "applies": true}]}
{"time": "2025-...Z", "tool": "term_count", "passed": true, "checks": [{"name": "has_policy_decision", "passed": true}, {"name": "blocked_calls_do_not_succeed", "passed": true}, {"name": "successful_calls_include_summary", "passed": true, "applies": false}]}
{"time": "2025-...Z", "tool": "term_count", "passed": false, "checks": [{"name": "has_policy_decision", "passed": true}, {"name": "successful_calls_include_summary", "passed": false, "applies": true}]}
The third line is the interesting one. That saved call was allowed and marked
successful, but it is missing summary, so
successful_calls_include_summary fails. Because that record is
intentional, the overall command exits non-zero on purpose.