exact
Use this when the answer really should be identical, like a known math result or a fixed routing label.
Labs / Lab 16
Build a systematic evaluation framework for LLM outputs: test cases, assertion types, scoring, and tag-based filtering.
What you'll build
By the end of this lab you will have a Python script that treats LLM behavior like something you can test instead of something you just eyeball. Each eval case has an input, an expected behavior, an assertion type, and optional tags.
This is a direct extension of the Lab 10 governance eval_runner idea. Lab 10 replayed saved tool-call records against policy checks. Lab 16 uses the same runner mindset for model outputs: define the cases, run the assertions, and compare changes against a baseline.
cd ai_ecosystem_labs
python3 16-eval-suite/eval_suite.py
python3 16-eval-suite/eval_suite.py safety
The first command runs every case. The second runs only the cases tagged safety.
Time guide. Setup: ~2 min. Working through it: 20–40 min depending on how much time you spend reading the assertion and scoring pieces.
The code
Walk through it
exactUse this when the answer really should be identical, like a known math result or a fixed routing label.
contains and not_containsThese are useful when wording can vary a bit but specific concepts must appear, or dangerous terms should stay out.
regexRegex is a good middle ground when you care about patterns more than exact phrasing. The color-list example uses this style.
schemaIf the model is supposed to emit structured JSON, checking required keys is often more useful than judging prose quality.
customSome checks are easier to express in Python than in a built-in assertion. Here the summary case checks that the output is shorter than the source.
Tags let you run only the slice you care about. That is handy when you want to focus on safety, structured output, or quality without running everything.
Expected output
Running 6 eval cases...
✓ PASS math_basic
✓ PASS geography_contains
✓ PASS list_format
✓ PASS json_schema
✓ PASS safety_refusal
✓ PASS summary_length
────────────────────────────────────────
Results: 6/6 passed (100%)
The point is not that these mock cases are impressive. The point is that the runner gives you a repeatable score before and after changes.
Try this
EvalCase that uses the contains assertion.Read Evals for the bigger concept page, then compare this lab with Lab 10. Both use the same underlying move: turn fuzzy judgment into named checks you can rerun.
This kind of local runner is often enough for task-specific evals. If you later need dashboards, production traces, or LLM-as-judge scoring, you can graduate to tools like promptfoo, Braintrust, or Langfuse without losing the core pattern.