Labs / Lab 16

Build a systematic eval suite

Build a systematic evaluation framework for LLM outputs: test cases, assertion types, scoring, and tag-based filtering.

What you'll build

A small eval harness you can rerun every time the prompt changes.

By the end of this lab you will have a Python script that treats LLM behavior like something you can test instead of something you just eyeball. Each eval case has an input, an expected behavior, an assertion type, and optional tags.

This is a direct extension of the Lab 10 governance eval_runner idea. Lab 10 replayed saved tool-call records against policy checks. Lab 16 uses the same runner mindset for model outputs: define the cases, run the assertions, and compare changes against a baseline.

Run it

cd ai_ecosystem_labs
python3 16-eval-suite/eval_suite.py
python3 16-eval-suite/eval_suite.py safety

The first command runs every case. The second runs only the cases tagged safety.

Time guide. Setup: ~2 min. Working through it: 20–40 min depending on how much time you spend reading the assertion and scoring pieces.

The code

eval_suite.py

Walk through it

Six assertion shapes in one small runner.

exact

Use this when the answer really should be identical, like a known math result or a fixed routing label.

contains and not_contains

These are useful when wording can vary a bit but specific concepts must appear, or dangerous terms should stay out.

regex

Regex is a good middle ground when you care about patterns more than exact phrasing. The color-list example uses this style.

schema

If the model is supposed to emit structured JSON, checking required keys is often more useful than judging prose quality.

custom

Some checks are easier to express in Python than in a built-in assertion. Here the summary case checks that the output is shorter than the source.

Tags and filtering

Tags let you run only the slice you care about. That is handy when you want to focus on safety, structured output, or quality without running everything.

Expected output

What a clean run looks like.

Running 6 eval cases...

✓ PASS  math_basic
✓ PASS  geography_contains
✓ PASS  list_format
✓ PASS  json_schema
✓ PASS  safety_refusal
✓ PASS  summary_length

────────────────────────────────────────
Results: 6/6 passed (100%)

The point is not that these mock cases are impressive. The point is that the runner gives you a repeatable score before and after changes.

Try this

Three ways to make the pattern real.

  1. Run the suite and read each result so you can see how each assertion type behaves.
  2. Add a new EvalCase that uses the contains assertion.
  3. Intentionally break one mock response and watch the suite catch it.

Concepts behind this

Read Evals for the bigger concept page, then compare this lab with Lab 10. Both use the same underlying move: turn fuzzy judgment into named checks you can rerun.

What this is good for

This kind of local runner is often enough for task-specific evals. If you later need dashboards, production traces, or LLM-as-judge scoring, you can graduate to tools like promptfoo, Braintrust, or Langfuse without losing the core pattern.