Deterministic evals
What they do: exact match, substring match, regex, or JSON schema validation.
Why they matter: fast, reliable, zero cost. They are a great fit for classification, structured output, and routing tasks.
Reference page
Evals are how you stop prompt work from turning into guesswork. Plain English first: they are tests for model behavior. Technical version: they are repeatable input-plus-assertion cases you can rerun whenever prompts, models, tools, or policies change.
You can't improve what you can't measure. An LLM without evals is a system you're flying blind. Evals let you catch regressions when you change a prompt, compare models objectively, quantify quality improvements, and build confidence before deploying.
The uncomfortable truth: most people skip evals until something breaks in production. Don't do that.
If a prompt or model change matters enough to ship, it matters enough to measure. Even 20 to 50 honest test cases will teach you more than a week of vibes-based comparison.
Section 1
An eval is a test case: an input, an expected behavior, and an assertion.
Simple evals check for exact matches or keywords. More complex ones use another LLM as a judge, or check statistical properties of outputs at scale.
What they do: exact match, substring match, regex, or JSON schema validation.
Why they matter: fast, reliable, zero cost. They are a great fit for classification, structured output, and routing tasks.
What they do: length checks, readability checks, and format validators.
Why they matter: they are good at catching obvious failures even when exact answers are not the point.
What it does: uses a second, usually stronger, model to score outputs against a rubric.
Why it matters: more expensive, but it can capture quality dimensions that are hard to assert programmatically.
Section 2
MMLU, HumanEval, MATH, and HellaSwag are standardized test suites that compare models head to head. They are useful for model selection, not for your exact application.
A model can score 90% on MMLU and still fail your use case.
These are cases built from your real inputs and your actual expectations. They are what really matter once you are building something concrete.
Build 20 to 50 of them before you ship anything.
Section 3
Section 4
This is basically TDD for prompts.
Section 5
CLI and CI eval runner. You write cases in YAML and compare providers or prompts side by side.
Eval runs, traces, and datasets in one place. More end-to-end quality tracking than a simple local runner.
Useful for online evals on production traffic, where the interesting failures only show up after real usage starts.
Often the right answer for task-specific evals. If your checks are simple and local, a small Python runner is usually enough.
If you want the small, concrete version, go to Lab 16. If you want the governance-flavored ancestor of the same idea, look at Lab 10 and its eval_runner pattern.
Section 6
Evals test what you thought to test. They will not catch failure modes you did not anticipate.
A passing eval suite is a floor, not a ceiling. Combine evals with production monitoring such as Langfuse or Helicone if you want to catch the failures that only appear in real traffic.