Evals — AI Tooling Field Guide

Why evals matter

You can't improve what you can't measure. An LLM without evals is a system you're flying blind. Evals let you catch regressions when you change a prompt, compare models objectively, quantify quality improvements, and build confidence before deploying.

The uncomfortable truth: most people skip evals until something breaks in production. Don't do that.

A practical rule

If a prompt or model change matters enough to ship, it matters enough to measure. Even 20 to 50 honest test cases will teach you more than a week of vibes-based comparison.

Section 1

What an eval is

An eval is a test case: an input, an expected behavior, and an assertion.

Simple evals check for exact matches or keywords. More complex ones use another LLM as a judge, or check statistical properties of outputs at scale.

Deterministic evals

What they do: exact match, substring match, regex, or JSON schema validation.

Why they matter: fast, reliable, zero cost. They are a great fit for classification, structured output, and routing tasks.

Heuristic evals

What they do: length checks, readability checks, and format validators.

Why they matter: they are good at catching obvious failures even when exact answers are not the point.

LLM-as-judge

What it does: uses a second, usually stronger, model to score outputs against a rubric.

Why it matters: more expensive, but it can capture quality dimensions that are hard to assert programmatically.

Section 2

Benchmark evals vs task-specific evals

Benchmarks

MMLU, HumanEval, MATH, and HellaSwag are standardized test suites that compare models head to head. They are useful for model selection, not for your exact application.

A model can score 90% on MMLU and still fail your use case.

Task-specific evals

These are cases built from your real inputs and your actual expectations. They are what really matter once you are building something concrete.

Build 20 to 50 of them before you ship anything.

Section 3

Building an eval suite

Collect real inputs from users, logs, or edge cases you can think of.
Define what good looks like for each one: exact answer, key concepts present, or format constraints.
Write assertions: exact match, contains, not-contains, regex, or schema.
Run against your current prompt baseline. That score is the thing to beat.
Make a change, re-run, and compare.

Section 4

Eval-driven development

This is basically TDD for prompts.

Write the eval first and describe the expected behavior.
Confirm it fails so you know you are testing the right thing.
Improve the prompt until it passes.
Do not break the evals that already pass when you add new behavior.

Section 5

Tools

promptfoo

CLI and CI eval runner. You write cases in YAML and compare providers or prompts side by side.

Braintrust

Eval runs, traces, and datasets in one place. More end-to-end quality tracking than a simple local runner.

Langfuse

Useful for online evals on production traffic, where the interesting failures only show up after real usage starts.

Custom scripts

Often the right answer for task-specific evals. If your checks are simple and local, a small Python runner is usually enough.

If you want the small, concrete version, go to Lab 16. If you want the governance-flavored ancestor of the same idea, look at Lab 10 and its eval_runner pattern.

Section 6

What evals don't solve

Evals test what you thought to test. They will not catch failure modes you did not anticipate.

A passing eval suite is a floor, not a ceiling. Combine evals with production monitoring such as Langfuse or Helicone if you want to catch the failures that only appear in real traffic.