Most LLM features ship on vibes. Someone tries a few prompts, the output looks good, and it goes live. The first regression — a prompt tweak that quietly breaks an edge case — is usually found by a customer, not a test. An evaluation harness is how you change that.
What a harness is
An evaluation harness is a fixed set of inputs, expected behaviours, and scorers that you run on every change. It is the LLM equivalent of a unit-test suite: cheap to run, honest about regressions, and version-controlled alongside the prompt it protects.
The three scorers worth having
- Exact / structured: for outputs that must match a schema, value, or format.
- Model-graded: a second model judges correctness against a written rubric.
- Human-in-the-loop: a small sample reviewed by a person each release to anchor the rest.
Most teams start with one and add the others as the feature matures. The combination matters: structured checks catch format breaks instantly, model grading scales judgement cheaply, and the human sample keeps the grader honest.

Wiring it into the release
Treat the harness like CI for behaviour, not just code:
- Run the full set on every prompt or model change.
- Gate the deploy on the aggregate score and on a no-regression rule for critical cases.
- Store each run so you can see the trend, not just the latest number.
A model change you cannot score is a model change you cannot safely ship.
Common mistakes
The two failure modes are opposite extremes. Some teams write hundreds of synthetic cases that do not resemble real traffic and feel safe while drifting from reality. Others never write the first twenty and fly blind. Both are avoidable.
- Start with ~20 cases drawn from real usage — coverage beats volume early on.
- Add a new case every time a bug teaches you a new failure mode.
- Re-validate model-graded scores against the human sample periodically to catch grader drift.
Twenty good cases that mirror production will protect you far better than two hundred invented ones. The suite then grows naturally, one real failure at a time, into the safety net every LLM feature should have.



