Skip to content
Stealthy Good
Writing

Evaluations are the product.

If you can’t tell whether the model got better, the model didn’t get better.

· 8 min read

An engineering manager at a mid-market SaaS company told me recently that the team had shipped a big model upgrade. I asked how they knew it was better. “It feels better,” he said. It did not feel better to the six customers who had emailed support that week. It felt, to them, noticeably worse.

In traditional software, you ship a change. The change has a defined behavior. If the behavior is wrong, you write a failing test, fix it, and the test guards against regression. This is boring. It is also why modern software works as well as it does.

AI features don’t have that property by default. You ship a change, and the behavior distribution moves. Some outputs get better, some get worse, and without a systematic way to measure that, the only signal you have is the emails from customers who noticed.

The eval harness comes first

If you are building any product surface that uses a language model, an image model, or any other generative system, the first thing to build is the evaluation harness. Not the feature. The harness.

A minimum-viable evaluation harness has three parts:

  1. A fixed set of real inputs. Sampled from production, representative of the distribution you actually see, not cherry-picked success cases. Fifty inputs beats five hundred synthetic ones.
  2. A scoring function you trust. For some tasks that’s a clean automated metric. For most, it’s a rubric plus a human, or a rubric plus a strong LLM-as-judge that you have spot-checked against human scores.
  3. A baseline you can beat. The current production behavior, scored on the same set, with the same rubric.

With those three in place, every change becomes a comparable experiment. Without them, you’re shipping on vibes. Vibes are the reason your release notes read “improved accuracy.”

Evals change what you ship

The most interesting thing about a real evaluation harness is that it changes the roadmap. Features that looked exciting in the planning meeting frequently score badly once you measure them. Features that seemed boring turn out to be high-impact once you can demonstrate a twelve-point improvement against a baseline.

If you can’t tell whether the model got better, the model didn’t get better. If you can’t tell whether your feature works, it doesn’t.

What to ask at the next sprint review

Two questions. They will rearrange the conversation:

  1. “What’s the eval score on main, and what was it last week?”
  2. “Which cases are failing that weren’t failing before?”

If nobody has an answer, the eval is the next thing to build. Everything else can wait a week. A team that can answer those two questions on demand has a different kind of product than a team that can’t.