What is LLM evals? — Agent Month

LLM evals

LLM evals are systematic tests that measure the quality of a model’s outputs against defined criteria, so changes can be validated instead of guessed.

An eval (evaluation) is a repeatable test for an LLM-powered feature. It pairs representative inputs with a way to score the outputs — exact match, a rubric, or an LLM-as-judge — so you can tell whether a prompt change, model swap, or retrieval tweak made quality go up or down.

Evals run offline (against a fixed dataset, ideally in CI) and online (sampling real production traffic). Together they turn quality from a vibe into a number you can gate deploys on.

Most teams shipped AI features with no evals, which makes every change a blind deploy. Adding even a small eval suite to the highest-stakes route is the highest-leverage reliability investment most teams can make.