LLM-as-judge
LLM-as-judge is an evaluation technique where a language model scores another model’s output against criteria you define.
LLM-as-judge uses a capable model to grade outputs that have no single correct answer — summaries, explanations, generated code review. You give the judge a rubric, and it returns a score or pass/fail per criterion.
It’s powerful for nuanced, open-ended quality, but the judge itself must be validated against human labels so you’re not just measuring one model’s opinion of another. Calibrating the judge is part of building a trustworthy eval.
In production eval systems, LLM-as-judge is usually combined with cheaper exact-match and rubric checks, each applied to the routes where it fits best.