How do I detect LLM model regression in production?

Question

Accepted Answer

Detecting regression requires three things you almost never have during a firefight: a frozen evaluation set, dated baseline scores, and statistical significance testing on the deltas. The mechanics are: (1) curate 50–500 prompts that represent your task surface — include adversarial, edge, and easy cases; (2) score each prompt with a deterministic judge (string match, regex, JSON-schema validator, or rubric-based LLM-as-judge with a frozen judge model); (3) run the suite on a fixed schedule against the same model alias; (4) compute pass-rate, refusal-rate, format-validity, p50/p95 latency, and per-1K-token cost; (5) alert when any metric moves beyond a control-chart threshold (e.g., 3 sigma over the trailing 14-day window) or fails a Mann-Whitney U test against last week. Open frameworks that do parts of this: EleutherAI's `lm-evaluation-harness`, Stanford's HELM, Google's BIG-Bench, OpenAI Evals. None of them run continuously against provider APIs with alerting. ModelWatch is the continuous-monitoring layer on top — golden-prompt suite, daily run, public scorecard, Slack/webhook alerts on regression.