How do I interpret a benchmark drop — signal or noise?

Question

Accepted Answer

Three tests separate signal from noise. **(1) Magnitude vs typical noise floor.** Compute the trailing 14-day mean and standard deviation for that metric. A drop within 1 sigma is noise; 2 sigma is suspicious; 3 sigma or more is alert-worthy. For a 200-item MMLU sample at ~88 percent, typical day-over-day SD is ~0.8 percentage points, so a 2-point drop is roughly 2.5 sigma — investigate. **(2) Statistical significance, not just magnitude.** For pass-rate metrics, run a McNemar's or chi-squared test against the rolling baseline. For latency, Mann-Whitney U. For continuous metrics like cost-per-call, a Welch's t-test. Treat anything with p > 0.05 as noise even if the point estimate looks scary. **(3) Cross-metric and cross-model consistency.** A single benchmark moving alone is often noise (or a contamination artifact). The same direction of move on 2–3 independent benchmarks (e.g., MMLU + GSM8K + HumanEval) on the same snapshot is much stronger evidence. Conversely, if every model on every provider drops on the same day, it's probably your eval harness — check your judge, your prompt template, your tokenizer. Operational rule of thumb: alert on confluence — magnitude + significance + cross-metric. ModelWatch's default alert policy fires only when at least 2 of 3 conditions hold, which cuts false-positive pages by roughly 80 percent versus single-metric thresholds.