How do I prove to my CEO the model got worse?

Question

Accepted Answer

You need three artifacts. **(1) A frozen eval set with timestamps**: 50–200 prompts, scored daily, ideally going back at least 30 days. Without dated baselines, "the model got worse" is unfalsifiable. **(2) A specific, measurable metric that moved**: pass-rate dropped from 87 percent to 71 percent on coding-suite, refusal rate doubled on customer-support intent, p95 latency went from 4.2s to 7.8s. **(3) Statistical significance**: chi-squared or McNemar's on pass-rate, Mann-Whitney U on latency, with a confidence interval, not just point estimates. Bonus credibility: cross-check against a public benchmark. If Aider's leaderboard or Artificial Analysis shows the same direction of move on the same snapshot, that's external validation no exec will argue with. ModelWatch produces all three artifacts automatically and exposes a shareable public scorecard URL per model — designed to be the link you paste into a Slack thread or board deck.