How to handle a silent model degradation in production

Question

Accepted Answer

Incident playbook in five steps. **(1) Confirm it's the model, not you.** Re-run your golden-prompt suite against both the current snapshot and the previous snapshot (if still available) and against the `*-latest` alias. If old snapshot scores match historical baseline and new snapshot or alias regressed, the model changed. If both regressed equally, your eval pipeline changed — check judge, tokenizer, prompt templates. Document the deltas with a timestamp before doing anything else. **(2) Triage by user impact.** Which production code paths use this model? Which user-facing metrics moved (latency, error rate, refusal rate visible to users, format failures)? If user-impact is severe, immediately fall back to the previous pinned snapshot — every responsible team should keep the last-known-good snapshot as a tested fallback, accessible by config flag. **(3) Communicate.** File an internal incident with the eval data attached. If the regression is provider-side, file a bug with the provider (OpenAI, Anthropic, Google all have support channels) attaching reproducible prompts and per-snapshot scores. If on Discord/Slack community, share quietly to validate that other teams are seeing the same shift before going public. **(4) Investigate root cause.** Was it an alias re-point? A new dated snapshot? Silent serving change on a pinned snapshot? The diagnosis dictates fix: re-pin to a dated snapshot (for alias re-points), defer adoption of the new snapshot (for new dated versions), or escalate to the provider (for silent changes on a pinned snapshot, which technically should not happen). **(5) Update monitoring and runbook.** Add the failed prompts as named regression canaries in your suite so this specific failure mode is detected next time. ModelWatch automates steps 1, 4, and the canary-update piece of 5; you still own 2 and 3.