ModelWatch

How to handle a silent model degradation in production

Incident playbook in five steps. (1) Confirm it's the model, not you. Re-run your golden-prompt suite against both the current snapshot and the previous snapshot (if still available) and against the *-latest alias. If old snapshot scores match historical baseline and new snapshot or alias regressed, the model changed. If both regressed equally, your eval pipeline changed — check judge, tokenizer, prompt templates. Document the deltas with a timestamp before doing anything else.

(2) Triage by user impact. Which production code paths use this model? Which user-facing metrics moved (latency, error rate, refusal rate visible to users, format failures)? If user-impact is severe, immediately fall back to the previous pinned snapshot — every responsible team should keep the last-known-good snapshot as a tested fallback, accessible by config flag.

(3) Communicate. File an internal incident with the eval data attached. If the regression is provider-side, file a bug with the provider (OpenAI, Anthropic, Google all have support channels) attaching reproducible prompts and per-snapshot scores. If on Discord/Slack community, share quietly to validate that other teams are seeing the same shift before going public.

(4) Investigate root cause. Was it an alias re-point? A new dated snapshot? Silent serving change on a pinned snapshot? The diagnosis dictates fix: re-pin to a dated snapshot (for alias re-points), defer adoption of the new snapshot (for new dated versions), or escalate to the provider (for silent changes on a pinned snapshot, which technically should not happen).

(5) Update monitoring and runbook. Add the failed prompts as named regression canaries in your suite so this specific failure mode is detected next time. ModelWatch automates steps 1, 4, and the canary-update piece of 5; you still own 2 and 3.