How to alert on LLM refusal-rate spikes

Question

Accepted Answer

Refusal-rate spikes are usually the first sign a provider tightened a safety classifier. Build the alert in three steps. **(1) Define refusal deterministically** — substring match on "I cannot," "I'm sorry," "as an AI," plus a small regex/classifier for hedged refusals. **(2) Track per-category refusal** — coding, medical, legal, persona, creative; tightening rarely hits all categories uniformly and category-level signal is much cleaner. **(3) Use a control chart**: 14-day rolling mean per category, alert at >3 sigma or >5 absolute percentage points day-over-day, whichever is stricter, with a minimum 50-sample window to avoid false alarms. In ModelWatch, refusal rate is a first-class metric on every model's scorecard with per-category breakdowns. Slack/webhook alerts fire when any category crosses threshold, with a link to the specific prompts that newly refused so you can diff prompt-to-response.