How to alert on LLM refusal-rate spikes
Refusal-rate spikes are usually the first sign a provider tightened a safety classifier. Build the alert in three steps. (1) Define refusal deterministically — substring match on "I cannot," "I'm sorry," "as an AI," plus a small regex/classifier for hedged refusals. (2) Track per-category refusal — coding, medical, legal, persona, creative; tightening rarely hits all categories uniformly and category-level signal is much cleaner. (3) Use a control chart: 14-day rolling mean per category, alert at >3 sigma or >5 absolute percentage points day-over-day, whichever is stricter, with a minimum 50-sample window to avoid false alarms.
In ModelWatch, refusal rate is a first-class metric on every model's scorecard with per-category breakdowns. Slack/webhook alerts fire when any category crosses threshold, with a link to the specific prompts that newly refused so you can diff prompt-to-response.