Refusal-rate monitoring — how to set thresholds

Question

Accepted Answer

Refusal rate is one of the most volatile metrics in LLM monitoring because safety-classifier updates cause sudden step-function changes. Threshold design has three layers. **(1) Category-segmented baseline.** Refusals are not uniform — coding requests refuse rarely (~0.5 percent), persona/creative requests moderately (~3–8 percent), medical/legal/safety-adjacent requests highly (~15–40 percent). A single overall threshold averages these and misses category-level shifts. Track at least 5 categories: coding, creative, persona, medical/legal, and "neutral Q&A." **(2) Statistical thresholds, not fixed percentages.** For each category, compute the 14-day rolling mean and SD on a 50–200 prompt daily sample. Alert when day-over-day move exceeds **max(3 sigma, 5 absolute percentage points)**. The minimum-percentage floor prevents false alarms on low-baseline categories where a 1-prompt change can be statistically significant but operationally meaningless. **(3) Multi-day confirmation for sustained shifts.** Single-day spikes are often sampling noise. Require 2 of the last 3 days to exceed threshold before paging. This adds latency but cuts false positives by ~70 percent based on typical noise profiles. Detection mechanics: substring match on common refusal templates ("I cannot," "I'm sorry," "as an AI," "I'm not able to") plus a small regex/classifier for hedged refusals ("While I can help with..."). ModelWatch ships per-category refusal tracking with these thresholds preconfigured, and links every alert to the specific newly-refused prompts so you can diff prompt-to-response.