How to A/B test prompts after a model update
When a provider ships a new snapshot, your existing prompts may behave differently — sometimes better, sometimes worse. The disciplined A/B process: (1) Freeze your eval suite first. Don't change prompts and the model at the same time; you won't be able to attribute the deltas. Run the *exact same* prompts on old and new snapshot, score both, document the per-prompt deltas. (2) Identify prompts that regressed. Sort by score drop. Investigate the top 10–20 regressions manually — read the actual outputs side-by-side. Often you'll find the new snapshot is responding differently to specific phrasings (e.g., "be concise" now produces 2-sentence answers where before it produced 5).
(3) Generate prompt variants for regressions. For each regressing prompt, draft 2–3 alternative phrasings. Use the new snapshot to score each variant. (4) Statistical A/B per variant. Run each variant N=50–200 times against the new snapshot, compute pass-rate confidence intervals, pick the winner. Use sequential testing (mSPRT) if you want to peek without inflating false-positive rate. (5) Ship the winning variants in a versioned prompt registry (LangSmith, PromptLayer, or your own). Log prompt-version + model-snapshot on every production request. (6) Continue daily monitoring on the new prompt + new snapshot combination.
Tools: Promptfoo or DeepEval for the offline A/B harness. LangSmith or PromptLayer for prompt versioning. ModelWatch for the post-rollout drift monitoring on the model itself.