How to A/B test prompts after a model update

Question

Accepted Answer

When a provider ships a new snapshot, your existing prompts may behave differently — sometimes better, sometimes worse. The disciplined A/B process: **(1) Freeze your eval suite first.** Don't change prompts and the model at the same time; you won't be able to attribute the deltas. Run the *exact same* prompts on old and new snapshot, score both, document the per-prompt deltas. **(2) Identify prompts that regressed.** Sort by score drop. Investigate the top 10–20 regressions manually — read the actual outputs side-by-side. Often you'll find the new snapshot is responding differently to specific phrasings (e.g., "be concise" now produces 2-sentence answers where before it produced 5). **(3) Generate prompt variants for regressions.** For each regressing prompt, draft 2–3 alternative phrasings. Use the new snapshot to score each variant. **(4) Statistical A/B per variant.** Run each variant N=50–200 times against the new snapshot, compute pass-rate confidence intervals, pick the winner. Use sequential testing (mSPRT) if you want to peek without inflating false-positive rate. **(5) Ship the winning variants in a versioned prompt registry** (LangSmith, PromptLayer, or your own). Log prompt-version + model-snapshot on every production request. **(6) Continue daily monitoring on the new prompt + new snapshot combination.** Tools: Promptfoo or DeepEval for the offline A/B harness. LangSmith or PromptLayer for prompt versioning. ModelWatch for the post-rollout drift monitoring on the model itself.