What is a golden-prompt eval suite?

Question

Accepted Answer

A golden-prompt eval suite is a curated, frozen set of input prompts paired with either expected outputs or deterministic grading rules, run on a regular schedule to detect model-behavior changes. "Golden" means the inputs and grading are immutable — you do not edit them in response to model changes, because that defeats the purpose. Typical composition: 30 percent representative production-style prompts, 30 percent academic benchmark items (MMLU, HumanEval, GSM8K subsets), 20 percent adversarial / edge cases, 10 percent regression-canary prompts that previously broke, 10 percent format-strictness prompts (JSON schema, exact string output, structured extraction). Best practices: version-control the suite, hash it, include the hash in every eval run record. Avoid contaminated prompts — items the model has plausibly trained on inflate scores and don't catch regressions. Refresh the adversarial slice quarterly with newly observed failures. ModelWatch ships a default 300-item suite spanning these categories plus a UI for uploading your own private golden prompts that never leave your account.