How to build a golden-prompt eval suite from scratch

Question

Accepted Answer

Six steps. **(1) Define your task surface.** Audit production traffic for 1–2 weeks. Cluster prompts by intent (coding, summarization, structured extraction, Q&A, etc.). Pick 5–10 categories that cover 80 percent of traffic. **(2) Sample 30–50 prompts per category.** Aim for 200–500 total. Include easy cases (model should never fail), hard cases (model usually struggles), and adversarial cases (designed to trigger known failure modes). **(3) Define deterministic grading.** Exact-match for closed-ended tasks. JSON-schema validation for structured output. AST parse + unit tests for code. Regex for format constraints. LLM-as-judge with a frozen judge model only when nothing else works, and even then add a rubric. **(4) Add academic-benchmark slices.** 100 MMLU items (stratified across 5–10 subjects), 50 HumanEval problems, 50 GSM8K problems, and 50 TruthfulQA-MC1 items. These give you industry-comparable signal. **(5) Freeze and version.** Commit the suite to a private repo. Hash the contents. Include the hash in every eval run record. Resist the urge to edit items in response to model changes — that destroys longitudinal signal. **(6) Run on a schedule.** Daily, ideally same time of day to control for serving-load variation. Log per-prompt scores plus aggregate metrics. Store at least 90 days of history. Budget: 1–2 engineering weeks to build, ~$50–$300/month in API costs depending on suite size and model coverage. Or use ModelWatch's pre-built 300-item default suite plus your private golden-prompt upload — ~5 minutes to set up.