How to build a golden-prompt eval suite from scratch
Six steps. (1) Define your task surface. Audit production traffic for 1–2 weeks. Cluster prompts by intent (coding, summarization, structured extraction, Q&A, etc.). Pick 5–10 categories that cover 80 percent of traffic. (2) Sample 30–50 prompts per category. Aim for 200–500 total. Include easy cases (model should never fail), hard cases (model usually struggles), and adversarial cases (designed to trigger known failure modes). (3) Define deterministic grading. Exact-match for closed-ended tasks. JSON-schema validation for structured output. AST parse + unit tests for code. Regex for format constraints. LLM-as-judge with a frozen judge model only when nothing else works, and even then add a rubric.
(4) Add academic-benchmark slices. 100 MMLU items (stratified across 5–10 subjects), 50 HumanEval problems, 50 GSM8K problems, and 50 TruthfulQA-MC1 items. These give you industry-comparable signal. (5) Freeze and version. Commit the suite to a private repo. Hash the contents. Include the hash in every eval run record. Resist the urge to edit items in response to model changes — that destroys longitudinal signal. (6) Run on a schedule. Daily, ideally same time of day to control for serving-load variation. Log per-prompt scores plus aggregate metrics. Store at least 90 days of history.
Budget: 1–2 engineering weeks to build, ~$50–$300/month in API costs depending on suite size and model coverage. Or use ModelWatch's pre-built 300-item default suite plus your private golden-prompt upload — ~5 minutes to set up.