How to run daily evals on GPT-4, Claude, and Gemini

Question

Accepted Answer

Build the eval loop in five layers. **(1) Prompts**: 50–500 dated, frozen golden prompts plus an academic slice (e.g., 200 MMLU items, the full 164 HumanEval problems, 100 GSM8K problems, 100 SimpleBench items if licensed). **(2) Adapters**: one thin client per provider with retry, timeout, and explicit `model=` snapshot pinning. **(3) Graders**: deterministic (exact-match, regex, AST parse for code, `json.loads` for structured output) plus optional LLM-as-judge with a frozen judge model. **(4) Storage**: append-only time series — `(timestamp, provider, model, snapshot, prompt_id, score, latency_ms, input_tokens, output_tokens, refused, format_valid)`. **(5) Alerting**: control-chart on each metric (rolling 14-day mean + 3 sigma) plus statistical tests (chi-squared on pass-rate, Mann-Whitney on latency) before paging. This is roughly what ModelWatch ships out of the box. If you build it yourself, budget 2–3 engineering weeks and ~$50–$300/month in API costs depending on suite size and model coverage.