What benchmarks should I track for LLM regression (MMLU, HumanEval, GSM8K)?

Question

Accepted Answer

The right benchmark depends on workload. **General knowledge / reasoning**: MMLU (57 subjects, multiple choice), MMLU-Pro for a harder variant, ARC-Challenge for science reasoning, HellaSwag for commonsense. **Coding**: HumanEval and HumanEval+ (pass@1), MBPP, SWE-Bench Verified for repo-level tasks, LiveCodeBench for contamination-resistant scoring. **Math**: GSM8K (grade-school word problems), MATH (competition), AIME for frontier. **Reasoning under distractors**: SimpleBench, BIG-Bench Hard, GPQA Diamond. **Instruction following**: IFEval, MT-Bench. **Long context**: RULER, ZeroSCROLLS. **Safety / refusal**: a private suite is best — public ones leak into training data. For regression *monitoring* specifically, you want benchmarks where (a) you can run a stable subset daily within a reasonable cost budget and (b) the scoring is deterministic. MMLU-200-random-seed, HumanEval-164, GSM8K-100 is a common starter pack. ModelWatch ships all of these with frozen seeds plus user-uploaded golden prompts.