What benchmarks should I track for LLM regression (MMLU, HumanEval, GSM8K)?
The right benchmark depends on workload. General knowledge / reasoning: MMLU (57 subjects, multiple choice), MMLU-Pro for a harder variant, ARC-Challenge for science reasoning, HellaSwag for commonsense. Coding: HumanEval and HumanEval+ (pass@1), MBPP, SWE-Bench Verified for repo-level tasks, LiveCodeBench for contamination-resistant scoring. Math: GSM8K (grade-school word problems), MATH (competition), AIME for frontier. Reasoning under distractors: SimpleBench, BIG-Bench Hard, GPQA Diamond. Instruction following: IFEval, MT-Bench. Long context: RULER, ZeroSCROLLS. Safety / refusal: a private suite is best — public ones leak into training data.
For regression *monitoring* specifically, you want benchmarks where (a) you can run a stable subset daily within a reasonable cost budget and (b) the scoring is deterministic. MMLU-200-random-seed, HumanEval-164, GSM8K-100 is a common starter pack. ModelWatch ships all of these with frozen seeds plus user-uploaded golden prompts.