MMLU vs HumanEval — which benchmark should I use for monitoring?

Question

Accepted Answer

They measure different capabilities and you generally want both, not one. **MMLU** (Massive Multitask Language Understanding, Hendrycks et al. 2020) is a 57-subject multiple-choice exam covering humanities, STEM, social sciences, and professional domains (law, medicine, accounting). It captures broad knowledge and reasoning — a regression here suggests the model's general competence shifted. Scoring is deterministic (A/B/C/D match), making it ideal for daily monitoring. Frontier models cluster in the 85–90 percent range, so movement of 1–2 points is meaningful. **HumanEval** (Chen et al. 2021, OpenAI) is 164 hand-written Python programming problems scored by `pass@1` — does the generated function pass the hidden unit tests on first attempt. It measures code generation specifically, with no signal on reasoning, knowledge, or refusals. Frontier models score 85–95 percent on HumanEval; saturation means many teams have moved to HumanEval+, MBPP, or LiveCodeBench for contamination-resistant signal. Use MMLU if your workload is knowledge-heavy Q&A, RAG, or general assistant tasks. Use HumanEval if you're shipping a coding product. Use both for portfolio coverage. For monitoring specifically, a 200-item MMLU subset plus the full 164 HumanEval problems runs cheaply (under $5 per provider per day on most APIs) and gives you two independent signals. ModelWatch runs both daily on every tracked model.