HellaSwag for common sense — is it still meaningful in 2026?

Question

Accepted Answer

HellaSwag (Zellers et al. 2019) is a 70K-item commonsense-inference benchmark — pick the most plausible sentence continuation from four options. The "Hella" stands for "Harder Endings, Longer contexts, and Low-shot Activities." When introduced, state-of-the-art models scored ~48 percent while humans hit 95 percent. **Is it still meaningful?** Largely saturated. Frontier models (GPT-4, Claude 3.5 Sonnet, Gemini 1.5 Pro, Llama 3.1 405B) all score 90–95+ percent. The benchmark has been deeply contaminated by training data — many open models have plausibly seen the test set during pretraining. For frontier-model regression monitoring, HellaSwag rarely moves enough to be a useful signal; a 1-point drop is well within day-over-day noise on a 200-item daily sample. Where it's still useful: (a) **smaller open-weight models** (7B–13B class) where HellaSwag is not yet saturated and a regression after a quantization change or fine-tune is detectable; (b) as part of a **multi-benchmark portfolio score** (e.g., HELM's accuracy aggregate) where it contributes alongside non-saturated benchmarks; (c) **historical comparisons** when looking at progress curves over 2019–2024. For day-to-day drift monitoring of a frontier API model, deprecate HellaSwag in favor of MMLU-Pro, GPQA Diamond, or LiveBench. ModelWatch tracks HellaSwag for open-weight coverage but doesn't alert on it for frontier closed models.