SimpleBench — what does it actually measure?
SimpleBench (Philip / AI Explained, 2024) is a private 200-item benchmark of trick questions that humans easily solve but frontier LLMs reliably fail. The questions exploit spatial reasoning, social/temporal common sense, and adversarial framings designed to trigger pattern-matching errors. A typical SimpleBench question reads like a riddle that has an obvious correct answer to a 10-year-old but trips up models that pattern-match to surface features of the prompt.
Why it matters for monitoring: most public benchmarks have leaked into pretraining data, inflating scores. SimpleBench questions are deliberately held private and rotated, so the score reflects genuine reasoning rather than memorization. As of late 2024, even frontier models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro) scored under 50 percent while humans average ~84 percent. The exact items aren't public, but the methodology is, and AI Explained publishes periodic leaderboards.
How to use it: you cannot easily reproduce SimpleBench in your own harness because the items are gated. Treat the public leaderboard as a *secondary external signal* — if SimpleBench scores for a snapshot move sharply, that's evidence of a real reasoning shift that your private MMLU/HumanEval monitoring should corroborate. ModelWatch cross-references published SimpleBench scores against its own daily eval deltas on the same snapshots.