ARC-AGI for reasoning — is it useful for production monitoring?
ARC-AGI (Abstraction and Reasoning Corpus, François Chollet 2019; ARC-AGI-2 in 2024) is a benchmark of visual-grid puzzles where models must infer a transformation rule from a few input-output examples and apply it to a new input. It's the canonical "fluid intelligence" benchmark — designed to be resistant to training-data memorization. The ARC Prize 2024 competition put $1M+ on the table, and OpenAI's o3 model (December 2024) was the first system to cross the 75 percent human-performance threshold on the semi-private set, scoring 87.5 percent at high compute.
Useful for production monitoring? Generally no. ARC-AGI is designed to measure frontier-research progress, not detect day-to-day serving drift. Three problems for monitoring: (a) it's expensive — high-compute runs cost thousands of dollars per evaluation; (b) most production models score in single digits or low teens, making drift signal indistinguishable from noise; (c) the public set is contaminated, and the private set requires Kaggle submission.
If you're explicitly benchmarking a reasoning-heavy product (math tutoring, research assistant, agentic planner), a small 20-item ARC-AGI public-set sample can be useful as a *capability tier* indicator — confirming you're on a frontier-class model rather than detecting drift. For routine drift monitoring, stick with MMLU + HumanEval + GSM8K + SimpleBench (via external leaderboard).