ModelWatch

GSM8K for math reasoning monitoring — what is it and how to use it

GSM8K (Cobbe et al. 2021, OpenAI) is a dataset of 8,500 grade-school math word problems requiring 2–8 reasoning steps to solve. The test set has 1,319 problems; for daily monitoring a stratified 100-item sample is sufficient and runs in under a minute per provider. Scoring is deterministic: extract the final numeric answer (the dataset uses #### <number> format) and compare. Frontier models score 90–97 percent on GSM8K when using chain-of-thought prompting, which has made the benchmark partially saturated for top models — but it remains very useful as a regression canary because *drops* are meaningful even when the absolute ceiling is high.

What GSM8K catches: multi-step arithmetic reasoning, intermediate-state tracking, and chain-of-thought adherence. What it misses: advanced math (use MATH or AIME for that), real-world quantitative reasoning under ambiguity, and reasoning over numerical context. If you see GSM8K drop while MMLU stays flat, the model's chain-of-thought handling probably changed — sampler tweak, system-prompt shift, or speculative-decoding regression. ModelWatch runs the standard 100-item GSM8K slice daily and flags >2 percentage point drops as significant given the typical day-over-day noise floor.