ModelWatch

Is GPT-4 getting worse over time?

There is published evidence that GPT-4's behavior shifts measurably between snapshots. The most cited reference is Chen, Zaharia, and Zou's 2023 paper "How Is ChatGPT's Behavior Changing over Time?" (Stanford/Berkeley, arXiv:2307.09009), which measured the March vs June 2023 snapshots of gpt-4 on four tasks: prime-number identification, sensitive-question answering, code generation, and visual reasoning. They reported drops as large as 95.2 percent to 2.4 percent on prime classification, and a sharp rise in refusals to sensitive questions over the same window. OpenAI's own documentation confirms snapshots (e.g., gpt-4-0613 vs gpt-4-1106-preview) can differ in tool-use, JSON-mode strictness, and refusal calibration.

What this means in practice: even when you pin a model alias, the underlying serving stack can change — system-prompt defaults, safety classifiers, speculative decoding, MoE routing, quantization. The only way to *prove* the model got worse for your workload is a stable, dated eval baseline. ModelWatch runs that baseline against published benchmarks (MMLU, HumanEval, GSM8K, SimpleBench) plus user-supplied golden prompts every 24 hours and alerts on statistically significant deltas.