ModelWatch

GPT-4 Turbo drift paper — what did the Stanford/Berkeley study actually show?

The widely cited paper is Lingjiao Chen, Matei Zaharia, and James Zou (2023), "How Is ChatGPT's Behavior Changing over Time?" — arXiv:2307.09009. It compared the March 2023 and June 2023 snapshots of gpt-4 and gpt-3.5-turbo on four tasks. Headline findings: GPT-4's accuracy on identifying prime numbers fell from ~84 percent in March to ~51 percent in June on their test set (the often-quoted 95-to-2 number was for a sub-slice using chain-of-thought). GPT-4 became dramatically more terse on sensitive-question refusals. Both models' code-generation outputs became less directly executable, with more markdown formatting around the code. GPT-3.5 actually *improved* on primes in the same window.

Critiques (notably from Princeton's Arvind Narayanan and Sayash Kapoor) argued the prime-finding result largely reflected formatting/chain-of-thought differences rather than reasoning loss. The broader takeaway holds: same model name, materially different behavior, no version bump. That's exactly the failure mode ModelWatch is built to catch in real time.