Has GPT-5 gotten worse since launch?

Question

Accepted Answer

The honest answer requires dated eval data on dated snapshots, and the catch with any "has model X gotten worse since launch" question is that it conflates three different phenomena. **First**, explicit snapshot upgrades — OpenAI ships new dated versions (e.g., a hypothetical `gpt-5-2025-08-15` followed by `gpt-5-2025-11-01`) where deltas are intended, documented, and usually improvements on average but always regressions on some slices. **Second**, silent serving drift on `gpt-5` alias — the alias re-points and behavior changes overnight without a new snapshot string. **Third**, user-side distribution drift — your prompts changed, your team grew, your task surface expanded. What independent data sources show on frontier OpenAI models in general: launch-day evals tend to be optimistically reported; 2–6 weeks post-launch the model is typically retuned for cost/latency, often with measurable slices regressing while average performance stays flat. Aider's coding leaderboard, LiveCodeBench, Artificial Analysis, and lmsys Arena all routinely show snapshot deltas of 2–5 points within months of launch on the same alias. The only way to answer the question for your workload is to have a frozen eval set with daily runs going back to the launch date. ModelWatch maintains exactly that — every tracked frontier model gets a daily golden-prompt suite run from launch day onward, plotted as a time series at modelwatch.app/gpt-5 (and equivalent URLs for Claude, Gemini, Llama).