LLM cost and latency monitoring across providers
Cost and latency are the metrics most likely to move silently because they don't fail loudly — they just slowly degrade margin and UX. Track per-snapshot: input tokens per call, output tokens per call, $/1K input, $/1K output, p50 latency, p95 latency, p99 latency, time-to-first-token (TTFT) for streaming, and request error rate. Watch for the gotchas: tokenizer changes (Claude 3 vs 3.5 tokenize differently, affecting token-count and therefore cost), price re-tiering (OpenAI cut gpt-4o pricing mid-2024), and serving-side latency regressions during peak hours that are invisible if you only sample at off-peak.
Helicone and proxy tools handle this for *your* traffic. ModelWatch handles it for the *model itself* — a fixed eval suite run from a fixed region means a latency move is a provider-side signal, not a "your data-center moved" signal. Both layers are useful; they answer different questions.