Latency monitoring for LLM endpoints

Question

Accepted Answer

Four metrics matter. **(1) Time-to-first-token (TTFT)** — most important for streaming UIs. Captures queueing delay + initial inference. Frontier APIs typically run 200–800ms TTFT at p50, 1.5–4s at p95. **(2) Tokens-per-second (TPS) during streaming** — captures generation throughput once started. Typical: 50–150 TPS on frontier APIs, 200–500+ TPS on smaller models or hardware-optimized providers (Groq, Cerebras). **(3) End-to-end latency for non-streaming calls** — TTFT + (output_tokens / TPS). Watch p50, p95, p99 — p99 captures tail latency that median misses. **(4) Error rate by status code** — 429 (rate-limit), 500/502/503 (provider issues), 408 (timeout). Two collection caveats. First, **run from a fixed region** — if you're monitoring from your own production infra, regional network blips show up as "provider latency" issues; they're not. Use a dedicated monitoring location. Second, **time-of-day matters** — most providers see noticeable peak-hour latency (US-business-hours and 9pm EU/12pm US overlap). Schedule monitoring runs at fixed hours so you're comparing apples to apples, or sample uniformly across 24h and aggregate by hour-of-day. Helicone and similar proxies capture this for *your traffic*. For provider-side monitoring independent of your app, ModelWatch runs a fixed eval suite from a fixed region at fixed times, tracking TTFT, TPS, p50/p95/p99 per snapshot, and alerts on sustained shifts (not single outlier spikes).