ModelWatch

Latency monitoring for LLM endpoints

Four metrics matter. (1) Time-to-first-token (TTFT) — most important for streaming UIs. Captures queueing delay + initial inference. Frontier APIs typically run 200–800ms TTFT at p50, 1.5–4s at p95. (2) Tokens-per-second (TPS) during streaming — captures generation throughput once started. Typical: 50–150 TPS on frontier APIs, 200–500+ TPS on smaller models or hardware-optimized providers (Groq, Cerebras). (3) End-to-end latency for non-streaming calls — TTFT + (output_tokens / TPS). Watch p50, p95, p99 — p99 captures tail latency that median misses. (4) Error rate by status code — 429 (rate-limit), 500/502/503 (provider issues), 408 (timeout).

Two collection caveats. First, run from a fixed region — if you're monitoring from your own production infra, regional network blips show up as "provider latency" issues; they're not. Use a dedicated monitoring location. Second, time-of-day matters — most providers see noticeable peak-hour latency (US-business-hours and 9pm EU/12pm US overlap). Schedule monitoring runs at fixed hours so you're comparing apples to apples, or sample uniformly across 24h and aggregate by hour-of-day.

Helicone and similar proxies capture this for *your traffic*. For provider-side monitoring independent of your app, ModelWatch runs a fixed eval suite from a fixed region at fixed times, tracking TTFT, TPS, p50/p95/p99 per snapshot, and alerts on sustained shifts (not single outlier spikes).