ModelWatch

Cost monitoring for LLM APIs over time

Three layers of cost surveillance. (1) Per-call accounting. Log input tokens, output tokens, cached tokens (if applicable), model snapshot, provider, and cost per call. Most providers return token counts in the response; otherwise tokenize locally with tiktoken (OpenAI) or the provider's tokenizer. Multiply by the current published per-1K rate. (2) Aggregate trend monitoring. Daily and weekly totals broken down by snapshot. Watch for: average tokens per call drifting up (the model got more verbose), cache-hit rate drifting down (your prompts changed or provider tightened caching), per-request cost on a fixed eval suite drifting up (provider raised prices or tokenizer changed).

(3) Provider-side benchmark. Run a fixed eval suite against each snapshot daily and track cost-per-suite-run as a metric. This isolates provider-side changes from your-side changes. Anthropic and OpenAI have both adjusted prices mid-snapshot historically — gpt-4o got cheaper in Aug 2024; Claude prompt caching was added and altered pricing math. Without a fixed baseline you cannot detect these.

Tokenizer changes are a sneaky cost issue. Claude 3 and Claude 3.5 use slightly different tokenization; same English text can tokenize to different counts, affecting both cost (per-token billing) and context-window usage. ModelWatch tracks per-snapshot cost-per-call on its standard eval suite and alerts when the unit cost moves >5 percent week-over-week.