fix(usage): read top-level Anthropic cache fields from OAI-compatible proxies

Port from cline/cline#10266. When OpenAI-compatible proxies (OpenRouter, Vercel AI Gateway, Cline) route Claude models, they sometimes surface the Anthropic-native cache counters (`cache_read_input_tokens`, `cache_creation_input_tokens`) at the top level of the `usage` object instead of nesting them inside `prompt_tokens_details`. Our chat-completions branch of `normalize_usage()` only read the nested `prompt_tokens_details` fields, so those responses: - reported `cache_write_tokens = 0` even when the model actually did a prompt-cache write, - reported only some of the cache-read tokens when the proxy exposed them top-level only, - overstated `input_tokens` by the missed cache-write amount, which in turn made cost estimation and the status-bar cache-hit percentage wrong for Claude traffic going through these gateways. Now the chat-completions branch tries the OpenAI-standard `prompt_tokens_details` first and falls back to the top-level Anthropic-shape fields only if the nested values are absent/zero. The Anthropic and Codex Responses branches are unchanged. Regression guards added for three shapes: top-level write + nested read, top-level-only, and both-present (nested wins).
2026-04-25 00:51:20 +00:00 · 2026-04-22 17:03:35 -07:00 · 2026-04-22 17:03:35 -07:00 · b9463e32c6
commit b9463e32c6
parent 75221db967
2 changed files with 79 additions and 0 deletions
--- a/agent/usage_pricing.py
+++ b/agent/usage_pricing.py
@ -533,10 +533,22 @@ def normalize_usage(
        prompt_total = _to_int(getattr(response_usage, "prompt_tokens", 0))
        output_tokens = _to_int(getattr(response_usage, "completion_tokens", 0))
        details = getattr(response_usage, "prompt_tokens_details", None)
+        # Primary: OpenAI-style prompt_tokens_details. Fallback: Anthropic-style
+        # top-level fields that some OpenAI-compatible proxies (OpenRouter, Vercel
+        # AI Gateway, Cline) expose when routing Claude models — without this
+        # fallback, cache writes are undercounted as 0 and cache reads can be
+        # missed when the proxy only surfaces them at the top level.
+        # Port of cline/cline#10266.
        cache_read_tokens = _to_int(getattr(details, "cached_tokens", 0) if details else 0)
+        if not cache_read_tokens:
+            cache_read_tokens = _to_int(getattr(response_usage, "cache_read_input_tokens", 0))
        cache_write_tokens = _to_int(
            getattr(details, "cache_write_tokens", 0) if details else 0
        )
+        if not cache_write_tokens:
+            cache_write_tokens = _to_int(
+                getattr(response_usage, "cache_creation_input_tokens", 0)
+            )
        input_tokens = max(0, prompt_total - cache_read_tokens - cache_write_tokens)

    reasoning_tokens = 0