Port from cline/cline#10266.
When OpenAI-compatible proxies (OpenRouter, Vercel AI Gateway, Cline)
route Claude models, they sometimes surface the Anthropic-native cache
counters (`cache_read_input_tokens`, `cache_creation_input_tokens`) at
the top level of the `usage` object instead of nesting them inside
`prompt_tokens_details`. Our chat-completions branch of
`normalize_usage()` only read the nested `prompt_tokens_details` fields,
so those responses:
- reported `cache_write_tokens = 0` even when the model actually did a
prompt-cache write,
- reported only some of the cache-read tokens when the proxy exposed them
top-level only,
- overstated `input_tokens` by the missed cache-write amount, which in
turn made cost estimation and the status-bar cache-hit percentage wrong
for Claude traffic going through these gateways.
Now the chat-completions branch tries the OpenAI-standard
`prompt_tokens_details` first and falls back to the top-level
Anthropic-shape fields only if the nested values are absent/zero. The
Anthropic and Codex Responses branches are unchanged.
Regression guards added for three shapes: top-level write + nested read,
top-level-only, and both-present (nested wins).
Automated dead code audit using vulture + coverage.py + ast-grep intersection,
confirmed by Opus deep verification pass. Every symbol verified to have zero
production callers (test imports excluded from reachability analysis).
Removes ~1,534 lines of dead production code across 46 files and ~1,382 lines
of stale test code. 3 entire files deleted (agent/builtin_memory_provider.py,
hermes_cli/checklist.py, tests/hermes_cli/test_setup_model_selection.py).
Co-authored-by: alt-glitch <balyan.sid@gmail.com>
format_token_count_compact() used unconditional rstrip("0") to clean up
decimal trailing zeros (e.g. "1.50" → "1.5"), but this also stripped
meaningful trailing zeros from whole numbers ("260" → "26", "100" → "1").
Guard the strip behind a decimal-point check.
Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
* perf: cache base_url.lower() via property, consolidate triple load_config(), hoist set constant
run_agent.py:
- Add base_url property that auto-caches _base_url_lower on every
assignment, eliminating 12+ redundant .lower() calls per API cycle
across __init__, _build_api_kwargs, _supports_reasoning_extra_body,
and the main conversation loop
- Consolidate three separate load_config() disk reads in __init__
(memory, skills, compression) into a single call, reusing the
result dict for all three config sections
model_tools.py:
- Hoist _READ_SEARCH_TOOLS set to module level (was rebuilt inside
handle_function_call on every tool invocation)
* Use endpoint metadata for custom model context and pricing
---------
Co-authored-by: kshitij <82637225+kshitijk4poor@users.noreply.github.com>
Salvaged from PR #1104 by kshitijk4poor. Closes#683.
Adds a persistent status bar to the CLI showing model name, context
window usage with visual bar, estimated cost, and session duration.
Responsive layout degrades gracefully for narrow terminals.
Changes:
- agent/usage_pricing.py: shared pricing table, cost estimation with
Decimal arithmetic, duration/token formatting helpers
- agent/insights.py: refactored to reuse usage_pricing (eliminates
duplicate pricing table and formatting logic)
- cli.py: status bar with FormattedTextControl fragments, color-coded
context thresholds (green/yellow/orange/red), enhanced /usage with
cost breakdown, 1Hz idle refresh for status bar updates
- tests/test_cli_status_bar.py: status bar snapshot, width collapsing,
usage report with/without pricing, zero-priced model handling
- tests/test_insights.py: verify zero-priced providers show as unknown
Salvage fixes:
- Resolved conflict with voice status bar (both coexist in layout)
- Import _format_context_length from hermes_cli.banner (moved since PR)
Co-authored-by: kshitijk4poor <kshitijk4poor@users.noreply.github.com>