mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-25 00:51:20 +00:00
Salvaged from PR #1563 by @kshitijk4poor. Cherry-picked with authorship preserved. - Route-aware pricing architecture replacing static MODEL_PRICING + heuristics - Canonical usage normalization (Anthropic/OpenAI/Codex API shapes) - Cache-aware billing (separate cache_read/cache_write rates) - Cost status tracking (estimated/included/unknown/actual) - OpenRouter live pricing via models API - Schema migration v4→v5 with billing metadata columns - Removed speculative forward-looking entries - Removed cost display from CLI status bar - Threaded OpenRouter metadata pre-warm Co-authored-by: kshitij <82637225+kshitijk4poor@users.noreply.github.com>
608 lines
14 KiB
Markdown
608 lines
14 KiB
Markdown
# Pricing Accuracy Architecture
|
|
|
|
Date: 2026-03-16
|
|
|
|
## Goal
|
|
|
|
Hermes should only show dollar costs when they are backed by an official source for the user's actual billing path.
|
|
|
|
This design replaces the current static, heuristic pricing flow in:
|
|
|
|
- `run_agent.py`
|
|
- `agent/usage_pricing.py`
|
|
- `agent/insights.py`
|
|
- `cli.py`
|
|
|
|
with a provider-aware pricing system that:
|
|
|
|
- handles cache billing correctly
|
|
- distinguishes `actual` vs `estimated` vs `included` vs `unknown`
|
|
- reconciles post-hoc costs when providers expose authoritative billing data
|
|
- supports direct providers, OpenRouter, subscriptions, enterprise pricing, and custom endpoints
|
|
|
|
## Problems In The Current Design
|
|
|
|
Current Hermes behavior has four structural issues:
|
|
|
|
1. It stores only `prompt_tokens` and `completion_tokens`, which is insufficient for providers that bill cache reads and cache writes separately.
|
|
2. It uses a static model price table and fuzzy heuristics, which can drift from current official pricing.
|
|
3. It assumes public API list pricing matches the user's real billing path.
|
|
4. It has no distinction between live estimates and reconciled billed cost.
|
|
|
|
## Design Principles
|
|
|
|
1. Normalize usage before pricing.
|
|
2. Never fold cached tokens into plain input cost.
|
|
3. Track certainty explicitly.
|
|
4. Treat the billing path as part of the model identity.
|
|
5. Prefer official machine-readable sources over scraped docs.
|
|
6. Use post-hoc provider cost APIs when available.
|
|
7. Show `n/a` rather than inventing precision.
|
|
|
|
## High-Level Architecture
|
|
|
|
The new system has four layers:
|
|
|
|
1. `usage_normalization`
|
|
Converts raw provider usage into a canonical usage record.
|
|
2. `pricing_source_resolution`
|
|
Determines the billing path, source of truth, and applicable pricing source.
|
|
3. `cost_estimation_and_reconciliation`
|
|
Produces an immediate estimate when possible, then replaces or annotates it with actual billed cost later.
|
|
4. `presentation`
|
|
`/usage`, `/insights`, and the status bar display cost with certainty metadata.
|
|
|
|
## Canonical Usage Record
|
|
|
|
Add a canonical usage model that every provider path maps into before any pricing math happens.
|
|
|
|
Suggested structure:
|
|
|
|
```python
|
|
@dataclass
|
|
class CanonicalUsage:
|
|
provider: str
|
|
billing_provider: str
|
|
model: str
|
|
billing_route: str
|
|
|
|
input_tokens: int = 0
|
|
output_tokens: int = 0
|
|
cache_read_tokens: int = 0
|
|
cache_write_tokens: int = 0
|
|
reasoning_tokens: int = 0
|
|
request_count: int = 1
|
|
|
|
raw_usage: dict[str, Any] | None = None
|
|
raw_usage_fields: dict[str, str] | None = None
|
|
computed_fields: set[str] | None = None
|
|
|
|
provider_request_id: str | None = None
|
|
provider_generation_id: str | None = None
|
|
provider_response_id: str | None = None
|
|
```
|
|
|
|
Rules:
|
|
|
|
- `input_tokens` means non-cached input only.
|
|
- `cache_read_tokens` and `cache_write_tokens` are never merged into `input_tokens`.
|
|
- `output_tokens` excludes cache metrics.
|
|
- `reasoning_tokens` is telemetry unless a provider officially bills it separately.
|
|
|
|
This is the same normalization pattern used by `opencode`, extended with provenance and reconciliation ids.
|
|
|
|
## Provider Normalization Rules
|
|
|
|
### OpenAI Direct
|
|
|
|
Source usage fields:
|
|
|
|
- `prompt_tokens`
|
|
- `completion_tokens`
|
|
- `prompt_tokens_details.cached_tokens`
|
|
|
|
Normalization:
|
|
|
|
- `cache_read_tokens = cached_tokens`
|
|
- `input_tokens = prompt_tokens - cached_tokens`
|
|
- `cache_write_tokens = 0` unless OpenAI exposes it in the relevant route
|
|
- `output_tokens = completion_tokens`
|
|
|
|
### Anthropic Direct
|
|
|
|
Source usage fields:
|
|
|
|
- `input_tokens`
|
|
- `output_tokens`
|
|
- `cache_read_input_tokens`
|
|
- `cache_creation_input_tokens`
|
|
|
|
Normalization:
|
|
|
|
- `input_tokens = input_tokens`
|
|
- `output_tokens = output_tokens`
|
|
- `cache_read_tokens = cache_read_input_tokens`
|
|
- `cache_write_tokens = cache_creation_input_tokens`
|
|
|
|
### OpenRouter
|
|
|
|
Estimate-time usage normalization should use the response usage payload with the same rules as the underlying provider when possible.
|
|
|
|
Reconciliation-time records should also store:
|
|
|
|
- OpenRouter generation id
|
|
- native token fields when available
|
|
- `total_cost`
|
|
- `cache_discount`
|
|
- `upstream_inference_cost`
|
|
- `is_byok`
|
|
|
|
### Gemini / Vertex
|
|
|
|
Use official Gemini or Vertex usage fields where available.
|
|
|
|
If cached content tokens are exposed:
|
|
|
|
- map them to `cache_read_tokens`
|
|
|
|
If a route exposes no cache creation metric:
|
|
|
|
- store `cache_write_tokens = 0`
|
|
- preserve the raw usage payload for later extension
|
|
|
|
### DeepSeek And Other Direct Providers
|
|
|
|
Normalize only the fields that are officially exposed.
|
|
|
|
If a provider does not expose cache buckets:
|
|
|
|
- do not infer them unless the provider explicitly documents how to derive them
|
|
|
|
### Subscription / Included-Cost Routes
|
|
|
|
These still use the canonical usage model.
|
|
|
|
Tokens are tracked normally. Cost depends on billing mode, not on whether usage exists.
|
|
|
|
## Billing Route Model
|
|
|
|
Hermes must stop keying pricing solely by `model`.
|
|
|
|
Introduce a billing route descriptor:
|
|
|
|
```python
|
|
@dataclass
|
|
class BillingRoute:
|
|
provider: str
|
|
base_url: str | None
|
|
model: str
|
|
billing_mode: str
|
|
organization_hint: str | None = None
|
|
```
|
|
|
|
`billing_mode` values:
|
|
|
|
- `official_cost_api`
|
|
- `official_generation_api`
|
|
- `official_models_api`
|
|
- `official_docs_snapshot`
|
|
- `subscription_included`
|
|
- `user_override`
|
|
- `custom_contract`
|
|
- `unknown`
|
|
|
|
Examples:
|
|
|
|
- OpenAI direct API with Costs API access: `official_cost_api`
|
|
- Anthropic direct API with Usage & Cost API access: `official_cost_api`
|
|
- OpenRouter request before reconciliation: `official_models_api`
|
|
- OpenRouter request after generation lookup: `official_generation_api`
|
|
- GitHub Copilot style subscription route: `subscription_included`
|
|
- local OpenAI-compatible server: `unknown`
|
|
- enterprise contract with configured rates: `custom_contract`
|
|
|
|
## Cost Status Model
|
|
|
|
Every displayed cost should have:
|
|
|
|
```python
|
|
@dataclass
|
|
class CostResult:
|
|
amount_usd: Decimal | None
|
|
status: Literal["actual", "estimated", "included", "unknown"]
|
|
source: Literal[
|
|
"provider_cost_api",
|
|
"provider_generation_api",
|
|
"provider_models_api",
|
|
"official_docs_snapshot",
|
|
"user_override",
|
|
"custom_contract",
|
|
"none",
|
|
]
|
|
label: str
|
|
fetched_at: datetime | None
|
|
pricing_version: str | None
|
|
notes: list[str]
|
|
```
|
|
|
|
Presentation rules:
|
|
|
|
- `actual`: show dollar amount as final
|
|
- `estimated`: show dollar amount with estimate labeling
|
|
- `included`: show `included` or `$0.00 (included)` depending on UX choice
|
|
- `unknown`: show `n/a`
|
|
|
|
## Official Source Hierarchy
|
|
|
|
Resolve cost using this order:
|
|
|
|
1. Request-level or account-level official billed cost
|
|
2. Official machine-readable model pricing
|
|
3. Official docs snapshot
|
|
4. User override or custom contract
|
|
5. Unknown
|
|
|
|
The system must never skip to a lower level if a higher-confidence source exists for the current billing route.
|
|
|
|
## Provider-Specific Truth Rules
|
|
|
|
### OpenAI Direct
|
|
|
|
Preferred truth:
|
|
|
|
1. Costs API for reconciled spend
|
|
2. Official pricing page for live estimate
|
|
|
|
### Anthropic Direct
|
|
|
|
Preferred truth:
|
|
|
|
1. Usage & Cost API for reconciled spend
|
|
2. Official pricing docs for live estimate
|
|
|
|
### OpenRouter
|
|
|
|
Preferred truth:
|
|
|
|
1. `GET /api/v1/generation` for reconciled `total_cost`
|
|
2. `GET /api/v1/models` pricing for live estimate
|
|
|
|
Do not use underlying provider public pricing as the source of truth for OpenRouter billing.
|
|
|
|
### Gemini / Vertex
|
|
|
|
Preferred truth:
|
|
|
|
1. official billing export or billing API for reconciled spend when available for the route
|
|
2. official pricing docs for estimate
|
|
|
|
### DeepSeek
|
|
|
|
Preferred truth:
|
|
|
|
1. official machine-readable cost source if available in the future
|
|
2. official pricing docs snapshot today
|
|
|
|
### Subscription-Included Routes
|
|
|
|
Preferred truth:
|
|
|
|
1. explicit route config marking the model as included in subscription
|
|
|
|
These should display `included`, not an API list-price estimate.
|
|
|
|
### Custom Endpoint / Local Model
|
|
|
|
Preferred truth:
|
|
|
|
1. user override
|
|
2. custom contract config
|
|
3. unknown
|
|
|
|
These should default to `unknown`.
|
|
|
|
## Pricing Catalog
|
|
|
|
Replace the current `MODEL_PRICING` dict with a richer pricing catalog.
|
|
|
|
Suggested record:
|
|
|
|
```python
|
|
@dataclass
|
|
class PricingEntry:
|
|
provider: str
|
|
route_pattern: str
|
|
model_pattern: str
|
|
|
|
input_cost_per_million: Decimal | None = None
|
|
output_cost_per_million: Decimal | None = None
|
|
cache_read_cost_per_million: Decimal | None = None
|
|
cache_write_cost_per_million: Decimal | None = None
|
|
request_cost: Decimal | None = None
|
|
image_cost: Decimal | None = None
|
|
|
|
source: str = "official_docs_snapshot"
|
|
source_url: str | None = None
|
|
fetched_at: datetime | None = None
|
|
pricing_version: str | None = None
|
|
```
|
|
|
|
The catalog should be route-aware:
|
|
|
|
- `openai:gpt-5`
|
|
- `anthropic:claude-opus-4-6`
|
|
- `openrouter:anthropic/claude-opus-4.6`
|
|
- `copilot:gpt-4o`
|
|
|
|
This avoids conflating direct-provider billing with aggregator billing.
|
|
|
|
## Pricing Sync Architecture
|
|
|
|
Introduce a pricing sync subsystem instead of manually maintaining a single hardcoded table.
|
|
|
|
Suggested modules:
|
|
|
|
- `agent/pricing/catalog.py`
|
|
- `agent/pricing/sources.py`
|
|
- `agent/pricing/sync.py`
|
|
- `agent/pricing/reconcile.py`
|
|
- `agent/pricing/types.py`
|
|
|
|
### Sync Sources
|
|
|
|
- OpenRouter models API
|
|
- official provider docs snapshots where no API exists
|
|
- user overrides from config
|
|
|
|
### Sync Output
|
|
|
|
Cache pricing entries locally with:
|
|
|
|
- source URL
|
|
- fetch timestamp
|
|
- version/hash
|
|
- confidence/source type
|
|
|
|
### Sync Frequency
|
|
|
|
- startup warm cache
|
|
- background refresh every 6 to 24 hours depending on source
|
|
- manual `hermes pricing sync`
|
|
|
|
## Reconciliation Architecture
|
|
|
|
Live requests may produce only an estimate initially. Hermes should reconcile them later when a provider exposes actual billed cost.
|
|
|
|
Suggested flow:
|
|
|
|
1. Agent call completes.
|
|
2. Hermes stores canonical usage plus reconciliation ids.
|
|
3. Hermes computes an immediate estimate if a pricing source exists.
|
|
4. A reconciliation worker fetches actual cost when supported.
|
|
5. Session and message records are updated with `actual` cost.
|
|
|
|
This can run:
|
|
|
|
- inline for cheap lookups
|
|
- asynchronously for delayed provider accounting
|
|
|
|
## Persistence Changes
|
|
|
|
Session storage should stop storing only aggregate prompt/completion totals.
|
|
|
|
Add fields for both usage and cost certainty:
|
|
|
|
- `input_tokens`
|
|
- `output_tokens`
|
|
- `cache_read_tokens`
|
|
- `cache_write_tokens`
|
|
- `reasoning_tokens`
|
|
- `estimated_cost_usd`
|
|
- `actual_cost_usd`
|
|
- `cost_status`
|
|
- `cost_source`
|
|
- `pricing_version`
|
|
- `billing_provider`
|
|
- `billing_mode`
|
|
|
|
If schema expansion is too large for one PR, add a new pricing events table:
|
|
|
|
```text
|
|
session_cost_events
|
|
id
|
|
session_id
|
|
request_id
|
|
provider
|
|
model
|
|
billing_mode
|
|
input_tokens
|
|
output_tokens
|
|
cache_read_tokens
|
|
cache_write_tokens
|
|
estimated_cost_usd
|
|
actual_cost_usd
|
|
cost_status
|
|
cost_source
|
|
pricing_version
|
|
created_at
|
|
updated_at
|
|
```
|
|
|
|
## Hermes Touchpoints
|
|
|
|
### `run_agent.py`
|
|
|
|
Current responsibility:
|
|
|
|
- parse raw provider usage
|
|
- update session token counters
|
|
|
|
New responsibility:
|
|
|
|
- build `CanonicalUsage`
|
|
- update canonical counters
|
|
- store reconciliation ids
|
|
- emit usage event to pricing subsystem
|
|
|
|
### `agent/usage_pricing.py`
|
|
|
|
Current responsibility:
|
|
|
|
- static lookup table
|
|
- direct cost arithmetic
|
|
|
|
New responsibility:
|
|
|
|
- move or replace with pricing catalog facade
|
|
- no fuzzy model-family heuristics
|
|
- no direct pricing without billing-route context
|
|
|
|
### `cli.py`
|
|
|
|
Current responsibility:
|
|
|
|
- compute session cost directly from prompt/completion totals
|
|
|
|
New responsibility:
|
|
|
|
- display `CostResult`
|
|
- show status badges:
|
|
- `actual`
|
|
- `estimated`
|
|
- `included`
|
|
- `n/a`
|
|
|
|
### `agent/insights.py`
|
|
|
|
Current responsibility:
|
|
|
|
- recompute historical estimates from static pricing
|
|
|
|
New responsibility:
|
|
|
|
- aggregate stored pricing events
|
|
- prefer actual cost over estimate
|
|
- surface estimates only when reconciliation is unavailable
|
|
|
|
## UX Rules
|
|
|
|
### Status Bar
|
|
|
|
Show one of:
|
|
|
|
- `$1.42`
|
|
- `~$1.42`
|
|
- `included`
|
|
- `cost n/a`
|
|
|
|
Where:
|
|
|
|
- `$1.42` means `actual`
|
|
- `~$1.42` means `estimated`
|
|
- `included` means subscription-backed or explicitly zero-cost route
|
|
- `cost n/a` means unknown
|
|
|
|
### `/usage`
|
|
|
|
Show:
|
|
|
|
- token buckets
|
|
- estimated cost
|
|
- actual cost if available
|
|
- cost status
|
|
- pricing source
|
|
|
|
### `/insights`
|
|
|
|
Aggregate:
|
|
|
|
- actual cost totals
|
|
- estimated-only totals
|
|
- unknown-cost sessions count
|
|
- included-cost sessions count
|
|
|
|
## Config And Overrides
|
|
|
|
Add user-configurable pricing overrides in config:
|
|
|
|
```yaml
|
|
pricing:
|
|
mode: hybrid
|
|
sync_on_startup: true
|
|
sync_interval_hours: 12
|
|
overrides:
|
|
- provider: openrouter
|
|
model: anthropic/claude-opus-4.6
|
|
billing_mode: custom_contract
|
|
input_cost_per_million: 4.25
|
|
output_cost_per_million: 22.0
|
|
cache_read_cost_per_million: 0.5
|
|
cache_write_cost_per_million: 6.0
|
|
included_routes:
|
|
- provider: copilot
|
|
model: "*"
|
|
- provider: codex-subscription
|
|
model: "*"
|
|
```
|
|
|
|
Overrides must win over catalog defaults for the matching billing route.
|
|
|
|
## Rollout Plan
|
|
|
|
### Phase 1
|
|
|
|
- add canonical usage model
|
|
- split cache token buckets in `run_agent.py`
|
|
- stop pricing cache-inflated prompt totals
|
|
- preserve current UI with improved backend math
|
|
|
|
### Phase 2
|
|
|
|
- add route-aware pricing catalog
|
|
- integrate OpenRouter models API sync
|
|
- add `estimated` vs `included` vs `unknown`
|
|
|
|
### Phase 3
|
|
|
|
- add reconciliation for OpenRouter generation cost
|
|
- add actual cost persistence
|
|
- update `/insights` to prefer actual cost
|
|
|
|
### Phase 4
|
|
|
|
- add direct OpenAI and Anthropic reconciliation paths
|
|
- add user overrides and contract pricing
|
|
- add pricing sync CLI command
|
|
|
|
## Testing Strategy
|
|
|
|
Add tests for:
|
|
|
|
- OpenAI cached token subtraction
|
|
- Anthropic cache read/write separation
|
|
- OpenRouter estimated vs actual reconciliation
|
|
- subscription-backed models showing `included`
|
|
- custom endpoints showing `n/a`
|
|
- override precedence
|
|
- stale catalog fallback behavior
|
|
|
|
Current tests that assume heuristic pricing should be replaced with route-aware expectations.
|
|
|
|
## Non-Goals
|
|
|
|
- exact enterprise billing reconstruction without an official source or user override
|
|
- backfilling perfect historical cost for old sessions that lack cache bucket data
|
|
- scraping arbitrary provider web pages at request time
|
|
|
|
## Recommendation
|
|
|
|
Do not expand the existing `MODEL_PRICING` dict.
|
|
|
|
That path cannot satisfy the product requirement. Hermes should instead migrate to:
|
|
|
|
- canonical usage normalization
|
|
- route-aware pricing sources
|
|
- estimate-then-reconcile cost lifecycle
|
|
- explicit certainty states in the UI
|
|
|
|
This is the minimum architecture that makes the statement "Hermes pricing is backed by official sources where possible, and otherwise clearly labeled" defensible.
|