hermes-agent/docs/plans/2026-03-16-pricing-accuracy-architecture-design.md

# Pricing Accuracy Architecture

Date: 2026-03-16

## Goal

Hermes should only show dollar costs when they are backed by an official source for the user's actual billing path.

This design replaces the current static, heuristic pricing flow in:

- `run_agent.py`
- `agent/usage_pricing.py`
- `agent/insights.py`
- `cli.py`

with a provider-aware pricing system that:

- handles cache billing correctly
- distinguishes `actual` vs `estimated` vs `included` vs `unknown`
- reconciles post-hoc costs when providers expose authoritative billing data
- supports direct providers, OpenRouter, subscriptions, enterprise pricing, and custom endpoints

## Problems In The Current Design

Current Hermes behavior has four structural issues:

1. It stores only `prompt_tokens` and `completion_tokens`, which is insufficient for providers that bill cache reads and cache writes separately.
2. It uses a static model price table and fuzzy heuristics, which can drift from current official pricing.
3. It assumes public API list pricing matches the user's real billing path.
4. It has no distinction between live estimates and reconciled billed cost.

## Design Principles

1. Normalize usage before pricing.
2. Never fold cached tokens into plain input cost.
3. Track certainty explicitly.
4. Treat the billing path as part of the model identity.
5. Prefer official machine-readable sources over scraped docs.
6. Use post-hoc provider cost APIs when available.
7. Show `n/a` rather than inventing precision.

## High-Level Architecture

The new system has four layers:

1. `usage_normalization`
   Converts raw provider usage into a canonical usage record.
2. `pricing_source_resolution`
   Determines the billing path, source of truth, and applicable pricing source.
3. `cost_estimation_and_reconciliation`
   Produces an immediate estimate when possible, then replaces or annotates it with actual billed cost later.
4. `presentation`
   `/usage`, `/insights`, and the status bar display cost with certainty metadata.

## Canonical Usage Record

Add a canonical usage model that every provider path maps into before any pricing math happens.

Suggested structure:

```python
@dataclass
class CanonicalUsage:
    provider: str
    billing_provider: str
    model: str
    billing_route: str

    input_tokens: int = 0
    output_tokens: int = 0
    cache_read_tokens: int = 0
    cache_write_tokens: int = 0
    reasoning_tokens: int = 0
    request_count: int = 1

    raw_usage: dict[str, Any] | None = None
    raw_usage_fields: dict[str, str] | None = None
    computed_fields: set[str] | None = None

    provider_request_id: str | None = None
    provider_generation_id: str | None = None
    provider_response_id: str | None = None
```

Rules:

- `input_tokens` means non-cached input only.
- `cache_read_tokens` and `cache_write_tokens` are never merged into `input_tokens`.
- `output_tokens` excludes cache metrics.
- `reasoning_tokens` is telemetry unless a provider officially bills it separately.

This is the same normalization pattern used by `opencode`, extended with provenance and reconciliation ids.

## Provider Normalization Rules

### OpenAI Direct

Source usage fields:

- `prompt_tokens`
- `completion_tokens`
- `prompt_tokens_details.cached_tokens`

Normalization:

- `cache_read_tokens = cached_tokens`
- `input_tokens = prompt_tokens - cached_tokens`
- `cache_write_tokens = 0` unless OpenAI exposes it in the relevant route
- `output_tokens = completion_tokens`

### Anthropic Direct

Source usage fields:

- `input_tokens`
- `output_tokens`
- `cache_read_input_tokens`
- `cache_creation_input_tokens`

Normalization:

- `input_tokens = input_tokens`
- `output_tokens = output_tokens`
- `cache_read_tokens = cache_read_input_tokens`
- `cache_write_tokens = cache_creation_input_tokens`

### OpenRouter

Estimate-time usage normalization should use the response usage payload with the same rules as the underlying provider when possible.

Reconciliation-time records should also store:

- OpenRouter generation id
- native token fields when available
- `total_cost`
- `cache_discount`
- `upstream_inference_cost`
- `is_byok`

### Gemini / Vertex

Use official Gemini or Vertex usage fields where available.

If cached content tokens are exposed:

- map them to `cache_read_tokens`

If a route exposes no cache creation metric:

- store `cache_write_tokens = 0`
- preserve the raw usage payload for later extension

### DeepSeek And Other Direct Providers

Normalize only the fields that are officially exposed.

If a provider does not expose cache buckets:

- do not infer them unless the provider explicitly documents how to derive them

### Subscription / Included-Cost Routes

These still use the canonical usage model.

Tokens are tracked normally. Cost depends on billing mode, not on whether usage exists.

## Billing Route Model

Hermes must stop keying pricing solely by `model`.

Introduce a billing route descriptor:

```python
@dataclass
class BillingRoute:
    provider: str
    base_url: str | None
    model: str
    billing_mode: str
    organization_hint: str | None = None
```

`billing_mode` values:

- `official_cost_api`
- `official_generation_api`
- `official_models_api`
- `official_docs_snapshot`
- `subscription_included`
- `user_override`
- `custom_contract`
- `unknown`

Examples:

- OpenAI direct API with Costs API access: `official_cost_api`
- Anthropic direct API with Usage & Cost API access: `official_cost_api`
- OpenRouter request before reconciliation: `official_models_api`
- OpenRouter request after generation lookup: `official_generation_api`
- GitHub Copilot style subscription route: `subscription_included`
- local OpenAI-compatible server: `unknown`
- enterprise contract with configured rates: `custom_contract`

## Cost Status Model

Every displayed cost should have:

```python
@dataclass
class CostResult:
    amount_usd: Decimal | None
    status: Literal["actual", "estimated", "included", "unknown"]
    source: Literal[
        "provider_cost_api",
        "provider_generation_api",
        "provider_models_api",
        "official_docs_snapshot",
        "user_override",
        "custom_contract",
        "none",
    ]
    label: str
    fetched_at: datetime | None
    pricing_version: str | None
    notes: list[str]
```

Presentation rules:

- `actual`: show dollar amount as final
- `estimated`: show dollar amount with estimate labeling
- `included`: show `included` or `$0.00 (included)` depending on UX choice
- `unknown`: show `n/a`

## Official Source Hierarchy

Resolve cost using this order:

1. Request-level or account-level official billed cost
2. Official machine-readable model pricing
3. Official docs snapshot
4. User override or custom contract
5. Unknown

The system must never skip to a lower level if a higher-confidence source exists for the current billing route.

## Provider-Specific Truth Rules

### OpenAI Direct

Preferred truth:

1. Costs API for reconciled spend
2. Official pricing page for live estimate

### Anthropic Direct

Preferred truth:

1. Usage & Cost API for reconciled spend
2. Official pricing docs for live estimate

### OpenRouter

Preferred truth:

1. `GET /api/v1/generation` for reconciled `total_cost`
2. `GET /api/v1/models` pricing for live estimate

Do not use underlying provider public pricing as the source of truth for OpenRouter billing.

### Gemini / Vertex

Preferred truth:

1. official billing export or billing API for reconciled spend when available for the route
2. official pricing docs for estimate

### DeepSeek

Preferred truth:

1. official machine-readable cost source if available in the future
2. official pricing docs snapshot today

### Subscription-Included Routes

Preferred truth:

1. explicit route config marking the model as included in subscription

These should display `included`, not an API list-price estimate.

### Custom Endpoint / Local Model

Preferred truth:

1. user override
2. custom contract config
3. unknown

These should default to `unknown`.

## Pricing Catalog

Replace the current `MODEL_PRICING` dict with a richer pricing catalog.

Suggested record:

```python
@dataclass
class PricingEntry:
    provider: str
    route_pattern: str
    model_pattern: str

    input_cost_per_million: Decimal | None = None
    output_cost_per_million: Decimal | None = None
    cache_read_cost_per_million: Decimal | None = None
    cache_write_cost_per_million: Decimal | None = None
    request_cost: Decimal | None = None
    image_cost: Decimal | None = None

    source: str = "official_docs_snapshot"
    source_url: str | None = None
    fetched_at: datetime | None = None
    pricing_version: str | None = None
```

The catalog should be route-aware:

- `openai:gpt-5`
- `anthropic:claude-opus-4-6`
- `openrouter:anthropic/claude-opus-4.6`
- `copilot:gpt-4o`

This avoids conflating direct-provider billing with aggregator billing.

## Pricing Sync Architecture

Introduce a pricing sync subsystem instead of manually maintaining a single hardcoded table.

Suggested modules:

- `agent/pricing/catalog.py`
- `agent/pricing/sources.py`
- `agent/pricing/sync.py`
- `agent/pricing/reconcile.py`
- `agent/pricing/types.py`

### Sync Sources

- OpenRouter models API
- official provider docs snapshots where no API exists
- user overrides from config

### Sync Output

Cache pricing entries locally with:

- source URL
- fetch timestamp
- version/hash
- confidence/source type

### Sync Frequency

- startup warm cache
- background refresh every 6 to 24 hours depending on source
- manual `hermes pricing sync`

## Reconciliation Architecture

Live requests may produce only an estimate initially. Hermes should reconcile them later when a provider exposes actual billed cost.

Suggested flow:

1. Agent call completes.
2. Hermes stores canonical usage plus reconciliation ids.
3. Hermes computes an immediate estimate if a pricing source exists.
4. A reconciliation worker fetches actual cost when supported.
5. Session and message records are updated with `actual` cost.

This can run:

- inline for cheap lookups
- asynchronously for delayed provider accounting

## Persistence Changes

Session storage should stop storing only aggregate prompt/completion totals.

Add fields for both usage and cost certainty:

- `input_tokens`
- `output_tokens`
- `cache_read_tokens`
- `cache_write_tokens`
- `reasoning_tokens`
- `estimated_cost_usd`
- `actual_cost_usd`
- `cost_status`
- `cost_source`
- `pricing_version`
- `billing_provider`
- `billing_mode`

If schema expansion is too large for one PR, add a new pricing events table:

```text
session_cost_events
  id
  session_id
  request_id
  provider
  model
  billing_mode
  input_tokens
  output_tokens
  cache_read_tokens
  cache_write_tokens
  estimated_cost_usd
  actual_cost_usd
  cost_status
  cost_source
  pricing_version
  created_at
  updated_at
```

## Hermes Touchpoints

### `run_agent.py`

Current responsibility:

- parse raw provider usage
- update session token counters

New responsibility:

- build `CanonicalUsage`
- update canonical counters
- store reconciliation ids
- emit usage event to pricing subsystem

### `agent/usage_pricing.py`

Current responsibility:

- static lookup table
- direct cost arithmetic

New responsibility:

- move or replace with pricing catalog facade
- no fuzzy model-family heuristics
- no direct pricing without billing-route context

### `cli.py`

Current responsibility:

- compute session cost directly from prompt/completion totals

New responsibility:

- display `CostResult`
- show status badges:
  - `actual`
  - `estimated`
  - `included`
  - `n/a`

### `agent/insights.py`

Current responsibility:

- recompute historical estimates from static pricing

New responsibility:

- aggregate stored pricing events
- prefer actual cost over estimate
- surface estimates only when reconciliation is unavailable

## UX Rules

### Status Bar

Show one of:

- `$1.42`
- `~$1.42`
- `included`
- `cost n/a`

Where:

- `$1.42` means `actual`
- `~$1.42` means `estimated`
- `included` means subscription-backed or explicitly zero-cost route
- `cost n/a` means unknown

### `/usage`

Show:

- token buckets
- estimated cost
- actual cost if available
- cost status
- pricing source

### `/insights`

Aggregate:

- actual cost totals
- estimated-only totals
- unknown-cost sessions count
- included-cost sessions count

## Config And Overrides

Add user-configurable pricing overrides in config:

```yaml
pricing:
  mode: hybrid
  sync_on_startup: true
  sync_interval_hours: 12
  overrides:
    - provider: openrouter
      model: anthropic/claude-opus-4.6
      billing_mode: custom_contract
      input_cost_per_million: 4.25
      output_cost_per_million: 22.0
      cache_read_cost_per_million: 0.5
      cache_write_cost_per_million: 6.0
  included_routes:
    - provider: copilot
      model: "*"
    - provider: codex-subscription
      model: "*"
```

Overrides must win over catalog defaults for the matching billing route.

## Rollout Plan

### Phase 1

- add canonical usage model
- split cache token buckets in `run_agent.py`
- stop pricing cache-inflated prompt totals
- preserve current UI with improved backend math

### Phase 2

- add route-aware pricing catalog
- integrate OpenRouter models API sync
- add `estimated` vs `included` vs `unknown`

### Phase 3

- add reconciliation for OpenRouter generation cost
- add actual cost persistence
- update `/insights` to prefer actual cost

### Phase 4

- add direct OpenAI and Anthropic reconciliation paths
- add user overrides and contract pricing
- add pricing sync CLI command

## Testing Strategy

Add tests for:

- OpenAI cached token subtraction
- Anthropic cache read/write separation
- OpenRouter estimated vs actual reconciliation
- subscription-backed models showing `included`
- custom endpoints showing `n/a`
- override precedence
- stale catalog fallback behavior

Current tests that assume heuristic pricing should be replaced with route-aware expectations.

## Non-Goals

- exact enterprise billing reconstruction without an official source or user override
- backfilling perfect historical cost for old sessions that lack cache bucket data
- scraping arbitrary provider web pages at request time

## Recommendation

Do not expand the existing `MODEL_PRICING` dict.

That path cannot satisfy the product requirement. Hermes should instead migrate to:

- canonical usage normalization
- route-aware pricing sources
- estimate-then-reconcile cost lifecycle
- explicit certainty states in the UI

This is the minimum architecture that makes the statement "Hermes pricing is backed by official sources where possible, and otherwise clearly labeled" defensible.