hermes-agent/tests/agent/test_compressor_image_tokens.py
Teknium ec671c4154
feat(image-input): native multimodal routing based on model vision capability (#16506)
* feat(image-input): native multimodal routing based on model vision capability

Attach user-sent images as OpenAI-style content parts on the user turn when
the active model supports native vision, so vision-capable models see real
pixels instead of a lossy text description from vision_analyze.

Routing decision (agent/image_routing.py::decide_image_input_mode):

  agent.image_input_mode = auto | native | text  (default: auto)

In auto mode:
  - If auxiliary.vision.provider/model is explicitly configured, keep the
    text pipeline (user paid for a dedicated vision backend).
  - Else if models.dev reports supports_vision=True for the active
    provider/model, attach natively.
  - Else fall back to text (current behaviour).

Call sites updated: gateway/run.py (all messaging platforms), tui_gateway
(dashboard/Ink), cli.py (interactive /attach + drag-drop).

run_agent.py changes:
  - _prepare_anthropic_messages_for_api now passes image parts through
    unchanged when the model supports vision — the Anthropic adapter
    translates them to native image blocks. Previous behaviour
    (vision_analyze → text) only runs for non-vision Anthropic models.
  - New _prepare_messages_for_non_vision_model mirrors the same contract
    for chat.completions and codex_responses paths, so non-vision models
    on any provider get text-fallback instead of failing at the provider.
  - New _model_supports_vision() helper reads models.dev caps.

vision_analyze description rewritten: positions it as a tool for images
NOT already visible in the conversation (URLs, tool output, deeper
inspection). Prevents the model from redundantly calling it on images
already attached natively.

Config default: agent.image_input_mode = auto.

Tests: 35 new (test_image_routing.py + test_vision_aware_preprocessing.py),
all existing tests that reference _prepare_anthropic_messages_for_api
still pass (198 targeted + new tests green).

* feat(image-input): size-cap + resize oversized images, charge image tokens in compressor

Two follow-ups that make the native image routing safer for long / heavy
sessions:

1) Oversize handling in build_native_content_parts:
   - 20 MB ceiling per image (matches vision_tools._MAX_BASE64_BYTES,
     the most restrictive provider — Gemini inline data).
   - Delegates to vision_tools._resize_image_for_vision (Pillow-based,
     already battle-tested) to downscale to 5 MB first-try.
   - If Pillow is missing or resize still overshoots, the image is
     dropped and reported back in skipped[]; caller falls back to text
     enrichment for that image.

2) Image-token accounting in context_compressor:
   - New _IMAGE_TOKEN_ESTIMATE = 1600 (matches Claude Code's constant;
     within the realistic range for Anthropic/GPT-4o/Gemini billing).
   - _content_length_for_budget() helper: sums text-part lengths and
     charges _IMAGE_CHAR_EQUIVALENT (1600 * 4 chars) per image/image_url/
     input_image part.  Base64 payload inside image_url is NOT counted
     as chars — dimensions don't matter, only image-presence.
   - Both tail-cut sites (_prune_old_tool_results L527 and
     _find_tail_cut_by_tokens L1126) now call the helper so multi-image
     conversations don't slip past compression budget.

Tests: 9 new in test_image_routing.py (oversize triggers resize,
resize-fails-returns-None, oversize-skipped-reported), 11 new in
test_compressor_image_tokens.py (flat charge per image, multiple images,
Responses-API / Anthropic-native / OpenAI-chat shapes, no-inflation on
raw base64, bounds-check on the constant, integration test that an
image-heavy tail actually gets trimmed).

* fix(image-input): replace blanket 20MB ceiling with empirically-verified per-provider limits

The previous commit imposed a hardcoded 20 MB base64 ceiling on all
providers, triggering auto-resize on anything larger. This was wrong in
both directions:

  * Too loose for Anthropic — actual limit is 5 MB (returns HTTP 400
    'image exceeds 5 MB maximum' above that).
  * Too strict for OpenAI / Codex / OpenRouter — accept 49 MB+ without
    complaint (empirically verified April 2026 with progressive PNG
    sizes).

New behaviour:

  * _PROVIDER_BASE64_CEILING table: only anthropic and bedrock have a
    ceiling (5 MB, since bedrock-on-Claude shares Anthropic's decoder).
  * Providers NOT in the table get no ceiling — images attach at native
    size and we trust the provider to return its own error if it
    disagrees. A provider-specific 400 message is clearer than us
    guessing wrong and silently degrading image quality.
  * build_native_content_parts() gains a keyword-only provider arg;
    gateway/CLI/TUI pass the active provider so Anthropic users get
    auto-resize protection while OpenAI users don't pay it.
  * Resize target dropped from 5 MB to 4 MB to slide safely under
    Anthropic's boundary with header overhead.

Empirical measurements (direct API, no Hermes in the loop):

    image b64     anthropic   openrouter/gpt5.5   codex-oauth/gpt5.5
    0.19 MB       ✓           ✓                   ✓
    12.37 MB      ✗ 400 5MB   ✓                   ✓
    23.85 MB      ✗ 400 5MB   ✓                   ✓
    49.46 MB      ✗ 413       ✓                   ✓

Tests: rewrote TestOversizeHandling (5 tests): no-ceiling pass-through,
Anthropic resize fires, Anthropic skip on resize-fail, build_native_parts
routes ceiling by provider, unknown provider gets no ceiling. All 52
targeted tests pass.

* refactor(image-input): attempt native, shrink-and-retry on provider reject

Replace proactive per-provider size ceilings with a reactive shrink path
on the provider's actual rejection. All providers now attempt native
full-size attachment first; if the provider returns an image-too-large
error, the agent silently shrinks and retries once.

Why the previous design was wrong: hardcoding provider ceilings
(anthropic=5MB, others=unlimited) meant OpenAI users on a 10MB image
paid no tax, but Anthropic users lost quality on anything >5MB even
though the empirical behaviour at provider-reject time is the same
(shrink + retry). Baking the table into the routing layer also
requires updating Hermes every time a provider's limit changes.

Reactive design:
  - image_routing.py: _file_to_data_url encodes native size, no ceiling.
    build_native_content_parts drops its provider kwarg.
  - error_classifier.py: new FailoverReason.image_too_large + pattern
    match ("image exceeds", "image too large", etc.) checked BEFORE
    context_overflow so Anthropic's 5MB rejection lands in the right
    bucket.
  - run_agent.py: new _try_shrink_image_parts_in_messages walks api
    messages in-place, re-encodes oversized data: URL image parts
    through vision_tools._resize_image_for_vision to fit under 4MB,
    handles both chat.completions (dict image_url) and Responses
    (string image_url) shapes, ignores http URLs (provider-fetched).
    New image_shrink_retry_attempted flag in the retry loop fires the
    shrink exactly once per turn after credential-pool recovery but
    before auth retries.

E2E verified live against Anthropic claude-sonnet-4-6:
  - 17.9MB PNG (23.9MB b64) attached at native size
  - Anthropic returns 400 "image exceeds 5 MB maximum"
  - Agent logs '📐 Image(s) exceeded provider size limit — shrank and
    retrying...'
  - Retry succeeds, correct response delivered in 6.8s total.

Tests: 12 new (8 shrink-helper shapes + 4 classifier signals),
replaces 5 proactive-ceiling tests with 3 simpler 'native attach works'
tests. 181 targeted tests pass. test_enum_members_exist in
test_error_classifier.py updated for the new enum value.
2026-04-27 06:27:59 -07:00

141 lines
5.8 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

"""Tests for image-token accounting in the context compressor.
Covers the native-image-routing PR's companion change: the compressor's
multimodal message length counter now charges ~1600 tokens per attached
image part instead of 0, so tail-cut / prune decisions are accurate for
creative workflows that iterate on images across many turns.
"""
from __future__ import annotations
import pytest
from agent.context_compressor import (
_CHARS_PER_TOKEN,
_IMAGE_CHAR_EQUIVALENT,
_IMAGE_TOKEN_ESTIMATE,
_content_length_for_budget,
)
class TestContentLengthForBudget:
def test_plain_string(self):
assert _content_length_for_budget("hello world") == 11
def test_empty_string(self):
assert _content_length_for_budget("") == 0
def test_none_coerces_to_zero(self):
assert _content_length_for_budget(None) == 0
def test_text_only_list(self):
content = [
{"type": "text", "text": "first"},
{"type": "text", "text": "second"},
]
assert _content_length_for_budget(content) == 5 + 6
def test_single_image_part_charges_fixed_budget(self):
content = [
{"type": "text", "text": "look"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,XXXX"}},
]
# 4 chars of text + 1 image at fixed char-equivalent
assert _content_length_for_budget(content) == 4 + _IMAGE_CHAR_EQUIVALENT
def test_image_url_raw_base64_is_not_counted_as_chars(self):
"""A 1MB base64 blob inside an image_url must NOT inflate token count.
The flat image estimate is what the provider actually bills; the raw
base64 is transport payload, not context tokens.
"""
huge_url = "data:image/png;base64," + ("A" * 1_000_000)
content = [
{"type": "image_url", "image_url": {"url": huge_url}},
]
# Exactly one image's worth, not 1M + something.
assert _content_length_for_budget(content) == _IMAGE_CHAR_EQUIVALENT
def test_multiple_image_parts(self):
content = [
{"type": "text", "text": "compare"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,AAA"}},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,BBB"}},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,CCC"}},
]
assert _content_length_for_budget(content) == 7 + 3 * _IMAGE_CHAR_EQUIVALENT
def test_openai_responses_input_image_shape(self):
"""Responses API uses type=input_image with top-level image_url string."""
content = [
{"type": "input_text", "text": "hey"},
{"type": "input_image", "image_url": "data:image/png;base64,XX"},
]
# input_text has .text "hey" (3 chars) + 1 image
assert _content_length_for_budget(content) == 3 + _IMAGE_CHAR_EQUIVALENT
def test_anthropic_native_image_shape(self):
"""Anthropic native shape: {type: image, source: {...}}."""
content = [
{"type": "text", "text": "hi"},
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": "XX"}},
]
assert _content_length_for_budget(content) == 2 + _IMAGE_CHAR_EQUIVALENT
def test_bare_string_part_in_list(self):
"""Older code paths sometimes produce mixed list-of-strings content."""
content = ["hello", {"type": "text", "text": "world"}]
assert _content_length_for_budget(content) == 5 + 5
def test_image_estimate_constant_is_reasonable(self):
"""Sanity-check the estimate aligns with real provider billing.
Anthropic ≈ width*height/750 → ~1600 for 1000×1200.
OpenAI GPT-4o high-detail 2048×2048 ≈ 1445.
Gemini 258/tile × 6 tiles for a 2048×2048 ≈ 1548.
Anything in the 800-2000 range is defensible. Enforce bounds so an
accidental edit doesn't drop it to e.g. 16.
"""
assert 800 <= _IMAGE_TOKEN_ESTIMATE <= 2500
assert _IMAGE_CHAR_EQUIVALENT == _IMAGE_TOKEN_ESTIMATE * _CHARS_PER_TOKEN
class TestTokenBudgetWithImages:
"""Integration: the compressor's tail-cut decision now respects image cost."""
def test_image_heavy_turns_count_toward_budget(self):
"""A tail with 5 image-bearing turns should blow past a 5K token budget."""
from agent.context_compressor import ContextCompressor
# Minimal compressor fixture — just enough to call _find_tail_cut_by_tokens
cc = object.__new__(ContextCompressor)
cc.tail_token_budget = 5000
# Build 10 messages: 5 with images, 5 with short text. Without the
# image-tokens fix, the compressor would think all 10 fit in 5K and
# protect them all. With the fix, images alone cost 5 × 1600 = 8K,
# so the tail should be trimmed.
messages = [{"role": "system", "content": "sys"}]
for i in range(5):
messages.append({
"role": "user",
"content": [
{"type": "text", "text": f"turn {i}"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,AAA"}},
],
})
messages.append({
"role": "assistant",
"content": f"response {i}",
})
cut = cc._find_tail_cut_by_tokens(messages, head_end=0, token_budget=5000)
# Budget is 5K, soft ceiling 7.5K. 5 images alone = 8000 image-tokens.
# Walking backward, the compressor should stop before including all 5.
# Exact cut depends on text lengths and min_tail, but it MUST be > 1
# (at least some head-side messages should be compressible).
assert cut > 1, (
f"Expected image-heavy tail to be trimmed; compressor placed cut at "
f"{cut} out of {len(messages)} (image tokens were likely ignored)."
)