mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-30 01:41:43 +00:00
* feat(image-input): native multimodal routing based on model vision capability
Attach user-sent images as OpenAI-style content parts on the user turn when
the active model supports native vision, so vision-capable models see real
pixels instead of a lossy text description from vision_analyze.
Routing decision (agent/image_routing.py::decide_image_input_mode):
agent.image_input_mode = auto | native | text (default: auto)
In auto mode:
- If auxiliary.vision.provider/model is explicitly configured, keep the
text pipeline (user paid for a dedicated vision backend).
- Else if models.dev reports supports_vision=True for the active
provider/model, attach natively.
- Else fall back to text (current behaviour).
Call sites updated: gateway/run.py (all messaging platforms), tui_gateway
(dashboard/Ink), cli.py (interactive /attach + drag-drop).
run_agent.py changes:
- _prepare_anthropic_messages_for_api now passes image parts through
unchanged when the model supports vision — the Anthropic adapter
translates them to native image blocks. Previous behaviour
(vision_analyze → text) only runs for non-vision Anthropic models.
- New _prepare_messages_for_non_vision_model mirrors the same contract
for chat.completions and codex_responses paths, so non-vision models
on any provider get text-fallback instead of failing at the provider.
- New _model_supports_vision() helper reads models.dev caps.
vision_analyze description rewritten: positions it as a tool for images
NOT already visible in the conversation (URLs, tool output, deeper
inspection). Prevents the model from redundantly calling it on images
already attached natively.
Config default: agent.image_input_mode = auto.
Tests: 35 new (test_image_routing.py + test_vision_aware_preprocessing.py),
all existing tests that reference _prepare_anthropic_messages_for_api
still pass (198 targeted + new tests green).
* feat(image-input): size-cap + resize oversized images, charge image tokens in compressor
Two follow-ups that make the native image routing safer for long / heavy
sessions:
1) Oversize handling in build_native_content_parts:
- 20 MB ceiling per image (matches vision_tools._MAX_BASE64_BYTES,
the most restrictive provider — Gemini inline data).
- Delegates to vision_tools._resize_image_for_vision (Pillow-based,
already battle-tested) to downscale to 5 MB first-try.
- If Pillow is missing or resize still overshoots, the image is
dropped and reported back in skipped[]; caller falls back to text
enrichment for that image.
2) Image-token accounting in context_compressor:
- New _IMAGE_TOKEN_ESTIMATE = 1600 (matches Claude Code's constant;
within the realistic range for Anthropic/GPT-4o/Gemini billing).
- _content_length_for_budget() helper: sums text-part lengths and
charges _IMAGE_CHAR_EQUIVALENT (1600 * 4 chars) per image/image_url/
input_image part. Base64 payload inside image_url is NOT counted
as chars — dimensions don't matter, only image-presence.
- Both tail-cut sites (_prune_old_tool_results L527 and
_find_tail_cut_by_tokens L1126) now call the helper so multi-image
conversations don't slip past compression budget.
Tests: 9 new in test_image_routing.py (oversize triggers resize,
resize-fails-returns-None, oversize-skipped-reported), 11 new in
test_compressor_image_tokens.py (flat charge per image, multiple images,
Responses-API / Anthropic-native / OpenAI-chat shapes, no-inflation on
raw base64, bounds-check on the constant, integration test that an
image-heavy tail actually gets trimmed).
* fix(image-input): replace blanket 20MB ceiling with empirically-verified per-provider limits
The previous commit imposed a hardcoded 20 MB base64 ceiling on all
providers, triggering auto-resize on anything larger. This was wrong in
both directions:
* Too loose for Anthropic — actual limit is 5 MB (returns HTTP 400
'image exceeds 5 MB maximum' above that).
* Too strict for OpenAI / Codex / OpenRouter — accept 49 MB+ without
complaint (empirically verified April 2026 with progressive PNG
sizes).
New behaviour:
* _PROVIDER_BASE64_CEILING table: only anthropic and bedrock have a
ceiling (5 MB, since bedrock-on-Claude shares Anthropic's decoder).
* Providers NOT in the table get no ceiling — images attach at native
size and we trust the provider to return its own error if it
disagrees. A provider-specific 400 message is clearer than us
guessing wrong and silently degrading image quality.
* build_native_content_parts() gains a keyword-only provider arg;
gateway/CLI/TUI pass the active provider so Anthropic users get
auto-resize protection while OpenAI users don't pay it.
* Resize target dropped from 5 MB to 4 MB to slide safely under
Anthropic's boundary with header overhead.
Empirical measurements (direct API, no Hermes in the loop):
image b64 anthropic openrouter/gpt5.5 codex-oauth/gpt5.5
0.19 MB ✓ ✓ ✓
12.37 MB ✗ 400 5MB ✓ ✓
23.85 MB ✗ 400 5MB ✓ ✓
49.46 MB ✗ 413 ✓ ✓
Tests: rewrote TestOversizeHandling (5 tests): no-ceiling pass-through,
Anthropic resize fires, Anthropic skip on resize-fail, build_native_parts
routes ceiling by provider, unknown provider gets no ceiling. All 52
targeted tests pass.
* refactor(image-input): attempt native, shrink-and-retry on provider reject
Replace proactive per-provider size ceilings with a reactive shrink path
on the provider's actual rejection. All providers now attempt native
full-size attachment first; if the provider returns an image-too-large
error, the agent silently shrinks and retries once.
Why the previous design was wrong: hardcoding provider ceilings
(anthropic=5MB, others=unlimited) meant OpenAI users on a 10MB image
paid no tax, but Anthropic users lost quality on anything >5MB even
though the empirical behaviour at provider-reject time is the same
(shrink + retry). Baking the table into the routing layer also
requires updating Hermes every time a provider's limit changes.
Reactive design:
- image_routing.py: _file_to_data_url encodes native size, no ceiling.
build_native_content_parts drops its provider kwarg.
- error_classifier.py: new FailoverReason.image_too_large + pattern
match ("image exceeds", "image too large", etc.) checked BEFORE
context_overflow so Anthropic's 5MB rejection lands in the right
bucket.
- run_agent.py: new _try_shrink_image_parts_in_messages walks api
messages in-place, re-encodes oversized data: URL image parts
through vision_tools._resize_image_for_vision to fit under 4MB,
handles both chat.completions (dict image_url) and Responses
(string image_url) shapes, ignores http URLs (provider-fetched).
New image_shrink_retry_attempted flag in the retry loop fires the
shrink exactly once per turn after credential-pool recovery but
before auth retries.
E2E verified live against Anthropic claude-sonnet-4-6:
- 17.9MB PNG (23.9MB b64) attached at native size
- Anthropic returns 400 "image exceeds 5 MB maximum"
- Agent logs '📐 Image(s) exceeded provider size limit — shrank and
retrying...'
- Retry succeeds, correct response delivered in 6.8s total.
Tests: 12 new (8 shrink-helper shapes + 4 classifier signals),
replaces 5 proactive-ceiling tests with 3 simpler 'native attach works'
tests. 181 targeted tests pass. test_enum_members_exist in
test_error_classifier.py updated for the new enum value.
141 lines
5.8 KiB
Python
141 lines
5.8 KiB
Python
"""Tests for image-token accounting in the context compressor.
|
||
|
||
Covers the native-image-routing PR's companion change: the compressor's
|
||
multimodal message length counter now charges ~1600 tokens per attached
|
||
image part instead of 0, so tail-cut / prune decisions are accurate for
|
||
creative workflows that iterate on images across many turns.
|
||
"""
|
||
|
||
from __future__ import annotations
|
||
|
||
import pytest
|
||
|
||
from agent.context_compressor import (
|
||
_CHARS_PER_TOKEN,
|
||
_IMAGE_CHAR_EQUIVALENT,
|
||
_IMAGE_TOKEN_ESTIMATE,
|
||
_content_length_for_budget,
|
||
)
|
||
|
||
|
||
class TestContentLengthForBudget:
|
||
def test_plain_string(self):
|
||
assert _content_length_for_budget("hello world") == 11
|
||
|
||
def test_empty_string(self):
|
||
assert _content_length_for_budget("") == 0
|
||
|
||
def test_none_coerces_to_zero(self):
|
||
assert _content_length_for_budget(None) == 0
|
||
|
||
def test_text_only_list(self):
|
||
content = [
|
||
{"type": "text", "text": "first"},
|
||
{"type": "text", "text": "second"},
|
||
]
|
||
assert _content_length_for_budget(content) == 5 + 6
|
||
|
||
def test_single_image_part_charges_fixed_budget(self):
|
||
content = [
|
||
{"type": "text", "text": "look"},
|
||
{"type": "image_url", "image_url": {"url": "data:image/png;base64,XXXX"}},
|
||
]
|
||
# 4 chars of text + 1 image at fixed char-equivalent
|
||
assert _content_length_for_budget(content) == 4 + _IMAGE_CHAR_EQUIVALENT
|
||
|
||
def test_image_url_raw_base64_is_not_counted_as_chars(self):
|
||
"""A 1MB base64 blob inside an image_url must NOT inflate token count.
|
||
|
||
The flat image estimate is what the provider actually bills; the raw
|
||
base64 is transport payload, not context tokens.
|
||
"""
|
||
huge_url = "data:image/png;base64," + ("A" * 1_000_000)
|
||
content = [
|
||
{"type": "image_url", "image_url": {"url": huge_url}},
|
||
]
|
||
# Exactly one image's worth, not 1M + something.
|
||
assert _content_length_for_budget(content) == _IMAGE_CHAR_EQUIVALENT
|
||
|
||
def test_multiple_image_parts(self):
|
||
content = [
|
||
{"type": "text", "text": "compare"},
|
||
{"type": "image_url", "image_url": {"url": "data:image/png;base64,AAA"}},
|
||
{"type": "image_url", "image_url": {"url": "data:image/png;base64,BBB"}},
|
||
{"type": "image_url", "image_url": {"url": "data:image/png;base64,CCC"}},
|
||
]
|
||
assert _content_length_for_budget(content) == 7 + 3 * _IMAGE_CHAR_EQUIVALENT
|
||
|
||
def test_openai_responses_input_image_shape(self):
|
||
"""Responses API uses type=input_image with top-level image_url string."""
|
||
content = [
|
||
{"type": "input_text", "text": "hey"},
|
||
{"type": "input_image", "image_url": "data:image/png;base64,XX"},
|
||
]
|
||
# input_text has .text "hey" (3 chars) + 1 image
|
||
assert _content_length_for_budget(content) == 3 + _IMAGE_CHAR_EQUIVALENT
|
||
|
||
def test_anthropic_native_image_shape(self):
|
||
"""Anthropic native shape: {type: image, source: {...}}."""
|
||
content = [
|
||
{"type": "text", "text": "hi"},
|
||
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": "XX"}},
|
||
]
|
||
assert _content_length_for_budget(content) == 2 + _IMAGE_CHAR_EQUIVALENT
|
||
|
||
def test_bare_string_part_in_list(self):
|
||
"""Older code paths sometimes produce mixed list-of-strings content."""
|
||
content = ["hello", {"type": "text", "text": "world"}]
|
||
assert _content_length_for_budget(content) == 5 + 5
|
||
|
||
def test_image_estimate_constant_is_reasonable(self):
|
||
"""Sanity-check the estimate aligns with real provider billing.
|
||
|
||
Anthropic ≈ width*height/750 → ~1600 for 1000×1200.
|
||
OpenAI GPT-4o high-detail 2048×2048 ≈ 1445.
|
||
Gemini 258/tile × 6 tiles for a 2048×2048 ≈ 1548.
|
||
Anything in the 800-2000 range is defensible. Enforce bounds so an
|
||
accidental edit doesn't drop it to e.g. 16.
|
||
"""
|
||
assert 800 <= _IMAGE_TOKEN_ESTIMATE <= 2500
|
||
assert _IMAGE_CHAR_EQUIVALENT == _IMAGE_TOKEN_ESTIMATE * _CHARS_PER_TOKEN
|
||
|
||
|
||
class TestTokenBudgetWithImages:
|
||
"""Integration: the compressor's tail-cut decision now respects image cost."""
|
||
|
||
def test_image_heavy_turns_count_toward_budget(self):
|
||
"""A tail with 5 image-bearing turns should blow past a 5K token budget."""
|
||
from agent.context_compressor import ContextCompressor
|
||
|
||
# Minimal compressor fixture — just enough to call _find_tail_cut_by_tokens
|
||
cc = object.__new__(ContextCompressor)
|
||
cc.tail_token_budget = 5000
|
||
|
||
# Build 10 messages: 5 with images, 5 with short text. Without the
|
||
# image-tokens fix, the compressor would think all 10 fit in 5K and
|
||
# protect them all. With the fix, images alone cost 5 × 1600 = 8K,
|
||
# so the tail should be trimmed.
|
||
messages = [{"role": "system", "content": "sys"}]
|
||
for i in range(5):
|
||
messages.append({
|
||
"role": "user",
|
||
"content": [
|
||
{"type": "text", "text": f"turn {i}"},
|
||
{"type": "image_url", "image_url": {"url": "data:image/png;base64,AAA"}},
|
||
],
|
||
})
|
||
messages.append({
|
||
"role": "assistant",
|
||
"content": f"response {i}",
|
||
})
|
||
|
||
cut = cc._find_tail_cut_by_tokens(messages, head_end=0, token_budget=5000)
|
||
|
||
# Budget is 5K, soft ceiling 7.5K. 5 images alone = 8000 image-tokens.
|
||
# Walking backward, the compressor should stop before including all 5.
|
||
# Exact cut depends on text lengths and min_tail, but it MUST be > 1
|
||
# (at least some head-side messages should be compressible).
|
||
assert cut > 1, (
|
||
f"Expected image-heavy tail to be trimmed; compressor placed cut at "
|
||
f"{cut} out of {len(messages)} (image tokens were likely ignored)."
|
||
)
|