mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-18 04:41:56 +00:00
feat(vision): vision_analyze returns pixels to vision-capable models, not aux text (#22955)
When the active main model has native vision and the provider supports multimodal tool results (Anthropic, OpenAI Chat, Codex Responses, Gemini 3, OpenRouter, Nous), vision_analyze loads the image bytes and returns them to the model as a multimodal tool-result envelope. The model then sees the pixels directly on its next turn instead of receiving a lossy text description from an auxiliary LLM. Falls back to the legacy aux-LLM text path for non-vision models and unverified providers. Mirrors the architecture used in OpenCode, Claude Code, Codex CLI, and Cline. All four converge on the same pattern: tool results carry image content blocks for vision-capable provider/model combinations. Changes - tools/vision_tools.py: _vision_analyze_native fast path + provider capability table (_supports_media_in_tool_results). Schema description updated to reflect new behaviour. - agent/codex_responses_adapter.py: function_call_output.output now accepts the array form for multimodal tool results (was string-only). Preflight validates input_text/input_image parts. - agent/auxiliary_client.py: _RUNTIME_MAIN_PROVIDER/_MODEL globals so tools see the live CLI/gateway override, not the stale config.yaml default. set_runtime_main()/clear_runtime_main() helpers. - run_agent.py: AIAgent.run_conversation calls set_runtime_main at turn start so vision_analyze's fast-path check sees the actual runtime. - tests/conftest.py: clear runtime-main override between tests. Tests - tests/tools/test_vision_native_fast_path.py: provider capability table, envelope shape, fast-path gating (vision-capable model uses fast path; non-vision model falls through to aux). - tests/run_agent/test_codex_multimodal_tool_result.py: list tool content becomes function_call_output.output array; preflight preserves arrays and drops unknown part types. Live verified - Opus 4.6 + Sonnet 4.6 on OpenRouter: model calls vision_analyze on a typed filepath, gets pixels back, reads exact text from images that no aux description could capture (font color irony, multi-line fruit-count list, etc.). PR replaces the closed prior efforts (#16506 shipped the inbound user- attached path; this PR closes the gap for tool-discovered images).
This commit is contained in:
parent
e62250453b
commit
3800972dd0
7 changed files with 757 additions and 10 deletions
14
run_agent.py
14
run_agent.py
|
|
@ -11119,6 +11119,20 @@ class AIAgent:
|
|||
|
||||
self._ensure_db_session()
|
||||
|
||||
# Tell auxiliary_client what the live main provider/model are for
|
||||
# this turn. Used by tools whose behaviour depends on the active
|
||||
# main model (e.g. vision_analyze's native fast path) so they see
|
||||
# the CLI/gateway override instead of the stale config.yaml
|
||||
# default. Idempotent — fine to call every turn.
|
||||
try:
|
||||
from agent.auxiliary_client import set_runtime_main
|
||||
set_runtime_main(
|
||||
getattr(self, "provider", "") or "",
|
||||
getattr(self, "model", "") or "",
|
||||
)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Tag all log records on this thread with the session ID so
|
||||
# ``hermes logs --session <id>`` can filter a single conversation.
|
||||
from hermes_logging import set_session_context
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue