feat(vision): vision_analyze returns pixels to vision-capable models, not aux text (#22955)

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-05-18 04:41:56 +00:00

When the active main model has native vision and the provider supports
multimodal tool results (Anthropic, OpenAI Chat, Codex Responses, Gemini
3, OpenRouter, Nous), vision_analyze loads the image bytes and returns
them to the model as a multimodal tool-result envelope. The model then
sees the pixels directly on its next turn instead of receiving a lossy
text description from an auxiliary LLM.

Falls back to the legacy aux-LLM text path for non-vision models and
unverified providers.

Mirrors the architecture used in OpenCode, Claude Code, Codex CLI, and
Cline. All four converge on the same pattern: tool results carry image
content blocks for vision-capable provider/model combinations.

Changes
- tools/vision_tools.py: _vision_analyze_native fast path + provider
  capability table (_supports_media_in_tool_results). Schema description
  updated to reflect new behaviour.
- agent/codex_responses_adapter.py: function_call_output.output now
  accepts the array form for multimodal tool results (was string-only).
  Preflight validates input_text/input_image parts.
- agent/auxiliary_client.py: _RUNTIME_MAIN_PROVIDER/_MODEL globals so
  tools see the live CLI/gateway override, not the stale config.yaml
  default. set_runtime_main()/clear_runtime_main() helpers.
- run_agent.py: AIAgent.run_conversation calls set_runtime_main at turn
  start so vision_analyze's fast-path check sees the actual runtime.
- tests/conftest.py: clear runtime-main override between tests.

Tests
- tests/tools/test_vision_native_fast_path.py: provider capability
  table, envelope shape, fast-path gating (vision-capable model uses
  fast path; non-vision model falls through to aux).
- tests/run_agent/test_codex_multimodal_tool_result.py: list tool
  content becomes function_call_output.output array; preflight
  preserves arrays and drops unknown part types.

Live verified
- Opus 4.6 + Sonnet 4.6 on OpenRouter: model calls vision_analyze on a
  typed filepath, gets pixels back, reads exact text from images that
  no aux description could capture (font color irony, multi-line
  fruit-count list, etc.).

PR replaces the closed prior efforts (#16506 shipped the inbound user-
attached path; this PR closes the gap for tool-discovered images).

This commit is contained in:

Teknium

2026-05-09 21:06:19 -07:00

• committed by

GitHub

parent e62250453b

commit 3800972dd0

No known key found for this signature in database

GPG key ID: B5690EEEBB952194

7 changed files with 757 additions and 10 deletions

									
										9

tests/conftest.py
									
										View file
										
				@ -427,6 +427,15 @@ def _reset_module_state():

				    except Exception:

				        pass

				    # --- agent.auxiliary_client — runtime main provider/model override ---

				    # Set per-turn by AIAgent.run_conversation; tests that import it must

				    # see a clean state so config.yaml fallback works as expected.

				    try:

				        from agent import auxiliary_client as _aux_mod

				        _aux_mod.clear_runtime_main()

				    except Exception:

				        pass

				    # --- tools.file_tools — per-task read history + file-ops cache ---

				    # _read_tracker accumulates per-task_id read history for loop detection,

				    # capped by _READ_HISTORY_CAP. If entries from a prior test persist, the

Rows
Columns

feat(vision): vision_analyze returns pixels to vision-capable models, not aux text (#22955)

9 tests/conftest.py Unescape Escape View file

9

tests/conftest.py

View file