mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-18 04:41:56 +00:00
When the active main model has native vision and the provider supports multimodal tool results (Anthropic, OpenAI Chat, Codex Responses, Gemini 3, OpenRouter, Nous), vision_analyze loads the image bytes and returns them to the model as a multimodal tool-result envelope. The model then sees the pixels directly on its next turn instead of receiving a lossy text description from an auxiliary LLM. Falls back to the legacy aux-LLM text path for non-vision models and unverified providers. Mirrors the architecture used in OpenCode, Claude Code, Codex CLI, and Cline. All four converge on the same pattern: tool results carry image content blocks for vision-capable provider/model combinations. Changes - tools/vision_tools.py: _vision_analyze_native fast path + provider capability table (_supports_media_in_tool_results). Schema description updated to reflect new behaviour. - agent/codex_responses_adapter.py: function_call_output.output now accepts the array form for multimodal tool results (was string-only). Preflight validates input_text/input_image parts. - agent/auxiliary_client.py: _RUNTIME_MAIN_PROVIDER/_MODEL globals so tools see the live CLI/gateway override, not the stale config.yaml default. set_runtime_main()/clear_runtime_main() helpers. - run_agent.py: AIAgent.run_conversation calls set_runtime_main at turn start so vision_analyze's fast-path check sees the actual runtime. - tests/conftest.py: clear runtime-main override between tests. Tests - tests/tools/test_vision_native_fast_path.py: provider capability table, envelope shape, fast-path gating (vision-capable model uses fast path; non-vision model falls through to aux). - tests/run_agent/test_codex_multimodal_tool_result.py: list tool content becomes function_call_output.output array; preflight preserves arrays and drops unknown part types. Live verified - Opus 4.6 + Sonnet 4.6 on OpenRouter: model calls vision_analyze on a typed filepath, gets pixels back, reads exact text from images that no aux description could capture (font color irony, multi-line fruit-count list, etc.). PR replaces the closed prior efforts (#16506 shipped the inbound user- attached path; this PR closes the gap for tool-discovered images). |
||
|---|---|---|
| .. | ||
| transports | ||
| __init__.py | ||
| account_usage.py | ||
| anthropic_adapter.py | ||
| auxiliary_client.py | ||
| bedrock_adapter.py | ||
| codex_responses_adapter.py | ||
| context_compressor.py | ||
| context_engine.py | ||
| context_references.py | ||
| copilot_acp_client.py | ||
| credential_pool.py | ||
| credential_sources.py | ||
| curator.py | ||
| curator_backup.py | ||
| display.py | ||
| error_classifier.py | ||
| file_safety.py | ||
| gemini_cloudcode_adapter.py | ||
| gemini_native_adapter.py | ||
| gemini_schema.py | ||
| google_code_assist.py | ||
| google_oauth.py | ||
| i18n.py | ||
| image_gen_provider.py | ||
| image_gen_registry.py | ||
| image_routing.py | ||
| insights.py | ||
| lmstudio_reasoning.py | ||
| manual_compression_feedback.py | ||
| memory_manager.py | ||
| memory_provider.py | ||
| model_metadata.py | ||
| models_dev.py | ||
| moonshot_schema.py | ||
| nous_rate_guard.py | ||
| onboarding.py | ||
| prompt_builder.py | ||
| prompt_caching.py | ||
| rate_limit_tracker.py | ||
| redact.py | ||
| retry_utils.py | ||
| shell_hooks.py | ||
| skill_commands.py | ||
| skill_preprocessing.py | ||
| skill_utils.py | ||
| subdirectory_hints.py | ||
| think_scrubber.py | ||
| title_generator.py | ||
| tool_guardrails.py | ||
| trajectory.py | ||
| usage_pricing.py | ||