mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-25 00:51:20 +00:00
feat(api-server): inline image inputs on /v1/chat/completions and /v1/responses (#12969)
OpenAI-compatible clients (Open WebUI, LobeChat, etc.) can now send vision
requests to the API server. Both endpoints accept the canonical OpenAI
multimodal shape:
Chat Completions: {type: text|image_url, image_url: {url, detail?}}
Responses: {type: input_text|input_image, image_url: <str>, detail?}
The server validates and converts both into a single internal shape that the
existing agent pipeline already handles (Anthropic adapter converts,
OpenAI-wire providers pass through). Remote http(s) URLs and data:image/*
URLs are supported.
Uploaded files (file, input_file, file_id) and non-image data: URLs are
rejected with 400 unsupported_content_type.
Changes:
- gateway/platforms/api_server.py
- _normalize_multimodal_content(): validates + normalizes both Chat and
Responses content shapes. Returns a plain string for text-only content
(preserves prompt-cache behavior on existing callers) or a canonical
[{type:text|image_url,...}] list when images are present.
- _content_has_visible_payload(): replaces the bare truthy check so a
user turn with only an image no longer rejects as 'No user message'.
- _handle_chat_completions and _handle_responses both call the new helper
for user/assistant content; system messages continue to flatten to text.
- Codex conversation_history, input[], and inline history paths all share
the same validator. No duplicated normalizers.
- run_agent.py
- _summarize_user_message_for_log(): produces a short string summary
('[1 image] describe this') from list content for logging, spinner
previews, and trajectory writes. Fixes AttributeError when list
user_message hit user_message[:80] + '...' / .replace().
- _chat_content_to_responses_parts(): module-level helper that converts
chat-style multimodal content to Responses 'input_text'/'input_image'
parts. Used in _chat_messages_to_responses_input for Codex routing.
- _preflight_codex_input_items() now validates and passes through list
content parts for user/assistant messages instead of stringifying.
- tests/gateway/test_api_server_multimodal.py (new, 38 tests)
- Unit coverage for _normalize_multimodal_content, including both part
formats, data URL gating, and all reject paths.
- Real aiohttp HTTP integration on /v1/chat/completions and /v1/responses
verifying multimodal payloads reach _run_agent intact.
- 400 coverage for file / input_file / non-image data URL.
- tests/run_agent/test_run_agent_multimodal_prologue.py (new)
- Regression coverage for the prologue no-crash contract.
- _chat_content_to_responses_parts round-trip coverage.
- website/docs/user-guide/features/api-server.md
- Inline image examples for both endpoints.
- Updated Limitations: files still unsupported, images now supported.
Validated live against openrouter/anthropic/claude-opus-4.6:
POST /v1/chat/completions → 200, vision-accurate description
POST /v1/responses → 200, same image, clean output_text
POST /v1/chat/completions [file] → 400 unsupported_content_type
POST /v1/responses [input_file] → 400 unsupported_content_type
POST /v1/responses [non-image data URL] → 400 unsupported_content_type
Closes #5621, #8253, #4046, #6632.
Co-authored-by: Paul Bergeron <paul@gamma.app>
Co-authored-by: zhangxicen <zhangxicen@example.com>
Co-authored-by: Manuel Schipper <manuelschipper@users.noreply.github.com>
Co-authored-by: pradeep7127 <pradeep7127@users.noreply.github.com>
This commit is contained in:
parent
3218d58fc5
commit
f683132c1d
5 changed files with 776 additions and 20 deletions
|
|
@ -83,6 +83,25 @@ Standard OpenAI Chat Completions format. Stateless — the full conversation is
|
|||
}
|
||||
```
|
||||
|
||||
**Inline image input:** user messages may send `content` as an array of `text` and `image_url` parts. Both remote `http(s)` URLs and `data:image/...` URLs are supported:
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "hermes-agent",
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "text", "text": "What is in this image?"},
|
||||
{"type": "image_url", "image_url": {"url": "https://example.com/cat.png", "detail": "high"}}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Uploaded files (`file` / `input_file` / `file_id`) and non-image `data:` URLs return `400 unsupported_content_type`.
|
||||
|
||||
**Streaming** (`"stream": true`): Returns Server-Sent Events (SSE) with token-by-token response chunks. For **Chat Completions**, the stream uses standard `chat.completion.chunk` events plus Hermes' custom `hermes.tool.progress` event for tool-start UX. For **Responses**, the stream uses OpenAI Responses event types such as `response.created`, `response.output_text.delta`, `response.output_item.added`, `response.output_item.done`, and `response.completed`.
|
||||
|
||||
**Tool progress in streams**:
|
||||
|
|
@ -119,6 +138,25 @@ OpenAI Responses API format. Supports server-side conversation state via `previo
|
|||
}
|
||||
```
|
||||
|
||||
**Inline image input:** `input[].content` can contain `input_text` and `input_image` parts. Both remote URLs and `data:image/...` URLs are supported:
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "hermes-agent",
|
||||
"input": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "input_text", "text": "Describe this screenshot."},
|
||||
{"type": "input_image", "image_url": "data:image/png;base64,iVBORw0K..."}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Uploaded files (`input_file` / `file_id`) and non-image `data:` URLs return `400 unsupported_content_type`.
|
||||
|
||||
#### Multi-turn with previous_response_id
|
||||
|
||||
Chain responses to maintain full context (including tool calls) across turns:
|
||||
|
|
@ -330,7 +368,7 @@ In Open WebUI, add each as a separate connection. The model dropdown shows `alic
|
|||
## Limitations
|
||||
|
||||
- **Response storage** — stored responses (for `previous_response_id`) are persisted in SQLite and survive gateway restarts. Max 100 stored responses (LRU eviction).
|
||||
- **No file upload** — vision/document analysis via uploaded files is not yet supported through the API.
|
||||
- **No file upload** — inline images are supported on both `/v1/chat/completions` and `/v1/responses`, but uploaded files (`file`, `input_file`, `file_id`) and non-image document inputs are not supported through the API.
|
||||
- **Model field is cosmetic** — the `model` field in requests is accepted but the actual LLM model used is configured server-side in config.yaml.
|
||||
|
||||
## Proxy Mode
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue