mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-06-27 11:22:03 +00:00
feat(image-gen): add image-to-image / editing to image_generate (#48705)
* feat(image-gen): add image-to-image / editing to image_generate Brings image generation to parity with video generation: the unified image_generate tool now edits/transforms a source image (image-to-image) when given image_url / reference_image_urls, routing to each backend's edit endpoint, exactly as video_generate routes to image-to-video. - ImageGenProvider ABC: generate() gains keyword-only image_url + reference_image_urls; new capabilities() declares modalities + max_reference_images (defaults to text-only, backward compatible). success_response gains a modality field; adds normalize_reference_images. - image_generate tool: schema exposes image_url + reference_image_urls; dynamic schema reflects the active model's actual edit capability so the agent knows when image_url is honored. Handler + plugin dispatch forward the new inputs; legacy/text-only providers get a clear modality_unsupported error instead of silently dropping the source image. - In-tree FAL: 7 models gain edit endpoints (flux-2-klein, flux-2-pro, nano-banana-pro, gpt-image-1.5, gpt-image-2, ideogram/v3, qwen-image) with per-model edit_supports whitelists + reference caps; routes to the /edit endpoint and skips the upscaler for edits. - Plugins: openai (images.edit, 16 refs), xai (/v1/images/edits via grok-imagine-image-quality, JSON body per xAI docs), krea (image_style_references, 10 refs). openai-codex stays text-only and rejects edits with an actionable error. - Tests: 15 new (payload, routing, dispatch forwarding, dynamic schema, capabilities); updated 2 change-detector/lambda tests for the new schema. - Docs: image-generation feature page, image-gen provider plugin guide, tools reference. * fix(image-gen): preserve legacy passthrough in fal/krea plugin tests Two existing plugin tests asserted pre-image-to-image behavior: - fal: forward image_url/reference_image_urls only when supplied, so a text-to-image delegation stays byte-identical (no None kwargs). - krea: keep dict-shaped image_style_references refs verbatim (the unified string refs go through normalize_reference_images; legacy non-string ref objects pass through unchanged) — fixes KeyError when callers pass the richer Krea ref-object shape. * fix(image-gen): clearer not-capable message for text-to-image-only models When a text-to-image-only model (incl. gpt-image-2 on the Codex OAuth path, which can't do editing through the Responses image_generation tool) gets a source image, say 'this model is not capable of image-to-image / editing — provide a text-only prompt' rather than sending the user shopping for other backends. Applies to the openai-codex guard, the in-tree FAL no-edit-endpoint error, and the dynamic tool-schema text-only line.
This commit is contained in:
parent
cfb55de5ea
commit
c02192ff6a
13 changed files with 1239 additions and 106 deletions
|
|
@ -86,6 +86,46 @@ Create a square portrait of a wise old owl — use the typography model
|
|||
Make me a futuristic cityscape, landscape orientation
|
||||
```
|
||||
|
||||
## Image-to-Image / Editing
|
||||
|
||||
The same `image_generate` tool also **edits existing images** when the active
|
||||
model supports it — pass a source image and the backend routes to its editing
|
||||
endpoint automatically (mirrors how `video_generate` handles image-to-video).
|
||||
Omit the source image and it's plain text-to-image.
|
||||
|
||||
```
|
||||
Take this photo and make it a rainy Tokyo street at night → <image>
|
||||
```
|
||||
|
||||
```
|
||||
Blend these two product shots into one hero image → <image1> <image2>
|
||||
```
|
||||
|
||||
Two inputs drive the edit:
|
||||
|
||||
- **`image_url`** — the primary source image to edit/transform (public URL or local path).
|
||||
- **`reference_image_urls`** — additional style/composition references (capped per-model).
|
||||
|
||||
### Which backends support editing
|
||||
|
||||
| Backend | Image-to-image | Reference cap | How |
|
||||
|---|---|---|---|
|
||||
| **FAL.ai** (edit-capable models below) | ✓ | up to 9 | routes to the model's `/edit` endpoint |
|
||||
| **OpenAI** (`gpt-image-2`) | ✓ | up to 16 | `images.edit()` |
|
||||
| **xAI** (Grok Imagine) | ✓ | 1 | `/v1/images/edits` (`grok-imagine-image-quality`) |
|
||||
| **Krea** (`Krea 2`) | ✓ | up to 10 | reference-guided generation (`image_style_references`) |
|
||||
| **OpenAI (Codex auth)** | ✗ | — | text-to-image only |
|
||||
|
||||
FAL models with an editing endpoint: `flux-2/klein/9b`, `flux-2-pro`,
|
||||
`nano-banana-pro`, `gpt-image-1.5`, `gpt-image-2`, `ideogram/v3`, and
|
||||
`qwen-image`. Pure text-to-image FAL models (`z-image/turbo`, `recraft`,
|
||||
`krea/*`) reject image inputs with a clear error pointing you at an
|
||||
edit-capable model.
|
||||
|
||||
The active model's editing capability is surfaced in the tool description at
|
||||
runtime, so the agent knows whether `image_url` will be honored before it
|
||||
calls the tool.
|
||||
|
||||
## Aspect Ratios
|
||||
|
||||
Every model accepts the same three aspect ratios from the agent's perspective. Internally, each model's native size spec is filled in automatically:
|
||||
|
|
@ -152,7 +192,7 @@ Debug logs go to `./logs/image_tools_debug_<session_id>.json` with per-call deta
|
|||
|
||||
## Limitations
|
||||
|
||||
- **Requires FAL credentials** (direct `FAL_KEY` or Nous Subscription)
|
||||
- **Text-to-image only** — no inpainting, img2img, or editing via this tool
|
||||
- **Temporary URLs** — FAL returns hosted URLs that expire after hours/days; save locally if needed
|
||||
- **Per-model constraints** — some models don't support `seed`, `num_inference_steps`, etc. The `supports` filter silently drops unsupported params; this is expected behavior
|
||||
- **Requires credentials** for the active backend (FAL `FAL_KEY` / Nous Subscription, `OPENAI_API_KEY`, xAI OAuth, `KREA_API_KEY`)
|
||||
- **Editing is model-dependent** — image-to-image works only on edit-capable models (see the table above); text-to-image-only models reject image inputs with a clear error
|
||||
- **Temporary URLs** — backends return hosted URLs that expire after hours/days; Hermes materializes them to the local cache so delivery still works after expiry
|
||||
- **Per-model constraints** — some models don't support `seed`, `num_inference_steps`, etc. The `supports` / `edit_supports` filter silently drops unsupported params; this is expected behavior
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue