feat(image-gen): add image-to-image / editing to image_generate (#48705)

* feat(image-gen): add image-to-image / editing to image_generate

Brings image generation to parity with video generation: the unified
image_generate tool now edits/transforms a source image (image-to-image)
when given image_url / reference_image_urls, routing to each backend's
edit endpoint, exactly as video_generate routes to image-to-video.

- ImageGenProvider ABC: generate() gains keyword-only image_url +
  reference_image_urls; new capabilities() declares modalities +
  max_reference_images (defaults to text-only, backward compatible).
  success_response gains a modality field; adds normalize_reference_images.
- image_generate tool: schema exposes image_url + reference_image_urls;
  dynamic schema reflects the active model's actual edit capability so the
  agent knows when image_url is honored. Handler + plugin dispatch forward
  the new inputs; legacy/text-only providers get a clear modality_unsupported
  error instead of silently dropping the source image.
- In-tree FAL: 7 models gain edit endpoints (flux-2-klein, flux-2-pro,
  nano-banana-pro, gpt-image-1.5, gpt-image-2, ideogram/v3, qwen-image)
  with per-model edit_supports whitelists + reference caps; routes to the
  /edit endpoint and skips the upscaler for edits.
- Plugins: openai (images.edit, 16 refs), xai (/v1/images/edits via
  grok-imagine-image-quality, JSON body per xAI docs), krea
  (image_style_references, 10 refs). openai-codex stays text-only and
  rejects edits with an actionable error.
- Tests: 15 new (payload, routing, dispatch forwarding, dynamic schema,
  capabilities); updated 2 change-detector/lambda tests for the new schema.
- Docs: image-generation feature page, image-gen provider plugin guide,
  tools reference.

* fix(image-gen): preserve legacy passthrough in fal/krea plugin tests

Two existing plugin tests asserted pre-image-to-image behavior:
- fal: forward image_url/reference_image_urls only when supplied, so a
  text-to-image delegation stays byte-identical (no None kwargs).
- krea: keep dict-shaped image_style_references refs verbatim (the unified
  string refs go through normalize_reference_images; legacy non-string ref
  objects pass through unchanged) — fixes KeyError when callers pass the
  richer Krea ref-object shape.

* fix(image-gen): clearer not-capable message for text-to-image-only models

When a text-to-image-only model (incl. gpt-image-2 on the Codex OAuth path,
which can't do editing through the Responses image_generation tool) gets a
source image, say 'this model is not capable of image-to-image / editing —
provide a text-only prompt' rather than sending the user shopping for other
backends. Applies to the openai-codex guard, the in-tree FAL no-edit-endpoint
error, and the dynamic tool-schema text-only line.
This commit is contained in:
Teknium 2026-06-18 22:13:07 -07:00 committed by GitHub
parent cfb55de5ea
commit c02192ff6a
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
13 changed files with 1239 additions and 106 deletions

View file

@ -47,6 +47,7 @@ from agent.image_gen_provider import (
DEFAULT_ASPECT_RATIO,
ImageGenProvider,
error_response,
normalize_reference_images,
resolve_aspect_ratio,
save_b64_image,
success_response,
@ -112,10 +113,20 @@ class MyBackendImageGenProvider(ImageGenProvider):
],
}
def capabilities(self) -> Dict[str, Any]:
# Declare whether this backend supports image-to-image / editing.
# The tool layer surfaces this in the dynamic schema so the model
# knows when `image_url` is honored. Default (if you omit this) is
# text-only: {"modalities": ["text"], "max_reference_images": 0}.
return {"modalities": ["text", "image"], "max_reference_images": 4}
def generate(
self,
prompt: str,
aspect_ratio: str = DEFAULT_ASPECT_RATIO,
*,
image_url: Optional[str] = None,
reference_image_urls: Optional[List[str]] = None,
**kwargs: Any,
) -> Dict[str, Any]:
prompt = (prompt or "").strip()
@ -130,6 +141,15 @@ class MyBackendImageGenProvider(ImageGenProvider):
aspect_ratio=aspect_ratio,
)
# Routing: if image_url (or reference_image_urls) is set, the call is
# an image-to-image / edit request; otherwise text-to-image. Report
# which path you took via the `modality` field of success_response.
sources = []
if image_url:
sources.append(image_url)
sources.extend(normalize_reference_images(reference_image_urls) or [])
modality = "image" if sources else "text"
# Model selection precedence: env var → config → default. The helper
# _resolve_model() in the built-in openai plugin is a good reference.
model_id = kwargs.get("model") or self.default_model() or "my-model-fast"
@ -137,11 +157,18 @@ class MyBackendImageGenProvider(ImageGenProvider):
try:
import my_backend_sdk
client = my_backend_sdk.Client(api_key=os.environ["MY_BACKEND_API_KEY"])
result = client.generate(
prompt=prompt,
model=model_id,
aspect_ratio=aspect_ratio,
)
if modality == "image":
result = client.edit(
prompt=prompt,
model=model_id,
image_urls=sources,
)
else:
result = client.generate(
prompt=prompt,
model=model_id,
aspect_ratio=aspect_ratio,
)
# Two shapes supported:
# - URL string: return it as `image`
@ -162,6 +189,7 @@ class MyBackendImageGenProvider(ImageGenProvider):
prompt=prompt,
aspect_ratio=aspect_ratio,
provider=self.name,
modality=modality,
)
except Exception as exc:
return error_response(

View file

@ -114,7 +114,7 @@ Scoped to the Feishu document-comment handler. Drives comment read/write operati
| Tool | Description | Requires environment |
|------|-------------|----------------------|
| `image_generate` | Generate high-quality images from text prompts using FAL.ai. The underlying model is user-configured (default: FLUX 2 Klein 9B, sub-1s generation) and is not selectable by the agent. Returns a single image URL. Display it using… | FAL_KEY |
| `image_generate` | Generate images from text prompts (text-to-image) or edit/transform an existing image (image-to-image) via the user-configured backend (FAL.ai, OpenAI, xAI, Krea). Pass `image_url` to edit an image and `reference_image_urls` for style references; omit both for text-to-image. The model is user-configured and not selectable by the agent. Returns a single image URL or local path. | FAL_KEY / OPENAI_API_KEY / xAI OAuth / KREA_API_KEY |
## `kanban` toolset

View file

@ -86,6 +86,46 @@ Create a square portrait of a wise old owl — use the typography model
Make me a futuristic cityscape, landscape orientation
```
## Image-to-Image / Editing
The same `image_generate` tool also **edits existing images** when the active
model supports it — pass a source image and the backend routes to its editing
endpoint automatically (mirrors how `video_generate` handles image-to-video).
Omit the source image and it's plain text-to-image.
```
Take this photo and make it a rainy Tokyo street at night → <image>
```
```
Blend these two product shots into one hero image → <image1> <image2>
```
Two inputs drive the edit:
- **`image_url`** — the primary source image to edit/transform (public URL or local path).
- **`reference_image_urls`** — additional style/composition references (capped per-model).
### Which backends support editing
| Backend | Image-to-image | Reference cap | How |
|---|---|---|---|
| **FAL.ai** (edit-capable models below) | ✓ | up to 9 | routes to the model's `/edit` endpoint |
| **OpenAI** (`gpt-image-2`) | ✓ | up to 16 | `images.edit()` |
| **xAI** (Grok Imagine) | ✓ | 1 | `/v1/images/edits` (`grok-imagine-image-quality`) |
| **Krea** (`Krea 2`) | ✓ | up to 10 | reference-guided generation (`image_style_references`) |
| **OpenAI (Codex auth)** | ✗ | — | text-to-image only |
FAL models with an editing endpoint: `flux-2/klein/9b`, `flux-2-pro`,
`nano-banana-pro`, `gpt-image-1.5`, `gpt-image-2`, `ideogram/v3`, and
`qwen-image`. Pure text-to-image FAL models (`z-image/turbo`, `recraft`,
`krea/*`) reject image inputs with a clear error pointing you at an
edit-capable model.
The active model's editing capability is surfaced in the tool description at
runtime, so the agent knows whether `image_url` will be honored before it
calls the tool.
## Aspect Ratios
Every model accepts the same three aspect ratios from the agent's perspective. Internally, each model's native size spec is filled in automatically:
@ -152,7 +192,7 @@ Debug logs go to `./logs/image_tools_debug_<session_id>.json` with per-call deta
## Limitations
- **Requires FAL credentials** (direct `FAL_KEY` or Nous Subscription)
- **Text-to-image only** — no inpainting, img2img, or editing via this tool
- **Temporary URLs**FAL returns hosted URLs that expire after hours/days; save locally if needed
- **Per-model constraints** — some models don't support `seed`, `num_inference_steps`, etc. The `supports` filter silently drops unsupported params; this is expected behavior
- **Requires credentials** for the active backend (FAL `FAL_KEY` / Nous Subscription, `OPENAI_API_KEY`, xAI OAuth, `KREA_API_KEY`)
- **Editing is model-dependent** — image-to-image works only on edit-capable models (see the table above); text-to-image-only models reject image inputs with a clear error
- **Temporary URLs**backends return hosted URLs that expire after hours/days; Hermes materializes them to the local cache so delivery still works after expiry
- **Per-model constraints** — some models don't support `seed`, `num_inference_steps`, etc. The `supports` / `edit_supports` filter silently drops unsupported params; this is expected behavior