feat(computer-use): background focus-safe backend — set_value, structured windows, MIME detection

Extends the cua-driver computer-use backend to drive backgrounded macOS
windows without stealing keyboard or mouse focus from the foreground app.
All changes target the cua-driver MCP backend and the shared dispatcher.

## cua_backend.py

**Window-aware capture**: capture() now calls list_windows + get_window_state
instead of the removed capture tool. Prefers structuredContent.windows
(MCP 2024-11-05+ cua-driver) for zero-parse window enumeration; falls back
to regex-parsed text for older builds. Stores the selected (pid, window_id)
as sticky context so subsequent action calls do not need a redundant round-trip.

**Action routing**: click/scroll/type_text/key all carry the sticky pid
(and window_id for element-indexed clicks). type_text routes through
type_text_chars (individual key events) rather than AX attribute write --
WebKit AXTextFields reject attribute writes from backgrounded processes.

**Key parsing**: _parse_key_combo splits cmd+s-style strings into
(key, [modifiers]) and routes to hotkey (modifier present) or
press_key (bare key) -- cua-driver actual tool names.

**set_value method**: new set_value(value, element) calls the cua-driver
set_value MCP tool. For AXPopUpButton / HTML select in a backgrounded Safari,
AXPress opens the native macOS popup which closes immediately when the app is
non-frontmost; set_value AX-presses the matching child option directly
(no menu required, no focus steal).

**focus_app**: reimplemented as a pure window-selector (enumerates
list_windows, sets sticky pid/window_id) without ever raising the window
or stealing focus.

**list_apps**: fixed tool name from listApps to list_apps; handles plain-text
response via regex when structured data is absent.

**Structured-content extraction**: _extract_tool_result now surfaces
structuredContent from MCP results, enabling the list_windows window array
without text parsing.

**Helpers**: _parse_windows_from_text, _parse_elements_from_tree,
_split_tree_text, _parse_key_combo extracted as module-level functions.

## schema.py

Added set_value to the action enum with a description explaining when to
prefer it over click (select/popup elements, sliders, no focus steal).
Added value field for set_value payloads.

## tool.py

Routed set_value action through _dispatch to backend.set_value.
Added set_value to _DESTRUCTIVE_ACTIONS (approval-gated).
Fixed MIME-type detection in _capture_response: cua-driver may return
JPEG; detect from base64 magic bytes (/9j/ -> image/jpeg, else image/png)
rather than hardcoding image/png.

## agent/display.py + run_agent.py

Guard _detect_tool_failure and result-preview logic against non-string
function_result values: multimodal tool results (dicts with _multimodal=True)
are not string-sliceable; treat them as successes and fall back to str()
for length/preview.
This commit is contained in:
ddupont 2026-04-24 13:39:12 -07:00 committed by Teknium
parent dad10a78d0
commit 413ee1a286
5 changed files with 363 additions and 75 deletions

View file

@ -74,7 +74,7 @@ _SAFE_ACTIONS = frozenset({"capture", "wait", "list_apps"})
# Actions that mutate user-visible state. Go through approval.
_DESTRUCTIVE_ACTIONS = frozenset({
"click", "double_click", "right_click", "middle_click",
"drag", "scroll", "type", "key", "focus_app",
"drag", "scroll", "type", "key", "set_value", "focus_app",
})
# Hard-blocked key combinations. Mirrored from #4562 — these are destructive
@ -387,6 +387,13 @@ def _dispatch(backend: ComputerUseBackend, action: str, args: Dict[str, Any]) ->
res = backend.key(args.get("keys", ""))
return _maybe_follow_capture(backend, res, capture_after)
if action == "set_value":
value = args.get("value")
if value is None:
return json.dumps({"error": "set_value requires `value`"})
res = backend.set_value(value=str(value), element=args.get("element"))
return _maybe_follow_capture(backend, res, capture_after)
return json.dumps({"error": f"unknown action {action!r}"})
@ -416,12 +423,17 @@ def _capture_response(cap: CaptureResult) -> Any:
summary = "\n".join(summary_lines)
if cap.png_b64 and cap.mode != "ax":
# Detect actual image format from base64 magic bytes so the MIME type
# matches what the data contains (cua-driver may return JPEG or PNG).
# JPEG: base64 starts with /9j/ PNG: starts with iVBOR
_b64_prefix = cap.png_b64[:8]
_mime = "image/jpeg" if _b64_prefix.startswith("/9j/") else "image/png"
return {
"_multimodal": True,
"content": [
{"type": "text", "text": summary},
{"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{cap.png_b64}"}},
"image_url": {"url": f"data:{_mime};base64,{cap.png_b64}"}},
],
"text_summary": summary,
"meta": {"mode": cap.mode, "width": cap.width, "height": cap.height,