feat(computer-use): background focus-safe backend — set_value, structured windows, MIME detection

Extends the cua-driver computer-use backend to drive backgrounded macOS windows without stealing keyboard or mouse focus from the foreground app. All changes target the cua-driver MCP backend and the shared dispatcher. ## cua_backend.py **Window-aware capture**: capture() now calls list_windows + get_window_state instead of the removed capture tool. Prefers structuredContent.windows (MCP 2024-11-05+ cua-driver) for zero-parse window enumeration; falls back to regex-parsed text for older builds. Stores the selected (pid, window_id) as sticky context so subsequent action calls do not need a redundant round-trip. **Action routing**: click/scroll/type_text/key all carry the sticky pid (and window_id for element-indexed clicks). type_text routes through type_text_chars (individual key events) rather than AX attribute write -- WebKit AXTextFields reject attribute writes from backgrounded processes. **Key parsing**: _parse_key_combo splits cmd+s-style strings into (key, [modifiers]) and routes to hotkey (modifier present) or press_key (bare key) -- cua-driver actual tool names. **set_value method**: new set_value(value, element) calls the cua-driver set_value MCP tool. For AXPopUpButton / HTML select in a backgrounded Safari, AXPress opens the native macOS popup which closes immediately when the app is non-frontmost; set_value AX-presses the matching child option directly (no menu required, no focus steal). **focus_app**: reimplemented as a pure window-selector (enumerates list_windows, sets sticky pid/window_id) without ever raising the window or stealing focus. **list_apps**: fixed tool name from listApps to list_apps; handles plain-text response via regex when structured data is absent. **Structured-content extraction**: _extract_tool_result now surfaces structuredContent from MCP results, enabling the list_windows window array without text parsing. **Helpers**: _parse_windows_from_text, _parse_elements_from_tree, _split_tree_text, _parse_key_combo extracted as module-level functions. ## schema.py Added set_value to the action enum with a description explaining when to prefer it over click (select/popup elements, sliders, no focus steal). Added value field for set_value payloads. ## tool.py Routed set_value action through _dispatch to backend.set_value. Added set_value to _DESTRUCTIVE_ACTIONS (approval-gated). Fixed MIME-type detection in _capture_response: cua-driver may return JPEG; detect from base64 magic bytes (/9j/ -> image/jpeg, else image/png) rather than hardcoding image/png. ## agent/display.py + run_agent.py Guard _detect_tool_failure and result-preview logic against non-string function_result values: multimodal tool results (dicts with _multimodal=True) are not string-sliceable; treat them as successes and fall back to str() for length/preview.
2026-04-30 01:41:43 +00:00 · 2026-04-24 13:39:12 -07:00 · 2026-04-24 13:39:12 -07:00 · 413ee1a286
commit 413ee1a286
parent dad10a78d0
5 changed files with 363 additions and 75 deletions
--- a/tools/computer_use/tool.py
+++ b/tools/computer_use/tool.py
@ -74,7 +74,7 @@ _SAFE_ACTIONS = frozenset({"capture", "wait", "list_apps"})
 # Actions that mutate user-visible state. Go through approval.
 _DESTRUCTIVE_ACTIONS = frozenset({
    "click", "double_click", "right_click", "middle_click",
-    "drag", "scroll", "type", "key", "focus_app",
+    "drag", "scroll", "type", "key", "set_value", "focus_app",
 })

 # Hard-blocked key combinations. Mirrored from #4562 — these are destructive
@ -387,6 +387,13 @@ def _dispatch(backend: ComputerUseBackend, action: str, args: Dict[str, Any]) ->
        res = backend.key(args.get("keys", ""))
        return _maybe_follow_capture(backend, res, capture_after)

+    if action == "set_value":
+        value = args.get("value")
+        if value is None:
+            return json.dumps({"error": "set_value requires `value`"})
+        res = backend.set_value(value=str(value), element=args.get("element"))
+        return _maybe_follow_capture(backend, res, capture_after)
+
    return json.dumps({"error": f"unknown action {action!r}"})


@ -416,12 +423,17 @@ def _capture_response(cap: CaptureResult) -> Any:
    summary = "\n".join(summary_lines)

    if cap.png_b64 and cap.mode != "ax":
+        # Detect actual image format from base64 magic bytes so the MIME type
+        # matches what the data contains (cua-driver may return JPEG or PNG).
+        # JPEG: base64 starts with /9j/   PNG: starts with iVBOR
+        _b64_prefix = cap.png_b64[:8]
+        _mime = "image/jpeg" if _b64_prefix.startswith("/9j/") else "image/png"
        return {
            "_multimodal": True,
            "content": [
                {"type": "text", "text": summary},
                {"type": "image_url",
-                 "image_url": {"url": f"data:image/png;base64,{cap.png_b64}"}},
+                 "image_url": {"url": f"data:{_mime};base64,{cap.png_b64}"}},
            ],
            "text_summary": summary,
            "meta": {"mode": cap.mode, "width": cap.width, "height": cap.height,