mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-30 01:41:43 +00:00
feat(computer-use): background focus-safe backend — set_value, structured windows, MIME detection
Extends the cua-driver computer-use backend to drive backgrounded macOS windows without stealing keyboard or mouse focus from the foreground app. All changes target the cua-driver MCP backend and the shared dispatcher. ## cua_backend.py **Window-aware capture**: capture() now calls list_windows + get_window_state instead of the removed capture tool. Prefers structuredContent.windows (MCP 2024-11-05+ cua-driver) for zero-parse window enumeration; falls back to regex-parsed text for older builds. Stores the selected (pid, window_id) as sticky context so subsequent action calls do not need a redundant round-trip. **Action routing**: click/scroll/type_text/key all carry the sticky pid (and window_id for element-indexed clicks). type_text routes through type_text_chars (individual key events) rather than AX attribute write -- WebKit AXTextFields reject attribute writes from backgrounded processes. **Key parsing**: _parse_key_combo splits cmd+s-style strings into (key, [modifiers]) and routes to hotkey (modifier present) or press_key (bare key) -- cua-driver actual tool names. **set_value method**: new set_value(value, element) calls the cua-driver set_value MCP tool. For AXPopUpButton / HTML select in a backgrounded Safari, AXPress opens the native macOS popup which closes immediately when the app is non-frontmost; set_value AX-presses the matching child option directly (no menu required, no focus steal). **focus_app**: reimplemented as a pure window-selector (enumerates list_windows, sets sticky pid/window_id) without ever raising the window or stealing focus. **list_apps**: fixed tool name from listApps to list_apps; handles plain-text response via regex when structured data is absent. **Structured-content extraction**: _extract_tool_result now surfaces structuredContent from MCP results, enabling the list_windows window array without text parsing. **Helpers**: _parse_windows_from_text, _parse_elements_from_tree, _split_tree_text, _parse_key_combo extracted as module-level functions. ## schema.py Added set_value to the action enum with a description explaining when to prefer it over click (select/popup elements, sliders, no focus steal). Added value field for set_value payloads. ## tool.py Routed set_value action through _dispatch to backend.set_value. Added set_value to _DESTRUCTIVE_ACTIONS (approval-gated). Fixed MIME-type detection in _capture_response: cua-driver may return JPEG; detect from base64 magic bytes (/9j/ -> image/jpeg, else image/png) rather than hardcoding image/png. ## agent/display.py + run_agent.py Guard _detect_tool_failure and result-preview logic against non-string function_result values: multimodal tool results (dicts with _multimodal=True) are not string-sliceable; treat them as successes and fall back to str() for length/preview.
This commit is contained in:
parent
dad10a78d0
commit
413ee1a286
5 changed files with 363 additions and 75 deletions
|
|
@ -74,7 +74,7 @@ _SAFE_ACTIONS = frozenset({"capture", "wait", "list_apps"})
|
|||
# Actions that mutate user-visible state. Go through approval.
|
||||
_DESTRUCTIVE_ACTIONS = frozenset({
|
||||
"click", "double_click", "right_click", "middle_click",
|
||||
"drag", "scroll", "type", "key", "focus_app",
|
||||
"drag", "scroll", "type", "key", "set_value", "focus_app",
|
||||
})
|
||||
|
||||
# Hard-blocked key combinations. Mirrored from #4562 — these are destructive
|
||||
|
|
@ -387,6 +387,13 @@ def _dispatch(backend: ComputerUseBackend, action: str, args: Dict[str, Any]) ->
|
|||
res = backend.key(args.get("keys", ""))
|
||||
return _maybe_follow_capture(backend, res, capture_after)
|
||||
|
||||
if action == "set_value":
|
||||
value = args.get("value")
|
||||
if value is None:
|
||||
return json.dumps({"error": "set_value requires `value`"})
|
||||
res = backend.set_value(value=str(value), element=args.get("element"))
|
||||
return _maybe_follow_capture(backend, res, capture_after)
|
||||
|
||||
return json.dumps({"error": f"unknown action {action!r}"})
|
||||
|
||||
|
||||
|
|
@ -416,12 +423,17 @@ def _capture_response(cap: CaptureResult) -> Any:
|
|||
summary = "\n".join(summary_lines)
|
||||
|
||||
if cap.png_b64 and cap.mode != "ax":
|
||||
# Detect actual image format from base64 magic bytes so the MIME type
|
||||
# matches what the data contains (cua-driver may return JPEG or PNG).
|
||||
# JPEG: base64 starts with /9j/ PNG: starts with iVBOR
|
||||
_b64_prefix = cap.png_b64[:8]
|
||||
_mime = "image/jpeg" if _b64_prefix.startswith("/9j/") else "image/png"
|
||||
return {
|
||||
"_multimodal": True,
|
||||
"content": [
|
||||
{"type": "text", "text": summary},
|
||||
{"type": "image_url",
|
||||
"image_url": {"url": f"data:image/png;base64,{cap.png_b64}"}},
|
||||
"image_url": {"url": f"data:{_mime};base64,{cap.png_b64}"}},
|
||||
],
|
||||
"text_summary": summary,
|
||||
"meta": {"mode": cap.mode, "width": cap.width, "height": cap.height,
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue