mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-30 01:41:43 +00:00
feat(computer-use): background focus-safe backend — set_value, structured windows, MIME detection
Extends the cua-driver computer-use backend to drive backgrounded macOS windows without stealing keyboard or mouse focus from the foreground app. All changes target the cua-driver MCP backend and the shared dispatcher. ## cua_backend.py **Window-aware capture**: capture() now calls list_windows + get_window_state instead of the removed capture tool. Prefers structuredContent.windows (MCP 2024-11-05+ cua-driver) for zero-parse window enumeration; falls back to regex-parsed text for older builds. Stores the selected (pid, window_id) as sticky context so subsequent action calls do not need a redundant round-trip. **Action routing**: click/scroll/type_text/key all carry the sticky pid (and window_id for element-indexed clicks). type_text routes through type_text_chars (individual key events) rather than AX attribute write -- WebKit AXTextFields reject attribute writes from backgrounded processes. **Key parsing**: _parse_key_combo splits cmd+s-style strings into (key, [modifiers]) and routes to hotkey (modifier present) or press_key (bare key) -- cua-driver actual tool names. **set_value method**: new set_value(value, element) calls the cua-driver set_value MCP tool. For AXPopUpButton / HTML select in a backgrounded Safari, AXPress opens the native macOS popup which closes immediately when the app is non-frontmost; set_value AX-presses the matching child option directly (no menu required, no focus steal). **focus_app**: reimplemented as a pure window-selector (enumerates list_windows, sets sticky pid/window_id) without ever raising the window or stealing focus. **list_apps**: fixed tool name from listApps to list_apps; handles plain-text response via regex when structured data is absent. **Structured-content extraction**: _extract_tool_result now surfaces structuredContent from MCP results, enabling the list_windows window array without text parsing. **Helpers**: _parse_windows_from_text, _parse_elements_from_tree, _split_tree_text, _parse_key_combo extracted as module-level functions. ## schema.py Added set_value to the action enum with a description explaining when to prefer it over click (select/popup elements, sliders, no focus steal). Added value field for set_value payloads. ## tool.py Routed set_value action through _dispatch to backend.set_value. Added set_value to _DESTRUCTIVE_ACTIONS (approval-gated). Fixed MIME-type detection in _capture_response: cua-driver may return JPEG; detect from base64 magic bytes (/9j/ -> image/jpeg, else image/png) rather than hardcoding image/png. ## agent/display.py + run_agent.py Guard _detect_tool_failure and result-preview logic against non-string function_result values: multimodal tool results (dicts with _multimodal=True) are not string-sliceable; treat them as successes and fall back to str() for length/preview.
This commit is contained in:
parent
dad10a78d0
commit
413ee1a286
5 changed files with 363 additions and 75 deletions
|
|
@ -40,6 +40,7 @@ COMPUTER_USE_SCHEMA: Dict[str, Any] = {
|
|||
"scroll",
|
||||
"type",
|
||||
"key",
|
||||
"set_value",
|
||||
"wait",
|
||||
"list_apps",
|
||||
"focus_app",
|
||||
|
|
@ -47,7 +48,9 @@ COMPUTER_USE_SCHEMA: Dict[str, Any] = {
|
|||
"description": (
|
||||
"Which action to perform. `capture` is free (no side "
|
||||
"effects). All other actions require approval unless "
|
||||
"auto-approved."
|
||||
"auto-approved. Use `set_value` for select/popup elements "
|
||||
"and sliders — it selects the matching option directly "
|
||||
"without opening the native menu (no focus steal)."
|
||||
),
|
||||
},
|
||||
# ── capture ────────────────────────────────────────────
|
||||
|
|
@ -132,6 +135,16 @@ COMPUTER_USE_SCHEMA: Dict[str, Any] = {
|
|||
"type": "integer",
|
||||
"description": "Scroll wheel ticks. Default 3.",
|
||||
},
|
||||
# ── set_value ──────────────────────────────────────────
|
||||
"value": {
|
||||
"type": "string",
|
||||
"description": (
|
||||
"For action='set_value': the value to set on the element. "
|
||||
"For AXPopUpButton / select dropdowns, pass the option's "
|
||||
"display label (e.g. 'Blue'). For sliders and other "
|
||||
"AXValue-settable elements, pass the numeric or string value."
|
||||
),
|
||||
},
|
||||
# ── type / key / wait ──────────────────────────────────
|
||||
"text": {
|
||||
"type": "string",
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue