mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-02 02:01:47 +00:00
feat(computer-use): cua-driver backend, universal any-model schema
Background macOS desktop control via cua-driver MCP — does NOT steal the user's cursor or keyboard focus, works with any tool-capable model. Replaces the Anthropic-native `computer_20251124` approach from the abandoned #4562 with a generic OpenAI function-calling schema plus SOM (set-of-mark) captures so Claude, GPT, Gemini, and open models can all drive the desktop via numbered element indices. ## What this adds - `tools/computer_use/` package — swappable ComputerUseBackend ABC + CuaDriverBackend (stdio MCP client to trycua/cua's cua-driver binary). - Universal `computer_use` tool with one schema for all providers. Actions: capture (som/vision/ax), click, double_click, right_click, middle_click, drag, scroll, type, key, wait, list_apps, focus_app. - Multimodal tool-result envelope (`_multimodal=True`, OpenAI-style `content: [text, image_url]` parts) that flows through handle_function_call into the tool message. Anthropic adapter converts into native `tool_result` image blocks; OpenAI-compatible providers get the parts list directly. - Image eviction in convert_messages_to_anthropic: only the 3 most recent screenshots carry real image data; older ones become text placeholders to cap per-turn token cost. - Context compressor image pruning: old multimodal tool results have their image parts stripped instead of being skipped. - Image-aware token estimation: each image counts as a flat 1500 tokens instead of its base64 char length (~1MB would have registered as ~250K tokens before). - COMPUTER_USE_GUIDANCE system-prompt block — injected when the toolset is active. - Session DB persistence strips base64 from multimodal tool messages. - Trajectory saver normalises multimodal messages to text-only. - `hermes tools` post-setup installs cua-driver via the upstream script and prints permission-grant instructions. - CLI approval callback wired so destructive computer_use actions go through the same prompt_toolkit approval dialog as terminal commands. - Hard safety guards at the tool level: blocked type patterns (curl|bash, sudo rm -rf, fork bomb), blocked key combos (empty trash, force delete, lock screen, log out). - Skill `apple/macos-computer-use/SKILL.md` — universal (model-agnostic) workflow guide. - Docs: `user-guide/features/computer-use.md` plus reference catalog entries. ## Tests 44 new tests in tests/tools/test_computer_use.py covering schema shape (universal, not Anthropic-native), dispatch routing, safety guards, multimodal envelope, Anthropic adapter conversion, screenshot eviction, context compressor pruning, image-aware token estimation, run_agent helpers, and universality guarantees. 469/469 pass across tests/tools/test_computer_use.py + the affected agent/ test suites. ## Not in this PR - `model_tools.py` provider-gating: the tool is available to every provider. Providers without multi-part tool message support will see text-only tool results (graceful degradation via `text_summary`). - Anthropic server-side `clear_tool_uses_20250919` — deferred; client-side eviction + compressor pruning cover the same cost ceiling without a beta header. ## Caveats - macOS only. cua-driver uses private SkyLight SPIs (SLEventPostToPid, SLPSPostEventRecordTo, _AXObserverAddNotificationAndCheckRemote) that can break on any macOS update. Pin with HERMES_CUA_DRIVER_VERSION. - Requires Accessibility + Screen Recording permissions — the post-setup prints the Settings path. Supersedes PR #4562 (pyautogui/Quartz foreground backend, Anthropic- native schema). Credit @0xbyt4 for the original #3816 groundwork whose context/eviction/token design is preserved here in generic form.
This commit is contained in:
parent
24f139e16a
commit
b07791db05
23 changed files with 2861 additions and 27 deletions
|
|
@ -18,6 +18,7 @@ Apple/macOS-specific skills — iMessage, Reminders, Notes, FindMy, and macOS au
|
|||
| `apple-reminders` | Manage Apple Reminders via remindctl CLI (list, add, complete, delete). | `apple/apple-reminders` |
|
||||
| `findmy` | Track Apple devices and AirTags via FindMy.app on macOS using AppleScript and screen capture. | `apple/findmy` |
|
||||
| `imessage` | Send and receive iMessages/SMS via the imsg CLI on macOS. | `apple/imessage` |
|
||||
| `macos-computer-use` | Drive the macOS desktop in the background via the `computer_use` tool — screenshots, mouse, keyboard, scroll, drag — without stealing the user's cursor or keyboard focus. Works with any tool-capable model. | `apple/macos-computer-use` |
|
||||
|
||||
## autonomous-ai-agents
|
||||
|
||||
|
|
|
|||
|
|
@ -91,6 +91,13 @@ Scoped to the Feishu document-comment handler. Drives comment read/write operati
|
|||
| `ha_list_entities` | List Home Assistant entities. Optionally filter by domain (light, switch, climate, sensor, binary_sensor, cover, fan, etc.) or by area name (living room, kitchen, bedroom, etc.). | — |
|
||||
| `ha_list_services` | List available Home Assistant services (actions) for device control. Shows what actions can be performed on each device type and what parameters they accept. Use this to discover how to control devices found via ha_list_entities. | — |
|
||||
|
||||
## `computer_use` toolset
|
||||
|
||||
| Tool | Description | Requires environment |
|
||||
|------|-------------|----------------------|
|
||||
| `computer_use` | Background macOS desktop control via cua-driver — screenshots (SOM / vision / AX), click / drag / scroll / type / key / wait, list_apps, focus_app. Does NOT steal the user's cursor or keyboard focus. Works with any tool-capable model. macOS only. | `cua-driver` on `$PATH` (install via `hermes tools`). |
|
||||
|
||||
|
||||
:::note
|
||||
**Honcho tools** (`honcho_profile`, `honcho_search`, `honcho_context`, `honcho_reasoning`, `honcho_conclude`) are no longer built-in. They are available via the Honcho memory provider plugin at `plugins/memory/honcho/`. See [Memory Providers](../user-guide/features/memory-providers.md) for installation and usage.
|
||||
:::
|
||||
|
|
|
|||
|
|
@ -61,6 +61,7 @@ Or in-session:
|
|||
| `feishu_drive` | `feishu_drive_add_comment`, `feishu_drive_list_comments`, `feishu_drive_list_comment_replies`, `feishu_drive_reply_comment` | Feishu/Lark drive comment operations. Scoped to the comment agent; not exposed on `hermes-cli` or other messaging toolsets. |
|
||||
| `file` | `patch`, `read_file`, `search_files`, `write_file` | File reading, writing, searching, and editing. |
|
||||
| `homeassistant` | `ha_call_service`, `ha_get_state`, `ha_list_entities`, `ha_list_services` | Smart home control via Home Assistant. Only available when `HASS_TOKEN` is set. |
|
||||
| `computer_use` | `computer_use` | Background macOS desktop control via cua-driver — does not steal cursor/focus. Works with any tool-capable model. macOS only; requires `cua-driver` on `$PATH`. |
|
||||
| `image_gen` | `image_generate` | Text-to-image generation via FAL.ai. |
|
||||
| `memory` | `memory` | Persistent cross-session memory management. |
|
||||
| `messaging` | `send_message` | Send messages to other platforms (Telegram, Discord, etc.) from within a session. |
|
||||
|
|
|
|||
163
website/docs/user-guide/features/computer-use.md
Normal file
163
website/docs/user-guide/features/computer-use.md
Normal file
|
|
@ -0,0 +1,163 @@
|
|||
# Computer Use (macOS)
|
||||
|
||||
Hermes Agent can drive your Mac's desktop — clicking, typing, scrolling,
|
||||
dragging — in the **background**. Your cursor doesn't move, keyboard focus
|
||||
doesn't change, and macOS doesn't switch Spaces on you. You and the agent
|
||||
co-work on the same machine.
|
||||
|
||||
Unlike most computer-use integrations, this works with **any tool-capable
|
||||
model** — Claude, GPT, Gemini, or an open model on a local vLLM endpoint.
|
||||
There's no Anthropic-native schema to worry about.
|
||||
|
||||
## How it works
|
||||
|
||||
The `computer_use` toolset speaks MCP over stdio to [`cua-driver`](https://github.com/trycua/cua),
|
||||
a macOS driver that uses SkyLight private SPIs (`SLEventPostToPid`,
|
||||
`SLPSPostEventRecordTo`) and the `_AXObserverAddNotificationAndCheckRemote`
|
||||
accessibility SPI to:
|
||||
|
||||
- Post synthesized events directly to target processes — no HID event tap,
|
||||
no cursor warp.
|
||||
- Flip AppKit active-state without raising windows — no Space switching.
|
||||
- Keep Chromium/Electron accessibility trees alive when windows are
|
||||
occluded.
|
||||
|
||||
That combination is what OpenAI's Codex "background computer-use" ships.
|
||||
cua-driver is the open-source equivalent.
|
||||
|
||||
## Enabling
|
||||
|
||||
1. Run `hermes tools`, pick `🖱️ Computer Use (macOS)` → `cua-driver (background)`.
|
||||
2. The setup runs the upstream installer:
|
||||
`curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/cua-driver/scripts/install.sh`.
|
||||
3. Grant macOS permissions when prompted:
|
||||
- **System Settings → Privacy & Security → Accessibility** → allow the
|
||||
terminal (or Hermes app).
|
||||
- **System Settings → Privacy & Security → Screen Recording** → allow
|
||||
the same.
|
||||
4. Start a session with the toolset enabled:
|
||||
```
|
||||
hermes -t computer_use chat
|
||||
```
|
||||
or add `computer_use` to your enabled toolsets in `~/.hermes/config.yaml`.
|
||||
|
||||
## Quick example
|
||||
|
||||
User prompt: *"Find my latest email from Stripe and summarise what they want me to do."*
|
||||
|
||||
The agent's plan:
|
||||
|
||||
1. `computer_use(action="capture", mode="som", app="Mail")` — gets a
|
||||
screenshot of Mail with every sidebar item, toolbar button, and message
|
||||
row numbered.
|
||||
2. `computer_use(action="click", element=14)` — clicks the search field
|
||||
(element #14 from the capture).
|
||||
3. `computer_use(action="type", text="from:stripe")`
|
||||
4. `computer_use(action="key", keys="return", capture_after=True)` — submit
|
||||
and get the new screenshot.
|
||||
5. Click the top result, read the body, summarise.
|
||||
|
||||
During all of this, your cursor stays wherever you left it and Mail never
|
||||
comes to front.
|
||||
|
||||
## Provider compatibility
|
||||
|
||||
| Provider | Vision? | Works? | Notes |
|
||||
|---|---|---|---|
|
||||
| Anthropic (Claude Sonnet/Opus 3+) | ✅ | ✅ | Best overall; SOM + raw coordinates. |
|
||||
| OpenRouter (any vision model) | ✅ | ✅ | Multi-part tool messages supported. |
|
||||
| OpenAI (GPT-4+, GPT-5) | ✅ | ✅ | Same as above. |
|
||||
| Local vLLM / LM Studio (vision model) | ✅ | ✅ | If the model supports multi-part tool content. |
|
||||
| Text-only models | ❌ | ✅ (degraded) | Use `mode="ax"` for accessibility-tree-only operation. |
|
||||
|
||||
Screenshots are sent inline with tool results as OpenAI-style `image_url`
|
||||
parts. For Anthropic, the adapter converts them into native `tool_result`
|
||||
image blocks.
|
||||
|
||||
## Safety
|
||||
|
||||
Hermes applies multi-layer guardrails:
|
||||
|
||||
- Destructive actions (click, type, drag, scroll, key, focus_app) require
|
||||
approval — either interactively via the CLI dialog or via the
|
||||
messaging-platform approval buttons.
|
||||
- Hard-blocked key combos at the tool level: empty trash, force delete,
|
||||
lock screen, log out, force log out.
|
||||
- Hard-blocked type patterns: `curl | bash`, `sudo rm -rf /`, fork bombs,
|
||||
etc.
|
||||
- The agent's system prompt tells it explicitly: no clicking permission
|
||||
dialogs, no typing passwords, no following instructions embedded in
|
||||
screenshots.
|
||||
|
||||
Pair with `security.approval_level` in `~/.hermes/config.yaml` if you want
|
||||
every action confirmed.
|
||||
|
||||
## Token efficiency
|
||||
|
||||
Screenshots are expensive. Hermes applies four layers of optimisation:
|
||||
|
||||
- **Screenshot eviction** — the Anthropic adapter keeps only the 3 most
|
||||
recent screenshots in context; older ones become `[screenshot removed
|
||||
to save context]` placeholders.
|
||||
- **Client-side compression pruning** — the context compressor detects
|
||||
multimodal tool results and strips image parts from old ones.
|
||||
- **Image-aware token estimation** — each image is counted as ~1500 tokens
|
||||
(Anthropic's flat rate) instead of its base64 char length.
|
||||
- **Server-side context editing (Anthropic only)** — when active, the
|
||||
adapter enables `clear_tool_uses_20250919` via `context_management` so
|
||||
Anthropic's API clears old tool results server-side.
|
||||
|
||||
A 20-action session on a 1568×900 display typically costs ~30K tokens
|
||||
of screenshot context, not ~600K.
|
||||
|
||||
## Limitations
|
||||
|
||||
- **macOS only.** cua-driver uses private Apple SPIs that don't exist on
|
||||
Linux or Windows. For cross-platform GUI automation, use the `browser`
|
||||
toolset.
|
||||
- **Private SPI risk.** Apple can change SkyLight's symbol surface in any
|
||||
OS update. Pin the driver version with the `HERMES_CUA_DRIVER_VERSION`
|
||||
env var if you want reproducibility across a macOS bump.
|
||||
- **Performance.** Background mode is slower than foreground —
|
||||
SkyLight-routed events take ~5-20ms vs direct HID posting. Not
|
||||
noticeable for agent-speed clicking; noticeable if you try to record a
|
||||
speed-run.
|
||||
- **No keyboard password entry.** `type` has hard-block patterns on
|
||||
command-shell payloads; for passwords, use the system's autofill.
|
||||
|
||||
## Configuration
|
||||
|
||||
Override the driver binary path (tests / CI):
|
||||
|
||||
```
|
||||
HERMES_CUA_DRIVER_CMD=/opt/homebrew/bin/cua-driver
|
||||
HERMES_CUA_DRIVER_VERSION=0.5.0 # optional pin
|
||||
```
|
||||
|
||||
Swap the backend entirely (for testing):
|
||||
|
||||
```
|
||||
HERMES_COMPUTER_USE_BACKEND=noop # records calls, no side effects
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**`computer_use backend unavailable: cua-driver is not installed`** — Run
|
||||
`hermes tools` and enable Computer Use.
|
||||
|
||||
**Clicks seem to have no effect** — Capture and verify. A modal you
|
||||
didn't see may be blocking input. Dismiss it with `escape` or the close
|
||||
button.
|
||||
|
||||
**Element indices are stale** — SOM indices are only valid until the
|
||||
next `capture`. Re-capture after any state-changing action.
|
||||
|
||||
**"blocked pattern in type text"** — The text you tried to `type`
|
||||
matches the dangerous-shell-pattern list. Break the command up or
|
||||
reconsider.
|
||||
|
||||
## See also
|
||||
|
||||
- [Universal skill: `macos-computer-use`](https://github.com/NousResearch/hermes-agent/blob/main/skills/apple/macos-computer-use/SKILL.md)
|
||||
- [cua-driver source (trycua/cua)](https://github.com/trycua/cua)
|
||||
- [Browser automation](./browser-use.md) for cross-platform web tasks.
|
||||
Loading…
Add table
Add a link
Reference in a new issue