Commit graph

24 commits

Author SHA1 Message Date
Brooklyn Nicholson
833710d33e Merge remote-tracking branch 'origin/main' into pr-50994
# Conflicts:
#	tools/computer_use/cua_backend.py
2026-06-22 18:48:07 -05:00
Brooklyn Nicholson
807b696295 fix(computer-use): vision capture returns an image on cua-driver >=0.5.x
Vision mode called a `screenshot` MCP tool that cua-driver dropped in
0.5.x (full-window PNG capture was folded into `get_window_state`). The
driver replied "Unknown tool: screenshot", so `images` came back empty,
`png_b64` stayed None, and capture returned a 0x0 result with no image
on every call. `som`/`ax` were unaffected because they already use
`get_window_state`, which masked the regression.

Route vision by capability:
- driver advertises `screenshot` (older builds) -> use it (no AX walk)
- otherwise -> call `get_window_state` but discard the AX tree/elements,
  returning only the PNG so vision stays free of element noise
- capabilities not yet discovered -> try `screenshot`, fall back to
  `get_window_state` on an empty image, so the path self-heals

Add `_image_from_tool_result` to pull the PNG from either an MCP image
content-part or `structuredContent.screenshot_png_b64`, and use it on
the som path too so the image won't silently drop on driver builds that
deliver it via structuredContent instead of a content part.

Verified live (vision: 1568x954, 0 elements; som: image + 527 elements)
and with unit coverage of all four routing cases.
2026-06-22 17:41:42 -05:00
Teknium
30e5d0092d
feat(computer-use): add whole-screen/desktop capture target
capture(app='screen'|'desktop') now resolves to the OS shell/desktop
window (Windows Progman/WorkerW desktop or Shell_TrayWnd taskbar, macOS
Finder/Dock) so 'show me my screen' and 'click the taskbar' work.
Previously capture() only matched application windows, and the schema
advertised 'or the whole screen' without any code path delivering it.

cua-driver is window-oriented (no virtual-desktop or per-monitor MCP
tool), so a single image still cannot span multiple monitors — the
schema now states this and the no-desktop-window path returns a clear
message instead of silently grabbing the frontmost app.
2026-06-22 12:21:58 -07:00
jeeves-assistant
5250335863
fix(computer-use): route CuaDriver vision capture via get_window_state
cua-driver 0.6.x removed the standalone screenshot MCP tool, so
capture(mode='vision') hit 'Unknown tool: screenshot' and returned a
0x0 image with no PNG while som/ax (which use get_window_state) still
worked. Route vision through get_window_state(capture_mode='vision').

Salvaged from PR #50771; same fix submitted earlier as #39262 by
@Tranquil-Flow.
2026-06-22 12:21:58 -07:00
Teknium
f1e6d39a74
feat(computer_use): disable cua-driver telemetry by default, add opt-in (#50842)
* feat(computer_use): disable cua-driver telemetry by default, add opt-in

cua-driver ships anonymous PostHog usage telemetry ENABLED by default
upstream (fires cua_driver_install / cua_driver_doctor events to
eu.i.posthog.com). Hermes now disables it for our users unless they
explicitly opt in.

- New config key `computer_use.cua_telemetry` (default false) in
  DEFAULT_CONFIG.
- `cua_backend.cua_driver_child_env()` injects
  `CUA_DRIVER_RS_TELEMETRY_ENABLED=0` into the child env when telemetry is
  disabled (the default); leaves the var untouched on opt-in so the driver
  uses its own default. Reads config fail-safe — any error defaults to
  telemetry off.
- Routed every cua-driver spawn site through the policy: MCP backend
  (StdioServerParameters env), `cua_driver_update_check`, doctor's
  health_report Popen, the install.sh/install.ps1 runner, and the
  `--version` / status probes.
- Docs: new Telemetry subsection in computer-use.md (EN).
- Tests: tests/computer_use/test_cua_telemetry.py — default disables,
  explicit-false disables, opt-in leaves var untouched, config-failure
  fails safe, inherited-enabled is overridden off.

Verified live on Linux against the real cua-driver-rs 0.6.0 binary: with
the var=0 the driver reports "telemetry: disabled via
CUA_DRIVER_RS_TELEMETRY_ENABLED" and sends no event; with it unset it logs
"sending event: cua_driver_doctor". 213 computer_use + install tests green.

* fix(dashboard): fold computer_use config category into agent tab

The new computer_use.cua_telemetry key created a single-field dashboard
config category, tripping test_no_single_field_categories (web_server's
invariant that categories with <2 fields must be merged to avoid tab
sprawl). Add computer_use -> agent to _CATEGORY_MERGE, matching the
existing onboarding/telegram single-field folds.
2026-06-22 09:57:16 -07:00
teknium1
e3505c7f73 fix(computer_use): reconcile Linux gate with stale "gated off" comments
The runtime gate (check_computer_use_requirements) and the hermes tools
platform_gate both enable linux alongside darwin/win32, but several
docstrings/comments still described Linux as "alpha, gated off until it
flips upstream" — contradicting the code that ships it. Bring the prose in
line with the gate that's actually live:

- tool.py / cua_backend.py module docstrings: Linux is enabled (X11 today,
  Wayland via XWayland), not gated off.
- toolsets.py description and hermes tools display name: (macOS/Windows) ->
  (macOS/Windows/Linux).

No behavior change — the gate already allowed all three platforms.
2026-06-22 06:42:30 -07:00
Francesco Bonacci
f2e37549c6 feat(computer_use): cross-platform cua-driver (macOS/Windows/Linux)
Make the computer_use toolset platform-agnostic by driving cua-driver on
macOS, Windows, and Linux. Consumes the 8 cua-driver decoupling surfaces
(capability discovery, structuredContent AX tree, opaque element_token,
click button enum, explicit mimeType, machine-readable manifest,
structured list_windows, structured health_report), each degrading
gracefully on older drivers.

Adds `hermes computer-use doctor` (drives cua-driver health_report with a
per-OS check matrix and an exit 0/1/2 ok/degraded/blocked contract), full
typed wrappers for the previously-uncovered cua-driver tools plus a generic
call_tool escape hatch, per-session agent-cursor lifecycle, platform-aware
system-prompt guidance (host-deterministic, cache-safe), and honors
HERMES_CUA_DRIVER_CMD end-to-end.

Replaces the macOS-only skills/apple/macos-computer-use skill with a
cross-platform skills/computer-use skill, and refreshes the EN + zh-Hans
docs.

Supersedes #44221 (Windows-enablement salvage of #30660).

Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>
2026-06-22 06:42:30 -07:00
Luke The Dev
2e5c04aaf7 fix(#37878): scrub operator environment before launching cua-driver MCP
- Use _sanitize_subprocess_env() to filter Hermes-managed credentials
  from the cua-driver subprocess environment (issue #37878)
- Prevents credential exfiltration to the third-party cua-driver binary
- Aligns with existing pattern used by browser-tool and other tools
- Add regression test to verify environment sanitization

The cua-driver is a lower-trust MCP subprocess per SECURITY.md §2.3.
Its inherited environment is now scrubbed by default, removing provider
API keys, gateway tokens, and platform credentials that should not leak
to third-party binaries.

Fixes #37878
2026-06-18 08:53:31 -07:00
helix4u
591e6fb8f4 fix(computer_use): honor custom vision routing 2026-06-07 02:09:20 -07:00
Teknium
365437e4aa
fix(cua-driver): reconnect MCP stdio session once on ClosedResourceError after daemon restart (#40570)
Salvaged from #40282; cleaned up, re-verified against main, tests added.

Co-authored-by: jeeves-assistant <jeeves-assistant@users.noreply.github.com>
2026-06-06 18:35:12 -07:00
teknium1
ddaf2f6712 style: restore PEP8 blank-line separation after dead-code removal
The deletions in the salvaged commit left some top-level defs/classes
separated by a single blank line. Restore the 2-blank-line separation.
2026-05-29 04:22:27 -07:00
kshitijk4poor
dc235e93cb chore: remove dead code — 28 unused functions/classes across 16 files
Vulture + per-symbol verification (whole-repo grep incl. tests, string
literals, getattr, decorator/registry/argparse dispatch) confirmed each of
these has zero callers anywhere — not reachable via any dynamic-dispatch path,
not referenced by tests, not re-exported.

Removed:
- acp_adapter/tools.py: _build_patch_mode_content
- agent/anthropic_adapter.py: read_claude_managed_key (diagnostics-only, never called)
- agent/bedrock_adapter.py: get_bedrock_model_ids
- agent/browser_registry.py: get_active_browser_provider
- agent/chat_completion_helpers.py: _take_request_client (x2 nested closures, never invoked)
- gateway/platforms/weixin.py: _rewrite_headers_for_weixin, _rewrite_table_block_for_weixin
- hermes_cli/banner.py: _skin_branding
- hermes_cli/debug.py: _delete_hint
- hermes_cli/gateway.py: _setup_email, _setup_sms, _setup_yuanbao
  (platform keys absent from the _builtin_setup_fn dispatch dict; handled by
  the _setup_standard_platform fallback)
- hermes_cli/kanban_db.py: set_max_runtime, active_run
- hermes_cli/kanban_diagnostics.py: severity_of_highest, _latest_clean_event_ts
- hermes_cli/main.py: _build_provider_choices, cmd_portal
  (portal subcommand is wired via portal_cli.add_parser, not this wrapper)
- hermes_cli/model_switch.py: CustomAutoResult (orphaned by the switch_model() extraction)
- hermes_cli/models.py: format_model_pricing_table, fetch_nous_account_tier
- hermes_cli/portal_cli.py: _nous_portal_base_url
- hermes_cli/proxy/server.py: handle_models_fallback (defined but never registered on the router)
- tools/computer_use/cua_backend.py: _parse_element, _is_arm_mac
- tools/file_operations.py: _get_safe_write_root (prod uses the imported
  agent.file_safety.get_safe_write_root directly)
- tools/skills_tool.py: _load_category_description

Also dropped two imports left unused by the removals:
- tools/file_operations.py: get_safe_write_root alias
- tools/computer_use/cua_backend.py: import platform

Pure deletion: -551 LOC. No behavior change. Test files covering the edited
modules pass (640/640); the broader suite's pre-existing/env-dependent
failures reproduce unchanged on origin/main.
2026-05-29 04:22:27 -07:00
kshitijk4poor
66827f8947 chore: prune unused imports and duplicate import redefinitions
Remove unused imports (F401) and duplicate/shadowed import
redefinitions (F811) across the codebase using ruff's safe
autofixes. No behavioral changes -- imports only.

- ~1400 safe autofixes applied across 644 files (net -1072 lines)
- __init__.py re-exports preserved (excluded from F401 removal so
  public re-export surfaces stay intact)
- Re-exports that are imported or monkeypatched by tests but look
  unused in their defining module are kept with explicit # noqa:
  F401 (gateway/run.py load_dotenv; run_agent re-exports from
  agent.message_sanitization, agent.context_compressor,
  agent.retry_utils, agent.prompt_builder, agent.process_bootstrap,
  agent.codex_responses_adapter)
- Unsafe F841 (unused-variable) fixes deliberately skipped -- those
  can change behavior when the RHS has side effects
- ruff lints remain disabled in pyproject.toml (only PLW1514 is
  selected); this is a one-time cleanup, not a config change

Verification:
- python -m compileall: clean
- pytest --collect-only: all 27161 tests collect (zero import errors)
- core entry points import clean (run_agent, model_tools, cli,
  toolsets, hermes_state, batch_runner, gateway)
- static scan: every name any test imports directly from an edited
  module still resolves
2026-05-28 22:26:25 -07:00
briandevans
5aa4727f34 fix(computer-use): surface app=… filter no-match instead of silently using frontmost (#24170 bug 1)
`CuaDriverBackend.capture(app=X)` and `focus_app(app=X)` silently fell back
to the frontmost on-screen window when X matched no app — typically a
menu-bar utility (e.g. "Fuwari" in the bug reporter's case) rather than
the requested app. The agent then received UI elements for the wrong app
and clicked / typed into it.

The root cause is a localized macOS app name mismatch: `list_windows`
returns the localized `app_name` (e.g. "計算機" on a Japanese/Chinese
system) but callers naturally pass the English name ("Calculator"). The
substring filter doesn't match, and the code falls through to picking the
frontmost window with no signal that the filter was effectively dropped.

Fix:

- `capture(app=…)`: when the filter matches nothing, return a
  `CaptureResult` with empty `app`/`elements` and a diagnostic
  `window_title` pointing the caller at `list_apps` and noting the
  localized-name convention. `_active_pid` / `_active_window_id` are left
  untouched so a subsequent action doesn't inadvertently hit the wrong
  process.
- `focus_app(app=…)`: when the filter matches nothing, set `target = None`
  and let the existing `return ActionResult(ok=False, …, "No on-screen
  window found for app …")` path fire instead of falsely reporting success
  on the frontmost window.

This addresses bug 1 only from #24170. Bugs 2 & 5 are addressed in #30046;
bugs 3 & 4 in #30032.
2026-05-21 17:15:35 -07:00
Bartok9
4cc18877c6 fix(computer_use): preserve app context for capture_after; fix element label parsing (#24170 bugs 2 & 5)
Bug 2 (capture_after=True loses app context):
_maybe_follow_capture called backend.capture(mode='som') with no app=,
causing cua-driver to capture the frontmost window instead of the app
targeted by the preceding capture/focus_app. Fix: track _last_app on
CuaDriverBackend and thread it through the follow-up capture call so
the same app is re-captured regardless of which window has OS focus.

Bug 5 (element labels stripped in capture results):
_ELEMENT_LINE_RE matched the classic '  - [N] AXRole "label"' format
but not the '[N] AXRole (order) id=Label' format introduced in
cua-driver v0.1.6. All element labels were silently dropped as empty
strings, making element identification impossible.

Fix: extend regex to capture both group(3) (quoted label) and group(4)
(id= label), and update _parse_elements_from_tree to use group(4) as
fallback. Both old and new cua-driver output now produce populated
UIElement.label values.

focus_app() now also sets _last_app so that capture_after= on any
subsequent action re-targets the focused app.

5 new regression tests added.

Part of #24170 (bugs 1 and 3/4 addressed separately).
2026-05-21 14:19:09 -07:00
liuhao1024
18cd1e5c72 fix(computer_use): correct type_text MCP tool name and implement drag action
Bug 3: The cua_backend type_text() method called MCP tool 'type_text_chars'
which does not exist in current cua-driver. Changed to 'type_text' which is
the correct MCP tool name.

Bug 4: The drag() method returned a hardcoded 'not supported' error even
though cua-driver exposes a 'drag' MCP tool. Implemented proper drag
dispatching with coordinate-based and element-based targeting.

Added dispatch-level validation for drag to ensure from/to coordinates
or elements are provided before calling any backend.

Fixes #24170 (bugs 3 and 4)
2026-05-21 14:08:28 -07:00
Teknium
4e89c53082
fix(async): close unscheduled coroutines in all threadsafe bridges (#26584)
Wraps every sync->async coroutine-scheduling site in the codebase with a
new agent.async_utils.safe_schedule_threadsafe() helper that closes the
coroutine on scheduling failure (closed loop, shutdown race, etc.)
instead of leaking it as 'coroutine was never awaited' RuntimeWarnings
plus reference leaks.

22 production call sites migrated across the codebase:
- acp_adapter/events.py, acp_adapter/permissions.py
- agent/lsp/manager.py
- cron/scheduler.py (media + text delivery paths)
- gateway/platforms/feishu.py (5 sites, via existing _submit_on_loop helper
  which now delegates to safe_schedule_threadsafe)
- gateway/run.py (10 sites: telegram rename, agent:step hook, status
  callback, interim+bg-review, clarify send, exec-approval button+text,
  temp-bubble cleanup, channel-directory refresh)
- plugins/memory/hindsight, plugins/platforms/google_chat
- tools/browser_supervisor.py (3), browser_cdp_tool.py,
  computer_use/cua_backend.py, slash_confirm.py
- tools/environments/modal.py (_AsyncWorker)
- tools/mcp_tool.py (2 + 8 _run_on_mcp_loop callers converted to
  factory-style so the coroutine is never constructed on a dead loop)
- tui_gateway/ws.py

Tests: new tests/agent/test_async_utils.py covers helper behavior under
live loop, dead loop, None loop, and scheduling exceptions. Regression
tests added at three PR-original sites (acp events, acp permissions,
mcp loop runner) mirroring contributor's intent.

Live-tested end-to-end:
- Helper stress test: 1500 schedules across live/dead/race scenarios,
  zero leaked coroutines
- Race exercised: 5000 schedules with loop killed mid-flight, 100 ok /
  4900 None returns, zero leaks
- hermes chat -q with terminal tool call (exercises step_callback bridge)
- MCP probe against failing subprocess servers + factory path
- Real gateway daemon boot + SIGINT shutdown across multiple platform
  adapter inits
- WSTransport 100 live + 50 dead-loop writes
- Cron delivery path live + dead loop

Salvages PR #2657 — adopts contributor's intent over a much wider site
list and a single centralized helper instead of inline try/except at
each site. 3 of the original PR's 6 sites no longer exist on main
(environments/patches.py deleted, DingTalk refactored to native async);
the equivalent fix lives in tools/environments/modal.py instead.

Co-authored-by: JithendraNara <jithendranaidunara@gmail.com>
2026-05-15 14:00:01 -07:00
kshitij
2ec8d2b42f
chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937)
Replace  with  for all literal-tuple
membership tests. Set lookup is O(1) vs O(n) for tuple — consistent
micro-optimization across the codebase.

608 instances fixed via `ruff --fix --unsafe-fixes`, 0 remaining.
133 files, +626/-626 (net zero).
2026-05-11 11:13:25 -07:00
Teknium
8f711f79a4
fix(tools): install cua-driver when Computer Use is enabled via 'hermes tools' (#22765)
Returning users who enabled '🖱️ Computer Use (macOS)' via 'hermes tools'
saw '✓ Saved configuration' but no install — cua-driver was never on
PATH and the toolset failed at first use. Two compounding causes:

1. _toolset_needs_configuration_prompt fell through to _toolset_has_keys,
   which returned True for any provider with empty env_vars. cua-driver
   has no env vars, so the gate skipped _configure_toolset entirely and
   _run_post_setup('cua_driver') never ran.

2. No stable CLI entry-point existed for re-running the install when
   the picker no-op'd it (e.g. when toggling the toolset off+on inside
   one picker session, where 'added' is empty).

Changes:

- hermes_cli/tools_config.py: add _POST_SETUP_INSTALLED registry
  mapping post_setup keys to installed-state predicates. The gate
  now returns True when any visible provider has a registered
  post_setup whose predicate fails. cua_driver is the only opt-in
  for now; other post_setup hooks keep their existing behaviour.
- hermes_cli/main.py: add 'hermes computer-use install' and
  'hermes computer-use status' as a stable docs target. install
  reuses the same _run_post_setup('cua_driver') path that the
  picker invokes; status reports whether cua-driver is on PATH.
- tools/computer_use/cua_backend.py: install hint now points users
  at 'hermes computer-use install' first.
- website/docs/user-guide/features/computer-use.md: document the
  new command as the primary install path.
- website/docs/reference/cli-commands.md: catalog 'hermes
  computer-use' alongside 'hermes tools'.
- tests/hermes_cli/test_post_setup_gating.py: regression coverage
  for the gate predicate (missing -> setup forced, installed ->
  setup skipped, broken predicate -> non-blocking, unregistered
  keys -> behaviour unchanged).

Fixes #22737. Reported by @f-trycua.
2026-05-09 13:02:25 -07:00
ddupont
e31f3b3c56 feat(computer-use): background focus-safe backend — set_value, structured windows, MIME detection
Extends the cua-driver computer-use backend to drive backgrounded macOS
windows without stealing keyboard or mouse focus from the foreground app.
All changes target the cua-driver MCP backend and the shared dispatcher.

## cua_backend.py

**Window-aware capture**: capture() now calls list_windows + get_window_state
instead of the removed capture tool. Prefers structuredContent.windows
(MCP 2024-11-05+ cua-driver) for zero-parse window enumeration; falls back
to regex-parsed text for older builds. Stores the selected (pid, window_id)
as sticky context so subsequent action calls do not need a redundant round-trip.

**Action routing**: click/scroll/type_text/key all carry the sticky pid
(and window_id for element-indexed clicks). type_text routes through
type_text_chars (individual key events) rather than AX attribute write --
WebKit AXTextFields reject attribute writes from backgrounded processes.

**Key parsing**: _parse_key_combo splits cmd+s-style strings into
(key, [modifiers]) and routes to hotkey (modifier present) or
press_key (bare key) -- cua-driver actual tool names.

**set_value method**: new set_value(value, element) calls the cua-driver
set_value MCP tool. For AXPopUpButton / HTML select in a backgrounded Safari,
AXPress opens the native macOS popup which closes immediately when the app is
non-frontmost; set_value AX-presses the matching child option directly
(no menu required, no focus steal).

**focus_app**: reimplemented as a pure window-selector (enumerates
list_windows, sets sticky pid/window_id) without ever raising the window
or stealing focus.

**list_apps**: fixed tool name from listApps to list_apps; handles plain-text
response via regex when structured data is absent.

**Structured-content extraction**: _extract_tool_result now surfaces
structuredContent from MCP results, enabling the list_windows window array
without text parsing.

**Helpers**: _parse_windows_from_text, _parse_elements_from_tree,
_split_tree_text, _parse_key_combo extracted as module-level functions.

## schema.py

Added set_value to the action enum with a description explaining when to
prefer it over click (select/popup elements, sliders, no focus steal).
Added value field for set_value payloads.

## tool.py

Routed set_value action through _dispatch to backend.set_value.
Added set_value to _DESTRUCTIVE_ACTIONS (approval-gated).
Fixed MIME-type detection in _capture_response: cua-driver may return
JPEG; detect from base64 magic bytes (/9j/ -> image/jpeg, else image/png)
rather than hardcoding image/png.

## agent/display.py + run_agent.py

Guard _detect_tool_failure and result-preview logic against non-string
function_result values: multimodal tool results (dicts with _multimodal=True)
are not string-sliceable; treat them as successes and fall back to str()
for length/preview.
2026-05-08 11:07:38 -07:00
Teknium
850413f120 feat(computer-use): cua-driver backend, universal any-model schema
Background macOS desktop control via cua-driver MCP — does NOT steal the
user's cursor or keyboard focus, works with any tool-capable model.

Replaces the Anthropic-native `computer_20251124` approach from the
abandoned #4562 with a generic OpenAI function-calling schema plus SOM
(set-of-mark) captures so Claude, GPT, Gemini, and open models can all
drive the desktop via numbered element indices.

- `tools/computer_use/` package — swappable ComputerUseBackend ABC +
  CuaDriverBackend (stdio MCP client to trycua/cua's cua-driver binary).
- Universal `computer_use` tool with one schema for all providers.
  Actions: capture (som/vision/ax), click, double_click, right_click,
  middle_click, drag, scroll, type, key, wait, list_apps, focus_app.
- Multimodal tool-result envelope (`_multimodal=True`, OpenAI-style
  `content: [text, image_url]` parts) that flows through
  handle_function_call into the tool message. Anthropic adapter converts
  into native `tool_result` image blocks; OpenAI-compatible providers
  get the parts list directly.
- Image eviction in convert_messages_to_anthropic: only the 3 most
  recent screenshots carry real image data; older ones become text
  placeholders to cap per-turn token cost.
- Context compressor image pruning: old multimodal tool results have
  their image parts stripped instead of being skipped.
- Image-aware token estimation: each image counts as a flat 1500 tokens
  instead of its base64 char length (~1MB would have registered as
  ~250K tokens before).
- COMPUTER_USE_GUIDANCE system-prompt block — injected when the toolset
  is active.
- Session DB persistence strips base64 from multimodal tool messages.
- Trajectory saver normalises multimodal messages to text-only.
- `hermes tools` post-setup installs cua-driver via the upstream script
  and prints permission-grant instructions.
- CLI approval callback wired so destructive computer_use actions go
  through the same prompt_toolkit approval dialog as terminal commands.
- Hard safety guards at the tool level: blocked type patterns
  (curl|bash, sudo rm -rf, fork bomb), blocked key combos (empty trash,
  force delete, lock screen, log out).
- Skill `apple/macos-computer-use/SKILL.md` — universal (model-agnostic)
  workflow guide.
- Docs: `user-guide/features/computer-use.md` plus reference catalog
  entries.

44 new tests in tests/tools/test_computer_use.py covering schema
shape (universal, not Anthropic-native), dispatch routing, safety
guards, multimodal envelope, Anthropic adapter conversion, screenshot
eviction, context compressor pruning, image-aware token estimation,
run_agent helpers, and universality guarantees.

469/469 pass across tests/tools/test_computer_use.py + the affected
agent/ test suites.

- `model_tools.py` provider-gating: the tool is available to every
  provider. Providers without multi-part tool message support will see
  text-only tool results (graceful degradation via `text_summary`).
- Anthropic server-side `clear_tool_uses_20250919` — deferred;
  client-side eviction + compressor pruning cover the same cost ceiling
  without a beta header.

- macOS only. cua-driver uses private SkyLight SPIs
  (SLEventPostToPid, SLPSPostEventRecordTo,
  _AXObserverAddNotificationAndCheckRemote) that can break on any macOS
  update. Pin with HERMES_CUA_DRIVER_VERSION.
- Requires Accessibility + Screen Recording permissions — the post-setup
  prints the Settings path.

Supersedes PR #4562 (pyautogui/Quartz foreground backend, Anthropic-
native schema). Credit @0xbyt4 for the original #3816 groundwork whose
context/eviction/token design is preserved here in generic form.
2026-05-08 11:07:38 -07:00
Teknium
e63364b8df
revert: computer-use cua-driver (PR #16919) (#16927)
Reverts PR #16919 (commits dad10a78d, 413ee1a28, b4a8031b2, afb958829)
which was merged prematurely. Restoring the pre-merge state so #14817
and #15328 can be revisited as standing PRs.

Reverted commits:
- afb958829 fix(computer-use): harden image-rejection fallback + AUTHOR_MAP
- b4a8031b2 fix(computer-use): unwrap _multimodal tool results
- 413ee1a28 feat(computer-use): background focus-safe backend
- dad10a78d feat(computer-use): cua-driver backend, universal any-model schema

Co-authored-by: teknium1 <teknium@users.noreply.github.com>
2026-04-28 01:57:21 -07:00
ddupont
413ee1a286 feat(computer-use): background focus-safe backend — set_value, structured windows, MIME detection
Extends the cua-driver computer-use backend to drive backgrounded macOS
windows without stealing keyboard or mouse focus from the foreground app.
All changes target the cua-driver MCP backend and the shared dispatcher.

## cua_backend.py

**Window-aware capture**: capture() now calls list_windows + get_window_state
instead of the removed capture tool. Prefers structuredContent.windows
(MCP 2024-11-05+ cua-driver) for zero-parse window enumeration; falls back
to regex-parsed text for older builds. Stores the selected (pid, window_id)
as sticky context so subsequent action calls do not need a redundant round-trip.

**Action routing**: click/scroll/type_text/key all carry the sticky pid
(and window_id for element-indexed clicks). type_text routes through
type_text_chars (individual key events) rather than AX attribute write --
WebKit AXTextFields reject attribute writes from backgrounded processes.

**Key parsing**: _parse_key_combo splits cmd+s-style strings into
(key, [modifiers]) and routes to hotkey (modifier present) or
press_key (bare key) -- cua-driver actual tool names.

**set_value method**: new set_value(value, element) calls the cua-driver
set_value MCP tool. For AXPopUpButton / HTML select in a backgrounded Safari,
AXPress opens the native macOS popup which closes immediately when the app is
non-frontmost; set_value AX-presses the matching child option directly
(no menu required, no focus steal).

**focus_app**: reimplemented as a pure window-selector (enumerates
list_windows, sets sticky pid/window_id) without ever raising the window
or stealing focus.

**list_apps**: fixed tool name from listApps to list_apps; handles plain-text
response via regex when structured data is absent.

**Structured-content extraction**: _extract_tool_result now surfaces
structuredContent from MCP results, enabling the list_windows window array
without text parsing.

**Helpers**: _parse_windows_from_text, _parse_elements_from_tree,
_split_tree_text, _parse_key_combo extracted as module-level functions.

## schema.py

Added set_value to the action enum with a description explaining when to
prefer it over click (select/popup elements, sliders, no focus steal).
Added value field for set_value payloads.

## tool.py

Routed set_value action through _dispatch to backend.set_value.
Added set_value to _DESTRUCTIVE_ACTIONS (approval-gated).
Fixed MIME-type detection in _capture_response: cua-driver may return
JPEG; detect from base64 magic bytes (/9j/ -> image/jpeg, else image/png)
rather than hardcoding image/png.

## agent/display.py + run_agent.py

Guard _detect_tool_failure and result-preview logic against non-string
function_result values: multimodal tool results (dicts with _multimodal=True)
are not string-sliceable; treat them as successes and fall back to str()
for length/preview.
2026-04-28 01:46:36 -07:00
Teknium
dad10a78d0 feat(computer-use): cua-driver backend, universal any-model schema
Background macOS desktop control via cua-driver MCP — does NOT steal the
user's cursor or keyboard focus, works with any tool-capable model.

Replaces the Anthropic-native `computer_20251124` approach from the
abandoned #4562 with a generic OpenAI function-calling schema plus SOM
(set-of-mark) captures so Claude, GPT, Gemini, and open models can all
drive the desktop via numbered element indices.

- `tools/computer_use/` package — swappable ComputerUseBackend ABC +
  CuaDriverBackend (stdio MCP client to trycua/cua's cua-driver binary).
- Universal `computer_use` tool with one schema for all providers.
  Actions: capture (som/vision/ax), click, double_click, right_click,
  middle_click, drag, scroll, type, key, wait, list_apps, focus_app.
- Multimodal tool-result envelope (`_multimodal=True`, OpenAI-style
  `content: [text, image_url]` parts) that flows through
  handle_function_call into the tool message. Anthropic adapter converts
  into native `tool_result` image blocks; OpenAI-compatible providers
  get the parts list directly.
- Image eviction in convert_messages_to_anthropic: only the 3 most
  recent screenshots carry real image data; older ones become text
  placeholders to cap per-turn token cost.
- Context compressor image pruning: old multimodal tool results have
  their image parts stripped instead of being skipped.
- Image-aware token estimation: each image counts as a flat 1500 tokens
  instead of its base64 char length (~1MB would have registered as
  ~250K tokens before).
- COMPUTER_USE_GUIDANCE system-prompt block — injected when the toolset
  is active.
- Session DB persistence strips base64 from multimodal tool messages.
- Trajectory saver normalises multimodal messages to text-only.
- `hermes tools` post-setup installs cua-driver via the upstream script
  and prints permission-grant instructions.
- CLI approval callback wired so destructive computer_use actions go
  through the same prompt_toolkit approval dialog as terminal commands.
- Hard safety guards at the tool level: blocked type patterns
  (curl|bash, sudo rm -rf, fork bomb), blocked key combos (empty trash,
  force delete, lock screen, log out).
- Skill `apple/macos-computer-use/SKILL.md` — universal (model-agnostic)
  workflow guide.
- Docs: `user-guide/features/computer-use.md` plus reference catalog
  entries.

44 new tests in tests/tools/test_computer_use.py covering schema
shape (universal, not Anthropic-native), dispatch routing, safety
guards, multimodal envelope, Anthropic adapter conversion, screenshot
eviction, context compressor pruning, image-aware token estimation,
run_agent helpers, and universality guarantees.

469/469 pass across tests/tools/test_computer_use.py + the affected
agent/ test suites.

- `model_tools.py` provider-gating: the tool is available to every
  provider. Providers without multi-part tool message support will see
  text-only tool results (graceful degradation via `text_summary`).
- Anthropic server-side `clear_tool_uses_20250919` — deferred;
  client-side eviction + compressor pruning cover the same cost ceiling
  without a beta header.

- macOS only. cua-driver uses private SkyLight SPIs
  (SLEventPostToPid, SLPSPostEventRecordTo,
  _AXObserverAddNotificationAndCheckRemote) that can break on any macOS
  update. Pin with HERMES_CUA_DRIVER_VERSION.
- Requires Accessibility + Screen Recording permissions — the post-setup
  prints the Settings path.

Supersedes PR #4562 (pyautogui/Quartz foreground backend, Anthropic-
native schema). Credit @0xbyt4 for the original #3816 groundwork whose
context/eviction/token design is preserved here in generic form.
2026-04-28 01:46:36 -07:00