mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-26 01:01:40 +00:00
feat(computer-use): cua-driver backend, universal any-model schema
Background macOS desktop control via cua-driver MCP — does NOT steal the user's cursor or keyboard focus, works with any tool-capable model. Replaces the Anthropic-native `computer_20251124` approach from the abandoned #4562 with a generic OpenAI function-calling schema plus SOM (set-of-mark) captures so Claude, GPT, Gemini, and open models can all drive the desktop via numbered element indices. ## What this adds - `tools/computer_use/` package — swappable ComputerUseBackend ABC + CuaDriverBackend (stdio MCP client to trycua/cua's cua-driver binary). - Universal `computer_use` tool with one schema for all providers. Actions: capture (som/vision/ax), click, double_click, right_click, middle_click, drag, scroll, type, key, wait, list_apps, focus_app. - Multimodal tool-result envelope (`_multimodal=True`, OpenAI-style `content: [text, image_url]` parts) that flows through handle_function_call into the tool message. Anthropic adapter converts into native `tool_result` image blocks; OpenAI-compatible providers get the parts list directly. - Image eviction in convert_messages_to_anthropic: only the 3 most recent screenshots carry real image data; older ones become text placeholders to cap per-turn token cost. - Context compressor image pruning: old multimodal tool results have their image parts stripped instead of being skipped. - Image-aware token estimation: each image counts as a flat 1500 tokens instead of its base64 char length (~1MB would have registered as ~250K tokens before). - COMPUTER_USE_GUIDANCE system-prompt block — injected when the toolset is active. - Session DB persistence strips base64 from multimodal tool messages. - Trajectory saver normalises multimodal messages to text-only. - `hermes tools` post-setup installs cua-driver via the upstream script and prints permission-grant instructions. - CLI approval callback wired so destructive computer_use actions go through the same prompt_toolkit approval dialog as terminal commands. - Hard safety guards at the tool level: blocked type patterns (curl|bash, sudo rm -rf, fork bomb), blocked key combos (empty trash, force delete, lock screen, log out). - Skill `apple/macos-computer-use/SKILL.md` — universal (model-agnostic) workflow guide. - Docs: `user-guide/features/computer-use.md` plus reference catalog entries. ## Tests 44 new tests in tests/tools/test_computer_use.py covering schema shape (universal, not Anthropic-native), dispatch routing, safety guards, multimodal envelope, Anthropic adapter conversion, screenshot eviction, context compressor pruning, image-aware token estimation, run_agent helpers, and universality guarantees. 469/469 pass across tests/tools/test_computer_use.py + the affected agent/ test suites. ## Not in this PR - `model_tools.py` provider-gating: the tool is available to every provider. Providers without multi-part tool message support will see text-only tool results (graceful degradation via `text_summary`). - Anthropic server-side `clear_tool_uses_20250919` — deferred; client-side eviction + compressor pruning cover the same cost ceiling without a beta header. ## Caveats - macOS only. cua-driver uses private SkyLight SPIs (SLEventPostToPid, SLPSPostEventRecordTo, _AXObserverAddNotificationAndCheckRemote) that can break on any macOS update. Pin with HERMES_CUA_DRIVER_VERSION. - Requires Accessibility + Screen Recording permissions — the post-setup prints the Settings path. Supersedes PR #4562 (pyautogui/Quartz foreground backend, Anthropic- native schema). Credit @0xbyt4 for the original #3816 groundwork whose context/eviction/token design is preserved here in generic form.
This commit is contained in:
parent
24f139e16a
commit
b07791db05
23 changed files with 2861 additions and 27 deletions
509
tools/computer_use/tool.py
Normal file
509
tools/computer_use/tool.py
Normal file
|
|
@ -0,0 +1,509 @@
|
|||
"""Entry point for the `computer_use` tool.
|
||||
|
||||
Universal (any-model) macOS desktop control via cua-driver's background
|
||||
computer-use primitive. Replaces #4562's Anthropic-native `computer_20251124`
|
||||
approach — the schema here is standard OpenAI function-calling so every
|
||||
tool-capable model can drive it.
|
||||
|
||||
Return contract
|
||||
---------------
|
||||
For text-only results (wait, key, list_apps, focus_app, failures, etc.):
|
||||
JSON string.
|
||||
|
||||
For captures / actions with `capture_after=True`:
|
||||
A dict wrapped as the OpenAI-style multi-part tool-message content:
|
||||
|
||||
{
|
||||
"_multimodal": True,
|
||||
"content": [
|
||||
{"type": "text", "text": "<human-readable summary + SOM index>"},
|
||||
{"type": "image_url",
|
||||
"image_url": {"url": "data:image/png;base64,<b64>"}},
|
||||
],
|
||||
"text_summary": "<text used for fallback string content>",
|
||||
}
|
||||
|
||||
run_agent.py's tool-message builder inspects `_multimodal` and emits a
|
||||
list-shaped `content` for OpenAI-compatible providers. The Anthropic
|
||||
adapter splices the base64 image into a `tool_result` block (see
|
||||
`agent/anthropic_adapter.py`). Every provider that supports multi-part
|
||||
tool content gets the image; text-only providers see the summary only.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
import threading
|
||||
from typing import Any, Dict, List, Optional, Tuple
|
||||
|
||||
from tools.computer_use.backend import (
|
||||
ActionResult,
|
||||
CaptureResult,
|
||||
ComputerUseBackend,
|
||||
UIElement,
|
||||
)
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Approval & safety
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
_approval_callback = None
|
||||
|
||||
|
||||
def set_approval_callback(cb) -> None:
|
||||
"""Register a callback for computer_use approval prompts (used by CLI).
|
||||
|
||||
Matches the terminal_tool._approval_callback pattern. The callback
|
||||
receives (action, args, summary) and returns one of:
|
||||
"approve_once" | "approve_session" | "always_approve" | "deny".
|
||||
"""
|
||||
global _approval_callback
|
||||
_approval_callback = cb
|
||||
|
||||
|
||||
# Actions that read, not mutate. Always allowed.
|
||||
_SAFE_ACTIONS = frozenset({"capture", "wait", "list_apps"})
|
||||
|
||||
# Actions that mutate user-visible state. Go through approval.
|
||||
_DESTRUCTIVE_ACTIONS = frozenset({
|
||||
"click", "double_click", "right_click", "middle_click",
|
||||
"drag", "scroll", "type", "key", "focus_app",
|
||||
})
|
||||
|
||||
# Hard-blocked key combinations. Mirrored from #4562 — these are destructive
|
||||
# regardless of approval level (e.g. logout kills the session Hermes runs in).
|
||||
_BLOCKED_KEY_COMBOS = {
|
||||
frozenset({"cmd", "shift", "backspace"}), # empty trash
|
||||
frozenset({"cmd", "option", "backspace"}), # force delete
|
||||
frozenset({"cmd", "ctrl", "q"}), # lock screen
|
||||
frozenset({"cmd", "shift", "q"}), # log out
|
||||
frozenset({"cmd", "option", "shift", "q"}), # force log out
|
||||
}
|
||||
|
||||
_KEY_ALIASES = {"command": "cmd", "control": "ctrl", "alt": "option", "⌘": "cmd", "⌥": "option"}
|
||||
|
||||
|
||||
def _canon_key_combo(keys: str) -> frozenset:
|
||||
parts = [p.strip().lower() for p in re.split(r"\s*\+\s*", keys) if p.strip()]
|
||||
parts = [_KEY_ALIASES.get(p, p) for p in parts]
|
||||
return frozenset(parts)
|
||||
|
||||
|
||||
# Dangerous text patterns for the `type` action. Same list as #4562.
|
||||
_BLOCKED_TYPE_PATTERNS = [
|
||||
re.compile(r"curl\s+[^|]*\|\s*bash", re.IGNORECASE),
|
||||
re.compile(r"curl\s+[^|]*\|\s*sh", re.IGNORECASE),
|
||||
re.compile(r"wget\s+[^|]*\|\s*bash", re.IGNORECASE),
|
||||
re.compile(r"\bsudo\s+rm\s+-[rf]", re.IGNORECASE),
|
||||
re.compile(r"\brm\s+-rf\s+/\s*$", re.IGNORECASE),
|
||||
re.compile(r":\s*\(\)\s*\{\s*:\|:\s*&\s*\}", re.IGNORECASE), # fork bomb
|
||||
]
|
||||
|
||||
|
||||
def _is_blocked_type(text: str) -> Optional[str]:
|
||||
for pat in _BLOCKED_TYPE_PATTERNS:
|
||||
if pat.search(text):
|
||||
return pat.pattern
|
||||
return None
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Backend selection — env-swappable for tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
# Per-process cached backend; lazily instantiated on first call.
|
||||
_backend_lock = threading.Lock()
|
||||
_backend: Optional[ComputerUseBackend] = None
|
||||
# Session-scoped approval state.
|
||||
_session_auto_approve = False
|
||||
_always_allow: set = set() # action names the user unlocked for the session
|
||||
|
||||
|
||||
def _get_backend() -> ComputerUseBackend:
|
||||
global _backend
|
||||
with _backend_lock:
|
||||
if _backend is None:
|
||||
backend_name = os.environ.get("HERMES_COMPUTER_USE_BACKEND", "cua").lower()
|
||||
if backend_name in ("cua", "cua-driver", ""):
|
||||
from tools.computer_use.cua_backend import CuaDriverBackend
|
||||
_backend = CuaDriverBackend()
|
||||
elif backend_name == "noop": # pragma: no cover
|
||||
_backend = _NoopBackend()
|
||||
else:
|
||||
raise RuntimeError(f"Unknown HERMES_COMPUTER_USE_BACKEND={backend_name!r}")
|
||||
_backend.start()
|
||||
return _backend
|
||||
|
||||
|
||||
def reset_backend_for_tests() -> None: # pragma: no cover
|
||||
"""Test helper — tear down the cached backend."""
|
||||
global _backend, _session_auto_approve, _always_allow
|
||||
with _backend_lock:
|
||||
if _backend is not None:
|
||||
try:
|
||||
_backend.stop()
|
||||
except Exception:
|
||||
pass
|
||||
_backend = None
|
||||
_session_auto_approve = False
|
||||
_always_allow = set()
|
||||
|
||||
|
||||
class _NoopBackend(ComputerUseBackend): # pragma: no cover
|
||||
"""Test/CI stub. Records calls; returns trivial results."""
|
||||
|
||||
def __init__(self) -> None:
|
||||
self.calls: List[Tuple[str, Dict[str, Any]]] = []
|
||||
self._started = False
|
||||
|
||||
def start(self) -> None: self._started = True
|
||||
def stop(self) -> None: self._started = False
|
||||
def is_available(self) -> bool: return True
|
||||
|
||||
def capture(self, mode: str = "som", app: Optional[str] = None) -> CaptureResult:
|
||||
self.calls.append(("capture", {"mode": mode, "app": app}))
|
||||
return CaptureResult(mode=mode, width=1024, height=768, png_b64=None,
|
||||
elements=[], app=app or "", window_title="")
|
||||
|
||||
def click(self, **kw) -> ActionResult:
|
||||
self.calls.append(("click", kw))
|
||||
return ActionResult(ok=True, action="click")
|
||||
|
||||
def drag(self, **kw) -> ActionResult:
|
||||
self.calls.append(("drag", kw))
|
||||
return ActionResult(ok=True, action="drag")
|
||||
|
||||
def scroll(self, **kw) -> ActionResult:
|
||||
self.calls.append(("scroll", kw))
|
||||
return ActionResult(ok=True, action="scroll")
|
||||
|
||||
def type_text(self, text: str) -> ActionResult:
|
||||
self.calls.append(("type", {"text": text}))
|
||||
return ActionResult(ok=True, action="type")
|
||||
|
||||
def key(self, keys: str) -> ActionResult:
|
||||
self.calls.append(("key", {"keys": keys}))
|
||||
return ActionResult(ok=True, action="key")
|
||||
|
||||
def list_apps(self) -> List[Dict[str, Any]]:
|
||||
self.calls.append(("list_apps", {}))
|
||||
return []
|
||||
|
||||
def focus_app(self, app: str, raise_window: bool = False) -> ActionResult:
|
||||
self.calls.append(("focus_app", {"app": app, "raise": raise_window}))
|
||||
return ActionResult(ok=True, action="focus_app")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Dispatch
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def handle_computer_use(args: Dict[str, Any], **kwargs) -> Any:
|
||||
"""Main entry point — dispatched by tools.registry.
|
||||
|
||||
Returns either a JSON string (text-only) or a dict marked `_multimodal`
|
||||
(image + summary) which run_agent.py wraps into the tool message.
|
||||
"""
|
||||
action = (args.get("action") or "").strip().lower()
|
||||
if not action:
|
||||
return json.dumps({"error": "missing `action`"})
|
||||
|
||||
# Safety: validate actions before approval prompt.
|
||||
if action == "type":
|
||||
text = args.get("text", "")
|
||||
pat = _is_blocked_type(text)
|
||||
if pat:
|
||||
return json.dumps({
|
||||
"error": f"blocked pattern in type text: {pat!r}",
|
||||
"hint": "Dangerous shell patterns cannot be typed via computer_use.",
|
||||
})
|
||||
|
||||
if action == "key":
|
||||
keys = args.get("keys", "")
|
||||
combo = _canon_key_combo(keys)
|
||||
for blocked in _BLOCKED_KEY_COMBOS:
|
||||
if blocked.issubset(combo) and len(blocked) <= len(combo):
|
||||
return json.dumps({
|
||||
"error": f"blocked key combo: {sorted(blocked)}",
|
||||
"hint": "Destructive system shortcuts are hard-blocked.",
|
||||
})
|
||||
|
||||
# Approval gate (destructive actions only).
|
||||
if action in _DESTRUCTIVE_ACTIONS:
|
||||
err = _request_approval(action, args)
|
||||
if err is not None:
|
||||
return err
|
||||
|
||||
# Dispatch to backend.
|
||||
try:
|
||||
backend = _get_backend()
|
||||
except Exception as e:
|
||||
return json.dumps({
|
||||
"error": f"computer_use backend unavailable: {e}",
|
||||
"hint": "Run `hermes tools` and enable Computer Use to install cua-driver.",
|
||||
})
|
||||
|
||||
try:
|
||||
return _dispatch(backend, action, args)
|
||||
except Exception as e:
|
||||
logger.exception("computer_use %s failed", action)
|
||||
return json.dumps({"error": f"{action} failed: {e}"})
|
||||
|
||||
|
||||
def _request_approval(action: str, args: Dict[str, Any]) -> Optional[str]:
|
||||
"""Return None if approved, or a JSON error string if denied."""
|
||||
global _session_auto_approve, _always_allow
|
||||
if _session_auto_approve:
|
||||
return None
|
||||
if action in _always_allow:
|
||||
return None
|
||||
cb = _approval_callback
|
||||
if cb is None:
|
||||
# No CLI approval wired — default allow. Gateway approval is handled
|
||||
# one layer out via the normal tool-approval infra.
|
||||
return None
|
||||
summary = _summarize_action(action, args)
|
||||
try:
|
||||
verdict = cb(action, args, summary)
|
||||
except Exception as e:
|
||||
logger.warning("approval callback failed: %s", e)
|
||||
verdict = "deny"
|
||||
if verdict == "approve_once":
|
||||
return None
|
||||
if verdict == "approve_session" or verdict == "always_approve":
|
||||
_always_allow.add(action)
|
||||
if verdict == "always_approve":
|
||||
_session_auto_approve = True
|
||||
return None
|
||||
return json.dumps({"error": "denied by user", "action": action})
|
||||
|
||||
|
||||
def _summarize_action(action: str, args: Dict[str, Any]) -> str:
|
||||
if action in ("click", "double_click", "right_click", "middle_click"):
|
||||
if args.get("element") is not None:
|
||||
return f"{action} element #{args['element']}"
|
||||
coord = args.get("coordinate")
|
||||
if coord:
|
||||
return f"{action} at {tuple(coord)}"
|
||||
return action
|
||||
if action == "drag":
|
||||
src = args.get("from_element") or args.get("from_coordinate")
|
||||
dst = args.get("to_element") or args.get("to_coordinate")
|
||||
return f"drag {src} → {dst}"
|
||||
if action == "scroll":
|
||||
return f"scroll {args.get('direction', '?')} x{args.get('amount', 3)}"
|
||||
if action == "type":
|
||||
text = args.get("text", "")
|
||||
return f"type {text[:60]!r}" + ("..." if len(text) > 60 else "")
|
||||
if action == "key":
|
||||
return f"key {args.get('keys', '')!r}"
|
||||
if action == "focus_app":
|
||||
return f"focus {args.get('app', '')!r}" + (" (raise)" if args.get("raise_window") else "")
|
||||
return action
|
||||
|
||||
|
||||
def _dispatch(backend: ComputerUseBackend, action: str, args: Dict[str, Any]) -> Any:
|
||||
capture_after = bool(args.get("capture_after"))
|
||||
|
||||
if action == "capture":
|
||||
mode = str(args.get("mode", "som"))
|
||||
if mode not in ("som", "vision", "ax"):
|
||||
return json.dumps({"error": f"bad mode {mode!r}; use som|vision|ax"})
|
||||
cap = backend.capture(mode=mode, app=args.get("app"))
|
||||
return _capture_response(cap)
|
||||
|
||||
if action == "wait":
|
||||
seconds = float(args.get("seconds", 1.0))
|
||||
res = backend.wait(seconds)
|
||||
return _text_response(res)
|
||||
|
||||
if action == "list_apps":
|
||||
apps = backend.list_apps()
|
||||
return json.dumps({"apps": apps, "count": len(apps)})
|
||||
|
||||
if action == "focus_app":
|
||||
app = args.get("app")
|
||||
if not app:
|
||||
return json.dumps({"error": "focus_app requires `app`"})
|
||||
res = backend.focus_app(app, raise_window=bool(args.get("raise_window")))
|
||||
return _maybe_follow_capture(backend, res, capture_after)
|
||||
|
||||
if action in ("click", "double_click", "right_click", "middle_click"):
|
||||
button = args.get("button")
|
||||
click_count = 1
|
||||
if action == "double_click":
|
||||
click_count = 2
|
||||
elif action == "right_click":
|
||||
button = "right"
|
||||
elif action == "middle_click":
|
||||
button = "middle"
|
||||
else:
|
||||
button = button or "left"
|
||||
element = args.get("element")
|
||||
coord = args.get("coordinate") or (None, None)
|
||||
x, y = (coord[0], coord[1]) if coord and coord[0] is not None else (None, None)
|
||||
res = backend.click(
|
||||
element=element if element is not None else None,
|
||||
x=x, y=y, button=button or "left", click_count=click_count,
|
||||
modifiers=args.get("modifiers"),
|
||||
)
|
||||
return _maybe_follow_capture(backend, res, capture_after)
|
||||
|
||||
if action == "drag":
|
||||
res = backend.drag(
|
||||
from_element=args.get("from_element"),
|
||||
to_element=args.get("to_element"),
|
||||
from_xy=tuple(args["from_coordinate"]) if args.get("from_coordinate") else None,
|
||||
to_xy=tuple(args["to_coordinate"]) if args.get("to_coordinate") else None,
|
||||
button=args.get("button", "left"),
|
||||
modifiers=args.get("modifiers"),
|
||||
)
|
||||
return _maybe_follow_capture(backend, res, capture_after)
|
||||
|
||||
if action == "scroll":
|
||||
coord = args.get("coordinate") or (None, None)
|
||||
res = backend.scroll(
|
||||
direction=args.get("direction", "down"),
|
||||
amount=int(args.get("amount", 3)),
|
||||
element=args.get("element"),
|
||||
x=coord[0] if coord and coord[0] is not None else None,
|
||||
y=coord[1] if coord and coord[1] is not None else None,
|
||||
modifiers=args.get("modifiers"),
|
||||
)
|
||||
return _maybe_follow_capture(backend, res, capture_after)
|
||||
|
||||
if action == "type":
|
||||
res = backend.type_text(args.get("text", ""))
|
||||
return _maybe_follow_capture(backend, res, capture_after)
|
||||
|
||||
if action == "key":
|
||||
res = backend.key(args.get("keys", ""))
|
||||
return _maybe_follow_capture(backend, res, capture_after)
|
||||
|
||||
return json.dumps({"error": f"unknown action {action!r}"})
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Response shaping
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _text_response(res: ActionResult) -> str:
|
||||
payload: Dict[str, Any] = {"ok": res.ok, "action": res.action}
|
||||
if res.message:
|
||||
payload["message"] = res.message
|
||||
if res.meta:
|
||||
payload["meta"] = res.meta
|
||||
return json.dumps(payload)
|
||||
|
||||
|
||||
def _capture_response(cap: CaptureResult) -> Any:
|
||||
element_index = _format_elements(cap.elements)
|
||||
summary_lines = [
|
||||
f"capture mode={cap.mode} {cap.width}x{cap.height}"
|
||||
+ (f" app={cap.app}" if cap.app else "")
|
||||
+ (f" window={cap.window_title!r}" if cap.window_title else ""),
|
||||
f"{len(cap.elements)} interactable element(s):",
|
||||
]
|
||||
if element_index:
|
||||
summary_lines.extend(element_index)
|
||||
summary = "\n".join(summary_lines)
|
||||
|
||||
if cap.png_b64 and cap.mode != "ax":
|
||||
return {
|
||||
"_multimodal": True,
|
||||
"content": [
|
||||
{"type": "text", "text": summary},
|
||||
{"type": "image_url",
|
||||
"image_url": {"url": f"data:image/png;base64,{cap.png_b64}"}},
|
||||
],
|
||||
"text_summary": summary,
|
||||
"meta": {"mode": cap.mode, "width": cap.width, "height": cap.height,
|
||||
"elements": len(cap.elements), "png_bytes": cap.png_bytes_len},
|
||||
}
|
||||
# AX-only (or image missing): text path.
|
||||
return json.dumps({
|
||||
"mode": cap.mode,
|
||||
"width": cap.width,
|
||||
"height": cap.height,
|
||||
"app": cap.app,
|
||||
"window_title": cap.window_title,
|
||||
"elements": [_element_to_dict(e) for e in cap.elements],
|
||||
"summary": summary,
|
||||
})
|
||||
|
||||
|
||||
def _maybe_follow_capture(
|
||||
backend: ComputerUseBackend, res: ActionResult, do_capture: bool,
|
||||
) -> Any:
|
||||
if not do_capture:
|
||||
return _text_response(res)
|
||||
try:
|
||||
cap = backend.capture(mode="som")
|
||||
except Exception as e:
|
||||
logger.warning("follow-up capture failed: %s", e)
|
||||
return _text_response(res)
|
||||
# Combine action summary with the capture.
|
||||
resp = _capture_response(cap)
|
||||
if isinstance(resp, dict) and resp.get("_multimodal"):
|
||||
prefix = f"[{res.action}] ok={res.ok}" + (f" — {res.message}" if res.message else "")
|
||||
resp["content"][0]["text"] = prefix + "\n\n" + resp["content"][0]["text"]
|
||||
resp["text_summary"] = prefix + "\n\n" + resp["text_summary"]
|
||||
return resp
|
||||
# Fallback: action + text capture merged.
|
||||
try:
|
||||
data = json.loads(resp)
|
||||
except (TypeError, json.JSONDecodeError):
|
||||
data = {"capture": resp}
|
||||
data["action"] = res.action
|
||||
data["ok"] = res.ok
|
||||
if res.message:
|
||||
data["message"] = res.message
|
||||
return json.dumps(data)
|
||||
|
||||
|
||||
def _format_elements(elements: List[UIElement], max_lines: int = 40) -> List[str]:
|
||||
out: List[str] = []
|
||||
for e in elements[:max_lines]:
|
||||
label = e.label.replace("\n", " ")[:60]
|
||||
out.append(f" #{e.index} {e.role} {label!r} @ {e.bounds}"
|
||||
+ (f" [{e.app}]" if e.app else ""))
|
||||
if len(elements) > max_lines:
|
||||
out.append(f" ... +{len(elements) - max_lines} more (call capture with app= to narrow)")
|
||||
return out
|
||||
|
||||
|
||||
def _element_to_dict(e: UIElement) -> Dict[str, Any]:
|
||||
return {
|
||||
"index": e.index,
|
||||
"role": e.role,
|
||||
"label": e.label,
|
||||
"bounds": list(e.bounds),
|
||||
"app": e.app,
|
||||
}
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Availability check (used by the tool registry check_fn)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def check_computer_use_requirements() -> bool:
|
||||
"""Return True iff computer_use can run on this host.
|
||||
|
||||
Conditions: macOS + cua-driver binary installed (or override via env).
|
||||
"""
|
||||
if sys.platform != "darwin":
|
||||
return False
|
||||
from tools.computer_use.cua_backend import cua_driver_binary_available
|
||||
return cua_driver_binary_available()
|
||||
|
||||
|
||||
def get_computer_use_schema() -> Dict[str, Any]:
|
||||
from tools.computer_use.schema import COMPUTER_USE_SCHEMA
|
||||
return COMPUTER_USE_SCHEMA
|
||||
Loading…
Add table
Add a link
Reference in a new issue