hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-05-30 06:41:51 +00:00

Author	SHA1	Message	Date
Rodrigo	fbdca64f73	fix(computer-use): skip capture_after when action failed (ok=False) _maybe_follow_capture() issued a follow-up screenshot unconditionally when capture_after=True, even when res.ok=False. The model then received a normal-looking screenshot alongside an error message, and in practice it often ignored ok=False and proceeded as if the action had succeeded. Fix: return _text_response(res) early when res.ok is False so the model receives only the error and can decide how to recover. Tests added: - test_capture_after_skipped_when_action_failed: patches click to return ok=False and asserts no capture call is issued. - test_capture_after_fires_when_action_succeeds: ensures the happy path still triggers the follow-up capture.	2026-05-22 01:19:01 -07:00
Rodrigo	c52cd48e25	fix(computer-use): add set_value to ComputerUseBackend ABC and _NoopBackend stub _dispatch() routes action="set_value" to backend.set_value(), but: - ComputerUseBackend did not declare set_value as @abstractmethod, so subclasses could silently omit it without a TypeError at class load time. - _NoopBackend (the test/CI stub) had no set_value method at all, causing AttributeError in any test that exercises the set_value action path. Fix: - Add set_value as @abstractmethod to ComputerUseBackend in backend.py. - Add a recording stub in _NoopBackend in tool.py. - Add two TestDispatch cases: one verifying the call reaches the backend, one verifying the missing-value guard returns a clean error.	2026-05-22 01:14:15 -07:00
briandevans	280dd4513a	fix(computer-use): address Copilot review on max_elements cap Four findings from Copilot's review on PR #22891, all in the AX elements-array cap added by 22fa1ed: 1. The truncation note ("response truncated to N of M elements") was appended unconditionally — including in the som/vision multimodal path, whose response carries a screenshot rather than an `elements` array. The note described a payload field that wasn't present. Moved the note into the AX-text branch where the array actually appears. 2. `_format_elements(cap.elements)` ran on the full untrimmed list with its own `max_lines=40` cap, so a caller passing `max_elements=10` would see summary lines referencing `#11..#40` even though the JSON `elements` array only held #1..#10. Format on `visible_elements` instead so the summary indices always exist in the response. 3. `_coerce_max_elements` enforced a lower bound but no upper bound, so `max_elements=10_000_000` silently disabled the safeguard and reintroduced the original context-blow-up. Added a hard cap (`_MAX_ALLOWED_MAX_ELEMENTS = 1000`) that clamps oversized values. 4. The schema string said "Default 100" but the property carried no `default` field, and claimed `max_elements` had no effect on som/ vision while the image-missing fallback path can still return an elements array. Added `"default": 100`, `"maximum": 1000`, and clarified the fallback-path wording. Each finding gets a regression test: - test_capture_ax_clamps_oversized_max_elements_to_hard_cap - test_capture_ax_summary_indices_match_returned_elements - test_capture_multimodal_summary_omits_truncation_note - test_schema_max_elements_documents_default_and_upper_bound Verified with `pytest tests/tools/test_computer_use.py` (53 passed, including the 5 new cases). Confirmed each new test fails on the pre-fix code path before applying the production change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 19:07:32 -07:00
briandevans	bb694bad42	fix(computer-use): cap AX `elements` array to prevent context blowup (#22865 ) `computer_use(action='capture', mode='ax')` returned the full AX element list verbatim in the JSON response. Dense Electron / Obsidian / JetBrains UIs publish 500+ AX nodes (one reproduction in #22865 returned 597 elements against Obsidian), so a single capture could consume enough context to trigger compression failures or render the session unusable. The human-readable `_format_elements` summary is already capped at 40 lines, so the truncation gap was invisible to anyone reading the summary output. Add a `max_elements` argument to the tool schema, default 100, that trims the AX `elements` array. When the cap fires, the response surfaces `total_elements` and `truncated_elements` and appends a "raise max_elements or pass app= to narrow" hint to the summary so the model knows the JSON view is partial and can re-issue with a tighter scope. Validation is centralized in `_coerce_max_elements`: missing / non-integer / sub-1 inputs fall back to the default cap, so the protection can never be silently disabled by a malformed tool-call argument. The cap only affects AX-mode JSON; `mode='som'` and `mode='vision'` keep returning a screenshot + image-aware summary unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 19:07:32 -07:00
briandevans	5aa4727f34	fix(computer-use): surface app=… filter no-match instead of silently using frontmost (#24170 bug 1) `CuaDriverBackend.capture(app=X)` and `focus_app(app=X)` silently fell back to the frontmost on-screen window when X matched no app — typically a menu-bar utility (e.g. "Fuwari" in the bug reporter's case) rather than the requested app. The agent then received UI elements for the wrong app and clicked / typed into it. The root cause is a localized macOS app name mismatch: `list_windows` returns the localized `app_name` (e.g. "計算機" on a Japanese/Chinese system) but callers naturally pass the English name ("Calculator"). The substring filter doesn't match, and the code falls through to picking the frontmost window with no signal that the filter was effectively dropped. Fix: - `capture(app=…)`: when the filter matches nothing, return a `CaptureResult` with empty `app`/`elements` and a diagnostic `window_title` pointing the caller at `list_apps` and noting the localized-name convention. `_active_pid` / `_active_window_id` are left untouched so a subsequent action doesn't inadvertently hit the wrong process. - `focus_app(app=…)`: when the filter matches nothing, set `target = None` and let the existing `return ActionResult(ok=False, …, "No on-screen window found for app …")` path fire instead of falsely reporting success on the frontmost window. This addresses bug 1 only from #24170. Bugs 2 & 5 are addressed in #30046; bugs 3 & 4 in #30032.	2026-05-21 17:15:35 -07:00
Bartok9	4cc18877c6	fix(computer_use): preserve app context for capture_after; fix element label parsing (#24170 bugs 2 & 5) Bug 2 (capture_after=True loses app context): _maybe_follow_capture called backend.capture(mode='som') with no app=, causing cua-driver to capture the frontmost window instead of the app targeted by the preceding capture/focus_app. Fix: track _last_app on CuaDriverBackend and thread it through the follow-up capture call so the same app is re-captured regardless of which window has OS focus. Bug 5 (element labels stripped in capture results): _ELEMENT_LINE_RE matched the classic ' - [N] AXRole "label"' format but not the '[N] AXRole (order) id=Label' format introduced in cua-driver v0.1.6. All element labels were silently dropped as empty strings, making element identification impossible. Fix: extend regex to capture both group(3) (quoted label) and group(4) (id= label), and update _parse_elements_from_tree to use group(4) as fallback. Both old and new cua-driver output now produce populated UIElement.label values. focus_app() now also sets _last_app so that capture_after= on any subsequent action re-targets the focused app. 5 new regression tests added. Part of #24170 (bugs 1 and 3/4 addressed separately).	2026-05-21 14:19:09 -07:00
liuhao1024	18cd1e5c72	fix(computer_use): correct type_text MCP tool name and implement drag action Bug 3: The cua_backend type_text() method called MCP tool 'type_text_chars' which does not exist in current cua-driver. Changed to 'type_text' which is the correct MCP tool name. Bug 4: The drag() method returned a hardcoded 'not supported' error even though cua-driver exposes a 'drag' MCP tool. Implemented proper drag dispatching with coordinate-based and element-based targeting. Added dispatch-level validation for drag to ensure from/to coordinates or elements are provided before calling any backend. Fixes #24170 (bugs 3 and 4)	2026-05-21 14:08:28 -07:00
helix4u	a28add199d	fix(agent): keep image tool results from poisoning text-only sessions	2026-05-14 14:52:15 -07:00
Teknium	850413f120	feat(computer-use): cua-driver backend, universal any-model schema Background macOS desktop control via cua-driver MCP — does NOT steal the user's cursor or keyboard focus, works with any tool-capable model. Replaces the Anthropic-native `computer_20251124` approach from the abandoned #4562 with a generic OpenAI function-calling schema plus SOM (set-of-mark) captures so Claude, GPT, Gemini, and open models can all drive the desktop via numbered element indices. - `tools/computer_use/` package — swappable ComputerUseBackend ABC + CuaDriverBackend (stdio MCP client to trycua/cua's cua-driver binary). - Universal `computer_use` tool with one schema for all providers. Actions: capture (som/vision/ax), click, double_click, right_click, middle_click, drag, scroll, type, key, wait, list_apps, focus_app. - Multimodal tool-result envelope (`_multimodal=True`, OpenAI-style `content: [text, image_url]` parts) that flows through handle_function_call into the tool message. Anthropic adapter converts into native `tool_result` image blocks; OpenAI-compatible providers get the parts list directly. - Image eviction in convert_messages_to_anthropic: only the 3 most recent screenshots carry real image data; older ones become text placeholders to cap per-turn token cost. - Context compressor image pruning: old multimodal tool results have their image parts stripped instead of being skipped. - Image-aware token estimation: each image counts as a flat 1500 tokens instead of its base64 char length (~1MB would have registered as ~250K tokens before). - COMPUTER_USE_GUIDANCE system-prompt block — injected when the toolset is active. - Session DB persistence strips base64 from multimodal tool messages. - Trajectory saver normalises multimodal messages to text-only. - `hermes tools` post-setup installs cua-driver via the upstream script and prints permission-grant instructions. - CLI approval callback wired so destructive computer_use actions go through the same prompt_toolkit approval dialog as terminal commands. - Hard safety guards at the tool level: blocked type patterns (curl\|bash, sudo rm -rf, fork bomb), blocked key combos (empty trash, force delete, lock screen, log out). - Skill `apple/macos-computer-use/SKILL.md` — universal (model-agnostic) workflow guide. - Docs: `user-guide/features/computer-use.md` plus reference catalog entries. 44 new tests in tests/tools/test_computer_use.py covering schema shape (universal, not Anthropic-native), dispatch routing, safety guards, multimodal envelope, Anthropic adapter conversion, screenshot eviction, context compressor pruning, image-aware token estimation, run_agent helpers, and universality guarantees. 469/469 pass across tests/tools/test_computer_use.py + the affected agent/ test suites. - `model_tools.py` provider-gating: the tool is available to every provider. Providers without multi-part tool message support will see text-only tool results (graceful degradation via `text_summary`). - Anthropic server-side `clear_tool_uses_20250919` — deferred; client-side eviction + compressor pruning cover the same cost ceiling without a beta header. - macOS only. cua-driver uses private SkyLight SPIs (SLEventPostToPid, SLPSPostEventRecordTo, _AXObserverAddNotificationAndCheckRemote) that can break on any macOS update. Pin with HERMES_CUA_DRIVER_VERSION. - Requires Accessibility + Screen Recording permissions — the post-setup prints the Settings path. Supersedes PR #4562 (pyautogui/Quartz foreground backend, Anthropic- native schema). Credit @0xbyt4 for the original #3816 groundwork whose context/eviction/token design is preserved here in generic form.	2026-05-08 11:07:38 -07:00
Teknium	e63364b8df	revert: computer-use cua-driver (PR #16919 ) (#16927 ) Reverts PR #16919 (commits `dad10a78d`, `413ee1a28`, `b4a8031b2`, `afb958829`) which was merged prematurely. Restoring the pre-merge state so #14817 and #15328 can be revisited as standing PRs. Reverted commits: - `afb958829` fix(computer-use): harden image-rejection fallback + AUTHOR_MAP - `b4a8031b2` fix(computer-use): unwrap _multimodal tool results - `413ee1a28` feat(computer-use): background focus-safe backend - `dad10a78d` feat(computer-use): cua-driver backend, universal any-model schema Co-authored-by: teknium1 <teknium@users.noreply.github.com>	2026-04-28 01:57:21 -07:00
Teknium	dad10a78d0	feat(computer-use): cua-driver backend, universal any-model schema Background macOS desktop control via cua-driver MCP — does NOT steal the user's cursor or keyboard focus, works with any tool-capable model. Replaces the Anthropic-native `computer_20251124` approach from the abandoned #4562 with a generic OpenAI function-calling schema plus SOM (set-of-mark) captures so Claude, GPT, Gemini, and open models can all drive the desktop via numbered element indices. - `tools/computer_use/` package — swappable ComputerUseBackend ABC + CuaDriverBackend (stdio MCP client to trycua/cua's cua-driver binary). - Universal `computer_use` tool with one schema for all providers. Actions: capture (som/vision/ax), click, double_click, right_click, middle_click, drag, scroll, type, key, wait, list_apps, focus_app. - Multimodal tool-result envelope (`_multimodal=True`, OpenAI-style `content: [text, image_url]` parts) that flows through handle_function_call into the tool message. Anthropic adapter converts into native `tool_result` image blocks; OpenAI-compatible providers get the parts list directly. - Image eviction in convert_messages_to_anthropic: only the 3 most recent screenshots carry real image data; older ones become text placeholders to cap per-turn token cost. - Context compressor image pruning: old multimodal tool results have their image parts stripped instead of being skipped. - Image-aware token estimation: each image counts as a flat 1500 tokens instead of its base64 char length (~1MB would have registered as ~250K tokens before). - COMPUTER_USE_GUIDANCE system-prompt block — injected when the toolset is active. - Session DB persistence strips base64 from multimodal tool messages. - Trajectory saver normalises multimodal messages to text-only. - `hermes tools` post-setup installs cua-driver via the upstream script and prints permission-grant instructions. - CLI approval callback wired so destructive computer_use actions go through the same prompt_toolkit approval dialog as terminal commands. - Hard safety guards at the tool level: blocked type patterns (curl\|bash, sudo rm -rf, fork bomb), blocked key combos (empty trash, force delete, lock screen, log out). - Skill `apple/macos-computer-use/SKILL.md` — universal (model-agnostic) workflow guide. - Docs: `user-guide/features/computer-use.md` plus reference catalog entries. 44 new tests in tests/tools/test_computer_use.py covering schema shape (universal, not Anthropic-native), dispatch routing, safety guards, multimodal envelope, Anthropic adapter conversion, screenshot eviction, context compressor pruning, image-aware token estimation, run_agent helpers, and universality guarantees. 469/469 pass across tests/tools/test_computer_use.py + the affected agent/ test suites. - `model_tools.py` provider-gating: the tool is available to every provider. Providers without multi-part tool message support will see text-only tool results (graceful degradation via `text_summary`). - Anthropic server-side `clear_tool_uses_20250919` — deferred; client-side eviction + compressor pruning cover the same cost ceiling without a beta header. - macOS only. cua-driver uses private SkyLight SPIs (SLEventPostToPid, SLPSPostEventRecordTo, _AXObserverAddNotificationAndCheckRemote) that can break on any macOS update. Pin with HERMES_CUA_DRIVER_VERSION. - Requires Accessibility + Screen Recording permissions — the post-setup prints the Settings path. Supersedes PR #4562 (pyautogui/Quartz foreground backend, Anthropic- native schema). Credit @0xbyt4 for the original #3816 groundwork whose context/eviction/token design is preserved here in generic form.	2026-04-28 01:46:36 -07:00

11 commits