hermes-agent/website/docs/reference/toolsets-reference.md
Teknium b07791db05
feat(computer-use): cua-driver backend, universal any-model schema
Background macOS desktop control via cua-driver MCP — does NOT steal the
user's cursor or keyboard focus, works with any tool-capable model.

Replaces the Anthropic-native `computer_20251124` approach from the
abandoned #4562 with a generic OpenAI function-calling schema plus SOM
(set-of-mark) captures so Claude, GPT, Gemini, and open models can all
drive the desktop via numbered element indices.

## What this adds

- `tools/computer_use/` package — swappable ComputerUseBackend ABC +
  CuaDriverBackend (stdio MCP client to trycua/cua's cua-driver binary).
- Universal `computer_use` tool with one schema for all providers.
  Actions: capture (som/vision/ax), click, double_click, right_click,
  middle_click, drag, scroll, type, key, wait, list_apps, focus_app.
- Multimodal tool-result envelope (`_multimodal=True`, OpenAI-style
  `content: [text, image_url]` parts) that flows through
  handle_function_call into the tool message. Anthropic adapter converts
  into native `tool_result` image blocks; OpenAI-compatible providers
  get the parts list directly.
- Image eviction in convert_messages_to_anthropic: only the 3 most
  recent screenshots carry real image data; older ones become text
  placeholders to cap per-turn token cost.
- Context compressor image pruning: old multimodal tool results have
  their image parts stripped instead of being skipped.
- Image-aware token estimation: each image counts as a flat 1500 tokens
  instead of its base64 char length (~1MB would have registered as
  ~250K tokens before).
- COMPUTER_USE_GUIDANCE system-prompt block — injected when the toolset
  is active.
- Session DB persistence strips base64 from multimodal tool messages.
- Trajectory saver normalises multimodal messages to text-only.
- `hermes tools` post-setup installs cua-driver via the upstream script
  and prints permission-grant instructions.
- CLI approval callback wired so destructive computer_use actions go
  through the same prompt_toolkit approval dialog as terminal commands.
- Hard safety guards at the tool level: blocked type patterns
  (curl|bash, sudo rm -rf, fork bomb), blocked key combos (empty trash,
  force delete, lock screen, log out).
- Skill `apple/macos-computer-use/SKILL.md` — universal (model-agnostic)
  workflow guide.
- Docs: `user-guide/features/computer-use.md` plus reference catalog
  entries.

## Tests

44 new tests in tests/tools/test_computer_use.py covering schema
shape (universal, not Anthropic-native), dispatch routing, safety
guards, multimodal envelope, Anthropic adapter conversion, screenshot
eviction, context compressor pruning, image-aware token estimation,
run_agent helpers, and universality guarantees.

469/469 pass across tests/tools/test_computer_use.py + the affected
agent/ test suites.

## Not in this PR

- `model_tools.py` provider-gating: the tool is available to every
  provider. Providers without multi-part tool message support will see
  text-only tool results (graceful degradation via `text_summary`).
- Anthropic server-side `clear_tool_uses_20250919` — deferred;
  client-side eviction + compressor pruning cover the same cost ceiling
  without a beta header.

## Caveats

- macOS only. cua-driver uses private SkyLight SPIs
  (SLEventPostToPid, SLPSPostEventRecordTo,
  _AXObserverAddNotificationAndCheckRemote) that can break on any macOS
  update. Pin with HERMES_CUA_DRIVER_VERSION.
- Requires Accessibility + Screen Recording permissions — the post-setup
  prints the Settings path.

Supersedes PR #4562 (pyautogui/Quartz foreground backend, Anthropic-
native schema). Credit @0xbyt4 for the original #3816 groundwork whose
context/eviction/token design is preserved here in generic form.
2026-04-23 16:44:24 -07:00

8.3 KiB

sidebar_position title description
4 Toolsets Reference Reference for Hermes core, composite, platform, and dynamic toolsets

Toolsets Reference

Toolsets are named bundles of tools that control what the agent can do. They're the primary mechanism for configuring tool availability per platform, per session, or per task.

How Toolsets Work

Every tool belongs to exactly one toolset. When you enable a toolset, all tools in that bundle become available to the agent. Toolsets come in three kinds:

  • Core — A single logical group of related tools (e.g., file bundles read_file, write_file, patch, search_files)
  • Composite — Combines multiple core toolsets for a common scenario (e.g., debugging bundles file, terminal, and web tools)
  • Platform — A complete tool configuration for a specific deployment context (e.g., hermes-cli is the default for interactive CLI sessions)

Configuring Toolsets

Per-session (CLI)

hermes chat --toolsets web,file,terminal
hermes chat --toolsets debugging        # composite — expands to file + terminal + web
hermes chat --toolsets all              # everything

Per-platform (config.yaml)

toolsets:
  - hermes-cli          # default for CLI
  # - hermes-telegram   # override for Telegram gateway

Interactive management

hermes tools                            # curses UI to enable/disable per platform

Or in-session:

/tools list
/tools disable browser
/tools enable rl

Core Toolsets

Toolset Tools Purpose
browser browser_back, browser_cdp, browser_click, browser_console, browser_get_images, browser_navigate, browser_press, browser_scroll, browser_snapshot, browser_type, browser_vision, web_search Full browser automation. Includes web_search as a fallback for quick lookups. browser_cdp is a raw CDP passthrough gated on a reachable CDP endpoint — it only appears when /browser connect is active or browser.cdp_url is set.
clarify clarify Ask the user a question when the agent needs clarification.
code_execution execute_code Run Python scripts that call Hermes tools programmatically.
cronjob cronjob Schedule and manage recurring tasks.
delegation delegate_task Spawn isolated subagent instances for parallel work.
feishu_doc feishu_doc_read Read Feishu/Lark document content. Used by the Feishu document-comment intelligent-reply handler.
feishu_drive feishu_drive_add_comment, feishu_drive_list_comments, feishu_drive_list_comment_replies, feishu_drive_reply_comment Feishu/Lark drive comment operations. Scoped to the comment agent; not exposed on hermes-cli or other messaging toolsets.
file patch, read_file, search_files, write_file File reading, writing, searching, and editing.
homeassistant ha_call_service, ha_get_state, ha_list_entities, ha_list_services Smart home control via Home Assistant. Only available when HASS_TOKEN is set.
computer_use computer_use Background macOS desktop control via cua-driver — does not steal cursor/focus. Works with any tool-capable model. macOS only; requires cua-driver on $PATH.
image_gen image_generate Text-to-image generation via FAL.ai.
memory memory Persistent cross-session memory management.
messaging send_message Send messages to other platforms (Telegram, Discord, etc.) from within a session.
moa mixture_of_agents Multi-model consensus via Mixture of Agents.
rl rl_check_status, rl_edit_config, rl_get_current_config, rl_get_results, rl_list_environments, rl_list_runs, rl_select_environment, rl_start_training, rl_stop_training, rl_test_inference RL training environment management (Atropos).
search web_search Web search only (without extract).
session_search session_search Search past conversation sessions.
skills skill_manage, skill_view, skills_list Skill CRUD and browsing.
terminal process, terminal Shell command execution and background process management.
todo todo Task list management within a session.
tts text_to_speech Text-to-speech audio generation.
vision vision_analyze Image analysis via vision-capable models.
web web_extract, web_search Web search and page content extraction.

Composite Toolsets

These expand to multiple core toolsets, providing a convenient shorthand for common scenarios:

Toolset Expands to Use case
debugging web + file + process, terminal (via includes) — effectively patch, process, read_file, search_files, terminal, web_extract, web_search, write_file Debug sessions — file access, terminal, and web research without browser or delegation overhead.
safe image_generate, vision_analyze, web_extract, web_search Read-only research and media generation. No file writes, no terminal access, no code execution. Good for untrusted or constrained environments.

Platform Toolsets

Platform toolsets define the complete tool configuration for a deployment target. Most messaging platforms use the same set as hermes-cli:

Toolset Differences from hermes-cli
hermes-cli Full toolset — all 36 core tools including clarify. The default for interactive CLI sessions.
hermes-acp Drops clarify, cronjob, image_generate, send_message, text_to_speech, homeassistant tools. Focused on coding tasks in IDE context.
hermes-api-server Drops clarify, send_message, and text_to_speech. Adds everything else — suitable for programmatic access where user interaction isn't possible.
hermes-telegram Same as hermes-cli.
hermes-discord Same as hermes-cli.
hermes-slack Same as hermes-cli.
hermes-whatsapp Same as hermes-cli.
hermes-signal Same as hermes-cli.
hermes-matrix Same as hermes-cli.
hermes-mattermost Same as hermes-cli.
hermes-email Same as hermes-cli.
hermes-sms Same as hermes-cli.
hermes-bluebubbles Same as hermes-cli.
hermes-dingtalk Same as hermes-cli.
hermes-feishu Same as hermes-cli. Note: the feishu_doc / feishu_drive toolsets are used only by the document-comment handler, not by the regular Feishu chat adapter.
hermes-qqbot Same as hermes-cli.
hermes-wecom Same as hermes-cli.
hermes-wecom-callback Same as hermes-cli.
hermes-weixin Same as hermes-cli.
hermes-homeassistant Same as hermes-cli plus the homeassistant toolset always on.
hermes-webhook Same as hermes-cli.
hermes-gateway Internal gateway orchestrator toolset — union of the broadest possible tool set when the gateway needs to accept any message source.

Dynamic Toolsets

MCP server toolsets

Each configured MCP server generates a mcp-<server> toolset at runtime. For example, if you configure a github MCP server, a mcp-github toolset is created containing all tools that server exposes.

# config.yaml
mcp_servers:
  github:
    command: npx
    args: ["-y", "@modelcontextprotocol/server-github"]

This creates a mcp-github toolset you can reference in --toolsets or platform configs.

Plugin toolsets

Plugins can register their own toolsets via ctx.register_tool() during plugin initialization. These appear alongside built-in toolsets and can be enabled/disabled the same way.

Custom toolsets

Define custom toolsets in config.yaml to create project-specific bundles:

toolsets:
  - hermes-cli
custom_toolsets:
  data-science:
    - file
    - terminal
    - code_execution
    - web
    - vision

Wildcards

  • all or * — expands to every registered toolset (built-in + dynamic + plugin)

Relationship to hermes tools

The hermes tools command provides a curses-based UI for toggling individual tools on or off per platform. This operates at the tool level (finer than toolsets) and persists to config.yaml. Disabled tools are filtered out even if their toolset is enabled.

See also: Tools Reference for the complete list of individual tools and their parameters.