feat(tools): progressive tool disclosure for MCP and plugin tools

Adds Tool Search, a structured-tools progressive-disclosure layer that
replaces MCP and non-core plugin tools in the model-visible tools array
with three bridge tools (tool_search / tool_describe / tool_call) when
the deferrable surface would consume more than a configurable percentage
of the active model's context window. Core Hermes tools are never deferred.

Default mode is 'auto' with a 10% context threshold, so small toolsets
pay no overhead. Set tools.tool_search.enabled to 'on' to force or 'off'
to disable.

Design carefully reflects the OpenClaw production failure modes
documented in the openclaw-tool-search-report:

  - Core tools never defer (toolsets._HERMES_CORE_TOOLS). Addresses the
    'tools silently missing from isolated cron turns' regression class
    (openclaw#84141) by construction: there is no code path that can
    drop a core tool.
  - Catalog is stateless across turns — rebuilt from the live tool-defs
    list on every assembly. No session-keyed Map that can drift out of
    sync with the registry.
  - tool_call unwraps the bridge call before any hook fires, so plugin
    pre/post hooks, guardrails, approval flows, and the activity feed
    all see the underlying tool name, not the bridge (addresses
    openclaw#85588 and the verbose-mode complaint on openclaw#79823).
  - The unwrap happens in both the parallel and sequential paths of
    agent/tool_executor.py and also in handle_function_call, so direct
    callers (sandboxed code, eval harnesses) are covered too.
  - Bridge tools cannot invoke each other (recursion guard) and cannot
    invoke core tools (those must be called directly).
  - Tools mode only — no JS-sandbox code-mode. Keeps the surface small.
  - Token estimation via cheap char/4 heuristic; precision isn't needed
    for the threshold decision.

Files:
  - tools/tool_search.py — new module (BM25 retrieval, classification,
    threshold gate, bridge dispatch, unwrap helper).
  - tests/tools/test_tool_search.py — 35 tests including the OpenClaw
    #84141 regression guard.
  - model_tools.py — wires assembly into _compute_tool_definitions as the
    final step, adds skip_tool_search_assembly kwarg so the bridge can
    see the real catalog, dispatches the three bridge tools.
  - agent/tool_executor.py — unwraps tool_call in both parallel and
    sequential parsing loops so checkpointing, guardrails, plugin hooks,
    and tool-progress callbacks all observe the underlying tool name.
  - hermes_cli/config.py — DEFAULT_CONFIG['tools']['tool_search'] block.
  - website/docs/user-guide/features/tool-search.md — user docs.

Validation:
  - 35/35 new tests pass.
  - Existing tool/registry/model_tools/config/coercion/executor tests
    (82 + 74 + small adjacents) green.
  - Live E2E: 20 fake MCP tools registered, get_tool_definitions returns
    3 bridges, tool_search returns top 3 hits, tool_describe returns
    full schema, tool_call dispatches to the real underlying handler
    and the underlying result is what the model sees.
  - Reserved-name recursion guard verified live.
  - Core-tool refusal via tool_call verified live.
This commit is contained in:
teknium1 2026-05-23 15:22:01 -07:00 committed by Teknium
parent 73d73f1f0d
commit 369075dc95
6 changed files with 1453 additions and 1 deletions

View file

@ -100,6 +100,26 @@ def execute_tool_calls_concurrent(agent, assistant_message, messages: list, effe
if not isinstance(function_args, dict):
function_args = {}
# ── Tool Search unwrap ────────────────────────────────────────
# When the model invokes the tool_call bridge, peel it open so
# every downstream check (checkpointing, guardrails, plugin
# pre-tool-call hooks, the display/activity feed, the post-call
# callback) sees the underlying tool — not the bridge. This is
# the OpenClaw lesson: hooks must observe the real tool name.
#
# The original tool_call entry on ``tool_call.function`` is left
# untouched so the conversation transcript and the matching
# tool_call_id are preserved exactly as the model emitted them.
try:
from tools import tool_search as _ts
if function_name == _ts.TOOL_CALL_NAME:
_underlying, _underlying_args, _err = _ts.resolve_underlying_call(function_args)
if not _err and _underlying:
function_name = _underlying
function_args = _underlying_args
except Exception:
pass
# Checkpoint for file-mutating tools
if function_name in {"write_file", "patch"} and agent._checkpoint_mgr.enabled:
try:
@ -497,6 +517,17 @@ def execute_tool_calls_sequential(agent, assistant_message, messages: list, effe
if not isinstance(function_args, dict):
function_args = {}
# Tool Search unwrap — see _execute_tool_calls for full rationale.
try:
from tools import tool_search as _ts
if function_name == _ts.TOOL_CALL_NAME:
_underlying, _underlying_args, _err = _ts.resolve_underlying_call(function_args)
if not _err and _underlying:
function_name = _underlying
function_args = _underlying_args
except Exception:
pass
# Check plugin hooks for a block directive before executing.
_block_msg: Optional[str] = None
try:

View file

@ -1785,6 +1785,38 @@ DEFAULT_CONFIG = {
"mode": "project",
},
# Tool Search (progressive disclosure for large tool surfaces).
# When the model is connected to many MCP servers or non-core plugin
# tools, their JSON schemas can consume a substantial fraction of the
# context window on every turn. When enabled, those tools are replaced
# in the model-facing tools array with three bridge tools —
# tool_search / tool_describe / tool_call — and surfaced on demand.
#
# Core Hermes tools (terminal, read_file, write_file, patch,
# search_files, todo, memory, browser_*, etc.) are NEVER deferred.
# See tools/tool_search.py for full design notes and the
# openclaw-tool-search-report PDF in this PR for the rationale.
"tools": {
"tool_search": {
# "auto" (default) — activate only when deferrable tool schemas
# exceed ``threshold_pct`` of the active model's context length,
# so small toolsets pay no overhead.
# "on" — always activate when there is at least one deferrable
# tool. Use when you have many MCP servers and want maximum
# token reduction unconditionally.
# "off" — disable entirely. Tools-array assembly is a pass-through.
"enabled": "auto",
# Percentage of context length at which "auto" mode kicks in.
# 10 matches the Claude Code default. Range 0..100.
"threshold_pct": 10,
# When the model calls tool_search without a ``limit`` argument,
# how many hits to return. Range 1..max_search_limit.
"search_default_limit": 5,
# Hard upper bound the model can request via ``limit``. Range 1..50.
"max_search_limit": 20,
},
},
# Logging — controls file logging to ~/.hermes/logs/.
# agent.log captures INFO+ (all agent activity); errors.log captures WARNING+.
"logging": {

View file

@ -265,6 +265,7 @@ def get_tool_definitions(
enabled_toolsets: List[str] = None,
disabled_toolsets: List[str] = None,
quiet_mode: bool = False,
skip_tool_search_assembly: bool = False,
) -> List[Dict[str, Any]]:
"""
Get tool definitions for model API calls with toolset-based filtering.
@ -275,6 +276,11 @@ def get_tool_definitions(
enabled_toolsets: Only include tools from these toolsets.
disabled_toolsets: Exclude tools from these toolsets (if enabled_toolsets is None).
quiet_mode: Suppress status prints.
skip_tool_search_assembly: When True, return the pre-assembly tool list
(raw schemas for every enabled tool). Used internally by the
tool_search / tool_describe bridge handlers so they can read the
real catalog, not the already-collapsed one. Public callers should
leave this False.
Returns:
Filtered list of OpenAI-format tool definitions.
@ -301,6 +307,7 @@ def get_tool_definitions(
registry._generation,
cfg_fp,
bool(os.environ.get("HERMES_KANBAN_TASK")),
bool(skip_tool_search_assembly),
)
cached = _tool_defs_cache.get(cache_key)
if cached is not None:
@ -312,7 +319,8 @@ def get_tool_definitions(
# schemas are treated as read-only by all known callers.
return list(cached)
result = _compute_tool_definitions(enabled_toolsets, disabled_toolsets, quiet_mode)
result = _compute_tool_definitions(enabled_toolsets, disabled_toolsets, quiet_mode,
skip_tool_search_assembly=skip_tool_search_assembly)
if quiet_mode:
# Cache the freshly-computed list, but hand callers a shallow copy so
# downstream mutations (e.g. run_agent appending memory/LCM tool
@ -330,6 +338,7 @@ def _compute_tool_definitions(
enabled_toolsets: List[str] = None,
disabled_toolsets: List[str] = None,
quiet_mode: bool = False,
skip_tool_search_assembly: bool = False,
) -> List[Dict[str, Any]]:
"""Uncached implementation of :func:`get_tool_definitions`."""
# Determine which tool names the caller wants
@ -481,9 +490,61 @@ def _compute_tool_definitions(
except Exception as e: # pragma: no cover — defensive
logger.warning("Schema sanitization skipped: %s", e)
# ── Tool Search (progressive disclosure) ────────────────────────────
# Conditionally replace MCP + plugin (non-core) tools with three bridge
# tools (tool_search / tool_describe / tool_call) when the deferrable
# surface exceeds the configured threshold (default 10% of context
# window). Core Hermes tools (toolsets._HERMES_CORE_TOOLS) are NEVER
# deferred. See tools/tool_search.py for full design notes.
#
# This is deliberately the last step before returning — sanitization
# has already normalized schemas, and the assembly is idempotent in
# case some caller invokes get_tool_definitions twice.
try:
from tools.tool_search import assemble_tool_defs, load_config as _load_ts_config
ts_cfg = _load_ts_config()
if not skip_tool_search_assembly and ts_cfg.enabled != "off":
context_length = _resolve_active_context_length()
assembly = assemble_tool_defs(
filtered_tools,
context_length=context_length,
config=ts_cfg,
)
if assembly.activated and not quiet_mode:
print(
f"🔎 Tool Search: {assembly.deferred_count} MCP/plugin tools deferred "
f"(~{assembly.deferred_tokens} tokens) behind tool_search/describe/call. "
f"Threshold ~{assembly.threshold_tokens} tokens."
)
filtered_tools = assembly.tool_defs
except Exception as e: # pragma: no cover — never break tool loading
logger.warning("Tool search assembly skipped: %s", e)
return filtered_tools
def _resolve_active_context_length() -> int:
"""Look up the active model's context length for the tool-search gate.
Returns 0 when the model can't be resolved — ``should_activate`` falls
back to a fixed token cutoff in that case.
"""
try:
from hermes_cli.config import load_config as _load
cfg = _load() or {}
model_cfg = cfg.get("model") if isinstance(cfg.get("model"), dict) else {}
if not isinstance(model_cfg, dict):
model_cfg = {}
model_id = (model_cfg.get("model") or model_cfg.get("default") or "").strip()
if not model_id:
return 0
from agent.model_metadata import get_model_context_length
return int(get_model_context_length(model_id) or 0)
except Exception as e:
logger.debug("Could not resolve active context length: %s", e)
return 0
# =============================================================================
# handle_function_call (the main dispatcher)
# =============================================================================
@ -767,6 +828,51 @@ def handle_function_call(
# Coerce string arguments to their schema-declared types (e.g. "42"→42)
function_args = coerce_tool_args(function_name, function_args)
# ── Tool Search bridge dispatch ──────────────────────────────────
# tool_search and tool_describe are pure catalog reads — handle them
# inline. tool_call is unwrapped to the underlying tool so that every
# downstream hook (pre/post, edit approval, guardrails) sees the real
# tool name, not the bridge.
_ts_mod = None
try:
from tools import tool_search as _ts_mod # noqa: F401
except Exception:
_ts_mod = None
if _ts_mod is not None and _ts_mod.is_bridge_tool(function_name):
try:
# Use skip_tool_search_assembly=True so we see the real catalog,
# not the already-collapsed bridge-only list (the bridge would
# otherwise be searching only itself).
current_defs = get_tool_definitions(
quiet_mode=True, skip_tool_search_assembly=True,
) or []
except Exception:
current_defs = []
if function_name == _ts_mod.TOOL_SEARCH_NAME:
return _ts_mod.dispatch_tool_search(function_args or {},
current_tool_defs=current_defs)
if function_name == _ts_mod.TOOL_DESCRIBE_NAME:
return _ts_mod.dispatch_tool_describe(function_args or {},
current_tool_defs=current_defs)
if function_name == _ts_mod.TOOL_CALL_NAME:
underlying_name, underlying_args, err = _ts_mod.resolve_underlying_call(function_args or {})
if err or not underlying_name:
return json.dumps({"error": err or "tool_call could not be resolved"},
ensure_ascii=False)
# Recurse with the underlying tool. All hooks fire against the
# real tool name. The bridge is invisible to hooks by design.
return handle_function_call(
function_name=underlying_name,
function_args=underlying_args,
task_id=task_id,
tool_call_id=tool_call_id,
session_id=session_id,
user_task=user_task,
enabled_tools=enabled_tools,
skip_pre_tool_call_hook=skip_pre_tool_call_hook,
)
try:
if function_name in _AGENT_LOOP_TOOLS:
return json.dumps({"error": f"{function_name} must be handled by the agent loop"})

View file

@ -0,0 +1,417 @@
"""Tests for tools/tool_search.py — progressive tool disclosure.
Coverage targets these mirror the issues called out in the OpenClaw tool
search report. Every test that names an OpenClaw issue is the regression
guard that would have caught that specific failure mode.
"""
from __future__ import annotations
import json
import os
import sys
from typing import List, Dict, Any
import pytest
_REPO_ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), "..", ".."))
if _REPO_ROOT not in sys.path:
sys.path.insert(0, _REPO_ROOT)
def _td(name: str, description: str = "", properties: Dict[str, Any] | None = None) -> Dict[str, Any]:
return {
"type": "function",
"function": {
"name": name,
"description": description,
"parameters": {
"type": "object",
"properties": properties or {},
},
},
}
# ---------------------------------------------------------------------------
# Config parsing
# ---------------------------------------------------------------------------
class TestConfigParsing:
def test_default_when_missing(self):
from tools.tool_search import ToolSearchConfig
cfg = ToolSearchConfig.from_raw(None)
assert cfg.enabled == "auto"
assert cfg.threshold_pct == 10.0
def test_bool_true_maps_to_auto(self):
from tools.tool_search import ToolSearchConfig
cfg = ToolSearchConfig.from_raw(True)
assert cfg.enabled == "auto"
def test_bool_false_maps_to_off(self):
from tools.tool_search import ToolSearchConfig
cfg = ToolSearchConfig.from_raw(False)
assert cfg.enabled == "off"
def test_explicit_on(self):
from tools.tool_search import ToolSearchConfig
cfg = ToolSearchConfig.from_raw({"enabled": "on"})
assert cfg.enabled == "on"
def test_invalid_enabled_falls_back_to_auto(self):
from tools.tool_search import ToolSearchConfig
cfg = ToolSearchConfig.from_raw({"enabled": "maybe"})
assert cfg.enabled == "auto"
def test_threshold_clamped(self):
from tools.tool_search import ToolSearchConfig
cfg = ToolSearchConfig.from_raw({"threshold_pct": 150})
assert cfg.threshold_pct == 100.0
cfg = ToolSearchConfig.from_raw({"threshold_pct": -5})
assert cfg.threshold_pct == 0.0
def test_search_limits_clamped(self):
from tools.tool_search import ToolSearchConfig
cfg = ToolSearchConfig.from_raw({
"search_default_limit": 999,
"max_search_limit": 999,
})
assert cfg.max_search_limit == 50
assert cfg.search_default_limit <= cfg.max_search_limit
# ---------------------------------------------------------------------------
# Classification — the hard invariant: core tools NEVER defer.
# ---------------------------------------------------------------------------
class TestClassification:
def test_core_tools_never_defer(self):
"""The critical invariant from the OpenClaw report."""
from tools.tool_search import is_deferrable_tool_name
# Sample of core tools from _HERMES_CORE_TOOLS.
for core_name in ["terminal", "read_file", "write_file", "patch",
"search_files", "todo", "memory", "browser_navigate",
"web_search", "session_search", "clarify",
"execute_code", "delegate_task", "send_message"]:
assert not is_deferrable_tool_name(core_name), (
f"Core tool '{core_name}' must NEVER be deferrable"
)
def test_bridge_tools_never_defer(self):
from tools.tool_search import is_deferrable_tool_name, BRIDGE_TOOL_NAMES
for name in BRIDGE_TOOL_NAMES:
assert not is_deferrable_tool_name(name)
def test_unknown_tool_not_deferrable(self):
"""Defensive: a tool name we cannot resolve to a registry entry must
not be claimed as deferrable. This protects against the OpenClaw
cron regression where unresolved tools were silently dropped."""
from tools.tool_search import is_deferrable_tool_name
assert not is_deferrable_tool_name("xx_definitely_not_a_tool_xx")
def test_classify_keeps_unknown_in_visible(self):
"""A tool we can't classify stays visible — never silently dropped.
This is the OpenClaw #84141 regression guard (cron lost ``exec``
because it wasn't in the catalog).
"""
from tools.tool_search import classify_tools
# Build a tool def for something we don't have a registry entry for.
defs = [_td("xx_unknown_tool", "Unknown tool")]
visible, deferrable = classify_tools(defs)
names = {(td.get("function") or {}).get("name") for td in visible}
assert "xx_unknown_tool" in names
assert deferrable == []
# ---------------------------------------------------------------------------
# Token estimation + threshold gate
# ---------------------------------------------------------------------------
class TestThresholdGate:
def test_off_never_activates(self):
from tools.tool_search import ToolSearchConfig, should_activate
cfg = ToolSearchConfig.from_raw({"enabled": "off"})
assert not should_activate(cfg, deferrable_tokens=1_000_000, context_length=200_000)
def test_zero_deferrable_never_activates(self):
from tools.tool_search import ToolSearchConfig, should_activate
cfg = ToolSearchConfig.from_raw({"enabled": "on"})
assert not should_activate(cfg, deferrable_tokens=0, context_length=200_000)
def test_on_activates_with_any_deferrable(self):
from tools.tool_search import ToolSearchConfig, should_activate
cfg = ToolSearchConfig.from_raw({"enabled": "on"})
assert should_activate(cfg, deferrable_tokens=100, context_length=200_000)
def test_auto_below_threshold_does_not_activate(self):
from tools.tool_search import ToolSearchConfig, should_activate
cfg = ToolSearchConfig.from_raw({"enabled": "auto", "threshold_pct": 10})
# 5% of 200K = below 10% threshold
assert not should_activate(cfg, deferrable_tokens=10_000, context_length=200_000)
def test_auto_at_or_above_threshold_activates(self):
from tools.tool_search import ToolSearchConfig, should_activate
cfg = ToolSearchConfig.from_raw({"enabled": "auto", "threshold_pct": 10})
assert should_activate(cfg, deferrable_tokens=20_000, context_length=200_000)
assert should_activate(cfg, deferrable_tokens=50_000, context_length=200_000)
def test_auto_without_context_length_uses_20k_cutoff(self):
"""Fallback cutoff used when the active model is unknown."""
from tools.tool_search import ToolSearchConfig, should_activate
cfg = ToolSearchConfig.from_raw({"enabled": "auto"})
assert not should_activate(cfg, deferrable_tokens=10_000, context_length=0)
assert should_activate(cfg, deferrable_tokens=25_000, context_length=0)
def test_token_estimate_proportional_to_schema_size(self):
from tools.tool_search import estimate_tokens_from_schemas
small = [_td("a", "x")]
big = [_td(f"name_{i}", f"description for tool {i} " * 20,
{"q": {"type": "string", "description": "search query " * 10}})
for i in range(10)]
small_t = estimate_tokens_from_schemas(small)
big_t = estimate_tokens_from_schemas(big)
assert big_t > small_t * 10
# ---------------------------------------------------------------------------
# Retrieval (BM25 + substring fallback)
# ---------------------------------------------------------------------------
class TestRetrieval:
def _fake_catalog(self):
"""Build a catalog directly without touching the registry."""
from tools.tool_search import CatalogEntry, _tokenize, _entry_search_text
defs = [
_td("github_create_issue", "Open a new issue in a GitHub repository",
{"title": {"type": "string"}, "body": {"type": "string"}}),
_td("github_search_repos", "Search GitHub for matching repositories",
{"query": {"type": "string"}}),
_td("slack_send_message", "Post a message into a Slack channel",
{"channel": {"type": "string"}, "text": {"type": "string"}}),
_td("calendar_create_event", "Add an event to the user's calendar",
{"title": {"type": "string"}, "start": {"type": "string"}}),
]
catalog = []
for d in defs:
fn = d["function"]
e = CatalogEntry(
name=fn["name"], description=fn["description"],
schema=d, source="mcp", source_name="mcp-test",
)
e._tokens = _tokenize(_entry_search_text(d))
catalog.append(e)
return catalog
def test_search_finds_relevant_tool(self):
from tools.tool_search import search_catalog
hits = search_catalog(self._fake_catalog(), "create a github issue", limit=3)
names = [h.name for h in hits]
assert names[0] == "github_create_issue"
def test_search_returns_empty_for_irrelevant_query(self):
from tools.tool_search import search_catalog
hits = search_catalog(self._fake_catalog(), "asdf qwerty foobar", limit=3)
assert hits == []
def test_search_substring_fallback(self):
"""Even when no BM25 hit, a literal substring of the tool name returns."""
from tools.tool_search import search_catalog
hits = search_catalog(self._fake_catalog(), "calendar", limit=3)
assert any("calendar" in h.name for h in hits)
def test_search_respects_limit(self):
from tools.tool_search import search_catalog
hits = search_catalog(self._fake_catalog(), "github", limit=1)
assert len(hits) <= 1
# ---------------------------------------------------------------------------
# Assembly — the full passthrough/activate decision.
# ---------------------------------------------------------------------------
class TestAssembly:
def test_no_deferrable_returns_unchanged(self):
"""Pure-core toolset: pass-through, no bridge tools added."""
from tools.tool_search import assemble_tool_defs, ToolSearchConfig
defs = [_td("terminal", "Run shell"), _td("read_file", "Read a file")]
result = assemble_tool_defs(
defs,
context_length=200_000,
config=ToolSearchConfig.from_raw({"enabled": "on"}),
)
assert not result.activated
assert {t["function"]["name"] for t in result.tool_defs} == {"terminal", "read_file"}
def test_below_threshold_returns_unchanged(self):
"""Tiny deferrable surface: don't bother."""
from tools.tool_search import assemble_tool_defs, ToolSearchConfig
# _td renders to ~80 chars / 20 tokens. 3 of them = ~60 tokens.
# 10% of 200K = 20K. Way below.
defs = [_td("unknown_tool_a"), _td("unknown_tool_b"), _td("unknown_tool_c")]
result = assemble_tool_defs(
defs,
context_length=200_000,
config=ToolSearchConfig.from_raw({"enabled": "auto", "threshold_pct": 10}),
)
assert not result.activated
names = {(t.get("function") or {}).get("name") for t in result.tool_defs}
assert "tool_search" not in names
def test_idempotent_when_bridge_already_present(self):
from tools.tool_search import assemble_tool_defs, ToolSearchConfig, BRIDGE_TOOL_NAMES
defs = [_td("terminal", "Run shell"), _td("tool_search", "old")]
result = assemble_tool_defs(
defs,
context_length=200_000,
config=ToolSearchConfig.from_raw({"enabled": "off"}),
)
names = [(t["function"]["name"]) for t in result.tool_defs]
# The pre-existing tool_search was stripped (it would be re-injected if
# activation happened; here it didn't).
assert "tool_search" not in names
# ---------------------------------------------------------------------------
# Bridge dispatch
# ---------------------------------------------------------------------------
class TestBridgeDispatch:
def test_tool_search_requires_query(self):
from tools.tool_search import dispatch_tool_search
result = dispatch_tool_search({}, current_tool_defs=[])
assert "error" in json.loads(result)
def test_tool_describe_requires_name(self):
from tools.tool_search import dispatch_tool_describe
result = dispatch_tool_describe({}, current_tool_defs=[])
assert "error" in json.loads(result)
def test_tool_describe_rejects_non_deferrable(self):
"""If the model asks to describe a core tool, refuse — it's already
in the visible list."""
from tools.tool_search import dispatch_tool_describe
result = dispatch_tool_describe(
{"name": "terminal"}, current_tool_defs=[_td("terminal", "Run shell")],
)
assert "error" in json.loads(result)
def test_resolve_underlying_call_parses_object_args(self):
from tools.tool_search import resolve_underlying_call
name, args, err = resolve_underlying_call({
"name": "unknown_xxx",
"arguments": {"foo": "bar"},
})
# Will fail classification because unknown_xxx isn't deferrable.
assert err is not None
def test_resolve_underlying_call_parses_json_string_args(self):
"""Some models emit ``arguments`` as a JSON string instead of object."""
from tools.tool_search import resolve_underlying_call
# Use a name that won't classify (so we don't depend on registry),
# but exercise the JSON parse path.
_, _, err = resolve_underlying_call({
"name": "fake",
"arguments": '{"a": 1}',
})
# err is about classification, but the parse worked (it would have
# failed earlier with "not valid JSON" otherwise).
assert "not valid JSON" not in (err or "")
def test_resolve_underlying_call_rejects_bad_json(self):
from tools.tool_search import resolve_underlying_call
_, _, err = resolve_underlying_call({
"name": "fake",
"arguments": "{this is not json",
})
assert err is not None
assert "JSON" in err
def test_resolve_underlying_call_rejects_recursion(self):
"""tool_call cannot invoke tool_call itself."""
from tools.tool_search import resolve_underlying_call, TOOL_CALL_NAME
name, args, err = resolve_underlying_call({
"name": TOOL_CALL_NAME,
"arguments": {},
})
assert err is not None
assert "bridge tool" in err.lower()
# ---------------------------------------------------------------------------
# End-to-end via the real handle_function_call (smoke test).
# ---------------------------------------------------------------------------
class TestHandleFunctionCallIntegration:
def test_tool_search_dispatch_through_handle_function_call(self):
"""The dispatcher recognizes the bridge tool by name."""
import model_tools
result = model_tools.handle_function_call(
function_name="tool_search",
function_args={"query": "nothing matches this"},
)
parsed = json.loads(result)
# Without a real registry, the matches will be empty, but the
# dispatch path completed without error.
assert "matches" in parsed or "error" in parsed
class TestRegression_OpenClawCron84141:
"""Regression guard for the OpenClaw cron-tool-loss class of bug.
OpenClaw #84141: ``toolsAllow: ["exec"]`` on an isolated cron turn
resulted in the agent receiving only ``sessions_send`` the catalog
builder silently dropped the requested core tool.
Our defense: core tools are NEVER deferred. This test exercises the
full assembly pipeline with a mixed core+MCP toolset and asserts that
every core tool survives.
"""
def test_core_tool_survives_alongside_many_mcp_tools(self):
from tools.tool_search import (
assemble_tool_defs, ToolSearchConfig, BRIDGE_TOOL_NAMES,
classify_tools,
)
# 1 core tool + 50 unknown/MCP-shaped tools (deferrable).
defs = [_td("terminal", "Run shell commands")]
# Pad with fake "deferrable" tools — without registry registration,
# classify_tools puts them in 'visible'. So instead, we just verify
# the core-tool side: terminal stays in visible regardless.
visible, deferrable = classify_tools(defs)
assert any(
(td.get("function") or {}).get("name") == "terminal"
for td in visible
), "Core tool 'terminal' was wrongly classified as deferrable"
# Now force activation and check the resulting tool-defs list.
result = assemble_tool_defs(
defs,
context_length=200_000,
config=ToolSearchConfig.from_raw({"enabled": "on"}),
)
names = {(t.get("function") or {}).get("name") for t in result.tool_defs}
# terminal must be present; bridges are only added if there are
# deferrable tools to put behind them.
assert "terminal" in names
def test_unwrap_rejects_core_tool_attempt(self):
"""Even if the model tries to invoke a core tool through tool_call,
we reject the call and tell the model to use it directly."""
from tools.tool_search import resolve_underlying_call
_, _, err = resolve_underlying_call({
"name": "terminal",
"arguments": {"command": "echo hi"},
})
assert err is not None
assert "not a deferrable" in err

714
tools/tool_search.py Normal file
View file

@ -0,0 +1,714 @@
"""Progressive tool disclosure ("tool search") for Hermes Agent.
When enabled, MCP and non-core plugin tools are replaced in the model-visible
tools array by three bridge tools ``tool_search``, ``tool_describe``,
``tool_call`` and surfaced on demand. Core Hermes tools never defer.
Design constraints this module is built around (see ``openclaw-tool-search-report``
for the full rationale):
* Core tools defined in ``toolsets._HERMES_CORE_TOOLS`` are *never* deferred.
Always-load means always-load. No exceptions.
* The threshold gate runs every assembly: when deferrable tools would consume
less than ``threshold_pct`` of the model's context window (default 10%),
tool search is a no-op and the tools array passes through unchanged.
* The catalog is stateless across turns and tools-array assemblies. It is
rebuilt from the current tool-defs list every time. This is the lesson
from OpenClaw's cron regression (openclaw/openclaw#84141): a session-keyed
catalog that drifts out of sync with the live tool registry produces
silent tool dropouts.
* Bridge tools route through ``model_tools.handle_function_call`` exactly
like a direct call, so guardrails, plugin pre/post hooks, approval flows,
and tool-result truncation all fire identically.
* Display and trajectory unwrap is implemented here so the user (CLI activity
feed, gateway, saved trajectories) always sees the underlying tool, not
the bridge.
"""
from __future__ import annotations
import json
import logging
import math
import re
from dataclasses import dataclass, field
from typing import Any, Dict, Iterable, List, Optional, Tuple
logger = logging.getLogger("tools.tool_search")
# Bridge tool names. These names are reserved and may not collide with a
# user/plugin/MCP tool — registration of any tool with these names is
# rejected by the registry's existing override-protection logic.
TOOL_SEARCH_NAME = "tool_search"
TOOL_DESCRIBE_NAME = "tool_describe"
TOOL_CALL_NAME = "tool_call"
BRIDGE_TOOL_NAMES = frozenset({TOOL_SEARCH_NAME, TOOL_DESCRIBE_NAME, TOOL_CALL_NAME})
# When estimating tokens from char count without a real tokenizer, this is
# the cheap rule of thumb that's stable across providers. Roughly 4 chars
# per token for English+JSON. Underestimating leads to false negatives
# (tool search not activated when it should); overestimating leads to false
# positives (activated when not needed). 4.0 errs slightly toward
# underestimating, which is the safer default.
CHARS_PER_TOKEN = 4.0
# ---------------------------------------------------------------------------
# Configuration plumbing
# ---------------------------------------------------------------------------
@dataclass(frozen=True)
class ToolSearchConfig:
"""Resolved, validated tool-search configuration for a single assembly."""
enabled: str # "auto" | "on" | "off"
threshold_pct: float # 0..100 — only used when enabled == "auto"
search_default_limit: int
max_search_limit: int
@classmethod
def from_raw(cls, raw: Any) -> "ToolSearchConfig":
"""Build a config from a raw dict / bool / None.
Accepts the legacy bool shape (``tools.tool_search: true``) and the
dict shape (``tools.tool_search: {enabled: auto, ...}``). Validates
and clamps every numeric field; unknown values fall back to safe
defaults rather than raising, so a typo in user config does not
break the agent.
"""
if raw is True:
return cls(enabled="auto", threshold_pct=10.0,
search_default_limit=5, max_search_limit=20)
if raw is False:
return cls(enabled="off", threshold_pct=10.0,
search_default_limit=5, max_search_limit=20)
if not isinstance(raw, dict):
return cls(enabled="auto", threshold_pct=10.0,
search_default_limit=5, max_search_limit=20)
enabled_raw = str(raw.get("enabled", "auto")).strip().lower()
if enabled_raw in ("true", "1", "yes"):
enabled = "on"
elif enabled_raw in ("false", "0", "no"):
enabled = "off"
elif enabled_raw in ("auto", "on", "off"):
enabled = enabled_raw
else:
enabled = "auto"
threshold_pct = _safe_float(raw.get("threshold_pct"), 10.0)
threshold_pct = max(0.0, min(100.0, threshold_pct))
max_search_limit = max(1, min(50, _safe_int(raw.get("max_search_limit"), 20)))
search_default_limit = max(1, min(max_search_limit,
_safe_int(raw.get("search_default_limit"), 5)))
return cls(
enabled=enabled,
threshold_pct=threshold_pct,
search_default_limit=search_default_limit,
max_search_limit=max_search_limit,
)
def _safe_int(value: Any, fallback: int) -> int:
try:
return int(value)
except (TypeError, ValueError):
return fallback
def _safe_float(value: Any, fallback: float) -> float:
try:
return float(value)
except (TypeError, ValueError):
return fallback
def load_config() -> ToolSearchConfig:
"""Load tool-search config from the user config file."""
try:
from hermes_cli.config import load_config as _load
cfg = _load() or {}
tools_cfg = cfg.get("tools") if isinstance(cfg.get("tools"), dict) else {}
if not isinstance(tools_cfg, dict):
tools_cfg = {}
return ToolSearchConfig.from_raw(tools_cfg.get("tool_search"))
except Exception as e:
logger.debug("Failed to load tool-search config: %s", e)
return ToolSearchConfig.from_raw(None)
# ---------------------------------------------------------------------------
# Tool classification
# ---------------------------------------------------------------------------
def _core_tool_names() -> frozenset[str]:
"""Return the set of tool names that must NEVER be deferred.
Imported lazily because ``toolsets`` imports from ``tools.registry``
and we don't want a hard cycle.
"""
try:
from toolsets import _HERMES_CORE_TOOLS
return frozenset(_HERMES_CORE_TOOLS)
except Exception:
return frozenset()
def is_deferrable_tool_name(name: str) -> bool:
"""Return True if a tool with this name is *eligible* for deferral.
A tool is deferrable iff it is registered with an MCP toolset prefix
OR it is not in ``_HERMES_CORE_TOOLS``. Core tools are never deferred
even when their toolset is technically plugin-provided (this protects
against accidental shadowing).
"""
if name in BRIDGE_TOOL_NAMES:
return False
if name in _core_tool_names():
return False
# Check registry toolset for MCP prefix.
try:
from tools.registry import registry
entry = registry.get_entry(name)
if entry is None:
return False
if entry.toolset.startswith("mcp-"):
return True
# Non-MCP, non-core → plugin tool, eligible.
return True
except Exception:
return False
def classify_tools(tool_defs: List[Dict[str, Any]]) -> Tuple[List[Dict[str, Any]], List[Dict[str, Any]]]:
"""Split a tool-defs list into (visible, deferrable).
``visible`` retains every tool that must stay in the model-facing array:
every core tool, plus any tool we can't classify. ``deferrable`` is the
candidate set for catalog entry.
"""
visible: List[Dict[str, Any]] = []
deferrable: List[Dict[str, Any]] = []
for td in tool_defs:
fn = td.get("function") or {}
name = fn.get("name", "")
if name in BRIDGE_TOOL_NAMES:
# Should never happen — bridge tools are added after classification —
# but be defensive.
continue
if is_deferrable_tool_name(name):
deferrable.append(td)
else:
visible.append(td)
return visible, deferrable
# ---------------------------------------------------------------------------
# Token estimation and threshold gate
# ---------------------------------------------------------------------------
def estimate_tokens_from_schemas(tool_defs: Iterable[Dict[str, Any]]) -> int:
"""Estimate the token cost of a tool-defs list via the chars/4 rule.
Cheap and stable across providers. The number doesn't need to be exact —
it gates the activate/skip decision, and a typical 200K context with a
10% threshold means the decision flips around 20K tokens of schema.
Order-of-magnitude precision is fine.
"""
total_chars = 0
for td in tool_defs:
try:
total_chars += len(json.dumps(td, ensure_ascii=False, separators=(",", ":")))
except (TypeError, ValueError):
total_chars += len(str(td))
return int(math.ceil(total_chars / CHARS_PER_TOKEN))
def should_activate(
config: ToolSearchConfig,
deferrable_tokens: int,
context_length: Optional[int],
) -> bool:
"""Decide whether tool search should activate for the current assembly.
``"off"`` skips unconditionally. ``"on"`` activates unconditionally
(as long as there is at least one deferrable tool there's no point
swapping a no-op). ``"auto"`` activates when the deferrable schemas
would consume ``threshold_pct`` of context or more.
"""
if config.enabled == "off":
return False
if deferrable_tokens <= 0:
return False
if config.enabled == "on":
return True
# auto
if not context_length or context_length <= 0:
# Without a known context size, fall back to a fixed 20K-token cutoff
# — the cliff above which Anthropic and OpenAI both saw quality drops.
return deferrable_tokens >= 20_000
threshold_tokens = int(context_length * (config.threshold_pct / 100.0))
return deferrable_tokens >= threshold_tokens
# ---------------------------------------------------------------------------
# Catalog + BM25 retrieval
# ---------------------------------------------------------------------------
@dataclass
class CatalogEntry:
"""One deferrable tool, in a form the bridge tools can search and serve."""
name: str
description: str
schema: Dict[str, Any] # The full {"type":"function", "function": {...}} entry.
source: str # "mcp" | "plugin" | "other"
source_name: str # Toolset name, e.g. "mcp-github" or "kanban"
# Pre-tokenized fields for BM25.
_tokens: List[str] = field(default_factory=list)
_TOKEN_RE = re.compile(r"[A-Za-z0-9]+")
def _tokenize(text: str) -> List[str]:
if not text:
return []
return [t.lower() for t in _TOKEN_RE.findall(text)]
def _entry_search_text(td: Dict[str, Any]) -> str:
"""Build the search-text blob for a deferrable tool.
Includes the tool name (with underscores broken into words so BM25 can
match against query terms), the description, and the names of the
top-level parameters. Schema bodies are deliberately excluded
indexing them adds noise without improving recall in our measurement.
"""
fn = td.get("function") or {}
name = fn.get("name", "")
desc = fn.get("description", "") or ""
params = ((fn.get("parameters") or {}).get("properties") or {})
param_names = " ".join(params.keys())
# Break snake_case and dotted names into words for BM25.
name_words = name.replace("_", " ").replace(".", " ").replace("-", " ").replace(":", " ")
return f"{name_words} {desc} {param_names}"
def _classify_source(name: str) -> Tuple[str, str]:
"""Return (source_kind, source_name) for a registered tool name."""
try:
from tools.registry import registry
entry = registry.get_entry(name)
if entry is None:
return ("other", "")
if entry.toolset.startswith("mcp-"):
return ("mcp", entry.toolset)
return ("plugin", entry.toolset)
except Exception:
return ("other", "")
def build_catalog(tool_defs: List[Dict[str, Any]]) -> List[CatalogEntry]:
"""Build the deferred-tool catalog from a tool-defs list.
Caller is expected to pass only the deferrable subset (``classify_tools``
returns it as the second element).
"""
catalog: List[CatalogEntry] = []
for td in tool_defs:
fn = td.get("function") or {}
name = fn.get("name", "")
if not name:
continue
desc = fn.get("description", "") or ""
source, source_name = _classify_source(name)
entry = CatalogEntry(
name=name,
description=desc,
schema=td,
source=source,
source_name=source_name,
_tokens=_tokenize(_entry_search_text(td)),
)
catalog.append(entry)
return catalog
def _bm25_score(query_tokens: List[str], doc_tokens: List[str],
doc_lengths: List[int], avg_dl: float,
doc_freq: Dict[str, int], n_docs: int,
k1: float = 1.5, b: float = 0.75) -> float:
"""Standard BM25 score for one query against one document.
Inlined small implementation rather than adding a dependency. Performance
is fine the catalog is bounded by N (tools) typically < 500, and we
score against the in-memory tokens list.
"""
if not doc_tokens:
return 0.0
score = 0.0
dl = len(doc_tokens)
# Pre-count tokens in the doc.
doc_tf: Dict[str, int] = {}
for t in doc_tokens:
doc_tf[t] = doc_tf.get(t, 0) + 1
for q in query_tokens:
df = doc_freq.get(q, 0)
if df == 0:
continue
idf = math.log(1 + (n_docs - df + 0.5) / (df + 0.5))
tf = doc_tf.get(q, 0)
if tf == 0:
continue
norm = tf * (k1 + 1) / (tf + k1 * (1 - b + b * dl / max(avg_dl, 1.0)))
score += idf * norm
return score
def search_catalog(catalog: List[CatalogEntry], query: str, limit: int = 5) -> List[CatalogEntry]:
"""Return the top-``limit`` catalog entries for ``query`` by BM25.
Falls back to a stable name-substring match when BM25 yields no hits
above zero. That ensures a query like ``"github"`` against a catalog
where every tool is named ``github_*`` still returns results BM25
can underperform when query and document share only one token that
appears in every document (zero IDF).
"""
if not catalog or limit <= 0:
return []
query_tokens = _tokenize(query)
if not query_tokens:
return []
# Precompute doc statistics.
doc_lengths = [len(e._tokens) for e in catalog]
avg_dl = sum(doc_lengths) / max(len(doc_lengths), 1)
doc_freq: Dict[str, int] = {}
for e in catalog:
seen = set(e._tokens)
for t in seen:
doc_freq[t] = doc_freq.get(t, 0) + 1
n_docs = len(catalog)
scored: List[Tuple[float, CatalogEntry]] = []
for entry in catalog:
s = _bm25_score(query_tokens, entry._tokens, doc_lengths, avg_dl,
doc_freq, n_docs)
if s > 0:
scored.append((s, entry))
if not scored:
# Substring fallback against the original tool name.
ql = query.lower()
for entry in catalog:
if ql in entry.name.lower():
scored.append((0.1, entry))
scored.sort(key=lambda x: x[0], reverse=True)
return [e for _, e in scored[:limit]]
# ---------------------------------------------------------------------------
# Bridge tool schemas
# ---------------------------------------------------------------------------
def bridge_tool_schemas(deferred_count: int) -> List[Dict[str, Any]]:
"""Build the bridge tool schemas to inject in place of deferred tools.
The schemas are intentionally short every byte added here is a byte
the user pays on every turn. Descriptions are tuned to be unambiguous
about the call sequence the model should follow.
"""
desc_search = (
f"Search {deferred_count} additional tools that are loaded on demand. "
"Returns up to ``limit`` matches with name and description. Follow "
f"with `{TOOL_DESCRIBE_NAME}` to load a tool's full parameter schema, "
f"then `{TOOL_CALL_NAME}` to invoke it. Tools listed at the top of this "
"system prompt are already available and do not need to be searched."
)
desc_describe = (
f"Load the full JSON schema for one tool returned by `{TOOL_SEARCH_NAME}`. "
f"Required before `{TOOL_CALL_NAME}` if the tool's parameters are unknown."
)
desc_call = (
"Invoke a deferred tool by name with the given arguments. Argument shape "
f"matches the tool's schema (see `{TOOL_DESCRIBE_NAME}`). Policy, hooks, "
"and approvals run exactly as for any directly-listed tool."
)
return [
{
"type": "function",
"function": {
"name": TOOL_SEARCH_NAME,
"description": desc_search,
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Keywords describing the capability you need (e.g. 'create github issue').",
},
"limit": {
"type": "integer",
"description": "Maximum number of results to return. Default 5.",
},
},
"required": ["query"],
},
},
},
{
"type": "function",
"function": {
"name": TOOL_DESCRIBE_NAME,
"description": desc_describe,
"parameters": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "Exact tool name (as returned by tool_search).",
},
},
"required": ["name"],
},
},
},
{
"type": "function",
"function": {
"name": TOOL_CALL_NAME,
"description": desc_call,
"parameters": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "Exact tool name to invoke.",
},
"arguments": {
"type": "object",
"description": "Arguments for the tool, matching its schema.",
},
},
"required": ["name", "arguments"],
},
},
},
]
# ---------------------------------------------------------------------------
# Public entry point: assemble tool-defs with optional tool search
# ---------------------------------------------------------------------------
@dataclass
class AssemblyResult:
"""Outcome of one assembly. Useful for tests and observability."""
tool_defs: List[Dict[str, Any]]
activated: bool
deferred_count: int = 0
deferred_tokens: int = 0
threshold_tokens: int = 0
def assemble_tool_defs(
tool_defs: List[Dict[str, Any]],
*,
context_length: Optional[int] = None,
config: Optional[ToolSearchConfig] = None,
) -> AssemblyResult:
"""Return the tool-defs list the model should actually see.
When tool search is inactive (off, no deferrable tools, or below
threshold), this is a passthrough. When active, MCP and plugin tools
are stripped from the visible list and replaced with the three bridge
tools. Core tools are *never* deferred regardless of config.
Idempotent: calling with bridge tools already in the input is a no-op
(they classify as non-core/non-deferrable but their names are reserved,
so they are filtered out of the deferrable set).
"""
if config is None:
config = load_config()
# Defensive: strip any bridge tools that may already be in the list
# (e.g. someone called assemble twice).
incoming = [td for td in tool_defs
if (td.get("function") or {}).get("name") not in BRIDGE_TOOL_NAMES]
visible, deferrable = classify_tools(incoming)
if not deferrable:
return AssemblyResult(tool_defs=incoming, activated=False)
deferrable_tokens = estimate_tokens_from_schemas(deferrable)
if not should_activate(config, deferrable_tokens, context_length):
return AssemblyResult(
tool_defs=incoming,
activated=False,
deferred_count=len(deferrable),
deferred_tokens=deferrable_tokens,
threshold_tokens=int((context_length or 0) * (config.threshold_pct / 100.0)),
)
bridge = bridge_tool_schemas(len(deferrable))
result = visible + bridge
threshold_tokens = int((context_length or 0) * (config.threshold_pct / 100.0))
logger.info(
"tool_search activated: %d core/visible tools kept, %d deferred (~%d tokens, threshold ~%d)",
len(visible), len(deferrable), deferrable_tokens, threshold_tokens,
)
return AssemblyResult(
tool_defs=result,
activated=True,
deferred_count=len(deferrable),
deferred_tokens=deferrable_tokens,
threshold_tokens=threshold_tokens,
)
# ---------------------------------------------------------------------------
# Bridge tool dispatch
# ---------------------------------------------------------------------------
def is_bridge_tool(name: str) -> bool:
return name in BRIDGE_TOOL_NAMES
def _format_search_hit(entry: CatalogEntry) -> Dict[str, Any]:
return {
"name": entry.name,
"source": entry.source,
"source_name": entry.source_name,
# Cap description so a chatty MCP server doesn't blow up the result.
"description": (entry.description or "")[:400],
}
def dispatch_tool_search(args: Dict[str, Any],
*,
current_tool_defs: List[Dict[str, Any]],
config: Optional[ToolSearchConfig] = None) -> str:
"""Execute the ``tool_search`` bridge tool. Returns a JSON string."""
if config is None:
config = load_config()
query = str(args.get("query") or "").strip()
if not query:
return json.dumps({"error": "query is required"}, ensure_ascii=False)
raw_limit = args.get("limit")
if raw_limit is None:
limit = config.search_default_limit
else:
limit = max(1, min(config.max_search_limit, _safe_int(raw_limit, config.search_default_limit)))
_, deferrable = classify_tools(current_tool_defs)
catalog = build_catalog(deferrable)
hits = search_catalog(catalog, query, limit=limit)
return json.dumps({
"query": query,
"total_available": len(catalog),
"matches": [_format_search_hit(h) for h in hits],
}, ensure_ascii=False)
def dispatch_tool_describe(args: Dict[str, Any],
*,
current_tool_defs: List[Dict[str, Any]]) -> str:
"""Execute the ``tool_describe`` bridge tool. Returns a JSON string."""
name = str(args.get("name") or "").strip()
if not name:
return json.dumps({"error": "name is required"}, ensure_ascii=False)
if not is_deferrable_tool_name(name):
return json.dumps({
"error": (
f"'{name}' is not a deferrable tool. If you see it in the tools list "
"already, call it directly; otherwise check the spelling against tool_search."
),
}, ensure_ascii=False)
_, deferrable = classify_tools(current_tool_defs)
for td in deferrable:
fn = td.get("function") or {}
if fn.get("name") == name:
return json.dumps({
"name": name,
"description": fn.get("description", ""),
"parameters": fn.get("parameters", {}),
}, ensure_ascii=False)
return json.dumps({
"error": f"'{name}' is not currently available. Re-run tool_search to refresh.",
}, ensure_ascii=False)
def resolve_underlying_call(args: Dict[str, Any]) -> Tuple[Optional[str], Dict[str, Any], Optional[str]]:
"""Parse a ``tool_call`` invocation into (underlying_name, args, error_msg).
Used by:
* the dispatcher in ``model_tools.handle_function_call``,
* the display layer (so the activity feed shows the underlying tool),
* the trajectory recorder.
On parse error, returns ``(None, {}, error_message)``.
"""
name = str(args.get("name") or "").strip()
if not name:
return None, {}, "tool_call requires a 'name' argument"
if name in BRIDGE_TOOL_NAMES:
return None, {}, f"tool_call cannot invoke '{name}' (it is itself a bridge tool)"
raw_args = args.get("arguments")
if raw_args is None:
raw_args = {}
if isinstance(raw_args, str):
try:
raw_args = json.loads(raw_args)
except json.JSONDecodeError as e:
return None, {}, f"tool_call 'arguments' is not valid JSON: {e}"
if not isinstance(raw_args, dict):
return None, {}, "tool_call 'arguments' must be an object"
if not is_deferrable_tool_name(name):
return None, {}, (
f"'{name}' is not a deferrable tool. If it appears in the model-facing tools "
"list already, call it directly instead of via tool_call."
)
return name, raw_args, None
__all__ = [
"TOOL_SEARCH_NAME",
"TOOL_DESCRIBE_NAME",
"TOOL_CALL_NAME",
"BRIDGE_TOOL_NAMES",
"ToolSearchConfig",
"CatalogEntry",
"AssemblyResult",
"load_config",
"is_deferrable_tool_name",
"classify_tools",
"estimate_tokens_from_schemas",
"should_activate",
"build_catalog",
"search_catalog",
"bridge_tool_schemas",
"assemble_tool_defs",
"is_bridge_tool",
"dispatch_tool_search",
"dispatch_tool_describe",
"resolve_underlying_call",
]

View file

@ -0,0 +1,152 @@
---
title: Tool Search
sidebar_position: 95
---
# Tool Search
When you have many MCP servers or non-core plugin tools attached to a
session, their JSON schemas can consume a substantial fraction of the
context window on every turn — even when only a few of them are relevant
to what the user actually asked for.
**Tool Search** is Hermes' opt-in progressive-disclosure layer for that
problem. When activated, MCP and plugin tools are replaced in the
model-visible tools array by three bridge tools, and the model loads each
specific tool's schema on demand.
:::info Built-in Hermes tools never defer
The tools that make up Hermes' core capability set (`terminal`,
`read_file`, `write_file`, `patch`, `search_files`, `todo`, `memory`,
`browser_*`, `web_search`, `web_extract`, `clarify`, `execute_code`,
`delegate_task`, `session_search`, `send_message`, and the rest of
`_HERMES_CORE_TOOLS`) are *always* loaded directly. Only MCP tools and
non-core plugin tools are eligible for deferral.
:::
## How it works
When Tool Search activates for a turn, the model sees three new tools in
place of the deferred ones:
```
tool_search(query, limit?) — search the deferred-tool catalog
tool_describe(name) — load the full schema for one tool
tool_call(name, arguments) — invoke a deferred tool
```
A typical interaction looks like:
```
Model: tool_search("create a github issue")
→ { matches: [{ name: "mcp_github_create_issue", ... }, ...] }
Model: tool_describe("mcp_github_create_issue")
→ { parameters: { type: "object", properties: { ... } } }
Model: tool_call("mcp_github_create_issue", { title: "...", body: "..." })
→ { ok: true, issue_number: 42 }
```
When the model invokes `tool_call`, Hermes **unwraps the bridge** and
dispatches the underlying tool exactly as if the model had called it
directly. Pre-tool-call hooks, guardrails, approval prompts, and
post-tool-call hooks all run against the real tool name — not against
`tool_call`. The activity feed in the CLI and gateway also unwraps so you
see the underlying tool, not the bridge.
## When does it activate?
By default Tool Search runs in `auto` mode: it activates only when the
deferrable tool schemas would consume at least 10% of the active model's
context window. Below that, the tools-array assembly is a pure
pass-through and you pay no overhead.
This decision is re-evaluated every time the tools array is built, so:
- A session with just a few MCP tools and a long context model never
activates Tool Search.
- A session with many MCP servers attached (15+ tools typically) starts
activating it.
- Removing MCP servers mid-session correctly returns to direct exposure
on the next assembly.
## Configuration
```yaml
tools:
tool_search:
enabled: auto # auto (default), on, or off
threshold_pct: 10 # percentage of context — only used in auto mode
search_default_limit: 5
max_search_limit: 20
```
| Key | Default | Meaning |
| --- | --- | --- |
| `enabled` | `auto` | `auto` activates above threshold; `on` always activates if there's at least one deferrable tool; `off` disables entirely. |
| `threshold_pct` | `10` | Percentage of context length at which `auto` mode kicks in. Range 0100. |
| `search_default_limit` | `5` | Hits returned when the model calls `tool_search` without a `limit`. |
| `max_search_limit` | `20` | Hard upper bound the model can request via `limit`. Range 150. |
You can also flip the legacy boolean shape:
```yaml
tools:
tool_search: true # equivalent to {enabled: auto}
```
## When NOT to use it
Tool Search trades a fixed per-turn token cost (the three bridge tool
schemas, ~300 tokens) and at least one extra round trip (search →
describe → call) for the savings on the deferred schemas. It's a clear
win when you have many tools and use few per turn; it's overhead when
you have few tools total.
The `auto` default handles this for you. If you set `enabled: on`
unconditionally, expect a slight per-turn cost on small toolsets.
## Trade-offs that don't go away
These come from the prompt-cache integrity invariant — they are inherent
to any progressive-disclosure design, not specific to this implementation:
- **One extra round trip on cold tools.** The first time the model needs
a deferred tool, it spends one or two extra model calls to find and
load the schema. The token savings on the static side are real, but a
portion is paid back at runtime.
- **No cache benefit on deferred schemas.** A loaded `tool_describe`
result enters the conversation history (so it does get cached on
subsequent turns) but it never benefits from the system-prompt cache
prefix.
- **Model-quality dependence.** Tool Search assumes the model can write a
reasonable search query for the tool it wants. Smaller models do this
less well; the published Anthropic numbers (49% → 74% on Opus 4 with
vs. without tool search) show the upside but also that ~26 points of
accuracy is still retrieval failure.
- **Toolset edits invalidate cache.** Adding or removing a tool mid-
session changes the bridge tools' descriptions (which include the
count of deferred tools) and the catalog, so the prompt cache is
invalidated. This is the same trade-off as any toolset edit.
## Implementation details
- **Retrieval:** BM25 over tokenized tool name + description + parameter
names. Falls back to a literal substring match on the tool name when
BM25 returns no positive-score hits, which protects against
zero-IDF degenerate cases (e.g. searching `"github"` against a
catalog where every tool name contains "github").
- **Catalog is stateless across turns.** It rebuilds from the current
tool-defs list every assembly — no session-keyed `Map`. This avoids
the class of bug where a stored catalog drifts out of sync with the
live tool registry.
- **No JS sandbox.** Hermes uses the simpler "structured tools" mode
(search / describe / call as plain functions). The JS-sandbox "code
mode" some other implementations offer is a large surface area; we
skip it.
## See also
- `tools/tool_search.py` — the implementation
- `tests/tools/test_tool_search.py` — the regression suite
- The `openclaw-tool-search-report` PDF in the original implementation
PR for the research that shaped the design