fix(tools): isolate get_tool_definitions quiet_mode cache + dedup LCM injection (#17335)

Long-lived Gateway processes were sending duplicate tool names to
providers that enforce uniqueness:

  - DeepSeek:        'Tool names must be unique.'
  - Xiaomi MiMo:     'tools contains duplicate names: lcm_expand'
  - Moonshot/Kimi:   'function name lcm_grep is duplicated'

TUI was unaffected because TUI runs with quiet_mode=False and skips the
cache entirely.

Root cause (two layered bugs)
- model_tools.get_tool_definitions(quiet_mode=True) memoizes its result
  in _tool_defs_cache. The cache-hit path returned list(cached) (safe),
  but the FIRST uncached call stored and returned the SAME object.
  run_agent.py mutates self.tools (memory + LCM context-engine schemas)
  in-place, so the very first agent init in a Gateway process
  poisoned the cache, and every subsequent init appended LCM schemas
  again on top of the already-polluted list.
- run_agent.py's context-engine injection (lcm_grep / lcm_describe /
  lcm_expand) had no dedup, unlike the memory-tools injection right
  above it which already skips already-present names.

Fix (defense in depth, per the issue's suggested fix)
- model_tools.get_tool_definitions: on the uncached branch, cache the
  computed list but return list(result) to the caller. Same pattern as
  the cache-hit path.
- run_agent.py: build _existing_tool_names from self.tools and skip
  schemas whose names are already present, mirroring the memory-tools
  block. This also defends against plugin paths that may register the
  same schemas via ctx.register_tool().

Tests (tests/test_get_tool_definitions_cache_isolation.py)
- test_first_uncached_call_returns_fresh_list \u2014 pins the fix; without
  it, first-call alias caused all the symptoms.
- test_cache_hit_returns_fresh_list \u2014 pre-existing behavior stays.
- test_caller_mutation_does_not_poison_cache \u2014 simulates run_agent
  appending lcm_grep / lcm_expand to the returned list and asserts the
  next call doesn't see them.
- test_repeated_caller_mutation_does_not_accumulate \u2014 reproduces the
  long-lived Gateway accumulation pattern across 5 agent inits.
- test_non_quiet_mode_does_not_use_cache \u2014 sanity, explains why TUI
  was fine.

5/5 pass on the new file; 23/23 still pass on tests/test_model_tools.py.
This commit is contained in:
Sanjays2402 2026-04-29 00:50:32 -07:00 committed by Teknium
parent 70ae678af1
commit e0fa2cf972
3 changed files with 119 additions and 2 deletions

View file

@ -320,7 +320,15 @@ def get_tool_definitions(
result = _compute_tool_definitions(enabled_toolsets, disabled_toolsets, quiet_mode)
if quiet_mode:
# Cache the freshly-computed list, but hand callers a shallow copy so
# downstream mutations (e.g. run_agent appending memory/LCM tool
# schemas to self.tools) don't poison the cache. Without this, a
# long-lived Gateway process accumulates duplicate tool names across
# agent inits and providers that enforce unique tool names
# (DeepSeek, Xiaomi MiMo, Moonshot Kimi) reject the request with
# HTTP 400. Mirrors the cache-hit path above. (issue #17335)
_tool_defs_cache[cache_key] = result
return list(result)
return result