Commit graph

2 commits

Author SHA1 Message Date
Michael Steuer
3d029a53ec fix(gateway): close residual memory-leak sites under heavy scheduled workload
Long-lived gateways under heavy cron/build workloads grow steadily (~18 MB/hr
post-phantom-dispatch-fix) and eventually need a restart-or-OOM. Four retention
sites, all confirmed live on current main:

1. _evict_cached_agent() (/model, /reasoning, codex-runtime, /undo, etc.) popped
   the cache entry without releasing the agent's OpenAI client, httpx transport,
   SSL context, or conversation history. Only /new cleaned up first. Now releases
   clients on a daemon thread, matching _enforce_agent_cache_cap.

2. _release_evicted_agent_soft() now clears _session_messages after
   release_clients() — tool outputs (file reads, terminal output, search results)
   can be tens of MB per 100+-tool-call session; the list is rebuilt from
   persisted session JSON on resume, so dropping it on soft eviction is safe.

3. The session-expiry watcher (permanent finalization) now drops the session's
   per-session control dicts (_session_model_overrides, _session_reasoning_overrides,
   _pending_approvals, _update_prompt_pending, _pending_model_notes). These leaked
   one entry per session per gateway lifetime. NOTE: this is the session-finalize
   path, NOT idle agent-cache eviction — an idle-evicted session is still alive and
   rebuilds its agent from these overrides, so pruning them there would silently
   reset a user's /model choice.

4. _tool_defs_cache is now bounded (_TOOL_DEFS_CACHE_MAX=8) with oldest-first
   eviction instead of growing unboundedly across the distinct toolset/config
   fingerprints a gateway sees over its lifetime.

Salvaged from #25318 by Michael Steuer (@mssteuer); fix 3 redirected from the
idle-sweep to the session-finalize lifecycle, magic number 8 lifted to a named
constant, test ported.

Fixes #19251
Co-authored-by: Michael Steuer <michael@make.software>
2026-06-08 06:32:42 -07:00
Sanjays2402
e0fa2cf972 fix(tools): isolate get_tool_definitions quiet_mode cache + dedup LCM injection (#17335)
Long-lived Gateway processes were sending duplicate tool names to
providers that enforce uniqueness:

  - DeepSeek:        'Tool names must be unique.'
  - Xiaomi MiMo:     'tools contains duplicate names: lcm_expand'
  - Moonshot/Kimi:   'function name lcm_grep is duplicated'

TUI was unaffected because TUI runs with quiet_mode=False and skips the
cache entirely.

Root cause (two layered bugs)
- model_tools.get_tool_definitions(quiet_mode=True) memoizes its result
  in _tool_defs_cache. The cache-hit path returned list(cached) (safe),
  but the FIRST uncached call stored and returned the SAME object.
  run_agent.py mutates self.tools (memory + LCM context-engine schemas)
  in-place, so the very first agent init in a Gateway process
  poisoned the cache, and every subsequent init appended LCM schemas
  again on top of the already-polluted list.
- run_agent.py's context-engine injection (lcm_grep / lcm_describe /
  lcm_expand) had no dedup, unlike the memory-tools injection right
  above it which already skips already-present names.

Fix (defense in depth, per the issue's suggested fix)
- model_tools.get_tool_definitions: on the uncached branch, cache the
  computed list but return list(result) to the caller. Same pattern as
  the cache-hit path.
- run_agent.py: build _existing_tool_names from self.tools and skip
  schemas whose names are already present, mirroring the memory-tools
  block. This also defends against plugin paths that may register the
  same schemas via ctx.register_tool().

Tests (tests/test_get_tool_definitions_cache_isolation.py)
- test_first_uncached_call_returns_fresh_list \u2014 pins the fix; without
  it, first-call alias caused all the symptoms.
- test_cache_hit_returns_fresh_list \u2014 pre-existing behavior stays.
- test_caller_mutation_does_not_poison_cache \u2014 simulates run_agent
  appending lcm_grep / lcm_expand to the returned list and asserts the
  next call doesn't see them.
- test_repeated_caller_mutation_does_not_accumulate \u2014 reproduces the
  long-lived Gateway accumulation pattern across 5 agent inits.
- test_non_quiet_mode_does_not_use_cache \u2014 sanity, explains why TUI
  was fine.

5/5 pass on the new file; 23/23 still pass on tests/test_model_tools.py.
2026-04-30 04:32:06 -07:00