hermes-agent/gateway/platforms
Fearvox 4b06c98fe4 fix(gateway): close ResponseStore + dispose unowned adapter on reconnect failure
Three separate code paths in the gateway's platform reconnect loop
leaked file descriptors every retry, exhausting the default 2560-fd
ulimit in ~12 hours of continuous failure and turning the gateway
into a zombie that raises OSError: [Errno 24] on every open() (#37011).

Root cause:
  * APIServerAdapter.__init__ opens a ResponseStore SQLite connection
    that holds 2 fds (db file + WAL sidecar).
  * APIServerAdapter.disconnect() previously only stopped the aiohttp
    web server — the ResponseStore connection was never closed.
  * The reconnect watcher in _platform_reconnect_watcher constructs a
    fresh adapter on every retry attempt. When the connect call fails
    (3 paths: non-retryable error, retryable error, exception during
    connect) the adapter is dropped without ever being installed on
    self.adapters, so nothing else calls its disconnect(). Result: the
    2 ResponseStore fds stay open until GC sweeps the unreachable
    object, which Python's cyclic GC does not do promptly for
    asyncio-bound native handles.

  2 fds × 1 retry × (3600s / 300s backoff cap) ≈ 12 fds/hour.
  2560 fds / 12 fds/hr ≈ 12h to ulimit exhaustion.

Fix:

  * APIServerAdapter.disconnect() now also calls
    self._response_store.close() (with a try/except so a SQLite
    close failure doesn't abort the aiohttp teardown).
  * New module-level helper _dispose_unused_adapter(adapter) in
    gateway/run.py that calls adapter.disconnect() and swallows
    any exception (so half-constructed adapters whose __init__
    crashed don't kill the watcher loop).
  * _platform_reconnect_watcher calls _dispose_unused_adapter() in
    all three failure paths: non-retryable, retryable, and the
    except Exception arm. adapter = None is initialized
    before the try so the except arm can see the partial
    construction.

Tests:

  * New file tests/gateway/test_platform_reconnect_fd_leak.py with
    7 regression tests covering all three failure paths, the
    _dispose_unused_adapter helper (None + raising-disconnect cases),
    and the APIServerAdapter ResponseStore close behavior (success +
    close-exception cases). The _CountingAdapter fixture tracks
    disconnect() invocations and an _open_fds counter that is
    decremented on dispose, so the assertion is the literal
    observable behavior of the leak.

Refs:
  - Closes #37011 (the original fd-leak report)
  - Supersedes #37018, #37110, #37238, #37260, #37394 (7 competing
    open PRs all addressing the same root cause from different angles;
    none of them rebased cleanly against current main, and none
    covered all three failure paths in one fix with regression tests
    for both the watcher and the platform-level close behavior)
2026-06-02 17:27:44 -07:00
..
qqbot fix(gateway): trust adapter-owned access policy over env default-deny (#34515) 2026-05-29 04:22:41 -07:00
__init__.py perf(gateway): defer QQAdapter and YuanbaoAdapter imports via PEP 562 (#22790) 2026-05-09 13:17:48 -07:00
_http_client_limits.py fix(gateway): tighten httpx keepalive and close whatsapp typing-response leak (#18451) 2026-05-02 02:23:37 -07:00
ADDING_A_PLATFORM.md refactor(plugins): add apply_yaml_config_fn registry hook 2026-05-13 22:20:30 -07:00
api_server.py fix(gateway): close ResponseStore + dispose unowned adapter on reconnect failure 2026-06-02 17:27:44 -07:00
base.py feat(gateway): structured stream-event protocol + Telegram draft formatting parity (#37250) 2026-06-02 00:33:50 -07:00
bluebubbles.py refactor(bluebubbles): simplify mention-gating helpers 2026-06-01 18:52:05 -07:00
dingtalk.py fix(dingtalk): finalize open streaming cards before disconnect 2026-05-23 20:48:56 -07:00
email.py chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) 2026-05-11 11:13:25 -07:00
feishu.py fix(gateway): detach pending_watchers batch + normalize LRU caches + align test fixtures + AUTHOR_MAP 2026-05-31 00:50:19 -07:00
feishu_comment.py chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) 2026-05-11 11:13:25 -07:00
feishu_comment_rules.py chore: ruff auto-fix C401, C416, C408, PLR1722 (#23940) 2026-05-11 11:20:58 -07:00
helpers.py fix(gateway): preserve underscores in plain-text identifiers 2026-05-16 23:11:43 -07:00
homeassistant.py chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) 2026-05-11 11:13:25 -07:00
matrix.py fix(matrix): fail-closed approval reaction auth when MATRIX_ALLOWED_USERS is empty 2026-05-29 03:58:45 -07:00
msgraph_webhook.py fix(security): require source CIDR allowlisting for public msgraph webhook binds 2026-05-28 01:26:18 -07:00
signal.py Add Hermes desktop app (#20059) 2026-05-31 17:46:56 -05:00
signal_rate_limit.py feat(gateway/signal): add support for multiple images sending 2026-04-30 04:28:08 -07:00
slack.py Add Hermes desktop app (#20059) 2026-05-31 17:46:56 -05:00
sms.py fix(gateway): add trust_env=True to aiohttp sessions in SMS, Slack, Teams, Google Chat adapters 2026-05-16 23:11:43 -07:00
telegram.py feat(gateway): structured stream-event protocol + Telegram draft formatting parity (#37250) 2026-06-02 00:33:50 -07:00
telegram_network.py fix(telegram): reset sticky fallback IP on connect failure, retry primary DNS 2026-05-18 22:14:45 -07:00
webhook.py feat(dashboard): complete admin panel — MCP catalog, enable/disable toggles, hook creation, system stats (#36736) 2026-06-02 00:16:11 -04:00
wecom.py fix(gateway): honor WECOM_ALLOWED_USERS in env-only WeCom DM allowlist 2026-06-01 19:20:36 -07:00
wecom_callback.py chore(wecom): make defusedxml dep acquireable and tolerant of absence 2026-05-25 23:30:43 -07:00
wecom_crypto.py feat(gateway): add WeCom callback-mode adapter for self-built apps 2026-04-11 15:22:49 -07:00
weixin.py fix(weixin): replace aiohttp ClientTimeout with asyncio.wait_for in _api_post/_api_get 2026-06-01 17:31:40 -07:00
whatsapp.py fix(whatsapp): honor dm_policy and group_policy open at the gateway 2026-06-01 19:51:21 -07:00
yuanbao.py fix(gateway): trust adapter-owned access policy over env default-deny (#34515) 2026-05-29 04:22:41 -07:00
yuanbao_media.py chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) 2026-05-11 11:13:25 -07:00
yuanbao_proto.py chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) 2026-05-11 11:13:25 -07:00
yuanbao_sticker.py yuanbao platform (#16298) 2026-04-26 18:50:49 -07:00