mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-06-11 08:42:11 +00:00
Three separate code paths in the gateway's platform reconnect loop leaked file descriptors every retry, exhausting the default 2560-fd ulimit in ~12 hours of continuous failure and turning the gateway into a zombie that raises OSError: [Errno 24] on every open() (#37011). Root cause: * APIServerAdapter.__init__ opens a ResponseStore SQLite connection that holds 2 fds (db file + WAL sidecar). * APIServerAdapter.disconnect() previously only stopped the aiohttp web server — the ResponseStore connection was never closed. * The reconnect watcher in _platform_reconnect_watcher constructs a fresh adapter on every retry attempt. When the connect call fails (3 paths: non-retryable error, retryable error, exception during connect) the adapter is dropped without ever being installed on self.adapters, so nothing else calls its disconnect(). Result: the 2 ResponseStore fds stay open until GC sweeps the unreachable object, which Python's cyclic GC does not do promptly for asyncio-bound native handles. 2 fds × 1 retry × (3600s / 300s backoff cap) ≈ 12 fds/hour. 2560 fds / 12 fds/hr ≈ 12h to ulimit exhaustion. Fix: * APIServerAdapter.disconnect() now also calls self._response_store.close() (with a try/except so a SQLite close failure doesn't abort the aiohttp teardown). * New module-level helper _dispose_unused_adapter(adapter) in gateway/run.py that calls adapter.disconnect() and swallows any exception (so half-constructed adapters whose __init__ crashed don't kill the watcher loop). * _platform_reconnect_watcher calls _dispose_unused_adapter() in all three failure paths: non-retryable, retryable, and the except Exception arm. adapter = None is initialized before the try so the except arm can see the partial construction. Tests: * New file tests/gateway/test_platform_reconnect_fd_leak.py with 7 regression tests covering all three failure paths, the _dispose_unused_adapter helper (None + raising-disconnect cases), and the APIServerAdapter ResponseStore close behavior (success + close-exception cases). The _CountingAdapter fixture tracks disconnect() invocations and an _open_fds counter that is decremented on dispose, so the assertion is the literal observable behavior of the leak. Refs: - Closes #37011 (the original fd-leak report) - Supersedes #37018, #37110, #37238, #37260, #37394 (7 competing open PRs all addressing the same root cause from different angles; none of them rebased cleanly against current main, and none covered all three failure paths in one fix with regression tests for both the watcher and the platform-level close behavior) |
||
|---|---|---|
| .. | ||
| qqbot | ||
| __init__.py | ||
| _http_client_limits.py | ||
| ADDING_A_PLATFORM.md | ||
| api_server.py | ||
| base.py | ||
| bluebubbles.py | ||
| dingtalk.py | ||
| email.py | ||
| feishu.py | ||
| feishu_comment.py | ||
| feishu_comment_rules.py | ||
| helpers.py | ||
| homeassistant.py | ||
| matrix.py | ||
| msgraph_webhook.py | ||
| signal.py | ||
| signal_rate_limit.py | ||
| slack.py | ||
| sms.py | ||
| telegram.py | ||
| telegram_network.py | ||
| webhook.py | ||
| wecom.py | ||
| wecom_callback.py | ||
| wecom_crypto.py | ||
| weixin.py | ||
| whatsapp.py | ||
| yuanbao.py | ||
| yuanbao_media.py | ||
| yuanbao_proto.py | ||
| yuanbao_sticker.py | ||