From b49a3f84749926066511fa32571b6201026e7c0d Mon Sep 17 00:00:00 2001 From: Sonic Chang <265632032+sonic-netizen@users.noreply.github.com> Date: Thu, 7 May 2026 19:39:18 +0800 Subject: [PATCH] fix(kanban): reap completed worker children in dispatch_once MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The gateway-embedded dispatcher (default since `kanban.dispatch_in_gateway = true`) is the parent of every spawned kanban worker. `_default_spawn` calls `subprocess.Popen(..., start_new_session=True)` and returns the pid — `start_new_session` detaches the controlling tty but does not reparent to init, so the gateway keeps each worker as a child until it `wait()`s for them. Nothing in the dispatch loop ever calls `waitpid`. Result: every completed worker becomes a `` zombie that lingers until the gateway exits. We hit ~430 zombies on a single hermes-agent container after ~40 days of steady kanban traffic, approaching process-table exhaustion on the host. Fix: add a non-blocking reap loop at the top of `dispatch_once`, so every dispatcher tick (default 60s) drains zombies that accumulated since the last tick. WNOHANG keeps the call non-blocking; ChildProcessError means no children to reap. Why here, not a SIGCHLD handler: - signal.signal requires the main thread; gateway threading model makes that placement non-trivial. - Bounded staleness: at default interval=60s the maximum live zombie count is one tick's worth of worker completions. - No interaction with detect_crashed_workers: that function only inspects rows where status='running', and rows reach 'done' (and stop being inspected) before their workers exit. --- hermes_cli/kanban_db.py | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/hermes_cli/kanban_db.py b/hermes_cli/kanban_db.py index 3b38c124e3..d61eba1916 100644 --- a/hermes_cli/kanban_db.py +++ b/hermes_cli/kanban_db.py @@ -3224,6 +3224,25 @@ def dispatch_once( ``board`` pins workspace/log/db resolution for this tick to a specific board. When omitted, the current-board resolution chain is used. """ + # Reap zombie children from previously spawned workers. + # The gateway-embedded dispatcher is the parent of every worker spawned + # via _default_spawn (start_new_session=True only detaches the + # controlling tty, not the parent). Without an explicit waitpid, each + # completed worker becomes a entry that lingers until gateway + # exit. WNOHANG keeps this non-blocking; ChildProcessError means no + # children to reap. Bounded: at most one tick's worth of completions + # can be in at once. + try: + while True: + try: + _pid, _status = os.waitpid(-1, os.WNOHANG) + except ChildProcessError: + break + if _pid == 0: + break + except Exception: + pass + result = DispatchResult() result.reclaimed = release_stale_claims(conn) result.crashed = detect_crashed_workers(conn)