fix(kanban): reap completed worker children in dispatch_once

The gateway-embedded dispatcher (default since `kanban.dispatch_in_gateway
= true`) is the parent of every spawned kanban worker. `_default_spawn`
calls `subprocess.Popen(..., start_new_session=True)` and returns the
pid — `start_new_session` detaches the controlling tty but does not
reparent to init, so the gateway keeps each worker as a child until it
`wait()`s for them.

Nothing in the dispatch loop ever calls `waitpid`. Result: every
completed worker becomes a `<defunct>` zombie that lingers until the
gateway exits. We hit ~430 zombies on a single hermes-agent container
after ~40 days of steady kanban traffic, approaching process-table
exhaustion on the host.

Fix: add a non-blocking reap loop at the top of `dispatch_once`, so
every dispatcher tick (default 60s) drains zombies that accumulated
since the last tick. WNOHANG keeps the call non-blocking; ChildProcessError
means no children to reap.

Why here, not a SIGCHLD handler:
- signal.signal requires the main thread; gateway threading model makes
  that placement non-trivial.
- Bounded staleness: at default interval=60s the maximum live zombie
  count is one tick's worth of worker completions.
- No interaction with detect_crashed_workers: that function only inspects
  rows where status='running', and rows reach 'done' (and stop being
  inspected) before their workers exit.
This commit is contained in:
Sonic Chang 2026-05-07 19:39:18 +08:00 committed by Teknium
parent 06f24351c5
commit b49a3f8474

View file

@ -3224,6 +3224,25 @@ def dispatch_once(
``board`` pins workspace/log/db resolution for this tick to a specific
board. When omitted, the current-board resolution chain is used.
"""
# Reap zombie children from previously spawned workers.
# The gateway-embedded dispatcher is the parent of every worker spawned
# via _default_spawn (start_new_session=True only detaches the
# controlling tty, not the parent). Without an explicit waitpid, each
# completed worker becomes a <defunct> entry that lingers until gateway
# exit. WNOHANG keeps this non-blocking; ChildProcessError means no
# children to reap. Bounded: at most one tick's worth of completions
# can be in <defunct> at once.
try:
while True:
try:
_pid, _status = os.waitpid(-1, os.WNOHANG)
except ChildProcessError:
break
if _pid == 0:
break
except Exception:
pass
result = DispatchResult()
result.reclaimed = release_stale_claims(conn)
result.crashed = detect_crashed_workers(conn)