mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-08 03:01:47 +00:00
fix(kanban): reap completed worker children in dispatch_once
The gateway-embedded dispatcher (default since `kanban.dispatch_in_gateway = true`) is the parent of every spawned kanban worker. `_default_spawn` calls `subprocess.Popen(..., start_new_session=True)` and returns the pid — `start_new_session` detaches the controlling tty but does not reparent to init, so the gateway keeps each worker as a child until it `wait()`s for them. Nothing in the dispatch loop ever calls `waitpid`. Result: every completed worker becomes a `<defunct>` zombie that lingers until the gateway exits. We hit ~430 zombies on a single hermes-agent container after ~40 days of steady kanban traffic, approaching process-table exhaustion on the host. Fix: add a non-blocking reap loop at the top of `dispatch_once`, so every dispatcher tick (default 60s) drains zombies that accumulated since the last tick. WNOHANG keeps the call non-blocking; ChildProcessError means no children to reap. Why here, not a SIGCHLD handler: - signal.signal requires the main thread; gateway threading model makes that placement non-trivial. - Bounded staleness: at default interval=60s the maximum live zombie count is one tick's worth of worker completions. - No interaction with detect_crashed_workers: that function only inspects rows where status='running', and rows reach 'done' (and stop being inspected) before their workers exit.
This commit is contained in:
parent
06f24351c5
commit
b49a3f8474
1 changed files with 19 additions and 0 deletions
|
|
@ -3224,6 +3224,25 @@ def dispatch_once(
|
|||
``board`` pins workspace/log/db resolution for this tick to a specific
|
||||
board. When omitted, the current-board resolution chain is used.
|
||||
"""
|
||||
# Reap zombie children from previously spawned workers.
|
||||
# The gateway-embedded dispatcher is the parent of every worker spawned
|
||||
# via _default_spawn (start_new_session=True only detaches the
|
||||
# controlling tty, not the parent). Without an explicit waitpid, each
|
||||
# completed worker becomes a <defunct> entry that lingers until gateway
|
||||
# exit. WNOHANG keeps this non-blocking; ChildProcessError means no
|
||||
# children to reap. Bounded: at most one tick's worth of completions
|
||||
# can be in <defunct> at once.
|
||||
try:
|
||||
while True:
|
||||
try:
|
||||
_pid, _status = os.waitpid(-1, os.WNOHANG)
|
||||
except ChildProcessError:
|
||||
break
|
||||
if _pid == 0:
|
||||
break
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
result = DispatchResult()
|
||||
result.reclaimed = release_stale_claims(conn)
|
||||
result.crashed = detect_crashed_workers(conn)
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue