mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-15 04:12:25 +00:00
fix(kanban): reap completed worker children in dispatch_once
The gateway-embedded dispatcher (default since `kanban.dispatch_in_gateway = true`) is the parent of every spawned kanban worker. `_default_spawn` calls `subprocess.Popen(..., start_new_session=True)` and returns the pid — `start_new_session` detaches the controlling tty but does not reparent to init, so the gateway keeps each worker as a child until it `wait()`s for them. Nothing in the dispatch loop ever calls `waitpid`. Result: every completed worker becomes a `<defunct>` zombie that lingers until the gateway exits. We hit ~430 zombies on a single hermes-agent container after ~40 days of steady kanban traffic, approaching process-table exhaustion on the host. Fix: add a non-blocking reap loop at the top of `dispatch_once`, so every dispatcher tick (default 60s) drains zombies that accumulated since the last tick. WNOHANG keeps the call non-blocking; ChildProcessError means no children to reap. Why here, not a SIGCHLD handler: - signal.signal requires the main thread; gateway threading model makes that placement non-trivial. - Bounded staleness: at default interval=60s the maximum live zombie count is one tick's worth of worker completions. - No interaction with detect_crashed_workers: that function only inspects rows where status='running', and rows reach 'done' (and stop being inspected) before their workers exit.
This commit is contained in:
parent
06f24351c5
commit
b49a3f8474
1 changed files with 19 additions and 0 deletions
|
|
@ -3224,6 +3224,25 @@ def dispatch_once(
|
||||||
``board`` pins workspace/log/db resolution for this tick to a specific
|
``board`` pins workspace/log/db resolution for this tick to a specific
|
||||||
board. When omitted, the current-board resolution chain is used.
|
board. When omitted, the current-board resolution chain is used.
|
||||||
"""
|
"""
|
||||||
|
# Reap zombie children from previously spawned workers.
|
||||||
|
# The gateway-embedded dispatcher is the parent of every worker spawned
|
||||||
|
# via _default_spawn (start_new_session=True only detaches the
|
||||||
|
# controlling tty, not the parent). Without an explicit waitpid, each
|
||||||
|
# completed worker becomes a <defunct> entry that lingers until gateway
|
||||||
|
# exit. WNOHANG keeps this non-blocking; ChildProcessError means no
|
||||||
|
# children to reap. Bounded: at most one tick's worth of completions
|
||||||
|
# can be in <defunct> at once.
|
||||||
|
try:
|
||||||
|
while True:
|
||||||
|
try:
|
||||||
|
_pid, _status = os.waitpid(-1, os.WNOHANG)
|
||||||
|
except ChildProcessError:
|
||||||
|
break
|
||||||
|
if _pid == 0:
|
||||||
|
break
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
result = DispatchResult()
|
result = DispatchResult()
|
||||||
result.reclaimed = release_stale_claims(conn)
|
result.reclaimed = release_stale_claims(conn)
|
||||||
result.crashed = detect_crashed_workers(conn)
|
result.crashed = detect_crashed_workers(conn)
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue