mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-06-03 07:21:54 +00:00
fix(kanban): SIGTERM on worker must terminate the process (#28181)
The single-query signal handler in cli.py raises KeyboardInterrupt on SIGTERM/SIGHUP. For interactive 'hermes chat -q' that unwinds the main thread cleanly. For kanban workers spawned by the dispatcher, the worker process is likely to have a non-daemon thread alive (terminal _wait_for_process, custom plugins, etc.). With KeyboardInterrupt only the main thread unwinds; the non-daemon thread keeps the process alive, the gateway has already restarted, and the dispatcher's _pid_alive check returns True forever — task stuck in 'running' indefinitely. When HERMES_KANBAN_TASK is set (dispatcher-spawned worker), flush logging + stdout/stderr, then os._exit(0) instead of raising KeyboardInterrupt. The kernel reclaims the PID immediately, and the existing zombie-state detection in _pid_alive flips the task to crashed on the next dispatcher tick. detect_crashed_workers then re-spawns it on the following tick — no manual recovery needed. A SIGALRM(2s) deadman is armed before the flush so a pathological blocking-I/O flush can't wedge the worker forever. In practice the reporter measured flush in <1ms; the alarm is a failsafe, never the common path. Interactive (non-kanban) chat -q is unchanged — the env-gated branch only fires for dispatcher-spawned workers. Live verification on this machine: - Without HERMES_KANBAN_TASK + non-daemon thread alive: process hangs alive 4+ seconds after SIGTERM. Dispatcher's _pid_alive returns True → task stuck. - With HERMES_KANBAN_TASK + same non-daemon thread: process exits in 0.10s via os._exit(0). Dispatcher reclaims on next tick. Tests: - tests/hermes_cli/test_signal_handler_kanban_worker.py (3 cases): end-to-end subprocess test with a non-daemon thread, HERMES_KANBAN_TASK env, SIGTERM, dispatcher-style _pid_alive check. Plus a source-level invariant test catching future refactors that drop the env-gated exit. - 452/452 kanban tests pass. Co-authored-by: andrewhosf <andrewho.sf@gmail.com>
This commit is contained in:
parent
3a9bc9d88a
commit
f30db14ced
2 changed files with 263 additions and 0 deletions
33
cli.py
33
cli.py
|
|
@ -14980,6 +14980,39 @@ def main(
|
|||
time.sleep(_grace)
|
||||
except Exception:
|
||||
pass # never block signal handling
|
||||
# Kanban worker exit path (#28181): SIGTERM hits a dispatcher-spawned
|
||||
# worker that's likely in a non-daemon thread waiting on a child
|
||||
# subprocess in _wait_for_process. Raising KeyboardInterrupt only
|
||||
# unwinds the main thread; the worker thread keeps running, the
|
||||
# process gets reparented to init, and the dispatcher's _pid_alive
|
||||
# check returns True forever — task stuck in 'running' indefinitely.
|
||||
# Skip the controlled-unwind dance and call os._exit(0) so the kernel
|
||||
# reclaims the PID immediately and detect_crashed_workers can reclaim
|
||||
# the stale claim on the next tick. Flush logging + stdout/stderr
|
||||
# first so the final debug trace isn't lost; SIGALRM deadman guards
|
||||
# the flush against any rare blocking-I/O case (the reporter measured
|
||||
# flush in <1ms; the alarm is a failsafe, not the common path).
|
||||
if os.environ.get("HERMES_KANBAN_TASK"):
|
||||
try:
|
||||
import signal as _sig_mod
|
||||
if hasattr(_sig_mod, "SIGALRM"):
|
||||
# Cancel any pre-existing alarm to avoid colliding with
|
||||
# caller-installed timers.
|
||||
_sig_mod.signal(_sig_mod.SIGALRM, lambda *_: os._exit(0))
|
||||
_sig_mod.alarm(2)
|
||||
except Exception:
|
||||
pass
|
||||
try:
|
||||
import logging as _lg
|
||||
_lg.shutdown()
|
||||
except Exception:
|
||||
pass
|
||||
for _stream in (sys.stdout, sys.stderr):
|
||||
try:
|
||||
_stream.flush()
|
||||
except Exception:
|
||||
pass
|
||||
os._exit(0)
|
||||
raise KeyboardInterrupt()
|
||||
try:
|
||||
import signal as _signal
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue