mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-25 00:51:20 +00:00
fix(interrupt): propagate to concurrent-tool workers + opt-in debug trace (#11907)
* fix(interrupt): propagate to concurrent-tool workers + opt-in debug trace
interrupt() previously only flagged the agent's _execution_thread_id.
Tools running inside _execute_tool_calls_concurrent execute on
ThreadPoolExecutor worker threads whose tids are distinct from the
agent's, so is_interrupted() inside those tools returned False no matter
how many times the gateway called .interrupt() — hung ssh / curl / long
make-builds ran to their own timeout.
Changes:
- run_agent.py: track concurrent-tool worker tids in a per-agent set,
fan interrupt()/clear_interrupt() out to them, and handle the
register-after-interrupt race at _run_tool entry. getattr fallback
for the tracker so test stubs built via object.__new__ keep working.
- tools/environments/base.py: opt-in _wait_for_process trace (ENTER,
per-30s HEARTBEAT with interrupt+activity-cb state, INTERRUPT
DETECTED, TIMEOUT, EXIT) behind HERMES_DEBUG_INTERRUPT=1.
- tools/interrupt.py: opt-in set_interrupt() trace (caller tid, target
tid, set snapshot) behind the same env flag.
- tests: new regression test runs a polling tool on a concurrent worker
and asserts is_interrupted() flips to True within ~1s of interrupt().
Second new test guards clear_interrupt() clearing tracked worker bits.
Validation: tests/run_agent/ all 762 pass; tests/tools/ interrupt+env
subset 216 pass.
* fix(interrupt-debug): bypass quiet_mode logger filter so trace reaches agent.log
AIAgent.__init__ sets logging.getLogger('tools').setLevel(ERROR) when
quiet_mode=True (the CLI default). This would silently swallow every
INFO-level trace line from the HERMES_DEBUG_INTERRUPT=1 instrumentation
added in the parent commit — confirmed by running hermes chat -q with
the flag and finding zero trace lines in agent.log even though
_wait_for_process was clearly executing (subprocess pid existed).
Fix: when HERMES_DEBUG_INTERRUPT=1, each traced module explicitly sets
its own logger level to INFO at import time, overriding the 'tools'
parent-level filter. Scoped to the opt-in case only, so production
(quiet_mode default) logs stay quiet as designed.
Validation: hermes chat -q with HERMES_DEBUG_INTERRUPT=1 now writes
'_wait_for_process ENTER/EXIT' lines to agent.log as expected.
* fix(cli): SIGTERM/SIGHUP no longer orphans tool subprocesses
Tool subprocesses spawned by the local environment backend use
os.setsid so they run in their own process group. Before this fix,
SIGTERM/SIGHUP to the hermes CLI killed the main thread via
KeyboardInterrupt but the worker thread running _wait_for_process
never got a chance to call _kill_process — Python exited, the child
was reparented to init (PPID=1), and the subprocess ran to its
natural end (confirmed live: sleep 300 survived 4+ min after SIGTERM
to the agent until manual cleanup).
Changes:
- cli.py _signal_handler (interactive) + _signal_handler_q (-q mode):
route SIGTERM/SIGHUP through agent.interrupt() so the worker's poll
loop sees the per-thread interrupt flag and calls _kill_process
(os.killpg) on the subprocess group. HERMES_SIGTERM_GRACE (default
1.5s) gives the worker time to complete its SIGTERM+SIGKILL
escalation before KeyboardInterrupt unwinds main.
- tools/environments/base.py _wait_for_process: wrap the poll loop in
try/except (KeyboardInterrupt, SystemExit) so the cleanup fires
even on paths the signal handlers don't cover (direct sys.exit,
unhandled KI from nested code, etc.). Emits EXCEPTION_EXIT trace
line when HERMES_DEBUG_INTERRUPT=1.
- New regression test: injects KeyboardInterrupt into a running
_wait_for_process via PyThreadState_SetAsyncExc, verifies the
subprocess process group is dead within 3s of the exception and
that KeyboardInterrupt re-raises cleanly afterward.
Validation:
| Before | After |
|---------------------------------------------------------|--------------------|
| sleep 300 survives 4+ min as PPID=1 orphan after SIGTERM | dies within 2 s |
| No INTERRUPT DETECTED in trace | INTERRUPT DETECTED fires + killing process group |
| tests/tools/test_local_interrupt_cleanup | 1/1 pass |
| tests/run_agent/test_concurrent_interrupt | 4/4 pass |
This commit is contained in:
parent
607be54a24
commit
20f2258f34
6 changed files with 551 additions and 22 deletions
69
cli.py
69
cli.py
|
|
@ -10067,8 +10067,36 @@ class HermesCLI:
|
|||
|
||||
# Register signal handlers for graceful shutdown on SSH disconnect / SIGTERM
|
||||
def _signal_handler(signum, frame):
|
||||
"""Handle SIGHUP/SIGTERM by triggering graceful cleanup."""
|
||||
"""Handle SIGHUP/SIGTERM by triggering graceful cleanup.
|
||||
|
||||
Calls ``self.agent.interrupt()`` first so the agent daemon
|
||||
thread's poll loop sees the per-thread interrupt and kills the
|
||||
tool's subprocess group via ``_kill_process`` (os.killpg).
|
||||
Without this, the main thread dies from KeyboardInterrupt and
|
||||
the daemon thread is killed with it — before it can run one
|
||||
more poll iteration to clean up the subprocess, which was
|
||||
spawned with ``os.setsid`` and therefore survives as an orphan
|
||||
with PPID=1.
|
||||
|
||||
Grace window (``HERMES_SIGTERM_GRACE``, default 1.5 s) gives
|
||||
the daemon time to: detect the interrupt (next 200 ms poll) →
|
||||
call _kill_process (SIGTERM + 1 s wait + SIGKILL if needed) →
|
||||
return from _wait_for_process. ``time.sleep`` releases the
|
||||
GIL so the daemon actually runs during the window.
|
||||
"""
|
||||
logger.debug("Received signal %s, triggering graceful shutdown", signum)
|
||||
try:
|
||||
if getattr(self, "agent", None) and getattr(self, "_agent_running", False):
|
||||
self.agent.interrupt(f"received signal {signum}")
|
||||
import time as _t
|
||||
try:
|
||||
_grace = float(os.getenv("HERMES_SIGTERM_GRACE", "1.5"))
|
||||
except (TypeError, ValueError):
|
||||
_grace = 1.5
|
||||
if _grace > 0:
|
||||
_t.sleep(_grace)
|
||||
except Exception:
|
||||
pass # never block signal handling
|
||||
raise KeyboardInterrupt()
|
||||
|
||||
try:
|
||||
|
|
@ -10371,6 +10399,45 @@ def main(
|
|||
|
||||
# Register cleanup for single-query mode (interactive mode registers in run())
|
||||
atexit.register(_run_cleanup)
|
||||
|
||||
# Also install signal handlers in single-query / `-q` mode. Interactive
|
||||
# mode registers its own inside HermesCLI.run(), but `-q` runs
|
||||
# cli.agent.run_conversation() below and AIAgent spawns worker threads
|
||||
# for tools — so when SIGTERM arrives on the main thread, raising
|
||||
# KeyboardInterrupt only unwinds the main thread, not the worker
|
||||
# running _wait_for_process. Python then exits, the child subprocess
|
||||
# (spawned with os.setsid, its own process group) is reparented to
|
||||
# init and keeps running as an orphan.
|
||||
#
|
||||
# Fix: route SIGTERM/SIGHUP through agent.interrupt() which sets the
|
||||
# per-thread interrupt flag the worker's poll loop checks every 200 ms.
|
||||
# Give the worker a grace window to call _kill_process (SIGTERM to the
|
||||
# process group, then SIGKILL after 1 s), then raise KeyboardInterrupt
|
||||
# so main unwinds normally. HERMES_SIGTERM_GRACE overrides the 1.5 s
|
||||
# default for debugging.
|
||||
def _signal_handler_q(signum, frame):
|
||||
logger.debug("Received signal %s in single-query mode", signum)
|
||||
try:
|
||||
_agent = getattr(cli, "agent", None)
|
||||
if _agent is not None:
|
||||
_agent.interrupt(f"received signal {signum}")
|
||||
import time as _t
|
||||
try:
|
||||
_grace = float(os.getenv("HERMES_SIGTERM_GRACE", "1.5"))
|
||||
except (TypeError, ValueError):
|
||||
_grace = 1.5
|
||||
if _grace > 0:
|
||||
_t.sleep(_grace)
|
||||
except Exception:
|
||||
pass # never block signal handling
|
||||
raise KeyboardInterrupt()
|
||||
try:
|
||||
import signal as _signal
|
||||
_signal.signal(_signal.SIGTERM, _signal_handler_q)
|
||||
if hasattr(_signal, "SIGHUP"):
|
||||
_signal.signal(_signal.SIGHUP, _signal_handler_q)
|
||||
except Exception:
|
||||
pass # signal handler may fail in restricted environments
|
||||
|
||||
# Handle single query mode
|
||||
if query or image:
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue