fix(interrupt): propagate to concurrent-tool workers + opt-in debug trace (#11907)

* fix(interrupt): propagate to concurrent-tool workers + opt-in debug trace

interrupt() previously only flagged the agent's _execution_thread_id.
Tools running inside _execute_tool_calls_concurrent execute on
ThreadPoolExecutor worker threads whose tids are distinct from the
agent's, so is_interrupted() inside those tools returned False no matter
how many times the gateway called .interrupt() — hung ssh / curl / long
make-builds ran to their own timeout.

Changes:
- run_agent.py: track concurrent-tool worker tids in a per-agent set,
  fan interrupt()/clear_interrupt() out to them, and handle the
  register-after-interrupt race at _run_tool entry.  getattr fallback
  for the tracker so test stubs built via object.__new__ keep working.
- tools/environments/base.py: opt-in _wait_for_process trace (ENTER,
  per-30s HEARTBEAT with interrupt+activity-cb state, INTERRUPT
  DETECTED, TIMEOUT, EXIT) behind HERMES_DEBUG_INTERRUPT=1.
- tools/interrupt.py: opt-in set_interrupt() trace (caller tid, target
  tid, set snapshot) behind the same env flag.
- tests: new regression test runs a polling tool on a concurrent worker
  and asserts is_interrupted() flips to True within ~1s of interrupt().
  Second new test guards clear_interrupt() clearing tracked worker bits.

Validation: tests/run_agent/ all 762 pass; tests/tools/ interrupt+env
subset 216 pass.

* fix(interrupt-debug): bypass quiet_mode logger filter so trace reaches agent.log

AIAgent.__init__ sets logging.getLogger('tools').setLevel(ERROR) when
quiet_mode=True (the CLI default). This would silently swallow every
INFO-level trace line from the HERMES_DEBUG_INTERRUPT=1 instrumentation
added in the parent commit — confirmed by running hermes chat -q with
the flag and finding zero trace lines in agent.log even though
_wait_for_process was clearly executing (subprocess pid existed).

Fix: when HERMES_DEBUG_INTERRUPT=1, each traced module explicitly sets
its own logger level to INFO at import time, overriding the 'tools'
parent-level filter. Scoped to the opt-in case only, so production
(quiet_mode default) logs stay quiet as designed.

Validation: hermes chat -q with HERMES_DEBUG_INTERRUPT=1 now writes
'_wait_for_process ENTER/EXIT' lines to agent.log as expected.

* fix(cli): SIGTERM/SIGHUP no longer orphans tool subprocesses

Tool subprocesses spawned by the local environment backend use
os.setsid so they run in their own process group. Before this fix,
SIGTERM/SIGHUP to the hermes CLI killed the main thread via
KeyboardInterrupt but the worker thread running _wait_for_process
never got a chance to call _kill_process — Python exited, the child
was reparented to init (PPID=1), and the subprocess ran to its
natural end (confirmed live: sleep 300 survived 4+ min after SIGTERM
to the agent until manual cleanup).

Changes:
- cli.py _signal_handler (interactive) + _signal_handler_q (-q mode):
  route SIGTERM/SIGHUP through agent.interrupt() so the worker's poll
  loop sees the per-thread interrupt flag and calls _kill_process
  (os.killpg) on the subprocess group. HERMES_SIGTERM_GRACE (default
  1.5s) gives the worker time to complete its SIGTERM+SIGKILL
  escalation before KeyboardInterrupt unwinds main.
- tools/environments/base.py _wait_for_process: wrap the poll loop in
  try/except (KeyboardInterrupt, SystemExit) so the cleanup fires
  even on paths the signal handlers don't cover (direct sys.exit,
  unhandled KI from nested code, etc.). Emits EXCEPTION_EXIT trace
  line when HERMES_DEBUG_INTERRUPT=1.
- New regression test: injects KeyboardInterrupt into a running
  _wait_for_process via PyThreadState_SetAsyncExc, verifies the
  subprocess process group is dead within 3s of the exception and
  that KeyboardInterrupt re-raises cleanly afterward.

Validation:
| Before                                                  | After              |
|---------------------------------------------------------|--------------------|
| sleep 300 survives 4+ min as PPID=1 orphan after SIGTERM | dies within 2 s   |
| No INTERRUPT DETECTED in trace                          | INTERRUPT DETECTED fires + killing process group |
| tests/tools/test_local_interrupt_cleanup                | 1/1 pass          |
| tests/run_agent/test_concurrent_interrupt               | 4/4 pass          |
This commit is contained in:
Teknium 2026-04-17 20:39:25 -07:00 committed by GitHub
parent 607be54a24
commit 20f2258f34
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
6 changed files with 551 additions and 22 deletions

69
cli.py
View file

@ -10067,8 +10067,36 @@ class HermesCLI:
# Register signal handlers for graceful shutdown on SSH disconnect / SIGTERM
def _signal_handler(signum, frame):
"""Handle SIGHUP/SIGTERM by triggering graceful cleanup."""
"""Handle SIGHUP/SIGTERM by triggering graceful cleanup.
Calls ``self.agent.interrupt()`` first so the agent daemon
thread's poll loop sees the per-thread interrupt and kills the
tool's subprocess group via ``_kill_process`` (os.killpg).
Without this, the main thread dies from KeyboardInterrupt and
the daemon thread is killed with it before it can run one
more poll iteration to clean up the subprocess, which was
spawned with ``os.setsid`` and therefore survives as an orphan
with PPID=1.
Grace window (``HERMES_SIGTERM_GRACE``, default 1.5 s) gives
the daemon time to: detect the interrupt (next 200 ms poll)
call _kill_process (SIGTERM + 1 s wait + SIGKILL if needed)
return from _wait_for_process. ``time.sleep`` releases the
GIL so the daemon actually runs during the window.
"""
logger.debug("Received signal %s, triggering graceful shutdown", signum)
try:
if getattr(self, "agent", None) and getattr(self, "_agent_running", False):
self.agent.interrupt(f"received signal {signum}")
import time as _t
try:
_grace = float(os.getenv("HERMES_SIGTERM_GRACE", "1.5"))
except (TypeError, ValueError):
_grace = 1.5
if _grace > 0:
_t.sleep(_grace)
except Exception:
pass # never block signal handling
raise KeyboardInterrupt()
try:
@ -10371,6 +10399,45 @@ def main(
# Register cleanup for single-query mode (interactive mode registers in run())
atexit.register(_run_cleanup)
# Also install signal handlers in single-query / `-q` mode. Interactive
# mode registers its own inside HermesCLI.run(), but `-q` runs
# cli.agent.run_conversation() below and AIAgent spawns worker threads
# for tools — so when SIGTERM arrives on the main thread, raising
# KeyboardInterrupt only unwinds the main thread, not the worker
# running _wait_for_process. Python then exits, the child subprocess
# (spawned with os.setsid, its own process group) is reparented to
# init and keeps running as an orphan.
#
# Fix: route SIGTERM/SIGHUP through agent.interrupt() which sets the
# per-thread interrupt flag the worker's poll loop checks every 200 ms.
# Give the worker a grace window to call _kill_process (SIGTERM to the
# process group, then SIGKILL after 1 s), then raise KeyboardInterrupt
# so main unwinds normally. HERMES_SIGTERM_GRACE overrides the 1.5 s
# default for debugging.
def _signal_handler_q(signum, frame):
logger.debug("Received signal %s in single-query mode", signum)
try:
_agent = getattr(cli, "agent", None)
if _agent is not None:
_agent.interrupt(f"received signal {signum}")
import time as _t
try:
_grace = float(os.getenv("HERMES_SIGTERM_GRACE", "1.5"))
except (TypeError, ValueError):
_grace = 1.5
if _grace > 0:
_t.sleep(_grace)
except Exception:
pass # never block signal handling
raise KeyboardInterrupt()
try:
import signal as _signal
_signal.signal(_signal.SIGTERM, _signal_handler_q)
if hasattr(_signal, "SIGHUP"):
_signal.signal(_signal.SIGHUP, _signal_handler_q)
except Exception:
pass # signal handler may fail in restricted environments
# Handle single query mode
if query or image: