fix(interrupt): propagate to concurrent-tool workers + opt-in debug trace (#11907)

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-04-29 01:31:41 +00:00

* fix(interrupt): propagate to concurrent-tool workers + opt-in debug trace

interrupt() previously only flagged the agent's _execution_thread_id.
Tools running inside _execute_tool_calls_concurrent execute on
ThreadPoolExecutor worker threads whose tids are distinct from the
agent's, so is_interrupted() inside those tools returned False no matter
how many times the gateway called .interrupt() — hung ssh / curl / long
make-builds ran to their own timeout.

Changes:
- run_agent.py: track concurrent-tool worker tids in a per-agent set,
  fan interrupt()/clear_interrupt() out to them, and handle the
  register-after-interrupt race at _run_tool entry.  getattr fallback
  for the tracker so test stubs built via object.__new__ keep working.
- tools/environments/base.py: opt-in _wait_for_process trace (ENTER,
  per-30s HEARTBEAT with interrupt+activity-cb state, INTERRUPT
  DETECTED, TIMEOUT, EXIT) behind HERMES_DEBUG_INTERRUPT=1.
- tools/interrupt.py: opt-in set_interrupt() trace (caller tid, target
  tid, set snapshot) behind the same env flag.
- tests: new regression test runs a polling tool on a concurrent worker
  and asserts is_interrupted() flips to True within ~1s of interrupt().
  Second new test guards clear_interrupt() clearing tracked worker bits.

Validation: tests/run_agent/ all 762 pass; tests/tools/ interrupt+env
subset 216 pass.

* fix(interrupt-debug): bypass quiet_mode logger filter so trace reaches agent.log

AIAgent.__init__ sets logging.getLogger('tools').setLevel(ERROR) when
quiet_mode=True (the CLI default). This would silently swallow every
INFO-level trace line from the HERMES_DEBUG_INTERRUPT=1 instrumentation
added in the parent commit — confirmed by running hermes chat -q with
the flag and finding zero trace lines in agent.log even though
_wait_for_process was clearly executing (subprocess pid existed).

Fix: when HERMES_DEBUG_INTERRUPT=1, each traced module explicitly sets
its own logger level to INFO at import time, overriding the 'tools'
parent-level filter. Scoped to the opt-in case only, so production
(quiet_mode default) logs stay quiet as designed.

Validation: hermes chat -q with HERMES_DEBUG_INTERRUPT=1 now writes
'_wait_for_process ENTER/EXIT' lines to agent.log as expected.

* fix(cli): SIGTERM/SIGHUP no longer orphans tool subprocesses

Tool subprocesses spawned by the local environment backend use
os.setsid so they run in their own process group. Before this fix,
SIGTERM/SIGHUP to the hermes CLI killed the main thread via
KeyboardInterrupt but the worker thread running _wait_for_process
never got a chance to call _kill_process — Python exited, the child
was reparented to init (PPID=1), and the subprocess ran to its
natural end (confirmed live: sleep 300 survived 4+ min after SIGTERM
to the agent until manual cleanup).

Changes:
- cli.py _signal_handler (interactive) + _signal_handler_q (-q mode):
  route SIGTERM/SIGHUP through agent.interrupt() so the worker's poll
  loop sees the per-thread interrupt flag and calls _kill_process
  (os.killpg) on the subprocess group. HERMES_SIGTERM_GRACE (default
  1.5s) gives the worker time to complete its SIGTERM+SIGKILL
  escalation before KeyboardInterrupt unwinds main.
- tools/environments/base.py _wait_for_process: wrap the poll loop in
  try/except (KeyboardInterrupt, SystemExit) so the cleanup fires
  even on paths the signal handlers don't cover (direct sys.exit,
  unhandled KI from nested code, etc.). Emits EXCEPTION_EXIT trace
  line when HERMES_DEBUG_INTERRUPT=1.
- New regression test: injects KeyboardInterrupt into a running
  _wait_for_process via PyThreadState_SetAsyncExc, verifies the
  subprocess process group is dead within 3s of the exception and
  that KeyboardInterrupt re-raises cleanly afterward.

Validation:
| Before                                                  | After              |
|---------------------------------------------------------|--------------------|
| sleep 300 survives 4+ min as PPID=1 orphan after SIGTERM | dies within 2 s   |
| No INTERRUPT DETECTED in trace                          | INTERRUPT DETECTED fires + killing process group |
| tests/tools/test_local_interrupt_cleanup                | 1/1 pass          |
| tests/run_agent/test_concurrent_interrupt               | 4/4 pass          |

This commit is contained in:

Teknium

2026-04-17 20:39:25 -07:00

• committed by

GitHub

parent 607be54a24

commit 20f2258f34

No known key found for this signature in database

GPG key ID: B5690EEEBB952194

6 changed files with 551 additions and 22 deletions

									
										22

tools/interrupt.py
									
										View file
										
				@ -14,8 +14,23 @@ Usage in tools:

				        return {"output": "[interrupted]", "returncode": 130}

				"""

				import logging

				import os

				import threading

				logger = logging.getLogger(__name__)

				# Opt-in debug tracing — pairs with HERMES_DEBUG_INTERRUPT in

				# tools/environments/base.py.  Enables per-call logging of set/check so the

				# caller thread, target thread, and current state are visible when

				# diagnosing "interrupt signaled but tool never saw it" reports.

				_DEBUG_INTERRUPT = bool(os.getenv("HERMES_DEBUG_INTERRUPT"))

				if _DEBUG_INTERRUPT:

				    # AIAgent's quiet_mode path forces `tools` logger to ERROR on CLI startup.

				    # Force our own logger back to INFO so the trace is visible in agent.log.

				    logger.setLevel(logging.INFO)

				# Set of thread idents that have been interrupted.

				_interrupted_threads: set[int] = set()

				_lock = threading.Lock()

				@ -35,6 +50,13 @@ def set_interrupt(active: bool, thread_id: int | None = None) -> None:

				            _interrupted_threads.add(tid)

				        else:

				            _interrupted_threads.discard(tid)

				        _snapshot = set(_interrupted_threads) if _DEBUG_INTERRUPT else None

				    if _DEBUG_INTERRUPT:

				        logger.info(

				            "[interrupt-debug] set_interrupt(active=%s, target_tid=%s) "

				            "called_from_tid=%s current_set=%s",

				            active, tid, threading.current_thread().ident, _snapshot,

				        )

				def is_interrupted() -> bool:

Rows
Columns

fix(interrupt): propagate to concurrent-tool workers + opt-in debug trace (#11907)

22 tools/interrupt.py Unescape Escape View file

22

tools/interrupt.py

View file