fix(interrupt): propagate to concurrent-tool workers + opt-in debug trace (#11907)

* fix(interrupt): propagate to concurrent-tool workers + opt-in debug trace interrupt() previously only flagged the agent's _execution_thread_id. Tools running inside _execute_tool_calls_concurrent execute on ThreadPoolExecutor worker threads whose tids are distinct from the agent's, so is_interrupted() inside those tools returned False no matter how many times the gateway called .interrupt() — hung ssh / curl / long make-builds ran to their own timeout. Changes: - run_agent.py: track concurrent-tool worker tids in a per-agent set, fan interrupt()/clear_interrupt() out to them, and handle the register-after-interrupt race at _run_tool entry. getattr fallback for the tracker so test stubs built via object.__new__ keep working. - tools/environments/base.py: opt-in _wait_for_process trace (ENTER, per-30s HEARTBEAT with interrupt+activity-cb state, INTERRUPT DETECTED, TIMEOUT, EXIT) behind HERMES_DEBUG_INTERRUPT=1. - tools/interrupt.py: opt-in set_interrupt() trace (caller tid, target tid, set snapshot) behind the same env flag. - tests: new regression test runs a polling tool on a concurrent worker and asserts is_interrupted() flips to True within ~1s of interrupt(). Second new test guards clear_interrupt() clearing tracked worker bits. Validation: tests/run_agent/ all 762 pass; tests/tools/ interrupt+env subset 216 pass. * fix(interrupt-debug): bypass quiet_mode logger filter so trace reaches agent.log AIAgent.__init__ sets logging.getLogger('tools').setLevel(ERROR) when quiet_mode=True (the CLI default). This would silently swallow every INFO-level trace line from the HERMES_DEBUG_INTERRUPT=1 instrumentation added in the parent commit — confirmed by running hermes chat -q with the flag and finding zero trace lines in agent.log even though _wait_for_process was clearly executing (subprocess pid existed). Fix: when HERMES_DEBUG_INTERRUPT=1, each traced module explicitly sets its own logger level to INFO at import time, overriding the 'tools' parent-level filter. Scoped to the opt-in case only, so production (quiet_mode default) logs stay quiet as designed. Validation: hermes chat -q with HERMES_DEBUG_INTERRUPT=1 now writes '_wait_for_process ENTER/EXIT' lines to agent.log as expected. * fix(cli): SIGTERM/SIGHUP no longer orphans tool subprocesses Tool subprocesses spawned by the local environment backend use os.setsid so they run in their own process group. Before this fix, SIGTERM/SIGHUP to the hermes CLI killed the main thread via KeyboardInterrupt but the worker thread running _wait_for_process never got a chance to call _kill_process — Python exited, the child was reparented to init (PPID=1), and the subprocess ran to its natural end (confirmed live: sleep 300 survived 4+ min after SIGTERM to the agent until manual cleanup). Changes: - cli.py _signal_handler (interactive) + _signal_handler_q (-q mode): route SIGTERM/SIGHUP through agent.interrupt() so the worker's poll loop sees the per-thread interrupt flag and calls _kill_process (os.killpg) on the subprocess group. HERMES_SIGTERM_GRACE (default 1.5s) gives the worker time to complete its SIGTERM+SIGKILL escalation before KeyboardInterrupt unwinds main. - tools/environments/base.py _wait_for_process: wrap the poll loop in try/except (KeyboardInterrupt, SystemExit) so the cleanup fires even on paths the signal handlers don't cover (direct sys.exit, unhandled KI from nested code, etc.). Emits EXCEPTION_EXIT trace line when HERMES_DEBUG_INTERRUPT=1. - New regression test: injects KeyboardInterrupt into a running _wait_for_process via PyThreadState_SetAsyncExc, verifies the subprocess process group is dead within 3s of the exception and that KeyboardInterrupt re-raises cleanly afterward. Validation: | Before | After | |---------------------------------------------------------|--------------------| | sleep 300 survives 4+ min as PPID=1 orphan after SIGTERM | dies within 2 s | | No INTERRUPT DETECTED in trace | INTERRUPT DETECTED fires + killing process group | | tests/tools/test_local_interrupt_cleanup | 1/1 pass | | tests/run_agent/test_concurrent_interrupt | 4/4 pass |
2026-04-25 00:51:20 +00:00 · 2026-04-17 20:39:25 -07:00 · 2026-04-17 20:39:25 -07:00 · 20f2258f34
commit 20f2258f34
parent 607be54a24
6 changed files with 551 additions and 22 deletions
--- a/tests/run_agent/test_concurrent_interrupt.py
+++ b/tests/run_agent/test_concurrent_interrupt.py
@ -23,6 +23,10 @@ def _make_agent(monkeypatch):

    class _Stub:
        _interrupt_requested = False
+        _interrupt_message = None
+        # Bind to this thread's ident so interrupt() targets a real tid.
+        _execution_thread_id = threading.current_thread().ident
+        _interrupt_thread_signal_pending = False
        log_prefix = ""
        quiet_mode = True
        verbose_logging = False
@ -40,6 +44,15 @@ def _make_agent(monkeypatch):
        _current_tool = None
        _last_activity = 0
        _print_fn = print
+        # Worker-thread tracking state mirrored from AIAgent.__init__ so the
+        # real interrupt() method can fan out to concurrent-tool workers.
+        _active_children: list = []
+
+        def __init__(self):
+            # Instance-level (not class-level) so each test gets a fresh set.
+            self._tool_worker_threads: set = set()
+            self._tool_worker_threads_lock = threading.Lock()
+            self._active_children_lock = threading.Lock()

        def _touch_activity(self, desc):
            self._last_activity = time.time()
@ -60,8 +73,10 @@ def _make_agent(monkeypatch):
            return False

    stub = _Stub()
-    # Bind the real methods
+    # Bind the real methods under test
    stub._execute_tool_calls_concurrent = _ra.AIAgent._execute_tool_calls_concurrent.__get__(stub)
+    stub.interrupt = _ra.AIAgent.interrupt.__get__(stub)
+    stub.clear_interrupt = _ra.AIAgent.clear_interrupt.__get__(stub)
    stub._invoke_tool = MagicMock(side_effect=lambda *a, **kw: '{"ok": true}')
    return stub

@ -137,3 +152,109 @@ def test_concurrent_preflight_interrupt_skips_all(monkeypatch):
    assert "skipped due to user interrupt" in messages[1]["content"]
    # _invoke_tool should never have been called
    agent._invoke_tool.assert_not_called()
+
+
+def test_running_concurrent_worker_sees_is_interrupted(monkeypatch):
+    """Regression guard for the "interrupt-doesn't-reach-hung-tool" class of
+    bug Physikal reported in April 2026.
+
+    Before this fix, `AIAgent.interrupt()` called `_set_interrupt(True,
+    _execution_thread_id)` — which only flagged the agent's *main* thread.
+    Tools running inside `_execute_tool_calls_concurrent` execute on
+    ThreadPoolExecutor worker threads whose tids are NOT the agent's, so
+    `is_interrupted()` (which checks the *current* thread's tid) returned
+    False inside those tools no matter how many times the gateway called
+    `.interrupt()`.  Hung ssh / long curl / big make-build tools would run
+    to their own timeout.
+
+    This test runs a fake tool in the concurrent path that polls
+    `is_interrupted()` like a real terminal command does, then calls
+    `agent.interrupt()` from another thread, and asserts the poll sees True
+    within one second.
+    """
+    from tools.interrupt import is_interrupted
+
+    agent = _make_agent(monkeypatch)
+
+    # Counter plus observation hooks so we can prove the worker saw the flip.
+    observed = {"saw_true": False, "poll_count": 0, "worker_tid": None}
+    worker_started = threading.Event()
+
+    def polling_tool(name, args, task_id, call_id=None):
+        observed["worker_tid"] = threading.current_thread().ident
+        worker_started.set()
+        deadline = time.monotonic() + 5.0
+        while time.monotonic() < deadline:
+            observed["poll_count"] += 1
+            if is_interrupted():
+                observed["saw_true"] = True
+                return '{"interrupted": true}'
+            time.sleep(0.05)
+        return '{"timed_out": true}'
+
+    agent._invoke_tool = MagicMock(side_effect=polling_tool)
+
+    tc1 = _FakeToolCall("hung_fake_tool_1", call_id="tc1")
+    tc2 = _FakeToolCall("hung_fake_tool_2", call_id="tc2")
+    msg = _FakeAssistantMsg([tc1, tc2])
+    messages = []
+
+    def _interrupt_after_start():
+        # Wait until at least one worker is running so its tid is tracked.
+        worker_started.wait(timeout=2.0)
+        time.sleep(0.2)  # let the other worker enter too
+        agent.interrupt("stop requested by test")
+
+    t = threading.Thread(target=_interrupt_after_start)
+    t.start()
+    start = time.monotonic()
+    agent._execute_tool_calls_concurrent(msg, messages, "test_task")
+    elapsed = time.monotonic() - start
+    t.join(timeout=2.0)
+
+    # The worker must have actually polled is_interrupted — otherwise the
+    # test isn't exercising what it claims to.
+    assert observed["poll_count"] > 0, (
+        "polling_tool never ran — test scaffold issue"
+    )
+    # The worker must see the interrupt within ~1 s of agent.interrupt()
+    # being called.  Before the fix this loop ran until its 5 s own-timeout.
+    assert observed["saw_true"], (
+        f"is_interrupted() never returned True inside the concurrent worker "
+        f"after agent.interrupt() — interrupt-propagation hole regressed. "
+        f"worker_tid={observed['worker_tid']!r} poll_count={observed['poll_count']}"
+    )
+    assert elapsed < 3.0, (
+        f"concurrent execution took {elapsed:.2f}s after interrupt — the fan-out "
+        f"to worker tids didn't shortcut the tool's poll loop as expected"
+    )
+    # Also verify cleanup: no stale worker tids should remain after all
+    # tools finished.
+    assert agent._tool_worker_threads == set(), (
+        f"worker tids leaked after run: {agent._tool_worker_threads}"
+    )
+
+
+def test_clear_interrupt_clears_worker_tids(monkeypatch):
+    """After clear_interrupt(), stale worker-tid bits must be cleared so the
+    next turn's tools — which may be scheduled onto recycled tids — don't
+    see a false interrupt."""
+    from tools.interrupt import is_interrupted, set_interrupt
+
+    agent = _make_agent(monkeypatch)
+    # Simulate a worker having registered but not yet exited cleanly (e.g. a
+    # hypothetical bug in the tear-down).  Put a fake tid in the set and
+    # flag it interrupted.
+    fake_tid = threading.current_thread().ident  # use real tid so is_interrupted can see it
+    with agent._tool_worker_threads_lock:
+        agent._tool_worker_threads.add(fake_tid)
+    set_interrupt(True, fake_tid)
+    assert is_interrupted() is True  # sanity
+
+    agent.clear_interrupt()
+
+    assert is_interrupted() is False, (
+        "clear_interrupt() did not clear the interrupt bit for a tracked "
+        "worker tid — stale interrupt can leak into the next turn"
+    )
+