fix(gateway): drain on Windows hermes gateway stop so sessions survive restart (#33798)

Sessions now survive `hermes gateway stop` / `restart` on native Windows.
Previously the gateway died on schtasks `/End` + os.kill SIGTERM without
ever running the drain loop, so the v0.13.0 session-resume feature (#21192)
silently broke on Windows: `resume_pending=True` was never written, and
the next boot started with a blank conversation history (issue #33778).

Root cause is twofold and the reporter only identified half of it:

1. `hermes_cli/gateway_windows.py::stop()` did not write the
   `planned_stop_marker` before signalling. The reporter caught this.

2. The bigger reason: `asyncio.add_signal_handler` raises
   NotImplementedError for SIGTERM/SIGINT on Windows, so even if the
   marker had been written, the gateway's existing SIGTERM handler
   (which is what calls `runner.stop()` and the `mark_resume_pending`
   loop) was never invoked. Writing the marker would have been
   necessary-but-insufficient.

The fix has two parts:

* gateway/run.py: new `_run_planned_stop_watcher` daemon thread polls
  for the planned-stop marker file every 0.5s. When the marker appears
  it `loop.call_soon_threadsafe(shutdown_signal_handler, None)` — the
  same shutdown path a real SIGTERM would have driven, including the
  pre-drain `mark_resume_pending` writes (run.py:5977) and graceful
  drain wait. The existing signal handler already accepts
  `received_signal=None` and falls through to
  `consume_planned_stop_marker_for_self()`, so no handler changes
  needed. Runs on every platform as cheap belt-and-suspenders.

* hermes_cli/gateway_windows.py: `stop()` now writes the marker for
  the running gateway PID and waits up to `agent.restart_drain_timeout`
  (default 30s) for the PID to exit cleanly. On clean drain, the kill
  sweep is non-forceful; on timeout, escalates to
  `kill_gateway_processes(force=True)` which routes to taskkill /T /F
  per `references/windows-native-support.md`.

Validation:

* 7 new tests in tests/gateway/test_planned_stop_watcher.py covering:
  marker→handler dispatch, no-marker idle, already-draining skip,
  not-yet-running skip, stop_event responsiveness, fire-once
  semantics, error tolerance.
* 8 new tests in tests/hermes_cli/test_gateway_windows.py covering:
  marker-before-kill ordering, clean-drain skips force-kill,
  drain-timeout escalates to force=True, no-pid-skips-drain,
  invalid-pid handling, fast-exit success, timeout failure,
  marker-write-failure tolerance.
* E2E (Linux, detached orphan): write_planned_stop_marker(pid) +
  `_drain_gateway_pid(pid, 5.0)` returns True in 0.5s after the
  victim sees the marker and exits. Tested with a double-forked
  subprocess so the test parent isn't holding it as a zombie.
* Targeted: tests/gateway/{restart_drain,restart_resume_pending,
  signal,signal_format,status,shutdown_forensics,approve_deny_commands,
  planned_stop_watcher} + tests/hermes_cli/{gateway_windows,
  gateway_service} → 519/519.

What was wrong with the reporter's claim (for future archaeology): they
described the symptom as "no `resume_pending=True` written to
`sessions.json`" — but Hermes uses `state.db` (SQLite), not
`sessions.json`, and `mark_resume_pending` is called regardless of
the marker (the marker only affects exit code 0 vs 1 for systemd
revival semantics). The real session-loss path is the missing drain
on Windows, not a missing marker. Both halves are fixed here.

Closes #33778.
This commit is contained in:
Teknium 2026-05-28 03:25:32 -07:00 committed by GitHub
parent f8896dedc8
commit 10ee4a729b
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
4 changed files with 649 additions and 9 deletions

View file

@ -18193,6 +18193,72 @@ class GatewayRunner:
return response
def _run_planned_stop_watcher(
stop_event: threading.Event,
runner,
loop: asyncio.AbstractEventLoop,
shutdown_handler,
*,
poll_interval: float = 0.5,
) -> None:
"""Poll for the planned-stop marker and trigger graceful shutdown.
On Windows, ``asyncio.add_signal_handler`` raises NotImplementedError
for SIGTERM/SIGINT, so the standard signal-driven shutdown path
never runs when ``hermes gateway stop`` signals the gateway. The
consequence is that the drain loop is skipped in-flight agent
sessions are killed mid-turn and ``resume_pending`` is never set,
so the next gateway boot has no idea those sessions need to be
auto-resumed (issue #33778, v0.13.0 session-resume feature broken
on native Windows).
This watcher runs on every platform (cheap, defensive) and bridges
the gap on Windows by translating a filesystem marker into the
same shutdown-handler invocation a real SIGTERM would have produced
on POSIX. The CLI's ``hermes_cli.gateway_windows.stop()`` writes
the marker via ``write_planned_stop_marker(pid)`` and then waits
for the gateway PID to exit; this watcher is what makes that
exit happen cleanly.
On POSIX this is a no-op safety net the signal handler always
races us to consuming the marker file because it fires synchronously
from the kernel's signal delivery.
Args:
stop_event: cleared by start_gateway() during normal shutdown
to tell the watcher to exit.
runner: the GatewayRunner instance; we check ``_running`` and
``_draining`` to avoid triggering shutdown if the gateway
is already in one of those states.
loop: the asyncio event loop the shutdown handler must run on.
shutdown_handler: same callable that's wired to SIGTERM —
tolerates a ``None`` signal argument (planned stop case)
and consumes the marker via
``consume_planned_stop_marker_for_self()``.
poll_interval: seconds between marker checks. 0.5s gives a
responsive shutdown without burning CPU.
"""
from gateway.status import _get_planned_stop_marker_path
marker_path = _get_planned_stop_marker_path()
while not stop_event.is_set():
try:
if (
marker_path.exists()
and not getattr(runner, "_draining", False)
and getattr(runner, "_running", False)
):
# Drive the same path as a real signal handler.
# Pass signal=None — the handler tolerates that and consumes
# the marker via consume_planned_stop_marker_for_self,
# which also validates target_pid + start_time match us.
loop.call_soon_threadsafe(shutdown_handler, None)
# Done — the handler will set _draining; we exit on next tick.
break
except Exception as _e:
logger.debug("Planned-stop watcher tick error: %s", _e)
stop_event.wait(poll_interval)
def _start_cron_ticker(stop_event: threading.Event, adapters=None, loop=None, interval: int = 60):
"""
Background thread that ticks the cron scheduler at a regular interval.
@ -18597,7 +18663,28 @@ async def start_gateway(config: Optional[GatewayConfig] = None, replace: bool =
pass
else:
logger.info("Skipping signal handlers (not running in main thread).")
# Windows fallback: asyncio.add_signal_handler raises NotImplementedError
# on Windows, so `hermes gateway stop`'s SIGTERM (which Python maps to
# TerminateProcess on Windows) never invokes shutdown_signal_handler.
# That means the drain loop never runs, mark_resume_pending never fires,
# and sessions are silently lost across restarts (issue #33778).
#
# The fix is a marker-polling thread: `hermes gateway stop` writes the
# planned-stop marker BEFORE killing, and this thread notices it and
# drives the same shutdown path the signal handler would have. Runs
# on every platform (cheap, defensive) so non-signal-bearing
# environments (Windows native, sandboxed CI runners that mask
# SIGTERM) still get a clean drain.
_planned_stop_watcher_stop = threading.Event()
_planned_stop_watcher_thread = threading.Thread(
target=_run_planned_stop_watcher,
args=(_planned_stop_watcher_stop, runner, loop, shutdown_signal_handler),
daemon=True,
name="planned-stop-watcher",
)
_planned_stop_watcher_thread.start()
# Claim the PID file BEFORE bringing up any platform adapters.
# This closes the --replace race window: two concurrent `gateway run
# --replace` invocations both pass the termination-wait above, but
@ -18675,6 +18762,10 @@ async def start_gateway(config: Optional[GatewayConfig] = None, replace: bool =
cron_stop.set()
cron_thread.join(timeout=5)
# Stop the planned-stop watcher (daemon=True so this is belt-and-suspenders).
_planned_stop_watcher_stop.set()
_planned_stop_watcher_thread.join(timeout=2)
# Close MCP server connections
try:
from tools.mcp_tool import shutdown_mcp_servers

View file

@ -1014,12 +1014,70 @@ def start() -> None:
_report_gateway_start(f"direct spawn (PID {pid})")
def stop() -> None:
"""Stop the gateway. Tries /End on the scheduled task, then kills any stragglers."""
_assert_windows()
from hermes_cli.gateway import kill_gateway_processes
def _drain_gateway_pid(pid: int, drain_timeout: float) -> bool:
"""Write the planned-stop marker and wait for the gateway PID to exit.
stopped_any = False
Windows cannot deliver POSIX signals to a Python asyncio loop
(``loop.add_signal_handler`` raises NotImplementedError), so writing
the marker is the ONLY way to ask a running gateway to drain
in-flight agents and persist ``resume_pending`` before exit. The
gateway's planned-stop watcher thread (gateway/run.py) polls for
the marker and drives the same shutdown path the SIGTERM handler
would have on POSIX.
Returns True if the PID exited within the timeout, False if it
didn't (caller should escalate to schtasks /End + taskkill).
"""
if pid <= 0:
return False
try:
from gateway.status import write_planned_stop_marker, _pid_exists
except ImportError:
return False
try:
write_planned_stop_marker(pid)
except Exception:
# Best-effort: if the marker can't be written, we have no choice
# but to fall through to a hard kill. Caller decides escalation.
pass
deadline = time.monotonic() + max(drain_timeout, 1.0)
while time.monotonic() < deadline:
if not _pid_exists(pid):
return True
time.sleep(0.5)
return False
def stop() -> None:
"""Stop the gateway.
Writes the planned-stop marker first so the gateway can drain
in-flight agents and persist ``resume_pending`` before exit (the
gateway's marker-watcher thread picks this up — Windows asyncio
can't deliver SIGTERM to the loop, so the marker is our only IPC).
Then escalates: ``schtasks /End`` (kills the scheduled-task tree)
+ ``kill_gateway_processes(force=True)`` for any strays.
"""
_assert_windows()
from hermes_cli.gateway import kill_gateway_processes, _get_restart_drain_timeout
from gateway.status import get_running_pid
# Phase 1: ask the running gateway (if any) to drain itself by writing
# the planned-stop marker, then wait briefly for it to exit cleanly.
# On clean exit, sessions land with resume_pending=True and the next
# boot will auto-resume them.
pid = get_running_pid()
drained = False
if pid is not None:
try:
drain_timeout = float(_get_restart_drain_timeout() or 30.0)
except Exception:
drain_timeout = 30.0
drained = _drain_gateway_pid(pid, drain_timeout)
stopped_any = drained
if is_task_registered():
code, _out, err = _exec_schtasks(["/End", "/TN", get_task_name()])
# schtasks returns nonzero when the task isn't currently running — don't treat that as an error.
@ -1028,12 +1086,19 @@ def stop() -> None:
elif "not running" not in (err or "").lower():
print(f"⚠ schtasks /End returned code {code}: {err.strip()}")
killed = kill_gateway_processes(all_profiles=False)
# Phase 3: hard-kill any strays. When drain succeeded this is a no-op;
# when drain timed out this is the escalation that ensures the PID
# actually exits. Use force=True on Windows so taskkill /T /F walks
# the descendant tree (browser helpers, etc.).
killed = kill_gateway_processes(all_profiles=False, force=not drained)
if killed:
stopped_any = True
print(f"✓ Killed {killed} gateway process(es)")
if stopped_any:
print("✓ Gateway stopped")
if drained:
print("✓ Gateway stopped (drained cleanly)")
else:
print("✓ Gateway stopped")
else:
print("✗ No gateway was running")

View file

@ -0,0 +1,267 @@
"""Tests for the planned-stop marker watcher thread (gateway/run.py).
The watcher is the Windows-fallback path for the v0.13.0 session-resume
feature on Windows ``asyncio.add_signal_handler`` raises
NotImplementedError, so the SIGTERM signal handler never runs and the
shutdown drain (which writes ``resume_pending=True``) is skipped. The
watcher closes this gap by polling for the planned-stop marker file
and translating its existence into the same shutdown-handler call a
real SIGTERM would have produced.
See issue #33778 for the original Windows session-loss bug report.
"""
import asyncio
import threading
import time
from types import SimpleNamespace
from unittest.mock import MagicMock
import pytest
from gateway.run import _run_planned_stop_watcher
class _FakeRunner:
"""Stand-in for GatewayRunner — only exposes the two flags the watcher reads."""
def __init__(self, *, running: bool = True, draining: bool = False):
self._running = running
self._draining = draining
def _make_loop_capturing_calls():
"""Build a fake asyncio loop whose call_soon_threadsafe records its args."""
loop = MagicMock(spec=asyncio.AbstractEventLoop)
loop._captured = []
def fake_call_soon_threadsafe(fn, *args):
loop._captured.append((fn, args))
loop.call_soon_threadsafe = fake_call_soon_threadsafe
return loop
def test_watcher_fires_shutdown_when_marker_appears(tmp_path, monkeypatch):
"""When the marker file exists, the watcher must call the shutdown handler."""
marker = tmp_path / ".gateway-planned-stop.json"
# Patch the marker-path resolver so the watcher polls our temp location.
from gateway import status as status_mod
monkeypatch.setattr(status_mod, "_get_planned_stop_marker_path", lambda: marker)
runner = _FakeRunner(running=True, draining=False)
loop = _make_loop_capturing_calls()
shutdown_handler = MagicMock(name="shutdown_signal_handler")
stop_event = threading.Event()
# Drop the marker before the thread starts.
marker.write_text('{"target_pid": 1234}', encoding="utf-8")
watcher = threading.Thread(
target=_run_planned_stop_watcher,
args=(stop_event, runner, loop, shutdown_handler),
kwargs={"poll_interval": 0.05},
daemon=True,
)
watcher.start()
watcher.join(timeout=2.0)
assert not watcher.is_alive(), "Watcher should exit after firing"
assert len(loop._captured) == 1, (
f"Expected exactly one shutdown invocation, got {loop._captured}"
)
fn, args = loop._captured[0]
assert fn is shutdown_handler
# The handler must be called with signal=None (planned stop sentinel).
assert args == (None,)
def test_watcher_does_not_fire_when_marker_absent(tmp_path, monkeypatch):
"""No marker = no shutdown call. Watcher just spins until stop_event."""
marker = tmp_path / ".gateway-planned-stop.json"
# Deliberately do NOT create the marker.
from gateway import status as status_mod
monkeypatch.setattr(status_mod, "_get_planned_stop_marker_path", lambda: marker)
runner = _FakeRunner(running=True, draining=False)
loop = _make_loop_capturing_calls()
shutdown_handler = MagicMock()
stop_event = threading.Event()
watcher = threading.Thread(
target=_run_planned_stop_watcher,
args=(stop_event, runner, loop, shutdown_handler),
kwargs={"poll_interval": 0.05},
daemon=True,
)
watcher.start()
time.sleep(0.3) # let it poll a few times
stop_event.set()
watcher.join(timeout=2.0)
assert not watcher.is_alive()
assert loop._captured == [], (
f"No marker present, but watcher fired shutdown: {loop._captured}"
)
shutdown_handler.assert_not_called()
def test_watcher_skips_when_runner_already_draining(tmp_path, monkeypatch):
"""If shutdown is already in progress, don't re-fire the handler.
This prevents a race where the SIGTERM handler is mid-drain and the
watcher would double-tap the shutdown path. We check ``_draining``
so the watcher backs off once any shutdown is in flight.
"""
marker = tmp_path / ".gateway-planned-stop.json"
marker.write_text('{"target_pid": 1234}', encoding="utf-8")
from gateway import status as status_mod
monkeypatch.setattr(status_mod, "_get_planned_stop_marker_path", lambda: marker)
# Already draining — watcher should be a no-op.
runner = _FakeRunner(running=False, draining=True)
loop = _make_loop_capturing_calls()
shutdown_handler = MagicMock()
stop_event = threading.Event()
watcher = threading.Thread(
target=_run_planned_stop_watcher,
args=(stop_event, runner, loop, shutdown_handler),
kwargs={"poll_interval": 0.05},
daemon=True,
)
watcher.start()
time.sleep(0.2)
stop_event.set()
watcher.join(timeout=2.0)
assert loop._captured == [], "Watcher fired while runner was already draining"
def test_watcher_skips_when_runner_not_started(tmp_path, monkeypatch):
"""If the runner hasn't started, the marker is for a previous instance —
we shouldn't shutdown a not-yet-running gateway.
"""
marker = tmp_path / ".gateway-planned-stop.json"
marker.write_text('{"target_pid": 9999}', encoding="utf-8")
from gateway import status as status_mod
monkeypatch.setattr(status_mod, "_get_planned_stop_marker_path", lambda: marker)
runner = _FakeRunner(running=False, draining=False)
loop = _make_loop_capturing_calls()
shutdown_handler = MagicMock()
stop_event = threading.Event()
watcher = threading.Thread(
target=_run_planned_stop_watcher,
args=(stop_event, runner, loop, shutdown_handler),
kwargs={"poll_interval": 0.05},
daemon=True,
)
watcher.start()
time.sleep(0.2)
stop_event.set()
watcher.join(timeout=2.0)
assert loop._captured == [], "Watcher fired before runner was running"
def test_watcher_responds_to_stop_event_promptly(tmp_path, monkeypatch):
"""Setting stop_event must exit the watcher within ~poll_interval seconds."""
marker = tmp_path / ".gateway-planned-stop.json"
from gateway import status as status_mod
monkeypatch.setattr(status_mod, "_get_planned_stop_marker_path", lambda: marker)
runner = _FakeRunner(running=True, draining=False)
loop = _make_loop_capturing_calls()
stop_event = threading.Event()
watcher = threading.Thread(
target=_run_planned_stop_watcher,
args=(stop_event, runner, loop, MagicMock()),
kwargs={"poll_interval": 0.1},
daemon=True,
)
watcher.start()
time.sleep(0.05)
started_stop = time.monotonic()
stop_event.set()
watcher.join(timeout=2.0)
elapsed = time.monotonic() - started_stop
assert not watcher.is_alive()
assert elapsed < 0.5, f"Watcher took {elapsed:.2f}s to honour stop_event"
def test_watcher_fires_only_once_when_marker_persists(tmp_path, monkeypatch):
"""Marker file existing for multiple polls must NOT spam the handler.
The watcher fires once and exits its loop (the shutdown handler is
responsible for consuming the marker on its own thread). If we
re-fired on every tick, the handler would be invoked dozens of
times before the gateway actually shuts down.
"""
marker = tmp_path / ".gateway-planned-stop.json"
marker.write_text('{"target_pid": 1234}', encoding="utf-8")
from gateway import status as status_mod
monkeypatch.setattr(status_mod, "_get_planned_stop_marker_path", lambda: marker)
runner = _FakeRunner(running=True, draining=False)
loop = _make_loop_capturing_calls()
stop_event = threading.Event()
watcher = threading.Thread(
target=_run_planned_stop_watcher,
args=(stop_event, runner, loop, MagicMock()),
kwargs={"poll_interval": 0.05},
daemon=True,
)
watcher.start()
# Let the watcher tick several times — but it should exit after the first fire.
watcher.join(timeout=1.0)
assert not watcher.is_alive()
assert len(loop._captured) == 1, (
f"Watcher fired {len(loop._captured)} times; should fire once "
f"and exit (events={loop._captured})"
)
def test_watcher_tolerates_marker_path_resolution_errors(tmp_path, monkeypatch, caplog):
"""If _get_planned_stop_marker_path() raises, the watcher logs and continues."""
from gateway import status as status_mod
call_count = [0]
def explode():
call_count[0] += 1
# First call (the one outside the loop, at thread start) is fine —
# but subsequent .exists() calls on a corrupt Path could explode.
if call_count[0] == 1:
return tmp_path / "nonexistent"
raise OSError("filesystem failed")
monkeypatch.setattr(status_mod, "_get_planned_stop_marker_path", explode)
runner = _FakeRunner(running=True, draining=False)
loop = _make_loop_capturing_calls()
stop_event = threading.Event()
watcher = threading.Thread(
target=_run_planned_stop_watcher,
args=(stop_event, runner, loop, MagicMock()),
kwargs={"poll_interval": 0.05},
daemon=True,
)
watcher.start()
time.sleep(0.2)
stop_event.set()
watcher.join(timeout=2.0)
assert not watcher.is_alive(), "Watcher should still honour stop_event after errors"
# No shutdown fired because the marker never reported existence.
assert loop._captured == []

View file

@ -481,4 +481,221 @@ def test_uninstall_access_denied_declined_keeps_task_and_cleans_files(monkeypatc
out = capsys.readouterr().out
assert "Skipped elevation" in out
assert "UAC is Windows' admin approval prompt" in out
assert "Scheduled Task still registered" in out
assert "Scheduled Task still registered" in out
# ---------------------------------------------------------------------------
# stop() drain semantics — issue #33778
#
# Background: on Windows, asyncio.add_signal_handler raises NotImplementedError,
# so the gateway's SIGTERM handler (which drains in-flight agents and writes
# resume_pending=True) never fires when `hermes gateway stop` kills the
# process. The fix: stop() writes the planned_stop_marker first, waits for
# the gateway's marker-watcher thread to drain + exit cleanly, then escalates
# to taskkill if drain times out.
# ---------------------------------------------------------------------------
def test_stop_writes_planned_stop_marker_before_killing(monkeypatch):
"""stop() must write the planned-stop marker BEFORE any kill signal.
Without this, the gateway's drain loop never runs on Windows and
sessions silently lose context across restarts.
"""
pid = 99999
events = []
monkeypatch.setattr(gateway_windows, "_assert_windows", lambda: None)
monkeypatch.setattr(gateway_windows, "is_task_registered", lambda: False)
# Stub the marker write so we can record the order of operations.
from gateway import status as status_mod
def fake_write_marker(target_pid):
events.append(("write_marker", target_pid))
return True
def fake_pid_exists(check_pid):
# Drain succeeds: pid "exits" right after the marker write.
return ("write_marker", pid) not in events
monkeypatch.setattr(status_mod, "write_planned_stop_marker", fake_write_marker)
monkeypatch.setattr(status_mod, "_pid_exists", fake_pid_exists)
monkeypatch.setattr(status_mod, "get_running_pid", lambda: pid)
def fake_kill(**kwargs):
events.append(("kill", kwargs.get("force", False)))
return 0
monkeypatch.setattr("hermes_cli.gateway.kill_gateway_processes", fake_kill)
monkeypatch.setattr("hermes_cli.gateway._get_restart_drain_timeout", lambda: 5.0)
gateway_windows.stop()
# Marker MUST be written before any kill.
kinds = [e[0] for e in events]
assert "write_marker" in kinds, "stop() never wrote the planned-stop marker"
marker_idx = kinds.index("write_marker")
kill_idx = kinds.index("kill") if "kill" in kinds else len(kinds)
assert marker_idx < kill_idx, (
f"stop() killed before writing the marker (events={events})"
)
def test_stop_waits_for_graceful_drain_before_force_kill(monkeypatch):
"""When drain succeeds, stop() should NOT force-kill the gateway.
drained=True means the gateway exited cleanly after seeing the
marker escalating to taskkill /F afterwards would be wasted
work and may emit confusing "killed N processes" output.
"""
pid = 88888
events = []
monkeypatch.setattr(gateway_windows, "_assert_windows", lambda: None)
monkeypatch.setattr(gateway_windows, "is_task_registered", lambda: False)
from gateway import status as status_mod
monkeypatch.setattr(status_mod, "write_planned_stop_marker", lambda p: True)
# Simulate the gateway exiting cleanly after one poll tick.
poll_count = [0]
def fake_pid_exists(check_pid):
poll_count[0] += 1
return poll_count[0] < 2 # alive on first poll, gone on second
monkeypatch.setattr(status_mod, "_pid_exists", fake_pid_exists)
monkeypatch.setattr(status_mod, "get_running_pid", lambda: pid)
def fake_kill(**kwargs):
events.append(("kill", kwargs.get("force", False)))
return 0
monkeypatch.setattr("hermes_cli.gateway.kill_gateway_processes", fake_kill)
monkeypatch.setattr("hermes_cli.gateway._get_restart_drain_timeout", lambda: 5.0)
gateway_windows.stop()
# kill_gateway_processes is still called as the no-op sweep, but
# NOT with force=True — drain succeeded, gateway is already gone.
assert events == [("kill", False)], (
f"After clean drain, force kill should be disabled (events={events})"
)
def test_stop_escalates_to_force_kill_when_drain_times_out(monkeypatch):
"""When drain times out, stop() MUST escalate to force=True.
Drain timeout = gateway is stuck or unresponsive. Without the
taskkill /T /F escalation, the gateway stays alive and the next
`hermes gateway start` fails with "another instance is running".
"""
pid = 77777
events = []
monkeypatch.setattr(gateway_windows, "_assert_windows", lambda: None)
monkeypatch.setattr(gateway_windows, "is_task_registered", lambda: False)
from gateway import status as status_mod
monkeypatch.setattr(status_mod, "write_planned_stop_marker", lambda p: True)
# PID never exits — drain times out.
monkeypatch.setattr(status_mod, "_pid_exists", lambda check_pid: True)
monkeypatch.setattr(status_mod, "get_running_pid", lambda: pid)
def fake_kill(**kwargs):
events.append(("kill", kwargs.get("force", False)))
return 1
monkeypatch.setattr("hermes_cli.gateway.kill_gateway_processes", fake_kill)
# Tiny drain timeout to keep the test fast.
monkeypatch.setattr("hermes_cli.gateway._get_restart_drain_timeout", lambda: 1.0)
gateway_windows.stop()
# When drain times out, kill is invoked with force=True so taskkill /T /F
# walks the process tree.
assert events == [("kill", True)], (
f"After drain timeout, kill must use force=True (events={events})"
)
def test_stop_no_running_gateway_skips_drain(monkeypatch):
"""When no gateway is running, skip the drain wait entirely."""
events = []
monkeypatch.setattr(gateway_windows, "_assert_windows", lambda: None)
monkeypatch.setattr(gateway_windows, "is_task_registered", lambda: False)
from gateway import status as status_mod
monkeypatch.setattr(status_mod, "get_running_pid", lambda: None)
def fake_write_marker(target_pid):
events.append(("write_marker", target_pid))
return True
monkeypatch.setattr(status_mod, "write_planned_stop_marker", fake_write_marker)
monkeypatch.setattr(status_mod, "_pid_exists", lambda check_pid: False)
def fake_kill(**kwargs):
events.append(("kill", kwargs.get("force", False)))
return 0
monkeypatch.setattr("hermes_cli.gateway.kill_gateway_processes", fake_kill)
monkeypatch.setattr("hermes_cli.gateway._get_restart_drain_timeout", lambda: 5.0)
gateway_windows.stop()
# With no PID to drain, no marker is written. Kill sweep still runs
# (defensive — covers the case where a stray gateway is alive without
# a PID file). force=True because drained=False.
assert ("write_marker", None) not in events
assert all(e[0] != "write_marker" for e in events), (
f"Should not write marker when no PID is running (events={events})"
)
assert events == [("kill", True)]
def test_drain_helper_handles_invalid_pid(monkeypatch):
"""_drain_gateway_pid returns False for invalid PIDs without crashing."""
assert gateway_windows._drain_gateway_pid(0, 5.0) is False
assert gateway_windows._drain_gateway_pid(-1, 5.0) is False
def test_drain_helper_returns_true_when_pid_exits_quickly(monkeypatch):
"""_drain_gateway_pid polls _pid_exists until it returns False."""
pid = 66666
poll_count = [0]
def fake_pid_exists(check_pid):
poll_count[0] += 1
return poll_count[0] < 3 # alive twice, then gone
from gateway import status as status_mod
monkeypatch.setattr(status_mod, "write_planned_stop_marker", lambda p: True)
monkeypatch.setattr(status_mod, "_pid_exists", fake_pid_exists)
assert gateway_windows._drain_gateway_pid(pid, drain_timeout=5.0) is True
def test_drain_helper_returns_false_on_timeout(monkeypatch):
"""_drain_gateway_pid returns False when the PID never exits."""
from gateway import status as status_mod
monkeypatch.setattr(status_mod, "write_planned_stop_marker", lambda p: True)
monkeypatch.setattr(status_mod, "_pid_exists", lambda check_pid: True)
assert gateway_windows._drain_gateway_pid(55555, drain_timeout=1.0) is False
def test_drain_helper_still_waits_if_marker_write_fails(monkeypatch):
"""Marker-write failures are swallowed; drain still polls for PID exit.
If the marker can't be written (disk full, permission error), the
gateway can't drain — but the wait still happens so a slow-shutdown
gateway from a different code path (e.g. SIGTERM working on this
platform after all) still gets observed cleanly.
"""
pid = 44444
def fake_write(target_pid):
raise OSError("disk full")
from gateway import status as status_mod
monkeypatch.setattr(status_mod, "write_planned_stop_marker", fake_write)
monkeypatch.setattr(status_mod, "_pid_exists", lambda check_pid: False)
# Returns True because _pid_exists immediately says "gone".
assert gateway_windows._drain_gateway_pid(pid, drain_timeout=5.0) is True