fix(gateway): drain on Windows hermes gateway stop so sessions survive restart (#33798)

Sessions now survive `hermes gateway stop` / `restart` on native Windows. Previously the gateway died on schtasks `/End` + os.kill SIGTERM without ever running the drain loop, so the v0.13.0 session-resume feature (#21192) silently broke on Windows: `resume_pending=True` was never written, and the next boot started with a blank conversation history (issue #33778). Root cause is twofold and the reporter only identified half of it: 1. `hermes_cli/gateway_windows.py::stop()` did not write the `planned_stop_marker` before signalling. The reporter caught this. 2. The bigger reason: `asyncio.add_signal_handler` raises NotImplementedError for SIGTERM/SIGINT on Windows, so even if the marker had been written, the gateway's existing SIGTERM handler (which is what calls `runner.stop()` and the `mark_resume_pending` loop) was never invoked. Writing the marker would have been necessary-but-insufficient. The fix has two parts: * gateway/run.py: new `_run_planned_stop_watcher` daemon thread polls for the planned-stop marker file every 0.5s. When the marker appears it `loop.call_soon_threadsafe(shutdown_signal_handler, None)` — the same shutdown path a real SIGTERM would have driven, including the pre-drain `mark_resume_pending` writes (run.py:5977) and graceful drain wait. The existing signal handler already accepts `received_signal=None` and falls through to `consume_planned_stop_marker_for_self()`, so no handler changes needed. Runs on every platform as cheap belt-and-suspenders. * hermes_cli/gateway_windows.py: `stop()` now writes the marker for the running gateway PID and waits up to `agent.restart_drain_timeout` (default 30s) for the PID to exit cleanly. On clean drain, the kill sweep is non-forceful; on timeout, escalates to `kill_gateway_processes(force=True)` which routes to taskkill /T /F per `references/windows-native-support.md`. Validation: * 7 new tests in tests/gateway/test_planned_stop_watcher.py covering: marker→handler dispatch, no-marker idle, already-draining skip, not-yet-running skip, stop_event responsiveness, fire-once semantics, error tolerance. * 8 new tests in tests/hermes_cli/test_gateway_windows.py covering: marker-before-kill ordering, clean-drain skips force-kill, drain-timeout escalates to force=True, no-pid-skips-drain, invalid-pid handling, fast-exit success, timeout failure, marker-write-failure tolerance. * E2E (Linux, detached orphan): write_planned_stop_marker(pid) + `_drain_gateway_pid(pid, 5.0)` returns True in 0.5s after the victim sees the marker and exits. Tested with a double-forked subprocess so the test parent isn't holding it as a zombie. * Targeted: tests/gateway/{restart_drain,restart_resume_pending, signal,signal_format,status,shutdown_forensics,approve_deny_commands, planned_stop_watcher} + tests/hermes_cli/{gateway_windows, gateway_service} → 519/519. What was wrong with the reporter's claim (for future archaeology): they described the symptom as "no `resume_pending=True` written to `sessions.json`" — but Hermes uses `state.db` (SQLite), not `sessions.json`, and `mark_resume_pending` is called regardless of the marker (the marker only affects exit code 0 vs 1 for systemd revival semantics). The real session-loss path is the missing drain on Windows, not a missing marker. Both halves are fixed here. Closes #33778.
2026-07-13 14:02:16 +00:00 · 2026-05-28 03:25:32 -07:00 · 2026-05-28 03:25:32 -07:00 · 10ee4a729b
commit 10ee4a729b
parent f8896dedc8
4 changed files with 649 additions and 9 deletions
--- a/gateway/run.py
+++ b/gateway/run.py
@ -18193,6 +18193,72 @@ class GatewayRunner:
        return response


+def _run_planned_stop_watcher(
+    stop_event: threading.Event,
+    runner,
+    loop: asyncio.AbstractEventLoop,
+    shutdown_handler,
+    *,
+    poll_interval: float = 0.5,
+) -> None:
+    """Poll for the planned-stop marker and trigger graceful shutdown.
+
+    On Windows, ``asyncio.add_signal_handler`` raises NotImplementedError
+    for SIGTERM/SIGINT, so the standard signal-driven shutdown path
+    never runs when ``hermes gateway stop`` signals the gateway. The
+    consequence is that the drain loop is skipped — in-flight agent
+    sessions are killed mid-turn and ``resume_pending`` is never set,
+    so the next gateway boot has no idea those sessions need to be
+    auto-resumed (issue #33778, v0.13.0 session-resume feature broken
+    on native Windows).
+
+    This watcher runs on every platform (cheap, defensive) and bridges
+    the gap on Windows by translating a filesystem marker into the
+    same shutdown-handler invocation a real SIGTERM would have produced
+    on POSIX. The CLI's ``hermes_cli.gateway_windows.stop()`` writes
+    the marker via ``write_planned_stop_marker(pid)`` and then waits
+    for the gateway PID to exit; this watcher is what makes that
+    exit happen cleanly.
+
+    On POSIX this is a no-op safety net — the signal handler always
+    races us to consuming the marker file because it fires synchronously
+    from the kernel's signal delivery.
+
+    Args:
+        stop_event: cleared by start_gateway() during normal shutdown
+            to tell the watcher to exit.
+        runner: the GatewayRunner instance; we check ``_running`` and
+            ``_draining`` to avoid triggering shutdown if the gateway
+            is already in one of those states.
+        loop: the asyncio event loop the shutdown handler must run on.
+        shutdown_handler: same callable that's wired to SIGTERM —
+            tolerates a ``None`` signal argument (planned stop case)
+            and consumes the marker via
+            ``consume_planned_stop_marker_for_self()``.
+        poll_interval: seconds between marker checks. 0.5s gives a
+            responsive shutdown without burning CPU.
+    """
+    from gateway.status import _get_planned_stop_marker_path
+    marker_path = _get_planned_stop_marker_path()
+    while not stop_event.is_set():
+        try:
+            if (
+                marker_path.exists()
+                and not getattr(runner, "_draining", False)
+                and getattr(runner, "_running", False)
+            ):
+                # Drive the same path as a real signal handler.
+                # Pass signal=None — the handler tolerates that and consumes
+                # the marker via consume_planned_stop_marker_for_self,
+                # which also validates target_pid + start_time match us.
+                loop.call_soon_threadsafe(shutdown_handler, None)
+                # Done — the handler will set _draining; we exit on next tick.
+                break
+        except Exception as _e:
+            logger.debug("Planned-stop watcher tick error: %s", _e)
+        stop_event.wait(poll_interval)
+
+
 def _start_cron_ticker(stop_event: threading.Event, adapters=None, loop=None, interval: int = 60):
    """
    Background thread that ticks the cron scheduler at a regular interval.
@ -18597,7 +18663,28 @@ async def start_gateway(config: Optional[GatewayConfig] = None, replace: bool =
                pass
    else:
        logger.info("Skipping signal handlers (not running in main thread).")
-    
+
+    # Windows fallback: asyncio.add_signal_handler raises NotImplementedError
+    # on Windows, so `hermes gateway stop`'s SIGTERM (which Python maps to
+    # TerminateProcess on Windows) never invokes shutdown_signal_handler.
+    # That means the drain loop never runs, mark_resume_pending never fires,
+    # and sessions are silently lost across restarts (issue #33778).
+    #
+    # The fix is a marker-polling thread: `hermes gateway stop` writes the
+    # planned-stop marker BEFORE killing, and this thread notices it and
+    # drives the same shutdown path the signal handler would have.  Runs
+    # on every platform (cheap, defensive) so non-signal-bearing
+    # environments (Windows native, sandboxed CI runners that mask
+    # SIGTERM) still get a clean drain.
+    _planned_stop_watcher_stop = threading.Event()
+    _planned_stop_watcher_thread = threading.Thread(
+        target=_run_planned_stop_watcher,
+        args=(_planned_stop_watcher_stop, runner, loop, shutdown_signal_handler),
+        daemon=True,
+        name="planned-stop-watcher",
+    )
+    _planned_stop_watcher_thread.start()
+
    # Claim the PID file BEFORE bringing up any platform adapters.
    # This closes the --replace race window: two concurrent `gateway run
    # --replace` invocations both pass the termination-wait above, but
@ -18675,6 +18762,10 @@ async def start_gateway(config: Optional[GatewayConfig] = None, replace: bool =
    cron_stop.set()
    cron_thread.join(timeout=5)

+    # Stop the planned-stop watcher (daemon=True so this is belt-and-suspenders).
+    _planned_stop_watcher_stop.set()
+    _planned_stop_watcher_thread.join(timeout=2)
+
    # Close MCP server connections
    try:
        from tools.mcp_tool import shutdown_mcp_servers
--- a/hermes_cli/gateway_windows.py
+++ b/hermes_cli/gateway_windows.py
@ -1014,12 +1014,70 @@ def start() -> None:
    _report_gateway_start(f"direct spawn (PID {pid})")


-def stop() -> None:
-    """Stop the gateway. Tries /End on the scheduled task, then kills any stragglers."""
-    _assert_windows()
-    from hermes_cli.gateway import kill_gateway_processes
+def _drain_gateway_pid(pid: int, drain_timeout: float) -> bool:
+    """Write the planned-stop marker and wait for the gateway PID to exit.

-    stopped_any = False
+    Windows cannot deliver POSIX signals to a Python asyncio loop
+    (``loop.add_signal_handler`` raises NotImplementedError), so writing
+    the marker is the ONLY way to ask a running gateway to drain
+    in-flight agents and persist ``resume_pending`` before exit. The
+    gateway's planned-stop watcher thread (gateway/run.py) polls for
+    the marker and drives the same shutdown path the SIGTERM handler
+    would have on POSIX.
+
+    Returns True if the PID exited within the timeout, False if it
+    didn't (caller should escalate to schtasks /End + taskkill).
+    """
+    if pid <= 0:
+        return False
+    try:
+        from gateway.status import write_planned_stop_marker, _pid_exists
+    except ImportError:
+        return False
+
+    try:
+        write_planned_stop_marker(pid)
+    except Exception:
+        # Best-effort: if the marker can't be written, we have no choice
+        # but to fall through to a hard kill.  Caller decides escalation.
+        pass
+
+    deadline = time.monotonic() + max(drain_timeout, 1.0)
+    while time.monotonic() < deadline:
+        if not _pid_exists(pid):
+            return True
+        time.sleep(0.5)
+    return False
+
+
+def stop() -> None:
+    """Stop the gateway.
+
+    Writes the planned-stop marker first so the gateway can drain
+    in-flight agents and persist ``resume_pending`` before exit (the
+    gateway's marker-watcher thread picks this up — Windows asyncio
+    can't deliver SIGTERM to the loop, so the marker is our only IPC).
+    Then escalates: ``schtasks /End`` (kills the scheduled-task tree)
+    + ``kill_gateway_processes(force=True)`` for any strays.
+    """
+    _assert_windows()
+    from hermes_cli.gateway import kill_gateway_processes, _get_restart_drain_timeout
+    from gateway.status import get_running_pid
+
+    # Phase 1: ask the running gateway (if any) to drain itself by writing
+    # the planned-stop marker, then wait briefly for it to exit cleanly.
+    # On clean exit, sessions land with resume_pending=True and the next
+    # boot will auto-resume them.
+    pid = get_running_pid()
+    drained = False
+    if pid is not None:
+        try:
+            drain_timeout = float(_get_restart_drain_timeout() or 30.0)
+        except Exception:
+            drain_timeout = 30.0
+        drained = _drain_gateway_pid(pid, drain_timeout)
+
+    stopped_any = drained
    if is_task_registered():
        code, _out, err = _exec_schtasks(["/End", "/TN", get_task_name()])
        # schtasks returns nonzero when the task isn't currently running — don't treat that as an error.
@ -1028,12 +1086,19 @@ def stop() -> None:
        elif "not running" not in (err or "").lower():
            print(f"⚠ schtasks /End returned code {code}: {err.strip()}")

-    killed = kill_gateway_processes(all_profiles=False)
+    # Phase 3: hard-kill any strays.  When drain succeeded this is a no-op;
+    # when drain timed out this is the escalation that ensures the PID
+    # actually exits.  Use force=True on Windows so taskkill /T /F walks
+    # the descendant tree (browser helpers, etc.).
+    killed = kill_gateway_processes(all_profiles=False, force=not drained)
    if killed:
        stopped_any = True
        print(f"✓ Killed {killed} gateway process(es)")
    if stopped_any:
-        print("✓ Gateway stopped")
+        if drained:
+            print("✓ Gateway stopped (drained cleanly)")
+        else:
+            print("✓ Gateway stopped")
    else:
        print("✗ No gateway was running")

--- a/tests/gateway/test_planned_stop_watcher.py
+++ b/tests/gateway/test_planned_stop_watcher.py
@ -0,0 +1,267 @@
+"""Tests for the planned-stop marker watcher thread (gateway/run.py).
+
+The watcher is the Windows-fallback path for the v0.13.0 session-resume
+feature — on Windows ``asyncio.add_signal_handler`` raises
+NotImplementedError, so the SIGTERM signal handler never runs and the
+shutdown drain (which writes ``resume_pending=True``) is skipped. The
+watcher closes this gap by polling for the planned-stop marker file
+and translating its existence into the same shutdown-handler call a
+real SIGTERM would have produced.
+
+See issue #33778 for the original Windows session-loss bug report.
+"""
+
+import asyncio
+import threading
+import time
+from types import SimpleNamespace
+from unittest.mock import MagicMock
+
+import pytest
+
+from gateway.run import _run_planned_stop_watcher
+
+
+class _FakeRunner:
+    """Stand-in for GatewayRunner — only exposes the two flags the watcher reads."""
+
+    def __init__(self, *, running: bool = True, draining: bool = False):
+        self._running = running
+        self._draining = draining
+
+
+def _make_loop_capturing_calls():
+    """Build a fake asyncio loop whose call_soon_threadsafe records its args."""
+    loop = MagicMock(spec=asyncio.AbstractEventLoop)
+    loop._captured = []
+
+    def fake_call_soon_threadsafe(fn, *args):
+        loop._captured.append((fn, args))
+
+    loop.call_soon_threadsafe = fake_call_soon_threadsafe
+    return loop
+
+
+def test_watcher_fires_shutdown_when_marker_appears(tmp_path, monkeypatch):
+    """When the marker file exists, the watcher must call the shutdown handler."""
+    marker = tmp_path / ".gateway-planned-stop.json"
+
+    # Patch the marker-path resolver so the watcher polls our temp location.
+    from gateway import status as status_mod
+    monkeypatch.setattr(status_mod, "_get_planned_stop_marker_path", lambda: marker)
+
+    runner = _FakeRunner(running=True, draining=False)
+    loop = _make_loop_capturing_calls()
+    shutdown_handler = MagicMock(name="shutdown_signal_handler")
+    stop_event = threading.Event()
+
+    # Drop the marker before the thread starts.
+    marker.write_text('{"target_pid": 1234}', encoding="utf-8")
+
+    watcher = threading.Thread(
+        target=_run_planned_stop_watcher,
+        args=(stop_event, runner, loop, shutdown_handler),
+        kwargs={"poll_interval": 0.05},
+        daemon=True,
+    )
+    watcher.start()
+    watcher.join(timeout=2.0)
+
+    assert not watcher.is_alive(), "Watcher should exit after firing"
+    assert len(loop._captured) == 1, (
+        f"Expected exactly one shutdown invocation, got {loop._captured}"
+    )
+    fn, args = loop._captured[0]
+    assert fn is shutdown_handler
+    # The handler must be called with signal=None (planned stop sentinel).
+    assert args == (None,)
+
+
+def test_watcher_does_not_fire_when_marker_absent(tmp_path, monkeypatch):
+    """No marker = no shutdown call. Watcher just spins until stop_event."""
+    marker = tmp_path / ".gateway-planned-stop.json"
+    # Deliberately do NOT create the marker.
+
+    from gateway import status as status_mod
+    monkeypatch.setattr(status_mod, "_get_planned_stop_marker_path", lambda: marker)
+
+    runner = _FakeRunner(running=True, draining=False)
+    loop = _make_loop_capturing_calls()
+    shutdown_handler = MagicMock()
+    stop_event = threading.Event()
+
+    watcher = threading.Thread(
+        target=_run_planned_stop_watcher,
+        args=(stop_event, runner, loop, shutdown_handler),
+        kwargs={"poll_interval": 0.05},
+        daemon=True,
+    )
+    watcher.start()
+    time.sleep(0.3)  # let it poll a few times
+    stop_event.set()
+    watcher.join(timeout=2.0)
+
+    assert not watcher.is_alive()
+    assert loop._captured == [], (
+        f"No marker present, but watcher fired shutdown: {loop._captured}"
+    )
+    shutdown_handler.assert_not_called()
+
+
+def test_watcher_skips_when_runner_already_draining(tmp_path, monkeypatch):
+    """If shutdown is already in progress, don't re-fire the handler.
+
+    This prevents a race where the SIGTERM handler is mid-drain and the
+    watcher would double-tap the shutdown path. We check ``_draining``
+    so the watcher backs off once any shutdown is in flight.
+    """
+    marker = tmp_path / ".gateway-planned-stop.json"
+    marker.write_text('{"target_pid": 1234}', encoding="utf-8")
+
+    from gateway import status as status_mod
+    monkeypatch.setattr(status_mod, "_get_planned_stop_marker_path", lambda: marker)
+
+    # Already draining — watcher should be a no-op.
+    runner = _FakeRunner(running=False, draining=True)
+    loop = _make_loop_capturing_calls()
+    shutdown_handler = MagicMock()
+    stop_event = threading.Event()
+
+    watcher = threading.Thread(
+        target=_run_planned_stop_watcher,
+        args=(stop_event, runner, loop, shutdown_handler),
+        kwargs={"poll_interval": 0.05},
+        daemon=True,
+    )
+    watcher.start()
+    time.sleep(0.2)
+    stop_event.set()
+    watcher.join(timeout=2.0)
+
+    assert loop._captured == [], "Watcher fired while runner was already draining"
+
+
+def test_watcher_skips_when_runner_not_started(tmp_path, monkeypatch):
+    """If the runner hasn't started, the marker is for a previous instance —
+    we shouldn't shutdown a not-yet-running gateway.
+    """
+    marker = tmp_path / ".gateway-planned-stop.json"
+    marker.write_text('{"target_pid": 9999}', encoding="utf-8")
+
+    from gateway import status as status_mod
+    monkeypatch.setattr(status_mod, "_get_planned_stop_marker_path", lambda: marker)
+
+    runner = _FakeRunner(running=False, draining=False)
+    loop = _make_loop_capturing_calls()
+    shutdown_handler = MagicMock()
+    stop_event = threading.Event()
+
+    watcher = threading.Thread(
+        target=_run_planned_stop_watcher,
+        args=(stop_event, runner, loop, shutdown_handler),
+        kwargs={"poll_interval": 0.05},
+        daemon=True,
+    )
+    watcher.start()
+    time.sleep(0.2)
+    stop_event.set()
+    watcher.join(timeout=2.0)
+
+    assert loop._captured == [], "Watcher fired before runner was running"
+
+
+def test_watcher_responds_to_stop_event_promptly(tmp_path, monkeypatch):
+    """Setting stop_event must exit the watcher within ~poll_interval seconds."""
+    marker = tmp_path / ".gateway-planned-stop.json"
+    from gateway import status as status_mod
+    monkeypatch.setattr(status_mod, "_get_planned_stop_marker_path", lambda: marker)
+
+    runner = _FakeRunner(running=True, draining=False)
+    loop = _make_loop_capturing_calls()
+    stop_event = threading.Event()
+
+    watcher = threading.Thread(
+        target=_run_planned_stop_watcher,
+        args=(stop_event, runner, loop, MagicMock()),
+        kwargs={"poll_interval": 0.1},
+        daemon=True,
+    )
+    watcher.start()
+    time.sleep(0.05)
+    started_stop = time.monotonic()
+    stop_event.set()
+    watcher.join(timeout=2.0)
+    elapsed = time.monotonic() - started_stop
+
+    assert not watcher.is_alive()
+    assert elapsed < 0.5, f"Watcher took {elapsed:.2f}s to honour stop_event"
+
+
+def test_watcher_fires_only_once_when_marker_persists(tmp_path, monkeypatch):
+    """Marker file existing for multiple polls must NOT spam the handler.
+
+    The watcher fires once and exits its loop (the shutdown handler is
+    responsible for consuming the marker on its own thread). If we
+    re-fired on every tick, the handler would be invoked dozens of
+    times before the gateway actually shuts down.
+    """
+    marker = tmp_path / ".gateway-planned-stop.json"
+    marker.write_text('{"target_pid": 1234}', encoding="utf-8")
+
+    from gateway import status as status_mod
+    monkeypatch.setattr(status_mod, "_get_planned_stop_marker_path", lambda: marker)
+
+    runner = _FakeRunner(running=True, draining=False)
+    loop = _make_loop_capturing_calls()
+    stop_event = threading.Event()
+
+    watcher = threading.Thread(
+        target=_run_planned_stop_watcher,
+        args=(stop_event, runner, loop, MagicMock()),
+        kwargs={"poll_interval": 0.05},
+        daemon=True,
+    )
+    watcher.start()
+    # Let the watcher tick several times — but it should exit after the first fire.
+    watcher.join(timeout=1.0)
+
+    assert not watcher.is_alive()
+    assert len(loop._captured) == 1, (
+        f"Watcher fired {len(loop._captured)} times; should fire once "
+        f"and exit (events={loop._captured})"
+    )
+
+
+def test_watcher_tolerates_marker_path_resolution_errors(tmp_path, monkeypatch, caplog):
+    """If _get_planned_stop_marker_path() raises, the watcher logs and continues."""
+    from gateway import status as status_mod
+
+    call_count = [0]
+    def explode():
+        call_count[0] += 1
+        # First call (the one outside the loop, at thread start) is fine —
+        # but subsequent .exists() calls on a corrupt Path could explode.
+        if call_count[0] == 1:
+            return tmp_path / "nonexistent"
+        raise OSError("filesystem failed")
+
+    monkeypatch.setattr(status_mod, "_get_planned_stop_marker_path", explode)
+
+    runner = _FakeRunner(running=True, draining=False)
+    loop = _make_loop_capturing_calls()
+    stop_event = threading.Event()
+
+    watcher = threading.Thread(
+        target=_run_planned_stop_watcher,
+        args=(stop_event, runner, loop, MagicMock()),
+        kwargs={"poll_interval": 0.05},
+        daemon=True,
+    )
+    watcher.start()
+    time.sleep(0.2)
+    stop_event.set()
+    watcher.join(timeout=2.0)
+
+    assert not watcher.is_alive(), "Watcher should still honour stop_event after errors"
+    # No shutdown fired because the marker never reported existence.
+    assert loop._captured == []
--- a/tests/hermes_cli/test_gateway_windows.py
+++ b/tests/hermes_cli/test_gateway_windows.py
@ -481,4 +481,221 @@ def test_uninstall_access_denied_declined_keeps_task_and_cleans_files(monkeypatc
    out = capsys.readouterr().out
    assert "Skipped elevation" in out
    assert "UAC is Windows' admin approval prompt" in out
-    assert "Scheduled Task still registered" in out
+    assert "Scheduled Task still registered" in out
+
+
+# ---------------------------------------------------------------------------
+# stop() drain semantics — issue #33778
+#
+# Background: on Windows, asyncio.add_signal_handler raises NotImplementedError,
+# so the gateway's SIGTERM handler (which drains in-flight agents and writes
+# resume_pending=True) never fires when `hermes gateway stop` kills the
+# process. The fix: stop() writes the planned_stop_marker first, waits for
+# the gateway's marker-watcher thread to drain + exit cleanly, then escalates
+# to taskkill if drain times out.
+# ---------------------------------------------------------------------------
+
+
+def test_stop_writes_planned_stop_marker_before_killing(monkeypatch):
+    """stop() must write the planned-stop marker BEFORE any kill signal.
+
+    Without this, the gateway's drain loop never runs on Windows and
+    sessions silently lose context across restarts.
+    """
+    pid = 99999
+    events = []
+
+    monkeypatch.setattr(gateway_windows, "_assert_windows", lambda: None)
+    monkeypatch.setattr(gateway_windows, "is_task_registered", lambda: False)
+
+    # Stub the marker write so we can record the order of operations.
+    from gateway import status as status_mod
+
+    def fake_write_marker(target_pid):
+        events.append(("write_marker", target_pid))
+        return True
+
+    def fake_pid_exists(check_pid):
+        # Drain succeeds: pid "exits" right after the marker write.
+        return ("write_marker", pid) not in events
+
+    monkeypatch.setattr(status_mod, "write_planned_stop_marker", fake_write_marker)
+    monkeypatch.setattr(status_mod, "_pid_exists", fake_pid_exists)
+    monkeypatch.setattr(status_mod, "get_running_pid", lambda: pid)
+
+    def fake_kill(**kwargs):
+        events.append(("kill", kwargs.get("force", False)))
+        return 0
+
+    monkeypatch.setattr("hermes_cli.gateway.kill_gateway_processes", fake_kill)
+    monkeypatch.setattr("hermes_cli.gateway._get_restart_drain_timeout", lambda: 5.0)
+
+    gateway_windows.stop()
+
+    # Marker MUST be written before any kill.
+    kinds = [e[0] for e in events]
+    assert "write_marker" in kinds, "stop() never wrote the planned-stop marker"
+    marker_idx = kinds.index("write_marker")
+    kill_idx = kinds.index("kill") if "kill" in kinds else len(kinds)
+    assert marker_idx < kill_idx, (
+        f"stop() killed before writing the marker (events={events})"
+    )
+
+
+def test_stop_waits_for_graceful_drain_before_force_kill(monkeypatch):
+    """When drain succeeds, stop() should NOT force-kill the gateway.
+
+    drained=True means the gateway exited cleanly after seeing the
+    marker — escalating to taskkill /F afterwards would be wasted
+    work and may emit confusing "killed N processes" output.
+    """
+    pid = 88888
+    events = []
+
+    monkeypatch.setattr(gateway_windows, "_assert_windows", lambda: None)
+    monkeypatch.setattr(gateway_windows, "is_task_registered", lambda: False)
+
+    from gateway import status as status_mod
+    monkeypatch.setattr(status_mod, "write_planned_stop_marker", lambda p: True)
+
+    # Simulate the gateway exiting cleanly after one poll tick.
+    poll_count = [0]
+    def fake_pid_exists(check_pid):
+        poll_count[0] += 1
+        return poll_count[0] < 2  # alive on first poll, gone on second
+    monkeypatch.setattr(status_mod, "_pid_exists", fake_pid_exists)
+    monkeypatch.setattr(status_mod, "get_running_pid", lambda: pid)
+
+    def fake_kill(**kwargs):
+        events.append(("kill", kwargs.get("force", False)))
+        return 0
+    monkeypatch.setattr("hermes_cli.gateway.kill_gateway_processes", fake_kill)
+    monkeypatch.setattr("hermes_cli.gateway._get_restart_drain_timeout", lambda: 5.0)
+
+    gateway_windows.stop()
+
+    # kill_gateway_processes is still called as the no-op sweep, but
+    # NOT with force=True — drain succeeded, gateway is already gone.
+    assert events == [("kill", False)], (
+        f"After clean drain, force kill should be disabled (events={events})"
+    )
+
+
+def test_stop_escalates_to_force_kill_when_drain_times_out(monkeypatch):
+    """When drain times out, stop() MUST escalate to force=True.
+
+    Drain timeout = gateway is stuck or unresponsive. Without the
+    taskkill /T /F escalation, the gateway stays alive and the next
+    `hermes gateway start` fails with "another instance is running".
+    """
+    pid = 77777
+    events = []
+
+    monkeypatch.setattr(gateway_windows, "_assert_windows", lambda: None)
+    monkeypatch.setattr(gateway_windows, "is_task_registered", lambda: False)
+
+    from gateway import status as status_mod
+    monkeypatch.setattr(status_mod, "write_planned_stop_marker", lambda p: True)
+    # PID never exits — drain times out.
+    monkeypatch.setattr(status_mod, "_pid_exists", lambda check_pid: True)
+    monkeypatch.setattr(status_mod, "get_running_pid", lambda: pid)
+
+    def fake_kill(**kwargs):
+        events.append(("kill", kwargs.get("force", False)))
+        return 1
+    monkeypatch.setattr("hermes_cli.gateway.kill_gateway_processes", fake_kill)
+    # Tiny drain timeout to keep the test fast.
+    monkeypatch.setattr("hermes_cli.gateway._get_restart_drain_timeout", lambda: 1.0)
+
+    gateway_windows.stop()
+
+    # When drain times out, kill is invoked with force=True so taskkill /T /F
+    # walks the process tree.
+    assert events == [("kill", True)], (
+        f"After drain timeout, kill must use force=True (events={events})"
+    )
+
+
+def test_stop_no_running_gateway_skips_drain(monkeypatch):
+    """When no gateway is running, skip the drain wait entirely."""
+    events = []
+
+    monkeypatch.setattr(gateway_windows, "_assert_windows", lambda: None)
+    monkeypatch.setattr(gateway_windows, "is_task_registered", lambda: False)
+
+    from gateway import status as status_mod
+    monkeypatch.setattr(status_mod, "get_running_pid", lambda: None)
+
+    def fake_write_marker(target_pid):
+        events.append(("write_marker", target_pid))
+        return True
+    monkeypatch.setattr(status_mod, "write_planned_stop_marker", fake_write_marker)
+    monkeypatch.setattr(status_mod, "_pid_exists", lambda check_pid: False)
+
+    def fake_kill(**kwargs):
+        events.append(("kill", kwargs.get("force", False)))
+        return 0
+    monkeypatch.setattr("hermes_cli.gateway.kill_gateway_processes", fake_kill)
+    monkeypatch.setattr("hermes_cli.gateway._get_restart_drain_timeout", lambda: 5.0)
+
+    gateway_windows.stop()
+
+    # With no PID to drain, no marker is written.  Kill sweep still runs
+    # (defensive — covers the case where a stray gateway is alive without
+    # a PID file).  force=True because drained=False.
+    assert ("write_marker", None) not in events
+    assert all(e[0] != "write_marker" for e in events), (
+        f"Should not write marker when no PID is running (events={events})"
+    )
+    assert events == [("kill", True)]
+
+
+def test_drain_helper_handles_invalid_pid(monkeypatch):
+    """_drain_gateway_pid returns False for invalid PIDs without crashing."""
+    assert gateway_windows._drain_gateway_pid(0, 5.0) is False
+    assert gateway_windows._drain_gateway_pid(-1, 5.0) is False
+
+
+def test_drain_helper_returns_true_when_pid_exits_quickly(monkeypatch):
+    """_drain_gateway_pid polls _pid_exists until it returns False."""
+    pid = 66666
+    poll_count = [0]
+
+    def fake_pid_exists(check_pid):
+        poll_count[0] += 1
+        return poll_count[0] < 3  # alive twice, then gone
+
+    from gateway import status as status_mod
+    monkeypatch.setattr(status_mod, "write_planned_stop_marker", lambda p: True)
+    monkeypatch.setattr(status_mod, "_pid_exists", fake_pid_exists)
+
+    assert gateway_windows._drain_gateway_pid(pid, drain_timeout=5.0) is True
+
+
+def test_drain_helper_returns_false_on_timeout(monkeypatch):
+    """_drain_gateway_pid returns False when the PID never exits."""
+    from gateway import status as status_mod
+    monkeypatch.setattr(status_mod, "write_planned_stop_marker", lambda p: True)
+    monkeypatch.setattr(status_mod, "_pid_exists", lambda check_pid: True)
+
+    assert gateway_windows._drain_gateway_pid(55555, drain_timeout=1.0) is False
+
+
+def test_drain_helper_still_waits_if_marker_write_fails(monkeypatch):
+    """Marker-write failures are swallowed; drain still polls for PID exit.
+
+    If the marker can't be written (disk full, permission error), the
+    gateway can't drain — but the wait still happens so a slow-shutdown
+    gateway from a different code path (e.g. SIGTERM working on this
+    platform after all) still gets observed cleanly.
+    """
+    pid = 44444
+    def fake_write(target_pid):
+        raise OSError("disk full")
+
+    from gateway import status as status_mod
+    monkeypatch.setattr(status_mod, "write_planned_stop_marker", fake_write)
+    monkeypatch.setattr(status_mod, "_pid_exists", lambda check_pid: False)
+
+    # Returns True because _pid_exists immediately says "gone".
+    assert gateway_windows._drain_gateway_pid(pid, drain_timeout=5.0) is True