fix(gateway,windows): reliability — JOB breakaway + status --deep probes + test-leak fix (#40909)

* fix(gateway,windows): reliability — supervisor task, JOB breakaway, status --deep

Three coordinated fixes for the Windows gateway reliability story:

1. CREATE_BREAKAWAY_FROM_JOB on every detached spawn

   The 'hermes update' triggered from the Electron Desktop GUI ran inside
   Electron's job object. Without breakaway, the post-update gateway
   watcher spawned by update — already DETACHED_PROCESS — was still
   reaped when Electron's job tore down, so the gateway never came back
   after a GUI-initiated update. Adds CREATE_BREAKAWAY_FROM_JOB (0x01000000)
   to:
     - hermes_cli/_subprocess_compat.py::windows_detach_flags() — used by
       every helper that calls windows_detach_popen_kwargs(), including
       launch_detached_profile_gateway_restart()
     - The watcher subprocess's own respawn snippet in
       hermes_cli/gateway.py (inlined flags so the watcher's child
       respawn also breaks away)

   _spawn_detached() in gateway_windows.py already had the flag; this
   change brings the rest of the codebase to parity.

2. Per-minute supervisor Scheduled Task — Windows equivalent of
   systemd Restart=always

   Introduces hermes_cli/gateway_supervisor.py and registers it as a
   second Scheduled Task ('Hermes_Gateway_Supervisor', SC MINUTE /MO 1,
   LIMITED rights) alongside the existing ONLOGON task. Every minute,
   the supervisor uses the same gateway.status.get_running_pid() probe
   as 'hermes gateway status' and, if no gateway is alive, calls
   gateway_windows._spawn_detached() (which now includes BREAKAWAY) to
   bring one back.

   Covers every crash mode, not just 'machine rebooted': taskkill,
   OOM, GUI update SIGTERM, parent job teardown. Cheap — one pythonw
   startup per minute when down, one PID-existence check per minute
   when up.

   Wired into both the schtasks-success and Startup-folder-fallback
   install paths via _install_supervisor_best_effort(), and removed in
   uninstall(). Best-effort: a failing supervisor install logs a
   warning but doesn't roll back the primary install.

3. 'hermes gateway status --deep' shows per-probe PASS/FAIL

   Replaces the existing terse '--deep' output (which only printed
   paths) with an actual diagnostic table:
     [1] PID file present
     [2] Lock file held by a live process
     [3] get_running_pid() result
     [4] _pid_exists(pid) — OS-level liveness
     [5] gateway_state.json (state + age)
     [6] Last lifecycle event from gateway-exit-diag.log

   When the high-level summary disagrees with reality, the user can
   see exactly which signal is lying.

Test-leak fix
-------------

tests/hermes_cli/test_gateway_wsl.py::TestGatewayCommandWSLMessages
monkey-patched is_linux/is_wsl/supports_systemd_services to simulate
WSL but did NOT stub is_windows(). On a Windows host, the dispatcher
in _gateway_command_inner takes the is_windows() branch BEFORE the
WSL guidance branch, so the test invoked gateway_windows.install()
for real. install() writes to %APPDATA%\...\Startup\Hermes_Gateway.cmd
— the REAL user Startup folder, never sandboxed by tmp_path — pointing
at the test's pytest-of-<user>/pytest-<N>/.../gateway-service/ wrapper.
When pytest tore down the tmp_path, every subsequent Windows login
flashed a cmd.exe window that failed to find the missing target.

Stubs is_windows=False on all four affected tests:
  test_install_wsl_no_systemd
  test_start_wsl_no_systemd
  test_status_wsl_running_manual
  test_status_wsl_not_running

Defense-in-depth: _build_startup_launcher() now prefixes the launcher
with 'if not exist <target> exit /b 0', so any future stale Startup
entry silently no-ops instead of flashing a console window.

Status enhancements
-------------------

- status() now reports supervisor task presence alongside the existing
  schtasks/Startup info, and nudges the user to reinstall if the
  supervisor isn't registered.
- Deep mode dumps both the supervisor task name + script path.

* fix(gateway,windows): drop the per-minute supervisor task — keep breakaway + deep probes

Earlier in this branch we added a per-minute schtasks-based supervisor to
respawn the gateway after crashes / GUI-update SIGTERMs. The implementation
flashed a brief console window on every firing, which stole window focus.
We tried several variants:

  - cmd.exe wrapper invoking pythonw  -> flashes (cmd.exe is console-subsystem)
  - schtasks /TR pointing at pythonw  -> flashes (uv venv launcher pythonw is
    actually subsystem=Console, not GUI; it respawns the real pythonw)
  - schtasks /TR pointing at base uv  -> still flashes (Task Scheduler-side
    conhost preallocation; documented Windows quirk)
  - XML registration with <Hidden>true>  -> still flashes (<Hidden> only hides
    the task in the Task Scheduler UI, not the spawned window)

Researched what leading projects do:

  - Ollama: GUI-subsystem tray exe + Startup-folder shortcut. No supervisor.
  - Tailscale: real Windows Service via SCM. Session 0, no console possible.
  - Syncthing: --no-console flag inside the binary + Startup folder.
  - openclaw: VBS Run(..., 0, False) wrapper. Suppresses the *window* but
    Super User Q971162 confirms focus-steal still occurs in some cases.

None of these use a per-minute polling scheduled task. The 'auto-restart on
crash' responsibility belongs INSIDE the daemon (Tailscale's in-process
recovery / Ollama's monitor+worker pair) OR is delegated to the Windows
Service Control Manager — not Task Scheduler.

So this commit drops the supervisor entirely. The CREATE_BREAKAWAY_FROM_JOB
fix in _subprocess_compat.py (from commit c1e5fa433) survives — that is the
*real* fix for problem #2 (GUI-update kills gateway): the post-update
watcher in launch_detached_profile_gateway_restart() now breaks out of
Electron's job object, so the gateway respawn watcher survives the GUI
quit and successfully respawns the gateway.

Surviving from c1e5fa433:
  * CREATE_BREAKAWAY_FROM_JOB in hermes_cli/_subprocess_compat.py (fixes #2)
  * Inlined breakaway flag in the watcher respawn snippet in gateway.py
  * hermes gateway status --deep PASS/FAIL probes (fixes #1 — visibility)
  * 'if not exist <target> exit /b 0' guard in _build_startup_launcher
    (fixes #3 — silent no-op for stale Startup entries)
  * tests/hermes_cli/test_gateway_wsl.py is_windows=False stubs (root cause
    of #3 — pytest WSL tests no longer leak Startup entries on Win hosts)

Removed in this commit:
  * hermes_cli/gateway_supervisor.py (entire file)
  * Supervisor section in hermes_cli/gateway_windows.py (~180 lines):
      get_supervisor_task_name, get_supervisor_script_path,
      _build_supervisor_cmd_script, _write_supervisor_script,
      _install_supervisor_task, is_supervisor_task_registered,
      _install_supervisor_best_effort
  * _install_supervisor_best_effort() calls in install() (3 spots)
  * supervisor cleanup block in uninstall()
  * supervisor display lines in status() / status(deep=True)

Future direction (out of scope for this PR): the right place for Windows
'Restart=always' semantics is a real Windows Service installed via
pywin32's win32serviceutil.ServiceFramework — session-0 isolation, SCM
auto-restart, no console window possible. That's a meaningful next-PR
project, not a band-aid.

Tests: 51 pass / 2 pre-existing failures in
tests/hermes_cli/test_gateway_{windows,wsl}.py (the 2 failures are
TestSupportsSystemdServicesWSL cases that fail on origin/main too —
unrelated to this PR).
This commit is contained in:
Teknium 2026-06-06 19:53:58 -07:00 committed by GitHub
parent 40cea4d58d
commit fc086da8bd
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
4 changed files with 217 additions and 8 deletions

View file

@ -97,6 +97,16 @@ def resolve_node_command(name: str, argv: Sequence[str]) -> list[str]:
_CREATE_NEW_PROCESS_GROUP = 0x00000200
_DETACHED_PROCESS = 0x00000008
_CREATE_NO_WINDOW = 0x08000000
# Escape any Win32 job object the parent process belongs to. Without this,
# a detached child still inherits its parent's job object membership, and
# when that parent (Electron, Tauri, Windows Terminal, the Desktop GUI's
# bootstrap-installer) dies, the OS tears down the whole job — taking the
# "detached" child with it. Critical for the post-update gateway watcher:
# Electron spawns the Tauri updater inside its own job, the updater spawns
# the watcher subprocess; without BREAKAWAY the watcher dies the instant
# Electron exits, so the gateway never gets respawned after a `hermes
# update` triggered from the GUI. See fix/windows-gateway-reliability.
_CREATE_BREAKAWAY_FROM_JOB = 0x01000000
def windows_detach_flags() -> int:
@ -116,10 +126,30 @@ def windows_detach_flags() -> int:
- ``CREATE_NO_WINDOW`` suppress the brief cmd flash that would
otherwise appear when launching a console app. Redundant with
DETACHED_PROCESS but explicit for clarity.
- ``CREATE_BREAKAWAY_FROM_JOB`` escape any job object the parent is
in. Electron (Desktop app) and Tauri (bootstrap installer) wrap
their children in job objects; without breakaway, those children
die when the parent process exits even if they were spawned with
DETACHED_PROCESS. This was the missing flag that made the
post-update gateway respawn watcher silently die alongside the
Tauri updater after the Electron Desktop's update flow finished.
If a process is in a job that disallows breakaway (rare
JOB_OBJECT_LIMIT_BREAKAWAY_OK isn't set), CreateProcess returns
ERROR_ACCESS_DENIED. Python surfaces that as ``PermissionError``
on the ``subprocess.Popen`` call. Callers in this codebase already
wrap detached spawns in ``try/except OSError`` and fall back to a
cmd.exe wrapper, so the breakaway-denied case degrades gracefully
rather than crashing.
"""
if not IS_WINDOWS:
return 0
return _CREATE_NEW_PROCESS_GROUP | _DETACHED_PROCESS | _CREATE_NO_WINDOW
return (
_CREATE_NEW_PROCESS_GROUP
| _DETACHED_PROCESS
| _CREATE_NO_WINDOW
| _CREATE_BREAKAWAY_FROM_JOB
)
def windows_hide_flags() -> int:

View file

@ -664,6 +664,10 @@ def launch_detached_profile_gateway_restart(profile: str, old_pid: int) -> bool:
# Platform-appropriate detach for the respawned gateway. On POSIX
# start_new_session=True maps to os.setsid; on Windows we need
# explicit creationflags because start_new_session is a no-op there.
# CREATE_BREAKAWAY_FROM_JOB is critical: the watcher itself may have
# been spawned inside a job object (Electron/Tauri parent), and
# without breakaway the respawned gateway would die when that job
# tears down. See _subprocess_compat.windows_detach_flags().
_popen_kwargs = {
"stdout": subprocess.DEVNULL,
"stderr": subprocess.DEVNULL,
@ -672,8 +676,12 @@ def launch_detached_profile_gateway_restart(profile: str, old_pid: int) -> bool:
_CREATE_NEW_PROCESS_GROUP = 0x00000200
_DETACHED_PROCESS = 0x00000008
_CREATE_NO_WINDOW = 0x08000000
_CREATE_BREAKAWAY_FROM_JOB = 0x01000000
_popen_kwargs["creationflags"] = (
_CREATE_NEW_PROCESS_GROUP | _DETACHED_PROCESS | _CREATE_NO_WINDOW
_CREATE_NEW_PROCESS_GROUP
| _DETACHED_PROCESS
| _CREATE_NO_WINDOW
| _CREATE_BREAKAWAY_FROM_JOB
)
else:
_popen_kwargs["start_new_session"] = True

View file

@ -380,13 +380,26 @@ def _build_gateway_cmd_script(
def _build_startup_launcher(script_path: Path) -> str:
"""The tiny .cmd that goes in the Startup folder. Just minimizes and chains."""
"""The tiny .cmd that goes in the Startup folder. Just minimizes and chains.
Defense-in-depth: bail out silently if the target script is gone. Test
fixtures historically wrote Startup entries pointing at pytest tmp_path
directories that vanish after the test session. Without the existence
guard, every subsequent Windows login flashes a cmd.exe window that
fails to find the target. The check + ``exit /b 0`` keeps that case
silent.
"""
quoted_target = _quote_cmd_script_arg(str(script_path))
lines = [
"@echo off",
f"rem {_TASK_DESCRIPTION}",
# If the wrapper script is gone (typical for stale entries from
# uninstalled/migrated installs), silently no-op instead of
# flashing a cmd window with a "file not found" error.
f"if not exist {quoted_target} exit /b 0",
# ``start "" /min`` detaches with a minimized console window.
# ``/d /c`` on cmd.exe skips AUTORUN and runs the target script once.
f'start "" /min cmd.exe /d /c {_quote_cmd_script_arg(str(script_path))}',
f'start "" /min cmd.exe /d /c {quoted_target}',
]
return "\r\n".join(lines) + "\r\n"
@ -478,6 +491,8 @@ def _install_scheduled_task(task_name: str, script_path: Path) -> tuple[bool, st
return (False, f"schtasks /Create failed (code {last_code}): {last_err.strip()}")
def _install_startup_entry(script_path: Path) -> Path:
"""Write the Startup-folder fallback launcher. Returns its path."""
entry = get_startup_entry_path()
@ -926,7 +941,10 @@ def uninstall() -> None:
else:
print(f"⚠ schtasks /Delete returned code {code}: {detail}")
for path, label in [(startup_entry, "Windows login item"), (script_path, "Task script")]:
for path, label in [
(startup_entry, "Windows login item"),
(script_path, "Task script"),
]:
try:
path.unlink()
print(f"✓ Removed {label}: {path}")
@ -984,6 +1002,139 @@ def _gateway_pids() -> list[int]:
return list(find_gateway_pids())
def _print_deep_probes() -> None:
"""Print PASS/FAIL per individual probe of gateway liveness.
The default ``status`` output collapses several signals into one
/ line, which is great when they agree and confusing when they
don't. The deep-probe block shows each underlying check independently
so the user can see exactly which signal is wrong.
Probes:
[1] PID file present
[2] Lock file present and held by some process
[3] gateway.status.get_running_pid() returns a PID
[4] _pid_exists(pid) OS confirms the process is alive
[5] gateway_state.json exists and parses (and is fresh-ish)
[6] Last lifecycle event in gateway-exit-diag.log
"""
import json
from datetime import datetime, timezone
from hermes_cli.config import get_hermes_home
home = Path(get_hermes_home()).resolve()
pid_path = home / "gateway.pid"
lock_path = home / "gateway.lock"
state_path = home / "gateway_state.json"
diag_path = home / "logs" / "gateway-exit-diag.log"
print()
print("Deep probes:")
def _mark(ok: bool) -> str:
return "PASS" if ok else "FAIL"
# [1] PID file
pid_exists = pid_path.exists()
pid_value: int | None = None
if pid_exists:
try:
data = json.loads(pid_path.read_text(encoding="utf-8"))
pid_value = int(data.get("pid")) if data.get("pid") is not None else None
print(f" [1] {_mark(True):4s} PID file present: {pid_path} (pid={pid_value})")
except Exception as exc:
print(f" [1] {_mark(False):4s} PID file present but unreadable: {exc}")
else:
print(f" [1] {_mark(False):4s} PID file missing: {pid_path}")
# [2] Lock file present + held
lock_held = False
lock_present = lock_path.exists()
if lock_present:
try:
from gateway.status import is_gateway_runtime_lock_active
lock_held = is_gateway_runtime_lock_active(lock_path)
print(f" [2] {_mark(lock_held):4s} Lock file held by a live process: {lock_path}")
except Exception as exc:
print(f" [2] {_mark(False):4s} Could not probe lock: {exc}")
else:
print(f" [2] {_mark(False):4s} Lock file missing: {lock_path}")
# [3] get_running_pid()
running_pid: int | None = None
try:
from gateway.status import get_running_pid
running_pid = get_running_pid(cleanup_stale=False)
print(f" [3] {_mark(running_pid is not None):4s} get_running_pid() => {running_pid}")
except Exception as exc:
print(f" [3] {_mark(False):4s} get_running_pid() raised: {exc!r}")
# [4] _pid_exists() on the probed PID
candidate_pid = running_pid if running_pid is not None else pid_value
if candidate_pid is not None:
try:
from gateway.status import _pid_exists
alive = bool(_pid_exists(candidate_pid))
print(f" [4] {_mark(alive):4s} _pid_exists({candidate_pid}) => {alive}")
except Exception as exc:
print(f" [4] {_mark(False):4s} _pid_exists raised: {exc!r}")
else:
print(f" [4] {_mark(False):4s} No candidate PID to verify")
# [5] runtime status file
if state_path.exists():
try:
state_data = json.loads(state_path.read_text(encoding="utf-8"))
gateway_state = state_data.get("gateway_state")
updated_at = state_data.get("updated_at")
age_str = ""
if updated_at:
try:
updated_dt = datetime.fromisoformat(updated_at.replace("Z", "+00:00"))
now = datetime.now(timezone.utc)
age_seconds = int((now - updated_dt).total_seconds())
age_str = f" (updated {age_seconds}s ago)"
except Exception:
pass
ok = gateway_state == "running"
print(f" [5] {_mark(ok):4s} gateway_state.json state={gateway_state!r}{age_str}")
except Exception as exc:
print(f" [5] {_mark(False):4s} gateway_state.json present but unreadable: {exc}")
else:
print(f" [5] {_mark(False):4s} gateway_state.json missing: {state_path}")
# [6] Last lifecycle event from the exit-diag log
if diag_path.exists():
try:
with open(diag_path, "rb") as fh:
# Read last ~4KB; one event is well under 500 bytes.
fh.seek(0, 2)
size = fh.tell()
fh.seek(max(0, size - 4096))
tail = fh.read().decode("utf-8", errors="replace").splitlines()
last_event = next((ln for ln in reversed(tail) if ln.strip()), "")
if last_event:
try:
event = json.loads(last_event)
tag = event.get("tag", "?")
pid = event.get("pid", "?")
ts = event.get("ts", "?")
healthy = tag in ("gateway.start",)
print(f" [6] {_mark(healthy):4s} Last lifecycle event: tag={tag} pid={pid} ts={ts}")
except Exception:
print(f" [6] {_mark(False):4s} Last lifecycle line not JSON: {last_event[:120]}")
else:
print(f" [6] {_mark(False):4s} exit-diag log empty: {diag_path}")
except Exception as exc:
print(f" [6] {_mark(False):4s} exit-diag log unreadable: {exc}")
else:
print(f" [6] {_mark(False):4s} exit-diag log missing: {diag_path}")
def status(deep: bool = False) -> None:
"""Print a status report for the Windows gateway service."""
_assert_windows()
@ -1011,9 +1162,12 @@ def status(deep: bool = False) -> None:
if deep:
print()
print(f" Task name: {task_name}")
print(f" Task script: {get_task_script_path()}")
print(f" Startup entry: {get_startup_entry_path()}")
print(f" Task name: {task_name}")
print(f" Task script: {get_task_script_path()}")
print(f" Startup entry: {get_startup_entry_path()}")
# Surface the per-probe truth so the user can see *which* signal
# is lying when the high-level summary disagrees with reality.
_print_deep_probes()
if not task_installed and not startup_installed and not pids:
print()

View file

@ -167,6 +167,14 @@ class TestGatewayCommandWSLMessages:
monkeypatch.setattr(gateway, "supports_systemd_services", lambda: False)
monkeypatch.setattr(gateway, "is_macos", lambda: False)
monkeypatch.setattr(gateway, "is_managed", lambda: False)
# CRITICAL: also stub is_windows. Without this, running this test on a
# real Windows host falls through to the is_windows() branch *before*
# the WSL guidance branch, invoking gateway_windows.install() which
# writes a Startup-folder .cmd into the real user's Startup folder
# (NOT tmp_path) pointing at a now-vanished pytest fixture path.
# The user then sees a broken Hermes_Gateway.cmd flash a cmd.exe
# window on every login. See fix/windows-gateway-reliability.
monkeypatch.setattr(gateway, "is_windows", lambda: False)
args = SimpleNamespace(
gateway_command="install", force=False, system=False,
@ -189,6 +197,10 @@ class TestGatewayCommandWSLMessages:
monkeypatch.setattr(gateway, "is_wsl", lambda: True)
monkeypatch.setattr(gateway, "supports_systemd_services", lambda: False)
monkeypatch.setattr(gateway, "is_macos", lambda: False)
# See test_install_wsl_no_systemd: stub is_windows so a Windows host
# running this test does NOT actually spawn a detached gateway via
# gateway_windows.start().
monkeypatch.setattr(gateway, "is_windows", lambda: False)
args = SimpleNamespace(gateway_command="start", system=False)
with pytest.raises(SystemExit) as exc_info:
@ -206,6 +218,9 @@ class TestGatewayCommandWSLMessages:
monkeypatch.setattr(gateway, "is_macos", lambda: False)
monkeypatch.setattr(gateway, "is_termux", lambda: False)
monkeypatch.setattr(gateway, "is_wsl", lambda: True)
# Stub is_windows so a Windows host running this test does NOT take
# the Windows status branch (which reads gateway_windows.is_installed()).
monkeypatch.setattr(gateway, "is_windows", lambda: False)
monkeypatch.setattr(gateway, "find_gateway_pids", lambda: [12345])
monkeypatch.setattr(gateway, "_runtime_health_lines", lambda: [])
# Stub out the systemd unit path check
@ -231,6 +246,8 @@ class TestGatewayCommandWSLMessages:
monkeypatch.setattr(gateway, "is_macos", lambda: False)
monkeypatch.setattr(gateway, "is_termux", lambda: False)
monkeypatch.setattr(gateway, "is_wsl", lambda: True)
# See test_status_wsl_running_manual.
monkeypatch.setattr(gateway, "is_windows", lambda: False)
monkeypatch.setattr(gateway, "find_gateway_pids", lambda: [])
monkeypatch.setattr(gateway, "_runtime_health_lines", lambda: [])
monkeypatch.setattr(