mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-06-09 08:21:50 +00:00
* fix(gateway,windows): reliability — supervisor task, JOB breakaway, status --deep
Three coordinated fixes for the Windows gateway reliability story:
1. CREATE_BREAKAWAY_FROM_JOB on every detached spawn
The 'hermes update' triggered from the Electron Desktop GUI ran inside
Electron's job object. Without breakaway, the post-update gateway
watcher spawned by update — already DETACHED_PROCESS — was still
reaped when Electron's job tore down, so the gateway never came back
after a GUI-initiated update. Adds CREATE_BREAKAWAY_FROM_JOB (0x01000000)
to:
- hermes_cli/_subprocess_compat.py::windows_detach_flags() — used by
every helper that calls windows_detach_popen_kwargs(), including
launch_detached_profile_gateway_restart()
- The watcher subprocess's own respawn snippet in
hermes_cli/gateway.py (inlined flags so the watcher's child
respawn also breaks away)
_spawn_detached() in gateway_windows.py already had the flag; this
change brings the rest of the codebase to parity.
2. Per-minute supervisor Scheduled Task — Windows equivalent of
systemd Restart=always
Introduces hermes_cli/gateway_supervisor.py and registers it as a
second Scheduled Task ('Hermes_Gateway_Supervisor', SC MINUTE /MO 1,
LIMITED rights) alongside the existing ONLOGON task. Every minute,
the supervisor uses the same gateway.status.get_running_pid() probe
as 'hermes gateway status' and, if no gateway is alive, calls
gateway_windows._spawn_detached() (which now includes BREAKAWAY) to
bring one back.
Covers every crash mode, not just 'machine rebooted': taskkill,
OOM, GUI update SIGTERM, parent job teardown. Cheap — one pythonw
startup per minute when down, one PID-existence check per minute
when up.
Wired into both the schtasks-success and Startup-folder-fallback
install paths via _install_supervisor_best_effort(), and removed in
uninstall(). Best-effort: a failing supervisor install logs a
warning but doesn't roll back the primary install.
3. 'hermes gateway status --deep' shows per-probe PASS/FAIL
Replaces the existing terse '--deep' output (which only printed
paths) with an actual diagnostic table:
[1] PID file present
[2] Lock file held by a live process
[3] get_running_pid() result
[4] _pid_exists(pid) — OS-level liveness
[5] gateway_state.json (state + age)
[6] Last lifecycle event from gateway-exit-diag.log
When the high-level summary disagrees with reality, the user can
see exactly which signal is lying.
Test-leak fix
-------------
tests/hermes_cli/test_gateway_wsl.py::TestGatewayCommandWSLMessages
monkey-patched is_linux/is_wsl/supports_systemd_services to simulate
WSL but did NOT stub is_windows(). On a Windows host, the dispatcher
in _gateway_command_inner takes the is_windows() branch BEFORE the
WSL guidance branch, so the test invoked gateway_windows.install()
for real. install() writes to %APPDATA%\...\Startup\Hermes_Gateway.cmd
— the REAL user Startup folder, never sandboxed by tmp_path — pointing
at the test's pytest-of-<user>/pytest-<N>/.../gateway-service/ wrapper.
When pytest tore down the tmp_path, every subsequent Windows login
flashed a cmd.exe window that failed to find the missing target.
Stubs is_windows=False on all four affected tests:
test_install_wsl_no_systemd
test_start_wsl_no_systemd
test_status_wsl_running_manual
test_status_wsl_not_running
Defense-in-depth: _build_startup_launcher() now prefixes the launcher
with 'if not exist <target> exit /b 0', so any future stale Startup
entry silently no-ops instead of flashing a console window.
Status enhancements
-------------------
- status() now reports supervisor task presence alongside the existing
schtasks/Startup info, and nudges the user to reinstall if the
supervisor isn't registered.
- Deep mode dumps both the supervisor task name + script path.
* fix(gateway,windows): drop the per-minute supervisor task — keep breakaway + deep probes
Earlier in this branch we added a per-minute schtasks-based supervisor to
respawn the gateway after crashes / GUI-update SIGTERMs. The implementation
flashed a brief console window on every firing, which stole window focus.
We tried several variants:
- cmd.exe wrapper invoking pythonw -> flashes (cmd.exe is console-subsystem)
- schtasks /TR pointing at pythonw -> flashes (uv venv launcher pythonw is
actually subsystem=Console, not GUI; it respawns the real pythonw)
- schtasks /TR pointing at base uv -> still flashes (Task Scheduler-side
conhost preallocation; documented Windows quirk)
- XML registration with <Hidden>true> -> still flashes (<Hidden> only hides
the task in the Task Scheduler UI, not the spawned window)
Researched what leading projects do:
- Ollama: GUI-subsystem tray exe + Startup-folder shortcut. No supervisor.
- Tailscale: real Windows Service via SCM. Session 0, no console possible.
- Syncthing: --no-console flag inside the binary + Startup folder.
- openclaw: VBS Run(..., 0, False) wrapper. Suppresses the *window* but
Super User Q971162 confirms focus-steal still occurs in some cases.
None of these use a per-minute polling scheduled task. The 'auto-restart on
crash' responsibility belongs INSIDE the daemon (Tailscale's in-process
recovery / Ollama's monitor+worker pair) OR is delegated to the Windows
Service Control Manager — not Task Scheduler.
So this commit drops the supervisor entirely. The CREATE_BREAKAWAY_FROM_JOB
fix in _subprocess_compat.py (from commit c1e5fa433) survives — that is the
*real* fix for problem #2 (GUI-update kills gateway): the post-update
watcher in launch_detached_profile_gateway_restart() now breaks out of
Electron's job object, so the gateway respawn watcher survives the GUI
quit and successfully respawns the gateway.
Surviving from c1e5fa433:
* CREATE_BREAKAWAY_FROM_JOB in hermes_cli/_subprocess_compat.py (fixes #2)
* Inlined breakaway flag in the watcher respawn snippet in gateway.py
* hermes gateway status --deep PASS/FAIL probes (fixes #1 — visibility)
* 'if not exist <target> exit /b 0' guard in _build_startup_launcher
(fixes #3 — silent no-op for stale Startup entries)
* tests/hermes_cli/test_gateway_wsl.py is_windows=False stubs (root cause
of #3 — pytest WSL tests no longer leak Startup entries on Win hosts)
Removed in this commit:
* hermes_cli/gateway_supervisor.py (entire file)
* Supervisor section in hermes_cli/gateway_windows.py (~180 lines):
get_supervisor_task_name, get_supervisor_script_path,
_build_supervisor_cmd_script, _write_supervisor_script,
_install_supervisor_task, is_supervisor_task_registered,
_install_supervisor_best_effort
* _install_supervisor_best_effort() calls in install() (3 spots)
* supervisor cleanup block in uninstall()
* supervisor display lines in status() / status(deep=True)
Future direction (out of scope for this PR): the right place for Windows
'Restart=always' semantics is a real Windows Service installed via
pywin32's win32serviceutil.ServiceFramework — session-0 isolation, SCM
auto-restart, no console window possible. That's a meaningful next-PR
project, not a band-aid.
Tests: 51 pass / 2 pre-existing failures in
tests/hermes_cli/test_gateway_{windows,wsl}.py (the 2 failures are
TestSupportsSystemdServicesWSL cases that fail on origin/main too —
unrelated to this PR).
203 lines
8.1 KiB
Python
203 lines
8.1 KiB
Python
"""Windows subprocess compatibility helpers.
|
|
|
|
Hermes is developed on Linux / macOS and tested natively on Windows too.
|
|
Several common subprocess patterns break silently-or-loudly on Windows:
|
|
|
|
* ``["npm", "install", ...]`` — on Windows ``npm`` is ``npm.cmd``, a batch
|
|
shim. ``subprocess.Popen(["npm", ...])`` fails with WinError 193
|
|
("not a valid Win32 application") because CreateProcessW can't run a
|
|
``.cmd`` file without ``shell=True`` or PATHEXT resolution.
|
|
|
|
* ``start_new_session=True`` — on POSIX, this maps to ``os.setsid()`` and
|
|
actually detaches the child. On Windows it's silently ignored; the
|
|
Windows equivalent is ``CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS``
|
|
creationflags, which Python only applies when you pass them explicitly.
|
|
|
|
* Console-window flashes — every ``subprocess.Popen`` of a ``.exe`` on
|
|
Windows spawns a cmd window briefly unless ``CREATE_NO_WINDOW`` is
|
|
passed. Cosmetic but jarring for background daemons.
|
|
|
|
This module centralizes the platform-branching logic so the rest of the
|
|
codebase doesn't sprinkle ``if sys.platform == "win32":`` everywhere.
|
|
|
|
**All helpers are no-ops on non-Windows** — calling them in Linux/macOS
|
|
code paths is safe by design. That's the "do no damage on POSIX"
|
|
guarantee.
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import shutil
|
|
import sys
|
|
from typing import Sequence
|
|
|
|
__all__ = [
|
|
"IS_WINDOWS",
|
|
"resolve_node_command",
|
|
"windows_detach_flags",
|
|
"windows_hide_flags",
|
|
"windows_detach_popen_kwargs",
|
|
]
|
|
|
|
|
|
IS_WINDOWS = sys.platform == "win32"
|
|
|
|
|
|
# -----------------------------------------------------------------------------
|
|
# Node ecosystem launcher resolution
|
|
# -----------------------------------------------------------------------------
|
|
|
|
|
|
def resolve_node_command(name: str, argv: Sequence[str]) -> list[str]:
|
|
"""Resolve a Node-ecosystem command name to an absolute-path argv.
|
|
|
|
On Windows, commands like ``npm``, ``npx``, ``yarn``, ``pnpm``,
|
|
``playwright``, ``prettier`` ship as ``.cmd`` files (batch shims).
|
|
``subprocess.Popen(["npm", "install"])`` fails with WinError 193
|
|
because CreateProcessW doesn't execute batch files directly.
|
|
|
|
``shutil.which(name)`` *does* resolve ``.cmd`` via PATHEXT and returns
|
|
the fully-qualified path — which CreateProcessW accepts because the
|
|
extension tells Windows to route through ``cmd.exe /c``.
|
|
|
|
On POSIX ``shutil.which`` also returns a fully-qualified path when
|
|
found. That's a small change from bare-name resolution (the OS does
|
|
its own PATH search) but functionally identical and has the side
|
|
benefit of making the argv reproducible in logs.
|
|
|
|
Behavior when the command is not on PATH:
|
|
- On Windows: return the bare name — caller can still try with
|
|
``shell=True`` as a last resort, OR the subsequent Popen will
|
|
raise FileNotFoundError with a readable error we want to surface.
|
|
- On POSIX: same. Bare ``npm`` on a Linux box without npm installed
|
|
fails the same way it did before this function existed.
|
|
|
|
Args:
|
|
name: The command name to resolve (``npm``, ``npx``, ``node`` …).
|
|
argv: The remaining arguments. Must NOT include ``name`` itself —
|
|
this function builds the full argv list.
|
|
|
|
Returns:
|
|
A list suitable for passing to subprocess.Popen/run/call.
|
|
"""
|
|
resolved = shutil.which(name)
|
|
if resolved:
|
|
return [resolved, *argv]
|
|
return [name, *argv]
|
|
|
|
|
|
# -----------------------------------------------------------------------------
|
|
# Detached / hidden process creation
|
|
# -----------------------------------------------------------------------------
|
|
|
|
|
|
# Win32 CreationFlags — defined here rather than imported from subprocess
|
|
# because CREATE_NO_WINDOW and DETACHED_PROCESS aren't guaranteed to be
|
|
# present on stdlib subprocess on older Pythons or non-Windows builds.
|
|
_CREATE_NEW_PROCESS_GROUP = 0x00000200
|
|
_DETACHED_PROCESS = 0x00000008
|
|
_CREATE_NO_WINDOW = 0x08000000
|
|
# Escape any Win32 job object the parent process belongs to. Without this,
|
|
# a detached child still inherits its parent's job object membership, and
|
|
# when that parent (Electron, Tauri, Windows Terminal, the Desktop GUI's
|
|
# bootstrap-installer) dies, the OS tears down the whole job — taking the
|
|
# "detached" child with it. Critical for the post-update gateway watcher:
|
|
# Electron spawns the Tauri updater inside its own job, the updater spawns
|
|
# the watcher subprocess; without BREAKAWAY the watcher dies the instant
|
|
# Electron exits, so the gateway never gets respawned after a `hermes
|
|
# update` triggered from the GUI. See fix/windows-gateway-reliability.
|
|
_CREATE_BREAKAWAY_FROM_JOB = 0x01000000
|
|
|
|
|
|
def windows_detach_flags() -> int:
|
|
"""Return Win32 creationflags that detach a child from the parent
|
|
console and process group. 0 on non-Windows.
|
|
|
|
Pair with ``start_new_session=False`` (default) when calling
|
|
subprocess.Popen — on POSIX use ``start_new_session=True`` instead,
|
|
which maps to ``os.setsid()`` in the child.
|
|
|
|
Rationale:
|
|
- ``CREATE_NEW_PROCESS_GROUP`` — child has its own process group so
|
|
Ctrl+C in the parent console doesn't propagate.
|
|
- ``DETACHED_PROCESS`` — child has no console at all. Necessary for
|
|
background daemons (gateway watchers, update respawners) because
|
|
without it, closing the console kills the child.
|
|
- ``CREATE_NO_WINDOW`` — suppress the brief cmd flash that would
|
|
otherwise appear when launching a console app. Redundant with
|
|
DETACHED_PROCESS but explicit for clarity.
|
|
- ``CREATE_BREAKAWAY_FROM_JOB`` — escape any job object the parent is
|
|
in. Electron (Desktop app) and Tauri (bootstrap installer) wrap
|
|
their children in job objects; without breakaway, those children
|
|
die when the parent process exits even if they were spawned with
|
|
DETACHED_PROCESS. This was the missing flag that made the
|
|
post-update gateway respawn watcher silently die alongside the
|
|
Tauri updater after the Electron Desktop's update flow finished.
|
|
|
|
If a process is in a job that disallows breakaway (rare —
|
|
JOB_OBJECT_LIMIT_BREAKAWAY_OK isn't set), CreateProcess returns
|
|
ERROR_ACCESS_DENIED. Python surfaces that as ``PermissionError``
|
|
on the ``subprocess.Popen`` call. Callers in this codebase already
|
|
wrap detached spawns in ``try/except OSError`` and fall back to a
|
|
cmd.exe wrapper, so the breakaway-denied case degrades gracefully
|
|
rather than crashing.
|
|
"""
|
|
if not IS_WINDOWS:
|
|
return 0
|
|
return (
|
|
_CREATE_NEW_PROCESS_GROUP
|
|
| _DETACHED_PROCESS
|
|
| _CREATE_NO_WINDOW
|
|
| _CREATE_BREAKAWAY_FROM_JOB
|
|
)
|
|
|
|
|
|
def windows_hide_flags() -> int:
|
|
"""Return Win32 creationflags that merely hide the child's console
|
|
window without detaching the child. 0 on non-Windows.
|
|
|
|
Use for short-lived console apps spawned as part of a larger
|
|
operation (``taskkill``, ``where``, version probes) where we want no
|
|
flash but also want to collect stdout/exit code synchronously.
|
|
|
|
The key difference from :func:`windows_detach_flags`: NO
|
|
``DETACHED_PROCESS`` — the child still inherits stdio handles so
|
|
``capture_output=True`` works. ``DETACHED_PROCESS`` would sever
|
|
stdio and break stdout capture.
|
|
"""
|
|
if not IS_WINDOWS:
|
|
return 0
|
|
return _CREATE_NO_WINDOW
|
|
|
|
|
|
def windows_detach_popen_kwargs() -> dict:
|
|
"""Return a dict of Popen kwargs that detach a child on Windows and
|
|
fall back to the POSIX equivalent (``start_new_session=True``) on
|
|
Linux/macOS.
|
|
|
|
Usage pattern:
|
|
|
|
.. code-block:: python
|
|
|
|
subprocess.Popen(
|
|
argv,
|
|
stdout=subprocess.DEVNULL,
|
|
stderr=subprocess.DEVNULL,
|
|
stdin=subprocess.DEVNULL,
|
|
close_fds=True,
|
|
**windows_detach_popen_kwargs(),
|
|
)
|
|
|
|
This replaces the unsafe-on-Windows pattern:
|
|
|
|
.. code-block:: python
|
|
|
|
subprocess.Popen(..., start_new_session=True)
|
|
|
|
which silently fails to detach on Windows (the flag is accepted but
|
|
has no effect — the child stays attached to the parent's console
|
|
and dies when the console closes).
|
|
"""
|
|
if IS_WINDOWS:
|
|
return {"creationflags": windows_detach_flags()}
|
|
return {"start_new_session": True}
|