mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-06-10 08:32:09 +00:00
Three coordinated fixes for the Windows gateway reliability story:
1. CREATE_BREAKAWAY_FROM_JOB on every detached spawn
The 'hermes update' triggered from the Electron Desktop GUI ran inside
Electron's job object. Without breakaway, the post-update gateway
watcher spawned by update — already DETACHED_PROCESS — was still
reaped when Electron's job tore down, so the gateway never came back
after a GUI-initiated update. Adds CREATE_BREAKAWAY_FROM_JOB (0x01000000)
to:
- hermes_cli/_subprocess_compat.py::windows_detach_flags() — used by
every helper that calls windows_detach_popen_kwargs(), including
launch_detached_profile_gateway_restart()
- The watcher subprocess's own respawn snippet in
hermes_cli/gateway.py (inlined flags so the watcher's child
respawn also breaks away)
_spawn_detached() in gateway_windows.py already had the flag; this
change brings the rest of the codebase to parity.
2. Per-minute supervisor Scheduled Task — Windows equivalent of
systemd Restart=always
Introduces hermes_cli/gateway_supervisor.py and registers it as a
second Scheduled Task ('Hermes_Gateway_Supervisor', SC MINUTE /MO 1,
LIMITED rights) alongside the existing ONLOGON task. Every minute,
the supervisor uses the same gateway.status.get_running_pid() probe
as 'hermes gateway status' and, if no gateway is alive, calls
gateway_windows._spawn_detached() (which now includes BREAKAWAY) to
bring one back.
Covers every crash mode, not just 'machine rebooted': taskkill,
OOM, GUI update SIGTERM, parent job teardown. Cheap — one pythonw
startup per minute when down, one PID-existence check per minute
when up.
Wired into both the schtasks-success and Startup-folder-fallback
install paths via _install_supervisor_best_effort(), and removed in
uninstall(). Best-effort: a failing supervisor install logs a
warning but doesn't roll back the primary install.
3. 'hermes gateway status --deep' shows per-probe PASS/FAIL
Replaces the existing terse '--deep' output (which only printed
paths) with an actual diagnostic table:
[1] PID file present
[2] Lock file held by a live process
[3] get_running_pid() result
[4] _pid_exists(pid) — OS-level liveness
[5] gateway_state.json (state + age)
[6] Last lifecycle event from gateway-exit-diag.log
When the high-level summary disagrees with reality, the user can
see exactly which signal is lying.
Test-leak fix
-------------
tests/hermes_cli/test_gateway_wsl.py::TestGatewayCommandWSLMessages
monkey-patched is_linux/is_wsl/supports_systemd_services to simulate
WSL but did NOT stub is_windows(). On a Windows host, the dispatcher
in _gateway_command_inner takes the is_windows() branch BEFORE the
WSL guidance branch, so the test invoked gateway_windows.install()
for real. install() writes to %APPDATA%\...\Startup\Hermes_Gateway.cmd
— the REAL user Startup folder, never sandboxed by tmp_path — pointing
at the test's pytest-of-<user>/pytest-<N>/.../gateway-service/ wrapper.
When pytest tore down the tmp_path, every subsequent Windows login
flashed a cmd.exe window that failed to find the missing target.
Stubs is_windows=False on all four affected tests:
test_install_wsl_no_systemd
test_start_wsl_no_systemd
test_status_wsl_running_manual
test_status_wsl_not_running
Defense-in-depth: _build_startup_launcher() now prefixes the launcher
with 'if not exist <target> exit /b 0', so any future stale Startup
entry silently no-ops instead of flashing a console window.
Status enhancements
-------------------
- status() now reports supervisor task presence alongside the existing
schtasks/Startup info, and nudges the user to reinstall if the
supervisor isn't registered.
- Deep mode dumps both the supervisor task name + script path.
203 lines
8.1 KiB
Python
203 lines
8.1 KiB
Python
"""Windows subprocess compatibility helpers.
|
|
|
|
Hermes is developed on Linux / macOS and tested natively on Windows too.
|
|
Several common subprocess patterns break silently-or-loudly on Windows:
|
|
|
|
* ``["npm", "install", ...]`` — on Windows ``npm`` is ``npm.cmd``, a batch
|
|
shim. ``subprocess.Popen(["npm", ...])`` fails with WinError 193
|
|
("not a valid Win32 application") because CreateProcessW can't run a
|
|
``.cmd`` file without ``shell=True`` or PATHEXT resolution.
|
|
|
|
* ``start_new_session=True`` — on POSIX, this maps to ``os.setsid()`` and
|
|
actually detaches the child. On Windows it's silently ignored; the
|
|
Windows equivalent is ``CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS``
|
|
creationflags, which Python only applies when you pass them explicitly.
|
|
|
|
* Console-window flashes — every ``subprocess.Popen`` of a ``.exe`` on
|
|
Windows spawns a cmd window briefly unless ``CREATE_NO_WINDOW`` is
|
|
passed. Cosmetic but jarring for background daemons.
|
|
|
|
This module centralizes the platform-branching logic so the rest of the
|
|
codebase doesn't sprinkle ``if sys.platform == "win32":`` everywhere.
|
|
|
|
**All helpers are no-ops on non-Windows** — calling them in Linux/macOS
|
|
code paths is safe by design. That's the "do no damage on POSIX"
|
|
guarantee.
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import shutil
|
|
import sys
|
|
from typing import Sequence
|
|
|
|
__all__ = [
|
|
"IS_WINDOWS",
|
|
"resolve_node_command",
|
|
"windows_detach_flags",
|
|
"windows_hide_flags",
|
|
"windows_detach_popen_kwargs",
|
|
]
|
|
|
|
|
|
IS_WINDOWS = sys.platform == "win32"
|
|
|
|
|
|
# -----------------------------------------------------------------------------
|
|
# Node ecosystem launcher resolution
|
|
# -----------------------------------------------------------------------------
|
|
|
|
|
|
def resolve_node_command(name: str, argv: Sequence[str]) -> list[str]:
|
|
"""Resolve a Node-ecosystem command name to an absolute-path argv.
|
|
|
|
On Windows, commands like ``npm``, ``npx``, ``yarn``, ``pnpm``,
|
|
``playwright``, ``prettier`` ship as ``.cmd`` files (batch shims).
|
|
``subprocess.Popen(["npm", "install"])`` fails with WinError 193
|
|
because CreateProcessW doesn't execute batch files directly.
|
|
|
|
``shutil.which(name)`` *does* resolve ``.cmd`` via PATHEXT and returns
|
|
the fully-qualified path — which CreateProcessW accepts because the
|
|
extension tells Windows to route through ``cmd.exe /c``.
|
|
|
|
On POSIX ``shutil.which`` also returns a fully-qualified path when
|
|
found. That's a small change from bare-name resolution (the OS does
|
|
its own PATH search) but functionally identical and has the side
|
|
benefit of making the argv reproducible in logs.
|
|
|
|
Behavior when the command is not on PATH:
|
|
- On Windows: return the bare name — caller can still try with
|
|
``shell=True`` as a last resort, OR the subsequent Popen will
|
|
raise FileNotFoundError with a readable error we want to surface.
|
|
- On POSIX: same. Bare ``npm`` on a Linux box without npm installed
|
|
fails the same way it did before this function existed.
|
|
|
|
Args:
|
|
name: The command name to resolve (``npm``, ``npx``, ``node`` …).
|
|
argv: The remaining arguments. Must NOT include ``name`` itself —
|
|
this function builds the full argv list.
|
|
|
|
Returns:
|
|
A list suitable for passing to subprocess.Popen/run/call.
|
|
"""
|
|
resolved = shutil.which(name)
|
|
if resolved:
|
|
return [resolved, *argv]
|
|
return [name, *argv]
|
|
|
|
|
|
# -----------------------------------------------------------------------------
|
|
# Detached / hidden process creation
|
|
# -----------------------------------------------------------------------------
|
|
|
|
|
|
# Win32 CreationFlags — defined here rather than imported from subprocess
|
|
# because CREATE_NO_WINDOW and DETACHED_PROCESS aren't guaranteed to be
|
|
# present on stdlib subprocess on older Pythons or non-Windows builds.
|
|
_CREATE_NEW_PROCESS_GROUP = 0x00000200
|
|
_DETACHED_PROCESS = 0x00000008
|
|
_CREATE_NO_WINDOW = 0x08000000
|
|
# Escape any Win32 job object the parent process belongs to. Without this,
|
|
# a detached child still inherits its parent's job object membership, and
|
|
# when that parent (Electron, Tauri, Windows Terminal, the Desktop GUI's
|
|
# bootstrap-installer) dies, the OS tears down the whole job — taking the
|
|
# "detached" child with it. Critical for the post-update gateway watcher:
|
|
# Electron spawns the Tauri updater inside its own job, the updater spawns
|
|
# the watcher subprocess; without BREAKAWAY the watcher dies the instant
|
|
# Electron exits, so the gateway never gets respawned after a `hermes
|
|
# update` triggered from the GUI. See fix/windows-gateway-reliability.
|
|
_CREATE_BREAKAWAY_FROM_JOB = 0x01000000
|
|
|
|
|
|
def windows_detach_flags() -> int:
|
|
"""Return Win32 creationflags that detach a child from the parent
|
|
console and process group. 0 on non-Windows.
|
|
|
|
Pair with ``start_new_session=False`` (default) when calling
|
|
subprocess.Popen — on POSIX use ``start_new_session=True`` instead,
|
|
which maps to ``os.setsid()`` in the child.
|
|
|
|
Rationale:
|
|
- ``CREATE_NEW_PROCESS_GROUP`` — child has its own process group so
|
|
Ctrl+C in the parent console doesn't propagate.
|
|
- ``DETACHED_PROCESS`` — child has no console at all. Necessary for
|
|
background daemons (gateway watchers, update respawners) because
|
|
without it, closing the console kills the child.
|
|
- ``CREATE_NO_WINDOW`` — suppress the brief cmd flash that would
|
|
otherwise appear when launching a console app. Redundant with
|
|
DETACHED_PROCESS but explicit for clarity.
|
|
- ``CREATE_BREAKAWAY_FROM_JOB`` — escape any job object the parent is
|
|
in. Electron (Desktop app) and Tauri (bootstrap installer) wrap
|
|
their children in job objects; without breakaway, those children
|
|
die when the parent process exits even if they were spawned with
|
|
DETACHED_PROCESS. This was the missing flag that made the
|
|
post-update gateway respawn watcher silently die alongside the
|
|
Tauri updater after the Electron Desktop's update flow finished.
|
|
|
|
If a process is in a job that disallows breakaway (rare —
|
|
JOB_OBJECT_LIMIT_BREAKAWAY_OK isn't set), CreateProcess returns
|
|
ERROR_ACCESS_DENIED. Python surfaces that as ``PermissionError``
|
|
on the ``subprocess.Popen`` call. Callers in this codebase already
|
|
wrap detached spawns in ``try/except OSError`` and fall back to a
|
|
cmd.exe wrapper, so the breakaway-denied case degrades gracefully
|
|
rather than crashing.
|
|
"""
|
|
if not IS_WINDOWS:
|
|
return 0
|
|
return (
|
|
_CREATE_NEW_PROCESS_GROUP
|
|
| _DETACHED_PROCESS
|
|
| _CREATE_NO_WINDOW
|
|
| _CREATE_BREAKAWAY_FROM_JOB
|
|
)
|
|
|
|
|
|
def windows_hide_flags() -> int:
|
|
"""Return Win32 creationflags that merely hide the child's console
|
|
window without detaching the child. 0 on non-Windows.
|
|
|
|
Use for short-lived console apps spawned as part of a larger
|
|
operation (``taskkill``, ``where``, version probes) where we want no
|
|
flash but also want to collect stdout/exit code synchronously.
|
|
|
|
The key difference from :func:`windows_detach_flags`: NO
|
|
``DETACHED_PROCESS`` — the child still inherits stdio handles so
|
|
``capture_output=True`` works. ``DETACHED_PROCESS`` would sever
|
|
stdio and break stdout capture.
|
|
"""
|
|
if not IS_WINDOWS:
|
|
return 0
|
|
return _CREATE_NO_WINDOW
|
|
|
|
|
|
def windows_detach_popen_kwargs() -> dict:
|
|
"""Return a dict of Popen kwargs that detach a child on Windows and
|
|
fall back to the POSIX equivalent (``start_new_session=True``) on
|
|
Linux/macOS.
|
|
|
|
Usage pattern:
|
|
|
|
.. code-block:: python
|
|
|
|
subprocess.Popen(
|
|
argv,
|
|
stdout=subprocess.DEVNULL,
|
|
stderr=subprocess.DEVNULL,
|
|
stdin=subprocess.DEVNULL,
|
|
close_fds=True,
|
|
**windows_detach_popen_kwargs(),
|
|
)
|
|
|
|
This replaces the unsafe-on-Windows pattern:
|
|
|
|
.. code-block:: python
|
|
|
|
subprocess.Popen(..., start_new_session=True)
|
|
|
|
which silently fails to detach on Windows (the flag is accepted but
|
|
has no effect — the child stays attached to the parent's console
|
|
and dies when the console closes).
|
|
"""
|
|
if IS_WINDOWS:
|
|
return {"creationflags": windows_detach_flags()}
|
|
return {"start_new_session": True}
|