mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-06-09 08:21:50 +00:00
#40909 added `CREATE_BREAKAWAY_FROM_JOB` to `windows_detach_flags()`, which fixed the headline bug (gateway dies after Desktop GUI update and never comes back). The flag's own docstring acknowledges that restrictive parent job objects can still refuse breakaway with `ERROR_ACCESS_DENIED`, surfacing as `OSError` on the `subprocess.Popen` call: "Callers in this codebase already wrap detached spawns in try/except OSError and fall back to a cmd.exe wrapper, so the breakaway-denied case degrades gracefully rather than crashing." That's true for `_spawn_detached` in `gateway_windows.py` (the `hermes gateway start` path), which has both the breakaway bit AND a retry-without-breakaway fallback. It's NOT true for the post-update watcher path in `launch_detached_profile_gateway_restart` (`hermes_cli/gateway.py`), which only has `except OSError: return False` and gives up entirely. If a user's shell/terminal/container wraps Hermes in a breakaway-denying job, the gateway-respawn watcher silently fails to launch instead of trying again without breakaway. This PR closes that gap and adds the regression tests that were missing from the original fix. ## Changes ### `hermes_cli/_subprocess_compat.py` Adds a sibling helper `windows_detach_flags_without_breakaway()` so callers can express the fallback symbolically (via the helper) rather than coding the magic `& ~0x01000000` mask at every site. Documented on `windows_detach_flags` and `windows_detach_flags_without_breakaway` with the recommended try/except pattern. ### `hermes_cli/gateway.py::launch_detached_profile_gateway_restart` Two changes, both aligned with the canonical pattern in `gateway_windows._spawn_detached`: 1. The outer watcher Popen now wraps in `try/except OSError`, and on failure retries with `windows_detach_flags_without_breakaway()` (POSIX never reaches this branch — `start_new_session=True` can't raise OSError). 2. The inlined respawn payload (the `python -c` watcher) also wraps its CreateProcess in try/except OSError and retries with `_flags & ~_CREATE_BREAKAWAY_FROM_JOB` on failure. This matters because the watcher's job-object inheritance is independent of the outer process's — even if the outer Popen succeeds with breakaway, the respawned gateway might inherit a job that doesn't. ### Regression tests in `tests/tools/test_windows_native_support.py` #40909 shipped the fix without any test that the breakaway bit is present (the existing `test_windows_detach_flags_has_expected_win32_bits` asserted only the three legacy bits). Four new tests close that: - `test_windows_detach_flags_includes_breakaway_from_job` — explicit assertion that the breakaway bit is in the default bundle, with the rationale spelled out in the docstring so a future maintainer staring at this test understands why removing it would resurrect the gateway-dies-after-GUI-update bug. - `test_windows_detach_flags_without_breakaway_drops_only_that_bit` — fallback payload keeps the other three detach bits intact. - `test_launch_detached_profile_gateway_restart_inlined_watcher_uses_breakaway` — static-text check on the stringified watcher payload. The inlined Python program isn't reachable via normal import-time inspection because it lives in a `textwrap.dedent("""...""")` literal that gets passed to a separate `python -c` interpreter. Asserting that both `_CREATE_BREAKAWAY_FROM_JOB` (symbolic) and `0x01000000` (hex literal) appear inside the dedent block is a sufficient regression guard against accidental refactors. - `test_launch_detached_profile_gateway_restart_outer_popen_has_access_denied_fallback` — static check that this PR's fallback retry is wired up symbolically. Without standing up a real Windows job object that refuses breakaway, we can't trigger the OSError in a unit test; the text guard catches the case where a future refactor removes the helper import or the `& ~_CREATE_BREAKAWAY_FROM_JOB` retry. Also extends `test_windows_detach_flags_has_expected_win32_bits` to include the breakaway bit assertion and updates `test_windows_flags_zero_on_posix` to cover the new helper. ## Tests Locally on Windows: 8/8 in the `-k "detach or breakaway or popen_kwargs or launch_detached or gateway_run_update or hermes_cli_gateway"` slice pass. Broader `tests/hermes_cli/test_gateway*.py + test_windows_native_support.py`: 172 passed, 10 failed. All 10 failures are pre-existing POSIX-only tests running on a Windows host (os.geteuid, SIGKILL fallback, is_linux fixture mismatches). Stashing this PR and re-running on bare post-#40909 main reproduces all 10 identically — none are regressions. POSIX paths unchanged: `windows_detach_flags()` and `windows_detach_flags_without_breakaway()` both return 0 off Windows, `windows_detach_popen_kwargs()` still yields `{"start_new_session": True}`. ## Out of scope - The other detached-spawn site in `hermes_cli/gateway.py` (around line 3068) also uses `windows_detach_popen_kwargs()` + `except OSError`. It deserves the same fallback treatment but the codepath is different enough (not the update-flow watcher) that it warrants a separate PR with its own scrutiny. - `gateway/run.py` has Windows branches with `windows_detach_popen_kwargs` too — same reasoning. ## Context Follow-up to #40909 (merged). I had a parallel PR (#40934, closed) that duplicated the core breakaway fix; the bits unique to that PR that #40909 didn't cover are the contents of this one. Closing #40934 and opening this slimmed-down version as the focused follow-up.
234 lines
9.3 KiB
Python
234 lines
9.3 KiB
Python
"""Windows subprocess compatibility helpers.
|
|
|
|
Hermes is developed on Linux / macOS and tested natively on Windows too.
|
|
Several common subprocess patterns break silently-or-loudly on Windows:
|
|
|
|
* ``["npm", "install", ...]`` — on Windows ``npm`` is ``npm.cmd``, a batch
|
|
shim. ``subprocess.Popen(["npm", ...])`` fails with WinError 193
|
|
("not a valid Win32 application") because CreateProcessW can't run a
|
|
``.cmd`` file without ``shell=True`` or PATHEXT resolution.
|
|
|
|
* ``start_new_session=True`` — on POSIX, this maps to ``os.setsid()`` and
|
|
actually detaches the child. On Windows it's silently ignored; the
|
|
Windows equivalent is ``CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS``
|
|
creationflags, which Python only applies when you pass them explicitly.
|
|
|
|
* Console-window flashes — every ``subprocess.Popen`` of a ``.exe`` on
|
|
Windows spawns a cmd window briefly unless ``CREATE_NO_WINDOW`` is
|
|
passed. Cosmetic but jarring for background daemons.
|
|
|
|
This module centralizes the platform-branching logic so the rest of the
|
|
codebase doesn't sprinkle ``if sys.platform == "win32":`` everywhere.
|
|
|
|
**All helpers are no-ops on non-Windows** — calling them in Linux/macOS
|
|
code paths is safe by design. That's the "do no damage on POSIX"
|
|
guarantee.
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import shutil
|
|
import sys
|
|
from typing import Sequence
|
|
|
|
__all__ = [
|
|
"IS_WINDOWS",
|
|
"resolve_node_command",
|
|
"windows_detach_flags",
|
|
"windows_detach_flags_without_breakaway",
|
|
"windows_hide_flags",
|
|
"windows_detach_popen_kwargs",
|
|
]
|
|
|
|
|
|
IS_WINDOWS = sys.platform == "win32"
|
|
|
|
|
|
# -----------------------------------------------------------------------------
|
|
# Node ecosystem launcher resolution
|
|
# -----------------------------------------------------------------------------
|
|
|
|
|
|
def resolve_node_command(name: str, argv: Sequence[str]) -> list[str]:
|
|
"""Resolve a Node-ecosystem command name to an absolute-path argv.
|
|
|
|
On Windows, commands like ``npm``, ``npx``, ``yarn``, ``pnpm``,
|
|
``playwright``, ``prettier`` ship as ``.cmd`` files (batch shims).
|
|
``subprocess.Popen(["npm", "install"])`` fails with WinError 193
|
|
because CreateProcessW doesn't execute batch files directly.
|
|
|
|
``shutil.which(name)`` *does* resolve ``.cmd`` via PATHEXT and returns
|
|
the fully-qualified path — which CreateProcessW accepts because the
|
|
extension tells Windows to route through ``cmd.exe /c``.
|
|
|
|
On POSIX ``shutil.which`` also returns a fully-qualified path when
|
|
found. That's a small change from bare-name resolution (the OS does
|
|
its own PATH search) but functionally identical and has the side
|
|
benefit of making the argv reproducible in logs.
|
|
|
|
Behavior when the command is not on PATH:
|
|
- On Windows: return the bare name — caller can still try with
|
|
``shell=True`` as a last resort, OR the subsequent Popen will
|
|
raise FileNotFoundError with a readable error we want to surface.
|
|
- On POSIX: same. Bare ``npm`` on a Linux box without npm installed
|
|
fails the same way it did before this function existed.
|
|
|
|
Args:
|
|
name: The command name to resolve (``npm``, ``npx``, ``node`` …).
|
|
argv: The remaining arguments. Must NOT include ``name`` itself —
|
|
this function builds the full argv list.
|
|
|
|
Returns:
|
|
A list suitable for passing to subprocess.Popen/run/call.
|
|
"""
|
|
resolved = shutil.which(name)
|
|
if resolved:
|
|
return [resolved, *argv]
|
|
return [name, *argv]
|
|
|
|
|
|
# -----------------------------------------------------------------------------
|
|
# Detached / hidden process creation
|
|
# -----------------------------------------------------------------------------
|
|
|
|
|
|
# Win32 CreationFlags — defined here rather than imported from subprocess
|
|
# because CREATE_NO_WINDOW and DETACHED_PROCESS aren't guaranteed to be
|
|
# present on stdlib subprocess on older Pythons or non-Windows builds.
|
|
_CREATE_NEW_PROCESS_GROUP = 0x00000200
|
|
_DETACHED_PROCESS = 0x00000008
|
|
_CREATE_NO_WINDOW = 0x08000000
|
|
# Escape any Win32 job object the parent process belongs to. Without this,
|
|
# a detached child still inherits its parent's job object membership, and
|
|
# when that parent (Electron, Tauri, Windows Terminal, the Desktop GUI's
|
|
# bootstrap-installer) dies, the OS tears down the whole job — taking the
|
|
# "detached" child with it. Critical for the post-update gateway watcher:
|
|
# Electron spawns the Tauri updater inside its own job, the updater spawns
|
|
# the watcher subprocess; without BREAKAWAY the watcher dies the instant
|
|
# Electron exits, so the gateway never gets respawned after a `hermes
|
|
# update` triggered from the GUI. See fix/windows-gateway-reliability.
|
|
_CREATE_BREAKAWAY_FROM_JOB = 0x01000000
|
|
|
|
|
|
def windows_detach_flags() -> int:
|
|
"""Return Win32 creationflags that detach a child from the parent
|
|
console and process group. 0 on non-Windows.
|
|
|
|
Pair with ``start_new_session=False`` (default) when calling
|
|
subprocess.Popen — on POSIX use ``start_new_session=True`` instead,
|
|
which maps to ``os.setsid()`` in the child.
|
|
|
|
Rationale:
|
|
- ``CREATE_NEW_PROCESS_GROUP`` — child has its own process group so
|
|
Ctrl+C in the parent console doesn't propagate.
|
|
- ``DETACHED_PROCESS`` — child has no console at all. Necessary for
|
|
background daemons (gateway watchers, update respawners) because
|
|
without it, closing the console kills the child.
|
|
- ``CREATE_NO_WINDOW`` — suppress the brief cmd flash that would
|
|
otherwise appear when launching a console app. Redundant with
|
|
DETACHED_PROCESS but explicit for clarity.
|
|
- ``CREATE_BREAKAWAY_FROM_JOB`` — escape any job object the parent is
|
|
in. Electron (Desktop app) and Tauri (bootstrap installer) wrap
|
|
their children in job objects; without breakaway, those children
|
|
die when the parent process exits even if they were spawned with
|
|
DETACHED_PROCESS. This was the missing flag that made the
|
|
post-update gateway respawn watcher silently die alongside the
|
|
Tauri updater after the Electron Desktop's update flow finished.
|
|
|
|
If a process is in a job that disallows breakaway (rare —
|
|
JOB_OBJECT_LIMIT_BREAKAWAY_OK isn't set), CreateProcess returns
|
|
ERROR_ACCESS_DENIED. Python surfaces that as ``PermissionError``
|
|
on the ``subprocess.Popen`` call. Callers in this codebase already
|
|
wrap detached spawns in ``try/except OSError`` and fall back to a
|
|
cmd.exe wrapper, so the breakaway-denied case degrades gracefully
|
|
rather than crashing.
|
|
"""
|
|
if not IS_WINDOWS:
|
|
return 0
|
|
return (
|
|
_CREATE_NEW_PROCESS_GROUP
|
|
| _DETACHED_PROCESS
|
|
| _CREATE_NO_WINDOW
|
|
| _CREATE_BREAKAWAY_FROM_JOB
|
|
)
|
|
|
|
|
|
def windows_detach_flags_without_breakaway() -> int:
|
|
"""Same as :func:`windows_detach_flags` minus ``CREATE_BREAKAWAY_FROM_JOB``.
|
|
|
|
The docstring on :func:`windows_detach_flags` notes that a process in
|
|
a job which disallows breakaway (no ``JOB_OBJECT_LIMIT_BREAKAWAY_OK``)
|
|
will see ``ERROR_ACCESS_DENIED`` from CreateProcess, surfacing as
|
|
``OSError`` (``PermissionError``) on the ``subprocess.Popen`` call.
|
|
Callers that want to recover — by retrying without the breakaway
|
|
bit — can pair the two helpers symbolically rather than coding the
|
|
``& ~0x01000000`` magic at every site:
|
|
|
|
.. code-block:: python
|
|
|
|
try:
|
|
subprocess.Popen(argv, creationflags=windows_detach_flags(), …)
|
|
except OSError:
|
|
subprocess.Popen(
|
|
argv,
|
|
creationflags=windows_detach_flags_without_breakaway(),
|
|
…,
|
|
)
|
|
|
|
See ``gateway_windows.py::_spawn_detached`` for the canonical
|
|
implementation of this pattern. Returns 0 on non-Windows.
|
|
"""
|
|
if not IS_WINDOWS:
|
|
return 0
|
|
return _CREATE_NEW_PROCESS_GROUP | _DETACHED_PROCESS | _CREATE_NO_WINDOW
|
|
|
|
|
|
def windows_hide_flags() -> int:
|
|
"""Return Win32 creationflags that merely hide the child's console
|
|
window without detaching the child. 0 on non-Windows.
|
|
|
|
Use for short-lived console apps spawned as part of a larger
|
|
operation (``taskkill``, ``where``, version probes) where we want no
|
|
flash but also want to collect stdout/exit code synchronously.
|
|
|
|
The key difference from :func:`windows_detach_flags`: NO
|
|
``DETACHED_PROCESS`` — the child still inherits stdio handles so
|
|
``capture_output=True`` works. ``DETACHED_PROCESS`` would sever
|
|
stdio and break stdout capture.
|
|
"""
|
|
if not IS_WINDOWS:
|
|
return 0
|
|
return _CREATE_NO_WINDOW
|
|
|
|
|
|
def windows_detach_popen_kwargs() -> dict:
|
|
"""Return a dict of Popen kwargs that detach a child on Windows and
|
|
fall back to the POSIX equivalent (``start_new_session=True``) on
|
|
Linux/macOS.
|
|
|
|
Usage pattern:
|
|
|
|
.. code-block:: python
|
|
|
|
subprocess.Popen(
|
|
argv,
|
|
stdout=subprocess.DEVNULL,
|
|
stderr=subprocess.DEVNULL,
|
|
stdin=subprocess.DEVNULL,
|
|
close_fds=True,
|
|
**windows_detach_popen_kwargs(),
|
|
)
|
|
|
|
This replaces the unsafe-on-Windows pattern:
|
|
|
|
.. code-block:: python
|
|
|
|
subprocess.Popen(..., start_new_session=True)
|
|
|
|
which silently fails to detach on Windows (the flag is accepted but
|
|
has no effect — the child stays attached to the parent's console
|
|
and dies when the console closes).
|
|
"""
|
|
if IS_WINDOWS:
|
|
return {"creationflags": windows_detach_flags()}
|
|
return {"start_new_session": True}
|