mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-06-09 08:21:50 +00:00
fix(windows): retry watcher Popen without breakaway when parent job denies it, plus regression tests for the breakaway bit (#40956)
#40909 added `CREATE_BREAKAWAY_FROM_JOB` to `windows_detach_flags()`, which fixed the headline bug (gateway dies after Desktop GUI update and never comes back). The flag's own docstring acknowledges that restrictive parent job objects can still refuse breakaway with `ERROR_ACCESS_DENIED`, surfacing as `OSError` on the `subprocess.Popen` call: "Callers in this codebase already wrap detached spawns in try/except OSError and fall back to a cmd.exe wrapper, so the breakaway-denied case degrades gracefully rather than crashing." That's true for `_spawn_detached` in `gateway_windows.py` (the `hermes gateway start` path), which has both the breakaway bit AND a retry-without-breakaway fallback. It's NOT true for the post-update watcher path in `launch_detached_profile_gateway_restart` (`hermes_cli/gateway.py`), which only has `except OSError: return False` and gives up entirely. If a user's shell/terminal/container wraps Hermes in a breakaway-denying job, the gateway-respawn watcher silently fails to launch instead of trying again without breakaway. This PR closes that gap and adds the regression tests that were missing from the original fix. ## Changes ### `hermes_cli/_subprocess_compat.py` Adds a sibling helper `windows_detach_flags_without_breakaway()` so callers can express the fallback symbolically (via the helper) rather than coding the magic `& ~0x01000000` mask at every site. Documented on `windows_detach_flags` and `windows_detach_flags_without_breakaway` with the recommended try/except pattern. ### `hermes_cli/gateway.py::launch_detached_profile_gateway_restart` Two changes, both aligned with the canonical pattern in `gateway_windows._spawn_detached`: 1. The outer watcher Popen now wraps in `try/except OSError`, and on failure retries with `windows_detach_flags_without_breakaway()` (POSIX never reaches this branch — `start_new_session=True` can't raise OSError). 2. The inlined respawn payload (the `python -c` watcher) also wraps its CreateProcess in try/except OSError and retries with `_flags & ~_CREATE_BREAKAWAY_FROM_JOB` on failure. This matters because the watcher's job-object inheritance is independent of the outer process's — even if the outer Popen succeeds with breakaway, the respawned gateway might inherit a job that doesn't. ### Regression tests in `tests/tools/test_windows_native_support.py` #40909 shipped the fix without any test that the breakaway bit is present (the existing `test_windows_detach_flags_has_expected_win32_bits` asserted only the three legacy bits). Four new tests close that: - `test_windows_detach_flags_includes_breakaway_from_job` — explicit assertion that the breakaway bit is in the default bundle, with the rationale spelled out in the docstring so a future maintainer staring at this test understands why removing it would resurrect the gateway-dies-after-GUI-update bug. - `test_windows_detach_flags_without_breakaway_drops_only_that_bit` — fallback payload keeps the other three detach bits intact. - `test_launch_detached_profile_gateway_restart_inlined_watcher_uses_breakaway` — static-text check on the stringified watcher payload. The inlined Python program isn't reachable via normal import-time inspection because it lives in a `textwrap.dedent("""...""")` literal that gets passed to a separate `python -c` interpreter. Asserting that both `_CREATE_BREAKAWAY_FROM_JOB` (symbolic) and `0x01000000` (hex literal) appear inside the dedent block is a sufficient regression guard against accidental refactors. - `test_launch_detached_profile_gateway_restart_outer_popen_has_access_denied_fallback` — static check that this PR's fallback retry is wired up symbolically. Without standing up a real Windows job object that refuses breakaway, we can't trigger the OSError in a unit test; the text guard catches the case where a future refactor removes the helper import or the `& ~_CREATE_BREAKAWAY_FROM_JOB` retry. Also extends `test_windows_detach_flags_has_expected_win32_bits` to include the breakaway bit assertion and updates `test_windows_flags_zero_on_posix` to cover the new helper. ## Tests Locally on Windows: 8/8 in the `-k "detach or breakaway or popen_kwargs or launch_detached or gateway_run_update or hermes_cli_gateway"` slice pass. Broader `tests/hermes_cli/test_gateway*.py + test_windows_native_support.py`: 172 passed, 10 failed. All 10 failures are pre-existing POSIX-only tests running on a Windows host (os.geteuid, SIGKILL fallback, is_linux fixture mismatches). Stashing this PR and re-running on bare post-#40909 main reproduces all 10 identically — none are regressions. POSIX paths unchanged: `windows_detach_flags()` and `windows_detach_flags_without_breakaway()` both return 0 off Windows, `windows_detach_popen_kwargs()` still yields `{"start_new_session": True}`. ## Out of scope - The other detached-spawn site in `hermes_cli/gateway.py` (around line 3068) also uses `windows_detach_popen_kwargs()` + `except OSError`. It deserves the same fallback treatment but the codepath is different enough (not the update-flow watcher) that it warrants a separate PR with its own scrutiny. - `gateway/run.py` has Windows branches with `windows_detach_popen_kwargs` too — same reasoning. ## Context Follow-up to #40909 (merged). I had a parallel PR (#40934, closed) that duplicated the core breakaway fix; the bits unique to that PR that #40909 didn't cover are the contents of this one. Closing #40934 and opening this slimmed-down version as the focused follow-up.
This commit is contained in:
parent
44c0c2d4ac
commit
fe0b3f2338
3 changed files with 205 additions and 15 deletions
|
|
@ -35,6 +35,7 @@ __all__ = [
|
|||
"IS_WINDOWS",
|
||||
"resolve_node_command",
|
||||
"windows_detach_flags",
|
||||
"windows_detach_flags_without_breakaway",
|
||||
"windows_hide_flags",
|
||||
"windows_detach_popen_kwargs",
|
||||
]
|
||||
|
|
@ -152,6 +153,36 @@ def windows_detach_flags() -> int:
|
|||
)
|
||||
|
||||
|
||||
def windows_detach_flags_without_breakaway() -> int:
|
||||
"""Same as :func:`windows_detach_flags` minus ``CREATE_BREAKAWAY_FROM_JOB``.
|
||||
|
||||
The docstring on :func:`windows_detach_flags` notes that a process in
|
||||
a job which disallows breakaway (no ``JOB_OBJECT_LIMIT_BREAKAWAY_OK``)
|
||||
will see ``ERROR_ACCESS_DENIED`` from CreateProcess, surfacing as
|
||||
``OSError`` (``PermissionError``) on the ``subprocess.Popen`` call.
|
||||
Callers that want to recover — by retrying without the breakaway
|
||||
bit — can pair the two helpers symbolically rather than coding the
|
||||
``& ~0x01000000`` magic at every site:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
try:
|
||||
subprocess.Popen(argv, creationflags=windows_detach_flags(), …)
|
||||
except OSError:
|
||||
subprocess.Popen(
|
||||
argv,
|
||||
creationflags=windows_detach_flags_without_breakaway(),
|
||||
…,
|
||||
)
|
||||
|
||||
See ``gateway_windows.py::_spawn_detached`` for the canonical
|
||||
implementation of this pattern. Returns 0 on non-Windows.
|
||||
"""
|
||||
if not IS_WINDOWS:
|
||||
return 0
|
||||
return _CREATE_NEW_PROCESS_GROUP | _DETACHED_PROCESS | _CREATE_NO_WINDOW
|
||||
|
||||
|
||||
def windows_hide_flags() -> int:
|
||||
"""Return Win32 creationflags that merely hide the child's console
|
||||
window without detaching the child. 0 on non-Windows.
|
||||
|
|
|
|||
|
|
@ -641,7 +641,10 @@ def launch_detached_profile_gateway_restart(profile: str, old_pid: int) -> bool:
|
|||
#
|
||||
# ``windows_detach_popen_kwargs()`` returns the right kwargs for the
|
||||
# host platform and is a no-op on POSIX (just ``start_new_session=True``).
|
||||
from hermes_cli._subprocess_compat import windows_detach_popen_kwargs
|
||||
from hermes_cli._subprocess_compat import (
|
||||
windows_detach_flags_without_breakaway,
|
||||
windows_detach_popen_kwargs,
|
||||
)
|
||||
|
||||
watcher = textwrap.dedent(
|
||||
"""
|
||||
|
|
@ -677,35 +680,66 @@ def launch_detached_profile_gateway_restart(profile: str, old_pid: int) -> bool:
|
|||
_DETACHED_PROCESS = 0x00000008
|
||||
_CREATE_NO_WINDOW = 0x08000000
|
||||
_CREATE_BREAKAWAY_FROM_JOB = 0x01000000
|
||||
_popen_kwargs["creationflags"] = (
|
||||
_flags = (
|
||||
_CREATE_NEW_PROCESS_GROUP
|
||||
| _DETACHED_PROCESS
|
||||
| _CREATE_NO_WINDOW
|
||||
| _CREATE_BREAKAWAY_FROM_JOB
|
||||
)
|
||||
try:
|
||||
_popen_kwargs["creationflags"] = _flags
|
||||
subprocess.Popen(cmd, **_popen_kwargs)
|
||||
except OSError:
|
||||
# CREATE_BREAKAWAY_FROM_JOB can be rejected with
|
||||
# ERROR_ACCESS_DENIED when the parent's job object refuses
|
||||
# breakaway. Retry without it — DETACHED_PROCESS et al.
|
||||
# alone are enough in most setups. Mirrors the canonical
|
||||
# fallback in gateway_windows._spawn_detached.
|
||||
_popen_kwargs["creationflags"] = _flags & ~_CREATE_BREAKAWAY_FROM_JOB
|
||||
subprocess.Popen(cmd, **_popen_kwargs)
|
||||
else:
|
||||
_popen_kwargs["start_new_session"] = True
|
||||
subprocess.Popen(cmd, **_popen_kwargs)
|
||||
subprocess.Popen(cmd, **_popen_kwargs)
|
||||
"""
|
||||
).strip()
|
||||
|
||||
watcher_argv = [
|
||||
sys.executable,
|
||||
"-c",
|
||||
watcher,
|
||||
str(old_pid),
|
||||
*_gateway_run_args_for_profile(profile),
|
||||
]
|
||||
|
||||
# Same platform-aware detach for the watcher process itself — so
|
||||
# closing the user's terminal doesn't kill the watcher.
|
||||
try:
|
||||
# Same platform-aware detach for the watcher process itself — so
|
||||
# closing the user's terminal doesn't kill the watcher.
|
||||
subprocess.Popen(
|
||||
[
|
||||
sys.executable,
|
||||
"-c",
|
||||
watcher,
|
||||
str(old_pid),
|
||||
*_gateway_run_args_for_profile(profile),
|
||||
],
|
||||
watcher_argv,
|
||||
stdout=subprocess.DEVNULL,
|
||||
stderr=subprocess.DEVNULL,
|
||||
**windows_detach_popen_kwargs(),
|
||||
)
|
||||
except OSError:
|
||||
return False
|
||||
# CREATE_BREAKAWAY_FROM_JOB rejected by the parent job object
|
||||
# (Electron, Windows Terminal with restrictive job settings, …).
|
||||
# Retry without it. POSIX never reaches this branch — there
|
||||
# ``start_new_session=True`` cannot raise OSError — so the
|
||||
# fallback is only meaningful on Windows.
|
||||
try:
|
||||
fallback_kwargs: dict = (
|
||||
{"creationflags": windows_detach_flags_without_breakaway()}
|
||||
if sys.platform == "win32"
|
||||
else {"start_new_session": True}
|
||||
)
|
||||
subprocess.Popen(
|
||||
watcher_argv,
|
||||
stdout=subprocess.DEVNULL,
|
||||
stderr=subprocess.DEVNULL,
|
||||
**fallback_kwargs,
|
||||
)
|
||||
except OSError:
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -541,10 +541,12 @@ class TestSubprocessCompatHelpers:
|
|||
def test_windows_flags_zero_on_posix(self):
|
||||
from hermes_cli._subprocess_compat import (
|
||||
windows_detach_flags,
|
||||
windows_detach_flags_without_breakaway,
|
||||
windows_hide_flags,
|
||||
)
|
||||
if sys.platform != "win32":
|
||||
assert windows_detach_flags() == 0
|
||||
assert windows_detach_flags_without_breakaway() == 0
|
||||
assert windows_hide_flags() == 0
|
||||
|
||||
def test_windows_detach_popen_kwargs_is_posix_equivalent_on_posix(self):
|
||||
|
|
@ -556,7 +558,9 @@ class TestSubprocessCompatHelpers:
|
|||
# branch behaviour. Do NOT break Linux/macOS here.
|
||||
assert kwargs == {"start_new_session": True}
|
||||
else:
|
||||
# Windows path must include creationflags with all 3 bits set.
|
||||
# Windows path must include creationflags with all 4 bits set
|
||||
# (including CREATE_BREAKAWAY_FROM_JOB — see the dedicated
|
||||
# breakaway test below for the rationale).
|
||||
assert "creationflags" in kwargs
|
||||
assert kwargs["creationflags"] != 0
|
||||
# No start_new_session on Windows (silently no-op there).
|
||||
|
|
@ -567,10 +571,57 @@ class TestSubprocessCompatHelpers:
|
|||
from hermes_cli import _subprocess_compat as sc
|
||||
monkeypatch.setattr(sc, "IS_WINDOWS", True)
|
||||
flags = sc.windows_detach_flags()
|
||||
# CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS | CREATE_NO_WINDOW
|
||||
# CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS | CREATE_NO_WINDOW |
|
||||
# CREATE_BREAKAWAY_FROM_JOB
|
||||
assert flags & 0x00000200, "missing CREATE_NEW_PROCESS_GROUP"
|
||||
assert flags & 0x00000008, "missing DETACHED_PROCESS"
|
||||
assert flags & 0x08000000, "missing CREATE_NO_WINDOW"
|
||||
assert flags & 0x01000000, "missing CREATE_BREAKAWAY_FROM_JOB"
|
||||
|
||||
def test_windows_detach_flags_includes_breakaway_from_job(self, monkeypatch):
|
||||
"""CREATE_BREAKAWAY_FROM_JOB is load-bearing for the GUI-driven update path.
|
||||
|
||||
Without it, the gateway-respawn watcher spawned by ``hermes update``
|
||||
(which runs under hermes-setup.exe, itself a grandchild of the
|
||||
Electron Desktop app) gets reaped when Electron exits and its
|
||||
Win32 job object is torn down by the OS. Result: gateway dies
|
||||
during update and never comes back.
|
||||
|
||||
Regression guard against accidentally dropping the breakaway bit
|
||||
from the default detach bundle. This was fixed in
|
||||
``fix/windows-gateway-reliability`` (PR #40909) and the bit must
|
||||
stay in the default bundle going forward.
|
||||
"""
|
||||
from hermes_cli import _subprocess_compat as sc
|
||||
monkeypatch.setattr(sc, "IS_WINDOWS", True)
|
||||
assert sc.windows_detach_flags() & 0x01000000, (
|
||||
"CREATE_BREAKAWAY_FROM_JOB (0x01000000) must remain in the "
|
||||
"default detach flag bundle so the Desktop GUI update flow "
|
||||
"can respawn the gateway after Electron exits."
|
||||
)
|
||||
|
||||
def test_windows_detach_flags_without_breakaway_drops_only_that_bit(
|
||||
self, monkeypatch
|
||||
):
|
||||
"""Fallback retry payload for restrictive job objects.
|
||||
|
||||
Some Windows Terminal / container / kiosk configurations refuse
|
||||
CREATE_BREAKAWAY_FROM_JOB with ERROR_ACCESS_DENIED. Callers
|
||||
catch ``OSError`` and retry with this payload (see
|
||||
``gateway_windows._spawn_detached`` for the canonical pattern).
|
||||
It must drop ONLY the breakaway bit — DETACHED_PROCESS et al.
|
||||
are still required for the child to survive the parent's exit.
|
||||
"""
|
||||
from hermes_cli import _subprocess_compat as sc
|
||||
monkeypatch.setattr(sc, "IS_WINDOWS", True)
|
||||
full = sc.windows_detach_flags()
|
||||
fallback = sc.windows_detach_flags_without_breakaway()
|
||||
# Fallback equals full minus the breakaway bit, nothing else changed.
|
||||
assert fallback == full & ~0x01000000
|
||||
# And the three "detach" bits we still need are present.
|
||||
assert fallback & 0x00000200, "fallback missing CREATE_NEW_PROCESS_GROUP"
|
||||
assert fallback & 0x00000008, "fallback missing DETACHED_PROCESS"
|
||||
assert fallback & 0x08000000, "fallback missing CREATE_NO_WINDOW"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
|
|
@ -880,3 +931,77 @@ class TestGatewayDetachedWatcherWindowsFlags:
|
|||
assert 'if sys.platform == "win32":' in source
|
||||
# Windows branch uses windows_detach_popen_kwargs
|
||||
assert "windows_detach_popen_kwargs" in source
|
||||
|
||||
def test_launch_detached_profile_gateway_restart_inlined_watcher_uses_breakaway(self):
|
||||
"""The inlined respawn script (stringified Python passed to ``python -c``)
|
||||
must include CREATE_BREAKAWAY_FROM_JOB so the *respawned gateway* also
|
||||
breaks away from any job-object the watcher itself inherits.
|
||||
|
||||
Static check — the watcher source is built at import time and embedded
|
||||
verbatim in the module text. Parsing it for an exact AST node would be
|
||||
brittle; the textual presence of the hex flag plus the symbolic name is
|
||||
a sufficient regression guard.
|
||||
|
||||
The bit was added to the inlined payload by PR #40909. This test
|
||||
ensures a future refactor of the dedent block doesn't silently drop it.
|
||||
"""
|
||||
root = Path(__file__).resolve().parents[2]
|
||||
text = (root / "hermes_cli" / "gateway.py").read_text(encoding="utf-8")
|
||||
marker = "watcher = textwrap.dedent("
|
||||
idx = text.find(marker)
|
||||
assert idx != -1, "watcher block not found in gateway.py"
|
||||
end = text.find(").strip()", idx)
|
||||
assert end != -1, "watcher block end not found"
|
||||
block = text[idx:end]
|
||||
assert "0x01000000" in block, (
|
||||
"Inlined respawn watcher must set CREATE_BREAKAWAY_FROM_JOB "
|
||||
"(0x01000000) on the respawned gateway — without it, the new "
|
||||
"gateway is reaped when the parent job is torn down."
|
||||
)
|
||||
assert "_CREATE_BREAKAWAY_FROM_JOB" in block, (
|
||||
"Inlined respawn watcher must name CREATE_BREAKAWAY_FROM_JOB "
|
||||
"symbolically so the intent is greppable."
|
||||
)
|
||||
|
||||
def test_launch_detached_profile_gateway_restart_outer_popen_has_access_denied_fallback(
|
||||
self,
|
||||
):
|
||||
"""When the outer watcher Popen raises OSError (breakaway denied by
|
||||
the parent job object), the watcher launch must retry without the
|
||||
breakaway bit instead of giving up.
|
||||
|
||||
This mirrors the canonical pattern in
|
||||
``gateway_windows._spawn_detached`` and brings the post-update
|
||||
watcher path into parity with the gateway-start path: a
|
||||
breakaway-denied job object on the parent process (rare but
|
||||
possible on Windows Terminal with restrictive job settings,
|
||||
containers, kiosk-mode shells) shouldn't take out the entire
|
||||
gateway-respawn chain.
|
||||
|
||||
Static check — without standing up a real Windows job object
|
||||
with breakaway forbidden, we can't trigger the OSError in a unit
|
||||
test. The textual presence of the fallback helper import +
|
||||
``windows_detach_flags_without_breakaway`` in the fallback path
|
||||
is the regression guard.
|
||||
"""
|
||||
root = Path(__file__).resolve().parents[2]
|
||||
text = (root / "hermes_cli" / "gateway.py").read_text(encoding="utf-8")
|
||||
assert "windows_detach_flags_without_breakaway" in text, (
|
||||
"launch_detached_profile_gateway_restart must import "
|
||||
"windows_detach_flags_without_breakaway so it can retry a "
|
||||
"breakaway-denied Popen without giving up on the watcher."
|
||||
)
|
||||
# And the inlined watcher's respawn must also handle the denial —
|
||||
# check the symbol is referenced INSIDE the watcher block (not
|
||||
# just at module scope).
|
||||
marker = "watcher = textwrap.dedent("
|
||||
idx = text.find(marker)
|
||||
end = text.find(").strip()", idx)
|
||||
block = text[idx:end]
|
||||
# The inlined script catches OSError on the respawn and retries
|
||||
# with breakaway cleared via ``& ~_CREATE_BREAKAWAY_FROM_JOB``.
|
||||
assert "~_CREATE_BREAKAWAY_FROM_JOB" in block, (
|
||||
"Inlined respawn must catch OSError on the breakaway-denied "
|
||||
"CreateProcess and retry without the breakaway bit, matching "
|
||||
"gateway_windows._spawn_detached's fallback pattern."
|
||||
)
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue