From fe0b3f233832c074aca5bb1dc86615190a9782ba Mon Sep 17 00:00:00 2001 From: Teknium <127238744+teknium1@users.noreply.github.com> Date: Sun, 7 Jun 2026 01:21:58 -0700 Subject: [PATCH] fix(windows): retry watcher Popen without breakaway when parent job denies it, plus regression tests for the breakaway bit (#40956) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit #40909 added `CREATE_BREAKAWAY_FROM_JOB` to `windows_detach_flags()`, which fixed the headline bug (gateway dies after Desktop GUI update and never comes back). The flag's own docstring acknowledges that restrictive parent job objects can still refuse breakaway with `ERROR_ACCESS_DENIED`, surfacing as `OSError` on the `subprocess.Popen` call: "Callers in this codebase already wrap detached spawns in try/except OSError and fall back to a cmd.exe wrapper, so the breakaway-denied case degrades gracefully rather than crashing." That's true for `_spawn_detached` in `gateway_windows.py` (the `hermes gateway start` path), which has both the breakaway bit AND a retry-without-breakaway fallback. It's NOT true for the post-update watcher path in `launch_detached_profile_gateway_restart` (`hermes_cli/gateway.py`), which only has `except OSError: return False` and gives up entirely. If a user's shell/terminal/container wraps Hermes in a breakaway-denying job, the gateway-respawn watcher silently fails to launch instead of trying again without breakaway. This PR closes that gap and adds the regression tests that were missing from the original fix. ## Changes ### `hermes_cli/_subprocess_compat.py` Adds a sibling helper `windows_detach_flags_without_breakaway()` so callers can express the fallback symbolically (via the helper) rather than coding the magic `& ~0x01000000` mask at every site. Documented on `windows_detach_flags` and `windows_detach_flags_without_breakaway` with the recommended try/except pattern. ### `hermes_cli/gateway.py::launch_detached_profile_gateway_restart` Two changes, both aligned with the canonical pattern in `gateway_windows._spawn_detached`: 1. The outer watcher Popen now wraps in `try/except OSError`, and on failure retries with `windows_detach_flags_without_breakaway()` (POSIX never reaches this branch — `start_new_session=True` can't raise OSError). 2. The inlined respawn payload (the `python -c` watcher) also wraps its CreateProcess in try/except OSError and retries with `_flags & ~_CREATE_BREAKAWAY_FROM_JOB` on failure. This matters because the watcher's job-object inheritance is independent of the outer process's — even if the outer Popen succeeds with breakaway, the respawned gateway might inherit a job that doesn't. ### Regression tests in `tests/tools/test_windows_native_support.py` #40909 shipped the fix without any test that the breakaway bit is present (the existing `test_windows_detach_flags_has_expected_win32_bits` asserted only the three legacy bits). Four new tests close that: - `test_windows_detach_flags_includes_breakaway_from_job` — explicit assertion that the breakaway bit is in the default bundle, with the rationale spelled out in the docstring so a future maintainer staring at this test understands why removing it would resurrect the gateway-dies-after-GUI-update bug. - `test_windows_detach_flags_without_breakaway_drops_only_that_bit` — fallback payload keeps the other three detach bits intact. - `test_launch_detached_profile_gateway_restart_inlined_watcher_uses_breakaway` — static-text check on the stringified watcher payload. The inlined Python program isn't reachable via normal import-time inspection because it lives in a `textwrap.dedent("""...""")` literal that gets passed to a separate `python -c` interpreter. Asserting that both `_CREATE_BREAKAWAY_FROM_JOB` (symbolic) and `0x01000000` (hex literal) appear inside the dedent block is a sufficient regression guard against accidental refactors. - `test_launch_detached_profile_gateway_restart_outer_popen_has_access_denied_fallback` — static check that this PR's fallback retry is wired up symbolically. Without standing up a real Windows job object that refuses breakaway, we can't trigger the OSError in a unit test; the text guard catches the case where a future refactor removes the helper import or the `& ~_CREATE_BREAKAWAY_FROM_JOB` retry. Also extends `test_windows_detach_flags_has_expected_win32_bits` to include the breakaway bit assertion and updates `test_windows_flags_zero_on_posix` to cover the new helper. ## Tests Locally on Windows: 8/8 in the `-k "detach or breakaway or popen_kwargs or launch_detached or gateway_run_update or hermes_cli_gateway"` slice pass. Broader `tests/hermes_cli/test_gateway*.py + test_windows_native_support.py`: 172 passed, 10 failed. All 10 failures are pre-existing POSIX-only tests running on a Windows host (os.geteuid, SIGKILL fallback, is_linux fixture mismatches). Stashing this PR and re-running on bare post-#40909 main reproduces all 10 identically — none are regressions. POSIX paths unchanged: `windows_detach_flags()` and `windows_detach_flags_without_breakaway()` both return 0 off Windows, `windows_detach_popen_kwargs()` still yields `{"start_new_session": True}`. ## Out of scope - The other detached-spawn site in `hermes_cli/gateway.py` (around line 3068) also uses `windows_detach_popen_kwargs()` + `except OSError`. It deserves the same fallback treatment but the codepath is different enough (not the update-flow watcher) that it warrants a separate PR with its own scrutiny. - `gateway/run.py` has Windows branches with `windows_detach_popen_kwargs` too — same reasoning. ## Context Follow-up to #40909 (merged). I had a parallel PR (#40934, closed) that duplicated the core breakaway fix; the bits unique to that PR that #40909 didn't cover are the contents of this one. Closing #40934 and opening this slimmed-down version as the focused follow-up. --- hermes_cli/_subprocess_compat.py | 31 +++++ hermes_cli/gateway.py | 60 +++++++--- tests/tools/test_windows_native_support.py | 129 ++++++++++++++++++++- 3 files changed, 205 insertions(+), 15 deletions(-) diff --git a/hermes_cli/_subprocess_compat.py b/hermes_cli/_subprocess_compat.py index b6bce13ad1d..607a9a3e6a4 100644 --- a/hermes_cli/_subprocess_compat.py +++ b/hermes_cli/_subprocess_compat.py @@ -35,6 +35,7 @@ __all__ = [ "IS_WINDOWS", "resolve_node_command", "windows_detach_flags", + "windows_detach_flags_without_breakaway", "windows_hide_flags", "windows_detach_popen_kwargs", ] @@ -152,6 +153,36 @@ def windows_detach_flags() -> int: ) +def windows_detach_flags_without_breakaway() -> int: + """Same as :func:`windows_detach_flags` minus ``CREATE_BREAKAWAY_FROM_JOB``. + + The docstring on :func:`windows_detach_flags` notes that a process in + a job which disallows breakaway (no ``JOB_OBJECT_LIMIT_BREAKAWAY_OK``) + will see ``ERROR_ACCESS_DENIED`` from CreateProcess, surfacing as + ``OSError`` (``PermissionError``) on the ``subprocess.Popen`` call. + Callers that want to recover — by retrying without the breakaway + bit — can pair the two helpers symbolically rather than coding the + ``& ~0x01000000`` magic at every site: + + .. code-block:: python + + try: + subprocess.Popen(argv, creationflags=windows_detach_flags(), …) + except OSError: + subprocess.Popen( + argv, + creationflags=windows_detach_flags_without_breakaway(), + …, + ) + + See ``gateway_windows.py::_spawn_detached`` for the canonical + implementation of this pattern. Returns 0 on non-Windows. + """ + if not IS_WINDOWS: + return 0 + return _CREATE_NEW_PROCESS_GROUP | _DETACHED_PROCESS | _CREATE_NO_WINDOW + + def windows_hide_flags() -> int: """Return Win32 creationflags that merely hide the child's console window without detaching the child. 0 on non-Windows. diff --git a/hermes_cli/gateway.py b/hermes_cli/gateway.py index 03228004053..d1339444800 100644 --- a/hermes_cli/gateway.py +++ b/hermes_cli/gateway.py @@ -641,7 +641,10 @@ def launch_detached_profile_gateway_restart(profile: str, old_pid: int) -> bool: # # ``windows_detach_popen_kwargs()`` returns the right kwargs for the # host platform and is a no-op on POSIX (just ``start_new_session=True``). - from hermes_cli._subprocess_compat import windows_detach_popen_kwargs + from hermes_cli._subprocess_compat import ( + windows_detach_flags_without_breakaway, + windows_detach_popen_kwargs, + ) watcher = textwrap.dedent( """ @@ -677,35 +680,66 @@ def launch_detached_profile_gateway_restart(profile: str, old_pid: int) -> bool: _DETACHED_PROCESS = 0x00000008 _CREATE_NO_WINDOW = 0x08000000 _CREATE_BREAKAWAY_FROM_JOB = 0x01000000 - _popen_kwargs["creationflags"] = ( + _flags = ( _CREATE_NEW_PROCESS_GROUP | _DETACHED_PROCESS | _CREATE_NO_WINDOW | _CREATE_BREAKAWAY_FROM_JOB ) + try: + _popen_kwargs["creationflags"] = _flags + subprocess.Popen(cmd, **_popen_kwargs) + except OSError: + # CREATE_BREAKAWAY_FROM_JOB can be rejected with + # ERROR_ACCESS_DENIED when the parent's job object refuses + # breakaway. Retry without it — DETACHED_PROCESS et al. + # alone are enough in most setups. Mirrors the canonical + # fallback in gateway_windows._spawn_detached. + _popen_kwargs["creationflags"] = _flags & ~_CREATE_BREAKAWAY_FROM_JOB + subprocess.Popen(cmd, **_popen_kwargs) else: _popen_kwargs["start_new_session"] = True - subprocess.Popen(cmd, **_popen_kwargs) + subprocess.Popen(cmd, **_popen_kwargs) """ ).strip() + watcher_argv = [ + sys.executable, + "-c", + watcher, + str(old_pid), + *_gateway_run_args_for_profile(profile), + ] + + # Same platform-aware detach for the watcher process itself — so + # closing the user's terminal doesn't kill the watcher. try: - # Same platform-aware detach for the watcher process itself — so - # closing the user's terminal doesn't kill the watcher. subprocess.Popen( - [ - sys.executable, - "-c", - watcher, - str(old_pid), - *_gateway_run_args_for_profile(profile), - ], + watcher_argv, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, **windows_detach_popen_kwargs(), ) except OSError: - return False + # CREATE_BREAKAWAY_FROM_JOB rejected by the parent job object + # (Electron, Windows Terminal with restrictive job settings, …). + # Retry without it. POSIX never reaches this branch — there + # ``start_new_session=True`` cannot raise OSError — so the + # fallback is only meaningful on Windows. + try: + fallback_kwargs: dict = ( + {"creationflags": windows_detach_flags_without_breakaway()} + if sys.platform == "win32" + else {"start_new_session": True} + ) + subprocess.Popen( + watcher_argv, + stdout=subprocess.DEVNULL, + stderr=subprocess.DEVNULL, + **fallback_kwargs, + ) + except OSError: + return False return True diff --git a/tests/tools/test_windows_native_support.py b/tests/tools/test_windows_native_support.py index baba1985950..661fefb6555 100644 --- a/tests/tools/test_windows_native_support.py +++ b/tests/tools/test_windows_native_support.py @@ -541,10 +541,12 @@ class TestSubprocessCompatHelpers: def test_windows_flags_zero_on_posix(self): from hermes_cli._subprocess_compat import ( windows_detach_flags, + windows_detach_flags_without_breakaway, windows_hide_flags, ) if sys.platform != "win32": assert windows_detach_flags() == 0 + assert windows_detach_flags_without_breakaway() == 0 assert windows_hide_flags() == 0 def test_windows_detach_popen_kwargs_is_posix_equivalent_on_posix(self): @@ -556,7 +558,9 @@ class TestSubprocessCompatHelpers: # branch behaviour. Do NOT break Linux/macOS here. assert kwargs == {"start_new_session": True} else: - # Windows path must include creationflags with all 3 bits set. + # Windows path must include creationflags with all 4 bits set + # (including CREATE_BREAKAWAY_FROM_JOB — see the dedicated + # breakaway test below for the rationale). assert "creationflags" in kwargs assert kwargs["creationflags"] != 0 # No start_new_session on Windows (silently no-op there). @@ -567,10 +571,57 @@ class TestSubprocessCompatHelpers: from hermes_cli import _subprocess_compat as sc monkeypatch.setattr(sc, "IS_WINDOWS", True) flags = sc.windows_detach_flags() - # CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS | CREATE_NO_WINDOW + # CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS | CREATE_NO_WINDOW | + # CREATE_BREAKAWAY_FROM_JOB assert flags & 0x00000200, "missing CREATE_NEW_PROCESS_GROUP" assert flags & 0x00000008, "missing DETACHED_PROCESS" assert flags & 0x08000000, "missing CREATE_NO_WINDOW" + assert flags & 0x01000000, "missing CREATE_BREAKAWAY_FROM_JOB" + + def test_windows_detach_flags_includes_breakaway_from_job(self, monkeypatch): + """CREATE_BREAKAWAY_FROM_JOB is load-bearing for the GUI-driven update path. + + Without it, the gateway-respawn watcher spawned by ``hermes update`` + (which runs under hermes-setup.exe, itself a grandchild of the + Electron Desktop app) gets reaped when Electron exits and its + Win32 job object is torn down by the OS. Result: gateway dies + during update and never comes back. + + Regression guard against accidentally dropping the breakaway bit + from the default detach bundle. This was fixed in + ``fix/windows-gateway-reliability`` (PR #40909) and the bit must + stay in the default bundle going forward. + """ + from hermes_cli import _subprocess_compat as sc + monkeypatch.setattr(sc, "IS_WINDOWS", True) + assert sc.windows_detach_flags() & 0x01000000, ( + "CREATE_BREAKAWAY_FROM_JOB (0x01000000) must remain in the " + "default detach flag bundle so the Desktop GUI update flow " + "can respawn the gateway after Electron exits." + ) + + def test_windows_detach_flags_without_breakaway_drops_only_that_bit( + self, monkeypatch + ): + """Fallback retry payload for restrictive job objects. + + Some Windows Terminal / container / kiosk configurations refuse + CREATE_BREAKAWAY_FROM_JOB with ERROR_ACCESS_DENIED. Callers + catch ``OSError`` and retry with this payload (see + ``gateway_windows._spawn_detached`` for the canonical pattern). + It must drop ONLY the breakaway bit — DETACHED_PROCESS et al. + are still required for the child to survive the parent's exit. + """ + from hermes_cli import _subprocess_compat as sc + monkeypatch.setattr(sc, "IS_WINDOWS", True) + full = sc.windows_detach_flags() + fallback = sc.windows_detach_flags_without_breakaway() + # Fallback equals full minus the breakaway bit, nothing else changed. + assert fallback == full & ~0x01000000 + # And the three "detach" bits we still need are present. + assert fallback & 0x00000200, "fallback missing CREATE_NEW_PROCESS_GROUP" + assert fallback & 0x00000008, "fallback missing DETACHED_PROCESS" + assert fallback & 0x08000000, "fallback missing CREATE_NO_WINDOW" # --------------------------------------------------------------------------- @@ -880,3 +931,77 @@ class TestGatewayDetachedWatcherWindowsFlags: assert 'if sys.platform == "win32":' in source # Windows branch uses windows_detach_popen_kwargs assert "windows_detach_popen_kwargs" in source + + def test_launch_detached_profile_gateway_restart_inlined_watcher_uses_breakaway(self): + """The inlined respawn script (stringified Python passed to ``python -c``) + must include CREATE_BREAKAWAY_FROM_JOB so the *respawned gateway* also + breaks away from any job-object the watcher itself inherits. + + Static check — the watcher source is built at import time and embedded + verbatim in the module text. Parsing it for an exact AST node would be + brittle; the textual presence of the hex flag plus the symbolic name is + a sufficient regression guard. + + The bit was added to the inlined payload by PR #40909. This test + ensures a future refactor of the dedent block doesn't silently drop it. + """ + root = Path(__file__).resolve().parents[2] + text = (root / "hermes_cli" / "gateway.py").read_text(encoding="utf-8") + marker = "watcher = textwrap.dedent(" + idx = text.find(marker) + assert idx != -1, "watcher block not found in gateway.py" + end = text.find(").strip()", idx) + assert end != -1, "watcher block end not found" + block = text[idx:end] + assert "0x01000000" in block, ( + "Inlined respawn watcher must set CREATE_BREAKAWAY_FROM_JOB " + "(0x01000000) on the respawned gateway — without it, the new " + "gateway is reaped when the parent job is torn down." + ) + assert "_CREATE_BREAKAWAY_FROM_JOB" in block, ( + "Inlined respawn watcher must name CREATE_BREAKAWAY_FROM_JOB " + "symbolically so the intent is greppable." + ) + + def test_launch_detached_profile_gateway_restart_outer_popen_has_access_denied_fallback( + self, + ): + """When the outer watcher Popen raises OSError (breakaway denied by + the parent job object), the watcher launch must retry without the + breakaway bit instead of giving up. + + This mirrors the canonical pattern in + ``gateway_windows._spawn_detached`` and brings the post-update + watcher path into parity with the gateway-start path: a + breakaway-denied job object on the parent process (rare but + possible on Windows Terminal with restrictive job settings, + containers, kiosk-mode shells) shouldn't take out the entire + gateway-respawn chain. + + Static check — without standing up a real Windows job object + with breakaway forbidden, we can't trigger the OSError in a unit + test. The textual presence of the fallback helper import + + ``windows_detach_flags_without_breakaway`` in the fallback path + is the regression guard. + """ + root = Path(__file__).resolve().parents[2] + text = (root / "hermes_cli" / "gateway.py").read_text(encoding="utf-8") + assert "windows_detach_flags_without_breakaway" in text, ( + "launch_detached_profile_gateway_restart must import " + "windows_detach_flags_without_breakaway so it can retry a " + "breakaway-denied Popen without giving up on the watcher." + ) + # And the inlined watcher's respawn must also handle the denial — + # check the symbol is referenced INSIDE the watcher block (not + # just at module scope). + marker = "watcher = textwrap.dedent(" + idx = text.find(marker) + end = text.find(").strip()", idx) + block = text[idx:end] + # The inlined script catches OSError on the respawn and retries + # with breakaway cleared via ``& ~_CREATE_BREAKAWAY_FROM_JOB``. + assert "~_CREATE_BREAKAWAY_FROM_JOB" in block, ( + "Inlined respawn must catch OSError on the breakaway-denied " + "CreateProcess and retry without the breakaway bit, matching " + "gateway_windows._spawn_detached's fallback pattern." + )