fix(windows): retry watcher Popen without breakaway when parent job denies it, plus regression tests for the breakaway bit (#40956)

#40909 added `CREATE_BREAKAWAY_FROM_JOB` to `windows_detach_flags()`,
which fixed the headline bug (gateway dies after Desktop GUI update
and never comes back). The flag's own docstring acknowledges that
restrictive parent job objects can still refuse breakaway with
`ERROR_ACCESS_DENIED`, surfacing as `OSError` on the `subprocess.Popen`
call:

  "Callers in this codebase already wrap detached spawns in
  try/except OSError and fall back to a cmd.exe wrapper, so the
  breakaway-denied case degrades gracefully rather than crashing."

That's true for `_spawn_detached` in `gateway_windows.py` (the
`hermes gateway start` path), which has both the breakaway bit AND a
retry-without-breakaway fallback. It's NOT true for the post-update
watcher path in `launch_detached_profile_gateway_restart`
(`hermes_cli/gateway.py`), which only has `except OSError: return
False` and gives up entirely. If a user's shell/terminal/container
wraps Hermes in a breakaway-denying job, the gateway-respawn watcher
silently fails to launch instead of trying again without breakaway.

This PR closes that gap and adds the regression tests that were
missing from the original fix.

## Changes

### `hermes_cli/_subprocess_compat.py`

Adds a sibling helper `windows_detach_flags_without_breakaway()` so
callers can express the fallback symbolically (via the helper) rather
than coding the magic `& ~0x01000000` mask at every site. Documented
on `windows_detach_flags` and `windows_detach_flags_without_breakaway`
with the recommended try/except pattern.

### `hermes_cli/gateway.py::launch_detached_profile_gateway_restart`

Two changes, both aligned with the canonical pattern in
`gateway_windows._spawn_detached`:

1. The outer watcher Popen now wraps in `try/except OSError`, and on
   failure retries with `windows_detach_flags_without_breakaway()`
   (POSIX never reaches this branch — `start_new_session=True` can't
   raise OSError).
2. The inlined respawn payload (the `python -c` watcher) also
   wraps its CreateProcess in try/except OSError and retries with
   `_flags & ~_CREATE_BREAKAWAY_FROM_JOB` on failure. This matters
   because the watcher's job-object inheritance is independent of the
   outer process's — even if the outer Popen succeeds with breakaway,
   the respawned gateway might inherit a job that doesn't.

### Regression tests in `tests/tools/test_windows_native_support.py`

#40909 shipped the fix without any test that the breakaway bit is
present (the existing `test_windows_detach_flags_has_expected_win32_bits`
asserted only the three legacy bits). Four new tests close that:

- `test_windows_detach_flags_includes_breakaway_from_job` — explicit
  assertion that the breakaway bit is in the default bundle, with the
  rationale spelled out in the docstring so a future maintainer
  staring at this test understands why removing it would resurrect
  the gateway-dies-after-GUI-update bug.
- `test_windows_detach_flags_without_breakaway_drops_only_that_bit`
  — fallback payload keeps the other three detach bits intact.
- `test_launch_detached_profile_gateway_restart_inlined_watcher_uses_breakaway`
  — static-text check on the stringified watcher payload. The inlined
  Python program isn't reachable via normal import-time inspection
  because it lives in a `textwrap.dedent("""...""")` literal that
  gets passed to a separate `python -c` interpreter. Asserting that
  both `_CREATE_BREAKAWAY_FROM_JOB` (symbolic) and `0x01000000` (hex
  literal) appear inside the dedent block is a sufficient regression
  guard against accidental refactors.
- `test_launch_detached_profile_gateway_restart_outer_popen_has_access_denied_fallback`
  — static check that this PR's fallback retry is wired up
  symbolically. Without standing up a real Windows job object that
  refuses breakaway, we can't trigger the OSError in a unit test;
  the text guard catches the case where a future refactor removes
  the helper import or the `& ~_CREATE_BREAKAWAY_FROM_JOB` retry.

Also extends `test_windows_detach_flags_has_expected_win32_bits` to
include the breakaway bit assertion and updates
`test_windows_flags_zero_on_posix` to cover the new helper.

## Tests

Locally on Windows: 8/8 in the `-k "detach or breakaway or
popen_kwargs or launch_detached or gateway_run_update or
hermes_cli_gateway"` slice pass.

Broader `tests/hermes_cli/test_gateway*.py + test_windows_native_support.py`:
172 passed, 10 failed. All 10 failures are pre-existing POSIX-only
tests running on a Windows host (os.geteuid, SIGKILL fallback,
is_linux fixture mismatches). Stashing this PR and re-running on bare
post-#40909 main reproduces all 10 identically — none are regressions.

POSIX paths unchanged: `windows_detach_flags()` and
`windows_detach_flags_without_breakaway()` both return 0 off Windows,
`windows_detach_popen_kwargs()` still yields `{"start_new_session": True}`.

## Out of scope

- The other detached-spawn site in `hermes_cli/gateway.py` (around
  line 3068) also uses `windows_detach_popen_kwargs()` + `except
  OSError`. It deserves the same fallback treatment but the codepath
  is different enough (not the update-flow watcher) that it warrants
  a separate PR with its own scrutiny.
- `gateway/run.py` has Windows branches with `windows_detach_popen_kwargs`
  too — same reasoning.

## Context

Follow-up to #40909 (merged). I had a parallel PR (#40934, closed)
that duplicated the core breakaway fix; the bits unique to that PR
that #40909 didn't cover are the contents of this one. Closing #40934
and opening this slimmed-down version as the focused follow-up.
This commit is contained in:
Teknium 2026-06-07 01:21:58 -07:00 committed by GitHub
parent 44c0c2d4ac
commit fe0b3f2338
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
3 changed files with 205 additions and 15 deletions

View file

@ -35,6 +35,7 @@ __all__ = [
"IS_WINDOWS",
"resolve_node_command",
"windows_detach_flags",
"windows_detach_flags_without_breakaway",
"windows_hide_flags",
"windows_detach_popen_kwargs",
]
@ -152,6 +153,36 @@ def windows_detach_flags() -> int:
)
def windows_detach_flags_without_breakaway() -> int:
"""Same as :func:`windows_detach_flags` minus ``CREATE_BREAKAWAY_FROM_JOB``.
The docstring on :func:`windows_detach_flags` notes that a process in
a job which disallows breakaway (no ``JOB_OBJECT_LIMIT_BREAKAWAY_OK``)
will see ``ERROR_ACCESS_DENIED`` from CreateProcess, surfacing as
``OSError`` (``PermissionError``) on the ``subprocess.Popen`` call.
Callers that want to recover by retrying without the breakaway
bit can pair the two helpers symbolically rather than coding the
``& ~0x01000000`` magic at every site:
.. code-block:: python
try:
subprocess.Popen(argv, creationflags=windows_detach_flags(), )
except OSError:
subprocess.Popen(
argv,
creationflags=windows_detach_flags_without_breakaway(),
,
)
See ``gateway_windows.py::_spawn_detached`` for the canonical
implementation of this pattern. Returns 0 on non-Windows.
"""
if not IS_WINDOWS:
return 0
return _CREATE_NEW_PROCESS_GROUP | _DETACHED_PROCESS | _CREATE_NO_WINDOW
def windows_hide_flags() -> int:
"""Return Win32 creationflags that merely hide the child's console
window without detaching the child. 0 on non-Windows.

View file

@ -641,7 +641,10 @@ def launch_detached_profile_gateway_restart(profile: str, old_pid: int) -> bool:
#
# ``windows_detach_popen_kwargs()`` returns the right kwargs for the
# host platform and is a no-op on POSIX (just ``start_new_session=True``).
from hermes_cli._subprocess_compat import windows_detach_popen_kwargs
from hermes_cli._subprocess_compat import (
windows_detach_flags_without_breakaway,
windows_detach_popen_kwargs,
)
watcher = textwrap.dedent(
"""
@ -677,35 +680,66 @@ def launch_detached_profile_gateway_restart(profile: str, old_pid: int) -> bool:
_DETACHED_PROCESS = 0x00000008
_CREATE_NO_WINDOW = 0x08000000
_CREATE_BREAKAWAY_FROM_JOB = 0x01000000
_popen_kwargs["creationflags"] = (
_flags = (
_CREATE_NEW_PROCESS_GROUP
| _DETACHED_PROCESS
| _CREATE_NO_WINDOW
| _CREATE_BREAKAWAY_FROM_JOB
)
try:
_popen_kwargs["creationflags"] = _flags
subprocess.Popen(cmd, **_popen_kwargs)
except OSError:
# CREATE_BREAKAWAY_FROM_JOB can be rejected with
# ERROR_ACCESS_DENIED when the parent's job object refuses
# breakaway. Retry without it — DETACHED_PROCESS et al.
# alone are enough in most setups. Mirrors the canonical
# fallback in gateway_windows._spawn_detached.
_popen_kwargs["creationflags"] = _flags & ~_CREATE_BREAKAWAY_FROM_JOB
subprocess.Popen(cmd, **_popen_kwargs)
else:
_popen_kwargs["start_new_session"] = True
subprocess.Popen(cmd, **_popen_kwargs)
subprocess.Popen(cmd, **_popen_kwargs)
"""
).strip()
watcher_argv = [
sys.executable,
"-c",
watcher,
str(old_pid),
*_gateway_run_args_for_profile(profile),
]
# Same platform-aware detach for the watcher process itself — so
# closing the user's terminal doesn't kill the watcher.
try:
# Same platform-aware detach for the watcher process itself — so
# closing the user's terminal doesn't kill the watcher.
subprocess.Popen(
[
sys.executable,
"-c",
watcher,
str(old_pid),
*_gateway_run_args_for_profile(profile),
],
watcher_argv,
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
**windows_detach_popen_kwargs(),
)
except OSError:
return False
# CREATE_BREAKAWAY_FROM_JOB rejected by the parent job object
# (Electron, Windows Terminal with restrictive job settings, …).
# Retry without it. POSIX never reaches this branch — there
# ``start_new_session=True`` cannot raise OSError — so the
# fallback is only meaningful on Windows.
try:
fallback_kwargs: dict = (
{"creationflags": windows_detach_flags_without_breakaway()}
if sys.platform == "win32"
else {"start_new_session": True}
)
subprocess.Popen(
watcher_argv,
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
**fallback_kwargs,
)
except OSError:
return False
return True

View file

@ -541,10 +541,12 @@ class TestSubprocessCompatHelpers:
def test_windows_flags_zero_on_posix(self):
from hermes_cli._subprocess_compat import (
windows_detach_flags,
windows_detach_flags_without_breakaway,
windows_hide_flags,
)
if sys.platform != "win32":
assert windows_detach_flags() == 0
assert windows_detach_flags_without_breakaway() == 0
assert windows_hide_flags() == 0
def test_windows_detach_popen_kwargs_is_posix_equivalent_on_posix(self):
@ -556,7 +558,9 @@ class TestSubprocessCompatHelpers:
# branch behaviour. Do NOT break Linux/macOS here.
assert kwargs == {"start_new_session": True}
else:
# Windows path must include creationflags with all 3 bits set.
# Windows path must include creationflags with all 4 bits set
# (including CREATE_BREAKAWAY_FROM_JOB — see the dedicated
# breakaway test below for the rationale).
assert "creationflags" in kwargs
assert kwargs["creationflags"] != 0
# No start_new_session on Windows (silently no-op there).
@ -567,10 +571,57 @@ class TestSubprocessCompatHelpers:
from hermes_cli import _subprocess_compat as sc
monkeypatch.setattr(sc, "IS_WINDOWS", True)
flags = sc.windows_detach_flags()
# CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS | CREATE_NO_WINDOW
# CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS | CREATE_NO_WINDOW |
# CREATE_BREAKAWAY_FROM_JOB
assert flags & 0x00000200, "missing CREATE_NEW_PROCESS_GROUP"
assert flags & 0x00000008, "missing DETACHED_PROCESS"
assert flags & 0x08000000, "missing CREATE_NO_WINDOW"
assert flags & 0x01000000, "missing CREATE_BREAKAWAY_FROM_JOB"
def test_windows_detach_flags_includes_breakaway_from_job(self, monkeypatch):
"""CREATE_BREAKAWAY_FROM_JOB is load-bearing for the GUI-driven update path.
Without it, the gateway-respawn watcher spawned by ``hermes update``
(which runs under hermes-setup.exe, itself a grandchild of the
Electron Desktop app) gets reaped when Electron exits and its
Win32 job object is torn down by the OS. Result: gateway dies
during update and never comes back.
Regression guard against accidentally dropping the breakaway bit
from the default detach bundle. This was fixed in
``fix/windows-gateway-reliability`` (PR #40909) and the bit must
stay in the default bundle going forward.
"""
from hermes_cli import _subprocess_compat as sc
monkeypatch.setattr(sc, "IS_WINDOWS", True)
assert sc.windows_detach_flags() & 0x01000000, (
"CREATE_BREAKAWAY_FROM_JOB (0x01000000) must remain in the "
"default detach flag bundle so the Desktop GUI update flow "
"can respawn the gateway after Electron exits."
)
def test_windows_detach_flags_without_breakaway_drops_only_that_bit(
self, monkeypatch
):
"""Fallback retry payload for restrictive job objects.
Some Windows Terminal / container / kiosk configurations refuse
CREATE_BREAKAWAY_FROM_JOB with ERROR_ACCESS_DENIED. Callers
catch ``OSError`` and retry with this payload (see
``gateway_windows._spawn_detached`` for the canonical pattern).
It must drop ONLY the breakaway bit DETACHED_PROCESS et al.
are still required for the child to survive the parent's exit.
"""
from hermes_cli import _subprocess_compat as sc
monkeypatch.setattr(sc, "IS_WINDOWS", True)
full = sc.windows_detach_flags()
fallback = sc.windows_detach_flags_without_breakaway()
# Fallback equals full minus the breakaway bit, nothing else changed.
assert fallback == full & ~0x01000000
# And the three "detach" bits we still need are present.
assert fallback & 0x00000200, "fallback missing CREATE_NEW_PROCESS_GROUP"
assert fallback & 0x00000008, "fallback missing DETACHED_PROCESS"
assert fallback & 0x08000000, "fallback missing CREATE_NO_WINDOW"
# ---------------------------------------------------------------------------
@ -880,3 +931,77 @@ class TestGatewayDetachedWatcherWindowsFlags:
assert 'if sys.platform == "win32":' in source
# Windows branch uses windows_detach_popen_kwargs
assert "windows_detach_popen_kwargs" in source
def test_launch_detached_profile_gateway_restart_inlined_watcher_uses_breakaway(self):
"""The inlined respawn script (stringified Python passed to ``python -c``)
must include CREATE_BREAKAWAY_FROM_JOB so the *respawned gateway* also
breaks away from any job-object the watcher itself inherits.
Static check the watcher source is built at import time and embedded
verbatim in the module text. Parsing it for an exact AST node would be
brittle; the textual presence of the hex flag plus the symbolic name is
a sufficient regression guard.
The bit was added to the inlined payload by PR #40909. This test
ensures a future refactor of the dedent block doesn't silently drop it.
"""
root = Path(__file__).resolve().parents[2]
text = (root / "hermes_cli" / "gateway.py").read_text(encoding="utf-8")
marker = "watcher = textwrap.dedent("
idx = text.find(marker)
assert idx != -1, "watcher block not found in gateway.py"
end = text.find(").strip()", idx)
assert end != -1, "watcher block end not found"
block = text[idx:end]
assert "0x01000000" in block, (
"Inlined respawn watcher must set CREATE_BREAKAWAY_FROM_JOB "
"(0x01000000) on the respawned gateway — without it, the new "
"gateway is reaped when the parent job is torn down."
)
assert "_CREATE_BREAKAWAY_FROM_JOB" in block, (
"Inlined respawn watcher must name CREATE_BREAKAWAY_FROM_JOB "
"symbolically so the intent is greppable."
)
def test_launch_detached_profile_gateway_restart_outer_popen_has_access_denied_fallback(
self,
):
"""When the outer watcher Popen raises OSError (breakaway denied by
the parent job object), the watcher launch must retry without the
breakaway bit instead of giving up.
This mirrors the canonical pattern in
``gateway_windows._spawn_detached`` and brings the post-update
watcher path into parity with the gateway-start path: a
breakaway-denied job object on the parent process (rare but
possible on Windows Terminal with restrictive job settings,
containers, kiosk-mode shells) shouldn't take out the entire
gateway-respawn chain.
Static check without standing up a real Windows job object
with breakaway forbidden, we can't trigger the OSError in a unit
test. The textual presence of the fallback helper import +
``windows_detach_flags_without_breakaway`` in the fallback path
is the regression guard.
"""
root = Path(__file__).resolve().parents[2]
text = (root / "hermes_cli" / "gateway.py").read_text(encoding="utf-8")
assert "windows_detach_flags_without_breakaway" in text, (
"launch_detached_profile_gateway_restart must import "
"windows_detach_flags_without_breakaway so it can retry a "
"breakaway-denied Popen without giving up on the watcher."
)
# And the inlined watcher's respawn must also handle the denial —
# check the symbol is referenced INSIDE the watcher block (not
# just at module scope).
marker = "watcher = textwrap.dedent("
idx = text.find(marker)
end = text.find(").strip()", idx)
block = text[idx:end]
# The inlined script catches OSError on the respawn and retries
# with breakaway cleared via ``& ~_CREATE_BREAKAWAY_FROM_JOB``.
assert "~_CREATE_BREAKAWAY_FROM_JOB" in block, (
"Inlined respawn must catch OSError on the breakaway-denied "
"CreateProcess and retry without the breakaway bit, matching "
"gateway_windows._spawn_detached's fallback pattern."
)