test: migrate stale os.kill monkeypatches to gateway.status._pid_exists

PR #21561 migrated liveness probes across 14 call sites from
`os.kill(pid, 0)` to `gateway.status._pid_exists` (psutil-first) so
the gateway doesn't Ctrl+C-itself on Windows via bpo-14484. A handful of
tests still patched the old `os.kill` seam and either happened to pass
on POSIX (when PID 12345 incidentally wasn't alive on the CI worker) or
failed outright — on CI runs they surfaced as 7 flaky/stable failures.

Migrate each affected test to patch the correct seam:

- tests/tools/test_browser_orphan_reaper.py (5 tests)
    Patch `gateway.status._pid_exists` instead of `os.kill`.
    Rename test_permission_error_on_kill_check_skips to
    test_alive_legacy_daemon_is_reaped — the old assertion was
    "PermissionError on sig 0 → skip dir"; post-migration the
    untracked-alive-daemon path always reaps the dir after SIGTERM
    (best-effort semantics were preserved).

- tests/tools/test_windows_native_support.py (4 tests)
    Replace tests that asserted `os.kill` seam behavior with tests
    that exercise `ProcessRegistry._is_host_pid_alive` as a
    delegator and split out a new TestPidExistsOSErrorWidening class
    that hits `gateway.status._pid_exists` directly via the POSIX
    fallback branch (so Windows-style `OSError(WinError 87)` + `PermissionError`
    widening is still covered on Linux CI).

- tests/tools/test_process_registry.py (1 test)
    Mock `psutil.Process` + `_pid_exists` instead of `os.kill`
    for the detached-session kill path.

- tests/tools/test_mcp_stability.py::test_kill_orphaned_uses_sigkill_when_available
    SIGTERM → alive-check → SIGKILL flow now uses `_pid_exists`
    for the middle step; assertion count drops from 3 to 2.

- tests/gateway/test_status.py::TestScopedLocks (2 tests)
    `acquire_scoped_lock` consults `_pid_exists`; patch that
    seam directly instead of trying to control the nested psutil
    call via os.kill monkeypatch.

- tests/hermes_cli/test_gateway.py::test_stop_profile_gateway_keeps_pid_file_when_process_still_running
    The stop loop sends one SIGTERM via os.kill then polls 20x via
    _pid_exists; instrument both separately. Old assertion
    `calls["kill"] == 21` split into `kill == 1` + `alive_probes == 20`.

- tests/hermes_cli/test_auth_toctou_file_modes.py::test_shared_nous_store_writes_0o600_with_0o700_parent
    Commit c34884ea2 switched the pytest seat-belt guard in
    `_nous_shared_store_path()` from `Path.home() / ".hermes"`
    to `get_default_hermes_root()`, which honors HERMES_HOME. The
    test sets both HERMES_HOME and HERMES_SHARED_AUTH_DIR to
    subpaths of the same tmp_path, and the override now collapses
    onto the same path the guard is refusing. Renamed the override
    subdirectory so the two paths diverge — guard passes, test runs.

All 21 original CI failures and their local-flaky siblings now pass
(278 tests across the touched files, 0 failures).
This commit is contained in:
Teknium 2026-05-08 14:18:41 -07:00
parent 291a158441
commit f5ee780124
7 changed files with 160 additions and 80 deletions

View file

@ -116,8 +116,12 @@ def test_shared_nous_store_writes_0o600_with_0o700_parent(tmp_path, monkeypatch)
"""The Nous shared-credential store must land at 0o600 / parent 0o700."""
monkeypatch.setenv("HERMES_HOME", str(tmp_path))
# _nous_shared_store_path() refuses to touch the real shared store during
# pytest runs; redirect it into tmp_path explicitly.
monkeypatch.setenv("HERMES_SHARED_AUTH_DIR", str(tmp_path / "shared"))
# pytest runs; redirect it into tmp_path explicitly. Use a distinct
# subdirectory name (``shared_override``) so the guard's "real user
# home" reference — which currently tracks HERMES_HOME via
# get_default_hermes_root() — can't collide with our override and
# falsely claim we're writing to the real user's shared store.
monkeypatch.setenv("HERMES_SHARED_AUTH_DIR", str(tmp_path / "shared_override"))
old_umask = os.umask(0o022)
try:
from hermes_cli import auth as auth_mod

View file

@ -450,14 +450,21 @@ class TestWaitForGatewayExit:
class TestStopProfileGateway:
def test_stop_profile_gateway_keeps_pid_file_when_process_still_running(self, monkeypatch):
calls = {"kill": 0, "remove": 0}
calls = {"kill": 0, "alive_probes": 0, "remove": 0}
monkeypatch.setattr("gateway.status.get_running_pid", lambda: 12345)
# Post-#21561: the stop loop sends one SIGTERM via ``os.kill`` then
# polls liveness via ``gateway.status._pid_exists`` (safe on
# Windows — bpo-14484). Instrument both seams separately.
monkeypatch.setattr(
gateway.os,
"kill",
lambda pid, sig: calls.__setitem__("kill", calls["kill"] + 1),
)
monkeypatch.setattr(
"gateway.status._pid_exists",
lambda pid: calls.__setitem__("alive_probes", calls["alive_probes"] + 1) or True,
)
monkeypatch.setattr("time.sleep", lambda _: None)
monkeypatch.setattr(
"gateway.status.remove_pid_file",
@ -465,5 +472,6 @@ class TestStopProfileGateway:
)
assert gateway.stop_profile_gateway() is True
assert calls["kill"] == 21
assert calls["kill"] == 1 # one SIGTERM
assert calls["alive_probes"] == 20 # 20 liveness polls over the 2s window
assert calls["remove"] == 0