test: migrate stale os.kill monkeypatches to gateway.status._pid_exists

PR #21561 migrated liveness probes across 14 call sites from
`os.kill(pid, 0)` to `gateway.status._pid_exists` (psutil-first) so
the gateway doesn't Ctrl+C-itself on Windows via bpo-14484. A handful of
tests still patched the old `os.kill` seam and either happened to pass
on POSIX (when PID 12345 incidentally wasn't alive on the CI worker) or
failed outright — on CI runs they surfaced as 7 flaky/stable failures.

Migrate each affected test to patch the correct seam:

- tests/tools/test_browser_orphan_reaper.py (5 tests)
    Patch `gateway.status._pid_exists` instead of `os.kill`.
    Rename test_permission_error_on_kill_check_skips to
    test_alive_legacy_daemon_is_reaped — the old assertion was
    "PermissionError on sig 0 → skip dir"; post-migration the
    untracked-alive-daemon path always reaps the dir after SIGTERM
    (best-effort semantics were preserved).

- tests/tools/test_windows_native_support.py (4 tests)
    Replace tests that asserted `os.kill` seam behavior with tests
    that exercise `ProcessRegistry._is_host_pid_alive` as a
    delegator and split out a new TestPidExistsOSErrorWidening class
    that hits `gateway.status._pid_exists` directly via the POSIX
    fallback branch (so Windows-style `OSError(WinError 87)` + `PermissionError`
    widening is still covered on Linux CI).

- tests/tools/test_process_registry.py (1 test)
    Mock `psutil.Process` + `_pid_exists` instead of `os.kill`
    for the detached-session kill path.

- tests/tools/test_mcp_stability.py::test_kill_orphaned_uses_sigkill_when_available
    SIGTERM → alive-check → SIGKILL flow now uses `_pid_exists`
    for the middle step; assertion count drops from 3 to 2.

- tests/gateway/test_status.py::TestScopedLocks (2 tests)
    `acquire_scoped_lock` consults `_pid_exists`; patch that
    seam directly instead of trying to control the nested psutil
    call via os.kill monkeypatch.

- tests/hermes_cli/test_gateway.py::test_stop_profile_gateway_keeps_pid_file_when_process_still_running
    The stop loop sends one SIGTERM via os.kill then polls 20x via
    _pid_exists; instrument both separately. Old assertion
    `calls["kill"] == 21` split into `kill == 1` + `alive_probes == 20`.

- tests/hermes_cli/test_auth_toctou_file_modes.py::test_shared_nous_store_writes_0o600_with_0o700_parent
    Commit c34884ea2 switched the pytest seat-belt guard in
    `_nous_shared_store_path()` from `Path.home() / ".hermes"`
    to `get_default_hermes_root()`, which honors HERMES_HOME. The
    test sets both HERMES_HOME and HERMES_SHARED_AUTH_DIR to
    subpaths of the same tmp_path, and the override now collapses
    onto the same path the guard is refusing. Renamed the override
    subdirectory so the two paths diverge — guard passes, test runs.

All 21 original CI failures and their local-flaky siblings now pass
(278 tests across the touched files, 0 failures).
This commit is contained in:
Teknium 2026-05-08 14:18:41 -07:00
parent 291a158441
commit f5ee780124
7 changed files with 160 additions and 80 deletions

View file

@ -728,18 +728,30 @@ class TestKillProcess:
s.detached = True
registry._running[s.id] = s
calls = []
terminate_calls = []
def fake_kill(pid, sig):
calls.append((pid, sig))
class FakeProcess:
def __init__(self, pid):
self.pid = pid
def children(self, recursive=False):
return []
def terminate(self):
terminate_calls.append(("terminate", self.pid))
import psutil as _psutil
try:
with patch("tools.process_registry.os.kill", side_effect=fake_kill):
# Post-#21561: liveness probe routes through
# ``ProcessRegistry._is_host_pid_alive`` (→
# ``gateway.status._pid_exists``), and the actual kill on POSIX
# routes through ``psutil.Process(pid).terminate()``. Neither
# touches ``os.kill`` directly. Mock both seams.
with patch("gateway.status._pid_exists", return_value=True), \
patch.object(_psutil, "Process", side_effect=lambda pid: FakeProcess(pid)):
result = registry.kill_process(s.id)
assert result["status"] == "killed"
assert (424242, 0) in calls
assert (424242, signal.SIGTERM) in calls
assert ("terminate", 424242) in terminate_calls
finally:
registry._running.pop(s.id, None)