Commit graph

17 commits

Author SHA1 Message Date
Teknium
f5ee780124 test: migrate stale os.kill monkeypatches to gateway.status._pid_exists
PR #21561 migrated liveness probes across 14 call sites from
`os.kill(pid, 0)` to `gateway.status._pid_exists` (psutil-first) so
the gateway doesn't Ctrl+C-itself on Windows via bpo-14484. A handful of
tests still patched the old `os.kill` seam and either happened to pass
on POSIX (when PID 12345 incidentally wasn't alive on the CI worker) or
failed outright — on CI runs they surfaced as 7 flaky/stable failures.

Migrate each affected test to patch the correct seam:

- tests/tools/test_browser_orphan_reaper.py (5 tests)
    Patch `gateway.status._pid_exists` instead of `os.kill`.
    Rename test_permission_error_on_kill_check_skips to
    test_alive_legacy_daemon_is_reaped — the old assertion was
    "PermissionError on sig 0 → skip dir"; post-migration the
    untracked-alive-daemon path always reaps the dir after SIGTERM
    (best-effort semantics were preserved).

- tests/tools/test_windows_native_support.py (4 tests)
    Replace tests that asserted `os.kill` seam behavior with tests
    that exercise `ProcessRegistry._is_host_pid_alive` as a
    delegator and split out a new TestPidExistsOSErrorWidening class
    that hits `gateway.status._pid_exists` directly via the POSIX
    fallback branch (so Windows-style `OSError(WinError 87)` + `PermissionError`
    widening is still covered on Linux CI).

- tests/tools/test_process_registry.py (1 test)
    Mock `psutil.Process` + `_pid_exists` instead of `os.kill`
    for the detached-session kill path.

- tests/tools/test_mcp_stability.py::test_kill_orphaned_uses_sigkill_when_available
    SIGTERM → alive-check → SIGKILL flow now uses `_pid_exists`
    for the middle step; assertion count drops from 3 to 2.

- tests/gateway/test_status.py::TestScopedLocks (2 tests)
    `acquire_scoped_lock` consults `_pid_exists`; patch that
    seam directly instead of trying to control the nested psutil
    call via os.kill monkeypatch.

- tests/hermes_cli/test_gateway.py::test_stop_profile_gateway_keeps_pid_file_when_process_still_running
    The stop loop sends one SIGTERM via os.kill then polls 20x via
    _pid_exists; instrument both separately. Old assertion
    `calls["kill"] == 21` split into `kill == 1` + `alive_probes == 20`.

- tests/hermes_cli/test_auth_toctou_file_modes.py::test_shared_nous_store_writes_0o600_with_0o700_parent
    Commit c34884ea2 switched the pytest seat-belt guard in
    `_nous_shared_store_path()` from `Path.home() / ".hermes"`
    to `get_default_hermes_root()`, which honors HERMES_HOME. The
    test sets both HERMES_HOME and HERMES_SHARED_AUTH_DIR to
    subpaths of the same tmp_path, and the override now collapses
    onto the same path the guard is refusing. Renamed the override
    subdirectory so the two paths diverge — guard passes, test runs.

All 21 original CI failures and their local-flaky siblings now pass
(278 tests across the touched files, 0 failures).
2026-05-08 14:27:40 -07:00
LeonSGP43
84287b0de8 fix(docker): refuse root gateway runs in official image 2026-05-07 05:59:25 -07:00
Teknium
06031229e8
fix(tests): tolerate ps ancestor-walk in find_gateway_pids fallback test (#19590)
Follow-up to #19586 (@cixuuz salvage): _get_ancestor_pids walks ps -o ppid=
up the process tree, which the pre-existing mock in
test_find_gateway_pids_falls_back_to_pid_file_when_process_scan_fails didn't
expect. Return empty stdout so the ancestor loop terminates cleanly and the
original fallback assertion still passes.
2026-05-04 01:40:39 -07:00
Rylen Anil
3858f9419e fix: handle gateway Ctrl+C shutdown cleanly 2026-04-30 03:29:57 -07:00
helix4u
b52123eb15 fix(gateway): recover stale pid and planned restart state 2026-04-22 16:33:46 -07:00
helix4u
47010e0757 fix(gateway): allow systemd-backed distrobox services 2026-04-17 19:24:30 -07:00
Sara Reynolds
8ab1aa2efc fix(gateway): fix discrepancies in gateway status 2026-04-17 18:58:29 -07:00
Teknium
d82580b25b fix: add all_profiles param + narrow exception handling
- add all_profiles=False to find_gateway_pids() and
  kill_gateway_processes() so hermes update and gateway stop --all
  can still discover processes across all profiles
- narrow bare 'except Exception' to (OSError, subprocess.TimeoutExpired)
- update test mocks to match new signatures
2026-04-11 14:44:29 -07:00
H-5-Isminiz
00dd5cc491 fix(gateway): implement platform-aware PID termination 2026-04-10 03:52:00 -07:00
adybag14-cyber
3878495972 fix(termux): disable gateway service flows on android 2026-04-09 16:24:53 -07:00
Teknium
786970925e
fix(cli): add missing subprocess.run() timeouts in gateway CLI (#5424)
All 35 subprocess.run() calls in hermes_cli/gateway.py lacked timeout
parameters. If systemctl, launchctl, loginctl, wmic, or ps blocks,
hermes gateway start/stop/restart/status/install/uninstall hangs
indefinitely with no feedback.

Timeouts tiered by operation type:
- 10s: instant queries (is-active, status, list, ps, tail, journalctl)
- 30s: fast lifecycle (daemon-reload, enable, start, bootstrap, kickstart)
- 90s: graceful shutdown (stop, restart, bootout, kickstart -k) — exceeds
  our TimeoutStopSec=60 to avoid premature timeout during shutdown

Special handling: _is_service_running() and launchd_status() catch
TimeoutExpired and treat it as not-running/not-loaded, consistent with
how non-zero return codes are already handled.

Inspired by PR #3732 (dlkakbs) and issue #4057 (SHL0MS).
Reimplemented on current main which has significantly changed launchctl
handling (bootout/bootstrap/kickstart vs legacy load/unload/start/stop).
2026-04-05 22:41:42 -07:00
Test
ace2cc6257 fix(gateway): PID-based wait with force-kill for gateway restart
Add _wait_for_gateway_exit() that polls get_running_pid() to confirm
the old gateway process has actually exited before starting a new one.
If the process doesn't exit within 5s, sends SIGKILL to the specific
PID. Uses the saved PID from gateway.pid (not launchd labels) so it
works correctly with multiple gateway instances under separate
HERMES_HOME directories.

Applied to both launchd_restart() and the manual restart path (replaces
the blind time.sleep(2)).

Inspired by PR #1881 by @AzothZephyr (race condition diagnosis).
Adds 4 tests.
2026-03-18 02:54:18 -07:00
teknium1
30da22e1c1 feat(gateway): scope systemd service name to HERMES_HOME
Multiple Hermes installations on the same machine now get unique
systemd service names:
- Default ~/.hermes → hermes-gateway (backward compatible)
- Custom HERMES_HOME → hermes-gateway-<8-char-hash>

Changes:
- Add get_service_name() in hermes_cli/gateway.py that derives a
  deterministic service name from HERMES_HOME via SHA256
- Replace all hardcoded 'hermes-gateway' systemd references with
  get_service_name() across gateway.py, main.py, status.py, uninstall.py
- Add HERMES_HOME env var to both user and system systemd unit templates
  so the gateway process uses the correct installation
- Update tests to use get_service_name() in assertions
2026-03-16 04:42:46 -07:00
Teknium
168a8e2e35
feat: add gateway install scope prompts (#1374) 2026-03-14 21:06:52 -07:00
Teknium
6c24d76533
feat: add system gateway service mode (#1371) 2026-03-14 20:54:51 -07:00
teyrebaz33
f10e26f731 fix: auto-enable systemd linger during gateway install on headless servers
Fixes #1005

Without linger, user-level systemd services stop when the SSH session
ends — even though systemctl --user status shows active (running).

Changes to systemd_install():
- Try loginctl enable-linger automatically (succeeds when the process
  has the required privileges)
- If loginctl fails (no privileges), print a clear, copy-pasteable
  warning with the exact command the user must run

New helper: _ensure_linger_enabled()
- Fast path: checks /var/lib/systemd/linger/<user> (no subprocess)
- Auto-enable: loginctl enable-linger <user>
- Fallback: actionable warning with sudo command + restart instructions

Tests: 4 new tests in TestEnsureLingerEnabled, 205 passed total
2026-03-14 11:46:59 -07:00
Teknium
fb3c163612
fix(gateway): surface missing linger in status and doctor (#1296)
* fix(gateway): surface missing linger in status and doctor

Warn when a systemd user gateway service has linger disabled so users can
spot the common 'gateway sleeps after logout' deployment issue from both
hermes doctor and hermes gateway status.

* fix(gateway): check linger status after install

After installing the systemd user service, report whether linger is
already enabled instead of always printing the generic hint. This makes
post-install guidance match the user's actual deployment state.
2026-03-14 06:11:33 -07:00