``hermes cron list`` falsely reports "Gateway is not running" on macOS
even when launchd has the service loaded and cron jobs fire correctly.
Two independent bugs in ``find_gateway_pids()`` each drop macOS into
an empty-result path; fixing either alone would clear the warning, but
both are real and both are worth fixing.
### Bug 1 — ``_get_service_pids()`` mis-parses ``launchctl list <label>``
``launchctl list`` returns two different formats depending on whether
a label is passed:
* No label → tab-separated table ``PID\\tStatus\\tLabel``
* With label → plist-dict dump, e.g. ``"PID" = 855;``
The old code always called ``launchctl list <label>`` (the plist-dict
path) but parsed with ``string.split()`` expecting the tab-separated
format. ``parts[2]`` on ``"Label" = "ai.hermes.gateway";`` is
``'"ai.hermes.gateway";'`` — quoted, semicolon'd — so the ``== label``
comparison never matched and no PID was ever extracted.
Fix: extracted a ``_parse_launchd_list_output(stdout, label)`` helper
that tries the plist-dict ``"PID" = N;`` regex first and falls back to
the tab-separated path when no plist matches were found. Handles both
formats so a future change to the caller (passing or not passing the
label) can't re-break detection. Regex anchored to the ``"PID"`` key
so sibling fields like ``LastExitStatus`` can't match; PID 0 rejected
so downstream ``os.kill(0, ...)`` never sees it.
### Bug 2 — ``_scan_gateway_pids()`` passes ``eww`` to ``ps``
The old invocation ``ps -A eww -o pid=,command=`` has two problems:
* **Darwin** rejects ``eww`` as "illegal argument" and exits 1 —
``stdout`` is empty, the parse loop iterates nothing, and no PID is
extracted (#15225).
* **FreeBSD** accepts ``eww`` but the ``e`` attaches environment
variables to the command column, so ``split(None, 1)`` picks up the
first env var as the command (#9069). The env vars can include API
keys and tokens, which also leaked them into any log line that
echoed the command.
Fix: replaced with ``ps -A -ww -o pid=,command=`` — portable across
Linux (procps), Darwin, FreeBSD, and busybox. Drops env-var leakage
as a side benefit.
### Tests (19 new + 1 updated, all passing)
``tests/hermes_cli/test_gateway_pid_detection_macos.py``:
* **``TestParseLaunchdListOutput``** (8 cases) — exercises the new
helper directly with the exact plist-dict output from the #15225
repro, the tab-separated format, the dash-PID unloaded case,
mixed-shape defensive input, PID=0 rejection, and 4
whitespace-variant tolerance cases for the PID regex (``"PID" =``,
``"PID"=``, ``"PID" = ``, leading tabs).
* **``TestGetServicePidsMacOS``** (3 cases) — end-to-end macOS branch
with ``subprocess.run`` patched to return the plist-dict payload;
asserts the single-PID set is returned, a non-zero launchctl exit
returns empty, and a missing ``launchctl`` binary (FileNotFoundError)
falls through cleanly.
* **``TestPsInvocationPortability``** (4 cases) — captures the exact
argv the production code passes to ``ps``, asserts ``"eww"`` is
never on the command line, pins the ``["ps", "-A", "-ww", "-o",
"pid=,command="]`` shape, parses a realistic Darwin-style ps sample
end-to-end, and exercises the nonzero-returncode fallback.
Also updated ``test_find_gateway_pids_falls_back_to_pid_file_when_process_scan_fails``
in ``tests/hermes_cli/test_gateway.py`` to match the new portable
``-ww`` invocation; added an inline comment pointing at #15225 history.
**Verified tests are real regression guards**: temporarily reverted
Bug 1 and Bug 2 independently; the relevant test classes correctly
failed with clear messages pointing at the regressed invariant.
Closes#15225
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- add all_profiles=False to find_gateway_pids() and
kill_gateway_processes() so hermes update and gateway stop --all
can still discover processes across all profiles
- narrow bare 'except Exception' to (OSError, subprocess.TimeoutExpired)
- update test mocks to match new signatures
All 35 subprocess.run() calls in hermes_cli/gateway.py lacked timeout
parameters. If systemctl, launchctl, loginctl, wmic, or ps blocks,
hermes gateway start/stop/restart/status/install/uninstall hangs
indefinitely with no feedback.
Timeouts tiered by operation type:
- 10s: instant queries (is-active, status, list, ps, tail, journalctl)
- 30s: fast lifecycle (daemon-reload, enable, start, bootstrap, kickstart)
- 90s: graceful shutdown (stop, restart, bootout, kickstart -k) — exceeds
our TimeoutStopSec=60 to avoid premature timeout during shutdown
Special handling: _is_service_running() and launchd_status() catch
TimeoutExpired and treat it as not-running/not-loaded, consistent with
how non-zero return codes are already handled.
Inspired by PR #3732 (dlkakbs) and issue #4057 (SHL0MS).
Reimplemented on current main which has significantly changed launchctl
handling (bootout/bootstrap/kickstart vs legacy load/unload/start/stop).
Add _wait_for_gateway_exit() that polls get_running_pid() to confirm
the old gateway process has actually exited before starting a new one.
If the process doesn't exit within 5s, sends SIGKILL to the specific
PID. Uses the saved PID from gateway.pid (not launchd labels) so it
works correctly with multiple gateway instances under separate
HERMES_HOME directories.
Applied to both launchd_restart() and the manual restart path (replaces
the blind time.sleep(2)).
Inspired by PR #1881 by @AzothZephyr (race condition diagnosis).
Adds 4 tests.
Multiple Hermes installations on the same machine now get unique
systemd service names:
- Default ~/.hermes → hermes-gateway (backward compatible)
- Custom HERMES_HOME → hermes-gateway-<8-char-hash>
Changes:
- Add get_service_name() in hermes_cli/gateway.py that derives a
deterministic service name from HERMES_HOME via SHA256
- Replace all hardcoded 'hermes-gateway' systemd references with
get_service_name() across gateway.py, main.py, status.py, uninstall.py
- Add HERMES_HOME env var to both user and system systemd unit templates
so the gateway process uses the correct installation
- Update tests to use get_service_name() in assertions
Fixes#1005
Without linger, user-level systemd services stop when the SSH session
ends — even though systemctl --user status shows active (running).
Changes to systemd_install():
- Try loginctl enable-linger automatically (succeeds when the process
has the required privileges)
- If loginctl fails (no privileges), print a clear, copy-pasteable
warning with the exact command the user must run
New helper: _ensure_linger_enabled()
- Fast path: checks /var/lib/systemd/linger/<user> (no subprocess)
- Auto-enable: loginctl enable-linger <user>
- Fallback: actionable warning with sudo command + restart instructions
Tests: 4 new tests in TestEnsureLingerEnabled, 205 passed total
* fix(gateway): surface missing linger in status and doctor
Warn when a systemd user gateway service has linger disabled so users can
spot the common 'gateway sleeps after logout' deployment issue from both
hermes doctor and hermes gateway status.
* fix(gateway): check linger status after install
After installing the systemd user service, report whether linger is
already enabled instead of always printing the generic hint. This makes
post-install guidance match the user's actual deployment state.