fix(gateway): detect legacy hermes.service + mark --replace SIGTERM as planned (#11909)

* fix(gateway): detect legacy hermes.service units from pre-rename installs

Older Hermes installs used a different service name (hermes.service) before
the rename to hermes-gateway.service. When both units remain installed, they
fight over the same bot token — after PR #5646's signal-recovery change,
this manifests as a 30-second SIGTERM flap loop between the two services.

Detection is an explicit allowlist (no globbing) plus an ExecStart content
check, so profile units (hermes-gateway-<profile>.service) and unrelated
third-party services named 'hermes' are never matched.

Wired into systemd_install, systemd_status, gateway_setup wizard, and the
main hermes setup flow — anywhere we already warn about scope conflicts now
also warns about legacy units.

* feat(gateway): add migrate-legacy command + install-time removal prompt

- New hermes_cli.gateway.remove_legacy_hermes_units() removes legacy
  unit files with stop → disable → unlink → daemon-reload. Handles user
  and system scopes separately; system scope returns path list when not
  running as root so the caller can tell the user to re-run with sudo.
- New 'hermes gateway migrate-legacy' subcommand (with --dry-run and -y)
  routes to remove_legacy_hermes_units via gateway_command dispatch.
- systemd_install now offers to remove legacy units BEFORE installing
  the new hermes-gateway.service, preventing the SIGTERM flap loop that
  hits users who still have pre-rename hermes.service around.

Profile units (hermes-gateway-<profile>.service) remain untouched in
all paths — the legacy allowlist is explicit (_LEGACY_SERVICE_NAMES)
and the ExecStart content check further narrows matches.

* fix(gateway): mark --replace SIGTERM as planned so target exits 0

PR #5646 made SIGTERM exit the gateway with code 1 so systemd's
Restart=on-failure revives it after unexpected kills. But when a user has
two gateway units fighting for the same bot token (e.g. legacy
hermes.service + hermes-gateway.service from a pre-rename install), the
--replace takeover itself becomes the 'unexpected' SIGTERM — the loser
exits 1, systemd revives it 30s later, and the cycle flaps indefinitely.

Before calling terminate_pid(), --replace now writes a short-lived marker
file naming the target PID + start_time. The target's shutdown_signal_handler
consumes the marker and, when it names this process, leaves
_signal_initiated_shutdown=False so the final exit code stays 0.

Staleness defences:
- PID + start_time combo prevents PID reuse matching an old marker
- Marker older than 60s is treated as stale and discarded
- Marker is unlinked on first read even if it doesn't match this process
- Replacer clears the marker post-loop + on permission-denied give-up
This commit is contained in:
Teknium 2026-04-17 19:27:58 -07:00 committed by GitHub
parent 38436eb4e3
commit 07db20c72d
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
8 changed files with 1273 additions and 2 deletions

View file

@ -2242,8 +2242,10 @@ def setup_gateway(config: dict):
_is_service_running,
supports_systemd_services,
has_conflicting_systemd_units,
has_legacy_hermes_units,
install_linux_gateway_from_setup,
print_systemd_scope_conflict_warning,
print_legacy_unit_warning,
systemd_start,
systemd_restart,
launchd_install,
@ -2261,6 +2263,10 @@ def setup_gateway(config: dict):
print_systemd_scope_conflict_warning()
print()
if supports_systemd and has_legacy_hermes_units():
print_legacy_unit_warning()
print()
if service_running:
if prompt_yes_no(" Restart the gateway to pick up changes?", True):
try: