mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-11 03:31:55 +00:00
fix(update): bypass systemd RestartSec after graceful drain (#22101)
After a clean SIGUSR1 drain, cmd_update passively polled for systemd's auto-restart to fire. Our unit file sets RestartSec=60 (a crash-loop guard), so the voluntary-restart path waited a full minute of dead air before the gateway came back — the user saw 'draining (up to 75s)...' and stared at it. Change: after the drain exits with code 75, call 'reset-failed' + 'start' explicitly. Manual start bypasses RestartSec entirely (RestartSec only governs systemd's own auto-restart logic). Takes about as long as the gateway needs to come up (~1-3s on a warm box) instead of ~60s. The RestartSec=60 default stays — it's the right crash-loop guard for actual crashes. This only short-circuits the voluntary-restart path. Matches the pattern already used in 'hermes gateway restart' (systemd_restart() in hermes_cli/gateway.py, PR #20949). Tests: - tests/hermes_cli/test_update_gateway_restart.py: new test_update_bypasses_restartsec_after_graceful_drain asserts both 'reset-failed hermes-gateway' AND 'start hermes-gateway' (NOT 'restart') are issued after a successful graceful drain. - All existing tests in the affected classes still pass (TestCmdUpdateLaunchdRestart, TestCmdUpdateResetFailedBeforeRestart are green; one pre-existing flake in the latter is unrelated).
This commit is contained in:
parent
5089596685
commit
d971b26bfd
2 changed files with 121 additions and 8 deletions
|
|
@ -7756,14 +7756,56 @@ def _cmd_update_impl(args, gateway_mode: bool):
|
|||
)
|
||||
|
||||
if _graceful_ok:
|
||||
# Gateway exited 75; systemd should relaunch
|
||||
# via Restart=on-failure. The unit's
|
||||
# RestartSec (default 30s on ours) gates the
|
||||
# respawn — poll past that + slack so we
|
||||
# don't give up mid-cooldown and falsely
|
||||
# print "drained but didn't relaunch". For
|
||||
# units without RestartSec set we fall back
|
||||
# to the original 10s budget.
|
||||
# Gateway exited 75. ``Restart=always`` +
|
||||
# ``RestartForceExitStatus=75`` means systemd
|
||||
# WILL respawn the unit — but only after
|
||||
# ``RestartSec`` (default 60s on our unit
|
||||
# file). That 60s wait is a crash-loop guard,
|
||||
# and is the right default when the gateway
|
||||
# dies unexpectedly. For a voluntary restart
|
||||
# on update, it's dead time the user watches.
|
||||
#
|
||||
# Shortcut it: ``reset-failed`` + ``start``
|
||||
# skips RestartSec entirely (we're manually
|
||||
# initiating the unit, not waiting for
|
||||
# systemd's auto-restart logic). Takes about
|
||||
# as long as the process takes to come up
|
||||
# (~1-3s on a warm box).
|
||||
#
|
||||
# If the unit is already active because
|
||||
# RestartSec elapsed while we were draining,
|
||||
# ``start`` is a no-op and we fall through to
|
||||
# the poll below. Either way we collapse the
|
||||
# 60s+ delay to a ~5s one.
|
||||
subprocess.run(
|
||||
scope_cmd + ["reset-failed", svc_name],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=10,
|
||||
)
|
||||
subprocess.run(
|
||||
scope_cmd + ["start", svc_name],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=15,
|
||||
)
|
||||
# Short poll: the gateway should be up within
|
||||
# a few seconds now that we bypassed
|
||||
# RestartSec. Fall back to the longer
|
||||
# RestartSec + slack budget ONLY if the
|
||||
# explicit start failed and we need to rely
|
||||
# on systemd's auto-restart.
|
||||
if _wait_for_service_active(
|
||||
scope_cmd,
|
||||
svc_name,
|
||||
timeout=10.0,
|
||||
):
|
||||
restarted_services.append(svc_name)
|
||||
continue
|
||||
# Explicit start didn't take. Fall back to
|
||||
# the original passive poll (systemd's
|
||||
# auto-restart WILL fire after RestartSec
|
||||
# regardless).
|
||||
_restart_sec = _service_restart_sec(
|
||||
scope_cmd,
|
||||
svc_name,
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue