mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-06-10 08:32:09 +00:00
* fix(gateway): auto-start after container restart via planned-stop marker
On Docker (s6-overlay), the gateway runs as a dynamically-registered s6
service. When the container stops/restarts/upgrades, s6 sends the gateway
a plain SIGTERM. The shutdown path (_stop_impl) ended with an
unconditional _update_runtime_status("stopped"), persisting
gateway_state=stopped to the volume. container_boot.py reads that on the
next boot and only auto-starts gateways whose last state was "running"
(_AUTOSTART_STATES) — so after a routine `docker compose up
--force-recreate` the gateway stays down and messaging channels silently
go dark, with no error surfaced (issue #42675).
The codebase already distinguishes intentional stops from unexpected
signals via the planned-stop marker (write_planned_stop_marker /
consume_planned_stop_marker_for_self): `hermes gateway stop`,
systemd/launchd ExecStop, and Ctrl+C write a marker before signalling,
so the handler classifies them as planned. An unmarked SIGTERM
(container/s6 restart, OOM, bare kill) is signal-initiated.
This wires that existing classification through to the state persist,
rather than adding unreliable signal-source inference:
- run.py: GatewayRunner._signal_initiated_shutdown, set in
shutdown_signal_handler's unmarked-signal branch. In _stop_impl, a
signal-initiated (non-restart) teardown now persists "running" instead
of "stopped" — preserving the operator's run-intent and overwriting the
mid-shutdown "draining" marker so _AUTOSTART_STATES matches on reboot.
Operator stops and restarts persist "stopped" as before.
- service_manager.py: S6ServiceManager.stop() now writes the planned-stop
marker for the supervised PID (read from s6-svstat) before `s6-svc -d`,
so an in-container `hermes gateway stop` is correctly classified as
intentional (parity with the systemd/launchd/host stop paths, which
already mark). Best-effort: a marker-write failure falls back to the
safe signal-initiated path.
Tests: shutdown persist-decision table (signal→running, operator→stopped,
restart→stopped), s6 stop marker write + svstat PID parse + failure
tolerance. The signal→running and s6-marker tests fail without the
respective source change. Verified end-to-end against a container built
from this branch: an unmarked SIGTERM to the live gateway leaves
gateway_state=running (shutdown-context log confirms signal path);
existing real container-restart suite still green.
* docs(docker): clarify gateway autostart distinguishes operator-stop from container-kill
The per-profile-supervision section described the autostart-across-restart
contract as "running gateways come back, stopped stay stopped" without
spelling out what records 'stopped'. That contract was the source of
#42675 confusion: users expected a restart to bring the gateway back and
it didn't. With the write-side fix, only an explicit `hermes gateway stop`
records 'stopped'; container/s6 restart SIGTERMs (incl. image upgrades and
unexpected exits) leave the state 'running' so the gateway auto-starts.
Make that distinction explicit in both the multi-profile and
per-profile-supervision sections.
* test(docker): real-restart autostart E2E for #42675
Adds test_live_gateway_autostarts_after_real_restart_without_manual_state_stamp:
a live s6-supervised gateway is killed by an actual `docker restart`
SIGTERM (no manual gateway_state stamp, no planned-stop marker) and must
auto-start on the next boot. Exercises the WRITE side of the fix that the
existing stamp-based tests bypass.
Verified to FAIL against an origin/main image (reconciler logs
prior_state=stopped action=registered — the #42675 bug) and PASS against
the fixed image (prior_state=running action=started).
This commit is contained in:
parent
b4170f3ac2
commit
5cf6e28a2f
7 changed files with 363 additions and 3 deletions
|
|
@ -1953,6 +1953,16 @@ class GatewayRunner(GatewayAuthorizationMixin, GatewayKanbanWatchersMixin, Gatew
|
|||
self._exit_code: Optional[int] = None
|
||||
self._draining = False
|
||||
self._restart_requested = False
|
||||
# Set by shutdown_signal_handler when a SIGTERM/SIGINT arrived
|
||||
# WITHOUT a planned-stop / takeover marker — i.e. an unexpected
|
||||
# external signal (container/s6 SIGTERM on `docker restart` or
|
||||
# image upgrade, OOM-killer, bare `kill`). Distinct from an
|
||||
# operator-requested stop, which writes a marker first. Used by
|
||||
# _stop_impl to decide whether to persist gateway_state=stopped
|
||||
# (see issue #42675): an unexpected signal must NOT persist
|
||||
# "stopped", or container_boot refuses to auto-start the gateway
|
||||
# on the next boot.
|
||||
self._signal_initiated_shutdown = False
|
||||
self._restart_task_started = False
|
||||
self._restart_detached = False
|
||||
self._restart_via_service = False
|
||||
|
|
@ -5952,7 +5962,36 @@ class GatewayRunner(GatewayAuthorizationMixin, GatewayKanbanWatchersMixin, Gatew
|
|||
self._exit_reason = self._exit_reason or "Gateway restart requested"
|
||||
|
||||
self._draining = False
|
||||
self._update_runtime_status("stopped", self._exit_reason)
|
||||
# Persist the terminal gateway_state. The default is "stopped",
|
||||
# but when this teardown was triggered by an UNEXPECTED external
|
||||
# signal (container/s6 SIGTERM on `docker restart` or image
|
||||
# upgrade, OOM-killer, bare `kill`) we instead persist "running"
|
||||
# to preserve the operator's run-intent across the restart.
|
||||
#
|
||||
# On Docker (s6-overlay), container_boot.py reads gateway_state
|
||||
# on the next boot and only auto-starts gateways whose last
|
||||
# state was "running" (_AUTOSTART_STATES). Persisting "stopped"
|
||||
# — or leaving the mid-shutdown "draining" marker in place — for
|
||||
# a routine `docker compose up --force-recreate` permanently
|
||||
# suppresses auto-start, so the messaging channels silently stay
|
||||
# dark until the operator manually restarts (issue #42675).
|
||||
#
|
||||
# An operator-initiated stop (`hermes gateway stop`,
|
||||
# systemd/launchd ExecStop, the s6 stop path, Ctrl+C) writes a
|
||||
# planned-stop marker BEFORE signalling, so it is classified as
|
||||
# a planned stop (not signal-initiated) and correctly persists
|
||||
# "stopped" — respecting the explicit intent. A restart also
|
||||
# persists "stopped" here; the restarting process brings the
|
||||
# gateway back up itself.
|
||||
if getattr(self, "_signal_initiated_shutdown", False) and not self._restart_requested:
|
||||
logger.info(
|
||||
"Gateway stopped by an unexpected signal — persisting "
|
||||
"gateway_state=running so container_boot auto-starts on "
|
||||
"the next boot (issue #42675)"
|
||||
)
|
||||
self._update_runtime_status("running", self._exit_reason)
|
||||
else:
|
||||
self._update_runtime_status("stopped", self._exit_reason)
|
||||
logger.info("Gateway stopped (total teardown %.2fs)", _phase_elapsed())
|
||||
|
||||
self._stop_task = asyncio.create_task(_stop_impl())
|
||||
|
|
@ -15711,6 +15750,13 @@ async def start_gateway(config: Optional[GatewayConfig] = None, replace: bool =
|
|||
)
|
||||
else:
|
||||
_signal_initiated_shutdown = True
|
||||
# Mirror onto the runner so _stop_impl can suppress the
|
||||
# gateway_state=stopped persist for unexpected signals
|
||||
# (container/s6 SIGTERM on restart, OOM, bare kill) — see
|
||||
# issue #42675. Operator-initiated stops set a planned-stop
|
||||
# marker first, land in the `planned_stop` branch above, and
|
||||
# leave this flag False so they DO persist "stopped".
|
||||
runner._signal_initiated_shutdown = True
|
||||
logger.info(
|
||||
"Received %s — initiating shutdown",
|
||||
_shutdown_ctx["signal"] if _shutdown_ctx else "SIGTERM/SIGINT",
|
||||
|
|
|
|||
|
|
@ -739,13 +739,55 @@ class S6ServiceManager:
|
|||
"""
|
||||
self._run_svc("-u", "start", name)
|
||||
|
||||
def _supervised_pid(self, name: str) -> int | None:
|
||||
"""Return the PID of the supervised gateway process, or None.
|
||||
|
||||
Parses ``s6-svstat`` output (``up (pid NNNN) ...``). Used to
|
||||
mark an operator-initiated stop with the planned-stop marker so
|
||||
the gateway's shutdown handler classifies the incoming SIGTERM
|
||||
as intentional rather than an unexpected kill (issue #42675).
|
||||
Best-effort: any parse/exec failure returns None.
|
||||
"""
|
||||
import subprocess
|
||||
|
||||
try:
|
||||
result = subprocess.run(
|
||||
[f"{_S6_BIN_DIR}/s6-svstat", str(self.scandir / name)],
|
||||
capture_output=True, text=True, timeout=5,
|
||||
)
|
||||
except (OSError, subprocess.SubprocessError):
|
||||
return None
|
||||
if result.returncode != 0:
|
||||
return None
|
||||
m = re.search(r"\(pid (\d+)\)", result.stdout)
|
||||
return int(m.group(1)) if m else None
|
||||
|
||||
def stop(self, name: str) -> None:
|
||||
"""Bring down a registered service (``s6-svc -d``).
|
||||
|
||||
Writes a planned-stop marker naming the supervised gateway PID
|
||||
BEFORE sending the down command, so the gateway's shutdown
|
||||
handler recognises this SIGTERM as an operator-initiated stop
|
||||
and persists ``gateway_state=stopped`` (respecting the explicit
|
||||
intent). Without the marker, an intentional ``hermes gateway
|
||||
stop`` is indistinguishable from the container/s6 SIGTERM sent on
|
||||
``docker restart``; the latter must NOT persist ``stopped`` or
|
||||
container_boot refuses to auto-start on the next boot (#42675).
|
||||
The marker write is best-effort — a failure only means the stop
|
||||
is treated as signal-initiated, which is the safe fallback.
|
||||
|
||||
Raises:
|
||||
GatewayNotRegisteredError: no service directory for ``name``.
|
||||
S6CommandError: s6-svc exited non-zero for any other reason.
|
||||
"""
|
||||
pid = self._supervised_pid(name)
|
||||
if pid is not None:
|
||||
try:
|
||||
from gateway.status import write_planned_stop_marker
|
||||
|
||||
write_planned_stop_marker(pid)
|
||||
except Exception:
|
||||
pass
|
||||
self._run_svc("-d", "stop", name)
|
||||
|
||||
def restart(self, name: str) -> None:
|
||||
|
|
|
|||
|
|
@ -250,3 +250,79 @@ def test_stale_gateway_pid_cleaned_up_on_restart(restart_container: str) -> None
|
|||
assert r.returncode != 0, "stale gateway.pid survived restart"
|
||||
r = _sh(container, "test -f /opt/data/profiles/ghost/processes.json")
|
||||
assert r.returncode != 0, "stale processes.json survived restart"
|
||||
|
||||
|
||||
def test_live_gateway_autostarts_after_real_restart_without_manual_state_stamp(
|
||||
restart_container: str,
|
||||
) -> None:
|
||||
"""End-to-end guard for issue #42675.
|
||||
|
||||
The other tests in this module stamp gateway_state.json directly to
|
||||
exercise the reconciler's READ side. This one exercises the WRITE
|
||||
side: a real, live gateway is killed by the container/s6 SIGTERM that
|
||||
`docker restart` sends — no manual state stamp — and must come back up
|
||||
on the next boot.
|
||||
|
||||
Before the fix, the shutdown handler unconditionally persisted
|
||||
gateway_state=stopped on that SIGTERM, so the reconciler saw 'stopped'
|
||||
and registered the slot DOWN — the gateway silently stayed dark after
|
||||
every container restart. The fix classifies an unmarked SIGTERM as
|
||||
signal-initiated and persists 'running' instead, so auto-start works.
|
||||
"""
|
||||
container = restart_container
|
||||
|
||||
_exec(container, "hermes", "profile", "create", "live").check_returncode()
|
||||
r = _exec(container, "hermes", "-p", "live", "gateway", "start", timeout=60)
|
||||
assert r.returncode == 0, f"gateway start failed: {r.stderr}"
|
||||
|
||||
# Wait for the gateway to actually come up under supervision AND write
|
||||
# its own gateway_state=running (we do NOT stamp it ourselves).
|
||||
deadline = time.monotonic() + 20.0
|
||||
while time.monotonic() < deadline:
|
||||
r = _sh(container, "/command/s6-svstat /run/service/gateway-live")
|
||||
if r.returncode == 0 and "up " in r.stdout:
|
||||
break
|
||||
time.sleep(0.5)
|
||||
assert "up " in r.stdout, f"gateway never came up pre-restart: {r.stdout!r}"
|
||||
|
||||
# Confirm the gateway persisted its own 'running' state (sanity: we're
|
||||
# testing the real write path, not a stamped fixture).
|
||||
deadline = time.monotonic() + 15.0
|
||||
state = ""
|
||||
while time.monotonic() < deadline:
|
||||
r = _sh(
|
||||
container,
|
||||
"cat /opt/data/profiles/live/gateway_state.json 2>/dev/null",
|
||||
)
|
||||
if r.returncode == 0 and '"gateway_state"' in r.stdout:
|
||||
state = r.stdout
|
||||
break
|
||||
time.sleep(0.5)
|
||||
assert '"running"' in state, (
|
||||
f"gateway never persisted running state pre-restart: {state!r}"
|
||||
)
|
||||
|
||||
# Real restart — Docker sends SIGTERM to PID 1; s6 propagates it to the
|
||||
# supervised gateway. No planned-stop marker is written (this is not an
|
||||
# operator `hermes gateway stop`), so the shutdown is signal-initiated.
|
||||
_docker("restart", container, timeout=60).check_returncode()
|
||||
|
||||
log = _wait_for_reconcile_log_mention(container, "live", deadline_s=30.0)
|
||||
assert "profile=live" in log, (
|
||||
f"reconciler never logged live after restart: {log!r}"
|
||||
)
|
||||
# The crux: the reconciler must AUTO-START it, not register it down.
|
||||
assert "action=started" in log, (
|
||||
f"gateway did NOT auto-start after a real restart (issue #42675 "
|
||||
f"regression): {log!r}"
|
||||
)
|
||||
|
||||
# Slot recreated, and NO down marker (we expect auto-start).
|
||||
assert _wait_for_path(
|
||||
container, "/run/service/gateway-live", kind="d", deadline_s=10.0,
|
||||
), "slot not recreated after restart"
|
||||
r = _sh(container, "test -f /run/service/gateway-live/down")
|
||||
assert r.returncode != 0, (
|
||||
"down marker present despite a live gateway being restarted — "
|
||||
"the signal-initiated shutdown wrongly persisted 'stopped' (#42675)"
|
||||
)
|
||||
|
|
|
|||
|
|
@ -66,6 +66,7 @@ def make_restart_runner(
|
|||
runner._background_tasks = set()
|
||||
runner._draining = False
|
||||
runner._restart_requested = False
|
||||
runner._signal_initiated_shutdown = False
|
||||
runner._restart_task_started = False
|
||||
runner._restart_detached = False
|
||||
runner._restart_via_service = False
|
||||
|
|
|
|||
|
|
@ -358,3 +358,90 @@ async def test_gateway_stop_kills_tool_subprocesses_on_graceful_path(monkeypatch
|
|||
|
||||
# Only the final catch-all fires on the graceful path.
|
||||
assert kill_count == 1
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# gateway_state persistence on shutdown (issue #42675)
|
||||
#
|
||||
# On Docker/s6, container_boot.py only auto-starts gateways whose last
|
||||
# persisted gateway_state was "running". An unexpected external signal
|
||||
# (the SIGTERM s6/Docker sends on `docker compose up --force-recreate`,
|
||||
# OOM, bare kill) must NOT persist "stopped" — otherwise the gateway
|
||||
# stays down after every container restart. An operator-initiated stop
|
||||
# writes a planned-stop marker first, so it is NOT signal-initiated and
|
||||
# DOES persist "stopped", respecting the explicit intent.
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _persisted_states(runner) -> list:
|
||||
"""All gateway_state values passed to _update_runtime_status, in order."""
|
||||
states = []
|
||||
for call in runner._update_runtime_status.call_args_list:
|
||||
args, kwargs = call
|
||||
state = kwargs.get("gateway_state", args[0] if args else None)
|
||||
states.append(state)
|
||||
return states
|
||||
|
||||
|
||||
def _stopped_state_persisted(runner) -> bool:
|
||||
"""True iff _update_runtime_status was called with gateway_state='stopped'."""
|
||||
return "stopped" in _persisted_states(runner)
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_signal_initiated_shutdown_persists_running_not_stopped(tmp_path, monkeypatch):
|
||||
"""Unexpected SIGTERM (container restart / OOM / kill) must persist
|
||||
gateway_state=running — NOT stopped, and NOT leave the mid-shutdown
|
||||
'draining' marker — so container_boot auto-starts on next boot (#42675)."""
|
||||
monkeypatch.setattr(gateway_run, "_hermes_home", tmp_path)
|
||||
runner, adapter = make_restart_runner()
|
||||
adapter.disconnect = AsyncMock()
|
||||
runner._signal_initiated_shutdown = True # set by handler on unmarked signal
|
||||
|
||||
with patch("gateway.status.remove_pid_file"), patch("gateway.status.write_runtime_status"):
|
||||
await runner.stop()
|
||||
|
||||
assert not _stopped_state_persisted(runner), (
|
||||
"signal-initiated shutdown must NOT persist gateway_state=stopped"
|
||||
)
|
||||
# The FINAL terminal write must be 'running' so container_boot's
|
||||
# _AUTOSTART_STATES check passes (it only auto-starts 'running').
|
||||
assert _persisted_states(runner)[-1] == "running", (
|
||||
f"final state must be 'running', got: {_persisted_states(runner)}"
|
||||
)
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_operator_initiated_stop_persists_stopped(tmp_path, monkeypatch):
|
||||
"""A planned stop (marker written → not signal-initiated) must persist
|
||||
gateway_state=stopped so an explicit `hermes gateway stop` stays down."""
|
||||
monkeypatch.setattr(gateway_run, "_hermes_home", tmp_path)
|
||||
runner, adapter = make_restart_runner()
|
||||
adapter.disconnect = AsyncMock()
|
||||
runner._signal_initiated_shutdown = False # planned stop classification
|
||||
|
||||
with patch("gateway.status.remove_pid_file"), patch("gateway.status.write_runtime_status"):
|
||||
await runner.stop()
|
||||
|
||||
assert _stopped_state_persisted(runner), (
|
||||
"operator-initiated stop must persist gateway_state=stopped"
|
||||
)
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_signal_initiated_restart_still_persists_stopped(tmp_path, monkeypatch):
|
||||
"""A restart is not a 'stay down' — it persists normally (the new
|
||||
process/container brings the gateway back up itself). The suppression
|
||||
only applies to a terminal signal-initiated stop, not a restart."""
|
||||
monkeypatch.setattr(gateway_run, "_hermes_home", tmp_path)
|
||||
runner, adapter = make_restart_runner()
|
||||
adapter.disconnect = AsyncMock()
|
||||
runner._signal_initiated_shutdown = True
|
||||
runner._launch_systemd_restart_shortcut = MagicMock()
|
||||
|
||||
with patch("gateway.status.remove_pid_file"), patch("gateway.status.write_runtime_status"):
|
||||
await runner.stop(restart=True, service_restart=True)
|
||||
|
||||
assert _stopped_state_persisted(runner), (
|
||||
"a restart must persist gateway_state=stopped via the normal path"
|
||||
)
|
||||
|
|
|
|||
|
|
@ -799,3 +799,111 @@ def test_s6_is_running_parses_svstat(
|
|||
return _sp.CompletedProcess(cmd, 0, "", "")
|
||||
monkeypatch.setattr("subprocess.run", _svstat_down)
|
||||
assert S6ServiceManager(scandir=s6_scandir).is_running("gateway-coder") is False
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# S6 stop writes a planned-stop marker (issue #42675)
|
||||
#
|
||||
# `hermes gateway stop` inside a container dispatches through
|
||||
# S6ServiceManager.stop() -> `s6-svc -d`, which SIGTERMs the gateway.
|
||||
# That SIGTERM is indistinguishable from the one s6/Docker sends on a
|
||||
# container restart unless we mark the intentional stop first. Without
|
||||
# the marker, the gateway's shutdown handler can't tell an operator
|
||||
# stop from a restart kill, and the gateway_state=stopped suppression
|
||||
# (run.py) would never engage for explicit stops.
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_s6_supervised_pid_parses_svstat(monkeypatch, s6_scandir):
|
||||
"""_supervised_pid extracts the PID from `up (pid NNNN) ...`."""
|
||||
import subprocess as _sp
|
||||
|
||||
def _fake(cmd, **kw):
|
||||
return _sp.CompletedProcess(cmd, 0, "up (pid 4242) 17 seconds\n", "")
|
||||
|
||||
monkeypatch.setattr("subprocess.run", _fake)
|
||||
mgr = S6ServiceManager(scandir=s6_scandir)
|
||||
assert mgr._supervised_pid("gateway-coder") == 4242
|
||||
|
||||
|
||||
def test_s6_supervised_pid_none_when_down(monkeypatch, s6_scandir):
|
||||
"""A down service (`s6-svstat` rc!=0 or no pid) yields None."""
|
||||
import subprocess as _sp
|
||||
|
||||
def _fake(cmd, **kw):
|
||||
return _sp.CompletedProcess(cmd, 0, "down (exitcode 0) 3 seconds\n", "")
|
||||
|
||||
monkeypatch.setattr("subprocess.run", _fake)
|
||||
mgr = S6ServiceManager(scandir=s6_scandir)
|
||||
assert mgr._supervised_pid("gateway-coder") is None
|
||||
|
||||
|
||||
def test_s6_stop_writes_planned_stop_marker(monkeypatch, s6_scandir):
|
||||
"""stop() must mark the supervised PID before `s6-svc -d` so the
|
||||
gateway recognises the SIGTERM as an intentional stop (#42675)."""
|
||||
import subprocess as _sp
|
||||
|
||||
svc_dir = s6_scandir / "gateway-coder"
|
||||
svc_dir.mkdir() # so _run_svc doesn't raise GatewayNotRegisteredError
|
||||
|
||||
svc_calls: list[list[str]] = []
|
||||
|
||||
def _fake(cmd, **kw):
|
||||
seq = list(cmd) if isinstance(cmd, (list, tuple)) else [str(cmd)]
|
||||
if seq and seq[0].startswith("/command/"):
|
||||
seq[0] = seq[0][len("/command/"):]
|
||||
svc_calls.append(seq)
|
||||
if seq and seq[0] == "s6-svstat":
|
||||
return _sp.CompletedProcess(cmd, 0, "up (pid 9090) 5 seconds\n", "")
|
||||
return _sp.CompletedProcess(cmd, 0, "", "")
|
||||
|
||||
monkeypatch.setattr("subprocess.run", _fake)
|
||||
|
||||
marked: list[int] = []
|
||||
monkeypatch.setattr(
|
||||
"gateway.status.write_planned_stop_marker",
|
||||
lambda pid: marked.append(pid) or True,
|
||||
)
|
||||
|
||||
mgr = S6ServiceManager(scandir=s6_scandir)
|
||||
mgr.stop("gateway-coder")
|
||||
|
||||
assert marked == [9090], (
|
||||
f"stop() must write the planned-stop marker for the supervised PID; "
|
||||
f"marked={marked}"
|
||||
)
|
||||
# And it must still issue the down command.
|
||||
assert any(
|
||||
cmd[0] == "s6-svc" and "-d" in cmd for cmd in svc_calls
|
||||
), f"s6-svc -d not invoked; saw: {svc_calls}"
|
||||
|
||||
|
||||
def test_s6_stop_tolerates_marker_write_failure(monkeypatch, s6_scandir):
|
||||
"""A marker-write failure must not block the stop (best-effort)."""
|
||||
import subprocess as _sp
|
||||
|
||||
svc_dir = s6_scandir / "gateway-coder"
|
||||
svc_dir.mkdir()
|
||||
|
||||
svc_calls: list[list[str]] = []
|
||||
|
||||
def _fake(cmd, **kw):
|
||||
seq = list(cmd) if isinstance(cmd, (list, tuple)) else [str(cmd)]
|
||||
if seq and seq[0].startswith("/command/"):
|
||||
seq[0] = seq[0][len("/command/"):]
|
||||
svc_calls.append(seq)
|
||||
if seq and seq[0] == "s6-svstat":
|
||||
return _sp.CompletedProcess(cmd, 0, "up (pid 9090) 5 seconds\n", "")
|
||||
return _sp.CompletedProcess(cmd, 0, "", "")
|
||||
|
||||
monkeypatch.setattr("subprocess.run", _fake)
|
||||
|
||||
def _boom(pid):
|
||||
raise OSError("disk full")
|
||||
|
||||
monkeypatch.setattr("gateway.status.write_planned_stop_marker", _boom)
|
||||
|
||||
mgr = S6ServiceManager(scandir=s6_scandir)
|
||||
mgr.stop("gateway-coder") # must not raise
|
||||
|
||||
assert any(cmd[0] == "s6-svc" and "-d" in cmd for cmd in svc_calls)
|
||||
|
|
|
|||
|
|
@ -183,7 +183,7 @@ Each profile created with `hermes profile create <name>` gets:
|
|||
- A dedicated s6 service slot at `/run/service/gateway-<name>/`, registered dynamically by the runtime — no container rebuild required.
|
||||
- Auto-restart on crash, backoff-managed by `s6-supervise`.
|
||||
- Per-profile rotated logs at `${HERMES_HOME}/logs/gateways/<name>/current` (10 archives × 1 MB each).
|
||||
- State persistence across container restarts: the boot-time reconciler reads `gateway_state.json` from each profile directory and brings the slot back up only for profiles whose last recorded state was `running`. Stopped profiles stay stopped.
|
||||
- State persistence across container restarts: the boot-time reconciler reads `gateway_state.json` from each profile directory and brings the slot back up only for profiles whose last recorded state was `running`. Only a gateway you explicitly stopped (`hermes gateway stop`) stays down across a restart — a container restart, image upgrade, or unexpected exit leaves the recorded state as `running`, so the gateway auto-starts on the next boot.
|
||||
|
||||
The lifecycle commands you'd run on the host work the same way from inside the container:
|
||||
|
||||
|
|
@ -473,7 +473,7 @@ Each profile created with `hermes profile create <name>` automatically gets an s
|
|||
|
||||
- Gateway crashes are auto-restarted by `s6-supervise` after a ~1s backoff.
|
||||
- Dashboard, when enabled with `HERMES_DASHBOARD=1`, is supervised on the same supervision tree and gets the same auto-restart treatment.
|
||||
- `docker restart` preserves running gateways: the cont-init reconciler reads `$HERMES_HOME/profiles/<name>/gateway_state.json` and brings the slot back up if the last recorded state was `running`. Stopped gateways stay stopped.
|
||||
- `docker restart`, image upgrades (`docker compose up -d --force-recreate`), and unexpected exits preserve running gateways: the cont-init reconciler reads `$HERMES_HOME/profiles/<name>/gateway_state.json` and brings the slot back up if the last recorded state was `running`. Only an explicit `hermes gateway stop` records `stopped` and keeps the gateway down across the restart; the container/s6 SIGTERM sent on a restart or upgrade is treated as "still running" and auto-starts.
|
||||
- Per-profile gateway logs persist under `$HERMES_HOME/logs/gateways/<profile>/current` (rotated by `s6-log`), and the reconciler's actions are appended to `$HERMES_HOME/logs/container-boot.log` per boot. See [Where the logs go](#where-the-logs-go) for the full routing map.
|
||||
|
||||
`hermes status` inside the container reports `Manager: s6 (container supervisor)`. Use `/command/s6-svstat /run/service/gateway-<name>` for the raw supervisor view (note `/command/` is on PATH for supervision-tree processes only; pass the absolute path when calling from `docker exec`).
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue