mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-06-27 11:22:03 +00:00
Tasks 2.1 + 2.2 + 2.3 of the safe-shutdown plan — the reversible
quiesce-without-restart machinery NAS drives during a lifecycle action (D4a).
These ship together because the endpoint, the control channel, and the gateway
state machine are one coherent slice.
2.2 — control channel (gateway/drain_control.py, new):
The dashboard has no HTTP path into a running gateway (guardrails: "there is NO
external control channel into a running gateway"); restart/drain is driven only
by markers the gateway reacts to. So begin/cancel-drain writes/removes a
presence-based marker .drain_request.json (HERMES_HOME-scoped, atomic write,
never-raises read; a corrupt marker reads as present-contentless → fail-safe
toward quiescing). This is Q-B option A.
2.2 — gateway state machine (gateway/run.py):
- _external_drain_active flag, DISTINCT from the shutdown _draining flag: this
one does NOT exit the process and is fully reversible.
- _enter_external_drain / _exit_external_drain: idempotent transitions that
flip gateway_state→draining / →running via _update_runtime_status (preserving
the live active_agents count). exit refuses to revert to running during a
real shutdown or after the loop stops (shutdown wins).
- _drain_control_watcher: 1s background task (modelled on _handoff_watcher)
reconciling accept-state with the marker; honours a marker that survived a
restart on its first tick. Registered alongside the other watchers in start.
- New-turn accept gate in _handle_message, placed BEFORE the session-slot
claim: when draining, refuse to START a new turn (so active_agents can only
fall → no TOCTOU race), while in-flight turns finish untouched. Internal/
system events (restart-recovery replays, bg-process completions) bypass it.
2.1 — endpoint (hermes_cli/web_server.py):
POST /api/gateway/drain {action: drain|cancel}. Authenticated by the Task-2.0a
token seam (the drain plugin registered this exact path as a token route);
attributes the request to the verified token principal. Begin writes the
marker, cancel removes it — the gateway process owns the actual transition.
Force-override (D6) is NOT here; it maps onto the existing immediate
/api/gateway/restart force path.
Tests (mocked — necessary-not-sufficient; the HARD live gate Q-B is next):
- tests/gateway/test_external_drain_control.py — marker contract (write/clear/
read/corrupt/atomic), state machine (enter/exit/idempotency/shutdown-wins/
loop-stopped), watcher reconcile-enter-then-exit, new-turn refusal, and
in-flight-not-interrupted. 15 tests.
- tests/hermes_cli/test_web_server.py — /api/gateway/drain begin/default-begin/
cancel/cancel-idempotent/bad-action-400. 6 tests.
- dashboard.drain_auth config section already added in 2.0b commit.
All touched suites green: 301 (gateway+auth) + 9 (web_server endpoints) passed.
Intentionally deferred:
- HARD live-validation gate (Q-B): real isolated `hermes gateway run`, drive a
real begin-drain marker, prove the 5-point checklist a–e.
- Spec-doc status flip + Phase-2 PR.
Build status: external-drain, restart-drain, status, dashboard-auth, drain-plugin,
token-auth, and web_server-endpoint suites green.
109 lines
4.1 KiB
Python
109 lines
4.1 KiB
Python
"""External drain-control marker contract (dashboard → gateway).
|
|
|
|
Task 2.2 of the safe-shutdown plan (decisions.md Q-B, option A): the dashboard
|
|
has no way to call into a running gateway — there is no HTTP control channel
|
|
into the gateway process (guardrails: "there is NO external control channel
|
|
into a running gateway"). Restart/drain is driven only by the gateway reacting
|
|
to its own inputs: slash commands, process signals, and file markers it writes
|
|
itself (``.restart_notify.json``).
|
|
|
|
So the begin/cancel-drain dashboard endpoint communicates with the running
|
|
gateway the same way: it writes (or removes) a marker file, and a gateway
|
|
background watcher reacts to it. This module owns that marker contract so both
|
|
sides — the dashboard endpoint (writer) and the gateway watcher (reader) —
|
|
share one definition and can never disagree.
|
|
|
|
Contract (presence-based, mirroring ``.restart_notify.json``):
|
|
|
|
* begin-drain → write ``{HERMES_HOME}/.drain_request.json`` with
|
|
``{"action": "drain", "requested_at": <iso>, "principal": <str>}``.
|
|
* cancel-drain → remove the marker.
|
|
* The gateway watcher treats **presence** of the marker as "external drain
|
|
active": flip ``gateway_state -> "draining"`` and stop accepting new turns.
|
|
Absence means "not draining" (revert to ``running`` if we had flipped it).
|
|
|
|
Reading the marker never raises: a malformed/half-written file reads as
|
|
"present but contentless", which the watcher still treats as drain-active
|
|
(fail-safe toward quiescing — a corrupt begin marker must not be ignored).
|
|
"""
|
|
from __future__ import annotations
|
|
|
|
import json
|
|
import logging
|
|
from datetime import datetime, timezone
|
|
from pathlib import Path
|
|
from typing import Any, Optional
|
|
|
|
from hermes_constants import get_hermes_home
|
|
from utils import atomic_json_write
|
|
|
|
_log = logging.getLogger(__name__)
|
|
|
|
_DRAIN_REQUEST_FILENAME = ".drain_request.json"
|
|
|
|
|
|
def drain_request_path(home: Optional[Path] = None) -> Path:
|
|
"""Absolute path to the drain-request marker, respecting HERMES_HOME."""
|
|
base = home if home is not None else get_hermes_home()
|
|
return Path(base) / _DRAIN_REQUEST_FILENAME
|
|
|
|
|
|
def write_drain_request(
|
|
*, principal: str = "drain-control", home: Optional[Path] = None
|
|
) -> dict[str, Any]:
|
|
"""Write the begin-drain marker. Returns the payload written.
|
|
|
|
Atomic write so the gateway watcher never reads a half-written file.
|
|
Idempotent: re-writing while a drain is already in progress just refreshes
|
|
``requested_at`` (harmless — the watcher keys off presence, not content).
|
|
"""
|
|
payload = {
|
|
"action": "drain",
|
|
"requested_at": datetime.now(timezone.utc).isoformat(),
|
|
"principal": principal,
|
|
}
|
|
atomic_json_write(drain_request_path(home), payload)
|
|
return payload
|
|
|
|
|
|
def clear_drain_request(*, home: Optional[Path] = None) -> bool:
|
|
"""Remove the drain marker (cancel-drain). Returns True if one existed.
|
|
|
|
Best-effort: a missing file is not an error (cancel is idempotent).
|
|
"""
|
|
path = drain_request_path(home)
|
|
try:
|
|
path.unlink()
|
|
return True
|
|
except FileNotFoundError:
|
|
return False
|
|
except OSError as e:
|
|
_log.warning("drain-control: failed to remove %s: %s", path, e)
|
|
return False
|
|
|
|
|
|
def drain_requested(*, home: Optional[Path] = None) -> bool:
|
|
"""True iff the begin-drain marker is present (external drain active)."""
|
|
return drain_request_path(home).exists()
|
|
|
|
|
|
def read_drain_request(*, home: Optional[Path] = None) -> Optional[dict[str, Any]]:
|
|
"""Return the marker payload, or ``None`` if absent.
|
|
|
|
A present-but-unparseable marker returns ``{}`` (truthy-presence preserved
|
|
via :func:`drain_requested`; callers that need the body get an empty dict
|
|
rather than an exception). Never raises.
|
|
"""
|
|
path = drain_request_path(home)
|
|
try:
|
|
raw = path.read_text(encoding="utf-8")
|
|
except FileNotFoundError:
|
|
return None
|
|
except OSError as e:
|
|
_log.warning("drain-control: failed to read %s: %s", path, e)
|
|
return None
|
|
try:
|
|
data = json.loads(raw)
|
|
except (ValueError, TypeError):
|
|
return {}
|
|
return data if isinstance(data, dict) else {}
|