hermes-agent/hermes_cli/service_manager.py
Ben c41f908ad4 fix(docker): make s6 lifecycle work for the unprivileged hermes user
Resolves the explicit "Known follow-up" left by commit 2f8ceeab9 and
the resulting CI failures in tests/docker/test_dashboard.py and
tests/docker/test_s6_profile_gateway_integration.py.

The product gap
---------------
Every hermes runtime operation inside the container runs as the
hermes user (UID 10000) via s6-setuidgid. But s6-supervise — spawned
by s6-svscan running as PID 1 — creates each service's supervise/
and top-level event/ directories with mode 0700 owned by its
effective UID (root). That left every s6-svc / s6-svstat / s6-svwait
call from hermes hitting EACCES on the supervise/control FIFO and
supervise/status — i.e. the entire S6ServiceManager lifecycle
(register, start, stop, unregister) was inert in production.

The 2f8ceeab9 commit message called this out and deferred the fix.
The audit changes that landed alongside it (defaulting docker_exec
to -u hermes) made the integration tests reproduce the bug
deterministically; the fix below resolves it.

The fix: pre-create the supervise/ skeleton hermes-owned
----------------------------------------------------------
Reading s6's source (src/supervision/s6-supervise.c::trymkdir +
control_init), the mkdir and mkfifo calls that build the supervise
tree are EEXIST-safe: if the directory or FIFO is already present,
s6-supervise reuses it and skips the chown/chmod fix-up that would
normally make event/ 03730 root:root. So if we lay the skeleton
down with hermes ownership before triggering s6-svscanctl -a,
s6-supervise inherits our layout and never touches it. The
death_tally / lock / status regular files written later by
s6-supervise (still as root) land mode 0644 — world-readable —
which is all s6-svstat needs.

New module-level helper _seed_supervise_skeleton(svc_dir) in
hermes_cli/service_manager.py lays down:
  svc_dir/event/                       hermes:hermes 03730
  svc_dir/supervise/                   hermes:hermes 0755
  svc_dir/supervise/event/             hermes:hermes 03730
  svc_dir/supervise/control            hermes:hermes 0660 (FIFO)
  svc_dir/log/event/                   hermes:hermes 03730  (if log/ present)
  svc_dir/log/supervise/               hermes:hermes 0755
  svc_dir/log/supervise/event/         hermes:hermes 03730
  svc_dir/log/supervise/control        hermes:hermes 0660 (FIFO)

The log/ branch matters because the logger is a second
s6-supervise instance — without it, unregister rmtree races on
the logger's root-owned supervise dir even after the parent
slot's supervise/ is hermes-owned. The helper is idempotent and
swallows PermissionError on chown so it works equally well when
called from root (cont-init.d) or hermes (runtime register).

Wiring
------
1. S6ServiceManager.register_profile_gateway calls
   _seed_supervise_skeleton(tmp_dir) just before publishing the
   slot via Path.replace. Runtime-registered profile gateways are
   set up by hermes.

2. container_boot._register_service does the same in the cont-init.d
   reconciliation path so boot-time-restored profile slots inherit
   the same layout.

3. New cont-init.d/015-supervise-perms script chowns the supervise/
   and event/ trees for STATIC s6-rc services (dashboard,
   main-hermes). These are spawned by s6-rc before cont-init.d
   gets to run, so the EEXIST-trick doesn't apply; we chown the
   already-existing tree instead. s6-supervise keeps using the
   same files; it never re-asserts ownership on a running service.
   The script skips s6-overlay internal services (s6rc-*,
   s6-linux-*) so the supervision tree itself stays root-only.
   015- slot is intentional: lex-sorts between 01-hermes-setup
   and 02-reconcile-profiles in the container's C-locale, so
   the chown finishes before the reconciler walks the scandir.

Unregister teardown reordering
------------------------------
S6ServiceManager.unregister_profile_gateway now fires
s6-svscanctl -an BEFORE rmtree (with a 200ms grace), so
s6-svscan reaps the supervise child and releases its file
handles on supervise/lock + supervise/status before we try to
remove the directory. Previously rmtree raced s6-supervise on a
set of files inside the supervise dir, and even with the parent
supervise/ now hermes-owned, the contained files (death_tally,
lock, status, written by root) could still be in use.

Dashboard down-state redesign
-----------------------------
The original PR #30136 review fix wrote a 'down' marker file
into /run/service/dashboard/ via cont-init.d/03-dashboard-toggle.
That approach was broken in two ways:

  (a) /run/service/dashboard is a symlink to a TRANSIENT
      /run/s6-rc:s6-rc-init:<tmpdir>/ directory while s6-rc is
      mid-transaction; the touch landed in a soon-to-be-discarded
      tmp.

  (b) Even when written to the final /run/s6-rc/servicedirs/
      location, the 'down' file is only consulted by s6-supervise
      at slot startup. s6-rc's user-bundle explicitly transitions
      'dashboard' to 'up' on every boot, overriding any down
      marker.

The right fix is the canonical s6 pattern: when HERMES_DASHBOARD
is unset, the dashboard run script exits 0 and a companion
finish script exits 125. Per s6-supervise(8), exit code 125 from
the finish script is the 'permanent failure, do not restart'
marker — equivalent to s6-svc -O. The slot reports as 'down' to
s6-svstat, matching the reality that no dashboard process is
running. When HERMES_DASHBOARD IS truthy, finish exits 0 and
restart-on-crash semantics apply.

03-dashboard-toggle is removed (its function is now subsumed by
the run/finish pair).

Tests
-----
Adds four unit tests for _seed_supervise_skeleton covering the
produced layout, the log/ subservice case, the skip-when-no-log
case, and idempotency. The live-container verification continues
to live in tests/docker/test_s6_profile_gateway_integration.py and
tests/docker/test_dashboard.py — both now pass against the
rebuilt image.

References
----------
* Skarnet skaware mailing list 2020-02-02 (Laurent Bercot
  + Guillermo Diaz Hartusch) on unprivileged s6 tool semantics:
  http://skarnet.org/lists/skaware/1424.html
* just-containers/s6-overlay#130 — same EEXIST-preseed pattern,
  community-validated 2016 onward
* https://skarnet.org/software/s6/servicedir.html — exit-code 125
  semantics in finish scripts
2026-05-25 11:21:31 +10:00

886 lines
35 KiB
Python

"""Abstract service manager interface.
Wraps the existing systemd (Linux host), launchd (macOS host), Windows
Scheduled Task (native Windows host), and s6 (container) backends behind
a common Protocol. Only the s6 backend supports runtime registration
(for per-profile gateways) — host backends raise NotImplementedError
from those methods, and callers MUST check supports_runtime_registration()
before invoking them.
Host-side call sites (setup wizard, uninstall, status) continue to use
the existing module-level functions in hermes_cli.gateway and
hermes_cli.gateway_windows directly. This protocol is a thin facade
used by new code that needs to be backend-agnostic — specifically the
profile create/delete hooks (Phase 4) and the s6 dispatch path in
``hermes gateway start/stop/restart`` when running inside a container.
"""
from __future__ import annotations
import re
from pathlib import Path
from typing import Literal, Protocol, runtime_checkable
ServiceManagerKind = Literal["systemd", "launchd", "windows", "s6", "none"]
# Profile name → service directory mapping. Profile names must be safe
# as filesystem directory names because the s6 backend creates a service
# directory at ``<scandir>/gateway-<profile>/``. We reject anything that
# could traverse paths, span filesystems, or break s6's own naming rules.
_VALID_PROFILE_RE = re.compile(r"^[a-z0-9][a-z0-9_-]*$")
_MAX_PROFILE_LEN = 251 # s6-svscan default name_max
def validate_profile_name(name: str) -> None:
"""Raise ValueError if ``name`` is not usable as a profile name.
Profile names are used as s6 service directory names, so they must
match a conservative subset of filesystem-safe characters. Reject
empty strings, uppercase, paths-traversal sequences, and anything
longer than s6's default ``name_max``.
"""
if not name:
raise ValueError("profile name must not be empty")
if len(name) > _MAX_PROFILE_LEN:
raise ValueError(
f"profile name too long ({len(name)} > {_MAX_PROFILE_LEN})"
)
if not _VALID_PROFILE_RE.match(name):
raise ValueError(
f"profile name must match [a-z0-9][a-z0-9_-]*, got {name!r}"
)
@runtime_checkable
class ServiceManager(Protocol):
"""Abstract interface for init-system-specific service operations.
Lifecycle methods (start / stop / restart / is_running) are
implemented by every backend. Runtime registration
(register_profile_gateway / unregister_profile_gateway /
list_profile_gateways) is implemented only by the s6 backend —
callers MUST check ``supports_runtime_registration()`` before
invoking the registration methods.
"""
kind: ServiceManagerKind
# Lifecycle of a pre-declared service.
def start(self, name: str) -> None: ...
def stop(self, name: str) -> None: ...
def restart(self, name: str) -> None: ...
def is_running(self, name: str) -> bool: ...
# Runtime registration (s6 only).
def supports_runtime_registration(self) -> bool: ...
def register_profile_gateway(
self,
profile: str,
*,
extra_env: dict[str, str] | None = None,
) -> None: ...
def unregister_profile_gateway(self, profile: str) -> None: ...
def list_profile_gateways(self) -> list[str]: ...
def detect_service_manager() -> ServiceManagerKind:
"""Detect which service manager is available in this environment.
Returns:
"s6" — inside a container when /init is s6-svscan (Phase 2+)
"windows" — native Windows host
"launchd" — macOS host
"systemd" — Linux host with a working user/system bus
"none" — anything else (Termux, sandbox shells, etc.)
This function does NOT replace ``supports_systemd_services()`` —
host call sites continue to use that. It exists for new backend-
agnostic code (profile create/delete hooks, the s6 dispatch path
in ``hermes gateway start/stop/restart``).
"""
# Imports deferred so importing this module doesn't drag in the
# whole gateway dependency graph for callers that only need the
# Protocol type or validate_profile_name().
from hermes_constants import is_container
from hermes_cli.gateway import (
is_macos,
is_windows,
supports_systemd_services,
)
if is_container() and _s6_running():
return "s6"
if is_windows():
return "windows"
if is_macos():
return "launchd"
if supports_systemd_services():
return "systemd"
return "none"
def _s6_running() -> bool:
"""True when s6-svscan is running as PID 1 in this container.
Detection has to work for **both** root and the unprivileged hermes
user (UID 10000). The obvious probe — ``Path('/proc/1/exe').resolve()``
— only works as root: for any other UID, the symlink at
``/proc/1/exe`` is unreadable and ``resolve()`` silently returns the
path unchanged, so the resolved name is the literal ``"exe"`` and
detection always fails. Since every Hermes runtime call inside the
container drops to hermes via ``s6-setuidgid``, that silent failure
made the entire service-manager runtime-registration path inert in
production (PR #30136 review).
Probe instead via:
* ``/proc/1/comm`` — world-readable, contains the process comm
(``s6-svscan`` when s6-overlay is PID 1).
* ``/run/s6/basedir`` — s6-overlay-specific directory created by
stage1. World-readable. More specific than ``/run/s6`` (which
other tools occasionally create).
Both signals are required; either alone could false-positive
(e.g. a container with the s6 binaries installed but a different
init, or an unrelated process named ``s6-svscan``).
"""
try:
comm = Path("/proc/1/comm").read_text(encoding="utf-8").strip()
except OSError:
return False
if comm != "s6-svscan":
return False
return Path("/run/s6/basedir").is_dir()
# ---------------------------------------------------------------------------
# Backend wrappers
#
# These adapters are thin facades over the existing module-level functions
# in ``hermes_cli.gateway`` (systemd/launchd) and ``hermes_cli.gateway_windows``
# (Windows Scheduled Tasks). The protocol's ``name`` parameter is currently
# unused for host backends — they operate on whichever profile is currently
# active (set via the ``hermes -p <profile>`` flag before the call). This
# matches existing host-side semantics; the parameter shape is designed
# for s6 where each profile maps to a distinct service directory.
# ---------------------------------------------------------------------------
class _RegistrationUnsupportedMixin:
"""Mixin for host backends that don't support runtime registration."""
def supports_runtime_registration(self) -> bool:
return False
def register_profile_gateway(
self,
profile: str,
*,
extra_env: dict[str, str] | None = None,
) -> None:
raise NotImplementedError(
f"{type(self).__name__} does not support runtime profile "
"gateway registration (container-only feature)"
)
def unregister_profile_gateway(self, profile: str) -> None:
raise NotImplementedError(
f"{type(self).__name__} does not support runtime profile "
"gateway unregistration (container-only feature)"
)
def list_profile_gateways(self) -> list[str]:
return []
class SystemdServiceManager(_RegistrationUnsupportedMixin):
"""Thin wrapper around the ``systemd_*`` functions in hermes_cli.gateway.
Existing host call sites continue to use those functions directly;
this wrapper exists for new code that needs to be backend-agnostic
(the Phase 4 profile create/delete hooks).
"""
kind: ServiceManagerKind = "systemd"
def start(self, name: str) -> None:
from hermes_cli.gateway import systemd_start
systemd_start()
def stop(self, name: str) -> None:
from hermes_cli.gateway import systemd_stop
systemd_stop()
def restart(self, name: str) -> None:
from hermes_cli.gateway import systemd_restart
systemd_restart()
def is_running(self, name: str) -> bool:
from hermes_cli.gateway import _probe_systemd_service_running
_, running = _probe_systemd_service_running()
return running
class LaunchdServiceManager(_RegistrationUnsupportedMixin):
"""Thin wrapper around the ``launchd_*`` functions in hermes_cli.gateway."""
kind: ServiceManagerKind = "launchd"
def start(self, name: str) -> None:
from hermes_cli.gateway import launchd_start
launchd_start()
def stop(self, name: str) -> None:
from hermes_cli.gateway import launchd_stop
launchd_stop()
def restart(self, name: str) -> None:
from hermes_cli.gateway import launchd_restart
launchd_restart()
def is_running(self, name: str) -> bool:
from hermes_cli.gateway import _probe_launchd_service_running
return _probe_launchd_service_running()
class WindowsServiceManager(_RegistrationUnsupportedMixin):
"""Thin wrapper around ``hermes_cli.gateway_windows`` (Scheduled Task /
Startup-folder fallback).
The native Windows backend uses a Scheduled Task rather than a true
init-system service, but for protocol purposes the lifecycle is the
same: start / stop / restart / is_running. ``install`` accepts a
handful of Windows-specific kwargs (start_now, start_on_login,
elevated_handoff) that are passed straight through — non-Windows
callers should never invoke ``install`` on this wrapper.
"""
kind: ServiceManagerKind = "windows"
def install(
self,
*,
force: bool = False,
start_now: bool | None = None,
start_on_login: bool | None = None,
elevated_handoff: bool = False,
) -> None:
from hermes_cli import gateway_windows
gateway_windows.install(
force=force,
start_now=start_now,
start_on_login=start_on_login,
elevated_handoff=elevated_handoff,
)
def start(self, name: str) -> None:
from hermes_cli import gateway_windows
gateway_windows.start()
def stop(self, name: str) -> None:
from hermes_cli import gateway_windows
gateway_windows.stop()
def restart(self, name: str) -> None:
from hermes_cli import gateway_windows
gateway_windows.restart()
def is_running(self, name: str) -> bool:
from hermes_cli import gateway_windows
from hermes_cli.gateway import find_gateway_pids
if not gateway_windows.is_installed():
return False
return bool(find_gateway_pids())
def get_service_manager() -> ServiceManager:
"""Return the ServiceManager instance for the current environment.
Raises:
RuntimeError: when no supported backend is available.
"""
kind = detect_service_manager()
if kind == "systemd":
return SystemdServiceManager()
if kind == "launchd":
return LaunchdServiceManager()
if kind == "windows":
return WindowsServiceManager()
if kind == "s6":
return S6ServiceManager()
raise RuntimeError("no supported service manager detected")
# ---------------------------------------------------------------------------
# S6ServiceManager (container-only)
#
# Per-profile gateways are registered dynamically when `hermes profile create`
# runs inside the container (Phase 4). Static services (main-hermes, dashboard)
# live in /etc/s6-overlay/s6-rc.d/ and are NOT managed by this class — they're
# part of the image, not runtime-created.
# ---------------------------------------------------------------------------
# s6-overlay's dynamic scandir for runtime-registered services. Lives on
# tmpfs and is the directory s6-svscan watches. Writes here trigger
# automatic supervision on the next rescan.
S6_DYNAMIC_SCANDIR = Path("/run/service")
S6_SERVICE_PREFIX = "gateway-"
# s6-overlay installs its binaries under /command/ and only adds that
# directory to PATH for processes started under the supervision tree
# (services started by s6-svscan, cont-init.d scripts, etc.). Code
# that runs via `docker exec` or any other out-of-tree entry point —
# notably our Phase 4 profile create/delete hooks — inherits the
# container's base PATH which does NOT include /command/.
#
# Rather than asking every caller to fix up its environment, the
# S6ServiceManager calls s6-* binaries by absolute path via this
# constant. We don't use `/usr/bin/s6-…` symlinks because the
# s6-overlay-symlinks-noarch tarball only links a subset, and we
# want every s6 invocation to be guaranteed-findable.
_S6_BIN_DIR = "/command"
# UID/GID of the in-image ``hermes`` user. Hardcoded to match what
# ``stage2-hook.sh`` enforces (the runtime invariant — see also
# tests/docker/test_uid_remap.py). The container starts s6-supervise
# under root and immediately drops to this UID via ``s6-setuidgid``.
_HERMES_UID = 10000
_HERMES_GID = 10000
def _seed_supervise_skeleton(svc_dir: Path) -> None:
"""Pre-create the ``supervise/`` and top-level ``event/`` skeleton
inside a service directory, owned by the hermes user.
Why this exists
---------------
When s6-supervise spawns a service it tries to ``mkdir`` two
directories: ``<svc>/event`` and ``<svc>/supervise``, both with mode
``0700``. It also ``mkfifo``s ``<svc>/supervise/control`` with mode
``0600``. Because s6-supervise runs as PID 1's effective UID (root)
these dirs end up root-owned mode 0700, and an unprivileged client
(the ``hermes`` user — UID 10000 — running every Hermes runtime
operation via ``s6-setuidgid``) gets ``EACCES`` on any ``s6-svc``,
``s6-svstat``, or ``s6-svwait`` invocation against the slot.
The PR #30136 review surfaced this as a real product gap: the
entire S6ServiceManager lifecycle (``register/start/stop/unregister
_profile_gateway``) was inert in production because every operation
is dispatched as the hermes user.
Why this works
--------------
Reading s6's source (src/supervision/s6-supervise.c::trymkdir +
control_init): the ``mkdir`` and ``mkfifo`` calls both treat
``EEXIST`` as success. If the directory is already present, the
chown/chmod fix-up that would normally make event/ ``03730
root:root`` is **skipped** entirely — s6-supervise just opens the
pre-existing FIFOs and proceeds. So if we lay the skeleton down
with hermes ownership before triggering ``s6-svscanctl -a``,
s6-supervise inherits our layout and never touches it.
Layout produced
---------------
``svc_dir/`` hermes:hermes, 0755 (parent must already exist)
``svc_dir/event/`` hermes:hermes, 03730 (setgid + g+rwx + sticky)
``svc_dir/supervise/`` hermes:hermes, 0755
``svc_dir/supervise/event/`` hermes:hermes, 03730
``svc_dir/supervise/control`` hermes:hermes, 0660 (FIFO)
The ``death_tally``, ``lock``, and ``status`` regular files end up
written by s6-supervise itself (as root), but those land mode 0644 —
world-readable — and ``s6-svstat`` only needs read access, so the
hermes user reads them fine.
If ``svc_dir/log/`` is present (the canonical s6 logger pattern —
one s6-supervise instance per service, plus a second for its
logger), the same skeleton is seeded under ``log/`` as well:
``log/event/``, ``log/supervise/``, ``log/supervise/event/``,
``log/supervise/control``. Without this, unregister teardown
would EACCES on the logger's supervise dir even after the parent
slot's supervise/ was hermes-owned.
Idempotency
-----------
Safe to call against a directory where the skeleton already exists.
Existing entries are left untouched (the helper doesn't try to
re-chown / re-chmod live FIFOs that s6-supervise may have already
opened).
Reference
---------
Discussed at length on the skarnet `skaware` mailing list in 2020
(`<http://skarnet.org/lists/skaware/1424.html>`_); see also
just-containers/s6-overlay#130. The pre-creation pattern was
historically called out as forward-compatibility-fragile, but the
EEXIST handling in s6-supervise has been stable since 2015 — it's
the same pattern ``s6-svperms`` and ``fix-attrs.d`` rely on.
"""
import os
def _mkdir_owned(path: Path, mode: int) -> None:
if path.exists():
return
path.mkdir(parents=False, exist_ok=False)
path.chmod(mode)
try:
os.chown(path, _HERMES_UID, _HERMES_GID)
except PermissionError:
# Running as the hermes user already — directory is hermes-
# owned by default. The chown is a no-op in that case, so
# swallowing this keeps both root and unprivileged callers
# on one code path.
pass
# Top-level event/ dir (this is the s6-svlisten1 event-subscription
# dir at the service root, distinct from supervise/event/).
_mkdir_owned(svc_dir / "event", 0o3730)
# supervise/ dir + its inner event/ dir.
supervise = svc_dir / "supervise"
_mkdir_owned(supervise, 0o755)
_mkdir_owned(supervise / "event", 0o3730)
# supervise/control FIFO. Same EEXIST-safe pattern: if it's already
# there (s6-supervise has already started against this slot), leave
# it alone. The explicit chmod after mkfifo is required because
# mkfifo honors the process umask, which can strip group-write
# (e.g. the default 0022 on most dev hosts → 0o660 becomes 0o640).
# The container runs with umask 0 inside s6-overlay's stage2, but
# being defensive here keeps the helper consistent under any
# invocation context.
control = supervise / "control"
if not control.exists():
os.mkfifo(control, 0o660)
control.chmod(0o660)
try:
os.chown(control, _HERMES_UID, _HERMES_GID)
except PermissionError:
pass
# If a log/ subdir is present (the canonical s6 logger pattern —
# see servicedir(7)), it gets its own s6-supervise instance and
# needs the same skeleton. Without this, unregister teardown
# would EACCES on the logger's root-owned supervise/ dir even
# when the parent slot's supervise/ is hermes-owned.
log_dir = svc_dir / "log"
if log_dir.is_dir():
_mkdir_owned(log_dir / "event", 0o3730)
log_supervise = log_dir / "supervise"
_mkdir_owned(log_supervise, 0o755)
_mkdir_owned(log_supervise / "event", 0o3730)
log_control = log_supervise / "control"
if not log_control.exists():
os.mkfifo(log_control, 0o660)
log_control.chmod(0o660)
try:
os.chown(log_control, _HERMES_UID, _HERMES_GID)
except PermissionError:
pass
class S6Error(RuntimeError):
"""Base error for S6ServiceManager lifecycle failures.
Concrete subclasses carry the slot name (and, where useful, the
underlying subprocess output) so the CLI can render an actionable
message instead of leaking a raw ``CalledProcessError`` traceback.
"""
def __init__(self, message: str, *, service: str | None = None) -> None:
super().__init__(message)
self.service = service
class GatewayNotRegisteredError(S6Error):
"""Raised when a lifecycle method targets a slot that doesn't exist.
Most commonly: ``hermes -p typo gateway start`` when no profile
``typo`` exists. Carries the unprefixed profile name (not the
full ``gateway-<profile>`` service-dir name) so callers can phrase
a user-facing message like "no such gateway 'typo'".
"""
def __init__(self, profile: str) -> None:
self.profile = profile
super().__init__(
f"no such gateway {profile!r}: register it with "
f"`hermes profile create {profile}` first, or pass "
"an existing profile name via `-p <name>`",
service=f"gateway-{profile}",
)
class S6CommandError(S6Error):
"""Raised when an s6 command fails for a reason other than a
missing slot — e.g. permission denied on the supervise control
FIFO, or s6-svc returning a non-zero exit for an unexpected
reason. Carries the stderr from the failing command so callers
can surface it.
"""
def __init__(
self, *, service: str, action: str, returncode: int, stderr: str,
) -> None:
self.action = action
self.returncode = returncode
self.stderr = stderr
message = (
f"s6-svc {action} on {service!r} failed (rc={returncode})"
)
if stderr.strip():
message += f": {stderr.strip()}"
super().__init__(message, service=service)
class S6ServiceManager:
"""Per-profile gateway supervision via s6-overlay.
Only handles runtime-registered services under
``S6_DYNAMIC_SCANDIR``. Static services (main-hermes, dashboard)
are managed by s6-rc at image-build time and are out of scope.
"""
kind: ServiceManagerKind = "s6"
def __init__(self, scandir: Path = S6_DYNAMIC_SCANDIR) -> None:
self.scandir = scandir
# -- internal helpers --------------------------------------------------
def _service_dir(self, profile: str) -> Path:
validate_profile_name(profile)
return self.scandir / f"{S6_SERVICE_PREFIX}{profile}"
def _service_name(self, profile: str) -> str:
return f"{S6_SERVICE_PREFIX}{profile}"
@staticmethod
def _render_run_script(
profile: str,
extra_env: dict[str, str],
) -> str:
"""Generate the run script for a profile-gateway s6 service.
The script:
1. Sources HERMES_HOME (and any extra env) via with-contenv —
so e.g. ``-e HERMES_HOME=/data/hermes`` is honored at run
time, not Python-substituted at registration time (OQ8-C).
2. Activates the bundled venv.
3. Drops to the hermes user and exec's
``hermes -p <profile> gateway run`` (or just ``hermes
gateway run`` for the default profile — see below).
Special case: ``profile == "default"`` emits ``hermes gateway
run`` with **no** ``-p`` flag. This is the sentinel for "the
root HERMES_HOME profile" (the implicit profile that exists at
the top of $HERMES_HOME, not under profiles/). It must be
spelled this way because ``_profile_suffix()`` returns the
empty string for the root profile, and the dispatcher in
``hermes_cli.gateway`` maps that empty string to the
``gateway-default`` service slot. Passing ``-p default`` here
would instead look up ``$HERMES_HOME/profiles/default/`` — a
completely different (and almost always nonexistent) profile.
Port selection: the gateway picks its bind port from the
profile's ``config.yaml`` (``[gateway] port = ...``) — that
is the single source of truth. Previously this method took a
``port`` parameter that was passed in but never substituted
into the rendered script (it was carried in for "API parity"
with a deterministic SHA-256 allocator in
``hermes_cli.profiles._allocate_gateway_port``). PR #30136
review item I5 retired both the allocator and the parameter
because they were dead code through the entire stack.
"""
import shlex
lines = [
"#!/command/with-contenv sh",
"# shellcheck shell=sh",
"set -e",
"cd /opt/data",
". /opt/hermes/.venv/bin/activate",
]
for k, v in sorted(extra_env.items()):
lines.append(f"export {k}={shlex.quote(v)}")
if profile == "default":
lines.append("exec s6-setuidgid hermes hermes gateway run")
else:
lines.append(
f"exec s6-setuidgid hermes hermes -p {shlex.quote(profile)} gateway run"
)
return "\n".join(lines) + "\n"
@staticmethod
def _render_log_run(profile: str) -> str:
"""Generate the log/run script for a profile-gateway service.
OQ8-C: persist to ``${HERMES_HOME}/logs/gateways/<profile>/``.
CRITICAL: the HERMES_HOME path is sourced from the runtime env
via with-contenv — NOT Python-substituted at registration time
— so a container started with ``-e HERMES_HOME=/data/hermes``
gets its logs under /data/hermes/logs/..., not the build-time
default.
"""
import shlex
prof = shlex.quote(profile)
return (
f"#!/command/with-contenv sh\n"
f"# shellcheck shell=sh\n"
f': "${{HERMES_HOME:=/opt/data}}"\n'
f'log_dir="$HERMES_HOME/logs/gateways/{prof}"\n'
f'mkdir -p "$log_dir"\n'
f'chown -R hermes:hermes "$log_dir" 2>/dev/null || true\n'
f'exec s6-setuidgid hermes s6-log n10 s1000000 T "$log_dir"\n'
)
# -- lifecycle ---------------------------------------------------------
def _run_svc(self, action_flag: str, action_label: str, name: str) -> None:
"""Shared lifecycle dispatch for start / stop / restart.
Translates the two failure modes operators care about into
named errors:
* ``GatewayNotRegisteredError`` — the service directory at
``<scandir>/<name>/`` doesn't exist. ``s6-svc`` would
exit non-zero with a fairly opaque message; we pre-empt
it with a clear "no such gateway 'X'" tied to the profile
name (without the ``gateway-`` prefix).
* ``S6CommandError`` — anything else (EACCES on the
supervise control FIFO, timeout, etc.). Carries the
subprocess return code and stderr so callers can render
them inline.
``action_flag`` is the ``s6-svc`` flag (``-u`` / ``-d`` /
``-t``); ``action_label`` is the human verb (``start`` /
``stop`` / ``restart``) used in error messages.
"""
import subprocess
service_dir = self.scandir / name
if not service_dir.is_dir():
# Strip the gateway- prefix back off so the message
# matches what the user typed on the CLI (``-p <profile>``).
profile = (
name[len(S6_SERVICE_PREFIX):]
if name.startswith(S6_SERVICE_PREFIX)
else name
)
raise GatewayNotRegisteredError(profile)
try:
subprocess.run(
[f"{_S6_BIN_DIR}/s6-svc", action_flag, str(service_dir)],
check=True, capture_output=True, text=True, timeout=5,
)
except subprocess.CalledProcessError as exc:
raise S6CommandError(
service=name,
action=action_label,
returncode=exc.returncode,
stderr=exc.stderr or "",
) from exc
def start(self, name: str) -> None:
"""Bring up a registered service (``s6-svc -u``).
Raises:
GatewayNotRegisteredError: no service directory for ``name``.
S6CommandError: s6-svc exited non-zero for any other reason
(permission denied on the supervise FIFO, timeout, etc.).
"""
self._run_svc("-u", "start", name)
def stop(self, name: str) -> None:
"""Bring down a registered service (``s6-svc -d``).
Raises:
GatewayNotRegisteredError: no service directory for ``name``.
S6CommandError: s6-svc exited non-zero for any other reason.
"""
self._run_svc("-d", "stop", name)
def restart(self, name: str) -> None:
"""Restart a registered service (``s6-svc -t`` = SIGTERM).
Raises:
GatewayNotRegisteredError: no service directory for ``name``.
S6CommandError: s6-svc exited non-zero for any other reason.
"""
self._run_svc("-t", "restart", name)
def is_running(self, name: str) -> bool:
"""True iff ``s6-svstat`` reports the service as up."""
import subprocess
result = subprocess.run(
[f"{_S6_BIN_DIR}/s6-svstat", str(self.scandir / name)],
capture_output=True, text=True, timeout=5,
)
return result.returncode == 0 and "up " in result.stdout
# -- runtime registration ---------------------------------------------
def supports_runtime_registration(self) -> bool:
return True
def register_profile_gateway(
self,
profile: str,
*,
extra_env: dict[str, str] | None = None,
) -> None:
"""Create the s6 service directory for a profile gateway.
Triggers ``s6-svscanctl -a`` so s6-svscan picks the new directory
up immediately. The service is created in the *up* state — to
register without auto-starting, follow up with ``stop(profile)``
(or pass the start flag via the future ``start_now=False`` arg,
which the Phase 4 reconciliation path uses via a ``down``
marker file written directly).
Raises:
ValueError: if the profile name is invalid or the service
directory already exists.
RuntimeError: if ``s6-svscanctl`` fails.
"""
import shutil
import subprocess
svc_dir = self._service_dir(profile)
if svc_dir.exists():
raise ValueError(
f"profile gateway {profile!r} already registered at {svc_dir}"
)
# Build the service directory atomically: write to a sibling
# temp dir, then rename. Avoids s6-svscan observing a half-
# populated directory on a fast rescan.
tmp_dir = svc_dir.with_name(svc_dir.name + ".tmp")
if tmp_dir.exists():
shutil.rmtree(tmp_dir, ignore_errors=True)
tmp_dir.mkdir(parents=True)
try:
(tmp_dir / "type").write_text("longrun\n")
run_script = self._render_run_script(profile, extra_env or {})
run_path = tmp_dir / "run"
run_path.write_text(run_script)
run_path.chmod(0o755)
# Persistent log rotation (OQ8-C).
log_subdir = tmp_dir / "log"
log_subdir.mkdir()
log_run = log_subdir / "run"
log_run.write_text(self._render_log_run(profile))
log_run.chmod(0o755)
# Pre-create the supervise/ skeleton with hermes ownership
# BEFORE we publish the slot. s6-supervise will EEXIST our
# dirs/FIFOs and inherit the ownership, so the runtime
# s6-svc / s6-svstat / s6-svwait calls (all dispatched as
# the hermes user) won't hit EACCES on root-owned 0700
# dirs. See ``_seed_supervise_skeleton`` for the full
# rationale.
_seed_supervise_skeleton(tmp_dir)
tmp_dir.rename(svc_dir)
except Exception:
shutil.rmtree(tmp_dir, ignore_errors=True)
raise
# Trigger rescan so s6-svscan picks up the new service.
result = subprocess.run(
[f"{_S6_BIN_DIR}/s6-svscanctl", "-a", str(self.scandir)],
capture_output=True, text=True, timeout=5,
)
if result.returncode != 0:
# Clean up: rescan failed, leave the directory in place would
# be confusing (no supervisor watching it).
shutil.rmtree(svc_dir, ignore_errors=True)
raise RuntimeError(
f"s6-svscanctl failed: {result.stderr or result.stdout}"
)
def unregister_profile_gateway(self, profile: str) -> None:
"""Stop the profile gateway service and remove its directory.
Idempotent: absent services are a no-op. Best-effort stop +
wait-for-down before removal so the running gateway process
gets a chance to shut down cleanly before its service dir
disappears.
Teardown ordering matters: ``s6-svscanctl -an`` is fired
**before** ``rmtree`` so s6-svscan reaps the supervise child
process (releasing its handle on ``supervise/lock`` and the
regular files inside the supervise dir), giving us a clean
directory to remove. Without the reap-first ordering, the
rmtree races s6-supervise on a set of root-owned files inside
the supervise dir and the dir is left half-removed.
"""
import shutil
import subprocess
import time
svc_dir = self._service_dir(profile)
if not svc_dir.exists():
return
# Stop the service (best effort — service may already be down).
subprocess.run(
[f"{_S6_BIN_DIR}/s6-svc", "-d", str(svc_dir)],
capture_output=True, text=True, timeout=5,
check=False,
)
# Wait for it to actually go down (up to 10s).
subprocess.run(
[f"{_S6_BIN_DIR}/s6-svwait", "-D", "-t", "10000", str(svc_dir)],
capture_output=True, text=True, timeout=15,
check=False,
)
# Reap the supervise child FIRST: -n tells s6-svscan to drop
# any supervise processes whose service dir is gone (which
# includes any service dir we're about to remove). This
# releases the file handles s6-supervise holds against the
# supervise/lock + supervise/status + supervise/death_tally
# files inside the slot, so the upcoming rmtree doesn't race.
subprocess.run(
[f"{_S6_BIN_DIR}/s6-svscanctl", "-an", str(self.scandir)],
capture_output=True, text=True, timeout=5,
check=False,
)
# Give s6-svscan a moment to reap. There's no synchronous
# "scan completed" handshake — the -a/-n trigger just sets a
# flag s6-svscan reads on its next loop iteration. 200ms is
# comfortably above the loop's resolution but well under any
# user-perceived latency.
time.sleep(0.2)
# Now the supervise dir's files are no longer held open by a
# live s6-supervise, so rmtree can remove them. Files inside
# supervise/ are root-owned (death_tally, lock, status, written
# by s6-supervise itself) — but the parent supervise/ directory
# is hermes-owned (see ``_seed_supervise_skeleton``), and on
# POSIX you only need write+execute on the parent to remove
# contained files regardless of file ownership.
shutil.rmtree(svc_dir, ignore_errors=True)
def list_profile_gateways(self) -> list[str]:
"""Return the profile names of all currently-registered gateway services.
Filters the scandir to entries that match the ``gateway-`` prefix.
Other services (e.g. ``s6-linux-init-shutdownd``) are ignored.
"""
if not self.scandir.exists():
return []
profiles: list[str] = []
for entry in self.scandir.iterdir():
if entry.name.startswith("."):
continue
if not entry.is_dir():
continue
if not entry.name.startswith(S6_SERVICE_PREFIX):
continue
profiles.append(entry.name[len(S6_SERVICE_PREFIX):])
return profiles