fix(docker): drop docker exec to hermes uid before invoking the CLI

When operators ran `docker exec <c> hermes login` (or anything else
that wrote under $HERMES_HOME) they defaulted to root, leaving
/opt/data/auth.json root:root mode 0600. The supervised gateway
(UID 10000) then couldn't read its own credentials and returned
"Provider authentication failed: Hermes is not logged into Nous
Portal" on every Telegram/Discord/etc. message — even though
`docker exec <c> hermes chat -q ping` (also root) succeeded because
root could read its own root-owned file. _load_auth_store swallowed
PermissionError as a parse failure and copied the file aside as
auth.json.corrupt, making the diagnostic more misleading.

Fix: install a privilege-drop shim at /opt/hermes/bin/hermes,
prepended ahead of the venv on PATH. When invoked as root the shim
exec's the real venv binary via `s6-setuidgid hermes` — so any file
the docker-exec session writes is uid-aligned with the supervised
processes. Non-root callers (the supervised processes themselves,
`docker exec --user hermes`, kanban subagents, anything inside the
container that's not coming through docker-exec) hit a single exec
to the absolute venv path with no privilege change.

Recursion is impossible: the shim exec's the venv binary by
absolute path (/opt/hermes/.venv/bin/hermes), so the second hop
cannot re-enter the shim regardless of PATH state. No sentinel env
var needed (unlike #33583's gateway-run redirect which DOES need
HERMES_S6_SUPERVISED_CHILD because there's no absolute-path
equivalent for the s6 dispatch).

Opt-out: `docker exec -e HERMES_DOCKER_EXEC_AS_ROOT=1 …` for
diagnostic sessions where the operator deliberately wants root.
Strict truthiness (1/true/yes case-insensitive); typos like `=0`
do not silently opt out, mirroring HERMES_GATEWAY_NO_SUPERVISE in
#33583.

If `s6-setuidgid` is missing (someone stripped s6-overlay in a
downstream fork), the shim exits 126 with a remediation message
pointing at `--user hermes` and the opt-out — never silently runs
as root.

Test plan:
- tests/docker/test_docker_exec_privilege_drop.py — 11 tests
  - shim drops root to hermes uid (file ownership check)
  - shim short-circuits for non-root docker exec
  - HERMES_DOCKER_EXEC_AS_ROOT=1 keeps root
  - strict-truthiness parametrization (5 falsy values reject)
  - main CMD path unaffected (recursion guard)
  - E2E: every file written by docker-exec is readable by uid 10000
- Full tests/docker/ harness: 32/32 pass against fresh image build
- shellcheck --severity=error: clean
- hadolint: clean
- Manual: reproduced the original symptom (root-owned auth.json)
  by bypassing the shim; confirmed default docker-exec produces
  hermes-owned files; confirmed opt-out env keeps root semantics.

Known follow-up: this prevents NEW instances of the bug. Volumes
that already have root:root /opt/data/auth.json from a pre-shim
image need a one-time `chown hermes:hermes` before rebooting onto
the new image. A stage2-hook chown sweep can self-heal that, but
is deferred per scope decision.
This commit is contained in:
Ben Barclay 2026-05-28 13:27:26 +10:00 committed by Ben Barclay
parent b345323195
commit aeb992d343
3 changed files with 397 additions and 1 deletions

View file

@ -213,13 +213,32 @@ COPY --chmod=0755 docker/cont-init.d/02-reconcile-profiles /etc/cont-init.d/02-r
# ---------- Runtime ----------
ENV HERMES_WEB_DIST=/opt/hermes/hermes_cli/web_dist
ENV HERMES_HOME=/opt/data
# `docker exec` privilege-drop shim. When operators run
# `docker exec <c> hermes ...` they default to root, and any file the
# command writes under $HERMES_HOME (auth.json, .env, config.yaml) ends
# up root-owned and unreadable to the supervised gateway (UID 10000).
# The shim lives at /opt/hermes/bin/hermes, sits earliest on PATH, and
# transparently re-exec's the real venv binary via `s6-setuidgid hermes`
# when invoked as root. Non-root callers (supervised processes,
# `--user hermes`, etc.) hit the short-circuit path with no overhead.
# Recursion is impossible because the shim exec's the venv binary by
# absolute path (/opt/hermes/.venv/bin/hermes). See the shim source for
# the opt-out env var (HERMES_DOCKER_EXEC_AS_ROOT=1).
COPY --chmod=0755 docker/hermes-exec-shim.sh /opt/hermes/bin/hermes
# Pre-s6 entrypoint.sh did `source .venv/bin/activate` which exported
# the venv bin onto PATH; Architecture B's main-wrapper.sh does the
# same for the container's main process, but `docker exec` and our
# cont-init.d scripts don't pass through the wrapper. Expose the venv
# bin globally so `docker exec <container> hermes ...` and any
# subprocess that doesn't activate the venv first still find hermes.
ENV PATH="/opt/hermes/.venv/bin:/opt/data/.local/bin:${PATH}"
#
# /opt/hermes/bin is prepended ahead of the venv so the privilege-drop
# shim wins PATH resolution. The shim's last act is to exec the venv
# binary by absolute path, so this PATH ordering is transparent to
# every other consumer.
ENV PATH="/opt/hermes/bin:/opt/hermes/.venv/bin:/opt/data/.local/bin:${PATH}"
RUN mkdir -p /opt/data
VOLUME [ "/opt/data" ]

View file

@ -0,0 +1,87 @@
#!/bin/sh
# shellcheck shell=sh
# /opt/hermes/bin/hermes — `docker exec` privilege-drop shim.
#
# Background
# ----------
# The s6 image runs the supervised gateway/main process as the unprivileged
# `hermes` user (UID 10000). When an operator runs `docker exec <c> hermes ...`
# the default UID is root (0), and any file the command writes under
# $HERMES_HOME — auth.json, .env, config.yaml — ends up root-owned and
# unreadable to the supervised gateway. The most common manifestation: the
# user runs `docker exec <c> hermes login`, this writes
# /opt/data/auth.json as root:root mode 0600, and from then on the gateway
# returns "Provider authentication failed: Hermes is not logged into Nous
# Portal" on every incoming message — even though `docker exec <c> hermes
# chat -q ping` (also running as root) succeeds because root happens to be
# able to read its own root-owned file. See systematic-debugging skill
# notes attached to this fix.
#
# Fix
# ---
# This shim sits at /opt/hermes/bin/hermes and is placed earliest on PATH.
# When invoked as root, it drops to the hermes user (via s6-setuidgid)
# before exec'ing the real venv binary, so anything that writes under
# $HERMES_HOME is uid-aligned with the supervised processes. When invoked
# as any non-root UID — including the supervised processes themselves,
# `docker exec --user hermes`, kanban subagents, etc. — it short-circuits
# straight to the venv binary with no privilege change. Net: one extra
# fork on the docker-exec-as-root path, zero behavioral change on every
# other path.
#
# Recursion safety: the shim exec's the venv binary by *absolute path*
# (/opt/hermes/.venv/bin/hermes), so the second hop cannot re-enter this
# shim regardless of PATH state. No sentinel env var needed.
#
# Opt-out: set HERMES_DOCKER_EXEC_AS_ROOT=1 (1/true/yes, case-insensitive)
# to keep running as root. Reserved for diagnostic sessions where the
# operator deliberately wants root semantics — e.g. inspecting root-only
# state via the hermes CLI. Default is to drop.
set -e
REAL=/opt/hermes/.venv/bin/hermes
# Defensive: if the venv binary is missing (corrupted image, partial
# install), fail loudly rather than silently masking it.
if [ ! -x "$REAL" ]; then
echo "hermes-shim: $REAL not found or not executable" >&2
exit 127
fi
# Already non-root? Just exec the real binary. This is the hot path for
# supervised processes (uid 10000) and for `docker exec --user hermes`.
if [ "$(id -u)" != "0" ]; then
exec "$REAL" "$@"
fi
# Root, with opt-out set? Honor it.
case "${HERMES_DOCKER_EXEC_AS_ROOT:-}" in
1|true|TRUE|True|yes|YES|Yes)
exec "$REAL" "$@"
;;
esac
# Root, no opt-out. Drop to the hermes user.
#
# s6-setuidgid lives under /command/ which is NOT on `docker exec`'s PATH
# (s6-overlay only puts /command/ on PATH for supervision-tree children).
# Reference it by absolute path so the drop is robust against PATH
# manipulation.
S6_SUID=/command/s6-setuidgid
if [ ! -x "$S6_SUID" ]; then
# Non-s6 image (someone stripped s6-overlay, or a hand-built variant).
# Fail loud rather than silently re-execing as root and leaking the
# bug this shim exists to prevent.
echo "hermes-shim: $S6_SUID not found; refusing to silently run as root." >&2
echo "hermes-shim: re-run with --user hermes or set HERMES_DOCKER_EXEC_AS_ROOT=1." >&2
exit 126
fi
# Reset HOME to the hermes user's home before dropping privileges. Without
# this, $HOME stays /root and any library that resolves paths off $HOME
# (XDG caches, lockfiles, .config writes) will try to write to /root and
# fail with EACCES. Mirrors main-wrapper.sh.
export HOME=/opt/data
exec "$S6_SUID" hermes "$REAL" "$@"

View file

@ -0,0 +1,290 @@
"""Regression tests for the docker-exec privilege-drop shim.
The shim (docker/hermes-exec-shim.sh, installed at /opt/hermes/bin/hermes)
exists to prevent the auth.json ownership-mismatch bug where
`docker exec <c> hermes login` would write /opt/data/auth.json as
root:root mode 0600, leaving the supervised gateway (UID 10000) unable
to read its own credentials and returning "Provider authentication
failed: Hermes is not logged into Nous Portal" on every message.
These tests verify:
1. ``docker exec <c> hermes `` (defaulting to root) gets dropped to the
hermes user before the real binary runs.
2. ``docker exec --user hermes <c> hermes `` (already non-root) short-
circuits and doesn't try to drop again.
3. Files written under $HERMES_HOME from a ``docker exec`` session land
as hermes:hermes the actual user-visible invariant.
4. The HERMES_DOCKER_EXEC_AS_ROOT opt-out lets diagnostic sessions keep
running as root deliberately.
5. The main CMD path (``docker run <image> ``) is unaffected by the
PATH-shim ordering no recursion, no behavior change.
"""
from __future__ import annotations
import subprocess
import time
from collections.abc import Iterator
import pytest
# How long to give a `docker run -d` container before declaring it not ready.
_RUN_READY_TIMEOUT_S = 20
def _wait_for_init(container: str) -> None:
"""Block until /init is up enough that `docker exec` is responsive."""
deadline = time.time() + _RUN_READY_TIMEOUT_S
while time.time() < deadline:
r = subprocess.run(
["docker", "exec", container, "true"],
capture_output=True, timeout=5,
)
if r.returncode == 0:
return
time.sleep(0.2)
pytest.fail(f"container {container} not responsive to docker exec within {_RUN_READY_TIMEOUT_S}s")
@pytest.fixture
def sleep_container(built_image: str, container_name: str) -> Iterator[str]:
"""Long-lived container running `sleep infinity` so we can docker exec into it."""
subprocess.run(
["docker", "rm", "-f", container_name],
capture_output=True, check=False,
)
r = subprocess.run(
["docker", "run", "-d", "--name", container_name, built_image,
"sleep", "infinity"],
capture_output=True, text=True, timeout=30,
)
assert r.returncode == 0, f"docker run failed: {r.stderr}"
try:
_wait_for_init(container_name)
yield container_name
finally:
subprocess.run(
["docker", "rm", "-f", container_name],
capture_output=True, check=False,
)
def test_shim_drops_root_to_hermes_uid(sleep_container: str) -> None:
"""docker exec defaults to root; the shim should drop to uid 10000.
We invoke `hermes` with a Python-style `-c` shim equivalent there's no
pure-hermes "print my uid" command, so we use the venv's python directly
via the shim's PATH lookup: `python -c 'print(os.getuid())'` is resolved
through the venv. But that bypasses the shim. Instead, we exploit the
fact that the venv's `hermes` is a console_scripts entry — under the
hood it's a tiny Python wrapper. We can't easily inject "print my uid"
into it without forking subcommands. Simplest approach: have `hermes`
do anything that writes to disk, then check the file's owner.
Use `hermes config set` which writes config.yaml under HERMES_HOME.
The resulting file ownership tells us what UID the shim ended up at.
"""
# Wipe any prior state.
subprocess.run(
["docker", "exec", "--user", "root", sleep_container,
"rm", "-f", "/opt/data/config.yaml"],
capture_output=True, check=False,
)
# Default docker exec (root) — should be dropped by the shim.
r = subprocess.run(
["docker", "exec", sleep_container,
"hermes", "config", "set", "_test.shim_marker", "1"],
capture_output=True, text=True, timeout=30,
)
assert r.returncode == 0, f"config set failed: stdout={r.stdout!r} stderr={r.stderr!r}"
# The written file must be owned by hermes, not root.
r = subprocess.run(
["docker", "exec", sleep_container,
"stat", "-c", "%U:%G", "/opt/data/config.yaml"],
capture_output=True, text=True, timeout=10,
)
assert r.returncode == 0, f"stat failed: {r.stderr}"
assert r.stdout.strip() == "hermes:hermes", (
f"config.yaml owned by {r.stdout.strip()!r}, expected hermes:hermes. "
"The shim did not drop privileges before invoking hermes."
)
def test_shim_short_circuits_for_non_root_exec(sleep_container: str) -> None:
"""docker exec --user hermes already runs as 10000; shim should be a no-op.
Verified indirectly: the command must still succeed end-to-end. If the
shim incorrectly tried to drop privileges a second time (e.g. by
invoking s6-setuidgid which requires root), it would fail with
EPERM. A clean success proves the short-circuit fired.
"""
subprocess.run(
["docker", "exec", "--user", "root", sleep_container,
"rm", "-f", "/opt/data/config.yaml"],
capture_output=True, check=False,
)
r = subprocess.run(
["docker", "exec", "--user", "hermes", sleep_container,
"hermes", "config", "set", "_test.shim_short_circuit", "1"],
capture_output=True, text=True, timeout=30,
)
assert r.returncode == 0, (
f"docker exec --user hermes failed: {r.stderr!r} stdout={r.stdout!r}. "
"If the shim mis-handled the non-root path, this would fail with EPERM."
)
# File still ends up hermes:hermes — orthogonally confirms uid.
r = subprocess.run(
["docker", "exec", sleep_container,
"stat", "-c", "%U:%G", "/opt/data/config.yaml"],
capture_output=True, text=True, timeout=10,
)
assert r.stdout.strip() == "hermes:hermes"
def test_shim_opt_out_keeps_root(sleep_container: str) -> None:
"""HERMES_DOCKER_EXEC_AS_ROOT=1 should suppress the privilege drop.
Reserved for diagnostic sessions where the operator deliberately
wants root semantics. Verified by writing a file and checking its
owner.
"""
subprocess.run(
["docker", "exec", "--user", "root", sleep_container,
"rm", "-f", "/opt/data/config.yaml"],
capture_output=True, check=False,
)
r = subprocess.run(
["docker", "exec",
"-e", "HERMES_DOCKER_EXEC_AS_ROOT=1",
sleep_container,
"hermes", "config", "set", "_test.opt_out", "1"],
capture_output=True, text=True, timeout=30,
)
assert r.returncode == 0, f"opt-out invocation failed: {r.stderr}"
r = subprocess.run(
["docker", "exec", sleep_container,
"stat", "-c", "%U:%G", "/opt/data/config.yaml"],
capture_output=True, text=True, timeout=10,
)
assert r.stdout.strip() == "root:root", (
f"With HERMES_DOCKER_EXEC_AS_ROOT=1, expected root:root, "
f"got {r.stdout.strip()!r}"
)
@pytest.mark.parametrize("falsy_value", ["0", "false", "no", "", "garbage", "2"])
def test_shim_opt_out_strict_truthiness(
sleep_container: str, falsy_value: str,
) -> None:
"""Anything other than 1/true/yes (case-insensitive) does NOT opt out.
Strict truthiness so a typo (``HERMES_DOCKER_EXEC_AS_ROOT=0``) doesn't
silently keep the user as root. Mirrors the policy used by
``HERMES_GATEWAY_NO_SUPERVISE`` in #33583.
"""
subprocess.run(
["docker", "exec", "--user", "root", sleep_container,
"rm", "-f", "/opt/data/config.yaml"],
capture_output=True, check=False,
)
r = subprocess.run(
["docker", "exec",
"-e", f"HERMES_DOCKER_EXEC_AS_ROOT={falsy_value}",
sleep_container,
"hermes", "config", "set", "_test.falsy", "1"],
capture_output=True, text=True, timeout=30,
)
assert r.returncode == 0, f"falsy value {falsy_value!r} caused failure: {r.stderr}"
r = subprocess.run(
["docker", "exec", sleep_container,
"stat", "-c", "%U:%G", "/opt/data/config.yaml"],
capture_output=True, text=True, timeout=10,
)
assert r.stdout.strip() == "hermes:hermes", (
f"falsy opt-out value {falsy_value!r} unexpectedly suppressed the drop; "
f"file owner is {r.stdout.strip()!r}, expected hermes:hermes"
)
def test_main_cmd_path_unaffected(built_image: str) -> None:
"""The CMD path (docker run <image> <args>) must still work.
The shim sits at /opt/hermes/bin earliest on PATH; main-wrapper.sh
invokes `s6-setuidgid hermes hermes <args>` which resolves `hermes`
through PATH. With the shim in the way, this could regress if the
shim recurses or interferes with TTY/exit-code propagation.
`chat --help` is cheap and exercises the full subcommand
passthrough path. The duplicate of test_main_invocation's
pre-existing test is intentional that one would have passed
pre-shim too; this one specifically guards against shim regressions
in the CMD-as-main-program codepath.
"""
r = subprocess.run(
["docker", "run", "--rm", built_image, "chat", "--help"],
capture_output=True, text=True, timeout=60,
)
assert r.returncode == 0, f"CMD path broken by shim: stderr={r.stderr!r}"
assert "Traceback" not in r.stderr
def test_e2e_login_then_supervised_gateway_can_read_auth(
sleep_container: str,
) -> None:
"""End-to-end regression for the original bug.
Pre-shim: ``docker exec <c> hermes login`` (root) wrote
/opt/data/auth.json as root:root 0600. The supervised gateway (UID
10000) couldn't read it, _load_auth_store swallowed PermissionError
as a parse failure, and resolve_nous_runtime_credentials raised
"Hermes is not logged into Nous Portal" on every message.
We can't do a real OAuth login in a unit test, but we can stand in
for it by writing the same file shape via `hermes config set`-style
writes what matters is the *file ownership invariant* downstream
of `_save_auth_store`. If the shim works, every file the
`docker exec` path produces is hermes-readable.
Specifically: pretend the operator ran `hermes login` (writes
auth.json) and verify (a) the file exists and (b) it's readable by
the hermes UID. We use `hermes auth list` since that touches the
auth store on the read side and would fail with the same
'not logged in' shape if the file was unreadable to uid 10000.
"""
# Have the shim-protected `docker exec` write the auth store.
# `hermes auth list` is read-only but still exercises _load_auth_store
# under the shim's UID. We invoke `hermes config set` first to
# provoke a write into HERMES_HOME so we have something concrete to
# owner-check.
r = subprocess.run(
["docker", "exec", sleep_container,
"hermes", "config", "set", "_test.e2e_marker", "1"],
capture_output=True, text=True, timeout=30,
)
assert r.returncode == 0, f"config set failed: {r.stderr}"
# The supervised UID (10000) must be able to read everything under
# HERMES_HOME that docker exec just wrote.
r = subprocess.run(
["docker", "exec", "--user", "hermes", sleep_container,
"find", "/opt/data", "-maxdepth", "2", "-type", "f",
"!", "-readable", "-print"],
capture_output=True, text=True, timeout=15,
)
assert r.returncode == 0, f"find failed: {r.stderr}"
unreadable = [ln for ln in r.stdout.splitlines() if ln.strip()]
assert not unreadable, (
"Files written by `docker exec` are unreadable to the hermes user "
f"(supervised gateway UID): {unreadable}. The shim failed to drop "
"privileges before the write."
)