From aeb992d34374bf07f34811a3e8a11e623d1cc16b Mon Sep 17 00:00:00 2001 From: Ben Barclay Date: Thu, 28 May 2026 13:27:26 +1000 Subject: [PATCH] fix(docker): drop `docker exec` to hermes uid before invoking the CLI MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When operators ran `docker exec hermes login` (or anything else that wrote under $HERMES_HOME) they defaulted to root, leaving /opt/data/auth.json root:root mode 0600. The supervised gateway (UID 10000) then couldn't read its own credentials and returned "Provider authentication failed: Hermes is not logged into Nous Portal" on every Telegram/Discord/etc. message — even though `docker exec hermes chat -q ping` (also root) succeeded because root could read its own root-owned file. _load_auth_store swallowed PermissionError as a parse failure and copied the file aside as auth.json.corrupt, making the diagnostic more misleading. Fix: install a privilege-drop shim at /opt/hermes/bin/hermes, prepended ahead of the venv on PATH. When invoked as root the shim exec's the real venv binary via `s6-setuidgid hermes` — so any file the docker-exec session writes is uid-aligned with the supervised processes. Non-root callers (the supervised processes themselves, `docker exec --user hermes`, kanban subagents, anything inside the container that's not coming through docker-exec) hit a single exec to the absolute venv path with no privilege change. Recursion is impossible: the shim exec's the venv binary by absolute path (/opt/hermes/.venv/bin/hermes), so the second hop cannot re-enter the shim regardless of PATH state. No sentinel env var needed (unlike #33583's gateway-run redirect which DOES need HERMES_S6_SUPERVISED_CHILD because there's no absolute-path equivalent for the s6 dispatch). Opt-out: `docker exec -e HERMES_DOCKER_EXEC_AS_ROOT=1 …` for diagnostic sessions where the operator deliberately wants root. Strict truthiness (1/true/yes case-insensitive); typos like `=0` do not silently opt out, mirroring HERMES_GATEWAY_NO_SUPERVISE in #33583. If `s6-setuidgid` is missing (someone stripped s6-overlay in a downstream fork), the shim exits 126 with a remediation message pointing at `--user hermes` and the opt-out — never silently runs as root. Test plan: - tests/docker/test_docker_exec_privilege_drop.py — 11 tests - shim drops root to hermes uid (file ownership check) - shim short-circuits for non-root docker exec - HERMES_DOCKER_EXEC_AS_ROOT=1 keeps root - strict-truthiness parametrization (5 falsy values reject) - main CMD path unaffected (recursion guard) - E2E: every file written by docker-exec is readable by uid 10000 - Full tests/docker/ harness: 32/32 pass against fresh image build - shellcheck --severity=error: clean - hadolint: clean - Manual: reproduced the original symptom (root-owned auth.json) by bypassing the shim; confirmed default docker-exec produces hermes-owned files; confirmed opt-out env keeps root semantics. Known follow-up: this prevents NEW instances of the bug. Volumes that already have root:root /opt/data/auth.json from a pre-shim image need a one-time `chown hermes:hermes` before rebooting onto the new image. A stage2-hook chown sweep can self-heal that, but is deferred per scope decision. --- Dockerfile | 21 +- docker/hermes-exec-shim.sh | 87 ++++++ .../docker/test_docker_exec_privilege_drop.py | 290 ++++++++++++++++++ 3 files changed, 397 insertions(+), 1 deletion(-) create mode 100644 docker/hermes-exec-shim.sh create mode 100644 tests/docker/test_docker_exec_privilege_drop.py diff --git a/Dockerfile b/Dockerfile index 0a1ed56b47f..ce41fb69ab7 100644 --- a/Dockerfile +++ b/Dockerfile @@ -213,13 +213,32 @@ COPY --chmod=0755 docker/cont-init.d/02-reconcile-profiles /etc/cont-init.d/02-r # ---------- Runtime ---------- ENV HERMES_WEB_DIST=/opt/hermes/hermes_cli/web_dist ENV HERMES_HOME=/opt/data + +# `docker exec` privilege-drop shim. When operators run +# `docker exec hermes ...` they default to root, and any file the +# command writes under $HERMES_HOME (auth.json, .env, config.yaml) ends +# up root-owned and unreadable to the supervised gateway (UID 10000). +# The shim lives at /opt/hermes/bin/hermes, sits earliest on PATH, and +# transparently re-exec's the real venv binary via `s6-setuidgid hermes` +# when invoked as root. Non-root callers (supervised processes, +# `--user hermes`, etc.) hit the short-circuit path with no overhead. +# Recursion is impossible because the shim exec's the venv binary by +# absolute path (/opt/hermes/.venv/bin/hermes). See the shim source for +# the opt-out env var (HERMES_DOCKER_EXEC_AS_ROOT=1). +COPY --chmod=0755 docker/hermes-exec-shim.sh /opt/hermes/bin/hermes + # Pre-s6 entrypoint.sh did `source .venv/bin/activate` which exported # the venv bin onto PATH; Architecture B's main-wrapper.sh does the # same for the container's main process, but `docker exec` and our # cont-init.d scripts don't pass through the wrapper. Expose the venv # bin globally so `docker exec hermes ...` and any # subprocess that doesn't activate the venv first still find hermes. -ENV PATH="/opt/hermes/.venv/bin:/opt/data/.local/bin:${PATH}" +# +# /opt/hermes/bin is prepended ahead of the venv so the privilege-drop +# shim wins PATH resolution. The shim's last act is to exec the venv +# binary by absolute path, so this PATH ordering is transparent to +# every other consumer. +ENV PATH="/opt/hermes/bin:/opt/hermes/.venv/bin:/opt/data/.local/bin:${PATH}" RUN mkdir -p /opt/data VOLUME [ "/opt/data" ] diff --git a/docker/hermes-exec-shim.sh b/docker/hermes-exec-shim.sh new file mode 100644 index 00000000000..7f4c5c3c0a0 --- /dev/null +++ b/docker/hermes-exec-shim.sh @@ -0,0 +1,87 @@ +#!/bin/sh +# shellcheck shell=sh +# /opt/hermes/bin/hermes — `docker exec` privilege-drop shim. +# +# Background +# ---------- +# The s6 image runs the supervised gateway/main process as the unprivileged +# `hermes` user (UID 10000). When an operator runs `docker exec hermes ...` +# the default UID is root (0), and any file the command writes under +# $HERMES_HOME — auth.json, .env, config.yaml — ends up root-owned and +# unreadable to the supervised gateway. The most common manifestation: the +# user runs `docker exec hermes login`, this writes +# /opt/data/auth.json as root:root mode 0600, and from then on the gateway +# returns "Provider authentication failed: Hermes is not logged into Nous +# Portal" on every incoming message — even though `docker exec hermes +# chat -q ping` (also running as root) succeeds because root happens to be +# able to read its own root-owned file. See systematic-debugging skill +# notes attached to this fix. +# +# Fix +# --- +# This shim sits at /opt/hermes/bin/hermes and is placed earliest on PATH. +# When invoked as root, it drops to the hermes user (via s6-setuidgid) +# before exec'ing the real venv binary, so anything that writes under +# $HERMES_HOME is uid-aligned with the supervised processes. When invoked +# as any non-root UID — including the supervised processes themselves, +# `docker exec --user hermes`, kanban subagents, etc. — it short-circuits +# straight to the venv binary with no privilege change. Net: one extra +# fork on the docker-exec-as-root path, zero behavioral change on every +# other path. +# +# Recursion safety: the shim exec's the venv binary by *absolute path* +# (/opt/hermes/.venv/bin/hermes), so the second hop cannot re-enter this +# shim regardless of PATH state. No sentinel env var needed. +# +# Opt-out: set HERMES_DOCKER_EXEC_AS_ROOT=1 (1/true/yes, case-insensitive) +# to keep running as root. Reserved for diagnostic sessions where the +# operator deliberately wants root semantics — e.g. inspecting root-only +# state via the hermes CLI. Default is to drop. + +set -e + +REAL=/opt/hermes/.venv/bin/hermes + +# Defensive: if the venv binary is missing (corrupted image, partial +# install), fail loudly rather than silently masking it. +if [ ! -x "$REAL" ]; then + echo "hermes-shim: $REAL not found or not executable" >&2 + exit 127 +fi + +# Already non-root? Just exec the real binary. This is the hot path for +# supervised processes (uid 10000) and for `docker exec --user hermes`. +if [ "$(id -u)" != "0" ]; then + exec "$REAL" "$@" +fi + +# Root, with opt-out set? Honor it. +case "${HERMES_DOCKER_EXEC_AS_ROOT:-}" in + 1|true|TRUE|True|yes|YES|Yes) + exec "$REAL" "$@" + ;; +esac + +# Root, no opt-out. Drop to the hermes user. +# +# s6-setuidgid lives under /command/ which is NOT on `docker exec`'s PATH +# (s6-overlay only puts /command/ on PATH for supervision-tree children). +# Reference it by absolute path so the drop is robust against PATH +# manipulation. +S6_SUID=/command/s6-setuidgid +if [ ! -x "$S6_SUID" ]; then + # Non-s6 image (someone stripped s6-overlay, or a hand-built variant). + # Fail loud rather than silently re-execing as root and leaking the + # bug this shim exists to prevent. + echo "hermes-shim: $S6_SUID not found; refusing to silently run as root." >&2 + echo "hermes-shim: re-run with --user hermes or set HERMES_DOCKER_EXEC_AS_ROOT=1." >&2 + exit 126 +fi + +# Reset HOME to the hermes user's home before dropping privileges. Without +# this, $HOME stays /root and any library that resolves paths off $HOME +# (XDG caches, lockfiles, .config writes) will try to write to /root and +# fail with EACCES. Mirrors main-wrapper.sh. +export HOME=/opt/data + +exec "$S6_SUID" hermes "$REAL" "$@" diff --git a/tests/docker/test_docker_exec_privilege_drop.py b/tests/docker/test_docker_exec_privilege_drop.py new file mode 100644 index 00000000000..745848938a3 --- /dev/null +++ b/tests/docker/test_docker_exec_privilege_drop.py @@ -0,0 +1,290 @@ +"""Regression tests for the docker-exec privilege-drop shim. + +The shim (docker/hermes-exec-shim.sh, installed at /opt/hermes/bin/hermes) +exists to prevent the auth.json ownership-mismatch bug where +`docker exec hermes login` would write /opt/data/auth.json as +root:root mode 0600, leaving the supervised gateway (UID 10000) unable +to read its own credentials and returning "Provider authentication +failed: Hermes is not logged into Nous Portal" on every message. + +These tests verify: + +1. ``docker exec hermes …`` (defaulting to root) gets dropped to the + hermes user before the real binary runs. +2. ``docker exec --user hermes hermes …`` (already non-root) short- + circuits and doesn't try to drop again. +3. Files written under $HERMES_HOME from a ``docker exec`` session land + as hermes:hermes — the actual user-visible invariant. +4. The HERMES_DOCKER_EXEC_AS_ROOT opt-out lets diagnostic sessions keep + running as root deliberately. +5. The main CMD path (``docker run …``) is unaffected by the + PATH-shim ordering — no recursion, no behavior change. +""" + +from __future__ import annotations + +import subprocess +import time +from collections.abc import Iterator + +import pytest + + +# How long to give a `docker run -d` container before declaring it not ready. +_RUN_READY_TIMEOUT_S = 20 + + +def _wait_for_init(container: str) -> None: + """Block until /init is up enough that `docker exec` is responsive.""" + deadline = time.time() + _RUN_READY_TIMEOUT_S + while time.time() < deadline: + r = subprocess.run( + ["docker", "exec", container, "true"], + capture_output=True, timeout=5, + ) + if r.returncode == 0: + return + time.sleep(0.2) + pytest.fail(f"container {container} not responsive to docker exec within {_RUN_READY_TIMEOUT_S}s") + + +@pytest.fixture +def sleep_container(built_image: str, container_name: str) -> Iterator[str]: + """Long-lived container running `sleep infinity` so we can docker exec into it.""" + subprocess.run( + ["docker", "rm", "-f", container_name], + capture_output=True, check=False, + ) + r = subprocess.run( + ["docker", "run", "-d", "--name", container_name, built_image, + "sleep", "infinity"], + capture_output=True, text=True, timeout=30, + ) + assert r.returncode == 0, f"docker run failed: {r.stderr}" + try: + _wait_for_init(container_name) + yield container_name + finally: + subprocess.run( + ["docker", "rm", "-f", container_name], + capture_output=True, check=False, + ) + + +def test_shim_drops_root_to_hermes_uid(sleep_container: str) -> None: + """docker exec defaults to root; the shim should drop to uid 10000. + + We invoke `hermes` with a Python-style `-c` shim equivalent — there's no + pure-hermes "print my uid" command, so we use the venv's python directly + via the shim's PATH lookup: `python -c 'print(os.getuid())'` is resolved + through the venv. But that bypasses the shim. Instead, we exploit the + fact that the venv's `hermes` is a console_scripts entry — under the + hood it's a tiny Python wrapper. We can't easily inject "print my uid" + into it without forking subcommands. Simplest approach: have `hermes` + do anything that writes to disk, then check the file's owner. + + Use `hermes config set` which writes config.yaml under HERMES_HOME. + The resulting file ownership tells us what UID the shim ended up at. + """ + # Wipe any prior state. + subprocess.run( + ["docker", "exec", "--user", "root", sleep_container, + "rm", "-f", "/opt/data/config.yaml"], + capture_output=True, check=False, + ) + + # Default docker exec (root) — should be dropped by the shim. + r = subprocess.run( + ["docker", "exec", sleep_container, + "hermes", "config", "set", "_test.shim_marker", "1"], + capture_output=True, text=True, timeout=30, + ) + assert r.returncode == 0, f"config set failed: stdout={r.stdout!r} stderr={r.stderr!r}" + + # The written file must be owned by hermes, not root. + r = subprocess.run( + ["docker", "exec", sleep_container, + "stat", "-c", "%U:%G", "/opt/data/config.yaml"], + capture_output=True, text=True, timeout=10, + ) + assert r.returncode == 0, f"stat failed: {r.stderr}" + assert r.stdout.strip() == "hermes:hermes", ( + f"config.yaml owned by {r.stdout.strip()!r}, expected hermes:hermes. " + "The shim did not drop privileges before invoking hermes." + ) + + +def test_shim_short_circuits_for_non_root_exec(sleep_container: str) -> None: + """docker exec --user hermes already runs as 10000; shim should be a no-op. + + Verified indirectly: the command must still succeed end-to-end. If the + shim incorrectly tried to drop privileges a second time (e.g. by + invoking s6-setuidgid which requires root), it would fail with + EPERM. A clean success proves the short-circuit fired. + """ + subprocess.run( + ["docker", "exec", "--user", "root", sleep_container, + "rm", "-f", "/opt/data/config.yaml"], + capture_output=True, check=False, + ) + + r = subprocess.run( + ["docker", "exec", "--user", "hermes", sleep_container, + "hermes", "config", "set", "_test.shim_short_circuit", "1"], + capture_output=True, text=True, timeout=30, + ) + assert r.returncode == 0, ( + f"docker exec --user hermes failed: {r.stderr!r} stdout={r.stdout!r}. " + "If the shim mis-handled the non-root path, this would fail with EPERM." + ) + + # File still ends up hermes:hermes — orthogonally confirms uid. + r = subprocess.run( + ["docker", "exec", sleep_container, + "stat", "-c", "%U:%G", "/opt/data/config.yaml"], + capture_output=True, text=True, timeout=10, + ) + assert r.stdout.strip() == "hermes:hermes" + + +def test_shim_opt_out_keeps_root(sleep_container: str) -> None: + """HERMES_DOCKER_EXEC_AS_ROOT=1 should suppress the privilege drop. + + Reserved for diagnostic sessions where the operator deliberately + wants root semantics. Verified by writing a file and checking its + owner. + """ + subprocess.run( + ["docker", "exec", "--user", "root", sleep_container, + "rm", "-f", "/opt/data/config.yaml"], + capture_output=True, check=False, + ) + + r = subprocess.run( + ["docker", "exec", + "-e", "HERMES_DOCKER_EXEC_AS_ROOT=1", + sleep_container, + "hermes", "config", "set", "_test.opt_out", "1"], + capture_output=True, text=True, timeout=30, + ) + assert r.returncode == 0, f"opt-out invocation failed: {r.stderr}" + + r = subprocess.run( + ["docker", "exec", sleep_container, + "stat", "-c", "%U:%G", "/opt/data/config.yaml"], + capture_output=True, text=True, timeout=10, + ) + assert r.stdout.strip() == "root:root", ( + f"With HERMES_DOCKER_EXEC_AS_ROOT=1, expected root:root, " + f"got {r.stdout.strip()!r}" + ) + + +@pytest.mark.parametrize("falsy_value", ["0", "false", "no", "", "garbage", "2"]) +def test_shim_opt_out_strict_truthiness( + sleep_container: str, falsy_value: str, +) -> None: + """Anything other than 1/true/yes (case-insensitive) does NOT opt out. + + Strict truthiness so a typo (``HERMES_DOCKER_EXEC_AS_ROOT=0``) doesn't + silently keep the user as root. Mirrors the policy used by + ``HERMES_GATEWAY_NO_SUPERVISE`` in #33583. + """ + subprocess.run( + ["docker", "exec", "--user", "root", sleep_container, + "rm", "-f", "/opt/data/config.yaml"], + capture_output=True, check=False, + ) + + r = subprocess.run( + ["docker", "exec", + "-e", f"HERMES_DOCKER_EXEC_AS_ROOT={falsy_value}", + sleep_container, + "hermes", "config", "set", "_test.falsy", "1"], + capture_output=True, text=True, timeout=30, + ) + assert r.returncode == 0, f"falsy value {falsy_value!r} caused failure: {r.stderr}" + + r = subprocess.run( + ["docker", "exec", sleep_container, + "stat", "-c", "%U:%G", "/opt/data/config.yaml"], + capture_output=True, text=True, timeout=10, + ) + assert r.stdout.strip() == "hermes:hermes", ( + f"falsy opt-out value {falsy_value!r} unexpectedly suppressed the drop; " + f"file owner is {r.stdout.strip()!r}, expected hermes:hermes" + ) + + +def test_main_cmd_path_unaffected(built_image: str) -> None: + """The CMD path (docker run ) must still work. + + The shim sits at /opt/hermes/bin earliest on PATH; main-wrapper.sh + invokes `s6-setuidgid hermes hermes ` which resolves `hermes` + through PATH. With the shim in the way, this could regress if the + shim recurses or interferes with TTY/exit-code propagation. + + `chat --help` is cheap and exercises the full subcommand + passthrough path. The duplicate of test_main_invocation's + pre-existing test is intentional — that one would have passed + pre-shim too; this one specifically guards against shim regressions + in the CMD-as-main-program codepath. + """ + r = subprocess.run( + ["docker", "run", "--rm", built_image, "chat", "--help"], + capture_output=True, text=True, timeout=60, + ) + assert r.returncode == 0, f"CMD path broken by shim: stderr={r.stderr!r}" + assert "Traceback" not in r.stderr + + +def test_e2e_login_then_supervised_gateway_can_read_auth( + sleep_container: str, +) -> None: + """End-to-end regression for the original bug. + + Pre-shim: ``docker exec hermes login`` (root) wrote + /opt/data/auth.json as root:root 0600. The supervised gateway (UID + 10000) couldn't read it, _load_auth_store swallowed PermissionError + as a parse failure, and resolve_nous_runtime_credentials raised + "Hermes is not logged into Nous Portal" on every message. + + We can't do a real OAuth login in a unit test, but we can stand in + for it by writing the same file shape via `hermes config set`-style + writes — what matters is the *file ownership invariant* downstream + of `_save_auth_store`. If the shim works, every file the + `docker exec` path produces is hermes-readable. + + Specifically: pretend the operator ran `hermes login` (writes + auth.json) and verify (a) the file exists and (b) it's readable by + the hermes UID. We use `hermes auth list` since that touches the + auth store on the read side and would fail with the same + 'not logged in' shape if the file was unreadable to uid 10000. + """ + # Have the shim-protected `docker exec` write the auth store. + # `hermes auth list` is read-only but still exercises _load_auth_store + # under the shim's UID. We invoke `hermes config set` first to + # provoke a write into HERMES_HOME so we have something concrete to + # owner-check. + r = subprocess.run( + ["docker", "exec", sleep_container, + "hermes", "config", "set", "_test.e2e_marker", "1"], + capture_output=True, text=True, timeout=30, + ) + assert r.returncode == 0, f"config set failed: {r.stderr}" + + # The supervised UID (10000) must be able to read everything under + # HERMES_HOME that docker exec just wrote. + r = subprocess.run( + ["docker", "exec", "--user", "hermes", sleep_container, + "find", "/opt/data", "-maxdepth", "2", "-type", "f", + "!", "-readable", "-print"], + capture_output=True, text=True, timeout=15, + ) + assert r.returncode == 0, f"find failed: {r.stderr}" + unreadable = [ln for ln in r.stdout.splitlines() if ln.strip()] + assert not unreadable, ( + "Files written by `docker exec` are unreadable to the hermes user " + f"(supervised gateway UID): {unreadable}. The shim failed to drop " + "privileges before the write." + )