diff --git a/docs/specs/container-cli-review-fixes.md b/docs/specs/container-cli-review-fixes.md new file mode 100644 index 000000000..0eb9070db --- /dev/null +++ b/docs/specs/container-cli-review-fixes.md @@ -0,0 +1,329 @@ +# Container-Aware CLI Review Fixes Spec + +**PR:** NousResearch/hermes-agent#7543 +**Review:** cursor[bot] bugbot review (4094049442) + two prior rounds +**Date:** 2026-04-12 +**Branch:** `feat/container-aware-cli-clean` + +## Review Issues Summary + +Six issues were raised across three bugbot review rounds. Three were fixed in intermediate commits (38277a6a, 726cf90f). This spec addresses remaining design concerns surfaced by those reviews and simplifies the implementation based on interview decisions. + +| # | Issue | Severity | Status | +|---|-------|----------|--------| +| 1 | `os.execvp` retry loop unreachable | Medium | Fixed in 79e8cd12 (switched to subprocess.run) | +| 2 | Redundant `shutil.which("sudo")` | Medium | Fixed in 38277a6a (reuses `sudo` var) | +| 3 | Missing `chown -h` on symlink update | Low | Fixed in 38277a6a | +| 4 | Container routing after `parse_args()` | High | Fixed in 726cf90f | +| 5 | Hardcoded `/home/${user}` | Medium | Fixed in 726cf90f | +| 6 | Group membership not gated on `container.enable` | Low | Fixed in 726cf90f | + +The mechanical fixes are in place but the overall design needs revision. The retry loop, error swallowing, and process model have deeper issues than what the bugbot flagged. + +--- + +## Spec: Revised `_exec_in_container` + +### Design Principles + +1. **Let it crash.** No silent fallbacks. If `.container-mode` exists but something goes wrong, the error propagates naturally (Python traceback). The only case where container routing is skipped is when `.container-mode` doesn't exist or `HERMES_DEV=1`. +2. **No retries.** Probe once for sudo, exec once. If it fails, docker/podman's stderr reaches the user verbatim. +3. **Completely transparent.** No error wrapping, no prefixes, no spinners. Docker's output goes straight through. +4. **`os.execvp` on the happy path.** Replace the Python process entirely so there's no idle parent during interactive sessions. Note: `execvp` never returns on success (process is replaced) and raises `OSError` on failure (it does not return a value). The container process's exit code becomes the process exit code by definition — no explicit propagation needed. +5. **One human-readable exception to "let it crash".** `subprocess.TimeoutExpired` from the sudo probe gets a specific catch with a readable message, since a raw traceback for "your Docker daemon is slow" is confusing. All other exceptions propagate naturally. + +### Execution Flow + +``` +1. get_container_exec_info() + - HERMES_DEV=1 → return None (skip routing) + - Inside container → return None (skip routing) + - .container-mode doesn't exist → return None (skip routing) + - .container-mode exists → parse and return dict + - .container-mode exists but malformed/unreadable → LET IT CRASH (no try/except) + +2. _exec_in_container(container_info, sys.argv[1:]) + a. shutil.which(backend) → if None, print "{backend} not found on PATH" and sys.exit(1) + b. Sudo probe: subprocess.run([runtime, "inspect", "--format", "ok", container_name], timeout=15) + - If succeeds → needs_sudo = False + - If fails → try subprocess.run([sudo, "-n", runtime, "inspect", ...], timeout=15) + - If succeeds → needs_sudo = True + - If fails → print error with sudoers hint (including why -n is required) and sys.exit(1) + - If TimeoutExpired → catch specifically, print human-readable message about slow daemon + c. Build exec_cmd: [sudo? + runtime, "exec", tty_flags, "-u", exec_user, env_flags, container, hermes_bin, *cli_args] + d. os.execvp(exec_cmd[0], exec_cmd) + - On success: process is replaced — Python is gone, container exit code IS the process exit code + - On OSError: let it crash (natural traceback) +``` + +### Changes to `hermes_cli/main.py` + +#### `_exec_in_container` — rewrite + +Remove: +- The entire retry loop (`max_retries`, `for attempt in range(...)`) +- Spinner logic (`"Waiting for container..."`, dots) +- Exit code classification (125/126/127 handling) +- `subprocess.run` for the exec call (keep it only for the sudo probe) +- Special TTY vs non-TTY retry counts +- The `time` import (no longer needed) + +Change: +- Use `os.execvp(exec_cmd[0], exec_cmd)` as the final call +- Keep the `subprocess` import only for the sudo probe +- Keep TTY detection for the `-it` vs `-i` flag +- Keep env var forwarding (TERM, COLORTERM, LANG, LC_ALL) +- Keep the sudo probe as-is (it's the one "smart" part) +- Bump probe `timeout` from 5s to 15s — cold podman on a loaded machine needs headroom +- Catch `subprocess.TimeoutExpired` specifically on both probe calls — print a readable message about the daemon being unresponsive instead of a raw traceback +- Expand the sudoers hint error message to explain *why* `-n` (non-interactive) is required: a password prompt would hang the CLI or break piped commands + +The function becomes roughly: + +```python +def _exec_in_container(container_info: dict, cli_args: list): + """Replace the current process with a command inside the managed container. + + Probes whether sudo is needed (rootful containers), then os.execvp + into the container. If exec fails, the OS error propagates naturally. + """ + import shutil + import subprocess + + backend = container_info["backend"] + container_name = container_info["container_name"] + exec_user = container_info["exec_user"] + hermes_bin = container_info["hermes_bin"] + + runtime = shutil.which(backend) + if not runtime: + print(f"Error: {backend} not found on PATH. Cannot route to container.", + file=sys.stderr) + sys.exit(1) + + # Probe whether we need sudo to see the rootful container. + # Timeout is 15s — cold podman on a loaded machine can take a while. + # TimeoutExpired is caught specifically for a human-readable message; + # all other exceptions propagate naturally. + needs_sudo = False + sudo = None + try: + probe = subprocess.run( + [runtime, "inspect", "--format", "ok", container_name], + capture_output=True, text=True, timeout=15, + ) + except subprocess.TimeoutExpired: + print( + f"Error: timed out waiting for {backend} to respond.\n" + f"The {backend} daemon may be unresponsive or starting up.", + file=sys.stderr, + ) + sys.exit(1) + + if probe.returncode != 0: + sudo = shutil.which("sudo") + if sudo: + try: + probe2 = subprocess.run( + [sudo, "-n", runtime, "inspect", "--format", "ok", container_name], + capture_output=True, text=True, timeout=15, + ) + except subprocess.TimeoutExpired: + print( + f"Error: timed out waiting for sudo {backend} to respond.", + file=sys.stderr, + ) + sys.exit(1) + + if probe2.returncode == 0: + needs_sudo = True + else: + print( + f"Error: container '{container_name}' not found via {backend}.\n" + f"\n" + f"The NixOS service runs the container as root. Your user cannot\n" + f"see it because {backend} uses per-user namespaces.\n" + f"\n" + f"Fix: grant passwordless sudo for {backend}. The -n (non-interactive)\n" + f"flag is required because the CLI calls sudo non-interactively —\n" + f"a password prompt would hang or break piped commands:\n" + f"\n" + f' security.sudo.extraRules = [{{\n' + f' users = [ "{os.getenv("USER", "your-user")}" ];\n' + f' commands = [{{ command = "{runtime}"; options = [ "NOPASSWD" ]; }}];\n' + f' }}];\n' + f"\n" + f"Or run: sudo hermes {' '.join(cli_args)}", + file=sys.stderr, + ) + sys.exit(1) + else: + print( + f"Error: container '{container_name}' not found via {backend}.\n" + f"The container may be running under root. Try: sudo hermes {' '.join(cli_args)}", + file=sys.stderr, + ) + sys.exit(1) + + is_tty = sys.stdin.isatty() + tty_flags = ["-it"] if is_tty else ["-i"] + + env_flags = [] + for var in ("TERM", "COLORTERM", "LANG", "LC_ALL"): + val = os.environ.get(var) + if val: + env_flags.extend(["-e", f"{var}={val}"]) + + cmd_prefix = [sudo, "-n", runtime] if needs_sudo else [runtime] + exec_cmd = ( + cmd_prefix + ["exec"] + + tty_flags + + ["-u", exec_user] + + env_flags + + [container_name, hermes_bin] + + cli_args + ) + + # execvp replaces this process entirely — it never returns on success. + # On failure it raises OSError, which propagates naturally. + os.execvp(exec_cmd[0], exec_cmd) +``` + +#### Container routing call site in `main()` — remove try/except + +Current: +```python +try: + from hermes_cli.config import get_container_exec_info + container_info = get_container_exec_info() + if container_info: + _exec_in_container(container_info, sys.argv[1:]) + sys.exit(1) # exec failed if we reach here +except SystemExit: + raise +except Exception: + pass # Container routing unavailable, proceed locally +``` + +Revised: +```python +from hermes_cli.config import get_container_exec_info +container_info = get_container_exec_info() +if container_info: + _exec_in_container(container_info, sys.argv[1:]) + # Unreachable: os.execvp never returns on success (process is replaced) + # and raises OSError on failure (which propagates as a traceback). + # This line exists only as a defensive assertion. + sys.exit(1) +``` + +No try/except. If `.container-mode` doesn't exist, `get_container_exec_info()` returns `None` and we skip routing. If it exists but is broken, the exception propagates with a natural traceback. + +Note: `sys.exit(1)` after `_exec_in_container` is dead code in all paths — `os.execvp` either replaces the process or raises. It's kept as a belt-and-suspenders assertion with a comment marking it unreachable, not as actual error handling. + +### Changes to `hermes_cli/config.py` + +#### `get_container_exec_info` — remove inner try/except + +Current code catches `(OSError, IOError)` and returns `None`. This silently hides permission errors, corrupt files, etc. + +Change: Remove the try/except around file reading. Keep the early returns for `HERMES_DEV=1` and `_is_inside_container()`. The `FileNotFoundError` from `open()` when `.container-mode` doesn't exist should still return `None` (this is the "container mode not enabled" case). All other exceptions propagate. + +```python +def get_container_exec_info() -> Optional[dict]: + if os.environ.get("HERMES_DEV") == "1": + return None + if _is_inside_container(): + return None + + container_mode_file = get_hermes_home() / ".container-mode" + + try: + with open(container_mode_file, "r") as f: + # ... parse key=value lines ... + except FileNotFoundError: + return None + # All other exceptions (PermissionError, malformed data, etc.) propagate + + return { ... } +``` + +--- + +## Spec: NixOS Module Changes + +### Symlink creation — simplify to two branches + +Current: 4 branches (symlink exists, directory exists, other file, doesn't exist). + +Revised: 2 branches. + +```bash +if [ -d "${symlinkPath}" ] && [ ! -L "${symlinkPath}" ]; then + # Real directory — back it up, then create symlink + _backup="${symlinkPath}.bak.$(date +%s)" + echo "hermes-agent: backing up existing ${symlinkPath} to $_backup" + mv "${symlinkPath}" "$_backup" +fi +# For everything else (symlink, doesn't exist, etc.) — just force-create +ln -sfn "${target}" "${symlinkPath}" +chown -h ${user}:${cfg.group} "${symlinkPath}" +``` + +`ln -sfn` handles: existing symlink (replaces), doesn't exist (creates), and after the `mv` above (creates). The only case that needs special handling is a real directory, because `ln -sfn` cannot atomically replace a directory. + +Note: there is a theoretical race between the `[ -d ... ]` check and the `mv` (something could create/remove the directory in between). In practice this is a NixOS activation script running as root during `nixos-rebuild switch` — no other process should be touching `~/.hermes` at that moment. Not worth adding locking for. + +### Sudoers — document, don't auto-configure + +Do NOT add `security.sudo.extraRules` to the module. Document the sudoers requirement in the module's description/comments and in the error message the CLI prints when sudo probe fails. + +### Group membership gating — keep as-is + +The fix in 726cf90f (`cfg.container.enable && cfg.container.hostUsers != []`) is correct. Leftover group membership when container mode is disabled is harmless. No cleanup needed. + +--- + +## Spec: Test Rewrite + +The existing test file (`tests/hermes_cli/test_container_aware_cli.py`) has 16 tests. With the simplified exec model, several are obsolete. + +### Tests to keep (update as needed) + +- `test_is_inside_container_dockerenv` — unchanged +- `test_is_inside_container_containerenv` — unchanged +- `test_is_inside_container_cgroup_docker` — unchanged +- `test_is_inside_container_false_on_host` — unchanged +- `test_get_container_exec_info_returns_metadata` — unchanged +- `test_get_container_exec_info_none_inside_container` — unchanged +- `test_get_container_exec_info_none_without_file` — unchanged +- `test_get_container_exec_info_skipped_when_hermes_dev` — unchanged +- `test_get_container_exec_info_not_skipped_when_hermes_dev_zero` — unchanged +- `test_get_container_exec_info_defaults` — unchanged +- `test_get_container_exec_info_docker_backend` — unchanged + +### Tests to add + +- `test_get_container_exec_info_crashes_on_permission_error` — verify that `PermissionError` propagates (no silent `None` return) +- `test_exec_in_container_calls_execvp` — verify `os.execvp` is called with correct args (runtime, tty flags, user, env, container, binary, cli args) +- `test_exec_in_container_sudo_probe_sets_prefix` — verify that when first probe fails and sudo probe succeeds, `os.execvp` is called with `sudo -n` prefix +- `test_exec_in_container_no_runtime_hard_fails` — keep existing, verify `sys.exit(1)` when `shutil.which` returns None +- `test_exec_in_container_non_tty_uses_i_only` — update to check `os.execvp` args instead of `subprocess.run` args +- `test_exec_in_container_probe_timeout_prints_message` — verify that `subprocess.TimeoutExpired` from the probe produces a human-readable error and `sys.exit(1)`, not a raw traceback +- `test_exec_in_container_container_not_running_no_sudo` — verify the path where runtime exists (`shutil.which` returns a path) but probe returns non-zero and no sudo is available. Should print the "container may be running under root" error. This is distinct from `no_runtime_hard_fails` which covers `shutil.which` returning None. + +### Tests to delete + +- `test_exec_in_container_tty_retries_on_container_failure` — retry loop removed +- `test_exec_in_container_non_tty_retries_silently_exits_126` — retry loop removed +- `test_exec_in_container_propagates_hermes_exit_code` — no subprocess.run to check exit codes; execvp replaces the process. Note: exit code propagation still works correctly — when `os.execvp` succeeds, the container's process *becomes* this process, so its exit code is the process exit code by OS semantics. No application code needed, no test needed. A comment in the function docstring documents this intent for future readers. + +--- + +## Out of Scope + +- Auto-configuring sudoers rules in the NixOS module +- Any changes to `get_container_exec_info` parsing logic beyond the try/except narrowing +- Changes to `.container-mode` file format +- Changes to the `HERMES_DEV=1` bypass +- Changes to container detection logic (`_is_inside_container`) diff --git a/hermes_cli/config.py b/hermes_cli/config.py index 67d8c0c13..a4eee56f9 100644 --- a/hermes_cli/config.py +++ b/hermes_cli/config.py @@ -143,6 +143,73 @@ def managed_error(action: str = "modify configuration"): print(format_managed_message(action), file=sys.stderr) +# ============================================================================= +# Container-aware CLI (NixOS container mode) +# ============================================================================= + +def _is_inside_container() -> bool: + """Detect if we're already running inside a Docker/Podman container.""" + # Standard Docker/Podman indicators + if os.path.exists("/.dockerenv"): + return True + # Podman uses /run/.containerenv + if os.path.exists("/run/.containerenv"): + return True + # Check cgroup for container runtime evidence (works for both Docker & Podman) + try: + with open("/proc/1/cgroup", "r") as f: + cgroup = f.read() + if "docker" in cgroup or "podman" in cgroup or "/lxc/" in cgroup: + return True + except OSError: + pass + return False + + +def get_container_exec_info() -> Optional[dict]: + """Read container mode metadata from HERMES_HOME/.container-mode. + + Returns a dict with keys: backend, container_name, exec_user, hermes_bin + or None if container mode is not active, we're already inside the + container, or HERMES_DEV=1 is set. + + The .container-mode file is written by the NixOS activation script when + container.enable = true. It tells the host CLI to exec into the container + instead of running locally. + """ + if os.environ.get("HERMES_DEV") == "1": + return None + + if _is_inside_container(): + return None + + container_mode_file = get_hermes_home() / ".container-mode" + + try: + info = {} + with open(container_mode_file, "r") as f: + for line in f: + line = line.strip() + if "=" in line and not line.startswith("#"): + key, _, value = line.partition("=") + info[key.strip()] = value.strip() + except FileNotFoundError: + return None + # All other exceptions (PermissionError, malformed data, etc.) propagate + + backend = info.get("backend", "docker") + container_name = info.get("container_name", "hermes-agent") + exec_user = info.get("exec_user", "hermes") + hermes_bin = info.get("hermes_bin", "/data/current-package/bin/hermes") + + return { + "backend": backend, + "container_name": container_name, + "exec_user": exec_user, + "hermes_bin": hermes_bin, + } + + # ============================================================================= # Config paths # ============================================================================= diff --git a/hermes_cli/main.py b/hermes_cli/main.py index c74b7945e..be0d4b093 100644 --- a/hermes_cli/main.py +++ b/hermes_cli/main.py @@ -528,6 +528,113 @@ def _resolve_last_cli_session() -> Optional[str]: return None +def _probe_container(cmd: list, backend: str, via_sudo: bool = False): + """Run a container inspect probe, returning the CompletedProcess. + + Catches TimeoutExpired specifically for a human-readable message; + all other exceptions propagate naturally. + """ + try: + return subprocess.run(cmd, capture_output=True, text=True, timeout=15) + except subprocess.TimeoutExpired: + label = f"sudo {backend}" if via_sudo else backend + print( + f"Error: timed out waiting for {label} to respond.\n" + f"The {backend} daemon may be unresponsive or starting up.", + file=sys.stderr, + ) + sys.exit(1) + + +def _exec_in_container(container_info: dict, cli_args: list): + """Replace the current process with a command inside the managed container. + + Probes whether sudo is needed (rootful containers), then os.execvp + into the container. On success the Python process is replaced entirely + and the container's exit code becomes the process exit code (OS semantics). + On failure, OSError propagates naturally. + + Args: + container_info: dict with backend, container_name, exec_user, hermes_bin + cli_args: the original CLI arguments (everything after 'hermes') + """ + import shutil + + backend = container_info["backend"] + container_name = container_info["container_name"] + exec_user = container_info["exec_user"] + hermes_bin = container_info["hermes_bin"] + + runtime = shutil.which(backend) + if not runtime: + print(f"Error: {backend} not found on PATH. Cannot route to container.", + file=sys.stderr) + sys.exit(1) + + # Rootful containers (NixOS systemd service) are invisible to unprivileged + # users — Podman uses per-user namespaces, Docker needs group access. + # Probe whether the runtime can see the container; if not, try via sudo. + sudo_path = None + probe = _probe_container( + [runtime, "inspect", "--format", "ok", container_name], backend, + ) + if probe.returncode != 0: + sudo_path = shutil.which("sudo") + if sudo_path: + probe2 = _probe_container( + [sudo_path, "-n", runtime, "inspect", "--format", "ok", container_name], + backend, via_sudo=True, + ) + if probe2.returncode != 0: + print( + f"Error: container '{container_name}' not found via {backend}.\n" + f"\n" + f"The container is likely running as root. Your user cannot see it\n" + f"because {backend} uses per-user namespaces. Grant passwordless\n" + f"sudo for {backend} — the -n (non-interactive) flag is required\n" + f"because a password prompt would hang or break piped commands.\n" + f"\n" + f"On NixOS:\n" + f"\n" + f' security.sudo.extraRules = [{{\n' + f' users = [ "{os.getenv("USER", "your-user")}" ];\n' + f' commands = [{{ command = "{runtime}"; options = [ "NOPASSWD" ]; }}];\n' + f' }}];\n' + f"\n" + f"Or run: sudo hermes {' '.join(cli_args)}", + file=sys.stderr, + ) + sys.exit(1) + else: + print( + f"Error: container '{container_name}' not found via {backend}.\n" + f"The container may be running under root. Try: sudo hermes {' '.join(cli_args)}", + file=sys.stderr, + ) + sys.exit(1) + + is_tty = sys.stdin.isatty() + tty_flags = ["-it"] if is_tty else ["-i"] + + env_flags = [] + for var in ("TERM", "COLORTERM", "LANG", "LC_ALL"): + val = os.environ.get(var) + if val: + env_flags.extend(["-e", f"{var}={val}"]) + + cmd_prefix = [sudo_path, "-n", runtime] if sudo_path else [runtime] + exec_cmd = ( + cmd_prefix + ["exec"] + + tty_flags + + ["-u", exec_user] + + env_flags + + [container_name, hermes_bin] + + cli_args + ) + + os.execvp(exec_cmd[0], exec_cmd) + + def _resolve_session_by_name_or_id(name_or_id: str) -> Optional[str]: """Resolve a session name (title) or ID to a session ID. @@ -5667,9 +5774,22 @@ Examples: # Pre-process argv so unquoted multi-word session names after -c / -r # are merged into a single token before argparse sees them. # e.g. ``hermes -c Pokemon Agent Dev`` → ``hermes -c 'Pokemon Agent Dev'`` + # ── Container-aware routing ──────────────────────────────────────── + # When NixOS container mode is active, route ALL subcommands into + # the managed container. This MUST run before parse_args() so that + # --help, unrecognised flags, and every subcommand are forwarded + # transparently instead of being intercepted by argparse on the host. + from hermes_cli.config import get_container_exec_info + container_info = get_container_exec_info() + if container_info: + _exec_in_container(container_info, sys.argv[1:]) + # Unreachable: os.execvp never returns on success (process is replaced) + # and raises OSError on failure (which propagates as a traceback). + sys.exit(1) + _processed_argv = _coalesce_session_name_args(sys.argv[1:]) args = parser.parse_args(_processed_argv) - + # Handle --version flag if args.version: cmd_version(args) diff --git a/nix/nixosModules.nix b/nix/nixosModules.nix index b1be031df..75b3dca31 100644 --- a/nix/nixosModules.nix +++ b/nix/nixosModules.nix @@ -499,6 +499,16 @@ default = "ubuntu:24.04"; description = "OCI container image. The container pulls this at runtime via Docker/Podman."; }; + + hostUsers = mkOption { + type = types.listOf types.str; + default = [ ]; + description = '' + Interactive users who get a ~/.hermes symlink to the service + stateDir. These users are automatically added to the hermes group. + ''; + example = [ "sidbin" ]; + }; }; }; @@ -557,6 +567,25 @@ environment.variables.HERMES_HOME = "${cfg.stateDir}/.hermes"; }) + # ── Host user group membership ───────────────────────────────────── + (lib.mkIf (cfg.container.enable && cfg.container.hostUsers != []) { + users.users = lib.genAttrs cfg.container.hostUsers (user: { + extraGroups = [ cfg.group ]; + }); + }) + + # ── Warnings ────────────────────────────────────────────────────── + (lib.mkIf (cfg.container.enable && !cfg.addToSystemPackages && cfg.container.hostUsers != []) { + warnings = [ + '' + services.hermes-agent: container.enable is true and container.hostUsers + is set, but addToSystemPackages is false. Without a host-installed hermes + binary, container routing will not work for interactive users. + Set addToSystemPackages = true or ensure hermes is on PATH. + '' + ]; + }) + # ── Directories ─────────────────────────────────────────────────── { systemd.tmpfiles.rules = [ @@ -611,6 +640,59 @@ chown ${cfg.user}:${cfg.group} ${cfg.stateDir}/.hermes/.managed chmod 0644 ${cfg.stateDir}/.hermes/.managed + # Container mode metadata — tells the host CLI to exec into the + # container instead of running locally. Removed when container mode + # is disabled so the host CLI falls back to native execution. + ${if cfg.container.enable then '' + cat > ${cfg.stateDir}/.hermes/.container-mode <<'HERMES_CONTAINER_MODE_EOF' +# Written by NixOS activation script. Do not edit manually. +backend=${cfg.container.backend} +container_name=${containerName} +exec_user=${cfg.user} +hermes_bin=${containerDataDir}/current-package/bin/hermes +HERMES_CONTAINER_MODE_EOF + chown ${cfg.user}:${cfg.group} ${cfg.stateDir}/.hermes/.container-mode + chmod 0644 ${cfg.stateDir}/.hermes/.container-mode + '' else '' + rm -f ${cfg.stateDir}/.hermes/.container-mode + + # Remove symlink bridge for hostUsers + ${lib.concatStringsSep "\n" (map (user: + let + userHome = config.users.users.${user}.home; + symlinkPath = "${userHome}/.hermes"; + in '' + if [ -L "${symlinkPath}" ] && [ "$(readlink "${symlinkPath}")" = "${cfg.stateDir}/.hermes" ]; then + rm -f "${symlinkPath}" + echo "hermes-agent: removed symlink ${symlinkPath}" + fi + '') cfg.container.hostUsers)} + ''} + + # ── Symlink bridge for interactive users ─────────────────────── + # Create ~/.hermes -> stateDir/.hermes for each hostUser so the + # host CLI shares state with the container service. + # Only runs when container mode is enabled. + ${lib.optionalString cfg.container.enable + (lib.concatStringsSep "\n" (map (user: + let + userHome = config.users.users.${user}.home; + symlinkPath = "${userHome}/.hermes"; + target = "${cfg.stateDir}/.hermes"; + in '' + if [ -d "${symlinkPath}" ] && [ ! -L "${symlinkPath}" ]; then + # Real directory — back it up, then create symlink. + # (ln -sfn cannot atomically replace a directory.) + _backup="${symlinkPath}.bak.$(date +%s)" + echo "hermes-agent: backing up existing ${symlinkPath} to $_backup" + mv "${symlinkPath}" "$_backup" + fi + # For everything else (existing symlink, doesn't exist, etc.) + # ln -sfn handles it: replaces symlinks, creates new ones. + ln -sfn "${target}" "${symlinkPath}" + chown -h ${user}:${cfg.group} "${symlinkPath}" + '') cfg.container.hostUsers))} + # Seed auth file if provided ${lib.optionalString (cfg.authFile != null) '' ${if cfg.authFileForceOverwrite then '' diff --git a/tests/hermes_cli/test_container_aware_cli.py b/tests/hermes_cli/test_container_aware_cli.py new file mode 100644 index 000000000..9e21c0b8d --- /dev/null +++ b/tests/hermes_cli/test_container_aware_cli.py @@ -0,0 +1,342 @@ +"""Tests for container-aware CLI routing (NixOS container mode). + +When container.enable = true in the NixOS module, the activation script +writes a .container-mode metadata file. The host CLI detects this and +execs into the container instead of running locally. +""" +import os +import subprocess +from pathlib import Path +from unittest.mock import MagicMock, patch + +import pytest + +from hermes_cli.config import ( + _is_inside_container, + get_container_exec_info, +) + + +# ============================================================================= +# _is_inside_container +# ============================================================================= + + +def test_is_inside_container_dockerenv(): + """Detects /.dockerenv marker file.""" + with patch("os.path.exists") as mock_exists: + mock_exists.side_effect = lambda p: p == "/.dockerenv" + assert _is_inside_container() is True + + +def test_is_inside_container_containerenv(): + """Detects Podman's /run/.containerenv marker.""" + with patch("os.path.exists") as mock_exists: + mock_exists.side_effect = lambda p: p == "/run/.containerenv" + assert _is_inside_container() is True + + +def test_is_inside_container_cgroup_docker(): + """Detects 'docker' in /proc/1/cgroup.""" + with patch("os.path.exists", return_value=False), \ + patch("builtins.open", create=True) as mock_open: + mock_open.return_value.__enter__ = lambda s: s + mock_open.return_value.__exit__ = MagicMock(return_value=False) + mock_open.return_value.read = MagicMock( + return_value="12:memory:/docker/abc123\n" + ) + assert _is_inside_container() is True + + +def test_is_inside_container_false_on_host(): + """Returns False when none of the container indicators are present.""" + with patch("os.path.exists", return_value=False), \ + patch("builtins.open", side_effect=OSError("no such file")): + assert _is_inside_container() is False + + +# ============================================================================= +# get_container_exec_info +# ============================================================================= + + +@pytest.fixture +def container_env(tmp_path, monkeypatch): + """Set up a fake HERMES_HOME with .container-mode file.""" + hermes_home = tmp_path / ".hermes" + hermes_home.mkdir() + monkeypatch.setenv("HERMES_HOME", str(hermes_home)) + monkeypatch.delenv("HERMES_DEV", raising=False) + + container_mode = hermes_home / ".container-mode" + container_mode.write_text( + "# Written by NixOS activation script. Do not edit manually.\n" + "backend=podman\n" + "container_name=hermes-agent\n" + "exec_user=hermes\n" + "hermes_bin=/data/current-package/bin/hermes\n" + ) + return hermes_home + + +def test_get_container_exec_info_returns_metadata(container_env): + """Reads .container-mode and returns all fields including exec_user.""" + with patch("hermes_cli.config._is_inside_container", return_value=False): + info = get_container_exec_info() + + assert info is not None + assert info["backend"] == "podman" + assert info["container_name"] == "hermes-agent" + assert info["exec_user"] == "hermes" + assert info["hermes_bin"] == "/data/current-package/bin/hermes" + + +def test_get_container_exec_info_none_inside_container(container_env): + """Returns None when we're already inside a container.""" + with patch("hermes_cli.config._is_inside_container", return_value=True): + info = get_container_exec_info() + + assert info is None + + +def test_get_container_exec_info_none_without_file(tmp_path, monkeypatch): + """Returns None when .container-mode doesn't exist (native mode).""" + hermes_home = tmp_path / ".hermes" + hermes_home.mkdir() + monkeypatch.setenv("HERMES_HOME", str(hermes_home)) + monkeypatch.delenv("HERMES_DEV", raising=False) + + with patch("hermes_cli.config._is_inside_container", return_value=False): + info = get_container_exec_info() + + assert info is None + + +def test_get_container_exec_info_skipped_when_hermes_dev(container_env, monkeypatch): + """Returns None when HERMES_DEV=1 is set (dev mode bypass).""" + monkeypatch.setenv("HERMES_DEV", "1") + + with patch("hermes_cli.config._is_inside_container", return_value=False): + info = get_container_exec_info() + + assert info is None + + +def test_get_container_exec_info_not_skipped_when_hermes_dev_zero(container_env, monkeypatch): + """HERMES_DEV=0 does NOT trigger bypass — only '1' does.""" + monkeypatch.setenv("HERMES_DEV", "0") + + with patch("hermes_cli.config._is_inside_container", return_value=False): + info = get_container_exec_info() + + assert info is not None + + +def test_get_container_exec_info_defaults(): + """Falls back to defaults for missing keys.""" + import tempfile + + with tempfile.TemporaryDirectory() as tmpdir: + hermes_home = Path(tmpdir) / ".hermes" + hermes_home.mkdir() + (hermes_home / ".container-mode").write_text( + "# minimal file with no keys\n" + ) + + with patch("hermes_cli.config._is_inside_container", return_value=False), \ + patch("hermes_cli.config.get_hermes_home", return_value=hermes_home), \ + patch.dict(os.environ, {}, clear=False): + os.environ.pop("HERMES_DEV", None) + info = get_container_exec_info() + + assert info is not None + assert info["backend"] == "docker" + assert info["container_name"] == "hermes-agent" + assert info["exec_user"] == "hermes" + assert info["hermes_bin"] == "/data/current-package/bin/hermes" + + +def test_get_container_exec_info_docker_backend(container_env): + """Correctly reads docker backend with custom exec_user.""" + (container_env / ".container-mode").write_text( + "backend=docker\n" + "container_name=hermes-custom\n" + "exec_user=myuser\n" + "hermes_bin=/opt/hermes/bin/hermes\n" + ) + + with patch("hermes_cli.config._is_inside_container", return_value=False): + info = get_container_exec_info() + + assert info["backend"] == "docker" + assert info["container_name"] == "hermes-custom" + assert info["exec_user"] == "myuser" + assert info["hermes_bin"] == "/opt/hermes/bin/hermes" + + +def test_get_container_exec_info_crashes_on_permission_error(container_env): + """PermissionError propagates instead of being silently swallowed.""" + with patch("hermes_cli.config._is_inside_container", return_value=False), \ + patch("builtins.open", side_effect=PermissionError("permission denied")): + with pytest.raises(PermissionError): + get_container_exec_info() + + +# ============================================================================= +# _exec_in_container +# ============================================================================= + + +@pytest.fixture +def docker_container_info(): + return { + "backend": "docker", + "container_name": "hermes-agent", + "exec_user": "hermes", + "hermes_bin": "/data/current-package/bin/hermes", + } + + +@pytest.fixture +def podman_container_info(): + return { + "backend": "podman", + "container_name": "hermes-agent", + "exec_user": "hermes", + "hermes_bin": "/data/current-package/bin/hermes", + } + + +def test_exec_in_container_calls_execvp(docker_container_info): + """Verifies os.execvp is called with correct args: runtime, tty flags, + user, env vars, container name, binary, and CLI args.""" + from hermes_cli.main import _exec_in_container + + with patch("shutil.which", return_value="/usr/bin/docker"), \ + patch("subprocess.run") as mock_run, \ + patch("sys.stdin") as mock_stdin, \ + patch("os.execvp") as mock_execvp, \ + patch.dict(os.environ, {"TERM": "xterm-256color", "LANG": "en_US.UTF-8"}, + clear=False): + mock_stdin.isatty.return_value = True + mock_run.return_value = MagicMock(returncode=0) + + _exec_in_container(docker_container_info, ["chat", "-m", "opus"]) + + mock_execvp.assert_called_once() + cmd = mock_execvp.call_args[0][1] + assert cmd[0] == "/usr/bin/docker" + assert cmd[1] == "exec" + assert "-it" in cmd + idx_u = cmd.index("-u") + assert cmd[idx_u + 1] == "hermes" + e_indices = [i for i, v in enumerate(cmd) if v == "-e"] + e_values = [cmd[i + 1] for i in e_indices] + assert "TERM=xterm-256color" in e_values + assert "LANG=en_US.UTF-8" in e_values + assert "hermes-agent" in cmd + assert "/data/current-package/bin/hermes" in cmd + assert "chat" in cmd + + +def test_exec_in_container_non_tty_uses_i_only(docker_container_info): + """Non-TTY mode uses -i instead of -it.""" + from hermes_cli.main import _exec_in_container + + with patch("shutil.which", return_value="/usr/bin/docker"), \ + patch("subprocess.run") as mock_run, \ + patch("sys.stdin") as mock_stdin, \ + patch("os.execvp") as mock_execvp: + mock_stdin.isatty.return_value = False + mock_run.return_value = MagicMock(returncode=0) + + _exec_in_container(docker_container_info, ["sessions", "list"]) + + cmd = mock_execvp.call_args[0][1] + assert "-i" in cmd + assert "-it" not in cmd + + +def test_exec_in_container_no_runtime_hard_fails(podman_container_info): + """Hard fails when runtime not found (no fallback).""" + from hermes_cli.main import _exec_in_container + + with patch("shutil.which", return_value=None), \ + patch("subprocess.run") as mock_run, \ + patch("os.execvp") as mock_execvp, \ + pytest.raises(SystemExit) as exc_info: + _exec_in_container(podman_container_info, ["chat"]) + + mock_run.assert_not_called() + mock_execvp.assert_not_called() + assert exc_info.value.code != 0 + + +def test_exec_in_container_sudo_probe_sets_prefix(podman_container_info): + """When first probe fails and sudo probe succeeds, execvp is called + with sudo -n prefix.""" + from hermes_cli.main import _exec_in_container + + def which_side_effect(name): + if name == "podman": + return "/usr/bin/podman" + if name == "sudo": + return "/usr/bin/sudo" + return None + + with patch("shutil.which", side_effect=which_side_effect), \ + patch("subprocess.run") as mock_run, \ + patch("sys.stdin") as mock_stdin, \ + patch("os.execvp") as mock_execvp: + mock_stdin.isatty.return_value = True + mock_run.side_effect = [ + MagicMock(returncode=1), # direct probe fails + MagicMock(returncode=0), # sudo probe succeeds + ] + + _exec_in_container(podman_container_info, ["chat"]) + + mock_execvp.assert_called_once() + cmd = mock_execvp.call_args[0][1] + assert cmd[0] == "/usr/bin/sudo" + assert cmd[1] == "-n" + assert cmd[2] == "/usr/bin/podman" + assert cmd[3] == "exec" + + +def test_exec_in_container_probe_timeout_prints_message(docker_container_info): + """TimeoutExpired from probe produces a human-readable error, not a + raw traceback.""" + from hermes_cli.main import _exec_in_container + + with patch("shutil.which", return_value="/usr/bin/docker"), \ + patch("subprocess.run", side_effect=subprocess.TimeoutExpired( + cmd=["docker", "inspect"], timeout=15)), \ + patch("os.execvp") as mock_execvp, \ + pytest.raises(SystemExit) as exc_info: + _exec_in_container(docker_container_info, ["chat"]) + + mock_execvp.assert_not_called() + assert exc_info.value.code == 1 + + +def test_exec_in_container_container_not_running_no_sudo(docker_container_info): + """When runtime exists but container not found and no sudo available, + prints helpful error about root containers.""" + from hermes_cli.main import _exec_in_container + + def which_side_effect(name): + if name == "docker": + return "/usr/bin/docker" + return None + + with patch("shutil.which", side_effect=which_side_effect), \ + patch("subprocess.run") as mock_run, \ + patch("os.execvp") as mock_execvp, \ + pytest.raises(SystemExit) as exc_info: + mock_run.return_value = MagicMock(returncode=1) + + _exec_in_container(docker_container_info, ["chat"]) + + mock_execvp.assert_not_called() + assert exc_info.value.code == 1 diff --git a/website/docs/getting-started/nix-setup.md b/website/docs/getting-started/nix-setup.md index 4db493986..858315329 100644 --- a/website/docs/getting-started/nix-setup.md +++ b/website/docs/getting-started/nix-setup.md @@ -122,6 +122,41 @@ services.hermes-agent.environmentFiles = [ "/var/lib/hermes/env" ]; Setting `addToSystemPackages = true` does two things: puts the `hermes` CLI on your system PATH **and** sets `HERMES_HOME` system-wide so the interactive CLI shares state (sessions, skills, cron) with the gateway service. Without it, running `hermes` in your shell creates a separate `~/.hermes/` directory. ::: +:::info Container-aware CLI +When `container.enable = true` and `addToSystemPackages = true`, **every** `hermes` command on the host automatically routes into the managed container. This means your interactive CLI session runs inside the same environment as the gateway service — with access to all container-installed packages and tools. + +- The routing is transparent: `hermes chat`, `hermes sessions list`, `hermes version`, etc. all exec into the container under the hood +- All CLI flags are forwarded as-is +- If the container isn't running, the CLI retries briefly (5s with a spinner for interactive use, 10s silently for scripts) then fails with a clear error — no silent fallback +- For developers working on the hermes codebase, set `HERMES_DEV=1` to bypass container routing and run the local checkout directly + +Set `container.hostUsers` to create a `~/.hermes` symlink to the service state directory, so the host CLI and the container share sessions, config, and memories: + +```nix +services.hermes-agent = { + container.enable = true; + container.hostUsers = [ "your-username" ]; + addToSystemPackages = true; +}; +``` + +Users listed in `hostUsers` are automatically added to the `hermes` group for file permission access. + +**Podman users:** The NixOS service runs the container as root. Docker users get access via the `docker` group socket, but Podman's rootful containers require sudo. Grant passwordless sudo for your container runtime: + +```nix +security.sudo.extraRules = [{ + users = [ "your-username" ]; + commands = [{ + command = "/run/current-system/sw/bin/podman"; + options = [ "NOPASSWD" ]; + }]; +}]; +``` + +The CLI auto-detects when sudo is needed and uses it transparently. Without this, you'll need to run `sudo hermes chat` manually. +::: + ### Verify It Works After `nixos-rebuild switch`, check that the service is running: @@ -246,6 +281,7 @@ Run `nix build .#configKeys && cat result` to see every leaf config key extracte container = { image = "ubuntu:24.04"; backend = "docker"; + hostUsers = [ "your-username" ]; extraVolumes = [ "/home/user/projects:/projects:rw" ]; extraOptions = [ "--gpus" "all" ]; }; @@ -285,6 +321,7 @@ Quick reference for the most common things Nix users want to customize: | Mount host directories into container | `container.extraVolumes` | `[ "/data:/data:rw" ]` | | Pass GPU access to container | `container.extraOptions` | `[ "--gpus" "all" ]` | | Use Podman instead of Docker | `container.backend` | `"podman"` | +| Share state between host CLI and container | `container.hostUsers` | `[ "sidbin" ]` | | Add tools to the service PATH (native only) | `extraPackages` | `[ pkgs.pandoc pkgs.imagemagick ]` | | Use a custom base image | `container.image` | `"ubuntu:24.04"` | | Override the hermes package | `package` | `inputs.hermes-agent.packages.${system}.default.override { ... }` | @@ -518,6 +555,7 @@ When container mode is enabled, hermes runs inside a persistent Ubuntu container Host Container ──── ───────── /nix/store/...-hermes-agent-0.1.0 ──► /nix/store/... (ro) +~/.hermes -> /var/lib/hermes/.hermes (symlink bridge, per hostUsers) /var/lib/hermes/ ──► /data/ (rw) ├── current-package -> /nix/store/... (symlink, updated each rebuild) ├── .gc-root -> /nix/store/... (prevents nix-collect-garbage) @@ -526,6 +564,7 @@ Host Container │ ├── .env (merged from environment + environmentFiles) │ ├── config.yaml (Nix-generated, deep-merged by activation) │ ├── .managed (marker file) + │ ├── .container-mode (routing metadata: backend, exec_user, etc.) │ ├── state.db, sessions/, memories/ (runtime state) │ └── mcp-tokens/ (OAuth tokens for MCP servers) ├── home/ ──► /home/hermes (rw) @@ -698,6 +737,7 @@ nix build .#checks.x86_64-linux.config-roundtrip # merge script preserves use | `container.image` | `str` | `"ubuntu:24.04"` | Base image (pulled at runtime) | | `container.extraVolumes` | `listOf str` | `[]` | Extra volume mounts (`host:container:mode`) | | `container.extraOptions` | `listOf str` | `[]` | Extra args passed to `docker create` | +| `container.hostUsers` | `listOf str` | `[]` | Interactive users who get a `~/.hermes` symlink to the service stateDir and are auto-added to the `hermes` group | --- @@ -818,3 +858,5 @@ nix-store --query --roots $(docker exec hermes-agent readlink /data/current-pack | `hermes version` shows old version | Container not restarted | `systemctl restart hermes-agent` | | Permission denied on `/var/lib/hermes` | State dir is `0750 hermes:hermes` | Use `docker exec` or `sudo -u hermes` | | `nix-collect-garbage` removed hermes | GC root missing | Restart the service (preStart recreates the GC root) | +| `no container with name or ID "hermes-agent"` (Podman) | Podman rootful container not visible to regular user | Add passwordless sudo for podman (see [Container-aware CLI](#container-aware-cli) section) | +| `unable to find user hermes` | Container still starting (entrypoint hasn't created user yet) | Wait a few seconds and retry — the CLI retries automatically |