mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-09 03:11:58 +00:00
## Why
Hermes supports Linux, macOS, and native Windows, but the codebase grew up
POSIX-first and has accumulated patterns that silently break (or worse,
silently kill!) on Windows:
- `os.kill(pid, 0)` as a liveness probe — on Windows this maps to
CTRL_C_EVENT and broadcasts Ctrl+C to the target's entire console
process group (bpo-14484, open since 2012).
- `os.killpg` — doesn't exist on Windows at all (AttributeError).
- `os.setsid` / `os.getuid` / `os.geteuid` — same.
- `signal.SIGKILL` / `signal.SIGHUP` / `signal.SIGUSR1` — module-attr
errors at runtime on Windows.
- `open(path)` / `open(path, "r")` without explicit encoding= — inherits
the platform default, which is cp1252/mbcs on Windows (UTF-8 on POSIX),
causing mojibake round-tripping between hosts.
- `wmic` — removed from Windows 10 21H1+.
This commit does three things:
1. Makes `psutil` a core dependency and migrates critical callsites to it.
2. Adds a grep-based CI gate (`scripts/check-windows-footguns.py`) that
blocks new instances of any of the above patterns.
3. Fixes every existing instance in the codebase so the baseline is clean.
## What changed
### 1. psutil as a core dependency (pyproject.toml)
Added `psutil>=5.9.0,<8` to core deps. psutil is the canonical
cross-platform answer for "is this PID alive" and "kill this process
tree" — its `pid_exists()` uses `OpenProcess + GetExitCodeProcess` on
Windows (NOT a signal call), and its `Process.children(recursive=True)`
+ `.kill()` combo replaces `os.killpg()` portably.
### 2. `gateway/status.py::_pid_exists`
Rewrote to call `psutil.pid_exists()` first, falling back to the
hand-rolled ctypes `OpenProcess + WaitForSingleObject` dance on Windows
(and `os.kill(pid, 0)` on POSIX) only if psutil is somehow missing —
e.g. during the scaffold phase of a fresh install before pip finishes.
### 3. `os.killpg` migration to psutil (7 callsites, 5 files)
- `tools/code_execution_tool.py`
- `tools/process_registry.py`
- `tools/tts_tool.py`
- `tools/environments/local.py` (3 sites kept as-is, suppressed with
`# windows-footgun: ok` — the pgid semantics psutil can't replicate,
and the calls are already Windows-guarded at the outer branch)
- `gateway/platforms/whatsapp.py`
### 4. `scripts/check-windows-footguns.py` (NEW, 500 lines)
Grep-based checker with 11 rules covering every Windows cross-platform
footgun we've hit so far:
1. `os.kill(pid, 0)` — the silent killer
2. `os.setsid` without guard
3. `os.killpg` (recommends psutil)
4. `os.getuid` / `os.geteuid` / `os.getgid`
5. `os.fork`
6. `signal.SIGKILL`
7. `signal.SIGHUP/SIGUSR1/SIGUSR2/SIGALRM/SIGCHLD/SIGPIPE/SIGQUIT`
8. `subprocess` shebang script invocation
9. `wmic` without `shutil.which` guard
10. Hardcoded `~/Desktop` (OneDrive trap)
11. `asyncio.add_signal_handler` without try/except
12. `open()` without `encoding=` on text mode
Features:
- Triple-quoted-docstring aware (won't flag prose inside docstrings)
- Trailing-comment aware (won't flag mentions in `# os.kill(pid, 0)` comments)
- Guard-hint aware (skips lines with `hasattr(os, ...)`,
`shutil.which(...)`, `if platform.system() != 'Windows'`, etc.)
- Inline suppression with `# windows-footgun: ok — <reason>`
- `--list` to print all rules with fixes
- `--all` / `--diff <ref>` / staged-files (default) modes
- Scans 380 files in under 2 seconds
### 5. CI integration
A GitHub Actions workflow that runs the checker on every PR and push is
staged at `/tmp/hermes-stash/windows-footguns.yml` — not included in this
commit because the GH token on the push machine lacks `workflow` scope.
A maintainer with `workflow` permissions should add it as
`.github/workflows/windows-footguns.yml` in a follow-up. Content:
```yaml
name: Windows footgun check
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: {python-version: "3.11"}
- run: python scripts/check-windows-footguns.py --all
```
### 6. CONTRIBUTING.md — "Cross-Platform Compatibility" expansion
Expanded from 5 to 16 rules, each with message, example, and fix.
Recommends psutil as the preferred API for PID / process-tree operations.
### 7. Baseline cleanup (91 → 0 findings)
- 14 `open()` sites → added `encoding='utf-8'` (internal logs/caches) or
`encoding='utf-8-sig'` (user-editable files that Notepad may BOM)
- 23 POSIX-only callsites in systemd helpers, pty_bridge, and plugin
tool subprocess management → annotated with
`# windows-footgun: ok — <reason>`
- 7 `os.killpg` sites → migrated to psutil (see §3 above)
## Verification
```
$ python scripts/check-windows-footguns.py --all
✓ No Windows footguns found (380 file(s) scanned).
$ python -c "from gateway.status import _pid_exists; import os
> print('self:', _pid_exists(os.getpid())); print('bogus:', _pid_exists(999999))"
self: True
bogus: False
```
Proof-of-repro that `os.kill(pid, 0)` was actually killing processes
before this fix — see commit `1cbe39914` and bpo-14484. This commit
removes the last hand-rolled ctypes path from the hot liveness-check
path and defers to the best-maintained cross-platform answer.
175 lines
6.5 KiB
Python
175 lines
6.5 KiB
Python
"""Helpers for loading Hermes .env files consistently across entrypoints."""
|
||
|
||
from __future__ import annotations
|
||
|
||
import os
|
||
import sys
|
||
from pathlib import Path
|
||
|
||
from dotenv import load_dotenv
|
||
from utils import atomic_replace
|
||
|
||
|
||
# Env var name suffixes that indicate credential values. These are the
|
||
# only env vars whose values we sanitize on load — we must not silently
|
||
# alter arbitrary user env vars, but credentials are known to require
|
||
# pure ASCII (they become HTTP header values).
|
||
_CREDENTIAL_SUFFIXES = ("_API_KEY", "_TOKEN", "_SECRET", "_KEY")
|
||
|
||
# Names we've already warned about during this process, so repeated
|
||
# load_hermes_dotenv() calls (user env + project env, gateway hot-reload,
|
||
# tests) don't spam the same warning multiple times.
|
||
_WARNED_KEYS: set[str] = set()
|
||
|
||
|
||
def _format_offending_chars(value: str, limit: int = 3) -> str:
|
||
"""Return a compact 'U+XXXX ('c'), ...' summary of non-ASCII codepoints."""
|
||
seen: list[str] = []
|
||
for ch in value:
|
||
if ord(ch) > 127:
|
||
label = f"U+{ord(ch):04X}"
|
||
if ch.isprintable():
|
||
label += f" ({ch!r})"
|
||
if label not in seen:
|
||
seen.append(label)
|
||
if len(seen) >= limit:
|
||
break
|
||
return ", ".join(seen)
|
||
|
||
|
||
def _sanitize_loaded_credentials() -> None:
|
||
"""Strip non-ASCII characters from credential env vars in os.environ.
|
||
|
||
Called after dotenv loads so the rest of the codebase never sees
|
||
non-ASCII API keys. Only touches env vars whose names end with
|
||
known credential suffixes (``_API_KEY``, ``_TOKEN``, etc.).
|
||
|
||
Emits a one-line warning to stderr when characters are stripped.
|
||
Silent stripping would mask copy-paste corruption (Unicode lookalike
|
||
glyphs from PDFs / rich-text editors, ZWSP from web pages) as opaque
|
||
provider-side "invalid API key" errors (see #6843).
|
||
"""
|
||
for key, value in list(os.environ.items()):
|
||
if not any(key.endswith(suffix) for suffix in _CREDENTIAL_SUFFIXES):
|
||
continue
|
||
try:
|
||
value.encode("ascii")
|
||
continue
|
||
except UnicodeEncodeError:
|
||
pass
|
||
cleaned = value.encode("ascii", errors="ignore").decode("ascii")
|
||
os.environ[key] = cleaned
|
||
if key in _WARNED_KEYS:
|
||
continue
|
||
_WARNED_KEYS.add(key)
|
||
stripped = len(value) - len(cleaned)
|
||
detail = _format_offending_chars(value) or "non-printable"
|
||
print(
|
||
f" Warning: {key} contained {stripped} non-ASCII character"
|
||
f"{'s' if stripped != 1 else ''} ({detail}) — stripped so the "
|
||
f"key can be sent as an HTTP header.",
|
||
file=sys.stderr,
|
||
)
|
||
print(
|
||
" This usually means the key was copy-pasted from a PDF, "
|
||
"rich-text editor, or web page that substituted lookalike\n"
|
||
" Unicode glyphs for ASCII letters. If authentication fails "
|
||
"(e.g. \"API key not valid\"), re-copy the key from the\n"
|
||
" provider's dashboard and run `hermes setup` (or edit the "
|
||
".env file in a plain-text editor).",
|
||
file=sys.stderr,
|
||
)
|
||
|
||
|
||
def _load_dotenv_with_fallback(path: Path, *, override: bool) -> None:
|
||
try:
|
||
load_dotenv(dotenv_path=path, override=override, encoding="utf-8")
|
||
except UnicodeDecodeError:
|
||
load_dotenv(dotenv_path=path, override=override, encoding="latin-1")
|
||
# Strip non-ASCII characters from credential env vars that were just
|
||
# loaded. API keys must be pure ASCII since they're sent as HTTP
|
||
# header values (httpx encodes headers as ASCII). Non-ASCII chars
|
||
# typically come from copy-pasting keys from PDFs or rich-text editors
|
||
# that substitute Unicode lookalike glyphs (e.g. ʋ U+028B for v).
|
||
_sanitize_loaded_credentials()
|
||
|
||
|
||
def _sanitize_env_file_if_needed(path: Path) -> None:
|
||
"""Pre-sanitize a .env file before python-dotenv reads it.
|
||
|
||
python-dotenv does not handle corrupted lines where multiple
|
||
KEY=VALUE pairs are concatenated on a single line (missing newline).
|
||
This produces mangled values — e.g. a bot token duplicated 8×
|
||
(see #8908).
|
||
|
||
We delegate to ``hermes_cli.config._sanitize_env_lines`` which
|
||
already knows all valid Hermes env-var names and can split
|
||
concatenated lines correctly.
|
||
"""
|
||
if not path.exists():
|
||
return
|
||
try:
|
||
from hermes_cli.config import _sanitize_env_lines
|
||
except ImportError:
|
||
return # early bootstrap — config module not available yet
|
||
|
||
read_kw = {"encoding": "utf-8-sig", "errors": "replace"}
|
||
try:
|
||
with open(path, **read_kw) as f:
|
||
original = f.readlines()
|
||
sanitized = _sanitize_env_lines(original)
|
||
if sanitized != original:
|
||
import tempfile
|
||
fd, tmp = tempfile.mkstemp(
|
||
dir=str(path.parent), suffix=".tmp", prefix=".env_"
|
||
)
|
||
try:
|
||
with os.fdopen(fd, "w", encoding="utf-8") as f:
|
||
f.writelines(sanitized)
|
||
f.flush()
|
||
os.fsync(f.fileno())
|
||
atomic_replace(tmp, path)
|
||
except BaseException:
|
||
try:
|
||
os.unlink(tmp)
|
||
except OSError:
|
||
pass
|
||
raise
|
||
except Exception:
|
||
pass # best-effort — don't block gateway startup
|
||
|
||
|
||
def load_hermes_dotenv(
|
||
*,
|
||
hermes_home: str | os.PathLike | None = None,
|
||
project_env: str | os.PathLike | None = None,
|
||
) -> list[Path]:
|
||
"""Load Hermes environment files with user config taking precedence.
|
||
|
||
Behavior:
|
||
- `~/.hermes/.env` overrides stale shell-exported values when present.
|
||
- project `.env` acts as a dev fallback and only fills missing values when
|
||
the user env exists.
|
||
- if no user env exists, the project `.env` also overrides stale shell vars.
|
||
"""
|
||
loaded: list[Path] = []
|
||
|
||
home_path = Path(hermes_home or os.getenv("HERMES_HOME", Path.home() / ".hermes"))
|
||
user_env = home_path / ".env"
|
||
project_env_path = Path(project_env) if project_env else None
|
||
|
||
# Fix corrupted .env files before python-dotenv parses them (#8908).
|
||
if user_env.exists():
|
||
_sanitize_env_file_if_needed(user_env)
|
||
if project_env_path and project_env_path.exists():
|
||
_sanitize_env_file_if_needed(project_env_path)
|
||
|
||
if user_env.exists():
|
||
_load_dotenv_with_fallback(user_env, override=True)
|
||
loaded.append(user_env)
|
||
|
||
if project_env_path and project_env_path.exists():
|
||
_load_dotenv_with_fallback(project_env_path, override=not loaded)
|
||
loaded.append(project_env_path)
|
||
|
||
return loaded
|