## Why
Hermes supports Linux, macOS, and native Windows, but the codebase grew up
POSIX-first and has accumulated patterns that silently break (or worse,
silently kill!) on Windows:
- `os.kill(pid, 0)` as a liveness probe — on Windows this maps to
CTRL_C_EVENT and broadcasts Ctrl+C to the target's entire console
process group (bpo-14484, open since 2012).
- `os.killpg` — doesn't exist on Windows at all (AttributeError).
- `os.setsid` / `os.getuid` / `os.geteuid` — same.
- `signal.SIGKILL` / `signal.SIGHUP` / `signal.SIGUSR1` — module-attr
errors at runtime on Windows.
- `open(path)` / `open(path, "r")` without explicit encoding= — inherits
the platform default, which is cp1252/mbcs on Windows (UTF-8 on POSIX),
causing mojibake round-tripping between hosts.
- `wmic` — removed from Windows 10 21H1+.
This commit does three things:
1. Makes `psutil` a core dependency and migrates critical callsites to it.
2. Adds a grep-based CI gate (`scripts/check-windows-footguns.py`) that
blocks new instances of any of the above patterns.
3. Fixes every existing instance in the codebase so the baseline is clean.
## What changed
### 1. psutil as a core dependency (pyproject.toml)
Added `psutil>=5.9.0,<8` to core deps. psutil is the canonical
cross-platform answer for "is this PID alive" and "kill this process
tree" — its `pid_exists()` uses `OpenProcess + GetExitCodeProcess` on
Windows (NOT a signal call), and its `Process.children(recursive=True)`
+ `.kill()` combo replaces `os.killpg()` portably.
### 2. `gateway/status.py::_pid_exists`
Rewrote to call `psutil.pid_exists()` first, falling back to the
hand-rolled ctypes `OpenProcess + WaitForSingleObject` dance on Windows
(and `os.kill(pid, 0)` on POSIX) only if psutil is somehow missing —
e.g. during the scaffold phase of a fresh install before pip finishes.
### 3. `os.killpg` migration to psutil (7 callsites, 5 files)
- `tools/code_execution_tool.py`
- `tools/process_registry.py`
- `tools/tts_tool.py`
- `tools/environments/local.py` (3 sites kept as-is, suppressed with
`# windows-footgun: ok` — the pgid semantics psutil can't replicate,
and the calls are already Windows-guarded at the outer branch)
- `gateway/platforms/whatsapp.py`
### 4. `scripts/check-windows-footguns.py` (NEW, 500 lines)
Grep-based checker with 11 rules covering every Windows cross-platform
footgun we've hit so far:
1. `os.kill(pid, 0)` — the silent killer
2. `os.setsid` without guard
3. `os.killpg` (recommends psutil)
4. `os.getuid` / `os.geteuid` / `os.getgid`
5. `os.fork`
6. `signal.SIGKILL`
7. `signal.SIGHUP/SIGUSR1/SIGUSR2/SIGALRM/SIGCHLD/SIGPIPE/SIGQUIT`
8. `subprocess` shebang script invocation
9. `wmic` without `shutil.which` guard
10. Hardcoded `~/Desktop` (OneDrive trap)
11. `asyncio.add_signal_handler` without try/except
12. `open()` without `encoding=` on text mode
Features:
- Triple-quoted-docstring aware (won't flag prose inside docstrings)
- Trailing-comment aware (won't flag mentions in `# os.kill(pid, 0)` comments)
- Guard-hint aware (skips lines with `hasattr(os, ...)`,
`shutil.which(...)`, `if platform.system() != 'Windows'`, etc.)
- Inline suppression with `# windows-footgun: ok — <reason>`
- `--list` to print all rules with fixes
- `--all` / `--diff <ref>` / staged-files (default) modes
- Scans 380 files in under 2 seconds
### 5. CI integration
A GitHub Actions workflow that runs the checker on every PR and push is
staged at `/tmp/hermes-stash/windows-footguns.yml` — not included in this
commit because the GH token on the push machine lacks `workflow` scope.
A maintainer with `workflow` permissions should add it as
`.github/workflows/windows-footguns.yml` in a follow-up. Content:
```yaml
name: Windows footgun check
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: {python-version: "3.11"}
- run: python scripts/check-windows-footguns.py --all
```
### 6. CONTRIBUTING.md — "Cross-Platform Compatibility" expansion
Expanded from 5 to 16 rules, each with message, example, and fix.
Recommends psutil as the preferred API for PID / process-tree operations.
### 7. Baseline cleanup (91 → 0 findings)
- 14 `open()` sites → added `encoding='utf-8'` (internal logs/caches) or
`encoding='utf-8-sig'` (user-editable files that Notepad may BOM)
- 23 POSIX-only callsites in systemd helpers, pty_bridge, and plugin
tool subprocess management → annotated with
`# windows-footgun: ok — <reason>`
- 7 `os.killpg` sites → migrated to psutil (see §3 above)
## Verification
```
$ python scripts/check-windows-footguns.py --all
✓ No Windows footguns found (380 file(s) scanned).
$ python -c "from gateway.status import _pid_exists; import os
> print('self:', _pid_exists(os.getpid())); print('bogus:', _pid_exists(999999))"
self: True
bogus: False
```
Proof-of-repro that `os.kill(pid, 0)` was actually killing processes
before this fix — see commit `1cbe39914` and bpo-14484. This commit
removes the last hand-rolled ctypes path from the hot liveness-check
path and defers to the best-maintained cross-platform answer.
Closes the last Python-on-Windows UTF-8 exposure by making every
text-mode open() call explicit about its encoding.
Before: on Windows, bare open(path, 'r') defaults to the system
locale encoding (cp1252 on US-locale installs). That means reading
any config/yaml/markdown/json file with non-ASCII content either
crashes with UnicodeDecodeError or silently mis-decodes bytes.
After: all 89 affected call sites in production code now pass
encoding='utf-8' explicitly. Works identically on every platform
and every locale, no surprise behavior.
Mechanical sweep via:
ruff check --preview --extend-select PLW1514 --unsafe-fixes --fix --exclude 'tests,venv,.venv,node_modules,website,optional-skills, skills,tinker-atropos,plugins' .
All 89 fixes have the same shape: open(x) or open(x, mode) became
open(x, encoding='utf-8') or open(x, mode, encoding='utf-8'). Nothing
else changed. Every modified file still parses and the Windows/sandbox
test suite is still green (85 passed, 14 skipped, 0 failed across
tests/tools/test_code_execution_windows_env.py +
tests/tools/test_code_execution_modes.py + tests/tools/test_env_passthrough.py +
tests/test_hermes_bootstrap.py).
Scope notes:
- tests/ excluded: test fixtures can use locale encoding intentionally
(exercising edge cases). If we want to tighten tests later that's
a separate PR.
- plugins/ excluded: plugin-specific conventions may differ; plugin
authors own their code.
- optional-skills/ and skills/ excluded: skill scripts are user-authored
and we don't want to mass-edit them.
- website/ and tinker-atropos/ excluded: vendored / generated content.
46 files touched, 89 +/- lines (symmetric replacement). No behavior
change on POSIX or on Windows when the file is ASCII; bug fix on
Windows when the file contains non-ASCII.
Four callsites hardcoded Path.home() / '.hermes' with no HERMES_HOME
check, breaking Docker deployments and profile isolation (hermes -p):
- plugins/hermes-achievements/dashboard/plugin_api.py:
state_path(), snapshot_path(), checkpoint_path() bare-literal paths
- scripts/profile-tui.py:
DEFAULT_STATE_DB and DEFAULT_LOG defaults ignored HERMES_HOME
- hermes_cli/slack_cli.py:
except-Exception fallback for slack-manifest.json dump
- optional-skills/migration/openclaw-migration/scripts/openclaw_to_hermes.py:
--target argparse default
Use get_hermes_home() (with an ImportError shim for the standalone
scripts) or 'os.environ.get("HERMES_HOME") or str(Path.home()/".hermes")'
where importing hermes_constants is impractical.
E2E-verified: with HERMES_HOME=/tmp/x all three achievements paths and
both profile-tui defaults route under /tmp/x.
Salvaged from #18068 (original scope was broader mechanical cleanup
claiming 23 callsites were buggy; most were already respecting
HERMES_HOME via os.environ.get(key, default) — only these 4 had no env
check at all). Credit: @web-dev0521.
- drop unused TUI helpers, test-only layout scaffolding, and stale public debug exports
- remove an unused profiler import and trim test-only coverage for deleted helpers
Before: change code → build → run profile → manually compare to
mental model of last run. After: `--loop` watches ui-tui/src and
packages/hermes-ink/src for .ts(x) changes, rebuilds on change,
re-runs the same scenario, prints a side-by-side A/B diff against
the previous iteration — so each edit's impact is quantified
instantly. Ctrl+C to stop.
Also added:
--save LABEL saves metrics snapshot to /tmp/perf-<LABEL>.json
--compare LABEL diffs the current run vs that snapshot
--extra-flag X pass-through to node dist/entry.js (prepping for
--no-fullscreen below)
key_metrics() flattens a full run into scalar numbers across
frames, React commits, and per-phase timings. format_diff() prints
a table with ↑/↓ markers denoting regressions vs improvements based
on whether the metric is lower-is-better (p99, max, patches, drain)
or higher-is-better (fps, gaps_under_16ms).
Run-to-run noise on static code is ~5-15% on most metrics — big
signal (>30% change on renderer_p99 / fps) cuts through cleanly.
Useful both for validating a single fix and for detecting subtle
regressions during the wheel-accel port.
Usage during the next perf session:
# one-shot with a baseline for later comparison
scripts/profile-tui.py --seconds 6 --hold wheel_up --save pre-accel
# after porting the wheel handler
scripts/profile-tui.py --seconds 6 --hold wheel_up --compare pre-accel
# continuous iteration
scripts/profile-tui.py --seconds 6 --hold wheel_up --loop
Adds four fields to FrameEvent.phases and the matching profile
summary:
optimizedPatches post-optimize patch count (what's actually
written to stdout; the .patches field is
pre-optimize)
writeBytes UTF-8 byte count of the write this frame
backpressure true when Node's stdout.write returned false
(Writable buffer full — outer terminal can't
keep up)
prevFrameDrainMs end-to-end drain time of the PREVIOUS frame's
write, captured from stdout.write's 2-arg
callback. Reported on the next frame so the
measurement reflects "time until OS flushed
the bytes to the terminal fd", not "time until
queued in Node".
writeDiffToTerminal() now returns { bytes, backpressure } and
accepts an optional onDrain callback. Only attached on TTY with
diff; piped/non-TTY stdout bypasses flow control so the callback
would fire synchronously anyway.
Initial measurements under hold-wheel_up against 1106-msg session
(30Hz for 6s):
patches total 28,888
optimized total 16,700 (ratio 0.58 — optimizer cuts ~42%)
writeBytes 42 KB / 10s = 4.2 KB/s throughput
drainMs p50 0.14 ms terminal accepts bytes instantly
drainMs p99 0.85 ms
backpressure 0% of frames
This rules out the terminal-parse hypothesis — Cursor's xterm.js
drains our output in sub-millisecond time at only 4 KB/s. The
remaining lag has to be in the render pipeline, not the wire.
Profile output now includes the bytes+drain+backpressure lines to
keep this visible on every subsequent iteration.
Extends HERMES_DEV_PERF to capture the complete render pipeline, not
just React commits. Adds scripts/profile-tui.py to drive repeatable
hold-PageUp stress tests against a real long session.
perfPane.tsx:
Wires ink's onFrame callback (already plumbed through the fork) into
the same perf.log as the React.Profiler samples. Captures per-phase
timing (yoga calculateLayout, renderNodeToOutput, screen diff, patch
optimize, stdout write) plus yoga counters (visited/measured/cache-
Hits/live) and patch counts per frame. Events are tagged
{src: 'react'|'frame'} so jq can split them. logFrameEvent is
undefined when HERMES_DEV_PERF is unset, so ink doesn't even attach
the callback.
entry.tsx:
Passes logFrameEvent into render().
types/hermes-ink.d.ts:
Declares FrameEvent + onFrame on RenderOptions so the ui-tui side
type-checks against the plumbed-through ink option.
scripts/profile-tui.py:
New harness. Launches the built TUI under a PTY with the longest
session in state.db resumed, holds PageUp/PageDown/etc at a
configurable Hz for N seconds, then parses perf.log and prints
per-phase p50/p95/p99/max plus yoga-counter summaries. Zero deps
beyond stdlib. Exit 2 if nothing was captured (wiring broken).
Initial findings (1106-msg session, 6s PageUp hold at 30Hz):
- Steady state: 10 fps; renderer phase p99=63ms, write p99=0.2ms
- 4/107 heavy frames (>=16ms), all dominated by renderNodeToOutput
- One pathological 97ms frame with yoga measuring 70,415 text cells
and Yoga visiting 225k nodes — the cold-unmeasured-region hit
- Ink's scroll fast-path (DECSTBM blit from prevScreen) is
disqualified because our spacer-based virtual history doesn't
keep heightDelta in sync with scroll.delta, so every PageUp step
falls through to a full 2000-4800 patch re-render instead of ~40