fix(windows): gateway status dedup + install.ps1 platform-SDK bootstrap

## Two residual Windows fixes that were hanging from earlier commits.

### 1. `hermes gateway status` reported 2 PIDs per gateway — TWO bugs compounded

Diagnosed with psutil parent/child walk against live gateway PIDs:

**Bug A (the real one): `_get_parent_pid` silently failed on Windows.**
The helper shelled out to `ps -o ppid= -p <pid>`, which doesn't exist
on Windows — `FileNotFoundError` → returns `None` → the ancestor walk
terminated at `os.getpid()` alone.  Consequence: the PID table scan in
`_scan_gateway_pids` couldn't filter out `hermes gateway status`'s own
launcher stub (a venv `pythonw.exe`/`python.exe` that matches the same
`-m hermes_cli.main gateway` pattern as the gateway).  Every status
call saw "itself" as a second gateway.

Fix: `_get_parent_pid` now calls `psutil.Process(pid).ppid()` first
(psutil is a core dependency since 3dfb35700) and falls back to `ps`
only when `shutil.which("ps")` succeeds — matching the Windows-footgun
checker's "always guard `ps` / `wmic` / etc. with `shutil.which`" rule.

Before: `Gateway process running (PID: 21952, 46880)` — 46880 changing
on every call (the status invocation's own launcher, which died by the
time the next status call looked).

After (5 consecutive calls):
```
✓ Gateway process running (PID: 21952)
✓ Gateway process running (PID: 21952)
✓ Gateway process running (PID: 21952)
✓ Gateway process running (PID: 21952)
✓ Gateway process running (PID: 21952)
```

Ancestor walk on the fix: 14 PIDs (full chain through bash/explorer)
instead of the broken 1-PID set.

**Bug B (the cosmetic one): venv-launcher dedup.** Standard Windows
CPython venv behaviour is that `<venv>/Scripts/pythonw.exe` is a ~5 MB
launcher stub that spawns the base Python (`C:\\Program Files\\Python311
\\pythonw.exe`) with the same command line and waits.  Our process
scanner sees two PIDs for every gateway: launcher + interpreter, same
cmdline.  Bug A masked this by accidentally counting the status call
AS one of them; with Bug A fixed, we see both the real launcher and
real interpreter for the gateway process itself.

Fix: `_filter_venv_launcher_stubs` at the tail of `_scan_gateway_pids`
walks each matched PID's ppid via psutil.  Any PID that's the PARENT
of another matched PID is a launcher stub — drop it, keep the child.
Scoped to Windows (`is_windows() and len(pids) > 1`) and no-ops when
psutil isn't importable.

Net effect: `gateway status` now reports one PID per gateway — the
interpreter — matching POSIX behaviour and user expectations.

### 2. `install.ps1`: bootstrap pip + auto-install platform SDKs

New `Install-PlatformSdks` function wired between `Invoke-SetupWizard`
and `Start-GatewayIfConfigured`.  Fixes two related issues on fresh
Windows installs:

1. The tiered `uv pip install` cascade (introduced in 87fca8342)
   correctly falls through when tier 1 `.[all]` fails on the RL git
   deps, but the fallback tiers can silently skip SDKs from `[messaging]`
   when there's a partial-resolve.  Result: user sets `DISCORD_BOT_TOKEN`
   in `.env`, fires up gateway, hits "discord module not installed".

2. `uv` creates venvs WITHOUT pip by default, so the user's escape
   hatch (`pip install discord.py` in the venv) doesn't exist either.

The new function:
- Skips if `-NoVenv` (nothing to bootstrap into).
- Scans `~/.hermes/.env` for messaging tokens (TELEGRAM_BOT_TOKEN,
  DISCORD_BOT_TOKEN, SLACK_BOT_TOKEN, SLACK_APP_TOKEN, WHATSAPP_ENABLED),
  filtering placeholder values.
- For each token that's set, runs `python -c "import <sdk>"` to verify.
- If any import fails: runs `python -m ensurepip --upgrade` to bootstrap
  pip into the venv (idempotent — no-ops if pip is already present),
  then `pip install <spec>` for each missing SDK with specs mirroring
  pyproject.toml's `[messaging]` extra to avoid version drift.

The `$ErrorActionPreference = "SilentlyContinue"` spans are not
cosmetic — PowerShell wraps native-stderr from a non-zero-exit
subprocess as a `NativeCommandError` that prints even through
`*> $null` / `2>$null`.  Save + restore EAP over the import-probe
and pip-install blocks keeps the output clean.

Verified on this Windows 10 box:
- Initial state: telegram+fastapi+psutil present, discord+slack_sdk
  missing (tier 1 `.[all]` had failed — `.tirith-install-failed`
  marker in `%LOCALAPPDATA%\\hermes`).
- First run with discord+slack tokens in .env: detects both missing,
  ensurepip (skipped — pip was already bootstrapped earlier this
  session for telegram), installs `discord.py[voice]==2.7.1` +
  `PyNaCl` + `davey`, installs `slack-sdk==3.41.0`. All imports
  succeed on verify.
- Second run: all three SDKs report OK, function no-ops.

Pip spec strings mirror pyproject.toml's `[messaging]` extra verbatim
so a bump to the extra picks up here automatically — no drift.

### Files

- `hermes_cli/gateway.py`: `_get_parent_pid` rewritten (psutil-first);
  `_filter_venv_launcher_stubs` added; `_scan_gateway_pids` dedups
  launchers on Windows when it finds >1 match.
- `scripts/install.ps1`: new `Install-PlatformSdks` function (~85
  lines); wired into the main flow at line 1438.

### Verification

- `venv/Scripts/python.exe scripts/check-windows-footguns.py --all`
  → `✓ No Windows footguns found (380 file(s) scanned).`
- `ast.parse` passes on gateway.py.
- `[System.Management.Automation.Language.Parser]::ParseFile` passes
  on install.ps1.
- Live gateway (PID 21952, running since 12:33 today) survived 5x
  stress loop of `hermes gateway status` without dying.
This commit is contained in:
Teknium 2026-05-08 13:10:34 -07:00
parent cc38282b04
commit 0548facc50
2 changed files with 172 additions and 1 deletions

View file

@ -131,9 +131,26 @@ def _get_service_pids() -> set:
def _get_parent_pid(pid: int) -> int | None:
"""Return the parent PID for ``pid``, or ``None`` when unavailable."""
"""Return the parent PID for ``pid``, or ``None`` when unavailable.
Uses psutil (core dependency) which works on every platform. The
older implementation shelled out to ``ps -o ppid= -p <pid>``, which
silently fails on Windows (no ``ps``) so the ancestor walk terminated
at self the caller's dedup / exclude logic then couldn't distinguish
"hermes CLI that invoked this scan" from "real gateway process".
"""
if pid <= 1:
return None
try:
import psutil # type: ignore
return psutil.Process(pid).ppid() or None
except ImportError:
pass
except Exception:
return None
# Fallback: shell out to ps (POSIX only — bare ``ps`` doesn't exist on Windows).
if not shutil.which("ps"):
return None
try:
result = subprocess.run(
["ps", "-o", "ppid=", "-p", str(pid)],
@ -416,9 +433,53 @@ def _scan_gateway_pids(exclude_pids: set[int], all_profiles: bool = False) -> li
except (OSError, subprocess.TimeoutExpired):
return []
# Windows-specific: collapse venv launcher stubs. A venv-built
# ``pythonw.exe`` in ``<venv>/Scripts/`` is a ~100 KB launcher exe
# that spawns the base Python (e.g. ``C:\Program Files\Python311\
# pythonw.exe``) with the same command line, preserving the venv's
# ``pyvenv.cfg`` context. This is standard Windows CPython venv
# behaviour — BUT it means every gateway run produces two pythonw
# PIDs with identical command lines (one launcher stub, one actual
# interpreter) which is confusing in ``gateway status`` output.
# Filter the stub: if a PID in our result is the PARENT of another
# PID in our result, and both are pythonw.exe, the parent is the
# launcher stub — drop it, keep the child.
if is_windows() and len(pids) > 1:
pids = _filter_venv_launcher_stubs(pids)
return pids
def _filter_venv_launcher_stubs(pids: list[int]) -> list[int]:
"""Drop venv-launcher ``pythonw.exe`` stubs that are parents of the real
interpreter process. See comment at the tail of ``_scan_gateway_pids``.
Uses ``psutil`` (core dependency). Safe on any platform; only invoked
on Windows by the caller because the stub pattern is Windows-specific.
"""
try:
import psutil # type: ignore
except ImportError:
return pids
pid_set = set(pids)
# Collect each PID's parent so we can flag "child of another matched PID".
parent_of: dict[int, int | None] = {}
for pid in pids:
try:
parent_of[pid] = psutil.Process(pid).ppid()
except (psutil.NoSuchProcess, psutil.AccessDenied):
parent_of[pid] = None
# For each child whose parent is also in our set, drop the parent.
drop: set[int] = set()
for pid, ppid in parent_of.items():
if ppid is not None and ppid in pid_set:
drop.add(ppid)
return [p for p in pids if p not in drop]
def find_gateway_pids(exclude_pids: set | None = None, all_profiles: bool = False) -> list:
"""Find PIDs of running gateway processes.

View file

@ -1161,6 +1161,115 @@ function Install-NodeDeps {
}
}
function Install-PlatformSdks {
# Ensure messaging-platform SDKs matching tokens the user added to
# ~/.hermes/.env are importable. Two problems this solves:
#
# 1. The tiered `uv pip install` cascade above can fall through to a
# lower tier when the first fails (common when RL git deps choke),
# which silently skips some messaging SDKs from [messaging].
# 2. `uv` creates the venv without pip. If a messaging SDK ends up
# missing, the user can't `pip install python-telegram-bot` to
# recover — pip simply isn't in their venv.
#
# Strategy: bootstrap pip via `python -m ensurepip` (idempotent), then
# for each token set in .env, verify the matching SDK imports. If not,
# run one targeted `pip install` as last-chance recovery. Keeps fresh
# Windows installs from hitting silent "python-telegram-bot not installed"
# at runtime.
if ($NoVenv) {
Write-Info "Skipping platform-SDK verification (-NoVenv: no venv to bootstrap)"
return
}
$pythonExe = "$InstallDir\venv\Scripts\python.exe"
if (-not (Test-Path $pythonExe)) {
Write-Warn "Skipping platform-SDK verification: $pythonExe not found"
return
}
$envPath = "$HermesHome\.env"
if (-not (Test-Path $envPath)) { return }
$envLines = Get-Content $envPath -ErrorAction SilentlyContinue
# Map: env var set in .env -> (import name, pip spec matching [messaging] extra).
# Specs mirror pyproject.toml to avoid version drift.
$sdkMap = @(
@{ Var = "TELEGRAM_BOT_TOKEN"; Import = "telegram"; Spec = "python-telegram-bot[webhooks]>=22.6,<23" },
@{ Var = "DISCORD_BOT_TOKEN"; Import = "discord"; Spec = "discord.py[voice]>=2.7.1,<3" },
@{ Var = "SLACK_BOT_TOKEN"; Import = "slack_sdk"; Spec = "slack-sdk>=3.27.0,<4" },
@{ Var = "SLACK_APP_TOKEN"; Import = "slack_bolt";Spec = "slack-bolt>=1.18.0,<2" },
@{ Var = "WHATSAPP_ENABLED"; Import = "qrcode"; Spec = "qrcode>=7.0,<8" }
)
# Which tokens are actually set (not placeholder)?
$needed = @()
foreach ($sdk in $sdkMap) {
$match = $envLines | Where-Object {
$_ -match ("^" + [regex]::Escape($sdk.Var) + "=.+") `
-and $_ -notmatch "your-token-here" `
-and $_ -notmatch "^\s*#"
}
if ($match) { $needed += $sdk }
}
if ($needed.Count -eq 0) { return }
Write-Host ""
Write-Info "Verifying platform SDKs for tokens found in $envPath ..."
# Verify each SDK's import without triggering side-effect imports.
# Quirk: PowerShell wraps non-zero-exit native stderr as a
# NativeCommandError that prints even with `2>$null` / `*> $null`
# unless we set $ErrorActionPreference to SilentlyContinue for the
# span. Save + restore rather than nuking globally.
$prevEAP = $ErrorActionPreference
$ErrorActionPreference = "SilentlyContinue"
try {
$missing = @()
foreach ($sdk in $needed) {
& $pythonExe -c "import $($sdk.Import)" 2>&1 | Out-Null
if ($LASTEXITCODE -ne 0) {
$missing += $sdk
Write-Warn " $($sdk.Import) NOT importable (needed for $($sdk.Var))"
} else {
Write-Success " $($sdk.Import) OK"
}
}
} finally {
$ErrorActionPreference = $prevEAP
}
if ($missing.Count -eq 0) { return }
# Bootstrap pip into the venv if it isn't there. `uv` creates venvs
# without pip; ensurepip is the stdlib-blessed way to add it.
$prevEAP = $ErrorActionPreference
$ErrorActionPreference = "SilentlyContinue"
try {
& $pythonExe -m pip --version 2>&1 | Out-Null
if ($LASTEXITCODE -ne 0) {
Write-Info "Bootstrapping pip into venv (uv doesn't ship pip)..."
& $pythonExe -m ensurepip --upgrade 2>&1 | Out-Null
if ($LASTEXITCODE -ne 0) {
Write-Warn "ensurepip failed — can't auto-install missing SDKs."
Write-Info "Manual recovery: $UvCmd pip install `"$($missing[0].Spec)`""
return
}
}
foreach ($sdk in $missing) {
Write-Info " Installing $($sdk.Spec) ..."
& $pythonExe -m pip install $sdk.Spec 2>&1 | ForEach-Object { Write-Host " $_" }
if ($LASTEXITCODE -eq 0) {
Write-Success " Installed $($sdk.Import)"
} else {
Write-Warn " Failed to install $($sdk.Spec). Recover manually: $pythonExe -m pip install `"$($sdk.Spec)`""
}
}
} finally {
$ErrorActionPreference = $prevEAP
}
}
function Invoke-SetupWizard {
if ($SkipSetup) {
Write-Info "Skipping setup wizard (-SkipSetup)"
@ -1343,6 +1452,7 @@ function Main {
Set-PathVariable
Copy-ConfigTemplates
Invoke-SetupWizard
Install-PlatformSdks
Start-GatewayIfConfigured
Write-Completion