From 0548facc506ff6d19044be28a10879c188b55087 Mon Sep 17 00:00:00 2001 From: Teknium <127238744+teknium1@users.noreply.github.com> Date: Fri, 8 May 2026 13:10:34 -0700 Subject: [PATCH] fix(windows): gateway status dedup + install.ps1 platform-SDK bootstrap MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Two residual Windows fixes that were hanging from earlier commits. ### 1. `hermes gateway status` reported 2 PIDs per gateway — TWO bugs compounded Diagnosed with psutil parent/child walk against live gateway PIDs: **Bug A (the real one): `_get_parent_pid` silently failed on Windows.** The helper shelled out to `ps -o ppid= -p `, which doesn't exist on Windows — `FileNotFoundError` → returns `None` → the ancestor walk terminated at `os.getpid()` alone. Consequence: the PID table scan in `_scan_gateway_pids` couldn't filter out `hermes gateway status`'s own launcher stub (a venv `pythonw.exe`/`python.exe` that matches the same `-m hermes_cli.main gateway` pattern as the gateway). Every status call saw "itself" as a second gateway. Fix: `_get_parent_pid` now calls `psutil.Process(pid).ppid()` first (psutil is a core dependency since 3dfb35700) and falls back to `ps` only when `shutil.which("ps")` succeeds — matching the Windows-footgun checker's "always guard `ps` / `wmic` / etc. with `shutil.which`" rule. Before: `Gateway process running (PID: 21952, 46880)` — 46880 changing on every call (the status invocation's own launcher, which died by the time the next status call looked). After (5 consecutive calls): ``` ✓ Gateway process running (PID: 21952) ✓ Gateway process running (PID: 21952) ✓ Gateway process running (PID: 21952) ✓ Gateway process running (PID: 21952) ✓ Gateway process running (PID: 21952) ``` Ancestor walk on the fix: 14 PIDs (full chain through bash/explorer) instead of the broken 1-PID set. **Bug B (the cosmetic one): venv-launcher dedup.** Standard Windows CPython venv behaviour is that `/Scripts/pythonw.exe` is a ~5 MB launcher stub that spawns the base Python (`C:\\Program Files\\Python311 \\pythonw.exe`) with the same command line and waits. Our process scanner sees two PIDs for every gateway: launcher + interpreter, same cmdline. Bug A masked this by accidentally counting the status call AS one of them; with Bug A fixed, we see both the real launcher and real interpreter for the gateway process itself. Fix: `_filter_venv_launcher_stubs` at the tail of `_scan_gateway_pids` walks each matched PID's ppid via psutil. Any PID that's the PARENT of another matched PID is a launcher stub — drop it, keep the child. Scoped to Windows (`is_windows() and len(pids) > 1`) and no-ops when psutil isn't importable. Net effect: `gateway status` now reports one PID per gateway — the interpreter — matching POSIX behaviour and user expectations. ### 2. `install.ps1`: bootstrap pip + auto-install platform SDKs New `Install-PlatformSdks` function wired between `Invoke-SetupWizard` and `Start-GatewayIfConfigured`. Fixes two related issues on fresh Windows installs: 1. The tiered `uv pip install` cascade (introduced in 87fca8342) correctly falls through when tier 1 `.[all]` fails on the RL git deps, but the fallback tiers can silently skip SDKs from `[messaging]` when there's a partial-resolve. Result: user sets `DISCORD_BOT_TOKEN` in `.env`, fires up gateway, hits "discord module not installed". 2. `uv` creates venvs WITHOUT pip by default, so the user's escape hatch (`pip install discord.py` in the venv) doesn't exist either. The new function: - Skips if `-NoVenv` (nothing to bootstrap into). - Scans `~/.hermes/.env` for messaging tokens (TELEGRAM_BOT_TOKEN, DISCORD_BOT_TOKEN, SLACK_BOT_TOKEN, SLACK_APP_TOKEN, WHATSAPP_ENABLED), filtering placeholder values. - For each token that's set, runs `python -c "import "` to verify. - If any import fails: runs `python -m ensurepip --upgrade` to bootstrap pip into the venv (idempotent — no-ops if pip is already present), then `pip install ` for each missing SDK with specs mirroring pyproject.toml's `[messaging]` extra to avoid version drift. The `$ErrorActionPreference = "SilentlyContinue"` spans are not cosmetic — PowerShell wraps native-stderr from a non-zero-exit subprocess as a `NativeCommandError` that prints even through `*> $null` / `2>$null`. Save + restore EAP over the import-probe and pip-install blocks keeps the output clean. Verified on this Windows 10 box: - Initial state: telegram+fastapi+psutil present, discord+slack_sdk missing (tier 1 `.[all]` had failed — `.tirith-install-failed` marker in `%LOCALAPPDATA%\\hermes`). - First run with discord+slack tokens in .env: detects both missing, ensurepip (skipped — pip was already bootstrapped earlier this session for telegram), installs `discord.py[voice]==2.7.1` + `PyNaCl` + `davey`, installs `slack-sdk==3.41.0`. All imports succeed on verify. - Second run: all three SDKs report OK, function no-ops. Pip spec strings mirror pyproject.toml's `[messaging]` extra verbatim so a bump to the extra picks up here automatically — no drift. ### Files - `hermes_cli/gateway.py`: `_get_parent_pid` rewritten (psutil-first); `_filter_venv_launcher_stubs` added; `_scan_gateway_pids` dedups launchers on Windows when it finds >1 match. - `scripts/install.ps1`: new `Install-PlatformSdks` function (~85 lines); wired into the main flow at line 1438. ### Verification - `venv/Scripts/python.exe scripts/check-windows-footguns.py --all` → `✓ No Windows footguns found (380 file(s) scanned).` - `ast.parse` passes on gateway.py. - `[System.Management.Automation.Language.Parser]::ParseFile` passes on install.ps1. - Live gateway (PID 21952, running since 12:33 today) survived 5x stress loop of `hermes gateway status` without dying. --- hermes_cli/gateway.py | 63 +++++++++++++++++++++++- scripts/install.ps1 | 110 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 172 insertions(+), 1 deletion(-) diff --git a/hermes_cli/gateway.py b/hermes_cli/gateway.py index 01fbf9e745..8f020f17d5 100644 --- a/hermes_cli/gateway.py +++ b/hermes_cli/gateway.py @@ -131,9 +131,26 @@ def _get_service_pids() -> set: def _get_parent_pid(pid: int) -> int | None: - """Return the parent PID for ``pid``, or ``None`` when unavailable.""" + """Return the parent PID for ``pid``, or ``None`` when unavailable. + + Uses psutil (core dependency) which works on every platform. The + older implementation shelled out to ``ps -o ppid= -p ``, which + silently fails on Windows (no ``ps``) so the ancestor walk terminated + at self — the caller's dedup / exclude logic then couldn't distinguish + "hermes CLI that invoked this scan" from "real gateway process". + """ if pid <= 1: return None + try: + import psutil # type: ignore + return psutil.Process(pid).ppid() or None + except ImportError: + pass + except Exception: + return None + # Fallback: shell out to ps (POSIX only — bare ``ps`` doesn't exist on Windows). + if not shutil.which("ps"): + return None try: result = subprocess.run( ["ps", "-o", "ppid=", "-p", str(pid)], @@ -416,9 +433,53 @@ def _scan_gateway_pids(exclude_pids: set[int], all_profiles: bool = False) -> li except (OSError, subprocess.TimeoutExpired): return [] + # Windows-specific: collapse venv launcher stubs. A venv-built + # ``pythonw.exe`` in ``/Scripts/`` is a ~100 KB launcher exe + # that spawns the base Python (e.g. ``C:\Program Files\Python311\ + # pythonw.exe``) with the same command line, preserving the venv's + # ``pyvenv.cfg`` context. This is standard Windows CPython venv + # behaviour — BUT it means every gateway run produces two pythonw + # PIDs with identical command lines (one launcher stub, one actual + # interpreter) which is confusing in ``gateway status`` output. + # Filter the stub: if a PID in our result is the PARENT of another + # PID in our result, and both are pythonw.exe, the parent is the + # launcher stub — drop it, keep the child. + if is_windows() and len(pids) > 1: + pids = _filter_venv_launcher_stubs(pids) + return pids +def _filter_venv_launcher_stubs(pids: list[int]) -> list[int]: + """Drop venv-launcher ``pythonw.exe`` stubs that are parents of the real + interpreter process. See comment at the tail of ``_scan_gateway_pids``. + + Uses ``psutil`` (core dependency). Safe on any platform; only invoked + on Windows by the caller because the stub pattern is Windows-specific. + """ + try: + import psutil # type: ignore + except ImportError: + return pids + + pid_set = set(pids) + # Collect each PID's parent so we can flag "child of another matched PID". + parent_of: dict[int, int | None] = {} + for pid in pids: + try: + parent_of[pid] = psutil.Process(pid).ppid() + except (psutil.NoSuchProcess, psutil.AccessDenied): + parent_of[pid] = None + + # For each child whose parent is also in our set, drop the parent. + drop: set[int] = set() + for pid, ppid in parent_of.items(): + if ppid is not None and ppid in pid_set: + drop.add(ppid) + + return [p for p in pids if p not in drop] + + def find_gateway_pids(exclude_pids: set | None = None, all_profiles: bool = False) -> list: """Find PIDs of running gateway processes. diff --git a/scripts/install.ps1 b/scripts/install.ps1 index 2f24ea8970..e16d083f15 100644 --- a/scripts/install.ps1 +++ b/scripts/install.ps1 @@ -1161,6 +1161,115 @@ function Install-NodeDeps { } } +function Install-PlatformSdks { + # Ensure messaging-platform SDKs matching tokens the user added to + # ~/.hermes/.env are importable. Two problems this solves: + # + # 1. The tiered `uv pip install` cascade above can fall through to a + # lower tier when the first fails (common when RL git deps choke), + # which silently skips some messaging SDKs from [messaging]. + # 2. `uv` creates the venv without pip. If a messaging SDK ends up + # missing, the user can't `pip install python-telegram-bot` to + # recover — pip simply isn't in their venv. + # + # Strategy: bootstrap pip via `python -m ensurepip` (idempotent), then + # for each token set in .env, verify the matching SDK imports. If not, + # run one targeted `pip install` as last-chance recovery. Keeps fresh + # Windows installs from hitting silent "python-telegram-bot not installed" + # at runtime. + if ($NoVenv) { + Write-Info "Skipping platform-SDK verification (-NoVenv: no venv to bootstrap)" + return + } + + $pythonExe = "$InstallDir\venv\Scripts\python.exe" + if (-not (Test-Path $pythonExe)) { + Write-Warn "Skipping platform-SDK verification: $pythonExe not found" + return + } + + $envPath = "$HermesHome\.env" + if (-not (Test-Path $envPath)) { return } + $envLines = Get-Content $envPath -ErrorAction SilentlyContinue + + # Map: env var set in .env -> (import name, pip spec matching [messaging] extra). + # Specs mirror pyproject.toml to avoid version drift. + $sdkMap = @( + @{ Var = "TELEGRAM_BOT_TOKEN"; Import = "telegram"; Spec = "python-telegram-bot[webhooks]>=22.6,<23" }, + @{ Var = "DISCORD_BOT_TOKEN"; Import = "discord"; Spec = "discord.py[voice]>=2.7.1,<3" }, + @{ Var = "SLACK_BOT_TOKEN"; Import = "slack_sdk"; Spec = "slack-sdk>=3.27.0,<4" }, + @{ Var = "SLACK_APP_TOKEN"; Import = "slack_bolt";Spec = "slack-bolt>=1.18.0,<2" }, + @{ Var = "WHATSAPP_ENABLED"; Import = "qrcode"; Spec = "qrcode>=7.0,<8" } + ) + + # Which tokens are actually set (not placeholder)? + $needed = @() + foreach ($sdk in $sdkMap) { + $match = $envLines | Where-Object { + $_ -match ("^" + [regex]::Escape($sdk.Var) + "=.+") ` + -and $_ -notmatch "your-token-here" ` + -and $_ -notmatch "^\s*#" + } + if ($match) { $needed += $sdk } + } + if ($needed.Count -eq 0) { return } + + Write-Host "" + Write-Info "Verifying platform SDKs for tokens found in $envPath ..." + + # Verify each SDK's import without triggering side-effect imports. + # Quirk: PowerShell wraps non-zero-exit native stderr as a + # NativeCommandError that prints even with `2>$null` / `*> $null` + # unless we set $ErrorActionPreference to SilentlyContinue for the + # span. Save + restore rather than nuking globally. + $prevEAP = $ErrorActionPreference + $ErrorActionPreference = "SilentlyContinue" + try { + $missing = @() + foreach ($sdk in $needed) { + & $pythonExe -c "import $($sdk.Import)" 2>&1 | Out-Null + if ($LASTEXITCODE -ne 0) { + $missing += $sdk + Write-Warn " $($sdk.Import) NOT importable (needed for $($sdk.Var))" + } else { + Write-Success " $($sdk.Import) OK" + } + } + } finally { + $ErrorActionPreference = $prevEAP + } + if ($missing.Count -eq 0) { return } + + # Bootstrap pip into the venv if it isn't there. `uv` creates venvs + # without pip; ensurepip is the stdlib-blessed way to add it. + $prevEAP = $ErrorActionPreference + $ErrorActionPreference = "SilentlyContinue" + try { + & $pythonExe -m pip --version 2>&1 | Out-Null + if ($LASTEXITCODE -ne 0) { + Write-Info "Bootstrapping pip into venv (uv doesn't ship pip)..." + & $pythonExe -m ensurepip --upgrade 2>&1 | Out-Null + if ($LASTEXITCODE -ne 0) { + Write-Warn "ensurepip failed — can't auto-install missing SDKs." + Write-Info "Manual recovery: $UvCmd pip install `"$($missing[0].Spec)`"" + return + } + } + + foreach ($sdk in $missing) { + Write-Info " Installing $($sdk.Spec) ..." + & $pythonExe -m pip install $sdk.Spec 2>&1 | ForEach-Object { Write-Host " $_" } + if ($LASTEXITCODE -eq 0) { + Write-Success " Installed $($sdk.Import)" + } else { + Write-Warn " Failed to install $($sdk.Spec). Recover manually: $pythonExe -m pip install `"$($sdk.Spec)`"" + } + } + } finally { + $ErrorActionPreference = $prevEAP + } +} + function Invoke-SetupWizard { if ($SkipSetup) { Write-Info "Skipping setup wizard (-SkipSetup)" @@ -1343,6 +1452,7 @@ function Main { Set-PathVariable Copy-ConfigTemplates Invoke-SetupWizard + Install-PlatformSdks Start-GatewayIfConfigured Write-Completion