mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-06-10 08:32:09 +00:00
178 commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
c9b62061d4 |
fix(cli): launchd KeepAlive unconditional restart (#37388)
Replace KeepAlive.SuccessfulExit=false dict with <key>KeepAlive</key><true/> so launchd restarts hermes-gateway on any exit, matching the documented drain-then-exit restart protocol used by --graceful-restart. |
||
|
|
d4b533de4e |
fix: batch of small robustness/correctness fixes from @kyssta-exe
Salvages 8 distinct fixes from a batch of PRs by @kyssta-exe, reapplied
onto current main (original branches were stale) with a few refinements.
- cron(jobs.py): load_jobs() validates top-level JSON shape — a bare
list auto-repairs into the {"jobs": [...]} dict; scalars/null raise a
clear RuntimeError instead of an uncaught AttributeError that took
down the whole cron subsystem (#37065, closes #36867).
- web(web_server.py): close the per-action log file handle after Popen
so the parent stops leaking one fd per spawned action (#36843).
- web(web_server.py): DELETE /api/env returns 400 for invalid key names
instead of a misleading 500, mirroring PUT /api/env (#36840).
- gateway(gateway.py): read /proc/<pid>/cmdline inside a with-block so
the fd is released immediately instead of relying on GC (#36804).
- web-tools(web_tools.py): include "xai" in check_web_api_key() so a
configured X.AI web backend reports as available (#36802).
- compression(conversation_compression.py): mark the feasibility check
done only after it completes, and default the gate to "not checked"
if the attribute is missing (#36803).
- completion(completion.py): replace `ls` with directory globbing in the
generated bash/zsh/fish profile listers — handles names with spaces
and skips non-directory entries (#36806).
- terminal-tool(terminal_tool.py): drop a duplicate `import threading`
(#36808).
- claw(claw.py): the migrate recommendation now points at the real
`hermes gateway stop` command instead of the non-existent
`hermes stop` (#36795, #36796, closes #36771).
- tests: guard against a leaked HERMES_CRON_SESSION breaking gateway
approval tests — add it to the hermetic conftest unset list (root
cause, protects every test) and pop it in the affected test's
setup_method (#36796).
Co-authored-by: kyssta-exe <kyssta-exe@users.noreply.github.com>
|
||
|
|
eee32cdd52
|
fix(gateway): fall back to in-process heartbeat when s6 sleep is missing (#36208) (#37120)
Inside an s6 container, `gateway run` redirects to the supervised
gateway and then keeps the CMD process alive as a no-op heartbeat so
/init doesn't start stage-3 shutdown. That heartbeat is
`os.execvp("sleep", ["sleep", "infinity"])`, which does a PATH lookup
for the `sleep` binary. When PATH was empty/truncated/clobbered at that
point — e.g. after user customizations rewrote PATH, or on a minimal
image without `sleep` on PATH — the exec raised FileNotFoundError,
killing the CMD process and causing /init to tear down every service:
the container failed to start (issue #36208, a regression in the s6
image from 2026.5.28).
Wrap the exec in try/except OSError: on success it still replaces the
process with the cheap `sleep` heartbeat (no resident Python
interpreter, and the existing process-tree/recursion contract is
preserved); on failure it falls back to `_block_until_terminated()` —
a SIGTERM handler (clean 128+signum exit on `docker stop`) plus a
signal.pause() loop, which needs no external binary and so can't fail
on PATH state. A threading.Event().wait() fallback covers platforms
without signal.pause().
Keeping execvp as the primary path (rather than replacing it outright)
preserves the `sleep infinity` heartbeat that the docker integration
tests assert (test_gateway_run_supervised.py) and avoids leaving a
full Python interpreter resident for the container's lifetime.
Verified end-to-end on a built image: with execvp forced to fail,
_block_until_terminated() blocks cleanly instead of raising
FileNotFoundError; normal boot still runs the cheap `sleep infinity`
heartbeat; the 6 test_gateway_run_supervised.py integration tests pass.
Salvages the two community fixes for this issue — the fallback design
from #36221 (@Pluviobyte) and the signal.pause() heartbeat from #36267
(@karmeleon) — and adds regression tests for both the normal and
sleep-missing paths.
Co-authored-by: Pluviobyte <Pluviobyte@users.noreply.github.com>
Co-authored-by: karmeleon <karmeleon@users.noreply.github.com>
Closes #36208.
|
||
|
|
b14e15c48e | fix(gateway): clean service restart notifications | ||
|
|
51c68d4ab1
|
Add Hermes desktop app (#20059)
* feat: better composer etc * docs: add desktop and dashboard run instructions * fix(desktop): address security scan findings * fix(dashboard): resolve @nous-research/ui path under npm workspaces The sync-assets prebuild step shelled out to 'cp -r node_modules/@nous-research/ui/dist/fonts ...' with a path relative to apps/dashboard/. That works only when the dep is installed locally in the dashboard workspace, but 'npm install' at the repo root (the documented setup — see apps/desktop/README.md) hoists shared deps to the root node_modules under npm workspaces. The relative cp then fails with 'No such file or directory', sync-assets exits 1, the Vite build aborts, and 'hermes dashboard' surfaces a generic 'Web UI build failed' message. Replace the shell one-liner with scripts/sync-assets.cjs, which walks up from the dashboard directory looking for node_modules/ @nous-research/ui — working in both the hoisted (workspaces) and co-located (standalone) layouts. Also guards against a missing dist/fonts or dist/assets with a clearer error pointing at a rebuild of the UI package rather than silently copying nothing. * feat(desktop): support connecting to a remote Hermes backend Add HERMES_DESKTOP_REMOTE_URL and HERMES_DESKTOP_REMOTE_TOKEN env vars that, when set, short-circuit the local-child spawn in startHermes() and connect the Electron renderer to an already- running 'hermes dashboard' server reachable over the network. Motivating use case: WSL2 users who want to run the Hermes core (agent loop, tools, filesystem access) inside their WSL distribution while rendering the Electron GUI on native Windows. Before this change, the desktop app always spawned a local Python child on the same host as the renderer, which doesn't cross the WSL/Windows boundary. The remote path reuses waitForHermes() as a liveness probe (/api/status is in the backend's public endpoint allowlist), so the connection is only returned once the backend is actually ready. WebSocket URL derivation picks ws:// or wss:// based on the input scheme. URL validation rejects non-http(s) schemes and requires both env vars together to avoid a half-configured connection that would silently fall through to the spawn path. No behaviour change when the env vars are unset — the default local-spawn flow is untouched. Typical usage: # in WSL2 hermes dashboard --tui --no-open --host 0.0.0.0 --port 9119 --insecure # on Windows set HERMES_DESKTOP_REMOTE_URL=http://localhost:9119 set HERMES_DESKTOP_REMOTE_TOKEN=<session token> set HERMES_DESKTOP_IGNORE_EXISTING=1 (launch Hermes desktop) * ci(desktop): automate desktop releases Add GitHub Actions release channels for signed desktop installers and document the stable/nightly download paths. * feat: file tabs * refactor(desktop): tighten right-rail tab close API Promote closeRightRailTab/closeActiveRightRailTab as the single public entry point. Drops the activeTabRef + handleCloseDocument indirection in ChatPreviewRail, the unused $rightRailHasContent atom, and the legacy dismissFilePreviewTarget alias. -70 LOC. * feat(desktop): polish composer pill toward reference look Solid foreground-on-background send/voice-conversation circle (black-on-white in light, white-on-black in dark) anchors the right edge as the primary CTA instead of the orange theme primary. Bumps the primary control to 2.125rem so it visually outranks the ghost mic/plus controls. Opens up the surface padding (0.625rem x / 0.5rem y) so the input row breathes around its controls, and nudges the corner radius from 20 to 24px for a slightly pill-ier silhouette. LiquidGlass distortion is preserved. * feat(desktop): add startup and onboarding flow Add phase-based desktop boot progress, fresh-install sandbox testing, and first-run provider credential onboarding so packaged installs can start cleanly without manual settings detours. * fix(desktop): gate prompts on provider setup Show the desktop provider onboarding flow before prompt submission when no inference provider is configured, preventing fresh installs from falling through to backend credential errors. * fix(desktop): surface provider onboarding from session warnings Propagate credential warnings through session runtime info and open desktop onboarding whenever a session reports no usable provider, so unconfigured installs cannot fall through to prompt errors. * fix(desktop): route gateway provider errors to onboarding The "No inference provider configured" auth error reaches the renderer through gateway error events, not the prompt.submit promise; the previous patch only caught the latter, so the error toast still surfaced and onboarding never opened. Also strip credential-shaped env vars from the test:desktop:fresh sandbox so the packaged backend can't see provider keys leaking from the launching shell. * fix(desktop): use strict runtime check to drive onboarding setup.status returned True whenever any provider auth state was discoverable, including indirect fallbacks like a gh-CLI Copilot token. That made desktop think the user was set up while the agent's actual resolve_runtime_provider call still raised AuthError, leaving the user with a useless toast and no onboarding. Add a setup.runtime_check gateway method that runs the same resolver the agent uses on session creation, and switch the desktop onboarding overlay and prompt precheck to use it. * feat(desktop): OAuth-first onboarding using existing dashboard provider API Replace the engineer-flavored API key form with a Sign-in-first onboarding overlay that uses the dashboard's existing /api/providers/oauth catalog and PKCE/device-code endpoints (Anthropic, Nous, OpenAI Codex, etc.). API key entry is now a fallback tab with friendly provider names instead of env var prefixes, and the loud raw resolver error is gone in favor of a one-line welcome message. * fix(desktop): polish onboarding provider list Reorder OAuth providers so Nous Portal is first, give the segmented Sign in / API key control equal column widths, and replace the engineer-flavored backend names like "Anthropic (Claude API)" / "MiniMax (OAuth)" with friendlier in-app titles. External-CLI providers now show a softer subtitle and an external-link icon instead of a chevron. * refactor(desktop): split onboarding overlay into store + view Move the OAuth state machine, runtime check, copy-to-clipboard, and api-key save into store/onboarding.ts (matching the boot.ts pattern), leaving the overlay as a presentation layer that subscribes via useStore. Tabs are now table-driven, child panels read flow from the store instead of prop-drilling, and the polling/PKCE/error/success branches share a small Status atom. * fix(desktop): external CLI providers + center mode tabs External-CLI providers (Claude Code, Qwen Code) now open an in-overlay panel with the CLI command, copy button, and an "I've signed in" recheck instead of firing an invisible toast. Center the Sign in / API key tab control so it sits under the heading instead of hugging the left edge. * fix(desktop): drop onboarding tabs for an inline link, group device-code waiting state Replace the Sign in / API key tab pair with an "I have an API key" footer link under the OAuth provider list, with a "Back to sign in" affordance inside the API key form. Group the device-code "Waiting for you to authorize..." status next to the Cancel button so the alignment matches the action. * refactor(desktop): tighten onboarding store + overlay Drop the dead isOnboardingBusy/BUSY set, factor the catch-fallback dance into safeReq, and share a single reloadAndConnect helper between PKCE submit, device-code success, external recheck, and api-key save. In the overlay, extract Step / CodeBlock / FlowFooter / CancelBtn / DocsLink atoms so the four sign-in panels share the same chrome instead of repeating it inline. Net effect: fewer literal divs, one place to touch the spacing, and the code-block + footer rows are reusable across future flows. * fix(desktop): mount onboarding from frame 1 to kill the FOUT Default onboarding.configured to null (unknown until the runtime check resolves) and have the onboarding overlay render whenever it's not yet confirmed true. The boot overlay now yields to it, so the very first paint is the Welcome card with a "While we get you set up..." progress strip instead of a flash of the chat shell between boot dismiss and onboarding mount. The picker swaps in cleanly once the gateway opens and the runtime check confirms the user is not configured. Already-configured users see the same prep card briefly while their existing runtime warms up, then the overlay dismisses without touching the chat shell. * fix(desktop): top-align empty sessions placeholder The "Start a chat to build your history." empty state used a min-h-35 grid place-items-center container, which floated the text in a tall dead zone. Render it as a flat paragraph that sits right under the section header like the empty pinned state does. * refactor(desktop): drop dead boot overlay Onboarding overlay subsumes the boot card now that it mounts from frame 1 and renders boot progress inline. The standalone DesktopBootOverlay is unreachable in every flow (yields whenever onboarding has not confirmed configured, dismisses once it has). * fix(desktop): hide pinned/recents sections until first session A fresh sidebar showed the Pinned and Recent chats headers with floating empty-state copy underneath. Drop both sections (and the now-orphan SidebarEmptySessionState) when there are no sessions yet — they reappear after the first chat. Skeletons during initial load are unchanged. * feat(gui): route embedded TUI through dashboard gateway (#21979) Inject HERMES_TUI_GATEWAY_URL into dashboard PTY sessions so embedded ui-tui instances attach to the in-process websocket gateway, with coverage for the new env wiring. * Add desktop remote gateway settings Make the desktop gateway connection configurable from settings so local remains the default while remote backends can be saved, tested, and applied without environment variables. * feat(gui): first-class Messaging page + gateway menu redesign - Add Messaging page to the desktop app with per-platform setup, status, and inline guidance. Catalog derives from gateway.config Platform enum + plugin registry, so every messaging adapter the CLI supports (Telegram, Discord, Slack, Mattermost, Matrix, WhatsApp, Signal, BlueBubbles, Home Assistant, Email, SMS, DingTalk, Feishu, WeCom, Weixin, QQ, Yuanbao, API server, Webhooks, plugins) shows up without per-platform code. - New REST endpoints: GET /api/messaging/platforms, PUT and POST /test on the same path. Secrets go through the existing .env pipeline; enable/disable writes config.yaml. - Replace gateway statusbar dropdown with a richer panel: status row, icon-only restart + system-panel actions, recent activity (with timestamps trimmed in display, full text on hover), platform list. - Auto-poll the messaging page every 6s (paused when hidden) so status updates without a manual check. - Drop Settings / Command Center from the sidebar nav (still reachable via shortcuts and the titlebar cog). - Flatten top corners on Messaging/Skills/Artifacts/Chat panes. - Share new StatusDot component across messaging + gateway menu. - Fix gateway/config.py so an explicit platforms.<name>.enabled=false in config.yaml is honored when env tokens are present. - pb-9 on the chat content area for breathing room above the composer. * Potential fix for pull request finding 'CodeQL / Clear-text logging of sensitive information' Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * pin electron version * hide application menu on non-mac systems * interpret compactPreview for non-string vlaues as JSON or an empty string * fix(desktop): keep composer contenteditable mounted across stacked toggle The composer rendered {input} inside two different parent fragments depending on `stacked`. When auto-expand flipped `stacked` (e.g. the moment typed text wrapped past two lines), React reconciled the two branches as different positions and unmounted/remounted the contenteditable. The fresh mount started empty, so any in-flight characters — most reliably reproduced by holding a key — were lost. Replace the conditional with a single CSS Grid whose template-areas swap on `stacked`. The three children (menu, input, controls) keep stable identities across the toggle; only their grid placement changes, which the browser handles without React tearing down the editor. * refactor(desktop): align install layout with install.ps1 / install.sh Make the desktop app's runtime layout match what scripts/install.ps1 and scripts/install.sh produce, so a desktop-only user and a CLI-only user end up with the same files in the same places and can share one install. Layout - ACTIVE_HERMES_ROOT = HERMES_HOME/hermes-agent (was: process.resourcesPath/hermes-agent, read-only) - VENV_ROOT = HERMES_HOME/hermes-agent/venv (was: userData/hermes-runtime) - desktop.log = HERMES_HOME/logs/desktop.log (was: userData/desktop.log) - HERMES_HOME default: %LOCALAPPDATA%\hermes on Windows, ~/.hermes elsewhere The packaged .app/.exe still ships a read-only payload at process.resourcesPath/hermes-agent (FACTORY_HERMES_ROOT). On first launch or after an installer-driven upgrade we sync factory -> active, then provision the venv and run pip install -e . against the active root. Key behaviors - Pin HERMES_HOME in the spawned Python's env so get_hermes_home() resolves to the same path resolveHermesHome() picked. Without this, Python falls back to ~/.hermes on every platform - fine on mac/linux, a split-state bug on Windows where our default is %LOCALAPPDATA%\hermes. - Detect developer installs by .git presence at ACTIVE; never overwrite a user's checkout via factory sync. - Marker at ACTIVE/.hermes-desktop-runtime.json (schema v4) tracks pyproject hash + factory version + runtime schema version. depsFresh fast-paths when nothing changed. - Dev (npm run dev) prefers SOURCE_REPO_ROOT over ACTIVE so devs run their local edits, not whatever's under HERMES_HOME. - Better error messages distinguish "no payload" from "no Python". - Preserve a legacy ~/.hermes on Windows when no %LOCALAPPDATA%\hermes exists, so users with prior pip/manual installs aren't orphaned. pyproject.toml - Promote fastapi, uvicorn[standard], ptyprocess (non-Windows), and pywinpty (Windows) to main dependencies. The dashboard backend (hermes dashboard) needs them at runtime; the previous lazy-import fallback was a footgun for fresh installs. - Empty the [pty] optional-extra; kept as a no-op back-compat alias for any existing pip install hermes-agent[pty] invocations. Drops the hardcoded BUNDLED_RUNTIME_REQUIREMENTS list in main.cjs - the desktop now installs whatever pyproject.toml says, single source of truth. Files - apps/desktop/electron/main.cjs: runtime layout, HERMES_HOME pin, factory->active sync, marker v4 - apps/desktop/scripts/test-desktop.mjs: track new venv location - apps/desktop/README.md: new Setup, Runtime Bootstrap, and Debugging sections - pyproject.toml: fastapi/uvicorn/pty backends in main dependencies; [pty] extra emptied Tested locally on Windows: npm run dev boots cleanly, sessions land at the new location, type-check + lint + test:desktop:platforms all pass. Verified end-to-end on a fresh Win11 VM via dist:win installer. Known gaps (filed as follow-ups, not in this PR): - Skills not seeded on packaged installs (sync_skills only runs in cmd_chat, not cmd_dashboard). Need to move to shared pre-dispatch. - Git Bash not bundled or detected; agent's terminal tool errors out with a useful message but desktop bootstrapper should pre-flight it. - install.ps1 / install.sh should be decomposed into composable phase libraries so the desktop bootstrapper can reuse them as a single source of truth across all install surfaces. * feat(desktop): theme polish, prose chat typography, composer chrome - DS tokens/midground, Backdrop, scoped scrollbars, typography plugin + prose - Composer liquid/radius utilities, thread font parity, tool/thinking cues - File tree label scale, preview flex, thread retry loading + streaming tests * feat(desktop): NSIS prereq detection page + auto-install via winget The packaged Windows installer now detects Python 3.11+ and Git for Windows at install time and offers to install missing prereqs via winget. Mirrors the prereq logic scripts/install.ps1 already runs for CLI installs, so desktop installer users get the same out-of-the-box experience as install.ps1 users. Why - Hermes' terminal tool calls bash.exe directly (tools/environments/ local.py); on Windows that's Git Bash from Git for Windows. Without it, the agent fails on the first terminal() call. - Hermes' Python runtime needs 3.11+. Without it, the desktop bootstrapper errors out at venv creation. - Both gaps surfaced on a fresh Windows 11 VM smoke test: VM had Python pre-installed but no Git, so the agent's first terminal call failed with "Git Bash isn't installed." - install.ps1 has had Install-Git + Install-Uv functions for ages. The desktop installer was the asymmetric outlier. How — NSIS prereq page - New file: apps/desktop/installer/prereq-check.nsh (plugged into electron-builder via build.nsis.include) - Real Wizard page using nsDialogs, inserted via customPageAfterChangeDir hook (between the Directory page and InstFiles). - Group boxes for Python and Git, each showing detection status. - Pre-checked install checkboxes when winget is available. - Auto-skips silently if both prereqs are already installed. - Falls back to manual download URLs when winget itself is missing. - Detection: - Python: probes `py -3.11`/`-3.12`/`-3.13`/`-3.14` via the Python launcher. Microsoft Store "Python stub" (no py.exe) is correctly classified as not-installed. - Git: `where git`. - winget: `where winget` (Win10 1809+ / Win11 with App Installer). - Install execution (in customInstall macro): - Python: nsExec::ExecToLog with `--scope user --silent`. Per-user install, no UAC prompt, output streams to install log. - Git: ExecShellWait via Windows ShellExecute. Critical because Git always installs per-machine and triggers UAC; ShellExecute preserves the foreground focus chain across non-elevated → elevated process spawns, so UAC actually comes to the foreground. nsExec::ExecToLog breaks the chain because winget runs hidden. - Both pass `--disable-interactivity --accept-package-agreements --accept-source-agreements` to suppress winget's own dialogs. - Verification: probes Git's standard install locations via FileExists rather than `where git`. NSIS's process inherits PATH at startup, so a freshly-installed Git won't be visible to `where` until restart. - Silent installs (/S) skip the prompts; managed deploys handle prereqs out-of-band via Group Policy / Intune. How — Electron-side safety net - New findGitBash() in main.cjs, parallel to findSystemPython(). Probes the same locations as tools/environments/local.py:_find_bash() so a positive result here means the agent's terminal tool will work. - ensureRuntime now throws a clear, actionable error on Windows when Git Bash isn't found, matching the existing "Python 3.11+ is required" error path. - Catches users the NSIS page doesn't: .msi installer users (NSIS prereq page doesn't run for MSI), `npm run dev` users, manual installers, anyone who unchecked the install boxes on the NSIS prereq page. - All gated on `IS_WINDOWS`; macOS / Linux unaffected. NSIS build issue (resolved) - electron-builder defaults to `-WX` (warnings as errors). NSIS optimizer emits "warning 6010: function not referenced" for our page functions because Page custom directives don't count as references in its static-analysis pass. The functions ARE called at runtime when NSIS invokes the page; the optimizer just can't see it statically. - Set `build.nsis.warningsAsErrors=false` in package.json so this spurious warning doesn't fail the build. (Documented option from electron-builder's nsisOptions.) Out of scope (filed for future work) - MSI prereq detection: Windows Installer custom actions are a different mechanism. Enterprise deploys typically handle prereqs via GP/Intune. - Bundle PortableGit + python-build-standalone in extraResources for zero-network installs. ~80MB increase. - Mac / Linux GUI prereq flows (different installer formats; Xcode CLT covers most macOS prereqs already; Linux is per-distro hard). Files - apps/desktop/installer/prereq-check.nsh (new, ~290 lines NSIS) - apps/desktop/package.json (build.nsis.include + warningsAsErrors) - apps/desktop/electron/main.cjs (findGitBash + preflight) - apps/desktop/README.md (Runtime prerequisites section) Cross-platform impact - macOS / Linux builds (dist:mac, dist:mac:dmg, dist:mac:zip): nsis config is ignored entirely; .nsh is dormant. - npm run dev: .nsh dormant; main.cjs preflight gated on IS_WINDOWS. - scripts/install.ps1, scripts/install.sh: no reference to any new files; CLI install paths untouched. - Hermes CLI / dashboard / gateway: no reference; runtime untouched. - All checks: node --check on main.cjs and test-desktop.mjs pass; npm run test:desktop:platforms 4/4 passing; node --test green. Tested - npm run dist:win produces signed .exe and .msi without errors. - Fresh Win11 VM (Python pre-installed, no Git): prereq page renders, Python check shows detected, Git checkbox pre-checked. Click Next → Git installs via winget with UAC prompt in foreground. - After install completes, Hermes launches and the agent's terminal tool can run bash commands. Verified Git Bash is detected at `C:\Program Files\Git\bin\bash.exe` by ensureRuntime's preflight. * feat: theme changes, composer tweaks, in app update ux, finesse * fix(cli): seed bundled skills on dashboard + gateway entrypoints `sync_skills(quiet=True)` was only being called from inside `cmd_chat`, which meant `hermes dashboard` (the desktop GUI's backend) and `hermes gateway` (Telegram/Discord/Slack/etc daemons) never seeded the bundled skill library into ~/.hermes/skills/. This surfaced as "No skills found" in the desktop GUI's skills panel on fresh installs, despite the agent having access to the full bundled library when invoked via `hermes chat`. scripts/install.ps1 worked around it by running skills_sync.py as part of Copy-ConfigTemplates, but that's not part of the desktop installer's bootstrap chain. Fix - Extract the skills-sync block from cmd_chat into a module-level `_sync_bundled_skills_quietly()` helper. - Call the helper from cmd_chat (preserving existing behavior), cmd_dashboard (after the --status/--stop early-return paths and fastapi import check, so we don't run skills_sync on management commands or when deps aren't installed), and cmd_gateway. Why these three entrypoints - cmd_chat: the user's primary CLI entrypoint - cmd_dashboard: the desktop GUI's backend; this is what `hermes dashboard --tui` invokes when the desktop bootstrapper spawns Hermes - cmd_gateway: long-running daemons where the user expects the agent to have full skill access Other entrypoints (cmd_config, cmd_doctor, cmd_login, cmd_status, etc.) are management commands that don't need skill discovery and were never running skills_sync in the first place — leaving them alone. Idempotence - tools/skills_sync.py is manifest-based: skipped skills cost milliseconds. Calling it from multiple entrypoints adds no real cost, and users running `hermes chat` then `hermes dashboard` get two fast no-ops on the second call. Failure handling - Helper wraps skills_sync in try/except. Skills are an enhancement, not a hard dependency — Hermes runs fine with an empty skills/ dir. Files - hermes_cli/main.py: + new helper `_sync_bundled_skills_quietly()` at module level + cmd_chat: replace inline block with helper call + cmd_dashboard: add helper call after fastapi import succeeds + cmd_gateway: add helper call before delegating to gateway_command * feat(desktop): hoisted todo widget, JSON tool summaries, history grouping & timer fixes - Hoist todo to first-class widget (shadcn checkboxes, brand colors, no tool-accordion). Header derives label from active task; non-active rows fade. - Replace raw JSON dumps with structured key/value summaries via formatToolResultSummary; nested error extraction for clearer failures. - Fix loaded-session grouping: stitch interleaved assistant/tool iterations into one bubble instead of orphaned synthetic messages. - Stable tool/thinking timers via keyed registry so unmount/scroll doesn't reset elapsed counts; gate "running" on real live thread state. - Reorganize chat-only assistant-ui components under components/chat/. * fix(desktop): address CodeQL alerts on PR #20059 - settings/helpers.ts: harden setNested against prototype pollution. POLLUTING_PATH_PARTS check is now applied at every assignment site (loop + leaf) and uses Object.defineProperty so CodeQL can see the guard inline rather than via a helper function call. - lib/markdown-preprocess.ts: rebuild the dangling-fence close regex from a fence-char + length instead of marker.replace(...). The marker is captured by `(`{3,}|~{3,})` so it can only be backticks or tildes, but CodeQL was tracing tainted input text into the RegExp source and flagging hostname dots from input as part of the pattern (false positive js/incomplete-hostname-regexp on the test fixture URLs). Reconstructing from a literal char breaks the dataflow. - scripts/notarize-artifact.cjs: drop args from the run() rejection message. Args carry --key-id / --issuer / key file path; the existing outer catch already squashes errors to a generic line, but CodeQL was flagging the args.join(' ') as clear-text logging of APPLE_API_KEY_ID. Composer DOM-text-as-HTML alerts (composer/index.tsx:379, :547) are already addressed in |
||
|
|
bd72d333dc |
fix(gateway,cron): reuse existing _HERMES_GATEWAY marker; tighten cron regex
Some checks are pending
Deploy Site / deploy-vercel (push) Waiting to run
Deploy Site / deploy-docs (push) Waiting to run
Docker Build and Publish / build-amd64 (push) Waiting to run
Docker Build and Publish / build-arm64 (push) Waiting to run
Docker Build and Publish / merge (push) Blocked by required conditions
Lint (ruff + ty) / ruff + ty diff (push) Waiting to run
Lint (ruff + ty) / ruff enforcement (blocking) (push) Waiting to run
Lint (ruff + ty) / Windows footguns (blocking) (push) Waiting to run
Nix / nix (macos-latest) (push) Waiting to run
Nix / nix (ubuntu-latest) (push) Waiting to run
OSV-Scanner / Scan lockfiles (push) Waiting to run
Tests / test (1) (push) Waiting to run
Tests / test (2) (push) Waiting to run
Tests / test (3) (push) Waiting to run
Tests / test (4) (push) Waiting to run
Tests / test (5) (push) Waiting to run
Tests / test (6) (push) Waiting to run
Tests / save-durations (push) Blocked by required conditions
Tests / e2e (push) Waiting to run
uv.lock check / uv lock --check (push) Waiting to run
Follow-up to the salvaged #30728: - Gateway already exports _HERMES_GATEWAY=1 at startup (gateway/run.py) and cli.py already keys off it. Drop the redundant new HERMES_IN_GATEWAY var; guard stop/restart on _HERMES_GATEWAY instead. One marker for one fact. - Drop the greedy \bgateway.*restart alternation from the cron lifecycle filter — it false-positived on legit prompts that merely mention an unrelated gateway + a restart (API/payment gateway monitoring). The specific 'hermes gateway (restart|stop|start)' pattern already covers the real command. - Rework the two negative guard tests to sentinel the first downstream call so they don't drive real signal delivery (tripped the live-system guard). - Add false-positive regression cases to test_safe_commands. |
||
|
|
5cd6c1717d |
fix(gateway,cron): prevent agent restart loops via self-targeting gateway commands (#30719)
Three defenses against SIGTERM-respawn loops when agent schedules its own gateway restart under launchd/systemd KeepAlive: 1. HERMES_IN_GATEWAY env var: gateway sets it at startup; stop/restart subcommands refuse to run when set (exit 1 with clear message). 2. Cron create payload filter: regex pre-flight rejects prompts/scripts containing hermes gateway restart/stop, launchctl kickstart/unload, systemctl restart/stop, and pkill patterns. 3. 30 new tests: pattern matching (14), cron block (5), gateway guard (4), safe command negatives (7). |
||
|
|
a1cb5fa2c7 |
fix(gateway): anchor service WorkingDirectory at HERMES_HOME, not the source checkout
The systemd unit (and launchd plist) pinned WorkingDirectory to PROJECT_ROOT (the checkout the unit was generated from). When that checkout is transient — a git worktree, or a clone hermes update later relocates/removes — the path rots. systemd then fails the start at the CHDIR step (status=200/CHDIR) BEFORE Python loads, so the on-boot refresh_systemd_unit_if_needed() self-heal never runs and Restart=always crash-loops forever on a dead directory. Observed in the wild: a gateway that crash-looped 153 times overnight, bot offline until a manual 'hermes gateway restart' regenerated the unit. Anchor cwd at HERMES_HOME instead — it never moves, always exists, and the gateway never needed cwd to be the checkout (ExecStart uses an absolute python + -m hermes_cli.main). Existing broken units now differ from the generated unit and self-heal on the next start/restart/update. |
||
|
|
ddaf2f6712 |
style: restore PEP8 blank-line separation after dead-code removal
The deletions in the salvaged commit left some top-level defs/classes separated by a single blank line. Restore the 2-blank-line separation. |
||
|
|
dc235e93cb |
chore: remove dead code — 28 unused functions/classes across 16 files
Vulture + per-symbol verification (whole-repo grep incl. tests, string literals, getattr, decorator/registry/argparse dispatch) confirmed each of these has zero callers anywhere — not reachable via any dynamic-dispatch path, not referenced by tests, not re-exported. Removed: - acp_adapter/tools.py: _build_patch_mode_content - agent/anthropic_adapter.py: read_claude_managed_key (diagnostics-only, never called) - agent/bedrock_adapter.py: get_bedrock_model_ids - agent/browser_registry.py: get_active_browser_provider - agent/chat_completion_helpers.py: _take_request_client (x2 nested closures, never invoked) - gateway/platforms/weixin.py: _rewrite_headers_for_weixin, _rewrite_table_block_for_weixin - hermes_cli/banner.py: _skin_branding - hermes_cli/debug.py: _delete_hint - hermes_cli/gateway.py: _setup_email, _setup_sms, _setup_yuanbao (platform keys absent from the _builtin_setup_fn dispatch dict; handled by the _setup_standard_platform fallback) - hermes_cli/kanban_db.py: set_max_runtime, active_run - hermes_cli/kanban_diagnostics.py: severity_of_highest, _latest_clean_event_ts - hermes_cli/main.py: _build_provider_choices, cmd_portal (portal subcommand is wired via portal_cli.add_parser, not this wrapper) - hermes_cli/model_switch.py: CustomAutoResult (orphaned by the switch_model() extraction) - hermes_cli/models.py: format_model_pricing_table, fetch_nous_account_tier - hermes_cli/portal_cli.py: _nous_portal_base_url - hermes_cli/proxy/server.py: handle_models_fallback (defined but never registered on the router) - tools/computer_use/cua_backend.py: _parse_element, _is_arm_mac - tools/file_operations.py: _get_safe_write_root (prod uses the imported agent.file_safety.get_safe_write_root directly) - tools/skills_tool.py: _load_category_description Also dropped two imports left unused by the removals: - tools/file_operations.py: get_safe_write_root alias - tools/computer_use/cua_backend.py: import platform Pure deletion: -551 LOC. No behavior change. Test files covering the edited modules pass (640/640); the broader suite's pre-existing/env-dependent failures reproduce unchanged on origin/main. |
||
|
|
0927fb5584 |
feat(docker): auto-redirect gateway run to supervised mode inside s6 image
Pre-s6, `docker run nousresearch/hermes-agent gateway run` was the
standard invocation: gateway ran as the container's main process,
tini reaped zombies, container exit code matched gateway exit code,
no supervision. With s6-overlay as PID 1, the same invocation now
auto-upgrades to supervised semantics — auto-restart on crash,
dashboard supervised alongside (when HERMES_DASHBOARD=1 is set),
multiple profile gateways under the same /init.
Users get the new behavior with zero changes to their docker run
command. A loud one-line breadcrumb on stderr explains the upgrade
and points at the opt-out for users who genuinely want pre-s6
foreground semantics.
How it works:
1. `_gateway_command_inner` (the `gateway run` handler) checks if
we're inside a container with s6 as PID 1.
2. If yes, dispatches `start` to the s6 service manager (registers
and starts gateway-default), then `exec sleep infinity` to keep
the CMD process alive without binding container lifetime to
gateway PID lifetime. The supervised gateway can flap freely;
`docker stop` still tears everything down via /init stage 3.
3. If no, falls through to the existing foreground code path
unchanged. Host runs of `hermes gateway run` are unaffected.
Three gates make the redirect inert outside the intended scope:
* `detect_service_manager() != "s6"` — host/non-s6-container runs.
* `HERMES_S6_SUPERVISED_CHILD=1` env var (recursion guard) —
exported by `S6ServiceManager._render_run_script` for the
s6-supervised invocation itself. Without this guard, the
supervised `gateway run --replace` would re-enter the redirect
and recurse (run → start → run → start → ...) infinitely.
* `--no-supervise` CLI flag OR `HERMES_GATEWAY_NO_SUPERVISE=1` env
var — explicit user opt-out for CI smoke tests, debugging the
foreground startup path, or any case wanting "CMD exit =
container exit" semantics. Strict truthiness (1/true/yes,
case-insensitive); typos like `=0` do NOT silently opt out.
Tests:
* Unit tests in tests/hermes_cli/test_gateway_s6_dispatch.py
cover all five paths (host no-op, supervised fire, sentinel
recursion guard, CLI flag, env var truthy + falsy). The two
load-bearing gates (sentinel + opt-out) were mutation-tested
by removing each gate in isolation and confirming the dedicated
test fails with the expected error.
* Docker harness tests in tests/docker/test_gateway_run_supervised.py
cover the round trips end-to-end against a built image: redirect
fires (sleep-infinity heartbeat + supervised gateway-default
slot + breadcrumb), --no-supervise opt-out (foreground gateway,
no want-up on the slot), HERMES_GATEWAY_NO_SUPERVISE env var
works identically, recursion is impossible (≤1 supervised
python gateway-run + exactly 1 sleep-infinity parented to the
CMD wrapper), and HERMES_DASHBOARD=1 produces both supervised
gateway and supervised dashboard.
Docs:
* Added a `:::tip Gateway runs supervised` admonition near the
main docker.md example explaining the upgrade and pointing at
the opt-out. Pre-s6 (tini-based) images still run gateway run
as the foreground main process, so the note is scoped to the
s6 image only.
Trade-off documented in the helper docstring: container exit code
under the redirect is sleep's exit code (always 0 on SIGTERM), not
the gateway's. That was an explicit design call — the supervised
gateway is allowed to flap without taking the container with it,
which is what "supervision" means. CI users who want exit-code
forwarding can pass --no-supervise.
|
||
|
|
7e165e843d
|
Merge pull request #31760 from NousResearch/hermes/hermes-bf5898da
feat(docker)!: s6-overlay container supervision (salvage of #30136) |
||
|
|
af973e4071 |
refactor(gateway): migrate Mattermost adapter to bundled plugin
Second migration of an existing built-in platform adapter after Discord (PR #30591) — follows the same shape established by IRC / Teams / LINE / Google Chat / SimpleX and the playbook in `references/platform-plugin-migration.md`. Advances the umbrella refactor in #3823. Matches Discord's parity bar — adapter under `plugins/platforms/mattermost/` with the standard `__init__.py` / `adapter.py` / `plugin.yaml` shell, `register(ctx)` entry point, **no back-compat shim** at the old import path, and full parity for all five hooks Discord uses plus the `apply_yaml_config_fn` hook (mattermost is the second consumer of #25443 after Discord): * `standalone_sender_fn` — out-of-process cron delivery via Mattermost REST API. Picks up the thread_id + media_files capabilities the legacy `_send_mattermost` lacked (parity with Discord's `_standalone_send`). * `setup_fn` — interactive `hermes setup gateway` wizard. * `apply_yaml_config_fn` — translates `config.yaml` `mattermost:` keys (`require_mention`, `free_response_channels`, `allowed_channels`) into `MATTERMOST_*` env vars (replaces the hardcoded block in `gateway/config.py`). * `is_connected` — declares connection state from `MATTERMOST_TOKEN` + `MATTERMOST_URL`. * `check_fn` — verifies aiohttp is installed and both required env vars are set. * plus `allowed_users_env`, `allow_all_env`, `cron_deliver_env_var`, `max_message_length` (4000 — Mattermost practical limit), `emoji`, `required_env`, `install_hint`. Files ----- * `gateway/platforms/mattermost.py` (873 LOC) → `plugins/platforms/mattermost/adapter.py` (git rename, R071) + appended `register()` block, hook helpers, and `_standalone_send` with media upload + thread_id support. * New `plugins/platforms/mattermost/{__init__.py, plugin.yaml}` with `requires_env` / `optional_env` declarations covering MATTERMOST_URL, MATTERMOST_TOKEN, MATTERMOST_ALLOWED_USERS, MATTERMOST_ALLOW_ALL_USERS, MATTERMOST_HOME_CHANNEL, MATTERMOST_REPLY_MODE, MATTERMOST_REQUIRE_MENTION, MATTERMOST_FREE_RESPONSE_CHANNELS, MATTERMOST_ALLOWED_CHANNELS. * `gateway/config.py`: delete 17-LOC `mattermost_cfg` YAML→env bridge (moved into plugin's `_apply_yaml_config`). * `gateway/run.py::_create_adapter`: delete `Platform.MATTERMOST elif` — replaced by the existing generic plugin-registry-first dispatch. * `tools/send_message_tool.py`: delete `_send_mattermost` (22 LOC) + `Platform.MATTERMOST elif` in `_send_to_platform` — the `else` branch already routes plugin platforms through `_send_via_adapter`, which hits the registry's `standalone_sender_fn`. * `hermes_cli/setup.py`: delete `_setup_mattermost` (44 LOC) — replaced by the plugin's `interactive_setup`. * `hermes_cli/gateway.py`: delete `_PLATFORMS["mattermost"]` dict entry (3 LOC) — plugin's `setup_fn` is dispatched via the plugin path in `_configure_platform`. * Consumer rewrite: 5 test files (test_mattermost.py, test_media_download_retry.py, test_send_multiple_images.py, test_stream_consumer.py, test_ws_auth_retry.py) get `gateway.platforms.mattermost` → `plugins.platforms.mattermost.adapter` with the bulk-rewrite recipe from the platform-plugin-migration playbook. Single `mock.patch` string in test_stream_consumer.py also repointed. * `tests/tools/test_send_message_missing_platforms.py`: thin `(token, extra, chat_id, message)` compat shim around the plugin's `_standalone_send(pconfig, …)` so existing test bodies continue to work without rewriting every signature. Validation ---------- * Plugin discovery: mattermost registers from `plugins/platforms/mattermost/` alongside discord / teams / irc / line / google_chat / simplex. All 9 hooks present (setup_fn, standalone_sender_fn, apply_yaml_config_fn, is_connected, check_fn, allowed_users_env, allow_all_env, cron_deliver_env_var, max_message_length=4000). * Mattermost-touching tests: 62/62 pass (`test_mattermost.py` + `test_send_message_missing_platforms.py`). * Targeted selectors (mattermost or platform_registry or stream_consumer or ws_auth_retry or media_download_retry or send_multiple_images or send_message_tool or platform_connected): 433/433 pass. * Full sweep (`scripts/run_tests.sh tests/gateway/ tests/cron/ tests/tools/test_send_message_tool.py tests/tools/test_send_message_missing_platforms.py tests/integration/`): **6220/6220 pass in 47.8s, 0 failures**. * Lint: ruff clean on all touched files. * Git identity verified: kshitijk4poor. * Rename detection: R071 (similarity dropped from a hypothetical R09x by the ~320-line appended register block — ~36% growth over the 873-LoC base, vs Discord's 5101 LoC base which kept R091). Closes part of #3823. |
||
|
|
b28b3f51d3
|
fix(service_manager): friendly errors for missing slots and s6-svc failures
PR #30136 review caught: `S6ServiceManager.start/stop/restart` called `subprocess.run(check=True)` on `s6-svc`, so any failure surfaced as a raw `CalledProcessError` traceback. The two cases operators actually hit are: 1. The service slot doesn't exist — most commonly because the user typed a profile name wrong (`hermes -p typo gateway start`). 2. s6-svc itself fails — most commonly EACCES on the supervise control FIFO when running unprivileged. Both deserve named errors with actionable messages, not stacktraces. Changes: * Add `S6Error` base + two concrete errors in `hermes_cli.service_manager`: - `GatewayNotRegisteredError(profile)` — carries the unprefixed profile name; message: `no such gateway 'typo': register it with `hermes profile create typo` first, or pass an existing profile name via `-p <name>``. - `S6CommandError(service, action, returncode, stderr)` — carries the s6-svc rc and stderr; message: `s6-svc start on 'gateway-coder' failed (rc=111): <stderr>`. * Factor lifecycle dispatch through `_run_svc(flag, label, name)`: pre-checks that the service directory exists (raises GatewayNotRegisteredError before invoking s6-svc), then runs s6-svc and translates any CalledProcessError into S6CommandError. * `_dispatch_via_service_manager_if_s6` in `hermes_cli.gateway` catches both errors and prints `✗ <message>` + `sys.exit(1)` instead of letting the exception bubble. The dispatch path that used to dump a traceback at the user now gives an actionable one-liner. Tests: 6 new tests for the error types and their CLI rendering; existing lifecycle test pre-seeds the slot directory before calling `mgr.start` etc. |
||
|
|
6dedaa4846
|
fix(gateway): route --all stop/restart through s6 under container
PR #30136 review caught that `hermes gateway stop --all` and `... restart --all` were broken under s6. The Phase 4 dispatcher was gated on `not stop_all` (and the symmetric restart_all), so `--all` fell through to `kill_gateway_processes(all_profiles=True)`. pkill SIGTERMed every gateway, s6-supervise observed the crashes, and restarted every gateway ~1s later — net effect: `--all` *kicked* gateways instead of *stopping* them. Add `_dispatch_all_via_service_manager_if_s6(action)` that iterates `mgr.list_profile_gateways()` and routes stop/restart through each service slot. s6's `want up`/`want down` flips correctly, so a stop persists. Partial failures are surfaced per-profile with a running success count; the host pkill path is only reached when s6 isn't in play. `start --all` isn't a CLI surface — the helper rejects it and returns False (host code path can take over). |
||
|
|
a36221ed91
|
docs(s6): document container supervision; doctor + skill + user-guide updates
Phase 5 of the s6-overlay supervision plan. Documentation + small
diagnostic cleanups; no behavior changes.
website/docs/user-guide/docker.md:
- Replace the old 'entrypoint script does the bootstrap' section
with the s6-overlay boot flow (cont-init.d/01-hermes-setup,
cont-init.d/02-reconcile-profiles, static main-hermes + dashboard
services, ENTRYPOINT-as-main-program pattern).
- Add a 'Per-profile gateway supervision' subsection covering the
new lifecycle commands, restart semantics, log persistence, and
'Manager: s6 (container supervisor)' status reporting.
- Add 'Breaking change vs. pre-s6 images' callout naming the
/init ENTRYPOINT and pointing affected wrappers at the pin
workaround.
website/docs/user-guide/profiles.md:
- Add a note under 'Persistent services' pointing container users
at the docker.md section explaining s6 supervision inside the
image. Host-side systemd/launchd documentation is unchanged.
skills/software-development/hermes-s6-container-supervision/SKILL.md:
- New maintainer skill covering the supervision-tree map, file
layout, the Architecture B rationale (cont-init.d args + halt
exit-code propagation), quick recipes, and the 8 pitfalls we hit
while implementing the plan (PATH-without-/command, root-owned
profile dirs, SOUL.md as marker, the '143' anti-pattern, etc.).
hermes_cli/doctor.py:
- _check_gateway_service_linger skips on s6 (the linger concept
doesn't apply inside the container).
- New _check_s6_supervision section reports main-hermes/dashboard
state and per-profile-gateway count (registered vs supervised
up), only inside the s6 container. Host doctor output unchanged.
- External Tools / Docker check no longer emits a 'docker not
found' warning inside the container; prints an explanatory
info line instead. Still respects an explicit TERMINAL_ENV=docker
(in case the user mounted /var/run/docker.sock).
hermes_cli/gateway.py:
- Document _container_systemd_operational more precisely: it's
NOT for our Hermes Docker image (s6-overlay handles that via
detect_service_manager() == 's6'). It still covers
systemd-nspawn / k8s-with-systemd-init cases, so leaving it in
place is correct; the docstring just makes that explicit.
Test harness (verification, no test changes in this commit):
19 passed, 0 xfailed. 66 service-manager / container-boot /
profiles-s6-hooks / gateway-s6-dispatch unit tests still green.
61 doctor tests still green. Hadolint + shellcheck clean.
Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md
|
||
|
|
2afefc501c
|
feat(docker): per-profile s6 supervision + container-restart reconciliation
Phase 4 of the s6-overlay supervision plan. Activates the Phase 3
S6ServiceManager by hooking it into the profile lifecycle and the
`hermes gateway start/stop/restart` dispatcher, and adds a cont-
init.d-time reconciliation pass that survives `docker restart`.
Task 4.0 — container-boot reconciliation:
/run/service/ is tmpfs, so every `docker restart` wipes every
per-profile gateway slot. /etc/cont-init.d/02-reconcile-profiles
invokes hermes_cli.container_boot.reconcile_profile_gateways() on
every boot, which walks $HERMES_HOME/profiles/<name>/, reads each
gateway_state.json, recreates the s6 service slot, and auto-starts
only those whose last state was 'running'. Other states
(stopped, starting, startup_failed, missing) register the slot
in the down state — avoiding crash-loops across restarts for a
gateway that was broken last boot. Per-profile outcome is recorded
to $HERMES_HOME/logs/container-boot.log.
Implementation: hermes_cli/container_boot.py + 12 unit tests.
Profile-marker is SOUL.md, not config.yaml, because `hermes profile
create` only seeds SOUL.md by default (config.yaml comes from
`hermes setup`).
Task 4.1 / 4.2 — profile create/delete hooks:
hermes_cli/profiles.py::create_profile now calls
_maybe_register_gateway_service(<canon>) at the end, which routes
through ServiceManager.register_profile_gateway when running on s6
and no-ops on host backends. delete_profile mirrors with
_maybe_unregister_gateway_service. _allocate_gateway_port produces
a deterministic SHA-256-derived port in [9200, 9800).
Task 4.3 — gateway dispatch + remove rejection arms:
_dispatch_via_service_manager_if_s6(action) intercepts
start/stop/restart at the top of each subcommand and routes them
through S6ServiceManager.{start,stop,restart}. The pre-Phase-4
`elif is_container():` rejection arms are kept as fallback for
pre-s6 containers / unsupported runtimes, but only ever fire when
detect_service_manager() != 's6'. install/uninstall under s6
print informational guidance pointing users at profile create/delete.
Removed the two xfail(strict=True) markers from
tests/docker/test_profile_gateway.py — both tests now pass strictly.
Task 4.4 — status reporting:
get_gateway_runtime_snapshot() reports
Manager: 's6 (container supervisor)' inside an s6 container instead
of 'docker (foreground)'.
Plan-vs-reality drift fixed in this commit:
- Plan's S6ServiceManager._render_run_script used
`gateway start --foreground --port {port}` — invented args; the
real CLI is `gateway run`. Switched accordingly. port arg
retained for API parity but now documented as 'currently ignored'.
- Plan's reconciler keyed on config.yaml; switched to SOUL.md
(config.yaml is created by hermes setup, not by hermes profile
create, so the original gate caught nothing).
- The plan's _dispatch helper used _profile_arg() which returns
'--profile <name>' (i.e. with the flag prefix). Switched to
_profile_suffix() which returns the bare name.
- Architecture B's docker exec doesn't get /command on PATH or
the venv on PATH; Dockerfile's runtime PATH now includes
/opt/hermes/.venv/bin so 'docker exec <c> hermes ...' works
without sourcing the venv.
- stage2-hook now chowns $HERMES_HOME/profiles to hermes on every
boot, not just on the UID-remap path. Without this, files created
by docker-exec-as-root accumulate and the next reconciler run
fails with PermissionError reading SOUL.md.
Test harness:
19 passed, 0 xfailed (the two pre-Phase-4 xfail targets flip to
passing). 78 unit tests across service_manager + container_boot +
profiles_s6_hooks + gateway_s6_dispatch. Hadolint + shellcheck
pass cleanly.
Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md
|
||
|
|
1f897b0dc9
|
fix(gateway): stop enabling dingtalk allow-all during setup (#30743) | ||
|
|
7849a3d73f |
fix(gateway,discord-plugin): _platform_status must respect is_connected=False, not silently fall back to check_fn
Two bugs surfaced by PR #24356 migrating Discord into the registry: 1. plugins/platforms/discord/adapter.py::_is_connected — read DISCORD_BOT_TOKEN via hermes_cli.gateway.get_env_value (the abstraction tests patch) instead of os.getenv directly. The legacy non-registry path used get_env_value; bypassing it broke test_setup_openclaw_migration which patches gateway_mod.get_env_value to simulate a hermetic env. 2. hermes_cli/gateway.py::_platform_status — when entry.is_connected is defined and returns False, return 'not configured' immediately. Don't fall back to entry.check_fn(), which would let 'SDK is installed' override 'no token configured' and incorrectly report the platform as ready. The fallback to check_fn is the right behaviour only when is_connected is None (not registered). Fixes 5 test failures observed on CI for PR #24356: - tests/hermes_cli/test_setup.py::test_setup_gateway_skips_service_install_when_systemctl_missing - tests/hermes_cli/test_setup.py::test_setup_gateway_in_container_shows_docker_guidance - tests/hermes_cli/test_setup_irc.py::TestIRCGatewaySetupFreshInstall::test_setup_gateway_irc_counts_as_messaging_platform - tests/hermes_cli/test_setup_openclaw_migration.py::TestGetSectionConfigSummary::test_gateway_returns_none_without_tokens - tests/hermes_cli/test_setup_openclaw_migration.py::TestSetupWizardSkipsConfiguredSections::test_sections_skipped_when_migration_imported_settings Same _platform_status bug exists for sibling plugin platforms (teams, google_chat) whose check_fn returns true on SDK install alone; their tests just never exercised the registry path before. The bug only became test-visible when Discord migrated into the registry. Validation: 11,167 tests across tests/gateway/ + tests/cron/ + tests/tools/test_send_message_tool.py + tests/hermes_cli/ pass with zero failures. |
||
|
|
cc8e5ec2af |
refactor(gateway): migrate Discord adapter to bundled plugin (full Teams parity)
First migration of an existing built-in platform adapter to the plugin system established by IRC / Teams / LINE / Google Chat. Closes #24325; advances the umbrella refactor in #3823. Matches Teams' shape exactly — adapter under ``plugins/platforms/discord/`` with the standard ``__init__.py`` / ``adapter.py`` / ``plugin.yaml`` shell, ``register(ctx)`` entry point, **no back-compat shim** at the old import path, and full parity for the four hooks Teams uses plus the ``apply_yaml_config_fn`` hook that landed in #25443 (the Discord plugin is the first consumer of that hook): * ``standalone_sender_fn`` — out-of-process cron delivery via REST API * ``setup_fn`` — interactive ``hermes setup gateway`` wizard * ``apply_yaml_config_fn`` — translate ``config.yaml`` ``discord:`` keys into ``DISCORD_*`` env vars (replaces the hardcoded block in ``gateway/config.py``) * ``is_connected`` — declares connection state from ``DISCORD_BOT_TOKEN`` * ``check_fn`` — lazy-installs ``discord.py`` on demand * plus ``allowed_users_env``, ``allow_all_env``, ``cron_deliver_env_var``, ``max_message_length``, ``emoji``, ``required_env``, ``install_hint`` * ``gateway/platforms/discord.py`` (5,101 LOC) → ``plugins/platforms/discord/adapter.py`` (git rename, R090). * New ``plugins/platforms/discord/{__init__.py, plugin.yaml}`` with ``requires_env`` / ``optional_env`` declarations. * Append ``register(ctx)`` block + new hook implementations (``_standalone_send``, ``interactive_setup``, ``_apply_yaml_config``, ``_clean_discord_user_ids``, ``_is_connected``, ``_build_adapter``, plus helpers ``_DISCORD_CHANNEL_TYPE_PROBE_CACHE`` etc.) to the adapter. * Replace the ``Platform.DISCORD elif`` branch in ``GatewayRunner._create_adapter()`` (−9 LOC) with a generic post-creation hook (+6 LOC) in the registry path: any plugin adapter that declares a ``gateway_runner`` attribute now gets it auto-injected. Webhook's built-in branch is unchanged (it doesn't go through the registry path). * Move ``_send_discord`` (190 LOC) and helpers (``_DISCORD_CHANNEL_TYPE_PROBE_CACHE``, ``_remember_channel_is_forum``, ``_probe_is_forum_cached``, ``_derive_forum_thread_name``) from ``tools/send_message_tool.py`` into the plugin as ``_standalone_send``. * Wire via ``standalone_sender_fn=_standalone_send`` (Teams pattern; same gap fixed in #21804 for other plugin platforms). * Replace the Discord ``elif`` in ``tools/send_message_tool.py`` ``_send_to_platform`` with a 10-line registry-hook dispatch. * Drop the ``DiscordAdapter`` import and the ``Platform.DISCORD: DiscordAdapter.MAX_MESSAGE_LENGTH`` ``_MAX_LENGTHS`` entry — the registry's ``max_message_length=2000`` covers it. * Move ``_setup_discord`` and ``_clean_discord_user_ids`` (68 LOC) from ``hermes_cli/setup.py`` into the plugin as ``interactive_setup``. * Wire via ``setup_fn=interactive_setup``. CLI helpers (``prompt``, ``print_info``, etc.) are lazy-imported so the plugin's module-load surface stays minimal. * Remove ``"discord": _s._setup_discord`` from ``hermes_cli/gateway.py::_builtin_setup_fn``. * Remove the entire 32-line ``_PLATFORMS["discord"]`` static dict entry — Discord's setup metadata is now discovered dynamically via ``_all_platforms()`` from the registry entry. * Move the 59-line ``discord_cfg`` YAML→env bridge from ``gateway/config.py::load_gateway_config()`` into the plugin as ``_apply_yaml_config``. Covers ``require_mention``, ``thread_require_mention``, ``free_response_channels``, ``auto_thread``, ``reactions``, ``ignored_channels``, ``allowed_channels``, ``no_thread_channels``, ``allow_mentions.{everyone,roles,users, replied_user}``, and ``reply_to_mode`` (including the YAML 1.1 ``off``-as-False coercion and the ``extra.reply_to_mode`` fallback). * Wire via ``apply_yaml_config_fn=_apply_yaml_config``. * The hook runs BEFORE ``_apply_env_overrides`` and after the generic shared-key loop, exactly as documented in ``website/docs/developer-guide/adding-platform-adapters.md``. * Behavior is preserved exactly — every assignment still uses ``not os.getenv(...)`` guards so env vars take precedence over YAML. All 78 references to the old import path are rewritten — no back-compat shim: * 51 ``from gateway.platforms.discord import X`` → ``from plugins.platforms.discord.adapter import X`` * 5 ``import gateway.platforms.discord as discord_platform`` → ``import plugins.platforms.discord.adapter as discord_platform`` * 1 ``from gateway.platforms import discord as discord_mod`` → ``from plugins.platforms.discord import adapter as discord_mod`` * 21 ``mock.patch("gateway.platforms.discord.X")`` strings → ``mock.patch("plugins.platforms.discord.adapter.X")`` * 1 docstring reference in ``hermes_cli/commands.py`` * 1 import in ``tools/send_message_tool.py`` (now removed entirely) The import-safety test in ``tests/gateway/test_discord_imports.py`` is updated to purge the new canonical module name from ``sys.modules``. **38 files changed, +621 / −473** — net positive due to the YAML hook implementation (89 new LOC in the plugin trading for 59 deleted in core), but every line moved has a clear plugin home now. The git rename is detected at R090 because the adapter gained ~340 LOC of moved-in hook implementations (``_standalone_send`` + ``interactive_setup`` + ``_apply_yaml_config`` + helpers). * All 568 Discord-specific tests pass across 25 ``test_discord_*.py`` files plus voice/send/text-batching/reload-skills/stream-consumer/ integration tests. * All 147 tests in the YAML-touching subset (``test_discord_reply_mode``, ``test_discord_free_response``, ``test_discord_allowed_channels``, ``test_discord_allowed_mentions``, ``test_discord_channel_controls``, ``test_discord_reactions``, ``test_discord_thread_persistence``, ``test_runtime_footer``) pass — this is the strongest signal that the YAML→env hook behaves identically to the legacy block. * Broader gateway/cron/integration sweep (1297 tests) introduces zero new failures vs ``main``. Pre-existing failures in ``tests/gateway/test_tts_media_routing.py`` and ``tests/e2e/test_platform_commands.py`` reproduce identically on the unchanged ``main`` revision. * Plugin discovery sanity check confirms Discord registers alongside the other four platform plugins: Registered platforms: ['discord', 'google_chat', 'irc', 'line', 'teams'] These Discord-shaped tendrils in core were **deliberately not moved** — they are generic platform-registry concerns affecting every platform, not Discord-specific: * ``gateway/config.py:1205`` ``DISCORD_BOT_TOKEN → config.token`` env enablement — same shape Telegram has. The existing ``env_enablement_fn`` registry hook only seeds ``extra``, not ``.token``, so it can't replace this without an adapter refactor to read from ``extra["bot_token"]``. * ``gateway/run.py`` voice-mode hooks (``self.adapters.get(Platform.DISCORD)`` for ``start_voice_mode``/``stop_voice_mode``), role-based auth, ``DISCORD_ALLOW_BOTS`` branch in ``_is_user_authorized``, ``_UPDATE_ALLOWED_PLATFORMS`` frozenset, and the per-platform allowlist maps — generic platform-registry concerns. * ``Platform.DISCORD`` enum literal — stable identifier used as dict keys throughout the codebase; removing it is a separate refactor with no real benefit. * ``tools/discord_tool.py`` and ``tools/environments/local.py`` — first-class agent tools and env-passthrough config, neither is the gateway adapter. Each of these is worth its own scoping issue when the time comes. |
||
|
|
d948de39e9 |
fix(gateway): harden Windows gateway install lifecycle
Preserve Windows profile install decisions across UAC handoff, avoid visible console windows by launching via pythonw, make repeated install/start idempotent, recreate stale Scheduled Tasks, and separate start-now from login auto-start behavior. Add Windows gateway regression coverage and systemd setup tests for the shared install flow. |
||
|
|
417a653d9e | fix(gateway): prevent Windows Telegram /restart leaving gateway stopped | ||
|
|
bc7c608d54 | fix(gateway): ignore inaccessible service path dirs | ||
|
|
4e9cedcd4c |
fix(gateway): add missing logger definition to prevent NameError in _all_platforms
hermes_cli/gateway.py:3702 referenced logger.debug() but 'logger' was never defined in the module, causing a NameError at runtime if the try/except around discover_plugins() caught an exception. Added import logging and logger = logging.getLogger(__name__) at module level to resolve the undefined name. |
||
|
|
d69eab1efd |
fix(gateway): build service PATH from existing dirs only, include ~/.hermes/node_modules
Extract PATH building into _build_service_path_dirs() that skips directories which don't exist on disk (e.g. node_modules/.bin for pip installs) and also includes ~/.hermes/node/bin and ~/.hermes/node_modules/.bin for agent-browser. |
||
|
|
0bc5f7b235 | fix(gateway): reduce systemd restart delay | ||
|
|
3955aefced
|
fix(install): use --extra all not --all-extras; drop lazy-covered extras from [all] (#24515)
* fix(install): use `--extra all` not `--all-extras`; drop lazy-covered extras from [all]
Two coupled fixes for the Windows install hang where uv sync built
python-olm from sdist and failed on missing make.
# Root cause: --all-extras vs --extra all (credit: ethernet)
`uv sync --all-extras` installs every key in [project.optional-
dependencies], bypassing the curated [all] extra entirely. So even
when [all] excluded [matrix], [rl], [yc-bench], etc., the installer
pulled them anyway because they were still defined as extras. On
Windows that meant python-olm (no wheel, needs make to build from
sdist) and the install died there.
The right flag is `--extra all` — install just the [all] extra's
contents, respecting curation. Empirically verified via dry-run:
--all-extras: pulls python-olm, mautrix, ctranslate2, onnxruntime,
atroposlib, tinker, wandb, modal, daytona, vercel,
python-telegram-bot, discord.py, slack-bolt,
dingtalk-stream, lark-oapi, anthropic, boto3,
edge-tts, elevenlabs, exa-py, fal-client, faster-
whisper, firecrawl-py, honcho-ai, parallel-web
--extra all: pulls none of those — just [all]'s curated set
Dockerfile already uses `--extra all` (with comment explaining the
gotcha) — knowledge existed; the gap was install.sh / install.ps1 /
setup-hermes.sh.
Sites fixed: scripts/install.sh L1118, scripts/install.ps1 L809,
setup-hermes.sh L245.
# Companion fix: drop lazy-covered extras from [all]
`tools/lazy_deps.py` already covers anthropic, bedrock, exa,
firecrawl, parallel-web, fal, edge-tts, elevenlabs, modal, daytona,
vercel, all messaging platforms (telegram/discord/slack/matrix/
dingtalk/feishu), honcho, and faster-whisper. They were ALSO in
[all], which defeats the whole point of lazy-install — fresh
installs eager-pulled them and inherited whatever was broken
upstream (the matrix → python-olm → no Windows wheel chain being
the proximate symptom).
[all] now contains only what genuinely can't be lazy-installed:
cron, cli, dev, pty, mcp, homeassistant, sms, acp, google, web,
youtube. Same trim applied to [termux-all]. New regression test
asserts the contract: every extra in LAZY_DEPS must NOT also appear
in [all].
# Companion fix: surface uv progress + errors
setup-hermes.sh's hash-verified path swallowed uv's stderr to a
tempfile, identical to the install.sh bug fixed in PR #24504. Same
fix applied: stream stderr through directly so users see live
progress instead of staring at a frozen prompt.
# Files
- pyproject.toml: trim [all] and [termux-all] to non-lazy extras only.
- scripts/install.sh: --all-extras → --extra all; trim _ALL_EXTRAS /
_PYPI_EXTRAS to match.
- scripts/install.ps1: --all-extras → --extra all; trim $allExtras /
$pypiExtras to match.
- setup-hermes.sh: --all-extras → --extra all; stream stderr.
- tests/test_project_metadata.py: invert matrix-in-[all] assertion;
add lazy-coverage contract test.
- uv.lock: regenerated.
# Validation
5/5 metadata tests pass. 37/37 in update_autostash + tool_token_
estimation. `uv lock --check` passes. Empirical dry-run confirms
`--extra all` excludes python-olm + RL chain on the new lockfile.
* fix(install): parse [all] from pyproject.toml instead of mirroring it
ethernet's review point: the previous patch left two hand-mirrored
copies of [all]'s contents (in install.sh's $_ALL_EXTRAS and
install.ps1's $allExtras). That guarantees future drift the next
time pyproject.toml's [all] changes.
Now both scripts parse pyproject.toml at install time using stdlib
tomllib (Python 3.11+, which the bootstrap step already requires).
Single source of truth. The only purpose of the parsed list is to
build the 'Tier 2: [all] minus broken extras' fallback spec — so we
parse, filter against $brokenExtras, and rebuild the .[a,b,c] spec.
Also: removed redundant fallback tiers.
Before: Tier 1 [all]
Tier 2 [all] minus broken
Tier 3 PyPI-only extras (no git deps)
Tier 4 [web,mcp,cron,cli,messaging,dev]
Tier 5 .
After: Tier 1 [all]
Tier 2 [all] minus broken
Tier 3 .
Tier 3 (PyPI-only) and Tier 4 (dashboard+core) used to dodge the [rl]
git+sdist deps and the [matrix] python-olm build. Both are no longer
in [all] post-2026-05-12 lazy-install migration, so the carve-out
tiers had no remaining content. Tier 4 also referenced [messaging],
which is now lazy-installed — the hardcoded fallback was actually
inconsistent with the new policy.
Defensive fallback: if tomllib parse fails (corrupted pyproject,
unexpected schema), Tier 2 collapses to '.[all]' (same as Tier 1) so
the broken-extras path becomes a no-op rather than crashing.
* fix(gateway): hide Matrix from setup picker on Windows
Matrix is the one messaging platform that has no working install path
on Windows: [matrix] -> mautrix[encryption] -> python-olm, which has
Linux-only wheels and needs make + libolm to build from sdist. The
[all] cleanup in this PR keeps mautrix out of fresh installs, but a
user who picked Matrix in 'hermes setup gateway' would still walk
into the same sdist build failure when the wizard tried to install
the extra.
Hide the option at the picker so users never get the chance to try.
The gate lives in _all_platforms() — single source of truth for the
setup wizard, the curses gateway-config menu, and any future picker.
Adapter loading at runtime is intentionally NOT gated: users who
already have MATRIX_* env vars set (e.g. config copied from a Linux
install) keep working if they somehow have python-olm available.
This is the lowest-friction fix — picker visibility only.
Tests cover linux/darwin/win32 and verify other platforms aren't
collateral damage.
|
||
|
|
2ec8d2b42f
|
chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937)
Replace with for all literal-tuple membership tests. Set lookup is O(1) vs O(n) for tuple — consistent micro-optimization across the codebase. 608 instances fixed via `ruff --fix --unsafe-fixes`, 0 remaining. 133 files, +626/-626 (net zero). |
||
|
|
657874460f
|
chore: ruff auto-fixes — collapsible-else-if, if-stmt-min-max, dict.fromkeys (#23926)
PLR5501 (collapsible-else-if): 28 instances — else: if: → elif: PLR1730 (if-stmt-min-max): 15 instances — if x<y: x=y → x=max(x,y) C420 (dict.fromkeys): 2 instances — dictcomp → dict.fromkeys PLR1704 (redefined-argument): 1 instance — reason → err_msg (shadow fix) C414 (unnecessary-list): 1 instance — sorted(list(x)) → sorted(x) 28 files, -44 net lines. All mechanical, zero logic changes. 17,211 tests pass, zero regressions. |
||
|
|
6bf7ac3185 |
fix(gateway): detect gateway process via /proc in Docker without procps
Salvage of NousResearch/hermes-agent#7622. Docker images often lack procps so `ps` is unavailable. Try reading /proc/*/cmdline first (works in any Linux container) and fall back to `ps -A eww` only when /proc is not present. PermissionError on individual PIDs is silently skipped. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|
|
2ffef15675
|
fix(test_gateway): stop run_gateway() tests from rewriting the dev's installed systemd unit (#22900)
run_gateway() calls refresh_systemd_unit_if_needed() on every invocation so restart settings stay current after exit-code-75 respawns. The user-scope unit path resolves under Path.home() (NOT sandboxed by conftest, only HERMES_HOME is), and generate_systemd_unit() bakes the current HERMES_HOME into the unit's Environment= line. Result: any test that exercises run_gateway() end-to-end on a real Linux dev box silently rewrites the developer's installed ~/.config/systemd/user/hermes-gateway.service with a polluted HERMES_HOME pointing at /tmp/pytest-of-<user>/.../hermes_test. On the next reboot, systemd loads that unit, the gateway starts looking at an empty tmp dir, and Telegram/Discord/etc. all show as 'No messaging platforms enabled' even though the user's real config is fine. Three tests in tests/hermes_cli/test_gateway.py hit this path: test_run_gateway_exits_cleanly_on_keyboard_interrupt, test_run_gateway_exits_nonzero_when_start_gateway_reports_failure, and test_run_gateway_root_guard_has_escape_hatch. Two-layer fix: 1. _install_fake_gateway_run helper (covers all four run_gateway() call sites in test_gateway.py and any future ones) now also stubs supports_systemd_services and refresh_systemd_unit_if_needed. 2. refresh_systemd_unit_if_needed() itself sniffs the generated unit body for /pytest-of- and /hermes_test markers and refuses to write when present. Defense in depth so a future test that bypasses the helper still can't corrupt the dev's gateway. Tests that legitimately exercise the refresh flow (test_run_gateway_refreshes_outdated_unit_on_boot) patch generate_systemd_unit to return synthetic content that doesn't carry those markers, so they keep working. Adds test_refresh_refuses_to_bake_pytest_tmpdir_into_real_user_unit as a regression test for the source-side guard. |
||
|
|
9c26297c80 | fix(gateway): preserve Ctrl+C for Windows foreground runs | ||
|
|
1508dcb9c2 |
fix(gateway): adopt unit's HERMES_HOME for --system CLI ops
When systemd_restart / systemd_status / systemd_stop run under sudo, HERMES_HOME is stripped and HOME=/root, so get_hermes_home() resolves to /root/.hermes instead of the unit's pinned home. read_runtime_status and get_running_pid then look at the wrong gateway_state.json — the 60s status poll never sees "running", times out, and forces another systemctl restart that SIGTERMs the in-progress new gateway. Read the unit's pinned HERMES_HOME from `systemctl show -p Environment` and mirror it into os.environ before any HERMES_HOME-derived read. Early-out when system=False (user-scope inherits naturally). Errors swallowed so a transient systemctl failure doesn't break unrelated CLI ops. Closes #22035. |
||
|
|
78b0008f44 |
fix(gateway): also catch restart TimeoutExpired; friendly message
Some checks failed
Deploy Site / deploy-vercel (push) Waiting to run
Deploy Site / deploy-docs (push) Waiting to run
Docker Build and Publish / build-amd64 (push) Waiting to run
Docker Build and Publish / build-arm64 (push) Waiting to run
Docker Build and Publish / merge (push) Blocked by required conditions
Docker Build and Publish / move-latest (push) Blocked by required conditions
Lint (ruff + ty) / ruff + ty diff (push) Waiting to run
Lint (ruff + ty) / ruff enforcement (blocking) (push) Waiting to run
Lint (ruff + ty) / Windows footguns (blocking) (push) Waiting to run
Nix / nix (macos-latest) (push) Waiting to run
Nix / nix (ubuntu-latest) (push) Waiting to run
OSV-Scanner / Scan lockfiles (push) Waiting to run
Tests / test (push) Waiting to run
Tests / e2e (push) Waiting to run
uv.lock check / uv lock --check (push) Waiting to run
Build Skills Index / build-index (push) Has been cancelled
Build Skills Index / deploy-with-index (push) Has been cancelled
Extends #19994 to the restart path. Dashboard spawns 'hermes gateway restart' in the background; when a wedged adapter websocket pushes drain past the 90s CLI timeout, the dashboard previously surfaced a raw subprocess.TimeoutExpired traceback. Mirror systemd_stop()'s TimeoutExpired catch onto both forcing-restart sites in systemd_restart(). Adds a test that exercises the no-active-pid branch end-to-end. |
||
|
|
dccf1fb6e0 | fix(gateway): cap adapter disconnect during stop | ||
|
|
a54cae60d4
|
fix(setup): offer gateway service install on Windows (#22099)
Both setup wizards (hermes setup and hermes gateway setup) gated the service install/start/restart prompts behind 'supports_systemd or is_macos()' and fell through to 'run in foreground' on Windows, even though _is_service_installed() / _is_service_running() already call gateway_windows.is_installed() and the Windows backend has a full install/start/stop/restart contract. Wire the Windows branch into both wizards: - supports_service_manager now includes is_windows(). - Install offer reads 'Scheduled Task service' on Windows. - install() on Windows starts the task inline via schtasks /Run (or direct-spawn fallback) so the separate 'Start the service now?' prompt is skipped. - Start and Restart delegate to gateway_windows.start() / .restart(). hermes_cli/setup.py +30 -4 hermes_cli/gateway.py +28 -4 |
||
|
|
0548facc50 |
fix(windows): gateway status dedup + install.ps1 platform-SDK bootstrap
## Two residual Windows fixes that were hanging from earlier commits. ### 1. `hermes gateway status` reported 2 PIDs per gateway — TWO bugs compounded Diagnosed with psutil parent/child walk against live gateway PIDs: **Bug A (the real one): `_get_parent_pid` silently failed on Windows.** The helper shelled out to `ps -o ppid= -p <pid>`, which doesn't exist on Windows — `FileNotFoundError` → returns `None` → the ancestor walk terminated at `os.getpid()` alone. Consequence: the PID table scan in `_scan_gateway_pids` couldn't filter out `hermes gateway status`'s own launcher stub (a venv `pythonw.exe`/`python.exe` that matches the same `-m hermes_cli.main gateway` pattern as the gateway). Every status call saw "itself" as a second gateway. Fix: `_get_parent_pid` now calls `psutil.Process(pid).ppid()` first (psutil is a core dependency since |
||
|
|
cc38282b04 |
feat(cross-platform): psutil for PID/process management + Windows footgun checker
## Why
Hermes supports Linux, macOS, and native Windows, but the codebase grew up
POSIX-first and has accumulated patterns that silently break (or worse,
silently kill!) on Windows:
- `os.kill(pid, 0)` as a liveness probe — on Windows this maps to
CTRL_C_EVENT and broadcasts Ctrl+C to the target's entire console
process group (bpo-14484, open since 2012).
- `os.killpg` — doesn't exist on Windows at all (AttributeError).
- `os.setsid` / `os.getuid` / `os.geteuid` — same.
- `signal.SIGKILL` / `signal.SIGHUP` / `signal.SIGUSR1` — module-attr
errors at runtime on Windows.
- `open(path)` / `open(path, "r")` without explicit encoding= — inherits
the platform default, which is cp1252/mbcs on Windows (UTF-8 on POSIX),
causing mojibake round-tripping between hosts.
- `wmic` — removed from Windows 10 21H1+.
This commit does three things:
1. Makes `psutil` a core dependency and migrates critical callsites to it.
2. Adds a grep-based CI gate (`scripts/check-windows-footguns.py`) that
blocks new instances of any of the above patterns.
3. Fixes every existing instance in the codebase so the baseline is clean.
## What changed
### 1. psutil as a core dependency (pyproject.toml)
Added `psutil>=5.9.0,<8` to core deps. psutil is the canonical
cross-platform answer for "is this PID alive" and "kill this process
tree" — its `pid_exists()` uses `OpenProcess + GetExitCodeProcess` on
Windows (NOT a signal call), and its `Process.children(recursive=True)`
+ `.kill()` combo replaces `os.killpg()` portably.
### 2. `gateway/status.py::_pid_exists`
Rewrote to call `psutil.pid_exists()` first, falling back to the
hand-rolled ctypes `OpenProcess + WaitForSingleObject` dance on Windows
(and `os.kill(pid, 0)` on POSIX) only if psutil is somehow missing —
e.g. during the scaffold phase of a fresh install before pip finishes.
### 3. `os.killpg` migration to psutil (7 callsites, 5 files)
- `tools/code_execution_tool.py`
- `tools/process_registry.py`
- `tools/tts_tool.py`
- `tools/environments/local.py` (3 sites kept as-is, suppressed with
`# windows-footgun: ok` — the pgid semantics psutil can't replicate,
and the calls are already Windows-guarded at the outer branch)
- `gateway/platforms/whatsapp.py`
### 4. `scripts/check-windows-footguns.py` (NEW, 500 lines)
Grep-based checker with 11 rules covering every Windows cross-platform
footgun we've hit so far:
1. `os.kill(pid, 0)` — the silent killer
2. `os.setsid` without guard
3. `os.killpg` (recommends psutil)
4. `os.getuid` / `os.geteuid` / `os.getgid`
5. `os.fork`
6. `signal.SIGKILL`
7. `signal.SIGHUP/SIGUSR1/SIGUSR2/SIGALRM/SIGCHLD/SIGPIPE/SIGQUIT`
8. `subprocess` shebang script invocation
9. `wmic` without `shutil.which` guard
10. Hardcoded `~/Desktop` (OneDrive trap)
11. `asyncio.add_signal_handler` without try/except
12. `open()` without `encoding=` on text mode
Features:
- Triple-quoted-docstring aware (won't flag prose inside docstrings)
- Trailing-comment aware (won't flag mentions in `# os.kill(pid, 0)` comments)
- Guard-hint aware (skips lines with `hasattr(os, ...)`,
`shutil.which(...)`, `if platform.system() != 'Windows'`, etc.)
- Inline suppression with `# windows-footgun: ok — <reason>`
- `--list` to print all rules with fixes
- `--all` / `--diff <ref>` / staged-files (default) modes
- Scans 380 files in under 2 seconds
### 5. CI integration
A GitHub Actions workflow that runs the checker on every PR and push is
staged at `/tmp/hermes-stash/windows-footguns.yml` — not included in this
commit because the GH token on the push machine lacks `workflow` scope.
A maintainer with `workflow` permissions should add it as
`.github/workflows/windows-footguns.yml` in a follow-up. Content:
```yaml
name: Windows footgun check
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: {python-version: "3.11"}
- run: python scripts/check-windows-footguns.py --all
```
### 6. CONTRIBUTING.md — "Cross-Platform Compatibility" expansion
Expanded from 5 to 16 rules, each with message, example, and fix.
Recommends psutil as the preferred API for PID / process-tree operations.
### 7. Baseline cleanup (91 → 0 findings)
- 14 `open()` sites → added `encoding='utf-8'` (internal logs/caches) or
`encoding='utf-8-sig'` (user-editable files that Notepad may BOM)
- 23 POSIX-only callsites in systemd helpers, pty_bridge, and plugin
tool subprocess management → annotated with
`# windows-footgun: ok — <reason>`
- 7 `os.killpg` sites → migrated to psutil (see §3 above)
## Verification
```
$ python scripts/check-windows-footguns.py --all
✓ No Windows footguns found (380 file(s) scanned).
$ python -c "from gateway.status import _pid_exists; import os
> print('self:', _pid_exists(os.getpid())); print('bogus:', _pid_exists(999999))"
self: True
bogus: False
```
Proof-of-repro that `os.kill(pid, 0)` was actually killing processes
before this fix — see commit `
|
||
|
|
324567c936 |
fix(windows): os.kill(pid, 0) is NOT a no-op on Windows — route through new _pid_exists helper
On Windows, Python's ``os.kill(pid, 0)`` is NOT a no-op. CPython's
implementation (``Modules/posixmodule.c::os_kill_impl``) treats sig=0
as ``CTRL_C_EVENT`` because the two integer values collide at the C
layer, and routes it through ``GenerateConsoleCtrlEvent(0, pid)`` —
which sends a Ctrl+C to the ENTIRE console process group containing
the target PID, not just the PID itself. Any caller that wanted to
check "is PID X alive" via the classic POSIX ``os.kill(pid, 0)``
idiom was silently killing that process (and often unrelated
processes in the same console group) on Windows. Long-standing
Python Windows quirk; see bpo-14484 (open since 2012).
This manifested in Hermes as: every ``hermes gateway status``
invocation would read the gateway's PID from the PID file, call
``os.kill(pid, 0)`` via ``gateway.status.get_running_pid()`` as a
"liveness check", and instantly terminate the gateway it was trying
to report on. No shutdown log, no traceback, no atexit hook fire,
no exit-diag entry — just silent termination of the detached pythonw
process. "Bot answered one message then stopped typing" was the
characteristic end-user symptom because `os.kill(pid, 0)` fires
mid-response-send and kills the gateway between logs.
Reproduction (verified in this branch before the fix):
$ hermes gateway start # gateway alive, PID 37520
$ hermes gateway status # reports "No gateway process detected"
$ tasklist /FI "PID eq 37520" # INFO: No tasks are running
# — gateway terminated silently
Root-cause fix is a new ``gateway.status._pid_exists(pid)`` helper:
- On Windows: Win32 ``OpenProcess(PROCESS_QUERY_LIMITED_INFORMATION |
SYNCHRONIZE, False, pid)`` + ``WaitForSingleObject(handle, 0)``
via ctypes. Zero signal delivery, zero console-group side effects.
Pins ctypes return types to avoid DWORD-vs-signed-int parse bugs
on WAIT_TIMEOUT (0x102). Distinguishes ERROR_INVALID_PARAMETER
(PID gone) from ERROR_ACCESS_DENIED (alive but another user).
- On POSIX: the canonical ``os.kill(pid, 0)`` idiom that actually is
a no-op there.
Then patch every ``os.kill(pid, 0)`` liveness-check callsite to
route through ``_pid_exists`` instead. Total 14 callsites across
11 files; every single one was a latent silent-kill on Windows:
gateway/run.py:2810 — /restart watcher (inline subprocess)
gateway/run.py:15195 — --replace wait loop
gateway/status.py:572 — acquire_gateway_runtime_lock stale check
gateway/status.py:828 — get_running_pid (THE killer for status)
gateway/platforms/whatsapp.py:111
hermes_cli/gateway.py:228, 522, 1012 — gateway-related drain loops
hermes_cli/kanban_db.py:2826 — _pid_alive was claiming to
be cross-platform but used
os.kill(pid, 0) on Windows
hermes_cli/main.py:5792 — CLI process-kill polling
hermes_cli/profiles.py:782 — profile stop wait loop
plugins/google_meet/process_manager.py:74
tools/browser_tool.py:1215, 1255 — browser daemon ownership probes
tools/mcp_tool.py:1255, 3374 — MCP stdio orphan tracking
The watcher source in gateway/run.py:2810 is a multi-line string
that gets spawned as an inline ``python -c "..."`` subprocess, so
it can't import gateway.status. The fix for that callsite inlines
the same ctypes probe directly into the watcher source.
Tested on Windows 10 with the hermes gateway + Telegram bot:
- gateway start → alive
- 5 consecutive ``hermes gateway status`` invocations → gateway
alive after every one, same PID reported each time (37520, 21952)
- gateway.log shows uninterrupted operation; no spurious shutdown
entries; cron ticker and kanban dispatcher still running on
their 60-second cadence
- bot continues answering Telegram messages throughout
Ships alongside an exit-path diagnostic wrapper in
``hermes_cli/gateway.py::run_gateway()`` that captures every way
``asyncio.run(start_gateway(...))`` can return (success, SystemExit,
KeyboardInterrupt, BaseException, atexit) with full traceback to
``logs/gateway-exit-diag.log``. This was used to prove the gateway
was being hard-killed externally (no exit event fired) and should
be kept for future Windows debugging.
Refs: https://bugs.python.org/issue14484
See also: references/windows-subprocess-sigint-storm.md in
the hermes-agent skill.
|
||
|
|
9c263fbf8a |
feat(windows): gateway as a Scheduled Task + Startup-folder fallback
Hermes gateway now installs as a real Windows service via
`hermes gateway install`, auto-starts on user logon, and stays running
across reboots. Mirrors the launchd (macOS) / systemd (Linux) contract
so the rest of the CLI dispatcher just plugs into the same `install /
uninstall / start / stop / restart / status` entrypoints.
Primary implementation is the new `hermes_cli/gateway_windows.py`:
- `schtasks /Create /SC ONLOGON /RL LIMITED /RU <user> /NP /IT` creates
a per-user Scheduled Task running as the current user at next logon,
with no UAC prompt and no stored password. Same pattern OpenClaw uses.
- When `schtasks /Create` returns "Access is denied" or times out
(locked-down corporate boxes, 15s/30s hard + no-output cutoffs),
fall back to writing a `.cmd` file into
`%APPDATA%\Microsoft\Windows\Start Menu\Programs\Startup\`, which
Windows Explorer fires at every logon. Either path produces the same
end-user experience.
- `_spawn_detached()` launches `pythonw.exe -m hermes_cli.main gateway
run --replace` directly with `DETACHED_PROCESS |
CREATE_NEW_PROCESS_GROUP | CREATE_NO_WINDOW |
CREATE_BREAKAWAY_FROM_JOB` + DEVNULL stdio + sidecar
`logs/gateway-stdio.log`. Going through pythonw.exe (no console)
instead of a cmd.exe shim is what lets the gateway survive the
spawning shell's exit on Windows — documented in
`references/windows-subprocess-sigint-storm.md`.
- Two separate quoting helpers for cmd.exe vs schtasks (`/TR` argument)
— they're different parsers and mixing breaks both. Same split
OpenClaw documents in src/daemon/schtasks.ts.
- `_wait_for_gateway_ready()` + `_report_gateway_start()` poll for a
live gateway process after spawn and report the PID, so install
doesn't lie about success.
Dispatcher wiring in `hermes_cli/gateway.py`:
- `_gateway_command_inner()` gets Windows branches for install /
uninstall / start / stop / restart / status + `_is_service_installed`
+ `_is_service_running`. `gateway status` output + suggested
commands now mention `hermes gateway install` instead of
`sudo hermes gateway install --system` on Windows.
Two separable Windows fixes that only matter for a working
detached gateway, bundled here because shipping them independently
leaves install broken:
(1) Spurious CTRL_C_EVENT on detached pythonw runs. When the gateway
is launched detached on Windows, something on the boot path (HTTPX /
python-telegram-bot / asyncio ProactorEventLoop subprocess plumbing)
synthesizes a Ctrl+C within ~60-90 seconds. Python 3.11 translates it
into KeyboardInterrupt inside `asyncio.run(start_gateway(...))`, the
outer `except KeyboardInterrupt: return` exits cleanly, and the
process dies with no shutdown log — "bot started typing, then
stopped" is the fingerprint because the interrupt fires mid-send.
Fix in `run_gateway()`: when `is_windows()` and stdin is not a TTY,
install `signal.signal(SIGINT, SIG_IGN)` + same for SIGBREAK. Real
console runs have a TTY and skip the absorber, so user Ctrl+C still
works interactively. Same family as commit 449ad952b's browser-tool
SIGINT absorber; cross-referenced in the ref doc.
(2) `wmic process get` is the process-list path used by
`_scan_gateway_pids()` / `find_gateway_pids()`, which power status,
stop, and restart on Windows. `C:\Windows\System32\wbem\WMIC.exe` has
been deprecated since Windows 10 21H1 and is not installed on modern
Win 10/11 boxes, so `find_gateway_pids()` silently returns [] — status
sees no gateway even when one is running. Fix: `shutil.which("wmic")`
first, fall back to PowerShell's `Get-CimInstance Win32_Process`
emitting the same LIST-style `CommandLine=...` / `ProcessId=...` pairs
the downstream parser already handles. Zero behavior change on boxes
where wmic still works.
Verified end-to-end on Windows 10 (Delta-1):
- `hermes gateway install` → falls back to Startup folder (access
denied on schtasks for this user) + detached pythonw spawn, PID
reported correctly.
- Gateway connects to Telegram, answers messages, stays alive past
2min (previously died at ~85s with no shutdown log).
- `hermes gateway stop` + `uninstall` both clean up both tracks.
Refs: openclaw/openclaw src/daemon/schtasks.ts for the ONLOGON +
startup-folder-fallback pattern. skill hermes-agent
references/windows-subprocess-sigint-storm.md for the deeper
CTRL_C_EVENT / ProactorEventLoop background.
|
||
|
|
e93bfc6c93 |
feat(windows): close remaining POSIX-only landmines — TUI crash, kanban waitpid, AF_UNIX sandbox, /bin/bash, npm .cmd shims, cwd tracking, detach flags
Second pass on native Windows support, driven by a systematic audit across
five areas: POSIX-only primitives (signal.SIGKILL/SIGHUP/SIGPIPE, os.WNOHANG,
os.setsid), path translation bugs (/c/Users → C:\Users), subprocess patterns
(npm.cmd batch shims, start_new_session no-op on Windows), subsystem health
(cron, gateway daemon, update flow), and module-level import guards.
Every change is platform-gated — POSIX (Linux/macOS) behaviour is preserved
bit-identical. Explicit "do no harm" test: test_posix_path_preserved_on_linux,
test_posix_noop, test_windows_detach_popen_kwargs_is_posix_equivalent_on_posix.
## New module
- hermes_cli/_subprocess_compat.py — shared helpers (resolve_node_command,
windows_detach_flags, windows_hide_flags, windows_detach_popen_kwargs).
All no-ops on non-Windows.
## CRITICAL fixes (would crash or silently break on Windows)
- tui_gateway/entry.py: SIGPIPE/SIGHUP referenced at module top level would
AttributeError on import on Windows, breaking `hermes --tui` entirely (it
spawns this module as a subprocess). Guard each signal.signal() call with
hasattr() and add SIGBREAK as Windows' SIGHUP equivalent.
- hermes_cli/kanban_db.py: os.waitpid(-1, os.WNOHANG) in dispatcher tick was
unguarded. os.WNOHANG doesn't exist on Windows. Gate the whole reap loop
behind `os.name != "nt"` — Windows has no zombies anyway.
- tools/code_execution_tool.py: AF_UNIX socket for execute_code RPC fails on
most Windows builds. Fall back to loopback TCP (AF_INET on 127.0.0.1:0
ephemeral port) when _IS_WINDOWS. HERMES_RPC_SOCKET env var now accepts
either a filesystem path (POSIX) or `tcp://127.0.0.1:<port>` (Windows).
Generated sandbox client parses both.
- cron/scheduler.py: `argv = ["/bin/bash", str(path)]` hardcoded. Use
shutil.which("bash") so Windows (Git Bash via MinGit) works, with a
readable error when bash is genuinely absent.
- 6 bare npm/npx spawn sites: tools_config.py x2, doctor.py, whatsapp.py
(npm install + node version probe), browser_tool.py x2. On Windows npm
is npm.cmd / npx is npx.cmd (batch shims); subprocess.Popen(["npm", ...])
fails with WinError 193. shutil.which(...) returns the absolute .cmd
path which CreateProcessW accepts because the extension routes through
cmd.exe /c. POSIX behaviour unchanged (shutil.which still returns the
same path subprocess would resolve itself).
## HIGH fixes (silent misbehaviour on Windows)
- tools/environments/local.py get_temp_dir: hardcoded /tmp returned on
Windows meant `_cwd_file = "/tmp/hermes-cwd-*.txt"`, which bash wrote
via MSYS2's virtual /tmp but native Python couldn't open. Result: cwd
tracking silently broken — `cd` in terminal tool did nothing. Windows
branch now returns `%HERMES_HOME%/cache/terminal` with forward slashes
(works in both bash and Python, guaranteed no spaces).
- tools/environments/local.py _make_run_env PATH injection: `/usr/bin not
in split(":")` heuristic mangles Windows PATH (";" separator). Gate
the injection behind `not _IS_WINDOWS`.
- hermes_cli/gateway.py launch_detached_profile_gateway_restart: outer
Popen + watcher-script Popen both used start_new_session=True, which
Windows silently ignores. Watcher stayed attached to CLI's console,
died when user closed terminal after `hermes update`, left gateway
stale. Now branches through windows_detach_popen_kwargs() helper
(CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS | CREATE_NO_WINDOW on
Windows, start_new_session=True on POSIX — identical to main).
## MEDIUM fixes
- gateway/run.py /restart and /update handlers: hardcoded bash/setsid
chain crashes on Windows when user triggers /update in-gateway. Now
has sys.platform=="win32" branch using sys.executable + a tiny
Python watcher with proper detach flags. POSIX path is unchanged.
- cli.py _git_repo_root: Git on Windows sometimes returns /c/Users/...
style paths that break subprocess.Popen(cwd=...) and Path().resolve().
Added _normalize_git_bash_path() helper that translates /c/Users,
/cygdrive/c, /mnt/c variants to native C:\Users form. POSIX no-op.
_git_repo_root() now routes every result through it.
- cli.py worktree .worktreeinclude: os.symlink on directories failed
hard on Windows (requires admin or Developer Mode). Falls back to
shutil.copytree with a warning log.
## Tests
- 29 new tests in tests/tools/test_windows_native_support.py covering:
subprocess_compat helpers, TUI entry signal guards, kanban waitpid
guard, code_execution TCP fallback source-level invariants, cron bash
resolution, npm/npx bare-spawn lint per-file, local env Windows temp
dir, PATH injection gating, git bash path normalization, symlink
fallback, gateway detached watcher flags.
- One existing test assertion adjusted in test_browser_homebrew_paths:
it compared captured Popen argv to the BARE `"npx"` literal; after the
shutil.which() change argv[0] is the absolute path. New assertion
checks the shape (two items, second is `agent-browser`) rather than
the exact first-item string. Behaviour unchanged; test was too strict.
All 56 tests pass on Linux (30 from previous commits + 26 new).
267 tests from the affected files/dirs (browser, code_exec, local_env,
process_registry, kanban_db, windows_compat) all pass — zero regressions.
tests/hermes_cli/ (3909 pass) and tests/gateway/ (5021 pass) unchanged;
all pre-existing test failures confirmed unrelated via `git stash` re-run.
## What's still deferred (LOW priority)
- Visible cmd-window flashes on short-lived console apps (~14 sites) —
cosmetic, needs a follow-up pass once we have user reports.
- agent/file_safety.py POSIX-only security deny patterns — separate
hardening task.
- tools/process_registry.py returning "/tmp" as fallback — theoretical;
reachable only when all env-var candidates fail.
|
||
|
|
9de893e3b0 |
feat(windows): close native-Windows install gaps — crash-free startup, UTF-8 stdio, tzdata dep, docs
Native Windows (with Git for Windows installed) can now run the Hermes CLI and gateway end-to-end without crashing. install.ps1 already existed and the Git Bash terminal backend was already wired up — this PR fills the remaining gaps discovered by auditing every Windows-unsafe primitive (`signal.SIGKILL`, `os.kill(pid, 0)` probes, bare `fcntl`/`termios` imports) and by comparing hermes against how Claude Code, OpenCode, Codex, and Cline handle native Windows. ## What changed ### UTF-8 stdio (new module) - `hermes_cli/stdio.py` — single `configure_windows_stdio()` entry point. Flips the console code page to CP_UTF8 (65001), reconfigures `sys.stdout`/`stderr`/`stdin` to UTF-8, sets `PYTHONIOENCODING` + `PYTHONUTF8` for subprocesses. No-op on non-Windows. Opt out via `HERMES_DISABLE_WINDOWS_UTF8=1`. - Called early in `cli.py::main`, `hermes_cli/main.py::main`, and `gateway/run.py::main` so Unicode banners (box-drawing, geometric symbols, non-Latin chat text) don't `UnicodeEncodeError` on cp1252 consoles. ### Crash sites fixed - `hermes_cli/main.py:7970` (hermes update → stuck gateway sweep): raw `os.kill(pid, _signal.SIGKILL)` → `gateway.status.terminate_pid(pid, force=True)` which routes through `taskkill /T /F` on Windows. - `hermes_cli/profiles.py::_stop_gateway_process`: same fix — also converted SIGTERM path to `terminate_pid()` and widened OSError catch on the intermediate `os.kill(pid, 0)` probe. - `hermes_cli/kanban_db.py:2914, 3041`: raw `signal.SIGKILL` → `getattr(signal, "SIGKILL", signal.SIGTERM)` fallback (matches the pattern already used in `gateway/status.py`). ### OSError widening on `os.kill(pid, 0)` probes Windows raises `OSError` (WinError 87) for a gone PID instead of `ProcessLookupError`. Widened the catch at: - `gateway/run.py:15101` (`--replace` wait-for-exit loop — without this, the loop busy-spins the full 10s every Windows gateway start) - `hermes_cli/gateway.py:228, 460, 940` - `hermes_cli/profiles.py:777` - `tools/process_registry.py::_is_host_pid_alive` - `tools/browser_tool.py:1170, 1206` ### Dashboard PTY graceful degradation `hermes_cli/pty_bridge.py` depends on `fcntl`/`termios`/`ptyprocess`, none of which exist on native Windows. Previously a Windows dashboard would crash on `import hermes_cli.web_server` because of a top-level import. Now: - `hermes_cli/web_server.py` wraps the pty_bridge import in `try/except ImportError` and sets `_PTY_BRIDGE_AVAILABLE=False`. - The `/api/pty` WebSocket handler returns a friendly "use WSL2 for this tab" message instead of exploding. - Every other dashboard feature (sessions, jobs, metrics, config editor) runs natively on Windows. ### Dependency - `pyproject.toml`: add `tzdata>=2023.3; sys_platform == 'win32'` so Python's `zoneinfo` works on Windows (which has no IANA tzdata shipped with the OS). Credits @sprmn24 (PR #13182). ### Docs - README.md: removed "Native Windows is not supported"; added PowerShell one-liner and Git-for-Windows prerequisite note. - `website/docs/getting-started/installation.md`: new Windows section with capability matrix (everything native except the dashboard `/chat` PTY tab, which is WSL2-only). - `website/docs/user-guide/windows-wsl-quickstart.md`: reframed as "WSL2 as an alternative to native" rather than "the only way". - `website/docs/developer-guide/contributing.md`: updated cross-platform guidance with the `signal.SIGKILL` / `OSError` rules we enforce now. - `website/docs/user-guide/features/web-dashboard.md`: acknowledged native Windows works for everything except the embedded PTY pane. ## Why this shape Pulled from a survey of how other agent codebases handle native Windows (Claude Code, OpenCode, Codex, Cline): - All four treat Git Bash as the canonical shell on Windows, same as hermes already does in `tools/environments/local.py::_find_bash()`. - None of them force `SetConsoleOutputCP` — but they don't have to, Node/Rust write UTF-16 to the Win32 console API. Python does not get that for free, so we flip CP_UTF8 via ctypes. - None of them ship PowerShell-as-primary-shell (Claude Code exposes PS as a secondary tool; scope creep for this PR). - All of them use `taskkill /T /F` for force-kill on Windows, which is exactly what `gateway.status.terminate_pid(force=True)` does. ## Non-goals (deliberate scope limits) - No PowerShell-as-a-second-shell tool — worth designing separately. - No terminal routing rewrite (#12317, #15461, #19800 cluster) — that's the hardest design call and needs a separate doc. - No wholesale `open()` → `open(..., encoding="utf-8")` sweep (Tianworld cluster) — will do as follow-up if users hit actual breakage; most modern code already specifies it. ## Validation - 28 new tests in `tests/tools/test_windows_native_support.py` — all platform-mocked, pass on Linux CI. Cover: - `configure_windows_stdio` idempotency, opt-out, env-preservation - `terminate_pid` taskkill routing, failure → OSError, FileNotFoundError fallback - `getattr(signal, "SIGKILL", …)` fallback shape - `_is_host_pid_alive` OSError widening (Windows-gone-PID behavior) - Source-level checks that all entry points call `configure_windows_stdio` - pty_bridge import-guard present in `web_server.py` - README no longer says "not supported" - 12 pre-existing tests in `tests/tools/test_windows_compat.py` still pass. - `tests/hermes_cli/` ran fully (3909 passed, 9 failures — all confirmed pre-existing on main by stash-test). - `tests/gateway/` ran fully (5021 passed, 1 pre-existing failure). - `tests/tools/test_process_registry.py` + `test_browser_*` pass. - Manual smoke: `import hermes_cli.stdio; import gateway.run; import hermes_cli.web_server` — all clean, `_PTY_BRIDGE_AVAILABLE=True` on Linux (as expected). ## Files - New: `hermes_cli/stdio.py`, `tests/tools/test_windows_native_support.py` - Modified: `cli.py`, `gateway/run.py`, `hermes_cli/main.py`, `hermes_cli/profiles.py`, `hermes_cli/gateway.py`, `hermes_cli/kanban_db.py`, `hermes_cli/pty_bridge.py`, `hermes_cli/web_server.py`, `tools/browser_tool.py`, `tools/process_registry.py`, `pyproject.toml`, `README.md`, and 4 docs pages. Credits to everyone whose prior PR work informed these fixes — see the co-author trailers. All of the PRs listed in `~/.hermes/plans/windows-support-prs.md` fixing `os.kill` / `signal.SIGKILL` / UTF-8 stdio / tzdata / README patterns found the same issues; this PR consolidates them. Co-authored-by: Philip D'Souza <9472774+PhilipAD@users.noreply.github.com> Co-authored-by: Arecanon <42595053+ArecaNon@users.noreply.github.com> Co-authored-by: XiaoXiao0221 <263113677+XiaoXiao0221@users.noreply.github.com> Co-authored-by: Lars Hagen <1360677+lars-hagen@users.noreply.github.com> Co-authored-by: Luan Dias <65574834+luandiasrj@users.noreply.github.com> Co-authored-by: Ruzzgar <ruzzgarcn@gmail.com> Co-authored-by: sprmn24 <oncuevtv@gmail.com> Co-authored-by: adybag14-cyber <252811164+adybag14-cyber@users.noreply.github.com> Co-authored-by: Prasanna28Devadiga <54196612+Prasanna28Devadiga@users.noreply.github.com> |
||
|
|
84287b0de8 | fix(docker): refuse root gateway runs in official image | ||
|
|
3c439ec681 |
feat(gateway): add hermes gateway list to show all profiles' gateway status
Add a new `hermes gateway list` subcommand that shows the running
status of gateways across all profiles in a single view:
Gateways:
✓ default (current) — PID 155469
✓ wx1 — PID 166893
✗ dev — not running
Also includes `_print_other_profiles_gateway_status()` which appends
an "Other profiles" section to `hermes gateway status` output when
other profile gateways are running.
Both use existing `list_profiles()` and `find_profile_gateway_processes()`
— no new dependencies.
Closes #19127
Related: #19113, #4402, #4587
|
||
|
|
2e00bcaaab |
fix(oauth,gateway): monotonic deadlines for polling/timeout loops
Widen PR #20314's fix to the other timeout-polling sites in the codebase that share the same wall-clock-jump bug class. All of these measure elapsed timeout duration, not civil time, so they belong on time.monotonic(). - hermes_cli/auth.py: auth-store file-lock timeout, Spotify OAuth callback wait, Nous portal device-auth token poll. - hermes_cli/copilot_auth.py: Copilot OAuth device-flow token poll. - hermes_cli/gateway.py: gateway systemd restart wait. - hermes_cli/web_server.py: dashboard Codex device-auth user_code wait, dashboard Nous device-auth token poll. (sess["expires_at"] stays on time.time() — it's a persisted absolute timestamp, not a local deadline-polling variable.) - agent/copilot_acp_client.py: Copilot ACP JSON-RPC request timeout. |
||
|
|
d797755a1c | fix(gateway): wait for systemd restart readiness | ||
|
|
3cdbf334d5 |
fix(gateway): don't dead-end setup wizard when only system-scope unit is installed
The setup wizard dropped non-root users at a bare shell prompt when trying to start a system-scope gateway service. Previously _require_root_for_system_service called sys.exit(1), which the wizard's `except Exception` guards cannot catch (SystemExit is a BaseException). Users with a pre-existing /etc/systemd/system unit (e.g. from an earlier `sudo hermes setup` run) hit this whenever they re-ran `hermes setup` as a regular user. - Convert _require_root_for_system_service to raise a typed SystemScopeRequiresRootError (RuntimeError subclass) instead of sys.exit(1). The direct CLI path (`hermes gateway install|start|stop| restart|uninstall` without sudo) still exits 1 cleanly via a new catch at the top of gateway_command, matching the existing UserSystemdUnavailableError pattern. - Add _system_scope_wizard_would_need_root() pre-check and _print_system_scope_remediation() helper. Both setup wizards (hermes_cli/setup.py and hermes_cli/gateway.py::gateway_setup) now detect the dead-end before prompting and print actionable guidance: either `sudo systemctl start <service>` this time, or uninstall the system unit and install a per-user one. - Defense-in-depth: all 5 wizard prompt sites also catch SystemScopeRequiresRootError and fall back to the remediation helper if the pre-check is bypassed (race, etc.). Tests: 12 new tests in TestSystemScopeRequiresRootError, TestSystemScopeWizardPreCheck, TestSystemScopeRemediationOutput, and TestGatewayCommandCatchesSystemScopeError covering the exception contract, pre-check matrix (root vs non-root, system-only vs user-present vs none vs explicit system=True), remediation output for each action, and the direct-CLI exit-1 path. |
||
|
|
b632290166 | fix(gateway): handle planned service stops | ||
|
|
8ab9f61dcf | fix(gateway): preserve WSL interop PATH in systemd units | ||
|
|
8c8f95bc8e |
fix(gateway): show friendly error when service is not installed
Instead of an unhelpful CalledProcessError traceback when running `hermes gateway start/stop/restart` without first installing the service, check for the unit file and exit with an actionable install hint. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |