hermes-agent/hermes_cli/gateway.py
brooklyn! 51c68d4ab1
Add Hermes desktop app (#20059)
* feat: better composer etc

* docs: add desktop and dashboard run instructions

* fix(desktop): address security scan findings

* fix(dashboard): resolve @nous-research/ui path under npm workspaces

The sync-assets prebuild step shelled out to 'cp -r
node_modules/@nous-research/ui/dist/fonts ...' with a path relative
to apps/dashboard/. That works only when the dep is installed
locally in the dashboard workspace, but 'npm install' at the repo
root (the documented setup — see apps/desktop/README.md) hoists
shared deps to the root node_modules under npm workspaces. The
relative cp then fails with 'No such file or directory', sync-assets
exits 1, the Vite build aborts, and 'hermes dashboard' surfaces a
generic 'Web UI build failed' message.

Replace the shell one-liner with scripts/sync-assets.cjs, which
walks up from the dashboard directory looking for node_modules/
@nous-research/ui — working in both the hoisted (workspaces) and
co-located (standalone) layouts. Also guards against a missing
dist/fonts or dist/assets with a clearer error pointing at a
rebuild of the UI package rather than silently copying nothing.

* feat(desktop): support connecting to a remote Hermes backend

Add HERMES_DESKTOP_REMOTE_URL and HERMES_DESKTOP_REMOTE_TOKEN env
vars that, when set, short-circuit the local-child spawn in
startHermes() and connect the Electron renderer to an already-
running 'hermes dashboard' server reachable over the network.

Motivating use case: WSL2 users who want to run the Hermes core
(agent loop, tools, filesystem access) inside their WSL
distribution while rendering the Electron GUI on native Windows.
Before this change, the desktop app always spawned a local Python
child on the same host as the renderer, which doesn't cross the
WSL/Windows boundary.

The remote path reuses waitForHermes() as a liveness probe
(/api/status is in the backend's public endpoint allowlist), so
the connection is only returned once the backend is actually
ready. WebSocket URL derivation picks ws:// or wss:// based on
the input scheme. URL validation rejects non-http(s) schemes and
requires both env vars together to avoid a half-configured
connection that would silently fall through to the spawn path.

No behaviour change when the env vars are unset — the default
local-spawn flow is untouched.

Typical usage:

  # in WSL2
  hermes dashboard --tui --no-open --host 0.0.0.0 --port 9119 --insecure

  # on Windows
  set HERMES_DESKTOP_REMOTE_URL=http://localhost:9119
  set HERMES_DESKTOP_REMOTE_TOKEN=<session token>
  set HERMES_DESKTOP_IGNORE_EXISTING=1
  (launch Hermes desktop)

* ci(desktop): automate desktop releases

Add GitHub Actions release channels for signed desktop installers and document the stable/nightly download paths.

* feat: file tabs

* refactor(desktop): tighten right-rail tab close API

Promote closeRightRailTab/closeActiveRightRailTab as the single
public entry point. Drops the activeTabRef + handleCloseDocument
indirection in ChatPreviewRail, the unused $rightRailHasContent
atom, and the legacy dismissFilePreviewTarget alias. -70 LOC.

* feat(desktop): polish composer pill toward reference look

Solid foreground-on-background send/voice-conversation circle (black-on-white
in light, white-on-black in dark) anchors the right edge as the primary CTA
instead of the orange theme primary. Bumps the primary control to 2.125rem so
it visually outranks the ghost mic/plus controls. Opens up the surface padding
(0.625rem x / 0.5rem y) so the input row breathes around its controls, and
nudges the corner radius from 20 to 24px for a slightly pill-ier silhouette.
LiquidGlass distortion is preserved.

* feat(desktop): add startup and onboarding flow

Add phase-based desktop boot progress, fresh-install sandbox testing, and first-run provider credential onboarding so packaged installs can start cleanly without manual settings detours.

* fix(desktop): gate prompts on provider setup

Show the desktop provider onboarding flow before prompt submission when no inference provider is configured, preventing fresh installs from falling through to backend credential errors.

* fix(desktop): surface provider onboarding from session warnings

Propagate credential warnings through session runtime info and open desktop onboarding whenever a session reports no usable provider, so unconfigured installs cannot fall through to prompt errors.

* fix(desktop): route gateway provider errors to onboarding

The "No inference provider configured" auth error reaches the renderer through gateway error events, not the prompt.submit promise; the previous patch only caught the latter, so the error toast still surfaced and onboarding never opened.

Also strip credential-shaped env vars from the test:desktop:fresh sandbox so the packaged backend can't see provider keys leaking from the launching shell.

* fix(desktop): use strict runtime check to drive onboarding

setup.status returned True whenever any provider auth state was discoverable, including indirect fallbacks like a gh-CLI Copilot token. That made desktop think the user was set up while the agent's actual resolve_runtime_provider call still raised AuthError, leaving the user with a useless toast and no onboarding.

Add a setup.runtime_check gateway method that runs the same resolver the agent uses on session creation, and switch the desktop onboarding overlay and prompt precheck to use it.

* feat(desktop): OAuth-first onboarding using existing dashboard provider API

Replace the engineer-flavored API key form with a Sign-in-first onboarding overlay that uses the dashboard's existing /api/providers/oauth catalog and PKCE/device-code endpoints (Anthropic, Nous, OpenAI Codex, etc.). API key entry is now a fallback tab with friendly provider names instead of env var prefixes, and the loud raw resolver error is gone in favor of a one-line welcome message.

* fix(desktop): polish onboarding provider list

Reorder OAuth providers so Nous Portal is first, give the segmented Sign in / API key control equal column widths, and replace the engineer-flavored backend names like "Anthropic (Claude API)" / "MiniMax (OAuth)" with friendlier in-app titles. External-CLI providers now show a softer subtitle and an external-link icon instead of a chevron.

* refactor(desktop): split onboarding overlay into store + view

Move the OAuth state machine, runtime check, copy-to-clipboard, and api-key save into store/onboarding.ts (matching the boot.ts pattern), leaving the overlay as a presentation layer that subscribes via useStore. Tabs are now table-driven, child panels read flow from the store instead of prop-drilling, and the polling/PKCE/error/success branches share a small Status atom.

* fix(desktop): external CLI providers + center mode tabs

External-CLI providers (Claude Code, Qwen Code) now open an in-overlay panel with the CLI command, copy button, and an "I've signed in" recheck instead of firing an invisible toast. Center the Sign in / API key tab control so it sits under the heading instead of hugging the left edge.

* fix(desktop): drop onboarding tabs for an inline link, group device-code waiting state

Replace the Sign in / API key tab pair with an "I have an API key" footer link under the OAuth provider list, with a "Back to sign in" affordance inside the API key form. Group the device-code "Waiting for you to authorize..." status next to the Cancel button so the alignment matches the action.

* refactor(desktop): tighten onboarding store + overlay

Drop the dead isOnboardingBusy/BUSY set, factor the catch-fallback dance into safeReq, and share a single reloadAndConnect helper between PKCE submit, device-code success, external recheck, and api-key save.

In the overlay, extract Step / CodeBlock / FlowFooter / CancelBtn / DocsLink atoms so the four sign-in panels share the same chrome instead of repeating it inline. Net effect: fewer literal divs, one place to touch the spacing, and the code-block + footer rows are reusable across future flows.

* fix(desktop): mount onboarding from frame 1 to kill the FOUT

Default onboarding.configured to null (unknown until the runtime check resolves) and have the onboarding overlay render whenever it's not yet confirmed true. The boot overlay now yields to it, so the very first paint is the Welcome card with a "While we get you set up..." progress strip instead of a flash of the chat shell between boot dismiss and onboarding mount.

The picker swaps in cleanly once the gateway opens and the runtime check confirms the user is not configured. Already-configured users see the same prep card briefly while their existing runtime warms up, then the overlay dismisses without touching the chat shell.

* fix(desktop): top-align empty sessions placeholder

The "Start a chat to build your history." empty state used a min-h-35 grid place-items-center container, which floated the text in a tall dead zone. Render it as a flat paragraph that sits right under the section header like the empty pinned state does.

* refactor(desktop): drop dead boot overlay

Onboarding overlay subsumes the boot card now that it mounts from frame 1 and renders boot progress inline. The standalone DesktopBootOverlay is unreachable in every flow (yields whenever onboarding has not confirmed configured, dismisses once it has).

* fix(desktop): hide pinned/recents sections until first session

A fresh sidebar showed the Pinned and Recent chats headers with floating empty-state copy underneath. Drop both sections (and the now-orphan SidebarEmptySessionState) when there are no sessions yet — they reappear after the first chat. Skeletons during initial load are unchanged.

* feat(gui): route embedded TUI through dashboard gateway (#21979)

Inject HERMES_TUI_GATEWAY_URL into dashboard PTY sessions so embedded ui-tui instances attach to the in-process websocket gateway, with coverage for the new env wiring.

* Add desktop remote gateway settings

Make the desktop gateway connection configurable from settings so local remains the default while remote backends can be saved, tested, and applied without environment variables.

* feat(gui): first-class Messaging page + gateway menu redesign

- Add Messaging page to the desktop app with per-platform setup,
  status, and inline guidance. Catalog derives from gateway.config
  Platform enum + plugin registry, so every messaging adapter the CLI
  supports (Telegram, Discord, Slack, Mattermost, Matrix, WhatsApp,
  Signal, BlueBubbles, Home Assistant, Email, SMS, DingTalk, Feishu,
  WeCom, Weixin, QQ, Yuanbao, API server, Webhooks, plugins) shows up
  without per-platform code.
- New REST endpoints: GET /api/messaging/platforms, PUT and POST
  /test on the same path. Secrets go through the existing .env
  pipeline; enable/disable writes config.yaml.
- Replace gateway statusbar dropdown with a richer panel: status row,
  icon-only restart + system-panel actions, recent activity (with
  timestamps trimmed in display, full text on hover), platform list.
- Auto-poll the messaging page every 6s (paused when hidden) so
  status updates without a manual check.
- Drop Settings / Command Center from the sidebar nav (still
  reachable via shortcuts and the titlebar cog).
- Flatten top corners on Messaging/Skills/Artifacts/Chat panes.
- Share new StatusDot component across messaging + gateway menu.
- Fix gateway/config.py so an explicit platforms.<name>.enabled=false
  in config.yaml is honored when env tokens are present.
- pb-9 on the chat content area for breathing room above the composer.

* Potential fix for pull request finding 'CodeQL / Clear-text logging of sensitive information'

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

* pin electron version

* hide application menu on non-mac systems

* interpret compactPreview for non-string vlaues as JSON or an empty string

* fix(desktop): keep composer contenteditable mounted across stacked toggle

The composer rendered {input} inside two different parent fragments
depending on `stacked`. When auto-expand flipped `stacked` (e.g. the
moment typed text wrapped past two lines), React reconciled the two
branches as different positions and unmounted/remounted the
contenteditable. The fresh mount started empty, so any in-flight
characters — most reliably reproduced by holding a key — were lost.

Replace the conditional with a single CSS Grid whose template-areas
swap on `stacked`. The three children (menu, input, controls) keep
stable identities across the toggle; only their grid placement
changes, which the browser handles without React tearing down the
editor.

* refactor(desktop): align install layout with install.ps1 / install.sh

Make the desktop app's runtime layout match what scripts/install.ps1 and
scripts/install.sh produce, so a desktop-only user and a CLI-only user end
up with the same files in the same places and can share one install.

Layout
- ACTIVE_HERMES_ROOT = HERMES_HOME/hermes-agent  (was: process.resourcesPath/hermes-agent, read-only)
- VENV_ROOT          = HERMES_HOME/hermes-agent/venv  (was: userData/hermes-runtime)
- desktop.log        = HERMES_HOME/logs/desktop.log  (was: userData/desktop.log)
- HERMES_HOME default: %LOCALAPPDATA%\hermes on Windows, ~/.hermes elsewhere

The packaged .app/.exe still ships a read-only payload at
process.resourcesPath/hermes-agent (FACTORY_HERMES_ROOT). On first launch
or after an installer-driven upgrade we sync factory -> active, then
provision the venv and run pip install -e . against the active root.

Key behaviors
- Pin HERMES_HOME in the spawned Python's env so get_hermes_home() resolves
  to the same path resolveHermesHome() picked. Without this, Python falls
  back to ~/.hermes on every platform - fine on mac/linux, a split-state
  bug on Windows where our default is %LOCALAPPDATA%\hermes.
- Detect developer installs by .git presence at ACTIVE; never overwrite
  a user's checkout via factory sync.
- Marker at ACTIVE/.hermes-desktop-runtime.json (schema v4) tracks
  pyproject hash + factory version + runtime schema version. depsFresh
  fast-paths when nothing changed.
- Dev (npm run dev) prefers SOURCE_REPO_ROOT over ACTIVE so devs run
  their local edits, not whatever's under HERMES_HOME.
- Better error messages distinguish "no payload" from "no Python".
- Preserve a legacy ~/.hermes on Windows when no %LOCALAPPDATA%\hermes
  exists, so users with prior pip/manual installs aren't orphaned.

pyproject.toml
- Promote fastapi, uvicorn[standard], ptyprocess (non-Windows), and
  pywinpty (Windows) to main dependencies. The dashboard backend
  (hermes dashboard) needs them at runtime; the previous lazy-import
  fallback was a footgun for fresh installs.
- Empty the [pty] optional-extra; kept as a no-op back-compat alias for
  any existing pip install hermes-agent[pty] invocations.

Drops the hardcoded BUNDLED_RUNTIME_REQUIREMENTS list in main.cjs - the
desktop now installs whatever pyproject.toml says, single source of truth.

Files
- apps/desktop/electron/main.cjs:    runtime layout, HERMES_HOME pin,
                                      factory->active sync, marker v4
- apps/desktop/scripts/test-desktop.mjs:  track new venv location
- apps/desktop/README.md:            new Setup, Runtime Bootstrap, and
                                      Debugging sections
- pyproject.toml:                    fastapi/uvicorn/pty backends in main
                                      dependencies; [pty] extra emptied

Tested locally on Windows: npm run dev boots cleanly, sessions land at
the new location, type-check + lint + test:desktop:platforms all pass.
Verified end-to-end on a fresh Win11 VM via dist:win installer.

Known gaps (filed as follow-ups, not in this PR):
- Skills not seeded on packaged installs (sync_skills only runs in
  cmd_chat, not cmd_dashboard). Need to move to shared pre-dispatch.
- Git Bash not bundled or detected; agent's terminal tool errors out
  with a useful message but desktop bootstrapper should pre-flight it.
- install.ps1 / install.sh should be decomposed into composable phase
  libraries so the desktop bootstrapper can reuse them as a single
  source of truth across all install surfaces.

* feat(desktop): theme polish, prose chat typography, composer chrome

- DS tokens/midground, Backdrop, scoped scrollbars, typography plugin + prose
- Composer liquid/radius utilities, thread font parity, tool/thinking cues
- File tree label scale, preview flex, thread retry loading + streaming tests

* feat(desktop): NSIS prereq detection page + auto-install via winget

The packaged Windows installer now detects Python 3.11+ and Git for Windows
at install time and offers to install missing prereqs via winget. Mirrors
the prereq logic scripts/install.ps1 already runs for CLI installs, so
desktop installer users get the same out-of-the-box experience as
install.ps1 users.

Why
- Hermes' terminal tool calls bash.exe directly (tools/environments/
  local.py); on Windows that's Git Bash from Git for Windows. Without it,
  the agent fails on the first terminal() call.
- Hermes' Python runtime needs 3.11+. Without it, the desktop bootstrapper
  errors out at venv creation.
- Both gaps surfaced on a fresh Windows 11 VM smoke test: VM had Python
  pre-installed but no Git, so the agent's first terminal call failed
  with "Git Bash isn't installed."
- install.ps1 has had Install-Git + Install-Uv functions for ages. The
  desktop installer was the asymmetric outlier.

How — NSIS prereq page
- New file: apps/desktop/installer/prereq-check.nsh (plugged into
  electron-builder via build.nsis.include)
- Real Wizard page using nsDialogs, inserted via customPageAfterChangeDir
  hook (between the Directory page and InstFiles).
  - Group boxes for Python and Git, each showing detection status.
  - Pre-checked install checkboxes when winget is available.
  - Auto-skips silently if both prereqs are already installed.
  - Falls back to manual download URLs when winget itself is missing.
- Detection:
  - Python: probes `py -3.11`/`-3.12`/`-3.13`/`-3.14` via the Python
    launcher. Microsoft Store "Python stub" (no py.exe) is correctly
    classified as not-installed.
  - Git: `where git`.
  - winget: `where winget` (Win10 1809+ / Win11 with App Installer).
- Install execution (in customInstall macro):
  - Python: nsExec::ExecToLog with `--scope user --silent`. Per-user
    install, no UAC prompt, output streams to install log.
  - Git: ExecShellWait via Windows ShellExecute. Critical because Git
    always installs per-machine and triggers UAC; ShellExecute preserves
    the foreground focus chain across non-elevated → elevated process
    spawns, so UAC actually comes to the foreground. nsExec::ExecToLog
    breaks the chain because winget runs hidden.
  - Both pass `--disable-interactivity --accept-package-agreements
    --accept-source-agreements` to suppress winget's own dialogs.
- Verification: probes Git's standard install locations via FileExists
  rather than `where git`. NSIS's process inherits PATH at startup, so
  a freshly-installed Git won't be visible to `where` until restart.
- Silent installs (/S) skip the prompts; managed deploys handle prereqs
  out-of-band via Group Policy / Intune.

How — Electron-side safety net
- New findGitBash() in main.cjs, parallel to findSystemPython(). Probes
  the same locations as tools/environments/local.py:_find_bash() so a
  positive result here means the agent's terminal tool will work.
- ensureRuntime now throws a clear, actionable error on Windows when Git
  Bash isn't found, matching the existing "Python 3.11+ is required"
  error path.
- Catches users the NSIS page doesn't: .msi installer users (NSIS prereq
  page doesn't run for MSI), `npm run dev` users, manual installers,
  anyone who unchecked the install boxes on the NSIS prereq page.
- All gated on `IS_WINDOWS`; macOS / Linux unaffected.

NSIS build issue (resolved)
- electron-builder defaults to `-WX` (warnings as errors). NSIS optimizer
  emits "warning 6010: function not referenced" for our page functions
  because Page custom directives don't count as references in its
  static-analysis pass. The functions ARE called at runtime when NSIS
  invokes the page; the optimizer just can't see it statically.
- Set `build.nsis.warningsAsErrors=false` in package.json so this
  spurious warning doesn't fail the build. (Documented option from
  electron-builder's nsisOptions.)

Out of scope (filed for future work)
- MSI prereq detection: Windows Installer custom actions are a different
  mechanism. Enterprise deploys typically handle prereqs via GP/Intune.
- Bundle PortableGit + python-build-standalone in extraResources for
  zero-network installs. ~80MB increase.
- Mac / Linux GUI prereq flows (different installer formats; Xcode CLT
  covers most macOS prereqs already; Linux is per-distro hard).

Files
- apps/desktop/installer/prereq-check.nsh   (new, ~290 lines NSIS)
- apps/desktop/package.json                 (build.nsis.include +
                                              warningsAsErrors)
- apps/desktop/electron/main.cjs            (findGitBash + preflight)
- apps/desktop/README.md                    (Runtime prerequisites
                                              section)

Cross-platform impact
- macOS / Linux builds (dist:mac, dist:mac:dmg, dist:mac:zip): nsis
  config is ignored entirely; .nsh is dormant.
- npm run dev: .nsh dormant; main.cjs preflight gated on IS_WINDOWS.
- scripts/install.ps1, scripts/install.sh: no reference to any new
  files; CLI install paths untouched.
- Hermes CLI / dashboard / gateway: no reference; runtime untouched.
- All checks: node --check on main.cjs and test-desktop.mjs pass;
  npm run test:desktop:platforms 4/4 passing; node --test green.

Tested
- npm run dist:win produces signed .exe and .msi without errors.
- Fresh Win11 VM (Python pre-installed, no Git): prereq page renders,
  Python check shows detected, Git checkbox pre-checked. Click Next →
  Git installs via winget with UAC prompt in foreground.
- After install completes, Hermes launches and the agent's terminal
  tool can run bash commands. Verified Git Bash is detected at
  `C:\Program Files\Git\bin\bash.exe` by ensureRuntime's preflight.

* feat: theme changes, composer tweaks, in app update ux, finesse

* fix(cli): seed bundled skills on dashboard + gateway entrypoints

`sync_skills(quiet=True)` was only being called from inside `cmd_chat`,
which meant `hermes dashboard` (the desktop GUI's backend) and `hermes
gateway` (Telegram/Discord/Slack/etc daemons) never seeded the bundled
skill library into ~/.hermes/skills/.

This surfaced as "No skills found" in the desktop GUI's skills panel on
fresh installs, despite the agent having access to the full bundled
library when invoked via `hermes chat`. scripts/install.ps1 worked
around it by running skills_sync.py as part of Copy-ConfigTemplates,
but that's not part of the desktop installer's bootstrap chain.

Fix
- Extract the skills-sync block from cmd_chat into a module-level
  `_sync_bundled_skills_quietly()` helper.
- Call the helper from cmd_chat (preserving existing behavior),
  cmd_dashboard (after the --status/--stop early-return paths and
  fastapi import check, so we don't run skills_sync on management
  commands or when deps aren't installed), and cmd_gateway.

Why these three entrypoints
- cmd_chat: the user's primary CLI entrypoint
- cmd_dashboard: the desktop GUI's backend; this is what `hermes
  dashboard --tui` invokes when the desktop bootstrapper spawns Hermes
- cmd_gateway: long-running daemons where the user expects the agent
  to have full skill access

Other entrypoints (cmd_config, cmd_doctor, cmd_login, cmd_status,
etc.) are management commands that don't need skill discovery and were
never running skills_sync in the first place — leaving them alone.

Idempotence
- tools/skills_sync.py is manifest-based: skipped skills cost
  milliseconds. Calling it from multiple entrypoints adds no real
  cost, and users running `hermes chat` then `hermes dashboard` get
  two fast no-ops on the second call.

Failure handling
- Helper wraps skills_sync in try/except. Skills are an enhancement,
  not a hard dependency — Hermes runs fine with an empty skills/ dir.

Files
- hermes_cli/main.py:
  + new helper `_sync_bundled_skills_quietly()` at module level
  + cmd_chat: replace inline block with helper call
  + cmd_dashboard: add helper call after fastapi import succeeds
  + cmd_gateway: add helper call before delegating to gateway_command

* feat(desktop): hoisted todo widget, JSON tool summaries, history grouping & timer fixes

- Hoist todo to first-class widget (shadcn checkboxes, brand colors, no
  tool-accordion). Header derives label from active task; non-active rows fade.
- Replace raw JSON dumps with structured key/value summaries via
  formatToolResultSummary; nested error extraction for clearer failures.
- Fix loaded-session grouping: stitch interleaved assistant/tool iterations
  into one bubble instead of orphaned synthetic messages.
- Stable tool/thinking timers via keyed registry so unmount/scroll doesn't
  reset elapsed counts; gate "running" on real live thread state.
- Reorganize chat-only assistant-ui components under components/chat/.

* fix(desktop): address CodeQL alerts on PR #20059

- settings/helpers.ts: harden setNested against prototype pollution.
  POLLUTING_PATH_PARTS check is now applied at every assignment site
  (loop + leaf) and uses Object.defineProperty so CodeQL can see the
  guard inline rather than via a helper function call.

- lib/markdown-preprocess.ts: rebuild the dangling-fence close regex
  from a fence-char + length instead of marker.replace(...). The marker
  is captured by `(`{3,}|~{3,})` so it can only be backticks or tildes,
  but CodeQL was tracing tainted input text into the RegExp source and
  flagging hostname dots from input as part of the pattern (false
  positive js/incomplete-hostname-regexp on the test fixture URLs).
  Reconstructing from a literal char breaks the dataflow.

- scripts/notarize-artifact.cjs: drop args from the run() rejection
  message. Args carry --key-id / --issuer / key file path; the existing
  outer catch already squashes errors to a generic line, but CodeQL was
  flagging the args.join(' ') as clear-text logging of APPLE_API_KEY_ID.

Composer DOM-text-as-HTML alerts (composer/index.tsx:379, :547) are
already addressed in 4dd9732a9 — innerHTML assignment was replaced with
renderComposerContents which builds DOM via replaceChildren / append
text nodes (no HTML interpretation).

* fix(desktop): inline prototype-pollution guard so CodeQL sees it

CodeQL's dataflow doesn't follow the helper-function guard inside
`safeSet`, so it kept flagging Object.defineProperty as prototype-
polluting. Inline the literal `__proto__`/`constructor`/`prototype`
check at the assignment site to break the dataflow.

Behavior unchanged — same set of disallowed keys, same throw.

* feat(ui-tui): resolve links to readable page titles

Mirror desktop pretty-link behavior in the TUI by resolving HTTP links to page titles with shared caching and safe fetch filters, plus slug-based fallbacks so chat links stay readable even when title fetch fails.

* fix(desktop): drop RegExp from dangling-fence close detection

Previous attempt tried to break the dataflow by reconstructing the
close-fence regex from a literal char + marker.length, but CodeQL still
traced marker.length back to input and kept flagging the test-fixture
URLs as hostname-regex sources (js/incomplete-hostname-regexp).

Replace `new RegExp(...)` + `closeRe.test(body)` with a string-only
hasCloseFenceLine() helper that splits on '\n' and uses ===. No regex
on this path now, so input data can no longer reach a RegExp source.

Behavior preserved: matches lines that are (whitespace + marker +
whitespace), which is what the original `\n[ \t]*${marker}[ \t]*(?=\n|$)`
matched. All 12 markdown-text tests still pass.

* fix(process-registry): suppress windows-footgun false positive on guarded killpg

Keep the existing POSIX-only process-group teardown path, but make the
signal selection explicit via getattr and add an inline windows-footgun
suppression marker on the guarded os.killpg line so the Windows footgun
check no longer blocks CI on this intentionally platform-gated code.

* feat(desktop): reconcile live tool events, polish thread chrome, harden boot

- chat-messages: match tool rows by overlapping query/context/preview values
  so preview-first `tool.progress` rows reliably adopt later stable-id
  `tool.start` payloads instead of spawning ghost rows or mis-merging
  parallel same-name calls; preserve prior args/result across phases.
- tui_gateway: emit full args + parsed result on `tool.start` / `tool.complete`,
  drop redundant `tool.started` re-emit from `tool.progress`.
- electron/main: prefer SOURCE_REPO_ROOT before PATH `hermes` in dev so
  local backend edits actually run; split hardening helpers into
  `electron/hardening.cjs` with tests.
- thread/tool UI: one-shot enter animation keyed by stable ids, braille
  spinner for running rows, Cursor-like disclosure rows, drill-down +
  duration/count formatting via new tool-fallback-model.
- composer: extract `text-utils`, drop liquid-glass overrides.
- right-rail: split preview-pane into preview-console / preview-file.
- runtime: incremental external-store runtime + runtime-readiness gate;
  onboarding store + tests; route-resume hook test.
- regression tests for live tool reconciliation (parallel tools, id-less
  progress, preview-first rows, structured args/results).

* feat(desktop): add ripgrep to NSIS prereq page + polish layout

Add ripgrep as a third (recommended) prereq alongside Python and Git in
the NSIS prereq detection page, and clean up the page layout based on
on-VM testing.

Why ripgrep
- Hermes' search_files tool calls `rg` directly for content + filename
  search (tools/file_operations.py:1382). Falls back to grep/find from
  Git Bash when missing — works but slower and noisier (no .gitignore
  awareness).
- ~5MB winget install via `BurntSushi.ripgrep.MSVC --scope user` — no
  UAC prompt, parallel to how Python installs.
- scripts/install.ps1 already installs ripgrep as part of
  Install-SystemPackages; this brings the desktop installer to parity.

Why "recommended" not "required"
- Python and Git are hard requirements: without them the agent runtime
  or terminal tool refuses to start. The bootstrapper preflight throws.
- ripgrep is a performance enhancement: missing it just means slower
  searches. Page wording reflects this; failure to install is logged
  but doesn't show a MessageBox or block.

Layout polish (response to on-VM screenshot review)
- Wizard header now correctly reads "System Requirements" instead of
  the leftover "Choose Install Location" from the previous page. Set
  via `GetDlgItem $HWNDPARENT 1037/1038` + WM_SETTEXT — the standard
  NSIS pattern for overriding the page header on a custom Page.
- Removed redundant in-body title + verbose intro paragraph; the
  wizard header IS the title now. Body has one short intro line.
- Group boxes tightened to 26u with content positioned just below the
  groupbox title (not top-anchored status + bottom-anchored checkbox
  with empty space in the middle). All three panels + footer fit
  comfortably in 126u, well under the 140u page limit.
- Checkbox labels simplified: dropped "(per-user, no admin prompt)"
  and "(administrator approval required)" suffixes. The footer note
  still calls out UAC for Git when relevant.
- Footer text trimmed to fit cleanly without clipping.

Install order (in customInstall macro)
- Python → ripgrep → Git
- Python and ripgrep are silent and run first; Git's UAC prompt comes
  last so the user's approval interaction isn't interrupted by silent
  activity afterwards.

Skip behavior unchanged
- All three detected → page auto-skips via Abort
- Silent install (/S) → customInstall winget block skips
- User unchecks all → page advances without running winget

Files
- apps/desktop/installer/prereq-check.nsh: ripgrep detection block,
  ripgrep page panel + checkbox, ripgrep customInstall block,
  GetDlgItem header override, layout reflow
- apps/desktop/README.md: Runtime prerequisites section updated to
  list ripgrep as recommended, with manual winget command

* feat(desktop): add model-confirmation step to onboarding

After OAuth/API-key login completes, onboarding now shows a confirmation
card with the curated default model and a Change button before dropping
the user into chat. Closes the gap where the desktop's `model.default`
was empty after first launch and the agent had to fall back to whatever
heuristic happened to fire — leaving users wondering "why am I getting
sonnet-4 when I logged into Nous Portal?"

Why
- Desktop onboarding only persisted credentials, never `model.default`.
  The CLI's `hermes model` command pairs provider + model selection,
  but the desktop's onboarding skipped the model step entirely.
- Result: users saw whichever model the agent's auto-fallback picked,
  unpredictably and undocumented.
- For the BUILD demo we want users to land on the model they expect
  for their provider, with a clear "this is what you're getting" UI
  and a one-click path to change it before chatting.

How
- New `confirming_model` flow status carries the just-authenticated
  provider slug, current default model, label, and a saving flag.
- `completeWithModelConfirm()` runs after credentials succeed: reloads
  env, verifies runtime, fetches /api/model/options to find the curated
  first-model for the provider, persists it via /api/model/set, then
  transitions into `confirming_model`.
- If anything fails (no providers returned, network error), falls
  through to the previous behaviour — onboarding completes without
  the confirm step. Polish, not a hard requirement.
- All four credential paths (device_code OAuth, PKCE OAuth, external
  CLI flow, API key) now use completeWithModelConfirm instead of
  reloadAndConnect.

UI
- `ConfirmingModelPanel` shows: green "<provider> connected" banner,
  card with "Default model: <name>" + Change button, and a "Start
  chatting" CTA that finalises onboarding.
- Reuses the existing `ModelPickerDialog` (the same picker available
  from the chat shell) for the change-model UX. Search, filtering,
  multi-provider listing — all already built.
- Stacking: ModelPickerDialog defaults to z-130, which renders UNDER
  the onboarding overlay (z-1300) and breaks pointer events. Added
  optional `contentClassName` prop to ModelPickerDialog so callers
  can override; onboarding passes `z-[1310]`.

Provider-slug matching
- For OAuth flows: pass `provider.id` directly as the preferred slug.
- For API-key flows: `OPENROUTER_API_KEY` → "openrouter" via env-key
  prefix strip. Also includes the user-visible label as a fallback
  candidate.
- fetchProviderDefaultModel falls back to the first authenticated
  provider in the response if no preferred slug matches — so even a
  miss still surfaces a reasonable default.

Files
- apps/desktop/src/store/onboarding.ts:
  + new `confirming_model` flow variant
  + fetchProviderDefaultModel + completeWithModelConfirm helpers
  + setOnboardingModel (optimistic update + revert on failure)
  + confirmOnboardingModel (finalises onboarding from the card)
  - reloadAndConnect (replaced; the four call sites now go through
    completeWithModelConfirm)
- apps/desktop/src/components/desktop-onboarding-overlay.tsx:
  + ConfirmingModelPanel component
  + new branch in FlowPanel for status `confirming_model`
  + ModelPickerDialog usage with z-[1310] content class
- apps/desktop/src/components/model-picker.tsx:
  + optional `contentClassName` prop on ModelPickerDialog so the
    dialog can be stacked on top of other fixed overlays

Tested
- `npm run type-check` passes
- `npx eslint` clean on touched files
- Live test in `npm run dev`: cleared onboarding cache, walked
  through Nous device-code flow, saw confirm card with curated
  default, clicked Change → ModelPickerDialog rendered above the
  onboarding overlay with working pointer events, picked a different
  model, "Start chatting" persisted to ~/.hermes/config.yaml.

* fix(desktop): suppress generic provider warning in onboarding

Hide the red setup notice when the message is the generic missing-provider guidance, since onboarding already presents provider auth actions. Centralize provider-setup matching across desktop hooks and add coverage for the matcher.

* fix(desktop): add 2u clearance below prereq checkboxes

Group box bottom border was clipping the checkboxes by 1-2px.
Bumped each box height 26u→30u; checkboxes now sit 2u above the bottom border.

* fix(nix): refresh dashboard lockfile hash

Update the web npm deps hash in nix/web.nix to match the committed apps/dashboard/package-lock.json so bb/gui passes the nix lockfile check.

* fix(desktop): install TUI deps in release workflow

Ensure desktop release builds install the standalone ui-tui package before bundling the TUI payload.

* fix(desktop): run release builder from app package

Invoke the desktop builder through the package script so electron-builder uses apps/desktop/package.json.

* fix(desktop): expand release artifact names safely

Build desktop artifact names from workflow version/channel while preserving electron-builder platform macros.

* fix(desktop): use package artifact naming in release workflow

Let electron-builder's desktop package config provide platform-specific artifact extensions while the workflow injects the release version/channel metadata.

* fix(nix): fetch dashboard npm deps from package root

Point the dashboard npm dependency fetch at apps/dashboard so Nix can find the package lockfile after the dashboard move.

* fix(nix): build dashboard from package directory

Set the web package source root to apps/dashboard so npm patch/build phases run beside the dashboard lockfile while keeping apps/shared available as a sibling.

* feat(desktop): render LaTeX math via KaTeX after streaming completes

Add @streamdown/math plugin to the chat markdown renderer.
Inline ($x^2$) and block ($$...$$) math both supported with
singleDollarTextMath enabled. Plugin is gated to non-streaming state
to match the existing pattern for syntax highlighting — math renders
when the message completes, avoiding KaTeX re-render churn during
streaming. KaTeX CSS is imported in styles.css; ~30KB CSS + ~430KB
JS added to the bundle. Smoothness improvements during streaming
deferred to a follow-up.

* perf(desktop): memoize KaTeX renders so math streams without re-rendering

Wrap rehype-katex with a per-equation LRU cache (keyed by
displayMode + source text) and re-enable math during streaming.

Stock @streamdown/math runs rehype-katex on every markdown commit,
so each new token re-katexes every equation in the message. For
math-heavy responses (an equation derived step-by-step) that's
hundreds of ms of wasted work per token and the streaming UI
chokes. With memoization, each equation pays katex.renderToString
exactly once; subsequent tokens re-walk the tree but hit cache for
unchanged equations.

The wrapper mirrors rehype-katex's semantics exactly: same class
detection (language-math, math-inline, math-display), same
<pre>-walk-up for fenced math blocks, same parent.children.splice
replacement, same SKIP traversal, same strict-then-lenient render
strategy with VFile message reporting.

Cached children are structuredCloned on each splice so downstream
rehype plugins or toJsxRuntime can't mutate the cache.

* fix(desktop): declare katex-memo deps directly + drop per-app lockfile

katex-memo.ts (added in 112cad59b) imports hast-util-from-html-isomorphic,
hast-util-to-text, remark-math, katex, and unist-util-visit-parents but
those were never added to apps/desktop/package.json. They were silently
resolving via @streamdown/math at the workspace root, which broke the
moment `npm i --prefix apps/desktop` ran with the per-workspace lockfile
because that install only consults apps/desktop/package.json. Add them
as direct deps, plus unified/vfile/@types/hast for the type imports.

Also delete apps/desktop/package-lock.json — root package.json declares
workspaces: ["apps/*"], so npm manages all lockfile state at the root.
The stale per-app lockfile is what made `npm i --prefix apps/desktop`
diverge from the workspace install in the first place and left an empty
apps/desktop/node_modules/@assistant-ui/ stub that Vite's dep optimizer
then tried (and failed) to open at @assistant-ui/core/dist/internal.js.

* feat(desktop): disable Backdrop noise overlay by default

The noise overlay defaulted to on, which adds a busy speckle layer over
the whole window for every new user. Flip the Leva default to off; the
toggle stays in Backdrop / Noise for anyone who wants it back.

* fix(desktop): polish LaTeX rendering — currency, code blocks, brackets

Five distinct bugs surfaced from a math-heavy stress test:

1. Adjacent code fences glued together. scrubBacktickNoise's
   second-pass regex /``\s*``/g matched the LAST 2 backticks of
   one fence + whitespace + FIRST 2 backticks of the next, collapsing
   two blocks into one. Fixed with lookbehind/lookahead so we only
   match exactly 2 backticks not part of a longer run.

2. Whitespace eaten between fences and following content.
   stripPreviewTargets internally calls .trim() which strips leading/
   trailing whitespace from each split-segment. For segments between
   two fences this collapsed \n\n to '', gluing fence close to next
   block. Fixed by capturing leading/trailing whitespace at the call
   site and restoring it after the transform.

3. Currency dollar signs eaten as math. With singleDollarTextMath:true
   remark-math greedy-matched any pair of $, so '$5 ... $10' became
   one inline math span. Added escapeCurrencyDollars to escape $<digit>
   patterns to \$<digit> in prose segments (not in code). Trade-off:
   math expressions starting with a digit (rare — '$5x = 10$') get
   escaped too. Mirrors the convention in ChatGPT/Claude's UIs.

4. \(...\) and \[...\] LaTeX brackets unsupported. Models often
   emit these instead of $...$ / $$...$$. Added
   rewriteLatexBracketDelimiters preprocessor pass.

5. ```latex / ```tex blocks were being routed to KaTeX via a
   rewrite to ```math. Aligns with GitHub markdown convention:
   ```math = render as math; ```latex / ```tex = LaTeX/TeX
   source code (syntax highlighted, not rendered). Conflating them
   broke teaching/showing-source use cases. MATH_FENCE_LANGUAGES
   pruned to {'math'} only.

Also flipped parseIncompleteMarkdown to true (was !isStreaming) so
the math parser can't see $ inside streaming-but-not-yet-closed code
fences. Shiki was already deferred via defer={isStreaming} so this
doesn't introduce new tokenization cost.

Test: 18/18 existing tests still pass; one test updated to expect
escaped \$ in currency-prose-with-URL case.

* fix(desktop): detect Python via registry/filesystem; pin to 3.11–3.13

Two related fixes for Python detection on Windows:

1. py.exe (Python launcher) is missing from per-user installs that
   didn't check the launcher option, so 'py -3.X --version' alone
   misses real Python installs. User-reported case: clean Win11 +
   official Python.org 3.14 install -> 'where py' returned nothing,
   our installer offered to install Python again. Both NSIS prereq
   page and main.cjs now probe in this order:
     1. py.exe launcher (when present)
     2. PEP 514 registry: HKLM/HKCU\SOFTWARE\Python\PythonCore\<v>\InstallPath
     3. Filesystem: %ProgramFiles%\Python<v>, %LocalAppData%\Programs\Python\Python<v>
   Crucially, we never fall back to running 'python.exe' from PATH
   on Windows — the WindowsApps stub at %LOCALAPPDATA%\Microsoft\
   WindowsApps\python.exe is a redirector that opens the Microsoft
   Store window if no Store Python is installed. Triggering that
   during boot would be terrible UX. Registry/filesystem probes
   never execute the binary.

2. Drop 3.14 from the supported version set. Several Hermes deps
   (notably pywinpty, which carries Rust crates like
   windows_x86_64_msvc) don't yet publish 3.14 wheels. With wheels
   missing, 'pip install -e .' falls back to building from sdist,
   which needs a Rust toolchain — users see 'could not compile
   windows_x86_64_msvc build script' on first run. install.ps1
   sidesteps this by pinning to 3.11 via uv; the desktop installer
   doesn't yet have the same uv-managed-Python pathway, so for now
   we accept 3.11/3.12/3.13 and tell winget to install 3.11 if
   none of those are present. Revisit when the wheel ecosystem
   catches up to 3.14 (~early 2026).

* feat(desktop): Cron, Profiles, usage analytics, and titlebar fixes

- Add Cron and Profiles sidebar routes with full CRUD-style flows and API wiring.
- Extend Command Center with auxiliary task overrides and a Usage panel (7d/30d/90d).
- Fix titlebar geometry for WSL/Windows (native overlay width, tool spacing).
- Remove stray merge conflict markers from pyproject.toml optional deps.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(title-bar): position sidebar toggle button

* feat(desktop): composer queue — queue many, edit/delete/cancel-edit, Cursor-style

Press Enter while busy with a draft to queue it; with no draft to interrupt
and send the next queued turn. Auto-drains one queued turn each time the
session settles, same as Cursor. Queue persists across reloads so an
interrupted-and-queued turn isn't lost on refresh.

Each queued row supports edit-in-composer (with explicit Save/Cancel),
send-now (↑), and delete. Drain skips only the entry currently being
edited so the rest of the queue keeps flowing.

Queue dequeue is transactional — an entry only leaves the queue after
`prompt.submit` is accepted, so a rejected submit doesn't drop the turn.

Also shrinks the `[interrupted]` marker to a muted one-liner and drops
its assistant footer so it stops looking like a real reply.

* fix(desktop): handle empty usage analytics totals

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(desktop): address PR review titlebar and usage races

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(desktop): add MCP settings and live subagent tree

Surface configured MCP servers in Settings with JSON edit/save and a gateway-backed reload action so users can manage tool servers without falling back to slash commands.

Track live subagent gateway events in a desktop store, show active subagent counts in the Agents statusbar item, and replace the Agents overlay stub with a live spawn tree for the active session.

* fix(desktop): move power-user views out of sidebar

Keep Cron and Profiles available through lower-prominence chrome entry points so the workspace sidebar stays focused on core chat navigation.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(desktop): subagent overlay reads like a live transcript, not a dashboard

Strip the card chrome and rewire /agents to feel like peeking into the
child agent's stream:

- subagents store: single `stream` of typed entries (thinking/tool/progress/
  summary) replaces the parallel notes/thinking/tools arrays. Drop unused
  fields (toolsets, depth, apiCalls, reasoningTokens, sessionId).
- agents view: no OverlayCards, no boxed stream, no per-row borders. Goal +
  status pill + indented stream lines, full row width.
- Group root spawns into "Delegation N" sections when batch shape + spawn
  time match — hides task-index interleaving and makes hierarchy obvious.
- Sort tree by spawn time, then task_index. Step indicator is one colored
  pill (primary while running, emerald when done) inside the row, not a
  trailing pill that wrapped under the chevron.
- Tree picks up `subagent.start` (not only `spawn_requested`) and prunes
  delegate-tool fallback rows once native subagent events land for the
  session — fixes duplicate "Delegated task" rows alongside the real ones.

* feat(desktop): Esc closes every OverlayView-based overlay

Lift the keyboard handler into the shared OverlayView so Agents, Settings,
Command Center — and anything we build on top of it later — all dismiss on
Esc by default. Nested Radix dialogs stop propagation themselves, so a
modal opened inside an overlay (e.g. model picker inside Settings) still
closes the modal first, not the overlay underneath.

Drop the now-redundant Esc handlers in Settings (kept Cmd/Ctrl+P) and
Command Center.

* fix(desktop): drop numbered step pill on subagent rows

The pill was getting clipped at the overlay edge anyway. Just use the
status glyph (●/✓/✗/■/○) — the delegation header already conveys
"3 workers, 3 active", and order in the list implies which step you're
looking at.

* fix(desktop): drop noisy "returned N items / empty object" stub strings

When a tool returns nothing useful, the row should be silent — the title
("Search Files", etc.) already tells the user what happened. Counting the
fields in an opaque payload is engineer-noise.

`formatToolResultSummary` and `minimalValueSummary` now return '' for
empty arrays / records / unrecognized values; tool-fallback already hides
the detail section when its body is empty.

* refactor(desktop): subagent rows borrow chat tool patterns (fade-in, lucide glyphs, shimmer)

Pull the agents view closer to how chat tool blocks render:
- statusGlyph() returns the same lucide BrailleSpinner / CheckCircle2 /
  AlertCircle vocabulary as tool-fallback's statusGlyph
- Stream lines fade-in via useEnterAnimation (one-shot WAAPI), keyed per
  entry so streamed deltas settle in instead of popping
- Subagent rows fade in too, and pick up the existing data-slot=tool-block
  spacing rules between blocks
- Active stream line trails a BrailleSpinner instead of a hand-rolled
  pulsing rectangle
- Goal text drops FadeText (which forces nowrap); keep FadeText only for
  the single-line meta subtitle
- Running rows shimmer the title — same affordance the chat thinking row
  uses

* refactor(desktop): make /agents subagent-only, drop sidebar + dead sections

Activity rail and History stub were both noise. Strip the split layout,
sidebar, route enum, and the rail/stub helpers — the overlay is now just
the spawn tree, centered in a max-w-3xl column so it stops claiming the
whole screen for one section's worth of content.

* feat: update cron modals

* Add dedicated GUI log stream for dashboard debugging.

Capture dashboard and PTY websocket lifecycle failures in gui.log and expose it via hermes logs.

* Improve desktop runtime UX by surfacing inference readiness in gateway status and hardening WSL link opening.

This also stabilizes markdown code/table block spacing and adds root-install guards so desktop dev runs use a healthy workspace dependency tree.

* Log detailed GUI websocket failure metadata.

Capture richer reject/disconnect/send/parse context for dashboard gateway websocket flows so GUI connection failures are diagnosable from logs.

* Default dashboard startup logging to GUI mode.

Detect the dashboard subcommand during early CLI bootstrap so gui.log is attached from process start and GUI startup failures are always captured.

* Clean up gateway status conditionals and logging bootstrap mode detection.

Simplify nested dashboard gateway status branches for readability and use a concise first-subcommand check when selecting early GUI logging mode.

* add logging to nsis installer

* feat: glass ui pass

* fix(desktop): persist inline assistant errors across hydrate/resume

- Detect provider failure text arriving via message.complete
  (HTTP 4xx, "API call failed after N retries", Provider/Gateway
  error: ...) and persist as an inline assistant error instead of
  regular completion text, blocking the hydrate that was wiping it.
- preserveLocalAssistantErrors: merge by id so same-id hydrated
  messages keep their local error, and preserve the optimistic
  user+error pair as a unit (with tail-user dedupe).
- Hook all hydrate/resume writers (use-session-actions resume +
  fallback, hydrateFromStoredSession, syncSessionStateToView) into
  the merge so stale snapshots can't clobber a failed turn.
- Add error to chatMessagesEquivalent so the resume diff actually
  sees error-only changes and paints them.
- editMessage on a failed turn now submits a plain resend (no
  truncate_before_user_ordinal) and retries plainly on the
  "no longer in session history" race.

Style polish on touched files:
- Inline error: text-only treatment (no card).
- User stop / edit-composer send: shared Tabler IconPlayerStopFilled
  glyph + shared icon-button class slot for parity.

* feat(desktop): theme xterm with active light/dark mode

The right-sidebar terminal hardcoded a light palette, which read poorly
on the dark glass surface. Subscribe to `useTheme().resolvedMode` and
hot-swap `term.options.theme` so Shift+X (and any other mode change)
updates the terminal in place without tearing down the PTY session.

Dark mode uses xterm's built-in defaults (white fg/cursor + vivid ANSI
16) with just a transparent background so the glass shows through;
light mode keeps the existing hand-tuned overrides for legibility on a
bright surface.

* feat(sidebar): right-click + drag-reorder sessions and workspaces

- Wire right-click on session rows to open the same actions menu;
  suppresses the OS-native context menu so Windows stops looking awful.
- Share dropdown + context menu items via useSessionActions() driving
  a single declarative ItemSpec[]; render polymorphic over MenuItem.
- New shadcn ContextMenu primitive mirroring DropdownMenu styling.
- Restore drag-and-drop reordering for Agents (lost during the cwd
  cleanup) and add reordering of workspace groups via a right-side
  grab handle. Pinned reorder unchanged.
- Generic orderByIds<T> replaces the duplicated session/group orderers;
  useSortableBindings() hook collapses the two Sortable wrappers.
- cursor-pointer on every actionable element; cursor-grab on handles.
- KISS pass: baseName() helper, AGE_TICKS table, single WORKSPACE_PAGE
  constant, flatter SidebarSessionsSection render.

* feat(desktop): solarize the xterm palette in both light & dark

xterm's default ANSI 16 is tuned for dark and reads candy-bright on the
light glass surface (vivid cyans/greens). Ship the canonical Solarized
palette (Schoonover) for both modes — same 16 accents either way, only
fg/cursor swap between `base00/01` (light) and `base0/1` (dark), so a
prompt's colors look uniform across a Shift+X toggle.

Background stays transparent in both modes — Solarized's cream/slate
backgrounds would fight the glass.

* feat(desktop): virtualize chat thread + sidebar via TanStack Virtual

Replaces `use-stick-to-bottom` and per-row session rendering with
`@tanstack/react-virtual`, matching what Cursor uses.

Chat thread (`thread-virtualizer.tsx`):
- Natural-flow virtualization (padding spacers, not absolute items) so
  `position: sticky` on the human bubble still resolves cleanly against
  the scroller.
- Custom at-bottom anchor: pins when armed, disarms on user-driven
  upward scroll, re-arms at bottom, jumps on session switch +
  `thread.runStart`.
- Loading indicator and `--thread-last-message-clearance` move to a
  real `[data-slot=aui_composer-clearance]` node; drops the brittle
  `:nth-last-child(1 of …)` rule that can't fire reliably under
  virtualization.

Sidebar (`virtual-session-list.tsx`):
- Flat agents list virtualizes at >=25 rows; pinned and
  workspace-grouped paths stay direct-render.
- `SortableContext` keeps all IDs; only the window mounts; dnd-kit's
  `setNodeRef` is merged with `virtualizer.measureElement` so rows
  participate in both DnD hit-testing and TanStack measurement.

Drops `use-stick-to-bottom`. Streaming test gets a global
`offsetWidth/offsetHeight` stub so the virtualizer's viewport sizing
works in jsdom; the scroll-up-doesn't-pull-back invariant still passes.

* feat: more ui qa

* fix(desktop): trim sidebar terminal startup spacer

Drop zsh's initial spacer row before writing the first terminal prompt so new sidebar terminal sessions do not open with a selectable blank line.

* chore: uptick

* feat(desktop): thin installer + first-launch install.ps1 bootstrap

Converges the Windows packaged desktop installer onto a single canonical
install topology: drop the Electron shell only (~80MB instead of ~500MB),
clone Hermes Agent at a build-time-pinned commit on first launch via
install.ps1's stage protocol, and treat the resulting git checkout at
%LOCALAPPDATA%\hermes\hermes-agent\ as the canonical install location
(same path the CLI installer uses).  Future updates flow through the
existing applyUpdates() git-pull path.

Replaces the previous fat-installer architecture where the .exe bundled
a pre-staged hermes-agent source tree under resources/hermes-agent/ that
was then sync'd into ACTIVE_HERMES_ROOT at launch -- a complicated
factory-vs-active dance with several footguns (FACTORY_HERMES_ROOT
mismatch on path resolve, isGitCheckout guard regressions, pyproject
hash drift detection inside the sync loop).

Architecture overview
---------------------

  Build time
    apps/desktop/scripts/write-build-stamp.cjs writes
    apps/desktop/build/install-stamp.json with {commit, branch, builtAt,
    dirty}.  Honours $GITHUB_SHA / $GITHUB_REF_NAME in CI, falls back to
    `git rev-parse HEAD` locally.

    apps/desktop/scripts/stage-native-deps.cjs copies the runtime subset
    of @homebridge/node-pty-prebuilt-multiarch from the workspace-root
    node_modules into apps/desktop/build/native-deps/.  Workspace dedup
    hoists this dep to the root, out of reach of electron-builder's
    `files:`-restricted collector; staging gives us a deterministic
    path to extraResources.

    electron-builder ships both into resources/install-stamp.json and
    resources/native-deps/ respectively.

  Boot resolver (electron/main.cjs)
    Resolver order:
      1. HERMES_DESKTOP_HERMES_ROOT override
      2. SOURCE_REPO_ROOT (dev mode)
      3. ACTIVE_HERMES_ROOT git checkout WITH .hermes-bootstrap-complete
         marker -- the post-install fast path
      4. `hermes` on PATH (CLI-installed user adding the desktop)
      5. pip-installed hermes_cli via system Python
      6. bootstrap-needed sentinel -> hand off to runBootstrap

    Deletes the entire FACTORY_HERMES_ROOT / RUNTIME_MARKER /
    syncTreeExcludingVenv machinery (-200 lines).  The isGitCheckout
    guard that bit us in the install.ps1 PR is gone.

  First-launch bootstrap (electron/bootstrap-runner.cjs)
    1. Resolve install.ps1: prefer SOURCE_REPO_ROOT/scripts (dev), else
       download from GitHub raw at INSTALL_STAMP.commit (cached at
       HERMES_HOME\bootstrap-cache\install-<sha>.ps1).
    2. Fetch the stage manifest via install.ps1 -Manifest -Commit X
       -Branch Y.
    3. Iterate stages: install.ps1 -Stage <name> -NonInteractive -Json
       -Commit X -Branch Y per stage.
    4. On all stages green: write the .hermes-bootstrap-complete
       marker with {schemaVersion, pinnedCommit, pinnedBranch,
       completedAt, desktopVersion}.

    Per-run log to HERMES_HOME\logs\bootstrap-<ts>.log.  Cancellation
    via AbortSignal.  Manifest cache so retries don't re-download.

  Install overlay (src/components/desktop-install-overlay.tsx)
    Mounted alongside the existing onboarding overlay; flexbox card
    with header (static) + middle (scrollable) + footer (failure-only,
    static).  Subscribes to hermes:bootstrap:event IPC + resyncs from
    hermes:bootstrap:get on mount/reload.  Renders:
      - 14-stage checklist with per-stage state icons
      - Overall progress bar + current-stage spotlight
      - Auto-expanded installer-output panel on failure
      - "Copy output" button (full ring buffer + error to clipboard)
      - "Reload and retry" wired through hermes:bootstrap:reset to
        clear main.cjs's latched failure
    Synthetic empty-manifest event from main.cjs flips the overlay to
    'active' immediately so the slow install.ps1 download doesn't
    leave the user staring at the generic Preparing splash.

  Failure latching (main.cjs)
    bootstrapFailure module-scope variable holds the rejection after
    install.ps1 fails.  startHermes() throws the latched error
    immediately when set, bypassing the entire ensureRuntime +
    runBootstrap chain.  Without this, the renderer's ensureGatewayOpen
    retries would re-run install.ps1 in a 5-10 min hot loop while the
    user was still reading the failure overlay.  Cleared via
    hermes:bootstrap:reset on user-driven retry.

  Unsupported-platform overlay (1F)
    macOS / Linux packaged builds (no install.sh stage protocol yet)
    emit an unsupported-platform event with a copy-pasteable install
    command + docs URL.  Dedicated overlay branch with "Copy command"
    + "I've run it -- retry" buttons.

install.ps1 additions (Phase 1F.3 + 1F.5)
-----------------------------------------

  New -Commit and -Tag string params.  Precedence Commit > Tag >
  Branch.  Honoured by all three code paths (update / fresh clone /
  ZIP fallback), with archive URL selection that handles each
  ref-type variant.  Detached-HEAD checkouts intentionally -- they're
  pins, not branches the user pulls into.

  EAP=Continue wrap around the new pin-step git invocations.  `git
  fetch origin <commit>` writes the routine 'From <url>' info line to
  stderr; under the script's global EAP=Stop that terminates the
  script even though fetch+checkout succeed.  Matches the established
  pattern in Install-Uv, Test-Python, _Run-NpmInstall.

Backend fix (hermes_cli/web_server.py)
--------------------------------------

  CORS allow_origin_regex now accepts Origin: 'null'.  Packaged
  Electron loads index.html via file://; Chromium sets the WebSocket
  upgrade Origin header to the opaque origin 'null', which the old
  regex rejected with HTTP 403 before gateway_ws() ever ran.  This
  failure mode was masked in the older FACTORY_HERMES_ROOT
  architecture because the resolver often found an existing hermes
  on PATH with different binding behavior.

  Security maintained: localhost-only bind keeps cross-machine pages
  out; per-process session token still gates every authenticated
  /api/ endpoint regardless of Origin.

Desktop QoL
-----------

  DevTools is now enabled in packaged builds (F12 / Cmd+Opt+I).
  Field-debugging trade-off: tiny attack surface increase versus
  a much better support story when CSP / WS / theme issues surface.

  NSIS prereq-check page deleted (-767 lines).  The standard
  Welcome -> License -> Directory -> InstallFiles -> Finish wizard
  now installs without custom Python/Git/ripgrep detection -- those
  prereqs are install.ps1's job at first launch.

Test infrastructure (Phase 1G)
------------------------------

  apps/desktop/scripts/test-desktop.mjs rewritten as a cross-platform
  bundle validator (was darwin-only and asserted on dead factory-
  payload paths):
    NEGATIVE: hermes_cli/main.py is NOT shipped (regression guard)
    POSITIVE: install-stamp.json carries a real commit + branch
    POSITIVE: node-pty native deps shipped under resources/native-deps
    POSITIVE: renderer dist/index.html reachable (asar or unpacked)
  New nsis mode and npm run test:desktop:nsis script.

Validated end-to-end on clean Win10 VM
--------------------------------------

  Confirmed: NSIS installer drops Electron shell, app launches,
  install overlay shows progress, install.ps1 clones the pinned
  commit, 14 stages run to completion, marker written, backend
  spawns, WebSocket connects, onboarding overlay asks for API key,
  main UI loads, integrated terminal works.

  Failures handled: bootstrap stays failed (no hot-loop retry),
  "Copy output" gives actionable transcript, "Reload and retry"
  explicitly re-runs install.ps1.

What's deferred
---------------

  - MSIX wrapping (Phase 2): same Electron .exe under MSIX manifest
    with runFullTrust, signed and submitted to Microsoft Store.
  - install.sh stage protocol parity (Phase 2): once shipped, the
    unsupported-platform overlay becomes drive-it-yourself and
    macOS/Linux packaged installers gain feature parity with Windows.

* feat(desktop): persistent terminal pane + fullscreen takeover

Adds a VSCode-style "focus terminal" toggle to the right sidebar's Terminal
tab that takes over the chat pane area without unmounting the shell. The
xterm host is mounted once at the layout root and CSS-overlayed onto
whichever <TerminalSlot /> is currently active, so the PTY session,
scrollback, selection, focus, and WebGL renderer survive every toggle.

Also:
- WebGL renderer (matching dashboard ChatPage) so Hermes' TUI skins paint
  faithfully instead of muting through xterm's default DOM renderer
- File drag/drop from the project tree or OS into xterm — paths are
  shell-quoted (zsh/bash/pwsh/cmd) and written straight into the PTY
- Solarized dark canvas with brights promoted to real accent variants
  (Schoonover's UI-gray brights washed out every TUI accent)
- Strip NO_COLOR/FORCE_COLOR/COLORFGBG/TERM=dumb leaking from non-tty
  parents (CI runners, Cursor's agent shell) so the embedded shell gets
  truecolor regardless of how Electron was launched
- rAF-debounced ResizeObserver — running fit.fit() synchronously during
  sibling pane transitions crashed the WebGL texture-atlas rebuild

* fix(install.ps1): strip UTF-8 BOM regression that broke 'irm | iex'

The canonical install flow

    irm https://raw.githubusercontent.com/.../scripts/install.ps1 | iex

fails on PowerShell 5.1 with a cascade of 'The assignment expression
is not valid' errors at every param() default value:

    [string]$Branch = 'main',
                      ~~~~~~
    The assignment expression is not valid. The input to an assignment
    operator must be an object that is able to accept assignments...

Root cause: scripts/install.ps1 carries a UTF-8 BOM (0xEF 0xBB 0xBF)
as its first three bytes. 'irm' returns the response body as a string;
on PS 5.1 the BOM survives into that string as a leading \ufeff
character. 'iex' then evaluates the string and PS's parser chokes
on the invisible character before param() -- error recovery proceeds
into the body but every assignment is reported as broken.

This was the exact failure mode the install.ps1 hardening pass (PR
#27224) deliberately fixed by stripping the BOM and ensuring the
file body is pure ASCII. Commit 4279da4db ('fix(windows): make
PowerShell installer parse in 5.1') re-introduced the BOM later,
unintentionally undoing the irm|iex compatibility fix; the merge
that brought it into bb/gui carried it forward.

Fix: strip the three BOM bytes. File body is verified pure ASCII
(any-byte > 127 returns false), so PS 5.1 with no BOM falls back to
Windows-1252 decoding which is identical to ASCII for our content.
Both install paths now work:
  - 'irm ... | iex' (canonical CLI)
  - 'powershell -File install.ps1' (programmatic / desktop bootstrap)

* install.ps1: detect ARM64 Windows reliably for Node and Git stages

Add a Get-WindowsArch helper that reads Win32_Processor.Architecture
via CIM (invariant to PowerShell host bitness) with PROCESSOR_ARCHITEW6432
fallback. Use it in:

- Install-Git: previously only triggered the arm64 PortableGit asset
  when invoked from a native-ARM64 PowerShell host. WoW64 / emulated
  x64 hosts (the default powershell.exe on Windows-on-ARM) saw
  PROCESSOR_ARCHITECTURE=AMD64 and fell through to the x64 PortableGit
  build, leaving ARM64 users on emulated Git for Windows.

- Test-Node: previously hardcoded the Node download to win-x64 on any
  64-bit OS, so ARM64 users always got x64 Node under Prism emulation
  even though Node ships an arm64 build for Windows. The winget
  fallback now also passes --architecture arm64 on ARM64.

Python remains x86_64 by design: uv intentionally prefers
windows-x86_64 cpython on ARM64 hosts for ecosystem (wheel)
compatibility (see astral-sh/uv#19015).

* install.ps1: harden Install-SystemPackages against winget msstore failures

The previous winget invocation discarded stdout/stderr and trusted no
signal at all -- not the exit code (winget exits 0 even when it bails
"please specify --source"), not output (sent to Out-Null), not the
catch handler (winget returning 0 means no exception fires). The only
trust signal was a post-install Get-Command rg / Get-Command ffmpeg
check, which would also miss the package because %LOCALAPPDATA%\
Microsoft\WinGet\Links (where winget puts command aliases) is added to
PATH by AppExecutionAlias machinery only in fresh shells. End result on
machines where the msstore source has a cert problem (0x8a15005e --
common on Windows-on-ARM and some corporate networks): silent failure,
no log, no breadcrumb, and the user is told the install succeeded.

Specifically:

- Pin --source winget on every winget install call. Defeats the broken-
  msstore-source path. We ship nothing from msstore so this is safe and
  forward-compatible.

- Add --exact --id for a tighter package match.

- Capture each winget invocation's combined stdout/stderr + exit code to
  %TEMP%\hermes-winget-<pkg>-<n>.log instead of Out-Null. On the happy
  path the log is deleted after the post-install check confirms the
  binary is on PATH; on failure the log is kept and its path is named in
  a Write-Warn so the user has something to grep.

- Refresh PATH to include %LOCALAPPDATA%\Microsoft\WinGet\Links in
  addition to the User/Machine env-var hives, so Get-Command sees newly-
  installed winget aliases in the same process.

- No behavior change on the happy path. Same Write-Info/Success/Warn
  cadence, same fallback order (winget -> choco -> scoop -> manual),
  same $script:HasRipgrep / $script:HasFfmpeg outputs.

Verified end-to-end on a real Snapdragon ARM64 Windows host: ripgrep
uninstalled, stage re-run, [OK] ripgrep installed in 1.4s, ok:true.

* desktop: swap node-pty fork for upstream microsoft/node-pty 1.1.0

The previous dependency, @homebridge/node-pty-prebuilt-multiarch@0.13.1,
publishes no win32-arm64 prebuilds on its v0.13.x line, and its v0.14.x
betas (which do add an arm64 Windows build) ship no electron-vXXX-win32-
arm64 prebuilds at all -- so packaged Electron 40 builds (NMV 143) would
fail at runtime even on a successful npm install. Net effect: the
desktop's integrated terminal was unbuildable on Windows-on-ARM, in
both dev (npm install fails: 404 fetching the node-vXXX-win32-arm64
prebuilt) and packaged builds (no Electron-ABI prebuilt exists).

The homebridge fork was originally created because upstream node-pty
shipped no prebuilds at all. That hasn't been true since node-pty@1.0
(April 2024), which:

- bundles prebuilts for mac (arm64+x64) and Windows (arm64+x64) directly
  inside the npm tarball -- no GitHub-Releases fetch, no missing-binary
  failure mode
- uses N-API (node-addon-api) for ABI stability across Node and Electron
  major versions, so the same pty.node binary loads under Node 22 (dev)
  and Electron 40+ (packaged) without per-ABI rebuilds
- is what VS Code, Hyper, and Theia actually ship

API surface is identical (spawn / onData / onExit / write / resize /
kill) -- no call-site changes needed.

Specifically:

- apps/desktop/package.json: replace the @homebridge fork with
  node-pty@1.1.0 (exact pin). Widen `asarUnpack` from `["**/*.node"]`
  to also unpack `**/prebuilds/**`, because node-pty ships runtime-
  execed helpers alongside its .node files (darwin spawn-helper has no
  extension and would not be matched by `**/*.node`; conpty.dll,
  OpenConsole.exe, winpty.dll, winpty-agent.exe on Windows are also
  exec'd at runtime and cannot live inside asar).

- apps/desktop/electron/main.cjs: update both require() strings to
  match the new package name and the new staged path under
  resources/native-deps/node-pty/.

- apps/desktop/scripts/stage-native-deps.cjs: point at node_modules/
  node-pty. node-pty's prebuilts live under prebuilds/<plat>-<arch>/
  (not build/Release/), so update the include glob to copy that dir.
  Per-arch staging keeps the resource bundle small (target arch comes
  from npm_config_arch when electron-builder cross-builds, else
  process.arch). Explicitly enumerate file types in the prebuilds glob
  so the ~25 MB of .pdb debug symbols that prebuild-install bundles
  for Windows crash analysis don't bloat the installer (29 MB -> 2.6 MB
  staged on win32-arm64). Re-assert +x on the darwin spawn-helper
  defensively, since a stripped mode bit would manifest as a silent
  ENOENT at first pty.spawn().

- apps/desktop/scripts/test-desktop.mjs: update expectedNativeDepPaths()
  and its assertion site to look at prebuilds/<plat>-<arch>/ instead of
  build/Release/. Add an explicit spawn-helper-exists check on darwin
  so a regression in the asarUnpack glob would fail loudly in CI rather
  than at first PTY spawn.

Trade-off: Linux end-users lose prebuilts and fall back to building
node-pty from source on `npm install`. Acceptable because Hermes
ships no Linux desktop builds (desktop-release.yml matrix is mac + win
only, package.json declares no `linux` target), and Linux developers
hacking on the desktop already need a C++ toolchain for the rest of
the stack.

Verified on Windows 11 ARM64 (Snapdragon):
  npm install                                          -> exit 0
  node -e "require('node-pty').spawn(...)" round-trip  -> OK
  stage-native-deps                                    -> 27 files, 2.6 MB
  load from staged tree (simulates packaged fallback)  -> ConPTY
                                                           round-trip OK

* desktop+gateway: harden Slack socket recovery and Windows restart dedupe (#28873)

* desktop+gateway: harden Slack socket recovery and Windows restart dedupe

Fix Slack Socket Mode reliability by adding a watchdog/reconnect path so silent socket task drops no longer leave the adapter stuck. Harden Windows gateway lifecycle by avoiding desktop-binary path collisions, making gateway PID scans case/extension tolerant, and reusing in-flight restart actions to prevent duplicate gateway spawns.

* test(slack): add Socket Mode watchdog/reconnect behavioural coverage

Drive the new Slack Socket Mode self-healing logic through a fake AsyncSocketModeHandler so we can simulate the P0 silent-hang failure mode (task exit, transport disconnected, intentional shutdown, concurrent reconnect attempts) without touching real Slack.

* fix(slack,desktop): address Copilot review on watchdog races and path normalization

- connect(): explicitly cancel + await the prior socket watchdog before flipping _running, so an old monitor cannot exit between teardown and respawn (Copilot #1)
- _socket_watchdog_loop: wrap the body in try/except + add a done-callback that respawns on unexpected crash, so a transient bug cannot permanently disable self-healing (Copilot #2)
- normalizeExecutablePathForCompare: use the resolved path for realpathSync so non-string inputs cannot leak through (Copilot #3)
- Add tests for crash-recovery and atomic watchdog replacement across reconnects

* fix(slack): tighten connect() error path and clarify watchdog test intent

Address Copilot review round 2.

- connect(): wrap _start_socket_mode_handler/_ensure_socket_watchdog in a focused try/except so any failure rolls back partially-started handler/task state and leaves _running=False, ensuring the platform lock is always released by the outer finally
- Defer _running=True until after the handler is actually started so the watchdog observes a live socket task immediately and never spins against a half-built adapter
- Rename test_watchdog_self_restarts_after_unexpected_crash to test_watchdog_cancellation_does_not_respawn (matches what it actually asserts) and add test_watchdog_unexpected_exit_respawns_via_done_callback that drives a real RuntimeError through _on_socket_watchdog_done and verifies a fresh task replaces the crashed one

* fix(web_server): serialize action spawn check+store under a threading lock

Address Copilot review round 3.

FastAPI runs sync handlers on its threadpool, so two near-simultaneous /api/gateway/restart (or /api/hermes/update) requests could both observe "no live process" in _spawn_hermes_action's poll-based dedupe and double-spawn. Add a module-level _ACTION_SPAWN_LOCK around the entire check + Popen + _ACTION_PROCS store sequence so the dedupe is atomic across threads.

* fix: address Copilot review round 4

- slack.disconnect(): mirror connect()'s defensive cleanup — catch the broad Exception path on watchdog await so handler shutdown and lock release still run if the watchdog raised before cancellation took effect
- web_server._spawn_hermes_action: wrap subprocess.Popen in try/except so a missing executable / permission error closes the log file handle, writes a failure marker, and re-raises instead of leaking a file descriptor
- gateway._scan_gateway_pids: drop the over-broad "hermes.exe --profile" / "hermes.exe -p" patterns that would match any Hermes CLI subcommand using a profile flag (e.g. `hermes.exe --profile foo dashboard`); rely on the "hermes.exe gateway" + "hermes-gateway.exe" tokens instead
- tests: tighten _fake_create_task to assert coroutine input and return a real asyncio.Task that stays pending until pytest teardown, and update the three callsites whose mocked AsyncSocketModeHandler.start_async returned a non-coroutine value

* fix(slack): reset multi-workspace state on reconnect

Address Copilot review round 5.

connect() is reentrant (gateway restart, in-process reconnect), but it was leaving _bot_user_id / _team_clients / _team_bot_user_ids populated from the previous session. A reconnect that rotated the primary token or dropped a workspace would silently keep the stale bot user id and stale workspace client maps, leading to dispatch against gone workspaces.

Clear these three pieces of state right after _stop_socket_mode_handler() and before the auth_test loop, then let the loop repopulate from the current tokens. Add test_reconnect_refreshes_multi_workspace_state to lock it in.

* nix: package apps/desktop as .#desktop (#28964)

Adds nix/desktop.nix building the Electron renderer with buildNpmPackage
and wrapping nixpkgs' electron binary.  Reuses .#default by setting
HERMES_DESKTOP_HERMES to its hermes binary, so the desktop's resolver
picks up the fully-wired nix hermes (venv, bundled skills/plugins,
runtime PATH) without reimplementing agent resolution.

- nix/desktop.nix: renderer + electron wrapper
- nix/hermes-agent.nix: finalAttrs form, exposes hermesDesktop in passthru
- nix/packages.nix: exposes .#desktop + adds to fix-lockfiles
- apps/desktop/package-lock.json: standalone hermetic lockfile

nix build .#desktop && nix run .#desktop both clean.

* fix(desktop): probe steps 4 & 5 of resolveHermesBackend before trusting

A user-reported failure on Windows-on-ARM: a pre-installed Python 3.13
on PATH makes findSystemPython() succeed, so resolveHermesBackend
returns a backend pointing at it -- but hermes_cli isn't in that
interpreter's site-packages. The spawn dies with ModuleNotFoundError
and the user sees a dead GUI instead of the first-launch installer.

Same shape can hit step 4 (existing `hermes` on PATH) when a stale
shim survives a partial uninstall.

Add cheap exit-code probes -- `python -c "import hermes_cli"` for
step 5, `<hermes> --version` for step 4 -- and fall through to step 6
(bootstrap-needed) on failure. install.ps1 then runs as if on a clean
box and the venv gets built.

Probes live in a standalone electron/backend-probes.cjs module so they
can be unit-tested with node --test, same pattern as bootstrap-platform.cjs
and hardening.cjs. New test file wired into test:desktop:platforms.

* test(desktop): allow `node-pty` bare-require in packaged entrypoints

Pre-existing failure on bb/gui since c858484b4 swapped the node-pty
fork for upstream microsoft/node-pty 1.1.0. main.cjs intentionally
bare-requires node-pty (it's hoisted by workspace dedup in dev, and
staged to resources/native-deps via scripts/stage-native-deps.cjs +
extraResources for packaged builds, with a try/catch fallback at
line ~38). The allowlist hadn't been updated to match -- same shape
as `electron`, which was already allowed.

* chore(deps): refresh root lockfile for dashboard @nous-research/ui 0.14.0

apps/dashboard/package.json was bumped to @nous-research/ui 0.14.0 (+
flag-icons ^7.5.0, motion ^12.38.0) but the root package-lock.json was
never refreshed. Running `npm install` from the repo root now
materialises 0.14.0's transitive closure (launder, bumps for
@nanostores/react, nanostores, sanitize-html, tailwind-merge).

No code changes; purely a lockfile catch-up so fresh checkouts on bb/gui
get a working dashboard install.

* chore(desktop): bump version to 0.0.1

First non-placeholder version so electron-builder's artifactName template
produces `Hermes-0.0.1-win-x64.exe` instead of the obviously-unreleased
`Hermes-0.0.0-...`. No release process yet; this just stops the artifact
filename from telling users "you got a debug build."

Bumped in three slots that all carry the desktop app's version:
- apps/desktop/package.json (source of truth)
- apps/desktop/package-lock.json (per-app lockfile, kept for CI parity)
- root package-lock.json's apps/desktop workspace entry

Identity-of-build for first-launch bootstrap continues to come from
build/install-stamp.json (commit SHA + builtAt), unchanged.

* fix: fs icon color

* perf(desktop): cut per-keystroke layout + listener churn in chat composer

Empirical work via CDP harnesses under apps/desktop/scripts/ (see
profile-typing-lag.md):

  jsListeners growth (per round of 200 chars + GC):
    before: +35  (verified leak — listeners stuck after 1st trigger popover use)
    after:  +0

Four narrow edits in src/app/chat/composer/index.tsx:

1. Drop the per-keystroke `editorRef.current.scrollHeight` read used to
   decide composer expansion. Replace with `draft.length > 60` heuristic;
   the existing ResizeObserver still catches edge cases. `scrollHeight`
   is a forced-layout call and was firing on every char until the first
   wrap.

2. Bucket measured composer height to 8px before writing
   `--composer-measured-height` / `--composer-surface-measured-height`
   on `documentElement`. Without this, the editor grows ~1px per char,
   setProperty fires every keystroke, computed style is invalidated tree-
   wide.

3. Remove the dead `$composerDraft` two-way sync. Nothing outside the
   composer subscribed to that atom (verified via grep). Two useEffects
   on `[draft]` were pushing draft→atom and atom→aui per keystroke for
   no consumer. Also drop the per-keystroke
   `reconcileComposerTerminalSelections` call; it was pruning stale
   labels for `terminalContextBlocksFromDraft`, but that helper already
   ignores labels not in the current submitted text, so pruning per
   keystroke was just bookkeeping.

4. `refreshTrigger` fast-bails when the draft contains neither `@` nor
   `/`. Previously `textBeforeCaret(editor)` ran on every input/keyup
   regardless; `range.toString()` inside is O(n) over draft length.

Synthetic typing latency p50/p90/p99 is similar before vs after on a
freshly-loaded session (Blink can already handle ~30cps typing into a
contentEditable on its own); the real win is the listener leak being
gone and the global computed-style invalidations dropping ~8× when the
composer is sitting at a fixed height row.

The `Enter → stall` follow-up (see profile-typing-lag.md §"Submit /
TTFT stall") is unmeasured here — needs a throwaway session because
the harness fires a real prompt. Not blocking this commit.

* perf(desktop): cut FadeText forced layouts during streaming

The slowest user-felt path is typing into the composer while the
assistant is streaming. Profile (scripts/profile-under-stream.mjs):

  FadeText measureOverflow self time:  35.8 ms → 18.1 ms  (-50%)
  total active CPU during 7s window:   ~150 ms → ~50 ms

Two changes in src/components/ui/fade-text.tsx:

1. Drop the `useEffect([children])` that re-ran `measureOverflow`
   (reads scrollWidth + clientWidth — forced layout) on every parent
   re-render. `useResizeObserver` already fires the same callback on
   mount and whenever the host span's box size changes; that covers
   the only case where overflow state can legitimately change. The
   previous explicit useEffect was a forced-layout flush on every
   parent render, which during streaming meant every token tick.

2. Wrap the component in `memo` with a custom comparator that
   short-circuits the entire render when scalar string `children` and
   the className/fadeWidth/style props are unchanged. The hot path
   was tool-fallback's title chips being re-rendered by parent
   streaming updates even though their text was stable; memo+
   comparator skips that.

Also adds two harness scripts under apps/desktop/scripts/:
  - latency-under-stream.mjs (key→paint latency while a turn streams)
  - profile-under-stream.mjs (CPU profile while a turn streams)

Updates profile-typing-lag.md with the streaming numbers and confirms
the Enter→paint submit path is already fast (≤320ms on the populated
session; the 2s "stall after Enter" the user noticed once was a
one-time cold-start, not reproducible at the UI layer).

I'd guess the felt jank in real use is fast-burst typing during a
long-form streaming reply (code blocks + markdown lists multiply the
per-token render cost). The CPU savings here scale linearly with
token volume.

* chore(desktop): drop diag scratch scripts no longer needed

* docs(desktop): correct leak-typing numbers on a real session

Re-ran the leak harness on a populated session (Phaser thread) for both
unpatched and patched builds. The original 'listener leak' was transient
warm-up cost, not a steady-state leak — both versions show 0 listener
growth/round in steady state.

The load-bearing number is forced layouts per character:
  unpatched (HEAD~2):  7.02 layouts/char
  patched   (HEAD):    2.35 layouts/char  (3× fewer)

The patches reduce per-char forced-layout work to Blink's natural floor.
Document node count and heap are flat in both builds.

* perf(desktop): fix "Enter jumps up" on long threads

User reported: after pressing Enter on a long thread, the view jumps up
— the just-submitted message disappears below the fold. Confirmed via
apps/desktop/scripts/measure-jump.mjs:

  before:  distFromBottom 0 → 49.5px, sticks there permanently
  after:   distFromBottom 0 → ~0 (worst case 4px for one frame)

Root cause in useThreadScrollAnchor (thread-virtualizer.tsx):

1. The sticky-bottom logic disarmed on any scroll event where
   `scrollTop < lastTopRef.current`. That check can't distinguish a
   user scrolling up from a programmatic `pinToBottom` write that
   the browser clamped short of bottom (because content also grew in
   the same frame, so `scrollTop = scrollHeight` lands at
   `scrollHeight - clientHeight` for the OLD scrollHeight, which is
   now below the NEW scrollHeight). Result: sticky-bottom disarmed
   permanently on the user's first submit.

2. There was no synchronous pin tied to React's commit phase. By the
   time the ResizeObserver fired and re-pinned, the user had already
   seen ~50ms of "message below the fold" — visually that reads as the
   view jumping up.

Fix:

- `programmaticScrollPendingRef` counter tracks scroll events we
  expect to be ours (one per `pinToBottom` write). The scroll handler
  skips the disarm check when consuming a pending tick, keeps the
  arm bit true, and re-pins synchronously if the browser clamped us
  short of bottom. A depth cap (8) breaks runaway loops in
  pathological streaming-burst layouts.

- `useLayoutEffect` on `groupCount` increase pins BEFORE the browser
  paints, eliminating the visible ~50ms window between optimistic
  user-message insert and the RO/scroll-event chain firing.

Verified on the long Cloud Shadows thread (7-8 turns, ~11k px tall):
all three repro runs now hold within 0–4 px of bottom across the
post-Enter transition. Submit latency unchanged (paint 77–107 ms),
streaming-typing latency unchanged.

Also adds three debug harnesses:
  - measure-jump.mjs   — sample thread scroll across Enter
  - probe-thread.mjs   — dump current thread / scroll state
  - diag-jump.mjs      — intercept scrollTop + RO + mutations across Enter

* perf(desktop): rate-limit thread auto-pin during streaming

Follow-up to the Enter-jump fix. The first version did a synchronous
re-pin loop inside the on-scroll handler when the browser clamped our
`scrollTop = scrollHeight` write short of the new bottom; that gave a
tight 4 px visible jump on Enter, but during streaming the
ResizeObserver fires many times per second as content grows, and each
RO callback re-entered the pin loop. CPU profile showed
`Virtualizer.getMaxScrollOffset` climbing to 22 ms self over a typing-
during-streaming window — the sync re-pin path was paying tanstack-
virtual's recompute cost ~3× per token.

Re-architect:

- RO callback coalesces to one pin per animation frame. Streaming-rate
  RO bursts now cost the same as a single per-frame pin.
- The on-scroll programmatic-counter guard remains (it's what prevents
  the false-disarm bug when the browser clamps a write). It no longer
  does sync re-pins; the next RO/rAF will catch up.
- The useLayoutEffect on groupCount (the path that fires on user
  submit / new turn arrival) ALSO schedules one rAF pin in addition to
  the synchronous pin. This catches the case where React mounts the
  new message in a second commit (after our layout effect ran), which
  grows scrollHeight again. Two pins instead of a tight loop, paid only
  once per turn change.

Net effect on the Cloud Shadows long thread:

  enter-jump transient:   12–20 px for 1 frame (was 49 px permanent)
  CPU during stream+type: `getMaxScrollOffset` dropped out of top-5
                          self-time list
  typing-during-stream:   p50 ~10 ms paint, p99 ~20 ms (1 frame),
                          occasional 40 ms+ outliers during burst
                          token arrivals

Also adds scripts/profile-long-stream.mjs: 20-second streaming profile
with per-500ms FPS histogram + content-length tracking, so we can see
whether streaming render cost grows with message length (it doesn't —
sustained 60 fps).

* perf(desktop): use textContent for trigger precondition

Replace composerPlainText() call inside refreshTrigger's no-trigger
fast-bail with a textContent check. textContent is a browser-native
flat traversal; composerPlainText walks recursively with chip-aware
logic. We only need to know if @ or / appears; either way the trigger
char will be in textContent because chips contain @ in their refText.

Profile shows composerPlainText was ~18ms self over a 12s typing-during-
stream window, called from refreshTrigger on every keystroke. Most of
that was the precondition check (the trigger detection path is the
slow path but only runs when a trigger char is present).

* Revert "perf(desktop): use textContent for trigger precondition"

This reverts commit a6a78ff08a.

* Revert "perf(desktop): cut FadeText forced layouts during streaming"

This reverts commit 88e7d7537c.

* Revert "perf(desktop): cut per-keystroke layout + listener churn in chat composer"

This reverts commit bff1b3261d.

* Revert "Revert "perf(desktop): cut per-keystroke layout + listener churn in chat composer""

This reverts commit b7b378e3a4.

* Revert "Revert "perf(desktop): use textContent for trigger precondition""

This reverts commit 0739588f48.

* chore(desktop): synthetic-stream perf harness + scripts

Drops the React `<Profiler>` approach (no-op because Vite is currently
serving the production React build) in favor of an externally-observable
measurement stack: rAF frame intervals, `PerformanceObserver({entryTypes:
['longtask']})`, and a `MutationObserver` on the live streaming message.

Adds a synthetic stream driver — `window.__PERF_DRIVE__.stream({...})` —
that pushes tokens through the live `$messages` atom at a controlled rate,
so the assistant-ui runtime, incremental repository, and Streamdown
markdown pipeline see the same workload they'd see during a real LLM
stream, without the LLM cost.

The driver lives in `src/app/chat/perf-probe.tsx`; `main.tsx` side-imports
it under `import.meta.env.MODE !== 'production'` so it tree-shakes out of
prod builds. (Using `MODE` rather than `DEV` because our Vite setup
currently reports `DEV=false` even under `vite dev` — see the dev-build
note in `profile-typing-lag.md`.)

Scripts:
  - measure-synthetic-stream.mjs  drive synthetic + record frame/longtask/mutation
  - profile-synth-stream.mjs      CPU profile + top self-time during synthetic
  - measure-real-stream.mjs       same harness, real LLM stream
  - profile-real-stream.mjs       CPU profile bracketing the real stream window
  - eval.mjs / reload.mjs         small CDP helpers

A real-LLM measurement on Cloud Shadows (gpt-4o-mini, 39 s window) showed
12 longtasks in the same 75-127 ms range the synthetic predicted, so the
synthetic is a faithful proxy.

* perf(desktop): memo FadeText so it skips re-renders when text unchanged

FadeText is used 110+ times inside `tool-fallback.tsx` on a tool-heavy
thread. During streaming each parent re-render previously triggered the
component's `useEffect([children])`, which forced a `scrollWidth` layout
read even when the title text was unchanged. The `useResizeObserver` was
already covering the genuine resize case, so that effect was strictly
redundant work.

Drops the effect and wraps the component in `React.memo` with a custom
comparator that field-compares `className`, `fadeWidth`, and `style`,
plus identity-compares `children` (scalar fast-path; correct for JSX
nodes too since a new node should force a re-render).

Verified via temporary render counter on the 34 MB
`session_20260514_215353_fe0ac8` thread (110 FadeText instances): a
2 s synthetic stream went from ~11k FadeText render calls to 122 —
roughly one render per truly-new instance instead of one per parent
commit per instance.

Doesn't move the longtask needle on its own (Streamdown's markdown
re-parse dwarfs it) but eliminates a steady CPU floor and a class of
forced layouts during streaming. Profile-typing-lag.md documents the
full investigation, including the remaining Streamdown cost as the
real source of the perceived "5 fps moment" hitches.

* perf(desktop): memoize MarkdownText plugins to stop churning Streamdown

The inline `plugins={{ math: mathPlugin, ...(isStreaming ? {} : { code }) }}`
on `<StreamdownTextPrimitive>` constructed a new object literal on every
parent render. That broke `<Streamdown>`'s outer memo and forced its
internal `rehypePlugins` / `remarkPlugins` array useMemos to rebuild,
which propagates a new identity into every `<Block>` and defeats Block's
memoization for stable historical blocks.

After memoizing on `[isStreaming]` (the only real dimension of variance),
CPU profile during a 5 s synthetic stream on the 34 MB session shows
`parser` self-time dropping out of the top 10, `compile` cut roughly in
half, and `bn$1` / `m$1` (micromark internals) leaving the top entries.

Doesn't move the visible longtask count on its own — Streamdown's
per-Block parse cost still dominates whenever the last block's content
changes — but it removes a class of unnecessary re-parses for historical
blocks during streaming. See `scripts/profile-typing-lag.md` for the
full investigation.

* perf(desktop): floor assistant-text flush gap to 33ms for predictable batching

`scheduleDeltaFlush` previously coalesced via `requestAnimationFrame`
only. The "at most one flush per frame" guarantee that gives you is fine
for fast streams (>~80 tok/sec) where multiple tokens arrive within a
single frame, but breaks down at typical LLM token rates (30-80 tok/sec)
where each token arrives slower than the rAF cadence and triggers its
own React commit + Streamdown markdown re-parse.

Track `lastFlushAt` and require at least 33 ms between two flushes.
React 18+ auto-batching probabilistically already collapsed some of
these, but the floor makes it deterministic.

A/B on the 34 MB session, 300 tokens at 50 tok/sec (markdown chunks):

| | avgFps | p99 frame | LTs / 5 s | max LT |
|---|---|---|---|---|
| no floor (current rAF) | 54.0 | 38 ms | 2.0 | 145 ms |
| 33 ms floor (this PR) | 54.3 | 41 ms | 1.7 | 110 ms |

`inter-mutation` p50 also tightens from 22-28 ms to a clean 33 ms,
which is the expected signature of a deterministic floor. Doesn't fully
solve the user's perceived hitches — Streamdown's per-Block parse cost
when the last block grows past ~2 k chars is still the elephant — but
it consistently shaves the worst-case longtask and makes the streaming
cadence visibly steadier.

Also threads a matching `flushMinMs` option through the synthetic
stream driver in `perf-probe.tsx` + `scripts/measure-synthetic-stream.mjs`
so the harness can A/B both regimes without spending LLM credits.

See `scripts/profile-typing-lag.md` for the full investigation.

* perf(desktop): useDeferredValue for streaming markdown so parses don't block input

Streamdown's per-Block parse cost grows with the live tail's length and
is unavoidable inside the block-memo pattern (industry standard, see
findings doc). The fix is to stop having that work block the main thread.

`<DeferStreamingText>` is a 12-line wrapper that reads message-part state
via `useMessagePartText`, runs it through `useDeferredValue`, and
re-publishes via assistant-ui's `<TextMessagePartProvider>`. The inner
`<StreamdownTextPrimitive>` reads the deferred value through the normal
`useMessagePartText` hook — no fork, no internal-path imports, fully on
assistant-ui's public API. React's concurrent scheduler then:

  - abandons in-flight deferred renders when a newer token arrives, so
    intermediate states get skipped under fast streams
  - deprioritises the markdown render when the main thread has urgent
    work (typing, scroll), so input stays responsive even while a
    100ms parse is queued

Streamdown already uses `useTransition` for its block-array setState;
this lifts the deferral up to the consumer boundary so it covers the
whole pipeline (preprocess → split → repair → parse → render).

A/B on the 34 MB session, 300 tokens at 50 tok/sec, markdown chunks
(four trials each, with the 33ms flush throttle on for both):

| | avgFps | p99 frame | LTs/5s | max LT | typing-while-stream p95 |
|---|---|---|---|---|---|
| pre  | 54.3 | 41 ms | 1.7 | 110 ms | ~17 ms |
| post | 58.5 | 31 ms | 2.0 | 117 ms | 14-18 ms |

Longtask count + max LT unchanged — useDeferredValue doesn't reduce
CPU, only its priority. The avgFps lift and p99 frame drop are the
proof that the existing CPU is no longer blocking 60 fps cadence. One
clean run logged MUTATIONS=0 — React skipped every intermediate text
state and only committed the final one (textbook deferred-value
behaviour).

The actually-reduce-CPU path is replacing the parser with a state
machine like Flowdown — left for a future PR; see
`apps/desktop/scripts/profile-typing-lag.md` for the full investigation.

* feat(desktop): add hermes gui launcher

* feat(desktop): launch packaged gui builds by default

* bump gui version to 0.0.2

* fix(dashboard): allow file:// origin on loopback WS + diagnostic logging

Upstream commit 2e66eefbc ("fix(dashboard): validate WebSocket Host
and Origin") added a WebSocket Host/Origin guard to block DNS
rebinding against the dashboard.  The guard rejects any Origin whose
scheme is not http/https or whose netloc is empty — which includes
Electron's renderer Origin: file:// when the desktop app loads its
bundle from disk in production mode.

That makes the bb/gui Electron desktop unable to open the gateway
WebSocket against the embedded backend on Windows / macOS prod
builds.  The renderer reports "Desktop boot failed" and the backend
logs:

  WARNING hermes_cli.web_server: gateway-ws reject
      peer=127.0.0.1:NNNN reason=non_loopback_or_bad_origin
      bound_host=127.0.0.1 close_code=4403

DNS-rebinding requires a DNS-resolvable hostname; file:// has no
host component and therefore cannot be the attack vector this guard
exists to block.  When bound to a loopback interface (127.0.0.1 /
::1 / localhost), accept file:// origins so desktop wrappers can
attach.  Non-loopback binds (operator opted into network exposure)
keep rejecting file:// — the loose policy doesn't apply.

Also adds per-reason diagnostic logging in
_ws_host_origin_is_allowed, so future ws-guard rejections name the
specific clause that fired (bad_host / bad_origin_scheme /
origin_host_mismatch) instead of the opaque
"non_loopback_or_bad_origin" surfaced at the call site.

Verified against tests/hermes_cli/test_web_server_host_header.py
(all 11 upstream tests still pass) and hand-tested by opening the
bb/gui Electron desktop dev build against the patched backend.

* fix(tui_gateway): restore _content_display_text helper

Bb/gui had dropped the helper but the orchestrator code merged from main
still calls it (_inflight_text, _message_preview). Re-add the definition
verbatim from main so session.create / _start_inflight_turn don't crash
with NameError on first prompt submit.

* fix(tui-gateway): restore _content_display_text helper lost in main merge

The May 27 merge of origin/main into bb/gui re-introduced two callers of
_content_display_text (in _inflight_text and _history_to_messages) but
dropped the helper definition itself, leaving an unresolved reference.

NameError fires on every user message via _start_inflight_turn ->
_inflight_text, taking down both the TUI and the desktop (which share
this gateway backend) the moment input is dispatched.

Restores the helper verbatim from main (commit 36c99af37) -- pure
structured-content text extractor, no other dependencies.

* fix(telegram): import Set for _dm_topic_chat_ids annotation

self._dm_topic_chat_ids: Set[str] = {...} at line 460 references Set
but only Dict, List, Optional, Any are imported from typing. The file
has no 'from __future__ import annotations', so the annotation is
evaluated at runtime and raises NameError on TelegramAdapter
construction.

* fix(setup): drop shadowing inner importlib.util re-imports

_print_setup_summary and _setup_tts_provider each had 'import
importlib.util' inside a try: block nested deeper in the function
body. Python flips importlib to function-local for the whole scope,
so earlier references in the same function (the neutts branches at
lines 493 / 1109) hit UnboundLocalError before the late import can
run.

The top-of-module 'import importlib.util' at line 14 already covers
both call sites, so dropping the redundant inner imports restores
the intended behavior.

* feat(install.ps1): add -IncludeDesktop switch + Stage-Desktop

The new Hermes-Setup.exe (Tauri bootstrap installer) passes -IncludeDesktop
so users who install via the GUI end up with a launchable Hermes.exe at
apps/desktop/release/<os>-unpacked/. Existing flows are unchanged:

  * The 'irm install.ps1 | iex' CLI one-liner omits the flag — terminal
    users don't need a prebuilt desktop binary; 'hermes desktop' builds
    on demand.
  * The Electron desktop's bootstrap-runner.cjs also omits the flag —
    rebuilding apps/desktop from inside a running Hermes.exe would try
    to overwrite the live binary on disk and fail.

Stage-Desktop runs after Stage-NodeDeps so workspace npm is already
installed when electron-builder fires. It does:
  1. 'npm install' at repo root so apps/* workspaces resolve their deps
     (Electron itself arrives via npm here, ~150MB)
  2. 'npm run pack' in apps/desktop (tsc + vite + electron-builder --dir)
  3. Probes apps/desktop/release/{win-unpacked,win-arm64-unpacked}/Hermes.exe

The --dir mode produces an unpacked launchable binary without an NSIS/MSI
installer artifact — we don't need one because Hermes-Setup.exe spawns the
unpacked binary directly via launch_hermes_desktop.

* feat(installer): Tauri bootstrap installer for first-time onboarding

Hermes-Setup.exe is a small signed Rust+Tauri binary that drives
scripts/install.ps1 stage-by-stage with a native UI matching the
desktop's design language. Replaces the chicken-and-egg pattern of
shipping a 200MB Electron app whose first launch existed only to
run install.ps1.

The architecture:

  Rust backend (src-tauri/):
    bootstrap.rs        orchestrator -- Tauri commands, stage iteration
    install_script.rs   resolve install.ps1 (dev checkout, cache, GitHub raw)
    powershell.rs       spawn powershell, line-stream stdout/stderr, parse JSON
    events.rs           BootstrapEvent types -- mirror bootstrap-runner.cjs
    paths.rs            HERMES_HOME resolution + tracing log setup
    build.rs            bakes BUILD_PIN_COMMIT / BUILD_PIN_BRANCH from
                        'git rev-parse HEAD' at compile time

  React frontend (src/):
    Tauri webview rendering 4 screens (welcome / progress / success /
    failure), driven by nanostores subscribing to the Rust event stream.
    Visual layer reuses the desktop's styles.css wholesale via @import
    so the installer and desktop never drift visually.

  Distribution:
    targets = ['app', 'dmg', 'appimage'] -- no NSIS/MSI wrapper. The
    raw target/release/Hermes-Setup.exe IS the artifact on Windows;
    .dmg + .app on macOS; AppImage on Linux. One file, double-click,
    no installer-installing-an-installer pattern.

  Compile-time pinning:
    build.rs reads 'git rev-parse HEAD' and emits
    cargo:rustc-env=BUILD_PIN_COMMIT=<sha> + BUILD_PIN_BRANCH=<branch>.
    bootstrap.rs's option_env!() picks these up so the binary fetches
    install.ps1 from the exact SHA it was tested against. CI / release
    builds can override via HERMES_BUILD_PIN_COMMIT env var.

  Windows manifest:
    hermes-setup.manifest declares level='asInvoker' so the
    productName 'Hermes Setup' doesn't trip Windows's installer-
    detection heuristic and refuse to launch without elevation.
    Also declares PerMonitorV2 DPI + UTF-8 active code page + Common
    Controls v6.

Limitations of this initial version:

  * No code signing -- Windows SmartScreen will warn once on Hermes-Setup.exe
    ('More info -> Run anyway'). The downstream binaries it produces
    (Hermes.exe in win-unpacked/, the hermes CLI) are locally-built and
    therefore don't carry MOTW, so they launch without SmartScreen
    intervention. Cert procurement tracked separately.

  * macOS and Linux build paths defined but untested -- Windows-only V1.

* fix(installer): pass -IncludeDesktop to manifest, surface launch errors, alias hermes desktop

Three bugs found in the first VM end-to-end test:

1. install.ps1 -Manifest was called WITHOUT -IncludeDesktop, so the
   manifest came back with the 14-stage list (no desktop stage), the
   UI showed '14 steps' and Stage-Desktop never ran. Pass the flag to
   both the manifest fetch and the per-stage runs — install.ps1 gates
   the desktop stage's inclusion on the flag.

2. The Success screen's Launch button silently swallowed the Tauri
   error when no Hermes.exe existed (e.g. Stage-Desktop was skipped).
   Wire the error through to inline UI with an alert callout, so the
   user gets actionable text ('Hermes.exe missing, run hermes desktop
   from a terminal') instead of an unresponsive button.

3. The Success screen tells users to run 'hermes desktop' from a
   terminal but the CLI only accepted 'hermes gui' — invalid choice
   for 'desktop'. Rename the subcommand canonically to 'desktop' with
   'gui' as a backwards-compatible alias. Update the _SUBCOMMANDS sets
   used by session-flag arg parsing + logging-mode probe so both names
   route to the same logic.

* fix(install.ps1): pre-warm electron-builder winCodeSign cache + fix Stage-Desktop $HasNode false-skip

Two bugs caught in the second VM end-to-end run:

1. electron-builder's winCodeSign extraction fails on grandma-class
   Windows boxes because the .7z archive contains macOS symlinks
   (darwin/10.12/lib/libcrypto.dylib and libssl.dylib pointing at
   versioned siblings). Creating symlinks on Windows requires
   SeCreateSymbolicLinkPrivilege, a per-user right that non-admin
   accounts don't have on stock Windows. Result: every fresh install
   on a non-admin user fails Stage-Desktop with a 7-Zip 'cannot create
   symbolic link' error, retried four times, then bails.

   Fix: Initialize-ElectronBuilderCache pre-extracts winCodeSign-2.6.0.7z
   ourselves with -snl (don't preserve symlinks, store as resolved file
   content) AND -x!darwin (skip the entire macOS subtree — irrelevant
   on Windows). Writes to electron-builder's expected cache dir before
   electron-builder gets a chance to try its own broken extraction.
   Idempotent — fast-paths via signtool.exe sentinel check.

2. Install-Desktop's first guard was 'if (-not $HasNode) skip'.
   $HasNode is set by Stage-Node into $script:HasNode, but in
   cross-process driver mode (each -Stage NAME is a fresh powershell.exe
   spawned by Hermes-Setup.exe), that script-scope variable from the
   PREVIOUS process is invisible — so the guard always fired and
   Install-Desktop returned in 900ms with a misleading
   'Node.js not available' reason. The real npm probe below it never
   got to run. Fix: re-probe npm directly via Get-Command when $HasNode
   is empty/false, since by that point Stage-Node has already verified
   Node is installed and the only question is whether *this* process
   can see it on PATH (it can — installer-wide PATH update from Stage-Node).

* fix(install.ps1): tell electron-builder we're NOT signing instead of pre-extracting winCodeSign

The previous commit (c7e46f9f3) worked around the winCodeSign-symlinks-
on-Windows extraction crash by pre-extracting the archive ourselves with
-snl + -x!darwin. That fix was correct but addressed the wrong layer.

The deeper question: why was electron-builder fetching winCodeSign at all
when we have no signing cert configured? Answer: electron-builder
unconditionally pre-warms the toolchain assuming any build MIGHT sign.
The cert auto-discovery never finds anything (we never set CSC_LINK
or anything else), so the signing never happens — but the 100MB fetch
of winCodeSign and its broken-on-Windows symlink extraction does.

Set CSC_IDENTITY_AUTO_DISCOVERY=false (with WIN_CSC_LINK and
WIN_CSC_KEY_PASSWORD also explicitly cleared as belt-and-suspenders)
before invoking npm run pack, and electron-builder skips the entire
winCodeSign apparatus. No download, no extraction, no privilege check.
Env vars are saved/restored around the invocation so we don't leak
the override into Stage-PlatformSdks etc.

Net: removes the 100-line Initialize-ElectronBuilderCache helper that
manually downloaded + extracted winCodeSign-2.6.0.7z. Replaced with
3 env-var assignments. The produced Hermes.exe is functionally
identical — just no longer carries a code-signing-machinery dependency
we never used.

* fix(installer): bump bootstrap-installer.log to capture stage transitions + every install.ps1 line

Diagnosing the second VM failure was impossible because bootstrap-installer.log
contained only the 'starting' banner. Two causes:

1. emit_log() inside run_bootstrap() was tracing::debug! — dropped on the
   floor under the default INFO env-filter.

2. The per-stage sink callbacks (on_stdout_line / on_stderr_line) only
   emitted Tauri events to the frontend; they never tee'd to the log file
   at all. When the failure route mounts, the Tauri event stream is the
   only place the script output lived, and it gets discarded.

3. The Failed / Stage / Manifest / Complete lifecycle frames in emit_event()
   were also Tauri-only — so even the 'which stage failed' frame never
   reached the log.

Fixes:
  * emit_log() → tracing::info!
  * Sink callbacks tee stdout to info!, stderr to warn!, with stage label
    as a structured field for grep'ability
  * emit_event() now matches on the variant and logs each lifecycle frame
    at the right level: Failed → tracing::error!, others → info!

Result: a failing install leaves a complete forensic trail in
bootstrap-installer.log — manifest stage list, every install.ps1
stdout/stderr line tagged by stage, the stage transitions, and the
final error. Same path as before so nothing the user does changes.

* fix(install.ps1): Stage-NodeDeps cross-process $HasNode + stream npm install output to bootstrap log

VM run 3 diagnosis: node-deps stage skipped on the VM (logged
'Skipping Node.js dependencies (Node not installed)') and then
desktop's npm install failed with exit 1 and zero diagnostic detail.

Two root causes:

1. $HasNode false-skip in Stage-NodeDeps — same cross-process bug
   pattern we fixed for Stage-Desktop in c7e46f9f3. Stage-Node ran
   in process A and set $script:HasNode = $true, then exited. Stage-
   NodeDeps ran in fresh process B (Hermes-Setup.exe -Stage NAME
   spawns each stage independently), where that variable doesn't
   exist. Re-probe via Get-Command npm instead of trusting the
   stale script-scope global. The previous stage already verified
   Node so the re-probe succeeds.

2. npm install --silent + Tee to TEMP file hid the real error.
   When the workspace install failed on the VM, the actual reason
   was buffered in $env:TEMP\hermes-npm-desktop-install-*.log and
   the user saw only 'exit 1'. Drop --silent so npm streams its
   full output, drop the TEMP-file dance — the Tauri installer's
   streaming sink already tees every stdout/stderr line to the
   rolling bootstrap-installer.log, so a side log file is dead
   weight that hides the very error we need.

After this, the bootstrap log on a failure will contain npm's full
output (deprecation warnings, ETARGET, native-module compile errors,
whatever) tagged with stage=desktop, making the actual cause
diagnosable instead of an opaque exit code.

* fix(install.ps1): restore Initialize-ElectronBuilderCache (CSC env vars alone aren't enough)

VM run 4 diagnosis: even with CSC_IDENTITY_AUTO_DISCOVERY=false set,
electron-builder still fetches winCodeSign and signs bundled binaries.
The log shows the signing happens BEFORE the cache extraction:

  • signing with signtool.exe  ...\winpty-agent.exe
  • signing with signtool.exe  ...\OpenConsole.exe
  • downloading winCodeSign-2.6.0.7z
  • <symlink privilege error>

Cause: node-pty's bundled prebuilds are listed in apps/desktop's
asarUnpack ['**/*.node', '**/prebuilds/**']. electron-builder
re-signs anything unpacked from asar, regardless of whether OUR
binary gets signed. The signtool invocation needs winCodeSign on
disk, which needs the .7z extracted, which hits the macOS-symlink
crash on non-admin Windows.

The CSC env vars I added in d5fe46727 only kill IDENTITY DISCOVERY
(so OUR Hermes.exe stays unsigned, which is fine — we have no cert).
They don't prevent the toolchain fetch for the bundled-prebuild
re-sign. I removed the pre-extract in d5fe46727 thinking the env
vars subsumed it; that was wrong. Both are needed.

Restoring Initialize-ElectronBuilderCache verbatim from c7e46f9f3
and keeping the CSC env vars. Wrote a clearer doc-comment at the
call site explaining the two-knob interaction so future maintainers
don't drop one half again.

* fix(desktop): disable signtool via signtoolOptions.sign=null, drop dead winCodeSign pre-extract

VM run 5 diagnosis: the pre-extract from 3b29e65c1 ran (extracted 83
files, 24MB) but produced ZERO files at the expected sentinel path
'/winCodeSign-2.6.0/windows-10/x64/signtool.exe'.

Cause: the .7z archive's root entries are 'windows-10/', 'darwin/',
'linux/', etc. — not 'winCodeSign-2.6.0/<arch>'. Extracting with
'-o$cacheRoot' put files at $cacheRoot/windows-10/..., NOT at
$cacheRoot/winCodeSign-2.6.0/windows-10/.... I had the directory
nesting wrong from the start.

And then we observed: electron-builder downloads winCodeSign-2.6.0.7z
under a random numeric filename ('384387955.7z') regardless of what's
already extracted in the parent dir. The cache key isn't the dirname;
it's content-addressed. So the pre-extract approach was doomed even
if the path nesting had been right.

Actual fix: signtoolOptions.sign=null in apps/desktop/package.json's
win build config. electron-builder honors this and skips the bundled-
prebuild signing entirely — no signtool invocation, no winCodeSign
fetch, no symlink-privilege crash. The previous failures all stemmed
from electron-builder pre-signing node-pty's bundled .exes
(winpty-agent.exe, OpenConsole.exe) which are already author-signed
upstream; re-signing with our nonexistent cert was overwriting good
sigs with nothing useful anyway.

Cost: when we DO get a real cert later, we'll add it back with the
sign function pointing at the cert chain. Until then, all-null is
the correct config and unblocks every non-admin Windows user.

Removed Initialize-ElectronBuilderCache (the dead pre-extract).
Removed the call site. Kept the CSC_IDENTITY_AUTO_DISCOVERY env
vars as belt-and-suspenders against a future electron-builder
change that might revive cert auto-discovery.

* fix(desktop): use no-op sign function instead of sign=null

VM run 6 still hit the symlink crash even with signtoolOptions.sign=null.
electron-builder 26.8.1 treats null as 'use the default signtool path'
rather than 'skip signing', so the winCodeSign fetch + extraction still
fired for the bundled prebuild re-sign.

The Electron docs (electronjs.org/docs/latest/tutorial/code-signing)
make it clear signing is OPTIONAL and unsigned apps work fine — users
just see SmartScreen on first launch. The electron-builder mechanism
for 'don't actually sign anything' is to supply a custom sign function
(via signtoolOptions.sign: '<path-to-cjs-module>') that resolves
without invoking signtool.

build-noop-sign.cjs is that module — a 5-line async function that
returns undefined. electron-builder calls it for every binary it would
have signed, gets back a resolved promise, and considers each binary
'signed.' No signtool spawn, no winCodeSign fetch, no symlink crash.

When Nous's cert arrives, replace this file with a real signing hook
(@electron/windows-sign-based or a direct signtool invocation). The
architecture's signing-ready and the cutover is a one-file edit.

* fix(desktop): signAndEditExecutable=false to skip signtool path entirely

After reading app-builder-lib/winPackager.js line 216 + 231 directly:
signAndEditExecutable is the ACTUAL hardcoded gate that short-circuits
both signApp() (which signs Hermes.exe + every shouldSignFile match
including bundled prebuilds) AND createTransformerForExtraFiles().
None of signtoolOptions.sign / sign:null / sign:<custom-fn> gate the
winCodeSign download — that happens before they're consulted.

What we lose: rcedit also runs through signAndEditResources, so
disabling this drops PE metadata (file properties showing 'Hermes' /
'Nous Research' / file description). Cost is real but bounded:
  * Hermes.exe filename, icon, asar contents, app identity intact
  * Task Manager shows 'Hermes.exe' (the filename) not 'Hermes' (PE
    description) — minor downgrade
  * Start menu, taskbar, window title all work normally
  * SmartScreen will warn once (unsigned, same as before)

When the cert lands, flip signAndEditExecutable back to default true,
both signing AND rcedit return, PE metadata is restored.

Removes the no-op sign function (build-noop-sign.cjs) since
signAndEditExecutable=false prevents signtool from being invoked at
all — the custom hook never gets called either.

* feat(install.ps1): write .hermes-bootstrap-complete marker at end of install

The desktop app's main.cjs resolver ladder has a 'bootstrap-needed' rung
that fires when .hermes-bootstrap-complete is missing from
ACTIVE_HERMES_ROOT. Pre-Hermes-Setup, this marker was written by the
packaged-desktop's own bootstrap-runner.cjs at the end of its install
flow. Now that Hermes-Setup.exe runs install.ps1 directly, install.ps1
needs to own the marker — otherwise the desktop sees no marker on first
launch and triggers its legacy first-launch bootstrap (re-running
install.ps1 from inside Electron, the exact recursion Hermes-Setup.exe
was supposed to obviate).

Implementation:
  * New Stage-BootstrapMarker (worker) → Write-BootstrapMarker (helper)
  * Slotted in the manifest right after platform-sdks, before the
    interactive configure/gateway stages, so it runs unconditionally
    when the install reaches the finalize phase
  * Schema mirrors apps/desktop/electron/main.cjs writeBootstrapMarker /
    isBootstrapComplete EXACTLY: {schemaVersion: 1, pinnedCommit,
    pinnedBranch, completedAt}. Schema version stays at 1 so old
    desktops that read marker files written by future install.ps1s
    can still parse them.
  * pinnedCommit comes from -Commit flag (Hermes-Setup.exe passes it)
    or falls back to 'git rev-parse HEAD' in InstallDir
  * pinnedBranch from -Branch flag, defaults to 'main' matching
    install.ps1's own param default

Two PS-5.1 gotchas baked into comments:
  * The ?. null-conditional operator doesn't exist pre-PS7; use
    explicit if-checks on Get-Command results
  * Set-Content -Encoding UTF8 emits a BOM in 5.1 and Node's plain
    JSON.parse rejects BOM — write via .NET's UTF8Encoding(false)
    to produce BOM-less JSON the desktop's readJson() can parse

* feat(installer): drive in-app updates through the Tauri installer

Converge update on the same principle as bootstrap: one driver owns all
repo mutation. The desktop becomes a pure consumer that hands off to
Hermes-Setup.exe --update instead of re-implementing git/pip in Electron.

- hermes desktop --build-only: build without launching, so the installer
  owns the post-update launch (CLI keeps build logic single-sourced).
- Installer AppMode {Install,Update} from argv; get_mode exposed to the UI.
- Installer self-copies to HERMES_HOME/hermes-setup.exe on install success
  (no-op guard during --update re-invocation to avoid the locked-exe copy).
- Installer --update flow (update.rs): wait for the desktop to release the
  venv shim, run 'hermes update --yes --gateway' (branch on exit 0/2/other),
  then 'hermes desktop --build-only', then launch the rebuilt desktop. Reuses
  the bootstrap event channel + progress UI via a synthetic two-stage manifest.
- Desktop applyUpdates() gutted (~105 lines of git/stash/pull/pyproject/pip
  removed) -> thin handoff: spawn updater, app.quit() to free the shim.
  Detection (checkUpdates, commit changelog, behind-count) kept intact.
- install.ps1 creates Start Menu + Desktop shortcuts to the packed Hermes.exe
  (never bare 'hermes desktop', which would rebuild every launch).

* test update

* fix(installer): pass --branch to hermes update in the --update flow

The install is a detached-HEAD checkout of a pinned commit. Without
--branch, 'hermes update' fell back to its default (main) and switched
the checkout to main — a divergent branch that lacks the desktop CLI
command — so the update targeted the wrong branch and the rebuild stage
failed with 'invalid choice: desktop'.

Thread BUILD_PIN_BRANCH (the branch this installer was built against,
and the same branch the desktop detected the update on) into
'hermes update --branch <b>' so update + rebuild stay on-branch.

* test update

* fix(installer): stamp Hermes icon onto Hermes.exe via rcedit (no winCodeSign)

The unpacked Hermes.exe showed the stock Electron icon + name in the
taskbar because build.win.signAndEditExecutable=false disables BOTH
electron-builder's signing AND its rcedit metadata/icon stamping. That
flag is load-bearing: enabling it re-triggers signtool -> winCodeSign,
whose macOS symlinks crash 7-Zip on non-admin Windows (unfixable dead end).

Decouple identity-stamping from signing entirely: after npm run pack,
run rcedit ourselves on the produced exe.
- Add rcedit as a direct devDependency of apps/desktop (the transitive
  electron-winstaller copy is fragile).
- apps/desktop/scripts/set-exe-identity.cjs: Node helper that calls
  rcedit's named export to set icon + ProductName/FileDescription/
  CompanyName. Node builds argv natively — avoids the PowerShell->exe
  ->JSON double-escaping that broke the app-builder rcedit path.
- install.ps1 Set-DesktopExeIdentity invokes the script after the build,
  before shortcuts. Best-effort: failure keeps the stock icon, never
  fails the install. rcedit is a pure PE editor — no signtool, no
  winCodeSign, no symlinks.

Verified locally: stamping a copy of the built Hermes.exe embeds the
32x32 icon and sets ProductName=Hermes.

Also fix update-path success-screen flash: in update mode the installer
hands off + exits in ~600ms, so don't route to the 'launch Hermes'
success view (it flashed before the window closed).

* update test

* fix(desktop): show 'hermes update' guidance for CLI installs instead of dead-end error

A user who installed via the CLI (irm|iex / install.sh) then ran
`hermes desktop` has no staged hermes-setup.exe, so clicking Update
in-app hit resolveUpdaterBinary()=null and showed a misleading error
('re-run the Hermes installer') with a Try-again button that could
never succeed — a dead loop for a perfectly valid install.

Treat the no-updater case as an intentional outcome, not a failure:
- main.cjs applyUpdates returns { ok:true, manual:true, command:'hermes update' }
  (no throw, no 'error' stage) when no updater binary exists.
- New 'manual' update stage + apply-state.command thread the command to the UI.
- updates-overlay ManualView: a polished terminal-native card with the
  exact command and a copy button, framed as the correct path for a CLI
  user rather than an error.

GUI-installer users are unaffected — hermes-setup.exe present => seamless
auto-update runs as before. Zero new process orchestration; can't fail
the update demo.

* update test

* fix(gui): pin /api/hermes/update to the current branch

The desktop command-center 'update' action hits POST /api/hermes/update,
which spawned bare `hermes update` with no --branch. cmd_update then
falls back to its default (main) and checks the working tree OUT of the
tracked branch — a bb/gui install silently jumped to main and lost the
desktop CLI.

Resolve the checkout's current branch and pass --branch <current> from
this endpoint only. The engine default (main) is DELIBERATELY unchanged:
bare `hermes update` from a terminal, the gateway /update bot command,
and the CLI/TUI relaunch path all keep their long-standing 'update against
main' contract for the existing user base. Only the GUI button is scoped
to update-the-branch-you're-on. Detached HEAD / git failure falls back to
the bare default.

* update test

* fix(desktop): branch-pin the CLI manual-update command card

The 'Update from your terminal' card (shown to CLI installs with no staged
updater) hardcoded bare `hermes update` — which defaults to main and would
switch a bb/gui (or any non-main) checkout off-branch. Same bug we fixed for
the GUI button, leaked into the card's copy text.

Resolve the checkout's current branch and show `hermes update --branch
<current>` for non-main checkouts; keep it bare for main so the card stays
clean. Best-effort: bare fallback if branch detection fails. Matches the
GUI button + installer --update contract; bare terminal/bot/TUI update
paths still default to main, unchanged.

* docs: phragg was here

* feat(desktop): lead onboarding with Nous Portal + fix fresh-install detection (#34970)

- Feature Nous Portal as the primary onboarding card (Recommended tag,
  app logo, single pitch line); collapse other OAuth providers behind an
  "Other providers" disclosure whose open/closed state persists.
- Surface OpenRouter as a one-click API-key option inside the disclosure;
  move "I have an API key" to a quiet bottom-right link.
- Treat "no provider configured" as a normal onboarding state, not a red
  error banner (provider-setup-errors copy match).
- Fix setup.runtime_check: it reported ready when the resolved runtime had
  an empty credential or only implicit Bedrock/IAM, so fresh installs never
  saw onboarding. Now requires a usable credential.
- Auto-wire Windows fonts for WSL2 users so the renderer renders real
  Segoe UI instead of the DejaVu fallback; make WSL detection env-independent
  via the /proc kernel marker.

* feat(desktop): live elapsed timer on install bootstrap steps

The first-launch install overlay showed a static "Installing" with no
motion, so long steps (notably the repo clone) looked frozen. Stamp each
stage's start time on the running transition and tick once a second so the
active step shows live elapsed (e.g. "Installing · 1:23"), plus elapsed on
the overall current-step line. Completed steps keep their final duration.

* fix(desktop): resolve PortableGit for update checks + reserve titlebar tools space

- runGit() hardcoded spawn('git'), which ENOENTs on fresh installer-driven
  Windows installs (git is PortableGit under %LOCALAPPDATA%\hermes\git, never
  on PATH) — so "Check for updates" failed with "Couldn't check for updates".
  Add resolveGitBinary() mirroring findGitBash (PortableGit → Git-for-Windows
  → PATH) and use it in runGit.
- PageSearchShell rendered a full-width search input in the titlebar row, so
  on Windows its right edge slid under the fixed top-right tools + native
  window controls. Reserve that footprint via --titlebar-tools-* vars.

* fix(desktop): stop streaming caret from shifting layout on completion

The streaming caret (::after on the running message's last child) was an
in-flow inline-block adding ~0.78em of inline width, which could wrap the
last line mid-stream; when the caret is removed on completion the line
un-wraps and reflows — the visible post-response layout shift. Net-zero its
inline advance with a compensating negative margin so it paints at the text
end without consuming layout width.

* fix(desktop): stop completed-message layout shift while streaming

The assistant message action bar used `hideWhenRunning`, which unmounts it
whenever the thread is streaming. Since the bar reserves vertical space in
each completed assistant message's footer (it's invisible-until-hover via
opacity, not via mount), unmounting it collapsed every prior turn by the
bar's height — then remounting on resolve grew them back, shifting the whole
conversation (visible as "padding appears above the last user message").
Drop hideWhenRunning so the footer height is constant; the bar stays
invisible during streaming via its existing opacity/pointer-events gating.

* fix(merge): keep windows-footgun suppressions inline

* fix(merge): keep remaining gateway footgun suppressions inline

* fix(merge): restore contracts caught by main-target CI

* fix(dashboard): honor injected HERMES_DASHBOARD_SESSION_TOKEN

The desktop shell mints a session token and signs its /api + /api/ws
calls with it via HERMES_DASHBOARD_SESSION_TOKEN, but the main-merge
restored a web_server.py that ignored the env var and minted its own
random _SESSION_TOKEN -- so every desktop request 401'd and the UI
reported "gateway offline". Read the injected token (fall back to a
fresh random one) so loopback HTTP + WS auth line up.

Adds a regression test so a future merge can't silently drop the read.

* fix(desktop): align fresh-install home so upgraders don't brick

Two related first-launch bugs on machines with a legacy ~/.hermes:

- install.ps1 hardcoded $HermesHome/$InstallDir to %LOCALAPPDATA%\hermes
  and ignored the HERMES_HOME the desktop passes through. The desktop
  freezes HERMES_HOME at module load and prefers a legacy ~/.hermes when
  %LOCALAPPDATA%\hermes is absent, so the installer wrote to a different
  home than the shell read -> "Could not connect to Hermes gateway". Honor
  $env:HERMES_HOME in the param defaults.

- isBootstrapComplete() trusted the marker + checkout without verifying a
  runnable venv, so an interrupted/split install spawned a dead backend
  instead of re-bootstrapping. Also require the venv python to exist.

* fix(dashboard): allow packaged desktop file:// origin on loopback WS

The packaged Electron desktop loads its renderer over file://, so its
/api/ws handshake carries Origin: file:// (or null). The DNS-rebinding
WebSocket Origin guard only accepted http(s) origins matching the bound
host, so it rejected the desktop's own renderer with 4403 -> "Could not
connect to Hermes gateway" on macOS.

A browser DNS-rebinding attacker can only ever present an http(s) origin
(the site hosting the malicious page); it cannot forge file://, null, or
a custom app scheme AND hold the loopback session token. So on loopback
binds we now trust non-web origins -- the token in _ws_auth_ok remains
the real authenticator. Public/gated binds still reject them, and
cross-site http(s) origins are still rejected everywhere.

* fix(desktop): resolve renderer assets relative to BASE_URL

Absolute public asset paths (/apple-touch-icon.png, /ds-assets/...) work
under the dev server but break in the packaged app, where the renderer is
loaded from file://.../index.html and a leading slash resolves to the
filesystem root -> broken onboarding provider icon and backdrop image on
macOS. Prefix these with import.meta.env.BASE_URL so they resolve next to
the bundled index.html in both dev and packaged builds.

* feat(desktop): automate first-launch bootstrap on macOS/Linux

Previously a packaged macOS/Linux app with no Hermes install hit a
dead-end ("first-launch install is not yet automated -- run install.sh
manually") because install.sh lacked the staged protocol install.ps1
exposes. Now both platforms bootstrap on first launch with the same
structured, per-step progress UI as Windows.

- install.sh: add --manifest / --stage / --json / --non-interactive plus
  a stage dispatcher (prerequisites, repository, venv, python-deps,
  node-deps, path, config, setup, gateway, complete). User-input stages
  (setup, gateway) are skipped under --non-interactive; the in-app
  onboarding overlay owns API keys/model, matching the Windows flow.
  Each stage runs inside the install dir (its own process) and a new
  --commit flag pins the checkout to the build-stamp SHA.
- bootstrap-runner.cjs: drive the staged manifest/stage/JSON protocol for
  both install.ps1 (PowerShell) and install.sh (bash), selected by
  installer kind; removed the single-blob POSIX shim.
- main.cjs: drop the macOS/Linux unsupported-platform dead-end so the
  bootstrap-needed path runs the installer on every platform.

* fix(dashboard): return 404 JSON for unmatched /api paths instead of SPA HTML

The SPA catch-all (serve_spa) served index.html for any unmatched GET,
including unregistered /api/* endpoints. A missing API route therefore
came back as <!doctype html> with status 200, and JSON clients (the
desktop app's fetchJson) crashed with an opaque
'SyntaxError: Unexpected token <' instead of a clear error.

- web_server.py: unmatched /api or /api/... now returns 404 JSON
  ('No such API endpoint'); non-api paths still serve the SPA for
  client-side routing.
- main.cjs fetchJson: detect an HTML body / text/html content-type on a
  2xx response and reject with a clear message naming the URL, rather
  than a raw JSON.parse SyntaxError. Empty bodies resolve to null;
  malformed JSON reports the URL plus a snippet.

* say 'OS appearance' instead of 'macOS appearance'

* feat(install): add --include-desktop stage + PowerShell-style flags to install.sh

Brings install.sh to parity with install.ps1's bootstrap surface so the
shared Rust/Tauri bootstrapper (apps/bootstrap-installer) can drive a
macOS/Linux install the same way it drives Windows.

- Accept the PowerShell-style aliases the bootstrapper emits to both
  installers: -Commit / -Branch (alongside existing -Manifest / -Stage /
  -Json / -NonInteractive).
- Add --include-desktop / -IncludeDesktop. When set, the manifest gains a
  'desktop' stage (immediately before 'complete'), and a new install_desktop
  runs a root workspace `npm install` + `npm run pack` (electron-builder
  --dir, signing auto-discovery disabled) to produce release/mac*/Hermes.app
  -- mirroring install.ps1's Install-Desktop / Stage-Desktop.
- The flag is opt-in, exactly like Windows: the signed bootstrap installer
  passes it; the Electron app's own first-launch bootstrap and the CLI
  one-liner omit it (building the desktop from inside the running app would
  clobber it).

* fix: tts endpoints

* macOS desktop: install + in-app self-update (#35607)

* fix(installer): align macOS HERMES_HOME with the rest of the stack

paths.rs computed the macOS Hermes home as ~/Library/Application Support/
hermes, but nothing else does: hermes_constants.get_hermes_home() (Python),
scripts/install.sh, and the Electron desktop's resolveHermesHome() all use
~/.hermes on macOS. The drift meant the Tauri installer wrote the install to
one directory and the desktop looked for it in another, so a fresh GUI
install never found its backend (the file's own comment warned this exact
drift would break things). Use ~/.hermes on macOS to match.

* fix(install.sh): always emit a stage result frame on failure

Stage helpers (clone_repo, install_deps, check_python, …) were written for
the monolithic flow and call `exit 1` on failure. Under `--stage`, that
terminated the process before the JSON result frame was printed, so the
installer's parse_stage_result saw "no frame" instead of a clean
{ok:false,...} contract response. Run the stage body in a subshell so an
`exit` only unwinds the subshell and the parent still emits the frame.

* feat(install.sh): auto-provision git on macOS/Linux (parity with install.ps1)

install.ps1 downloads PortableGit on Windows, but install.sh just printed a
"please install git" hint and exited — so a fresh Mac with no developer tools
(no Xcode CLT → no git) couldn't get past the clone step. check_git now tries
to install git before bailing:
  - macOS: Homebrew if present (headless), else `xcode-select --install`
    (the CLT prompt also provides the compiler some wheels need), polling for
    git to appear.
  - Linux: apt/dnf/pacman via sudo when available.
Falls back to the manual instructions only if auto-provision fails.

* feat(desktop): in-app GUI+backend self-update on macOS/Linux

On Windows the staged Hermes-Setup binary drives updates (quit → hermes
update → hermes desktop --build-only → relaunch). The mac drag-install has no
such binary, so "Update now" previously just printed `hermes update`.

Since there's no venv-shim file lock on POSIX, the desktop can drive the whole
update itself. applyUpdates now, when no staged updater exists on mac/linux:
  1. runs `hermes update --yes [--branch <current>]` (backend git pull + deps),
  2. runs `hermes desktop --build-only` (OS-aware GUI rebuild) with the
     Hermes-managed Node + venv on PATH,
  3. spawns a detached swapper that waits for this process to exit, dittos the
     freshly built Hermes.app over the running bundle, clears quarantine, and
     relaunches.
Degrades to "backend updated — restart to load the new GUI" if the rebuild
fails or there's no .app bundle to swap (dev run, Linux AppImage).

* chore: uptick

* chore: uptick

* chore: linux build

* fix(install): detect xcode-select git stub on fresh macOS

* chore: bump

* fix(desktop): repair voice dictation on Windows

Voice dictation was broken on Windows in two ways:

1. Mic access was denied. The Electron permission request handler only
   granted 'media' requests whose details.mediaTypes included 'audio',
   but Chromium on Windows frequently fires the mic request with an empty
   mediaTypes array, so getUserMedia threw NotAllowedError. The handler
   now grants audio-capture when mediaTypes includes 'audio' OR is
   empty/absent, handles the 'audioCapture' permission name, and adds a
   setPermissionCheckHandler (the synchronous path Chromium also consults
   for getUserMedia on Windows). Video is still denied.

2. Transcripts went nowhere. The composer's insertText handler (used by
   dictation and other inserts) only updated the assistant-ui composer
   store via setText, never the contentEditable editor DOM. The
   draft->editor sync effect only re-renders the editor when it is NOT
   focused, and dictation runs while the editor has/regains focus, so the
   transcript was stored but never shown and could not be sent. insertText
   now renders into the editor DOM and places the caret, mirroring
   appendExternalText.

Also hardens fetchJson: a 2xx response with an HTML body (or text/html
content-type) now rejects with a clear message naming the URL instead of
an opaque JSON.parse 'Unexpected token <' error.

* feat(desktop): route Nous subscribers onto the Tool Gateway from the GUI

When the GUI sets the main provider to Nous via POST /api/model/set, call
the same apply_nous_managed_defaults the CLI uses after model selection, so
GUI/onboarding users land on the Nous Tool Gateway the same way CLI users do
— no separate prompt, no duplicated logic.

Purely additive: apply_nous_managed_defaults skips any tool where the user
has a direct key (FIRECRAWL_API_KEY, FAL_KEY, etc.) or explicit config, so it
never overwrites a user's own setup. Only unconfigured tools get routed.

- web_server.py: in set_model_assignment (scope=main, provider=nous), resolve
  enabled toolsets and apply managed defaults; guarded so a Portal hiccup never
  blocks saving the model. Returns routed tools as gateway_tools.
- onboarding.ts: surface a 'Tool Gateway enabled' toast listing routed tools.
- types/hermes.ts: add gateway_tools to ModelAssignmentResponse.
- tests: cover nous-applies, non-nous-skips, and failure-doesnt-block-save.

* feat(desktop): mirror hermes model free/paid curation in GUI onboarding

GUI onboarding picked models[0] from /api/model/options, which ignores the
Nous free/paid tier — a free user could land on a paid default (e.g.
anthropic/claude-opus-4). Now the recommended default mirrors what `hermes
model` does.

- web_server.py: new GET /api/model/recommended-default?provider=<slug>. For
  Nous it runs the same curation as the CLI (get_curated_nous_model_ids +
  pricing + check_nous_free_tier + union_with_portal_{free,paid}_recommendations
  + partition_nous_models_by_tier) so free users get a free model and paid users
  get the curated default. Other providers fall back to the first curated model.
  Never 500s — returns empty model on error so onboarding degrades gracefully.
- hermes.ts: getRecommendedDefaultModel client + RecommendedDefaultModel type.
- onboarding.ts: fetchProviderDefaultModel prefers the recommended endpoint,
  falls back to models[0] when unavailable.
- tests: free-tier picks free model, paid-tier picks curated default, failure
  returns empty without 500.

* feat(desktop): show model pricing + free/paid tier gating in GUI picker

The CLI `hermes model` picker shows per-model $/Mtok pricing and gates paid
models on free Nous accounts. The GUI picker showed bare model names. Bring it
to parity across both the model-picker dialog and onboarding confirm card.

Backend:
- inventory.build_models_payload gains a pricing=True flag → _apply_pricing
  enriches each provider row with formatted per-model pricing
  ({input,output,cache,free}) via the same _format_price_per_mtok the CLI uses,
  and for Nous adds free_tier + unavailable_models (paid models a free user
  can't select) via check_nous_free_tier + partition_nous_models_by_tier.
  Best-effort: any pricing/tier failure is swallowed and fails open (no gating).
- /api/model/options and TUI model.options now pass pricing=True so the
  global picker and in-session picker both carry pricing.

Frontend:
- ModelOptionProvider gains pricing/free_tier/unavailable_models; new
  ModelPricing type.
- model-picker dialog renders In/Out $/Mtok (or a Free pill) per model, a
  Free tier/Pro badge on the Nous heading, and disables + grays unavailable
  paid models for free users with a 'Pro models need a paid subscription' note.
- onboarding confirm card shows the chosen model's price + tier badge.

Tests: test_inventory_pricing covers price formatting, free-tier gating,
paid no-gating, providers without pricing, and swallowed failures.

* fix(desktop): GUI model picker shows curated Nous list in curated order

Two bugs made the GUI Nous model list diverge from the `hermes model` CLI picker:

1. Backend (model_switch.py): the Nous row in list_authenticated_providers
   fell through to cached_provider_model_ids("nous"), dumping the full live
   /v1/models catalog (~50 vendor-prefixed models, alphabetical). Now it uses
   the curated list AND applies the Portal free/paid recommendation union —
   exactly like _model_flow_nous in main.py — so newly-launched models such as
   stepfun/step-3.7-flash:free surface in curated order. Best-effort: falls
   back to the curated list alone if the Portal fetch fails.

2. Frontend (model-picker.tsx): cmdk's Command had shouldFilter on (default),
   which re-sorts items by fuzzy-match score (≈alphabetical) and ignores array
   order. Set shouldFilter={false} + own the search term and do an
   order-preserving substring filter, so the backend's curated order is shown
   verbatim.

* feat(desktop): add/switch providers from the model picker via onboarding reuse

The model picker could only select models from already-authenticated
providers. Switching to a new provider had no in-app path. Rather than
duplicate provider UI, reuse the existing onboarding provider selector
(featured Nous + other providers + API-key form + device-code/PKCE flow +
model-confirm with pricing/tier).

- onboarding store: add a 'manual' flag with startManualOnboarding() /
  closeManualOnboarding(). Manual mode forces the onboarding overlay to show
  even when configured===true and refreshOnboarding no longer auto-dismisses
  on runtime-ready (the app is already working — the user is just adding or
  switching a provider).
- onboarding overlay: render when manual even if configured; show a Close
  button (the first-run flow has none since the app can't run yet).
- model picker: 'Add provider' footer button opens the onboarding selector;
  ModelResults lists only configured (model-bearing) providers.

* feat(desktop): add PUT /api/tools/toolsets/{name} enable/disable endpoint

* feat(desktop): add toggleToolset RPC binding

* feat(desktop): toolset enable/disable switch in Tools settings

* feat(desktop): tool configuration parity in GUI Tools settings

Bring the desktop GUI Tools settings to parity with the CLI `hermes tools`
for provider selection and API-key configuration.

Backend (hermes_cli/web_server.py):
- GET  /api/tools/toolsets/{name}/config  - provider matrix + key status
- PUT  /api/tools/toolsets/{name}/provider - persist provider selection

Shared core (hermes_cli/tools_config.py):
- Extract apply_provider_selection / _write_provider_config from the
  interactive _configure_provider so the CLI and GUI write identical
  config keys (web.backend, tts.provider, browser.cloud_provider, plugin
  image/video providers, use_gateway flags) through one code path.

Desktop UI:
- ToolsetConfigPanel: provider list with select, per-provider API-key
  entry (set/replace/clear/reveal via the shared env RPCs), Ready/Needs
  keys state, guidance for Nous-auth and post-setup providers.
- Wire the Configured/Needs keys pill to expand the panel inline; refresh
  the toolset list after key changes so the pill updates live.
- Add getToolsetConfig / selectToolsetProvider RPC bindings + types.

Post-setup (OAuth/install) flows still defer to the CLI; see
docs spike findings for the planned /api/tools/setup/* endpoint family.

Tests: backend round-trip + 400 cases for the new endpoints and
apply_provider_selection; desktop vitest coverage for the config panel
(provider render, select, key save). No change-detector tests.

Also removes three stale completed plan docs.

* fix(desktop): show real Hermes version + sync package.json on release

The desktop app version was disconnected from the Hermes version: the
release script bumped pyproject.toml + hermes_cli/__init__.py but never
touched apps/desktop/package.json, which sat stale at 0.0.2 (lockfile at
0.0.1).

- main.cjs: hermes:version IPC now resolves __version__ from
  hermes_cli/__init__.py (the canonical source release.py bumps) via a new
  resolveHermesVersion() helper, falling back to app.getVersion() when the
  source tree isn't readable. The About panel now always shows the live
  Hermes version and can't drift.
- release.py: update_version_files() also bumps apps/desktop/package.json
  in lockstep with pyproject (top-level version only; dep specs untouched).
- One-time catch-up: package.json 0.0.2 -> 0.15.1 and the lockfile root
  mirrors 0.0.1 -> 0.15.1.

* fix(desktop): stamp exe identity in afterPack hook so updates stay branded

The packed Hermes.exe reverted to the stock Electron icon + "Electron" name
after an in-app update. The icon/identity stamp (rcedit) lived only in
install.ps1, but the installer's --update path rebuilds the desktop via
`hermes desktop --build-only` -> `npm run pack`, which never ran install.ps1
and so never stamped the rebuilt exe.

Move the stamp into an electron-builder afterPack hook so it runs for EVERY
packed build regardless of caller (first install, hermes desktop, the update
rebuild, or a manual npm run pack):

- set-exe-identity.cjs: refactor to export stampExeIdentity(exe, desktopRoot);
  still runnable as a standalone CLI.
- after-pack.cjs (new): afterPack hook calling stampExeIdentity. Windows-only
  guard; best-effort (logs + resolves on failure, never fails the build).
- package.json: register build.afterPack.
- install.ps1: remove the now-redundant Set-DesktopExeIdentity function + call;
  the hook handles it during npm run pack.

electron-builder's own rcedit step stays disabled (signAndEditExecutable=false)
to avoid the signtool -> winCodeSign -> 7-Zip macOS-symlink crash on non-admin
Windows; the hook runs rcedit directly (pure PE resource edit, no signing).

* fix(desktop): export afterPack hook as exports.default so electron-builder runs it

The afterPack hook used `module.exports = fn`, which electron-builder's hook
loader doesn't pick up — it expects the function as the module's default
export (the same shape afterSign/notarize.cjs uses). The hook silently never
ran, so even first install shipped the stock "Electron" exe.

Switch to `exports.default = async function afterPack(...)`. Verified with a
real `npm run pack`: electron-builder now invokes the hook and the produced
release/win-unpacked/Hermes.exe carries ProductName/FileDescription=Hermes.

* chore(desktop): drop auto-build release CI in favor of manual build + upload

Remove desktop-release.yml (nightly-on-main + stable publish). Installers
are now built locally per platform and uploaded to a GitHub Release by hand;
the website points at them via NEXT_PUBLIC_HERMES_DL_* env. Update README +
docs and drop the dead desktop-nightly channel links.

* fix(desktop): stable shortcut icon + bust icon cache so updates repaint

Symptom on a freshly-installed laptop: Hermes.exe itself shows the correct
Hermes icon (Explorer reads the live exe's stamped PE resource), but the
desktop shortcut still draws the stock Electron icon.

Cause: New-DesktopShortcuts set IconLocation to "<exe>,0", so Windows cached
the icon it extracted from the exe at shortcut-creation time. On an update the
exe gets re-stamped, but the shortcut keeps rendering the stale cached bitmap.

- package.json: ship assets/icon.ico beside the exe via extraResources
  (-> resources/icon.ico). Verified with a real npm run pack.
- install.ps1 New-DesktopShortcuts: point IconLocation at resources/icon.ico
  (fallback to <exe>,0 if absent) — a dedicated .ico is cache-stable and skips
  the per-exe extraction that goes stale. Then run `ie4uinit.exe -show` to bust
  the shell icon cache so the shortcut repaints immediately instead of showing
  the old Electron icon until reboot.

Both best-effort; never fail an otherwise-good install.

* dummy update

* feat(desktop): self-heal update branch + backend contract guard

Two fixes for the bb/gui→main transition:

- Self-update self-heals: if the tracked branch (e.g. bb/gui) no longer
  exists on origin (merged + deleted), the desktop updater falls back to
  main and persists it. Read-only ls-remote probe that only flips on a
  definitive "ref absent" (exit 2), never on a transient network error, so
  already-installed clients migrate themselves with no manual flip.
- Backend contract guard: tui_gateway reports DESKTOP_BACKEND_CONTRACT in
  session runtime info; the desktop warns with a one-click "Update Hermes"
  when the backend predates the GUI's required contract (e.g. a bb/gui app
  pointed at a main checkout) instead of failing cryptically downstream.

* docs(desktop): rewrite README to match current install/update/build flow

The old README contradicted itself (claimed a bundled Python payload while
also saying it no longer bundles source) and predated cross-platform support.
Rewrite for accuracy: Linux is a first-class build target, install.sh/install.ps1
both drive the staged bootstrap, the real self-update handoff (Windows
Hermes-Setup vs in-app macOS/Linux), and the bb/gui→main self-heal + backend
contract guard.

* docs(desktop): rewrite README as a real product readme

Lead with what the app is and how to get it (download an installer, or
`hermes desktop` for existing CLI users) plus a plain-language feature list,
then keep contributor/build/internals as a clearly separated secondary section.

* docs(desktop): fix install framing — releases no longer auto-build installers

Lead with the install-with-Hermes path (`--include-desktop` / `hermes desktop`),
which always works, and describe prebuilt installers as manually published when
a release ships them rather than implying CI attaches them to every release.

* docs(desktop): match base repo README style

Adopt the root README's conventions: centered title + badge row, bold
one-liner intro, a feature <table> grid, --- section dividers, and a
Community / License footer.

* feat(desktop): recover from gateway boot failures + validate API keys on entry (#35864)

Fresh installs that hit a gateway boot failure had no recovery path: the
shell rendered dead ("gateway offline"), logs were undiscoverable, and a
mistyped API key was accepted because onboarding only checked credential
presence, not validity.

- Add BootFailureOverlay: a top-level recovery surface (Retry, Repair
  install, Use local gateway, Open logs + inline recent logs) that mounts
  on any hard boot failure, including post-install. Trims the now-redundant
  recovery button from the onboarding Preparing panel.
- Add hermes:logs:reveal / :recent IPC (reveal desktop.log) and a
  hermes:bootstrap:repair IPC that drops the bootstrap marker to force a
  clean reinstall. Surface "Open logs" in Gateway settings too.
- Add POST /api/providers/validate: a live per-provider probe
  (OpenRouter/OpenAI/xAI/Gemini key check, local endpoint connectivity)
  wired into saveOnboardingApiKey so a rejected key blocks before it's
  persisted, while an unreachable probe falls through (offline-safe).

* test(model-catalog): fix stale nous picker test after curated-list change

ac2e48907 made the GUI/picker Nous row use the curated list (curated["nous"]
= get_curated_nous_model_ids()) + Portal union, matching the `hermes model`
CLI — but test_picker_nous_row_uses_manifest still asserted the old 2-model
manifest snapshot, breaking the test shard.

Rewrite it as an invariant: stub the Portal union to passthrough and assert the
row equals get_curated_nous_model_ids() computed under the same conditions, so
it tracks the real contract instead of a hardcoded model list that rots on every
catalog update.

---------

Co-authored-by: emozilla <emozilla@nousresearch.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Austin Pickett <pickett.austin@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: ethernet <arilotter@gmail.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
2026-05-31 17:46:56 -05:00

6468 lines
244 KiB
Python

"""
Gateway subcommand for hermes CLI.
Handles: hermes gateway [run|start|stop|restart|status|install|uninstall|setup]
"""
import asyncio
import logging
import os
import shutil
import signal
import subprocess
import sys
import textwrap
from dataclasses import dataclass
from pathlib import Path
PROJECT_ROOT = Path(__file__).parent.parent.resolve()
from gateway.status import terminate_pid
from gateway.restart import (
DEFAULT_GATEWAY_RESTART_DRAIN_TIMEOUT,
GATEWAY_SERVICE_RESTART_EXIT_CODE,
parse_restart_drain_timeout,
)
from hermes_cli.config import (
get_env_value,
get_hermes_home,
is_managed,
managed_error,
read_raw_config,
save_env_value,
)
# display_hermes_home is imported lazily at call sites to avoid ImportError
# when hermes_constants is cached from a pre-update version during `hermes update`.
from hermes_cli.setup import (
print_header,
print_info,
print_success,
print_warning,
print_error,
prompt,
prompt_choice,
prompt_yes_no,
)
from hermes_cli.colors import Colors, color
logger = logging.getLogger(__name__)
# =============================================================================
# Process Management (for manual gateway runs)
# =============================================================================
@dataclass(frozen=True)
class GatewayRuntimeSnapshot:
manager: str
service_installed: bool = False
service_running: bool = False
gateway_pids: tuple[int, ...] = ()
service_scope: str | None = None
@property
def running(self) -> bool:
return self.service_running or bool(self.gateway_pids)
@property
def has_process_service_mismatch(self) -> bool:
return self.service_installed and self.running and not self.service_running
@dataclass(frozen=True)
class ProfileGatewayProcess:
profile: str
path: Path
pid: int
def _get_service_pids() -> set:
"""Return PIDs currently managed by systemd or launchd gateway services.
Used to avoid killing freshly-restarted service processes when sweeping
for stale manual gateway processes after a service restart. Relies on the
service manager having committed the new PID before the restart command
returns (true for both systemd and launchd in practice).
"""
pids: set = set()
# --- systemd (Linux): user and system scopes ---
if supports_systemd_services():
for scope_args in [["systemctl", "--user"], ["systemctl"]]:
try:
result = subprocess.run(
scope_args
+ [
"list-units",
"hermes-gateway*",
"--plain",
"--no-legend",
"--no-pager",
],
capture_output=True,
text=True,
timeout=5,
)
for line in result.stdout.strip().splitlines():
parts = line.split()
if not parts or not parts[0].endswith(".service"):
continue
svc = parts[0]
try:
show = subprocess.run(
scope_args + ["show", svc, "--property=MainPID", "--value"],
capture_output=True,
text=True,
timeout=5,
)
pid = int(show.stdout.strip())
if pid > 0:
pids.add(pid)
except (ValueError, subprocess.TimeoutExpired):
pass
except (FileNotFoundError, subprocess.TimeoutExpired):
pass
# --- launchd (macOS) ---
if is_macos():
try:
label = get_launchd_label()
result = subprocess.run(
["launchctl", "list", label],
capture_output=True,
text=True,
timeout=5,
)
if result.returncode == 0:
# Output: "PID\tStatus\tLabel" header, then one data line
for line in result.stdout.strip().splitlines():
parts = line.split()
if len(parts) >= 3 and parts[2] == label:
try:
pid = int(parts[0])
if pid > 0:
pids.add(pid)
except ValueError:
pass
except (FileNotFoundError, subprocess.TimeoutExpired):
pass
return pids
def _get_parent_pid(pid: int) -> int | None:
"""Return the parent PID for ``pid``, or ``None`` when unavailable.
Uses psutil (core dependency) which works on every platform. The
older implementation shelled out to ``ps -o ppid= -p <pid>``, which
silently fails on Windows (no ``ps``) so the ancestor walk terminated
at self — the caller's dedup / exclude logic then couldn't distinguish
"hermes CLI that invoked this scan" from "real gateway process".
"""
if pid <= 1:
return None
try:
import psutil # type: ignore
return psutil.Process(pid).ppid() or None
except ImportError:
pass
except Exception:
return None
# Fallback: shell out to ps (POSIX only — bare ``ps`` doesn't exist on Windows).
if not shutil.which("ps"):
return None
try:
result = subprocess.run(
["ps", "-o", "ppid=", "-p", str(pid)],
capture_output=True,
text=True,
timeout=5,
)
except (FileNotFoundError, subprocess.TimeoutExpired):
return None
if result.returncode != 0:
return None
raw = result.stdout.strip()
if not raw:
return None
try:
parent_pid = int(raw.splitlines()[-1].strip())
except ValueError:
return None
return parent_pid if parent_pid > 0 else None
def _is_pid_ancestor_of_current_process(target_pid: int) -> bool:
"""Return True when ``target_pid`` is this process or one of its ancestors."""
if target_pid <= 0:
return False
pid = os.getpid()
seen: set[int] = set()
while pid and pid not in seen:
if pid == target_pid:
return True
seen.add(pid)
pid = _get_parent_pid(pid) or 0
return False
def _request_gateway_self_restart(pid: int) -> bool:
"""Ask a running gateway ancestor to restart itself asynchronously."""
if not hasattr(signal, "SIGUSR1"):
return False
if not _is_pid_ancestor_of_current_process(pid):
return False
try:
os.kill(pid, signal.SIGUSR1) # windows-footgun: ok — POSIX signal, guarded by hasattr(signal, 'SIGUSR1') above
except (ProcessLookupError, PermissionError, OSError):
return False
return True
def _graceful_restart_via_sigusr1(pid: int, drain_timeout: float) -> bool:
"""Send SIGUSR1 to a gateway PID and wait for it to exit gracefully.
SIGUSR1 is wired in gateway/run.py to ``request_restart(via_service=True)``
which drains in-flight agent runs (up to ``agent.restart_drain_timeout``
seconds), then exits with code 75. Both systemd (``Restart=always``
+ ``RestartForceExitStatus=75``) and launchd (``KeepAlive.SuccessfulExit
= false``) relaunch the process after the graceful exit.
This is the drain-aware alternative to ``systemctl restart`` / ``SIGTERM``,
which SIGKILL in-flight agents after a short timeout.
Args:
pid: Gateway process PID (systemd MainPID, launchd PID, or bare
process PID).
drain_timeout: Seconds to wait for the process to exit after sending
SIGUSR1. Should be slightly larger than the gateway's
``agent.restart_drain_timeout`` to allow the drain loop to
finish cleanly.
Returns:
True if the PID was signalled and exited within the timeout.
False if SIGUSR1 couldn't be sent or the process didn't exit in
time (caller should fall back to a harder restart path).
"""
if not hasattr(signal, "SIGUSR1"):
return False
if pid <= 0:
return False
try:
os.kill(pid, signal.SIGUSR1) # windows-footgun: ok — POSIX signal, guarded by hasattr(signal, 'SIGUSR1') above
except ProcessLookupError:
# Already gone — nothing to drain.
return True
except (PermissionError, OSError):
return False
import time as _time
deadline = _time.monotonic() + max(drain_timeout, 1.0)
# IMPORTANT Windows note: ``os.kill(pid, 0)`` is NOT a no-op on
# Windows — Python's implementation calls ``TerminateProcess(handle, 0)``
# for sig=0, hard-killing the target. Use the cross-platform
# ``_pid_exists`` helper in gateway.status which does OpenProcess +
# WaitForSingleObject on Windows.
from gateway.status import _pid_exists
while _time.monotonic() < deadline:
if not _pid_exists(pid):
return True
_time.sleep(0.5)
# Drain didn't finish in time.
return False
def _get_ancestor_pids() -> set[int]:
"""Return the set of PIDs in the current process's ancestor chain.
Walks from the current PID up to PID 1 (init) so that process-table scans
never match the calling CLI process or any of its parents. This prevents
``hermes gateway status`` from falsely counting the ``hermes`` CLI that
invoked it as a running gateway instance (see #13242).
"""
ancestors: set[int] = set()
pid = os.getpid()
# Cap iterations to avoid infinite loops on exotic platforms.
for _ in range(64):
ancestors.add(pid)
parent = _get_parent_pid(pid)
if parent is None or parent <= 0 or parent in ancestors:
break
pid = parent
return ancestors
def _append_unique_pid(
pids: list[int], pid: int | None, exclude_pids: set[int]
) -> None:
if pid is None or pid <= 0:
return
if pid == os.getpid() or pid in exclude_pids or pid in pids:
return
pids.append(pid)
def _scan_gateway_pids(exclude_pids: set[int], all_profiles: bool = False) -> list[int]:
"""Best-effort process-table scan for gateway PIDs.
This supplements the profile-scoped PID file so status views can still spot
a live gateway when the PID file is stale/missing, and ``--all`` sweeps can
discover gateways outside the current profile.
"""
# Exclude the entire ancestor chain so the CLI process that invoked this
# scan (e.g. ``hermes gateway status``) is never mistaken for a running
# gateway. See #13242.
exclude_pids = exclude_pids | _get_ancestor_pids()
pids: list[int] = []
patterns = [
"hermes_cli.main gateway",
"hermes_cli.main --profile",
"hermes_cli.main -p",
"hermes_cli/main.py gateway",
"hermes_cli/main.py --profile",
"hermes_cli/main.py -p",
"hermes gateway",
# Windows: only match invocations that actually carry the ``gateway``
# subcommand or the gateway-dedicated console-script shim. Bare
# ``hermes.exe --profile`` / ``hermes.exe -p`` would also match
# ``hermes.exe --profile foo dashboard`` and other CLI subcommands,
# producing false-positive gateway PIDs (Copilot review).
"hermes.exe gateway",
"hermes-gateway.exe",
"gateway/run.py",
]
current_home = str(get_hermes_home().resolve())
current_home_lc = current_home.lower()
current_profile_arg = _profile_arg(current_home)
current_profile_name = (
current_profile_arg.split()[-1] if current_profile_arg else ""
)
current_profile_name_lc = current_profile_name.lower()
def _matches_current_profile(command: str) -> bool:
command_lc = command.lower()
if current_profile_name:
return (
f"--profile {current_profile_name_lc}" in command_lc
or f"-p {current_profile_name_lc}" in command_lc
or f"hermes_home={current_home_lc}" in command_lc
)
# Default-profile case: no profile flag in argv. Accept as long as
# the command doesn't advertise *some other* profile. HERMES_HOME
# may be passed via env (not visible in wmic/CIM command line) so
# its absence is NOT disqualifying — only a non-matching explicit
# HERMES_HOME= in argv is.
if "--profile " in command_lc or " -p " in command_lc:
return False
if (
"hermes_home=" in command_lc
and f"hermes_home={current_home_lc}" not in command_lc
):
return False
return True
try:
if is_windows():
# Prefer wmic when present (fast, stable output format). On
# modern Windows 11 / Win 10 late builds, wmic has been
# removed as part of the WMIC deprecation — fall back to
# PowerShell's Get-CimInstance. Any OSError here (FileNotFoundError
# on missing wmic) trips the fallback.
wmic_path = shutil.which("wmic")
used_fallback = False
result = None
if wmic_path is not None:
try:
result = subprocess.run(
[
wmic_path,
"process",
"get",
"ProcessId,CommandLine",
"/FORMAT:LIST",
],
capture_output=True,
text=True,
encoding="utf-8",
errors="ignore",
timeout=10,
)
except (OSError, subprocess.TimeoutExpired):
result = None
if result is None or result.returncode != 0 or not (result.stdout or ""):
# Fallback: PowerShell Get-CimInstance, emit LIST-style output
# so the downstream parser below doesn't need to branch.
powershell = shutil.which("powershell") or shutil.which("pwsh")
if powershell is None:
return []
ps_cmd = (
"Get-CimInstance Win32_Process | "
"ForEach-Object { "
" 'CommandLine=' + ($_.CommandLine -replace \"`r`n\",' ' -replace \"`n\",' '); "
" 'ProcessId=' + $_.ProcessId; "
" '' "
"}"
)
try:
result = subprocess.run(
[powershell, "-NoProfile", "-Command", ps_cmd],
capture_output=True,
text=True,
encoding="utf-8",
errors="ignore",
timeout=15,
)
except (OSError, subprocess.TimeoutExpired):
return []
used_fallback = True
if result.returncode != 0 or result.stdout is None:
return []
current_cmd = ""
for line in result.stdout.split("\n"):
line = line.strip()
if line.startswith("CommandLine="):
current_cmd = line[len("CommandLine=") :]
elif line.startswith("ProcessId="):
pid_str = line[len("ProcessId=") :]
current_cmd_lc = current_cmd.lower()
if any(p in current_cmd_lc for p in patterns) and (
all_profiles or _matches_current_profile(current_cmd)
):
try:
_append_unique_pid(pids, int(pid_str), exclude_pids)
except ValueError:
pass
current_cmd = ""
else:
# Try /proc first (works in Docker without procps installed),
# fall back to ps -A eww.
_found_via_proc = False
if os.path.isdir("/proc"):
try:
my_pid = os.getpid()
for entry in os.listdir("/proc"):
if not entry.isdigit():
continue
pid = int(entry)
if pid == my_pid or pid in exclude_pids:
continue
try:
cmdline = (
open(f"/proc/{pid}/cmdline", "rb")
.read()
.decode("utf-8", errors="replace")
)
cmdline = cmdline.replace("\x00", " ")
cmdline_lc = cmdline.lower()
if any(p in cmdline_lc for p in patterns) and (
all_profiles or _matches_current_profile(cmdline)
):
_append_unique_pid(pids, pid, exclude_pids)
except (OSError, PermissionError):
continue
_found_via_proc = True
except Exception:
pass
if not _found_via_proc:
result = subprocess.run(
["ps", "-A", "eww", "-o", "pid=,command="],
capture_output=True,
text=True,
timeout=10,
)
if result.returncode != 0:
return []
for line in result.stdout.split("\n"):
stripped = line.strip()
if not stripped or "grep" in stripped:
continue
pid = None
command = ""
parts = stripped.split(None, 1)
if len(parts) == 2:
try:
pid = int(parts[0])
command = parts[1]
except ValueError:
pid = None
if pid is None:
aux_parts = stripped.split()
if len(aux_parts) > 10 and aux_parts[1].isdigit():
pid = int(aux_parts[1])
command = " ".join(aux_parts[10:])
if pid is None:
continue
command_lc = command.lower()
if any(pattern in command_lc for pattern in patterns) and (
all_profiles or _matches_current_profile(command)
):
_append_unique_pid(pids, pid, exclude_pids)
except (OSError, subprocess.TimeoutExpired):
return []
# Windows-specific: collapse venv launcher stubs. A venv-built
# ``pythonw.exe`` in ``<venv>/Scripts/`` is a ~100 KB launcher exe
# that spawns the base Python (e.g. ``C:\Program Files\Python311\
# pythonw.exe``) with the same command line, preserving the venv's
# ``pyvenv.cfg`` context. This is standard Windows CPython venv
# behaviour — BUT it means every gateway run produces two pythonw
# PIDs with identical command lines (one launcher stub, one actual
# interpreter) which is confusing in ``gateway status`` output.
# Filter the stub: if a PID in our result is the PARENT of another
# PID in our result, and both are pythonw.exe, the parent is the
# launcher stub — drop it, keep the child.
if is_windows() and len(pids) > 1:
pids = _filter_venv_launcher_stubs(pids)
return pids
def _filter_venv_launcher_stubs(pids: list[int]) -> list[int]:
"""Drop venv-launcher ``pythonw.exe`` stubs that are parents of the real
interpreter process. See comment at the tail of ``_scan_gateway_pids``.
Uses ``psutil`` (core dependency). Safe on any platform; only invoked
on Windows by the caller because the stub pattern is Windows-specific.
"""
try:
import psutil # type: ignore
except ImportError:
return pids
pid_set = set(pids)
# Collect each PID's parent so we can flag "child of another matched PID".
parent_of: dict[int, int | None] = {}
for pid in pids:
try:
parent_of[pid] = psutil.Process(pid).ppid()
except (psutil.NoSuchProcess, psutil.AccessDenied):
parent_of[pid] = None
# For each child whose parent is also in our set, drop the parent.
drop: set[int] = set()
for pid, ppid in parent_of.items():
if ppid is not None and ppid in pid_set:
drop.add(ppid)
return [p for p in pids if p not in drop]
def find_gateway_pids(
exclude_pids: set | None = None, all_profiles: bool = False
) -> list:
"""Find PIDs of running gateway processes.
Args:
exclude_pids: PIDs to exclude from the result (e.g. service-managed
PIDs that should not be killed during a stale-process sweep).
all_profiles: When ``True``, return gateway PIDs across **all**
profiles (the pre-7923 global behaviour). ``hermes update``
needs this because a code update affects every profile.
When ``False`` (default), only PIDs belonging to the current
Hermes profile are returned.
"""
_exclude = set(exclude_pids or set())
pids: list[int] = []
if not all_profiles:
try:
from gateway.status import get_running_pid
_append_unique_pid(pids, get_running_pid(), _exclude)
except Exception:
pass
for pid in _get_service_pids():
_append_unique_pid(pids, pid, _exclude)
for pid in _scan_gateway_pids(_exclude, all_profiles=all_profiles):
_append_unique_pid(pids, pid, _exclude)
return pids
def find_profile_gateway_processes(
exclude_pids: set | None = None,
) -> list[ProfileGatewayProcess]:
"""Return running gateway PIDs mapped to Hermes profiles via PID files."""
_exclude = set(exclude_pids or set())
processes: list[ProfileGatewayProcess] = []
try:
from gateway.status import get_running_pid
from hermes_cli.profiles import list_profiles
except Exception:
return processes
seen: set[int] = set()
for profile in list_profiles():
try:
pid = get_running_pid(profile.path / "gateway.pid", cleanup_stale=False)
except Exception:
continue
if pid is None or pid <= 0 or pid in _exclude or pid in seen:
continue
seen.add(pid)
processes.append(
ProfileGatewayProcess(profile=profile.name, path=profile.path, pid=pid)
)
return processes
def _gateway_run_args_for_profile(profile: str) -> list[str]:
args = [get_python_path(), "-m", "hermes_cli.main"]
if profile != "default":
args.extend(["--profile", profile])
args.extend(["gateway", "run", "--replace"])
return args
def launch_detached_profile_gateway_restart(profile: str, old_pid: int) -> bool:
"""Relaunch a manually-run profile gateway after its current PID exits."""
if old_pid <= 0:
return False
# The watcher is a tiny Python subprocess that polls the old PID and
# respawns the gateway once it's gone. Both legs of the chain need
# platform-appropriate detach semantics:
#
# POSIX — ``start_new_session=True`` (os.setsid in the child) detaches
# from the parent's process group so Ctrl+C in the CLI doesn't
# propagate and the watcher/gateway survive the CLI exiting.
#
# Windows — ``start_new_session`` is silently accepted but does NOT
# detach. The watcher stays attached to the CLI's console and dies
# when the user closes the terminal, leaving ``hermes update`` users
# with no running gateway until they re-invoke ``hermes gateway``
# manually. The Win32 equivalent is the ``CREATE_NEW_PROCESS_GROUP |
# DETACHED_PROCESS | CREATE_NO_WINDOW`` creationflags bundle.
#
# ``windows_detach_popen_kwargs()`` returns the right kwargs for the
# host platform and is a no-op on POSIX (just ``start_new_session=True``).
from hermes_cli._subprocess_compat import windows_detach_popen_kwargs
watcher = textwrap.dedent(
"""
import os
import subprocess
import sys
import time
pid = int(sys.argv[1])
cmd = sys.argv[2:]
deadline = time.monotonic() + 120
while time.monotonic() < deadline:
# ``os.kill(pid, 0)`` is not a no-op on Windows — use the
# cross-platform existence check.
from gateway.status import _pid_exists
if not _pid_exists(pid):
break
time.sleep(0.2)
# Platform-appropriate detach for the respawned gateway. On POSIX
# start_new_session=True maps to os.setsid; on Windows we need
# explicit creationflags because start_new_session is a no-op there.
_popen_kwargs = {
"stdout": subprocess.DEVNULL,
"stderr": subprocess.DEVNULL,
}
if sys.platform == "win32":
_CREATE_NEW_PROCESS_GROUP = 0x00000200
_DETACHED_PROCESS = 0x00000008
_CREATE_NO_WINDOW = 0x08000000
_popen_kwargs["creationflags"] = (
_CREATE_NEW_PROCESS_GROUP | _DETACHED_PROCESS | _CREATE_NO_WINDOW
)
else:
_popen_kwargs["start_new_session"] = True
subprocess.Popen(cmd, **_popen_kwargs)
"""
).strip()
try:
# Same platform-aware detach for the watcher process itself — so
# closing the user's terminal doesn't kill the watcher.
subprocess.Popen(
[
sys.executable,
"-c",
watcher,
str(old_pid),
*_gateway_run_args_for_profile(profile),
],
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
**windows_detach_popen_kwargs(),
)
except OSError:
return False
return True
def _probe_systemd_service_running(system: bool = False) -> tuple[bool, bool]:
selected_system = _select_systemd_scope(system)
unit_exists = get_systemd_unit_path(system=selected_system).exists()
if not unit_exists:
return selected_system, False
try:
result = _run_systemctl(
["is-active", get_service_name()],
system=selected_system,
capture_output=True,
text=True,
timeout=10,
)
except (RuntimeError, subprocess.TimeoutExpired):
return selected_system, False
return selected_system, result.stdout.strip() == "active"
def _read_systemd_unit_environment(system: bool = False) -> dict[str, str]:
"""Parse the gateway unit's ``Environment=`` directives.
``systemctl show -p Environment`` returns a single line of
space-separated ``KEY=VALUE`` pairs; values are not quoted in the output
even when the unit file quoted them. We split on whitespace and ``=``.
"""
selected_system = _select_systemd_scope(system)
try:
result = _run_systemctl(
[
"show",
get_service_name(),
"--no-pager",
"--property",
"Environment",
],
system=selected_system,
capture_output=True,
text=True,
timeout=10,
)
except (RuntimeError, subprocess.TimeoutExpired, OSError):
return {}
if result.returncode != 0:
return {}
parsed: dict[str, str] = {}
for line in result.stdout.splitlines():
if not line.startswith("Environment="):
continue
body = line[len("Environment=") :].strip()
for token in body.split():
if "=" not in token:
continue
key, value = token.split("=", 1)
parsed[key] = value
return parsed
def _sync_hermes_home_from_systemd_unit(system: bool) -> None:
"""When acting on a system-scope unit, adopt its ``HERMES_HOME``.
Under ``sudo``, ``HERMES_HOME`` is stripped and ``HOME=/root``, so
:func:`get_hermes_home` falls back to ``/root/.hermes`` — the wrong
profile. The unit file pins ``HERMES_HOME`` for the actual gateway
process, so we mirror that into our own environment to make
``read_runtime_status`` / ``get_running_pid`` read the correct files.
"""
if not system:
return
env = _read_systemd_unit_environment(system=True)
unit_home = env.get("HERMES_HOME", "").strip()
if not unit_home:
return
current = os.environ.get("HERMES_HOME", "").strip()
if current == unit_home:
return
os.environ["HERMES_HOME"] = unit_home
def _read_systemd_unit_properties(
system: bool = False,
properties: tuple[str, ...] = (
"ActiveState",
"SubState",
"Result",
"ExecMainStatus",
"MainPID",
),
) -> dict[str, str]:
"""Return selected ``systemctl show`` properties for the gateway unit."""
selected_system = _select_systemd_scope(system)
try:
result = _run_systemctl(
[
"show",
get_service_name(),
"--no-pager",
"--property",
",".join(properties),
],
system=selected_system,
capture_output=True,
text=True,
timeout=10,
)
except (RuntimeError, subprocess.TimeoutExpired, OSError):
return {}
if result.returncode != 0:
return {}
parsed: dict[str, str] = {}
for line in result.stdout.splitlines():
if "=" not in line:
continue
key, value = line.split("=", 1)
parsed[key] = value.strip()
return parsed
def _systemd_main_pid_from_props(props: dict[str, str]) -> int | None:
try:
pid = int(props.get("MainPID", "0") or "0")
except (TypeError, ValueError):
return None
return pid if pid > 0 else None
def _systemd_main_pid(system: bool = False) -> int | None:
return _systemd_main_pid_from_props(_read_systemd_unit_properties(system=system))
def _read_gateway_runtime_status() -> dict | None:
try:
from gateway.status import read_runtime_status
state = read_runtime_status()
except Exception:
return None
return state if isinstance(state, dict) else None
def _gateway_runtime_status_for_pid(pid: int | None) -> dict | None:
if not pid:
return None
state = _read_gateway_runtime_status()
if not state:
return None
try:
state_pid = int(state.get("pid", 0) or 0)
except (TypeError, ValueError):
return None
return state if state_pid == pid else None
def _wait_for_systemd_service_restart(
*,
system: bool = False,
previous_pid: int | None = None,
timeout: float = 60.0,
) -> bool:
"""Wait for the gateway service to become active after a restart handoff."""
import time
svc = get_service_name()
scope_label = _service_scope_label(system).capitalize()
deadline = time.monotonic() + timeout
printed_runtime_wait = False
while time.monotonic() < deadline:
props = _read_systemd_unit_properties(system=system)
active_state = props.get("ActiveState", "")
sub_state = props.get("SubState", "")
new_pid = None
try:
from gateway.status import get_running_pid
new_pid = get_running_pid()
except Exception:
new_pid = None
if not new_pid:
new_pid = _systemd_main_pid_from_props(props)
if active_state == "active":
if new_pid and (previous_pid is None or new_pid != previous_pid):
runtime_state = _gateway_runtime_status_for_pid(new_pid)
gateway_state = (runtime_state or {}).get("gateway_state")
if gateway_state == "running":
print(f"{scope_label} service restarted (PID {new_pid})")
return True
if gateway_state == "startup_failed":
reason = (runtime_state or {}).get(
"exit_reason"
) or "startup failed"
print(
f"{scope_label} service process restarted (PID {new_pid}), but gateway startup failed: {reason}"
)
return False
if not printed_runtime_wait:
print(
f"{scope_label} service process started (PID {new_pid}); waiting for gateway runtime..."
)
printed_runtime_wait = True
if active_state == "activating" and sub_state == "auto-restart":
time.sleep(1)
continue
if _systemd_unit_is_start_limited(props):
_print_systemd_start_limit_wait(system=system)
return False
time.sleep(2)
print(
f"{scope_label} service did not become active within {int(timeout)}s.\n"
f" Check status: {'sudo ' if system else ''}hermes gateway status\n"
f" Check logs: journalctl {'--user ' if not system else ''}-u {svc} -l --since '2 min ago'"
)
return False
def _systemd_unit_is_start_limited(props: dict[str, str]) -> bool:
result = props.get("Result", "").lower()
sub_state = props.get("SubState", "").lower()
return result == "start-limit-hit" or sub_state == "start-limit-hit"
def _systemd_error_indicates_start_limit(exc: subprocess.CalledProcessError) -> bool:
parts: list[str] = []
for attr in ("stderr", "stdout", "output"):
value = getattr(exc, attr, None)
if not value:
continue
if isinstance(value, bytes):
value = value.decode(errors="replace")
parts.append(str(value))
text = "\n".join(parts).lower()
return (
"start-limit-hit" in text
or "start request repeated too quickly" in text
or "start-limit" in text
)
def _systemd_service_is_start_limited(system: bool = False) -> bool:
return _systemd_unit_is_start_limited(_read_systemd_unit_properties(system=system))
def _print_systemd_start_limit_wait(system: bool = False) -> None:
svc = get_service_name()
scope_label = _service_scope_label(system).capitalize()
scope_flag = " --system" if system else ""
systemctl_prefix = "systemctl " if system else "systemctl --user "
journal_prefix = "journalctl " if system else "journalctl --user "
print(f"{scope_label} service is temporarily rate-limited by systemd.")
print(" systemd is refusing another immediate start after repeated exits.")
print(
f" Wait for the start-limit window to expire, then run: {'sudo ' if system else ''}hermes gateway restart{scope_flag}"
)
print(f" Or clear the failed state manually: {systemctl_prefix}reset-failed {svc}")
print(f" Check logs: {journal_prefix}-u {svc} -l --since '5 min ago'")
def _recover_pending_systemd_restart(
system: bool = False, previous_pid: int | None = None
) -> bool:
"""Recover a planned service restart that is stuck in systemd state."""
props = _read_systemd_unit_properties(system=system)
if not props:
return False
try:
from gateway.status import read_runtime_status
except Exception:
return False
runtime_state = read_runtime_status() or {}
if not runtime_state.get("restart_requested"):
return False
active_state = props.get("ActiveState", "")
sub_state = props.get("SubState", "")
exec_main_status = props.get("ExecMainStatus", "")
result = props.get("Result", "")
if active_state == "activating" and sub_state == "auto-restart":
print("⏳ Service restart already pending — waiting for systemd relaunch...")
return _wait_for_systemd_service_restart(
system=system,
previous_pid=previous_pid,
)
if active_state == "failed" and (
exec_main_status == str(GATEWAY_SERVICE_RESTART_EXIT_CODE)
or result == "exit-code"
):
svc = get_service_name()
scope_label = _service_scope_label(system).capitalize()
print(
f"↻ Clearing failed state for pending {scope_label.lower()} service restart..."
)
_run_systemctl(
["reset-failed", svc],
system=system,
check=False,
timeout=30,
)
_run_systemctl(
["start", svc],
system=system,
check=False,
timeout=90,
)
return _wait_for_systemd_service_restart(
system=system,
previous_pid=previous_pid,
)
return False
def _probe_launchd_service_running() -> bool:
if not get_launchd_plist_path().exists():
return False
try:
result = subprocess.run(
["launchctl", "list", get_launchd_label()],
capture_output=True,
text=True,
timeout=10,
)
except subprocess.TimeoutExpired:
return False
return result.returncode == 0
def get_gateway_runtime_snapshot(system: bool = False) -> GatewayRuntimeSnapshot:
"""Return a unified view of gateway liveness for the current profile."""
gateway_pids = tuple(find_gateway_pids())
if is_termux():
return GatewayRuntimeSnapshot(
manager="Termux / manual process",
gateway_pids=gateway_pids,
)
from hermes_constants import is_container
if is_linux() and is_container():
# Phase 4: report s6 supervision when running under our /init.
# Other container runtimes (or containers built before Phase 2)
# still get the original "docker (foreground)" label.
try:
from hermes_cli.service_manager import detect_service_manager
if detect_service_manager() == "s6":
return GatewayRuntimeSnapshot(
manager="s6 (container supervisor)",
gateway_pids=gateway_pids,
)
except Exception:
pass # Fall through to the legacy label on any detection error.
return GatewayRuntimeSnapshot(
manager="docker (foreground)",
gateway_pids=gateway_pids,
)
if supports_systemd_services():
selected_system, service_running = _probe_systemd_service_running(system=system)
scope_label = _service_scope_label(selected_system)
return GatewayRuntimeSnapshot(
manager=f"systemd ({scope_label})",
service_installed=get_systemd_unit_path(system=selected_system).exists(),
service_running=service_running,
gateway_pids=gateway_pids,
service_scope=scope_label,
)
if is_macos():
return GatewayRuntimeSnapshot(
manager="launchd",
service_installed=get_launchd_plist_path().exists(),
service_running=_probe_launchd_service_running(),
gateway_pids=gateway_pids,
service_scope="launchd",
)
return GatewayRuntimeSnapshot(
manager="manual process",
gateway_pids=gateway_pids,
)
def _format_gateway_pids(
pids: tuple[int, ...] | list[int], *, limit: int | None = 3
) -> str:
rendered = (
[str(pid) for pid in pids[:limit] if pid > 0]
if limit is not None
else [str(pid) for pid in pids if pid > 0]
)
if limit is not None and len(pids) > limit:
rendered.append("...")
return ", ".join(rendered)
def _print_gateway_process_mismatch(snapshot: GatewayRuntimeSnapshot) -> None:
if not snapshot.has_process_service_mismatch:
return
print()
print(
"⚠ Gateway process is running for this profile, but the service is not active"
)
print(f" PID(s): {_format_gateway_pids(snapshot.gateway_pids, limit=None)}")
print(" This is usually a manual foreground/tmux/nohup run, so `hermes gateway`")
print(" can refuse to start another copy until this process stops.")
def _print_other_profiles_gateway_status() -> None:
"""Print a summary of gateway status across all profiles.
Shown at the bottom of ``hermes gateway status`` output so users with
multiple profiles can tell at a glance which gateways are running and
avoid confusing another profile's process with the current one.
"""
try:
from hermes_cli.profiles import get_active_profile_name
current = get_active_profile_name()
other_processes = [
p for p in find_profile_gateway_processes() if p.profile != current
]
if not other_processes:
return
print()
print("Other profiles:")
for proc in other_processes:
print(f"{proc.profile:<16s} — PID {proc.pid}")
except Exception:
pass
def _gateway_list() -> None:
"""List all profiles and their gateway running status.
Provides a single-command overview of every known profile and whether
its gateway is currently running, so multi-profile users don't have to
check each profile individually.
"""
try:
from hermes_cli.profiles import list_profiles, get_active_profile_name
except Exception:
print("Unable to list profiles.")
return
profiles = list_profiles()
if not profiles:
print("No profiles found.")
return
current = get_active_profile_name()
print("Gateways:")
for prof in profiles:
marker = "" if prof.gateway_running else ""
label = prof.name
if prof.name == current:
label += " (current)"
parts = [f" {marker} {label:<24s}"]
if prof.gateway_running:
try:
from gateway.status import get_running_pid
pid = get_running_pid(prof.path / "gateway.pid", cleanup_stale=False)
if pid:
parts.append(f"PID {pid}")
except Exception:
pass
else:
parts.append("not running")
print("".join(parts))
def kill_gateway_processes(
force: bool = False, exclude_pids: set | None = None, all_profiles: bool = False
) -> int:
"""Kill any running gateway processes. Returns count killed.
Args:
force: Use the platform's force-kill mechanism instead of graceful terminate.
exclude_pids: PIDs to skip (e.g. service-managed PIDs that were just
restarted and should not be killed).
all_profiles: When ``True``, kill across all profiles. Passed
through to :func:`find_gateway_pids`.
"""
pids = find_gateway_pids(exclude_pids=exclude_pids, all_profiles=all_profiles)
killed = 0
for pid in pids:
try:
terminate_pid(pid, force=force)
killed += 1
except ProcessLookupError:
# Process already gone
pass
except PermissionError:
print(f"⚠ Permission denied to kill PID {pid}")
except OSError as exc:
print(f"Failed to kill PID {pid}: {exc}")
return killed
def stop_profile_gateway() -> bool:
"""Stop only the gateway for the current profile (HERMES_HOME-scoped).
Uses the PID file written by start_gateway(), so it only kills the
gateway belonging to this profile — not gateways from other profiles.
Returns True if a process was stopped, False if none was found.
"""
try:
from gateway.status import get_running_pid, remove_pid_file
except ImportError:
return False
pid = get_running_pid()
if pid is None:
return False
try:
from gateway.status import write_planned_stop_marker
write_planned_stop_marker(pid)
except Exception:
pass
try:
os.kill(pid, signal.SIGTERM)
except ProcessLookupError:
pass # Already gone
except PermissionError:
print(f"⚠ Permission denied to kill PID {pid}")
return False
# Wait briefly for it to exit. On Windows, os.kill(pid, 0) is NOT
# a no-op — route through the cross-platform existence check.
import time as _time
from gateway.status import _pid_exists
for _ in range(20):
if not _pid_exists(pid):
break
_time.sleep(0.5)
if get_running_pid() is None:
remove_pid_file()
return True
def is_linux() -> bool:
return sys.platform.startswith("linux")
from hermes_constants import is_container, is_termux, is_wsl
def _wsl_systemd_operational() -> bool:
"""Check if systemd is actually running as PID 1 on WSL.
WSL2 with ``systemd=true`` in wsl.conf has working systemd.
WSL2 without it (or WSL1) does not — systemctl commands fail.
"""
return _systemd_operational(system=True)
def _systemd_operational(system: bool = False) -> bool:
"""Return True when the requested systemd scope is usable."""
try:
result = _run_systemctl(
["is-system-running"],
system=system,
capture_output=True,
text=True,
timeout=5,
)
# "running", "degraded", "starting" all mean systemd is PID 1
status = result.stdout.strip().lower()
return status in {"running", "degraded", "starting", "initializing"}
except (RuntimeError, subprocess.TimeoutExpired, OSError):
return False
def _container_systemd_operational() -> bool:
"""Return True when a container exposes working user or system systemd.
This is NOT our Hermes Docker image — that one runs s6-overlay as
PID 1 (since Phase 2 of the s6-overlay supervision plan) and is
detected via ``service_manager.detect_service_manager() == "s6"``.
This function handles the "container managed by something else"
case: systemd-nspawn, certain k8s pods, containers built FROM
systemd-bearing distros where the user has wired systemd as their
init. In those environments systemctl behaves identically to the
host case, so we fall through to the normal systemd code paths.
"""
if _systemd_operational(system=False):
return True
if _systemd_operational(system=True):
return True
return False
def supports_systemd_services() -> bool:
if not is_linux() or is_termux():
return False
if shutil.which("systemctl") is None:
return False
if is_wsl():
return _wsl_systemd_operational()
if is_container():
return _container_systemd_operational()
return True
def is_macos() -> bool:
return sys.platform == "darwin"
def is_windows() -> bool:
return sys.platform == "win32"
def _windows_gateway_should_absorb_console_controls() -> bool:
"""Return True for detached Windows gateway runs that should ignore Ctrl+C.
Foreground ``hermes gateway run`` must remain interruptible from
PowerShell/CMD. Detached service-style launches opt in via
``HERMES_GATEWAY_DETACHED=1``; older wrappers without the env marker are
treated as detached when no interactive stdin is attached.
"""
if not is_windows():
return False
detached = os.getenv("HERMES_GATEWAY_DETACHED", "").strip().lower()
if detached in {"1", "true", "yes", "on"}:
return True
try:
return not bool(sys.stdin and sys.stdin.isatty())
except (ValueError, OSError):
return True
# =============================================================================
# Service Configuration
# =============================================================================
_SERVICE_BASE = "hermes-gateway"
SERVICE_DESCRIPTION = "Hermes Agent Gateway - Messaging Platform Integration"
def _profile_suffix() -> str:
"""Derive a service-name suffix from the current HERMES_HOME.
Returns ``""`` for the default root, the profile name for
``<root>/profiles/<name>``, or a short hash for any other path.
Works correctly in Docker (HERMES_HOME=/opt/data) and standard deployments.
"""
import hashlib
import re
from hermes_constants import get_default_hermes_root
home = get_hermes_home().resolve()
default = get_default_hermes_root().resolve()
if home == default:
return ""
# Detect <root>/profiles/<name> pattern → use the profile name
profiles_root = (default / "profiles").resolve()
try:
rel = home.relative_to(profiles_root)
parts = rel.parts
if len(parts) == 1 and re.match(r"^[a-z0-9][a-z0-9_-]{0,63}$", parts[0]):
return parts[0]
except ValueError:
pass
# Fallback: short hash for arbitrary HERMES_HOME paths
return hashlib.sha256(str(home).encode()).hexdigest()[:8]
def _profile_arg(hermes_home: str | None = None) -> str:
"""Return ``--profile <name>`` only when HERMES_HOME is a named profile.
For ``~/.hermes/profiles/<name>``, returns ``"--profile <name>"``.
For the default profile or hash-based custom paths, returns the empty string.
Args:
hermes_home: Optional explicit HERMES_HOME path. Defaults to the current
``get_hermes_home()`` value. Should be passed when generating a
service definition for a different user (e.g. system service).
"""
import re
from hermes_constants import get_default_hermes_root
home = Path(hermes_home or str(get_hermes_home())).resolve()
default = get_default_hermes_root().resolve()
if home == default:
return ""
profiles_root = (default / "profiles").resolve()
try:
rel = home.relative_to(profiles_root)
parts = rel.parts
if len(parts) == 1 and re.match(r"^[a-z0-9][a-z0-9_-]{0,63}$", parts[0]):
return f"--profile {parts[0]}"
except ValueError:
pass
return ""
def get_service_name() -> str:
"""Derive a systemd service name scoped to this HERMES_HOME.
Default ``~/.hermes`` returns ``hermes-gateway`` (backward compatible).
Profile ``~/.hermes/profiles/coder`` returns ``hermes-gateway-coder``.
Any other HERMES_HOME appends a short hash for uniqueness.
"""
suffix = _profile_suffix()
if not suffix:
return _SERVICE_BASE
return f"{_SERVICE_BASE}-{suffix}"
def get_systemd_unit_path(system: bool = False) -> Path:
name = get_service_name()
if system:
return Path("/etc/systemd/system") / f"{name}.service"
return Path.home() / ".config" / "systemd" / "user" / f"{name}.service"
class UserSystemdUnavailableError(RuntimeError):
"""Raised when ``systemctl --user`` cannot reach the user D-Bus session.
Typically hit on fresh RHEL/Debian SSH sessions where linger is disabled
and no user@.service is running, so ``/run/user/$UID/bus`` never exists.
Carries a user-facing remediation message in ``args[0]``.
"""
class SystemScopeRequiresRootError(RuntimeError):
"""Raised when a system-scope gateway operation is attempted as non-root.
System-scope units live in ``/etc/systemd/system/`` and require root for
install / uninstall / start / stop / restart via ``systemctl``. The
previous behavior was ``sys.exit(1)`` which blew past the wizard's
``except Exception`` guards and dumped the user at a bare shell prompt
with no guidance. Raising a typed exception lets callers that can
recover (the setup wizard) print actionable remediation instead, while
``gateway_command`` still exits 1 with the same message for the direct
CLI path.
``args[0]`` carries the user-facing message, ``args[1]`` the action name.
``str(e)`` returns only the message (not the tuple repr) so format
strings like ``f"Failed: {e}"`` render cleanly.
"""
def __str__(self) -> str:
return self.args[0] if self.args else ""
def _user_dbus_socket_path() -> Path:
"""Return the expected per-user D-Bus socket path (regardless of existence)."""
xdg = os.environ.get("XDG_RUNTIME_DIR") or f"/run/user/{os.getuid()}" # windows-footgun: ok — POSIX systemd helper, never invoked on Windows
return Path(xdg) / "bus"
def _user_systemd_private_socket_path() -> Path:
"""Return the per-user systemd private socket path (regardless of existence)."""
xdg = os.environ.get("XDG_RUNTIME_DIR") or f"/run/user/{os.getuid()}" # windows-footgun: ok — POSIX systemd helper, never invoked on Windows
return Path(xdg) / "systemd" / "private"
def _user_systemd_socket_ready() -> bool:
"""Return True when user-scope systemd has a reachable control socket.
Some distros expose only the per-user systemd private socket even when the
D-Bus session bus socket is absent. ``systemctl --user`` can still work in
that configuration, so preflight checks must treat either socket as valid.
"""
return (
_user_dbus_socket_path().exists()
or _user_systemd_private_socket_path().exists()
)
def _ensure_user_systemd_env() -> None:
"""Ensure DBUS_SESSION_BUS_ADDRESS and XDG_RUNTIME_DIR are set for systemctl --user.
On headless servers (SSH sessions), these env vars may be missing even when
the user's systemd instance is running (via linger). Without them,
``systemctl --user`` fails with "Failed to connect to bus: No medium found".
We detect the standard socket path and set the vars so all subsequent
subprocess calls inherit them.
"""
uid = os.getuid() # windows-footgun: ok — POSIX systemd helper, never invoked on Windows
if "XDG_RUNTIME_DIR" not in os.environ:
runtime_dir = f"/run/user/{uid}"
if Path(runtime_dir).exists():
os.environ["XDG_RUNTIME_DIR"] = runtime_dir
if "DBUS_SESSION_BUS_ADDRESS" not in os.environ:
xdg_runtime = os.environ.get("XDG_RUNTIME_DIR", f"/run/user/{uid}")
bus_path = Path(xdg_runtime) / "bus"
if bus_path.exists():
os.environ["DBUS_SESSION_BUS_ADDRESS"] = f"unix:path={bus_path}"
def _wait_for_user_dbus_socket(timeout: float = 3.0) -> bool:
"""Poll for the user systemd runtime socket(s), up to ``timeout`` seconds.
Linger-enabled user@.service can take a second or two to spawn its control
socket(s) after ``loginctl enable-linger`` runs. Returns True once either
the user D-Bus socket or the per-user systemd private socket exists.
"""
import time
deadline = time.monotonic() + timeout
while time.monotonic() < deadline:
if _user_systemd_socket_ready():
_ensure_user_systemd_env()
return True
time.sleep(0.2)
return _user_systemd_socket_ready()
def _preflight_user_systemd(*, auto_enable_linger: bool = True) -> None:
"""Ensure ``systemctl --user`` will reach the user-scope systemd instance.
No-op when the user D-Bus socket or per-user systemd private socket is
already there (the common case on desktops and linger-enabled servers). On
fresh SSH sessions where both are missing:
* If linger is already enabled, wait briefly for user@.service to spawn
the socket.
* If linger is disabled and ``auto_enable_linger`` is True, try
``loginctl enable-linger $USER`` (works as non-root when polkit permits
it, otherwise needs sudo).
* If the socket is still missing afterwards, raise
:class:`UserSystemdUnavailableError` with a precise remediation message.
Callers should treat the exception as a terminal condition for user-scope
systemd operations and surface the message to the user.
"""
_ensure_user_systemd_env()
if _user_systemd_socket_ready():
return
import getpass
username = getpass.getuser()
linger_enabled, linger_detail = get_systemd_linger_status()
if linger_enabled is True:
if _wait_for_user_dbus_socket(timeout=3.0):
return
# Linger is on but socket still missing — unusual; fall through to error.
_raise_user_systemd_unavailable(
username,
reason="User systemd control sockets are missing even though linger is enabled.",
fix_hint=(
f" systemctl start user@{os.getuid()}.service\n" # windows-footgun: ok — POSIX systemd helper, never invoked on Windows
" (may require sudo; try again after the command succeeds)"
),
)
if auto_enable_linger and shutil.which("loginctl"):
try:
result = subprocess.run(
["loginctl", "enable-linger", username],
capture_output=True,
text=True,
check=False,
timeout=30,
)
except Exception as exc:
_raise_user_systemd_unavailable(
username,
reason=f"loginctl enable-linger failed ({exc}).",
fix_hint=f" sudo loginctl enable-linger {username}",
)
else:
if result.returncode == 0:
if _wait_for_user_dbus_socket(timeout=5.0):
print(f"✓ Enabled linger for {username} — user D-Bus now available")
return
# enable-linger succeeded but the socket never appeared.
_raise_user_systemd_unavailable(
username,
reason="Linger was enabled, but the user D-Bus socket did not appear.",
fix_hint=(
" Log out and log back in, then re-run the command.\n"
f" Or reboot and run: systemctl --user start {get_service_name()}"
),
)
detail = (
result.stderr or result.stdout or f"exit {result.returncode}"
).strip()
_raise_user_systemd_unavailable(
username,
reason=f"loginctl enable-linger was denied: {detail}",
fix_hint=f" sudo loginctl enable-linger {username}",
)
_raise_user_systemd_unavailable(
username,
reason=(
"User D-Bus session is not available "
f"({linger_detail or 'linger disabled'})."
),
fix_hint=f" sudo loginctl enable-linger {username}",
)
def _raise_user_systemd_unavailable(
username: str, *, reason: str, fix_hint: str
) -> None:
"""Build a user-facing error message and raise UserSystemdUnavailableError."""
msg = (
f"{reason}\n"
" systemctl --user cannot reach the user D-Bus session in this shell.\n"
"\n"
" To fix:\n"
f"{fix_hint}\n"
"\n"
" Alternative: run the gateway in the foreground (stays up until\n"
" you exit / close the terminal):\n"
" hermes gateway run"
)
raise UserSystemdUnavailableError(msg)
def _systemctl_cmd(system: bool = False) -> list[str]:
if not system:
_ensure_user_systemd_env()
return ["systemctl"] if system else ["systemctl", "--user"]
def _journalctl_cmd(system: bool = False) -> list[str]:
return ["journalctl"] if system else ["journalctl", "--user"]
def _run_systemctl(
args: list[str], *, system: bool = False, **kwargs
) -> subprocess.CompletedProcess:
"""Run a systemctl command, raising RuntimeError if systemctl is missing.
Defense-in-depth: callers are gated by ``supports_systemd_services()``,
but this ensures any future caller that bypasses the gate still gets a
clear error instead of a raw ``FileNotFoundError`` traceback.
"""
try:
return subprocess.run(_systemctl_cmd(system) + args, **kwargs)
except FileNotFoundError:
raise RuntimeError("systemctl is not available on this system") from None
def _service_scope_label(system: bool = False) -> str:
return "system" if system else "user"
def get_installed_systemd_scopes() -> list[str]:
scopes = []
seen_paths: set[Path] = set()
for system, label in ((False, "user"), (True, "system")):
unit_path = get_systemd_unit_path(system=system)
if unit_path in seen_paths:
continue
if unit_path.exists():
scopes.append(label)
seen_paths.add(unit_path)
return scopes
def has_conflicting_systemd_units() -> bool:
return len(get_installed_systemd_scopes()) > 1
# Legacy service names from older Hermes installs that predate the
# hermes-gateway rename. Kept as an explicit allowlist (NOT a glob) so
# profile units (hermes-gateway-*.service) and unrelated third-party
# "hermes" units are never matched.
_LEGACY_SERVICE_NAMES: tuple[str, ...] = ("hermes.service",)
# ExecStart content markers that identify a unit as running our gateway.
# A legacy unit is only flagged when its file contains one of these.
_LEGACY_UNIT_EXECSTART_MARKERS: tuple[str, ...] = (
"hermes_cli.main gateway",
"hermes_cli/main.py gateway",
"gateway/run.py",
" hermes gateway ",
"/hermes gateway ",
)
def _legacy_unit_search_paths() -> list[tuple[bool, Path]]:
"""Return ``[(is_system, base_dir), ...]`` — directories to scan for legacy units.
Factored out so tests can monkeypatch the search roots without touching
real filesystem paths.
"""
return [
(False, Path.home() / ".config" / "systemd" / "user"),
(True, Path("/etc/systemd/system")),
]
def _find_legacy_hermes_units() -> list[tuple[str, Path, bool]]:
"""Return ``[(unit_name, unit_path, is_system)]`` for legacy Hermes gateway units.
Detects unit files installed by older Hermes versions that used a
different service name (e.g. ``hermes.service`` before the rename to
``hermes-gateway.service``). When both a legacy unit and the current
``hermes-gateway.service`` are active, they fight over the same bot
token — the PR #5646 signal-recovery change turns this into a 30-second
SIGTERM flap loop.
Safety guards:
* Explicit allowlist of legacy names (no globbing). Profile units such
as ``hermes-gateway-coder.service`` and unrelated third-party
``hermes-*`` services are never matched.
* ExecStart content check — only flag units that invoke our gateway
entrypoint. A user-created ``hermes.service`` running an unrelated
binary is left untouched.
* Results are returned purely for caller inspection; this function
never mutates or removes anything.
"""
results: list[tuple[str, Path, bool]] = []
for is_system, base in _legacy_unit_search_paths():
for name in _LEGACY_SERVICE_NAMES:
unit_path = base / name
try:
if not unit_path.exists():
continue
text = unit_path.read_text(encoding="utf-8", errors="ignore")
except (OSError, PermissionError):
continue
if not any(marker in text for marker in _LEGACY_UNIT_EXECSTART_MARKERS):
# Not our gateway — leave alone
continue
results.append((name, unit_path, is_system))
return results
def has_legacy_hermes_units() -> bool:
"""Return True when any legacy Hermes gateway unit files exist."""
return bool(_find_legacy_hermes_units())
def print_legacy_unit_warning() -> None:
"""Warn about legacy Hermes gateway unit files if any are installed.
Idempotent: prints nothing when no legacy units are detected. Safe to
call from any status/install/setup path.
"""
legacy = _find_legacy_hermes_units()
if not legacy:
return
print_warning("Legacy Hermes gateway unit(s) detected from an older install:")
for name, path, is_system in legacy:
scope = "system" if is_system else "user"
print_info(f" {path} ({scope} scope)")
print_info(" These run alongside the current hermes-gateway service and")
print_info(" cause SIGTERM flap loops — both try to use the same bot token.")
print_info(" Remove them with:")
print_info(" hermes gateway migrate-legacy")
def remove_legacy_hermes_units(
interactive: bool = True,
dry_run: bool = False,
) -> tuple[int, list[Path]]:
"""Stop, disable, and remove legacy Hermes gateway unit files.
Iterates over whatever ``_find_legacy_hermes_units()`` returns — which is
an explicit allowlist of legacy names (not a glob). Profile units and
unrelated third-party services are never touched.
Args:
interactive: When True, prompt before removing. When False, remove
without asking (used when another prompt has already confirmed,
e.g. from the install flow).
dry_run: When True, list what would be removed and return.
Returns:
``(removed_count, remaining_paths)`` — remaining includes units we
couldn't remove (typically system-scope when not running as root).
"""
legacy = _find_legacy_hermes_units()
if not legacy:
print("No legacy Hermes gateway units found.")
return 0, []
user_units = [(n, p) for n, p, is_sys in legacy if not is_sys]
system_units = [(n, p) for n, p, is_sys in legacy if is_sys]
print()
print("Legacy Hermes gateway unit(s) found:")
for name, path, is_system in legacy:
scope = "system" if is_system else "user"
print(f" {path} ({scope} scope)")
print()
if dry_run:
print("(dry-run — nothing removed)")
return 0, [p for _, p, _ in legacy]
if interactive and not prompt_yes_no("Remove these legacy units?", True):
print("Skipped. Run again with: hermes gateway migrate-legacy")
return 0, [p for _, p, _ in legacy]
removed = 0
remaining: list[Path] = []
# User-scope removal
for name, path in user_units:
try:
_run_systemctl(["stop", name], system=False, check=False, timeout=90)
_run_systemctl(["disable", name], system=False, check=False, timeout=30)
path.unlink(missing_ok=True)
print(f" ✓ Removed {path}")
removed += 1
except (OSError, RuntimeError) as e:
print(f" ⚠ Could not remove {path}: {e}")
remaining.append(path)
if user_units:
try:
_run_systemctl(["daemon-reload"], system=False, check=False, timeout=30)
except RuntimeError:
pass
# System-scope removal (needs root)
if system_units:
if os.geteuid() != 0: # windows-footgun: ok — Linux systemd removal path, guarded by `if system == "Linux"` / systemd-only branch
print()
print_warning("System-scope legacy units require root to remove.")
print_info(" Re-run with: sudo hermes gateway migrate-legacy")
for _, path in system_units:
remaining.append(path)
else:
for name, path in system_units:
try:
_run_systemctl(["stop", name], system=True, check=False, timeout=90)
_run_systemctl(
["disable", name], system=True, check=False, timeout=30
)
path.unlink(missing_ok=True)
print(f" ✓ Removed {path}")
removed += 1
except (OSError, RuntimeError) as e:
print(f" ⚠ Could not remove {path}: {e}")
remaining.append(path)
try:
_run_systemctl(["daemon-reload"], system=True, check=False, timeout=30)
except RuntimeError:
pass
print()
if remaining:
print_warning(
f"{len(remaining)} legacy unit(s) still present — see messages above."
)
else:
print_success(f"Removed {removed} legacy unit(s).")
return removed, remaining
def print_systemd_scope_conflict_warning() -> None:
scopes = get_installed_systemd_scopes()
if len(scopes) < 2:
return
rendered_scopes = " + ".join(scopes)
print_warning(
f"Both user and system gateway services are installed ({rendered_scopes})."
)
print_info(" This is confusing and can make start/stop/status behavior ambiguous.")
print_info(
" Default gateway commands target the user service unless you pass --system."
)
print_info(" Keep one of these:")
print_info(" hermes gateway uninstall")
print_info(" sudo hermes gateway uninstall --system")
def _require_root_for_system_service(action: str) -> None:
if os.geteuid() != 0: # windows-footgun: ok — POSIX systemd helper, never invoked on Windows
raise SystemScopeRequiresRootError(
f"System gateway {action} requires root. Re-run with sudo.",
action,
)
def _system_service_identity(run_as_user: str | None = None) -> tuple[str, str, str]:
import getpass
import grp
import pwd
username = (
run_as_user
or os.getenv("SUDO_USER")
or os.getenv("USER")
or os.getenv("LOGNAME")
or getpass.getuser()
).strip()
if not username:
raise ValueError(
"Could not determine which user the gateway service should run as"
)
if username == "root" and not run_as_user:
raise ValueError(
"Refusing to install the gateway system service as root; pass --run-as-user root to override (e.g. in LXC containers)"
)
if username == "root":
print_warning("Installing gateway service to run as root.")
print_info(
" This is fine for LXC/container environments but not recommended on bare-metal hosts."
)
try:
user_info = pwd.getpwnam(username)
except KeyError as e:
raise ValueError(f"Unknown user: {username}") from e
group_name = grp.getgrgid(user_info.pw_gid).gr_name
return username, group_name, user_info.pw_dir
def _read_systemd_user_from_unit(unit_path: Path) -> str | None:
if not unit_path.exists():
return None
for line in unit_path.read_text(encoding="utf-8").splitlines():
if line.startswith("User="):
value = line.split("=", 1)[1].strip()
return value or None
return None
def _default_system_service_user() -> str | None:
for candidate in (os.getenv("SUDO_USER"), os.getenv("USER"), os.getenv("LOGNAME")):
if candidate and candidate.strip() and candidate.strip() != "root":
return candidate.strip()
return None
def prompt_linux_gateway_install_scope() -> str | None:
choice = prompt_choice(
" Choose how the gateway should run in the background:",
[
"User service (no sudo; best for laptops/dev boxes; may need linger after logout)",
"System service (starts on boot; requires sudo; still runs as your user)",
"Skip service install for now",
],
default=0,
)
return {0: "user", 1: "system", 2: None}[choice]
def install_linux_gateway_from_setup(force: bool = False, enable_on_startup: bool = True) -> tuple[str | None, bool]:
scope = prompt_linux_gateway_install_scope()
if scope is None:
return None, False
if scope == "system":
run_as_user = _default_system_service_user()
if os.geteuid() != 0: # windows-footgun: ok — Linux systemd install wizard, never invoked on Windows
print_warning(
" System service install requires sudo, so Hermes can't create it from this user session."
)
if run_as_user:
print_info(
f" After setup, run: sudo hermes gateway install --system --run-as-user {run_as_user}"
)
else:
print_info(
" After setup, run: sudo hermes gateway install --system --run-as-user <your-user>"
)
print_info(" Then start it with: sudo hermes gateway start --system")
return scope, False
if not run_as_user:
while True:
run_as_user = prompt(
" Run the system gateway service as which user?", default=""
)
run_as_user = (run_as_user or "").strip()
if run_as_user:
break
print_error(" Enter a username.")
systemd_install(force=force, system=True, run_as_user=run_as_user, enable_on_startup=enable_on_startup)
return scope, True
systemd_install(force=force, system=False, enable_on_startup=enable_on_startup)
return scope, True
def get_systemd_linger_status() -> tuple[bool | None, str]:
"""Return systemd linger status for the current user.
Returns:
(True, "") when linger is enabled.
(False, "") when linger is disabled.
(None, detail) when the status could not be determined.
"""
if is_termux():
return None, "not supported in Termux"
if not is_linux():
return None, "not supported on this platform"
if not shutil.which("loginctl"):
return None, "loginctl not found"
username = os.getenv("USER") or os.getenv("LOGNAME")
if not username:
try:
import pwd
username = pwd.getpwuid(os.getuid()).pw_name # windows-footgun: ok — POSIX loginctl helper, never invoked on Windows
except Exception:
return None, "could not determine current user"
try:
result = subprocess.run(
["loginctl", "show-user", username, "--property=Linger", "--value"],
capture_output=True,
text=True,
check=False,
timeout=10,
)
except Exception as e:
return None, str(e)
if result.returncode != 0:
detail = (result.stderr or result.stdout or f"exit {result.returncode}").strip()
return None, detail or "loginctl query failed"
value = (result.stdout or "").strip().lower()
if value in {"yes", "true", "1"}:
return True, ""
if value in {"no", "false", "0"}:
return False, ""
rendered = value or "<empty>"
return None, f"unexpected loginctl output: {rendered}"
def print_systemd_linger_guidance() -> None:
"""Print the current linger status and the fix when it is disabled."""
linger_enabled, linger_detail = get_systemd_linger_status()
if linger_enabled is True:
print("✓ Systemd linger is enabled (service survives logout)")
elif linger_enabled is False:
print("⚠ Systemd linger is disabled (gateway may stop when you log out)")
print(" Run: sudo loginctl enable-linger $USER")
else:
print(f"⚠ Could not verify systemd linger ({linger_detail})")
print(" If you want the gateway user service to survive logout, run:")
print(" sudo loginctl enable-linger $USER")
def _launchd_user_home() -> Path:
"""Return the real macOS user home for launchd artifacts.
Profile-mode Hermes often sets ``HOME`` to a profile-scoped directory, but
launchd user agents still live under the actual account home.
"""
import pwd
return Path(pwd.getpwuid(os.getuid()).pw_dir) # windows-footgun: ok — POSIX launchd (macOS) helper, never invoked on Windows
def get_launchd_plist_path() -> Path:
"""Return the launchd plist path, scoped per profile.
Default ``~/.hermes`` → ``ai.hermes.gateway.plist`` (backward compatible).
Profile ``~/.hermes/profiles/coder`` → ``ai.hermes.gateway-coder.plist``.
"""
suffix = _profile_suffix()
name = f"ai.hermes.gateway-{suffix}" if suffix else "ai.hermes.gateway"
return _launchd_user_home() / "Library" / "LaunchAgents" / f"{name}.plist"
def _detect_venv_dir() -> Path | None:
"""Detect the active virtualenv directory.
Checks ``sys.prefix`` first (works regardless of the directory name),
then ``VIRTUAL_ENV`` env var (covers uv-managed environments where
sys.prefix == sys.base_prefix), then falls back to probing common
directory names under PROJECT_ROOT.
Returns ``None`` when no virtualenv can be found.
"""
# If we're running inside a virtualenv, sys.prefix points to it.
if sys.prefix != sys.base_prefix:
venv = Path(sys.prefix)
if venv.is_dir():
return venv
# uv and some other tools set VIRTUAL_ENV without changing sys.prefix.
# This catches `uv run` where sys.prefix == sys.base_prefix but the
# environment IS a venv. (#8620)
_virtual_env = os.environ.get("VIRTUAL_ENV")
if _virtual_env:
venv = Path(_virtual_env)
if venv.is_dir():
return venv
# Fallback: check common virtualenv directory names under the project root.
for candidate in (".venv", "venv"):
venv = PROJECT_ROOT / candidate
if venv.is_dir():
return venv
return None
def get_python_path() -> str:
venv = _detect_venv_dir()
if venv is not None:
if is_windows():
venv_python = venv / "Scripts" / "python.exe"
else:
venv_python = venv / "bin" / "python"
if venv_python.exists():
return str(venv_python)
return sys.executable
# =============================================================================
# Systemd (Linux)
# =============================================================================
def _build_user_local_paths(home: Path, path_entries: list[str]) -> list[str]:
"""Return user-local bin dirs that exist and aren't already in *path_entries*."""
candidates = [
str(home / ".local" / "bin"), # uv, uvx, pip-installed CLIs
str(home / ".cargo" / "bin"), # Rust/cargo tools
str(home / "go" / "bin"), # Go tools
str(home / ".npm-global" / "bin"), # npm global packages
]
return [p for p in candidates if p not in path_entries and Path(p).exists()]
def _build_wsl_interop_paths(path_entries: list[str]) -> list[str]:
"""Return WSL Windows interop PATH entries for generated systemd units.
WSL shells normally inherit Windows PATH entries such as
``/mnt/c/WINDOWS/System32``. systemd user services do not, so gateway tools
that call ``powershell.exe``/``cmd.exe`` work in a terminal but fail in the
background service unless we persist the relevant entries at install time.
"""
if not is_wsl():
return []
candidates: list[str] = []
for entry in os.environ.get("PATH", "").split(os.pathsep):
if entry.startswith("/mnt/"):
candidates.append(entry)
for executable in ("powershell.exe", "cmd.exe", "explorer.exe", "wsl.exe"):
resolved = shutil.which(executable)
if resolved:
candidates.append(str(Path(resolved).parent))
for entry in (
"/mnt/c/WINDOWS/system32",
"/mnt/c/WINDOWS",
"/mnt/c/WINDOWS/System32/Wbem",
"/mnt/c/WINDOWS/System32/WindowsPowerShell/v1.0/",
"/mnt/c/WINDOWS/System32/OpenSSH/",
):
if Path(entry).exists():
candidates.append(entry)
result: list[str] = []
seen = set(path_entries)
for entry in candidates:
if entry and entry not in seen:
seen.add(entry)
result.append(entry)
return result
def _remap_path_for_user(path: str, target_home_dir: str) -> str:
"""Remap *path* from the current user's home to *target_home_dir*.
If *path* lives under ``Path.home()`` the corresponding prefix is swapped
to *target_home_dir*; otherwise the path is returned unchanged.
/root/.hermes/hermes-agent -> /home/alice/.hermes/hermes-agent
/opt/hermes -> /opt/hermes (kept as-is)
Note: this function intentionally does NOT resolve symlinks. A venv's
``bin/python`` is typically a symlink to the base interpreter (e.g. a
uv-managed CPython at ``~/.local/share/uv/python/.../python3.11``);
resolving that symlink swaps the unit's ``ExecStart`` to a bare Python
that has none of the venv's site-packages, so the service crashes on
the first ``import``. Keep the symlinked path so the venv activates
its own environment. Lexical expansion only via ``expanduser``.
"""
current_home = Path.home()
p = Path(path).expanduser()
try:
relative = p.relative_to(current_home)
return str(Path(target_home_dir) / relative)
except ValueError:
return str(p)
def _hermes_home_for_target_user(target_home_dir: str) -> str:
"""Remap the current HERMES_HOME to the equivalent under a target user's home.
When installing a system service via sudo, get_hermes_home() resolves to
root's home. This translates it to the target user's equivalent path:
/root/.hermes → /home/alice/.hermes
/root/.hermes/profiles/coder → /home/alice/.hermes/profiles/coder
/opt/custom-hermes → /opt/custom-hermes (kept as-is)
"""
current_hermes = get_hermes_home().resolve()
current_default = (Path.home() / ".hermes").resolve()
target_default = Path(target_home_dir) / ".hermes"
# Default ~/.hermes → remap to target user's default
if current_hermes == current_default:
return str(target_default)
# Profile or subdir of ~/.hermes → preserve the relative structure
try:
relative = current_hermes.relative_to(current_default)
return str(target_default / relative)
except ValueError:
# Completely custom path (not under ~/.hermes) — keep as-is
return str(current_hermes)
def _build_service_path_dirs(project_root: Path | None = None) -> list[str]:
"""Build PATH directory list for service units, excluding non-existent dirs."""
if project_root is None:
project_root = PROJECT_ROOT
def _is_dir(path: Path) -> bool:
try:
return path.is_dir()
except OSError:
return False
candidates = []
venv_bin = project_root / "venv" / "bin"
if _is_dir(venv_bin):
candidates.append(str(venv_bin))
elif sys.prefix != sys.base_prefix:
candidates.append(str(Path(sys.prefix) / "bin"))
node_bin = project_root / "node_modules" / ".bin"
if _is_dir(node_bin):
candidates.append(str(node_bin))
hermes_home = get_hermes_home()
hermes_node = hermes_home / "node" / "bin"
if _is_dir(hermes_node):
candidates.append(str(hermes_node))
hermes_nm = hermes_home / "node_modules" / ".bin"
if _is_dir(hermes_nm):
candidates.append(str(hermes_nm))
return candidates
def _stable_service_working_dir() -> str:
"""Return a WorkingDirectory that will not disappear out from under systemd.
The gateway does NOT need its cwd to be the source checkout — ``ExecStart``
uses an absolute python interpreter and ``-m hermes_cli.main``, so module
resolution does not depend on cwd. Pinning ``WorkingDirectory`` to
``PROJECT_ROOT`` (``Path(__file__).parent.parent``) is actively harmful:
when the unit is generated from a transient checkout — a ``.worktrees/``
dir, or a clone that ``hermes update`` later relocates/removes — the path
rots. systemd then fails the start at the CHDIR step (``status=200/CHDIR``,
"Changing to the requested working directory failed") *before* Python
loads, so the on-boot ``refresh_systemd_unit_if_needed()`` self-heal never
runs and ``Restart=always`` crash-loops forever on a dead directory.
``HERMES_HOME`` is the stable anchor: it is where config/state/logs live,
it never moves, and it is guaranteed to exist whenever the gateway is
meaningfully installed. Fall back to ``PROJECT_ROOT`` only if HERMES_HOME
cannot be resolved (it always can in practice).
"""
try:
home = get_hermes_home()
if home and Path(home).is_dir():
return str(Path(home).resolve())
except Exception:
pass
return str(PROJECT_ROOT)
def generate_systemd_unit(system: bool = False, run_as_user: str | None = None) -> str:
python_path = get_python_path()
working_dir = _stable_service_working_dir()
detected_venv = _detect_venv_dir()
venv_dir = str(detected_venv) if detected_venv else str(PROJECT_ROOT / "venv")
path_entries = _build_service_path_dirs()
resolved_node = shutil.which("node")
if resolved_node:
resolved_node_dir = str(Path(resolved_node).resolve().parent)
if resolved_node_dir not in path_entries:
path_entries.append(resolved_node_dir)
common_bin_paths = [
"/usr/local/sbin",
"/usr/local/bin",
"/usr/sbin",
"/usr/bin",
"/sbin",
"/bin",
]
# systemd's TimeoutStopSec must exceed the gateway's drain_timeout so
# there's budget left for post-interrupt cleanup (tool subprocess kill,
# adapter disconnect, session DB close) before systemd escalates to
# SIGKILL on the cgroup — otherwise bash/sleep tool-call children left
# by a force-interrupted agent get reaped by systemd instead of us
# (#8202). 30s of headroom covers the worst case we've observed.
_drain_timeout = int(_get_restart_drain_timeout() or 0)
restart_timeout = max(60, _drain_timeout) + 30
if system:
username, group_name, home_dir = _system_service_identity(run_as_user)
hermes_home = _hermes_home_for_target_user(home_dir)
profile_arg = _profile_arg(hermes_home)
# Remap all paths that may resolve under the calling user's home
# (e.g. /root/) to the target user's home so the service can
# actually access them.
python_path = _remap_path_for_user(python_path, home_dir)
# Anchor cwd to the target user's HERMES_HOME (stable, always exists)
# rather than a remapped source-checkout path that can rot. See
# _stable_service_working_dir() for the full rationale.
working_dir = str(hermes_home) if hermes_home else _remap_path_for_user(working_dir, home_dir)
venv_dir = _remap_path_for_user(venv_dir, home_dir)
path_entries = [_remap_path_for_user(p, home_dir) for p in path_entries]
path_entries.extend(_build_user_local_paths(Path(home_dir), path_entries))
path_entries.extend(_build_wsl_interop_paths(path_entries))
path_entries.extend(common_bin_paths)
sane_path = ":".join(path_entries)
return f"""[Unit]
Description={SERVICE_DESCRIPTION}
After=network-online.target
Wants=network-online.target
StartLimitIntervalSec=0
[Service]
Type=simple
User={username}
Group={group_name}
ExecStart={python_path} -m hermes_cli.main{f" {profile_arg}" if profile_arg else ""} gateway run --replace
WorkingDirectory={working_dir}
Environment="HOME={home_dir}"
Environment="USER={username}"
Environment="LOGNAME={username}"
Environment="PATH={sane_path}"
Environment="VIRTUAL_ENV={venv_dir}"
Environment="HERMES_HOME={hermes_home}"
Restart=always
RestartSec=5
RestartMaxDelaySec=300
RestartSteps=5
RestartForceExitStatus={GATEWAY_SERVICE_RESTART_EXIT_CODE}
KillMode=mixed
KillSignal=SIGTERM
ExecReload=/bin/kill -USR1 $MAINPID
TimeoutStopSec={restart_timeout}
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target
"""
hermes_home = str(get_hermes_home().resolve())
profile_arg = _profile_arg(hermes_home)
path_entries.extend(_build_user_local_paths(Path.home(), path_entries))
path_entries.extend(_build_wsl_interop_paths(path_entries))
path_entries.extend(common_bin_paths)
sane_path = ":".join(path_entries)
return f"""[Unit]
Description={SERVICE_DESCRIPTION}
After=network-online.target
Wants=network-online.target
StartLimitIntervalSec=0
[Service]
Type=simple
ExecStart={python_path} -m hermes_cli.main{f" {profile_arg}" if profile_arg else ""} gateway run --replace
WorkingDirectory={working_dir}
Environment="PATH={sane_path}"
Environment="VIRTUAL_ENV={venv_dir}"
Environment="HERMES_HOME={hermes_home}"
Restart=always
RestartSec=5
RestartMaxDelaySec=300
RestartSteps=5
RestartForceExitStatus={GATEWAY_SERVICE_RESTART_EXIT_CODE}
KillMode=mixed
KillSignal=SIGTERM
ExecReload=/bin/kill -USR1 $MAINPID
TimeoutStopSec={restart_timeout}
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=default.target
"""
def _normalize_service_definition(text: str) -> str:
return "\n".join(line.rstrip() for line in text.strip().splitlines())
def _normalize_launchd_plist_for_comparison(text: str) -> str:
"""Normalize launchd plist text for staleness checks.
The generated plist intentionally captures a broad PATH assembled from the
invoking shell so user-installed tools remain reachable under launchd.
That makes raw text comparison unstable across shells, so ignore the PATH
payload when deciding whether the installed plist is stale.
"""
import re
normalized = _normalize_service_definition(text)
return re.sub(
r"(<key>PATH</key>\s*<string>)(.*?)(</string>)",
r"\1__HERMES_PATH__\3",
normalized,
flags=re.S,
)
def systemd_unit_is_current(system: bool = False) -> bool:
unit_path = get_systemd_unit_path(system=system)
if not unit_path.exists():
return False
installed = unit_path.read_text(encoding="utf-8")
expected_user = _read_systemd_user_from_unit(unit_path) if system else None
expected = generate_systemd_unit(system=system, run_as_user=expected_user)
return _normalize_service_definition(installed) == _normalize_service_definition(
expected
)
def refresh_systemd_unit_if_needed(system: bool = False) -> bool:
"""Rewrite the installed systemd unit when the generated definition has changed."""
unit_path = get_systemd_unit_path(system=system)
if not unit_path.exists() or systemd_unit_is_current(system=system):
return False
expected_user = _read_systemd_user_from_unit(unit_path) if system else None
new_unit = generate_systemd_unit(system=system, run_as_user=expected_user)
# ── Test-environment safety belt ─────────────────────────────────────
# The user-scope unit path resolves under ``Path.home()``, which is NOT
# sandboxed by the test conftest (only HERMES_HOME is). If a test
# exercises ``run_gateway()`` with a pytest-tmp HERMES_HOME, the freshly
# generated unit bakes that ``/tmp/pytest-of-.../hermes_test`` path into
# ``Environment="HERMES_HOME=..."``. Writing that to the developer's
# real user systemd unit file silently breaks their gateway on the next
# reboot (systemd loads the polluted env, the gateway looks at an empty
# tmp dir, and Telegram/Discord/etc. all show as "not configured").
# Refuse to write when the generated unit references a pytest tmpdir.
# Detection sniffs the unit body — tests that legitimately exercise the
# refresh flow patch ``generate_systemd_unit`` to return synthetic
# content (``"new unit\n"``) which doesn't contain these markers and
# still works.
if not system and (
"/pytest-of-" in new_unit
or '/hermes_test"' in new_unit
or "/hermes_test/" in new_unit
):
return False
unit_path.write_text(new_unit, encoding="utf-8")
_run_systemctl(["daemon-reload"], system=system, check=True, timeout=30)
print(
f"↻ Updated gateway {_service_scope_label(system)} service definition to match the current Hermes install"
)
return True
def _print_linger_enable_warning(username: str, detail: str | None = None) -> None:
print()
print("⚠ Linger not enabled — gateway may stop when you close this terminal.")
if detail:
print(f" Auto-enable failed: {detail}")
print()
print(" On headless servers (VPS, cloud instances) run:")
print(f" sudo loginctl enable-linger {username}")
print()
print(" Then restart the gateway:")
print(f" systemctl --user restart {get_service_name()}.service")
print()
def _ensure_linger_enabled() -> None:
"""Enable linger when possible so the user gateway survives logout."""
if is_termux() or not is_linux():
return
import getpass
username = getpass.getuser()
linger_file = Path(f"/var/lib/systemd/linger/{username}")
if linger_file.exists():
print("✓ Systemd linger is enabled (service survives logout)")
return
linger_enabled, linger_detail = get_systemd_linger_status()
if linger_enabled is True:
print("✓ Systemd linger is enabled (service survives logout)")
return
if not shutil.which("loginctl"):
_print_linger_enable_warning(username, linger_detail or "loginctl not found")
return
print("Enabling linger so the gateway survives SSH logout...")
try:
result = subprocess.run(
["loginctl", "enable-linger", username],
capture_output=True,
text=True,
check=False,
timeout=30,
)
except Exception as e:
_print_linger_enable_warning(username, str(e))
return
if result.returncode == 0:
print("✓ Linger enabled — gateway will persist after logout")
return
detail = (result.stderr or result.stdout or f"exit {result.returncode}").strip()
_print_linger_enable_warning(username, detail or linger_detail)
def _select_systemd_scope(system: bool = False) -> bool:
if system:
return True
return (
get_systemd_unit_path(system=True).exists()
and not get_systemd_unit_path(system=False).exists()
)
def _system_scope_wizard_would_need_root(system: bool = False) -> bool:
"""True when the setup wizard is about to trigger a system-scope operation
as a non-root user.
Replicates the decision ``_select_systemd_scope`` makes inside
``systemd_start`` / ``systemd_restart`` / ``systemd_stop`` so the wizard
can detect the dead-end BEFORE prompting, rather than letting
``SystemScopeRequiresRootError`` propagate out and leave the user
staring at a bare shell.
"""
if os.geteuid() == 0: # windows-footgun: ok — systemd scope wizard decision, never invoked on Windows
return False
return _select_systemd_scope(system=system)
def _print_system_scope_remediation(action: str) -> None:
"""Print actionable remediation when the wizard skips a system-scope
prompt because the user isn't root. Keeps the wizard flowing instead of
aborting.
"""
svc = get_service_name()
print_warning(
f"Gateway is installed as a system-wide service — " f"{action} requires root."
)
print_info(" Options:")
print_info(f" 1. {action.capitalize()} it this time:")
if action == "start":
print_info(f" sudo systemctl start {svc}")
elif action == "stop":
print_info(f" sudo systemctl stop {svc}")
elif action == "restart":
print_info(f" sudo systemctl restart {svc}")
else:
print_info(f" sudo systemctl {action} {svc}")
print_info(" 2. Switch to a per-user service (recommended for personal use):")
print_info(" sudo hermes gateway uninstall --system")
print_info(" hermes gateway install")
print_info(" hermes gateway start")
def _get_restart_drain_timeout() -> float:
"""Return the configured gateway restart drain timeout in seconds."""
raw = os.getenv("HERMES_RESTART_DRAIN_TIMEOUT", "").strip()
if not raw:
cfg = read_raw_config()
agent_cfg = cfg.get("agent", {}) if isinstance(cfg, dict) else {}
raw = str(
agent_cfg.get(
"restart_drain_timeout", DEFAULT_GATEWAY_RESTART_DRAIN_TIMEOUT
)
)
return parse_restart_drain_timeout(raw)
def systemd_install(
force: bool = False,
system: bool = False,
run_as_user: str | None = None,
enable_on_startup: bool = True,
):
if system:
_require_root_for_system_service("install")
# Offer to remove legacy units (hermes.service from pre-rename installs)
# before installing the new hermes-gateway.service. If both remain, they
# flap-fight for the Telegram bot token on every gateway startup.
# Only removes units matching _LEGACY_SERVICE_NAMES + our ExecStart
# signature — profile units are never touched.
if has_legacy_hermes_units():
print()
print_legacy_unit_warning()
print()
if prompt_yes_no("Remove the legacy unit(s) before installing?", True):
remove_legacy_hermes_units(interactive=False)
print()
unit_path = get_systemd_unit_path(system=system)
scope_flag = " --system" if system else ""
if unit_path.exists() and not force:
if not systemd_unit_is_current(system=system):
print(
f"↻ Repairing outdated {_service_scope_label(system)} systemd service at: {unit_path}"
)
refresh_systemd_unit_if_needed(system=system)
if enable_on_startup:
_run_systemctl(["enable", get_service_name()], system=system, check=True, timeout=30)
print(f"{_service_scope_label(system).capitalize()} service definition updated")
return
print(f"Service already installed at: {unit_path}")
print("Use --force to reinstall")
return
unit_path.parent.mkdir(parents=True, exist_ok=True)
print(f"Installing {_service_scope_label(system)} systemd service to: {unit_path}")
unit_path.write_text(
generate_systemd_unit(system=system, run_as_user=run_as_user), encoding="utf-8"
)
_run_systemctl(["daemon-reload"], system=system, check=True, timeout=30)
if enable_on_startup:
_run_systemctl(["enable", get_service_name()], system=system, check=True, timeout=30)
print()
enable_label = "installed and enabled" if enable_on_startup else "installed"
print(f"{_service_scope_label(system).capitalize()} service {enable_label}!")
print()
print("Next steps:")
print(
f" {'sudo ' if system else ''}hermes gateway start{scope_flag} # Start the service"
)
print(
f" {'sudo ' if system else ''}hermes gateway status{scope_flag} # Check status"
)
print(
f" {'journalctl' if system else 'journalctl --user'} -u {get_service_name()} -f # View logs"
)
print()
if system:
configured_user = _read_systemd_user_from_unit(unit_path)
if configured_user:
print(f"Configured to run as: {configured_user}")
else:
_ensure_linger_enabled()
print_systemd_scope_conflict_warning()
print_legacy_unit_warning()
def systemd_uninstall(system: bool = False):
system = _select_systemd_scope(system)
if system:
_require_root_for_system_service("uninstall")
_run_systemctl(["stop", get_service_name()], system=system, check=False, timeout=90)
_run_systemctl(
["disable", get_service_name()], system=system, check=False, timeout=30
)
unit_path = get_systemd_unit_path(system=system)
if unit_path.exists():
unit_path.unlink()
print(f"✓ Removed {unit_path}")
_run_systemctl(["daemon-reload"], system=system, check=True, timeout=30)
print(f"{_service_scope_label(system).capitalize()} service uninstalled")
def _require_service_installed(action: str, system: bool = False) -> None:
unit_path = get_systemd_unit_path(system=system)
if not unit_path.exists():
scope_flag = " --system" if system else ""
print(f"✗ Gateway service is not installed")
print(f" Run: {'sudo ' if system else ''}hermes gateway install{scope_flag}")
sys.exit(1)
def systemd_start(system: bool = False):
system = _select_systemd_scope(system)
if system:
_require_root_for_system_service("start")
else:
# Fail fast with actionable guidance if the user D-Bus session is not
# reachable (common on fresh RHEL/Debian SSH sessions without linger).
# Raises UserSystemdUnavailableError with a remediation message.
_preflight_user_systemd()
_require_service_installed("start", system=system)
refresh_systemd_unit_if_needed(system=system)
_run_systemctl(["start", get_service_name()], system=system, check=True, timeout=30)
print(f"{_service_scope_label(system).capitalize()} service started")
def systemd_stop(system: bool = False):
system = _select_systemd_scope(system)
if system:
_require_root_for_system_service("stop")
_require_service_installed("stop", system=system)
_sync_hermes_home_from_systemd_unit(system=system)
try:
from gateway.status import get_running_pid, write_planned_stop_marker
pid = get_running_pid(cleanup_stale=False)
if pid is not None:
write_planned_stop_marker(pid)
except Exception:
pass
try:
_run_systemctl(
["stop", get_service_name()], system=system, check=True, timeout=90
)
except subprocess.TimeoutExpired:
label = _service_scope_label(system)
print(
f"Gateway {label} service is still stopping after 90s; "
"check `hermes gateway status` or logs for final shutdown state."
)
return
print(f"{_service_scope_label(system).capitalize()} service stopped")
def systemd_restart(system: bool = False):
system = _select_systemd_scope(system)
if system:
_require_root_for_system_service("restart")
else:
_preflight_user_systemd()
_require_service_installed("restart", system=system)
refresh_systemd_unit_if_needed(system=system)
_sync_hermes_home_from_systemd_unit(system=system)
from gateway.status import get_running_pid
pid = get_running_pid() or _systemd_main_pid(system=system)
if pid is not None:
scope_label = _service_scope_label(system).capitalize()
svc = get_service_name()
drain_timeout = _get_restart_drain_timeout()
print(f"{scope_label} service restarting gracefully (PID {pid})...")
if _graceful_restart_via_sigusr1(pid, drain_timeout + 5):
# The gateway exits with code 75 for a planned service restart.
# RestartSec can otherwise delay the relaunch even though the
# operator asked for an immediate restart, so kick the unit once
# the old PID has exited and then wait for the replacement PID.
_run_systemctl(
["reset-failed", svc],
system=system,
check=False,
timeout=30,
)
_run_systemctl(
["restart", svc],
system=system,
check=False,
timeout=90,
)
if _wait_for_systemd_service_restart(system=system, previous_pid=pid):
return
if _systemd_service_is_start_limited(system=system):
return
print(
f"⚠ Graceful restart did not complete within {int(drain_timeout + 5)}s; "
"forcing a service restart..."
)
_run_systemctl(
["reset-failed", svc],
system=system,
check=False,
timeout=30,
)
try:
_run_systemctl(["restart", svc], system=system, check=True, timeout=90)
except subprocess.CalledProcessError as exc:
if _systemd_error_indicates_start_limit(
exc
) or _systemd_service_is_start_limited(system=system):
_print_systemd_start_limit_wait(system=system)
return
raise
except subprocess.TimeoutExpired:
label = _service_scope_label(system)
print(
f"Gateway {label} service is still restarting after 90s; "
"check `hermes gateway status` or logs for final state."
)
return
_wait_for_systemd_service_restart(system=system, previous_pid=pid)
return
if _recover_pending_systemd_restart(system=system, previous_pid=pid):
return
_run_systemctl(
["reset-failed", get_service_name()],
system=system,
check=False,
timeout=30,
)
try:
_run_systemctl(
["restart", get_service_name()], system=system, check=True, timeout=90
)
except subprocess.CalledProcessError as exc:
if _systemd_error_indicates_start_limit(
exc
) or _systemd_service_is_start_limited(system=system):
_print_systemd_start_limit_wait(system=system)
return
raise
except subprocess.TimeoutExpired:
label = _service_scope_label(system)
print(
f"Gateway {label} service is still restarting after 90s; "
"check `hermes gateway status` or logs for final state."
)
return
_wait_for_systemd_service_restart(system=system, previous_pid=pid)
def systemd_status(deep: bool = False, system: bool = False, full: bool = False):
system = _select_systemd_scope(system)
unit_path = get_systemd_unit_path(system=system)
scope_flag = " --system" if system else ""
if not unit_path.exists():
print("✗ Gateway service is not installed")
print(f" Run: {'sudo ' if system else ''}hermes gateway install{scope_flag}")
return
_sync_hermes_home_from_systemd_unit(system=system)
if has_conflicting_systemd_units():
print_systemd_scope_conflict_warning()
print()
if has_legacy_hermes_units():
print_legacy_unit_warning()
print()
if not systemd_unit_is_current(system=system):
print("⚠ Installed gateway service definition is outdated")
print(
f" Run: {'sudo ' if system else ''}hermes gateway restart{scope_flag} # auto-refreshes the unit"
)
print()
status_cmd = ["status", get_service_name(), "--no-pager"]
if full:
status_cmd.append("-l")
_run_systemctl(
status_cmd,
system=system,
capture_output=False,
timeout=10,
)
result = _run_systemctl(
["is-active", get_service_name()],
system=system,
capture_output=True,
text=True,
timeout=10,
)
status = result.stdout.strip()
if status == "active":
print(
f"{_service_scope_label(system).capitalize()} gateway service is running"
)
else:
print(
f"{_service_scope_label(system).capitalize()} gateway service is stopped"
)
print(f" Run: {'sudo ' if system else ''}hermes gateway start{scope_flag}")
configured_user = _read_systemd_user_from_unit(unit_path) if system else None
if configured_user:
print(f"Configured to run as: {configured_user}")
runtime_lines = _runtime_health_lines()
if runtime_lines:
print()
print("Recent gateway health:")
for line in runtime_lines:
print(f" {line}")
unit_props = _read_systemd_unit_properties(system=system)
active_state = unit_props.get("ActiveState", "")
sub_state = unit_props.get("SubState", "")
exec_main_status = unit_props.get("ExecMainStatus", "")
result_code = unit_props.get("Result", "")
if active_state == "activating" and sub_state == "auto-restart":
print(" ⏳ Restart pending: systemd is waiting to relaunch the gateway")
elif _systemd_unit_is_start_limited(unit_props):
print(" ⏳ Restart pending: systemd is temporarily rate-limiting starts")
print(
f" Run after the start-limit window expires: {'sudo ' if system else ''}hermes gateway restart{scope_flag}"
)
print(
f" Or clear it manually: systemctl {'--user ' if not system else ''}reset-failed {get_service_name()}"
)
elif active_state == "failed" and exec_main_status == str(
GATEWAY_SERVICE_RESTART_EXIT_CODE
):
print(" ⚠ Planned restart is stuck in systemd failed state (exit 75)")
print(
f" Run: systemctl {'--user ' if not system else ''}reset-failed {get_service_name()} && {'sudo ' if system else ''}hermes gateway start{scope_flag}"
)
elif active_state == "failed" and result_code:
print(f" ⚠ Systemd unit result: {result_code}")
if system:
print("✓ System service starts at boot without requiring systemd linger")
elif deep:
print_systemd_linger_guidance()
else:
linger_enabled, _ = get_systemd_linger_status()
if linger_enabled is True:
print("✓ Systemd linger is enabled (service survives logout)")
elif linger_enabled is False:
print("⚠ Systemd linger is disabled (gateway may stop when you log out)")
print(" Run: sudo loginctl enable-linger $USER")
if deep:
print()
print("Recent logs:")
log_cmd = _journalctl_cmd(system) + [
"-u",
get_service_name(),
"-n",
"20",
"--no-pager",
]
if full:
log_cmd.append("-l")
subprocess.run(log_cmd, timeout=10)
# =============================================================================
# Launchd (macOS)
# =============================================================================
def get_launchd_label() -> str:
"""Return the launchd service label, scoped per profile."""
suffix = _profile_suffix()
return f"ai.hermes.gateway-{suffix}" if suffix else "ai.hermes.gateway"
def _launchd_domain() -> str:
return f"gui/{os.getuid()}" # windows-footgun: ok — POSIX launchd (macOS) helper, never invoked on Windows
def generate_launchd_plist() -> str:
python_path = get_python_path()
# Stable cwd anchor — never the volatile source checkout. See
# _stable_service_working_dir() for the rationale (same rot risk applies
# to launchd's WorkingDirectory as to systemd's).
working_dir = _stable_service_working_dir()
hermes_home = str(get_hermes_home().resolve())
log_dir = get_hermes_home() / "logs"
log_dir.mkdir(parents=True, exist_ok=True)
label = get_launchd_label()
profile_arg = _profile_arg(hermes_home)
# Build a sane PATH for the launchd plist. launchd provides only a
# minimal default (/usr/bin:/bin:/usr/sbin:/sbin) which misses Homebrew,
# nvm, cargo, etc. We prepend venv/bin and node_modules/.bin (matching
# the systemd unit), then capture the user's full shell PATH so every
# user-installed tool (node, ffmpeg, …) is reachable.
detected_venv = _detect_venv_dir()
venv_dir = str(detected_venv) if detected_venv else str(PROJECT_ROOT / "venv")
# Resolve the directory containing the node binary (e.g. Homebrew, nvm)
# so it's explicitly in PATH even if the user's shell PATH changes later.
priority_dirs = _build_service_path_dirs()
resolved_node = shutil.which("node")
if resolved_node:
resolved_node_dir = str(Path(resolved_node).resolve().parent)
if resolved_node_dir not in priority_dirs:
priority_dirs.append(resolved_node_dir)
sane_path = ":".join(
dict.fromkeys(
priority_dirs + [p for p in os.environ.get("PATH", "").split(":") if p]
)
)
# Build ProgramArguments array, including --profile when using a named profile
prog_args = [
f"<string>{python_path}</string>",
"<string>-m</string>",
"<string>hermes_cli.main</string>",
]
if profile_arg:
for part in profile_arg.split():
prog_args.append(f"<string>{part}</string>")
prog_args.extend(
[
"<string>gateway</string>",
"<string>run</string>",
"<string>--replace</string>",
]
)
prog_args_xml = "\n ".join(prog_args)
return f"""<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>{label}</string>
<key>ProgramArguments</key>
<array>
{prog_args_xml}
</array>
<key>WorkingDirectory</key>
<string>{working_dir}</string>
<key>EnvironmentVariables</key>
<dict>
<key>PATH</key>
<string>{sane_path}</string>
<key>VIRTUAL_ENV</key>
<string>{venv_dir}</string>
<key>HERMES_HOME</key>
<string>{hermes_home}</string>
</dict>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<dict>
<key>SuccessfulExit</key>
<false/>
</dict>
<key>StandardOutPath</key>
<string>{log_dir}/gateway.log</string>
<key>StandardErrorPath</key>
<string>{log_dir}/gateway.error.log</string>
</dict>
</plist>
"""
def launchd_plist_is_current() -> bool:
"""Check if the installed launchd plist matches the currently generated one."""
plist_path = get_launchd_plist_path()
if not plist_path.exists():
return False
installed = plist_path.read_text(encoding="utf-8")
expected = generate_launchd_plist()
return _normalize_launchd_plist_for_comparison(
installed
) == _normalize_launchd_plist_for_comparison(expected)
def refresh_launchd_plist_if_needed() -> bool:
"""Rewrite the installed launchd plist when the generated definition has changed.
Unlike systemd, launchd picks up plist changes on the next ``launchctl kill``/
``launchctl kickstart`` cycle — no daemon-reload is needed. We still bootout/
bootstrap to make launchd re-read the updated plist immediately.
"""
plist_path = get_launchd_plist_path()
if not plist_path.exists() or launchd_plist_is_current():
return False
plist_path.write_text(generate_launchd_plist(), encoding="utf-8")
label = get_launchd_label()
# Bootout/bootstrap so launchd picks up the new definition
subprocess.run(
["launchctl", "bootout", f"{_launchd_domain()}/{label}"],
check=False,
timeout=90,
)
subprocess.run(
["launchctl", "bootstrap", _launchd_domain(), str(plist_path)],
check=False,
timeout=30,
)
print(
"↻ Updated gateway launchd service definition to match the current Hermes install"
)
return True
def launchd_install(force: bool = False):
plist_path = get_launchd_plist_path()
if plist_path.exists() and not force:
if not launchd_plist_is_current():
print(f"↻ Repairing outdated launchd service at: {plist_path}")
refresh_launchd_plist_if_needed()
print("✓ Service definition updated")
return
print(f"Service already installed at: {plist_path}")
print("Use --force to reinstall")
return
plist_path.parent.mkdir(parents=True, exist_ok=True)
print(f"Installing launchd service to: {plist_path}")
plist_path.write_text(generate_launchd_plist())
subprocess.run(
["launchctl", "bootstrap", _launchd_domain(), str(plist_path)],
check=True,
timeout=30,
)
print()
print("✓ Service installed and loaded!")
print()
print("Next steps:")
print(" hermes gateway status # Check status")
from hermes_constants import display_hermes_home as _dhh
print(f" tail -f {_dhh()}/logs/gateway.log # View logs")
def launchd_uninstall():
plist_path = get_launchd_plist_path()
label = get_launchd_label()
subprocess.run(
["launchctl", "bootout", f"{_launchd_domain()}/{label}"],
check=False,
timeout=90,
)
if plist_path.exists():
plist_path.unlink()
print(f"✓ Removed {plist_path}")
print("✓ Service uninstalled")
def launchd_start():
plist_path = get_launchd_plist_path()
label = get_launchd_label()
# Self-heal if the plist is missing entirely (e.g., manual cleanup, failed upgrade)
if not plist_path.exists():
print("↻ launchd plist missing; regenerating service definition")
plist_path.parent.mkdir(parents=True, exist_ok=True)
plist_path.write_text(generate_launchd_plist(), encoding="utf-8")
subprocess.run(
["launchctl", "bootstrap", _launchd_domain(), str(plist_path)],
check=True,
timeout=30,
)
subprocess.run(
["launchctl", "kickstart", f"{_launchd_domain()}/{label}"],
check=True,
timeout=30,
)
print("✓ Service started")
return
refresh_launchd_plist_if_needed()
try:
subprocess.run(
["launchctl", "kickstart", f"{_launchd_domain()}/{label}"],
check=True,
timeout=30,
)
except subprocess.CalledProcessError as e:
if e.returncode not in {3, 113}:
raise
print("↻ launchd job was unloaded; reloading service definition")
subprocess.run(
["launchctl", "bootstrap", _launchd_domain(), str(plist_path)],
check=True,
timeout=30,
)
subprocess.run(
["launchctl", "kickstart", f"{_launchd_domain()}/{label}"],
check=True,
timeout=30,
)
print("✓ Service started")
def launchd_stop():
label = get_launchd_label()
target = f"{_launchd_domain()}/{label}"
try:
from gateway.status import get_running_pid, write_planned_stop_marker
pid = get_running_pid(cleanup_stale=False)
if pid is not None:
write_planned_stop_marker(pid)
except Exception:
pass
# bootout unloads the service definition so KeepAlive doesn't respawn
# the process. A plain `kill SIGTERM` only signals the process — launchd
# immediately restarts it because KeepAlive.SuccessfulExit = false.
# `hermes gateway start` re-bootstraps when it detects the job is unloaded.
try:
subprocess.run(["launchctl", "bootout", target], check=True, timeout=90)
except subprocess.CalledProcessError as e:
if e.returncode in {3, 113}:
pass # Already unloaded — nothing to stop.
else:
raise
_wait_for_gateway_exit(timeout=10.0, force_after=5.0)
print("✓ Service stopped")
def _wait_for_gateway_exit(
timeout: float = 10.0, force_after: float | None = 5.0
) -> bool:
"""Wait for the gateway process (by saved PID) to exit.
Uses the PID from the gateway.pid file — not launchd labels — so this
works correctly when multiple gateway instances run under separate
HERMES_HOME directories.
Args:
timeout: Total seconds to wait before giving up.
force_after: Seconds of graceful waiting before escalating to force-kill.
"""
import time
from gateway.status import get_running_pid
deadline = time.monotonic() + timeout
force_deadline = (
(time.monotonic() + force_after) if force_after is not None else None
)
force_sent = False
while time.monotonic() < deadline:
pid = get_running_pid()
if pid is None:
return True # Process exited cleanly.
if (
force_after is not None
and not force_sent
and time.monotonic() >= force_deadline
):
# Grace period expired — force-kill the specific PID.
try:
terminate_pid(pid, force=True)
print(f"⚠ Gateway PID {pid} did not exit gracefully; sent SIGKILL")
except (ProcessLookupError, PermissionError, OSError):
return True # Already gone or we can't touch it.
force_sent = True
time.sleep(0.3)
# Timed out even after force-kill.
remaining_pid = get_running_pid()
if remaining_pid is not None:
print(
f"⚠ Gateway PID {remaining_pid} still running after {timeout}s — restart may fail"
)
return False
return True
def launchd_restart():
label = get_launchd_label()
target = f"{_launchd_domain()}/{label}"
drain_timeout = _get_restart_drain_timeout()
from gateway.status import get_running_pid
try:
pid = get_running_pid()
if pid is not None and _request_gateway_self_restart(pid):
print("✓ Service restart requested")
return
if pid is not None:
try:
terminate_pid(pid, force=False)
except (ProcessLookupError, PermissionError, OSError):
pid = None
if pid is not None:
exited = _wait_for_gateway_exit(timeout=drain_timeout, force_after=None)
if not exited:
print(
f"⚠ Gateway drain timed out after {drain_timeout:.0f}s — forcing launchd restart"
)
subprocess.run(["launchctl", "kickstart", "-k", target], check=True, timeout=90)
print("✓ Service restarted")
except subprocess.CalledProcessError as e:
if e.returncode not in {3, 113}:
raise
# Job not loaded — bootstrap and start fresh
print("↻ launchd job was unloaded; reloading")
plist_path = get_launchd_plist_path()
subprocess.run(
["launchctl", "bootstrap", _launchd_domain(), str(plist_path)],
check=True,
timeout=30,
)
subprocess.run(["launchctl", "kickstart", target], check=True, timeout=30)
print("✓ Service restarted")
def launchd_status(deep: bool = False):
plist_path = get_launchd_plist_path()
label = get_launchd_label()
try:
result = subprocess.run(
["launchctl", "list", label],
capture_output=True,
text=True,
timeout=10,
)
loaded = result.returncode == 0
loaded_output = result.stdout
except subprocess.TimeoutExpired:
loaded = False
loaded_output = ""
print(f"Launchd plist: {plist_path}")
if launchd_plist_is_current():
print("✓ Service definition matches the current Hermes install")
else:
print("⚠ Service definition is stale relative to the current Hermes install")
print(" Run: hermes gateway start")
if loaded:
print("✓ Gateway service is loaded")
print(loaded_output)
else:
print("✗ Gateway service is not loaded")
print(" Service definition exists locally but launchd has not loaded it.")
print(" Run: hermes gateway start")
if deep:
log_file = get_hermes_home() / "logs" / "gateway.log"
if log_file.exists():
print()
print("Recent logs:")
subprocess.run(["tail", "-20", str(log_file)], timeout=10)
# =============================================================================
# Gateway Runner
# =============================================================================
def _truthy_env(value: str | None) -> bool:
return str(value or "").strip().lower() in {"1", "true", "yes", "on"}
def _is_official_docker_checkout() -> bool:
return (
str(PROJECT_ROOT) == "/opt/hermes"
and (PROJECT_ROOT / "docker" / "entrypoint.sh").is_file()
)
def _guard_official_docker_root_gateway() -> None:
"""Refuse gateway startup when the official Docker privilege drop was bypassed."""
if not hasattr(os, "geteuid") or os.geteuid() != 0:
return
if _truthy_env(os.getenv("HERMES_ALLOW_ROOT_GATEWAY")):
return
if not _is_official_docker_checkout():
return
print_error(
"Refusing to run the Hermes gateway as root inside the official Docker image."
)
print(
" The image entrypoint normally drops privileges to the 'hermes' user. "
"If you override entrypoint in Docker Compose, include "
"/opt/hermes/docker/entrypoint.sh before the Hermes command."
)
print(
" Running the gateway as root can leave root-owned files in "
"$HERMES_HOME and break later non-root dashboard/gateway runs."
)
print(
" Set HERMES_ALLOW_ROOT_GATEWAY=1 only if you intentionally accept this risk."
)
sys.exit(1)
def run_gateway(verbose: int = 0, quiet: bool = False, replace: bool = False):
"""Run the gateway in foreground.
Args:
verbose: Stderr log verbosity count added on top of default WARNING (0=WARNING, 1=INFO, 2+=DEBUG).
quiet: Suppress all stderr log output.
replace: If True, kill any existing gateway instance before starting.
This prevents systemd restart loops when the old process
hasn't fully exited yet.
"""
_guard_official_docker_root_gateway()
sys.path.insert(0, str(PROJECT_ROOT))
# Detached Windows gateway runs must ignore console-control broadcasts
# from sibling CLI processes, but foreground `hermes gateway run` still
# needs to obey the banner's "Press Ctrl+C to stop" contract.
# Service-style launchers set HERMES_GATEWAY_DETACHED=1; older wrappers
# without the marker are handled by the non-TTY fallback.
try:
_stdin_is_tty = bool(sys.stdin and sys.stdin.isatty())
except (ValueError, OSError):
_stdin_is_tty = False
_absorb_windows_console_controls = _windows_gateway_should_absorb_console_controls()
if _absorb_windows_console_controls:
try:
signal.signal(signal.SIGINT, signal.SIG_IGN)
if hasattr(signal, "SIGBREAK"):
signal.signal(signal.SIGBREAK, signal.SIG_IGN)
except (OSError, ValueError):
# SetConsoleCtrlHandler not available (rare on Windows) —
# best-effort, proceed either way.
pass
# Python's signal module only hooks SIGINT/SIGBREAK. To also
# absorb CTRL_CLOSE_EVENT / CTRL_LOGOFF_EVENT and any other
# console control signals Windows may broadcast to the console
# process group, call the native SetConsoleCtrlHandler(NULL, TRUE)
# — this tells the kernel to IGNORE all console control events
# for this process entirely, which is what background services
# are supposed to do. Belt-and-braces over the Python-level
# handlers above.
try:
import ctypes
kernel32 = ctypes.windll.kernel32 # type: ignore[attr-defined]
# BOOL SetConsoleCtrlHandler(NULL, Add) — Add=TRUE means
# "install the NULL handler", which has the documented
# effect of ignoring Ctrl+C. Called twice for defense in
# depth: once before any Python import could have flipped
# our disposition, once as our last word.
kernel32.SetConsoleCtrlHandler(None, 1)
except (OSError, AttributeError):
pass
# Refresh the systemd unit definition on every boot so that restart
# settings (RestartSec, StartLimitIntervalSec, etc.) stay current even
# when the process was respawned via exit-code-75 (stale-code or
# /restart) rather than through `hermes gateway restart` which already
# calls refresh_systemd_unit_if_needed(). Without this, a code update
# that ships new unit settings won't take effect until the next manual
# `hermes gateway start/restart` — leaving the gateway vulnerable to
# the exact failure mode the new settings were meant to prevent.
if supports_systemd_services():
try:
refresh_systemd_unit_if_needed(system=False)
except Exception:
pass # best-effort; don't block gateway startup
from gateway.run import start_gateway
print("┌─────────────────────────────────────────────────────────┐")
print("│ ⚕ Hermes Gateway Starting... │")
print("├─────────────────────────────────────────────────────────┤")
print("│ Messaging platforms + cron scheduler │")
print("│ Press Ctrl+C to stop │")
print("└─────────────────────────────────────────────────────────┘")
print()
# Exit with code 1 if gateway fails to connect any platform,
# so systemd Restart=always will retry on transient errors
verbosity = None if quiet else verbose
# ── Exit-path diagnostics ────────────────────────────────────────────
# When the gateway dies silently on Windows (no shutdown log, no
# traceback in gateway.log / errors.log), we're usually blind to the
# cause. The code below captures *every* way the asyncio.run() call
# below can return, with full context dumped to a dedicated log so
# the next silent death yields evidence instead of a mystery. This
# is diagnostic scaffolding; cheap to keep on, costs nothing during
# normal operation, and the emitted lines are opt-in via the
# HERMES_GATEWAY_EXIT_DIAG env var (default: on while we're still
# chasing the Windows lifecycle bug).
import atexit as _atexit
import traceback as _traceback
from datetime import datetime as _dt, timezone as _tz
def _exit_diag(tag: str, **extra: object) -> None:
if os.environ.get("HERMES_GATEWAY_EXIT_DIAG", "1") != "1":
return
try:
from hermes_constants import get_hermes_home as _ghh
log_dir = _ghh() / "logs"
log_dir.mkdir(parents=True, exist_ok=True)
ts = _dt.now(_tz.utc).isoformat()
line = {
"ts": ts,
"tag": tag,
"pid": os.getpid(),
"python": sys.version.split()[0],
"platform": sys.platform,
**extra,
}
import json as _json
with open(log_dir / "gateway-exit-diag.log", "a", encoding="utf-8") as f:
f.write(_json.dumps(line, default=str) + "\n")
except Exception:
pass # never let the diagnostic itself crash the gateway
_exit_diag(
"gateway.start",
replace=replace,
argv=sys.argv,
stdin_is_tty=_stdin_is_tty,
absorb_windows_console_controls=_absorb_windows_console_controls,
)
def _atexit_hook() -> None:
_exit_diag("atexit.hook", sys_exc=repr(sys.exc_info()))
_atexit.register(_atexit_hook)
success = False
try:
success = asyncio.run(start_gateway(replace=replace, verbosity=verbosity))
_exit_diag("asyncio.run.returned", success=success)
except KeyboardInterrupt:
# On Windows-detached runs this shouldn't fire (we absorb SIGINT above),
# but keep the handler for console runs.
_exit_diag(
"asyncio.run.KeyboardInterrupt",
traceback=_traceback.format_exc(),
)
print("\nGateway stopped.")
return
except SystemExit as e:
_exit_diag(
"asyncio.run.SystemExit",
code=getattr(e, "code", None),
traceback=_traceback.format_exc(),
)
raise
except BaseException as e:
# Absolutely everything else: Exception, asyncio.CancelledError,
# even exotic BaseException subclasses. We want the cause logged.
_exit_diag(
"asyncio.run.exception",
exc_type=type(e).__name__,
exc_repr=repr(e),
traceback=_traceback.format_exc(),
)
raise
if not success:
_exit_diag("gateway.exit_nonzero")
sys.exit(1)
_exit_diag("gateway.exit_clean")
# =============================================================================
# Gateway Setup (Interactive Messaging Platform Configuration)
# =============================================================================
# Per-platform config: each entry defines the env vars, setup instructions,
# and prompts needed to configure a messaging platform.
_PLATFORMS = [
{
"key": "telegram",
"label": "Telegram",
"emoji": "📱",
"token_var": "TELEGRAM_BOT_TOKEN",
"setup_instructions": [
"1. Open Telegram and message @BotFather",
"2. Send /newbot and follow the prompts to create your bot",
"3. Copy the bot token BotFather gives you",
"4. To find your user ID: message @userinfobot — it replies with your numeric ID",
],
"vars": [
{
"name": "TELEGRAM_BOT_TOKEN",
"prompt": "Bot token",
"password": True,
"help": "Paste the token from @BotFather (step 3 above).",
},
{
"name": "TELEGRAM_ALLOWED_USERS",
"prompt": "Allowed user IDs (comma-separated)",
"password": False,
"is_allowlist": True,
"help": "Paste your user ID from step 4 above.",
},
{
"name": "TELEGRAM_HOME_CHANNEL",
"prompt": "Home channel ID (for cron/notification delivery, or empty to set later with /set-home)",
"password": False,
"help": "For DMs, this is your user ID. You can set it later by typing /set-home in chat.",
},
],
},
# Discord moved to plugins/platforms/discord/ — its setup metadata is
# discovered dynamically via _all_platforms() from the platform registry
# entry registered by plugins/platforms/discord/adapter.py::register().
{
"key": "slack",
"label": "Slack",
"emoji": "💼",
"token_var": "SLACK_BOT_TOKEN",
"setup_instructions": [
"1. Go to https://api.slack.com/apps → Create New App → From Scratch",
"2. Enable Socket Mode: Settings → Socket Mode → Enable",
" Create an App-Level Token with scope: connections:write → copy xapp-... token",
"3. Add Bot Token Scopes: Features → OAuth & Permissions → Scopes",
" Required: chat:write, app_mentions:read, channels:history, channels:read,",
" groups:history, im:history, im:read, im:write, users:read, files:read, files:write",
"4. Subscribe to Events: Features → Event Subscriptions → Enable",
" Required events: message.im, message.channels, app_mention",
" Optional: message.groups (for private channels)",
" ⚠ Without message.channels the bot will ONLY work in DMs!",
"5. Install to Workspace: Settings → Install App → copy xoxb-... token",
"6. Reinstall the app after any scope or event changes",
"7. Find your user ID: click your profile → three dots → Copy member ID",
"8. Invite the bot to channels: /invite @YourBot",
],
"vars": [
{
"name": "SLACK_BOT_TOKEN",
"prompt": "Bot Token (xoxb-...)",
"password": True,
"help": "Paste the bot token from step 3 above.",
},
{
"name": "SLACK_APP_TOKEN",
"prompt": "App Token (xapp-...)",
"password": True,
"help": "Paste the app-level token from step 4 above.",
},
{
"name": "SLACK_ALLOWED_USERS",
"prompt": "Allowed user IDs (comma-separated)",
"password": False,
"is_allowlist": True,
"help": "Paste your member ID from step 7 above.",
},
],
},
{
"key": "matrix",
"label": "Matrix",
"emoji": "🔐",
"token_var": "MATRIX_ACCESS_TOKEN",
"setup_instructions": [
"1. Works with any Matrix homeserver (self-hosted Synapse/Conduit/Dendrite or matrix.org)",
"2. Create a bot user on your homeserver, or use your own account",
"3. Get an access token: Element → Settings → Help & About → Access Token",
" Or via API: curl -X POST https://your-server/_matrix/client/v3/login \\",
' -d \'{"type":"m.login.password","user":"@bot:server","password":"..."}\'',
"4. Alternatively, provide user ID + password and Hermes will log in directly",
"5. For E2EE: set MATRIX_ENCRYPTION=true (requires pip install 'mautrix[encryption]')",
"6. To find your user ID: it's @username:your-server (shown in Element profile)",
],
"vars": [
{
"name": "MATRIX_HOMESERVER",
"prompt": "Homeserver URL (e.g. https://matrix.example.org)",
"password": False,
"help": "Your Matrix homeserver URL. Works with any self-hosted instance.",
},
{
"name": "MATRIX_ACCESS_TOKEN",
"prompt": "Access token (leave empty to use password login instead)",
"password": True,
"help": "Paste your access token, or leave empty and provide user ID + password below.",
},
{
"name": "MATRIX_USER_ID",
"prompt": "User ID (@bot:server — required for password login)",
"password": False,
"help": "Full Matrix user ID, e.g. @hermes:matrix.example.org",
},
{
"name": "MATRIX_ALLOWED_USERS",
"prompt": "Allowed user IDs (comma-separated, e.g. @you:server)",
"password": False,
"is_allowlist": True,
"help": "Matrix user IDs who can interact with the bot.",
},
{
"name": "MATRIX_HOME_ROOM",
"prompt": "Home room ID (for cron/notification delivery, or empty to set later with /set-home)",
"password": False,
"help": "Room ID (e.g. !abc123:server) for delivering cron results and notifications.",
},
],
},
{
"key": "mattermost",
"label": "Mattermost",
"emoji": "💬",
"token_var": "MATTERMOST_TOKEN",
"setup_instructions": [
"1. In Mattermost: Integrations → Bot Accounts → Add Bot Account",
" (System Console → Integrations → Bot Accounts must be enabled)",
"2. Give it a username (e.g. hermes) and copy the bot token",
"3. Works with any self-hosted Mattermost instance — enter your server URL",
"4. To find your user ID: click your avatar (top-left) → Profile",
" Your user ID is displayed there — click it to copy.",
" ⚠ This is NOT your username — it's a 26-character alphanumeric ID.",
"5. To get a channel ID: click the channel name → View Info → copy the ID",
],
"vars": [
{
"name": "MATTERMOST_URL",
"prompt": "Server URL (e.g. https://mm.example.com)",
"password": False,
"help": "Your Mattermost server URL. Works with any self-hosted instance.",
},
{
"name": "MATTERMOST_TOKEN",
"prompt": "Bot token",
"password": True,
"help": "Paste the bot token from step 2 above.",
},
{
"name": "MATTERMOST_ALLOWED_USERS",
"prompt": "Allowed user IDs (comma-separated)",
"password": False,
"is_allowlist": True,
"help": "Your Mattermost user ID from step 4 above.",
},
{
"name": "MATTERMOST_HOME_CHANNEL",
"prompt": "Home channel ID (for cron/notification delivery, or empty to set later with /set-home)",
"password": False,
"help": "Channel ID where Hermes delivers cron results and notifications.",
},
{
"name": "MATTERMOST_REPLY_MODE",
"prompt": "Reply mode — 'off' for flat messages, 'thread' for threaded replies (default: off)",
"password": False,
"help": "off = flat channel messages, thread = replies nest under your message.",
},
],
},
{
"key": "whatsapp",
"label": "WhatsApp",
"emoji": "📲",
"token_var": "WHATSAPP_ENABLED",
},
{
"key": "signal",
"label": "Signal",
"emoji": "📡",
"token_var": "SIGNAL_HTTP_URL",
},
{
"key": "email",
"label": "Email",
"emoji": "📧",
"token_var": "EMAIL_ADDRESS",
"setup_instructions": [
"1. Use a dedicated email account for your Hermes agent",
"2. For Gmail: enable 2FA, then create an App Password at",
" https://myaccount.google.com/apppasswords",
"3. For other providers: use your email password or app-specific password",
"4. IMAP must be enabled on your email account",
],
"vars": [
{
"name": "EMAIL_ADDRESS",
"prompt": "Email address",
"password": False,
"help": "The email address Hermes will use (e.g., hermes@gmail.com).",
},
{
"name": "EMAIL_PASSWORD",
"prompt": "Email password (or app password)",
"password": True,
"help": "For Gmail, use an App Password (not your regular password).",
},
{
"name": "EMAIL_IMAP_HOST",
"prompt": "IMAP host",
"password": False,
"help": "e.g., imap.gmail.com for Gmail, outlook.office365.com for Outlook.",
},
{
"name": "EMAIL_SMTP_HOST",
"prompt": "SMTP host",
"password": False,
"help": "e.g., smtp.gmail.com for Gmail, smtp.office365.com for Outlook.",
},
{
"name": "EMAIL_ALLOWED_USERS",
"prompt": "Allowed sender emails (comma-separated)",
"password": False,
"is_allowlist": True,
"help": "Only emails from these addresses will be processed.",
},
],
},
{
"key": "sms",
"label": "SMS (Twilio)",
"emoji": "📱",
"token_var": "TWILIO_ACCOUNT_SID",
"setup_instructions": [
"1. Create a Twilio account at https://www.twilio.com/",
"2. Get your Account SID and Auth Token from the Twilio Console dashboard",
"3. Buy or configure a phone number capable of sending SMS",
"4. Set up your webhook URL for inbound SMS:",
" Twilio Console → Phone Numbers → Active Numbers → your number",
" → Messaging → A MESSAGE COMES IN → Webhook → https://your-server:8080/webhooks/twilio",
],
"vars": [
{
"name": "TWILIO_ACCOUNT_SID",
"prompt": "Twilio Account SID",
"password": False,
"help": "Found on the Twilio Console dashboard.",
},
{
"name": "TWILIO_AUTH_TOKEN",
"prompt": "Twilio Auth Token",
"password": True,
"help": "Found on the Twilio Console dashboard (click to reveal).",
},
{
"name": "TWILIO_PHONE_NUMBER",
"prompt": "Twilio phone number (E.164 format, e.g. +15551234567)",
"password": False,
"help": "The Twilio phone number to send SMS from.",
},
{
"name": "SMS_ALLOWED_USERS",
"prompt": "Allowed phone numbers (comma-separated, E.164 format)",
"password": False,
"is_allowlist": True,
"help": "Only messages from these phone numbers will be processed.",
},
{
"name": "SMS_HOME_CHANNEL",
"prompt": "Home channel phone number (for cron/notification delivery, or empty)",
"password": False,
"help": "Phone number to deliver cron job results and notifications to.",
},
],
},
{
"key": "dingtalk",
"label": "DingTalk",
"emoji": "💬",
"token_var": "DINGTALK_CLIENT_ID",
"setup_instructions": [
"1. Go to https://open-dev.dingtalk.com → Create Application",
"2. Under 'Credentials', copy the AppKey (Client ID) and AppSecret (Client Secret)",
"3. Enable 'Stream Mode' under the bot settings",
"4. Add the bot to a group chat or message it directly",
],
"vars": [
{
"name": "DINGTALK_CLIENT_ID",
"prompt": "AppKey (Client ID)",
"password": False,
"help": "The AppKey from your DingTalk application credentials.",
},
{
"name": "DINGTALK_CLIENT_SECRET",
"prompt": "AppSecret (Client Secret)",
"password": True,
"help": "The AppSecret from your DingTalk application credentials.",
},
],
},
{
"key": "feishu",
"label": "Feishu / Lark",
"emoji": "🪽",
"token_var": "FEISHU_APP_ID",
"setup_instructions": [
"1. Go to https://open.feishu.cn/ (or https://open.larksuite.com/ for Lark)",
"2. Create an app and copy the App ID and App Secret",
"3. Enable the Bot capability for the app",
"4. Choose WebSocket (recommended) or Webhook connection mode",
"5. Add the bot to a group chat or message it directly",
"6. Restrict access with FEISHU_ALLOWED_USERS for production use",
],
"vars": [
{
"name": "FEISHU_APP_ID",
"prompt": "App ID",
"password": False,
"help": "The App ID from your Feishu/Lark application.",
},
{
"name": "FEISHU_APP_SECRET",
"prompt": "App Secret",
"password": True,
"help": "The App Secret from your Feishu/Lark application.",
},
{
"name": "FEISHU_DOMAIN",
"prompt": "Domain — feishu or lark (default: feishu)",
"password": False,
"help": "Use 'feishu' for Feishu China, or 'lark' for Lark international.",
},
{
"name": "FEISHU_CONNECTION_MODE",
"prompt": "Connection mode — websocket or webhook (default: websocket)",
"password": False,
"help": "websocket is recommended unless you specifically need webhook mode.",
},
{
"name": "FEISHU_ALLOWED_USERS",
"prompt": "Allowed user IDs (comma-separated, or empty)",
"password": False,
"is_allowlist": True,
"help": "Restrict which Feishu/Lark users can interact with the bot.",
},
{
"name": "FEISHU_HOME_CHANNEL",
"prompt": "Home chat ID (optional, for cron/notifications)",
"password": False,
"help": "Chat ID for scheduled results and notifications.",
},
],
},
{
"key": "wecom",
"label": "WeCom (Enterprise WeChat)",
"emoji": "💬",
"token_var": "WECOM_BOT_ID",
"setup_instructions": [
"1. Go to WeCom Admin Console → Applications → Create AI Bot",
"2. Copy the Bot ID and Secret from the bot's credentials page",
"3. The bot connects via WebSocket — no public endpoint needed",
"4. Add the bot to a group chat or message it directly in WeCom",
"5. Restrict access with WECOM_ALLOWED_USERS for production use",
],
"vars": [
{
"name": "WECOM_BOT_ID",
"prompt": "Bot ID",
"password": False,
"help": "The Bot ID from your WeCom AI Bot.",
},
{
"name": "WECOM_SECRET",
"prompt": "Secret",
"password": True,
"help": "The secret from your WeCom AI Bot.",
},
{
"name": "WECOM_ALLOWED_USERS",
"prompt": "Allowed user IDs (comma-separated, or empty)",
"password": False,
"is_allowlist": True,
"help": "Restrict which WeCom users can interact with the bot.",
},
{
"name": "WECOM_HOME_CHANNEL",
"prompt": "Home chat ID (optional, for cron/notifications)",
"password": False,
"help": "Chat ID for scheduled results and notifications.",
},
],
},
{
"key": "wecom_callback",
"label": "WeCom Callback (Self-Built App)",
"emoji": "💬",
"token_var": "WECOM_CALLBACK_CORP_ID",
"setup_instructions": [
"1. Go to WeCom Admin Console → Applications → Create Self-Built App",
"2. Note the Corp ID (top of admin console) and create a Corp Secret",
"3. Under Receive Messages, configure the callback URL to point to your server",
"4. Copy the Token and EncodingAESKey from the callback configuration",
"5. The adapter runs an HTTP server — ensure the port is reachable from WeCom",
"6. Restrict access with WECOM_CALLBACK_ALLOWED_USERS for production use",
],
"vars": [
{
"name": "WECOM_CALLBACK_CORP_ID",
"prompt": "Corp ID",
"password": False,
"help": "Your WeCom enterprise Corp ID.",
},
{
"name": "WECOM_CALLBACK_CORP_SECRET",
"prompt": "Corp Secret",
"password": True,
"help": "The secret for your self-built application.",
},
{
"name": "WECOM_CALLBACK_AGENT_ID",
"prompt": "Agent ID",
"password": False,
"help": "The Agent ID of your self-built application.",
},
{
"name": "WECOM_CALLBACK_TOKEN",
"prompt": "Callback Token",
"password": True,
"help": "The Token from your WeCom callback configuration.",
},
{
"name": "WECOM_CALLBACK_ENCODING_AES_KEY",
"prompt": "Encoding AES Key",
"password": True,
"help": "The EncodingAESKey from your WeCom callback configuration.",
},
{
"name": "WECOM_CALLBACK_PORT",
"prompt": "Callback server port (default: 8645)",
"password": False,
"help": "Port for the HTTP callback server.",
},
{
"name": "WECOM_CALLBACK_ALLOWED_USERS",
"prompt": "Allowed user IDs (comma-separated, or empty)",
"password": False,
"is_allowlist": True,
"help": "Restrict which WeCom users can interact with the app.",
},
],
},
{
"key": "weixin",
"label": "Weixin / WeChat",
"emoji": "💬",
"token_var": "WEIXIN_ACCOUNT_ID",
},
{
"key": "bluebubbles",
"label": "BlueBubbles (iMessage)",
"emoji": "💬",
"token_var": "BLUEBUBBLES_SERVER_URL",
"setup_instructions": [
"1. Install BlueBubbles on a Mac that will act as your iMessage server:",
" https://bluebubbles.app/",
"2. Complete the BlueBubbles setup wizard — sign in with your Apple ID",
"3. In BlueBubbles Settings → API, note the Server URL and password",
"4. The server URL is typically http://<your-mac-ip>:1234",
"5. Hermes connects via the BlueBubbles REST API and receives",
" incoming messages via a local webhook",
"6. To authorize users, use DM pairing: hermes pairing generate bluebubbles",
" Share the code — the user sends it via iMessage to get approved",
],
"vars": [
{
"name": "BLUEBUBBLES_SERVER_URL",
"prompt": "BlueBubbles server URL (e.g. http://192.168.1.10:1234)",
"password": False,
"help": "The URL shown in BlueBubbles Settings → API.",
},
{
"name": "BLUEBUBBLES_PASSWORD",
"prompt": "BlueBubbles server password",
"password": True,
"help": "The password shown in BlueBubbles Settings → API.",
},
{
"name": "BLUEBUBBLES_ALLOWED_USERS",
"prompt": "Pre-authorized phone numbers or iMessage IDs (comma-separated, or leave empty for DM pairing)",
"password": False,
"is_allowlist": True,
"help": "Optional — pre-authorize specific users. Leave empty to use DM pairing instead (recommended).",
},
{
"name": "BLUEBUBBLES_HOME_CHANNEL",
"prompt": "Home channel (phone number or iMessage ID for cron/notifications, or empty)",
"password": False,
"help": "Phone number or Apple ID to deliver cron results and notifications to.",
},
],
},
{
"key": "qqbot",
"label": "QQ Bot",
"emoji": "🐧",
"token_var": "QQ_APP_ID",
"setup_instructions": [
"1. Register a QQ Bot application at q.qq.com",
"2. Note your App ID and App Secret from the application page",
"3. Enable the required intents (C2C, Group, Guild messages)",
"4. Configure sandbox or publish the bot",
],
"vars": [
{
"name": "QQ_APP_ID",
"prompt": "QQ Bot App ID",
"password": False,
"help": "Your QQ Bot App ID from q.qq.com.",
},
{
"name": "QQ_CLIENT_SECRET",
"prompt": "QQ Bot App Secret",
"password": True,
"help": "Your QQ Bot App Secret from q.qq.com.",
},
{
"name": "QQ_ALLOWED_USERS",
"prompt": "Allowed user OpenIDs (comma-separated, leave empty for open access)",
"password": False,
"is_allowlist": True,
"help": "Optional — restrict DM access to specific user OpenIDs.",
},
{
"name": "QQBOT_HOME_CHANNEL",
"prompt": "Home channel (user/group OpenID for cron delivery, or empty)",
"password": False,
"help": "OpenID to deliver cron results and notifications to.",
},
],
},
{
"key": "yuanbao",
"label": "Yuanbao",
"emoji": "💎",
"token_var": "YUANBAO_APP_ID",
"setup_instructions": [
"1. Download the Yuanbao app from https://yuanbao.tencent.com/",
"2. In the app, go to PAI → My Bot and create a new bot",
"3. After the bot is created, copy the App ID and App Secret",
"4. Enter them below and Hermes will connect automatically over WebSocket",
],
"vars": [
{
"name": "YUANBAO_APP_ID",
"prompt": "App ID",
"password": False,
"help": "The App ID from your Yuanbao IM Bot credentials.",
},
{
"name": "YUANBAO_APP_SECRET",
"prompt": "App Secret",
"password": True,
"help": "The App Secret (used for HMAC signing) from your Yuanbao IM Bot.",
},
],
},
]
def _all_platforms() -> list[dict]:
"""Return the full list of platforms for setup menus.
Combines the built-in ``_PLATFORMS`` with plugin platforms registered via
``platform_registry``. Plugins are discovered on first call so bundled
platforms (like IRC, which auto-load via ``kind: platform``) appear in
``hermes setup gateway`` without needing the gateway to be running.
Built-ins keep their dict shape; plugin entries are adapted to the same
shape with ``_registry_entry`` holding the source.
Platform-specific gating: some platforms can't be configured on
every host. Currently:
- Matrix is hidden on Windows. The [matrix] extra pulls
``mautrix[encryption]`` -> ``python-olm``, which has no Windows
wheel and needs ``make`` + libolm to build from sdist. There's
no native Windows path that works, so we don't offer it in the
picker. Users who want Matrix on Windows can run hermes under
WSL.
"""
# Populate the registry so plugin platforms are visible. Idempotent.
# Bundled platform plugins (``kind: platform``) auto-load unconditionally,
# so every shipped messaging channel appears in the setup menu by default.
# User-installed platform plugins under ~/.hermes/plugins/ still require
# opt-in via ``plugins.enabled`` (untrusted code).
try:
from hermes_cli.plugins import discover_plugins
discover_plugins()
except Exception as e:
logger.debug("plugin discovery failed during platform enumeration: %s", e)
platforms = [dict(p) for p in _PLATFORMS]
# Drop platforms that can't function on this host. See docstring.
if sys.platform == "win32":
platforms = [p for p in platforms if p.get("key") != "matrix"]
by_key = {p["key"]: p for p in platforms}
try:
from gateway.platform_registry import platform_registry
except Exception:
return platforms
for entry in platform_registry.all_entries():
if entry.name in by_key:
continue # built-in already covers it
platforms.append(
{
"key": entry.name,
"label": entry.label,
"emoji": entry.emoji,
"token_var": entry.required_env[0] if entry.required_env else "",
"install_hint": entry.install_hint,
"_registry_entry": entry,
}
)
return platforms
def _platform_status(platform: dict) -> str:
"""Return a plain-text status string for a platform.
Returns uncolored text so it can safely be embedded in
curses menu items (ANSI codes break width calculation).
"""
entry = platform.get("_registry_entry")
if entry is not None:
configured = False
# Prefer is_connected (checks both env and config.yaml) over
# check_fn (typically just dependency / env presence).
if entry.is_connected is not None:
try:
from gateway.config import PlatformConfig
synthetic = PlatformConfig(enabled=True)
configured = bool(entry.is_connected(synthetic))
except Exception:
configured = False
else:
# No is_connected hook — fall back to check_fn as a coarse
# "are deps present" gate. Don't fall back when is_connected
# is defined and returned False; that would let "SDK is
# installed" override "no token configured" and incorrectly
# report the platform as ready.
try:
configured = bool(entry.check_fn())
except Exception:
configured = False
return "configured" if configured else "not configured"
token_var = platform.get("token_var", "")
if not token_var:
return "not configured"
val = get_env_value(token_var)
if token_var == "WHATSAPP_ENABLED":
if val and val.lower() == "true":
session_file = get_hermes_home() / "whatsapp" / "session" / "creds.json"
if session_file.exists():
return "configured + paired"
return "enabled, not paired"
return "not configured"
if platform.get("key") == "signal":
account = get_env_value("SIGNAL_ACCOUNT")
if val and account:
return "configured"
if val or account:
return "partially configured"
return "not configured"
if platform.get("key") == "email":
pwd = get_env_value("EMAIL_PASSWORD")
imap = get_env_value("EMAIL_IMAP_HOST")
smtp = get_env_value("EMAIL_SMTP_HOST")
if all([val, pwd, imap, smtp]):
return "configured"
if any([val, pwd, imap, smtp]):
return "partially configured"
return "not configured"
if platform.get("key") == "matrix":
homeserver = get_env_value("MATRIX_HOMESERVER")
password = get_env_value("MATRIX_PASSWORD")
if (val or password) and homeserver:
e2ee = get_env_value("MATRIX_ENCRYPTION")
suffix = " + E2EE" if e2ee and e2ee.lower() in {"true", "1", "yes"} else ""
return f"configured{suffix}"
if val or password or homeserver:
return "partially configured"
return "not configured"
if platform.get("key") == "weixin":
token = get_env_value("WEIXIN_TOKEN")
if val and token:
return "configured"
if val or token:
return "partially configured"
return "not configured"
if val:
return "configured"
return "not configured"
def _runtime_health_lines() -> list[str]:
"""Summarize the latest persisted gateway runtime health state."""
try:
from gateway.status import read_runtime_status
except Exception:
return []
state = read_runtime_status()
if not state:
return []
lines: list[str] = []
gateway_state = state.get("gateway_state")
exit_reason = state.get("exit_reason")
active_agents = state.get("active_agents")
restart_requested = state.get("restart_requested")
platforms = state.get("platforms", {}) or {}
for platform, pdata in platforms.items():
if pdata.get("state") == "fatal":
message = pdata.get("error_message") or "unknown error"
lines.append(f"{platform}: {message}")
if gateway_state == "startup_failed" and exit_reason:
lines.append(f"⚠ Last startup issue: {exit_reason}")
elif gateway_state == "draining":
action = "restart" if restart_requested else "shutdown"
count = int(active_agents or 0)
lines.append(f"⏳ Gateway draining for {action} ({count} active agent(s))")
elif gateway_state == "stopped" and exit_reason:
lines.append(f"⚠ Last shutdown reason: {exit_reason}")
return lines
def _setup_standard_platform(platform: dict):
"""Interactive setup for Telegram, Discord, or Slack."""
emoji = platform["emoji"]
label = platform["label"]
token_var = platform["token_var"]
print()
print(color(f" ─── {emoji} {label} Setup ───", Colors.CYAN))
# Show step-by-step setup instructions if this platform has them
instructions = platform.get("setup_instructions")
if instructions:
print()
for line in instructions:
print_info(f" {line}")
existing_token = get_env_value(token_var)
if existing_token:
print()
print_success(f"{label} is already configured.")
if not prompt_yes_no(f" Reconfigure {label}?", False):
return
allowed_val_set = None # Track if user set an allowlist (for home channel offer)
for var in platform["vars"]:
print()
print_info(f" {var['help']}")
existing = get_env_value(var["name"])
if existing and var["name"] != token_var:
print_info(f" Current: {existing}")
# Allowlist fields get special handling for the deny-by-default security model
if var.get("is_allowlist"):
print_info(" The gateway DENIES all users by default for security.")
print_info(" Enter user IDs to create an allowlist, or leave empty")
print_info(" and you'll be asked about open access next.")
value = prompt(f" {var['prompt']}", password=False)
if value:
cleaned = value.replace(" ", "")
# For Discord, strip common prefixes (user:123, <@123>, <@!123>)
if "DISCORD" in var["name"]:
parts = []
for uid in cleaned.split(","):
uid = uid.strip()
if uid.startswith("<@") and uid.endswith(">"):
uid = uid.lstrip("<@!").rstrip(">")
if uid.lower().startswith("user:"):
uid = uid[5:]
if uid:
parts.append(uid)
cleaned = ",".join(parts)
save_env_value(var["name"], cleaned)
print_success(" Saved — only these users can interact with the bot.")
allowed_val_set = cleaned
else:
# No allowlist — ask about open access vs DM pairing
print()
access_choices = [
"Enable open access (anyone can message the bot)",
"Use DM pairing (unknown users request access, you approve with 'hermes pairing approve')",
"Skip for now (bot will deny all users until configured)",
]
access_idx = prompt_choice(
" How should unauthorized users be handled?", access_choices, 1
)
if access_idx == 0:
save_env_value("GATEWAY_ALLOW_ALL_USERS", "true")
print_warning(" Open access enabled — anyone can use your bot!")
elif access_idx == 1:
print_success(
" DM pairing mode — users will receive a code to request access."
)
print_info(
" Approve with: hermes pairing approve <platform> <code>"
)
else:
print_info(
" Skipped — configure later with 'hermes gateway setup'"
)
continue
value = prompt(f" {var['prompt']}", password=var.get("password", False))
if value:
save_env_value(var["name"], value)
print_success(f" Saved {var['name']}")
elif var["name"] == token_var:
print_warning(f" Skipped — {label} won't work without this.")
return
else:
print_info(" Skipped (can configure later)")
# If an allowlist was set and home channel wasn't, offer to reuse
# the first user ID (common for Telegram DMs).
home_var = f"{label.upper()}_HOME_CHANNEL"
home_val = get_env_value(home_var)
if allowed_val_set and not home_val and label == "Telegram":
first_id = allowed_val_set.split(",")[0].strip()
if first_id and prompt_yes_no(
f" Use your user ID ({first_id}) as the home channel?", True
):
save_env_value(home_var, first_id)
print_success(f" Home channel set to {first_id}")
print()
print_success(f"{emoji} {label} configured!")
def _setup_whatsapp():
"""Delegate to the existing WhatsApp setup flow."""
from hermes_cli.main import cmd_whatsapp
import argparse
cmd_whatsapp(argparse.Namespace())
def _setup_dingtalk():
"""Configure DingTalk — QR scan (recommended) or manual credential entry."""
from hermes_cli.setup import (
prompt_choice,
prompt_yes_no,
print_success,
print_warning,
)
dingtalk_platform = next(p for p in _PLATFORMS if p["key"] == "dingtalk")
emoji = dingtalk_platform["emoji"]
label = dingtalk_platform["label"]
print()
print(color(f" ─── {emoji} {label} Setup ───", Colors.CYAN))
existing = get_env_value("DINGTALK_CLIENT_ID")
if existing:
print()
print_success(f"{label} is already configured (Client ID: {existing}).")
if not prompt_yes_no(f" Reconfigure {label}?", False):
return
print()
method = prompt_choice(
" Choose setup method",
[
"QR Code Scan (Recommended, auto-obtain Client ID and Client Secret)",
"Manual Input (Client ID and Client Secret)",
],
default=0,
)
if method == 0:
# ── QR-code device-flow authorization ──
try:
from hermes_cli.dingtalk_auth import dingtalk_qr_auth
except ImportError as exc:
print_warning(
f" QR auth module failed to load ({exc}), falling back to manual input."
)
_setup_standard_platform(dingtalk_platform)
return
result = dingtalk_qr_auth()
if result is None:
print_warning(" QR auth incomplete, falling back to manual input.")
_setup_standard_platform(dingtalk_platform)
return
client_id, client_secret = result
save_env_value("DINGTALK_CLIENT_ID", client_id)
save_env_value("DINGTALK_CLIENT_SECRET", client_secret)
print()
print_success(f"{emoji} {label} configured via QR scan!")
else:
# ── Manual entry ──
_setup_standard_platform(dingtalk_platform)
def _setup_wecom():
"""Interactive setup for WeCom — scan QR code or manual credential input."""
print()
print(color(" ─── 💬 WeCom (Enterprise WeChat) Setup ───", Colors.CYAN))
existing_bot_id = get_env_value("WECOM_BOT_ID")
existing_secret = get_env_value("WECOM_SECRET")
if existing_bot_id and existing_secret:
print()
print_success("WeCom is already configured.")
if not prompt_yes_no(" Reconfigure WeCom?", False):
return
# ── Choose setup method ──
print()
method_choices = [
"Scan QR code to obtain Bot ID and Secret automatically (recommended)",
"Enter existing Bot ID and Secret manually",
]
method_idx = prompt_choice(
" How would you like to set up WeCom?", method_choices, 0
)
bot_id = None
secret = None
if method_idx == 0:
# ── QR scan flow ──
try:
from gateway.platforms.wecom import qr_scan_for_bot_info
except Exception as exc:
print_error(f" WeCom QR scan import failed: {exc}")
qr_scan_for_bot_info = None
if qr_scan_for_bot_info is not None:
try:
credentials = qr_scan_for_bot_info()
except KeyboardInterrupt:
print()
print_warning(" WeCom setup cancelled.")
return
except Exception as exc:
print_warning(f" QR scan failed: {exc}")
credentials = None
if credentials:
bot_id = credentials.get("bot_id", "")
secret = credentials.get("secret", "")
print_success(" ✔ QR scan successful! Bot ID and Secret obtained.")
if not bot_id or not secret:
print_info(" QR scan did not complete. Continuing with manual input.")
bot_id = None
secret = None
# ── Manual credential input ──
if not bot_id or not secret:
print()
print_info(
" 1. Go to WeCom Application → Workspace → Smart Robot -> Create smart robots"
)
print_info(" 2. Select API Mode")
print_info(" 3. Copy the Bot ID and Secret from the bot's credentials info")
print_info(" 4. The bot connects via WebSocket — no public endpoint needed")
print()
bot_id = prompt(" Bot ID", password=False)
if not bot_id:
print_warning(" Skipped — WeCom won't work without a Bot ID.")
return
secret = prompt(" Secret", password=True)
if not secret:
print_warning(" Skipped — WeCom won't work without a Secret.")
return
# ── Save core credentials ──
save_env_value("WECOM_BOT_ID", bot_id)
save_env_value("WECOM_SECRET", secret)
# ── Allowed users (deny-by-default security) ──
print()
print_info(" The gateway DENIES all users by default for security.")
print_info(" Enter user IDs to create an allowlist, or leave empty.")
allowed = prompt(" Allowed user IDs (comma-separated, or empty)", password=False)
if allowed:
cleaned = allowed.replace(" ", "")
save_env_value("WECOM_ALLOWED_USERS", cleaned)
print_success(" Saved — only these users can interact with the bot.")
else:
print()
access_choices = [
"Enable open access (anyone can message the bot)",
"Use DM pairing (unknown users request access, you approve with 'hermes pairing approve')",
"Disable direct messages",
"Skip for now (bot will deny all users until configured)",
]
access_idx = prompt_choice(
" How should unauthorized users be handled?", access_choices, 1
)
if access_idx == 0:
save_env_value("WECOM_DM_POLICY", "open")
save_env_value("GATEWAY_ALLOW_ALL_USERS", "true")
print_warning(" Open access enabled — anyone can use your bot!")
elif access_idx == 1:
save_env_value("WECOM_DM_POLICY", "pairing")
print_success(
" DM pairing mode — users will receive a code to request access."
)
print_info(" Approve with: hermes pairing approve <platform> <code>")
elif access_idx == 2:
save_env_value("WECOM_DM_POLICY", "disabled")
print_warning(" Direct messages disabled.")
else:
print_info(" Skipped — configure later with 'hermes gateway setup'")
# ── Home channel (optional) ──
print()
print_info(" Chat ID for scheduled results and notifications.")
home = prompt(" Home chat ID (optional, for cron/notifications)", password=False)
if home:
save_env_value("WECOM_HOME_CHANNEL", home)
print_success(f" Home channel set to {home}")
print()
print_success("💬 WeCom configured!")
def _is_service_installed() -> bool:
"""Check if the gateway is installed as a system service."""
if supports_systemd_services():
return (
get_systemd_unit_path(system=False).exists()
or get_systemd_unit_path(system=True).exists()
)
elif is_macos():
return get_launchd_plist_path().exists()
elif is_windows():
from hermes_cli import gateway_windows
return gateway_windows.is_installed()
return False
def _is_service_running() -> bool:
"""Check if the gateway service is currently running."""
if supports_systemd_services():
user_unit_exists = get_systemd_unit_path(system=False).exists()
system_unit_exists = get_systemd_unit_path(system=True).exists()
if user_unit_exists:
try:
result = _run_systemctl(
["is-active", get_service_name()],
system=False,
capture_output=True,
text=True,
timeout=10,
)
if result.stdout.strip() == "active":
return True
except (RuntimeError, subprocess.TimeoutExpired):
pass
if system_unit_exists:
try:
result = _run_systemctl(
["is-active", get_service_name()],
system=True,
capture_output=True,
text=True,
timeout=10,
)
if result.stdout.strip() == "active":
return True
except (RuntimeError, subprocess.TimeoutExpired):
pass
return False
elif is_macos() and get_launchd_plist_path().exists():
try:
result = subprocess.run(
["launchctl", "list", get_launchd_label()],
capture_output=True,
text=True,
timeout=10,
)
return result.returncode == 0
except subprocess.TimeoutExpired:
return False
elif is_windows():
from hermes_cli import gateway_windows
if gateway_windows.is_installed():
# "installed" doesn't necessarily mean "running" on Windows. The
# canonical check is whether a gateway process actually exists.
return len(find_gateway_pids()) > 0
# Check for manual processes
return len(find_gateway_pids()) > 0
def _setup_weixin():
"""Interactive setup for Weixin / WeChat personal accounts."""
print()
print(color(" ─── 💬 Weixin / WeChat Setup ───", Colors.CYAN))
print()
print_info(" 1. Hermes will open Tencent iLink QR login in this terminal.")
print_info(" 2. Use WeChat to scan and confirm the QR code.")
print_info(
" 3. Hermes will store the returned account_id/token in ~/.hermes/.env."
)
print_info(
" 4. This adapter supports native text, image, video, and document delivery."
)
existing_account = get_env_value("WEIXIN_ACCOUNT_ID")
existing_token = get_env_value("WEIXIN_TOKEN")
if existing_account and existing_token:
print()
print_success("Weixin is already configured.")
if not prompt_yes_no(" Reconfigure Weixin?", False):
return
try:
from gateway.platforms.weixin import check_weixin_requirements, qr_login
except Exception as exc:
print_error(f" Weixin adapter import failed: {exc}")
print_info(" Install gateway dependencies first, then retry.")
return
if not check_weixin_requirements():
print_error(" Missing dependencies: Weixin needs aiohttp and cryptography.")
print_info(" Install them, then rerun `hermes gateway setup`.")
return
print()
if not prompt_yes_no(" Start QR login now?", True):
print_info(" Cancelled.")
return
import asyncio
try:
credentials = asyncio.run(qr_login(str(get_hermes_home())))
except KeyboardInterrupt:
print()
print_warning(" Weixin setup cancelled.")
return
except Exception as exc:
print_error(f" QR login failed: {exc}")
return
if not credentials:
print_warning(" QR login did not complete.")
return
account_id = credentials.get("account_id", "")
token = credentials.get("token", "")
base_url = credentials.get("base_url", "")
user_id = credentials.get("user_id", "")
save_env_value("WEIXIN_ACCOUNT_ID", account_id)
save_env_value("WEIXIN_TOKEN", token)
if base_url:
save_env_value("WEIXIN_BASE_URL", base_url)
save_env_value(
"WEIXIN_CDN_BASE_URL",
get_env_value("WEIXIN_CDN_BASE_URL") or "https://novac2c.cdn.weixin.qq.com/c2c",
)
print()
access_choices = [
"Use DM pairing approval (recommended)",
"Allow all direct messages",
"Only allow listed user IDs",
"Disable direct messages",
]
access_idx = prompt_choice(
" How should direct messages be authorized?", access_choices, 0
)
if access_idx == 0:
save_env_value("WEIXIN_DM_POLICY", "pairing")
save_env_value("WEIXIN_ALLOW_ALL_USERS", "false")
save_env_value("WEIXIN_ALLOWED_USERS", "")
print_success(" DM pairing enabled.")
print_info(
" Unknown DM users can request access and you approve them with `hermes pairing approve`."
)
elif access_idx == 1:
save_env_value("WEIXIN_DM_POLICY", "open")
save_env_value("WEIXIN_ALLOW_ALL_USERS", "true")
save_env_value("WEIXIN_ALLOWED_USERS", "")
print_warning(" Open DM access enabled for Weixin.")
elif access_idx == 2:
default_allow = user_id or ""
allowlist = prompt(
" Allowed Weixin user IDs (comma-separated)", default_allow, password=False
).replace(" ", "")
save_env_value("WEIXIN_DM_POLICY", "allowlist")
save_env_value("WEIXIN_ALLOW_ALL_USERS", "false")
save_env_value("WEIXIN_ALLOWED_USERS", allowlist)
print_success(" Weixin allowlist saved.")
else:
save_env_value("WEIXIN_DM_POLICY", "disabled")
save_env_value("WEIXIN_ALLOW_ALL_USERS", "false")
save_env_value("WEIXIN_ALLOWED_USERS", "")
print_warning(" Direct messages disabled.")
print()
print_info(
" Note: QR login connects an iLink bot identity (e.g. ...@im.bot), not a"
)
print_info(
" scriptable personal WeChat account. Ordinary WeChat groups typically cannot"
)
print_info(
" invite an @im.bot identity, and iLink does not deliver ordinary-group events"
)
print_info(
" to most bot accounts. The settings below only apply when iLink actually"
)
print_info(
" delivers group events for your account type — otherwise DM remains the only"
)
print_info(" working channel regardless of this choice.")
group_choices = [
"Disable group chats (recommended)",
"Allow all group chats",
"Only allow listed group chat IDs",
]
group_idx = prompt_choice(" How should group chats be handled?", group_choices, 0)
if group_idx == 0:
save_env_value("WEIXIN_GROUP_POLICY", "disabled")
save_env_value("WEIXIN_GROUP_ALLOWED_USERS", "")
print_info(" Group chats disabled.")
elif group_idx == 1:
save_env_value("WEIXIN_GROUP_POLICY", "open")
save_env_value("WEIXIN_GROUP_ALLOWED_USERS", "")
print_warning(
" All group chats enabled (only takes effect if iLink delivers group events)."
)
else:
allow_groups = prompt(
" Allowed group chat IDs (comma-separated, not member user IDs)",
"",
password=False,
).replace(" ", "")
save_env_value("WEIXIN_GROUP_POLICY", "allowlist")
save_env_value("WEIXIN_GROUP_ALLOWED_USERS", allow_groups)
print_success(
" Group allowlist saved (only takes effect if iLink delivers group events)."
)
if user_id:
print()
if prompt_yes_no(
f" Use your Weixin user ID ({user_id}) as the home channel?", True
):
save_env_value("WEIXIN_HOME_CHANNEL", user_id)
print_success(f" Home channel set to {user_id}")
print()
print_success("Weixin configured!")
print_info(f" Account ID: {account_id}")
if user_id:
print_info(f" User ID: {user_id}")
def _setup_feishu():
"""Interactive setup for Feishu / Lark — scan-to-create or manual credentials."""
print()
print(color(" ─── 🪽 Feishu / Lark Setup ───", Colors.CYAN))
existing_app_id = get_env_value("FEISHU_APP_ID")
existing_secret = get_env_value("FEISHU_APP_SECRET")
if existing_app_id and existing_secret:
print()
print_success("Feishu / Lark is already configured.")
if not prompt_yes_no(" Reconfigure Feishu / Lark?", False):
return
# ── Choose setup method ──
print()
method_choices = [
"Scan QR code to create a new bot automatically (recommended)",
"Enter existing App ID and App Secret manually",
]
method_idx = prompt_choice(
" How would you like to set up Feishu / Lark?", method_choices, 0
)
credentials = None
used_qr = False
if method_idx == 0:
# ── QR scan-to-create ──
try:
from gateway.platforms.feishu import qr_register
except Exception as exc:
print_error(f" Feishu / Lark onboard import failed: {exc}")
qr_register = None
if qr_register is not None:
try:
credentials = qr_register()
except KeyboardInterrupt:
print()
print_warning(" Feishu / Lark setup cancelled.")
return
except Exception as exc:
print_warning(f" QR registration failed: {exc}")
if credentials:
used_qr = True
if not credentials:
print_info(" QR setup did not complete. Continuing with manual input.")
# ── Manual credential input ──
if not credentials:
print()
print_info(
" Go to https://open.feishu.cn/ (or https://open.larksuite.com/ for Lark)"
)
print_info(
" Create an app, enable the Bot capability, and copy the credentials."
)
print()
app_id = prompt(" App ID", password=False)
if not app_id:
print_warning(" Skipped — Feishu / Lark won't work without an App ID.")
return
app_secret = prompt(" App Secret", password=True)
if not app_secret:
print_warning(" Skipped — Feishu / Lark won't work without an App Secret.")
return
domain_choices = ["feishu (China)", "lark (International)"]
domain_idx = prompt_choice(" Domain", domain_choices, 0)
domain = "lark" if domain_idx == 1 else "feishu"
# Try to probe the bot with manual credentials
bot_name = None
try:
from gateway.platforms.feishu import probe_bot
bot_info = probe_bot(app_id, app_secret, domain)
if bot_info:
bot_name = bot_info.get("bot_name")
print_success(f" Credentials verified — bot: {bot_name or 'unnamed'}")
else:
print_warning(
" Could not verify bot connection. Credentials saved anyway."
)
except Exception as exc:
print_warning(f" Credential verification skipped: {exc}")
credentials = {
"app_id": app_id,
"app_secret": app_secret,
"domain": domain,
"open_id": None,
"bot_name": bot_name,
}
# ── Save core credentials ──
app_id = credentials["app_id"]
app_secret = credentials["app_secret"]
domain = credentials.get("domain", "feishu")
open_id = credentials.get("open_id")
bot_name = credentials.get("bot_name")
save_env_value("FEISHU_APP_ID", app_id)
save_env_value("FEISHU_APP_SECRET", app_secret)
save_env_value("FEISHU_DOMAIN", domain)
# Bot identity is resolved at runtime via _hydrate_bot_identity().
# ── Connection mode ──
if used_qr:
connection_mode = "websocket"
else:
print()
mode_choices = [
"WebSocket (recommended — no public URL needed)",
"Webhook (requires a reachable HTTP endpoint)",
]
mode_idx = prompt_choice(" Connection mode", mode_choices, 0)
connection_mode = "webhook" if mode_idx == 1 else "websocket"
if connection_mode == "webhook":
print_info(" Webhook defaults: 127.0.0.1:8765/feishu/webhook")
print_info(
" Override with FEISHU_WEBHOOK_HOST / FEISHU_WEBHOOK_PORT / FEISHU_WEBHOOK_PATH"
)
print_info(
" For signature verification, set FEISHU_ENCRYPT_KEY and FEISHU_VERIFICATION_TOKEN"
)
save_env_value("FEISHU_CONNECTION_MODE", connection_mode)
if bot_name:
print()
print_success(f" Bot created: {bot_name}")
# ── DM security policy ──
print()
access_choices = [
"Use DM pairing approval (recommended)",
"Allow all direct messages",
"Only allow listed user IDs",
]
access_idx = prompt_choice(
" How should direct messages be authorized?", access_choices, 0
)
if access_idx == 0:
save_env_value("FEISHU_ALLOW_ALL_USERS", "false")
save_env_value("FEISHU_ALLOWED_USERS", "")
print_success(" DM pairing enabled.")
print_info(
" Unknown users can request access; approve with `hermes pairing approve`."
)
elif access_idx == 1:
save_env_value("FEISHU_ALLOW_ALL_USERS", "true")
save_env_value("FEISHU_ALLOWED_USERS", "")
print_warning(" Open DM access enabled for Feishu / Lark.")
else:
save_env_value("FEISHU_ALLOW_ALL_USERS", "false")
default_allow = open_id or ""
allowlist = prompt(
" Allowed user IDs (comma-separated)", default_allow, password=False
).replace(" ", "")
save_env_value("FEISHU_ALLOWED_USERS", allowlist)
print_success(" Allowlist saved.")
# ── Group policy ──
print()
group_choices = [
"Respond only when @mentioned in groups (recommended)",
"Disable group chats",
]
group_idx = prompt_choice(" How should group chats be handled?", group_choices, 0)
if group_idx == 0:
save_env_value("FEISHU_GROUP_POLICY", "open")
print_info(" Group chats enabled (bot must be @mentioned).")
else:
save_env_value("FEISHU_GROUP_POLICY", "disabled")
print_info(" Group chats disabled.")
# ── Home channel ──
print()
home_channel = prompt(
" Home chat ID (optional, for cron/notifications)", password=False
)
if home_channel:
save_env_value("FEISHU_HOME_CHANNEL", home_channel)
print_success(f" Home channel set to {home_channel}")
print()
print_success("🪽 Feishu / Lark configured!")
print_info(f" App ID: {app_id}")
print_info(f" Domain: {domain}")
if bot_name:
print_info(f" Bot: {bot_name}")
def _setup_qqbot():
"""Interactive setup for QQ Bot — scan-to-configure or manual credentials."""
print()
print(color(" ─── 🐧 QQ Bot Setup ───", Colors.CYAN))
existing_app_id = get_env_value("QQ_APP_ID")
existing_secret = get_env_value("QQ_CLIENT_SECRET")
if existing_app_id and existing_secret:
print()
print_success("QQ Bot is already configured.")
if not prompt_yes_no(" Reconfigure QQ Bot?", False):
return
# ── Choose setup method ──
print()
method_choices = [
"Scan QR code to add bot automatically (recommended)",
"Enter existing App ID and App Secret manually",
]
method_idx = prompt_choice(
" How would you like to set up QQ Bot?", method_choices, 0
)
credentials = None
if method_idx == 0:
# ── QR scan-to-configure ──
try:
from gateway.platforms.qqbot import qr_register
credentials = qr_register()
except KeyboardInterrupt:
print()
print_warning(" QQ Bot setup cancelled.")
return
if not credentials:
print_info(" QR setup did not complete. Continuing with manual input.")
# ── Manual credential input ──
if not credentials:
print()
print_info(" Go to https://q.qq.com to register a QQ Bot application.")
print_info(" Note your App ID and App Secret from the application page.")
print()
app_id = prompt(" App ID", password=False)
if not app_id:
print_warning(" Skipped — QQ Bot won't work without an App ID.")
return
app_secret = prompt(" App Secret", password=True)
if not app_secret:
print_warning(" Skipped — QQ Bot won't work without an App Secret.")
return
credentials = {
"app_id": app_id.strip(),
"client_secret": app_secret.strip(),
"user_openid": "",
}
# ── Save core credentials ──
save_env_value("QQ_APP_ID", credentials["app_id"])
save_env_value("QQ_CLIENT_SECRET", credentials["client_secret"])
user_openid = credentials.get("user_openid", "")
# ── DM security policy ──
print()
access_choices = [
"Use DM pairing approval (recommended)",
"Allow all direct messages",
"Only allow listed user OpenIDs",
]
access_idx = prompt_choice(
" How should direct messages be authorized?", access_choices, 0
)
if access_idx == 0:
save_env_value("QQ_ALLOW_ALL_USERS", "false")
if user_openid:
print()
if prompt_yes_no(
f" Add yourself ({user_openid}) to the allow list?", True
):
save_env_value("QQ_ALLOWED_USERS", user_openid)
print_success(f" Allow list set to {user_openid}")
else:
save_env_value("QQ_ALLOWED_USERS", "")
else:
save_env_value("QQ_ALLOWED_USERS", "")
print_success(" DM pairing enabled.")
print_info(
" Unknown users can request access; approve with `hermes pairing approve`."
)
elif access_idx == 1:
save_env_value("QQ_ALLOW_ALL_USERS", "true")
save_env_value("QQ_ALLOWED_USERS", "")
print_warning(" Open DM access enabled for QQ Bot.")
else:
default_allow = user_openid or ""
allowlist = prompt(
" Allowed user OpenIDs (comma-separated)", default_allow, password=False
).replace(" ", "")
save_env_value("QQ_ALLOW_ALL_USERS", "false")
save_env_value("QQ_ALLOWED_USERS", allowlist)
print_success(" Allowlist saved.")
# ── Home channel ──
if user_openid:
print()
if prompt_yes_no(
f" Use your QQ user ID ({user_openid}) as the home channel?", True
):
save_env_value("QQBOT_HOME_CHANNEL", user_openid)
print_success(f" Home channel set to {user_openid}")
else:
print()
home_channel = prompt(
" Home channel OpenID (for cron/notifications, or empty)", password=False
)
if home_channel:
save_env_value("QQBOT_HOME_CHANNEL", home_channel.strip())
print_success(f" Home channel set to {home_channel.strip()}")
print()
print_success("🐧 QQ Bot configured!")
print_info(f" App ID: {credentials['app_id']}")
def _setup_signal():
"""Interactive setup for Signal messenger."""
import shutil
print()
print(color(" ─── 📡 Signal Setup ───", Colors.CYAN))
existing_url = get_env_value("SIGNAL_HTTP_URL")
existing_account = get_env_value("SIGNAL_ACCOUNT")
if existing_url and existing_account:
print()
print_success("Signal is already configured.")
if not prompt_yes_no(" Reconfigure Signal?", False):
return
# Check if signal-cli is available
print()
if shutil.which("signal-cli"):
print_success("signal-cli found on PATH.")
else:
print_warning("signal-cli not found on PATH.")
print_info(" Signal requires signal-cli running as an HTTP daemon.")
print_info(" Install options:")
print_info(
" Linux: download from https://github.com/AsamK/signal-cli/releases"
)
print_info(" macOS: brew install signal-cli")
print_info(" Docker: bbernhard/signal-cli-rest-api")
print()
print_info(" After installing, link your account and start the daemon:")
print_info(' signal-cli link -n "HermesAgent"')
print_info(" signal-cli --account +YOURNUMBER daemon --http 127.0.0.1:8080")
print()
# HTTP URL
print()
print_info(" Enter the URL where signal-cli HTTP daemon is running.")
default_url = existing_url or "http://127.0.0.1:8080"
try:
url = input(f" HTTP URL [{default_url}]: ").strip() or default_url
except (EOFError, KeyboardInterrupt):
print("\n Setup cancelled.")
return
# Test connectivity
print_info(" Testing connection...")
try:
import httpx
resp = httpx.get(f"{url.rstrip('/')}/api/v1/check", timeout=10.0)
if resp.status_code == 200:
print_success(" signal-cli daemon is reachable!")
else:
print_warning(f" signal-cli responded with status {resp.status_code}.")
if not prompt_yes_no(" Continue anyway?", False):
return
except Exception as e:
print_warning(f" Could not reach signal-cli at {url}: {e}")
if not prompt_yes_no(
" Save this URL anyway? (you can start signal-cli later)", True
):
return
save_env_value("SIGNAL_HTTP_URL", url)
# Account phone number
print()
print_info(" Enter your Signal account phone number in E.164 format.")
print_info(" Example: +15551234567")
default_account = existing_account or ""
try:
account = input(
f" Account number{f' [{default_account}]' if default_account else ''}: "
).strip()
if not account:
account = default_account
except (EOFError, KeyboardInterrupt):
print("\n Setup cancelled.")
return
if not account:
print_error(" Account number is required.")
return
save_env_value("SIGNAL_ACCOUNT", account)
# Allowed users
print()
print_info(" The gateway DENIES all users by default for security.")
print_info(" Enter phone numbers or UUIDs of allowed users (comma-separated).")
existing_allowed = get_env_value("SIGNAL_ALLOWED_USERS") or ""
default_allowed = existing_allowed or account
try:
allowed = (
input(f" Allowed users [{default_allowed}]: ").strip() or default_allowed
)
except (EOFError, KeyboardInterrupt):
print("\n Setup cancelled.")
return
save_env_value("SIGNAL_ALLOWED_USERS", allowed)
# Group messaging
print()
if prompt_yes_no(
" Enable group messaging? (disabled by default for security)", False
):
print()
print_info(" Enter group IDs to allow, or * for all groups.")
existing_groups = get_env_value("SIGNAL_GROUP_ALLOWED_USERS") or ""
try:
groups = (
input(f" Group IDs [{existing_groups or '*'}]: ").strip()
or existing_groups
or "*"
)
except (EOFError, KeyboardInterrupt):
print("\n Setup cancelled.")
return
save_env_value("SIGNAL_GROUP_ALLOWED_USERS", groups)
print()
print_success("Signal configured!")
print_info(f" URL: {url}")
print_info(f" Account: {account}")
print_info(" DM auth: via SIGNAL_ALLOWED_USERS + DM pairing")
print_info(
f" Groups: {'enabled' if get_env_value('SIGNAL_GROUP_ALLOWED_USERS') else 'disabled'}"
)
def _builtin_setup_fn(key: str):
"""Resolve the interactive setup function for a built-in platform key.
Late-bound to avoid a circular import with ``hermes_cli.setup`` (which
imports from this module for the remaining bespoke flows).
"""
from hermes_cli import setup as _s
return {
"telegram": _s._setup_telegram,
# discord moved into the plugin: setup_fn is registered by
# plugins/platforms/discord/adapter.py::register() and dispatched
# via the plugin path in _configure_platform().
"slack": _s._setup_slack,
"matrix": _s._setup_matrix,
# mattermost moved into the plugin: setup_fn is registered by
# plugins/platforms/mattermost/adapter.py::register() and dispatched
# via the plugin path in _configure_platform().
"bluebubbles": _s._setup_bluebubbles,
"webhooks": _s._setup_webhooks,
"signal": _setup_signal,
"whatsapp": _setup_whatsapp,
"weixin": _setup_weixin,
"dingtalk": _setup_dingtalk,
"feishu": _setup_feishu,
"wecom": _setup_wecom,
"qqbot": _setup_qqbot,
}.get(key)
def _configure_platform(platform: dict) -> None:
"""Run the interactive setup flow for a single platform.
Dispatch order:
1. Plugin-provided ``setup_fn`` on the registry entry.
2. Built-in setup function matched by platform key.
3. ``_setup_standard_platform`` when the entry has a ``vars`` schema.
4. Env-var hint fallback for plugins that offer no setup helper.
Bundled platform plugins (e.g. IRC) auto-load, so no plugin enable step
is needed here. User-installed platform plugins under ~/.hermes/plugins/
must already be in ``plugins.enabled`` before they appear in this menu.
"""
entry = platform.get("_registry_entry")
if entry is not None and entry.setup_fn is not None:
entry.setup_fn()
return
fn = _builtin_setup_fn(platform["key"])
if fn is not None:
fn()
return
if platform.get("vars"):
_setup_standard_platform(platform)
return
# Plugin with no setup helper — show env-var instructions.
label = platform.get("label", platform["key"])
emoji = platform.get("emoji", "🔌")
print()
print(color(f" ─── {emoji} {label} Setup ───", Colors.CYAN))
required = entry.required_env if entry else []
if required:
print_info(f" Set these env vars in ~/.hermes/.env: {', '.join(required)}")
else:
print_info(
f" Configure {label} in config.yaml under gateway.platforms.{platform['key']}"
)
if platform.get("install_hint"):
print_info(f" {platform['install_hint']}")
def gateway_setup():
"""Interactive setup for messaging platforms + gateway service."""
if is_managed():
managed_error("run gateway setup")
return
print()
print(
color(
"┌─────────────────────────────────────────────────────────┐",
Colors.MAGENTA,
)
)
print(
color(
"│ ⚕ Gateway Setup │", Colors.MAGENTA
)
)
print(
color(
"├─────────────────────────────────────────────────────────┤",
Colors.MAGENTA,
)
)
print(
color(
"│ Configure messaging platforms and the gateway service. │",
Colors.MAGENTA,
)
)
print(
color(
"│ Press Ctrl+C at any time to exit. │", Colors.MAGENTA
)
)
print(
color(
"└─────────────────────────────────────────────────────────┘",
Colors.MAGENTA,
)
)
# ── Gateway service status ──
print()
service_installed = _is_service_installed()
service_running = _is_service_running()
if supports_systemd_services() and has_conflicting_systemd_units():
print_systemd_scope_conflict_warning()
print()
if supports_systemd_services() and has_legacy_hermes_units():
print_legacy_unit_warning()
print()
if service_installed and service_running:
print_success("Gateway service is installed and running.")
elif service_installed:
print_warning("Gateway service is installed but not running.")
if supports_systemd_services() and _system_scope_wizard_would_need_root():
_print_system_scope_remediation("start")
elif prompt_yes_no(" Start it now?", True):
try:
if supports_systemd_services():
systemd_start()
elif is_macos():
launchd_start()
except UserSystemdUnavailableError as e:
print_error(" Failed to start — user systemd not reachable:")
for line in str(e).splitlines():
print(f" {line}")
except SystemScopeRequiresRootError as e:
# Defense in depth: the pre-check above should have caught
# this, but handle the race/edge case gracefully instead of
# letting the exception escape the wizard.
print_error(f" Failed to start: {e}")
_print_system_scope_remediation("start")
except subprocess.CalledProcessError as e:
print_error(f" Failed to start: {e}")
else:
print_info("Gateway service is not installed yet.")
print_info("You'll be offered to install it after configuring platforms.")
# ── Platform configuration loop ──
while True:
print()
print_header("Messaging Platforms")
platforms = _all_platforms()
menu_items = [
f"{p['emoji']} {p['label']} ({_platform_status(p)})" for p in platforms
]
menu_items.append("Done")
choice = prompt_choice(
"Select a platform to configure:", menu_items, len(menu_items) - 1
)
if choice == len(platforms):
break
_configure_platform(platforms[choice])
# ── Post-setup: offer to install/restart gateway ──
# Consider any platform (built-in or plugin) where the user has made
# meaningful progress. ``_platform_status`` already handles plugin
# entries via their check_fn and per-platform dual-states like
# WhatsApp's "enabled, not paired".
def _is_progress(status: str) -> bool:
s = status.lower()
return not (
s == "not configured"
or s.startswith("partially")
or s.startswith("plugin disabled")
)
any_configured = any(_is_progress(_platform_status(p)) for p in _all_platforms())
if any_configured:
print()
print(color("" * 58, Colors.DIM))
service_installed = _is_service_installed()
service_running = _is_service_running()
if service_running:
if supports_systemd_services() and _system_scope_wizard_would_need_root():
_print_system_scope_remediation("restart")
elif prompt_yes_no(" Restart the gateway to pick up changes?", True):
try:
if supports_systemd_services():
systemd_restart()
elif is_macos():
launchd_restart()
elif is_windows():
from hermes_cli import gateway_windows
gateway_windows.restart()
else:
stop_profile_gateway()
print_info("Start manually: hermes gateway")
except UserSystemdUnavailableError as e:
print_error(" Restart failed — user systemd not reachable:")
for line in str(e).splitlines():
print(f" {line}")
except SystemScopeRequiresRootError as e:
print_error(f" Restart failed: {e}")
_print_system_scope_remediation("restart")
except subprocess.CalledProcessError as e:
print_error(f" Restart failed: {e}")
elif service_installed:
if supports_systemd_services() and _system_scope_wizard_would_need_root():
_print_system_scope_remediation("start")
elif prompt_yes_no(" Start the gateway service?", True):
try:
if supports_systemd_services():
systemd_start()
elif is_macos():
launchd_start()
elif is_windows():
from hermes_cli import gateway_windows
gateway_windows.start()
except UserSystemdUnavailableError as e:
print_error(" Start failed — user systemd not reachable:")
for line in str(e).splitlines():
print(f" {line}")
except SystemScopeRequiresRootError as e:
print_error(f" Start failed: {e}")
_print_system_scope_remediation("start")
except subprocess.CalledProcessError as e:
print_error(f" Start failed: {e}")
else:
print()
if supports_systemd_services() or is_macos() or is_windows():
if supports_systemd_services():
platform_name = "systemd"
elif is_macos():
platform_name = "launchd"
else:
platform_name = "Scheduled Task"
wsl_note = " (note: services may not survive WSL restarts)" if is_wsl() else ""
start_now = prompt_yes_no(" Start the gateway now?", True)
start_on_login = prompt_yes_no(
f" Start the gateway automatically on login/boot as a {platform_name} service?{wsl_note}",
True,
)
if start_now or start_on_login:
try:
installed_scope = None
did_install = False
if supports_systemd_services():
installed_scope, did_install = install_linux_gateway_from_setup(
force=False,
enable_on_startup=start_on_login,
)
elif is_macos():
launchd_install(force=False)
did_install = True
else:
from hermes_cli import gateway_windows
gateway_windows.install(force=False)
did_install = True
print()
if did_install and start_now:
try:
if supports_systemd_services():
systemd_start(system=installed_scope == "system")
elif is_macos():
launchd_start()
elif is_windows():
from hermes_cli import gateway_windows
gateway_windows.start()
except UserSystemdUnavailableError as e:
print_error(
" Start failed — user systemd not reachable:"
)
for line in str(e).splitlines():
print(f" {line}")
except subprocess.CalledProcessError as e:
print_error(f" Start failed: {e}")
except subprocess.CalledProcessError as e:
print_error(f" Install failed: {e}")
print_info(" You can try manually: hermes gateway install")
else:
print_info(" Skipped start and auto-start setup.")
print_info(" You can install later: hermes gateway install")
if supports_systemd_services():
print_info(
" Or as a boot-time service: sudo hermes gateway install --system"
)
print_info(" Or run in foreground: hermes gateway run")
elif is_wsl():
print_info(" WSL detected but systemd is not running.")
print_info(" Run in foreground: hermes gateway run")
print_info(
" For persistence: tmux new -s hermes 'hermes gateway run'"
)
print_info(
" To enable systemd: add systemd=true to /etc/wsl.conf, then 'wsl --shutdown'"
)
elif is_termux():
from hermes_constants import display_hermes_home as _dhh
print_info(" Termux does not use systemd/launchd services.")
print_info(" Run in foreground: hermes gateway run")
print_info(
f" Or start it manually in the background (best effort): nohup hermes gateway run >{_dhh()}/logs/gateway.log 2>&1 &"
)
else:
print_info(" Service install not supported on this platform.")
print_info(" Run in foreground: hermes gateway run")
else:
print()
print_info("No platforms configured. Run 'hermes gateway setup' when ready.")
print()
# =============================================================================
# Main Command Handler
# =============================================================================
def _dispatch_via_service_manager_if_s6(
action: str, profile: str | None = None,
) -> bool:
"""If we're in a container with s6, dispatch gateway lifecycle via s6.
Returns True iff dispatched (caller should ``return``); False
otherwise — caller continues with the host-side code path.
``action`` is one of ``start`` / ``stop`` / ``restart``. The
profile defaults to the current one (resolved via ``_profile_arg``).
The s6 service slot was created either by the Phase 4 profile-create
hook or by the container-boot reconciler (cont-init.d/02-…). If it
doesn't exist or s6 returns an error, the named errors from
:mod:`hermes_cli.service_manager` are caught and surfaced as
actionable CLI messages (no raw ``CalledProcessError`` traceback).
"""
from hermes_cli.service_manager import (
GatewayNotRegisteredError,
S6CommandError,
detect_service_manager,
get_service_manager,
)
if detect_service_manager() != "s6":
return False
if profile is None:
# _profile_suffix() returns the bare profile name for
# HERMES_HOME=<root>/profiles/<name>, "" for the default root,
# or a hash for unrelated paths. Map "" → "default" so the
# default-profile gateway is reachable as gateway-default.
profile = _profile_suffix() or "default"
mgr = get_service_manager()
service_name = f"gateway-{profile}"
try:
if action == "start":
mgr.start(service_name)
elif action == "stop":
mgr.stop(service_name)
elif action == "restart":
mgr.restart(service_name)
else:
return False
except GatewayNotRegisteredError as exc:
print(f"{exc}")
sys.exit(1)
except S6CommandError as exc:
print(f"{exc}")
sys.exit(1)
return True
def _dispatch_all_via_service_manager_if_s6(action: str) -> bool:
"""Inside a container with s6, dispatch ``--all`` lifecycle to every
registered profile gateway.
Returns True iff dispatched (caller should ``return``); False
otherwise — caller continues with the host-side code path.
Without this, ``hermes gateway stop --all`` and ``... restart --all``
fall through to ``kill_gateway_processes(all_profiles=True)``, which
just ``pkill``s every gateway process. s6-supervise observes the
crash and restarts each one ~1s later — so ``--all`` ends up
*kicking* every gateway instead of *stopping* it. By iterating
``list_profile_gateways()`` and sending the lifecycle command
through the service manager we get the intended semantics (s6's
``want up``/``want down`` flips correctly so supervise stays down
after a stop).
``action`` is one of ``stop`` / ``restart`` (``start --all`` isn't
a supported CLI surface).
"""
from hermes_cli.service_manager import (
detect_service_manager,
get_service_manager,
)
if detect_service_manager() != "s6":
return False
if action not in ("stop", "restart"):
return False
mgr = get_service_manager()
profiles = mgr.list_profile_gateways()
if not profiles:
print("✗ No profile gateways registered under s6")
return True
fn = mgr.stop if action == "stop" else mgr.restart
errors: list[tuple[str, Exception]] = []
for profile in profiles:
service_name = f"gateway-{profile}"
try:
fn(service_name)
except Exception as exc: # noqa: BLE001 — report and continue
errors.append((profile, exc))
succeeded = len(profiles) - len(errors)
verb = "stopped" if action == "stop" else "restarted"
if succeeded:
print(f"{verb.capitalize()} {succeeded} profile gateway(s) under s6")
for profile, exc in errors:
print(f"✗ Could not {action} gateway-{profile}: {exc}")
return True
def gateway_command(args):
"""Handle gateway subcommands."""
try:
return _gateway_command_inner(args)
except UserSystemdUnavailableError as e:
# Clean, actionable message instead of a traceback when the user D-Bus
# session is unreachable (fresh SSH shell, no linger, container, etc.).
print_error("User systemd not reachable:")
for line in str(e).splitlines():
print(f" {line}")
sys.exit(1)
except SystemScopeRequiresRootError as e:
# The direct ``hermes gateway install|uninstall|start|stop|restart``
# path lands here when the user typed a system-scope action without
# sudo. Same exit code as before — just gives the wizard a way to
# intercept the same condition with friendlier guidance before the
# error is raised.
print(str(e))
sys.exit(1)
def _maybe_redirect_run_to_s6_supervision(args) -> bool:
"""Inside an s6 container, redirect bare ``gateway run`` to the
supervised path.
Background. Before the s6 image landed, ``docker run <image> gateway
run`` was the standard way to start a containerized gateway: the
gateway was the container's main process, tini reaped zombies, and
container exit code == gateway exit code. With s6-overlay as PID 1,
we'd much rather have the gateway run as a supervised s6 longrun
(auto-restart on crash, dashboard supervised alongside, multiple
profile gateways under the same /init). This redirect upgrades the
old invocation transparently — the user gets the new behavior
without changing their docker run command.
Three gates make this a no-op outside the intended scope:
1. ``_dispatch_via_service_manager_if_s6`` returns False unless
we're in a container with s6 as PID 1. Host runs of
``hermes gateway run`` are unaffected.
2. ``HERMES_S6_SUPERVISED_CHILD`` is exported by
``S6ServiceManager._render_run_script`` for the supervised
process itself — i.e. when s6-supervise execs ``hermes gateway
run --replace`` as a longrun, this guard short-circuits the
redirect so the supervised gateway actually runs in
foreground (otherwise we'd recurse: run → start → run → start
→ ...).
3. ``--no-supervise`` (or ``HERMES_GATEWAY_NO_SUPERVISE=1``) opts
out for users who genuinely want pre-s6 semantics — CI smoke
tests, debugging the foreground startup path, etc.
Returns True iff dispatched (caller should ``return``).
"""
no_supervise = getattr(args, "no_supervise", False) or \
os.environ.get("HERMES_GATEWAY_NO_SUPERVISE", "").lower() in ("1", "true", "yes")
if no_supervise:
return False
if os.environ.get("HERMES_S6_SUPERVISED_CHILD"):
# We ARE the supervised child s6-supervise is running. Fall
# through to the foreground code path so the gateway actually
# starts.
return False
if not _dispatch_via_service_manager_if_s6("start"):
return False
# Loud breadcrumb: explain the upgrade and how to opt out. Print to
# stderr so it doesn't pollute stdout-parsing scripts. The
# supervised gateway's own logs are routed by s6-log to both
# `docker logs` and ${HERMES_HOME}/logs/gateways/<profile>/current,
# so the user sees a clear sequence: this banner first, then the
# gateway's own stdout/stderr from the supervisor.
print(
"→ gateway is now running under s6 supervision (auto-restart on crash,\n"
" dashboard supervised alongside if HERMES_DASHBOARD is set).\n"
" This is the recommended setup for the s6 container image — the\n"
" gateway will keep running even if it crashes.\n"
" Use `--no-supervise` (or HERMES_GATEWAY_NO_SUPERVISE=1) to opt out\n"
" and get the pre-s6 foreground behavior instead.",
file=sys.stderr,
flush=True,
)
# Block until the container is signalled. The supervised gateway's
# lifetime is independent of this process — s6-supervise restarts
# it on crash, and we don't want the container to exit when the
# gateway flaps. `sleep infinity` matches the static main-hermes
# service's pattern (see docker/s6-rc.d/main-hermes/run): the CMD
# process is a no-op heartbeat that keeps /init alive until
# `docker stop` sends SIGTERM, at which point /init runs stage 3
# shutdown (which tears down the supervised gateway cleanly).
os.execvp("sleep", ["sleep", "infinity"])
def _gateway_command_inner(args):
subcmd = getattr(args, "gateway_command", None)
# Default to run if no subcommand
if subcmd is None or subcmd == "run":
if _maybe_redirect_run_to_s6_supervision(args):
return # unreachable; execvp doesn't return
verbose = getattr(args, "verbose", 0)
quiet = getattr(args, "quiet", False)
replace = getattr(args, "replace", False)
run_gateway(verbose, quiet=quiet, replace=replace)
return
if subcmd == "setup":
gateway_setup()
return
# Service management commands
if subcmd == "install":
if is_managed():
managed_error("install gateway service (managed by NixOS)")
return
force = getattr(args, "force", False)
system = getattr(args, "system", False)
run_as_user = getattr(args, "run_as_user", None)
if is_termux():
print("Gateway service installation is not supported on Termux.")
print("Run manually: hermes gateway")
sys.exit(1)
if supports_systemd_services():
if is_wsl():
print_warning(
"WSL detected — systemd services may not survive WSL restarts."
)
print_info(
" Consider running in foreground instead: hermes gateway run"
)
print_info(
" Or use tmux/screen for persistence: tmux new -s hermes 'hermes gateway run'"
)
print()
start_now = prompt_yes_no("Start the gateway now after installing the service?", True)
start_on_login = prompt_yes_no("Start the gateway automatically on login/boot with systemd?", True)
systemd_install(
force=force,
system=system,
run_as_user=run_as_user,
enable_on_startup=start_on_login,
)
if start_now:
systemd_start(system=system)
elif is_macos():
launchd_install(force)
elif is_windows():
from hermes_cli import gateway_windows
gateway_windows.install(
force=force,
start_now=getattr(args, 'start_now', None),
start_on_login=getattr(args, 'start_on_login', None),
elevated_handoff=getattr(args, 'elevated_handoff', False),
)
elif is_wsl():
print("WSL detected but systemd is not running.")
print(
"Either enable systemd (add systemd=true to /etc/wsl.conf and restart WSL)"
)
print("or run the gateway in foreground mode:")
print()
print(
" hermes gateway run # direct foreground"
)
print(
" tmux new -s hermes 'hermes gateway run' # persistent via tmux"
)
print(
" nohup hermes gateway run > ~/.hermes/logs/gateway.log 2>&1 & # background"
)
sys.exit(1)
elif is_container():
# Phase 4: inside a container with s6 the gateway service is
# auto-registered when the profile is created (and reconciled
# at every container boot). `install` is therefore informational.
from hermes_cli.service_manager import detect_service_manager
if detect_service_manager() == "s6":
print("Per-profile gateways are auto-registered when you create a profile.")
print()
print(" hermes profile create <name> # creates the s6 service slot")
print(" hermes -p <name> gateway start # bring it up via s6")
print(" hermes status # see currently-supervised gateways")
return
# Fallback for pre-s6 containers or other container runtimes
# we haven't taught about supervision (Podman without our
# /init, k8s plain runs, etc.) — the historical guidance still
# applies.
print("Service installation is not needed inside a Docker container.")
print(
"The container runtime is your service manager — use Docker restart policies instead:"
)
print()
print(
" docker run --restart unless-stopped ... # auto-restart on crash/reboot"
)
print(" docker restart <container> # manual restart")
print()
print("To run the gateway: hermes gateway run")
sys.exit(0)
else:
print("Service installation not supported on this platform.")
print("Run manually: hermes gateway run")
sys.exit(1)
elif subcmd == "uninstall":
if is_managed():
managed_error("uninstall gateway service (managed by NixOS)")
return
system = getattr(args, "system", False)
if is_termux():
print(
"Gateway service uninstall is not supported on Termux because there is no managed service to remove."
)
print("Stop manual runs with: hermes gateway stop")
sys.exit(1)
if supports_systemd_services():
systemd_uninstall(system=system)
elif is_macos():
launchd_uninstall()
elif is_windows():
from hermes_cli import gateway_windows
gateway_windows.uninstall()
elif is_container():
from hermes_cli.service_manager import detect_service_manager
if detect_service_manager() == "s6":
print("Per-profile gateways are auto-unregistered when you delete the profile.")
print()
print(" hermes profile delete <name> # tears down the s6 service slot")
print(" hermes -p <name> gateway stop # stop without deleting the profile")
return
print("Service uninstall is not applicable inside a Docker container.")
print("To stop the gateway, stop or remove the container:")
print()
print(" docker stop <container>")
print(" docker rm <container>")
sys.exit(0)
else:
print("Not supported on this platform.")
sys.exit(1)
elif subcmd == "start":
system = getattr(args, "system", False)
start_all = getattr(args, "all", False)
# Phase 4: inside a container with s6, dispatch via the service
# manager instead of falling through to systemd/launchd/windows.
# `--all` isn't meaningful here (each profile has its own service
# slot — start them individually via `hermes -p <name> gateway
# start`), so just bring up the current profile's slot.
if not start_all and _dispatch_via_service_manager_if_s6("start"):
return
if start_all:
# Kill all stale gateway processes across all profiles before starting
killed = kill_gateway_processes(all_profiles=True)
if killed:
print(
f"✓ Killed {killed} stale gateway process(es) across all profiles"
)
_wait_for_gateway_exit(timeout=10.0, force_after=5.0)
if is_termux():
print(
"Gateway service start is not supported on Termux because there is no system service manager."
)
print("Run manually: hermes gateway")
sys.exit(1)
if supports_systemd_services():
systemd_start(system=system)
elif is_macos():
launchd_start()
elif is_windows():
from hermes_cli import gateway_windows
gateway_windows.start()
elif is_wsl():
print("WSL detected but systemd is not available.")
print("Run the gateway in foreground mode instead:")
print()
print(
" hermes gateway run # direct foreground"
)
print(
" tmux new -s hermes 'hermes gateway run' # persistent via tmux"
)
print(
" nohup hermes gateway run > ~/.hermes/logs/gateway.log 2>&1 & # background"
)
print()
print(
"To enable systemd: add systemd=true to /etc/wsl.conf and run 'wsl --shutdown' from PowerShell."
)
sys.exit(1)
elif is_container():
# Reached only when s6 ISN'T running (the early dispatch
# above handles the s6 case). Pre-s6 containers or other
# container runtimes that don't ship our /init get the
# historical guidance: the gateway is the container's main
# process, so use docker lifecycle commands.
print("Service start is not applicable inside a Docker container.")
print("The gateway runs as the container's main process.")
print()
print(" docker start <container> # start a stopped container")
print(" docker restart <container> # restart a running container")
print()
print("Or run the gateway directly: hermes gateway run")
sys.exit(0)
else:
print("Not supported on this platform.")
sys.exit(1)
elif subcmd == "stop":
# Defense: refuse self-targeting gateway stop from inside the gateway.
# Prevents agent-initiated kill loops when combined with supervisor KeepAlive.
if os.getenv("_HERMES_GATEWAY") == "1":
print_error(
"Refusing to stop the gateway from inside the gateway process.\n"
"This command was blocked to prevent restart loops.\n"
"Use `hermes gateway stop` from a shell outside the running gateway."
)
sys.exit(1)
stop_all = getattr(args, "all", False)
system = getattr(args, "system", False)
# Phase 4: inside a container with s6, dispatch via the service
# manager. ``--all`` iterates every registered profile gateway
# through s6 (otherwise it would fall through to ``pkill``,
# which s6-supervise observes as a crash and immediately restarts).
if stop_all and _dispatch_all_via_service_manager_if_s6("stop"):
return
if not stop_all and _dispatch_via_service_manager_if_s6("stop"):
return
if stop_all:
# --all: kill every gateway process on the machine
service_available = False
if supports_systemd_services() and (
get_systemd_unit_path(system=False).exists()
or get_systemd_unit_path(system=True).exists()
):
try:
systemd_stop(system=system)
service_available = True
except subprocess.CalledProcessError:
pass
elif is_macos() and get_launchd_plist_path().exists():
try:
launchd_stop()
service_available = True
except subprocess.CalledProcessError:
pass
elif is_windows():
from hermes_cli import gateway_windows
if gateway_windows.is_installed():
try:
gateway_windows.stop()
service_available = True
except (subprocess.CalledProcessError, RuntimeError):
pass
killed = kill_gateway_processes(all_profiles=True)
total = killed + (1 if service_available else 0)
if total:
print(f"✓ Stopped {total} gateway process(es) across all profiles")
else:
print("✗ No gateway processes found")
else:
# Default: stop only the current profile's gateway
service_available = False
if supports_systemd_services() and (
get_systemd_unit_path(system=False).exists()
or get_systemd_unit_path(system=True).exists()
):
try:
systemd_stop(system=system)
service_available = True
except subprocess.CalledProcessError:
pass
elif is_macos() and get_launchd_plist_path().exists():
try:
launchd_stop()
service_available = True
except subprocess.CalledProcessError:
pass
elif is_windows():
from hermes_cli import gateway_windows
if gateway_windows.is_installed():
try:
gateway_windows.stop()
service_available = True
except (subprocess.CalledProcessError, RuntimeError):
pass
if not service_available:
# No systemd/launchd/schtasks service — use profile-scoped PID file
if stop_profile_gateway():
print("✓ Stopped gateway for this profile")
else:
print("✗ No gateway running for this profile")
else:
print(f"✓ Stopped {get_service_name()} service")
elif subcmd == "restart":
# Defense: refuse self-targeting gateway restart from inside the gateway.
# Prevents agent-initiated kill loops when combined with supervisor KeepAlive.
if os.getenv("_HERMES_GATEWAY") == "1":
print_error(
"Refusing to restart the gateway from inside the gateway process.\n"
"This command was blocked to prevent restart loops.\n"
"Use `hermes gateway restart` from a shell outside the running gateway."
)
sys.exit(1)
# Try service first, fall back to killing and restarting
service_available = False
system = getattr(args, "system", False)
restart_all = getattr(args, "all", False)
service_configured = False
# Phase 4: inside a container with s6, dispatch via the service
# manager (s6-svc -t restarts the supervised process). ``--all``
# iterates every registered profile gateway through s6; without
# this it would fall through to ``pkill``, which s6-supervise
# would observe as a crash and immediately restart anyway.
if restart_all and _dispatch_all_via_service_manager_if_s6("restart"):
return
if not restart_all and _dispatch_via_service_manager_if_s6("restart"):
return
if restart_all:
# --all: stop every gateway process across all profiles, then start fresh
service_stopped = False
if supports_systemd_services() and (
get_systemd_unit_path(system=False).exists()
or get_systemd_unit_path(system=True).exists()
):
try:
systemd_stop(system=system)
service_stopped = True
except subprocess.CalledProcessError:
pass
elif is_macos() and get_launchd_plist_path().exists():
try:
launchd_stop()
service_stopped = True
except subprocess.CalledProcessError:
pass
elif is_windows():
from hermes_cli import gateway_windows
if gateway_windows.is_installed():
try:
gateway_windows.stop()
service_stopped = True
except (subprocess.CalledProcessError, RuntimeError):
pass
killed = kill_gateway_processes(all_profiles=True)
total = killed + (1 if service_stopped else 0)
if total:
print(f"✓ Stopped {total} gateway process(es) across all profiles")
_wait_for_gateway_exit(timeout=10.0, force_after=5.0)
# Start the current profile's service fresh
print("Starting gateway...")
if supports_systemd_services() and (
get_systemd_unit_path(system=False).exists()
or get_systemd_unit_path(system=True).exists()
):
systemd_start(system=system)
elif is_macos() and get_launchd_plist_path().exists():
launchd_start()
elif is_windows():
from hermes_cli import gateway_windows
# On Windows, even without a registered Scheduled Task / Startup
# entry, gateway_windows.start() uses the safe detached
# pythonw.exe launcher. Do not fall back to run_gateway() here:
# when invoked from a gateway-hosted agent/tool call, foreground
# run_gateway() is tied to the very gateway process we just
# stopped and can die before the replacement is stable.
gateway_windows.start()
else:
run_gateway(verbose=0)
return
if supports_systemd_services() and (
get_systemd_unit_path(system=False).exists()
or get_systemd_unit_path(system=True).exists()
):
service_configured = True
try:
systemd_restart(system=system)
service_available = True
except subprocess.CalledProcessError:
pass
elif is_macos() and get_launchd_plist_path().exists():
service_configured = True
try:
launchd_restart()
service_available = True
except subprocess.CalledProcessError:
pass
elif is_windows():
from hermes_cli import gateway_windows
# Prefer the Windows-specific restart path: it supports both
# registered Scheduled Task / Startup installs and no-service
# detached restarts. In the normal successful Telegram-triggered
# restart flow, this avoids the generic foreground run_gateway()
# path that can be reaped with the old gateway process. If the
# Windows backend raises, intentionally preserve the existing
# generic failure fallback below.
service_configured = gateway_windows.is_installed()
try:
gateway_windows.restart()
return
except (subprocess.CalledProcessError, RuntimeError, OSError):
pass
if not service_available:
# systemd/launchd restart failed — check if linger is the issue
if supports_systemd_services():
linger_ok, _detail = get_systemd_linger_status()
if linger_ok is not True:
import getpass
_username = getpass.getuser()
print()
print(
"⚠ Cannot restart gateway as a service — linger is not enabled."
)
print(
" The gateway user service requires linger to function on headless servers."
)
print()
print(f" Run: sudo loginctl enable-linger {_username}")
print()
print(" Then restart the gateway:")
print(" hermes gateway restart")
return
if service_configured:
print()
print("✗ Gateway service restart failed.")
print(
" The service definition exists, but the service manager did not recover it."
)
print(" Fix the service, then retry: hermes gateway start")
sys.exit(1)
# Manual restart: stop only this profile's gateway
if stop_profile_gateway():
print("✓ Stopped gateway for this profile")
_wait_for_gateway_exit(timeout=10.0, force_after=5.0)
# Start fresh
print("Starting gateway...")
run_gateway(verbose=0)
elif subcmd == "status":
deep = getattr(args, "deep", False)
full = getattr(args, "full", False)
system = getattr(args, "system", False)
snapshot = get_gateway_runtime_snapshot(system=system)
# Check for service first
_windows_service_installed = False
if is_windows():
from hermes_cli import gateway_windows
_windows_service_installed = gateway_windows.is_installed()
if supports_systemd_services() and (
get_systemd_unit_path(system=False).exists()
or get_systemd_unit_path(system=True).exists()
):
systemd_status(deep, system=system, full=full)
_print_gateway_process_mismatch(snapshot)
elif is_macos() and get_launchd_plist_path().exists():
launchd_status(deep)
_print_gateway_process_mismatch(snapshot)
elif _windows_service_installed:
from hermes_cli import gateway_windows
gateway_windows.status(deep=deep)
_print_gateway_process_mismatch(snapshot)
else:
# Check for manually running processes
pids = list(snapshot.gateway_pids)
if pids:
print(f"✓ Gateway is running (PID: {', '.join(map(str, pids))})")
print(" (Running manually, not as a system service)")
runtime_lines = _runtime_health_lines()
if runtime_lines:
print()
print("Recent gateway health:")
for line in runtime_lines:
print(f" {line}")
print()
if is_termux():
print("Termux note:")
print(" Android may stop background jobs when Termux is suspended")
elif is_wsl():
print("WSL note:")
print(
" The gateway is running in foreground/manual mode (recommended for WSL)."
)
print(
" Use tmux or screen for persistence across terminal closes."
)
elif is_windows():
print(
"To install as a Windows Scheduled Task (auto-start on login):"
)
print(" hermes gateway install")
else:
print("To install as a service:")
print(" hermes gateway install")
print(" sudo hermes gateway install --system")
else:
print("✗ Gateway is not running")
runtime_lines = _runtime_health_lines()
if runtime_lines:
print()
print("Recent gateway health:")
for line in runtime_lines:
print(f" {line}")
print()
print("To start:")
print(" hermes gateway run # Run in foreground")
if is_termux():
print(
" nohup hermes gateway run > ~/.hermes/logs/gateway.log 2>&1 & # Best-effort background start"
)
elif is_wsl():
print(
" tmux new -s hermes 'hermes gateway run' # persistent via tmux"
)
print(
" nohup hermes gateway run > ~/.hermes/logs/gateway.log 2>&1 & # background"
)
elif is_windows():
print(
" hermes gateway install # Install as Windows Scheduled Task (auto-start on login)"
)
else:
print(" hermes gateway install # Install as user service")
print(
" sudo hermes gateway install --system # Install as boot-time system service"
)
# Show other profiles' gateway status for multi-profile awareness
_print_other_profiles_gateway_status()
elif subcmd == "list":
_gateway_list()
elif subcmd == "migrate-legacy":
# Stop, disable, and remove legacy Hermes gateway unit files from
# pre-rename installs (e.g. hermes.service). Profile units and
# unrelated third-party services are never touched.
dry_run = getattr(args, "dry_run", False)
yes = getattr(args, "yes", False)
if not supports_systemd_services() and not is_macos():
print("Legacy unit migration only applies to systemd-based Linux hosts.")
return
remove_legacy_hermes_units(interactive=not yes, dry_run=dry_run)