hermes-agent/hermes_cli
Teknium fc086da8bd
fix(gateway,windows): reliability — JOB breakaway + status --deep probes + test-leak fix (#40909)
* fix(gateway,windows): reliability — supervisor task, JOB breakaway, status --deep

Three coordinated fixes for the Windows gateway reliability story:

1. CREATE_BREAKAWAY_FROM_JOB on every detached spawn

   The 'hermes update' triggered from the Electron Desktop GUI ran inside
   Electron's job object. Without breakaway, the post-update gateway
   watcher spawned by update — already DETACHED_PROCESS — was still
   reaped when Electron's job tore down, so the gateway never came back
   after a GUI-initiated update. Adds CREATE_BREAKAWAY_FROM_JOB (0x01000000)
   to:
     - hermes_cli/_subprocess_compat.py::windows_detach_flags() — used by
       every helper that calls windows_detach_popen_kwargs(), including
       launch_detached_profile_gateway_restart()
     - The watcher subprocess's own respawn snippet in
       hermes_cli/gateway.py (inlined flags so the watcher's child
       respawn also breaks away)

   _spawn_detached() in gateway_windows.py already had the flag; this
   change brings the rest of the codebase to parity.

2. Per-minute supervisor Scheduled Task — Windows equivalent of
   systemd Restart=always

   Introduces hermes_cli/gateway_supervisor.py and registers it as a
   second Scheduled Task ('Hermes_Gateway_Supervisor', SC MINUTE /MO 1,
   LIMITED rights) alongside the existing ONLOGON task. Every minute,
   the supervisor uses the same gateway.status.get_running_pid() probe
   as 'hermes gateway status' and, if no gateway is alive, calls
   gateway_windows._spawn_detached() (which now includes BREAKAWAY) to
   bring one back.

   Covers every crash mode, not just 'machine rebooted': taskkill,
   OOM, GUI update SIGTERM, parent job teardown. Cheap — one pythonw
   startup per minute when down, one PID-existence check per minute
   when up.

   Wired into both the schtasks-success and Startup-folder-fallback
   install paths via _install_supervisor_best_effort(), and removed in
   uninstall(). Best-effort: a failing supervisor install logs a
   warning but doesn't roll back the primary install.

3. 'hermes gateway status --deep' shows per-probe PASS/FAIL

   Replaces the existing terse '--deep' output (which only printed
   paths) with an actual diagnostic table:
     [1] PID file present
     [2] Lock file held by a live process
     [3] get_running_pid() result
     [4] _pid_exists(pid) — OS-level liveness
     [5] gateway_state.json (state + age)
     [6] Last lifecycle event from gateway-exit-diag.log

   When the high-level summary disagrees with reality, the user can
   see exactly which signal is lying.

Test-leak fix
-------------

tests/hermes_cli/test_gateway_wsl.py::TestGatewayCommandWSLMessages
monkey-patched is_linux/is_wsl/supports_systemd_services to simulate
WSL but did NOT stub is_windows(). On a Windows host, the dispatcher
in _gateway_command_inner takes the is_windows() branch BEFORE the
WSL guidance branch, so the test invoked gateway_windows.install()
for real. install() writes to %APPDATA%\...\Startup\Hermes_Gateway.cmd
— the REAL user Startup folder, never sandboxed by tmp_path — pointing
at the test's pytest-of-<user>/pytest-<N>/.../gateway-service/ wrapper.
When pytest tore down the tmp_path, every subsequent Windows login
flashed a cmd.exe window that failed to find the missing target.

Stubs is_windows=False on all four affected tests:
  test_install_wsl_no_systemd
  test_start_wsl_no_systemd
  test_status_wsl_running_manual
  test_status_wsl_not_running

Defense-in-depth: _build_startup_launcher() now prefixes the launcher
with 'if not exist <target> exit /b 0', so any future stale Startup
entry silently no-ops instead of flashing a console window.

Status enhancements
-------------------

- status() now reports supervisor task presence alongside the existing
  schtasks/Startup info, and nudges the user to reinstall if the
  supervisor isn't registered.
- Deep mode dumps both the supervisor task name + script path.

* fix(gateway,windows): drop the per-minute supervisor task — keep breakaway + deep probes

Earlier in this branch we added a per-minute schtasks-based supervisor to
respawn the gateway after crashes / GUI-update SIGTERMs. The implementation
flashed a brief console window on every firing, which stole window focus.
We tried several variants:

  - cmd.exe wrapper invoking pythonw  -> flashes (cmd.exe is console-subsystem)
  - schtasks /TR pointing at pythonw  -> flashes (uv venv launcher pythonw is
    actually subsystem=Console, not GUI; it respawns the real pythonw)
  - schtasks /TR pointing at base uv  -> still flashes (Task Scheduler-side
    conhost preallocation; documented Windows quirk)
  - XML registration with <Hidden>true>  -> still flashes (<Hidden> only hides
    the task in the Task Scheduler UI, not the spawned window)

Researched what leading projects do:

  - Ollama: GUI-subsystem tray exe + Startup-folder shortcut. No supervisor.
  - Tailscale: real Windows Service via SCM. Session 0, no console possible.
  - Syncthing: --no-console flag inside the binary + Startup folder.
  - openclaw: VBS Run(..., 0, False) wrapper. Suppresses the *window* but
    Super User Q971162 confirms focus-steal still occurs in some cases.

None of these use a per-minute polling scheduled task. The 'auto-restart on
crash' responsibility belongs INSIDE the daemon (Tailscale's in-process
recovery / Ollama's monitor+worker pair) OR is delegated to the Windows
Service Control Manager — not Task Scheduler.

So this commit drops the supervisor entirely. The CREATE_BREAKAWAY_FROM_JOB
fix in _subprocess_compat.py (from commit c1e5fa433) survives — that is the
*real* fix for problem #2 (GUI-update kills gateway): the post-update
watcher in launch_detached_profile_gateway_restart() now breaks out of
Electron's job object, so the gateway respawn watcher survives the GUI
quit and successfully respawns the gateway.

Surviving from c1e5fa433:
  * CREATE_BREAKAWAY_FROM_JOB in hermes_cli/_subprocess_compat.py (fixes #2)
  * Inlined breakaway flag in the watcher respawn snippet in gateway.py
  * hermes gateway status --deep PASS/FAIL probes (fixes #1 — visibility)
  * 'if not exist <target> exit /b 0' guard in _build_startup_launcher
    (fixes #3 — silent no-op for stale Startup entries)
  * tests/hermes_cli/test_gateway_wsl.py is_windows=False stubs (root cause
    of #3 — pytest WSL tests no longer leak Startup entries on Win hosts)

Removed in this commit:
  * hermes_cli/gateway_supervisor.py (entire file)
  * Supervisor section in hermes_cli/gateway_windows.py (~180 lines):
      get_supervisor_task_name, get_supervisor_script_path,
      _build_supervisor_cmd_script, _write_supervisor_script,
      _install_supervisor_task, is_supervisor_task_registered,
      _install_supervisor_best_effort
  * _install_supervisor_best_effort() calls in install() (3 spots)
  * supervisor cleanup block in uninstall()
  * supervisor display lines in status() / status(deep=True)

Future direction (out of scope for this PR): the right place for Windows
'Restart=always' semantics is a real Windows Service installed via
pywin32's win32serviceutil.ServiceFramework — session-0 isolation, SCM
auto-restart, no console window possible. That's a meaningful next-PR
project, not a band-aid.

Tests: 51 pass / 2 pre-existing failures in
tests/hermes_cli/test_gateway_{windows,wsl}.py (the 2 failures are
TestSupportsSystemdServicesWSL cases that fail on origin/main too —
unrelated to this PR).
2026-06-06 19:53:58 -07:00
..
dashboard_auth fix(desktop): gate OAuth remote connect on AT-or-RT, not access token alone 2026-06-04 22:18:46 -07:00
proxy chore: remove dead code — 28 unused functions/classes across 16 files 2026-05-29 04:22:27 -07:00
__init__.py chore: release v0.16.0 (2026.6.5) (#40206) 2026-06-05 17:55:43 -07:00
_parser.py feat(cli): configurable default interface (cli vs tui) 2026-06-02 20:49:44 -05:00
_subprocess_compat.py fix(gateway,windows): reliability — JOB breakaway + status --deep probes + test-leak fix (#40909) 2026-06-06 19:53:58 -07:00
auth.py revert: keep Google Chat OAuth secret + active_provider profile-scoped (#39398) 2026-06-04 16:54:40 -07:00
auth_commands.py fix(auth): set active_provider after hermes auth add qwen-oauth 2026-06-04 05:58:33 -07:00
azure_detect.py feat(azure-foundry): add Microsoft Entra ID auth 2026-05-18 10:14:38 -07:00
backup.py fix(cron): restore jobs.json emptied by config migration on update 2026-05-29 13:22:54 -07:00
banner.py fix(update-check): stop reporting phantom "N commits behind" inside Docker (#39559) 2026-06-05 15:37:19 +10:00
browser_connect.py feat: auto-launch Chromium-family browser for CDP 2026-05-19 22:34:05 -07:00
build_info.py fix(docker): bake build-time git SHA into the image 2026-05-28 15:14:05 +10:00
bundles.py chore: prune unused imports and duplicate import redefinitions 2026-05-28 22:26:25 -07:00
callbacks.py fix(cli): show masked feedback for secret prompts 2026-05-25 01:20:33 -07:00
checkpoints.py chore: prune unused imports and duplicate import redefinitions 2026-05-28 22:26:25 -07:00
claw.py fix: batch of small robustness/correctness fixes from @kyssta-exe 2026-06-01 19:51:03 -07:00
cli_output.py fix(cli): show masked feedback for secret prompts 2026-05-25 01:20:33 -07:00
clipboard.py fix(clipboard): only read PNG signature bytes, not entire file 2026-05-13 22:54:21 -07:00
codex_models.py fix(codex): drop dead model slugs that HTTP 400 on ChatGPT Pro (#33424) 2026-05-27 12:16:15 -07:00
codex_runtime_plugin_migration.py fix(codex-runtime): de-dup [plugins.X] tables and stop leaking HERMES_HOME into config.toml 2026-05-15 02:31:30 -07:00
codex_runtime_switch.py chore: ruff auto-fix PLR6201 resweep — tuple → set in membership tests (#27355) 2026-05-17 02:29:41 -07:00
colors.py feat: respect NO_COLOR env var and TERM=dumb (#4079) 2026-03-30 17:07:21 -07:00
commands.py Add /version slash command across CLI, gateway, TUI, and desktop. 2026-06-05 18:05:05 -07:00
completion.py fix: batch of small robustness/correctness fixes from @kyssta-exe 2026-06-01 19:51:03 -07:00
config.py docs(kanban): clarify decomposer profile roles 2026-06-06 19:29:00 -07:00
container_boot.py fix(docker): seed s6 gateway state for legacy run cmd (#34829) 2026-06-01 11:28:56 +10:00
copilot_auth.py chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) 2026-05-11 11:13:25 -07:00
cron.py fix(cron): don't crash on cron list when a job's repeat is null 2026-06-05 00:19:45 -07:00
curator.py chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) 2026-05-11 11:13:25 -07:00
curses_ui.py feat(cli): ranked fuzzy search in the curses model picker 2026-06-01 16:58:58 -07:00
dashboard_register.py fix(dashboard): honor --portal-url / HERMES_DASHBOARD_PORTAL_URL override in register 2026-06-04 00:17:57 -07:00
debug.py feat(dashboard): add Debug Share to the System page (#38600) 2026-06-03 19:37:04 -07:00
default_soul.py fix: reset default SOUL.md to baseline identity text (#3159) 2026-03-26 01:34:27 -07:00
dep_ensure.py feat(dep_ensure): complete Windows bootstrap — dep_ensure + install.ps1 + detection (#27845) 2026-05-18 16:34:24 +05:30
dingtalk_auth.py chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) 2026-05-11 11:13:25 -07:00
doctor.py fix(install): correct check_dir tautology and add --workspace web test 2026-06-06 18:22:20 -07:00
dump.py fix(docker): bake build-time git SHA into the image 2026-05-28 15:14:05 +10:00
env_loader.py fix(secrets): only apply external secrets once per HERMES_HOME per process (#32271) 2026-05-25 15:18:55 -07:00
fallback_cmd.py fix(fallback): merge fallback_providers with legacy fallback_model configurations 2026-05-23 05:24:57 -07:00
fallback_config.py fix(fallback): merge fallback_providers with legacy fallback_model configurations 2026-05-23 05:24:57 -07:00
gateway.py fix(gateway,windows): reliability — JOB breakaway + status --deep probes + test-leak fix (#40909) 2026-06-06 19:53:58 -07:00
gateway_windows.py fix(gateway,windows): reliability — JOB breakaway + status --deep probes + test-leak fix (#40909) 2026-06-06 19:53:58 -07:00
goals.py feat(kanban): goal_mode cards run workers in a /goal loop (#35710) 2026-05-31 01:16:33 -07:00
gui_uninstall.py feat: uninstall the Chat GUI without removing the agent (CLI + desktop UI) (#40355) 2026-06-06 18:22:38 -07:00
hooks.py chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) 2026-05-11 11:13:25 -07:00
inventory.py feat(desktop): inline model picker in the status bar 2026-06-02 19:09:41 -05:00
kanban.py fix(kanban): isolate board override per concurrent call 2026-06-04 07:39:53 -07:00
kanban_db.py fix(kanban): isolate board override per concurrent call 2026-06-04 07:39:53 -07:00
kanban_decompose.py docs(kanban): clarify decomposer profile roles 2026-06-06 19:29:00 -07:00
kanban_diagnostics.py chore: remove dead code — 28 unused functions/classes across 16 files 2026-05-29 04:22:27 -07:00
kanban_specify.py fix(kanban): close kanban.db FD after every connect() in long-lived processes 2026-05-27 22:07:49 -07:00
kanban_swarm.py fix(kanban): CLI dispatch honors max_in_progress/max_spawn from config; swap missing 'avoid-ai-writing' skill for bundled humanizer (#33488, #29415) (#34337) 2026-05-28 21:00:46 -07:00
logs.py feat(debug): include desktop.log in hermes debug share / /debug / hermes logs (#38203) 2026-06-03 05:41:35 -07:00
main.py feat: uninstall the Chat GUI without removing the agent (CLI + desktop UI) (#40355) 2026-06-06 18:22:38 -07:00
managed_uv.py fix(update/windows): don't return _UvResult on Windows (subprocess argv crash) (#39820) 2026-06-05 07:54:08 -05:00
mcp_catalog.py chore: prune unused imports and duplicate import redefinitions 2026-05-28 22:26:25 -07:00
mcp_config.py fix(mcp): ensure server.shutdown() on probe iteration failure 2026-06-04 17:11:17 -07:00
mcp_picker.py feat(mcp): Nous-approved MCP catalog with interactive picker (#30870) 2026-05-26 12:48:14 -07:00
mcp_startup.py perf(cli): stop eager MCP discovery from blocking agent-capable startup 2026-05-30 07:45:26 -07:00
memory_setup.py fix(memory): fall back to pip when uv is unavailable (salvage #5954) (#38668) 2026-06-04 14:03:02 +10:00
middleware.py fix(middleware): single-use next_call guard + deepcopy-safe request copies 2026-06-06 23:07:25 +05:30
migrate.py feat(cli): hermes migrate xai [--apply] [--no-backup] 2026-05-20 09:18:23 -07:00
model_catalog.py feat(models): refresh model catalog hourly instead of daily (#35756) 2026-05-31 00:29:40 -07:00
model_normalize.py remove Vercel AI Gateway and Vercel Sandbox (#33067) 2026-05-27 00:43:32 -07:00
model_switch.py feat(model_switch): honor discover_models in custom_providers section 4 2026-06-06 01:04:13 +05:30
models.py fix(models): use deepseek-v4-flash as Nous silent default 2026-06-05 02:54:34 -07:00
nous_account.py feat(credits): usage-aware credits — in-session notices, /usage view, dev readout (#40011) 2026-06-06 13:18:18 +05:30
nous_subscription.py fix(cli): require Chromium for local browser readiness in setup/status surfaces 2026-06-05 04:06:17 -07:00
oneshot.py fix(cli): surface oneshot agent exceptions to stderr with rc=1 2026-05-30 07:31:48 -07:00
pairing.py fix(pairing): enforce lockout on approve_code, not just generate_code (#10195) (#21325) 2026-05-07 07:18:21 -07:00
partial_compress.py Inspired by Claude Code: /compress here [N] — boundary-aware 'summarize up to here' (#35048) 2026-05-29 17:49:15 -07:00
platforms.py feat: complete plugin platform parity — all 12 integration points 2026-04-29 21:56:51 -07:00
plugins.py feat(middleware): add adaptive execution intercepts 2026-06-03 11:22:06 -07:00
plugins_cmd.py Add Hermes desktop app (#20059) 2026-05-31 17:46:56 -05:00
portal_cli.py feat(cli): make hermes portal run the full quick-setup Nous flow (model picker) 2026-06-04 02:20:31 +05:30
profile_describer.py chore: prune unused imports and duplicate import redefinitions 2026-05-28 22:26:25 -07:00
profile_distribution.py fix(profile): reject symlinks in distributions (#25292) 2026-05-25 05:07:58 -07:00
profiles.py feat(cli): display custom profile alias names in profile list/show (#40371) 2026-06-06 08:08:07 +00:00
prompt_size.py feat(cli): add hermes prompt-size diagnostic (#35276) 2026-05-30 02:53:42 -07:00
providers.py fix(model-picker): OpenAI shows curated models; OpenRouter no longer phantom-shows (#37404) 2026-06-02 06:31:37 -07:00
psutil_android.py fix(android): reject unsafe tar members in psutil compatibility installer 2026-05-28 02:36:09 -07:00
pt_input_extras.py fix(cli): ignore terminal focus reports (salvage of #16780) 2026-05-29 00:31:44 -07:00
pty_bridge.py fix(dashboard): clamp PTY resize dimensions for WSL2 winsize garbage (#38200) 2026-06-03 09:00:16 -07:00
relaunch.py fix(windows): prefer npm.cmd over npm.ps1, skip .py argv0 in relaunch 2026-05-08 14:27:40 -07:00
runtime_provider.py fix(gateway): honor per-provider max_output_tokens in max_tokens chain 2026-06-05 09:10:26 -07:00
secret_prompt.py fix(cli): show masked feedback for secret prompts 2026-05-25 01:20:33 -07:00
secrets_cli.py fix(secrets): fail early with clear error when bitwarden setup runs without TTY (#40571) 2026-06-06 18:36:40 -07:00
security_advisories.py fix(stt,tts): restore mistralai — 2.4.8 is clean, ban lifted (#34841) 2026-05-29 13:24:12 -07:00
security_audit.py chore: prune unused imports and duplicate import redefinitions 2026-05-28 22:26:25 -07:00
send_cmd.py fix(review): address Copilot follow-up on sanitizer and file decode errors 2026-05-16 23:00:58 -05:00
service_manager.py Remove prviliges drop when you never ran as root (#34837) 2026-06-01 13:54:18 +10:00
session_recap.py chore: ruff auto-fix PLR6201 resweep — tuple → set in membership tests (#27355) 2026-05-17 02:29:41 -07:00
setup.py fix(cli): require Chromium for local browser readiness in setup/status surfaces 2026-06-05 04:06:17 -07:00
skills_config.py refactor(config): migrate remaining 33 cfg_get call sites (#17311) 2026-04-29 04:03:03 -07:00
skills_hub.py feat(skills): fix browse cap, add source links + copy buttons + category cleanup (#37143) 2026-06-01 19:52:28 -07:00
skin_engine.py fix(tui): improve charizard completion menu contrast 2026-05-18 20:05:23 -07:00
slack_cli.py fix(slack): enable writable app home DMs in manifest 2026-05-08 17:01:12 -07:00
status.py feat(cli): make hermes portal the human-readable Portal onboarding alias 2026-06-04 01:19:28 +05:30
stdio.py chore: prune unused imports and duplicate import redefinitions 2026-05-28 22:26:25 -07:00
telegram_managed_bot.py Add CLI Telegram QR onboarding 2026-06-05 03:20:10 -07:00
timeouts.py perf(agent-loop): cut 47% of per-conversation function calls via 3 targeted hot-path optimizations (#28866) 2026-05-19 14:25:10 -07:00
tips.py docs: align runtime footer field docs 2026-06-06 11:20:40 -06:00
tools_config.py fix(install): scope npm installs/audits to avoid pulling in apps/desktop 2026-06-06 18:22:20 -07:00
uninstall.py feat: uninstall the Chat GUI without removing the agent (CLI + desktop UI) (#40355) 2026-06-06 18:22:38 -07:00
voice.py fix(tui): restore voice push-to-talk parity (#20897) 2026-05-06 15:49:59 -07:00
web_server.py feat(desktop): surface every provider + models from hermes model in the GUI menus (#40563) 2026-06-06 16:31:34 +00:00
webhook.py fix(state): restrict sensitive store file permissions 2026-05-24 04:55:18 -07:00
xai_retirement.py fix(xai): align migrate retirement map with docs 2026-05-20 09:18:23 -07:00