hermes-agent/hermes_cli
Teknium 10ee4a729b
fix(gateway): drain on Windows hermes gateway stop so sessions survive restart (#33798)
Sessions now survive `hermes gateway stop` / `restart` on native Windows.
Previously the gateway died on schtasks `/End` + os.kill SIGTERM without
ever running the drain loop, so the v0.13.0 session-resume feature (#21192)
silently broke on Windows: `resume_pending=True` was never written, and
the next boot started with a blank conversation history (issue #33778).

Root cause is twofold and the reporter only identified half of it:

1. `hermes_cli/gateway_windows.py::stop()` did not write the
   `planned_stop_marker` before signalling. The reporter caught this.

2. The bigger reason: `asyncio.add_signal_handler` raises
   NotImplementedError for SIGTERM/SIGINT on Windows, so even if the
   marker had been written, the gateway's existing SIGTERM handler
   (which is what calls `runner.stop()` and the `mark_resume_pending`
   loop) was never invoked. Writing the marker would have been
   necessary-but-insufficient.

The fix has two parts:

* gateway/run.py: new `_run_planned_stop_watcher` daemon thread polls
  for the planned-stop marker file every 0.5s. When the marker appears
  it `loop.call_soon_threadsafe(shutdown_signal_handler, None)` — the
  same shutdown path a real SIGTERM would have driven, including the
  pre-drain `mark_resume_pending` writes (run.py:5977) and graceful
  drain wait. The existing signal handler already accepts
  `received_signal=None` and falls through to
  `consume_planned_stop_marker_for_self()`, so no handler changes
  needed. Runs on every platform as cheap belt-and-suspenders.

* hermes_cli/gateway_windows.py: `stop()` now writes the marker for
  the running gateway PID and waits up to `agent.restart_drain_timeout`
  (default 30s) for the PID to exit cleanly. On clean drain, the kill
  sweep is non-forceful; on timeout, escalates to
  `kill_gateway_processes(force=True)` which routes to taskkill /T /F
  per `references/windows-native-support.md`.

Validation:

* 7 new tests in tests/gateway/test_planned_stop_watcher.py covering:
  marker→handler dispatch, no-marker idle, already-draining skip,
  not-yet-running skip, stop_event responsiveness, fire-once
  semantics, error tolerance.
* 8 new tests in tests/hermes_cli/test_gateway_windows.py covering:
  marker-before-kill ordering, clean-drain skips force-kill,
  drain-timeout escalates to force=True, no-pid-skips-drain,
  invalid-pid handling, fast-exit success, timeout failure,
  marker-write-failure tolerance.
* E2E (Linux, detached orphan): write_planned_stop_marker(pid) +
  `_drain_gateway_pid(pid, 5.0)` returns True in 0.5s after the
  victim sees the marker and exits. Tested with a double-forked
  subprocess so the test parent isn't holding it as a zombie.
* Targeted: tests/gateway/{restart_drain,restart_resume_pending,
  signal,signal_format,status,shutdown_forensics,approve_deny_commands,
  planned_stop_watcher} + tests/hermes_cli/{gateway_windows,
  gateway_service} → 519/519.

What was wrong with the reporter's claim (for future archaeology): they
described the symptom as "no `resume_pending=True` written to
`sessions.json`" — but Hermes uses `state.db` (SQLite), not
`sessions.json`, and `mark_resume_pending` is called regardless of
the marker (the marker only affects exit code 0 vs 1 for systemd
revival semantics). The real session-loss path is the missing drain
on Windows, not a missing marker. Both halves are fixed here.

Closes #33778.
2026-05-28 03:25:32 -07:00
..
dashboard_auth feat(dashboard-auth): HERMES_DASHBOARD_PUBLIC_URL / dashboard.public_url override 2026-05-27 02:12:27 -07:00
proxy fix(xai-proxy): handle 429 rate-limit responses in proxy retry path 2026-05-28 02:36:37 -07:00
__init__.py chore: release v0.14.0 (2026.5.16) (#26862) 2026-05-16 02:58:57 -07:00
_parser.py Fix CLI verbose tool progress config fallback 2026-05-23 21:03:51 -07:00
_subprocess_compat.py feat(windows): close remaining POSIX-only landmines — TUI crash, kanban waitpid, AF_UNIX sandbox, /bin/bash, npm .cmd shims, cwd tracking, detach flags 2026-05-08 14:27:40 -07:00
auth.py fix(auth): sync manual:device_code Codex pool entries on re-auth (#33744) 2026-05-28 01:33:10 -07:00
auth_commands.py fix(cli): show masked feedback for secret prompts 2026-05-25 01:20:33 -07:00
azure_detect.py feat(azure-foundry): add Microsoft Entra ID auth 2026-05-18 10:14:38 -07:00
backup.py fix: limit pre-update state snapshots 2026-05-28 02:45:25 -07:00
banner.py fix(docker): bake build-time git SHA into the image 2026-05-28 15:14:05 +10:00
browser_connect.py feat: auto-launch Chromium-family browser for CDP 2026-05-19 22:34:05 -07:00
build_info.py fix(docker): bake build-time git SHA into the image 2026-05-28 15:14:05 +10:00
bundles.py feat(skills): add skill bundles — alias /<name> loads multiple skills (#28373) 2026-05-18 21:38:05 -07:00
callbacks.py fix(cli): show masked feedback for secret prompts 2026-05-25 01:20:33 -07:00
checkpoints.py chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) 2026-05-11 11:13:25 -07:00
claw.py chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) 2026-05-11 11:13:25 -07:00
cli_output.py fix(cli): show masked feedback for secret prompts 2026-05-25 01:20:33 -07:00
clipboard.py fix(clipboard): only read PNG signature bytes, not entire file 2026-05-13 22:54:21 -07:00
codex_models.py fix(codex): drop dead model slugs that HTTP 400 on ChatGPT Pro (#33424) 2026-05-27 12:16:15 -07:00
codex_runtime_plugin_migration.py fix(codex-runtime): de-dup [plugins.X] tables and stop leaking HERMES_HOME into config.toml 2026-05-15 02:31:30 -07:00
codex_runtime_switch.py chore: ruff auto-fix PLR6201 resweep — tuple → set in membership tests (#27355) 2026-05-17 02:29:41 -07:00
colors.py feat: respect NO_COLOR env var and TERM=dumb (#4079) 2026-03-30 17:07:21 -07:00
commands.py fix: ignore Telegram start pings 2026-05-27 02:41:24 -07:00
completion.py test(cli): strengthen zsh completion regression coverage 2026-05-13 09:34:15 -07:00
config.py fix(security): require API_SERVER_KEY before dispatching API server work 2026-05-28 00:25:08 -07:00
container_boot.py fix(docker): make s6 lifecycle work for the unprivileged hermes user 2026-05-25 12:23:23 +10:00
copilot_auth.py chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) 2026-05-11 11:13:25 -07:00
cron.py feat: add cron job profile support 2026-05-18 17:39:50 +00:00
curator.py chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) 2026-05-11 11:13:25 -07:00
curses_ui.py fix(cli): clamp curses color 8 for 8-color terminals (Docker) 2026-05-21 23:40:58 -07:00
debug.py fix(debug): redact BlueBubbles webhook secrets 2026-05-24 15:43:48 -07:00
default_soul.py fix: reset default SOUL.md to baseline identity text (#3159) 2026-03-26 01:34:27 -07:00
dep_ensure.py feat(dep_ensure): complete Windows bootstrap — dep_ensure + install.ps1 + detection (#27845) 2026-05-18 16:34:24 +05:30
dingtalk_auth.py chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) 2026-05-11 11:13:25 -07:00
doctor.py remove Vercel AI Gateway and Vercel Sandbox (#33067) 2026-05-27 00:43:32 -07:00
dump.py fix(docker): bake build-time git SHA into the image 2026-05-28 15:14:05 +10:00
env_loader.py fix(secrets): only apply external secrets once per HERMES_HOME per process (#32271) 2026-05-25 15:18:55 -07:00
fallback_cmd.py fix(fallback): merge fallback_providers with legacy fallback_model configurations 2026-05-23 05:24:57 -07:00
fallback_config.py fix(fallback): merge fallback_providers with legacy fallback_model configurations 2026-05-23 05:24:57 -07:00
gateway.py feat(docker): auto-redirect gateway run to supervised mode inside s6 image 2026-05-28 12:42:13 +10:00
gateway_windows.py fix(gateway): drain on Windows hermes gateway stop so sessions survive restart (#33798) 2026-05-28 03:25:32 -07:00
goals.py feat: inject current time into goal judge prompt 2026-05-16 23:05:27 -07:00
hooks.py chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) 2026-05-11 11:13:25 -07:00
inventory.py refactor(inventory): extract shared ConfigContext + build_models_payload 2026-05-13 22:31:11 -07:00
kanban.py fix(kanban): close kanban.db FD after every connect() in long-lived processes 2026-05-27 22:07:49 -07:00
kanban_db.py test(auth): update entitlement CI expectations 2026-05-28 00:19:31 -07:00
kanban_decompose.py fix(kanban): close kanban.db FD after every connect() in long-lived processes 2026-05-27 22:07:49 -07:00
kanban_diagnostics.py fix(kanban): honor severity thresholds in diagnostics 2026-05-18 20:47:01 -07:00
kanban_specify.py fix(kanban): close kanban.db FD after every connect() in long-lived processes 2026-05-27 22:07:49 -07:00
kanban_swarm.py feat(cli): add kanban swarm topology helper 2026-05-18 21:10:12 -07:00
logs.py feat: component-separated logging with session context and filtering (#7991) 2026-04-11 17:23:36 -07:00
main.py fix: limit pre-update state snapshots 2026-05-28 02:45:25 -07:00
mcp_catalog.py feat(mcp): Nous-approved MCP catalog with interactive picker (#30870) 2026-05-26 12:48:14 -07:00
mcp_config.py feat(mcp): Nous-approved MCP catalog with interactive picker (#30870) 2026-05-26 12:48:14 -07:00
mcp_picker.py feat(mcp): Nous-approved MCP catalog with interactive picker (#30870) 2026-05-26 12:48:14 -07:00
memory_setup.py fix(cli): show masked feedback for secret prompts 2026-05-25 01:20:33 -07:00
migrate.py feat(cli): hermes migrate xai [--apply] [--no-backup] 2026-05-20 09:18:23 -07:00
model_catalog.py codebase: add encoding='utf-8' to all bare open() calls (PLW1514) 2026-05-08 14:27:40 -07:00
model_normalize.py remove Vercel AI Gateway and Vercel Sandbox (#33067) 2026-05-27 00:43:32 -07:00
model_switch.py fix(model-switch): mark bare custom provider as current 2026-05-19 10:57:35 -07:00
models.py feat(auth) normalise the way in which we check whether a user has free/paid access to nous portal so we can expose behaviour and error messages accordingly. 2026-05-28 00:19:31 -07:00
nous_account.py feat(auth) normalise the way in which we check whether a user has free/paid access to nous portal so we can expose behaviour and error messages accordingly. 2026-05-28 00:19:31 -07:00
nous_subscription.py fix(auth): refresh Nous entitlement in tool menus 2026-05-28 00:19:31 -07:00
oneshot.py fix(provider): make config.yaml model.provider the single source of truth (#31222) 2026-05-23 18:18:41 -07:00
pairing.py fix(pairing): enforce lockout on approve_code, not just generate_code (#10195) (#21325) 2026-05-07 07:18:21 -07:00
platforms.py feat: complete plugin platform parity — all 12 integration points 2026-04-29 21:56:51 -07:00
plugins.py feat(plugins): add register_dashboard_auth_provider hook on PluginContext 2026-05-27 02:12:27 -07:00
plugins_cmd.py feat(context-engine): host contract for external context engines 2026-05-28 01:45:30 -07:00
portal_cli.py feat(portal): one-shot setup, status CLI, and Nous-included markers (#30860) 2026-05-23 02:39:09 -07:00
profile_describer.py fix(skills): prune dependency/venv dirs from all skill scanners (#30042) 2026-05-21 14:18:02 -07:00
profile_distribution.py fix(profile): reject symlinks in distributions (#25292) 2026-05-25 05:07:58 -07:00
profiles.py fix(security): tighten .env file permissions to 0600 at all creation sites 2026-05-25 03:40:47 -07:00
providers.py remove Vercel AI Gateway and Vercel Sandbox (#33067) 2026-05-27 00:43:32 -07:00
psutil_android.py fix(android): reject unsafe tar members in psutil compatibility installer 2026-05-28 02:36:09 -07:00
pt_input_extras.py fix(cli): make Ctrl+Enter insert newline on WSL/SSH/Windows Terminal (#22777) 2026-05-09 12:48:14 -07:00
pty_bridge.py chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) 2026-05-11 11:13:25 -07:00
relaunch.py fix(windows): prefer npm.cmd over npm.ps1, skip .py argv0 in relaunch 2026-05-08 14:27:40 -07:00
runtime_provider.py fix(custom): pass custom provider extra body 2026-05-21 07:48:53 -07:00
secret_prompt.py fix(cli): show masked feedback for secret prompts 2026-05-25 01:20:33 -07:00
secrets_cli.py fix(cli): show masked feedback for secret prompts 2026-05-25 01:20:33 -07:00
security_advisories.py feat(security): supply-chain advisory checker + lazy-install framework + tiered install fallback (#24220) 2026-05-12 01:02:25 -07:00
security_audit.py feat(security): on-demand supply-chain audit via OSV.dev (#31460) 2026-05-24 15:15:16 -07:00
send_cmd.py fix(review): address Copilot follow-up on sanitizer and file decode errors 2026-05-16 23:00:58 -05:00
service_manager.py fix(docker): align HOME for dashboard and s6 gateway services (#33481) 2026-05-28 13:42:27 +10:00
session_recap.py chore: ruff auto-fix PLR6201 resweep — tuple → set in membership tests (#27355) 2026-05-17 02:29:41 -07:00
setup.py remove Vercel AI Gateway and Vercel Sandbox (#33067) 2026-05-27 00:43:32 -07:00
skills_config.py refactor(config): migrate remaining 33 cfg_get call sites (#17311) 2026-04-29 04:03:03 -07:00
skills_hub.py fix: backfill official optional skill provenance 2026-05-27 13:39:58 -07:00
skin_engine.py fix(tui): improve charizard completion menu contrast 2026-05-18 20:05:23 -07:00
slack_cli.py fix(slack): enable writable app home DMs in manifest 2026-05-08 17:01:12 -07:00
status.py feat(auth) normalise the way in which we check whether a user has free/paid access to nous portal so we can expose behaviour and error messages accordingly. 2026-05-28 00:19:31 -07:00
stdio.py chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) 2026-05-11 11:13:25 -07:00
timeouts.py perf(agent-loop): cut 47% of per-conversation function calls via 3 targeted hot-path optimizations (#28866) 2026-05-19 14:25:10 -07:00
tips.py docs(auth): replace stale 'hermes login' references with 'hermes auth add' 2026-05-26 15:41:11 -07:00
tools_config.py fix: expose context engine tools with saved toolsets 2026-05-28 00:28:42 -07:00
uninstall.py docs(windows): avoid piping installer directly into iex 2026-05-18 20:05:47 -07:00
voice.py fix(tui): restore voice push-to-talk parity (#20897) 2026-05-06 15:49:59 -07:00
web_server.py feat(dashboard-auth): honour X-Forwarded-Prefix + __Host-/__Secure- cookies 2026-05-27 02:12:27 -07:00
webhook.py fix(state): restrict sensitive store file permissions 2026-05-24 04:55:18 -07:00
xai_retirement.py fix(xai): align migrate retirement map with docs 2026-05-20 09:18:23 -07:00