hermes-agent/gateway
Teknium 10ee4a729b
fix(gateway): drain on Windows hermes gateway stop so sessions survive restart (#33798)
Sessions now survive `hermes gateway stop` / `restart` on native Windows.
Previously the gateway died on schtasks `/End` + os.kill SIGTERM without
ever running the drain loop, so the v0.13.0 session-resume feature (#21192)
silently broke on Windows: `resume_pending=True` was never written, and
the next boot started with a blank conversation history (issue #33778).

Root cause is twofold and the reporter only identified half of it:

1. `hermes_cli/gateway_windows.py::stop()` did not write the
   `planned_stop_marker` before signalling. The reporter caught this.

2. The bigger reason: `asyncio.add_signal_handler` raises
   NotImplementedError for SIGTERM/SIGINT on Windows, so even if the
   marker had been written, the gateway's existing SIGTERM handler
   (which is what calls `runner.stop()` and the `mark_resume_pending`
   loop) was never invoked. Writing the marker would have been
   necessary-but-insufficient.

The fix has two parts:

* gateway/run.py: new `_run_planned_stop_watcher` daemon thread polls
  for the planned-stop marker file every 0.5s. When the marker appears
  it `loop.call_soon_threadsafe(shutdown_signal_handler, None)` — the
  same shutdown path a real SIGTERM would have driven, including the
  pre-drain `mark_resume_pending` writes (run.py:5977) and graceful
  drain wait. The existing signal handler already accepts
  `received_signal=None` and falls through to
  `consume_planned_stop_marker_for_self()`, so no handler changes
  needed. Runs on every platform as cheap belt-and-suspenders.

* hermes_cli/gateway_windows.py: `stop()` now writes the marker for
  the running gateway PID and waits up to `agent.restart_drain_timeout`
  (default 30s) for the PID to exit cleanly. On clean drain, the kill
  sweep is non-forceful; on timeout, escalates to
  `kill_gateway_processes(force=True)` which routes to taskkill /T /F
  per `references/windows-native-support.md`.

Validation:

* 7 new tests in tests/gateway/test_planned_stop_watcher.py covering:
  marker→handler dispatch, no-marker idle, already-draining skip,
  not-yet-running skip, stop_event responsiveness, fire-once
  semantics, error tolerance.
* 8 new tests in tests/hermes_cli/test_gateway_windows.py covering:
  marker-before-kill ordering, clean-drain skips force-kill,
  drain-timeout escalates to force=True, no-pid-skips-drain,
  invalid-pid handling, fast-exit success, timeout failure,
  marker-write-failure tolerance.
* E2E (Linux, detached orphan): write_planned_stop_marker(pid) +
  `_drain_gateway_pid(pid, 5.0)` returns True in 0.5s after the
  victim sees the marker and exits. Tested with a double-forked
  subprocess so the test parent isn't holding it as a zombie.
* Targeted: tests/gateway/{restart_drain,restart_resume_pending,
  signal,signal_format,status,shutdown_forensics,approve_deny_commands,
  planned_stop_watcher} + tests/hermes_cli/{gateway_windows,
  gateway_service} → 519/519.

What was wrong with the reporter's claim (for future archaeology): they
described the symptom as "no `resume_pending=True` written to
`sessions.json`" — but Hermes uses `state.db` (SQLite), not
`sessions.json`, and `mark_resume_pending` is called regardless of
the marker (the marker only affects exit code 0 vs 1 for systemd
revival semantics). The real session-loss path is the missing drain
on Windows, not a missing marker. Both halves are fixed here.

Closes #33778.
2026-05-28 03:25:32 -07:00
..
assets fix: improve telegram topic mode setup 2026-05-04 12:07:17 -07:00
builtin_hooks remove: BOOT.md built-in hook (#17093) 2026-04-28 09:50:27 -07:00
platforms fix(security): require source CIDR allowlisting for public msgraph webhook binds 2026-05-28 01:26:18 -07:00
__init__.py docs(gateway): mention Weixin in gateway help and docstrings 2026-05-12 17:08:51 -07:00
channel_directory.py refactor(ntfy): convert built-in adapter to platform plugin 2026-05-23 16:13:01 -07:00
config.py refactor(gateway): migrate Mattermost adapter to bundled plugin 2026-05-24 18:05:33 -07:00
delivery.py refactor: simplify Telegram DM topic refresh 2026-05-25 14:54:02 -07:00
display_config.py fix(gateway): keep Telegram heartbeat + interim commentary on; edit heartbeat in place (#33187) 2026-05-27 05:21:53 -07:00
hooks.py fix(plugins): register dynamically-loaded modules in sys.modules before exec 2026-04-29 23:34:35 -07:00
memory_monitor.py Port from cline/cline#10343: periodic gateway memory logging (#27102) 2026-05-16 12:55:23 -07:00
mirror.py refactor(gateway): drop _append_to_jsonl from mirror 2026-05-20 13:00:57 -07:00
pairing.py fix(gateway): preserve WhatsApp pairing approvals across JID/LID alias flips 2026-05-23 01:46:34 -07:00
platform_registry.py refactor(plugins): add apply_yaml_config_fn registry hook 2026-05-13 22:20:30 -07:00
restart.py fix(gateway): address restart review feedback 2026-04-10 21:18:34 -07:00
run.py fix(gateway): drain on Windows hermes gateway stop so sessions survive restart (#33798) 2026-05-28 03:25:32 -07:00
runtime_footer.py feat(gateway): opt-in runtime-metadata footer on final replies (#17026) 2026-04-28 06:50:04 -07:00
session.py fix(gateway): separate observed Telegram group context 2026-05-23 01:33:42 -07:00
session_context.py fix(cli): synchronize HERMES_SESSION_ID across environment and contextvar during session switches 2026-05-23 17:46:55 -07:00
shutdown_forensics.py chore: ruff auto-fixes — collapsible-else-if, if-stmt-min-max, dict.fromkeys (#23926) 2026-05-11 11:03:29 -07:00
slash_access.py feat(gateway): per-platform admin/user split for slash commands (salvage of #4443) (#23373) 2026-05-10 12:33:54 -07:00
status.py fix: gateway PID detection fails on Windows (two issues) 2026-05-13 23:10:57 -07:00
sticker_cache.py fix: guard yaml.safe_load, flock unlock, TOCTOU races, and atomic writes 2026-05-19 00:12:41 -07:00
stream_consumer.py fix(stream-consumer): only set _final_content_delivered when final response confirmed delivered 2026-05-28 03:15:19 -07:00
whatsapp_identity.py fix(whatsapp_identity): pin identifier regex to ASCII, clarify it's defense-in-depth 2026-04-26 20:48:31 -07:00