Three fixes for gateway lifecycle stability:
1. Notify active sessions before shutdown (#new)
When the gateway receives SIGTERM or /restart, it now sends a
notification to every chat with an active agent BEFORE starting
the drain. Users see:
- Shutdown: 'Gateway shutting down — your task will be interrupted.'
- Restart: 'Gateway restarting — use /retry after restart to continue.'
Deduplicates per-chat so group sessions with multiple users get
one notification. Best-effort: send failures are logged and swallowed.
2. Skip .clean_shutdown marker when drain timed out
Previously, a graceful SIGTERM always wrote .clean_shutdown, even if
agents were force-interrupted when the drain timed out. This meant
the next startup skipped session suspension, leaving interrupted
sessions in a broken state (trailing tool response, no final message).
Now the marker is only written if the drain completed without timeout,
so interrupted sessions get properly suspended on next startup.
3. Post-restart health check for hermes update (#6631)
cmd_update() now verifies the gateway actually survived after
systemctl restart (sleep 3s + is-active check). If the service
crashed immediately, it retries once. If still dead, prints
actionable diagnostics (journalctl command, manual restart hint).
Also closes#8104 — already fixed on main (the /restart handler
correctly detects systemd via INVOCATION_ID and uses via_service=True).
Test plan:
- 6 new tests for shutdown notifications (dedup, restart vs shutdown
messaging, sentinel filtering, send failure resilience)
- Existing restart drain + update tests pass (47 total)
Production fixes:
- Add clear_session_context() to hermes_logging.py (fixes 48 teardown errors)
- Add clear_session() to tools/approval.py (fixes 9 setup errors)
- Add SyncError M_UNKNOWN_TOKEN check to Matrix _sync_loop (bug fix)
- Fall back to inline api_key in named custom providers when key_env
is absent (runtime_provider.py)
Test fixes:
- test_memory_user_id: use builtin+external provider pair, fix honcho
peer_name override test to match production behavior
- test_display_config: remove TestHelpers for non-existent functions
- test_auxiliary_client: fix OAuth tokens to match _is_oauth_token
patterns, replace get_vision_auxiliary_client with resolve_vision_provider_client
- test_cli_interrupt_subagent: add missing _execution_thread_id attr
- test_compress_focus: add model/provider/api_key/base_url/api_mode
to mock compressor
- test_auth_provider_gate: add autouse fixture to clean Anthropic env
vars that leak from CI secrets
- test_opencode_go_in_model_list: accept both 'built-in' and 'hermes'
source (models.dev API unavailable in CI)
- test_email: verify email Platform enum membership instead of source
inspection (build_channel_directory now uses dynamic enum loop)
- test_feishu: add bot_added/bot_deleted handler mocks to _Builder
- test_ws_auth_retry: add AsyncMock for sync_store.get_next_batch,
add _pending_megolm and _joined_rooms to Matrix adapter mocks
- test_restart_drain: monkeypatch-delete INVOCATION_ID (systemd sets
this in CI, changing the restart call signature)
- test_session_hygiene: add user_id to SessionSource
- test_session_env: use relative baseline for contextvar clear check
(pytest-xdist workers share context)