hermes-agent/hermes_cli
Teknium 1dce908930
fix(gateway): shutdown + restart hygiene (drain timeout, false-fatal, success log) (#18761)
* fix(gateway): config.yaml wins over .env for agent/display/timezone settings

Regression from the silent config→env bridge. The bridge at module import
time is correct for max_turns (unconditional overwrite), but every other
agent.*, display.*, timezone, and security bridge key was guarded by
'if X not in os.environ' — so a stale .env entry from an old 'hermes setup'
run would shadow the user's current config.yaml indefinitely.

Symptom: agent.max_turns: 500 in config.yaml, HERMES_MAX_ITERATIONS=60
in .env from an old setup, and the gateway silently capped at 60
iterations per turn. Gateway logs confirmed api_calls never exceeded 60.

Three changes:

1. gateway/run.py: drop the 'not in os.environ' guards for all agent.*,
   display.*, timezone, and security.* bridge keys. config.yaml is now
   authoritative for these settings — same semantics already in place
   for max_turns, terminal.*, and auxiliary.*. Also surface the bridge
   failure (previously 'except Exception: pass') to stderr so operators
   see bridge errors instead of silently falling back to .env.

2. gateway/run.py: INFO-log the resolved max_iterations at gateway
   start so operators can verify the config→env bridge did the right
   thing instead of chasing a phantom budget ceiling.

3. hermes_cli/setup.py: stop writing HERMES_MAX_ITERATIONS to .env in
   the setup wizard. config.yaml is the single source of truth. Also
   clean up any stale .env entry left behind by pre-fix setups.

Regression tests in tests/gateway/test_config_env_bridge_authority.py
guard each config→env key against the 'stale .env shadows config' bug.

* fix(gateway): shutdown + restart hygiene (drain timeout, false-fatal, success log)

Three issues observed in production gateway.log during a rapid restart
chain on 2026-05-02, all fixed here.

1. _send_restart_notification logged unconditional success
   adapter.send() catches provider errors (e.g. Telegram 'Chat not found')
   and returns SendResult(success=False); it never raises. The caller
   ignored the return value and always logged 'Sent restart notification
   to <chat>' at INFO, producing a misleading success line directly
   below the 'Failed to send Telegram message' traceback on every boot.
   Now inspects result.success and logs WARNING with the error otherwise.

2. WhatsApp bridge SIGTERM on shutdown classified as fatal error
   _check_managed_bridge_exit() saw the bridge's returncode -15 (our own
   SIGTERM from disconnect()) and fired the full fatal-error path,
   producing 'ERROR ... WhatsApp bridge process exited unexpectedly' plus
   'Fatal whatsapp adapter error (whatsapp_bridge_exited)' on every
   planned shutdown, immediately before the normal '✓ whatsapp
   disconnected'. Adds a _shutting_down flag that disconnect() sets
   before the terminate, and _check_managed_bridge_exit() returns None
   for returncode in {0, -2, -15} while shutting down. OOM-kill (137)
   and other non-signal exits still hit the fatal path.

3. restart_drain_timeout default 60s → 180s
   On 2026-05-02 01:43:27 a user /restart fired while three agents were
   mid-API-call (82s, 112s, 154s into their turns). The 60s drain budget
   expired and all three were force-interrupted. 180s covers realistic
   in-flight agent turns; users on very-long-reasoning models can still
   raise it further via agent.restart_drain_timeout in config.yaml.
   Existing explicit user values are preserved by deep-merge.

Tests
- tests/gateway/test_restart_notification.py: two new tests assert INFO
  is only logged on SendResult(success=True) and WARNING with the error
  string is logged on SendResult(success=False).
- tests/gateway/test_whatsapp_connect.py: parametrized test for
  returncode in {0, -2, -15} proves shutdown-time exits are suppressed;
  separate test proves returncode 137 (SIGKILL/OOM) still surfaces as
  fatal even when _shutting_down is set.
- _check_managed_bridge_exit() reads _shutting_down via getattr-with-
  default so existing _make_adapter() test helpers that bypass __init__
  (pitfall #17 in AGENTS.md) keep working unmodified.
2026-05-02 02:08:06 -07:00
..
__init__.py chore: release v0.12.0 (2026.4.30) (#18057) 2026-04-30 11:31:01 -07:00
_parser.py refactor(cli): derive relaunch flag table from argparse introspection 2026-04-29 20:33:29 -07:00
auth.py fix(auth): make provider config writes atomic 2026-04-30 20:39:41 -07:00
auth_commands.py feat(cli): add minimax-oauth provider with PKCE browser flow 2026-04-29 09:53:42 -07:00
azure_detect.py chore: remove unused imports and dead locals (ruff F401, F841) (#17010) 2026-04-28 06:46:45 -07:00
backup.py feat(claw-migrate): harden OpenClaw import with plan-first apply, redaction, and pre-migration backup (#16911) 2026-04-28 01:50:23 -07:00
banner.py fix(banner): show correct update status on nix-built hermes (#17550) 2026-04-30 07:03:00 +05:30
browser_connect.py fix(browser): address Copilot review on /browser connect 2026-04-28 22:11:10 -07:00
callbacks.py fix: ESC cancels secret/sudo prompts, clearer skip messaging (#9902) 2026-04-14 16:11:37 -07:00
claw.py feat(claw-migrate): harden OpenClaw import with plan-first apply, redaction, and pre-migration backup (#16911) 2026-04-28 01:50:23 -07:00
cli_output.py refactor: remove dead code — 1,784 lines across 77 files (#9180) 2026-04-13 16:32:04 -07:00
clipboard.py feat: fix img pasting in new ink plus newline after tools 2026-04-11 13:14:32 -05:00
codex_models.py feat(codex): add gpt-5.5 and wire live model discovery into picker (#14720) 2026-04-23 13:32:43 -07:00
colors.py feat: respect NO_COLOR env var and TERM=dumb (#4079) 2026-03-30 17:07:21 -07:00
commands.py fix(discord): warn on 32-char clamp collisions in the /skill collector (#18759) 2026-05-02 02:05:01 -07:00
completion.py fix: preserve profile name completion in dynamic shell completion 2026-04-14 10:45:42 -07:00
config.py fix(gateway): shutdown + restart hygiene (drain timeout, false-fatal, success log) (#18761) 2026-05-02 02:08:06 -07:00
copilot_auth.py fix(copilot): exchange raw GitHub token for Copilot API JWT 2026-04-24 05:09:08 -07:00
cron.py feat(cron): per-job workdir for project-aware cron runs (#15110) 2026-04-24 05:07:01 -07:00
curator.py fix(curator): authoritative absorbed_into on delete + restore cron skill links on rollback (#18671) (#18731) 2026-05-02 01:29:57 -07:00
curses_ui.py feat: ungate Tool Gateway — subscription-based access with per-tool opt-in 2026-04-16 12:36:49 -07:00
debug.py chore: remove unused imports and dead locals (ruff F401, F841) (#17010) 2026-04-28 06:46:45 -07:00
default_soul.py fix: reset default SOUL.md to baseline identity text (#3159) 2026-03-26 01:34:27 -07:00
dingtalk_auth.py chore: remove unused imports and dead locals (ruff F401, F841) (#17010) 2026-04-28 06:46:45 -07:00
doctor.py fix(cli): decode .env as UTF-8 to avoid GBK crash on Windows 2026-05-02 01:40:31 -07:00
dump.py refactor(redact): canonical mask_secret helper; fix status.py DIM drift (#17207) 2026-04-28 21:04:35 -07:00
env_loader.py refactor: consolidate symlink-safe atomic replace into shared helper 2026-04-28 04:58:22 -07:00
fallback_cmd.py feat(cli): add 'hermes fallback' command to manage fallback providers (#16052) 2026-04-26 06:19:04 -07:00
gateway.py fix: gateway systemd unit now retries indefinitely with backoff (#18639) 2026-05-02 08:51:30 +05:30
goals.py feat: /goal — persistent cross-turn goals (Ralph loop) (#18262) 2026-04-30 23:10:20 -07:00
hooks.py chore: remove unused imports and dead locals (ruff F401, F841) (#17010) 2026-04-28 06:46:45 -07:00
kanban.py feat(kanban): durable multi-profile collaboration board (#17805) 2026-04-30 13:36:47 -07:00
kanban_db.py feat(kanban): durable multi-profile collaboration board (#17805) 2026-04-30 13:36:47 -07:00
logs.py feat: component-separated logging with session context and filtering (#7991) 2026-04-11 17:23:36 -07:00
main.py fix(cli): decode .env as UTF-8 to avoid GBK crash on Windows 2026-05-02 01:40:31 -07:00
mcp_config.py refactor(config): migrate remaining 33 cfg_get call sites (#17311) 2026-04-29 04:03:03 -07:00
memory_setup.py fix(cli): decode .env as UTF-8 to avoid GBK crash on Windows 2026-05-02 01:40:31 -07:00
model_catalog.py chore: remove unused imports and dead locals (ruff F401, F841) (#17010) 2026-04-28 06:46:45 -07:00
model_normalize.py feat(minimax-oauth): full integration with peer OAuth providers 2026-04-29 09:53:42 -07:00
model_switch.py fix(model_switch): correct user_providers override for private models 2026-04-30 19:44:26 -07:00
models.py chore(models): move Vercel AI Gateway to bottom of provider picker (#18112) 2026-04-30 19:34:19 -07:00
nous_subscription.py fix(cli): coerce use_gateway config flags in tool routing 2026-04-26 19:02:55 -07:00
oneshot.py fix(tui): honor launch toolsets (#17623) 2026-04-29 16:55:27 -07:00
pairing.py fix(pairing): handle null user_name in pairing list display 2026-04-23 02:34:11 -07:00
platforms.py feat: complete plugin platform parity — all 12 integration points 2026-04-29 21:56:51 -07:00
plugins.py fix(plugins): bound async plugin command await with 30s timeout 2026-04-30 19:56:18 -07:00
plugins_cmd.py feat(dashboard): add Plugins page with enable/disable, auth status, install/remove 2026-04-30 20:29:37 -04:00
profiles.py Merge upstream/main and address Copilot review feedback 2026-04-30 06:43:22 -04:00
providers.py fix: prevent bare 'custom' slug in model.provider (#17478) 2026-04-30 04:32:11 -07:00
pty_bridge.py fix: mobile chat in new layout 2026-04-24 12:07:46 -04:00
relaunch.py remove relaunch_chat 2026-04-29 20:33:29 -07:00
runtime_provider.py fix(fallback): let custom_providers shadow built-in aliases 2026-04-30 20:18:44 -07:00
setup.py fix(gateway): shutdown + restart hygiene (drain timeout, false-fatal, success log) (#18761) 2026-05-02 02:08:06 -07:00
skills_config.py refactor(config): migrate remaining 33 cfg_get call sites (#17311) 2026-04-29 04:03:03 -07:00
skills_hub.py feat(skills): install skills from a direct HTTP(S) URL (#16323) 2026-04-26 20:57:10 -07:00
skin_engine.py fix(tui): restore macOS copy behavior and theme polish (#17131) 2026-04-28 18:47:14 -05:00
slack_cli.py fix(paths): route achievements plugin + profile-tui through HERMES_HOME 2026-04-30 23:21:54 -07:00
status.py fix(status): add NVIDIA_API_KEY to hermes status API keys display 2026-04-30 19:46:06 -07:00
timeouts.py refactor(timeouts): drop redundant ImportError in except clause 2026-04-26 20:48:20 -07:00
tips.py feat(tips): add cost-saving tips from April 30 tip-of-the-day (#17841) 2026-04-30 02:30:36 -07:00
tools_config.py feat(tts): add Piper as a native local TTS provider (closes #8508) (#17885) 2026-04-30 02:53:20 -07:00
uninstall.py feat(uninstall): offer to remove named profiles when uninstalling from default 2026-04-18 19:18:13 -07:00
vercel_auth.py feat: add Vercel Sandbox backend 2026-04-29 07:22:33 -07:00
voice.py fix(tui): ignore SIGPIPE so stderr back-pressure can't kill the gateway 2026-04-23 16:18:15 -07:00
web_server.py fix: allow WebSocket connections from non-loopback IPs in --insecure mode (#18633) 2026-05-02 08:17:45 +05:30
webhook.py refactor(config): migrate remaining 33 cfg_get call sites (#17311) 2026-04-29 04:03:03 -07:00