hermes-agent/gateway/platforms
Teknium 518f39557b
fix(gateway): keep running when platforms fail; add per-platform circuit breaker + /platform (#26600)
Stop the gateway from exiting (or systemd-restart-looping) when a single
messaging adapter fails at startup or runtime.  A misconfigured WhatsApp
(npm install timeout, unpaired bridge, missing creds.json) used to take
the entire gateway down, killing cron jobs and any other connected
platforms with it.

Changes:

  • Startup (gateway/run.py): when connected_count==0 but the only
    errors are retryable, log a degraded-state warning and keep the
    gateway alive instead of returning False.  Reconnect watcher then
    recovers platforms as their underlying problem clears.

  • Runtime (gateway/run.py _handle_adapter_fatal_error): when the last
    adapter goes down with a retryable error and is queued for
    reconnection, stay alive instead of exit-with-failure.  Previously
    this triggered systemd Restart=on-failure, which created infinite
    restart loops on persistent retryable failures (proxy outage,
    repeated bridge crashes).

  • Reconnect watcher (gateway/run.py _platform_reconnect_watcher):
    replace the 20-attempt hard drop with a circuit-breaker pause.
    After _PAUSE_AFTER_FAILURES (10) consecutive retryable failures, the
    platform stays in _failed_platforms with paused=True so the watcher
    skips it but the operator can still see and resume it.  Non-retryable
    errors still drop out of the queue immediately.  Resolves #17063
    (gateway giving up on Telegram after 20 attempts).

  • WhatsApp preflight (gateway/platforms/whatsapp.py): refuse to start
    the Node bridge when creds.json is missing.  Sets a non-retryable
    whatsapp_not_paired fatal error so the watcher drops it cleanly
    with a single 'run hermes whatsapp' log line instead of paying the
    30s bridge bootstrap timeout on every gateway start.

  • WhatsApp setup ordering (hermes_cli/main.py cmd_whatsapp): only set
    WHATSAPP_ENABLED=true once pairing actually succeeds.  Previously
    the wizard wrote the env var at step 2 (before npm install and QR
    pairing), so any Ctrl+C left .env claiming WhatsApp was ready when
    the bridge had no creds.json.  Also propagate the env var when the
    user keeps an existing pairing on a re-run.

  • /platform slash command (hermes_cli/commands.py + gateway/run.py):
    new gateway-only command for manual circuit-breaker control.
      /platform list           — show connected + failed/paused platforms
      /platform pause <name>   — silence a known-broken platform
      /platform resume <name>  — re-queue a paused platform

Tests:

  • New: pause/resume helpers, /platform list|pause|resume command,
    WhatsApp creds.json preflight, WhatsApp setup ordering.
  • Updated: stale assertions that codified the old 'exit and let
    systemd restart' behavior in test_runner_fatal_adapter.py,
    test_runner_startup_failures.py, and test_platform_reconnect.py
    (the 20-attempt give-up test became a circuit-breaker pause test).

5488 tests pass in tests/gateway/.
2026-05-15 14:32:14 -07:00
..
qqbot fix(gateway): keep QQBot reconnect loop alive 2026-05-13 23:13:25 -07:00
__init__.py perf(gateway): defer QQAdapter and YuanbaoAdapter imports via PEP 562 (#22790) 2026-05-09 13:17:48 -07:00
_http_client_limits.py fix(gateway): tighten httpx keepalive and close whatsapp typing-response leak (#18451) 2026-05-02 02:23:37 -07:00
ADDING_A_PLATFORM.md refactor(plugins): add apply_yaml_config_fn registry hook 2026-05-13 22:20:30 -07:00
api_server.py fix: clean stale conversation mappings on response eviction/deletion 2026-05-15 01:27:43 -07:00
base.py feat(discord): channel history backfill for multi-user sessions 2026-05-14 15:50:57 -07:00
bluebubbles.py chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) 2026-05-11 11:13:25 -07:00
dingtalk.py fix(gateway): add lazy_deps.ensure() to slack, matrix, dingtalk, feishu adapters (#25014) 2026-05-13 19:28:50 +05:30
discord.py feat(discord): default history backfill on, expand to per-user + threads 2026-05-14 15:50:57 -07:00
email.py chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) 2026-05-11 11:13:25 -07:00
feishu.py fix(async): close unscheduled coroutines in all threadsafe bridges (#26584) 2026-05-15 14:00:01 -07:00
feishu_comment.py chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) 2026-05-11 11:13:25 -07:00
feishu_comment_rules.py chore: ruff auto-fix C401, C416, C408, PLR1722 (#23940) 2026-05-11 11:20:58 -07:00
helpers.py chore: ruff auto-fixes — collapsible-else-if, if-stmt-min-max, dict.fromkeys (#23926) 2026-05-11 11:03:29 -07:00
homeassistant.py chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) 2026-05-11 11:13:25 -07:00
matrix.py fix(gateway): complete lazy-install rebind for slack/feishu/matrix + add ensure_and_bind helper (#25038) 2026-05-14 10:41:46 +05:30
mattermost.py chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) 2026-05-11 11:13:25 -07:00
msgraph_webhook.py fix(msgraph_webhook): harden auth surface + IP allowlisting + response hygiene 2026-05-08 10:29:58 -07:00
signal.py fix(signal): handle group messages from linked devices in syncMessage path 2026-05-12 18:43:26 -07:00
signal_rate_limit.py feat(gateway/signal): add support for multiple images sending 2026-04-30 04:28:08 -07:00
slack.py fix(slack): guard split()[0] against whitespace-only command text 2026-05-15 01:50:56 -07:00
sms.py test(sms): use clear=True in test_missing_phone_number_is_non_retryable 2026-05-04 05:25:09 -07:00
telegram.py fix(telegram): set REQUIRES_EDIT_FINALIZE so final MarkdownV2 edit is not skipped 2026-05-14 14:51:07 -07:00
telegram_network.py chore: ruff auto-fix C401, C416, C408, PLR1722 (#23940) 2026-05-11 11:20:58 -07:00
webhook.py fix(webhook): widen INSECURE_NO_AUTH loopback check + tests + docs 2026-05-07 07:38:43 -07:00
wecom.py fix(wecom): update connection status after WebSocket reconnection 2026-05-12 18:48:17 -07:00
wecom_callback.py fix(gateway): tighten httpx keepalive and close whatsapp typing-response leak (#18451) 2026-05-02 02:23:37 -07:00
wecom_crypto.py feat(gateway): add WeCom callback-mode adapter for self-built apps 2026-04-11 15:22:49 -07:00
weixin.py chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) 2026-05-11 11:13:25 -07:00
whatsapp.py fix(gateway): keep running when platforms fail; add per-platform circuit breaker + /platform (#26600) 2026-05-15 14:32:14 -07:00
yuanbao.py refactor(yuanbao): improve quote media fallback — move to DispatchMiddleware, tighten conditions 2026-05-15 01:17:50 -07:00
yuanbao_media.py chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) 2026-05-11 11:13:25 -07:00
yuanbao_proto.py chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937) 2026-05-11 11:13:25 -07:00
yuanbao_sticker.py yuanbao platform (#16298) 2026-04-26 18:50:49 -07:00