mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-18 04:41:56 +00:00
fix(gateway): keep running when platforms fail; add per-platform circuit breaker + /platform (#26600)
Stop the gateway from exiting (or systemd-restart-looping) when a single
messaging adapter fails at startup or runtime. A misconfigured WhatsApp
(npm install timeout, unpaired bridge, missing creds.json) used to take
the entire gateway down, killing cron jobs and any other connected
platforms with it.
Changes:
• Startup (gateway/run.py): when connected_count==0 but the only
errors are retryable, log a degraded-state warning and keep the
gateway alive instead of returning False. Reconnect watcher then
recovers platforms as their underlying problem clears.
• Runtime (gateway/run.py _handle_adapter_fatal_error): when the last
adapter goes down with a retryable error and is queued for
reconnection, stay alive instead of exit-with-failure. Previously
this triggered systemd Restart=on-failure, which created infinite
restart loops on persistent retryable failures (proxy outage,
repeated bridge crashes).
• Reconnect watcher (gateway/run.py _platform_reconnect_watcher):
replace the 20-attempt hard drop with a circuit-breaker pause.
After _PAUSE_AFTER_FAILURES (10) consecutive retryable failures, the
platform stays in _failed_platforms with paused=True so the watcher
skips it but the operator can still see and resume it. Non-retryable
errors still drop out of the queue immediately. Resolves #17063
(gateway giving up on Telegram after 20 attempts).
• WhatsApp preflight (gateway/platforms/whatsapp.py): refuse to start
the Node bridge when creds.json is missing. Sets a non-retryable
whatsapp_not_paired fatal error so the watcher drops it cleanly
with a single 'run hermes whatsapp' log line instead of paying the
30s bridge bootstrap timeout on every gateway start.
• WhatsApp setup ordering (hermes_cli/main.py cmd_whatsapp): only set
WHATSAPP_ENABLED=true once pairing actually succeeds. Previously
the wizard wrote the env var at step 2 (before npm install and QR
pairing), so any Ctrl+C left .env claiming WhatsApp was ready when
the bridge had no creds.json. Also propagate the env var when the
user keeps an existing pairing on a re-run.
• /platform slash command (hermes_cli/commands.py + gateway/run.py):
new gateway-only command for manual circuit-breaker control.
/platform list — show connected + failed/paused platforms
/platform pause <name> — silence a known-broken platform
/platform resume <name> — re-queue a paused platform
Tests:
• New: pause/resume helpers, /platform list|pause|resume command,
WhatsApp creds.json preflight, WhatsApp setup ordering.
• Updated: stale assertions that codified the old 'exit and let
systemd restart' behavior in test_runner_fatal_adapter.py,
test_runner_startup_failures.py, and test_platform_reconnect.py
(the 20-attempt give-up test became a circuit-breaker pause test).
5488 tests pass in tests/gateway/.
This commit is contained in:
parent
3b9368a0c4
commit
518f39557b
9 changed files with 745 additions and 62 deletions
|
|
@ -198,6 +198,8 @@ COMMAND_REGISTRY: list[CommandDef] = [
|
|||
args_hint="[days]"),
|
||||
CommandDef("platforms", "Show gateway/messaging platform status", "Info",
|
||||
cli_only=True, aliases=("gateway",)),
|
||||
CommandDef("platform", "Pause, resume, or list a failing gateway platform", "Info",
|
||||
gateway_only=True, args_hint="<pause|resume|list> [name]"),
|
||||
CommandDef("copy", "Copy the last assistant response to clipboard", "Info",
|
||||
cli_only=True, args_hint="[number]"),
|
||||
CommandDef("paste", "Attach clipboard image from your clipboard", "Info",
|
||||
|
|
|
|||
|
|
@ -1522,14 +1522,18 @@ def cmd_whatsapp(args):
|
|||
)
|
||||
print(f"\n✓ Mode: {mode_label}")
|
||||
|
||||
# ── Step 2: Enable WhatsApp ──────────────────────────────────────────
|
||||
# ── Step 2: Mode is selected, will enable WhatsApp only after pairing ──
|
||||
# We intentionally don't write WHATSAPP_ENABLED=true here. If the user
|
||||
# aborts the wizard later (Ctrl+C, failed npm install, missed QR scan),
|
||||
# we'd otherwise leave .env claiming WhatsApp is ready when the bridge
|
||||
# has no creds.json. Every subsequent `hermes gateway` then paid a 30s
|
||||
# bridge-bootstrap timeout and queued WhatsApp for indefinite retries.
|
||||
# Now: aborted setup leaves WHATSAPP_ENABLED unset → gateway skips it.
|
||||
# Re-runs that already have WHATSAPP_ENABLED=true (from a prior
|
||||
# successful pairing) stay enabled — we just don't write it pre-emptively.
|
||||
print()
|
||||
current = get_env_value("WHATSAPP_ENABLED")
|
||||
if current and current.lower() == "true":
|
||||
if (get_env_value("WHATSAPP_ENABLED") or "").lower() == "true":
|
||||
print("✓ WhatsApp is already enabled")
|
||||
else:
|
||||
save_env_value("WHATSAPP_ENABLED", "true")
|
||||
print("✓ WhatsApp enabled")
|
||||
|
||||
# ── Step 3: Allowed users ────────────────────────────────────────────
|
||||
current_users = get_env_value("WHATSAPP_ALLOWED_USERS") or ""
|
||||
|
|
@ -1619,6 +1623,12 @@ def cmd_whatsapp(args):
|
|||
session_dir.mkdir(parents=True, exist_ok=True)
|
||||
print(" ✓ Session cleared")
|
||||
else:
|
||||
# Existing pairing — ensure WHATSAPP_ENABLED reflects that.
|
||||
# (Older installs may have lost the env var; covers re-runs
|
||||
# where the user picked "no, keep my session" but the var
|
||||
# was never set or got removed.)
|
||||
if (get_env_value("WHATSAPP_ENABLED") or "").lower() != "true":
|
||||
save_env_value("WHATSAPP_ENABLED", "true")
|
||||
print("\n✓ WhatsApp is configured and paired!")
|
||||
print(" Start the gateway with: hermes gateway")
|
||||
return
|
||||
|
|
@ -1647,6 +1657,11 @@ def cmd_whatsapp(args):
|
|||
# ── Step 7: Post-pairing ─────────────────────────────────────────────
|
||||
print()
|
||||
if (session_dir / "creds.json").exists():
|
||||
# Only enable WhatsApp now that pairing actually succeeded. If the
|
||||
# user Ctrl+C'd at any earlier step, WHATSAPP_ENABLED stays unset
|
||||
# and `hermes gateway` skips it cleanly instead of paying a 30s
|
||||
# bridge timeout + queueing the platform for indefinite retries.
|
||||
save_env_value("WHATSAPP_ENABLED", "true")
|
||||
print("✓ WhatsApp paired successfully!")
|
||||
print()
|
||||
if wa_mode == "bot":
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue