fix(gateway): keep running when platforms fail; add per-platform circuit breaker + /platform (#26600)

Stop the gateway from exiting (or systemd-restart-looping) when a single
messaging adapter fails at startup or runtime.  A misconfigured WhatsApp
(npm install timeout, unpaired bridge, missing creds.json) used to take
the entire gateway down, killing cron jobs and any other connected
platforms with it.

Changes:

  • Startup (gateway/run.py): when connected_count==0 but the only
    errors are retryable, log a degraded-state warning and keep the
    gateway alive instead of returning False.  Reconnect watcher then
    recovers platforms as their underlying problem clears.

  • Runtime (gateway/run.py _handle_adapter_fatal_error): when the last
    adapter goes down with a retryable error and is queued for
    reconnection, stay alive instead of exit-with-failure.  Previously
    this triggered systemd Restart=on-failure, which created infinite
    restart loops on persistent retryable failures (proxy outage,
    repeated bridge crashes).

  • Reconnect watcher (gateway/run.py _platform_reconnect_watcher):
    replace the 20-attempt hard drop with a circuit-breaker pause.
    After _PAUSE_AFTER_FAILURES (10) consecutive retryable failures, the
    platform stays in _failed_platforms with paused=True so the watcher
    skips it but the operator can still see and resume it.  Non-retryable
    errors still drop out of the queue immediately.  Resolves #17063
    (gateway giving up on Telegram after 20 attempts).

  • WhatsApp preflight (gateway/platforms/whatsapp.py): refuse to start
    the Node bridge when creds.json is missing.  Sets a non-retryable
    whatsapp_not_paired fatal error so the watcher drops it cleanly
    with a single 'run hermes whatsapp' log line instead of paying the
    30s bridge bootstrap timeout on every gateway start.

  • WhatsApp setup ordering (hermes_cli/main.py cmd_whatsapp): only set
    WHATSAPP_ENABLED=true once pairing actually succeeds.  Previously
    the wizard wrote the env var at step 2 (before npm install and QR
    pairing), so any Ctrl+C left .env claiming WhatsApp was ready when
    the bridge had no creds.json.  Also propagate the env var when the
    user keeps an existing pairing on a re-run.

  • /platform slash command (hermes_cli/commands.py + gateway/run.py):
    new gateway-only command for manual circuit-breaker control.
      /platform list           — show connected + failed/paused platforms
      /platform pause <name>   — silence a known-broken platform
      /platform resume <name>  — re-queue a paused platform

Tests:

  • New: pause/resume helpers, /platform list|pause|resume command,
    WhatsApp creds.json preflight, WhatsApp setup ordering.
  • Updated: stale assertions that codified the old 'exit and let
    systemd restart' behavior in test_runner_fatal_adapter.py,
    test_runner_startup_failures.py, and test_platform_reconnect.py
    (the 20-attempt give-up test became a circuit-breaker pause test).

5488 tests pass in tests/gateway/.
This commit is contained in:
Teknium 2026-05-15 14:32:14 -07:00 committed by GitHub
parent 3b9368a0c4
commit 518f39557b
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
9 changed files with 745 additions and 62 deletions

View file

@ -611,3 +611,93 @@ class TestHttpSessionLifecycle:
mock_task.cancel.assert_not_called()
assert adapter._poll_task is None
# ---------------------------------------------------------------------------
# Pre-flight: refuse to start the bridge when creds.json is missing
# ---------------------------------------------------------------------------
class TestNoCredsPreflight:
"""Verify ``connect()`` fast-fails as non-retryable when WhatsApp is
enabled but the user never finished pairing (no ``creds.json``).
Without this guard, every gateway boot:
spawned the bridge subprocess (npm install if needed)
waited 30s for status:connected (never happens without creds)
queued WhatsApp for indefinite retries that would just repeat
With the guard, ``connect()`` returns False immediately with a
non-retryable fatal error so the reconnect watcher drops the platform
and the gateway gets a single clear log line telling the user to run
``hermes whatsapp``.
"""
@pytest.mark.asyncio
async def test_connect_returns_false_when_no_creds(self, tmp_path):
from gateway.platforms.whatsapp import WhatsAppAdapter
adapter = WhatsAppAdapter.__new__(WhatsAppAdapter)
adapter.platform = Platform.WHATSAPP
adapter.config = MagicMock()
adapter._bridge_port = 19876
# Point bridge_script at a real existing file so the earlier
# bridge-missing check doesn't trip — we want to exercise the
# creds.json check specifically.
bridge = tmp_path / "bridge.js"
bridge.write_text("// stub")
adapter._bridge_script = str(bridge)
adapter._session_path = tmp_path / "session" # no creds.json inside
adapter._session_path.mkdir()
adapter._bridge_log_fh = None
adapter._fatal_error_code = None
adapter._fatal_error_message = None
adapter._fatal_error_retryable = True
with patch(
"gateway.platforms.whatsapp.check_whatsapp_requirements",
return_value=True,
):
result = await adapter.connect()
assert result is False
# Non-retryable so the reconnect watcher drops it cleanly
assert adapter._fatal_error_code == "whatsapp_not_paired"
assert adapter._fatal_error_retryable is False
@pytest.mark.asyncio
async def test_connect_proceeds_when_creds_present(self, tmp_path):
"""When creds.json exists, the preflight check is bypassed and
connect() proceeds to the bridge bootstrap path. We don't fully
simulate the bridge here we just verify no fast-fail occurs.
"""
from gateway.platforms.whatsapp import WhatsAppAdapter
adapter = WhatsAppAdapter.__new__(WhatsAppAdapter)
adapter.platform = Platform.WHATSAPP
adapter.config = MagicMock()
adapter._bridge_port = 19877
bridge = tmp_path / "bridge.js"
bridge.write_text("// stub")
adapter._bridge_script = str(bridge)
session_dir = tmp_path / "session"
session_dir.mkdir()
(session_dir / "creds.json").write_text("{}")
adapter._session_path = session_dir
adapter._bridge_log_fh = None
adapter._fatal_error_code = None
adapter._fatal_error_message = None
adapter._fatal_error_retryable = True
# Stub _acquire_platform_lock to return False so connect() exits
# cleanly *after* the preflight, without spawning subprocesses.
adapter._acquire_platform_lock = MagicMock(return_value=False)
with patch(
"gateway.platforms.whatsapp.check_whatsapp_requirements",
return_value=True,
):
result = await adapter.connect()
# Preflight passed — exits because we faked lock acquisition,
# but the fatal-error code is NOT the "not paired" one.
assert result is False
assert adapter._fatal_error_code != "whatsapp_not_paired"