hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-06-13 09:01:54 +00:00

Author	SHA1	Message	Date
Frowtek	71a9f44e80	fix(gateway): retry startup auto-resume when a failed platform reconnects	2026-06-04 05:56:45 -07:00
Teknium	45465b0d5d	fix(gateway): never auto-pause platforms on transient network/DNS failures (#35387 ) The per-platform reconnect watcher auto-paused a platform after 10 consecutive reconnect failures, setting next_retry=inf and requiring a manual /platform resume to recover. But both pause sites only ever fire on retryable failures — non-retryable errors (bad auth) already drop out of the retry queue earlier. So a transient DNS outage that spanned the watcher's backoff window would silently park the bot forever, even after connectivity returned. The watcher's own docstring already promised 'retryable failures keep retrying at the backoff cap indefinitely' — the code contradicted it. Remove the auto-pause from both reconnect-failure branches. Retryable failures now retry at the 5-min backoff cap forever and self-heal once the network recovers. The circuit breaker (_pause_failed_platform / _resume_paused_platform) stays for manual /platform pause\|resume. Fixes #35284.	2026-05-30 07:33:34 -07:00
kshitijk4poor	66827f8947	chore: prune unused imports and duplicate import redefinitions Remove unused imports (F401) and duplicate/shadowed import redefinitions (F811) across the codebase using ruff's safe autofixes. No behavioral changes -- imports only. - ~1400 safe autofixes applied across 644 files (net -1072 lines) - __init__.py re-exports preserved (excluded from F401 removal so public re-export surfaces stay intact) - Re-exports that are imported or monkeypatched by tests but look unused in their defining module are kept with explicit # noqa: F401 (gateway/run.py load_dotenv; run_agent re-exports from agent.message_sanitization, agent.context_compressor, agent.retry_utils, agent.prompt_builder, agent.process_bootstrap, agent.codex_responses_adapter) - Unsafe F841 (unused-variable) fixes deliberately skipped -- those can change behavior when the RHS has side effects - ruff lints remain disabled in pyproject.toml (only PLW1514 is selected); this is a one-time cleanup, not a config change Verification: - python -m compileall: clean - pytest --collect-only: all 27161 tests collect (zero import errors) - core entry points import clean (run_agent, model_tools, cli, toolsets, hermes_state, batch_runner, gateway) - static scan: every name any test imports directly from an edited module still resolves	2026-05-28 22:26:25 -07:00
Teknium	518f39557b	fix(gateway): keep running when platforms fail; add per-platform circuit breaker + /platform (#26600 ) Stop the gateway from exiting (or systemd-restart-looping) when a single messaging adapter fails at startup or runtime. A misconfigured WhatsApp (npm install timeout, unpaired bridge, missing creds.json) used to take the entire gateway down, killing cron jobs and any other connected platforms with it. Changes: • Startup (gateway/run.py): when connected_count==0 but the only errors are retryable, log a degraded-state warning and keep the gateway alive instead of returning False. Reconnect watcher then recovers platforms as their underlying problem clears. • Runtime (gateway/run.py _handle_adapter_fatal_error): when the last adapter goes down with a retryable error and is queued for reconnection, stay alive instead of exit-with-failure. Previously this triggered systemd Restart=on-failure, which created infinite restart loops on persistent retryable failures (proxy outage, repeated bridge crashes). • Reconnect watcher (gateway/run.py _platform_reconnect_watcher): replace the 20-attempt hard drop with a circuit-breaker pause. After _PAUSE_AFTER_FAILURES (10) consecutive retryable failures, the platform stays in _failed_platforms with paused=True so the watcher skips it but the operator can still see and resume it. Non-retryable errors still drop out of the queue immediately. Resolves #17063 (gateway giving up on Telegram after 20 attempts). • WhatsApp preflight (gateway/platforms/whatsapp.py): refuse to start the Node bridge when creds.json is missing. Sets a non-retryable whatsapp_not_paired fatal error so the watcher drops it cleanly with a single 'run hermes whatsapp' log line instead of paying the 30s bridge bootstrap timeout on every gateway start. • WhatsApp setup ordering (hermes_cli/main.py cmd_whatsapp): only set WHATSAPP_ENABLED=true once pairing actually succeeds. Previously the wizard wrote the env var at step 2 (before npm install and QR pairing), so any Ctrl+C left .env claiming WhatsApp was ready when the bridge had no creds.json. Also propagate the env var when the user keeps an existing pairing on a re-run. • /platform slash command (hermes_cli/commands.py + gateway/run.py): new gateway-only command for manual circuit-breaker control. /platform list — show connected + failed/paused platforms /platform pause <name> — silence a known-broken platform /platform resume <name> — re-queue a paused platform Tests: • New: pause/resume helpers, /platform list\|pause\|resume command, WhatsApp creds.json preflight, WhatsApp setup ordering. • Updated: stale assertions that codified the old 'exit and let systemd restart' behavior in test_runner_fatal_adapter.py, test_runner_startup_failures.py, and test_platform_reconnect.py (the 20-attempt give-up test became a circuit-breaker pause test). 5488 tests pass in tests/gateway/.	2026-05-15 14:32:14 -07:00
tmimmanuel	3606414ec7	fix(gateway): isolate platform connect failures with per-platform timeout Wrap each adapter.connect() in asyncio.wait_for() so one platform hanging during startup or reconnect cannot block the others. Telegram's 8-retry connect loop (~140s worst case) previously prevented Feishu from ever starting when Telegram was network-restricted — common for users in regions where Telegram is blocked. Default timeout is 30s; override via HERMES_GATEWAY_PLATFORM_CONNECT_TIMEOUT (0 disables). Applied to both startup and the reconnect watcher so a platform that hangs mid-retry also does not stall retries for others. Fixes #17242	2026-04-29 05:00:37 -07:00
Teknium	caded0a5e7	fix: repair 57 failing CI tests across 14 files (#5823 ) * fix: repair 57 failing CI tests across 14 files Categories of fixes: Test isolation under xdist (-n auto): - test_hermes_logging: Strip ALL RotatingFileHandlers before each test to prevent handlers leaked from other xdist workers from polluting counts - test_code_execution: Force TERMINAL_ENV=local in setUp — prevents Modal AuthError when another test leaks TERMINAL_ENV=modal - test_timezone: Same TERMINAL_ENV fix for execute_code timezone tests - test_codex_execution_paths: Mock _resolve_turn_agent_config to ensure model resolution works regardless of xdist worker state Matrix adapter tests (nio not installed in CI): - Add _make_fake_nio() helper with real response classes for isinstance() checks in production code - Replace MagicMock(spec=nio.XxxResponse) with fake_nio instances - Wrap production method calls with patch.dict('sys.modules', {'nio': ...}) so import nio succeeds in method bodies - Use try/except instead of pytest.importorskip for nio.crypto imports (importorskip can be fooled by MagicMock in sys.modules) - test_matrix_voice: Skip entire file if nio is a mock, not just missing Stale test expectations: - test_cli_provider_resolution: _prompt_provider_choice now takes kwargs (default param added); mock getpass.getpass alongside input - test_anthropic_oauth_flow: Mock getpass.getpass (code switched from input) - test_gemini_provider: Mock models.dev + OpenRouter API lookups to test hardcoded defaults without external API variance - test_code_execution: Add notify_on_complete to blocked terminal params - test_setup_openclaw_migration: Mock prompt_choice to select 'Full setup' (new quick-setup path leads to _require_tty → sys.exit in CI) - test_skill_manager_tool: Patch get_all_skills_dirs alongside SKILLS_DIR so _find_skill searches tmp_path, not real ~/.hermes/skills/ Missing attributes in object.__new__ test runners: - test_platform_reconnect: Add session_store to _make_runner() - test_session_race_guard: Add hooks, _running_agents_ts, session_store, delivery_router to _make_runner() Production bug fix (gateway/run.py):** - Fix sentinel eviction race: _AGENT_PENDING_SENTINEL was immediately evicted by the stale-detection logic because sentinels have no get_activity_summary() method, causing _stale_idle=inf >= timeout. Guard _should_evict with 'is not _AGENT_PENDING_SENTINEL'. * fix: address remaining CI failures - test_setup_openclaw_migration: Also mock _offer_launch_chat (called at end of both quick and full setup paths) - test_code_execution: Move TERMINAL_ENV=local to module level to protect ALL test classes (TestEnvVarFiltering, TestExecuteCodeEdgeCases, TestInterruptHandling, TestHeadTailTruncation) from xdist env leaks - test_matrix: Use try/except for nio.crypto imports (importorskip can be fooled by MagicMock in sys.modules under xdist)	2026-04-07 09:58:45 -07:00
Teknium	708f187549	fix(gateway): exit with failure when all platforms fail with retryable errors (#3592 ) When all messaging platforms exhaust retries and get queued for background reconnection, exit with code 1 so systemd Restart=on-failure can restart the process. Previously the gateway stayed alive as a zombie with no connected platforms and exit code 0. Salvaged from PR #3567 by kelsia14. Test updates added. Co-authored-by: kelsia14 <kelsia14@users.noreply.github.com>	2026-03-28 14:25:12 -07:00
Teknium	3b509da571	feat: auto-reconnect failed gateway platforms with exponential backoff (#2584 ) When a messaging platform fails to connect at startup (e.g. transient DNS failure) or disconnects at runtime with a retryable error, the gateway now queues it for background reconnection instead of giving up permanently. - New _platform_reconnect_watcher background task runs alongside the existing session expiry watcher - Exponential backoff: 30s, 60s, 120s, 240s, 300s cap - Max 20 retry attempts before giving up on a platform - Non-retryable errors (bad auth token, etc.) are not retried - Runtime disconnections via _handle_adapter_fatal_error now queue retryable failures instead of triggering gateway shutdown - On successful reconnect, adapter is wired up and channel directory is rebuilt automatically Fixes the case where a DNS blip during gateway startup caused Telegram and Discord to be permanently unavailable until manual restart.	2026-03-22 23:48:24 -07:00

8 commits