hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-05-20 05:01:30 +00:00

Author	SHA1	Message	Date
Teknium	518f39557b	fix(gateway): keep running when platforms fail; add per-platform circuit breaker + /platform (#26600 ) Stop the gateway from exiting (or systemd-restart-looping) when a single messaging adapter fails at startup or runtime. A misconfigured WhatsApp (npm install timeout, unpaired bridge, missing creds.json) used to take the entire gateway down, killing cron jobs and any other connected platforms with it. Changes: • Startup (gateway/run.py): when connected_count==0 but the only errors are retryable, log a degraded-state warning and keep the gateway alive instead of returning False. Reconnect watcher then recovers platforms as their underlying problem clears. • Runtime (gateway/run.py _handle_adapter_fatal_error): when the last adapter goes down with a retryable error and is queued for reconnection, stay alive instead of exit-with-failure. Previously this triggered systemd Restart=on-failure, which created infinite restart loops on persistent retryable failures (proxy outage, repeated bridge crashes). • Reconnect watcher (gateway/run.py _platform_reconnect_watcher): replace the 20-attempt hard drop with a circuit-breaker pause. After _PAUSE_AFTER_FAILURES (10) consecutive retryable failures, the platform stays in _failed_platforms with paused=True so the watcher skips it but the operator can still see and resume it. Non-retryable errors still drop out of the queue immediately. Resolves #17063 (gateway giving up on Telegram after 20 attempts). • WhatsApp preflight (gateway/platforms/whatsapp.py): refuse to start the Node bridge when creds.json is missing. Sets a non-retryable whatsapp_not_paired fatal error so the watcher drops it cleanly with a single 'run hermes whatsapp' log line instead of paying the 30s bridge bootstrap timeout on every gateway start. • WhatsApp setup ordering (hermes_cli/main.py cmd_whatsapp): only set WHATSAPP_ENABLED=true once pairing actually succeeds. Previously the wizard wrote the env var at step 2 (before npm install and QR pairing), so any Ctrl+C left .env claiming WhatsApp was ready when the bridge had no creds.json. Also propagate the env var when the user keeps an existing pairing on a re-run. • /platform slash command (hermes_cli/commands.py + gateway/run.py): new gateway-only command for manual circuit-breaker control. /platform list — show connected + failed/paused platforms /platform pause <name> — silence a known-broken platform /platform resume <name> — re-queue a paused platform Tests: • New: pause/resume helpers, /platform list\|pause\|resume command, WhatsApp creds.json preflight, WhatsApp setup ordering. • Updated: stale assertions that codified the old 'exit and let systemd restart' behavior in test_runner_fatal_adapter.py, test_runner_startup_failures.py, and test_platform_reconnect.py (the 20-attempt give-up test became a circuit-breaker pause test). 5488 tests pass in tests/gateway/.	2026-05-15 14:32:14 -07:00
Wesley Simplicio	246c676c2b	fix(gateway): degrade gracefully when all platform adapters are missing When connected_count == 0 AND enabled_platform_count > 0, the gateway treated 'all adapters returned None' identically to 'all adapters failed to connect' — both as fatal startup errors. The 'returned None' case happens when imports fail silently or when adapters are present in config but their dependencies aren't installed (e.g. discord.py missing). Cron jobs and other gateway-runtime work would unnecessarily fail to start. Split: only return False when startup_retryable_errors is non-empty (real connection attempt failed). When the list is empty AND enabled > 0, log a warning and continue running, matching the 'no platforms enabled' cron path. Salvage of #22642's gateway slice. Drops the bundled run_agent.py memory-nudge counter hydration block (issue #22357 territory) which wasn't mentioned in the PR description. Closes #5196.	2026-05-09 17:53:46 -07:00
hharry11	d0821b0573	fix(gateway): only clear locks belonging to the replaced process	2026-04-23 15:07:06 -07:00
Teknium	ce9c91c8f7	fix(gateway): close --replace race completely by claiming PID before adapter startup Follow-up on top of opriz's atomic PID file fix. The prior change caught the race AFTER runner.start(), so the loser still opened Telegram polling and Discord gateway sockets before detecting the conflict and exiting. Hoist the PID-claim block to BEFORE runner.start(). Now the loser of the O_CREAT\|O_EXCL race returns from start_gateway() without ever bringing up any platform adapter — no Telegram conflict, no Discord duplicate session. Also add regression tests: - test_write_pid_file_is_atomic_against_concurrent_writers: second write_pid_file() raises FileExistsError rather than clobbering. - Two existing replace-path tests updated to stateful mocks since the real post-kill state (get_running_pid None after remove_pid_file) is now exercised by the hoisted re-check.	2026-04-21 00:43:50 -07:00
kshitijk4poor	ff63e2e005	fix: tighten telegram docker-media salvage follow-ups Follow-up on top of the helix4u #6392 cherry-pick: - reuse one helper for actionable Docker-local file-not-found errors across document/image/video/audio local-media send paths - include /outputs/... alongside /output/... in the container-local path hint - soften the gateway startup warning so it does not imply custom host-visible mounts are broken; the warning now targets the specific risky pattern of emitting container-local MEDIA paths without an explicit export mount - add focused regressions for /outputs/... and non-document media hint coverage This keeps the salvage aligned with the actual MEDIA delivery problem on current main while reducing false-positive operator messaging.	2026-04-19 01:55:33 -07:00
helix4u	588333908c	fix(telegram): warn on docker-only media paths	2026-04-19 01:55:33 -07:00
Teknium	07db20c72d	fix(gateway): detect legacy hermes.service + mark --replace SIGTERM as planned (#11909 ) * fix(gateway): detect legacy hermes.service units from pre-rename installs Older Hermes installs used a different service name (hermes.service) before the rename to hermes-gateway.service. When both units remain installed, they fight over the same bot token — after PR #5646's signal-recovery change, this manifests as a 30-second SIGTERM flap loop between the two services. Detection is an explicit allowlist (no globbing) plus an ExecStart content check, so profile units (hermes-gateway-<profile>.service) and unrelated third-party services named 'hermes' are never matched. Wired into systemd_install, systemd_status, gateway_setup wizard, and the main hermes setup flow — anywhere we already warn about scope conflicts now also warns about legacy units. * feat(gateway): add migrate-legacy command + install-time removal prompt - New hermes_cli.gateway.remove_legacy_hermes_units() removes legacy unit files with stop → disable → unlink → daemon-reload. Handles user and system scopes separately; system scope returns path list when not running as root so the caller can tell the user to re-run with sudo. - New 'hermes gateway migrate-legacy' subcommand (with --dry-run and -y) routes to remove_legacy_hermes_units via gateway_command dispatch. - systemd_install now offers to remove legacy units BEFORE installing the new hermes-gateway.service, preventing the SIGTERM flap loop that hits users who still have pre-rename hermes.service around. Profile units (hermes-gateway-<profile>.service) remain untouched in all paths — the legacy allowlist is explicit (_LEGACY_SERVICE_NAMES) and the ExecStart content check further narrows matches. * fix(gateway): mark --replace SIGTERM as planned so target exits 0 PR #5646 made SIGTERM exit the gateway with code 1 so systemd's Restart=on-failure revives it after unexpected kills. But when a user has two gateway units fighting for the same bot token (e.g. legacy hermes.service + hermes-gateway.service from a pre-rename install), the --replace takeover itself becomes the 'unexpected' SIGTERM — the loser exits 1, systemd revives it 30s later, and the cycle flaps indefinitely. Before calling terminate_pid(), --replace now writes a short-lived marker file naming the target PID + start_time. The target's shutdown_signal_handler consumes the marker and, when it names this process, leaves _signal_initiated_shutdown=False so the final exit code stays 0. Staleness defences: - PID + start_time combo prevents PID reuse matching an old marker - Marker older than 60s is treated as stale and discarded - Marker is unlinked on first read even if it doesn't match this process - Replacer clears the marker post-loop + on permission-denied give-up	2026-04-17 19:27:58 -07:00
Tom Qiao	8a48c58bd3	fix(gateway): add missing RedactingFormatter import The gateway startup path references RedactingFormatter without importing it, causing a NameError crash when launched with a verbosity flag (e.g. via launchd --replace). Fixes #8044 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-11 19:38:05 -07:00
helix4u	cfbfc4c3f1	fix(discord): decouple readiness from slash sync	2026-04-11 19:22:14 -07:00
H-5-Isminiz	00dd5cc491	fix(gateway): implement platform-aware PID termination	2026-04-10 03:52:00 -07:00
JP Lew	17e87478d2	fix(gateway): restart on retryable startup failures (#1517 )	2026-03-16 05:26:31 -07:00

11 commits