hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-05-06 02:41:48 +00:00

Author	SHA1	Message	Date
briandevans	9e2628ee7c	test(discord): annotate make_attachment content_type as Optional[str] Copilot review: the helper accepted None in one test but was annotated str. Matches actual usage where no-content-type attachments are a tested scenario. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 12:36:47 -07:00
Albert.Zhou	fd9c32c0f2	fix(email): drop non-allowlisted senders before dispatch to prevent mail loops Add EMAIL_ALLOWED_USERS check in EmailAdapter._dispatch_message() to silently discard emails from senders not in the allowlist. This prevents the adapter from creating thread context and dispatching a MessageEvent for unauthorized senders, which could race with the gateway authorization check and result in SMTP replies being sent despite the handler returning None. Test: tests/gateway/test_email.py::TestDispatchMessage::test_non_allowlisted_sender_dropped Test: tests/gateway/test_email.py::TestDispatchMessage::test_allowlisted_sender_proceeds Test: tests/gateway/test_email.py::TestDispatchMessage::test_empty_allowlist_allows_all	2026-05-04 12:35:22 -07:00
Teknium	d90f73bcec	fix(gateway): use git HEAD SHA, not file mtimes, for stale-code check (#19740 ) The stale-code self-check (Issue #17648) used sentinel-file mtimes to decide whether the gateway survived a `hermes update` with stale `sys.modules`. That signal false-positives on any write to the sentinel files — including agent-driven edits during Hermes-on-Hermes dev sessions. Telling the agent to patch `run_agent.py` would flip the check to True on the next user message and force a gateway restart even though no update happened. Switch the signal to `git rev-parse HEAD`. Agent file edits don't move HEAD; `hermes update` (git pull) always does. Reading .git/HEAD directly (no subprocess) with a 5s cache keeps the overhead negligible on bursty chats. Non-git installs short-circuit to False — the stale-modules class can't occur without a git-backed update path, so there's nothing to detect. The legacy `_compute_repo_mtime` helper is kept but unused by detection, reserved as a fallback hook for future pip-install update paths. - _read_git_head_sha(): resolves HEAD across main checkout, worktree (follows `gitdir:` + `commondir` pointers), and packed-refs layouts. - _current_git_sha_cached(): per-runner 5s SHA cache. - _detect_stale_code(): boot SHA vs current SHA, returns False when either is unavailable. - Tests cover all four layouts, the agent-edits-don't-trigger regression, and cache behavior. Refs #17648.	2026-05-04 12:33:21 -07:00
teknium1	d35efb9898	feat(telegram): /topic off + help + auth gate + screenshot debounce Four production-readiness additions to topic mode: 1. /topic off — clean disable path. Flips telegram_dm_topic_mode.enabled to 0 and clears telegram_dm_topic_bindings for this chat. Previously users had to edit state.db with sqlite3 to turn the feature off. Idempotent: calling /topic off when the chat was never enabled returns a friendly no-op message. 2. /topic help — inline usage printed in the DM so users don't have to visit docs to discover /topic off, /topic <session-id>, etc. 3. Authorization gate. /topic mutates SQLite side tables and flips the root DM into a lobby, so the action must be authorized. Now calls self._is_user_authorized(source); unauthorized DMs get a refusal instead of activation. Defense in depth on top of the gateway's existing pre-route auth. 4. BotFather screenshot debounce. A user repeatedly running /topic while Threads Settings is still disabled would previously re-upload the same screenshot every time. Now rate-limited to one send per 5 minutes per chat. /topic off resets the counter so re-enabling starts fresh. Command-def args hint updated: /topic [off\|help\|session-id]. Docs: - New /topic subcommands table at the top of the multi-session section - Disable instructions updated to recommend /topic off first, with the raw SQL fallback kept for bulk cleanup - Under-the-hood list extended with the capability-hint debounce and the authorization gate Tests (6 new): - /topic help returns usage and doesn't create topic tables - /topic off disables mode AND clears bindings - /topic off is idempotent when never enabled - Unauthorized users get refusal, no tables created - Capability-hint debounce is per-chat - /topic off resets both lobby and capability debounce counters All 402 targeted tests pass. Full gateway sweep: 4809/4810 (pre-existing test_teams::test_send_typing unrelated).	2026-05-04 12:07:17 -07:00
teknium1	1381c89e56	fix(telegram): polish topic mode — CASCADE, General-topic handling, rename guard, debounce Five follow-ups to topic mode based on integration audit: 1. ON DELETE CASCADE on telegram_dm_topic_bindings.session_id. Session pruning (manual /delete, auto-cleanup, any future prune job) would have thrown 'FOREIGN KEY constraint failed' for sessions bound to a topic. Migration bumped to v2, rebuilds the bindings table in place if FK lacks CASCADE. Idempotent; only runs once per DB. 2. Never auto-rename operator-declared topics. If an operator has extra.dm_topics configured AND a user runs /topic, messages in those pre-declared topics would previously trigger auto-rename and silently mutate operator config. _rename_telegram_topic_for_session_title now early-returns when _get_dm_topic_info returns a dict for this (chat_id, thread_id). Uses class-based lookup (not hasattr) so MagicMock test fixtures don't accidentally trip the guard. 3. General topic handling. Telegram's General (pinned top) topic in a forum-enabled private chat may send messages with message_thread_id=1 or omit thread_id entirely depending on client. Both are now treated as the root lobby, not a topic lane. Prevents users from accidentally burning a session on the General topic. 4. Debounce the root-lobby reminder. 30-second cooldown per chat so a user who forgets topic mode is enabled and types ten messages in the root gets one reminder, not ten. Explicit command replies (/new-in-lobby, /topic <session-id>) still land every time. 5. Docs: added under-the-hood invariants for the above, plus a Downgrade section explaining that rolling back to a pre-/topic Hermes build leaves the DB tables orphaned but harmless — DMs just revert to native per-thread isolation. Tests: - test_operator_declared_topic_is_not_auto_renamed - test_general_topic_is_treated_as_root_lobby - test_lobby_reminder_is_debounced_per_chat - test_binding_survives_session_deletion_via_cascade - test_migration_rebuilds_v1_binding_table_with_cascade_fk Validated: 4803/4804 tests pass (tests/gateway/ + tests/test_hermes_state.py). Sole failure is a pre-existing test_teams::test_send_typing flake unrelated to this PR.	2026-05-04 12:07:17 -07:00
teknium1	a7683d04a9	fix(telegram): harden DM topic binding — persist through switch_session, rebind on /new Follow-up on @EmelyanenkoK's feat: add Telegram DM topic-mode sessions. Three issues: 1. Split-brain session state. After get_or_create_session() returned a SessionEntry for a topic lane, the handler was mutating .session_id in place to the binding's target, but never persisting the switch through SessionStore. The sessions.json session_key → session_id map kept pointing at the lane's natural id; any reader that reloaded from disk saw the wrong id. Fixed by routing through SessionStore.switch_session(), which _save()s the mapping and ends the old session in SQLite like /resume does. 2. /new inside a topic was a one-message no-op. Reset created a new session but left the telegram_dm_topic_bindings row pointing at the old session_id, so the next message's binding lookup switched right back. Now _handle_reset_command rebinds the topic to the new session_id after reset. 3. is_telegram_session_linked_to_topic and list_unlinked_telegram_sessions_for_user both called apply_telegram_topic_migration() on read, contradicting the PR's own invariant that migration only runs on explicit /topic opt-in. They now tolerate missing topic tables and return empty/False. Also: _telegram_topic_mode_enabled() now only treats True as enabled (not any truthy return), so test fixtures with MagicMock session_db don't accidentally flip every DM into lobby mode — this was breaking 4 pre-existing test_status_command tests. Tests: - New regression: /new inside a topic must update the binding row (test_new_inside_telegram_topic_rewrites_binding_to_new_session). - _make_runner now stubs switch_session so existing restore tests still exercise the new code path. Validated end-to-end with real SessionDB + SessionStore: readers on fresh DB don't create topic tables; enable creates them; binding override persists across SessionStore restart; /new rebinds and the new id survives a restart. Co-authored-by: EmelyanenkoK <emelyanenko.kirill@gmail.com>	2026-05-04 12:07:17 -07:00
EmelyanenkoK	25065283b3	fix: improve telegram topic mode setup	2026-05-04 12:07:17 -07:00
EmelyanenkoK	d6615d8ec7	feat: add Telegram DM topic-mode sessions	2026-05-04 12:07:17 -07:00
kshitijk4poor	54e78cadb2	test: add regression test for Teams interactive_setup import fix Adapted from PR #19188 by @LeonSGP43 — mocks cli_output helpers and verifies interactive_setup persists credentials to .env without crashing. Also adds megastary to AUTHOR_MAP.	2026-05-04 06:54:27 -07:00
briandevans	ce22301dc6	test(sms): use clear=True in test_missing_phone_number_is_non_retryable Prevents pre-existing TWILIO_PHONE_NUMBER or SMS_WEBHOOK_URL values in the outer test environment from leaking into the assertion context. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 05:25:09 -07:00
Pratik Rai	7a8ee8b29d	fix(gateway): deduplicate Weixin messages by content fingerprint	2026-05-04 05:20:13 -07:00
OpenClaw Bot	0443484115	fix(qqbot): honor proxy env vars for websocket	2026-05-04 05:06:09 -07:00
Teknium	5b6d413476	fix(cli,gateway): surface title errors from /new <name> The contributor's PR silently swallowed ValueError from SessionDB.set_session_title() with bare except Exception: pass. Users typing /new <title> with an already-in-use title got an untitled session and no feedback. Changes: - cli.py: catch ValueError from both sanitize_title() and set_session_title(); print the error and mark the session untitled in the banner (never echo the rejected title back). - gateway/run.py: append a warning note to the reset reply on title rejection; reflect the accepted title in the header. - Add regression tests for the duplicate-title path in CLI and gateway. Also map exx@example.com -> @exxmen in scripts/release.py.	2026-05-04 03:14:50 -07:00
Exx	f720751d79	feat(cli,gateway): /new accepts optional session name argument Allow users to start a fresh session and immediately set its title by passing a name to /new (or /reset): /new Refactor auth module Changes: - hermes_cli/commands.py: add args_hint='[name]' to /new command - cli.py: parse title argument in process_command(), pass to new_session() - cli.py: new_session() accepts title=None, sets title via SessionDB - gateway/run.py: _handle_reset_command() parses title, sets on new entry - gateway/session.py: reset_session() accepts optional display_name - tests: add test_new_session_with_title, test_reset_command_with_title, test_new_command_in_help_output All 36 affected tests pass.	2026-05-04 03:14:50 -07:00
DaniuXie	a45bd28598	fix(wecom): set SUPPORTS_MESSAGE_EDITING=False to prevent broken streaming	2026-05-04 03:10:36 -07:00
Asunfly	8a364df2c8	fix: inherit reasoning config in API server runs	2026-05-04 01:44:16 -07:00
molvikar	74636f9c4a	fix(gateway): clear queued reload-skills notes on new/resume/branch	2026-05-03 17:00:31 -07:00
Kenny Wang	222767e5e8	fix: sanitize Telegram help command mentions	2026-05-03 17:00:09 -07:00
konsisumer	6fda92aa7f	fix(gateway): bridge top-level require_mention to Telegram config Users commonly place `require_mention: true` at the top level of config.yaml alongside `group_sessions_per_user`, expecting it to gate Telegram group messages. The key was silently ignored because the config loader only checked `yaml_cfg["telegram"]["require_mention"]`. When `require_mention` is found at the top level and no telegram-specific value is set, the fix now: - adds it to platforms_data["telegram"]["extra"] so _telegram_require_mention() picks it up via the primary config.extra path - sets TELEGRAM_REQUIRE_MENTION env var for the secondary fallback path A telegram-specific value (telegram.require_mention) still takes precedence over the top-level shorthand. Also corrects telegram.md: bare /cmd without @botname is rejected when require_mention is enabled; only /cmd@botname (bot-menu form) passes. Fixes #3979	2026-05-03 16:59:46 -07:00
clawbot	1bd975c0ba	fix(gateway): suppress duplicate voice transcripts Deduplicate exact and near-exact Discord voice STT transcripts per guild/user over a short window to avoid duplicate delayed agent replies. Adds regression tests for exact and near-duplicate voice transcript suppression.	2026-05-03 16:59:21 -07:00
Zyproth	a5cae16496	fix(api_server): fall back to default port on malformed API_SERVER_PORT	2026-05-03 15:27:03 -07:00
leprincep35700	b59bb4e351	fix(gateway): preserve home-channel thread targets across restart notifications	2026-05-03 08:47:49 -07:00
Teknium	d87fd9f039	fix(goals): make /goal work in TUI and fix gateway verdict delivery (#19209 ) /goal was silently broken outside the classic CLI. TUI: /goal was routed through the HermesCLI slash-worker subprocess, which set the goal row in SessionDB but then called _pending_input.put(state.goal) — the subprocess has no reader for that queue, so the kickoff message was discarded. No post-turn judge was wired into prompt.submit either, so even a manual kickoff would not continue the goal loop. Intercept /goal in command.dispatch instead, drive GoalManager directly, and return {type: send, notice, message} so the TUI client renders the Goal-set notice and fires the kickoff. Run the judge in _run_prompt_submit after message.complete, surface the verdict via status.update {kind: goal}, and chain the continuation turn after the running guard is released. Gateway: _post_turn_goal_continuation was gated on hasattr(adapter, 'send_message'), but adapters only expose send(). That branch was dead on every platform — users never saw '✓ Goal achieved', 'Continuing toward goal', or budget-exhausted messages. Replace the dead call with adapter.send(chat_id, content, metadata) and drop a broken reference to self._loop. Tests: - tests/tui_gateway/test_goal_command.py — full /goal dispatch matrix (set / status / pause / resume / clear / stop / done / whitespace) plus regressions for slash.exec → 4018 and 'goal' staying in _PENDING_INPUT_COMMANDS. - tests/gateway/test_goal_verdict_send.py — locks in the adapter.send path for done / continue / budget-exhausted and verifies the hook no-ops when no goal is set or the adapter lacks send().	2026-05-03 05:49:12 -07:00
kshitijk4poor	6f2dab248a	fix: update tests for resume_pending semantics + add AUTHOR_MAP entries Tests updated to reflect suspend_recently_active now setting resume_pending=True (preserves session) instead of suspended=True (wipes session history). AUTHOR_MAP entries: millerc79 (#19033), shellybotmoyer (#18915)	2026-05-03 03:54:03 -07:00
nftpoetrist	6c1322b997	fix(slack): close previous handler in connect() to prevent zombie Socket Mode connections SlackAdapter.connect() overwrote self._handler, self._app, and self._socket_mode_task without closing the prior AsyncSocketModeHandler first. If connect() was called a second time on the same adapter (e.g. during a gateway restart or in-process reconnect attempt), the old Socket Mode websocket stayed alive. Both the old and new connections received every Slack event and dispatched it twice — producing double responses with different wording, the same bug that affected DiscordAdapter (#18187, fixed in #18758). Fix: add a close-before-reassign guard at the start of the connection setup path, mirroring the guard DiscordAdapter.connect() already has. When self._handler is None (fresh adapter, first connect()) the block is a harmless no-op. Scoped to the handler/app fields only — no behavior change for any path that does not call connect() twice. Fixes #18980	2026-05-03 03:47:49 -07:00
0xyg3n	19ba9e43b6	fix(gateway/discord): require allowlist auth on slash commands Slash commands (_run_simple_slash, _handle_thread_create_slash) bypassed every DISCORD_ALLOWED_* gate enforced by on_message. Any guild member could invoke /background (RCE via terminal), /restart, /model, /skill, etc. CVSS 9.8 Critical. - _evaluate_slash_authorization mirrors on_message gates (user, role, channel, ignored channel) with fail-closed semantics - _check_slash_authorization sends ephemeral reject + logs + admin alert - Auth gate runs before defer() so rejections are ephemeral - /skill autocomplete returns [] for unauthorized users (no catalog leak) - Component views (ExecApproval, SlashConfirm, UpdatePrompt, ModelPicker) now honor role allowlists via shared _component_check_auth helper - Optional DISCORD_HIDE_SLASH_COMMANDS defense-in-depth - Cross-platform admin alert (Telegram/Slack fallback) on unauthorized attempts Based on PR #18125 by @0xyg3n.	2026-05-03 03:44:55 -07:00
teknium1	762eb79f1e	fix(gateway): tighten httpx keepalive and close whatsapp typing-response leak (#18451 ) Two mitigations for the CLOSE_WAIT accumulation reported against QQ Bot + Feishu on macOS behind Cloudflare Warp. 1. Shared httpx.Limits helper (gateway/platforms/_http_client_limits.py). Every long-lived platform adapter now constructs httpx.AsyncClient with max_keepalive_connections=10 and keepalive_expiry=2.0, vs httpx's default of unbounded keepalive pool and 5.0s expiry. On macOS/Warp the default 5s window let idle keepalive sockets sit in CLOSE_WAIT long enough for seven persistent adapters (QQ Bot, WeCom, DingTalk, Signal, BlueBubbles, WeCom-callback, plus the transient Feishu helper) to compound to the 256-fd ulimit. Tunable via HERMES_GATEWAY_HTTPX_KEEPALIVE_EXPIRY and HERMES_GATEWAY_HTTPX_MAX_KEEPALIVE env vars. 2. whatsapp.send_typing aiohttp leak. The call was 'await self._http_session.post(...)' with no 'async with' and no variable capture — the ClientResponse went out of scope unclosed, holding its TCP socket in CLOSE_WAIT until GC. Fixed by wrapping in 'async with'. This was the only bare-await aiohttp leak in the gateway/tools/plugins tree per audit; all other aiohttp sites use the context-manager pattern correctly. The underlying reporter also saw Feishu SDK (lark-oapi) connections in CLOSE_WAIT — those are inside the SDK and out of our direct control, but tightening httpx keepalive across adapters reduces the aggregate pool pressure regardless of which individual adapter leaks.	2026-05-02 02:23:37 -07:00
beibi9966	38dd057e91	fix(feishu): finalize remote document downloads inside httpx.AsyncClient context (#18502 ) Snapshot Content-Type and body while the client context is still active so pooled connections fully release on exit. Previously the read happened after `async with httpx.AsyncClient(...)` returned — which works today only because httpx eagerly buffers non-streaming responses; a future refactor to `.stream()` would silently read- after-close. Part of the #18451 connection-hygiene audit. Salvage of #18502.	2026-05-02 02:23:37 -07:00
Teknium	1dce908930	fix(gateway): shutdown + restart hygiene (drain timeout, false-fatal, success log) (#18761 ) * fix(gateway): config.yaml wins over .env for agent/display/timezone settings Regression from the silent config→env bridge. The bridge at module import time is correct for max_turns (unconditional overwrite), but every other agent., display., timezone, and security bridge key was guarded by 'if X not in os.environ' — so a stale .env entry from an old 'hermes setup' run would shadow the user's current config.yaml indefinitely. Symptom: agent.max_turns: 500 in config.yaml, HERMES_MAX_ITERATIONS=60 in .env from an old setup, and the gateway silently capped at 60 iterations per turn. Gateway logs confirmed api_calls never exceeded 60. Three changes: 1. gateway/run.py: drop the 'not in os.environ' guards for all agent., display., timezone, and security.* bridge keys. config.yaml is now authoritative for these settings — same semantics already in place for max_turns, terminal., and auxiliary.. Also surface the bridge failure (previously 'except Exception: pass') to stderr so operators see bridge errors instead of silently falling back to .env. 2. gateway/run.py: INFO-log the resolved max_iterations at gateway start so operators can verify the config→env bridge did the right thing instead of chasing a phantom budget ceiling. 3. hermes_cli/setup.py: stop writing HERMES_MAX_ITERATIONS to .env in the setup wizard. config.yaml is the single source of truth. Also clean up any stale .env entry left behind by pre-fix setups. Regression tests in tests/gateway/test_config_env_bridge_authority.py guard each config→env key against the 'stale .env shadows config' bug. * fix(gateway): shutdown + restart hygiene (drain timeout, false-fatal, success log) Three issues observed in production gateway.log during a rapid restart chain on 2026-05-02, all fixed here. 1. _send_restart_notification logged unconditional success adapter.send() catches provider errors (e.g. Telegram 'Chat not found') and returns SendResult(success=False); it never raises. The caller ignored the return value and always logged 'Sent restart notification to <chat>' at INFO, producing a misleading success line directly below the 'Failed to send Telegram message' traceback on every boot. Now inspects result.success and logs WARNING with the error otherwise. 2. WhatsApp bridge SIGTERM on shutdown classified as fatal error _check_managed_bridge_exit() saw the bridge's returncode -15 (our own SIGTERM from disconnect()) and fired the full fatal-error path, producing 'ERROR ... WhatsApp bridge process exited unexpectedly' plus 'Fatal whatsapp adapter error (whatsapp_bridge_exited)' on every planned shutdown, immediately before the normal '✓ whatsapp disconnected'. Adds a _shutting_down flag that disconnect() sets before the terminate, and _check_managed_bridge_exit() returns None for returncode in {0, -2, -15} while shutting down. OOM-kill (137) and other non-signal exits still hit the fatal path. 3. restart_drain_timeout default 60s → 180s On 2026-05-02 01:43:27 a user /restart fired while three agents were mid-API-call (82s, 112s, 154s into their turns). The 60s drain budget expired and all three were force-interrupted. 180s covers realistic in-flight agent turns; users on very-long-reasoning models can still raise it further via agent.restart_drain_timeout in config.yaml. Existing explicit user values are preserved by deep-merge. Tests - tests/gateway/test_restart_notification.py: two new tests assert INFO is only logged on SendResult(success=True) and WARNING with the error string is logged on SendResult(success=False). - tests/gateway/test_whatsapp_connect.py: parametrized test for returncode in {0, -2, -15} proves shutdown-time exits are suppressed; separate test proves returncode 137 (SIGKILL/OOM) still surfaces as fatal even when _shutting_down is set. - _check_managed_bridge_exit() reads _shutting_down via getattr-with- default so existing _make_adapter() test helpers that bypass __init__ (pitfall #17 in AGENTS.md) keep working unmodified.	2026-05-02 02:08:06 -07:00
teknium1	e363ced3c3	test(discord): regression coverage for zombie-websocket guard in connect() Covers PR #18224 fix for issue #18187 — when DiscordAdapter.connect() is called a second time without an intervening disconnect(), the previous commands.Bot must be closed before a new one is created. Otherwise both websockets stay connected to Discord's gateway and both fire on_message, producing double responses with different wording.	2026-05-02 02:04:14 -07:00
Teknium	10297fa23c	fix(discord): `/reload-skills` now refreshes the `/skill` autocomplete live (#18754 ) `_register_skill_group` captured the skill catalog in closure variables (`entries` and `skill_lookup`) so the single `tree.add_command` call at startup owned the only live copy. The closure is never re-entered after startup, so `/reload-skills` — which rescans the on-disk skills dir and refreshes the in-process `_skill_commands` registry — had no way to propagate results into the `/skill` autocomplete on Discord. New skills stayed invisible in the dropdown, and deleted skills returned "Unknown skill" when the stale autocomplete entry was clicked. The fix is purely a dataflow change: promote `entries` and `skill_lookup` to instance attributes (`_skill_entries`, `_skill_lookup`), split the collector-driven rebuild into a helper (`_refresh_skill_catalog_state`), and add a public `refresh_skill_group()` method that re-runs the helper and is safe to call at any point after the initial registration. The gateway's `_handle_reload_skills_command` then iterates `self.adapters` and calls `refresh_skill_group()` on any adapter that exposes it (currently only Discord). Both sync and async implementations are supported; adapters that don't override the method (Telegram's BotCommand menu, Slack subcommand map, etc.) are silently skipped — the in-process `reload_skills()` call covers them. No `tree.sync()` is required because Discord fetches autocomplete options dynamically on every keystroke — mutating the instance state the callbacks already read from is sufficient. That sidesteps the per-app command-bucket rate limit (~5 writes / 20 s) that made the previous bulk-sync-on-reload approach unusable (#16713 context). Tests: tests/gateway/test_reload_skills_discord_resync.py — five cases covering (1) refresh replaces entries, (2) entries stay sorted after refresh, (3) collector exception leaves cached state intact, (4) `_refresh_skill_catalog_state` populates the instance attrs, (5) orchestrator calls `refresh_skill_group()` on sync + async adapters and skips adapters that don't expose it.	2026-05-02 02:00:11 -07:00
Teknium	6ec74aec07	fix(gateway): match disabled/optional skills by frontmatter slug, not dir name (#18753 ) _check_unavailable_skill is meant to turn a typed "/foo" command that doesn't resolve into a specific hint — "disabled, enable with hermes skills config" or "available but not installed, install with hermes skills install …" — instead of the generic "unknown command" reply. It was doing the match with `skill_md.parent.name.lower().replace("_", "-")`, comparing that to the typed command. For every skill whose directory name drifted from its declared frontmatter `name:`, that comparison failed and the user got the unhelpful generic path. On a standard install today 19 skills have this drift, e.g.: dir: mlops/stable-diffusion frontmatter: name: Stable Diffusion Image Generation registered slug (what the user types): /stable-diffusion-image-generation dir: mlops/qdrant frontmatter: name: Qdrant Vector Search registered slug: /qdrant-vector-search dir: mlops/flash-attention frontmatter: name: Optimizing Attention Flash registered slug: /optimizing-attention-flash In every case, _check_unavailable_skill would fall through because "stable-diffusion" != "stable-diffusion-image-generation", even with the skill sitting right there on disk. Fix: extract a small `_skill_slug_from_frontmatter` helper that reads the SKILL.md frontmatter and normalizes exactly like scan_skill_commands (lower, spaces/underscores → hyphens, strip non-[a-z0-9-], collapse runs of hyphens, strip edges). Use it in both the disabled-skills branch and the optional-skills branch. The disabled-set membership check now uses the declared frontmatter name (which is what `hermes skills config` writes into skills.disabled / platform_disabled), not the slug. Tests: five cases in tests/gateway/test_unavailable_skill_hint.py — the drift case for the disabled branch, unknown-command negative, matched-but-not-disabled negative, non-alnum stripping, and the drift case for the optional-skills branch. All five fail against main and pass with the fix.	2026-05-02 02:00:09 -07:00
Jacob Lizarraga	2470434d60	fix(telegram): probe polling liveness after reconnect to detect wedged Updater After a transient Telegram 502, _handle_polling_network_error's stop()+start_polling() cycle can leave PTB's Updater with `running=True` but a wedged consumer task that never makes progress. No error_callback fires in that state, so the reconnect ladder never advances past attempt 1, the MAX_NETWORK_RETRIES fatal-error path is never reached, and the gateway sits silent indefinitely. Schedule a heartbeat probe (60s after a successful reconnect) that verifies Updater.running is still True and bot.get_me() responds within a tight asyncio.wait_for timeout. Either failure feeds back into the reconnect ladder so the existing escalation path fires. No PTB-internal coupling, no Application rebuild — minimal additive defense inside the existing reconnect abstraction. Tests cover healthy / Updater non-running / probe timeout / probe network error / already-fatal cases, plus an integration check that the probe is actually scheduled after a successful start_polling(). Closes the silent-wedge case observed in the wild after a transient Telegram 502; existing reconnect tests updated to mock bot.get_me() now that the success path schedules a heartbeat probe.	2026-05-02 01:55:04 -07:00
Amr Essam	d05a87e686	fix(gateway): clear slack assistant thread status	2026-05-01 14:01:26 -07:00
hinotoi-agent	a147164d3c	fix(slack): preserve per-user slash-command session isolation	2026-05-01 14:01:26 -07:00
nightq	5cdc39e29a	fix(gateway): preserve case-sensitive chat IDs in DeliveryTarget.parse Fixes NousResearch/hermes-agent#11768 Root cause: target.strip().lower() was lowercasing the entire target string, corrupting case-sensitive chat IDs like Slack C123ABC and Matrix !RoomABC. Fix: Only lowercase the platform prefix for case-insensitive matching; preserve the original case for chat_id and thread_id values.	2026-05-01 14:01:26 -07:00
YAMAGUCHI Seiji	2b3923ff13	fix(gateway): coerce scalar free_response_channels to str before split YAML loads a bare numeric value such as discord: free_response_channels: 1491973769726791812 as an int. _discord_free_response_channels() / _slack_free_response_channels() checked `isinstance(raw, list)` and `isinstance(raw, str)` in that order and then fell through to `return set()`, so a single-channel config that happened to be unquoted was silently dropped with no log line — the bot kept demanding @mentions even though the channel was configured to free-response. A multi-channel value like `1234567890,9876543210` does not trip this because the comma forces YAML to parse it as a string. Single-channel configs are the only case that breaks, which is exactly the footgun that's hardest to diagnose (the config "looks right" and the feature just doesn't activate). Note that the old-schema env-var bridge at gateway/config.py:614+ already runs `str(frc)` when forwarding to SLACK_/DISCORD_FREE_RESPONSE_CHANNELS, so the env-var fallback worked. The bug only surfaces on the `config.extra["free_response_channels"]` path populated by the `platforms:` bridge at gateway/config.py:576, which passes the raw YAML value through unchanged. Fix at the reader: treat any non-list value as a scalar, coerce with str(), then apply the same CSV split semantics. This keeps the public contract stable (list or str-like continues to work identically) while accepting the ints that the YAML loader is free to hand us. Added tests for both Discord and Slack covering: - bare int value in config.extra - list of ints in config.extra	2026-05-01 14:01:26 -07:00
kshitijk4poor	8fcc160f6b	fix(gateway/slack): review fixes — scope ephemeral to commands, user isolation Self-review fixes for the slash ephemeral ack: - Only stash response_url when text starts with '/' (gateway command). Free-form questions via '/hermes <question>' must produce public agent replies visible to the whole channel, not ephemeral. - Use a ContextVar (_slash_user_id) to thread the invoking user's ID from _handle_slash_command through to send(). _pop_slash_context now matches the exact (channel_id, user_id) key when the ContextVar is set, preventing concurrent users on the same channel from stealing each other's ephemeral context. ContextVars propagate to child asyncio.Tasks, so the value survives through handle_message → _process_message_background → _send_with_retry → send(). - Add truncate_message() in _send_slash_ephemeral to prevent silent failures on long responses (response_url has the same ~40k limit). - Log send_private_notice failures at debug level instead of bare except/pass — aids diagnostics without spamming. - Document app_mention dedup dependency on shared event ts. - Add tests: free-form question must NOT stash context, concurrent users on the same channel get isolated contexts, non-slash send() path fallback behavior.	2026-05-01 13:33:06 -07:00
probepark	0ab2d752ff	feat(gateway): private notice delivery and Slack format_message fixes Adds platform-level private notice delivery abstraction so operational messages (e.g. sethome prompt) can be sent ephemerally on Slack when configured with `slack.notice_delivery: private`. Changes: - gateway/config.py: _normalize_notice_delivery() + GatewayConfig.get_notice_delivery() with per-platform config bridging - gateway/platforms/base.py: send_private_notice() default implementation (falls through to send()) - gateway/platforms/slack.py: send_private_notice() via chat_postEphemeral - gateway/run.py: _deliver_platform_notice() helper replaces direct adapter.send() for the sethome notice, with private→public fallback - gateway/platforms/slack.py: app_mention handler now forwards to _handle_slack_message (safe due to ts-based dedup) instead of no-op pass, fixing edge-case Slack configs where mentions arrive only as app_mention - gateway/platforms/slack.py format_message: negative lookbehind prevents markdown images (![]()) from becoming broken Slack links; italic regex now requires non-whitespace boundaries so 'a * b * c' stays literal Based on PR #9340 by @probepark.	2026-05-01 13:33:06 -07:00
kshitijk4poor	7cda0e5224	fix(gateway/slack): ephemeral ack and routing for slash commands Slack slash commands (/q, /btw, /stop, /model, etc.) previously showed no user-visible acknowledgement and posted command replies as public channel messages. This diverged from Discord, which uses ephemeral deferred responses for slash commands. Changes: - handle_hermes_command now passes response_type='ephemeral' and a 'Running /cmd…' text to ack(), giving the user immediate 'Only visible to you' feedback when they invoke any native slash command. - _handle_slash_command stashes the Slack response_url from the command payload in a per-channel context dict before dispatching to handle_message. - send() checks for a pending slash context and, when found, POSTs to the response_url with replace_original=true to swap the initial ack with the real command reply (e.g. 'Queued for the next turn.'), keeping it ephemeral. - Stale slash contexts are garbage-collected on lookup (120s TTL). - The response_url POST is non-fatal: if it fails, the user already saw the initial ack, and send() returns success=True. Fixes #18182	2026-05-01 13:33:06 -07:00
Teknium	f99676e315	fix(gateway): auto-restart when source files change out from under us (#17648 ) (#18409 ) Long-running gateway processes that survive 'hermes update' keep pre-update modules cached in sys.modules. When new tool files on disk then try to 'from hermes_cli.config import cfg_get' (added in PR #17304), the import resolves against the stale module object and raises ImportError — hitting users on Matrix, Telegram, Feishu, and other platforms. Two defenses: 1. Gateway self-check (gateway/run.py). On __init__, snapshot the newest mtime across sentinel source files (hermes_cli/config.py, run_agent.py, gateway/run.py, etc.). On every inbound message, re-read those mtimes; if any is newer than boot time + 2s slack, request a graceful restart via the normal drain path and return a one-line ack to the user. Idempotent, works regardless of how the update happened (hermes update, manual git pull, installer). 2. Post-restart survivor sweep ('hermes update'). After the existing restart loop, sleep 3s, rescan for gateway PIDs we already tried to kill, and SIGKILL any survivors. The detached profile watchers and systemd then relaunch with fresh code instead of waiting out the 120s watcher timeout. Closes #17648.	2026-05-01 09:50:08 -07:00
Mikey O'Brien	1be3b74cfb	fix(gateway): honor MATRIX_HOME_ROOM in onboarding	2026-04-30 23:13:34 -07:00
Teknium	4caad285a6	feat(gateway): auto-delete slash-command system notices after TTL (#18266 ) Adds opt-in auto-deletion for slash-command reply messages like "New session started!", "Restarting gateway…", "Stopped.", and YOLO toggles. After the TTL elapses the gateway calls the adapter's delete_message; on platforms without a delete API (everything except Telegram today) the TTL is silently ignored and the message stays. Requested on Twitter by @charlesmcdowell — tool-call bubbles are useful real-time, but system notices clutter the thread once the agent finishes. Implementation: - EphemeralReply(str) sentinel in gateway/platforms/base.py. Subclasses str so existing 'X' in response / response.startswith(...) checks in tests and call sites keep working unchanged; isinstance() still distinguishes it for the send path. - _process_message_background and both busy-session bypass paths (in base.py) call _unwrap_ephemeral() on the handler return, send the unwrapped text, and schedule a detached delete task when the TTL > 0 AND the adapter class overrides delete_message. - display.ephemeral_system_ttl (default 0 = disabled) in DEFAULT_CONFIG. Handler can pass ttl_seconds explicitly to override. - Wrapped the highest-noise return sites: /new, /reset, /stop, /yolo on/off, /restart success + "already in progress". Draining notices and /help output left as plain strings — those are informational and users want to read them. Backward-compat: default TTL 0 → no scheduling, no behavior change for existing users. Platforms without delete_message silently no-op.	2026-04-30 23:05:48 -07:00
Teknium	f0dc919f92	fix(compression): include system prompt + tool schemas in token estimates (#18265 ) The user-visible /compress banner and the post-compression last_prompt_tokens writeback both counted only the raw message transcript (chars/4). With a 15KB system prompt and 30 tool schemas (~26KB), a 4-message transcript that looks like ~45 tokens to the transcript-only estimator is really ~10.5K tokens of request pressure — a 234x gap. Two user-facing consequences: - Banner shows 'Compressing … (~45 tokens)…' while compression is actually firing on 10K+ tokens of real pressure, confusing users about why compression triggered (reported by @codecovenant on X; #6217). - Post-compression last_prompt_tokens writeback omits tool schemas, so the next should_compress() check compares real usage against a stale underestimate — compression triggers late, potentially past the model's context limit on small-context models (#14695). Swap estimate_messages_tokens_rough() for estimate_request_tokens_rough() at every user-visible banner and at the post-compression writeback. estimate_request_tokens_rough() already existed for exactly this purpose and includes system prompt + tool schemas. Touched call sites: - run_agent.py: post-compression last_prompt_tokens writeback, post-tool call should_compress() fallback when provider usage is missing - cli.py: /compress banner + summary - gateway/run.py: gateway /compress banner + summary - tui_gateway/server.py: TUI /compress status + summary - acp_adapter/server.py: ACP /compact before/after Left intentionally alone: - Session-hygiene fallback and the 'no agent' /status path in gateway/run.py — no agent instance is in scope to query for system prompt/tools, and the existing 30-50% overestimate wobble on hygiene is safety-accepted. - Verbose-mode 'Request size' logging — informational only, already counts system prompt via api_messages[0]. Also relabels the feedback line from 'Rough transcript estimate' to 'Approx request size' so the metric label matches what it actually measures. Credits: diagnoses from @devilardis (#14695) and @Jackten (#6217); user report @codecovenant on X (2026-04-30). Closes #14695 Closes #6217	2026-04-30 23:03:54 -07:00
Oxidane-bot	8d7500d80d	fix(gateway): snapshot callback generation after agent binds it, not before _process_message_background snapshotted callback_generation from the interrupt event at the TOP of the task — before the handler ran. _hermes_run_generation is only set on the event by GatewayRunner._bind_adapter_run_generation during _handle_message_with_agent, which runs DURING the handler await. The early snapshot always captured None, which then flowed into pop_post_delivery_callback(..., generation=None) in the finally block. In pop_post_delivery_callback, generation=None with a tuple-registered entry (generation, callback) bypasses the ownership check — it pops and fires the callback regardless of which run owns it. Result: a stale run could fire a fresher run's post-delivery callback (e.g. a background-review notification attributed to the wrong turn). Fix: move the snapshot into the finally block, after the handler has run and _hermes_run_generation has been bound to the current run. Regression test added: simulates a stale handler at generation=1 and a fresher callback registered at generation=2. Pre-fix: snapshot=None → pop fires the generation=2 callback under generation=1's ownership ("newer" fires). Post-fix: snapshot=1 → pop skips the mismatched entry, callback stays in the dict for the correct run to claim. Verified: test FAILS on current main (captures "newer" in fired list), PASSES with this fix. Salvaged from PR #12565 (the callback-ownership portion only; the /status totals portion was already fixed on main in `7abc9ce4d` via #17158). Co-authored-by: Oxidane-bot <1317078257maroon@gmail.com>	2026-04-30 20:41:18 -07:00
Teknium	27ec74c68a	fix: coerce show_reasoning and guard_agent_created config bools Widens #16528 to two sibling sites that had the same quoted-boolean bug: a YAML string "false" (or "0", "no", "off") silently evaluated truthy under bool() / if-check. - gateway/run.py _load_show_reasoning: is_truthy_value wrap - tools/skill_manager_tool.py _guard_agent_created_enabled: is_truthy_value wrap - regression tests for both	2026-04-30 20:40:46 -07:00
johnncenae	bb706c3f38	fix(gateway): coerce tool_progress_command as a real boolean	2026-04-30 20:40:46 -07:00
simbam99	7ba1a2b3df	fix(gateway): preserve assistant metadata when branching sessions	2026-04-30 20:40:28 -07:00
simbam99	ccfe6a47c3	fix(gateway): coerce StreamingConfig booleans and malformed numerics safely	2026-04-30 20:37:49 -07:00
hharry11	158eb32686	fix(gateway): preserve document type when merging queued events	2026-04-30 20:37:27 -07:00

1 2 3 4 5 ...

827 commits