Previously a message like `<@&1490963422786093149> help` would spawn a
thread literally named `<@&1490963422786093149> help`, exposing raw
Discord mention markers in the thread list. Only user mentions
(`<@id>`) were being stripped upstream — role mentions (`<@&id>`) and
channel mentions (`<#id>`) leaked through.
Fix: strip all three mention patterns in `_auto_create_thread` before
building the thread name. Collapse runs of whitespace left by the
removal. If the entire content was mention-only, fall back to 'Hermes'
instead of an empty title.
Fixes#6336.
Tests: two new regression guards in test_discord_slash_commands.py
covering mixed-mention content and mention-only content.
Free-response channels already bypassed the @mention gate so users could
chat inline with the bot, but auto-threading still fired on every
message — spinning off a thread per message and defeating the
lightweight-chat purpose.
Fix: fold `is_free_channel` into `skip_thread` so threading is skipped
whenever the channel is in DISCORD_FREE_RESPONSE_CHANNELS (via env or
discord.free_response_channels in config.yaml).
Net change: one line in _handle_message + one regression test.
Partially addresses #9399. Authored by @Hypn0sis (salvaged from PR #9650;
the bundled 'smart' auto-thread mode from that PR was dropped in favor
of deterministic true/false semantics).
* fix(gateway): bound _agent_cache with LRU cap + idle TTL eviction
The per-session AIAgent cache was unbounded. Each cached AIAgent holds
LLM clients, tool schemas, memory providers, and a conversation buffer.
In a long-lived gateway serving many chats/threads, cached agents
accumulated indefinitely — entries were only evicted on /new, /model,
or session reset.
Changes:
- Cache is now an OrderedDict so we can pop least-recently-used entries.
- _enforce_agent_cache_cap() pops entries beyond _AGENT_CACHE_MAX_SIZE=64
when a new agent is inserted. LRU order is refreshed via move_to_end()
on cache hits.
- _sweep_idle_cached_agents() evicts entries whose AIAgent has been idle
longer than _AGENT_CACHE_IDLE_TTL_SECS=3600s. Runs from the existing
_session_expiry_watcher so no new background task is created.
- The expiry watcher now also pops the cache entry after calling
_cleanup_agent_resources on a flushed session — previously the agent
was shut down but its reference stayed in the cache dict.
- Evicted agents have _cleanup_agent_resources() called on a daemon
thread so the cache lock isn't held during slow teardown.
Both tuning constants live at module scope so tests can monkeypatch
them without touching class state.
Tests: 7 new cases in test_agent_cache.py covering LRU eviction,
move_to_end refresh, cleanup thread dispatch, idle TTL sweep,
defensive handling of agents without _last_activity_ts, and plain-dict
test fixture tolerance.
* tweak: bump _AGENT_CACHE_MAX_SIZE 64 -> 128
* fix(gateway): never evict mid-turn agents; live spillover tests
The prior commit could tear down an active agent if its session_key
happened to be LRU when the cap was exceeded. AIAgent.close() kills
process_registry entries for the task, tears down the terminal
sandbox, closes the OpenAI client (sets self.client = None), and
cascades .close() into any active child subagents — all fatal if
the agent is still processing a turn.
Changes:
- _enforce_agent_cache_cap and _sweep_idle_cached_agents now look at
GatewayRunner._running_agents and skip any entry whose AIAgent
instance is present (identity via id(), so MagicMock doesn't
confuse lookup in tests). _AGENT_PENDING_SENTINEL is treated
as 'not active' since no real agent exists yet.
- Eviction only considers the LRU-excess window (first size-cap
entries). If an excess slot is held by a mid-turn agent, we skip
it WITHOUT compensating by evicting a newer entry. A freshly
inserted session (zero cache history) shouldn't be punished to
protect a long-lived one that happens to be busy.
- Cache may therefore stay transiently over cap when load spikes;
a WARNING is logged so operators can see it, and the next insert
re-runs the check after some turns have finished.
New tests (TestAgentCacheActiveSafety + TestAgentCacheSpilloverLive):
- Active LRU entry is skipped; no newer entry compensated
- Mixed active/idle excess window: only idle slots go
- All-active cache: no eviction, WARNING logged, all clients intact
- _AGENT_PENDING_SENTINEL doesn't block other evictions
- Idle-TTL sweep skips active agents
- End-to-end: active agent's .client survives eviction attempt
- Live fill-to-cap with real AIAgents, then spillover
- Live: CAP=4 all active + 1 newcomer — cache grows to 5, no teardown
- Live: 8 threads racing 160 inserts into CAP=16 — settles at 16
- Live: evicted session's next turn gets a fresh agent that works
30 tests pass (13 pre-existing + 17 new). Related gateway suites
(model switch, session reset, proxy, etc.) all green.
* fix(gateway): cache eviction preserves per-task state for session resume
The prior commits called AIAgent.close() on cache-evicted agents, which
tears down process_registry entries, terminal sandbox, and browser
daemon for that task_id — permanently. Fine for session-expiry (session
ended), wrong for cache eviction (session may resume).
Real-world scenario: a user leaves a Telegram session open for 2+ hours,
idle TTL evicts the cached AIAgent, user returns and sends a message.
Conversation history is preserved via SessionStore, but their terminal
sandbox (cwd, env vars, bg shells) and browser state were destroyed.
Fix: split the two cleanup modes.
close() Full teardown — session ended. Kills bg procs,
tears down terminal sandbox + browser daemon,
closes LLM client. Used by session-expiry,
/new, /reset (unchanged).
release_clients() Soft cleanup — session may resume. Closes
LLM client only. Leaves process_registry,
terminal sandbox, browser daemon intact
for the resuming agent to inherit via
shared task_id.
Gateway cache eviction (_enforce_agent_cache_cap, _sweep_idle_cached_agents)
now dispatches _release_evicted_agent_soft on the daemon thread instead
of _cleanup_agent_resources. All session-expiry call sites of
_cleanup_agent_resources are unchanged.
Tests (TestAgentCacheIdleResume, 5 new cases):
- release_clients does NOT call process_registry.kill_all
- release_clients does NOT call cleanup_vm / cleanup_browser
- release_clients DOES close the LLM client (agent.client is None after)
- close() vs release_clients() — semantic contract pinned
- Idle-evicted session's rebuild with same session_id gets same task_id
Updated test_cap_triggers_cleanup_thread to assert the soft path fires
and the hard path does NOT.
35 tests pass in test_agent_cache.py; 67 related tests green.
- gateway/platforms/weixin.py:
- Split aiohttp.ClientSession into _poll_session and _send_session
- Add _LIVE_ADAPTERS registry so send_weixin_direct() reuses the connected gateway adapter instead of creating a competing session
- Fixes silent message loss when gateway is running (iLink token contention)
- cron/scheduler.py:
- Support comma-separated deliver values (e.g. 'feishu,weixin') for multi-target delivery
- Delay pconfig/enabled check until standalone fallback so live adapters work even when platform is not in gateway config
- tools/send_message_tool.py:
- Synthesize PlatformConfig from WEIXIN_* env vars when gateway config lacks a weixin entry
- Fall back to WEIXIN_HOME_CHANNEL env var for home channel resolution
- tests/gateway/test_weixin.py:
- Update mocks to include _send_session
Follow-ups to the salvaged commits in this PR:
* gateway/config.py — strip trailing whitespace from youngDoo's diff
(line 315 had ~140 trailing spaces).
* hermes_cli/tools_config.py — replace `config.get("platform_toolsets", {})`
with `config.get("platform_toolsets") or {}`. Handles the case where the
YAML key is present but explicitly null (parses as None, previously
crashed with AttributeError on the next line's .get(platform)).
Cherry-picked from yyq4193's #9003 with attribution.
* tests/gateway/test_config.py — 4 new tests for TestGetConnectedPlatforms
covering DingTalk via extras, via env vars, disabled, and missing creds.
* tests/hermes_cli/test_tools_config.py — regression test for the null
platform_toolsets edge case.
* scripts/release.py — add kagura-agent, youngDoo, yyq4193 to AUTHOR_MAP.
Co-authored-by: yyq4193 <39405770+yyq4193@users.noreply.github.com>
Fixes#11463: DingTalk channel receives messages but fails to reply
with 'No session_webhook available'.
Two changes:
1. **Fire-and-forget message processing**: process() now dispatches
_on_message as a background task via asyncio.create_task instead of
awaiting it. This ensures the SDK ACK is returned immediately,
preventing heartbeat timeouts and disconnections when message
processing takes longer than the SDK's ACK deadline.
2. **session_webhook extraction fallback**: If ChatbotMessage.from_dict()
fails to map the sessionWebhook field (possible across SDK versions),
the handler now falls back to extracting it directly from the raw
callback data dict using both 'sessionWebhook' and 'session_webhook'
key variants.
Added 3 tests covering webhook extraction, fallback behavior, and
fire-and-forget ACK timing.
Two follow-ups to the cherry-picked PR #9873 (`e3bcc819`):
1. `_is_allowed_user` now uses `getattr(self, '_allowed_*_ids', set())`
so test fixtures that build the adapter via `object.__new__`
(skipping __init__) don't crash with AttributeError.
See AGENTS.md pitfall #17 — same pattern as gateway.run.
2. New 3-case regression coverage in test_discord_bot_auth_bypass.py:
- role-only config bypasses the gateway 'no allowlists' branch
- roles + users combined still authorizes user-allowlist matches
- the role bypass does NOT leak to other platforms (Telegram, etc.)
3. Autouse fixture in test_discord_bot_auth_bypass.py clears all Discord
auth env vars before each test so DISCORD_ALLOWED_ROLES leakage from
a previous test in the session can't flip later 'should-reject' tests
into false-pass.
Required because the bare cherry-pick of #9873 only added the adapter-
level role check — it didn't cover the gateway-level _is_user_authorized,
which still rejected role-only setups via the 'no allowlists configured'
branch.
Adds a new DISCORD_ALLOWED_ROLES environment variable that allows filtering
bot interactions by Discord role ID. Uses OR semantics with the existing
DISCORD_ALLOWED_USERS - if a user matches either allowlist, they're permitted.
Changes:
- Parse DISCORD_ALLOWED_ROLES comma-separated role IDs on connect
- Enable members intent when roles are configured (needed for role lookup)
- Update _is_allowed_user() to accept optional author param for direct role check
- Fallback to scanning mutual guilds when author object lacks roles (DMs, voice)
- Fully backwards compatible: no behavior change when env var is unset
Fixes#4466.
Root cause: two sequential authorization gates both independently rejected
bot messages, making DISCORD_ALLOW_BOTS completely ineffective.
Gate 1 — `discord.py` `on_message`:
_is_allowed_user ran BEFORE the bot filter, so bot senders were dropped
before the DISCORD_ALLOW_BOTS policy was ever evaluated.
Gate 2 — `gateway/run.py` _is_user_authorized:
The gateway-level allowlist check rejected bot IDs with 'Unauthorized
user: <bot_id>' even if they passed Gate 1.
Fix:
gateway/platforms/discord.py — reorder on_message so DISCORD_ALLOW_BOTS
runs BEFORE _is_allowed_user. Bots permitted by the filter skip the
user allowlist; non-bots are still checked.
gateway/session.py — add is_bot: bool = False to SessionSource so the
gateway layer can distinguish bot senders.
gateway/platforms/base.py — expose is_bot parameter in build_source.
gateway/platforms/discord.py _handle_message — set is_bot=True when
building the SessionSource for bot authors.
gateway/run.py _is_user_authorized — when source.is_bot is True AND
DISCORD_ALLOW_BOTS is 'mentions' or 'all', return True early. Platform
filter already validated the message at on_message; don't re-reject.
Behavior matrix:
| Config | Before | After |
| DISCORD_ALLOW_BOTS=none (default) | Blocked | Blocked |
| DISCORD_ALLOW_BOTS=all | Blocked | Allowed |
| DISCORD_ALLOW_BOTS=mentions + @mention | Blocked | Allowed |
| DISCORD_ALLOW_BOTS=mentions, no mention | Blocked | Blocked |
| Human in DISCORD_ALLOWED_USERS | Allowed | Allowed |
| Human NOT in DISCORD_ALLOWED_USERS | Blocked | Blocked |
Co-authored-by: Hermes Maintainer <hermes@nousresearch.com>
Closes#11321, closes#10259.
## Problem
The nested /skill command group (category subcommand groups + skill
subcommands) serialized to ~14KB with the default 75-skill catalog,
exceeding Discord's ~8000-byte per-command registration payload. The
entire tree.sync() rejected with error 50035 — ALL slash commands
including the 27 base commands failed to register.
## Fix
Replace the nested Group layout with a single flat Command:
/skill name:<autocomplete> args:<optional string>
Autocomplete options are fetched dynamically by Discord when the user
types — they do NOT count against the per-command registration budget.
So this single command registers at ~200 bytes regardless of how many
skills exist. Scales to thousands of skills with no size calculations,
no splitting, no hidden skills.
UX improvements:
- Discord live-filters by user's typed prefix against BOTH name and
description, so '/skill pdf' finds 'ocr-and-documents' via its
description. More discoverable than clicking through category menus.
- Unknown skill name → ephemeral error pointing user at autocomplete.
- Stable alphabetical ordering across restarts.
## Why not the other proposed approaches
Three prior PRs tried to fit within the 8KB limit by modifying the
nested layout:
- #10214 (njiangk): truncated all descriptions to 'Run <name>' and
category descriptions to 'Skills'. Works but destroys slash picker UX.
- #11385 (LeonSGP43): 40-char description clamp + iterative
trim-largest-category fallback. Works but HIDES skills the user can
no longer invoke via slash — functional regression.
- #10261 (zeapsu): adaptive split into /skill-<cat> top-level groups.
Preserves all skills but pollutes the slash namespace with 20
top-level commands.
All three work around the symptom. The flat autocomplete design
dissolves the problem — there is no payload-size pressure to manage.
## Tests
tests/gateway/test_discord_slash_commands.py — 5 new test cases replace
the 3 old nested-structure tests:
- flat-not-nested structure assertion
- empty skills → no command registered
- callback dispatches the right cmd_key by name
- unknown name → ephemeral error, no dispatch
- large-catalog regression guard (500 skills) — command payload stays
under 500 bytes regardless
E2E validated against real discord.py 2.7.1:
- Command registers as discord.app_commands.Command (not Group).
- Autocomplete filters by name AND description (verified across several
queries including description-only matches like 'pdf' → OCR skill).
- 500-skill catalog returns max 25 results per autocomplete query
(Discord's hard cap), filtered correctly.
- Choice labels formatted as 'name — description' clamped to 100 chars.
- feat: support one-click QR scan to create DingTalk bot and establish connection
- fix(gateway): wrap blocking DingTalkStreamClient.start() with asyncio.to_thread()
- fix(gateway): extract message fields from CallbackMessage payload instead of ChatbotMessage
- fix(gateway): add oapi.dingtalk.com to allowed webhook URL domains
- stop rewriting markdown tables, headings, and links before delivery
- keep markdown table blocks and headings together during chunking
- update Weixin tests and docs for native markdown rendering
Closes#10308
The Weixin adapter's send() method previously split and delivered the
raw response text without first extracting MEDIA: tags or bare local
file paths. This meant images, documents, and voice files referenced
by the agent were silently dropped in normal (non-streaming,
non-background) conversations.
Changes:
- In WeixinAdapter.send(), call extract_media() and
extract_local_files() before formatting/splitting text.
- Deliver extracted files via send_image_file(), send_document(),
send_voice(), or send_video() prior to sending text chunks.
- Also fix two minor typing issues in gateway/run.py where
extract_media() tuples were not unpacked correctly in background
and /btw task handlers.
Fixes missing media delivery on Weixin personal accounts.
Three open issues — #8242, #6587, #11345 — all trace to the same root
cause: the image / audio / document download paths in
`DiscordAdapter._handle_message` used plain, unauthenticated HTTP to
fetch `att.url`. That broke in three independent ways:
#8242 cdn.discordapp.com attachment URLs increasingly require the
bot session to download; unauthenticated httpx sees 403
Forbidden, image/voice analysis fail silently.
#6587 Some user environments (VPNs, corporate DNS, tunnels) resolve
cdn.discordapp.com to private-looking IPs. Our is_safe_url()
guard correctly blocks them as SSRF risks, but the user
environment is legitimate — image analysis and voice STT die.
#11345 The document download path skipped is_safe_url() entirely —
raw aiohttp.ClientSession.get(att.url) with no SSRF check,
inconsistent with the image/audio branches.
Unified fix: use `discord.Attachment.read()` as the primary download
path on all three branches. `att.read()` routes through discord.py's
own authenticated HTTPClient, so:
- Discord CDN auth is handled (#8242 resolved).
- Our is_safe_url() gate isn't consulted for the attachment path at
all — the bot session handles networking internally (#6587 resolved).
- All three branches now share the same code path, eliminating the
document-path SSRF gap (#11345 resolved).
Falls back to the existing cache_*_from_url helpers (image/audio) or an
SSRF-gated aiohttp fetch (documents) when `att.read()` is unavailable
or fails — preserves defense-in-depth for any future payload-schema
drift that could slip a non-CDN URL into att.url.
New helpers on DiscordAdapter:
- _read_attachment_bytes(att) — safe att.read() wrapper
- _cache_discord_image(att, ext) — primary + URL fallback
- _cache_discord_audio(att, ext) — primary + URL fallback
- _cache_discord_document(att, ext) — primary + SSRF-gated aiohttp fallback
Tests:
- tests/gateway/test_discord_attachment_download.py — 12 new cases
covering all three helpers: primary path, fallback on missing
.read(), fallback on validator rejection, SSRF guard on document
fallback, aiohttp fallback happy-path, and an E2E case via
_handle_message confirming cache_image_from_url is never invoked
when att.read() succeeds.
- All 11 existing document-handling tests continue to pass via the
aiohttp fallback path (their SimpleNamespace attachments have no
.read(), which triggers the fallback — now SSRF-gated).
Closes#8242, closes#6587, closes#11345.
When a WebSocket-based platform adapter (e.g. QQ Bot) temporarily
loses its connection, send() now polls is_connected for up to 15s
instead of immediately returning a non-retryable failure. If the
auto-reconnect completes within the window, the message is delivered
normally. On timeout, the SendResult is marked retryable=True so the
base class retry mechanism can attempt re-delivery.
Same treatment applied to _send_media().
Adds 4 async tests covering:
- Successful send after simulated reconnection
- Retryable failure on timeout
- Immediate success when already connected
- _send_media reconnection wait
Fixes#11163
DingTalk was the only messaging platform without group-mention gating or a
per-user allowlist. Slack, Telegram, Discord, WhatsApp, Matrix, and Mattermost
all support these via config.yaml + matching env vars; this change closes the
gap for DingTalk using the same surface:
Config:
platforms.dingtalk.require_mention: bool (env: DINGTALK_REQUIRE_MENTION)
platforms.dingtalk.mention_patterns: list (env: DINGTALK_MENTION_PATTERNS)
platforms.dingtalk.free_response_chats: list (env: DINGTALK_FREE_RESPONSE_CHATS)
platforms.dingtalk.allowed_users: list (env: DINGTALK_ALLOWED_USERS)
Semantics mirror Telegram's implementation:
- DMs are always accepted (subject to allowed_users).
- Group messages are accepted only when the chat is allowlisted, mention is
not required, the bot was @mentioned (dingtalk_stream sets is_in_at_list),
or the text matches a configured regex wake-word.
- allowed_users matches sender_id / sender_staff_id case-insensitively;
a single "*" disables the check.
Rationale: without this, any DingTalk user in a group chat can trigger the
bot, which makes DingTalk less safe to deploy than the other platforms. A
user's config.yaml already accepts require_mention for dingtalk but the value
was silently ignored.
The send_image_file method in WeixinAdapter used 'path' as parameter
name, but BasePlatformAdapter and gateway callers use 'image_path'.
This mismatch caused image sending to fail when called through the
gateway's extract_media path.
Changed parameter name from 'path' to 'image_path' to match the
interface defined in base.py and the calls in gateway/run.py.
discord.py does not apply a default AllowedMentions to the client, so any
reply whose content contains @everyone/@here or a role mention would ping
the whole server — including verbatim echoes of user input or LLM output
that happens to contain those tokens.
Set a safe default on commands.Bot: everyone=False, roles=False,
users=True, replied_user=True. Operators can opt back in via four
DISCORD_ALLOW_MENTION_* env vars or discord.allow_mentions.* in
config.yaml. No behavior change for normal user/reply pings.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cherry-picked SDK compat fix (previous commit) wired process() to
parse CallbackMessage.data into a ChatbotMessage, but _extract_text()
was still written against the pre-0.20 payload shape:
* message.text changed from dict {content: ...} → TextContent object.
The old code's str(text) fallback produced 'TextContent(content=...)'
as the agent's input, so every received message came in mangled.
* rich_text moved from message.rich_text (list) to
message.rich_text_content.rich_text_list.
This preserves legacy fallbacks (dict-shaped text, bare rich_text list)
while handling the current SDK layout via hasattr(text, 'content').
Adds regression tests covering:
* webhook domain allowlist (api.*, oapi.*, and hostile lookalikes)
* _IncomingHandler.process is a coroutine function
* _extract_text against TextContent object, dict, rich_text_content,
legacy rich_text, and empty-message cases
Also adds kevinskysunny to scripts/release.py AUTHOR_MAP (release CI
blocks unmapped emails).
Inbound Feishu messages arriving during brief windows when the adapter
loop is unavailable (startup/restart transitions, network-flap reconnect)
were silently dropped with a WARNING log. This matches the symptom in
issue #5499 — and users have reported seeing only a subset of their
messages reach the agent.
Fix: queue pending events in a thread-safe list and spawn a single
drainer thread that replays them once the loop becomes ready. Covers
these scenarios:
* Queue events instead of dropping when loop is None/closed
* Single drainer handles the full queue (not thread-per-event)
* Thread-safe with threading.Lock on the queue and schedule flag
* Handles mid-drain bursts (new events arrive while drainer is working)
* Handles RuntimeError if loop closes between check and submit
* Depth cap (1000) prevents unbounded growth during extended outages
* Drops queue cleanly on disconnect rather than holding forever
* Safety timeout (120s) prevents infinite retention on broken adapters
Based on the approach proposed in #4789 by milkoor, rewritten for
thread-safety and correctness.
Test plan:
* 5 new unit tests (TestPendingInboundQueue) — all passing
* E2E test with real asyncio loop + fake WS thread: 10-event burst
before loop ready → all 10 delivered in order
* E2E concurrent burst test: 20 events queued, 20 more arrive during
drainer dispatch → all 40 delivered, no loss, no duplicates
* All 111 existing feishu tests pass
Related: #5499, #4789
Co-authored-by: milkoor <milkoor@users.noreply.github.com>
All 61 TUI-related tests green across 3 consecutive xdist runs.
tests/tui_gateway/test_protocol.py:
- rename `get_messages` → `get_messages_as_conversation` on mock DB (method
was renamed in the real backend, test was still stubbing the old name)
- update tool-message shape expectation: `{role, name, context}` matches
current `_history_to_messages` output, not the legacy `{role, text}`
tests/hermes_cli/test_tui_resume_flow.py:
- `cmd_chat` grew a first-run provider-gate that bailed to "Run: hermes
setup" before `_launch_tui` was ever reached; 3 tests stubbed
`_resolve_last_session` + `_launch_tui` but not the gate
- factored a `main_mod` fixture that stubs `_has_any_provider_configured`,
reused by all three tests
tests/test_tui_gateway_server.py:
- `test_config_set_personality_resets_history_and_returns_info` was flaky
under xdist because the real `_write_config_key` touches
`~/.hermes/config.yaml`, racing with any other worker that writes
config. Stub it in the test.
The Discord voice receive path skipped RFC 3550 §5.1 padding handling,
passing padding-contaminated payloads into DAVE E2EE decrypt and Opus
decode. Symptoms in live VC sessions: deaf inbound speech, intermittent
empty STT results, "corrupted stream" decode errors — especially on the
first reply after join.
When the P bit is set in the RTP header, the last payload byte holds the
count of trailing padding bytes (including itself) that must be removed.
Receive pipeline now follows the spec order:
1. RTP header parse
2. NaCl transport decrypt (aead_xchacha20_poly1305_rtpsize)
3. strip encrypted RTP extension data from start
4. strip RTP padding from end if P bit set ← was missing
5. DAVE inner media decrypt
6. Opus decode
Drops malformed packets where pad_len is 0 or exceeds payload length.
Adds 7 integration tests covering valid padded packets, the X+P combined
case, padding under DAVE passthrough, and three malformed-padding paths.
Closes#11267
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* - make buffered streaming
- fix path naming to expand `~` for agent.
- fix stripping of matrix ID to not remove other mentions / localports.
* fix(matrix): register MembershipEventDispatcher for invite auto-join
The mautrix migration (#7518) broke auto-join because InternalEventType.INVITE
events are only dispatched when MembershipEventDispatcher is registered on the
client. Without it, _on_invite is dead code and the bot silently ignores all
room invites.
Closes#10094Closes#10725
Refs: PR #10135 (digging-airfare-4u), PR #10732 (fxfitz)
* fix(matrix): preserve _joined_rooms reference for CryptoStateStore
connect() reassigned self._joined_rooms = set(...) after initial sync,
orphaning the reference captured by _CryptoStateStore at init time.
find_shared_rooms() returned [] forever, breaking Megolm session rotation
on membership changes.
Mutate in place with clear() + update() so the CryptoStateStore reference
stays valid.
Refs #8174, PR #8215
* fix(matrix): remove dual ROOM_ENCRYPTED handler to fix dedup race
mautrix auto-registers DecryptionDispatcher when client.crypto is set.
The adapter also registered _on_encrypted_event for the same event type.
_on_encrypted_event had zero awaits and won the race to mark event IDs
in the dedup set, causing _on_room_message to drop successfully decrypted
events from DecryptionDispatcher. The retry loop masked this by re-decrypting
every message ~4 seconds later.
Remove _on_encrypted_event entirely. DecryptionDispatcher handles decryption;
genuinely undecryptable events are logged by mautrix and retried on next
key exchange.
Refs #8174, PR #8215
* fix(matrix): re-verify device keys after share_keys() upload
Matrix homeservers treat ed25519 identity keys as immutable per device.
share_keys() can return 200 but silently ignore new keys if the device
already exists with different identity keys. The bot would proceed with
shared=True while peers encrypt to the old (unreachable) keys.
Now re-queries the server after share_keys() and fails closed if keys
don't match, with an actionable error message.
Refs #8174, PR #8215
* fix(matrix): encrypt outbound attachments in E2EE rooms
_upload_and_send() uploaded raw bytes and used the 'url' key for all
rooms. In E2EE rooms, media must be encrypted client-side with
encrypt_attachment(), the ciphertext uploaded, and the 'file' key
(with key/iv/hashes) used instead of 'url'.
Now detects encrypted rooms via state_store.is_encrypted() and
branches to the encrypted upload path.
Refs: PR #9822 (charles-brooks)
* fix(matrix): add stop_typing to clear typing indicator after response
The adapter set a 30-second typing timeout but never cleared it.
The base class stop_typing() is a no-op, so the typing indicator
lingered for up to 30 seconds after each response.
Closes#6016
Refs: PR #6020 (r266-tech)
* fix(matrix): cache all media types locally, not just photos/voice
should_cache_locally only covered PHOTO, VOICE, and encrypted media.
Unencrypted audio/video/documents in plaintext rooms were passed as MXC
URLs that require authentication the agent doesn't have, resulting
in 401 errors.
Refs #3487, #3806
* fix(matrix): detect stale OTK conflict on startup and fail closed
When crypto state is wiped but the same device ID is reused, the
homeserver may still hold one-time keys signed with the previous
identity key. Identity key re-upload succeeds but OTK uploads fail
with "already exists" and a signature mismatch. Peers cannot
establish new Olm sessions, so all new messages are undecryptable.
Now proactively flushes OTKs via share_keys() during connect() and
catches the "already exists" error with an actionable log message
telling the operator to purge the device from the homeserver or
generate a fresh device ID.
Also documents the crypto store recovery procedure in the Matrix
setup guide.
Refs #8174
* docs(matrix): improve crypto recovery docs per review
- Put easy path (fresh access token) first, manual purge second
- URL-encode user ID in Synapse admin API example
- Note that device deletion may invalidate the access token
- Add "stop Synapse first" caveat for direct SQLite approach
- Mention the fail-closed startup detection behavior
- Add back-reference from upgrade section to OTK warning
* refactor(matrix): cleanup from code review
- Extract _extract_server_ed25519() and _reverify_keys_after_upload()
to deduplicate the re-verification block (was copy-pasted in two
places, three copies of ed25519 key extraction total)
- Remove dead code: _pending_megolm, _retry_pending_decryptions,
_MAX_PENDING_EVENTS, _PENDING_EVENT_TTL — all orphaned after
removing _on_encrypted_event
- Remove tautological TestMediaCacheGate (tested its own predicate,
not production code)
- Remove dead TestMatrixMegolmEventHandling and
TestMatrixRetryPendingDecryptions (tested removed methods)
- Merge duplicate TestMatrixStopTyping into TestMatrixTypingIndicator
- Trim comment to just the "why"
Users (Teknium) report missing debug reports before the 1-hour auto-delete
fires. 6 hours gives enough window for async bug-report triage without
leaving sensitive log data on public paste services indefinitely.
Applies to both the CLI (hermes debug share) and gateway (/debug) paths.
Initialize next_channel_prompt before the pending_event check and use
getattr with None default, matching the existing pattern for
next_source/next_message/next_message_id. Prevents AttributeError
when pending_event is None (interrupt path).
Cherry-picked from #10953 by @jackjin1997.
Switch from fragile Markdown V1 to HTML parse mode with html.escape()
for exec approval messages. Add fallback to text-based approval when
the formatted send fails.
Cherry-picked from #10999 by @danieldoderlein.
config.yaml terminal.cwd is now the single source of truth for working
directory. MESSAGING_CWD and TERMINAL_CWD in .env are deprecated with a
migration warning.
Changes:
1. config.py: Remove MESSAGING_CWD from OPTIONAL_ENV_VARS (setup wizard
no longer prompts for it). Add warn_deprecated_cwd_env_vars() that
prints a migration hint when deprecated env vars are detected.
2. gateway/run.py: Replace all MESSAGING_CWD reads with TERMINAL_CWD
(which is bridged from config.yaml terminal.cwd). MESSAGING_CWD is
still accepted as a backward-compat fallback with deprecation warning.
Config bridge skips cwd placeholder values so they don't clobber
the resolved TERMINAL_CWD.
3. cli.py: Guard against lazy-import clobbering — when cli.py is
imported lazily during gateway runtime (via delegate_tool), don't
let load_cli_config() overwrite an already-resolved TERMINAL_CWD
with os.getcwd() of the service's working directory. (#10817)
4. hermes_cli/main.py: Add 'hermes memory reset' command with
--target all/memory/user and --yes flags. Profile-scoped via
HERMES_HOME.
Migration path for users with .env settings:
Remove MESSAGING_CWD / TERMINAL_CWD from .env
Add to config.yaml:
terminal:
cwd: /your/project/path
Addresses: #10225, #4672, #10817, #7663
When execute_code times out, the result JSON had status="timeout" and an
error field, but the output field was empty. Many models treat empty
output as "nothing happened" and produce an empty/minimal response. The
gateway stream consumer then considers the response "already sent" (from
pre-tool streaming) and silently drops it — leaving the user staring at
silence.
Three changes:
1. Include the timeout message in the output field (both local and remote
paths) so the model always has visible content to relay to the user.
2. Add periodic activity callbacks to the local execution polling loop so
the gateway's inactivity monitor knows execute_code is alive during
long runs.
3. Fix stream_consumer._send_fallback_final to not silently drop content
when the continuation appears empty but the final text differs from
what was previously streamed (e.g. after a tool boundary reset).
When the LLM returns an empty completion, gateway/run.py replaced
final_response with the literal string '(No response generated)'.
This defeated cron/scheduler.py's empty-response skip guard, causing
the placeholder to be delivered to home channels.
Changes:
- gateway/run.py: return empty string instead of placeholder when
there is no error and no response content
- cron/scheduler.py: defensively strip the placeholder text in case
any upstream path still produces it
FixesNousResearch/hermes-agent#9270
The cancellation handler previously promoted any partial send
(already_sent=True) to final_response_sent=True unconditionally.
This meant if intermediate text (e.g. 'Let me search…') was streamed
and the consumer was cancelled before delivering the actual answer,
the gateway's suppression check would still prevent the fallback send.
Now final_response_sent is only set in the cancellation path when:
- The best-effort send of accumulated content actually succeeded, OR
- It was already confirmed before cancellation
Companion fix for PR #11000's run.py changes — closes the
cancellation-path loophole that would otherwise let partial streams
suppress final delivery during queued follow-ups.