Matrix gateway: fix sync loop never dispatching events (#5819)
- _sync_loop() called client.sync() but never called handle_sync()
to dispatch events to registered callbacks — _on_room_message was
registered but never fired for new messages
- Store next_batch token from initial sync and pass as since= to
subsequent incremental syncs (was doing full initial sync every time)
- 17 comments, confirmed by multiple users on matrix.org
Feishu docs: add interactive card configuration for approvals (#6893)
- Error 200340 is a Feishu Developer Console configuration issue,
not a code bug — users need to enable Interactive Card capability
and configure Card Request URL
- Added required 3-step setup instructions to feishu.md
- Added troubleshooting entry for error 200340
- 17 comments from Feishu users
Copilot provider drift: detect GPT-5.x Responses API requirement (#3388)
- GPT-5.x models are rejected on /v1/chat/completions by both OpenAI
and OpenRouter (unsupported_api_for_model error)
- Added _model_requires_responses_api() to detect models needing
Responses API regardless of provider
- Applied in __init__ (covers OpenRouter primary users) and in
_try_activate_fallback() (covers Copilot->OpenRouter drift)
- Fixed stale comment claiming gateway creates fresh agents per message
(it caches them via _agent_cache since the caching was added)
- 7 comments, reported on Copilot+Telegram gateway
* fix(matrix): pass required args to MemoryCryptoStore for mautrix ≥0.21
MemoryCryptoStore.__init__() now requires account_id and pickle_key
positional arguments as of mautrix 0.21. The migration from matrix-nio
(commit 1850747) didn't account for this, causing E2EE initialization
to fail with:
MemoryCryptoStore.__init__() missing 2 required positional arguments:
'account_id' and 'pickle_key'
Pass self._user_id as account_id and derive pickle_key from the same
user_id:device_id pair already used for the on-disk HMAC signature.
Update the test stub to accept the new parameters.
Fixes#7803
* fix: use consistent fallback for pickle_key derivation
Address review: _pickle_key now uses _acct_id (which has the 'hermes'
fallback) instead of raw self._user_id, so both values stay consistent
when user_id is empty.
---------
Co-authored-by: Hermes Agent <hermes@nousresearch.com>
Telegram flood control during streaming caused messages to be cut off
mid-response. The old behavior permanently disabled edits after a single
flood-control failure, losing the remainder of the response.
Changes:
- Adaptive backoff: on flood-control edit failures, double the edit interval
instead of immediately disabling edits. Only permanently disable after 3
consecutive failures (_MAX_FLOOD_STRIKES).
- Cursor strip: when entering fallback mode, best-effort edit to remove the
cursor (▉) from the last visible message so it doesn't appear stuck.
- Fallback send retry: _send_fallback_final retries each chunk once on
flood-control failures (3s delay) before giving up.
- Default edit_interval increased from 0.3s to 1.0s. Telegram rate-limits
edits at ~1/s per message; 0.3s was virtually guaranteed to trigger flood
control on any non-trivial response.
- _send_or_edit returns bool so the overflow split loop knows not to
truncate accumulated text when an edit fails (prevents content loss).
Fixes: messages cutting/stopping mid-response on Telegram, especially
with streaming enabled.
* feat: add watch_patterns to background processes for output monitoring
Adds a new 'watch_patterns' parameter to terminal(background=true) that
lets the agent specify strings to watch for in process output. When a
matching line appears, a notification is queued and injected as a
synthetic message — triggering a new agent turn, similar to
notify_on_complete but mid-process.
Implementation:
- ProcessSession gets watch_patterns field + rate-limit state
- _check_watch_patterns() in ProcessRegistry scans new output chunks
from all three reader threads (local, PTY, env-poller)
- Rate limited: max 8 notifications per 10s window
- Sustained overload (45s) permanently disables watching for that process
- watch_queue alongside completion_queue, same consumption pattern
- CLI drains watch_queue in both idle loop and post-turn drain
- Gateway drains after agent runs via _inject_watch_notification()
- Checkpoint persistence + crash recovery includes watch_patterns
- Blocked in execute_code sandbox (like other bg params)
- 20 new tests covering matching, rate limiting, overload kill,
checkpoint persistence, schema, and handler passthrough
Usage:
terminal(
command='npm run dev',
background=true,
watch_patterns=['ERROR', 'WARN', 'listening on port']
)
* refactor: merge watch_queue into completion_queue
Unified queue with 'type' field distinguishing 'completion',
'watch_match', and 'watch_disabled' events. Extracted
_format_process_notification() in CLI and gateway to handle
all event types in a single drain loop. Removes duplication
across both CLI drain sites and the gateway.
When the stream consumer has sent at least one message (already_sent=True),
the gateway skips sending the final response to avoid duplicates. But this
also suppressed error messages when the agent failed mid-loop — rate limit
exhaustion, context overflow, compression failure, etc.
The user would see the last streamed content and then nothing: no error
message, no explanation. The agent appeared to 'stop responding.'
Fix: check the 'failed' flag at both the producer (_run_agent marks
already_sent) and consumer (_handle_message_with_agent checks it) sites.
Error messages are always delivered regardless of streaming state.
- Add agent.close() call to _finalize_shutdown_agents() to prevent
zombie processes (terminal sandboxes, browser daemons, httpx clients)
- Global cleanup (process_registry, environments, browsers) preserved
in _stop_impl() during conflict resolution
- Move /restart CommandDef from 'Info' to 'Session' category to match
/stop and /status
* fix: circuit breaker stops CPU-burning restart loops on persistent errors
When a gateway session hits a non-retryable error (e.g. invalid model
ID → HTTP 400), the agent fails and returns. But if the session keeps
receiving messages (or something periodically recreates agents), each
attempt spawns a new AIAgent — reinitializing MCP server connections,
burning CPU — only to hit the same 400 error again. On a 4-core server,
this pegs an entire core per stuck session and accumulates 300+ minutes
of CPU time over hours.
Fix: add a per-session consecutive failure counter in the gateway runner.
- Track consecutive non-retryable failures per session key
- After 3 consecutive failures (_MAX_CONSECUTIVE_FAILURES), block
further agent creation for that session and notify the user:
'⚠️ This session has failed N times in a row with a non-retryable
error. Use /reset to start a new session.'
- Evict the cached agent when the circuit breaker engages to prevent
stale state from accumulating
- Reset the counter on successful agent runs
- Clear the counter on /reset and /new so users can recover
- Uses getattr() pattern so bare GatewayRunner instances (common in
tests using object.__new__) don't crash
Tests:
- 8 new tests in test_circuit_breaker.py covering counter behavior,
threshold, reset, session isolation, and bare-runner safety
Addresses #7130.
* Revert "fix: circuit breaker stops CPU-burning restart loops on persistent errors"
This reverts commit d848ea7109.
* fix: don't evict cached agent on failed runs — prevents MCP restart loop
When a run fails (e.g. invalid model ID → 400) and fallback activated,
the gateway was evicting the cached agent to 'retry primary next time.'
But evicting a failed agent forces a full AIAgent recreation on the next
message — reinitializing MCP server connections, spawning stdio
processes — only to hit the same 400 again. This created a CPU-burning
loop (91%+ for hours, #7130).
The fix: add `and not _run_failed` to the fallback-eviction check.
Failed runs keep the cached agent. The next message reuses it (no MCP
reinit), hits the same error, returns it to the user quickly. The user
can /reset or /model to fix their config.
Successful fallback runs still evict as before so the next message
retries the primary model.
Addresses #7130.
Follow-up fixes for the matrix-nio → mautrix migration:
1. Module-level mautrix.types import now wrapped in try/except with
proper stub classes. Without this, importing gateway.platforms.matrix
crashes the entire gateway when mautrix isn't installed — even for
users who don't use Matrix. The stubs mirror mautrix's real attribute
names so tests that exercise adapter methods (send, reactions, etc.)
work without the real SDK.
2. Removed _ensure_mautrix_mock() from test_matrix_mention.py — it
permanently installed MagicMock modules in sys.modules via setdefault(),
polluting later tests in the suite. No longer needed since the module
imports cleanly without mautrix.
3. Fixed thread persistence tests to use direct class reference in
monkeypatch.setattr() instead of string-based paths, which broke
when the module was reimported by other tests.
4. Moved the module-importability test to a subprocess to prevent it
from polluting sys.modules (reimporting creates a second module object
with different __dict__, breaking patch.object in subsequent tests).
The old nio code only handled RoomMessageText (m.text). The mautrix
rewrite dispatched both m.text and m.notice, which would cause infinite
loops between bots since m.notice is the conventional msgtype for bot
responses in the Matrix ecosystem.
- Add api.session.close() on E2EE dep check and E2EE setup failure
paths (two missing cleanup points from the mautrix migration)
- Replace raw pickle.load/dump with HMAC-SHA256 signed payloads to
prevent arbitrary code execution from a tampered store file
- Extract _resolve_message_context() to deduplicate ~40 lines of
mention/thread/DM gating logic between text and media handlers
- Move mautrix.types imports to module level (16 scattered local
imports consolidated)
- Parse mention/thread env vars once in __init__ instead of per-message
- Cache _is_bot_mentioned() result instead of calling 3x per event
- Consolidate send_emote/send_notice into shared _send_simple_message()
- Use _is_dm_room() in get_chat_info() instead of inline duplication
- Add _CRYPTO_PICKLE_PATH constant (was duplicated in 2 locations)
- Fix fragile event_ts extraction (double getattr, None safety)
- Clean up leaked aiohttp session on auth failure paths
- Remove redundant trailing _track_thread() calls
Address two bugs found by code review:
1. MemoryCryptoStore loses all E2EE keys on restart — now pickle the
store to disk on disconnect and restore on connect, preserving
Megolm sessions across restarts.
2. Encrypted events buffered for retry were silently dropped after
decryption because _on_encrypted_event registered the event ID
in the dedup set, then _on_room_message rejected it as a
duplicate. Now clear the dedup entry before routing decrypted
events.
Translate all nio SDK calls to mautrix equivalents while preserving the
adapter structure, business logic, and all features (E2EE, reactions,
threading, mention gating, text batching, media caching, voice MSC3245).
Key changes:
- nio.AsyncClient -> mautrix.client.Client + HTTPAPI + MemoryStateStore
- Manual E2EE key management -> OlmMachine with auto key lifecycle
- isinstance(resp, nio.XxxResponse) -> mautrix returns values directly
- add_event_callback per type -> single ROOM_MESSAGE handler with
msgtype dispatch
- Room state (member_count, display_name) via async state store lookups
- Upload/download return ContentURI/bytes directly (no wrapper objects)
Tool progress markers (e.g. `⏰ list`) were injected directly into
SSE delta.content chunks. OpenAI-compatible frontends (Open WebUI,
LobeChat, etc.) store delta.content verbatim as the assistant message
and send it back on subsequent requests. After enough turns, the model
learns to emit these markers as plain text instead of issuing real tool
calls — silently hallucinating tool results without ever running them.
Fix: Send tool progress as a custom `event: hermes.tool.progress` SSE
event instead of mixing it into delta.content. Per the SSE spec, clients
that don't understand a custom event type silently ignore it, so this is
backward-compatible. Frontends that want to render progress indicators
can listen for the custom event without persisting it to conversation
history.
The /v1/runs endpoint already uses structured events — this aligns the
/v1/chat/completions streaming path with the same principle.
Closes#6972
Six platforms (matrix, mattermost, dingtalk, feishu, wecom, homeassistant)
were missing from the session-based discovery loop, causing /channels and
send_message to return empty results on those platforms.
Instead of adding them to the hardcoded tuple (which would break again when
new platforms are added), derive the list dynamically from the Platform enum.
Only infrastructure entries (local, api_server, webhook) are excluded;
Discord and Slack are skipped automatically because their direct builders
already populate the platforms dict.
Reported by sprmn24 in PR #7416.
When two gateway messages arrived concurrently, _set_session_env wrote
HERMES_SESSION_PLATFORM/CHAT_ID/CHAT_NAME/THREAD_ID into the process-global
os.environ. Because asyncio tasks share the same process, Message B would
overwrite Message A's values mid-flight, causing background-task notifications
and tool calls to route to the wrong thread/chat.
Replace os.environ with Python's contextvars.ContextVar. Each asyncio task
(and any run_in_executor thread it spawns) gets its own copy, so concurrent
messages never interfere.
Changes:
- New gateway/session_context.py with ContextVar definitions, set/clear/get
helpers, and os.environ fallback for CLI/cron/test backward compatibility
- gateway/run.py: _set_session_env returns reset tokens, _clear_session_env
accepts them for proper cleanup in finally blocks
- All tool consumers updated: cronjob_tools, send_message_tool, skills_tool,
terminal_tool (both notify_on_complete AND check_interval blocks), tts_tool,
agent/skill_utils, agent/prompt_builder
- Tests updated for new contextvar-based API
Fixes#7358
Co-authored-by: teknium1 <127238744+teknium1@users.noreply.github.com>
Two fixes from PR review:
1. Session expiry was looking in _running_agents for the cached agent,
but idle expired sessions live in _agent_cache. Now checks
_agent_cache first, falls back to _running_agents.
2. Global cleanup in stop() was missing process_registry.kill_all(),
so background processes from agents evicted without close() (branch,
fallback) survived shutdown.
Use getattr guard for _agent_cache_lock in _handle_reset_command
because test fixtures may create GatewayRunner without calling
__init__, leaving the attribute unset.
Fixes e2e test failure: test_new_resets_session,
test_new_then_status_reflects_reset, test_new_is_idempotent.
Add 9 tests covering the full zombie process prevention chain:
- TestZombieReproduction: demonstrates that processes survive when
references are dropped without explicit cleanup (the original bug)
- TestAgentCloseMethod: verifies close() calls all cleanup functions,
is idempotent, propagates to children, and continues cleanup even
when individual steps fail
- TestGatewayCleanupWiring: verifies stop() calls close() and that
_evict_cached_agent() does NOT call close() (since it's also used
for non-destructive cache refreshes)
- TestDelegationCleanup: calls the real _run_single_child function and
verifies close() is called on the child agent
Ref: #7131
Wire AIAgent.close() into every gateway code path where an agent's
session is actually ending:
- stop(): close all running agents after interrupt + memory shutdown,
then call cleanup_all_environments() and cleanup_all_browsers() as
a global catch-all
- _session_expiry_watcher(): close agents when sessions expire after
the 5-minute idle timeout
- _handle_reset_command(): close the old agent before evicting it from
cache on /new or /reset
Note: _evict_cached_agent() intentionally does NOT call close() because
it is also used for non-destructive cache refreshes (model switch,
branch, fallback) where tool resources should persist.
Ref: #7131
Add is_network_accessible() helper using Python's ipaddress module to
robustly classify bind addresses (IPv4/IPv6 loopback, wildcards,
mapped addresses, hostname resolution with DNS-failure-fails-closed).
The API server connect() now refuses to start when the bind address is
network-accessible and no API_SERVER_KEY is set, preventing RCE from
other machines on the network.
Co-authored-by: entropidelic <entropidelic@users.noreply.github.com>
When enabled, @mentioning the bot in a DM creates a thread (default:
false). Supports both env var and YAML config (matrix.dm_mention_threads).
6 new tests, docs updated.
From #6957
Bot-added and bot-removed events were silently dropped because
_on_bot_added_to_chat and _on_bot_removed_from_chat were not
registered in _build_event_handler().
From #6975
Replace the simple DISCORD_IGNORE_NO_MENTION check with bot-aware
multi-agent filtering. When multiple agents share a channel:
- If other bots are @mentioned but this bot is not → stay silent
- If only humans are mentioned but not this bot → stay silent
- Messages with no mentions still flow to _handle_message for the
existing DISCORD_REQUIRE_MENTION check
- DMs are unaffected (always handled)
This prevents both agents from responding when only one is addressed.
- Remove sys.path.insert hack (leftover from standalone dev)
- Add token lock (acquire_scoped_lock/release_scoped_lock) in
connect()/disconnect() to prevent duplicate pollers across profiles
- Fix get_connected_platforms: WEIXIN check must precede generic
token/api_key check (requires both token AND account_id)
- Add WEIXIN_HOME_CHANNEL_NAME to _EXTRA_ENV_KEYS
- Add gateway setup wizard with QR login flow
- Add platform status check for partially configured state
- Add weixin.md docs page with full adapter documentation
- Update environment-variables.md reference with all 11 env vars
- Update sidebars.ts to include weixin docs page
- Wire all gateway integration points onto current main
Salvaged from PR #6747 by Zihan Huang.
Telegram's Bot API only allows a specific set of emoji for bot reactions
(the ReactionEmoji enum). ✅ (U+2705) and ❌ (U+274C) are not in that
set, causing on_processing_complete reactions to silently fail with
REACTION_INVALID (caught at debug log level).
Replace with 👍 (U+1F44D) / 👎 (U+1F44E) which are always available in
Telegram's allowed reaction list. The 👀 (eyes) reaction used by
on_processing_start was already valid.
Based on the fix by @ppdng in PR #6685.
Fixes#6068
Simplified implementation of the feature from PR #6842 (RunzhouLi).
Allows Discord channels/forum threads to auto-bind skills via config:
discord:
channel_skill_bindings:
- id: "123456"
skills: ["skill-a", "skill-b"]
The run.py auto-skill loader now handles both str and list[str],
loading multiple skills in order and concatenating their payloads.
Forum threads inherit their parent channel's bindings.
Co-authored-by: RunzhouLi <RunzhouLi@users.noreply.github.com>
Add debug logging when eyes reaction redaction fails, and add tests
for the success=False path and the no-pending-reaction edge case.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The on_processing_complete handler was never removing the eyes reaction because
_send_reaction didn't return the reaction event_id.
Fix:
- _send_reaction returns Optional[str] event_id
- on_processing_start stores it in _pending_reactions dict
- on_processing_complete redacts the eyes reaction before adding completion emoji
When extra.base_url is set in the Telegram platform config, use it as
the base URL for all Telegram API requests instead of api.telegram.org.
This allows agents to route Telegram traffic through the credential
proxy, which injects the real bot token — the VM never sees it.
Also supports extra.base_file_url for file downloads (defaults to
base_url if not set separately).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Follow-up to Dusk1e's PR #7120 (Slack send_image redirect guard):
- Rename _safe_url_for_log -> safe_url_for_log (drop underscore) since
it is now imported cross-module by the Slack adapter
- Add _ssrf_redirect_guard httpx event hook to cache_image_from_url()
and cache_audio_from_url() in base.py — same pattern as vision_tools
and the Slack adapter fix
- Update url_safety.py docstring to reflect broader coverage
- Add regression tests for image/audio redirect blocking + safe passthrough
The API server's _run_agent() was not passing task_id to
run_conversation(), causing a fresh random UUID per request. This meant
every Open WebUI message spun up a new Docker container and tore it down
afterward — making persistent filesystem state impossible.
Two fixes:
1. Pass task_id="default" so all API server conversations share the same
Docker container (matching the design intent: one configured Docker
environment, always the same container).
2. Derive a stable session_id from the system prompt + first user message
hash instead of uuid4(). This stops hermes sessions list from being
polluted with single-message throwaway sessions.
Fixes#3438.
Slack may return an HTML sign-in/redirect page instead of actual media
bytes (e.g. expired token, restricted file access). This adds two layers
of defense:
1. Content-Type check in slack.py rejects text/html responses early
2. Magic-byte validation in base.py's cache_image_from_bytes() rejects
non-image data regardless of source platform
Also adds ValueError guards in wecom.py and email.py so the new
validation doesn't crash those adapters.
Closes#6829
Platforms that don't return a message_id after the first send (Signal,
GitHub webhooks) were causing GatewayStreamConsumer to re-enter the
"first send" path on every tool boundary, posting one platform message
per tool call (observed as 155 PR comments on a single response).
Fix: treat _message_id == "__no_edit__" as a sentinel meaning "platform
accepted the send but cannot be edited". When a tool boundary arrives
in that state, skip the message_id/accumulated/last_sent_text reset so
all continuation text is delivered once via _send_fallback_final rather
than re-posted per segment.
Also make prompt_toolkit imports in hermes_cli/commands.py optional so
gateway and test environments that lack the package can still import
resolve_command, gateway_help_lines, and COMMAND_REGISTRY.