Sweep ~74 redundant local imports across 21 files where the same module
was already imported at the top level. Also includes type fixes and lint
cleanups on the same branch.
Follow-up on top of opriz's atomic PID file fix. The prior change caught
the race AFTER runner.start(), so the loser still opened Telegram polling
and Discord gateway sockets before detecting the conflict and exiting.
Hoist the PID-claim block to BEFORE runner.start(). Now the loser of the
O_CREAT|O_EXCL race returns from start_gateway() without ever bringing up
any platform adapter — no Telegram conflict, no Discord duplicate session.
Also add regression tests:
- test_write_pid_file_is_atomic_against_concurrent_writers: second
write_pid_file() raises FileExistsError rather than clobbering.
- Two existing replace-path tests updated to stateful mocks since the
real post-kill state (get_running_pid None after remove_pid_file)
is now exercised by the hoisted re-check.
If the old process crashed without firing its atexit handler,
remove_pid_file() is a no-op. Force-unlink the stale gateway.pid
so write_pid_file() (O_CREAT|O_EXCL) does not hit FileExistsError.
When starting the gateway with --replace, concurrent invocations could
leave multiple instances running simultaneously. This happened because
write_pid_file() used a plain overwrite, so the second racer would
silently replace the first process's PID record.
Changes:
- gateway/status.py: write_pid_file() now uses atomic O_CREAT|O_EXCL
creation. If the file already exists, it raises FileExistsError,
allowing exactly one process to win the race.
- gateway/run.py: before writing the PID file, re-check get_running_pid()
and catch FileExistsError from write_pid_file(). In both cases, stop
the runner and return False so the process exits cleanly.
Fixes#11718
Users can declare shell scripts in config.yaml under a hooks: block that
fire on plugin-hook events (pre_tool_call, post_tool_call, pre_llm_call,
subagent_stop, etc). Scripts receive JSON on stdin, can return JSON on
stdout to block tool calls or inject context pre-LLM.
Key design:
- Registers closures on existing PluginManager._hooks dict — zero changes
to invoke_hook() call sites
- subprocess.run(shell=False) via shlex.split — no shell injection
- First-use consent per (event, command) pair, persisted to allowlist JSON
- Bypass via --accept-hooks, HERMES_ACCEPT_HOOKS=1, or hooks_auto_accept
- hermes hooks list/test/revoke/doctor CLI subcommands
- Adds subagent_stop hook event fired after delegate_task children exit
- Claude Code compatible response shapes accepted
Cherry-picked from PR #13143 by @pefontana.
Pass the user's configured api_key through local-server detection and
context-length probes (detect_local_server_type, _query_local_context_length,
query_ollama_num_ctx) and use LM Studio's native /api/v1/models endpoint in
fetch_endpoint_model_metadata when a loaded instance is present — so the
probed context length is the actual runtime value the user loaded the model
at, not just the model's theoretical max.
Helps local-LLM users whose auto-detected context length was wrong, causing
compression failures and context-overrun crashes.
Prefer session_store origin over _parse_session_key() for shutdown
notifications. Fixes misrouting when chat identifiers contain colons
(e.g. Matrix room IDs like !room123:example.org).
Falls back to session-key parsing when no persisted origin exists.
Co-authored-by: Ruzzgar <ruzzgarcn@gmail.com>
Ref: #12766
Previously, /steer text was only injected after an entire tool batch
completed (_execute_tool_calls_sequential/concurrent returned). If the
batch had a long-running tool (delegate_task, terminal build), the
steer waited for ALL tools to finish before landing — functionally
identical to /queue from the user's perspective.
Now _apply_pending_steer_to_tool_results() is called after EACH
individual tool result is appended to messages, in both the sequential
and concurrent paths. A steer arriving during Tool 1 lands in Tool 1's
result before Tool 2 starts executing.
Also handles leftover steers in the gateway: if a steer arrives during
the final API call (no tool batch to drain into), it's now delivered as
the next user turn instead of being silently dropped.
Fixes user report from Utku.
/yolo and /verbose are safe to dispatch while an agent is running:
/yolo can unblock a pending approval prompt, /verbose cycles the
tool-progress display for the ongoing stream. Both modify session
state without needing agent interaction. Previously they fell through
to the running-agent catch-all (PR #12334) and returned the generic
busy message.
/fast and /reasoning stay on the catch-all — their handlers explicitly
say 'takes effect on next message', so nothing is gained by dispatching
them mid-turn.
Salvaged from #10116 (elkimek), scoped down.
Follow-up to 40164ba1.
- _handle_voice_channel_join/leave now use event.source.platform instead of
hardcoded Platform.DISCORD (consistent with other voice handlers).
- Update tests/gateway/test_voice_command.py to use 'platform:chat_id' keys
matching the new _voice_key() format.
- Add platform isolation regression test for the bug in #12542.
- Drop decorative test_legacy_key_collision_bug (the fix makes the
collision impossible; the test mutated a single key twice, not a
real scenario).
- Adapter mocks in _sync_voice_mode_state_to_adapter tests now set
adapter.platform = Platform.* (required by new isinstance check).
Follow-up to #9337: _is_user_authorized maps Platform.QQBOT to
QQ_ALLOWED_USERS, but the new platform_env_map inside
_get_unauthorized_dm_behavior omitted it. A QQ operator with a strict
user allowlist would therefore still have the gateway send pairing
codes to strangers.
Adds QQBOT to the env map and a regression test.
When SIGNAL_ALLOWED_USERS (or any platform-specific or global allowlist)
is set, the gateway was still sending automated pairing-code messages to
every unauthorized sender. This forced pairing-code spam onto personal
contacts of anyone running Hermes on a primary personal account with a
whitelist, and exposed information about the bot's existence.
Root cause
----------
_get_unauthorized_dm_behavior() fell through to the global default
('pair') even when an explicit allowlist was configured. An allowlist
signals that the operator has deliberately restricted access; offering
pairing codes to unknown senders contradicts that intent.
Fix
---
Extend _get_unauthorized_dm_behavior() to inspect the active per-platform
and global allowlist env vars. When any allowlist is set and the operator
has not written an explicit per-platform unauthorized_dm_behavior override,
the method now returns 'ignore' instead of 'pair'.
Resolution order (highest → lowest priority):
1. Explicit per-platform unauthorized_dm_behavior in config — always wins.
2. Explicit global unauthorized_dm_behavior != 'pair' in config — wins.
3. Any platform or global allowlist env var present → 'ignore'.
4. No allowlist, no override → 'pair' (open-gateway default preserved).
This fixes the spam for Signal, Telegram, WhatsApp, Slack, and all other
platforms with per-platform allowlist env vars.
Testing
-------
6 new tests added to tests/gateway/test_unauthorized_dm_behavior.py:
- test_signal_with_allowlist_ignores_unauthorized_dm (primary #9337 case)
- test_telegram_with_allowlist_ignores_unauthorized_dm (same for Telegram)
- test_global_allowlist_ignores_unauthorized_dm (GATEWAY_ALLOWED_USERS)
- test_no_allowlist_still_pairs_by_default (open-gateway regression guard)
- test_explicit_pair_config_overrides_allowlist_default (operator opt-in)
- test_get_unauthorized_dm_behavior_no_allowlist_returns_pair (unit)
All 15 tests in the file pass.
Fixes#9337
Smart model routing (auto-routing short/simple turns to a cheap model
across providers) was opt-in and disabled by default. This removes the
feature wholesale: the routing module, its config keys, docs, tests, and
the orchestration scaffolding it required in cli.py / gateway/run.py /
cron/scheduler.py.
The /fast (Priority Processing / Anthropic fast mode) feature kept its
hooks into _resolve_turn_agent_config — those still build a route dict
and attach request_overrides when the model supports it; the route now
just always uses the session's primary model/provider rather than
running prompts through choose_cheap_model_route() first.
Also removed:
- DEFAULT_CONFIG['smart_model_routing'] block and matching commented-out
example sections in hermes_cli/config.py and cli-config.yaml.example
- _load_smart_model_routing() / self._smart_model_routing on GatewayRunner
- self._smart_model_routing / self._active_agent_route_signature on
HermesCLI (signature kept; just no longer initialised through the
smart-routing pipeline)
- route_label parameter on HermesCLI._init_agent (only set by smart
routing; never read elsewhere)
- 'Smart Model Routing' section in website/docs/integrations/providers.md
- tip in hermes_cli/tips.py
- entries in hermes_cli/dump.py + hermes_cli/web_server.py
- row in skills/autonomous-ai-agents/hermes-agent/SKILL.md
Tests:
- Deleted tests/agent/test_smart_model_routing.py
- Rewrote tests/agent/test_credential_pool_routing.py to target the
simplified _resolve_turn_agent_config directly (preserves credential
pool propagation + 429 rotation coverage)
- Dropped 'cheap model' test from test_cli_provider_resolution.py
- Dropped resolve_turn_route patches from cli + gateway test_fast_command
— they now exercise the real method end-to-end
- Removed _smart_model_routing stub assignments from gateway/cron test
helpers
Targeted suites: 74/74 in the directly affected test files;
tests/agent + tests/cron + tests/cli pass except 5 failures that
already exist on main (cron silent-delivery + alias quick-command).
Follow-up on top of the helix4u #12388 cherry-picks:
- make deferred post-delivery callbacks generation-aware end-to-end so
stale runs cannot clear callbacks registered by a fresher run for the
same session
- bind callback ownership to the active session event at run start and
snapshot that generation inside base adapter processing so later event
mutation cannot retarget cleanup
- pass run_generation through proxy mode and drop stale proxy streams /
final results the same way local runs are dropped
- centralize stop/new interrupt cleanup into one helper and replace the
open-coded branches with shared logic
- unify internal control interrupt reason strings via shared constants
- remove the return from base.py's finally block so cleanup no longer
swallows cancellation/exception flow
- add focused regressions for generation forwarding, proxy stale
suppression, and newer-callback preservation
This addresses all review findings from the initial #12388 review while
keeping the fix scoped to stale-output/typing-loop interrupt handling.
Follow-up on top of the helix4u #6392 cherry-pick:
- reuse one helper for actionable Docker-local file-not-found errors
across document/image/video/audio local-media send paths
- include /outputs/... alongside /output/... in the container-local
path hint
- soften the gateway startup warning so it does not imply custom
host-visible mounts are broken; the warning now targets the specific
risky pattern of emitting container-local MEDIA paths without an
explicit export mount
- add focused regressions for /outputs/... and non-document media hint
coverage
This keeps the salvage aligned with the actual MEDIA delivery problem on
current main while reducing false-positive operator messaging.
Gateway startup leaks aiohttp.ClientSession (and other partial-init
resources) when an adapter's connect() returns False or raises. The
adapter is never added to self.adapters, so the shutdown path at
gateway/run.py:2426 never calls disconnect() on it — Python GC later
logs 'Unclosed client session' at process exit.
Seen on 2026-04-18 18:08:16 during a double --replace takeover cycle:
one of the partial-init sessions survived past shutdown and emitted
the warning right before status=75/TEMPFAIL.
Fix:
- New GatewayRunner._safe_adapter_disconnect() helper — calls
adapter.disconnect() and swallows any exception. Used on error paths.
- Connect loop calls it in both failure branches: success=False and
except Exception.
- Adapter disconnect() implementations are already expected to be
idempotent and tolerate partial-init state (they all guard on
self._http_session / self._bridge_process before touching them).
Tests: tests/gateway/test_safe_adapter_disconnect.py — 3 cases verify
the helper forwards to disconnect, swallows exceptions, and tolerates
platform=None.
Any recognized slash command now bypasses the Level-1 active-session
guard instead of queueing + interrupting. A mid-run /model (or
/reasoning, /voice, /insights, /title, /resume, /retry, /undo,
/compress, /usage, /provider, /reload-mcp, /sethome, /reset) used to
interrupt the agent AND get silently discarded by the slash-command
safety net — zero-char response, dropped tool calls.
Root cause:
- Discord registers 41 native slash commands via tree.command().
- Only 14 were in ACTIVE_SESSION_BYPASS_COMMANDS.
- The other ~15 user-facing ones fell through base.py:handle_message
to the busy-session handler, which calls running_agent.interrupt()
AND queues the text.
- After the aborted run, gateway/run.py:9912 correctly identifies the
queued text as a slash command and discards it — but the damage
(interrupt + zero-char response) already happened.
Fix:
- should_bypass_active_session() now returns True for any resolvable
slash command. ACTIVE_SESSION_BYPASS_COMMANDS stays as the subset
with dedicated Level-2 handlers (documentation + tests).
- gateway/run.py adds a catch-all after the dedicated handlers that
returns a user-visible "agent busy — wait or /stop first" response
for any other resolvable command.
- Unknown text / file-path-like messages are unchanged — they still
queue.
Also:
- gateway/platforms/discord.py logs the invoker identity on every
slash command (user id + name + channel + guild) so future
ghost-command reports can be triaged without guessing.
Tests:
- 15 new parametrized cases in test_command_bypass_active_session.py
cover every previously-broken Discord slash command.
- Existing tests for /stop, /new, /approve, /deny, /help, /status,
/agents, /background, /steer, /update, /queue still pass.
- test_steer.py's ACTIVE_SESSION_BYPASS_COMMANDS check still passes.
Fixes#5057. Related: #6252, #10370, #4665.
Follow-up to #12301.
The drain-timeout branch of _stop_impl() was iterating the drain-start
snapshot (active_agents) when marking sessions resume_pending. That
snapshot can include sessions that finished gracefully during the drain
window — marking them would give their next turn a stray
'your previous turn was interrupted by a gateway restart' system note
even though the prior turn actually completed cleanly.
Iterate self._running_agents at timeout time instead, mirroring
_interrupt_running_agents() exactly:
- only sessions still blocking the shutdown get marked
- pending sentinels (AIAgent construction not yet complete) are skipped
Changes:
- gateway/run.py: swap active_agents.keys() for filtered
self._running_agents.items() iteration in the drain-timeout mark loop.
- tests/gateway/test_restart_resume_pending.py: two regression tests —
finisher-during-drain not marked, pending sentinel not marked.
The shutdown banner promised "send any message after restart to resume
where you left off" but the code did the opposite: a drain-timeout
restart skipped the .clean_shutdown marker, which made the next startup
call suspend_recently_active(), which marked the session suspended,
which made get_or_create_session() spawn a fresh session_id with a
'Session automatically reset. Use /resume...' notice — contradicting
the banner.
Introduce a resume_pending state on SessionEntry that is distinct from
suspended. Drain-timeout shutdown flags active sessions resume_pending
instead of letting startup-wide suspension destroy them. The next
message on the same session_key preserves the session_id, reloads the
transcript, and the agent receives a reason-aware restart-resume
system note that subsumes the existing tool-tail auto-continue note
(PR #9934).
Terminal escalation still flows through the existing
.restart_failure_counts stuck-loop counter (PR #7536, threshold 3) —
no parallel counter on SessionEntry. suspended still wins over
resume_pending in get_or_create_session() so genuinely stuck sessions
converge to a clean slate.
Spec: PR #11852 (BrennerSpear). Implementation follows the spec with
the approved correction (reuse .restart_failure_counts rather than
adding a resume_attempts field).
Changes:
- gateway/session.py: SessionEntry.resume_pending/resume_reason/
last_resume_marked_at + to_dict/from_dict; SessionStore
.mark_resume_pending()/clear_resume_pending(); get_or_create_session()
returns existing entry when resume_pending (suspended still wins);
suspend_recently_active() skips resume_pending entries.
- gateway/run.py: _stop_impl() drain-timeout branch marks active
sessions resume_pending before _interrupt_running_agents();
_run_agent() injects reason-aware restart-resume system note that
subsumes the tool-tail case; successful-turn cleanup also clears
resume_pending next to _clear_restart_failure_count();
_notify_active_sessions_of_shutdown() softens the restart banner to
'I'll try to resume where you left off' (honest about stuck-loop
escalation).
- tests/gateway/test_restart_resume_pending.py: 29 new tests covering
SessionEntry roundtrip, mark/clear helpers, get_or_create_session
precedence (suspended > resume_pending), suspend_recently_active
skip, drain-timeout mark reason (restart vs shutdown), system-note
injection decision tree (including tool-tail subsumption), banner
wording, and stuck-loop escalation override.
* feat(steer): /steer <prompt> injects a mid-run note after the next tool call
Adds a new slash command that sits between /queue (turn boundary) and
interrupt. /steer <text> stashes the message on the running agent and
the agent loop appends it to the LAST tool result's content once the
current tool batch finishes. The model sees it as part of the tool
output on its next iteration.
No interrupt is fired, no new user turn is inserted, and no prompt
cache invalidation happens beyond the normal per-turn tool-result
churn. Message-role alternation is preserved — we only modify an
existing role:"tool" message's content.
Wiring
------
- hermes_cli/commands.py: register /steer + add to ACTIVE_SESSION_BYPASS_COMMANDS.
- run_agent.py: add _pending_steer state, AIAgent.steer(), _drain_pending_steer(),
_apply_pending_steer_to_tool_results(); drain at end of both parallel and
sequential tool executors; clear on interrupt; return leftover as
result['pending_steer'] if the agent exits before another tool batch.
- cli.py: /steer handler — route to agent.steer() when running, fall back to
the regular queue otherwise; deliver result['pending_steer'] as next turn.
- gateway/run.py: running-agent intercept calls running_agent.steer(); idle-agent
path strips the prefix and forwards as a regular user message.
- tui_gateway/server.py: new session.steer JSON-RPC method.
- ui-tui: SessionSteerResponse type + local /steer slash command that calls
session.steer when ui.busy, otherwise enqueues for the next turn.
Fallbacks
---------
- Agent exits mid-steer → surfaces in run_conversation result as pending_steer
so CLI/gateway deliver it as the next user turn instead of silently dropping it.
- All tools skipped after interrupt → re-stashes pending_steer for the caller.
- No active agent → /steer reduces to sending the text as a normal message.
Tests
-----
- tests/run_agent/test_steer.py — accept/reject, concatenation, drain,
last-tool-result injection, multimodal list content, thread safety,
cleared-on-interrupt, registry membership, bypass-set membership.
- tests/gateway/test_steer_command.py — running agent, pending sentinel,
missing steer() method, rejected payload, empty payload.
- tests/gateway/test_command_bypass_active_session.py — /steer bypasses
the Level-1 base adapter guard.
- tests/test_tui_gateway_server.py — session.steer RPC paths.
72/72 targeted tests pass under scripts/run_tests.sh.
* feat(steer): register /steer in Discord's native slash tree
Discord's app_commands tree is a curated subset of slash commands (not
derived from COMMAND_REGISTRY like Telegram/Slack). /steer already
works there as plain text (routes through handle_message → base
adapter bypass → runner), but registering it here adds Discord's
native autocomplete + argument hint UI so users can discover and
type it like any other first-class command.
When a Telegram /restart fires and PTB's graceful-shutdown `get_updates`
ACK call times out ("When polling for updates is restarted, updates may
be received twice" in gateway.log), the new gateway receives the same
/restart again and restarts a second time — a self-perpetuating loop.
Record the triggering update_id in `.restart_last_processed.json` when
handling /restart. On the next process, reject a /restart whose
update_id <= the recorded one as a stale redelivery. 5-minute staleness
guard so an orphaned marker can't block a legitimately new /restart.
- gateway/platforms/base.py: add `platform_update_id` to MessageEvent
- gateway/platforms/telegram.py: propagate `update.update_id` through
_build_message_event for text/command/location/media handlers
- gateway/run.py: write dedup marker in _handle_restart_command;
_is_stale_restart_redelivery checks it before processing /restart
- tests/gateway/test_restart_redelivery_dedup.py: 9 new tests covering
fresh restart, redelivery, staleness window, cross-platform,
malformed-marker resilience, and no-update_id (CLI) bypass
Only active for Telegram today (the one platform with monotonic
cross-session update ordering); other platforms return False from
_is_stale_restart_redelivery and proceed normally.
Error messages that tell users to install optional extras now use
{sys.executable} -m pip install ... instead of a bare 'pip install
hermes-agent[extra]' string. Under the curl installer, bare 'pip'
resolves to system pip, which either fails with PEP 668
externally-managed-environment or installs into the wrong Python.
Affects: hermes dashboard, hermes web server startup, mcp_serve,
hermes doctor Bedrock check, CLI voice mode, voice_mode tool runtime
error, Discord voice-channel join failure message.
* fix(gateway): detect legacy hermes.service units from pre-rename installs
Older Hermes installs used a different service name (hermes.service) before
the rename to hermes-gateway.service. When both units remain installed, they
fight over the same bot token — after PR #5646's signal-recovery change,
this manifests as a 30-second SIGTERM flap loop between the two services.
Detection is an explicit allowlist (no globbing) plus an ExecStart content
check, so profile units (hermes-gateway-<profile>.service) and unrelated
third-party services named 'hermes' are never matched.
Wired into systemd_install, systemd_status, gateway_setup wizard, and the
main hermes setup flow — anywhere we already warn about scope conflicts now
also warns about legacy units.
* feat(gateway): add migrate-legacy command + install-time removal prompt
- New hermes_cli.gateway.remove_legacy_hermes_units() removes legacy
unit files with stop → disable → unlink → daemon-reload. Handles user
and system scopes separately; system scope returns path list when not
running as root so the caller can tell the user to re-run with sudo.
- New 'hermes gateway migrate-legacy' subcommand (with --dry-run and -y)
routes to remove_legacy_hermes_units via gateway_command dispatch.
- systemd_install now offers to remove legacy units BEFORE installing
the new hermes-gateway.service, preventing the SIGTERM flap loop that
hits users who still have pre-rename hermes.service around.
Profile units (hermes-gateway-<profile>.service) remain untouched in
all paths — the legacy allowlist is explicit (_LEGACY_SERVICE_NAMES)
and the ExecStart content check further narrows matches.
* fix(gateway): mark --replace SIGTERM as planned so target exits 0
PR #5646 made SIGTERM exit the gateway with code 1 so systemd's
Restart=on-failure revives it after unexpected kills. But when a user has
two gateway units fighting for the same bot token (e.g. legacy
hermes.service + hermes-gateway.service from a pre-rename install), the
--replace takeover itself becomes the 'unexpected' SIGTERM — the loser
exits 1, systemd revives it 30s later, and the cycle flaps indefinitely.
Before calling terminate_pid(), --replace now writes a short-lived marker
file naming the target PID + start_time. The target's shutdown_signal_handler
consumes the marker and, when it names this process, leaves
_signal_initiated_shutdown=False so the final exit code stays 0.
Staleness defences:
- PID + start_time combo prevents PID reuse matching an old marker
- Marker older than 60s is treated as stale and discarded
- Marker is unlinked on first read even if it doesn't match this process
- Replacer clears the marker post-loop + on permission-denied give-up
Three closely-related fixes for shutdown / lifecycle hygiene.
1. _release_running_agent_state(session_key) helper
----------------------------------------------------
Per-running-agent state lived in three dicts that drifted out of sync
across cleanup sites:
self._running_agents — AIAgent per session_key
self._running_agents_ts — start timestamp per session_key
self._busy_ack_ts — last busy-ack timestamp per session_key
Inventory before this PR:
8 sites: del self._running_agents[key]
— only 1 (stale-eviction) cleaned all three
— 1 cleaned _running_agents + _running_agents_ts only
— 6 cleaned _running_agents only
Each missed entry was a (str, float) tuple per session per gateway
lifetime — small, persistent, accumulates across thousands of
sessions over months. Per-platform leaks compounded.
This change adds a single helper that pops all three dicts in
lockstep, and replaces every bare 'del self._running_agents[key]'
site with it. Per-session state that PERSISTS across turns
(_session_model_overrides, _voice_mode, _pending_approvals,
_update_prompt_pending) is intentionally NOT touched here — those
have their own lifecycles tied to user actions, not turn boundaries.
2. _running_agents_ts cleared in _stop_impl
----------------------------------------
Was being missed alongside _running_agents.clear(); now included.
3. SessionDB close() in _stop_impl
---------------------------------
The SQLite WAL write lock stayed held by the old gateway connection
until Python actually exited — causing 'database is locked' errors
when --replace launched a new gateway against the same file. We
now explicitly close both self._db and self.session_store._db
inside _stop_impl, with try/except so a flaky close on one doesn't
block the other.
Tests
-----
tests/gateway/test_session_state_cleanup.py — 10 cases covering:
* helper pops all three dicts atomically
* idempotent on missing/empty keys
* preserves other sessions
* tolerates older runners without _busy_ack_ts attribute
* thread-safe under concurrent release
* regression guard: scans gateway/run.py and fails if a future
contributor reintroduces 'del self._running_agents[...]'
outside docstrings
* SessionDB close called on both holders during shutdown
* shutdown tolerates missing session_store
* shutdown tolerates close() raising on one db (other still closes)
Broader gateway suite: 3108 passed (vs 3100 on baseline) — failure
delta is +8 net passes; the 10 remaining failures are pre-existing
cross-test pollution / missing optional deps (matrix needs olm,
signal/telegram approval flake, dingtalk Mock wiring), all reproduce
on stashed baseline.
SessionStore._entries grew unbounded. Every unique
(platform, chat_id, thread_id, user_id) tuple ever seen was kept in
RAM and rewritten to sessions.json on every message. A Discord bot
in 100 servers x 100 channels x ~100 rotating users accumulates on
the order of 10^5 entries after a few months; each sessions.json
write becomes an O(n) fsync. Nothing trimmed this — there was no
TTL, no cap, no eviction path.
Changes
-------
* SessionStore.prune_old_entries(max_age_days) — drops entries whose
updated_at is older than the cutoff. Preserves:
- suspended entries (user paused them via /stop for later resume)
- entries with an active background process attached
Pruning is functionally identical to a natural reset-policy expiry:
SQLite transcript stays, session_key -> session_id mapping dropped,
returning user gets a fresh session.
* GatewayConfig.session_store_max_age_days (default 90; 0 disables).
Serialized in to_dict/from_dict, coerced from bad types / negatives
to safe defaults. No migration needed — missing field -> 90 days.
* _session_expiry_watcher calls prune_old_entries once per hour
(first tick is immediate). Uses the existing watcher loop so no
new background task is created.
Why not more aggressive
-----------------------
90 days is long enough that legitimate long-idle users (seasonal,
vacation, etc.) aren't surprised — pruning just means they get a
fresh session on return, same outcome they'd get from any other
reset-policy trigger. Admins can lower it via config; 0 disables.
Tests
-----
tests/gateway/test_session_store_prune.py — 17 cases covering:
* entry age based on updated_at, not created_at
* max_age_days=0 disables; negative coerces to 0
* suspended + active-process entries are skipped
* _save fires iff something was removed
* disk JSON reflects post-prune state
* thread safety against concurrent readers
* config field roundtrips + graceful fallback on bad values
* watcher gate logic (first tick prunes, subsequent within 1h don't)
119 broader session/gateway tests remain green.
* fix(gateway): bound _agent_cache with LRU cap + idle TTL eviction
The per-session AIAgent cache was unbounded. Each cached AIAgent holds
LLM clients, tool schemas, memory providers, and a conversation buffer.
In a long-lived gateway serving many chats/threads, cached agents
accumulated indefinitely — entries were only evicted on /new, /model,
or session reset.
Changes:
- Cache is now an OrderedDict so we can pop least-recently-used entries.
- _enforce_agent_cache_cap() pops entries beyond _AGENT_CACHE_MAX_SIZE=64
when a new agent is inserted. LRU order is refreshed via move_to_end()
on cache hits.
- _sweep_idle_cached_agents() evicts entries whose AIAgent has been idle
longer than _AGENT_CACHE_IDLE_TTL_SECS=3600s. Runs from the existing
_session_expiry_watcher so no new background task is created.
- The expiry watcher now also pops the cache entry after calling
_cleanup_agent_resources on a flushed session — previously the agent
was shut down but its reference stayed in the cache dict.
- Evicted agents have _cleanup_agent_resources() called on a daemon
thread so the cache lock isn't held during slow teardown.
Both tuning constants live at module scope so tests can monkeypatch
them without touching class state.
Tests: 7 new cases in test_agent_cache.py covering LRU eviction,
move_to_end refresh, cleanup thread dispatch, idle TTL sweep,
defensive handling of agents without _last_activity_ts, and plain-dict
test fixture tolerance.
* tweak: bump _AGENT_CACHE_MAX_SIZE 64 -> 128
* fix(gateway): never evict mid-turn agents; live spillover tests
The prior commit could tear down an active agent if its session_key
happened to be LRU when the cap was exceeded. AIAgent.close() kills
process_registry entries for the task, tears down the terminal
sandbox, closes the OpenAI client (sets self.client = None), and
cascades .close() into any active child subagents — all fatal if
the agent is still processing a turn.
Changes:
- _enforce_agent_cache_cap and _sweep_idle_cached_agents now look at
GatewayRunner._running_agents and skip any entry whose AIAgent
instance is present (identity via id(), so MagicMock doesn't
confuse lookup in tests). _AGENT_PENDING_SENTINEL is treated
as 'not active' since no real agent exists yet.
- Eviction only considers the LRU-excess window (first size-cap
entries). If an excess slot is held by a mid-turn agent, we skip
it WITHOUT compensating by evicting a newer entry. A freshly
inserted session (zero cache history) shouldn't be punished to
protect a long-lived one that happens to be busy.
- Cache may therefore stay transiently over cap when load spikes;
a WARNING is logged so operators can see it, and the next insert
re-runs the check after some turns have finished.
New tests (TestAgentCacheActiveSafety + TestAgentCacheSpilloverLive):
- Active LRU entry is skipped; no newer entry compensated
- Mixed active/idle excess window: only idle slots go
- All-active cache: no eviction, WARNING logged, all clients intact
- _AGENT_PENDING_SENTINEL doesn't block other evictions
- Idle-TTL sweep skips active agents
- End-to-end: active agent's .client survives eviction attempt
- Live fill-to-cap with real AIAgents, then spillover
- Live: CAP=4 all active + 1 newcomer — cache grows to 5, no teardown
- Live: 8 threads racing 160 inserts into CAP=16 — settles at 16
- Live: evicted session's next turn gets a fresh agent that works
30 tests pass (13 pre-existing + 17 new). Related gateway suites
(model switch, session reset, proxy, etc.) all green.
* fix(gateway): cache eviction preserves per-task state for session resume
The prior commits called AIAgent.close() on cache-evicted agents, which
tears down process_registry entries, terminal sandbox, and browser
daemon for that task_id — permanently. Fine for session-expiry (session
ended), wrong for cache eviction (session may resume).
Real-world scenario: a user leaves a Telegram session open for 2+ hours,
idle TTL evicts the cached AIAgent, user returns and sends a message.
Conversation history is preserved via SessionStore, but their terminal
sandbox (cwd, env vars, bg shells) and browser state were destroyed.
Fix: split the two cleanup modes.
close() Full teardown — session ended. Kills bg procs,
tears down terminal sandbox + browser daemon,
closes LLM client. Used by session-expiry,
/new, /reset (unchanged).
release_clients() Soft cleanup — session may resume. Closes
LLM client only. Leaves process_registry,
terminal sandbox, browser daemon intact
for the resuming agent to inherit via
shared task_id.
Gateway cache eviction (_enforce_agent_cache_cap, _sweep_idle_cached_agents)
now dispatches _release_evicted_agent_soft on the daemon thread instead
of _cleanup_agent_resources. All session-expiry call sites of
_cleanup_agent_resources are unchanged.
Tests (TestAgentCacheIdleResume, 5 new cases):
- release_clients does NOT call process_registry.kill_all
- release_clients does NOT call cleanup_vm / cleanup_browser
- release_clients DOES close the LLM client (agent.client is None after)
- close() vs release_clients() — semantic contract pinned
- Idle-evicted session's rebuild with same session_id gets same task_id
Updated test_cap_triggers_cleanup_thread to assert the soft path fires
and the hard path does NOT.
35 tests pass in test_agent_cache.py; 67 related tests green.
Adds a new DISCORD_ALLOWED_ROLES environment variable that allows filtering
bot interactions by Discord role ID. Uses OR semantics with the existing
DISCORD_ALLOWED_USERS - if a user matches either allowlist, they're permitted.
Changes:
- Parse DISCORD_ALLOWED_ROLES comma-separated role IDs on connect
- Enable members intent when roles are configured (needed for role lookup)
- Update _is_allowed_user() to accept optional author param for direct role check
- Fallback to scanning mutual guilds when author object lacks roles (DMs, voice)
- Fully backwards compatible: no behavior change when env var is unset
Fixes#4466.
Root cause: two sequential authorization gates both independently rejected
bot messages, making DISCORD_ALLOW_BOTS completely ineffective.
Gate 1 — `discord.py` `on_message`:
_is_allowed_user ran BEFORE the bot filter, so bot senders were dropped
before the DISCORD_ALLOW_BOTS policy was ever evaluated.
Gate 2 — `gateway/run.py` _is_user_authorized:
The gateway-level allowlist check rejected bot IDs with 'Unauthorized
user: <bot_id>' even if they passed Gate 1.
Fix:
gateway/platforms/discord.py — reorder on_message so DISCORD_ALLOW_BOTS
runs BEFORE _is_allowed_user. Bots permitted by the filter skip the
user allowlist; non-bots are still checked.
gateway/session.py — add is_bot: bool = False to SessionSource so the
gateway layer can distinguish bot senders.
gateway/platforms/base.py — expose is_bot parameter in build_source.
gateway/platforms/discord.py _handle_message — set is_bot=True when
building the SessionSource for bot authors.
gateway/run.py _is_user_authorized — when source.is_bot is True AND
DISCORD_ALLOW_BOTS is 'mentions' or 'all', return True early. Platform
filter already validated the message at on_message; don't re-reject.
Behavior matrix:
| Config | Before | After |
| DISCORD_ALLOW_BOTS=none (default) | Blocked | Blocked |
| DISCORD_ALLOW_BOTS=all | Blocked | Allowed |
| DISCORD_ALLOW_BOTS=mentions + @mention | Blocked | Allowed |
| DISCORD_ALLOW_BOTS=mentions, no mention | Blocked | Blocked |
| Human in DISCORD_ALLOWED_USERS | Allowed | Allowed |
| Human NOT in DISCORD_ALLOWED_USERS | Blocked | Blocked |
Co-authored-by: Hermes Maintainer <hermes@nousresearch.com>
The Weixin adapter's send() method previously split and delivered the
raw response text without first extracting MEDIA: tags or bare local
file paths. This meant images, documents, and voice files referenced
by the agent were silently dropped in normal (non-streaming,
non-background) conversations.
Changes:
- In WeixinAdapter.send(), call extract_media() and
extract_local_files() before formatting/splitting text.
- Deliver extracted files via send_image_file(), send_document(),
send_voice(), or send_video() prior to sending text chunks.
- Also fix two minor typing issues in gateway/run.py where
extract_media() tuples were not unpacked correctly in background
and /btw task handlers.
Fixes missing media delivery on Weixin personal accounts.
* - make buffered streaming
- fix path naming to expand `~` for agent.
- fix stripping of matrix ID to not remove other mentions / localports.
* fix(matrix): register MembershipEventDispatcher for invite auto-join
The mautrix migration (#7518) broke auto-join because InternalEventType.INVITE
events are only dispatched when MembershipEventDispatcher is registered on the
client. Without it, _on_invite is dead code and the bot silently ignores all
room invites.
Closes#10094Closes#10725
Refs: PR #10135 (digging-airfare-4u), PR #10732 (fxfitz)
* fix(matrix): preserve _joined_rooms reference for CryptoStateStore
connect() reassigned self._joined_rooms = set(...) after initial sync,
orphaning the reference captured by _CryptoStateStore at init time.
find_shared_rooms() returned [] forever, breaking Megolm session rotation
on membership changes.
Mutate in place with clear() + update() so the CryptoStateStore reference
stays valid.
Refs #8174, PR #8215
* fix(matrix): remove dual ROOM_ENCRYPTED handler to fix dedup race
mautrix auto-registers DecryptionDispatcher when client.crypto is set.
The adapter also registered _on_encrypted_event for the same event type.
_on_encrypted_event had zero awaits and won the race to mark event IDs
in the dedup set, causing _on_room_message to drop successfully decrypted
events from DecryptionDispatcher. The retry loop masked this by re-decrypting
every message ~4 seconds later.
Remove _on_encrypted_event entirely. DecryptionDispatcher handles decryption;
genuinely undecryptable events are logged by mautrix and retried on next
key exchange.
Refs #8174, PR #8215
* fix(matrix): re-verify device keys after share_keys() upload
Matrix homeservers treat ed25519 identity keys as immutable per device.
share_keys() can return 200 but silently ignore new keys if the device
already exists with different identity keys. The bot would proceed with
shared=True while peers encrypt to the old (unreachable) keys.
Now re-queries the server after share_keys() and fails closed if keys
don't match, with an actionable error message.
Refs #8174, PR #8215
* fix(matrix): encrypt outbound attachments in E2EE rooms
_upload_and_send() uploaded raw bytes and used the 'url' key for all
rooms. In E2EE rooms, media must be encrypted client-side with
encrypt_attachment(), the ciphertext uploaded, and the 'file' key
(with key/iv/hashes) used instead of 'url'.
Now detects encrypted rooms via state_store.is_encrypted() and
branches to the encrypted upload path.
Refs: PR #9822 (charles-brooks)
* fix(matrix): add stop_typing to clear typing indicator after response
The adapter set a 30-second typing timeout but never cleared it.
The base class stop_typing() is a no-op, so the typing indicator
lingered for up to 30 seconds after each response.
Closes#6016
Refs: PR #6020 (r266-tech)
* fix(matrix): cache all media types locally, not just photos/voice
should_cache_locally only covered PHOTO, VOICE, and encrypted media.
Unencrypted audio/video/documents in plaintext rooms were passed as MXC
URLs that require authentication the agent doesn't have, resulting
in 401 errors.
Refs #3487, #3806
* fix(matrix): detect stale OTK conflict on startup and fail closed
When crypto state is wiped but the same device ID is reused, the
homeserver may still hold one-time keys signed with the previous
identity key. Identity key re-upload succeeds but OTK uploads fail
with "already exists" and a signature mismatch. Peers cannot
establish new Olm sessions, so all new messages are undecryptable.
Now proactively flushes OTKs via share_keys() during connect() and
catches the "already exists" error with an actionable log message
telling the operator to purge the device from the homeserver or
generate a fresh device ID.
Also documents the crypto store recovery procedure in the Matrix
setup guide.
Refs #8174
* docs(matrix): improve crypto recovery docs per review
- Put easy path (fresh access token) first, manual purge second
- URL-encode user ID in Synapse admin API example
- Note that device deletion may invalidate the access token
- Add "stop Synapse first" caveat for direct SQLite approach
- Mention the fail-closed startup detection behavior
- Add back-reference from upgrade section to OTK warning
* refactor(matrix): cleanup from code review
- Extract _extract_server_ed25519() and _reverify_keys_after_upload()
to deduplicate the re-verification block (was copy-pasted in two
places, three copies of ed25519 key extraction total)
- Remove dead code: _pending_megolm, _retry_pending_decryptions,
_MAX_PENDING_EVENTS, _PENDING_EVENT_TTL — all orphaned after
removing _on_encrypted_event
- Remove tautological TestMediaCacheGate (tested its own predicate,
not production code)
- Remove dead TestMatrixMegolmEventHandling and
TestMatrixRetryPendingDecryptions (tested removed methods)
- Merge duplicate TestMatrixStopTyping into TestMatrixTypingIndicator
- Trim comment to just the "why"
Users (Teknium) report missing debug reports before the 1-hour auto-delete
fires. 6 hours gives enough window for async bug-report triage without
leaving sensitive log data on public paste services indefinitely.
Applies to both the CLI (hermes debug share) and gateway (/debug) paths.
Initialize next_channel_prompt before the pending_event check and use
getattr with None default, matching the existing pattern for
next_source/next_message/next_message_id. Prevents AttributeError
when pending_event is None (interrupt path).
Cherry-picked from #10953 by @jackjin1997.
Switch from fragile Markdown V1 to HTML parse mode with html.escape()
for exec approval messages. Add fallback to text-based approval when
the formatted send fails.
Cherry-picked from #10999 by @danieldoderlein.
config.yaml terminal.cwd is now the single source of truth for working
directory. MESSAGING_CWD and TERMINAL_CWD in .env are deprecated with a
migration warning.
Changes:
1. config.py: Remove MESSAGING_CWD from OPTIONAL_ENV_VARS (setup wizard
no longer prompts for it). Add warn_deprecated_cwd_env_vars() that
prints a migration hint when deprecated env vars are detected.
2. gateway/run.py: Replace all MESSAGING_CWD reads with TERMINAL_CWD
(which is bridged from config.yaml terminal.cwd). MESSAGING_CWD is
still accepted as a backward-compat fallback with deprecation warning.
Config bridge skips cwd placeholder values so they don't clobber
the resolved TERMINAL_CWD.
3. cli.py: Guard against lazy-import clobbering — when cli.py is
imported lazily during gateway runtime (via delegate_tool), don't
let load_cli_config() overwrite an already-resolved TERMINAL_CWD
with os.getcwd() of the service's working directory. (#10817)
4. hermes_cli/main.py: Add 'hermes memory reset' command with
--target all/memory/user and --yes flags. Profile-scoped via
HERMES_HOME.
Migration path for users with .env settings:
Remove MESSAGING_CWD / TERMINAL_CWD from .env
Add to config.yaml:
terminal:
cwd: /your/project/path
Addresses: #10225, #4672, #10817, #7663
When the LLM returns an empty completion, gateway/run.py replaced
final_response with the literal string '(No response generated)'.
This defeated cron/scheduler.py's empty-response skip guard, causing
the placeholder to be delivered to home channels.
Changes:
- gateway/run.py: return empty string instead of placeholder when
there is no error and no response content
- cron/scheduler.py: defensively strip the placeholder text in case
any upstream path still produces it
FixesNousResearch/hermes-agent#9270
All 10 call sites in gateway/run.py and gateway/platforms/api_server.py
are inside async functions where a loop is guaranteed to be running.
get_event_loop() is deprecated since Python 3.10 — it can silently
create a new loop when none is running, masking bugs.
get_running_loop() raises RuntimeError instead, which is safer.
Surfaced during review of PRs #10533 and #10647.
Co-authored-by: kshitijk4poor <kshitijk4poor@users.noreply.github.com>