Extend the gateway_restart_notification flag to cover
_notify_active_sessions_of_shutdown — the message that fires just
before drain ("⚠️ Gateway restarting — Your current task will be
interrupted. Send any message after restart and I'll try to resume
where you left off.") sent to active sessions and home channels.
Same operator/end-user reasoning: on a Slack workspace shared with
end users, "Gateway restarting" reads as "the bot is broken" — the
operator should be able to suppress it consistently with the other
two lifecycle pings rather than having a partial opt-out.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an opt-out toggle on PlatformConfig that gates both restart
lifecycle pings: the "♻ Gateway restarted" message sent to the chat
that issued /restart, and the "♻️ Gateway online" home-channel
startup notification. Defaults to True so existing deployments are
unaffected.
The motivating split is operator vs. end-user surfaces: a back-channel
like Telegram should keep these pings, while a Slack workspace shared
with end users should not surface gateway lifecycle noise.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Route Feishu topic progress, status, approval, stream, and fallback messages through threaded replies by preserving the originating message id as the reply target. Add regressions for tool progress topic metadata and Feishu metadata-driven reply routing.
Mirror _message_thread_id_for_typing() with _message_thread_id_for_send():
both now map the General forum topic (thread id "1") to None upfront.
That removes the need for the retry-without-thread fallback in send_typing()
entirely — if _message_thread_id_for_typing() returns a non-None value, it's
a real user-created topic and falling back to the root chat is never correct.
If Telegram rejects the typing action (e.g. topic deleted mid-session), we
swallow it at debug level instead of bleeding the indicator into All Messages.
Updates the General-topic typing regression test to assert the new single-call
contract.
Salvages @Es1la's PR #13632 — a non-numeric timestamp in the persisted
feishu dedup state crashed adapter startup with ValueError/TypeError
from the unguarded float() call. Wrap the float() conversion in
try/except; skip the bad key and keep loading the rest.
The original PR also restructured existing TestDedupTTL tests to use
tempfile.TemporaryDirectory + HERMES_HOME patching — that was
test-hygiene scope creep unrelated to the bug. Kept only the
malformed-timestamp fix and added a focused regression test.
Widens @Krionex's PR #16933 fix to cover the second bug class at the sibling
site. natural mode used to pass env values through int() before the PR
caught mis-typed values crashing the gateway; custom mode had the exact
same bug one branch away (HERMES_HUMAN_DELAY_MIN_MS=oops in custom mode
still crashed). Same try/except/fallback pattern, scoped to the two
int() calls that feed random.uniform().
* feat(api-server): X-Hermes-Session-Key header for long-term memory scoping
API Server integrations (Open WebUI, custom web UIs) can now pass a stable
per-channel identifier via X-Hermes-Session-Key that scopes long-term memory
(Honcho, etc.) independently of the transcript-scoped X-Hermes-Session-Id.
This matches the native gateway's session_key / session_id split: one stable
key per assistant channel, many independent transcripts that rotate on /new.
- _create_agent and _run_agent accept gateway_session_key and pass it to
AIAgent(gateway_session_key=...), which is already honored by the Honcho
memory provider (plugins/memory/honcho/client.py resolve_session_name).
- New shared helper _parse_session_key_header applies the same API-key
gate, control-character sanitization, and a 256-char length cap as the
existing session-id header.
- All three agent endpoints honor the header: /v1/chat/completions,
/v1/responses, /v1/runs. JSON and SSE responses echo it back.
- /v1/capabilities advertises session_key_header so clients can
feature-detect.
Closes#20060.
Co-authored-by: Andy Stewart <lazycat.manatee@gmail.com>
* chore: AUTHOR_MAP entry for manateelazycat
---------
Co-authored-by: Andy Stewart <lazycat.manatee@gmail.com>
The YAML-to-env-var bridge in load_gateway_config() mapped every Discord
and Telegram config key (require_mention, auto_thread, reactions, etc.)
except reply_to_mode. Users setting discord.reply_to_mode or
telegram.reply_to_mode in ~/.hermes/config.yaml got no effect — the
adapter only read the env var, which nothing populated from YAML.
Add the missing bridge for both platforms, following the existing pattern.
Top-level <platform>.reply_to_mode preferred, falls back to
<platform>.extra.reply_to_mode, env var never overwritten. Handles YAML
1.1 bare `off` → Python False coercion.
This is a re-submission of the work from #9837 and #13930, which both
implemented the same fix but neither landed (see co-authors below).
Co-authored-by: Matteo De Agazio <hypnosis.mda@gmail.com>
Co-authored-by: ishardo <239075732+ishardo@users.noreply.github.com>
discover_fallback_ips() filtered out any DoH-resolved IP that also appeared
in the system resolver's answer set, on the assumption that the system IP
was unreachable. When DoH and system DNS agreed (a common case), the
function returned the hardcoded _SEED_FALLBACK_IPS list instead — and on
networks where those seed addresses are not routable, the Telegram fallback
transport had nothing usable to retry against and polling failed.
Drop the system_ips exclusion so DoH-confirmed IPs are preserved regardless
of system DNS overlap. The TelegramFallbackTransport already tries the
primary path first via system DNS, then falls through to the IP-rewrite
path on connect failure; including the same IP in both lanes lets a
transient primary failure recover via the explicit IP route instead of
escalating to seed addresses.
Update the two tests that codified the old exclusion to reflect the new,
inclusion-by-default behaviour.
Fixes#14520
* revert(gateway): remove stale-code self-check and auto-restart
Removes the _detect_stale_code / _trigger_stale_code_restart mechanism
introduced in #17648 and iterated in #19740. On every incoming message
the gateway compared the boot-time git HEAD SHA to the current SHA on
disk, and if they differed it would reply with
Gateway code was updated in the background --
restarting this gateway so your next message runs
on the new code. Please retry in a moment.
and then kick off a graceful restart. This is unwanted behaviour:
users who run a long-lived gateway and do their own ad-hoc git
operations on the checkout end up with their chat interrupted and
the current message dropped every time HEAD moves, with no way to
opt out.
If an operator really needs the old protection against stale
sys.modules after "hermes update", the SIGKILL-survivor sweep in
hermes update (hermes_cli/main.py, also tagged #17648) already
handles the supervisor-respawn case on its own.
Removed:
gateway/run.py:
- _STALE_CODE_SENTINELS, _GIT_SHA_CACHE_TTL_SECS
- _read_git_head_sha(), _compute_repo_mtime() module helpers
- class-level _boot_wall_time / _boot_repo_mtime / _boot_git_sha /
_stale_code_restart_triggered defaults
- __init__ boot-snapshot block (_boot_*, _cached_current_sha*,
_repo_root_for_staleness, _stale_code_notified)
- _current_git_sha_cached(), _detect_stale_code(),
_trigger_stale_code_restart() methods
- stale-code check + user-facing restart notice at the top of
_handle_message()
tests/gateway/test_stale_code_self_check.py (deleted, 412 lines)
No new logic added. Zero remaining references to any removed
symbol. Gateway test suite passes the same 4589 tests it passed
before; the 3 pre-existing unrelated failures (discord free-channel,
feishu bot admission, teams typing) are unchanged by this commit.
* fix(agent): stateful streaming scrubber for reasoning-block leaks (#17924)
Per-delta _strip_think_blocks ran at _fire_stream_delta and destroyed
downstream state. When MiniMax-M2.7 / DeepSeek / Qwen3 streamed a tag
split across deltas (delta1='<think>', delta2='Let me check'), the
regex case-2 match erased delta1 entirely, so CLI/gateway state
machines never learned a block was open and leaked delta2 as content.
Raw consumers (ACP, api_server, TTS) had no downstream defense at all.
Replace the per-delta regex with a stateful StreamingThinkScrubber
that survives delta boundaries:
- Closed <tag>X</tag> pairs always stripped (matches _strip_think_blocks
case 1).
- Unterminated open at block boundary enters a block; content
discarded until close tag arrives. At end-of-stream, held
content is dropped.
- Orphan close tags stripped without boundary gating.
- Partial tags at delta boundaries held back until resolved.
- Block-boundary rule (start-of-stream, after \n, or
whitespace-only since last \n) preserves prose that mentions
tag names.
Reset at turn start alongside the existing context scrubber; flush at
turn end so a benign '<' held back at end-of-stream reaches the UI.
E2E-verified on live OpenRouter->MiniMax-m2 streams: closed pairs
strip cleanly, first word of post-block content is preserved, pure
content passes through unchanged. Stefan's screenshot case (#17924)
— 'Let me check' getting chopped to ' me check' — no longer happens.
Final _strip_think_blocks calls on completed strings (final_response,
replay, compression) are preserved; only the streaming per-delta call
site switched to the scrubber.
The SDK requires Python >=3.12 so CI (3.11) falls to the except
ImportError branch, leaving TypingActivityInput=None. After loading
the adapter module, explicitly restore it from the mock so
test_send_typing doesn't silently no-op.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot review: the helper accepted None in one test but was annotated str.
Matches actual usage where no-content-type attachments are a tested scenario.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add EMAIL_ALLOWED_USERS check in EmailAdapter._dispatch_message()
to silently discard emails from senders not in the allowlist. This
prevents the adapter from creating thread context and dispatching a
MessageEvent for unauthorized senders, which could race with the
gateway authorization check and result in SMTP replies being sent
despite the handler returning None.
Test: tests/gateway/test_email.py::TestDispatchMessage::test_non_allowlisted_sender_dropped
Test: tests/gateway/test_email.py::TestDispatchMessage::test_allowlisted_sender_proceeds
Test: tests/gateway/test_email.py::TestDispatchMessage::test_empty_allowlist_allows_all
The stale-code self-check (Issue #17648) used sentinel-file mtimes to
decide whether the gateway survived a `hermes update` with stale
`sys.modules`. That signal false-positives on any write to the
sentinel files — including agent-driven edits during Hermes-on-Hermes
dev sessions. Telling the agent to patch `run_agent.py` would flip
the check to True on the next user message and force a gateway
restart even though no update happened.
Switch the signal to `git rev-parse HEAD`. Agent file edits don't
move HEAD; `hermes update` (git pull) always does. Reading .git/HEAD
directly (no subprocess) with a 5s cache keeps the overhead negligible
on bursty chats. Non-git installs short-circuit to False — the
stale-modules class can't occur without a git-backed update path, so
there's nothing to detect.
The legacy `_compute_repo_mtime` helper is kept but unused by
detection, reserved as a fallback hook for future pip-install update
paths.
- _read_git_head_sha(): resolves HEAD across main checkout, worktree
(follows `gitdir:` + `commondir` pointers), and packed-refs layouts.
- _current_git_sha_cached(): per-runner 5s SHA cache.
- _detect_stale_code(): boot SHA vs current SHA, returns False when
either is unavailable.
- Tests cover all four layouts, the agent-edits-don't-trigger
regression, and cache behavior.
Refs #17648.
Four production-readiness additions to topic mode:
1. /topic off — clean disable path. Flips telegram_dm_topic_mode.enabled
to 0 and clears telegram_dm_topic_bindings for this chat. Previously
users had to edit state.db with sqlite3 to turn the feature off.
Idempotent: calling /topic off when the chat was never enabled
returns a friendly no-op message.
2. /topic help — inline usage printed in the DM so users don't have to
visit docs to discover /topic off, /topic <session-id>, etc.
3. Authorization gate. /topic mutates SQLite side tables and flips the
root DM into a lobby, so the action must be authorized. Now calls
self._is_user_authorized(source); unauthorized DMs get a refusal
instead of activation. Defense in depth on top of the gateway's
existing pre-route auth.
4. BotFather screenshot debounce. A user repeatedly running /topic
while Threads Settings is still disabled would previously re-upload
the same screenshot every time. Now rate-limited to one send per
5 minutes per chat. /topic off resets the counter so re-enabling
starts fresh.
Command-def args hint updated: /topic [off|help|session-id].
Docs:
- New /topic subcommands table at the top of the multi-session section
- Disable instructions updated to recommend /topic off first, with the
raw SQL fallback kept for bulk cleanup
- Under-the-hood list extended with the capability-hint debounce and
the authorization gate
Tests (6 new):
- /topic help returns usage and doesn't create topic tables
- /topic off disables mode AND clears bindings
- /topic off is idempotent when never enabled
- Unauthorized users get refusal, no tables created
- Capability-hint debounce is per-chat
- /topic off resets both lobby and capability debounce counters
All 402 targeted tests pass. Full gateway sweep: 4809/4810
(pre-existing test_teams::test_send_typing unrelated).
Five follow-ups to topic mode based on integration audit:
1. ON DELETE CASCADE on telegram_dm_topic_bindings.session_id. Session
pruning (manual /delete, auto-cleanup, any future prune job) would
have thrown 'FOREIGN KEY constraint failed' for sessions bound to a
topic. Migration bumped to v2, rebuilds the bindings table in place
if FK lacks CASCADE. Idempotent; only runs once per DB.
2. Never auto-rename operator-declared topics. If an operator has
extra.dm_topics configured AND a user runs /topic, messages in those
pre-declared topics would previously trigger auto-rename and silently
mutate operator config. _rename_telegram_topic_for_session_title now
early-returns when _get_dm_topic_info returns a dict for this
(chat_id, thread_id). Uses class-based lookup (not hasattr) so
MagicMock test fixtures don't accidentally trip the guard.
3. General topic handling. Telegram's General (pinned top) topic in a
forum-enabled private chat may send messages with message_thread_id=1
or omit thread_id entirely depending on client. Both are now treated
as the root lobby, not a topic lane. Prevents users from
accidentally burning a session on the General topic.
4. Debounce the root-lobby reminder. 30-second cooldown per chat so a
user who forgets topic mode is enabled and types ten messages in the
root gets one reminder, not ten. Explicit command replies
(/new-in-lobby, /topic <session-id>) still land every time.
5. Docs: added under-the-hood invariants for the above, plus a
Downgrade section explaining that rolling back to a pre-/topic
Hermes build leaves the DB tables orphaned but harmless — DMs just
revert to native per-thread isolation.
Tests:
- test_operator_declared_topic_is_not_auto_renamed
- test_general_topic_is_treated_as_root_lobby
- test_lobby_reminder_is_debounced_per_chat
- test_binding_survives_session_deletion_via_cascade
- test_migration_rebuilds_v1_binding_table_with_cascade_fk
Validated: 4803/4804 tests pass (tests/gateway/ + tests/test_hermes_state.py).
Sole failure is a pre-existing test_teams::test_send_typing flake
unrelated to this PR.
Follow-up on @EmelyanenkoK's feat: add Telegram DM topic-mode sessions.
Three issues:
1. Split-brain session state. After get_or_create_session() returned a
SessionEntry for a topic lane, the handler was mutating
.session_id in place to the binding's target, but never persisting
the switch through SessionStore. The sessions.json session_key →
session_id map kept pointing at the lane's natural id; any reader
that reloaded from disk saw the wrong id. Fixed by routing through
SessionStore.switch_session(), which _save()s the mapping and ends
the old session in SQLite like /resume does.
2. /new inside a topic was a one-message no-op. Reset created a new
session but left the telegram_dm_topic_bindings row pointing at the
old session_id, so the next message's binding lookup switched right
back. Now _handle_reset_command rebinds the topic to the new
session_id after reset.
3. is_telegram_session_linked_to_topic and
list_unlinked_telegram_sessions_for_user both called
apply_telegram_topic_migration() on read, contradicting the PR's
own invariant that migration only runs on explicit /topic opt-in.
They now tolerate missing topic tables and return empty/False.
Also: _telegram_topic_mode_enabled() now only treats True as enabled
(not any truthy return), so test fixtures with MagicMock session_db
don't accidentally flip every DM into lobby mode — this was breaking
4 pre-existing test_status_command tests.
Tests:
- New regression: /new inside a topic must update the binding row
(test_new_inside_telegram_topic_rewrites_binding_to_new_session).
- _make_runner now stubs switch_session so existing restore tests
still exercise the new code path.
Validated end-to-end with real SessionDB + SessionStore:
readers on fresh DB don't create topic tables; enable creates them;
binding override persists across SessionStore restart; /new rebinds
and the new id survives a restart.
Co-authored-by: EmelyanenkoK <emelyanenko.kirill@gmail.com>
Adapted from PR #19188 by @LeonSGP43 — mocks cli_output helpers and
verifies interactive_setup persists credentials to .env without
crashing. Also adds megastary to AUTHOR_MAP.
Prevents pre-existing TWILIO_PHONE_NUMBER or SMS_WEBHOOK_URL values in
the outer test environment from leaking into the assertion context.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The contributor's PR silently swallowed ValueError from
SessionDB.set_session_title() with bare except Exception: pass.
Users typing /new <title> with an already-in-use title got an
untitled session and no feedback.
Changes:
- cli.py: catch ValueError from both sanitize_title() and
set_session_title(); print the error and mark the session
untitled in the banner (never echo the rejected title back).
- gateway/run.py: append a warning note to the reset reply on
title rejection; reflect the accepted title in the header.
- Add regression tests for the duplicate-title path in CLI and
gateway.
Also map exx@example.com -> @exxmen in scripts/release.py.
Allow users to start a fresh session and immediately set its title by
passing a name to /new (or /reset):
/new Refactor auth module
Changes:
- hermes_cli/commands.py: add args_hint='[name]' to /new command
- cli.py: parse title argument in process_command(), pass to new_session()
- cli.py: new_session() accepts title=None, sets title via SessionDB
- gateway/run.py: _handle_reset_command() parses title, sets on new entry
- gateway/session.py: reset_session() accepts optional display_name
- tests: add test_new_session_with_title, test_reset_command_with_title,
test_new_command_in_help_output
All 36 affected tests pass.
Users commonly place `require_mention: true` at the top level of
config.yaml alongside `group_sessions_per_user`, expecting it to gate
Telegram group messages. The key was silently ignored because the
config loader only checked `yaml_cfg["telegram"]["require_mention"]`.
When `require_mention` is found at the top level and no telegram-specific
value is set, the fix now:
- adds it to platforms_data["telegram"]["extra"] so _telegram_require_mention()
picks it up via the primary config.extra path
- sets TELEGRAM_REQUIRE_MENTION env var for the secondary fallback path
A telegram-specific value (telegram.require_mention) still takes
precedence over the top-level shorthand.
Also corrects telegram.md: bare /cmd without @botname is rejected when
require_mention is enabled; only /cmd@botname (bot-menu form) passes.
Fixes#3979
Deduplicate exact and near-exact Discord voice STT transcripts per guild/user over a short window to avoid duplicate delayed agent replies.
Adds regression tests for exact and near-duplicate voice transcript suppression.
/goal was silently broken outside the classic CLI.
TUI: /goal was routed through the HermesCLI slash-worker subprocess,
which set the goal row in SessionDB but then called
_pending_input.put(state.goal) — the subprocess has no reader for that
queue, so the kickoff message was discarded. No post-turn judge was
wired into prompt.submit either, so even a manual kickoff would not
continue the goal loop. Intercept /goal in command.dispatch instead,
drive GoalManager directly, and return {type: send, notice, message}
so the TUI client renders the Goal-set notice and fires the kickoff.
Run the judge in _run_prompt_submit after message.complete, surface
the verdict via status.update {kind: goal}, and chain the continuation
turn after the running guard is released.
Gateway: _post_turn_goal_continuation was gated on
hasattr(adapter, 'send_message'), but adapters only expose send().
That branch was dead on every platform — users never saw
'✓ Goal achieved', 'Continuing toward goal', or budget-exhausted
messages. Replace the dead call with adapter.send(chat_id, content,
metadata) and drop a broken reference to self._loop.
Tests:
- tests/tui_gateway/test_goal_command.py — full /goal dispatch matrix
(set / status / pause / resume / clear / stop / done / whitespace)
plus regressions for slash.exec → 4018 and 'goal' staying in
_PENDING_INPUT_COMMANDS.
- tests/gateway/test_goal_verdict_send.py — locks in the adapter.send
path for done / continue / budget-exhausted and verifies the hook
no-ops when no goal is set or the adapter lacks send().
SlackAdapter.connect() overwrote self._handler, self._app, and
self._socket_mode_task without closing the prior AsyncSocketModeHandler
first. If connect() was called a second time on the same adapter (e.g.
during a gateway restart or in-process reconnect attempt), the old Socket
Mode websocket stayed alive. Both the old and new connections received
every Slack event and dispatched it twice — producing double responses
with different wording, the same bug that affected DiscordAdapter (#18187,
fixed in #18758).
Fix: add a close-before-reassign guard at the start of the connection
setup path, mirroring the guard DiscordAdapter.connect() already has.
When self._handler is None (fresh adapter, first connect()) the block is
a harmless no-op. Scoped to the handler/app fields only — no behavior
change for any path that does not call connect() twice.
Fixes#18980
Two mitigations for the CLOSE_WAIT accumulation reported against QQ Bot
+ Feishu on macOS behind Cloudflare Warp.
1. Shared httpx.Limits helper (gateway/platforms/_http_client_limits.py).
Every long-lived platform adapter now constructs httpx.AsyncClient
with max_keepalive_connections=10 and keepalive_expiry=2.0, vs httpx's
default of unbounded keepalive pool and 5.0s expiry. On macOS/Warp the
default 5s window let idle keepalive sockets sit in CLOSE_WAIT long
enough for seven persistent adapters (QQ Bot, WeCom, DingTalk, Signal,
BlueBubbles, WeCom-callback, plus the transient Feishu helper) to
compound to the 256-fd ulimit. Tunable via
HERMES_GATEWAY_HTTPX_KEEPALIVE_EXPIRY and
HERMES_GATEWAY_HTTPX_MAX_KEEPALIVE env vars.
2. whatsapp.send_typing aiohttp leak. The call was
'await self._http_session.post(...)' with no 'async with' and no
variable capture — the ClientResponse went out of scope unclosed,
holding its TCP socket in CLOSE_WAIT until GC. Fixed by wrapping in
'async with'. This was the only bare-await aiohttp leak in the
gateway/tools/plugins tree per audit; all other aiohttp sites use
the context-manager pattern correctly.
The underlying reporter also saw Feishu SDK (lark-oapi) connections in
CLOSE_WAIT — those are inside the SDK and out of our direct control, but
tightening httpx keepalive across adapters reduces the aggregate pool
pressure regardless of which individual adapter leaks.
Snapshot Content-Type and body while the client context is still
active so pooled connections fully release on exit. Previously the
read happened after `async with httpx.AsyncClient(...)` returned —
which works today only because httpx eagerly buffers non-streaming
responses; a future refactor to `.stream()` would silently read-
after-close.
Part of the #18451 connection-hygiene audit. Salvage of #18502.
* fix(gateway): config.yaml wins over .env for agent/display/timezone settings
Regression from the silent config→env bridge. The bridge at module import
time is correct for max_turns (unconditional overwrite), but every other
agent.*, display.*, timezone, and security bridge key was guarded by
'if X not in os.environ' — so a stale .env entry from an old 'hermes setup'
run would shadow the user's current config.yaml indefinitely.
Symptom: agent.max_turns: 500 in config.yaml, HERMES_MAX_ITERATIONS=60
in .env from an old setup, and the gateway silently capped at 60
iterations per turn. Gateway logs confirmed api_calls never exceeded 60.
Three changes:
1. gateway/run.py: drop the 'not in os.environ' guards for all agent.*,
display.*, timezone, and security.* bridge keys. config.yaml is now
authoritative for these settings — same semantics already in place
for max_turns, terminal.*, and auxiliary.*. Also surface the bridge
failure (previously 'except Exception: pass') to stderr so operators
see bridge errors instead of silently falling back to .env.
2. gateway/run.py: INFO-log the resolved max_iterations at gateway
start so operators can verify the config→env bridge did the right
thing instead of chasing a phantom budget ceiling.
3. hermes_cli/setup.py: stop writing HERMES_MAX_ITERATIONS to .env in
the setup wizard. config.yaml is the single source of truth. Also
clean up any stale .env entry left behind by pre-fix setups.
Regression tests in tests/gateway/test_config_env_bridge_authority.py
guard each config→env key against the 'stale .env shadows config' bug.
* fix(gateway): shutdown + restart hygiene (drain timeout, false-fatal, success log)
Three issues observed in production gateway.log during a rapid restart
chain on 2026-05-02, all fixed here.
1. _send_restart_notification logged unconditional success
adapter.send() catches provider errors (e.g. Telegram 'Chat not found')
and returns SendResult(success=False); it never raises. The caller
ignored the return value and always logged 'Sent restart notification
to <chat>' at INFO, producing a misleading success line directly
below the 'Failed to send Telegram message' traceback on every boot.
Now inspects result.success and logs WARNING with the error otherwise.
2. WhatsApp bridge SIGTERM on shutdown classified as fatal error
_check_managed_bridge_exit() saw the bridge's returncode -15 (our own
SIGTERM from disconnect()) and fired the full fatal-error path,
producing 'ERROR ... WhatsApp bridge process exited unexpectedly' plus
'Fatal whatsapp adapter error (whatsapp_bridge_exited)' on every
planned shutdown, immediately before the normal '✓ whatsapp
disconnected'. Adds a _shutting_down flag that disconnect() sets
before the terminate, and _check_managed_bridge_exit() returns None
for returncode in {0, -2, -15} while shutting down. OOM-kill (137)
and other non-signal exits still hit the fatal path.
3. restart_drain_timeout default 60s → 180s
On 2026-05-02 01:43:27 a user /restart fired while three agents were
mid-API-call (82s, 112s, 154s into their turns). The 60s drain budget
expired and all three were force-interrupted. 180s covers realistic
in-flight agent turns; users on very-long-reasoning models can still
raise it further via agent.restart_drain_timeout in config.yaml.
Existing explicit user values are preserved by deep-merge.
Tests
- tests/gateway/test_restart_notification.py: two new tests assert INFO
is only logged on SendResult(success=True) and WARNING with the error
string is logged on SendResult(success=False).
- tests/gateway/test_whatsapp_connect.py: parametrized test for
returncode in {0, -2, -15} proves shutdown-time exits are suppressed;
separate test proves returncode 137 (SIGKILL/OOM) still surfaces as
fatal even when _shutting_down is set.
- _check_managed_bridge_exit() reads _shutting_down via getattr-with-
default so existing _make_adapter() test helpers that bypass __init__
(pitfall #17 in AGENTS.md) keep working unmodified.
Covers PR #18224 fix for issue #18187 — when DiscordAdapter.connect() is
called a second time without an intervening disconnect(), the previous
commands.Bot must be closed before a new one is created. Otherwise both
websockets stay connected to Discord's gateway and both fire on_message,
producing double responses with different wording.
`_register_skill_group` captured the skill catalog in closure variables
(`entries` and `skill_lookup`) so the single `tree.add_command` call at
startup owned the only live copy. The closure is never re-entered after
startup, so `/reload-skills` — which rescans the on-disk skills dir and
refreshes the in-process `_skill_commands` registry — had no way to
propagate results into the `/skill` autocomplete on Discord. New skills
stayed invisible in the dropdown, and deleted skills returned
"Unknown skill" when the stale autocomplete entry was clicked.
The fix is purely a dataflow change: promote `entries` and `skill_lookup`
to instance attributes (`_skill_entries`, `_skill_lookup`), split the
collector-driven rebuild into a helper (`_refresh_skill_catalog_state`),
and add a public `refresh_skill_group()` method that re-runs the helper
and is safe to call at any point after the initial registration.
The gateway's `_handle_reload_skills_command` then iterates
`self.adapters` and calls `refresh_skill_group()` on any adapter that
exposes it (currently only Discord). Both sync and async implementations
are supported; adapters that don't override the method (Telegram's
BotCommand menu, Slack subcommand map, etc.) are silently skipped — the
in-process `reload_skills()` call covers them.
No `tree.sync()` is required because Discord fetches autocomplete
options dynamically on every keystroke — mutating the instance state the
callbacks already read from is sufficient. That sidesteps the per-app
command-bucket rate limit (~5 writes / 20 s) that made the previous
bulk-sync-on-reload approach unusable (#16713 context).
Tests: tests/gateway/test_reload_skills_discord_resync.py — five cases
covering (1) refresh replaces entries, (2) entries stay sorted after
refresh, (3) collector exception leaves cached state intact, (4)
`_refresh_skill_catalog_state` populates the instance attrs, (5)
orchestrator calls `refresh_skill_group()` on sync + async adapters and
skips adapters that don't expose it.