Follow-up on top of @kyan12's PR #20888 — same feature, cleaner shape,
wider coverage.
Changes:
- Drop the synthetic '[System note: ...]' in the internal MessageEvent.
The existing _is_resume_pending branch in _handle_message_with_agent
(run.py ~L13738) already injects a reason-aware recovery system note
on the next turn. With kyan's text in place the model saw two stacked
system notes. Now the event text is empty and the existing injection
path owns the wording.
- Drop SessionStore.list_resume_pending() as a new public method. The
filter is 8 lines inline in _schedule_resume_pending_sessions() —
one caller, no other pluggability need.
- Add 'restart_interrupted' to the auto-resume reason set. That's the
reason SessionStore.suspend_recently_active() stamps on sessions
recovered from a crash/OOM/SIGKILL (no .clean_shutdown marker).
Previously those sessions had to wait for a real user message to
auto-resume; now they continue automatically at startup like
drain-timeout interruptions do.
- Reasons live in a _AUTO_RESUME_REASONS frozenset at class scope so
future reasons (e.g. 'manual_resume_request') can be opted in with
one line.
Test coverage added:
- drain-timeout + crash-recovery both scheduled
- stale entries skipped (outside freshness window)
- suspended entries skipped (suspended > resume_pending)
- originless entries skipped (no routing target)
- disallowed reasons skipped (graceful forward-compat)
E2E verified end-to-end with a real on-disk SessionStore: 2 eligible
sessions scheduled, 2 ineligible skipped, empty-text internal events
delivered to the adapter.
Co-authored-by: Kevin Yan <kevyan1998@gmail.com>
When display.cleanup_progress (or display.platforms.<plat>.cleanup_progress)
is true, the gateway deletes tool-progress bubbles, long-running '⏳ Still
working...' notices, and status-callback messages after the final response
is delivered successfully. Currently effective on adapters that implement
delete_message (Telegram); silently no-ops elsewhere. Off by default.
Failed runs skip cleanup so bubbles stay as breadcrumbs.
Minimal plumbing: base.py's existing post_delivery_callback slot now chains
new registrations onto any existing callback (with per-callback exception
isolation) rather than clobbering. Stale-generation registrations are
rejected so they can't step on a fresher run's callbacks. This lets the
cleanup callback coexist with the background-review release hook already
registered on the same slot.
Co-authored-by: mrcharlesiv <Mrcharlesiv@gmail.com>
When terminal.backend is docker, inbound documents uploaded via messaging
platforms (Telegram, Slack, Discord, Feishu, Email, etc.) are cached at a host
path under ~/.hermes/cache/documents, but the container sandbox only sees them
at the auto-mounted /root/.hermes/cache/documents path.
This PR adds to_agent_visible_cache_path() in tools/credential_files.py (the
natural sibling to get_cache_directory_mounts()) and calls it at the
document-context-injection site in gateway/run.py so the agent always receives
a path it can open directly, matching the mount layout already established
by get_cache_directory_mounts() (#4846).
Scope: only Docker backend for now; other backends use different mount
semantics and are left unchanged until verified.
Fixes#18787
Two follow-ups on top of helix4u's slash-command sync hardening:
- Only suppress exceptions that are actually Discord 429 rate limits
(discord.RateLimited, HTTPException with status 429, or a clearly
rate-limit-named duck type). Arbitrary failures that happen to expose
a retry_after attribute now re-raise to the outer handler instead of
silently swallowing a cooldown.
- Move the sync-state JSON under $HERMES_HOME/gateway/ so the home root
stops collecting ad-hoc runtime files.
Added a test verifying unrelated exceptions don't get misclassified as
rate limits.
Extend the gateway_restart_notification flag to cover
_notify_active_sessions_of_shutdown — the message that fires just
before drain ("⚠️ Gateway restarting — Your current task will be
interrupted. Send any message after restart and I'll try to resume
where you left off.") sent to active sessions and home channels.
Same operator/end-user reasoning: on a Slack workspace shared with
end users, "Gateway restarting" reads as "the bot is broken" — the
operator should be able to suppress it consistently with the other
two lifecycle pings rather than having a partial opt-out.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an opt-out toggle on PlatformConfig that gates both restart
lifecycle pings: the "♻ Gateway restarted" message sent to the chat
that issued /restart, and the "♻️ Gateway online" home-channel
startup notification. Defaults to True so existing deployments are
unaffected.
The motivating split is operator vs. end-user surfaces: a back-channel
like Telegram should keep these pings, while a Slack workspace shared
with end users should not surface gateway lifecycle noise.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Remove dead metadata.get('reply_to') fallback in _send_raw_message;
nothing in the codebase ever sets 'reply_to' inside a metadata dict —
the key only appears as a top-level send_voice() keyword argument
- Simplify _status_thread_metadata construction in run.py to use a
single dict literal instead of create-then-mutate pattern; the
or-{} guard was dead since source.thread_id implies _progress_thread_id
is also set for Feishu
- Add yuqian@zmetasoft.com to AUTHOR_MAP for contributor attribution
Route Feishu topic progress, status, approval, stream, and fallback messages through threaded replies by preserving the originating message id as the reply target. Add regressions for tool progress topic metadata and Feishu metadata-driven reply routing.
Replaces the per-directory shadow-repo design with a single shared shadow
git store at ~/.hermes/checkpoints/store/. Object DB is now deduplicated
across every working directory the agent has ever touched; a dozen
worktrees of the same project cost near-zero in additional disk.
Why
---
Pre-v2 design had three compounding problems that let ~/.hermes/checkpoints/
grow to multi-GB on active machines:
1. Each working directory got its own full shadow git repo — no object
dedup across projects or across worktrees of the same project.
2. _prune() was a documented no-op: max_snapshots only limited the
/rollback listing. Loose objects accumulated forever.
3. Defaults: enabled=True, auto_prune=False — users paid the disk cost
without ever asking for /rollback.
Field report on a single workstation: 847 MB across 47 shadow repos,
mostly redundant clones of the hermes-agent source tree.
Changes
-------
- tools/checkpoint_manager.py: full rewrite. Single bare store, per-project
refs (refs/hermes/<hash>), per-project indexes (store/indexes/<hash>),
per-project metadata (store/projects/<hash>.json with workdir +
created_at + last_touch). On first v2 init, any pre-v2 per-directory
shadow repos are auto-migrated into legacy-<timestamp>/ so the new
store starts clean. _prune() now actually rewrites the per-project ref
to the last max_snapshots commits and runs git gc --prune=now. New
_enforce_size_cap() drops oldest commits round-robin across projects
when the store exceeds max_total_size_mb. _drop_oversize_from_index()
filters any single file larger than max_file_size_mb out of the snapshot.
- hermes_cli/checkpoints.py: new 'hermes checkpoints' CLI
(status / list / prune / clear / clear-legacy) for managing the store
outside a session.
- hermes_cli/config.py: flipped defaults — enabled=False, max_snapshots=20,
auto_prune=True. Added max_total_size_mb=500, max_file_size_mb=10.
Tightened DEFAULT_EXCLUDES (added target/, *.so/*.dylib/*.dll,
*.mp4/*.mov, *.zip/*.tar.gz, .worktrees/, .mypy_cache/, etc.).
- run_agent.py / cli.py / gateway/run.py: thread the new kwargs through
AIAgent and the startup auto_prune hooks.
- Tests rewritten to match v2 storage while keeping backwards-compat
coverage for the pre-v2 prune path (per-directory shadow repos under
base/ are still swept correctly for anyone mid-migration).
- Docs updated: user-guide/checkpoints-and-rollback.md explains the
shared store, new defaults, migration, and the new CLI;
reference/cli-commands.md documents 'hermes checkpoints'.
E2E validated
-------------
- Legacy migration: pre-v2 shadow repos auto-archived into legacy-<ts>/.
- Object dedup: two projects with an identical shared.py blob resolve to
7 total objects in the store (v1 would have stored the blob twice).
- max_snapshots=3 actually enforced: after 6 commits, list shows 3.
- Orphan prune: deleting a project's workdir + 'hermes checkpoints prune
--retention-days 0' removes its ref, index, and metadata; GC reclaims
the objects.
- max_file_size_mb=1 excludes a 2 MB weights.bin while keeping the
tracked source code files.
- hermes checkpoints {status,prune,clear,clear-legacy} all work from the
CLI without an agent running.
Breaking / migration
--------------------
No in-place data migration — legacy per-directory shadow repos are moved
into legacy-<timestamp>/ on first run. Old /rollback history is still
accessible by inspecting the archive with git; run
'hermes checkpoints clear-legacy' to reclaim the space when ready. Users
relying on /rollback must now set checkpoints.enabled=true (or pass
--checkpoints) explicitly.
- Fix /compact → /compress in context-overflow tips (closes#20020)
- Evict cached agent after session hygiene and /compress so system
prompt refreshes with current SOUL.md, memory, and skills
- Restore memory authority across compaction: change 'informational
background data' to 'authoritative reference data' in memory block
and SUMMARY_PREFIX, with backward-compatible regex
Based on:
- PR #20027 by @LeonSGP43
- PR #18767 by @MacroAnarchy
- PR #17380 by @vominh1919
PR #17121 boundary marker fix already merged to main (2eef395e1).
PR #9262 user-message anchoring already on main via _ensure_last_user_message_in_tail().
Reduces SSE event rate ~500/turn → ~20/turn via 50ms text-delta batching in
_dispatch(), which eliminates markdown re-render storms on Open WebUI. Also:
- Trim tool_call.arguments in the response.completed event to 100KB
(prevents silent hangs on 848KB+ single-line SSE events).
- Catch-all exception handlers in _write_sse_responses() + _write_sse_chat_completion()
emit a proper error chunk instead of TransferEncodingError from incomplete
chunked encoding when the agent crashes mid-stream.
- MAX_REQUEST_BYTES 1MB → 10MB; pass client_max_size to aiohttp Application to
avoid silent 400s on truncated request bodies for long conversations.
Salvage of #17552 (api_server portion only). The contrib/openwebui-filter/
payload from that PR — Open WebUI Filter Function + benchmark writeup — is
a client-side user-installable add-on and doesn't need to live in the repo;
dropped here. Closes#17537.
Co-authored-by: bogerman1 <93757150+bogerman1@users.noreply.github.com>
The dispatch_once function already accepts a max_spawn parameter but the
gateway was calling it without passing any value, effectively ignoring
the configuration. This change reads kanban.max_spawn from config.yaml
and passes it through, allowing users to limit concurrent kanban tasks.
This prevents resource exhaustion scenarios where kanban dispatcher
spawns too many parallel workers on constrained hardware.
Salvage of #11350. Kept:
- Code: add an explicit /voice join Choice in the slash UI (runner accepts both 'join' and 'channel' but only 'channel' was in autocomplete).
- Docs: Server Members Intent is conditional (only needed if DISCORD_ALLOWED_USERS contains usernames); SSRC → user_id mapping uses the voice websocket SPEAKING opcode, not the Members intent.
Dropped from the original PR:
- HERMES_DISCORD_VOICE_PACKET_DUMP — this env var doesn't exist on main (it was in a different PR that isn't merged).
- DISCORD_PROXY docs — already documented on current main.
- DISCORD_ALLOW_MENTION_* docs — already on main.
- "barge-in mode" rewrite — current main actually does pause the listener during TTS (VoiceReceiver.pause() at discord.py:192); there is no barge_in_guard/barge_in_rms on main.
Co-authored-by: Michel Belleau <michel.belleau@malaiwah.com>
Mirror _message_thread_id_for_typing() with _message_thread_id_for_send():
both now map the General forum topic (thread id "1") to None upfront.
That removes the need for the retry-without-thread fallback in send_typing()
entirely — if _message_thread_id_for_typing() returns a non-None value, it's
a real user-created topic and falling back to the root chat is never correct.
If Telegram rejects the typing action (e.g. topic deleted mid-session), we
swallow it at debug level instead of bleeding the indicator into All Messages.
Updates the General-topic typing regression test to assert the new single-call
contract.
Fix three regressions introduced by PR #18370 (lazy session creation):
1. _finalize_session() uses stale session_key after compression (#20001)
2. session_key not synced after auto-compression in run_conversation (#20001)
3. pending_title ValueError leaves title wedged forever (#19029)
4. Gateway silently swallows null responses when agent did work (#18765)
5. One-time cleanup for accumulated ghost compression continuations (#20001)
Changes:
- tui_gateway/server.py: _finalize_session() now uses agent.session_id
(falls back to session_key when agent is None). Refactor
_sync_session_key_after_compress() with clear_pending_title and
restart_slash_worker policy flags. Call it post-run_conversation()
to sync session_key after auto-compression. Add ValueError handler
to pending_title flush.
- gateway/run.py: Extract _normalize_empty_agent_response() helper that
consolidates failed/partial/null response handling. Surfaces user-facing
error when agent did work (api_calls > 0) but returned no text.
- hermes_state.py: Add finalize_orphaned_compression_sessions() — marks
ghost continuation sessions as ended (non-destructive, preserves data).
- cli.py: One-time startup migration for orphaned compression sessions.
Test changes:
- tests/test_tui_gateway_server.py: Update pending_title ValueError test
for post-#18370 architecture (title applied post-message, not at create).
- tests/test_lazy_session_regressions.py: 14 new regression tests covering
all fixed paths.
The Telegram/Discord /model pickers currently call
list_authenticated_providers(), which returns every provider whose
credentials resolve locally and every model in its curated snapshot.
Two failure modes fall out:
- OpenRouter rows can include IDs the live catalog no longer carries.
- Provider rows can surface with zero callable models (e.g. a slug
whose credential pool entry exists but has nothing behind it).
list_picker_providers() wraps the base function and post-processes the
result so the interactive picker only shows models the user can
actually select:
- OpenRouter's models come from fetch_openrouter_models() (live-catalog
filtered against the curated OPENROUTER_MODELS snapshot).
- Rows with an empty models list are dropped, except custom endpoints
(is_user_defined=True with an api_url) where the user may enter
model ids manually.
- All other fields pass through unchanged.
The gateway /model handler switches to the new helper for the
interactive picker payload only. Typed /model <name> and the text
fallback list stay on list_authenticated_providers() so nothing is
hidden from power users or platforms without a picker.
Covered by nine focused unit tests in
tests/hermes_cli/test_list_picker_providers.py.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Salvages @Es1la's PR #13632 — a non-numeric timestamp in the persisted
feishu dedup state crashed adapter startup with ValueError/TypeError
from the unguarded float() call. Wrap the float() conversion in
try/except; skip the bad key and keep loading the rest.
The original PR also restructured existing TestDedupTTL tests to use
tempfile.TemporaryDirectory + HERMES_HOME patching — that was
test-hygiene scope creep unrelated to the bug. Kept only the
malformed-timestamp fix and added a focused regression test.
Feishu post-type 'md' elements do not render markdown tables.
When table content is sent as post (triggered by **bold** matching
_MARKDOWN_HINT_RE), the message appears blank on the client.
Add _MARKDOWN_TABLE_RE to detect markdown table syntax and force
text mode for table content, ensuring it is visible as plain text.
After PR #13725 replaced the module-level _LOCK_DIR/_LOCK_FILE constants
with a dynamic _get_lock_paths() helper, the xdist-isolation fixture
needs to patch the function instead of the removed constants.
- scheduler.py: Replace static _hermes_home with dynamic _get_hermes_home() function
to support profile switching at runtime (HERMES_HOME override)
- scheduler.py: Replace static _LOCK_DIR/_LOCK_FILE with _get_lock_paths() function
for profile-aware lock path resolution
- feishu.py: Add receive_id_type detection (oc_/ou_ -> open_id, else chat_id)
to fix Feishu API '[230001] ext=invalid receive_id' error for user DMs
* revert(gateway): remove stale-code self-check and auto-restart
Removes the _detect_stale_code / _trigger_stale_code_restart mechanism
introduced in #17648 and iterated in #19740. On every incoming message
the gateway compared the boot-time git HEAD SHA to the current SHA on
disk, and if they differed it would reply with
Gateway code was updated in the background --
restarting this gateway so your next message runs
on the new code. Please retry in a moment.
and then kick off a graceful restart. This is unwanted behaviour:
users who run a long-lived gateway and do their own ad-hoc git
operations on the checkout end up with their chat interrupted and
the current message dropped every time HEAD moves, with no way to
opt out.
If an operator really needs the old protection against stale
sys.modules after "hermes update", the SIGKILL-survivor sweep in
hermes update (hermes_cli/main.py, also tagged #17648) already
handles the supervisor-respawn case on its own.
Removed:
gateway/run.py:
- _STALE_CODE_SENTINELS, _GIT_SHA_CACHE_TTL_SECS
- _read_git_head_sha(), _compute_repo_mtime() module helpers
- class-level _boot_wall_time / _boot_repo_mtime / _boot_git_sha /
_stale_code_restart_triggered defaults
- __init__ boot-snapshot block (_boot_*, _cached_current_sha*,
_repo_root_for_staleness, _stale_code_notified)
- _current_git_sha_cached(), _detect_stale_code(),
_trigger_stale_code_restart() methods
- stale-code check + user-facing restart notice at the top of
_handle_message()
tests/gateway/test_stale_code_self_check.py (deleted, 412 lines)
No new logic added. Zero remaining references to any removed
symbol. Gateway test suite passes the same 4589 tests it passed
before; the 3 pre-existing unrelated failures (discord free-channel,
feishu bot admission, teams typing) are unchanged by this commit.
* feat(i18n): add display.language for static message translation (zh/ja/de/es)
Adds a thin-slice i18n layer covering the highest-impact static user-facing
messages: the CLI dangerous-command approval prompt and a handful of gateway
slash-command replies (restart-drain, goal cleared, approval expired, config
read/save errors).
Out of scope (stays English): agent responses, log lines, tool outputs,
slash-command descriptions, error tracebacks.
Infrastructure:
- agent/i18n.py: catalog loader, t() helper, language resolution
(HERMES_LANGUAGE env var > display.language config > en)
- locales/{en,zh,ja,de,es}.yaml: ~19 translated strings per language
- display.language in DEFAULT_CONFIG (hermes_cli/config.py)
Tests:
- tests/agent/test_i18n.py: 21 tests covering catalog parity, placeholder
parity across locales, fallback behavior, env-var override, alias
normalization, missing-key graceful degradation.
Docs:
- website/docs/user-guide/configuration.md: display.language entry plus a
short section explaining scope so users don't expect agent responses to
translate via this knob.
Widens @Krionex's PR #16933 fix to cover the second bug class at the sibling
site. natural mode used to pass env values through int() before the PR
caught mis-typed values crashing the gateway; custom mode had the exact
same bug one branch away (HERMES_HUMAN_DELAY_MIN_MS=oops in custom mode
still crashed). Same try/except/fallback pattern, scoped to the two
int() calls that feed random.uniform().
When context compression rotates the agent's session_id to a new
child session, the API server was still returning the stale parent
session_id in the X-Hermes-Session-Id response header.
This caused external clients to keep sending the old session_id,
loading uncompressed parent history instead of the compressed
continuation.
Fix: _run_agent() now includes the effective session_id in its
result dict, and the response header uses it instead of the
original provided session_id.
* feat(api-server): X-Hermes-Session-Key header for long-term memory scoping
API Server integrations (Open WebUI, custom web UIs) can now pass a stable
per-channel identifier via X-Hermes-Session-Key that scopes long-term memory
(Honcho, etc.) independently of the transcript-scoped X-Hermes-Session-Id.
This matches the native gateway's session_key / session_id split: one stable
key per assistant channel, many independent transcripts that rotate on /new.
- _create_agent and _run_agent accept gateway_session_key and pass it to
AIAgent(gateway_session_key=...), which is already honored by the Honcho
memory provider (plugins/memory/honcho/client.py resolve_session_name).
- New shared helper _parse_session_key_header applies the same API-key
gate, control-character sanitization, and a 256-char length cap as the
existing session-id header.
- All three agent endpoints honor the header: /v1/chat/completions,
/v1/responses, /v1/runs. JSON and SSE responses echo it back.
- /v1/capabilities advertises session_key_header so clients can
feature-detect.
Closes#20060.
Co-authored-by: Andy Stewart <lazycat.manatee@gmail.com>
* chore: AUTHOR_MAP entry for manateelazycat
---------
Co-authored-by: Andy Stewart <lazycat.manatee@gmail.com>
WeCom doesn't pad base64 aeskey, causing Python strict mode decode failure
on media/image/file messages. Add automatic padding before base64 decode:
aes_key + '=' * ((4 - len(aes_key) % 4) % 4).
Salvages the AES padding fix from @chengoak's PR #17040. The SSRF whitelist
entry for a private COS bucket hostname was dropped as it belongs in user
config, not the built-in trusted-private-IP-hosts list. The debug-level
full-body info log was dropped to avoid logging potentially sensitive
message content at INFO level.
The YAML-to-env-var bridge in load_gateway_config() mapped every Discord
and Telegram config key (require_mention, auto_thread, reactions, etc.)
except reply_to_mode. Users setting discord.reply_to_mode or
telegram.reply_to_mode in ~/.hermes/config.yaml got no effect — the
adapter only read the env var, which nothing populated from YAML.
Add the missing bridge for both platforms, following the existing pattern.
Top-level <platform>.reply_to_mode preferred, falls back to
<platform>.extra.reply_to_mode, env var never overwritten. Handles YAML
1.1 bare `off` → Python False coercion.
This is a re-submission of the work from #9837 and #13930, which both
implemented the same fix but neither landed (see co-authors below).
Co-authored-by: Matteo De Agazio <hypnosis.mda@gmail.com>
Co-authored-by: ishardo <239075732+ishardo@users.noreply.github.com>
discover_fallback_ips() filtered out any DoH-resolved IP that also appeared
in the system resolver's answer set, on the assumption that the system IP
was unreachable. When DoH and system DNS agreed (a common case), the
function returned the hardcoded _SEED_FALLBACK_IPS list instead — and on
networks where those seed addresses are not routable, the Telegram fallback
transport had nothing usable to retry against and polling failed.
Drop the system_ips exclusion so DoH-confirmed IPs are preserved regardless
of system DNS overlap. The TelegramFallbackTransport already tries the
primary path first via system DNS, then falls through to the IP-rewrite
path on connect failure; including the same IP in both lanes lets a
transient primary failure recover via the explicit IP route instead of
escalating to seed addresses.
Update the two tests that codified the old exclusion to reflect the new,
inclusion-by-default behaviour.
Fixes#14520
* revert(gateway): remove stale-code self-check and auto-restart
Removes the _detect_stale_code / _trigger_stale_code_restart mechanism
introduced in #17648 and iterated in #19740. On every incoming message
the gateway compared the boot-time git HEAD SHA to the current SHA on
disk, and if they differed it would reply with
Gateway code was updated in the background --
restarting this gateway so your next message runs
on the new code. Please retry in a moment.
and then kick off a graceful restart. This is unwanted behaviour:
users who run a long-lived gateway and do their own ad-hoc git
operations on the checkout end up with their chat interrupted and
the current message dropped every time HEAD moves, with no way to
opt out.
If an operator really needs the old protection against stale
sys.modules after "hermes update", the SIGKILL-survivor sweep in
hermes update (hermes_cli/main.py, also tagged #17648) already
handles the supervisor-respawn case on its own.
Removed:
gateway/run.py:
- _STALE_CODE_SENTINELS, _GIT_SHA_CACHE_TTL_SECS
- _read_git_head_sha(), _compute_repo_mtime() module helpers
- class-level _boot_wall_time / _boot_repo_mtime / _boot_git_sha /
_stale_code_restart_triggered defaults
- __init__ boot-snapshot block (_boot_*, _cached_current_sha*,
_repo_root_for_staleness, _stale_code_notified)
- _current_git_sha_cached(), _detect_stale_code(),
_trigger_stale_code_restart() methods
- stale-code check + user-facing restart notice at the top of
_handle_message()
tests/gateway/test_stale_code_self_check.py (deleted, 412 lines)
No new logic added. Zero remaining references to any removed
symbol. Gateway test suite passes the same 4589 tests it passed
before; the 3 pre-existing unrelated failures (discord free-channel,
feishu bot admission, teams typing) are unchanged by this commit.
* fix(agent): stateful streaming scrubber for reasoning-block leaks (#17924)
Per-delta _strip_think_blocks ran at _fire_stream_delta and destroyed
downstream state. When MiniMax-M2.7 / DeepSeek / Qwen3 streamed a tag
split across deltas (delta1='<think>', delta2='Let me check'), the
regex case-2 match erased delta1 entirely, so CLI/gateway state
machines never learned a block was open and leaked delta2 as content.
Raw consumers (ACP, api_server, TTS) had no downstream defense at all.
Replace the per-delta regex with a stateful StreamingThinkScrubber
that survives delta boundaries:
- Closed <tag>X</tag> pairs always stripped (matches _strip_think_blocks
case 1).
- Unterminated open at block boundary enters a block; content
discarded until close tag arrives. At end-of-stream, held
content is dropped.
- Orphan close tags stripped without boundary gating.
- Partial tags at delta boundaries held back until resolved.
- Block-boundary rule (start-of-stream, after \n, or
whitespace-only since last \n) preserves prose that mentions
tag names.
Reset at turn start alongside the existing context scrubber; flush at
turn end so a benign '<' held back at end-of-stream reaches the UI.
E2E-verified on live OpenRouter->MiniMax-m2 streams: closed pairs
strip cleanly, first word of post-block content is preserved, pure
content passes through unchanged. Stefan's screenshot case (#17924)
— 'Let me check' getting chopped to ' me check' — no longer happens.
Final _strip_think_blocks calls on completed strings (final_response,
replay, compression) are preserved; only the streaming per-delta call
site switched to the scrubber.
After PR #20105 (dispatcher skips ready tasks whose assignee fails
``profile_exists()`` to prevent the orion-cc/orion-research crash
loop), the gateway and CLI emit a spurious "kanban dispatcher stuck:
ready queue non-empty for N consecutive ticks but 0 workers spawned"
warning every 5 minutes on multi-lane setups where the queue is
steadily full of human-pulled work assigned to terminal lanes.
The warn is intended to catch real failure modes (broken PATH,
missing venv, credential loss for a real Hermes profile). On a
multi-lane host it fires forever even though everything is healthy:
the dispatcher correctly chose not to spawn, and there is nothing
for the operator to fix.
Changes:
* ``DispatchResult`` gains a ``skipped_nonspawnable`` field
(separate from ``skipped_unassigned``) so callers can distinguish
"task missing an owner — operator should route it" from "task
owned by a control-plane lane — terminal will pull it".
* ``dispatch_once`` routes the ``not profile_exists(assignee)`` skip
into the new bucket (was lumped into ``skipped_unassigned``).
* New helper ``has_spawnable_ready(conn)`` returns True iff at least
one ready+assigned+unclaimed task in the DB has an assignee that
maps to a real Hermes profile. Falls back to legacy "any
ready+assigned" when ``profile_exists`` is unimportable so degraded
installs still surface the original warn.
* The gateway dispatcher (``gateway/run.py``) and the CLI standalone
daemon (``hermes_cli/kanban.py``) both swap their cheap
``ready_nonempty`` probe to use ``has_spawnable_ready``. Stuck-warn
now fires only when there is genuine spawnable work the dispatcher
failed to start.
* CLI dispatch output prints ``Skipped (non-spawnable assignee —
terminal lane, OK)`` for visibility without alarm.
Tests:
* New ``has_spawnable_ready`` cases (empty queue, terminal-lane
only, mixed real+terminal).
* New ``test_dispatch_skips_nonspawnable_into_separate_bucket``
verifies the bucketing change.
* Updated ``test_dispatch_skips_unassigned`` to assert no
cross-leak.
* Added ``all_assignees_spawnable`` fixture in
``tests/hermes_cli/conftest.py`` and threaded it through dispatcher
tests that use synthetic assignees ("alice", "bob"). PR #20105
(the parent commit) silently broke 8 such tests by routing those
assignees into ``skipped_nonspawnable`` instead of spawning; this
PR repairs them as part of the same code area.
Verified locally: 246/246 kanban-suite tests pass.
Stacks on top of fix/kanban-dispatcher-skip-missing-profile-2026-05-05
(PR #20105). Reviewer: this PR is meant to merge AFTER #20105.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add EMAIL_ALLOWED_USERS check in EmailAdapter._dispatch_message()
to silently discard emails from senders not in the allowlist. This
prevents the adapter from creating thread context and dispatching a
MessageEvent for unauthorized senders, which could race with the
gateway authorization check and result in SMTP replies being sent
despite the handler returning None.
Test: tests/gateway/test_email.py::TestDispatchMessage::test_non_allowlisted_sender_dropped
Test: tests/gateway/test_email.py::TestDispatchMessage::test_allowlisted_sender_proceeds
Test: tests/gateway/test_email.py::TestDispatchMessage::test_empty_allowlist_allows_all
The stale-code self-check (Issue #17648) used sentinel-file mtimes to
decide whether the gateway survived a `hermes update` with stale
`sys.modules`. That signal false-positives on any write to the
sentinel files — including agent-driven edits during Hermes-on-Hermes
dev sessions. Telling the agent to patch `run_agent.py` would flip
the check to True on the next user message and force a gateway
restart even though no update happened.
Switch the signal to `git rev-parse HEAD`. Agent file edits don't
move HEAD; `hermes update` (git pull) always does. Reading .git/HEAD
directly (no subprocess) with a 5s cache keeps the overhead negligible
on bursty chats. Non-git installs short-circuit to False — the
stale-modules class can't occur without a git-backed update path, so
there's nothing to detect.
The legacy `_compute_repo_mtime` helper is kept but unused by
detection, reserved as a fallback hook for future pip-install update
paths.
- _read_git_head_sha(): resolves HEAD across main checkout, worktree
(follows `gitdir:` + `commondir` pointers), and packed-refs layouts.
- _current_git_sha_cached(): per-runner 5s SHA cache.
- _detect_stale_code(): boot SHA vs current SHA, returns False when
either is unavailable.
- Tests cover all four layouts, the agent-edits-don't-trigger
regression, and cache behavior.
Refs #17648.
Four production-readiness additions to topic mode:
1. /topic off — clean disable path. Flips telegram_dm_topic_mode.enabled
to 0 and clears telegram_dm_topic_bindings for this chat. Previously
users had to edit state.db with sqlite3 to turn the feature off.
Idempotent: calling /topic off when the chat was never enabled
returns a friendly no-op message.
2. /topic help — inline usage printed in the DM so users don't have to
visit docs to discover /topic off, /topic <session-id>, etc.
3. Authorization gate. /topic mutates SQLite side tables and flips the
root DM into a lobby, so the action must be authorized. Now calls
self._is_user_authorized(source); unauthorized DMs get a refusal
instead of activation. Defense in depth on top of the gateway's
existing pre-route auth.
4. BotFather screenshot debounce. A user repeatedly running /topic
while Threads Settings is still disabled would previously re-upload
the same screenshot every time. Now rate-limited to one send per
5 minutes per chat. /topic off resets the counter so re-enabling
starts fresh.
Command-def args hint updated: /topic [off|help|session-id].
Docs:
- New /topic subcommands table at the top of the multi-session section
- Disable instructions updated to recommend /topic off first, with the
raw SQL fallback kept for bulk cleanup
- Under-the-hood list extended with the capability-hint debounce and
the authorization gate
Tests (6 new):
- /topic help returns usage and doesn't create topic tables
- /topic off disables mode AND clears bindings
- /topic off is idempotent when never enabled
- Unauthorized users get refusal, no tables created
- Capability-hint debounce is per-chat
- /topic off resets both lobby and capability debounce counters
All 402 targeted tests pass. Full gateway sweep: 4809/4810
(pre-existing test_teams::test_send_typing unrelated).
Five follow-ups to topic mode based on integration audit:
1. ON DELETE CASCADE on telegram_dm_topic_bindings.session_id. Session
pruning (manual /delete, auto-cleanup, any future prune job) would
have thrown 'FOREIGN KEY constraint failed' for sessions bound to a
topic. Migration bumped to v2, rebuilds the bindings table in place
if FK lacks CASCADE. Idempotent; only runs once per DB.
2. Never auto-rename operator-declared topics. If an operator has
extra.dm_topics configured AND a user runs /topic, messages in those
pre-declared topics would previously trigger auto-rename and silently
mutate operator config. _rename_telegram_topic_for_session_title now
early-returns when _get_dm_topic_info returns a dict for this
(chat_id, thread_id). Uses class-based lookup (not hasattr) so
MagicMock test fixtures don't accidentally trip the guard.
3. General topic handling. Telegram's General (pinned top) topic in a
forum-enabled private chat may send messages with message_thread_id=1
or omit thread_id entirely depending on client. Both are now treated
as the root lobby, not a topic lane. Prevents users from
accidentally burning a session on the General topic.
4. Debounce the root-lobby reminder. 30-second cooldown per chat so a
user who forgets topic mode is enabled and types ten messages in the
root gets one reminder, not ten. Explicit command replies
(/new-in-lobby, /topic <session-id>) still land every time.
5. Docs: added under-the-hood invariants for the above, plus a
Downgrade section explaining that rolling back to a pre-/topic
Hermes build leaves the DB tables orphaned but harmless — DMs just
revert to native per-thread isolation.
Tests:
- test_operator_declared_topic_is_not_auto_renamed
- test_general_topic_is_treated_as_root_lobby
- test_lobby_reminder_is_debounced_per_chat
- test_binding_survives_session_deletion_via_cascade
- test_migration_rebuilds_v1_binding_table_with_cascade_fk
Validated: 4803/4804 tests pass (tests/gateway/ + tests/test_hermes_state.py).
Sole failure is a pre-existing test_teams::test_send_typing flake
unrelated to this PR.
Follow-up on @EmelyanenkoK's feat: add Telegram DM topic-mode sessions.
Three issues:
1. Split-brain session state. After get_or_create_session() returned a
SessionEntry for a topic lane, the handler was mutating
.session_id in place to the binding's target, but never persisting
the switch through SessionStore. The sessions.json session_key →
session_id map kept pointing at the lane's natural id; any reader
that reloaded from disk saw the wrong id. Fixed by routing through
SessionStore.switch_session(), which _save()s the mapping and ends
the old session in SQLite like /resume does.
2. /new inside a topic was a one-message no-op. Reset created a new
session but left the telegram_dm_topic_bindings row pointing at the
old session_id, so the next message's binding lookup switched right
back. Now _handle_reset_command rebinds the topic to the new
session_id after reset.
3. is_telegram_session_linked_to_topic and
list_unlinked_telegram_sessions_for_user both called
apply_telegram_topic_migration() on read, contradicting the PR's
own invariant that migration only runs on explicit /topic opt-in.
They now tolerate missing topic tables and return empty/False.
Also: _telegram_topic_mode_enabled() now only treats True as enabled
(not any truthy return), so test fixtures with MagicMock session_db
don't accidentally flip every DM into lobby mode — this was breaking
4 pre-existing test_status_command tests.
Tests:
- New regression: /new inside a topic must update the binding row
(test_new_inside_telegram_topic_rewrites_binding_to_new_session).
- _make_runner now stubs switch_session so existing restore tests
still exercise the new code path.
Validated end-to-end with real SessionDB + SessionStore:
readers on fresh DB don't create topic tables; enable creates them;
binding override persists across SessionStore restart; /new rebinds
and the new id survives a restart.
Co-authored-by: EmelyanenkoK <emelyanenko.kirill@gmail.com>