When a kanban worker subprocess exits rc=0 but its task is still in
status='running', the agent almost certainly answered the task
conversationally without calling kanban_complete or kanban_block. The
dispatcher used to classify this as a generic crash and respawn, which
loops forever on small local models (gemma4-e2b q4 etc.) that keep
returning clean but unproductive output.
Dispatcher changes:
- The waitpid reap loop at the top of dispatch_once now records each
reaped child's raw exit status in a bounded module registry
(_recent_worker_exits, TTL 600s, size cap 4096).
- _classify_worker_exit distinguishes clean_exit / nonzero_exit /
signaled / unknown using os.WIFEXITED / WIFSIGNALED.
- detect_crashed_workers consults the classification when a worker
is found dead. clean_exit → protocol_violation event + immediate
circuit-breaker trip (failure_limit=1). Everything else keeps the
existing crashed-event + counter behavior.
- DispatchResult.auto_blocked now includes protocol-violation trips.
Gateway fix (Bug A in #20894):
- gateway.run._notify_active_sessions_of_shutdown snapshots
self.adapters with list(...) before iterating. adapter.send() can
hit a fatal-error path that pops the adapter from the dict, which
was raising 'RuntimeError: dictionary changed size during iteration'
during shutdown.
Regression tests:
- test_detect_crashed_workers_protocol_violation_auto_blocks verifies
rc=0 + still-running → status=blocked on first occurrence with
protocol_violation + gave_up events and NO crashed event.
- test_detect_crashed_workers_nonzero_exit_uses_default_limit verifies
non-zero exits keep the existing 2-strike behavior.
Closes#20894.
Skills that produce large/lossless images (e.g. info-graph, where a
rendered JPG is 1-2 MB) currently lose quality in Telegram delivery
because `_IMAGE_EXTS` membership routes the file through
`send_multiple_images` → `sendMediaGroup`, which Telegram's server
re-encodes to JPEG @ 1280px max edge. The original bytes only survive
when the file goes through `send_document`, which the dispatch tables
in three places (`_process_message_background`, `_deliver_media_from_response`,
and the `send_message` tool's telegram path) only reach for files
whose extension is NOT in `_IMAGE_EXTS`.
This commit adds an `[[as_document]]` directive that mirrors the
existing `[[audio_as_voice]]` shape: a skill emits the directive once
in its response, and every image-extension MEDIA: file in that response
is delivered via `send_document` instead of `send_multiple_images` /
`sendPhoto`. The directive is detected at the dispatch sites (which see
the raw response) and the directive string is stripped from the
user-visible cleaned text in `extract_media` so it never leaks.
Granularity is intentionally all-or-nothing per response, matching
[[audio_as_voice]]'s scope. Skills that need fine control can split into
two responses.
Verified the targeted use case: info-graph emits
信息图已生成(...)
[[as_document]]
MEDIA:/tmp/info-graph-x/infographic.jpg
→ Telegram receives `infographic.jpg` via sendDocument, original 1MB
JPEG bytes preserved, no recompression. Forwarding and download
filenames stay clean (`infographic.jpg`).
Tests: +3 cases in TestExtractMedia covering directive strip, isolation
from voice flag, and coexistence with [[audio_as_voice]]. All
113 pre-existing media/extract/send tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up on top of Zyproth's session-source cache: swap the unbounded
dict for an OrderedDict with a 512-entry LRU cap so long-running
gateways can't accumulate stale entries for dead sessions forever.
- self._session_sources is now an OrderedDict
- _cache_session_source() move_to_end + popitem(last=False) above cap
- _get_cached_session_source() move_to_end on hit (LRU read bump)
- restart_test_helpers.py wires OrderedDict + _session_sources_max
Flip the default for HERMES_REDACT_SECRETS from off to on so the redactor
already wired into send_message_tool, logs, and tool output actually runs
on a fresh install.
- agent/redact.py: env-var default "" → "true"
- hermes_cli/config.py: DEFAULT_CONFIG security.redact_secrets True;
two config-template comments rewritten
- gateway/run.py + cli.py: startup log / banner warning when the user
has explicitly opted out, so the downgrade is visible in agent.log
and at CLI banner time
- docs/reference/environment-variables.md: description reconciled
- tests: flipped the default-pin, restructured the force=True
regression test to explicit-false instead of unset
Users who need raw credential values (redactor development) can still
opt out via security.redact_secrets: false in config.yaml or
HERMES_REDACT_SECRETS=false in .env.
Closes#17691.
Addresses #20785 (short-term output-pipeline recommendation).
Follow-up on top of @kyan12's PR #20888 — same feature, cleaner shape,
wider coverage.
Changes:
- Drop the synthetic '[System note: ...]' in the internal MessageEvent.
The existing _is_resume_pending branch in _handle_message_with_agent
(run.py ~L13738) already injects a reason-aware recovery system note
on the next turn. With kyan's text in place the model saw two stacked
system notes. Now the event text is empty and the existing injection
path owns the wording.
- Drop SessionStore.list_resume_pending() as a new public method. The
filter is 8 lines inline in _schedule_resume_pending_sessions() —
one caller, no other pluggability need.
- Add 'restart_interrupted' to the auto-resume reason set. That's the
reason SessionStore.suspend_recently_active() stamps on sessions
recovered from a crash/OOM/SIGKILL (no .clean_shutdown marker).
Previously those sessions had to wait for a real user message to
auto-resume; now they continue automatically at startup like
drain-timeout interruptions do.
- Reasons live in a _AUTO_RESUME_REASONS frozenset at class scope so
future reasons (e.g. 'manual_resume_request') can be opted in with
one line.
Test coverage added:
- drain-timeout + crash-recovery both scheduled
- stale entries skipped (outside freshness window)
- suspended entries skipped (suspended > resume_pending)
- originless entries skipped (no routing target)
- disallowed reasons skipped (graceful forward-compat)
E2E verified end-to-end with a real on-disk SessionStore: 2 eligible
sessions scheduled, 2 ineligible skipped, empty-text internal events
delivered to the adapter.
Co-authored-by: Kevin Yan <kevyan1998@gmail.com>
When display.cleanup_progress (or display.platforms.<plat>.cleanup_progress)
is true, the gateway deletes tool-progress bubbles, long-running '⏳ Still
working...' notices, and status-callback messages after the final response
is delivered successfully. Currently effective on adapters that implement
delete_message (Telegram); silently no-ops elsewhere. Off by default.
Failed runs skip cleanup so bubbles stay as breadcrumbs.
Minimal plumbing: base.py's existing post_delivery_callback slot now chains
new registrations onto any existing callback (with per-callback exception
isolation) rather than clobbering. Stale-generation registrations are
rejected so they can't step on a fresher run's callbacks. This lets the
cleanup callback coexist with the background-review release hook already
registered on the same slot.
Co-authored-by: mrcharlesiv <Mrcharlesiv@gmail.com>
When terminal.backend is docker, inbound documents uploaded via messaging
platforms (Telegram, Slack, Discord, Feishu, Email, etc.) are cached at a host
path under ~/.hermes/cache/documents, but the container sandbox only sees them
at the auto-mounted /root/.hermes/cache/documents path.
This PR adds to_agent_visible_cache_path() in tools/credential_files.py (the
natural sibling to get_cache_directory_mounts()) and calls it at the
document-context-injection site in gateway/run.py so the agent always receives
a path it can open directly, matching the mount layout already established
by get_cache_directory_mounts() (#4846).
Scope: only Docker backend for now; other backends use different mount
semantics and are left unchanged until verified.
Fixes#18787
Extend the gateway_restart_notification flag to cover
_notify_active_sessions_of_shutdown — the message that fires just
before drain ("⚠️ Gateway restarting — Your current task will be
interrupted. Send any message after restart and I'll try to resume
where you left off.") sent to active sessions and home channels.
Same operator/end-user reasoning: on a Slack workspace shared with
end users, "Gateway restarting" reads as "the bot is broken" — the
operator should be able to suppress it consistently with the other
two lifecycle pings rather than having a partial opt-out.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an opt-out toggle on PlatformConfig that gates both restart
lifecycle pings: the "♻ Gateway restarted" message sent to the chat
that issued /restart, and the "♻️ Gateway online" home-channel
startup notification. Defaults to True so existing deployments are
unaffected.
The motivating split is operator vs. end-user surfaces: a back-channel
like Telegram should keep these pings, while a Slack workspace shared
with end users should not surface gateway lifecycle noise.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Remove dead metadata.get('reply_to') fallback in _send_raw_message;
nothing in the codebase ever sets 'reply_to' inside a metadata dict —
the key only appears as a top-level send_voice() keyword argument
- Simplify _status_thread_metadata construction in run.py to use a
single dict literal instead of create-then-mutate pattern; the
or-{} guard was dead since source.thread_id implies _progress_thread_id
is also set for Feishu
- Add yuqian@zmetasoft.com to AUTHOR_MAP for contributor attribution
Route Feishu topic progress, status, approval, stream, and fallback messages through threaded replies by preserving the originating message id as the reply target. Add regressions for tool progress topic metadata and Feishu metadata-driven reply routing.
Replaces the per-directory shadow-repo design with a single shared shadow
git store at ~/.hermes/checkpoints/store/. Object DB is now deduplicated
across every working directory the agent has ever touched; a dozen
worktrees of the same project cost near-zero in additional disk.
Why
---
Pre-v2 design had three compounding problems that let ~/.hermes/checkpoints/
grow to multi-GB on active machines:
1. Each working directory got its own full shadow git repo — no object
dedup across projects or across worktrees of the same project.
2. _prune() was a documented no-op: max_snapshots only limited the
/rollback listing. Loose objects accumulated forever.
3. Defaults: enabled=True, auto_prune=False — users paid the disk cost
without ever asking for /rollback.
Field report on a single workstation: 847 MB across 47 shadow repos,
mostly redundant clones of the hermes-agent source tree.
Changes
-------
- tools/checkpoint_manager.py: full rewrite. Single bare store, per-project
refs (refs/hermes/<hash>), per-project indexes (store/indexes/<hash>),
per-project metadata (store/projects/<hash>.json with workdir +
created_at + last_touch). On first v2 init, any pre-v2 per-directory
shadow repos are auto-migrated into legacy-<timestamp>/ so the new
store starts clean. _prune() now actually rewrites the per-project ref
to the last max_snapshots commits and runs git gc --prune=now. New
_enforce_size_cap() drops oldest commits round-robin across projects
when the store exceeds max_total_size_mb. _drop_oversize_from_index()
filters any single file larger than max_file_size_mb out of the snapshot.
- hermes_cli/checkpoints.py: new 'hermes checkpoints' CLI
(status / list / prune / clear / clear-legacy) for managing the store
outside a session.
- hermes_cli/config.py: flipped defaults — enabled=False, max_snapshots=20,
auto_prune=True. Added max_total_size_mb=500, max_file_size_mb=10.
Tightened DEFAULT_EXCLUDES (added target/, *.so/*.dylib/*.dll,
*.mp4/*.mov, *.zip/*.tar.gz, .worktrees/, .mypy_cache/, etc.).
- run_agent.py / cli.py / gateway/run.py: thread the new kwargs through
AIAgent and the startup auto_prune hooks.
- Tests rewritten to match v2 storage while keeping backwards-compat
coverage for the pre-v2 prune path (per-directory shadow repos under
base/ are still swept correctly for anyone mid-migration).
- Docs updated: user-guide/checkpoints-and-rollback.md explains the
shared store, new defaults, migration, and the new CLI;
reference/cli-commands.md documents 'hermes checkpoints'.
E2E validated
-------------
- Legacy migration: pre-v2 shadow repos auto-archived into legacy-<ts>/.
- Object dedup: two projects with an identical shared.py blob resolve to
7 total objects in the store (v1 would have stored the blob twice).
- max_snapshots=3 actually enforced: after 6 commits, list shows 3.
- Orphan prune: deleting a project's workdir + 'hermes checkpoints prune
--retention-days 0' removes its ref, index, and metadata; GC reclaims
the objects.
- max_file_size_mb=1 excludes a 2 MB weights.bin while keeping the
tracked source code files.
- hermes checkpoints {status,prune,clear,clear-legacy} all work from the
CLI without an agent running.
Breaking / migration
--------------------
No in-place data migration — legacy per-directory shadow repos are moved
into legacy-<timestamp>/ on first run. Old /rollback history is still
accessible by inspecting the archive with git; run
'hermes checkpoints clear-legacy' to reclaim the space when ready. Users
relying on /rollback must now set checkpoints.enabled=true (or pass
--checkpoints) explicitly.
- Fix /compact → /compress in context-overflow tips (closes#20020)
- Evict cached agent after session hygiene and /compress so system
prompt refreshes with current SOUL.md, memory, and skills
- Restore memory authority across compaction: change 'informational
background data' to 'authoritative reference data' in memory block
and SUMMARY_PREFIX, with backward-compatible regex
Based on:
- PR #20027 by @LeonSGP43
- PR #18767 by @MacroAnarchy
- PR #17380 by @vominh1919
PR #17121 boundary marker fix already merged to main (2eef395e1).
PR #9262 user-message anchoring already on main via _ensure_last_user_message_in_tail().
The dispatch_once function already accepts a max_spawn parameter but the
gateway was calling it without passing any value, effectively ignoring
the configuration. This change reads kanban.max_spawn from config.yaml
and passes it through, allowing users to limit concurrent kanban tasks.
This prevents resource exhaustion scenarios where kanban dispatcher
spawns too many parallel workers on constrained hardware.
Fix three regressions introduced by PR #18370 (lazy session creation):
1. _finalize_session() uses stale session_key after compression (#20001)
2. session_key not synced after auto-compression in run_conversation (#20001)
3. pending_title ValueError leaves title wedged forever (#19029)
4. Gateway silently swallows null responses when agent did work (#18765)
5. One-time cleanup for accumulated ghost compression continuations (#20001)
Changes:
- tui_gateway/server.py: _finalize_session() now uses agent.session_id
(falls back to session_key when agent is None). Refactor
_sync_session_key_after_compress() with clear_pending_title and
restart_slash_worker policy flags. Call it post-run_conversation()
to sync session_key after auto-compression. Add ValueError handler
to pending_title flush.
- gateway/run.py: Extract _normalize_empty_agent_response() helper that
consolidates failed/partial/null response handling. Surfaces user-facing
error when agent did work (api_calls > 0) but returned no text.
- hermes_state.py: Add finalize_orphaned_compression_sessions() — marks
ghost continuation sessions as ended (non-destructive, preserves data).
- cli.py: One-time startup migration for orphaned compression sessions.
Test changes:
- tests/test_tui_gateway_server.py: Update pending_title ValueError test
for post-#18370 architecture (title applied post-message, not at create).
- tests/test_lazy_session_regressions.py: 14 new regression tests covering
all fixed paths.
The Telegram/Discord /model pickers currently call
list_authenticated_providers(), which returns every provider whose
credentials resolve locally and every model in its curated snapshot.
Two failure modes fall out:
- OpenRouter rows can include IDs the live catalog no longer carries.
- Provider rows can surface with zero callable models (e.g. a slug
whose credential pool entry exists but has nothing behind it).
list_picker_providers() wraps the base function and post-processes the
result so the interactive picker only shows models the user can
actually select:
- OpenRouter's models come from fetch_openrouter_models() (live-catalog
filtered against the curated OPENROUTER_MODELS snapshot).
- Rows with an empty models list are dropped, except custom endpoints
(is_user_defined=True with an api_url) where the user may enter
model ids manually.
- All other fields pass through unchanged.
The gateway /model handler switches to the new helper for the
interactive picker payload only. Typed /model <name> and the text
fallback list stay on list_authenticated_providers() so nothing is
hidden from power users or platforms without a picker.
Covered by nine focused unit tests in
tests/hermes_cli/test_list_picker_providers.py.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* revert(gateway): remove stale-code self-check and auto-restart
Removes the _detect_stale_code / _trigger_stale_code_restart mechanism
introduced in #17648 and iterated in #19740. On every incoming message
the gateway compared the boot-time git HEAD SHA to the current SHA on
disk, and if they differed it would reply with
Gateway code was updated in the background --
restarting this gateway so your next message runs
on the new code. Please retry in a moment.
and then kick off a graceful restart. This is unwanted behaviour:
users who run a long-lived gateway and do their own ad-hoc git
operations on the checkout end up with their chat interrupted and
the current message dropped every time HEAD moves, with no way to
opt out.
If an operator really needs the old protection against stale
sys.modules after "hermes update", the SIGKILL-survivor sweep in
hermes update (hermes_cli/main.py, also tagged #17648) already
handles the supervisor-respawn case on its own.
Removed:
gateway/run.py:
- _STALE_CODE_SENTINELS, _GIT_SHA_CACHE_TTL_SECS
- _read_git_head_sha(), _compute_repo_mtime() module helpers
- class-level _boot_wall_time / _boot_repo_mtime / _boot_git_sha /
_stale_code_restart_triggered defaults
- __init__ boot-snapshot block (_boot_*, _cached_current_sha*,
_repo_root_for_staleness, _stale_code_notified)
- _current_git_sha_cached(), _detect_stale_code(),
_trigger_stale_code_restart() methods
- stale-code check + user-facing restart notice at the top of
_handle_message()
tests/gateway/test_stale_code_self_check.py (deleted, 412 lines)
No new logic added. Zero remaining references to any removed
symbol. Gateway test suite passes the same 4589 tests it passed
before; the 3 pre-existing unrelated failures (discord free-channel,
feishu bot admission, teams typing) are unchanged by this commit.
* feat(i18n): add display.language for static message translation (zh/ja/de/es)
Adds a thin-slice i18n layer covering the highest-impact static user-facing
messages: the CLI dangerous-command approval prompt and a handful of gateway
slash-command replies (restart-drain, goal cleared, approval expired, config
read/save errors).
Out of scope (stays English): agent responses, log lines, tool outputs,
slash-command descriptions, error tracebacks.
Infrastructure:
- agent/i18n.py: catalog loader, t() helper, language resolution
(HERMES_LANGUAGE env var > display.language config > en)
- locales/{en,zh,ja,de,es}.yaml: ~19 translated strings per language
- display.language in DEFAULT_CONFIG (hermes_cli/config.py)
Tests:
- tests/agent/test_i18n.py: 21 tests covering catalog parity, placeholder
parity across locales, fallback behavior, env-var override, alias
normalization, missing-key graceful degradation.
Docs:
- website/docs/user-guide/configuration.md: display.language entry plus a
short section explaining scope so users don't expect agent responses to
translate via this knob.
* revert(gateway): remove stale-code self-check and auto-restart
Removes the _detect_stale_code / _trigger_stale_code_restart mechanism
introduced in #17648 and iterated in #19740. On every incoming message
the gateway compared the boot-time git HEAD SHA to the current SHA on
disk, and if they differed it would reply with
Gateway code was updated in the background --
restarting this gateway so your next message runs
on the new code. Please retry in a moment.
and then kick off a graceful restart. This is unwanted behaviour:
users who run a long-lived gateway and do their own ad-hoc git
operations on the checkout end up with their chat interrupted and
the current message dropped every time HEAD moves, with no way to
opt out.
If an operator really needs the old protection against stale
sys.modules after "hermes update", the SIGKILL-survivor sweep in
hermes update (hermes_cli/main.py, also tagged #17648) already
handles the supervisor-respawn case on its own.
Removed:
gateway/run.py:
- _STALE_CODE_SENTINELS, _GIT_SHA_CACHE_TTL_SECS
- _read_git_head_sha(), _compute_repo_mtime() module helpers
- class-level _boot_wall_time / _boot_repo_mtime / _boot_git_sha /
_stale_code_restart_triggered defaults
- __init__ boot-snapshot block (_boot_*, _cached_current_sha*,
_repo_root_for_staleness, _stale_code_notified)
- _current_git_sha_cached(), _detect_stale_code(),
_trigger_stale_code_restart() methods
- stale-code check + user-facing restart notice at the top of
_handle_message()
tests/gateway/test_stale_code_self_check.py (deleted, 412 lines)
No new logic added. Zero remaining references to any removed
symbol. Gateway test suite passes the same 4589 tests it passed
before; the 3 pre-existing unrelated failures (discord free-channel,
feishu bot admission, teams typing) are unchanged by this commit.
* fix(agent): stateful streaming scrubber for reasoning-block leaks (#17924)
Per-delta _strip_think_blocks ran at _fire_stream_delta and destroyed
downstream state. When MiniMax-M2.7 / DeepSeek / Qwen3 streamed a tag
split across deltas (delta1='<think>', delta2='Let me check'), the
regex case-2 match erased delta1 entirely, so CLI/gateway state
machines never learned a block was open and leaked delta2 as content.
Raw consumers (ACP, api_server, TTS) had no downstream defense at all.
Replace the per-delta regex with a stateful StreamingThinkScrubber
that survives delta boundaries:
- Closed <tag>X</tag> pairs always stripped (matches _strip_think_blocks
case 1).
- Unterminated open at block boundary enters a block; content
discarded until close tag arrives. At end-of-stream, held
content is dropped.
- Orphan close tags stripped without boundary gating.
- Partial tags at delta boundaries held back until resolved.
- Block-boundary rule (start-of-stream, after \n, or
whitespace-only since last \n) preserves prose that mentions
tag names.
Reset at turn start alongside the existing context scrubber; flush at
turn end so a benign '<' held back at end-of-stream reaches the UI.
E2E-verified on live OpenRouter->MiniMax-m2 streams: closed pairs
strip cleanly, first word of post-block content is preserved, pure
content passes through unchanged. Stefan's screenshot case (#17924)
— 'Let me check' getting chopped to ' me check' — no longer happens.
Final _strip_think_blocks calls on completed strings (final_response,
replay, compression) are preserved; only the streaming per-delta call
site switched to the scrubber.
After PR #20105 (dispatcher skips ready tasks whose assignee fails
``profile_exists()`` to prevent the orion-cc/orion-research crash
loop), the gateway and CLI emit a spurious "kanban dispatcher stuck:
ready queue non-empty for N consecutive ticks but 0 workers spawned"
warning every 5 minutes on multi-lane setups where the queue is
steadily full of human-pulled work assigned to terminal lanes.
The warn is intended to catch real failure modes (broken PATH,
missing venv, credential loss for a real Hermes profile). On a
multi-lane host it fires forever even though everything is healthy:
the dispatcher correctly chose not to spawn, and there is nothing
for the operator to fix.
Changes:
* ``DispatchResult`` gains a ``skipped_nonspawnable`` field
(separate from ``skipped_unassigned``) so callers can distinguish
"task missing an owner — operator should route it" from "task
owned by a control-plane lane — terminal will pull it".
* ``dispatch_once`` routes the ``not profile_exists(assignee)`` skip
into the new bucket (was lumped into ``skipped_unassigned``).
* New helper ``has_spawnable_ready(conn)`` returns True iff at least
one ready+assigned+unclaimed task in the DB has an assignee that
maps to a real Hermes profile. Falls back to legacy "any
ready+assigned" when ``profile_exists`` is unimportable so degraded
installs still surface the original warn.
* The gateway dispatcher (``gateway/run.py``) and the CLI standalone
daemon (``hermes_cli/kanban.py``) both swap their cheap
``ready_nonempty`` probe to use ``has_spawnable_ready``. Stuck-warn
now fires only when there is genuine spawnable work the dispatcher
failed to start.
* CLI dispatch output prints ``Skipped (non-spawnable assignee —
terminal lane, OK)`` for visibility without alarm.
Tests:
* New ``has_spawnable_ready`` cases (empty queue, terminal-lane
only, mixed real+terminal).
* New ``test_dispatch_skips_nonspawnable_into_separate_bucket``
verifies the bucketing change.
* Updated ``test_dispatch_skips_unassigned`` to assert no
cross-leak.
* Added ``all_assignees_spawnable`` fixture in
``tests/hermes_cli/conftest.py`` and threaded it through dispatcher
tests that use synthetic assignees ("alice", "bob"). PR #20105
(the parent commit) silently broke 8 such tests by routing those
assignees into ``skipped_nonspawnable`` instead of spawning; this
PR repairs them as part of the same code area.
Verified locally: 246/246 kanban-suite tests pass.
Stacks on top of fix/kanban-dispatcher-skip-missing-profile-2026-05-05
(PR #20105). Reviewer: this PR is meant to merge AFTER #20105.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The stale-code self-check (Issue #17648) used sentinel-file mtimes to
decide whether the gateway survived a `hermes update` with stale
`sys.modules`. That signal false-positives on any write to the
sentinel files — including agent-driven edits during Hermes-on-Hermes
dev sessions. Telling the agent to patch `run_agent.py` would flip
the check to True on the next user message and force a gateway
restart even though no update happened.
Switch the signal to `git rev-parse HEAD`. Agent file edits don't
move HEAD; `hermes update` (git pull) always does. Reading .git/HEAD
directly (no subprocess) with a 5s cache keeps the overhead negligible
on bursty chats. Non-git installs short-circuit to False — the
stale-modules class can't occur without a git-backed update path, so
there's nothing to detect.
The legacy `_compute_repo_mtime` helper is kept but unused by
detection, reserved as a fallback hook for future pip-install update
paths.
- _read_git_head_sha(): resolves HEAD across main checkout, worktree
(follows `gitdir:` + `commondir` pointers), and packed-refs layouts.
- _current_git_sha_cached(): per-runner 5s SHA cache.
- _detect_stale_code(): boot SHA vs current SHA, returns False when
either is unavailable.
- Tests cover all four layouts, the agent-edits-don't-trigger
regression, and cache behavior.
Refs #17648.
Four production-readiness additions to topic mode:
1. /topic off — clean disable path. Flips telegram_dm_topic_mode.enabled
to 0 and clears telegram_dm_topic_bindings for this chat. Previously
users had to edit state.db with sqlite3 to turn the feature off.
Idempotent: calling /topic off when the chat was never enabled
returns a friendly no-op message.
2. /topic help — inline usage printed in the DM so users don't have to
visit docs to discover /topic off, /topic <session-id>, etc.
3. Authorization gate. /topic mutates SQLite side tables and flips the
root DM into a lobby, so the action must be authorized. Now calls
self._is_user_authorized(source); unauthorized DMs get a refusal
instead of activation. Defense in depth on top of the gateway's
existing pre-route auth.
4. BotFather screenshot debounce. A user repeatedly running /topic
while Threads Settings is still disabled would previously re-upload
the same screenshot every time. Now rate-limited to one send per
5 minutes per chat. /topic off resets the counter so re-enabling
starts fresh.
Command-def args hint updated: /topic [off|help|session-id].
Docs:
- New /topic subcommands table at the top of the multi-session section
- Disable instructions updated to recommend /topic off first, with the
raw SQL fallback kept for bulk cleanup
- Under-the-hood list extended with the capability-hint debounce and
the authorization gate
Tests (6 new):
- /topic help returns usage and doesn't create topic tables
- /topic off disables mode AND clears bindings
- /topic off is idempotent when never enabled
- Unauthorized users get refusal, no tables created
- Capability-hint debounce is per-chat
- /topic off resets both lobby and capability debounce counters
All 402 targeted tests pass. Full gateway sweep: 4809/4810
(pre-existing test_teams::test_send_typing unrelated).
Five follow-ups to topic mode based on integration audit:
1. ON DELETE CASCADE on telegram_dm_topic_bindings.session_id. Session
pruning (manual /delete, auto-cleanup, any future prune job) would
have thrown 'FOREIGN KEY constraint failed' for sessions bound to a
topic. Migration bumped to v2, rebuilds the bindings table in place
if FK lacks CASCADE. Idempotent; only runs once per DB.
2. Never auto-rename operator-declared topics. If an operator has
extra.dm_topics configured AND a user runs /topic, messages in those
pre-declared topics would previously trigger auto-rename and silently
mutate operator config. _rename_telegram_topic_for_session_title now
early-returns when _get_dm_topic_info returns a dict for this
(chat_id, thread_id). Uses class-based lookup (not hasattr) so
MagicMock test fixtures don't accidentally trip the guard.
3. General topic handling. Telegram's General (pinned top) topic in a
forum-enabled private chat may send messages with message_thread_id=1
or omit thread_id entirely depending on client. Both are now treated
as the root lobby, not a topic lane. Prevents users from
accidentally burning a session on the General topic.
4. Debounce the root-lobby reminder. 30-second cooldown per chat so a
user who forgets topic mode is enabled and types ten messages in the
root gets one reminder, not ten. Explicit command replies
(/new-in-lobby, /topic <session-id>) still land every time.
5. Docs: added under-the-hood invariants for the above, plus a
Downgrade section explaining that rolling back to a pre-/topic
Hermes build leaves the DB tables orphaned but harmless — DMs just
revert to native per-thread isolation.
Tests:
- test_operator_declared_topic_is_not_auto_renamed
- test_general_topic_is_treated_as_root_lobby
- test_lobby_reminder_is_debounced_per_chat
- test_binding_survives_session_deletion_via_cascade
- test_migration_rebuilds_v1_binding_table_with_cascade_fk
Validated: 4803/4804 tests pass (tests/gateway/ + tests/test_hermes_state.py).
Sole failure is a pre-existing test_teams::test_send_typing flake
unrelated to this PR.
Follow-up on @EmelyanenkoK's feat: add Telegram DM topic-mode sessions.
Three issues:
1. Split-brain session state. After get_or_create_session() returned a
SessionEntry for a topic lane, the handler was mutating
.session_id in place to the binding's target, but never persisting
the switch through SessionStore. The sessions.json session_key →
session_id map kept pointing at the lane's natural id; any reader
that reloaded from disk saw the wrong id. Fixed by routing through
SessionStore.switch_session(), which _save()s the mapping and ends
the old session in SQLite like /resume does.
2. /new inside a topic was a one-message no-op. Reset created a new
session but left the telegram_dm_topic_bindings row pointing at the
old session_id, so the next message's binding lookup switched right
back. Now _handle_reset_command rebinds the topic to the new
session_id after reset.
3. is_telegram_session_linked_to_topic and
list_unlinked_telegram_sessions_for_user both called
apply_telegram_topic_migration() on read, contradicting the PR's
own invariant that migration only runs on explicit /topic opt-in.
They now tolerate missing topic tables and return empty/False.
Also: _telegram_topic_mode_enabled() now only treats True as enabled
(not any truthy return), so test fixtures with MagicMock session_db
don't accidentally flip every DM into lobby mode — this was breaking
4 pre-existing test_status_command tests.
Tests:
- New regression: /new inside a topic must update the binding row
(test_new_inside_telegram_topic_rewrites_binding_to_new_session).
- _make_runner now stubs switch_session so existing restore tests
still exercise the new code path.
Validated end-to-end with real SessionDB + SessionStore:
readers on fresh DB don't create topic tables; enable creates them;
binding override persists across SessionStore restart; /new rebinds
and the new id survives a restart.
Co-authored-by: EmelyanenkoK <emelyanenko.kirill@gmail.com>
Adds first-class board support to kanban so users can separate unrelated
streams of work (projects, repos, domains) into isolated queues. Single-
project users stay on the 'default' board and see no UI change.
Isolation model
---------------
- Each board is a directory at `~/.hermes/kanban/boards/<slug>/` with
its own `kanban.db`, `workspaces/`, and `logs/`. The 'default' board
keeps its legacy path (`~/.hermes/kanban.db`) for back-compat — fresh
installs and pre-boards users get zero migration.
- Workers spawned by the dispatcher have `HERMES_KANBAN_BOARD` pinned in
their env alongside the existing `HERMES_KANBAN_DB` /
`HERMES_KANBAN_WORKSPACES_ROOT` pins, so workers physically cannot see
other boards' tasks.
- The gateway's single dispatcher loop now sweeps every board per tick;
per-tick cost is a few extra filesystem stats.
- CAS concurrency guarantees are preserved per-board (each board is its
own SQLite DB, same WAL+IMMEDIATE machinery as before).
CLI
---
hermes kanban boards list|create|switch|show|rename|rm
hermes kanban --board <slug> <any-subcommand>
Board resolution order: `--board` flag → `HERMES_KANBAN_BOARD` env →
`~/.hermes/kanban/current` file → `default`. Slug validation is strict:
lowercase alphanumerics + hyphens + underscores, 1-64 chars, starts with
alphanumeric. Uppercase is auto-downcased; slashes / dots / `..` /
control chars are rejected so boards can't name their way out of the
boards/ directory.
Passive discoverability: when more than one board exists, `hermes kanban
list` prints a one-line header ("Board: foo (2 other boards …)") so
users who stumble across multi-project never have to hunt for the
feature. Invisible for single-board installs.
Dashboard
---------
- New `BoardSwitcher` component at the top of the Kanban tab: dropdown
with all boards + task counts, `+ New board` button, `Archive`
button (non-default only). Hidden entirely when only `default` exists
and is empty — single-project users never see it.
- New `NewBoardDialog` modal: slug / display name / description / icon
+ "switch to this board after creating" checkbox.
- Selected board persists to `localStorage` so browser users don't
shift the CLI's active board out from under a terminal they left open.
- New `?board=<slug>` query param on every existing endpoint plus a
new `/boards` CRUD surface (`GET /boards`, `POST /boards`,
`PATCH /boards/<slug>`, `DELETE /boards/<slug>`,
`POST /boards/<slug>/switch`).
- Events WebSocket is pinned to a board at connection time; switching
opens a fresh WS against the new board.
Also fixes a pre-existing bug in the plugin's tenant / assignee
filters: the SDK's `Select` uses `onValueChange(value)`, not
native `onChange(event)`, so those filters silently didn't work.
New `selectChangeHandler` helper wires both signatures.
Tests
-----
49 new tests in `tests/hermes_cli/test_kanban_boards.py` covering:
slug validation (valid / invalid / auto-downcase), path resolution
(default = legacy path, named = `boards/<slug>/`, env var override),
current-board resolution chain (env > file > default), board CRUD +
archive / hard-delete, per-board connection isolation (tasks don't
leak), worker spawn env injection (`HERMES_KANBAN_BOARD`,
`HERMES_KANBAN_DB`, `HERMES_KANBAN_WORKSPACES_ROOT` all point at the
right board), and end-to-end CLI surface.
Regression surface: all 264 pre-existing kanban tests continue to pass.
Live-tested via the dashboard: created 3 boards (default,
hermes-agent, atm10-server), created tasks on each via both CLI
(`--board <slug> create`) and dashboard (inline create on the Ready
column), confirmed zero cross-board leakage, confirmed `BoardSwitcher`
+ `NewBoardDialog` work end-to-end in the browser.
The contributor's PR silently swallowed ValueError from
SessionDB.set_session_title() with bare except Exception: pass.
Users typing /new <title> with an already-in-use title got an
untitled session and no feedback.
Changes:
- cli.py: catch ValueError from both sanitize_title() and
set_session_title(); print the error and mark the session
untitled in the banner (never echo the rejected title back).
- gateway/run.py: append a warning note to the reset reply on
title rejection; reflect the accepted title in the header.
- Add regression tests for the duplicate-title path in CLI and
gateway.
Also map exx@example.com -> @exxmen in scripts/release.py.
Allow users to start a fresh session and immediately set its title by
passing a name to /new (or /reset):
/new Refactor auth module
Changes:
- hermes_cli/commands.py: add args_hint='[name]' to /new command
- cli.py: parse title argument in process_command(), pass to new_session()
- cli.py: new_session() accepts title=None, sets title via SessionDB
- gateway/run.py: _handle_reset_command() parses title, sets on new entry
- gateway/session.py: reset_session() accepts optional display_name
- tests: add test_new_session_with_title, test_reset_command_with_title,
test_new_command_in_help_output
All 36 affected tests pass.
Quick commands of type "alias" that target built-in slash commands
(e.g. /h -> /model) were processed too late in _handle_message — after
the if-canonical=="model" checks. This meant alias expansion never
reached the target handler and fell through to the LLM as raw text.
Two fixes:
1. Move the quick_commands block before built-in dispatch so alias
targets (like /model) hit the correct handler after expansion.
2. Extract bare command name from target_command via .split()[0] to
feed _resolve_cmd() correctly (was using the full arg-string).
The old CWD heuristic was fooled by:
1. TERMINAL_CWD persisted to .env by `hermes config set terminal.cwd`
2. Inherited TERMINAL_CWD from parent hermes processes
3. Only resolved when config had a placeholder value (not explicit paths)
Fix:
- load_cli_config() unconditionally uses os.getcwd() for local backend
- TERMINAL_CWD always force-exported in CLI mode (overrides stale values)
- Gateway sets _HERMES_GATEWAY=1 marker so lazy cli.py imports don't clobber
- Remove terminal.cwd from config-set .env sync map (prevents re-poisoning)
- Clarify setup wizard label as 'Gateway working directory'
Closes#19214
Deduplicate exact and near-exact Discord voice STT transcripts per guild/user over a short window to avoid duplicate delayed agent replies.
Adds regression tests for exact and near-duplicate voice transcript suppression.
/goal was silently broken outside the classic CLI.
TUI: /goal was routed through the HermesCLI slash-worker subprocess,
which set the goal row in SessionDB but then called
_pending_input.put(state.goal) — the subprocess has no reader for that
queue, so the kickoff message was discarded. No post-turn judge was
wired into prompt.submit either, so even a manual kickoff would not
continue the goal loop. Intercept /goal in command.dispatch instead,
drive GoalManager directly, and return {type: send, notice, message}
so the TUI client renders the Goal-set notice and fires the kickoff.
Run the judge in _run_prompt_submit after message.complete, surface
the verdict via status.update {kind: goal}, and chain the continuation
turn after the running guard is released.
Gateway: _post_turn_goal_continuation was gated on
hasattr(adapter, 'send_message'), but adapters only expose send().
That branch was dead on every platform — users never saw
'✓ Goal achieved', 'Continuing toward goal', or budget-exhausted
messages. Replace the dead call with adapter.send(chat_id, content,
metadata) and drop a broken reference to self._loop.
Tests:
- tests/tui_gateway/test_goal_command.py — full /goal dispatch matrix
(set / status / pause / resume / clear / stop / done / whitespace)
plus regressions for slash.exec → 4018 and 'goal' staying in
_PENDING_INPUT_COMMANDS.
- tests/gateway/test_goal_verdict_send.py — locks in the adapter.send
path for done / continue / budget-exhausted and verifies the hook
no-ops when no goal is set or the adapter lacks send().
suspend_recently_active() was unconditionally setting suspended=True on
startup, causing get_or_create_session() to wipe conversation history on
every restart. Change to set resume_pending=True instead, so sessions
auto-resume while still allowing stuck-loop escalation after 3 failures.
* fix(gateway): config.yaml wins over .env for agent/display/timezone settings
Regression from the silent config→env bridge. The bridge at module import
time is correct for max_turns (unconditional overwrite), but every other
agent.*, display.*, timezone, and security bridge key was guarded by
'if X not in os.environ' — so a stale .env entry from an old 'hermes setup'
run would shadow the user's current config.yaml indefinitely.
Symptom: agent.max_turns: 500 in config.yaml, HERMES_MAX_ITERATIONS=60
in .env from an old setup, and the gateway silently capped at 60
iterations per turn. Gateway logs confirmed api_calls never exceeded 60.
Three changes:
1. gateway/run.py: drop the 'not in os.environ' guards for all agent.*,
display.*, timezone, and security.* bridge keys. config.yaml is now
authoritative for these settings — same semantics already in place
for max_turns, terminal.*, and auxiliary.*. Also surface the bridge
failure (previously 'except Exception: pass') to stderr so operators
see bridge errors instead of silently falling back to .env.
2. gateway/run.py: INFO-log the resolved max_iterations at gateway
start so operators can verify the config→env bridge did the right
thing instead of chasing a phantom budget ceiling.
3. hermes_cli/setup.py: stop writing HERMES_MAX_ITERATIONS to .env in
the setup wizard. config.yaml is the single source of truth. Also
clean up any stale .env entry left behind by pre-fix setups.
Regression tests in tests/gateway/test_config_env_bridge_authority.py
guard each config→env key against the 'stale .env shadows config' bug.
* fix(gateway): shutdown + restart hygiene (drain timeout, false-fatal, success log)
Three issues observed in production gateway.log during a rapid restart
chain on 2026-05-02, all fixed here.
1. _send_restart_notification logged unconditional success
adapter.send() catches provider errors (e.g. Telegram 'Chat not found')
and returns SendResult(success=False); it never raises. The caller
ignored the return value and always logged 'Sent restart notification
to <chat>' at INFO, producing a misleading success line directly
below the 'Failed to send Telegram message' traceback on every boot.
Now inspects result.success and logs WARNING with the error otherwise.
2. WhatsApp bridge SIGTERM on shutdown classified as fatal error
_check_managed_bridge_exit() saw the bridge's returncode -15 (our own
SIGTERM from disconnect()) and fired the full fatal-error path,
producing 'ERROR ... WhatsApp bridge process exited unexpectedly' plus
'Fatal whatsapp adapter error (whatsapp_bridge_exited)' on every
planned shutdown, immediately before the normal '✓ whatsapp
disconnected'. Adds a _shutting_down flag that disconnect() sets
before the terminate, and _check_managed_bridge_exit() returns None
for returncode in {0, -2, -15} while shutting down. OOM-kill (137)
and other non-signal exits still hit the fatal path.
3. restart_drain_timeout default 60s → 180s
On 2026-05-02 01:43:27 a user /restart fired while three agents were
mid-API-call (82s, 112s, 154s into their turns). The 60s drain budget
expired and all three were force-interrupted. 180s covers realistic
in-flight agent turns; users on very-long-reasoning models can still
raise it further via agent.restart_drain_timeout in config.yaml.
Existing explicit user values are preserved by deep-merge.
Tests
- tests/gateway/test_restart_notification.py: two new tests assert INFO
is only logged on SendResult(success=True) and WARNING with the error
string is logged on SendResult(success=False).
- tests/gateway/test_whatsapp_connect.py: parametrized test for
returncode in {0, -2, -15} proves shutdown-time exits are suppressed;
separate test proves returncode 137 (SIGKILL/OOM) still surfaces as
fatal even when _shutting_down is set.
- _check_managed_bridge_exit() reads _shutting_down via getattr-with-
default so existing _make_adapter() test helpers that bypass __init__
(pitfall #17 in AGENTS.md) keep working unmodified.
`_register_skill_group` captured the skill catalog in closure variables
(`entries` and `skill_lookup`) so the single `tree.add_command` call at
startup owned the only live copy. The closure is never re-entered after
startup, so `/reload-skills` — which rescans the on-disk skills dir and
refreshes the in-process `_skill_commands` registry — had no way to
propagate results into the `/skill` autocomplete on Discord. New skills
stayed invisible in the dropdown, and deleted skills returned
"Unknown skill" when the stale autocomplete entry was clicked.
The fix is purely a dataflow change: promote `entries` and `skill_lookup`
to instance attributes (`_skill_entries`, `_skill_lookup`), split the
collector-driven rebuild into a helper (`_refresh_skill_catalog_state`),
and add a public `refresh_skill_group()` method that re-runs the helper
and is safe to call at any point after the initial registration.
The gateway's `_handle_reload_skills_command` then iterates
`self.adapters` and calls `refresh_skill_group()` on any adapter that
exposes it (currently only Discord). Both sync and async implementations
are supported; adapters that don't override the method (Telegram's
BotCommand menu, Slack subcommand map, etc.) are silently skipped — the
in-process `reload_skills()` call covers them.
No `tree.sync()` is required because Discord fetches autocomplete
options dynamically on every keystroke — mutating the instance state the
callbacks already read from is sufficient. That sidesteps the per-app
command-bucket rate limit (~5 writes / 20 s) that made the previous
bulk-sync-on-reload approach unusable (#16713 context).
Tests: tests/gateway/test_reload_skills_discord_resync.py — five cases
covering (1) refresh replaces entries, (2) entries stay sorted after
refresh, (3) collector exception leaves cached state intact, (4)
`_refresh_skill_catalog_state` populates the instance attrs, (5)
orchestrator calls `refresh_skill_group()` on sync + async adapters and
skips adapters that don't expose it.
_check_unavailable_skill is meant to turn a typed "/foo" command that
doesn't resolve into a specific hint — "disabled, enable with hermes
skills config" or "available but not installed, install with hermes
skills install …" — instead of the generic "unknown command" reply.
It was doing the match with `skill_md.parent.name.lower().replace("_", "-")`,
comparing that to the typed command. For every skill whose directory name
drifted from its declared frontmatter `name:`, that comparison failed and
the user got the unhelpful generic path. On a standard install today 19
skills have this drift, e.g.:
dir: mlops/stable-diffusion
frontmatter: name: Stable Diffusion Image Generation
registered slug (what the user types): /stable-diffusion-image-generation
dir: mlops/qdrant
frontmatter: name: Qdrant Vector Search
registered slug: /qdrant-vector-search
dir: mlops/flash-attention
frontmatter: name: Optimizing Attention Flash
registered slug: /optimizing-attention-flash
In every case, _check_unavailable_skill would fall through because
"stable-diffusion" != "stable-diffusion-image-generation", even with the
skill sitting right there on disk.
Fix: extract a small `_skill_slug_from_frontmatter` helper that reads the
SKILL.md frontmatter and normalizes exactly like scan_skill_commands
(lower, spaces/underscores → hyphens, strip non-[a-z0-9-], collapse
runs of hyphens, strip edges). Use it in both the
disabled-skills branch and the optional-skills branch. The disabled-set
membership check now uses the declared frontmatter name (which is what
`hermes skills config` writes into skills.disabled / platform_disabled),
not the slug.
Tests: five cases in tests/gateway/test_unavailable_skill_hint.py —
the drift case for the disabled branch, unknown-command negative,
matched-but-not-disabled negative, non-alnum stripping, and the drift
case for the optional-skills branch. All five fail against main and
pass with the fix.