* feat(kanban): add `specify` — auxiliary LLM fleshes out triage tasks
The Triage column shipped with a placeholder 'a specifier will flesh
out the spec', but the specifier itself was never built. This wires
it up as a dedicated CLI verb.
`hermes kanban specify <id>` calls the auxiliary LLM (configured under
`auxiliary.triage_specifier`) to expand a rough one-liner into a
concrete spec — tightened title plus a body with Goal / Approach /
Acceptance criteria / Out-of-scope sections — then atomically flips
`status: triage -> todo` and recomputes ready so parent-free tasks
go straight to the dispatcher on the same tick.
Surface:
hermes kanban specify <task_id> # single task
hermes kanban specify --all [--tenant T] # sweep triage column
hermes kanban specify ... --author NAME # audit-comment author
hermes kanban specify ... --json # one JSON line per task
Design choices:
- Parent gating is preserved. specify_triage_task flips to 'todo',
then recompute_ready promotes to 'ready' only when parents are
done — same rule as a normal parent-gated todo.
- No daemon, no background watcher. Every invocation is explicit —
keeps cost predictable and doesn't fight the dispatcher loop.
- Response parse is lenient: strict JSON preferred, markdown-fence
tolerated, raw-body fallback on malformed JSON so the LLM can't
strand a task in triage.
- All failure modes (no aux client, API error, task moved out of
triage mid-call) return SpecifyOutcome(ok=False, reason=...) so
--all continues past individual failures.
Changes:
hermes_cli/kanban_db.py + specify_triage_task()
hermes_cli/kanban_specify.py NEW (~220 LOC — prompt, parse, call)
hermes_cli/kanban.py + specify subcommand + _cmd_specify
hermes_cli/config.py + auxiliary.triage_specifier task slot
website/docs/user-guide/features/kanban.md specify + config notes
website/docs/reference/cli-commands.md CLI reference entry
tests/hermes_cli/test_kanban_specify_db.py NEW (10 tests)
tests/hermes_cli/test_kanban_specify.py NEW (20 tests)
Validation: 30/30 targeted tests pass. E2E: triage task -> specify ->
ends in 'ready' with events [created, specified, promoted] and the
audit comment recorded under the configured author.
* feat(kanban): wire specifier into dashboard and gateway slash
Follow-ups to the initial PR #21435 — closes the two gaps I'd left as
post-merge: dashboard button and first-class gateway surface.
Dashboard (plugins/kanban/dashboard/)
- POST /tasks/:id/specify NEW endpoint. Thin wrapper around
kanban_specify.specify_task(). Returns the CLI outcome shape
({ok, task_id, reason, new_title}); ok=false with a human reason
is a 200, not a 4xx, so the UI can render it inline without
treating 'no aux client configured' as a crash.
- Runs sync in FastAPI's threadpool because the LLM call can take
tens of seconds on reasoning models.
- Pins HERMES_KANBAN_BOARD around the specify call so the module's
argless kb.connect() lands on the right board.
- dist/index.js: doSpecify callback threaded through the drawer →
TaskDetail → StatusActions prop chain. ✨ Specify button appears
ONLY when task.status === 'triage' (elsewhere the backend would
reject anyway — hide the button to keep the action row clean).
Busy state (Specifying…) + inline success/error banner under the
button using the response.reason text.
- dist/style.css: tiny hermes-kanban-msg-ok / -err classes using
existing --color vars so themes reskin cleanly.
Gateway slash (/kanban specify)
- Already works via the existing run_slash → build_parser →
kanban_command pipeline. No code change needed — slash commands
inherit the argparse tree automatically. Added coverage:
test_run_slash_specify_end_to_end (create --triage, specify, verify
promotion + retitle) and test_run_slash_specify_help_is_reachable.
Tests
- tests/plugins/test_kanban_dashboard_plugin.py: 3 new tests for the
REST endpoint — happy path, non-triage rejection as ok=false 200,
missing aux client as ok=false 200.
- tests/hermes_cli/test_kanban_cli.py: 2 new slash-surface tests.
Docs
- website/docs/user-guide/features/kanban.md: dashboard action row
description mentions ✨ Specify + all three surfaces. REST table
gains /tasks/:id/specify. Slash examples include /kanban specify.
Validation: 340/340 targeted tests pass. E2E via TestClient: create a
triage task over REST → POST /specify with mocked aux client → task
moves to 'ready' column on /board with new title and body applied.
Authenticated remote OpenViking servers derive tenancy from the Bearer
key, but the client was always sending X-OpenViking-Account and
X-OpenViking-User — defaulted to the literal string "default" — which
overrode the key-derived tenant and broke auth.
- _headers(): skip X-OpenViking-Account/-User when blank or "default"
(treats the legacy default value as unset, so existing installs don't
need to touch their .env)
- _headers(): send Authorization: Bearer <key> alongside X-API-Key for
standard HTTP auth compatibility
- health(): include auth headers so /health works against servers that
require authentication
Tests cover bearer emission, legacy "default" suppression, empty
suppression, real tenant passthrough, and authenticated health checks.
Fixes the same user report as #20695 (from @ZaynJarvis); that PR could
not be merged because its branch was stale against main and would have
reverted recent OpenViking work (#15696, local resource uploads, summary
URI normalization, fs-stat pre-check).
Drives stream_events directly and cancels the task while it is sleeping
in the poll loop, asserting the coroutine returns cleanly instead of
letting CancelledError bubble. Regression coverage for the Uvicorn
application traceback on dashboard Ctrl-C fixed by the preceding commit.
Add parent dependency guard to _set_status_direct so dragging
a task to the ready column is rejected (409) when its parents
are not all done. Previously the guard only existed in
recompute_ready, allowing direct status writes via the
dashboard API to bypass the dependency engine.
Root cause: after reclaiming stale workers, both T3 and T4
were set to ready via dashboard status writes in quick
succession, causing the writer to be spawned while the analyst
was blocked — upstream work wasn't done yet.
Mirrors the pattern already shipping in hindsight-integrations/openclaw:
probe `<api_url>/version` once per process, gate on Hindsight ≥ 0.5.0.
When supported, retains use a stable session-scoped `document_id`
(`session_id`) plus `update_mode='append'` so cross-process retains for
the same session merge into one document instead of producing
N-different-process-stamped duplicates. When unsupported (or probe
fails), fall back to the existing per-process unique
`f"{session_id}-{start_ts}"` document_id with no `update_mode` — the
resume-overwrite fix (#6654) keeps working unchanged on legacy servers.
Closes the dedup half of #20115. The proposed `document_id_strategy`
config knob isn't needed: auto-detection via the same /version probe
the OpenClaw plugin already uses gives the same outcome with no extra
config burden, and the choice is purely a function of what the server
can do.
Plumbing
--------
- Module-level helpers (`_meets_minimum_version`, `_fetch_hindsight_api_version`,
`_check_api_supports_update_mode_append`) cache the result per api_url
so every provider in the process gets one /version round-trip.
- One-time WARN logged when the API is older than 0.5.0, telling the
user to upgrade for cross-session deduplication.
- New instance helper `_resolve_retain_target(fallback_doc_id)` returns
`(document_id, update_mode)` based on cached capability. Wired into
`sync_turn` and the `on_session_switch` flush path.
- For local_embedded mode, the probe URL is taken from the running
client (`client.url`) so we hit the actual daemon port rather than
the configured default.
- `update_mode` is set on the per-item dict; `aretain_batch` already
threads `item['update_mode']` into the API call.
Tests
-----
- `TestUpdateModeAppendCapability` (5 cases): legacy fallback, modern
stable+append, per-url cache, one-time warn, flush-on-switch resolves
against the OLD session.
- Existing `_make_hindsight_provider` factory in the manager-side test
file extended to seed `_mode`/`_api_url`/`_api_key`/`_client` and stub
`_resolve_retain_target` so the bypass-init pattern keeps working.
E2E verified against installed `~/.hermes/hermes-agent`:
- Legacy probe (unreachable host) → `legacy-session-<ts>` doc_id,
no `update_mode`.
- Modern probe (live local_embedded 0.5.6 daemon) → stable
`modern-session` doc_id + `update_mode='append'`.
- `test_hermes_embedded_smoke.py` passes (90s).
The dispatcher's circuit breaker only protected against spawn-side
failures (profile missing, workspace mount error, exec failure).
Workers that successfully spawned but then timed out or crashed
re-queued to ``ready`` with no counter increment, so the next tick
re-spawned them — loops forever until someone noticed. Reported
externally on Twitter (Forbidden Seeds) and confirmed by walking the
kernel: ``enforce_max_runtime`` flipped the task back to ready, emitted
a ``timed_out`` event, and never touched ``spawn_failures``; same for
``detect_crashed_workers``.
Fix: unify the counter across all non-success outcomes.
Schema
------
* ``tasks.spawn_failures`` → ``tasks.consecutive_failures``
* ``tasks.last_spawn_error`` → ``tasks.last_failure_error``
* Migration renames the columns in-place on existing DBs (``ALTER
TABLE RENAME COLUMN`` — SQLite >= 3.25) so historical counter
values are preserved. Row mappers fall through to the legacy names
if both column renames and a migration somehow got out of sync.
Counter lifecycle
-----------------
New helper ``_record_task_failure(conn, task_id, error, *, outcome,
release_claim, end_run, event_payload_extra)`` is the single point
every non-success outcome funnels through:
* ``spawn_failed`` → ``_record_spawn_failure`` (kept as alias)
calls it with ``release_claim=True, end_run=True`` — transitions
running→ready, clears claim, closes run.
* ``timed_out`` → ``enforce_max_runtime`` already does the status
transition + run close + event emission, then calls
``_record_task_failure`` with ``release_claim=False, end_run=False``
just to bump the counter (and trip the breaker if needed).
* ``crashed`` → ``detect_crashed_workers`` same pattern, but the
counter increment runs after the main write_txn closes (SQLite
doesn't nest write transactions).
If the counter hits the breaker threshold (``DEFAULT_FAILURE_LIMIT=5``,
same as before), the task transitions to ``blocked`` with a ``gave_up``
event on top of whatever outcome-specific event was already emitted.
Reset semantics changed: the counter now clears only on successful
``complete_task`` (and operator ``reclaim_task`` — an explicit "I've
looked at this, try again with a fresh budget"). Previously
``_clear_spawn_failures`` ran on every successful spawn, which would
have wiped the counter before a timeout could accumulate past threshold
— exactly the loop this fix prevents.
Diagnostics
-----------
* ``_rule_repeated_spawn_failures`` → ``_rule_repeated_failures``. Now
fires regardless of which outcome is at fault. Classifies the most
recent failure (spawn_failed / timed_out / crashed) from the run
history so the title ("Agent timeout x3", "Agent crash x4", "Agent
spawn x5") and suggested action (``doctor`` for spawn, ``log`` for
timeout/crash) stay outcome-specific without N duplicate rules.
* ``_rule_repeated_crashes`` kept as a narrower early-warning at
threshold 2 (vs 3 for the unified rule), but now suppresses itself
when the unified rule would also fire — avoids double-flagging.
* Diagnostic ``data`` payload now carries
``{consecutive_failures, most_recent_outcome, last_error}`` instead
of spawn-specific keys.
CLI
---
* ``Task.consecutive_failures`` / ``Task.last_failure_error`` are the
public fields now. Existing callers that referenced the old names
get migrated (tests updated in this commit).
* Backward-compat: ``DEFAULT_SPAWN_FAILURE_LIMIT``,
``_clear_spawn_failures``, ``_record_spawn_failure`` stay as aliases.
Tests
-----
* 6 new kernel tests: timeout increments counter, 3 consecutive
timeouts trip the breaker (was the reported gap), crash increments
counter, reclaim clears counter, completion clears counter, spawn
success does NOT clear counter.
* Diagnostic tests: updated ``repeated_spawn_failures`` cases to use
the new kind name and add a timeout-loop test.
* Dashboard API test: spawn_failures column update → consecutive_failures.
389/389 kanban-suite tests pass.
Live verification
-----------------
Seeded 4 tasks in an isolated HERMES_HOME: 3 timeouts, 4 crashes,
2-spawn-failed + 2-timed-out, and a task that had prior failures but
completed successfully. Board correctly shows "!! 3 tasks need
attention" (the successful one has no badge because the counter
reset). Drawer for the timeout-loop task renders "Agent timeout x3"
with most_recent_outcome=timed_out and the "Check logs" suggested
action (not the spawn-flavoured "Verify profile"). The successful
task has zero diagnostics.
Closes the Forbidden-Seeds-reported gap.
* feat(kanban): generic diagnostics engine for task distress signals
Replaces the hallucination-specific ``warnings`` / ``RecoverySection``
surface (shipped in PR #20232) with a reusable diagnostic-rule engine
that covers five distress kinds in v1 and can be extended without
touching UI code. The "something's wrong with this task" signal is
no longer limited to phantom card ids.
Closes the follow-up from #20232 discussion.
New module
----------
``hermes_cli/kanban_diagnostics.py`` — stateless, no-side-effect rule
engine. Each rule is a pure function of
``(task, events, runs, now, config) -> list[Diagnostic]``. Registry
is a simple list; adding a new distress kind is one function + one
import, no UI or API changes required.
v1 rule set
-----------
* ``hallucinated_cards`` (error) — folds the existing
``completion_blocked_hallucination`` event into the new surface.
* ``prose_phantom_refs`` (warning) — folds
``suspected_hallucinated_references``.
* ``repeated_spawn_failures`` (error → critical at 2x threshold) —
fires when ``tasks.spawn_failures >= 3``; suggests
``hermes -p <profile> doctor`` / ``auth``.
* ``repeated_crashes`` (error → critical) — fires after N consecutive
``crashed`` run outcomes with no successful completion between;
suggests ``hermes kanban log <id>``.
* ``stuck_in_blocked`` (warning) — fires after 24h in ``blocked``
state with no comments / unblock attempts; suggests commenting.
Every diagnostic carries structured ``actions`` (reclaim, reassign,
unblock, cli_hint, comment, open_docs) that render consistently in
both CLI and dashboard. Suggested actions are highlighted; generic
recovery actions (reclaim / reassign) are available on every kind as
fallbacks.
Diagnostics auto-clear when the underlying failure resolves — a
clean ``completed``/``edited`` event drops hallucination diagnostics,
a successful run drops crash diagnostics, a comment drops
stuck-blocked diagnostics. Audit events persist; the badge goes away.
API
---
``plugin_api.py``:
* ``/board`` now attaches ``diagnostics`` (full list) and
``warnings`` (compact summary with ``highest_severity``) per task.
* ``/tasks/{id}`` attaches diagnostics so the drawer's Diagnostics
section auto-opens on flagged tasks.
* NEW ``/diagnostics`` endpoint — fleet-wide listing, filterable by
severity, sorted critical-first.
CLI
---
* NEW ``hermes kanban diagnostics [--severity X] [--task id]
[--json]`` — fleet view or single-task view, matches dashboard rule
output so CLI users see the same picture.
* ``hermes kanban show <id>`` now renders a Diagnostics section near
the top with severity markers + suggested actions.
Dashboard
---------
* Card badge is severity-coloured (⚠ amber warning, !! orange error,
!!! red critical) using ``warnings.highest_severity``.
* Attention strip above the toolbar counts EVERY task with active
diagnostics (not just hallucinations), severity-coloured, lists
affected tasks with Open buttons when expanded.
* Drawer's old ``RecoverySection`` replaced with generic
``DiagnosticsSection`` rendering a card per active diagnostic:
title + detail + structured data (task-id chips when payload keys
look like id lists) + action buttons. Reassign profile picker is
inline per-diagnostic. Clipboard fallback uses ``.catch()`` for
environments where writeText rejects.
* Three-rung severity palette; amber for warning, orange for error,
red for critical. Uses CSS variables so theming is straightforward.
Tests
-----
* NEW ``tests/hermes_cli/test_kanban_diagnostics.py`` — 14 unit tests
covering each rule's positive/negative/threshold paths, severity
sorting, broken-rule isolation, and sqlite3.Row integration.
* Dashboard plugin tests extended: ``/diagnostics`` endpoint (empty,
populated, severity-filtered), ``/board`` exposes both diagnostic
list and compact summary with ``highest_severity``.
* Existing hallucination-specific test (``test_board_surfaces_
warnings_field_for_hallucinated_completions``) updated to reflect
the new contract: warning summary keys by diagnostic kind
(``hallucinated_cards``) not event kind.
379 kanban-suite tests pass (+16 net from this PR).
Live verification
-----------------
Seeded all 5 diagnostic kinds + one clean + one plain-running task
(7 total) into an isolated HERMES_HOME, spun up the dashboard, and
verified:
* Attention strip: shows ``!! 5 tasks need attention`` in the
error-severity orange; Show expands to a list of 5 rows ordered
critical > error > warning.
* Card badges: error tasks render ``!!`` orange, warning tasks
render ``⚠`` amber, clean and plain-running tasks render no badge.
* Each of the 5 rules opens a correctly-coloured, correctly-styled
diagnostic card in the drawer with its specific suggested action.
* Live reassign from a diagnostic card flipped
``broken-ml-worker → alice`` and the drawer refreshed with the
new assignee + the same diagnostic still firing (correct:
spawn_failures counter hasn't reset yet).
* CLI ``hermes kanban diagnostics`` prints all 5 in severity order;
``--severity error`` narrows to 3; ``kanban show <id>`` includes
the Diagnostics block at the top with suggested action hint.
Migration note
--------------
The old ``warnings`` shape (``{count, kinds, latest_at}``) is
preserved on the API but ``kinds`` now keys by diagnostic kind
(``hallucinated_cards``) instead of event kind
(``completion_blocked_hallucination``). ``highest_severity`` is a
new required field. The dashboard was the only consumer and has
been updated in the same commit; external API consumers of the
``warnings`` field will need to update their kind-match logic.
* feat(kanban/diagnostics): lead titles with the actual error text
The generic 'Worker crashed N runs in a row' / 'Worker failed to spawn
N times' titles buried the actual cause in the data section. Operators
had to open logs or expand the diagnostic to see WHY the worker is
stuck — rate-limit vs insufficient quota vs bad auth vs context
overflow vs network blip all looked identical at a glance.
New titles:
Agent crashed 3x: openai: 429 Too Many Requests - rate limit reached
Agent crashed 3x: anthropic: 402 insufficient_quota - credit balance
Agent crashed 3x: provider auth error: 401 Unauthorized
Agent spawn failed 4x: insufficient_quota: You exceeded your current
Detail keeps the full error snippet (capped at 500 chars + ellipsis
for tracebacks). Title takes the first line capped at 160 chars.
Fallback title if no error recorded stays honest ('no error recorded').
Tests: 4 new cases covering 429/billing/spawn/truncation. 383 total
pass (+4).
Live-verified on dashboard with 6 seeded scenarios
(rate-limit, billing, auth, context, network, spawn-billing) —
each card title leads with the actionable error text.
Workers completing a kanban task can now claim the ids of cards they
created via an optional ``created_cards`` field on ``kanban_complete``.
The kernel verifies each id exists and was created by the completing
worker's profile; any phantom id blocks the completion with a
``HallucinatedCardsError`` and records a
``completion_blocked_hallucination`` event on the task so the rejected
attempt is auditable. Successful completions also get a non-blocking
prose-scan pass over their ``summary`` + ``result`` that emits a
``suspected_hallucinated_references`` event for any ``t_<hex>``
reference that doesn't resolve.
Closes#20017.
Recovery UX (kernel + CLI + dashboard)
--------------------------------------
A structural gate alone isn't enough — operators also need to see and
act on stuck workers, especially when a profile's model is the root
cause. This PR ships the full loop:
* ``kanban_db.reclaim_task(task_id)`` — operator-driven reclaim that
releases an active worker claim immediately (unlike
``release_stale_claims`` which only acts after claim_expires has
passed). Emits a ``reclaimed`` event with ``manual: True`` payload.
* ``kanban_db.reassign_task(task_id, profile, reclaim_first=...)`` —
switch a task to a different profile, optionally reclaiming a stuck
running worker in the same call.
* ``hermes kanban reclaim <id> [--reason ...]`` and
``hermes kanban reassign <id> <profile> [--reclaim] [--reason ...]``
CLI subcommands wired through to the same helpers.
* ``POST /api/plugins/kanban/tasks/{id}/reclaim`` and
``POST /api/plugins/kanban/tasks/{id}/reassign`` endpoints on the
dashboard plugin.
Dashboard surfacing
-------------------
* ⚠ **warning badge** on cards with active hallucination events.
* **attention strip** at the top of the board listing all flagged
tasks; dismissible per session.
* **events callout** in the task drawer — hallucination events render
with a red left border, amber icon, and phantom ids as styled chips.
* **recovery section** in the task drawer with three actions: Reclaim,
Reassign (with profile picker + reclaim-first checkbox), and a
copy-to-clipboard hint for ``hermes -p <profile> model`` since
profile config lives on disk and can't be edited from the browser.
Auto-opens when the task has warnings, collapsed otherwise.
Keyed by task id so state doesn't leak between drawers.
Active-vs-stale rule: warnings clear when a clean ``completed`` or
``edited`` event supersedes the hallucination, so recovery is never
permanently stigmatising — the audit events persist for debugging but
the badge goes away once the worker succeeds.
Skill updates
-------------
* ``skills/devops/kanban-worker/SKILL.md`` documents the
``created_cards`` contract with good/bad examples.
* ``skills/devops/kanban-orchestrator/SKILL.md`` gains a "Recovering
stuck workers" section with the three actions and when to use each.
Tests
-----
* Kernel gate: verified-cards manifest, phantom rejection + audit
event, cross-worker rejection, prose scan positive + negative.
* Recovery helpers: reclaim on running task, reclaim on non-running
returns False, reassign refuses running without reclaim_first,
reassign with reclaim_first succeeds on running.
* API endpoints: warnings field present on /board and /tasks/:id,
warnings cleared after clean completion, reclaim 200 + 409 paths,
reassign 200 + 409 + reclaim_first paths.
* CLI smoke: reclaim + reassign subcommands.
Live-verified end-to-end on a dashboard with seeded scenarios:
attention strip renders, badges land on the right cards, drawer
callout shows phantom chips, Reclaim on a running task flips status to
ready + emits manual reclaimed event + refreshes the drawer,
Reassign swaps the assignee and triggers board refresh.
359/359 kanban-suite tests pass
(test_kanban_{db,cli,boards,core_functionality} + dashboard + tools).
* revert: auto-subscribe gateway chat on tool-driven kanban_create (#19718)
Reverts ff3d2773e2. Teknium reviewed the merged PR and decided this
behavior isn't wanted — tool-driven kanban_create should not mirror
the slash-command path's auto-subscribe. Orchestrators that want
their originating chat notified can call kanban_notify-subscribe
explicitly; we're not going to make it implicit.
* feat(kanban-dashboard): per-platform home-channel notification toggles
Adds a "Notify home channels" section to the task drawer in the kanban
dashboard plugin. Each platform where the user has set a home channel
(/sethome, TELEGRAM_HOME_CHANNEL env var, gateway.platforms.<p>.home_channel
in config.yaml) gets a toggle pill. Toggling on writes a kanban_notify_subs
row keyed to that platform's home (chat_id + thread_id); toggling off
removes it. The existing gateway notifier watcher delivers completed /
blocked / gave_up events without any new plumbing — this is purely a GUI
surface over existing machinery.
Replaces the reverted auto-subscribe behavior from #19718 with an explicit,
per-task, per-platform, user-controlled opt-in. No implicit subscription
on tool-driven kanban_create; no CLI commands; no slash commands. Just a
toggle in the drawer.
Backend (plugins/kanban/dashboard/plugin_api.py):
- GET /api/plugins/kanban/home-channels[?task_id=X]
Returns every platform with a configured home, plus a per-entry
subscribed: bool relative to task_id (false when task_id omitted).
Reads the live GatewayConfig via load_gateway_config() so env-var
overlays stay honored.
- POST /api/plugins/kanban/tasks/:id/home-subscribe/:platform
Idempotent add_notify_sub keyed to the platform's home.
- DELETE /api/plugins/kanban/tasks/:id/home-subscribe/:platform
remove_notify_sub for the same tuple.
- 404 when the platform has no home configured, or task_id doesn't
exist (POST only).
Frontend (plugins/kanban/dashboard/dist/index.js):
- TaskDrawer fetches /home-channels on open, keyed on task_id.
- HomeSubsSection renders nothing when zero platforms have a home (so
users who haven't set one up don't see an empty UI block).
- Optimistic toggle with busy flag + revert-on-failure. One pill per
platform; ✓ prefix and --on class indicate the subscribed state.
CSS (plugins/kanban/dashboard/dist/style.css):
- .hermes-kanban-home-subs flex row + .hermes-kanban-home-sub pill
style + --on subscribed variant (subtle ring-colored background).
Live-tested against a dashboard with TELEGRAM + DISCORD_BOT_TOKEN /
HOME_CHANNEL env vars set: drawer shows both pills, toggling each
flips its visual state AND writes/removes the correct kanban_notify_subs
row (verified via direct DB read).
Tests (tests/plugins/test_kanban_dashboard_plugin.py, 11 new, 53/53
pass total):
- home-channels lists only platforms with a home (slack with a
token but no home is excluded)
- no task_id -> all subscribed=false
- subscribe creates notify_sub row with correct chat/thread/platform
- subscribed=true reflected in subsequent GET
- idempotent re-subscribe
- unknown platform -> 404
- unknown task -> 404
- unsubscribe removes the row
- telegram + discord subscribe/unsubscribe independent
- zero homes -> empty list
Reporter of #19535 explicitly asked for a regression test — covers it
here so a future refactor of _set_status_direct can't silently re-enable
the direct ready/todo -> running bypass.
Asserts both: (a) HTTP 400 with 'running' in the detail message, and
(b) the task's status is unchanged after the rejected PATCH (pre-request
status preserved, no partial mutation).
Three narrow fixes targeting the remaining red checks after #17828:
1. ui-tui/src/app/slash/commands/ops.ts (Docker Build):
/reload-mcp's local params type annotated session_id: string
while ctx.sid is string | null. Widen to string | null —
matches every other rpc call site and the test harness which passes
{ session_id: null }. Fixes TS2322 on line 86. The rpc signature
itself is Record<string, unknown>, so this is purely a local
typing fix, no behavioral change.
2. tests/plugins/test_achievements_plugin.py (13 cascading test failures):
_install_fake_session_db did a raw sys.modules['hermes_state'] =
fake_module without restoration, leaking the fake across xdist
worker boundaries. Downstream tests doing from hermes_state import
SessionDB got a module whose SessionDB was lambda: fake_db
— 6 test_hermes_state.py tests failed with AttributeError: 'function'
object has no attribute '_sanitize_fts5_query' / _contains_cjk,
and 7 test_860_dedup.py tests failed with TypeError: got unexpected
keyword argument 'db_path' (real code calls SessionDB(db_path=...)).
Fix: stash monkeypatch on the plugin_api module object in the
fixture, and have the helper do monkeypatch.setitem(sys.modules,
'hermes_state', fake_module) for auto-restoration at test teardown.
3. tests/hermes_cli/test_web_server.py (WS race):
TestPtyWebSocket::test_pub_broadcasts_to_events_subscribers hit the
30s test timeout on CI. websocket_connect returns after
ws.accept() — but /api/events registers the subscriber in
_event_channels on the NEXT await (inside _event_lock). A
publish immediately after connect could race ahead of registration
and be dropped, and the subsequent receive_text() blocked until
SIGALRM killed the test. Fix: poll _event_channels after the
subscriber connects, before publishing.
Validation:
scripts/run_tests.sh tests/plugins/test_achievements_plugin.py
tests/run_agent/test_860_dedup.py
tests/test_hermes_state.py
tests/hermes_cli/test_web_server.py 338 passed
cd ui-tui && npm run type-check clean
cd ui-tui && npm run build clean
Remaining red checks are pure infra (Nix ubuntu hits
TwirpErrorResponse ResourceExhausted on the GH Actions cache API; Nix
macos bounces between npm build openssl-legacy and cache rate-limits)
and cannot be fixed in the codebase.
* feat(plugins): bundle hermes-achievements, scan full session history
Ships @PCinkusz's hermes-achievements dashboard plugin (https://github.com/PCinkusz/hermes-achievements) as a bundled plugin at plugins/hermes-achievements/ and fixes a bug in the scan path that made the plugin only see the first 200 sessions — making lifetime badges (50k tool calls, 75k errors, etc.) unreachable on long-running installs.
Changes:
- plugins/hermes-achievements/: vendor v0.3.1 verbatim (manifest, dist/, plugin_api.py, tests, docs, README).
- plugins/hermes-achievements/dashboard/plugin_api.py:
* scan_sessions(): limit=None now scans ALL sessions via SQLite LIMIT -1. Previously capped at 200, so users with 8000+ sessions saw ~2% of their history.
* evaluate_all(): first-ever scans run in a background thread so the dashboard request path never blocks. Stale snapshots serve immediately while a background refresh runs. force=True still blocks synchronously for manual /rescan.
* _build_pending_snapshot(), _start_background_scan(), _run_scan_and_update_cache(): supporting plumbing + idempotent thread spawn.
- tests/plugins/test_achievements_plugin.py: new tests covering the 200-cap regression, the background-scan first-run flow, stale-serve-plus-background-refresh, forced sync rescan, and scan-thread idempotency.
- website/docs/user-guide/features/built-in-plugins.md: lists hermes-achievements in the bundled-plugins table and documents API endpoints, state files, and performance characteristics.
E2E validated against a real 8564-session ~6.4GB state.db:
* Cold scan: 13m 19s (one-time, backgrounded — UI never blocks)
* Warm rescan: 1.47s (8563/8564 sessions reused from checkpoint cache)
* 57/60 achievements unlocked, 3 discovered — aggregates like total_tool_calls=259958, total_errors=164213, skill_events=368243 correctly surface lifetime badges that the 200-cap made unreachable.
Original credit: @PCinkusz (MIT-licensed). Upstream repo remains the staging ground for new badges; this bundle keeps the dashboard feature parity with Hermes core changes.
* feat(achievements): publish partial snapshots during cold scan
Previously a cold scan on a large session DB (13min on 8564 sessions)
showed zero badges for the entire duration, then every badge at once
when the scan completed. A dashboard refresh mid-scan was indistinguishable
from a fresh install with no history.
Now the scanner publishes a partial snapshot to _SNAPSHOT_CACHE every
250 sessions, so each refresh during a cold scan surfaces more badges
incrementally.
Mechanism:
- scan_sessions() takes an optional progress_callback fired every
progress_every sessions with (sessions_so_far, scanned, total).
- _compute_from_scan() is extracted from compute_all() and gains an
is_partial flag that skips writing to state.json — we don't want
to record unlocked_at based on a half-complete aggregate that a
later session might rebalance.
- _run_scan_and_update_cache() installs a publisher callback that
builds a partial snapshot, marks it mode='in_progress', and writes
it to the cache with age=0 so the UI keeps polling /scan-status
and picks up the final snapshot when the scan completes.
- Manual /rescan (force=True) disables partial publishing — the
caller is blocking on the final result anyway.
E2E against real 8564-session state.db (polled cache every 10s):
t=10s: cache empty
t=20s: 250/8564 scanned, 35 unlocked, 25 discovered
t=40s: 500/8564 scanned, 42 unlocked, 18 discovered
t=60s: 1000/8564 scanned, 49 unlocked, 11 discovered
...
Tests: 9/9 pass (2 new — partial snapshot publication + no-persist-on-partial).
Upstream unittest suite: 10/10 pass.
* feat(achievements): in-progress scan banner with live % progress
Previously the dashboard showed zero badges silently during long cold
scans (13min on 8564 sessions). The backend was publishing partial
snapshots every 250 sessions, but the bundled UI didn't surface any
indicator that a scan was running — it just rendered the main page
with whatever counts were currently published and no way for the user
to know more progress was coming.
UI changes (dist/index.js, dist/style.css):
- Added a scan-in-progress banner rendered between the hero and stats
when scan_meta.mode is 'pending' or 'in_progress'. Shows:
BUILDING ACHIEVEMENT PROFILE…
Scanned 1,750 of 8,564 sessions · 20%. Badges unlock as more history streams in.
with a pulsing teal indicator and a filling teal/cyan progress bar.
Disappears the moment the backend flips to 'full' or 'incremental'.
- Added an auto-poller via useEffect — while scanInFlight is true the
page re-fetches /achievements every 4s WITHOUT toggling the loading
skeleton, so unlock counts tick up visibly without the user refreshing.
The effect cleans itself up when the scan finishes.
- Added refresh() (re-fetch, no loading flip) alongside the existing
load() (full reload, used by the Rescan button).
Attribution preserved:
- Added a header comment to index.js crediting @PCinkusz
(https://github.com/PCinkusz/hermes-achievements, MIT) as the
original author, noting the banner is a layered addition on top
of the original dist bundle.
- Matching header comment in style.css, flagging the new
.ha-scan-banner* rules as the local addition.
Live-verified end to end:
- Spun up `hermes dashboard --port 9229 --no-open` against a fresh
HERMES_HOME symlinked to the real 8564-session state.db.
- Opened /achievements in a browser, confirmed the banner renders with
live progress: 'Scanned 1,000 of 8,564 sessions · 11%' → updates to
'1,250 ... · 14%' → '1,750 ... · 20%' without user interaction,
matching the backend's partial publications.
- Stats row simultaneously climbed from 35 → 49 → 53 unlocked as
more history streamed in.
- Vision analysis of the rendered page confirms the banner styling
matches the rest of the dashboard (dark card bg, teal accent, same
small-caps typography, pulsing indicator reusing ha-pulse keyframes).
feat(gateway): refine Platform._missing_ and platform-connected dispatch
Restricts plugin-name acceptance to bundled plugin scan + registry
(no arbitrary string -> enum-pollution), pulls per-platform connectivity
checks into a _PLATFORM_CONNECTED_CHECKERS lambda map with a clean
_is_platform_connected method, and adds tests covering the checker map,
plugin platform interface, and IRC setup wizard.
Follow-up to the cherry-picked PR #17447. The original flush spawned a
bare threading.Thread for the buffer-flush path, overwriting
self._sync_thread — which is aliased to the long-lived writer thread.
Two consequences:
1. No serialization with the writer queue. If old-session retains were
still queued in _retain_queue, the flush ran concurrently with the
writer and both threads could call aretain_batch against the same
document_id.
2. The pre-spawn 'self._sync_thread.join(timeout=5.0)' tried to join the
long-lived writer, which never exits, so the join was a no-op that
just timed out — never actually serialized anything.
Fix: enqueue the flush closure on _retain_queue via _ensure_writer +
put(). Natural FIFO ordering behind any pending retains, no new thread,
no broken join. Shutdown-aware so it doesn't enqueue after teardown.
Tests updated to drain via _retain_queue.join() instead of the stale
_sync_thread.join(). Added regression guard
test_flush_serializes_behind_pending_retains_via_writer_queue that
blocks the writer mid-retain to prove the flush waits in FIFO behind
the old retain.
Also seeds _retain_queue / _shutting_down / stubbed _ensure_writer on
the bare-object test helper in test_memory_session_switch.py so that
path doesn't blow up under the new queue-enqueue.
tests/plugins/memory/test_hindsight_provider.py + tests/agent/test_memory_session_switch.py: 103/103 passing.
Two data-loss / leak gaps in HindsightMemoryProvider.on_session_switch
introduced by #17409.
1. Buffered turns silently lost when retain_every_n_turns > 1.
on_session_switch unconditionally cleared _session_turns without
flushing. Users who batched every N>1 turns and switched mid-batch
(/reset, /new, /resume, /branch, or context compression) had those
buffered turns disappear. Same data-loss class as the shutdown race,
different lifecycle event.
Note commit_memory_session() -> on_session_end() runs *before*
on_session_switch on /reset, but Hindsight doesn't implement
on_session_end so the buffer survives that step and dies at clear
time. /resume, /branch, and compression skip commit_memory_session
entirely so an on_session_end impl wouldn't help them anyway.
Fix: snapshot the old _session_id, _document_id, _parent_session_id,
_turn_index, and _session_turns; spawn one final retain that lands
under the OLD document_id; then rotate state. Metadata is built
synchronously against the old self._* so session_id / lineage tags
on the flushed item all reference the prior session consistently.
2. Stale _prefetch_result leaks across switch.
If queue_prefetch ran in the old session and the result hadn't been
consumed by prefetch() yet, on_session_switch left the cached recall
text in place. The next session's first prefetch() call would return
text mined from the prior session's bank/query.
Fix: join any in-flight _prefetch_thread (3s bounded — matches
shutdown()), then clear _prefetch_result under _prefetch_lock before
rotating session_id.
Tests
-----
- tests/plugins/memory/test_hindsight_provider.py (TestSessionSwitchBufferFlush):
- buffered turns flushed under OLD document_id with OLD lineage tags
- empty buffer => no spurious retain
- _prefetch_result cleared on switch
- in-flight prefetch thread is awaited before clear (no race)
- tests/agent/test_memory_session_switch.py: factory extended to seed the
attrs the new flush path reads (_retain_source, _platform, _bank_id,
prefetch state, etc.) and stub _run_hindsight_operation so existing
switch-state assertions keep passing without network setup.
The plugin used to spawn one daemon thread per sync_turn() to do the
aretain_batch network write. On CLI exit, that pattern raced interpreter
shutdown — the last retain could reach aiohttp after asyncio's
"cannot schedule new futures" guard had fired, producing noisy logs and
silently losing the final unsaved turn:
WARNING ... Hindsight sync failed: cannot schedule new futures after
interpreter shutdown
ERROR asyncio: Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x...>
Switch to a single-writer model: each provider owns one long-lived
writer thread plus a queue. sync_turn() snapshots state and enqueues a
job; the writer drains sequentially. Once shutdown() is called:
- new sync_turn() / queue_prefetch() calls are dropped, not enqueued
- a sentinel wakes the writer so it finishes in-flight work
- shutdown joins the writer (10s) before nulling the client
Also register an idempotent atexit hook from the first sync_turn(), so
exit paths that don't go through MemoryManager.shutdown_all() (Ctrl-C,
abrupt exit) still get a chance to drain.
Tests: keep _sync_thread as a legacy alias to the writer, swap join()
calls to _retain_queue.join() (canonical wait-for-drain), add a new
TestShutdownRace suite covering single-writer reuse, post-shutdown drop,
queue draining, and shutdown idempotency.
Opt-in Langfuse tracing for Hermes conversations — LLM calls, tool
usage, usage/cost breakdown per span. Hooks into pre/post_api_request,
pre/post_llm_call, pre/post_tool_call. SDK is optional; missing SDK or
credentials renders the plugin inert.
Salvaged from PR #16845 by @kshitijk4poor, who wrote the plugin
(~875 LOC, 6 hooks, Langfuse usage-details/cost-details normalization,
read_file payload summarization).
Salvage scope (why this isn't PR #16845 as-authored):
- Lives at plugins/observability/langfuse/ (standalone kind, opt-in via
plugins.enabled) instead of a new parallel optional-plugins/
directory. Standalone bundled plugins are already opt-in — only their
plugin.yaml is scanned at startup; the Python module is not imported
unless the user enables it. The premise of optional-plugins/ (avoid
import cost for users who don't want it) is already solved by the
existing plugin system.
- Dropped the triple activation gate (plugins.enabled +
plugins.langfuse.enabled + HERMES_LANGFUSE_ENABLED). The Hermes plugin
system's own enable/disable is authoritative; runtime credentials
gate whether the hook actually traces.
- Rewrote _is_enabled() → cached _get_langfuse() with an _INIT_FAILED
sentinel. The original called hermes_cli.config.load_config() from
every hook invocation (full yaml parse + deep merge + env expansion
on every pre/post_tool_call, potentially 100+ times per turn). The
cached version reads env once and returns the cached client or None
on every subsequent call with zero further work.
- hermes tools → Langfuse Observability post-setup adds
observability/langfuse to plugins.enabled directly (via
_save_enabled_set) instead of going through an install-copy flow.
Enable:
hermes tools # interactive
hermes plugins enable observability/langfuse # manual
Required env (set by `hermes tools` or in ~/.hermes/.env):
HERMES_LANGFUSE_PUBLIC_KEY
HERMES_LANGFUSE_SECRET_KEY
HERMES_LANGFUSE_BASE_URL # optional
Co-authored-by: kshitijk4poor <kshitijk4poor@gmail.com>
* feat(plugins): google_meet — bundled plugin for join+transcribe Meet calls
v1 shipping transcribe-only. Spawns headless Chromium via Playwright,
joins an explicit https://meet.google.com/ URL, enables live captions,
and scrapes them into a transcript file the agent can read across turns.
The agent then has the meeting content in context and can do followup
work (send recap, file issues, schedule followups) with its regular tools.
Surface:
- Tools: meet_join, meet_status, meet_transcript, meet_leave, meet_say
(meet_say is a v1 stub — returns not-implemented; v2 will wire
realtime duplex audio via OpenAI Realtime / Gemini Live +
BlackHole / PulseAudio null-sink.)
- CLI: hermes meet setup | auth | join | status | transcript | stop
- Lifecycle: on_session_end auto-leaves any still-running bot.
Safety:
- URL regex rejects anything that isn't https://meet.google.com/...
- No calendar scanning, no auto-dial, no auto-consent announcement.
- Single active meeting per install; a second meet_join leaves the first.
- Platform-gated to Linux + macOS (Windows audio routing for v2 untested).
- Opt-in: standalone plugin, user must add 'google_meet' to
plugins.enabled in config.yaml.
Zero core changes. Plugin uses existing register_tool /
register_cli_command / register_hook surfaces. 21 new unit tests cover the
URL safety gate, transcript dedup + status round-trip, process-manager
refusals/start/stop paths, tool-handler JSON shape under each branch,
session-end cleanup, and platform-gated register().
* feat(plugins/google_meet): v2 realtime audio + v3 remote node host
v2 \u2014 agent speaks in-meeting
audio_bridge.py: PulseAudio null-sink (Linux) + BlackHole probe (macOS).
On Linux we load pactl module-null-sink + module-virtual-source, track
module ids for teardown; Chrome gets PULSE_SOURCE=<virt src> env so its
fake mic reads what we write to the sink. macOS just probes BlackHole
2ch and returns its device name \u2014 the plugin refuses to switch the
user's default audio input (that would surprise them).
realtime/openai_client.py: sync WebSocket client for the OpenAI Realtime
API. RealtimeSession.speak(text) sends conversation.item.create +
response.create, accumulates response.audio.delta PCM bytes, appends
them to a file. RealtimeSpeaker runs a JSONL-queue loop consuming
meet_say calls. 'websockets' is an optional dep imported lazily.
meet_bot.py: when HERMES_MEET_MODE=realtime, provisions AudioBridge,
starts RealtimeSession + speaker thread, spawns paplay to pump PCM
into the null-sink, then cleans everything up on SIGTERM. If any
realtime setup step fails, falls back cleanly to transcribe mode
with an error flagged in status.json.
process_manager.enqueue_say(): writes a JSONL line to say_queue.jsonl;
refuses when no active meeting or active meeting is transcribe-only.
tools.meet_say: real implementation; requires active mode='realtime'.
meet_join: adds mode='transcribe'|'realtime' param.
v3 \u2014 remote node host
node/protocol.py: JSON envelope (type, id, token, payload) + validate.
node/registry.py: $HERMES_HOME/workspace/meetings/nodes.json, with
resolve() auto-selecting the sole registered node when name is None.
node/server.py: NodeServer \u2014 websockets.serve, bearer-token auth,
dispatches start_bot/stop/status/transcript/say/ping onto the local
process_manager. Token auto-generated + persisted on first run.
node/client.py: NodeClient \u2014 short-lived sync WS per RPC, raises
RuntimeError on error envelopes, clean API matching the server.
node/cli.py: 'hermes meet node {run,list,approve,remove,status,ping}'
subtree; wired into the main meet CLI by cli.py so 'hermes meet node'
Just Works.
tools.py: every meet_* tool accepts node='<name>'|'auto'; when set,
routes through NodeClient to the remote bot instead of running
locally. Unknown node \u2192 clear 'no registered meet node matches ...'
error.
cli.py: 'hermes meet join --node my-mac --mode realtime' and
'hermes meet say "..." --node my-mac' route to the node; 'hermes
meet node approve <name> <url> <token>' registers one.
Tests
21 v1 tests updated (meet_say is no longer a stub; active-record now
carries mode).
20 new audio_bridge + realtime tests.
42 new node tests (protocol/registry/server/client/cli).
17 new v1/v2/v3 integration tests at the plugin level covering
enqueue_say edge cases, env var passthrough, mode validation, node
routing (known/unknown/auto/ambiguous), and argparse wiring for
`hermes meet say` + `hermes meet node` + --mode/--node flags.
Total: 100 plugin tests + 58 plugin-system tests = 158 passing.
E2E verified on Linux with fresh HERMES_HOME: plugin loads, 5 tools
register, on_session_end hook wires, 'hermes meet' CLI tree wires
including the node subtree, NodeRegistry round-trips, meet_join routes
correctly to NodeClient under node='my-mac' with mode='realtime',
enqueue_say accepts realtime/rejects transcribe, argparse parses every
new flag cleanly.
Zero changes to core. All new code lives under plugins/google_meet/.
* feat(plugins/google_meet): auto-install, admission detect, mac PCM pump, barge-in, richer status
Ready-for-live-test follow-up on PR #16364. Five additions that matter for
the first live run on a real Meet, in priority order:
1. hermes meet install [--realtime] [--yes]
pip install playwright websockets + python -m playwright install chromium
--realtime: installs platform audio deps (pulseaudio-utils on Linux via
sudo apt, blackhole-2ch + ffmpeg on macOS via brew). Prompts before
sudo/brew unless --yes. Refuses on Windows. Refuses to auto-flip the
macOS default input — user still selects BlackHole in System Settings
(deliberate; surprise audio rerouting is worse than a manual step).
2. Admission detection
_detect_admission(page): Leave-button visible OR caption region
attached OR participants list present → we're in-call.
_detect_denied(page): 'You can\'t join this video call' / 'You were
removed' / 'No one responded to your request' → bail out.
HERMES_MEET_LOBBY_TIMEOUT (default 300s) caps how long we sit in
the lobby before giving up. in_call stays False until admitted.
Status surfaces leaveReason: duration_expired | lobby_timeout |
denied | page_closed.
3. macOS PCM pump
ffmpeg reads speaker.pcm (24kHz s16le mono) and writes to the
BlackHole AVFoundation output via -f audiotoolbox
-audio_device_index <N>. _mac_audio_device_index() probes
ffmpeg -f avfoundation -list_devices true to resolve 'BlackHole 2ch'
→ numeric index. Falls back to index 0 on probe failure. Linux
paplay pump unchanged.
4. Richer status dict
_BotState now tracks realtime, realtimeReady, realtimeDevice,
audioBytesOut, lastAudioOutAt, lastBargeInAt, joinAttemptedAt,
leaveReason. RealtimeSession.audio_bytes_out / last_audio_out_at
counters fold into the status file once a second so meet_status()
can show the agent's voice activity in near-real-time.
5. Barge-in
RealtimeSession.cancel_response() sends type='response.cancel' over
the same WS (lock-guarded so it's safe to call from the caption
thread while speak() is reading frames). Handles response.cancelled
as a terminal frame type. _looks_like_human_speaker() gates triggers
so the bot's own name, 'You', 'Unknown', and blanks don't self-cancel.
Called from the caption drain loop: when a new caption arrives
attributed to a real participant while rt.session exists, we fire
cancel_response() and stamp lastBargeInAt.
Tests: 20 new unit tests across _BotState telemetry, barge-in gating,
admission/denied probe error handling, cancel_response with and without
a connected WS, and `hermes meet install` CLI wiring (flag parsing +
end-to-end subprocess.run verification + Linux-already-installed fast
path). Total 171 passing across all google_meet test files + the
plugin-system regression suite.
E2E verified on Linux: plugin loads, all 5 tools register,
`hermes meet install --realtime --yes` parses, fresh-bot status.json
has every new telemetry key, cancel_response on a disconnected session
returns False without raising, barge-in helper gates the bot's own
name correctly.
Still out of scope (for a future PR, not blocking live test):
mic → Realtime duplex (the agent listening to meeting audio via
WebRTC), node-host TLS/pairing UX, Windows audio, Meet create+Twilio.
Docs updated: SKILL.md now lists the installer subcommand, lobby
timeout, barge-in caveat, and the full status-dict reference table.
README.md quick-start uses hermes meet install.
HindsightEmbedded.close() delegates to its sync client.close(). When Hermes
created/used that client on the shared async loop, closing it from the main
thread raises 'attached to a different loop' before aiohttp releases the
session — so the ClientSession / TCPConnector leak past provider teardown.
Close the embedded inner async client on the shared loop first via
_run_sync(inner_client.aclose()), then let the wrapper's sync close()
do its daemon/UI bookkeeping.
Salvage of #14605: test placement rebased — appended TestShutdown class
after TestSharedEventLoopLifecycle (which landed on main after the PR was
written). Original author attribution preserved.
Adds an optional bank_id_template config that derives the bank name at
initialize() time from runtime context. Existing users with a static
bank_id keep the current behavior (template is empty by default).
Supported placeholders:
{profile} — active Hermes profile (agent_identity kwarg)
{workspace} — Hermes workspace (agent_workspace kwarg)
{platform} — cli, telegram, discord, etc.
{user} — platform user id (gateway sessions)
{session} — session id
Unsafe characters in placeholder values are sanitized, and empty
placeholders collapse cleanly (e.g. "hermes-{user}" with no user
becomes "hermes"). If the template renders empty, the static bank_id
is used as a fallback.
Common uses:
bank_id_template: hermes-{profile} # isolate per Hermes profile
bank_id_template: {workspace}-{profile} # workspace + profile scoping
bank_id_template: hermes-{user} # per-user banks for gateway
Reusing session_id as document_id caused data loss on /resume: when
the session is loaded again, _session_turns starts empty and the next
retain replaces the entire previously stored content.
Now each process lifecycle gets its own document_id formed as
{session_id}-{startup_timestamp}, so:
- Same session, same process: turns accumulate into one document (existing behavior)
- Resume (new process, same session): writes a new document, old one preserved
- Forks: child process gets its own document; parent's doc is untouched
Also adds session lineage tags so all processes for the same session
(or its parent) can still be filtered together via recall:
- session:<session_id> on every retain
- parent:<parent_session_id> when initialized with parent_session_id
Closes#6602
The existing test_local_embedded_setup_materializes_profile_env expected
exact equality on ~/.hermes/.env content; the new HINDSIGHT_TIMEOUT=120
line from the timeout feature now appears in that file. Append it to the
expected string so the test reflects the new post_setup output.
The module-global `_loop` / `_loop_thread` pair is shared across every
`HindsightMemoryProvider` instance in the process — the plugin loader
creates one provider per `AIAgent`, and the gateway creates one `AIAgent`
per concurrent chat session (Telegram/Discord/Slack/CLI).
`HindsightMemoryProvider.shutdown()` stopped the shared loop when any one
session ended. That stranded the aiohttp `ClientSession` and `TCPConnector`
owned by every sibling provider on a now-dead loop — they were never
reachable for close and surfaced as the `Unclosed client session` /
`Unclosed connector` warnings reported in #11923.
Fix: stop stopping the shared loop in `shutdown()`. Per-provider cleanup
still closes that provider's own client via `self._client.aclose()`. The
loop runs on a daemon thread and is reclaimed on process exit; keeping
it alive between provider shutdowns means sibling providers can drain
their own sessions cleanly.
Regression tests in `tests/plugins/memory/test_hindsight_provider.py`
(`TestSharedEventLoopLifecycle`):
- `test_shutdown_does_not_stop_shared_event_loop` — two providers share
the loop; shutting down one leaves the loop live for the other. This
test reproduces the #11923 leak on `main` and passes with the fix.
- `test_client_aclose_called_on_cloud_mode_shutdown` — each provider's
own aiohttp session is still closed via `aclose()`.
Fixes#11923.
The agent-facing image_generate tool only passes prompt + aspect_ratio to
provider.generate() (see tools/image_generation_tool.py:953). The editing
block (reference_images / edit_image kwargs) could never fire from the
tool surface, and the xAI edits endpoint is /images/edits with a
different payload shape anyway — not /images/generations as submitted.
- Remove reference_images / edit_image kwargs handling from generate()
- Remove matching test_with_reference_images case
- Update docstring + plugin.yaml description to text-to-image only
- Surface resolution in the success extras
Follow-up to PR #14547. Tests: 18/18 pass.
New built-in image_gen backend at plugins/image_gen/openai-codex/ that
exposes the same gpt-image-2 low/medium/high tier catalog as the
existing 'openai' plugin, but routes generation through the ChatGPT/
Codex Responses image_generation tool path. Available whenever the user
has Codex OAuth signed in; no OPENAI_API_KEY required.
The two plugins are independent — users select between them via
'hermes tools' → Image Generation, and image_gen.provider in
config.yaml. The existing 'openai' (API-key) plugin is unchanged.
Reuses _read_codex_access_token() and _codex_cloudflare_headers() from
agent.auxiliary_client so token expiry / cred-pool / Cloudflare
originator handling stays in one place.
Inspired by #14047 by @Hygaard, but re-implemented as a separate
plugin instead of an in-place fork of the openai plugin.
Closes#11195
- Add configurable retain_tags / retain_source / retain_user_prefix /
retain_assistant_prefix knobs for native Hindsight.
- Thread gateway session identity (user_name, chat_id, chat_name,
chat_type, thread_id) through AIAgent and MemoryManager into
MemoryProvider.initialize kwargs so providers can scope and tag
retained memories.
- Hindsight attaches the new identity fields as retain metadata,
merges per-call tool tags with configured default tags, and uses
the configurable transcript labels for auto-retained turns.
Co-authored-by: Abner <abner.the.foreman@agentmail.to>
* feat(plugins): pluggable image_gen backends + OpenAI provider
Adds a ImageGenProvider ABC so image generation backends register as
bundled plugins under `plugins/image_gen/<name>/`. The plugin scanner
gains three primitives to make this work generically:
- `kind:` manifest field (`standalone` | `backend` | `exclusive`).
Bundled `kind: backend` plugins auto-load — no `plugins.enabled`
incantation. User-installed backends stay opt-in.
- Path-derived keys: `plugins/image_gen/openai/` gets key
`image_gen/openai`, so a future `tts/openai` cannot collide.
- Depth-2 recursion into category namespaces (parent dirs without a
`plugin.yaml` of their own).
Includes `OpenAIImageGenProvider` as the first consumer (gpt-image-1.5
default, plus gpt-image-1, gpt-image-1-mini, DALL-E 3/2). Base64
responses save to `$HERMES_HOME/cache/images/`; URL responses pass
through.
FAL stays in-tree for this PR — a follow-up ports it into
`plugins/image_gen/fal/` so the in-tree `image_generation_tool.py`
slims down. The dispatch shim in `_handle_image_generate` only fires
when `image_gen.provider` is explicitly set to a non-FAL value, so
existing FAL setups are untouched.
- 41 unit tests (scanner recursion, kind parsing, gate logic,
registry, OpenAI payload shapes)
- E2E smoke verified: bundled plugin autoloads, registers, and
`_handle_image_generate` routes to OpenAI when configured
* fix(image_gen/openai): don't send response_format to gpt-image-*
The live API rejects it: 'Unknown parameter: response_format'
(verified 2026-04-21 with gpt-image-1.5). gpt-image-* models return
b64_json unconditionally, so the parameter was both unnecessary and
actively broken.
* feat(image_gen/openai): gpt-image-2 only, drop legacy catalog
gpt-image-2 is the latest/best OpenAI image model (released 2026-04-21)
and there's no reason to expose the older gpt-image-1.5 / gpt-image-1 /
dall-e-3 / dall-e-2 alongside it — slower, lower quality, or awkward
(dall-e-2 squares only). Trim the catalog down to a single model.
Live-verified end-to-end: landscape 1536x1024 render of a Moog-style
synth matches prompt exactly, 2.4MB PNG saved to cache.
* feat(image_gen/openai): expose gpt-image-2 as three quality tiers
Users pick speed/fidelity via the normal model picker instead of a
hidden quality knob. All three tier IDs resolve to the single underlying
gpt-image-2 API model with a different quality parameter:
gpt-image-2-low ~15s fast iteration
gpt-image-2-medium ~40s default
gpt-image-2-high ~2min highest fidelity
Live-measured on OpenAI's API today: 15.4s / 40.8s / 116.9s for the
same 1024x1024 prompt.
Config:
image_gen.openai.model: gpt-image-2-high
# or
image_gen.model: gpt-image-2-low
# or env var for scripts/tests
OPENAI_IMAGE_MODEL=gpt-image-2-medium
Live-verified end-to-end with the low tier: 18.8s landscape render of a
golden retriever in wildflowers, vision-confirmed exact match.
* feat(tools_config): plugin image_gen providers inject themselves into picker
'hermes tools' → Image Generation now shows plugin-registered backends
alongside Nous Subscription and FAL.ai without tools_config.py needing
to know about them. OpenAI appears as a third option today; future
backends appear automatically as they're added.
Mechanism:
- ImageGenProvider gains an optional get_setup_schema() hook
(name, badge, tag, env_vars). Default derived from display_name.
- tools_config._plugin_image_gen_providers() pulls the schemas from
every registered non-FAL plugin provider.
- _visible_providers() appends those rows when rendering the Image
Generation category.
- _configure_provider() handles the new image_gen_plugin_name marker:
writes image_gen.provider and routes to the plugin's list_models()
catalog for the model picker.
- _toolset_needs_configuration_prompt('image_gen') stops demanding a
FAL key when any plugin provider reports is_available().
FAL is skipped in the plugin path because it already has hardcoded
TOOL_CATEGORIES rows — when it gets ported to a plugin in a follow-up
PR the hardcoded rows go away and it surfaces through the same path
as OpenAI.
Verified live: picker shows Nous Subscription / FAL.ai / OpenAI.
Picking OpenAI prompts for OPENAI_API_KEY, then shows the
gpt-image-2-low/medium/high model picker sourced from the plugin.
397 tests pass across plugins/, tools_config, registry, and picker.
* fix(image_gen): close final gaps for plugin-backend parity with FAL
Two small places that still hardcoded FAL:
- hermes_cli/setup.py status line: an OpenAI-only setup showed
'Image Generation: missing FAL_KEY'. Now probes plugin providers
and reports '(OpenAI)' when one is_available() — or falls back to
'missing FAL_KEY or OPENAI_API_KEY' if nothing is configured.
- image_generate tool schema description: said 'using FAL.ai, default
FLUX 2 Klein 9B'. Rewrote provider-neutral — 'backend and model are
user-configured' — and notes the 'image' field can be a URL or an
absolute path, which the gateway delivers either way via
extract_local_files().
Plugins now require explicit consent to load. Discovery still finds every
plugin — user-installed, bundled, and pip — so they all show up in
`hermes plugins` and `/plugins`, but the loader only instantiates
plugins whose name appears in `plugins.enabled` in config.yaml. This
removes the previous ambient-execution risk where a newly-installed or
bundled plugin could register hooks, tools, and commands on first run
without the user opting in.
The three-state model is now explicit:
enabled — in plugins.enabled, loads on next session
disabled — in plugins.disabled, never loads (wins over enabled)
not enabled — discovered but never opted in (default for new installs)
`hermes plugins install <repo>` prompts "Enable 'name' now? [y/N]"
(defaults to no). New `--enable` / `--no-enable` flags skip the prompt
for scripted installs. `hermes plugins enable/disable` manage both lists
so a disabled plugin stays explicitly off even if something later adds
it to enabled.
Config migration (schema v20 → v21): existing user plugins already
installed under ~/.hermes/plugins/ (minus anything in plugins.disabled)
are auto-grandfathered into plugins.enabled so upgrades don't silently
break working setups. Bundled plugins are NOT grandfathered — even
existing users have to opt in explicitly.
Also: HERMES_DISABLE_BUNDLED_PLUGINS env var removed (redundant with
opt-in default), cmd_list now shows bundled + user plugins together with
their three-state status, interactive UI tags bundled entries
[bundled], docs updated across plugins.md and built-in-plugins.md.
Validation: 442 plugin/config tests pass. E2E: fresh install discovers
disk-cleanup but does not load it; `hermes plugins enable disk-cleanup`
activates hooks; migration grandfathers existing user plugins correctly
while leaving bundled plugins off.
The original name was cute but non-obvious; disk-cleanup says what it
does. Plugin directory, script, state path, log lines, slash command,
and test module all renamed. No user-visible state exists yet, so no
migration path is needed.
New website page "Built-in Plugins" documents the <repo>/plugins/<name>/
source, how discovery interacts with user/project plugins, the
HERMES_DISABLE_BUNDLED_PLUGINS escape hatch, disk-cleanup's hook
behaviour and deletion rules, and guidance on when a plugin belongs
bundled vs. user-installable. Added to the Features → Core sidebar next
to the main Plugins page, with a cross-reference from plugins.md.
Rewires @LVT382009's disk-guardian (PR #12212) from a skill-plus-script
into a plugin that runs entirely via hooks — no agent compliance needed.
- post_tool_call hook auto-tracks files created by write_file / terminal
/ patch when they match test_/tmp_/*.test.* patterns under HERMES_HOME
- on_session_end hook runs cmd_quick cleanup when test files were
auto-tracked during the turn; stays quiet otherwise
- /disk-guardian slash command keeps status / dry-run / quick / deep /
track / forget for manual use
- Deterministic cleanup rules, path safety, atomic writes, and audit
logging preserved from the original contribution
- Protect well-known top-level state dirs (logs/, memories/, sessions/,
cron/, cache/, etc.) from empty-dir removal so fresh installs don't
get gutted on first session end
The plugin system gains a bundled-plugin discovery path (<repo>/plugins/
<name>/) alongside user/project/entry-point sources. Memory and
context_engine subdirs are skipped — they keep their own discovery
paths. HERMES_DISABLE_BUNDLED_PLUGINS=1 suppresses the scan; the test
conftest sets it by default so existing plugin tests stay clean.
Co-authored-by: LVT382009 <levantam.98.2324@gmail.com>
Cuts shard-3 local runtime in half by neutralizing real wall-clock
waits across three classes of slow test:
## 1. Retry backoff mocks
- tests/run_agent/conftest.py (NEW): autouse fixture mocks
jittered_backoff to 0.0 so the `while time.time() < sleep_end`
busy-loop exits immediately. No global time.sleep mock (would
break threading tests).
- test_anthropic_error_handling, test_413_compression,
test_run_agent_codex_responses, test_fallback_model: per-file
fixtures mock time.sleep / asyncio.sleep for retry / compression
paths.
- test_retaindb_plugin: cap the retaindb module's bound time.sleep
to 0.05s via a per-test shim (background writer-thread retries
sleep 2s after errors; tests don't care about exact duration).
Plus replace arbitrary time.sleep(N) waits with short polling
loops bounded by deadline.
## 2. Subprocess sleeps in production code
- test_update_gateway_restart: mock time.sleep. Production code
does time.sleep(3) after `systemctl restart` to verify the
service survived. Tests mock subprocess.run \u2014 nothing actually
restarts \u2014 so the wait is dead time.
## 3. Network / IMDS timeouts (biggest single win)
- tests/conftest.py: add AWS_EC2_METADATA_DISABLED=true plus
AWS_METADATA_SERVICE_TIMEOUT=1 and ATTEMPTS=1. boto3 falls back
to IMDS (169.254.169.254) when no AWS creds are set. Any test
hitting has_aws_credentials() / resolve_aws_auth_env_var() (e.g.
test_status, test_setup_copilot_acp, anything that touches
provider auto-detect) burned ~2-4s waiting for that to time out.
- test_exit_cleanup_interrupt: explicitly mock
resolve_runtime_provider which was doing real network auto-detect
(~4s). Tests don't care about provider resolution \u2014 the agent
is already mocked.
- test_timezone: collapse the 3-test "TZ env in subprocess" suite
into 2 tests by checking both injection AND no-leak in the same
subprocess spawn (was 3 \u00d7 3.2s, now 2 \u00d7 4s).
## Validation
| Test | Before | After |
|---|---|---|
| test_anthropic_error_handling (8 tests) | ~80s | ~15s |
| test_413_compression (14 tests) | ~18s | 2.3s |
| test_retaindb_plugin (67 tests) | ~13s | 1.3s |
| test_status_includes_tavily_key | 4.0s | 0.05s |
| test_setup_copilot_acp_skips_same_provider_pool_step | 8.0s | 0.26s |
| test_update_gateway_restart (5 tests) | ~18s total | ~0.35s total |
| test_exit_cleanup_interrupt (2 tests) | 8s | 1.5s |
| **Matrix shard 3 local** | **108s** | **50s** |
No behavioral contract changed \u2014 tests still verify retry happens,
service restart logic runs, etc.; they just don't burn real seconds
waiting for it.
Supersedes PR #11779 (those changes are included here).
First pass of test-suite reduction to address flaky CI and bloat.
Removed tests that fall into these change-detector patterns:
1. Source-grep tests (tests/gateway/test_feishu.py, test_email.py): tests
that call inspect.getsource() on production modules and grep for string
literals. Break on any refactor/rename even when behavior is correct.
2. Platform enum tautologies (every gateway/test_X.py): assertions like
`Platform.X.value == 'x'` duplicated across ~9 adapter test files.
3. Toolset/PLATFORM_HINTS/setup-wizard registry-presence checks: tests that
only verify a key exists in a dict. Data-layout tests, not behavior.
4. Argparse wiring tests (test_argparse_flag_propagation, test_subparser_routing
_fallback): tests that do parser.parse_args([...]) then assert args.field.
Tests Python's argparse, not our code.
5. Pure dispatch tests (test_plugins_cmd.TestPluginsCommandDispatch): patch
cmd_X, call plugins_command with matching action, assert mock called.
Tests the if/elif chain, not behavior.
6. Kwarg-to-mock verification (test_auxiliary_client ~45 tests,
test_web_tools_config, test_gemini_cloudcode, test_retaindb_plugin): tests
that mock the external API client, call our function, and assert exact
kwargs. Break on refactor even when behavior is preserved.
7. Schedule-internal "function-was-called" tests (acp/test_server scheduling
tests): tests that patch own helper method, then assert it was called.
Kept behavioral tests throughout: error paths (pytest.raises), security
tests (path traversal, SSRF, redaction), message alternation invariants,
provider API format conversion, streaming logic, memory contract, real
config load/merge tests.
Net reduction: 169 tests removed. 38 empty classes cleaned up.
Collected before: 12,522 tests
Collected after: 12,353 tests