For adapter plugins, ``PlatformEntry.check_fn`` doubles as a lazy installer:
calling it pip-installs the platform SDK as a side effect (see e.g.
``plugins/platforms/discord/adapter.py::check_discord_requirements``). The
enablement sweep in ``_apply_env_overrides`` called ``check_fn`` for every
registered plugin platform unconditionally, so a single
``load_gateway_config()`` — which the desktop/dashboard readiness probe
``GET /api/status`` awaits synchronously — pip-installed Discord, Telegram,
Slack, Feishu and Dingtalk even when the user configured none of them
(``platforms: none``). On a slow or restricted network the installs ran long
enough to block the event loop past the desktop's readiness timeouts, so the
app timed out, killed and re-spawned the backend, and boot-looped (stuck at
94%).
Consult the cheap ``is_connected`` credential check FIRST and only run the
install-triggering ``check_fn`` for platforms that are already enabled or
actually configured. Auto-enable-by-credentials is unchanged: a platform with
its token set still gets its SDK installed and enabled.
Follow-up to #50238/#50381. The restart-loop is now SAFE (marker + launch
gate), but the trigger that lured users into relaunching mid-update remained:
on the in-app update hand-off the desktop window vanished almost immediately
(app.quit() 600ms after spawning the detached updater), before the updater's
own window appeared — a blank-screen gap that looks like a crash.
- Linger on the update overlay for UPDATE_HANDOFF_DWELL_MS (2.5s, was 600ms)
before quitting, on BOTH hand-off paths (in-app update + Windows bootstrap
recovery), so the message lands and bridges to the updater window.
- Strengthen the restart-stage copy and the overlay's applyingBody/applyingClose
to explicitly tell the user the window will reopen automatically and NOT to
reopen Hermes themselves while it updates. All four locales (en/ja/zh/zh-hant)
updated in parity.
Pure UX; does not touch the #50381 marker/gate mutual-exclusion safety net.
The browser orphan reaper reads a daemon PID from a `.pid` file in a
world-writable, predictably-named temp dir (`/tmp/agent-browser-h_*`) it
does not write itself, then tree-kills that PID via `_terminate_host_pid`
after only a liveness check. A same-user actor could plant a fake socket
dir whose `.pid` points at an arbitrary victim process, and OS PID reuse
after the real daemon exits could land the recorded PID on an unrelated
process — either way an arbitrary same-user process (and its whole tree)
gets SIGTERMed. Local DoS.
Add `_verify_reapable_browser_daemon()`, gated before the kill: via psutil
(a hard dep, fine cross-platform for the same-user processes the reaper can
signal) require both (1) identity — `agent-browser` in the process
name/cmdline — and (2) binding — the live process references *this* session's
socket dir in its cmdline or `AGENT_BROWSER_SOCKET_DIR`. The binding check is
the real spoof defense: a planted/recycled PID won't embed our exact socket
path. Fail-closed on any ambiguity (unreadable cmdline, no match), leaving the
process and its socket dir untouched for a later sweep.
Builds on @sgaofen's fix in #14394 (cmdline identity check); rewritten to use
psutil instead of `/proc`+`ps` (cross-platform, Windows-covered) and to add
the session-socket-dir binding check for recycled-PID / spoof resistance.
Co-authored-by: sgaofen <135070653+sgaofen@users.noreply.github.com>
Follow-up to the salvaged #9560 fix:
- Replace the _TRAVERSAL_RE regex with an explicit _is_path_unsafe() helper
(drops the now-unused `import re`); catches a path separator ANYWHERE,
not just leading, so a non-leading Windows backslash can't slip through.
- Switch the per-entry skip in _ensure_loaded_locked from print() to
logger.warning to match the module's logging conventions.
- Add AUTHOR_MAP entry for the contributor.
- Add regression tests for the non-leading-separator case.
Extends the CWE-22 path traversal guard to cover Windows absolute paths
of the form C:/... and D:\... — previously only leading / and \ were
checked, which missed drive-letter prefixes. Replaces the inline
startswith check with a compiled module-level regex (_TRAVERSAL_RE) that
covers all three attack patterns: .., leading /\, and leading X: drives.
Adds two regression tests for C:/windows/system32 and D:\\path\\to\\file.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Addresses PR #9560 review comments: applies the CWE-22 fix to current main
(post-PR #458 rebase) and adds the requested regression tests.
- SessionEntry.from_dict now raises ValueError for session_key or session_id
containing '..' or starting with '/' or '\' (directory traversal guard)
- SessionStore._ensure_loaded moves per-entry validation inside the loop so
one malicious/corrupt entry is skipped with a warning instead of aborting
the entire sessions.json load
- Adds TestSessionEntryFromDictTraversalValidation (5 cases) and
TestEnsureLoadedSkipsInvalidEntries covering the skip-not-abort behavior
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Secret redaction only matched `Authorization: Bearer <token>`. Other auth
headers passed through verbatim into logs, tool output, and transcripts:
- `Authorization: Basic <base64>` — leaks base64(user:password)
- `Authorization: token <pat>` / any non-Bearer scheme
- `Proxy-Authorization: ...`
- `x-api-key: <key>` (Anthropic and many providers) and `api-key`,
`x-goog-api-key`, `x-auth-token`, `x-access-token`, ... — opaque values with
no known vendor prefix were caught by nothing
A logged request or an echoed `curl -H "x-api-key: ..."` command therefore
leaked live credentials.
Generalize the Authorization rule to mask the credential for any scheme (and
Proxy-Authorization) while preserving the header name and scheme word for
debuggability, and add an api-key header rule for the single-opaque-value
headers. Bearer behavior is unchanged; plain prose containing the word
"authorization" (no colon-delimited value) is left untouched.
Adds regression tests for Basic/token/Proxy auth and the x-api-key/api-key
headers, including inside a curl command.
Follow-up to the salvaged #25961 fix: regression tests asserting that
scope-bearing IPv6 addresses (fe80::1%eth0, ::1%lo) are blocked by
is_safe_url after the scope is stripped, that a still-unparseable address
fails closed, and that a scoped IPv4-mapped IMDS address is caught by the
always-blocked floor.
ipaddress.ip_address() raises ValueError on IPv6 addresses with scope
IDs (e.g. 'fe80::1%eth0'). Both is_always_blocked_url() and is_safe_url()
silently skipped these via `except ValueError: continue`.
If ALL resolved addresses for a hostname carry scope IDs, every address
is skipped and the URL passes all safety checks — a potential SSRF
bypass vector against link-local or metadata endpoints.
Fix:
- Strip the scope ID (%eth0) before parsing in both functions
- is_safe_url(): fail closed (return False) with a warning log if still
unparseable after stripping
- is_always_blocked_url(): use continue (not return False) to preserve
multi-address scanning, with a warning log
Affected: tools/url_safety.py — is_always_blocked_url(), is_safe_url()
When a streamed Telegram reply finalizes, the stream consumer could take
the fresh-final path (send a new sendRichMessage + best-effort delete the
preview) purely because the time-based _should_send_fresh_final()
threshold elapsed — even though Telegram's prefers_fresh_final_streaming
returns False. The fresh Rich Message then overlapped the legacy
MarkdownV2 preview already on screen, leaving both visible (the #47048
table + bullet double-render).
Honor the adapter's decision: when prefers_fresh_final_streaming exists
on the adapter (checked on the class + instance __dict__ so MagicMock
auto-attrs don't false-positive) and declines, the time threshold no
longer overrides it. Adapters without the hook keep the time-based
fresh-final for backward compat.
Fixes#47048
Fold in the #40715 blank-env OOM fix on top of the host-resolution change:
- connect() now sets a non-retryable fatal error when required settings are
missing, so the gateway stops reconnecting against an empty host instead of
looping forever and leaking memory until the host OOM-kills.
- check_email_requirements() treats blank/whitespace-only EMAIL_* values as
missing, so an abandoned setup with empty keys no longer enables the platform.
Credits the parallel fixes by zerone0x (#40745) and liuhao1024 (#40829).
The email adapter read address/host purely from env vars and never stripped
them, so a missing or whitespace-padded EMAIL_IMAP_HOST reached
imaplib.IMAP4_SSL("") and surfaced as the misleading
"[Errno 8] nodename nor servname provided, or not known" — sending users down a
DNS rabbit hole when the real problem was an empty/dirty host string. A
config.yaml-only setup also left the host empty because __init__ ignored
PlatformConfig.extra, even though the "connected" check, the send helper, and
`hermes config show` already read address/imap_host/smtp_host from it.
Resolve address/imap_host/smtp_host from the env var first, then fall back to
config.extra, and strip surrounding whitespace — matching the send helper's
existing pattern. Validate the required settings at the start of connect() and
return False with an actionable message instead of attempting a connection with
an empty host.
Adds regression tests for whitespace stripping, config.extra fallback, and the
no-IMAP-attempt-on-missing-host path.
- Add thread-scoped regression test: interrupt on the waiting thread resolves
the approval as deny well under the 300s timeout; a foreign-thread interrupt
does NOT release the wait (interrupts are per-thread).
- Add panghuer023 to AUTHOR_MAP for the salvaged #37994 fix.
A dangerous-command gateway approval blocks the agent's execution thread
inside _await_gateway_decision() on threading.Event.wait() until the user
responds or the 5-minute approval timeout fires. The poll loop never checked
is_interrupted(), so /stop (which flags the agent's execution thread via
AIAgent.interrupt()) was silently ignored — the session stayed wedged until
timeout, even though /stop reported the session unlocked.
Check is_interrupted() at the top of the poll loop. The wait runs on the
agent's execution thread, the exact thread interrupt() flags, so the check
sees the signal and resolves the pending approval as deny — the agent loop
receives a normal denial and unwinds cleanly. Covers /stop, /new, and the
gateway inactivity-timeout interrupt through the single shared wait loop used
by both the terminal and execute_code guards.
A bare custom provider configured via `model.api_base` (the intuitive name
OpenAI-SDK / LiteLLM users reach for) was silently ignored: `hermes config set`
accepts any dotted key, so `model.api_base` got written and confirmed, but the
runtime resolver reads only `model.base_url`. Requests fell back to OpenRouter
with an empty key -> 401, zero hits to the custom endpoint (issue #8919).
Now api_base is migrated to base_url at load time (fixes existing broken
configs) and at set time (with a notice), never overriding an explicit
base_url. Closes#8919.
On Windows, _pause_windows_gateways_for_update() force-kills every running
gateway before mutating the venv. Gateways mapped to a profile (via
profile.path/gateway.pid) were respawned afterward, but gateways with NO
profile mapping — e.g. a Windows Scheduled Task running
"pythonw.exe -m hermes_cli.main gateway run" — were force-killed and only
told to restart manually. After an auto-update/bootstrap the Telegram bot
stayed dead until manual intervention.
Now we snapshot each unmapped gateway's argv (psutil, guarded by
looks_like_gateway_command_line) before the kill and replay it through the
same detached watcher used for profile gateways, so unmapped gateways come
back automatically too.
Co-authored-by: Hermes Agent <agent@nousresearch.com>
When a /model switch resolves a valid model but the in-place agent swap
fails mid-conversation (expired key, unreachable base_url), the agent
rolls itself back to the old working model+client and re-raises. The
callers caught that re-raise, logged a warning, then committed the broken
switch anyway: wrote the failed model to the session DB, set
_session_model_overrides to the broken model/provider/key, and (gateway
direct path) evicted the working cached agent. The next message then
rebuilt a dead agent from the broken override -> permanently unusable
conversation (#50163).
Fix the whole caller class so a failed swap aborts the commit entirely:
- gateway/slash_commands.py (picker + direct /model paths): on swap
failure, early-return an error message; skip DB persist, session
override, cache eviction, and config write.
- cli.py (both /model handlers): snapshot CLI-level credential/runtime
fields before mutating, restore them on swap failure, and abort the
note + success print.
- tui_gateway/server.py: wrap the previously-unguarded swap; on failure
raise a clean error and skip worker restart, runtime persist, switch
marker, session model_override, and config persist.
The no-cached-agent path (apply-on-next-session) is unaffected.
Adds a gateway regression test that fails on the pre-fix behavior.
Per @egilewski's audit on this PR (#15544), the original fix was
correct but the file has refactored since: the four endpoint-local
empty-peer checks have been consolidated into _ws_client_is_allowed
and _ws_client_reason, but the helpers were left fail-open ('no peer
host known means allow' / 'no reason to block').
On a loopback-bound dashboard with auth disabled, an ASGI server
behind a misconfigured proxy or a unix-socket transport can deliver
ws.client == None or ws.client.host == ''. The helpers were treating
that as 'allowed', so the loopback-only peer gate could be bypassed
by anything that suppressed the client tuple in transit. All four
WebSocket endpoints (/api/pty, /api/ws, /api/pub, /api/events) route
through _ws_request_is_allowed -> _ws_client_is_allowed, so the gap
applied uniformly.
Fix:
* _ws_client_is_allowed: return False when client_host is empty
instead of True. Only reached on loopback bind with auth disabled
(auth_required=True and explicit non-loopback binds short-circuit
earlier), so the fail-closed behavior is scoped to the surface
that needs it.
* _ws_client_reason: return a 'missing_or_empty_peer bound=...'
block reason instead of None, so the dispatcher's existing
reason-based rejection path picks it up and the close gets logged
with a machine-parseable token for diagnosability.
Behavior unchanged for:
* gated mode (auth_required=True) — early-returns True before the
empty-peer check runs. The OAuth ticket is the auth at that point.
* explicit non-loopback bind (--host 0.0.0.0/::, or a specific LAN
address, always with --insecure) — early-returns True before the
empty-peer check runs. DNS-rebinding is still blocked by the
Host/Origin guard in _ws_host_origin_is_allowed.
* legitimate loopback peers (client_host == '127.0.0.1' / '::1') —
not affected by the empty-peer branch.
Regression tests added in tests/hermes_cli/test_dashboard_auth_ws_auth.py:
* test_empty_client_host_rejected_in_loopback_mode
* test_missing_client_object_rejected_in_loopback_mode
* test_empty_client_host_reason_is_block
Plus two regression guards to ensure the fix does not over-reach:
* test_empty_client_host_still_allowed_in_insecure_public_mode
* test_empty_client_host_still_allowed_in_gated_mode
All three new fail-closed tests fail without this patch (the helpers
return True / None for an empty peer) and pass with it. The 45
pre-existing tests in test_dashboard_auth_ws_auth.py continue to pass.
Baileys' jidDecode crashes ("Cannot destructure property 'user' of
jidDecode(...) as it is undefined") when handed a bare phone number, so
sending a WhatsApp message to +50766715226 / 50766715226 returned HTTP
500 and never delivered (#8637).
Add to_whatsapp_jid() to gateway/whatsapp_identity.py — the outbound
inverse of normalize_whatsapp_identifier: it builds the JID a send must
use (bare phone -> <digits>@s.whatsapp.net) and passes through already
qualified JIDs (@g.us, @lid, status@broadcast, @newsletter) unchanged.
Wire it at every outbound bridge call site in the WhatsApp adapter
(send, edit, media, typing, get_chat_info, and the standalone cron /
send_message sender).
Co-authored-by: Hermes Agent <noreply@nousresearch.com>
When a Windows user relaunches Hermes while an in-app update is still
running (the desktop vanished with no progress and looks crashed), the
fresh instance spawns its own dashboard backend. That backend re-locks
the venv shim, the updater's straggler cleanup (force_kill_other_hermes
-> taskkill /F /T /IM hermes.exe) kills it, the launch dies with the 45s
"backend didn't come up" timeout, and the user relaunches into the same
trap -- an infinite respawn/kill loop (#50238).
Root cause: no mutual exclusion between an applying update and a fresh
desktop spawning its own local backend.
Fix: the updater publishes a HERMES_HOME/.hermes-update-in-progress
marker (pid + start time) for the whole run via an RAII drop-guard that
removes it on every exit path (success, early return, panic). A
freshly-launched desktop checks the marker before spawning its local
backend and PARKS until the update finishes -- then brings the backend
up itself (it is the surviving instance; the updater's own relaunch hits
the single-instance lock and quits). A stale marker (dead pid or past a
20-minute ceiling) is pruned so a crashed updater can never strand
future launches. No rogue backend spawns mid-update, so
force_kill_other_hermes has nothing legitimate to kill.
Marker parse/staleness logic is extracted to update-marker.cjs and
unit-tested; the Rust guard has unit tests; the Rust-write <-> JS-read
contract is E2E-verified.
In Telegram streaming, the typing indicator persisted through the slow
final rich-text/MarkdownV2 finalize edit, so the '...typing' bubble
lingered for seconds after the last streamed token. Add a one-shot
on_before_finalize hook to GatewayStreamConsumer, fired once when the
stream transitions into its finalization path, and wire it on both
Telegram streaming call sites to call pause_typing_for_chat() before
the final edit. Cover hook ordering and once-only behavior in tests.
Fixes#49712
Root cause of #49145: the Windows ZIP-update path did rmtree(dst) then
copytree(src, dst). If the copy failed partway — common on that path,
which only runs because file I/O is already flaky on the machine — the
directory was left deleted with nothing copied back. ui-tui/ vanishing
is what broke 'hermes --tui' (WinError 267), but the bug hit every
top-level directory.
_atomic_replace_dir stages the new copy into a sibling temp dir and only
swaps it in on full success, restoring the original on failure. A failed
update now leaves the live tree untouched instead of half-deleted.
The Windows update path can leave tracked ui-tui/ files deleted in the
working tree (HEAD intact). The guard now self-heals: when ui-tui/ is
missing in a git checkout, run `git restore -- ui-tui` and continue,
falling back to the printed manual-recovery steps only when git can't
recover it (no checkout / restore failed).
Builds on konsisumer's missing-workspace guard.
Answers a recurring plugin-author question: how to read the active
profile and drive Hermes from inside a hook callback when ctx._cli_ref
is None (gateway, hermes chat -q, and kanban-spawned worker sessions).
- Adds a 'Act from inside a hook' section to the plugin guide covering
ctx.profile_name and ctx.dispatch_tool as the session-agnostic APIs,
with a kanban_task_blocked example, and notes there is no in-process
slash-command bridge for headless workers (shell out via the terminal
tool instead).
- Adds the three kanban lifecycle hooks to the hook reference table with
their process semantics.
- Pins the contract with a regression test: ctx.dispatch_tool invokes a
tool handler with _cli_ref=None (worker/hook context).
Requested by @Smithangshu on Discord.
Follow-up on salvaged #50347: the event surface table was missing the
billing.step_up.verification switch case, and the File map omitted
lib/perfPane.tsx.
After a worker crash + reclaim + respawn, the board could show a task in the
Ready lane while its task_run was 'running' and the new worker was actively
executing (#36910). The dispatcher could then treat live work as available and
double-assign.
Root cause: the three reclaim paths (detect_crashed_workers,
release_stale_claims heartbeat-stale backstop, enforce_max_runtime) each
snapshot a task's worker_pid/claim_lock, do liveness work, then reset
tasks.status back to 'ready' with only a 'WHERE status=running' guard. If the
task was reclaimed AND re-claimed by a NEW worker in between (new run, new
claim_lock, live pid), the stale UPDATE clobbered the live task: status flipped
to 'ready' while the fresh run stayed 'running'. claim_task is the only writer
that sets status='running', so nothing put it back — permanent desync.
Fix: gate each reset on the snapshot's claim_lock (and worker_pid where
available) so it only fires when the task is still owned by the worker the
reclaim was computed for. A stale reclaim now no-ops (rowcount 0) instead of
desyncing a re-claimed task. Genuine crashes (lock still matches) reclaim
exactly as before.
This is the same race class the in-gateway dispatch lock (single-writer ticks)
mitigates, closed at the row level so a single dispatcher's fast
reclaim->respawn across two ticks is also safe.
Closes#36910.
Per @egilewski's audit on this PR, the security fix is behaviorally
correct but lacks focused regression coverage for the two traversal
vectors it closes. Adding tests now so the path-traversal guard
cannot silently regress.
* test_restore_rejects_snapshot_id_traversal -- exercises the
snapshot_id input guard with seven hostile values (parent
traversal, single parent, bare '.', bare '..', forward slash,
backslash, empty string). Each must return False without touching
the filesystem.
* test_restore_rejects_manifest_rel_traversal -- exercises the
manifest rel guard by injecting '../../outside.txt' into a real
snapshot's manifest.json, seeding a source payload at the escaped
path, and asserting the destination outside HERMES_HOME does not
exist after restore. This is the higher-value test of the pair --
verified locally that it fails without the fix in
restore_quick_snapshot (the escape destination gets written) and
passes with the fix in place.
The 67 pre-existing tests in test_backup.py continue to pass.
The gateway dispatcher captured kanban.auto_decompose ONCE at boot, so a user
who flipped it to false to STOP auto-decompose had no way to make that take
effect short of restarting the gateway. Reported (#49638): auto-decompose
created and launched tasks the user never intended (while they were still
typing the task description), and 'even Hermes Agent couldn't disable this
feature' — because the live config edit was silently ignored.
Auto-decompose is a safety toggle; turning it off must halt fan-out on the
next tick. The dispatcher now re-reads the flag (and auto_decompose_per_tick)
from config every tick via the extracted _resolve_auto_decompose_settings(),
which fails SAFE (disabled) on a config read error so a transient failure can
never re-enable a feature the user turned off.
Closes#49638.
connect() wrapped its entire body in an unbounded blocking flock(LOCK_EX) on
every call (_cross_process_init_lock). A single process stalled inside the
critical section — or a stale lock held by a wedged worker — blocked every
other connect(), including the long-lived gateway dispatcher's next-tick
connect, forever. No timeout, no traceback, no recovery: the board silently
stopped being worked until a manual restart (issue #36644).
Two fixes:
1. Fast-path skip: once THIS process has initialized a path, the expensive
first-open work (header validation, integrity probe, schema + additive
migrations) is already cached in _INITIALIZED_PATHS. The steady-state
connect has nothing for the cross-process lock to protect, so it now opens
the connection (WAL + pragmas) under only the cheap in-process _INIT_LOCK
and never touches the file lock. This removes the lock from the dispatcher's
hot path entirely — a stalled external 'hermes kanban list' can no longer
block ticks.
2. Bounded acquire: even on first-init, _cross_process_init_lock now retries a
non-blocking acquire up to a 10s deadline, then logs a WARNING and proceeds
WITHOUT the cross-process lock. Safe because the in-process _INIT_LOCK still
serializes same-process threads and the init work is idempotent
(CREATE TABLE IF NOT EXISTS + additive migrations) — worst case is redundant
work, not corruption. A bounded 'proceed anyway' beats an unbounded hang.
Windows path switched LK_LOCK -> LK_NBLCK (non-blocking) to match.
Closes#36644.
_default_spawn launched the worker subprocess with cwd=workspace and set
HERMES_KANBAN_WORKSPACE, but never set TERMINAL_CWD — so the worker inherited
the dispatching gateway's TERMINAL_CWD. That value takes precedence over the
process cwd in two places:
- tools/file_tools.py::_resolve_base_dir — a relative write_file path resolved
against the gateway user's home instead of the workspace, so artifacts
silently landed outside the workspace (#41312).
- agent_init's context-file loader — AGENTS.md was discovered relative to the
gateway's cwd, so under multi-profile dispatch a worker loaded whichever
gateway won the claim race's AGENTS.md, not the task's (#34619).
Both are the same root cause. Pinning TERMINAL_CWD to the workspace (where the
task's work actually happens) fixes both. Guarded on an existing absolute dir
because file_tools rejects relative/sentinel TERMINAL_CWD values — a non-dir
workspace leaves the inherited value rather than writing a meaningless one.
Closes#34619, closes#41312.
hermes -w created the worktree branch from the standalone clone's HEAD, which
lags origin when the clone isn't freshly updated (it's only refreshed by
hermes update, not per session). Every worktree branch then rooted on a stale
base, so the PR diff GitHub computes against current main ballooned with
unrelated changes and the agent had to discover the staleness at push time and
rebase.
_resolve_worktree_base() now fetches and branches from the freshest available
ref: the current branch's upstream if it tracks one (so a deliberate
feature-branch worktree tracks its own remote), else the remote's default
branch (origin/HEAD), else local HEAD as a fail-soft fallback (offline / no
remote / detached). A bogus 'origin/(unknown)' default is guarded, and worktree
creation retries from HEAD if branching off the remote ref fails — so this is
never worse than the old behavior.
Gated by worktree_sync (default true); set worktree_sync: false to keep the
old branch-from-local-HEAD behavior. The resolved base is printed in the
session banner.
This is the follow-up to the #50319 session, where the standalone clone was
213 commits behind origin and the worktree inherited that stale base.
Plugins could observe session/tool/approval lifecycle but had no way to
observe kanban task transitions. Adds three observer hooks fired by the
board's claim/complete/block transitions:
- kanban_task_claimed (dispatcher process, before worker spawn)
- kanban_task_completed (worker process, carries summary)
- kanban_task_blocked (worker process, carries reason)
Each fires AFTER the DB write txn commits, so a plugin observes durable
state and a slow/hanging callback can never hold the SQLite write lock.
All firing is best-effort: a raising hook is logged and swallowed and
never breaks a board transition. profile_name is resolved from
HERMES_HOME so dispatcher- and worker-side hooks carry the right profile.
Requested by @Smithangshu on Discord.
Plugins previously had no way to read the active profile name from the
PluginContext. The workaround in the wild — reaching into
ctx._manager._cli_ref — only works in an interactive CLI session;
_cli_ref is None in the gateway and in kanban-spawned worker sessions
(hermes -p <profile> chat -q ...), so the workaround breaks exactly
where multi-profile awareness matters most.
ctx.profile_name wraps hermes_cli.profiles.get_active_profile_name(),
which derives the name from HERMES_HOME and therefore works in every
execution context with zero dependency on _cli_ref.
After the agent's final response, the '...typing' bubble persisted ~5s.
send() re-triggers send_typing() after every delivery so the bubble
survives intermediate progress messages (Telegram clears typing on each
delivered message). But that re-trigger also fired on the FINAL send,
re-arming Telegram's ~5s timer AFTER the gateway had already torn down
its typing-refresh loop — and Telegram exposes no stop-typing API, so
nothing cancelled it.
Gate the post-send re-trigger on the absence of metadata['notify'] (set
only on the final user-visible reply via _mark_notify_metadata). Both
the rich-message and legacy send paths are covered; intermediate
progress sends still re-trigger so the bubble stays alive mid-response.
Fixes#48678
Add a platform-neutral send-failure vocabulary so consumers can branch on a
typed category instead of substring-matching the raw provider message.
- base.py: SEND_ERROR_KINDS + classify_send_error() (too_long / bad_format /
forbidden / not_found / rate_limited / transient / unknown), and an optional
SendResult.error_kind field (defaults None — fully backward compatible).
- telegram.py: populate error_kind on send() failures; message_too_long keeps
its existing error token plus error_kind='too_long'.
Purely additive: no behavioral change to the existing degrade-and-deliver
paths (MarkdownV2->plain-text fallback, overflow split, retry classification
all untouched). 22 new tests + 210 adapter regression tests green.
The port-announcement clock in waitForDashboardPort starts the instant the
backend process is spawned — before uvicorn binds its socket. On a cold
install the child first compiles and imports the whole hermes_cli.main ->
web_server -> FastAPI/uvicorn chain, and on Windows real-time AV scans every
freshly written .pyc. That pre-bind cost can exceed the old hardcoded 45s
deadline, so the desktop killed a healthy-but-still-starting backend and
respawned it, piling up orphaned processes (#50209).
Raise the default to 90s and make it overridable via
HERMES_DESKTOP_PORT_ANNOUNCE_TIMEOUT_MS, clamped to a 45s floor so a bad
override can't reintroduce the loop. Warm starts still announce in well under
a second; both call sites inherit the new default with no change. Adds
backend-ready.test.cjs (wired into test:desktop:platforms).
Three tests covering the scenarios from issue #50209 that could not be
validated with real Defender on a fresh install:
1. test_lifespan_warmup_is_nonblocking
Patches _warm_gateway_module to sleep 3 s. Measures TestClient startup
time — must complete in < 1.5 s, proving the fire-and-forget
run_in_executor does not block the event loop before port binding
(HERMES_DASHBOARD_READY timing proxy).
2. test_get_status_does_not_block_event_loop
Patches _resolve_restart_drain_timeout to sleep 3 s. Fires concurrent
GET /api/status and GET /api/version requests. /api/version must
respond in < 3 s while /api/status waits — proving the event loop
stays free during the slow import (15 s socket timeout would not fire).
3. test_concurrent_status_probes_all_respond
Three simultaneous /api/status probes with the slow patch — all must
return HTTP 200 (no connection resets, no orphan accumulation).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes a regression introduced by the prior approach (synchronous import
hermes_cli.gateway inside _lifespan) that caused a new failure mode:
the blocking import stalled the asyncio event loop before uvicorn could
bind its port, pushing HERMES_DASHBOARD_READY past the desktop shell's
45 s announcement deadline and triggering a respawn loop that accumulated
orphaned backend processes.
Two-part fix:
_lifespan: replace the blocking import with a fire-and-forget
run_in_executor call (_warm_gateway_module). The import runs in a
worker thread while the server socket is already open, so
HERMES_DASHBOARD_READY fires without delay.
get_status: replace the inline lazy import with
await run_in_executor(None, _resolve_restart_drain_timeout). This is
the root fix for the original 15 s socket-timeout: the blocking
.pyc-compilation + Defender scan is offloaded to a thread, keeping the
event loop free for every /api/status probe. After the first call the
module is in sys.modules and the executor returns in microseconds.
Both helpers are extracted as module-level sync functions so they can
be unit-tested independently of FastAPI or uvicorn.
Closes#50209
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
On resource-contended hosts the embedded Hindsight daemon can exceed a
single 2s /health check; upstream then waits a grace window before
treating it as stale and killing+restarting it (hindsight-embed reads
HINDSIGHT_EMBED_PORT_HEALTH_GRACE_TIMEOUT, default 30s, into a
module-level constant at import time). Users on busy boxes had no
Hermes-side way to raise it short of hand-setting an env var.
Add a 'port_health_grace_timeout' config.json option to the Hindsight
plugin. When set, initialize() exports it to the process env BEFORE
daemon_embed_manager is imported (the import-time read is the contract).
setdefault() so an explicit operator env override always wins. Exposed
in 'hermes memory setup' for local_embedded mode.
Follow-up to #50308 / issue #13125 comment thread.