Telegram polling entered a self-inflicted ~31s loop of 409 Conflict ->
retry -> resume -> Conflict. The error_callback PTB invokes synchronously
inside its internal network_retry_loop only scheduled our async recovery
task (loop.create_task) and returned, so PTB kept polling getUpdates on its
own while our handler concurrently ran stop -> sleep -> start_polling. The
two polling sessions overlapped and Telegram returned a fresh 409.
Fix: in the conflict branch of the error_callback, synchronously set PTB's
private polling stop_event before scheduling recovery. PTB's loop exits on
its next tick (it races that event in do_action), so our handler owns
polling alone. The handler's await updater.stop() drains the task and PTB
clears the event, so the subsequent start_polling() builds a fresh event
and is not poisoned.
Keeps the existing reconnect ladder intact (option B) — fixes only the
race. Defensive: probes mangled + unmangled stop_event spellings and no-ops
(prior behaviour) if neither exists; never flips _running, which would make
the handler skip stop() and leave the loop wedged.
* fix(agent): config-driven intent-ack continuation for all api_modes (#27881)
The agent could end a turn after only stating intent ('I will run a health
check...') without executing the announced tool call, forcing the user to
re-prompt. A continuation guard that catches this and nudges the model to
proceed already existed but was hard-gated to the codex_responses api_mode,
so Gemini/Claude/OpenRouter turns never benefited.
- New agent.intent_ack_continuation config (default 'auto' = codex-only,
byte-stable for existing conversations). 'true'/model-list opts every
api_mode in; 'false' disables. Mirrors agent.tool_use_enforcement's shape.
- looks_like_codex_intermediate_ack gains require_workspace (default True).
The opted-in path drops the codebase/filesystem requirement so general
autonomous workflows (server ops, deploys, API calls) are caught, not just
coding tasks. Future-ack + action-verb + short-content + no-prior-tool
guards still apply; the 2-nudge-per-turn cap is unchanged.
- Resolution centralized in intent_ack_continuation_mode (off/codex_only/all).
* docs(infographic): intent-ack continuation (#27881)
The curator's LLM consolidation pass could archive whole clusters of
active skills with zero verified consolidations (#29912): a bare prune
(skill_manage delete with absorbed_into empty/omitted) from the forked
review agent was accepted, removing the skill's name from lookup even
though counts.consolidated_this_run was 0.
- _delete_skill now fails closed during the curator/background-review
pass: a delete is only allowed when it declares a verified
consolidation (absorbed_into=<umbrella>, umbrella must exist). A prune
with no forwarding target is refused; the skill stays active. The
deterministic inactivity prune (archive_skill) is unaffected.
- A verified consolidation delete during the curator pass now routes
through the recoverable archive primitive instead of shutil.rmtree, so
a misjudged consolidation can be undone with hermes curator restore.
The usage record is kept (state=archived) rather than forgotten.
- Foreground, user-directed deletes keep their existing hard-delete
semantics.
Follow-up to the cherry-picked #29212 (#29177):
- Promote the 24h stale-process threshold to config.yaml
(session_reset.bg_process_max_age_hours) instead of a hardcoded
constant. 0 disables the cutoff (legacy: any live process blocks reset).
Wired through GatewayConfig.default_reset_policy in gateway/run.py.
- Bug 2: process(action=list) now resolves the gateway session_key from
the contextvar and surfaces session-scoped background processes (a
forgotten preview server under a different task), flagged
session_scoped — so the agent/user can discover and kill the blocker.
Previously the task-scoped list returned [] and the blocker was invisible.
- Tests: config round-trip for the new field, cross-task list visibility.
- Docs: messaging session-reset section.
Background processes (e.g. http.server preview) that Hermes starts and
forgets about previously blocked session idle/daily reset indefinitely.
The reset guard in session.py checked has_active_for_session() with no
max age — a 3-day-old preview server blocked reset the same as a task
started 30 seconds ago.
Changes:
- Add max_active_age parameter to has_active_for_session() in
process_registry.py. Processes older than this threshold are ignored.
- Add MAX_ACTIVE_PROCESS_AGE constant (24h / 86400s).
- Wire max_active_age into the gateway's session store callback in
run.py so stale processes no longer block session lifecycle.
- Add debug logging when reset is skipped due to active processes.
- Add 3 tests covering recent, stale, and legacy (None) max age.
Fixes#29177
The module-level import broke tests/tools/test_managed_browserbase_and_modal.py,
which loads browser_tool.py via spec_from_file_location against a stubbed
'tools' package that does not include tools.environments.local. Move the import
into a _build_browser_env() helper called at the two agent-browser spawn sites,
matching the lazy-import pattern already used by lazy_deps.py.
Subprocesses spawned outside the terminal/execute_code path (agent-browser,
copilot ACP, dep-ensure, lazy_deps uv install, TUI Node host, cli.exec)
inherited the operator's full credential environment via os.environ.copy().
The terminal path was already scrubbed by _HERMES_PROVIDER_ENV_BLOCKLIST
(#1002/#1264/#32314); these spawn sites bypassed it.
Adds hermes_subprocess_env(inherit_credentials=) in tools/environments/local.py
reusing the existing dynamic blocklist as the single source of truth:
- Tier 1 (_ALWAYS_STRIP_KEYS): gateway bot tokens, GitHub auth, infra
secrets -- stripped even for credential-inheriting children.
- Tier 2 (_HERMES_PROVIDER_ENV_BLOCKLIST): provider/tool keys -- stripped
unless inherit_credentials=True. The opt-in is grep-able for audit.
Browser worker keeps a _BROWSER_PASSTHROUGH_KEYS allowlist (BROWSERBASE/
FIRECRAWL) re-added after the strip. Model-driving children (ACP, TUI Node
host, cli.exec) use inherit_credentials=True so they still get provider keys
while losing Tier-1 secrets. Installers (dep-ensure, lazy_deps) inherit
nothing sensitive. cua_backend already routed through _sanitize_subprocess_env
on main -- left as-is. Gateway adapter utility spawns (gh pr comment, ffmpeg)
are left inheriting env: gh needs GH_TOKEN by design, ffmpeg is a trusted
system binary -- no untrusted-dependency exposure.
This is defense-in-depth (personal-assistant trust model: same-user spawns),
making the existing scrub policy uniform across the spawn surface; the main
real payoff is shrinking the blast radius if a transitive npm dep in
agent-browser is compromised.
Reconstructed on current main from the design in #31959 (Tranquil-Flow);
also credits #39003 (rodboev), #37843 (coygeek), #35769 (egilewski).
Co-authored-by: Tranquil-Flow <tranquil_flow@protonmail.com>
Co-authored-by: rodboev <rod.boev@gmail.com>
Co-authored-by: egilewski <egilewski@egilewski.com>
Three CI flakes hit while landing the credential-pool restore fix; all three
were timing/wall-clock races in the tests, not product bugs (each passes
locally and the assertions are correct):
- test_entire_tree_is_sigkilled_not_just_parent: _terminate_host_pid SIGKILLs
synchronously, but the test's 4s budget after a 1s in-function SIGTERM grace
left almost no slack for the kernel to tear down 3 processes + reparent the
children to zombies under loaded-CI scheduling. Widen the wait to 15s and
make the liveness predicate tolerant of vanished-pid / zombie races. The
assertion never weakens: every tree member must end up dead or zombie.
- test_session_resume_follows_compression_tip: appended messages got
time.time() timestamps (~now) while the test forced session started_at into
the past, so the get_compression_tip MAX(m.timestamp) tiebreaker depended on
wall-clock ordering. Pass explicit, well-separated message timestamps so the
chain resolution is deterministic by construction.
- test_non_retryable_exhaustion_arms_cooldown: asserted the short (5s)
exhaustion cooldown with a tight +1.0s slack, which false-fails when
wall-clock jitter between the 'before' snapshot and the cooldown computation
exceeds a second on a loaded runner. Widen to +30s — still cleanly below the
60s rate-limit window it must distinguish from.
_restore_primary_runtime restored the construction-time api_key snapshot and
never consulted the credential pool. After the pool rotated away from a
revoked/exhausted entry mid-session, every new turn restored the dead key,
re-failed instantly, burned the remaining entries, and fell through to
cross-provider fallback.
After restoring the snapshot, re-select the pool's current best entry and
swap the live credential in via _swap_credential (which already rebuilds the
OpenAI/Anthropic client, reapplies base-url headers, and carries the #33163
base_url / OAuth-detection fixes). Falls back to the snapshot key when the
pool is absent, empty, or the entry has no usable key.
Salvaged from #25206 onto current main: the original targeted the pre-refactor
monolithic method in run_agent.py; the logic now lives in
agent/agent_runtime_helpers.py and is collapsed onto _swap_credential instead
of re-inlining the client rebuild.
Fixes#25205
`resolve_provider("auto")` checked `auth.json` `active_provider` BEFORE the
config.yaml `model.provider` and env-var API-key checks. So a user who was
OAuth-logged-into one provider (e.g. Anthropic) but had set an explicit
`model.provider` or exported an API key (e.g. `OPENAI_API_KEY`) was silently
routed to the stale OAuth provider — the override was invisible and surprising.
Reorder the auto-path so explicit intent wins (the order the issue asks for):
1. explicit CLI api_key/base_url
2. config.yaml `model.provider` (safety net — see below)
3. OPENAI_API_KEY / OPENROUTER_API_KEY env
4. OpenRouter credential pool
5. provider-specific API-key env vars
6. auth.json `active_provider` (OAuth) ← demoted to last-resort
7. AWS Bedrock credential chain
8. error
`active_provider` is still honored — it's just a last-resort fallback chosen
only when the user expressed no other preference, instead of overriding one.
The normal chat/gateway/TUI/ACP/status path already resolves config.provider
upstream in `resolve_requested_provider()` before "auto" is reached, so this
duplicate config check is the safety net for the lone direct caller
(`main.py` `resolve_provider("auto")`) and any future bypass. Because every
surface funnels through this one resolver, the fix propagates everywhere with
a single edit — no sibling path re-implements precedence.
Also add a one-shot WARN when resolution lands on `active_provider` while a
populated `model` config dict lacks a `provider` key — surfacing the silent
override the issue reported without breaking first-install.
Synthesizes the two competing PRs: #29615 (LifeJiggy — config-before-auth +
the silent-override framing) and #29809 (Minksgo — the env-before-auth
reorder). #29809 could not be merged directly (bundled unrelated, un-opt-in
cost-tagging telemetry); its reorder idea is incorporated here and credited.
Tests: tests/hermes_cli/test_provider_precedence.py — config/env beat stale
OAuth, OAuth still used as last resort, explicit request short-circuits, WARN
fires on silent fall-through. Full provider-resolution suites: 374 passed.
Fixes#29285
Co-authored-by: LifeJiggy <141562589+LifeJiggy@users.noreply.github.com>
Co-authored-by: Minksgo <153416856+Minksgo@users.noreply.github.com>
Two profile gateway services sharing the default ~/.hermes resolve the
takeover marker to the same path. A --replace from profile B could land
in profile A's marker, match on PID + start_time by coincidence of a
shared PID namespace, and make profile A exit 0 — only to be revived by
systemd Restart=always, which races the replacer again, flapping
indefinitely.
write_takeover_marker now stamps replacer_hermes_home; the shared
consume path rejects markers written under a different HERMES_HOME and
leaves them in place for the correct profile. Absent field (older
markers) is treated as same-home, so single-profile and mixed old/new
deployments are unaffected.
Salvaged from #31414 by @CryptoByz onto current main (branch was ~3962
commits behind; the consume function had since been refactored for
issue #34597). Co-authored-by: CryptoByz.
The salvaged #27354 fix made save_config strip schema-default leaves by
default. Five migration sites added to main after the PR was authored
still called bare save_config(config) and intentionally materialize a
(often default-valued) key: model_catalog.ttl_hours, write_approval,
curator.consolidate, agent.verify_on_stop, and the suspicious-MCP-server
disable. Pass strip_defaults=False so those one-time deliberate writes
survive, matching the opt-out the PR applied to the other migrations.
Fixes#27354
Root cause: called during init (or by any code path
that saves ) wrote injected schema defaults into
config.yaml as if the user had authored them. Two fix layers:
1. now only injects
when the user actually set
somewhere (root or agent). A user who never set
keeps it absent, so 's explicit-path
detection won't treat it as user-authored.
2. gains a parameter and a
new pass that removes keys matching
unless those paths were explicitly present in the
**raw** (pre-normalization) config on disk. Explicit-path detection
uses on *before* any
normalisation runs — preventing injected-in defaults from being
mistaken for user-set values.
All migration and edit-config call sites pass
to preserve their intentional default-seeding behaviour.
New helpers:
- — collects leaf-key paths from a raw dict
- — removes keys matching schema defaults
Test coverage: 4 new regression tests (59 total, all passing).
The existing test_anthropic_stream_parser_valueerror_retries_before_delivery
asserted mock_replace.call_count == 1 — i.e. it passed precisely because the
buggy OpenAI rebuild was invoked on the Anthropic path. Repoint it to assert
the corrected close+rebuild-Anthropic behavior (#28161).
interruptible_streaming_api_call() has three connection-pool cleanup
sites that called _replace_primary_openai_client() unconditionally.
For api_mode=anthropic_messages this has two consequences:
1. _replace_primary_openai_client() fails (OPENAI_API_KEY unset on
Anthropic-only configs), so dead connections are never purged.
2. The stale-stream detector's outer-poll site (L1977) is the only
mechanism that can interrupt the worker thread while it blocks in
for event in stream:. Because the Anthropic client is never closed,
the thread stays blocked until the 900 s httpx read-timeout fires,
producing a visible 15-minute hang for Telegram/gateway users on
claude-opus-4-7.
Fix: mirror the existing interrupt-path pattern (L1989-1997) at all
three cleanup sites — if api_mode == "anthropic_messages", call
_anthropic_client.close() + _rebuild_anthropic_client() instead of
_replace_primary_openai_client(). _rebuild_anthropic_client() handles
both direct Anthropic and Bedrock-hosted Claude correctly, unlike the
inline build_anthropic_client() calls in open PR #14430.
PR #14430 (open) covers only the outer stale-detector site (L1977).
PR #23678 (open) covers only the inner retry sites (L1774, L1833).
This PR covers all three sites and uses _rebuild_anthropic_client()
for Bedrock parity.
Fixes#28161
The Discord adapter could enter a silent zombie state after a network
outage / proxy stall: the process is alive, _client looks open, but the
underlying socket is dead. discord.py's WebSocket reconnect never sees a
RST through a wedged proxy/NAT, so client.start() spins forever without
exiting — which means the bot-task done callback (which only fires on
task completion) never trips either. The bot stays "offline" in Discord
until a manual `hermes gateway restart`. Reported offline for 13-17h.
Adds an out-of-band REST liveness probe in DiscordAdapter. Every
`discord.liveness_interval_seconds` (default 60s) the adapter issues a
cheap fetch_user(bot_id) — the same REST path as message delivery, so it
fails when the proxy/NAT is wedged. After
`discord.liveness_failure_threshold` consecutive failures (default 3) the
probe closes the wedged client and surfaces a retryable fatal error,
which trips the gateway's existing _platform_reconnect_watcher and
rebuilds the adapter. Operators disable it by setting either knob to 0.
Config lives in config.yaml (discord.liveness_*) per the .env-is-secrets
policy; _apply_yaml_config bridges it to internal env vars the adapter
reads, matching the existing HERMES_DISCORD_TEXT_BATCH_* pattern.
Co-authored-by: Hermes Agent <agent@nousresearch.com>
The durable _last_known_cwd anchor is keyed by the shared 'default' container,
so a non-owning worktree session could inherit the owning session's cwd through
it — breaking the wrong-worktree-routing fix (test_file_tools_cwd_resolution::
test_resolution_routes_to_resolving_sessions_worktree).
Reorder _authoritative_workspace_root so the session-specific registered cwd
override (keyed by raw session id) is checked BEFORE the shared-container
_last_known_cwd fallback. A non-owning session now resolves into its own
registered worktree; the durable anchor only fills in when there's no
session-specific override (the #26211 single-session case). Adds a regression
test covering the owner-mirrors-then-other-session-resolves interaction.
Belt-and-suspenders on top of the cherry-picked cwd-preservation fix:
- Proactively mirror every live terminal cwd into _last_known_cwd on each
successful read, so the durable anchor survives even when the cleanup
thread pops both _file_ops_cache and _active_environments before
_get_file_ops' stale-cache save branch can fire.
- Fall back to _last_known_cwd in _authoritative_workspace_root. write_file_tool
resolves the path (via _resolve_path_for_task) BEFORE _get_file_ops rebuilds
the env, so restoring only the rebuilt env's cwd was insufficient — the
resolution that decides where the file lands runs first. This closes that gap.
The local env's persisted _cwd_file can't serve this role: it's keyed by a
random per-session uuid and deleted on cleanup (the same cleanup that triggers
the bug). The in-memory _last_known_cwd registry is the durable anchor instead.
Adds a real-IO E2E regression (TestSilentFileMisplacementE2E) exercising the
actual write_file_tool path after env cleanup.
Root cause: when the terminal environment (`_active_environments` entry) is
cleaned up and re-created during a long conversation, the new environment
always starts with the default config CWD (typically `~/.hermes/hermes-agent`)
instead of preserving the user's last-known working directory. Subsequent
relative-path writes (`write_file`, `execute_code`, shell commands) silently
land in the default CWD, making files appear to be "created but absent."
Fix: add `_last_known_cwd` dict that preserves the old environment's CWD
before the stale cache entry is invalidated. When a new environment is
created for the same task_id, we check `_last_known_cwd` first and use the
preserved CWD instead of the config default.
Changes:
- tools/file_tools.py: add `_last_known_cwd` dict, save CWD before stale
cache invalidation, restore CWD on env recreation
- tests/tools/test_file_tools.py: add `TestLastKnownCwd` with 2 tests
verifying CWD preservation and fallback behavior
Fixes#26211
A flaky external probe in a tool's check_fn (e.g. check_terminal_requirements
running `docker version` with a 5s timeout, momentarily timing out under load)
would return False for a single get_tool_definitions() call. Because file
tools delegate their check_fn to the terminal check, that one flake silently
stripped read_file/write_file/patch/search_files AND terminal from whatever
agent was being constructed at that instant — most visibly a delegate_task
subagent, which then reported "Tool read_file does not exist". This explains
both the intermittent (~80% success) user-session failures and the
deterministic cron failures in #21658 / #5304.
The existing _check_fn TTL cache made this worse: it cached the transient
False for the full 30s window, poisoning every subagent spawned in that span.
Fix: remember the last time each check_fn returned True; when a fresh probe
fails within a short grace window of that success, treat it as a flake —
serve the last-good True and do NOT cache the failure (so the next call
re-probes). A failure with no recent success, or past the grace window, is
honored normally so a backend that genuinely went down stops advertising its
tools. Probe failures now log at WARNING regardless of quiet mode, making the
previously-silent tool loss diagnosable in subagent (quiet) sessions.
Co-authored-by: Stuart Horner <5261694+djstunami@users.noreply.github.com>
A document attached alongside an image in the same Discord message was
swept into the vision pipeline and 400'd the whole turn ("Could not
process image"), and was simultaneously never surfaced to the agent as a
readable file. Restores the "any file type works" contract for mixed
messages and fixes the HTTP 400.
Bug 1 — mixed attachments: the inbound routing loop keyed image/audio/video
classification off the message-level type (PHOTO/VOICE/AUDIO), so a doc in
a PHOTO message landed in image_paths and poisoned the vision call. The
document context-note path was gated on message_type == DOCUMENT, so that
same doc never reached the agent at all. Now classification is
per-attachment (trust each attachment's own MIME; fall back to the
message-level type only when MIME is unknown), via shared _event_media_is_*
helpers used by both _build_media_placeholder and the main inbound loop.
The document note now fires for any non-image/audio/video attachment
regardless of message-level type.
Bug 2 — uncommon formats: AVIF/HEIC/BMP/TIFF/ICO produced the same generic
400 because providers only accept PNG/JPEG/GIF/WEBP. image_routing now
transcodes those to PNG via Pillow before declaring media_type, skipping
cleanly (logged) if Pillow/plugins are missing. SVG is vector — Pillow
can't rasterize it — so it's skipped rather than transcoded.
Closes#25935.
Co-authored-by: LeonSGP43 <cine.dreamer.one@gmail.com>
Co-authored-by: cypres0099 <74935762+cypres0099@users.noreply.github.com>
Same bug class as the Anthropic fix (#26293): the OpenAI/aggregator client is
built without max_retries, so the SDK default of 2 applies. The SDK's own 1-2s
backoff ignores Retry-After and retries inside hermes's outer conversation loop,
burning request slots against a rate-limited bucket. Set max_retries=0 at the
single create_openai_client chokepoint (covers init, switch_model, recovery,
restore, request-scoped). auxiliary_client builds its own clients and is not
wrapped by the loop, so it keeps SDK retries.
The Anthropic SDK clients were built without max_retries, so the SDK
default (max_retries=2) retried 429/5xx with its own backoff that ignores
Retry-After — double-retrying inside hermes's outer loop and burning
request slots against a bucket that won't refill for minutes. Set
max_retries=0 on all Anthropic/AnthropicBedrock client constructions so
the outer conversation loop (which already honors Retry-After) owns retry.
Also raise the Retry-After cap in the conversation loop from 120s to 600s.
Anthropic Tier 1 input-token buckets reset in ~171s, so the 120s cap made
hermes retry before the reset window and re-trip the limit.
Refs #26293
A persistent upstream 401 on a single-entry OAuth pool (common for Claude
Max subscribers) made the credential-pool recovery spin forever:
try_refresh_current() re-mints a fresh token and reports success on every
401, so recover_with_credential_pool returned True and the retry loop
continue'd without ever incrementing retry_count or reaching the
auth-failover block. The configured fallback_model never activated and the
agent appeared to hang.
Cap consecutive successful same-entry refreshes (keyed by provider +
pool-entry id) at 2; once exceeded, treat the credential as unrecoverable
and return not-recovered so the loop falls through to
_try_activate_fallback. The 429/billing paths already rotate-or-fall-through
correctly (mark_exhausted_and_rotate returns None on a single entry), so
only the auth-refresh branch needed the cap.
Co-authored-by: Hermes Agent <hermes@nousresearch.com>
When every provider in the fallback chain fails non-retryably back-to-back
(e.g. HTTP 400/402/429 across distinct providers), the within-turn walk is
already bounded — _fallback_index advances monotonically and the loop aborts
when the chain exhausts. The damaging mode is cross-turn: restore_primary_
runtime resets _fallback_index=0 every turn, so a client that re-submits
immediately replays the entire chain, re-marshaling the full (potentially
80k-token) context once per provider every turn with no throttle on the
non-rate-limit path. On constrained hosts this exhausts memory/swap.
Rate-limit/billing failures already arm a 60s cooldown via _rate_limited_until;
the gap was the non-rate-limit case. Now, when the chain exhausts on a non-
rate-limit failure with a non-empty chain, arm a short (5s) cooldown on the
same _rate_limited_until gate (max(), never shrinking an existing window).
The next turn's restore stays gated and does NOT reset the index, so the
chain isn't replayed until the cooldown clears. No new state, no thread sleep,
no false-trip on legitimately long chains (those walk normally within a turn).
Tests: tests/run_agent/test_24996_fallback_exhaustion_cooldown.py
Claude Code OAuth refresh tokens are single-use; Claude Code refreshes on
its own schedule, so by the time Hermes notices an expired token Claude
Code may have already rotated it. Re-read live credential sources first and
adopt a valid token rather than POSTing a possibly-stale refresh token.
Ports the _refresh_oauth_token hardening from PR #40107 (chazmaniandinkle)
on top of the keychain/file reconciliation from PR #21112 (nodejun).
Adds AUTHOR_MAP entry for nodejun.
read_claude_code_credentials() previously returned the macOS Keychain
entry as soon as one existed, even if its OAuth token was already
expired. Callers then ran is_claude_code_token_valid() on the result
and got False, so resolve_anthropic_token() returned None — surfacing
the misleading 'No Anthropic credentials found' error even when
~/.claude/.credentials.json held a perfectly valid token.
Now reads both sources and prefers the non-expired one. When both are
valid (or both expired), prefers the later expiresAt so any subsequent
refresh uses the freshest refresh_token.
Adds TestReadClaudeCodeCredentialsDesync covering the four reconciliation
cases. The existing 'keychain wins' priority test still passes because
both fixtures share the same expiresAt and the tiebreaker is >=.
When a Telegram attachment download/cache fails (typically a transient
httpx.ConnectError to Telegram's CDN), the except handler logged a warning
and fell through to handle_message() with empty media and no text — the user
thought the file was delivered, the agent saw a content-less turn with no
signal an attachment was attempted, and the only record was a buried log line.
Adds _surface_media_cache_failure(): replies to the user in Telegram so they
know to retry, and appends an agent-visible notice to event.text via the
existing _append_observed_note channel so the agent knows an attachment was
attempted and failed. No new event fields (structured-event refactor is out
of scope per #23045). Wired into all five cache-failure sites — photo, voice,
audio, video, document — since they shared the identical silent fall-through.
Bug 1 from #23045 (unsupported types routed as fake user messages) no longer
exists on main: the document handler now accepts any file type, so there is no
rejection branch to fix.
Closes#23045
Eager fallback previously fired only on rate_limit/billing. A stale-
detector-killed hung stream classifies as FailoverReason.timeout
(retryable=True) and the retry loop re-hit the same dead primary until
the budget exhausted -- 3 x ~180-300s stale kills compounding into a
15+ min silent hang while the configured fallback chain sat idle.
Extend the existing eager-fallback gate to also cover timeout and
overloaded, but only after one real retry (retry_count >= 2) so genuine
transient hiccups still recover on the primary. Reuses the same
pool-recovery guard and state-reset as the rate_limit branch -- no new
config flag, no change to the rate-limit intent.
Salvaged from PR #50228 by @linyubin. Closes#22277.
Co-authored-by: Hermes Agent <127238744+teknium1@users.noreply.github.com>
DISCORD_ALLOWED_USERS="*" now means "allow everyone", matching the
SIGNAL_ALLOWED_USERS / DISCORD_ALLOWED_CHANNELS wildcard convention and
the value `claw migrate` emits. Previously _is_allowed_user did exact
ID matching only, so "*" matched no user and blocked every non-self
sender — a P1 with no workaround.
Three sites, all required for the fix to hold at runtime:
- _is_allowed_user: short-circuit when "*" is in the allowlist.
- connect(): exclude "*" from the intents.members trigger so the
wildcard does not request the privileged Server Members intent
(which can block the bot from coming online).
- _resolve_allowed_usernames: preserve "*" verbatim; otherwise it lands
in the username-resolution bucket, matches no member, and is silently
dropped from the set and env var on the first on_ready — quietly
undoing the fix.
Slash auth delegates to _is_allowed_user (auto-covered); component auth
already honors "*" on main.
The gateway banner promises 'chat responses are scrubbed before delivery',
but _redact_gateway_user_facing_secrets used a divergent 6-pattern subset that
leaked credential shapes the comprehensive agent.redact catches — notably the
GitHub fine-grained PAT (github_pat_...) and the Telegram bot-token shape
(bot<digits>:<token>), the gateway's own credential type.
_redact_gateway_user_facing_secrets now delegates to
agent.redact.redact_sensitive_text(force=True) — the same Tirith-grade redactor
already applied to logs, tool output, and approval-command prompts — so the
outbound LLM-response path (final_response -> _sanitize_gateway_final_response)
masks the full credential set. The narrow local pattern set is kept as a
fail-soft second pass. force=True honors redaction even when
security.redact_secrets is off, matching _redact_approval_command.
Test: regression guard parametrizing all 5 issue shapes x every chat surface;
asserts secret body never reaches the user and surrounding prose survives. The
existing bearer-token test's marker assertion is loosened from the literal
'[REDACTED]' to mask-agnostic (the redactor masks as '***'/partial) — it
asserts the security invariant, not the implementation's mask string.
_try_openrouter() returned (None, None) whenever an OpenRouter credential
pool existed but was exhausted (_select_pool_entry -> (True, None)), making
the OPENROUTER_API_KEY env-var fallback unreachable. Auxiliary tasks
(compression, vision, web_extract) silently failed even with a valid env key.
Now the pool-present branch only returns early when it successfully builds a
client; an exhausted pool falls through to the env-var path. The final
failure (pool exhausted AND no env var) still marks the provider unhealthy.
Fixes#23452.
Co-authored-by: ambition0802 <noreply@github.com>
When the primary provider returns 401 and the auth-refresh path is
unavailable or fails, both call_llm() and async_call_llm() reached the
should_fallback gate without _is_auth_error in the condition, so the
auxiliary task (e.g. compression) was dropped silently — losing message
history. Add _is_auth_error to should_fallback (NOT is_capacity_error) in
both sync and async paths, plus an 'auth error' reason branch.
Auth stays a non-capacity error: it falls back in auto mode via the
is_auto gate, but on an explicitly-configured provider it still respects
the user's choice and raises rather than silently switching providers.
The agent's image-rejection fallback strips images and retries text-only when
a provider rejects image content, which is what lets the gateway drain its
queued messages. The fallback only fires on a hardcoded phrase list, and the
OpenRouter wording — HTTP 404 'No endpoints found that support image input' —
was missing. For OpenRouter-routed non-vision models the fallback never fired,
the retry loop re-sent the same rejected request until exhaustion, and every
subsequent message (including plain text) stayed queued behind the stuck turn.
Add the phrase to _IMAGE_REJECTION_PHRASES (the 404 already passes the 4xx
gate). Add a positive test and a guard test so the sibling OpenRouter
'no endpoints ... data policy / guardrail' 404s do NOT get their images
stripped.
Fixes#21160. Reported by @liu14goal14-ux; PR #21198 by @ygd58.
complete.path and complete.slash ran inline on the tui_gateway stdin
reader thread. complete.path spawns git ls-files and fuzzy-ranks the
whole repo; complete.slash does first-call prompt_toolkit imports plus a
skill-dir scan. While either ran, prompt.submit / session.interrupt sat
unread in the stdin pipe, freezing the TUI until the 120s RPC timeout
fired — most reliably reproduced by typing @ on a large repo / WSL2 mount.
Add both to _LONG_HANDLERS so completion runs on the existing thread
pool (write_json is already _stdout_lock-guarded). Root-cause fix:
covers any slow completion, not just the bare-@ trigger.
Fixes#21123
The main stop loop in _stop_impl() awaited adapter.cancel_background_tasks()
and adapter.disconnect() with no timeout, for both the primary and the
secondary-profile (multiplex) adapter maps. A half-dead platform — a wedged
Feishu/Lark WebSocket thread blocked on network I/O is the reported case —
makes one of those awaits block forever, so the process never exits. systemd
then SIGKILLs it after TimeoutStopSec, skipping atexit PID-file cleanup, and
the next start dies with 'PID file race lost' and enters a restart loop.
The per-adapter timeout infra already existed on main
(_adapter_disconnect_timeout_secs / HERMES_GATEWAY_ADAPTER_DISCONNECT_TIMEOUT,
default 5s) but was only wired into _safe_adapter_disconnect, which the
teardown path never calls.
Add _bounded_adapter_teardown(): wraps BOTH cancel_background_tasks() and
disconnect() in the existing timeout budget, logs and forces forward progress
on timeout, and never raises. Both teardown loops now route through it, so the
stop sequence always completes regardless of any adapter's internal behavior
and PID-file cleanup runs.
Original report + fix direction by @happy5318 (#14128, #14130); this widens it
to cover cancel_background_tasks(), the multiplex loop, and the config knob.
Co-authored-by: happy5318 <happy5318@users.noreply.github.com>
A user-added systemd drop-in like ExecStopPost=/bin/kill -9 $MAINPID fires
on every stop, including clean restarts — it SIGKILLs the freshly spawned
gateway before it stabilizes and Restart=always respawns it, producing an
infinite restart loop (issue #23272). The unit Hermes installs already shuts
down cleanly via KillMode=mixed + KillSignal=SIGTERM with Restart=always +
RestartForceExitStatus, so no extra kill is needed. Document this as a danger
callout in the gateway service-management section.
The MoA reference-block display (each reference model's output shown as a
labelled thinking block before the aggregator responds) previously existed
only in the classic CLI. The facade already emits moa.reference / moa.aggregating
through tool_progress_callback; this wires the TUI and desktop consumers.
- tui_gateway/server.py: _on_tool_progress relays moa.reference (label / text /
index / count) and moa.aggregating to the Ink/desktop client as their own
events.
- ui-tui: gatewayTypes adds the two event shapes; createGatewayEventHandler
routes them; turnController.recordMoaReference pushes a committed
thinking-style segment tagged with the source model. Shown regardless of
showReasoning — references ARE the mixture-of-agents process the user opted
into, not ordinary reasoning. moa.aggregating is a status-only transition
(no transcript entry).
- apps/desktop: use-message-stream appends each reference as a labelled
reasoning chunk via the existing reasoning disclosure; GatewayEventPayload
gains label/index/aggregator.
Tests: tui_gateway emit (3), Ink handler render + showReasoning-independence +
aggregating-no-segment (3). TUI typecheck/lint clean; desktop typecheck/lint
clean.
display.thinking_progress is documented as independent of tool_progress —
users can keep tool progress quiet while opting into mid-turn assistant
scratch-text bubbles. But two gates were keyed on tool_progress_enabled alone,
so with tool_progress:off the _thinking relay was silently dead even when
thinking_progress:true:
1. agent.tool_progress_callback was set to None unless tool_progress_enabled,
so the callback that queues _thinking text never fired.
2. The send_progress_messages drain task was only started when
tool_progress_enabled, so even queued messages had no consumer.
Both now gate on needs_progress_queue (tool_progress OR thinking_progress) —
the same condition that already decides whether to create the progress queue
at all. No effect when both are off (queue is None) or when tool_progress is
on (unchanged).
Tests: _thinking relays with thinking_progress:on/tool_progress:off, and is
suppressed when thinking_progress:off. Full progress-topics suite: 35 pass.
Follow-up to #53791 addressing review feedback: the footgun checker treated
capture_output=/stdout=/stderr=/check_output as proof a subprocess can't pop a
Windows console. That invariant is false — stream redirection controls where a
child's output goes, not whether a console is allocated. From a console-less
parent (Desktop/Electron, pythonw.exe, detached gateway/cron) a console-subsystem
child still flashes a window even when fully captured.
- check-windows-footguns.py: capture/redirect/check_output is no longer a blanket
safe-pass. Added _WINDOWS_FLASHING_PROGRAMS (git/gh/npm/node/python/uv/ffmpeg/
docker/powershell/…); calls to those are flagged even when captured. Non-flashing
programs keep the capture exemption (no 271-site noise). _subprocess_compat.run/
popen calls are inherently safe (wrapper injects CREATE_NO_WINDOW).
- Routed the 35 genuine flashing git/gh/npm/uv/ffmpeg/docker spawns through the
_subprocess_compat.run/popen chokepoint (Brooklyn's wrapper from #53810) — the
durable fix, not per-site annotations. cmd.exe /c start stays # ok (intentional).
- Updated tests + CONTRIBUTING.md rule #17 to the corrected invariant.
On a MoA session, auxiliary tasks (title generation, compression, vision, …)
ran through _resolve_auto with provider='moa' / model='<preset>', which sent
the preset name (e.g. 'opus-gpt') as the model id to resolve_provider_client —
producing 'HTTP 400: opus-gpt is not a valid model ID' on every turn (visible
as the title-generation warning).
MoA is a virtual provider with no real HTTP endpoint; aux tasks don't need the
reference fan-out. _resolve_auto now resolves a 'moa' main provider to the
preset's aggregator slot (its acting model) and continues Step 1 with that real
provider+model, dropping the virtual moa://local base_url + placeholder key so
the aggregator resolves via its own provider credentials. Mirrors the MoA
context-length resolution.
Verified live: a MoA turn no longer emits the 'not a valid model ID' warning.
Test: tests/agent/test_auxiliary_main_first.py (19 pass).