* feat(kanban): typed block reasons + unblock-loop breaker
Stops the kanban blocked-task loop: a worker blocks a task, a cron
unblocks it, the worker re-blocks for the same reason, repeat forever.
block_task now takes a typed kind and a persistent block_recurrences
counter on the tasks table:
- kind=dependency routes to todo (parent-gated, auto-resumed), never
the human 'blocked' bucket a cron would keep unblocking.
- needs_input/capability/transient/untyped land in blocked; each
same-cause re-block after an unblock increments block_recurrences,
and at BLOCK_RECURRENCE_LIMIT (default 2) the task routes to triage
for a human instead of blocked.
- unblock_task no longer resets block_recurrences (the amnesia that
let the loop run unbounded); complete_task clears it on success.
Wired through the worker kanban_block tool (new kind arg) and the
hermes kanban block --kind CLI flag, both reporting where the task
actually landed. Docs + 11 new tests; 536 existing kanban tests green.
* test(kanban): make second-block notify test use a distinct block cause
test_notifier_second_blocked_delivers blocked the same task twice with
the same (untyped) reason, which now trips the new unblock-loop breaker
and routes the second block to triage instead of blocked — so only one
'blocked' notification fired. The test's actual intent is that TWO
distinct block cycles each notify; give the two cycles different kinds
(needs_input then capability) so they're genuinely separate blocks. The
same-cause loop→triage path is covered by test_kanban_block_kinds.py.
After a prolonged outage the in-process network-error ladder escalates to
fatal and GatewayRunner._platform_reconnect_watcher rebuilds a fresh adapter
that reconnects through the bootstrap path. That path called
start_polling(drop_pending_updates=True), discarding every update Telegram
queued during the outage — all messages sent while the bot was down were
silently lost. The in-process ladder and 409-conflict handler already passed
drop_pending_updates=False; only bootstrap did not distinguish a cold first
boot from a reconnect.
Thread an is_reconnect signal from the watcher through
_connect_adapter_with_timeout into adapter.connect(). The base
BasePlatformAdapter.connect() gains a keyword-only is_reconnect=False so every
adapter inherits a tolerant signature (no per-platform breakage when the
runner forwards the kwarg). Telegram translates is_reconnect into
drop_pending_updates=not is_reconnect on both the polling and webhook bootstrap
calls. Cold boot still drops the stale queue; a watcher reconnect preserves it.
Fixes#46621.
Co-authored-by: annguyenNous <annguyen@nousresearch.com>
Co-authored-by: kyssta-exe <kyssta-exe@users.noreply.github.com>
Co-authored-by: Kewe63 <Kewe63@users.noreply.github.com>
When the primary provider raises AuthError (e.g. expired OAuth token),
_make_agent now walks the configured fallback_providers/fallback_model
chain before giving up — matching the behavior that cron/scheduler.py
and cli_agent_setup_mixin.py already have.
Fixes#47627
A readable state.db can still reject every message write through the
messages_fts* triggers when the FTS5 index is corrupt: base-table reads and
PRAGMA integrity_check pass, but INSERT INTO messages fails with 'database
disk image is malformed'. The gateway reloads conversation_history from disk
each turn, so a silently-failed write hands the next turn stale/empty history
even though the same cached AIAgent still holds the live transcript — causing
immediate same-session amnesia. (#50502)
- hermes_state.py: _db_opens_cleanly() now drives a rolled-back message write
through the FTS triggers, so write-only corruption (which the read-only
probe reported healthy) is detected. repair_state_db_schema() gains an
in-place FTS5 'rebuild' strategy (tier 0) before the dedup/drop tiers, plus
an already_healthy short-circuit. Both 'hermes sessions repair' and
'hermes doctor' route through these, so the fix covers the whole class.
- hermes_cli/doctor.py: the state.db check runs the write-health probe even on
the success (readable) path and repairs in place with --fix.
- gateway/run.py: _select_cached_agent_history() prefers the cached agent's
longer live _session_messages over a shorter persisted transcript, so an
FTS write failure can't wipe in-session context.
- tests: regressions for write-health detection, in-place repair preserving
rows + resuming writes, the already_healthy shortcut, and the gateway guard.
Combines the approaches from #50504 (@0-CYBERDYNE-SYSTEMS-0, issue author),
#52165 (@davidgut1982), and #50576 (@trevorgordon981).
The email adapter authorized senders entirely off the From: header, which is
attacker-controlled and unauthenticated by IMAP. An attacker could forge
From: an-allowlisted-address and pass both the adapter's EMAIL_ALLOWED_USERS
pre-filter and the gateway's allowlist authz (both key on the same spoofable
sender_addr), getting unauthorized commands executed by the agent.
Verify the From: domain against the trusted Authentication-Results header the
receiving mail server stamps (SPF/DKIM/DMARC) before trusting it for
authorization. Enforced only when an allowlist is in effect and allow-all is
off — fail-closed. Operators whose server does not stamp the header can opt
out via platforms.email.require_authenticated_sender: false (or
EMAIL_TRUST_FROM_HEADER=true).
The scale-to-zero idle watcher never started on a correctly-opted-in,
relay-only instance, so the gateway never ran its idle decision, never called
go_dormant(), and never sent going_idle to the connector. Fly's autostop still
suspended the machine on traffic-idle, but the connector never flipped the
instance to buffered-only — so an inbound DM took the live delivery path,
found no live session for the suspended machine, and was dropped fail-closed
with no wake poke. The machine slept and never woke.
Root cause: _scale_to_zero_should_arm() passed list(config.platforms.keys())
to messaging_is_relay_only_or_absent(). config.platforms is pre-seeded with a
DISABLED placeholder PlatformConfig for every known platform (telegram,
discord, slack, matrix, …), so the key set is always the full ~20-entry
catalog regardless of what the instance actually runs. The relay-only check
discarded "relay", saw the disabled placeholders as live direct-socket
platforms, and returned False — so should_arm() was False and the watcher was
never created. Verified live on a staging instance: config.platforms keys =
[telegram, discord, slack, mattermost, matrix, relay] with only relay
enabled=True; should_arm() = False.
Fix: filter config.platforms to ENABLED entries before the relay-only check,
mirroring the adapter-connect loop which already gates on
`if not platform_config.enabled: continue`. This arms off the same notion of
"active platform" the rest of start() already uses — no parallel concept.
Also add a one-line not-armed diagnostic: when an instance IS opted in (the
HERMES_SCALE_TO_ZERO stamp is set) but the watcher still doesn't arm, log why
(relay_only_or_absent, the enabled platforms, wake_url present/missing). A
non-opted instance stays silent. The arm path previously logged only on
success, so a failed arm was invisible.
Tests: the existing pure-helper tests passed bare names so they never
exercised the call site that feeds the placeholder-laden config. Add
behaviour-contract tests against the REAL _scale_to_zero_should_arm with a
realistic config.platforms (relay enabled + others disabled). The F25
regression test (relay-only + disabled placeholders must arm) and the
no-platform case are RED without this fix, GREEN with it; the
genuinely-enabled-direct-platform / not-opted-in / no-wake-url cases stay
correctly non-arming so the filter can't over-broaden.
Wake mechanism itself verified healthy independently (direct wakeUrl GET
resumed a suspended staging instance in 1.15s, clean resume signature).
CredentialPool._sync_device_code_entry_to_auth_store rotated single-use
OAuth refresh tokens but wrote the new chain only into the active profile
store. When a profile resolves a grant from the global-root fallback
(read_credential_pool, #18594) and the pool then refreshes it, root was
left holding a now-revoked refresh token — every other profile reading the
stale root grant subsequently died with refresh_token_reused / invalid_grant
once its access token expired.
This is the credential-pool analog of #43589 (which fixed the non-pool xAI
refresh path in _save_xai_oauth_tokens). Detect the read-from-root case
(profile lacks its own providers.<id> block) BEFORE the profile save and,
after it, write the rotated chain back to the global root via a best-effort,
seat-belted write-through. A profile that genuinely shadows root (owns the
block) is untouched; classic mode (profile == root) is a no-op; a failed root
write never breaks the profile's own save. Covers openai-codex (reported),
xai-oauth, and nous through the shared sync path.
Prevents stage2-hook.sh recursive chown from following a symlinked $HERMES_HOME/home (or profiles/cron) and destroying the host user's home directory. Also guards top-level state-file chowns and refuses first-boot seeding through symlinks. Fixes#52781.
Co-authored-by: harjoth <harjoth.khara@gmail.com>
Two-part fix:
Part 1 (classifier override at agent/error_classifier.py:720-738):
A transport disconnect on a reasoning model — even on a large session —
now routes to FailoverReason.timeout instead of context_overflow. Without
this, large-session reasoning-model disconnects route to the compression
branch and silently delete conversation history on a phantom
context-length error. The override is strictly targeted: non-reasoning
models (gpt-4o, claude-3-5-sonnet, llama-3.3-70b, etc.) still route to
context_overflow on large sessions — the existing intentional behavior
for chat models whose proxy doesn't idle-kill during prefill/generation.
Part 2 (new agent/thinking_timeout_guidance.py + integration at
agent/conversation_loop.py:3488-3567):
New is_thinking_timeout() and build_thinking_timeout_guidance() helpers.
When a known reasoning model (NVIDIA Nemotron 3 Ultra, OpenAI o1/o3,
Anthropic Opus 4.x thinking, DeepSeek R1, Qwen QwQ, xAI Grok reasoning)
hits a transport-kill on a small session (classifier says timeout
directly) or after Part 1 routes correctly (large session), the user
now sees reasoning-specific guidance with three actionable workarounds
in priority order:
1. Set providers.<provider>.models.<model>.stale_timeout_seconds: 900
in ~/.hermes/config.yaml (Hermes's built-in floor is already 600s
for known reasoning models; raise further if upstream is even
tighter).
2. Lower reasoning_budget or set reasoning_effort: medium on this
model if the provider supports it.
3. Use a smaller / faster reasoning model if the task doesn't
require deep thinking.
The new guidance takes precedence via if/elif over the existing
_is_stream_drop block, so a reasoning-model user with a transport-kill
message sees actionable advice instead of the misleading "try
execute_code with Python's open() for large files" advice (which is
correct for the unrelated large-file-write stream-drop case but
actively wrong for the thinking-timeout case).
Verified:
- 478 tests passing across 9 directly-relevant files (49 new + 429
existing, zero regressions).
- Ruff lint clean on all 4 modified/new files.
- Negative test: 6 parametrized regression guards confirm non-reasoning
models still route to context_overflow on large sessions; 4
parametrized gates confirm non-timeout classifier reasons never
trigger the guidance; 5 parametrized cases confirm non-transport
messages never trigger it.
- Regression guard: new guidance message does NOT contain
"execute_code" or "open()" — the misleading advice is fully
replaced, not appended alongside.
- Cross-vendor dual review via agy -p:
- Gemini 3.5 Flash (Medium) — passed: true, zero blockers, one
SHOULD-FIX (vprint block duplication — fixed by extracting
detection into a helper module).
- GPT-OSS 120B (Medium) — passed: true, zero blockers, two nits
(test placement — adopted at tests/agent/test_thinking_timeout_guidance.py;
primary-model capture — accepted as non-issue per Flash's nit).
Dependency note for maintainers:
This PR includes agent/reasoning_timeouts.py (the reasoning-model
allowlist module from PR #52238) because the Layer 1 override is
load-bearing on get_reasoning_stale_timeout_floor(). After PR #52238
lands on main, this PR's duplicate agent/reasoning_timeouts.py should
be rebased away. Either PR can land first; the other rebase is
mechanical.
Fixes#52271.
The #45966 cross-process coherence guard popped the stale cached agent
and then called the blocking _cleanup_agent_resources (memory-provider
shutdown, tool-resource teardown, async-client teardown) while still
holding _agent_cache_lock, on the gateway event-loop thread. While that
ran, _sweep_idle_cached_agents (driven by _session_expiry_watcher)
blocked acquiring the same lock and the asyncio loop stalled for minutes,
tripping repeated Discord 'heartbeat blocked' warnings.
Fix mirrors the cap-enforcer / idle-sweep paths: pop the stale entry
under the lock, release it, then schedule the SOFT release on a daemon
thread. The soft path (_release_evicted_agent_soft) is also more correct
here than the hard teardown the regression used — the same session
rebuilds a fresh agent immediately after invalidation, so its terminal
sandbox / browser / bg processes (keyed on task_id) must be preserved
for the rebuilt agent to inherit, not torn down.
Verified the cross-process site was the only cleanup-under-lock instance;
the other _cleanup_agent_resources call sites run outside the lock.
CI shard test_telegram_conflict.py timed out (140s) because the new
_polling_heartbeat_loop, started by connect(), busy-spun under those
tests: they monkeypatch asyncio.sleep to instant and pass a bot double
with no get_me(), so the probe raised AttributeError (swallowed) and the
loop re-entered immediately with no real pacing, starving the event loop.
Guard the loop to return when bot.get_me is not callable — a real PTB Bot
always exposes it, so this only triggers on a torn-down app or a test
double, where there is nothing to probe. Also cancel the heartbeat task in
the conflict tests that call connect() without disconnect(), matching the
production disconnect() teardown.
Verified: test_telegram_conflict.py now runs in ~4.5s; the 22
heartbeat/reconnect tests still pass; E2E confirms a hanging get_me still
fires the reconnect ladder while a missing get_me exits without spinning.
When a Telegram long-poll TCP socket enters CLOSE-WAIT (remote sent FIN
but httpx hasn't noticed), epoll still reports it readable so no
exception is raised. PTB's error_callback never fires, the reconnect
ladder never engages, and the gateway silently stops receiving messages
while the process stays alive — until a manual systemctl restart.
The existing recovery only covers two cases: error_callback-driven
reconnects (which require an exception PTB never gets) and a one-shot
_verify_polling_after_reconnect probe (which runs only right after an
explicit reconnect). A socket that wedges during steady-state operation
is never detected.
Add _polling_heartbeat_loop: a background asyncio.Task started in
connect() (polling mode only) that probes get_me() every 90s on the
general request pool (not the getUpdates pool, so healthy long-polls are
never interrupted). On asyncio.TimeoutError/OSError it hands off to the
existing _handle_polling_network_error ladder; other errors are
swallowed. disconnect() cancels and awaits the task. Worst-case
detection window ~105s.
Complementary to #51541 (general-pool keepalive limits / fd leak) — that
recycles idle pooled connections; this detects a wedged active read.
Fixes#48495
Co-authored-by: agt-user <267614622+agt-user@users.noreply.github.com>
The desktop scheduler can overwrite cron/jobs.json with its own small
set of internally-tracked crons after an update/restart, causing
partial loss of tool-created cron jobs. The previous guard only
checked for total loss (live_count == 0), missing the case where
live_count > 0 but less than the pre-update snapshot count.
Compare live_count against snap_count instead of checking for zero,
so both total loss (0 vs N) and partial loss (1 vs 19) trigger
restoration.
Salvaged from #52161 by @liuhao1024.
Closes#52144
A top-level delegate_task dispatches in the background and re-enters as a
fresh turn when done. Print a one-line dispatch-time note — no spinner,
nothing to poll — so the idle prompt doesn't read as "nothing happened."
When idle with a background subagent still in flight, append a tail status
segment spelling out that the agent resumes on its own. Width-budgeted like
every tail segment, so it drops first on a tight terminal where the ⛓ count
already carries the signal.
When idle with a top-level delegate_task still in flight, render a static,
shimmering system-note at the transcript tail instead of a spinner (which
reads as "stuck"). Reuses the shared steer / slash-status chrome (centered,
0.6875rem, muted, Codicon) so it sits in the thread like every other meta
line, and mirrors the primary child's latest stream line, falling back to
generic copy. i18n across en/ja/zh/zh-hant; markdown prose/heading rhythm
tuned so a re-entered turn breathes.
Track top-level delegate_task work that dispatches in the background and
re-enters as a fresh turn. $backgroundResume returns {count, activity} for
the active session while idle — count of parked tasks plus the primary
child's latest stream line (tool/progress/thinking) when readable.
Use the app's amber warn color for the unsaved-edits tab dot (was inheriting
the label text color) and add a tab-bg ring + soft drop shadow so it stays
legible where it overlaps the filename.
Extends the pane store with heightOverride (alongside widthOverride) and a
get/set/clear API, and wires the pane shell + desktop controller so the
bottom-row terminal pane can be resized on the Y axis with its size persisted.
Adds a CodeMirror 6 spot editor to the right-rail file preview so users can
make quick edits in-app without leaving for an IDE. Entering edit mode is a
pure in-place swap of the read view — same fixed-height header, same gutter
geometry/typography (mirrors SourceView 1:1) so nothing shifts — toggled via
the Edit button, a bare `e` when the pane is hovered/focused, or the tab.
- Save path is transport-agnostic (writeDesktopFileText): local Electron IPC
or a new hardened POST /api/fs/write-text on the dashboard server (path
validation, parent-must-exist, regular-files-only, size cap, atomic
temp-file + os.replace), behind the existing auth middleware.
- Stale-on-disk guard re-reads before writing and offers overwrite vs
discard-and-reload instead of clobbering external/agent edits.
- VS Code-style modified dot on the tab; ⌘/Ctrl+S and ⌘/Ctrl+Enter save,
Esc cancels; GitHub highlight style matched to the read view's Shiki theme.
- Typing stays render-free (draft in a ref; dirty flips once at the boundary).
The detector folds absolute home / Hermes-home prefixes into their canonical
~/ and ~/.hermes/ forms so static patterns catch /home/alice/.bashrc the same
way they catch ~/.bashrc (abd69b81). On native Windows this fold never fired,
so terminal commands writing to shell startup files, ~/.ssh/authorized_keys,
or ~/.hermes/config.yaml / .env returned "safe" and skipped the approval
prompt — and config.yaml carries the approval policy itself.
Two compounding causes:
1. The fold ran after the backslash-escape strip (r\m -> rm), which dissolves
the backslash separators in a Windows path (C:\Users\alice\.bashrc ->
C:Usersalice...) before the fold could match. It now runs before the strip.
2. The fold only recognized POSIX absolute paths and only the home prefix,
leaving multi-segment backslash suffixes (\.ssh\authorized_keys) to be
mangled by the strip.
Consolidated into _home_prefix_fold_regex / _fold_home_prefixes: match a home
prefix with either separator, capture the rest of the path token, and
normalize its separators to / so multi-segment patterns match. The
degenerate-path guard generalizes count("/") >= 2 to "at least two components
below the root" (also rejecting a bare drive root C:\). HOME is consulted
directly because Windows' expanduser ignores it; the more specific Hermes home
is folded first, longest candidate first, so neither fold clobbers the other.
POSIX behavior unchanged; the r\m -> rm anti-obfuscation strip still runs.
Adds TestWindowsAbsolutePathFolding, which monkeypatches a Windows-style
HOME/HERMES_HOME so the behavior is also exercised on the CI runner.
Follow-up to the salvage of #45035 + #48682. The two PRs touched different
functions (resolve_resume_session_id vs get_compression_tip) but #45035's
descendant walk followed ANY parent_session_id child, so a delegate/subagent
child could hijack the resume target. Apply the same _branched_from /
_delegate_from / source!='tool' exclusion the rest of hermes_state.py uses,
so the resume walk only follows genuine compression continuations.
Also updates the unrealistic delegation test fixture to carry the real
_delegate_from marker, and updates 3 list_sessions_rich test mocks for the
order_by_last_active kwarg #48682 added.
AUTHOR_MAP: map PINKIIILQWQ + ailang323 salvage authors.
After context compression, the parent session holds pre-compression messages
and a child (or deeper descendant) holds the continuation.
resolve_resume_session_id() short-circuited when the input session already
had messages (row is not None -> return session_id), causing REST API
endpoints, gateway resume, and CLI resume to serve stale parent messages.
Remove the early-return. Walk the full descendant chain, record the
deepest node that has messages (best), and return best if not None
else the original session_id (preserving the empty-chain fallback).
Callers (api_server.py, web_server.py, cli_agent_setup_mixin.py,
cli_commands_mixin.py) all use the resolved != input -> redirect pattern
and are transparent to this change.
The pre-update HERMES_HOME zip shipped on by default (DEFAULT_CONFIG +
runtime fallback both True), so every `hermes update` zipped the entire
~/.hermes — sessions DB, caches, skills — adding minutes to each update.
The shipped cli-config.yaml.example, the --backup help, and the example
config all already said "off by default," so the live default
contradicted its own documentation.
Flip the default to off everywhere: DEFAULT_CONFIG, the runtime
`.get(..., False)` fallback in _run_pre_update_backup, and the stale
--backup help string. Users who want the #48200 safety net opt in via
updates.pre_update_backup: true or --backup for a single run.
Updated test_default_enabled_creates_backup -> test_default_disabled_is_silent
to assert the new default (silent no-op, no zip).
* fix(cron): add default retention to per-run job output to bound disk usage (#52383)
Per-run cron output (cron/output/<job>/<timestamp>.md) is written once
per execution and was never pruned, so a frequently-scheduled job on
a long-running deploy accumulates one file per run indefinitely and
can fill the volume ('no space left on device').
save_job_output() now keeps the most recent N output files per job and
removes older ones. N defaults to 50 and is configurable via
cron.output_retention; a non-positive value disables pruning for
operators who manage cleanup externally.
Salvaged from #52402 by @0xDevNinja.
Closes#52383
* fix(config): add cron.output_retention to DEFAULT_CONFIG
Follow-up to #52383: the retention config key was functional via
get()-with-default but missing from DEFAULT_CONFIG, so the deep-merge
wouldn't auto-populate it for new installs. Add it explicitly.
---------
Co-authored-by: 0xDevNinja <manmit0x@gmail.com>
Regression for the salvaged #48254 fix: billing route is first-writer-wins
via update_token_counts (COALESCE), so a mid-session provider switch left
the dashboard attributing cost to the original provider. Asserts the new
update_session_billing_route() overwrites unconditionally, nulls system_prompt
so the next turn rebuilds Model:/Provider:, and preserves billing_mode when
omitted (COALESCE on None).
The session database records billing_provider and billing_base_url using
COALESCE(column, ?) in update_token_counts(), making them write-once.
When a user switches models mid-session via /model, the runtime (agent.provider,
agent.base_url) updates correctly, but the session row never reflects the new
provider. This causes the dashboard Models page to display a stale provider
badge and misattributes token usage / cost analytics.
Fix: add update_session_billing_route() that unconditionally sets
billing_provider, billing_base_url, and billing_mode (no COALESCE), and call
it from switch_model() in agent_runtime_helpers.py after the swap succeeds.
This follows the same pattern as update_session_model() which already
unconditionally updates the model column (added for the identical COALESCE
problem on the model field).
Closes#48248