Free-tier users were seeing 'No free models currently available.' in the
`hermes model` and post-login pickers even though qwen/qwen3.6-plus is
free on the Portal right now. Three independent breakages compounded:
1. The docs-hosted catalog manifest at website/static/api/model-catalog.json
was not regenerated when _PROVIDER_MODELS['nous'] was updated, so users
fetching the manifest got a list that didn't include qwen/qwen3.6-plus.
2. _resolve_nous_pricing_credentials() returned ('', '') on any auth blip,
collapsing get_pricing_for_provider('nous') to {} and making every
curated model fall through the free-tier filter as 'paid'.
3. Even with healthy pricing, the picker only ever showed models from the
in-repo curated list intersected with live pricing — a Portal-flagged
free model not yet in the curated list could never appear.
Changes:
- hermes_cli/models.py: new union_with_portal_free_recommendations() that
augments the curated list with Portal freeRecommendedModels entries
(with synthetic free pricing so partition keeps them). The Portal's
/api/nous/recommended-models endpoint is now the source of truth for
free-tier surfacing — old Hermes builds will see new free models
without a CLI release.
- hermes_cli/models.py: _resolve_nous_pricing_credentials() falls back to
the public inference base URL when runtime cred resolution fails.
The /v1/models endpoint exposes pricing without auth, so silently
returning {} just because a refresh token expired was wrong.
- hermes_cli/auth.py + hermes_cli/main.py: both free-tier picker call
sites call union_with_portal_free_recommendations() before partition.
- tests/hermes_cli/test_models.py: 7 tests covering union behaviour
(prepend, dedup, end-to-end with stale pricing, empty/missing/error
payloads, invalid entries).
- tests/hermes_cli/test_model_catalog.py: drift guard
TestManifestMatchesInRepoLists fails CI when _PROVIDER_MODELS['nous']
or OPENROUTER_MODELS is edited without re-running
scripts/build_model_catalog.py. Verified empirically that removing a
manifest entry triggers an assertion with an actionable error message.
Validation:
- 133/133 targeted tests pass (test_models, test_model_catalog,
test_auth_nous_provider).
- Live E2E against the real Portal:
- Stale curated list ['claude-opus','claude-sonnet','gpt-5.4'] (no
qwen) → after union: ['qwen/qwen3.6-plus', ...] →
partition(free_tier=True): selectable=['qwen/qwen3.6-plus'].
- Simulated expired refresh token → anon fetch returns 403 pricing
entries including qwen/qwen3.6-plus -> {prompt:0, completion:0}.
- ruff: clean.
cua-driver was only installed once on toolset enable: `_run_post_setup` early-returns when the binary is already on PATH, so upstream fixes (e.g. v0.1.6 Safari window-focus fix) never reached existing users without manual reinstall.
Two refresh points now:
- `hermes update` re-runs the upstream installer at the end of the update if cua-driver is on PATH (macOS-only, no-op otherwise). Ties driver freshness to the user-controlled update cadence — no startup latency, no per-launch GitHub API call.
- `hermes computer-use install --upgrade` for manual force-refresh.
The upstream `install.sh` always pulls the latest release, so re-running is the canonical upgrade path. No version-comparison logic needed.
`hermes computer-use status` now shows the installed version, and points at `--upgrade` for refreshing.
Fixes#22832.
## Root cause
`hermes_cli/web_server.py:start_oauth_login` dispatched OAuth flows by
the catalog's `flow` field rather than provider id:
if catalog_entry["flow"] == "pkce":
return _start_anthropic_pkce()
The catalog had two `flow: "pkce"` entries — `anthropic` and
`minimax-oauth` — so clicking "Login" on MiniMax in the dashboard's
Keys tab unconditionally launched the Anthropic/Claude PKCE flow.
## Fix
Three changes in `hermes_cli/web_server.py`:
1. Catalog entry for `minimax-oauth` changed from `flow: "pkce"` to
`flow: "device_code"`. From a UX perspective MiniMax is a
verification-URI + user-code flow (open URL, enter code, backend
polls) — same shape as Nous's device-code flow. The PKCE bit
(verifier + challenge from `_minimax_pkce_pair`) is a security
extension that doesn't change the operator experience; the existing
dashboard modal already renders `device_code` correctly for this UX.
2. New MiniMax branch in `_start_device_code_flow`, mirroring the
existing Nous branch but calling MiniMax-specific helpers
(`_minimax_request_user_code`, `_minimax_pkce_pair`). Stashes
verifier + state in the session for the poller to consume. Handles
the overloaded `expired_in` field (could be unix-ms timestamp OR
seconds-from-now duration) the same way `_minimax_poll_token` does.
3. New `_minimax_poller` background thread mirroring `_nous_poller`.
Calls `_minimax_poll_token` → on success builds the same
`auth_state` dict the CLI flow (`_minimax_oauth_login`) builds, and
persists via `_minimax_save_auth_state` so the dashboard path leaves
the system in the same state as `hermes auth add minimax-oauth`.
Plus a dispatcher tightening to prevent regression: the `pkce` branch
now requires `provider_id == "anthropic"`, so any future PKCE provider
added without a proper start function gets a clean
`400 Unsupported flow` rather than silently launching Anthropic OAuth.
## Test
New `tests/hermes_cli/test_web_oauth_dispatch.py`:
- Regression test asserting MiniMax start does NOT return claude.ai
- Sanity test that Anthropic PKCE still works after the dispatcher
tightening
- Forward-looking test: a hypothetical pkce-flagged provider without
an explicit branch is rejected cleanly rather than misrouted
## Limitations
- The dashboard MiniMax path defaults to `region="global"`. CN-region
operators can still use the CLI flow which supports `--region cn`.
Adding a region toggle to the dashboard UI is a follow-up.
Follow-up to #23863 (CJK table alignment). The realigner was
correctly padding pipes to identical column offsets, but when a
table's natural width exceeds terminal cells it produced lines that
the terminal soft-wrapped mid-cell, destroying column alignment
visually even though the bytes were perfectly padded. Reported as
'columns are not aligned' on tables containing one long row alongside
several short rows.
Approach mirrors Claude Code's MarkdownTable.tsx narrow-terminal
fallback: when realign_markdown_tables is given an available_width
budget and the rebuilt horizontal table exceeds it, render each body
row as 'Header: value' lines separated by a thin ─ rule. Word-wraps
oversize values at the budget with a 2-space continuation indent.
- agent/markdown_tables.py: realign_markdown_tables(text, available_width=None);
threshold check at the top of _render_block flips into a new
_render_vertical fallback. Includes _wrap_to_width with hard-break
for tokens longer than the budget.
- cli.py: helper _terminal_width_for_streaming() returns
shutil.get_terminal_size().columns minus _STREAM_PAD and a 2-cell
safety margin; passed to all three realign call sites
(_render_final_assistant_content for strip+render Panel paths, and
the streaming flushers in _emit_stream_text / _flush_stream).
- tests/agent/test_markdown_tables.py: 4 new tests covering the
overflow-vertical fallback for ASCII + CJK content, the
'fits → keep horizontal' case, and the long-cell wrap with indent.
Live-verified: with COLUMNS=100, the user's reported 'long row in
ASCII table' case now renders as vertical key-value rows that all fit
the panel; the 6-column CJK comparison table still renders as an
aligned horizontal table because it fits inside 100 cols.
The old mtime-tracking staleness machinery (_tui_build_needed,
_hermes_ink_bundle_stale, _find_bundled_tui) tried to avoid rebuilding
by comparing source timestamps to dist/entry.js. This was fragile and
added ~100 lines of code. Replace with three clear paths:
1. HERMES_TUI_DIR set (prebuilt/nix): just node dist/entry.js, no build
2. --dev mode: tsx src/entry.tsx, no build, hot reload
3. Normal: always npm run build (esbuild is ~1s, correctness > caching)
Also error when HERMES_TUI_DIR is set with --dev (footgun: prebuilt
bundle has no source code to hot-reload).
Set HERMES_SESSION_ID using the existing session_context.py ContextVar
system for concurrency safety (multiple gateway sessions in one process
won't cross-talk). Also writes os.environ as fallback for CLI mode.
Touchpoints:
- gateway/session_context.py: Add _SESSION_ID ContextVar + _VAR_MAP entry
- run_agent.py: Set both ContextVar and os.environ at init and on
context-compression rotation
- tools/environments/local.py: Bridge ContextVars into subprocess env
in _make_run_env() (ContextVars don't propagate to child processes)
- tests/run_agent/test_session_id_env.py: 3 tests covering env, provided
ID, and ContextVar paths
execute_code subprocess already passes HERMES_* prefixed vars through
_scrub_child_env (line 82: _SAFE_ENV_PREFIXES includes 'HERMES_').
Primary use case: webhook-triggered agents that need to include a
`--resume <session_id>` takeover command in their output.
Cuts input cost for first-turn Claude requests by ~85-90% on subsequent
sessions within an hour. Tools array (~13k tokens for default toolset) +
stable system prefix (~5-8k tokens) get a 1h cache_control marker; the
volatile suffix (memory, USER profile, timestamp, session id) sits in a
separate non-cached block at the end so it doesn't poison the cross-session
prefix when it changes.
Provider gate: Claude on native Anthropic (incl. OAuth subscription),
OpenRouter, and Nous Portal (which proxies to OpenRouter). All other
providers keep today's system_and_3 layout unchanged.
Layout (4 cache_control breakpoints, Anthropic max):
1. tools[-1] -> 1h (cross-session)
2. system content[0] -> 1h (cross-session, stable prefix)
3. messages[-2] -> 5m (within-session rolling)
4. messages[-1] -> 5m (within-session rolling)
Within-session rolling shrinks from 3 messages to 2 to free the breakpoint
budget. On Claude with realistic tool loadouts the long-lived tier carries
the bulk of cross-session value anyway.
System prompt is now always assembled cache-friendly: stable identity /
guidance / skills / platform hints first, then session-stable context
files (AGENTS.md, .cursorrules), then per-call volatile content. Old
single-string callers see the same logical content (same join order),
just reordered so volatile lives at the end.
Config knobs (defaults shown):
prompt_caching:
cache_ttl: "5m" # rolling-window TTL (unchanged)
long_lived_prefix: true # opt-out switch
long_lived_ttl: "1h" # cross-session prefix TTL
Live E2E (tests/agent/test_prompt_caching_live.py, gated on
OPENROUTER_API_KEY) on anthropic/claude-haiku-4.5 with default toolset:
Call 1 (cold): cache_write=13,415 cache_read=0
Call 2 (NEW agent + msg): cache_write=391 cache_read=13,025
Cross-session reuse: 97.09%
Implementation:
* agent/prompt_caching.py: new apply_anthropic_cache_control_long_lived()
+ mark_tools_for_long_lived_cache(); existing apply_anthropic_cache_control()
preserved verbatim for the fallback path.
* agent/anthropic_adapter.py: convert_tools_to_anthropic() now forwards
cache_control onto each Anthropic-format tool dict.
* run_agent.py: _build_system_prompt_parts() returns the 3-tier dict;
_build_system_prompt() joins them (backward compatible).
_supports_long_lived_anthropic_cache() policy added next to the existing
_anthropic_prompt_cache_policy() (which now also recognises Nous Portal
Claude — pre-existing gap fixed in passing).
_build_api_kwargs() resolves tools_for_api once and propagates the
marker through all four build paths (anthropic_messages, bedrock,
codex_responses, profile/legacy chat completions).
Long-lived flag plumbed into the runtime snapshot/restore + model-switch
+ fallback-promotion paths.
Tests:
* tests/agent/test_prompt_caching.py: +8 tests (TestMarkToolsForLongLivedCache,
TestApplyAnthropicCacheControlLongLived).
* tests/run_agent/test_anthropic_prompt_cache_policy.py: +9 tests
(TestSupportsLongLivedAnthropicCache matrix across 8 endpoint classes
+ a fallback-target case).
* tests/agent/test_prompt_caching_live.py: new live E2E (skipif when
OPENROUTER_API_KEY is unset; runs outside the hermetic suite).
* Targeted suites: 327/327 pass (caching/adapter/policy/builder).
* tests/agent/ + tests/run_agent/: 3992 pass, 17 skip, 1 pre-existing
flake (test_async_httpx_del_neuter::test_same_key_replaces_stale_loop_entry,
verified failing on pristine origin/main).
#23482 fixed cache poisoning in the sync path: when a Codex auxiliary
timeout closes the underlying OpenAI client, _evict_cached_client_instance
walks CodexAuxiliaryClient wrappers via their _real_client attribute and
drops the cache entry so the next aux call rebuilds.
The cache key includes async_mode (see _client_cache_key), so the sync and
async clients for the same provider live in two distinct entries pointing
at the same underlying transport. The fix walked the sync wrapper's
_real_client correctly but the async wrappers
(AsyncCodexAuxiliaryClient, AsyncAnthropicAuxiliaryClient,
AsyncGeminiNativeClient) never exposed _real_client at all, so the async
entry survived eviction and kept handing out the poisoned client.
Effect on async aux callers: one timeout now poisons every subsequent
async aux call (compression, vision, session_search, title_generation)
with 'Connection error' until gateway restart -- even while the sync
route recovered as designed in #23482.
Mirror the sync wrapper's _real_client onto each async wrapper so the
existing eviction helper finds them. Three changes, one per wrapper:
- AsyncCodexAuxiliaryClient: self._real_client = sync_wrapper._real_client
(the underlying OpenAI client)
- AsyncAnthropicAuxiliaryClient: same shape
- AsyncGeminiNativeClient: self._real_client = sync_client (Gemini's
native facade is itself the leaf; no OpenAI client beneath it)
Update _evict_cached_client_instance docstring to reflect that it now
covers both sync and async wrappers via the same attribute walk.
Test: TestAuxiliaryClientPoisonedCacheEviction.test_evict_cached_client_instance_walks_async_wrapper
seeds both sync and async cache entries pointing at the same leaf and
asserts both are dropped on a single eviction call. Verified the test
fails without the wrapper changes ("async cache entry survived
eviction -- wrapper is missing _real_client") and passes with them.
Refs #23482, #23432
CJK and emoji glyphs render as two terminal cells but JS String#length
and the model's own padding count them as one, so any markdown table
with Chinese / Japanese / Korean cells drifts right per row when a
real terminal renders it. Both surfaces fix this with a display-cell
width measurement (wcswidth on the Python side, stringWidth on the
TUI side).
Changes:
- agent/markdown_tables.py: new helper. realign_markdown_tables(text)
detects markdown table blocks (header + |---| divider) and
rewrites the row padding using wcwidth.wcswidth so every pipe and
dash lines up across rows. No-op on text without tables.
- cli.py: hook the helper into _render_final_assistant_content for
strip / render modes (raw passes through untouched), and into the
streaming line emitter so live token-by-token rendering also
produces aligned tables. A small two-buffer state machine in
_emit_stream_text holds table rows until the block ends, then
flushes them through the realigner so all rows pad to a single
per-column width.
- ui-tui/src/components/markdown.tsx: renderTable now uses
stringWidth (Bun.stringWidth fast path + East-Asian-width-aware
fallback, already memoised in @hermes/ink) instead of UTF-16
String#length for both column-width measurement and per-cell
padding. Drops the comment that documented the bug as a deliberate
limitation.
Validation:
- New tests/agent/test_markdown_tables.py (11): every rebuilt block
shares pipe column offsets across rows for pure CJK, mixed
CJK+emoji, ragged-row, and multi-table inputs.
- Updated tests/cli/test_cli_markdown_rendering.py: the existing
strip-mode test asserted exact whitespace; rewritten to assert the
alignment contract (cell content survives + every rendered row
shares pipe offsets).
- New ui-tui markdown.test.ts case (1): rendered column-2 start
offset is identical for the header + every body row, including
the CJK row that drifted before the fix.
- Live: hermes chat -q with the user-reported screenshot prompt now
produces a perfectly aligned table on the wire (header, divider,
4 body rows including '通义千问', all pipes at identical columns).
The /model picker for Nous Portal users was returning the in-repo
_PROVIDER_MODELS["nous"] snapshot — which only updates on Hermes
releases — instead of the remote manifest published at
https://hermes-agent.nousresearch.com/docs/api/model-catalog.json.
OpenRouter already pulled from the manifest via fetch_openrouter_models;
"nous" was the only curated provider where the existing manifest
plumbing (get_curated_nous_model_ids → get_curated_nous_models) was
defined but not wired into the picker pipeline. Switch the curated
build in list_authenticated_providers to use it, with the same
graceful fallback to the in-repo snapshot when the manifest is
unreachable.
Test: tests/hermes_cli/test_model_catalog.py exercises the picker with
a patched manifest and asserts the manifest's nous list reaches
list_picker_providers. Falls-back-to-static path was already covered
by test_curated_nous_ids_falls_back_to_hardcoded_on_empty_catalog.
Follow-up to PR #23824. Adds two correctness fixes on top of the
contributor's salvaged commit:
1. Stale-dist fallback no longer gated on `fatal=False`. `cmd_dashboard`
passes `fatal=True` and is the primary scenario this fallback is for
(issue #23817 — Windows Scheduled Task at logon). The previous gate
meant the fallback never fired in the case it was designed for.
2. `--skip-build` now verifies the dist actually exists before starting
the server. Without this, a misconfigured pre-build would launch the
dashboard pointing at a missing dist and silently serve 404s. We now
exit 1 with a clear "pre-build first: cd web && npm run build"
message, and on success print which dist directory is being used.
Verified end-to-end on Linux:
- build fails + stale dist (fatal=True) -> fallback fires
- build fails + no dist (fatal=True) -> exit 1 with stderr surfaced
- build fails + stale dist (fatal=False) -> fallback fires
- --skip-build + missing dist -> exit 1 with clear guidance
- --skip-build + valid dist -> 'Skipping web UI build...'
On Windows systems using a Chinese GBK locale, `hermes update` could misreport the Web UI build as failed even when `npm run build` actually succeeded. The failure was caused by Python decoding captured npm output with the process locale inside a background subprocess reader thread. When npm emitted bytes such as `0x85`, decoding under GBK raised `UnicodeDecodeError`, and Hermes then surfaced a misleading "Web UI build failed" warning.
This change makes the npm install/npm ci path and the Web UI build step decode captured output explicitly as UTF-8 with `errors="replace"`. That keeps unexpected bytes from crashing output collection, preserves successful builds, and prevents false negatives during update on Windows.
The patch also adds regression tests that verify these subprocess calls always use explicit UTF-8 decoding with replacement semantics.
When the user's main provider is openai-codex on the ChatGPT-account
backend (https://chatgpt.com/backend-api/codex), sending a native image
attachment encodes it as data:image/...base64,... in the input_image
field. The OpenAI Responses API on the public endpoint accepts that, but
the ChatGPT-account variant rejects it with HTTP 400:
Invalid 'input[N].content[K].image_url'. Expected a valid URL, but got
a value with an invalid format.
Hermes' image-rejection phrase list didn't include this wording, so the
error escaped the strip-and-retry branch and fell through to the generic
recovery path: model fallback → context-too-large → compression cascade
→ auxiliary OpenRouter 402 spam (issue #23570).
Add a NARROW phrase keyed on the field-path apostrophe used by the Codex
Responses error format: "image_url'. expected". This matches the actual
error format without false-tripping on generic 'Expected a valid URL'
errors from unrelated tools (webhooks, redirect_uri, etc.). Once matched,
the existing branch strips images from history, sets _vision_supported=
False for the session, and retries text-only.
Refs #23570 (1 of 3 image-replay improvements; persistence rewrite to
store image PATHS instead of inlined base64 is a separate follow-up)
* Revert "fix(goals): force judge to use tool calls instead of JSON-text replies (#23547)"
This reverts commit a63a2b7c78.
* Revert "fix(goals): forward standing /goal state on auto-compression session rotation (#23530)"
This reverts commit 4a080b1d5a.
* Revert "feat(goals): /goal checklist + /subgoal user controls (#23456)"
This reverts commit 404640a2b7.
Adds the only #17873 category not covered by the in-flight PRs #17962
(briandevans, reverse shell + download-execute) and #7993 (SHL0MS,
credential reads + curl/wget exfiltration): sudo invocations that an
LLM-driven agent can drive without TTY interaction.
The agent has no TTY, so the sudo forms that succeed without human
involvement are those reading the password from stdin (`-S` / `--stdin`)
or via an askpass helper (`-A` / `--askpass`). The shell-launch (`-s`)
and list-privileges (`-a`) flags are also gated since they are
privilege-relevant invocations the agent can chain after acquiring the
password (e.g. read SUDO_PASSWORD from .env -> sudo -S -s -> root shell).
Plain `sudo cmd` (no flag) is TTY-bound and excluded.
Two patterns:
1. Direct flag: `\bsudo\b[^;|&\n]*?\s+(?:-s\b|--stdin\b|-a\b|--askpass\b)`
The lazy `[^;|&\n]*?` consumes flag-arguments without spanning
command separators, so `sudo -u root -S whoami` matches (a textbook
offensive form that a strict `(?:\s+-[^\s]+)*` "leading flags only"
pattern would have missed because `root` is a flag-value not a flag).
2. Combined short flags: `\bsudo\b[^;|&\n]*?\s+-[a-z]*[sa][a-z]*\b`
Catches packed forms like `sudo -nS id` where multiple flags share
a single `-X` token.
`_normalize_command_for_detection` lowercases input before pattern
matching (tools/approval.py:340), so case variants of S/s and A/a
collapse — both letter-pairs are gated since each is a privilege-
relevant invocation.
Tests: 21 new cases in TestDetectSudoStdin (12 positive covering all
flag-order permutations including herestring source and printf-piped
forms; 9 negative including TTY-bound `sudo whoami`, interactive
`sudo -i`, env-var reference `$SUDO_USER`, doc lookup `man sudo`,
package install, and the `pseudosudo` word-boundary edge case).
Empirical coverage: 11/11 attacks matched, 0/10 false positives.
Refs: #17873 category 4. Adjacent: #17962 (reverse shell + download-
execute), #7993 (credential reads + curl/wget exfiltration).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes#9590: Block explicit sudo -S (stdin password mode) commands
when the SUDO_PASSWORD environment variable is not configured.
The attack vector: the LLM constructs 'echo guessedpass | sudo -S cmd'
to brute-force sudo passwords, iterates based on sudo's error output
('Sorry, try again'). The existing _transform_sudo_command only
injects -S when SUDO_PASSWORD exists; without it, the LLM's explicit
sudo -S must be treated as a guessing attempt.
Changes:
- Add _check_sudo_stdin_guard() in approval.py: detects sudo -S when
SUDO_PASSWORD is absent, anchored to command-start positions
(^ ; && || | etc.) to avoid false positives on literal text
- Integrate into check_all_command_guards() above yolo/mode=off so
the block is unconditional (like the hardline floor)
- Add 6 tests covering: detection, allow-list, SUDO_PASSWORD bypass,
integration with check_all_command_guards, yolo non-bypass,
container backend bypass
Adds a TestCheckSendMessage class with 7 focused tests pinning the
four passing conditions and the failure modes:
- HERMES_KANBAN_TASK grants access (the new branch)
- HERMES_KANBAN_TASK short-circuits before consulting
session_context or gateway.status (so workers don't depend on
those import paths being healthy)
- HERMES_SESSION_PLATFORM=telegram grants access
- HERMES_SESSION_PLATFORM=local falls through to gateway check
- is_gateway_running()=True grants access
- All signals absent → False
- gateway.status ImportError is swallowed → False
Pinning the short-circuit (test #2) is the load-bearing one — it
documents the contract that worker-side availability cannot regress
to depending on gateway-side state lookups.
When a kanban worker subprocess hits the iteration budget, the agent
loop strips tools and asks the model for a summary. The model cannot
call kanban_block itself at that point, so the process exits rc=0
without calling kanban_complete or kanban_block — a protocol violation
that the dispatcher detects as a fatal error, giving up after 1 failure
and stranding downstream tasks.
Fix: after _handle_max_iterations() returns, check HERMES_KANBAN_TASK
and call kanban_block with a reason describing the exhaustion. The
dispatcher then sees a clean block transition instead of a protocol
violation, and the task can be retried or escalated by a human.
Fixes [Bug] kanban-worker exits cleanly (rc=0) on iteration-budget
exhaustion without calling kanban_complete or kanban_block #23216
When an auxiliary provider returns HTTP 402 (credit / payment), every
subsequent compression / title-gen / session-search / vision call still
re-tried it as the FIRST entry in the chain — burning ~1 RTT to hit 402
again, then falling back. On a long Discord/LCM session that meant dozens
of doomed 402s per minute (issue #23570).
Add a per-process unhealthy-provider cache with a 10 min TTL. When any
caller observes a payment error against a provider, the label is marked
unhealthy and skipped by:
* _resolve_auto Step-1 (main provider use-as-aux path)
* _resolve_auto Step-2 (aggregator/fallback chain)
* _try_payment_fallback (used by call_llm/acall_llm on first 402)
Skip-logs are throttled to once per minute per label so a bursty session
doesn't spam agent.log. Entries auto-expire so a topped-up account
recovers without manual intervention. The cache is in-process only by
design — multi-profile users with different keys per profile must each
hit the 402 once.
Refs #23570
When the Discord typing API call fails (rate limit, network error, 403),
_typing_loop returns early but the stale task remains in _typing_tasks.
Subsequent send_typing calls see the stale entry and skip, leaving no
typing indicator for the rest of the agent invocation.
Add finally block to _typing_loop to always remove the task from
_typing_tasks on exit, whether from cancellation, error, or normal
completion. This allows send_typing to create a fresh task.
3 new tests in test_discord_send.py:
- Task removed after API error
- Typing restartable after failure
- stop_typing cleans up
A YAML parse error in ~/.hermes/config.yaml caused load_config() to print
one line to stdout (Warning: Failed to load config: ...) and silently fall
back to DEFAULT_CONFIG, dropping every user override (auxiliary providers,
fallback chain, model settings). Users only noticed when downstream
behavior misbehaved — see issue #23570 where a tab-indent error in the
auxiliary section caused aux fallback to use OpenRouter (depleted) instead
of the configured Codex/MiniMax chain.
Now: log at WARNING (so 'hermes logs' surfaces it), write a prominent line
to stderr, dedup on (path, mtime_ns, size) so concurrent loads don't spam,
and re-warn after the user edits the file. Both call sites (raw read +
merged load) route through the same helper.
Refs #23570
Salvages the three substantive low-severity fixes from Gutslabs' #1974
"misc bug fixes" bundle. The other 8 claims in that PR were either
already fixed on main with superior implementations (state lock,
firecrawl lazy import, fcntl/msvcrt guard, path normalization, schema
migrations) or did not survive review.
- run_agent: `_materialize_data_url_for_vision` uses
`NamedTemporaryFile(delete=False)`; if `base64.b64decode` raises on a
corrupt data URL the temp file would persist forever. Wrap the
write in try/except and `os.unlink` the temp on failure.
- gateway/session: `append_to_transcript` JSONL write had no error
handling, so disk-full / read-only-fs / permission errors crashed the
message handler. The SQLite write above is the primary store, so
swallow OSError on the JSONL fallback with a debug log.
- gateway/status: `_read_pid_record` reads `pid_path.read_text()` after
an `exists()` check; if the PID file is deleted between the two
calls (concurrent gateway restart) we hit an unhandled OSError.
Catch it and return None.
Adds a regression test for the tempfile cleanup; the other two paths
are defensive try/excepts on infrequent OSError that don't warrant
dedicated tests.
Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>
- Implement tests for normalizing perpetual markets and DEXs.
- Validate JSON output for main commands including markets, candles, and review.
- Ensure environment variable resolution and dotenv file reading are covered.
- Test export functionality for market data with expected output structure.
agent.redact._REDACT_ENABLED is snapshotted at import time from
HERMES_REDACT_SECRETS env. Under xdist a prior test in the same worker
can flip it, so test_exec_command_output_is_redacted was order-dependent.
Pin it via monkeypatch like test_terminal_output_transform_still_runs_strip_and_redact does.
1. Quick command exec ran in the gateway process's full environment
without env sanitization or output redaction. A quick command like
"env" or "printenv" would leak all API keys, OAuth tokens, and
bot credentials to the messaging user.
Fix: apply _sanitize_subprocess_env() before exec and
redact_sensitive_text() on output before returning.
2. GatewayRunner._pending_messages was written on every interrupt
(lines 1331-1334) but never read or consumed anywhere. The actual
interrupt delivery uses adapter._pending_messages (a separate dict).
Removed the write-only accumulation to prevent unbounded growth.
Two new tests:
- tests/gateway/test_telegram_format.py
test_message_too_long_splits_into_continuations_not_silent_truncation:
asserts edit_message returns success=True with continuation_message_ids
populated and message_id pointing at the last continuation when
content exceeds MAX_MESSAGE_LENGTH (#19537). Replaces the original
fail-on-overflow assertion with the split-and-deliver contract.
- tests/gateway/test_stream_consumer.py
TestEditOverflowSplitAndDeliver.test_consumer_advances_message_id_on_split_and_deliver:
asserts the consumer side updates _message_id to the latest
continuation, clears _last_sent_text, and fires on_new_message when
the adapter reports a split-and-deliver result.
Surface ready tasks that nobody claims within a threshold (default
30 min) regardless of why. One identity-agnostic signal that catches:
- Operator typo'd the assignee
- Profile was deleted, leaving its tasks stranded
- External worker pool (Codex CLI lane, custom daemon) is down
- Dispatcher misconfigured (wrong board / wrong HERMES_HOME)
Today the dispatcher correctly skips these (no respawn loop, good)
but nothing surfaces the fact that operator-actionable work is
accumulating. The new `stranded_in_ready` rule does that without
requiring a manual lane registry — it reads the most recent ready-
transition event (`created` / `promoted` / `reclaimed` / `unblocked`)
and fires when (now - last_ready_ts) > threshold.
Severity escalates with age: warning at threshold, error at 2x,
critical at 6x. The cli_hint and reassign actions point operators
at the right next step.
Out of scope deliberately:
- Lane registry (#20157 closed) — this signal supersedes it.
- Pushing the diagnostic into messaging gateways — diagnostics
are pull-only via 'hermes kanban diagnostics' for now; gateway
push is a separate UX decision.
Tests: 10 new + 461 existing kanban tests pass. E2E verified end-
to-end via 'hermes kanban diagnostics --json' against a 2h-old
stranded task — surfaces as error severity with correct actions.
Live-tested on gemini-3-flash-preview the judge kept returning empty
or non-JSON content, tripping the consecutive-parse-failures auto-
pause. Free-form JSON output is hopeful; tool-call schemas are
enforced server-side by virtually every modern provider.
Two new tools the judge calls:
- submit_checklist(items) — Phase A, decompose
- update_checklist(updates, new_items, reason) — Phase B, evaluate
Both phases now call the auxiliary client with tool_choice forcing
the right tool. read_file remains for Phase B history inspection,
with the loop exiting only when update_checklist is called or the
read budget is exhausted (at which point read_file is dropped from
the toolbox and update_checklist is forced).
Robustness:
- _call_judge_with_tool_choice falls back tool_choice forced→required→
auto if the provider rejects a particular shape.
- If a fully-broken provider still returns content instead of a tool
call, the legacy JSON-text parsers stay around as a last-ditch
backstop so we never silently lose a checklist.
- _normalize_update_args replaces the JSON parser for the apply
layer; same 1-based→0-based conversion + terminal-status filter.
Live verification: same fizzbuzz goal that was hitting 'judge model
returned unparseable output 3 turns in a row' before now terminates
in 2 turns, all 11 items marked completed with item-specific
evidence, no auto-pause. Agent log shows
'produced 11 checklist items via tool call' instead of the JSON-
parse path.
Tests: 7 new cases for the tool-call path (Phase A success, Phase B
update only, Phase B read_file→update, JSON-content backstop,
empty-text item dropping, non-terminal status filter).
When run_agent's _compress_context fires mid-turn it ends the parent
session in SessionDB and creates a new continuation session with a
fresh session_id. The /goal state is keyed on session_id in
state_meta ("goal:<sid>"), so without forwarding the goal silently
disappears: _get_goal_manager() rebinds for the new session_id,
load_goal() returns None, mgr.is_active() is False, and the
continuation loop dies with no user-visible signal.
Fix: in the same SessionDB transaction block that creates the
continuation session, copy state_meta[goal:<old>] →
state_meta[goal:<new>] when present. No-op when the user has no
active goal. Logged at INFO so a stuck loop is debuggable.
Tests cover the round-trip via SessionDB and the no-op path.
Affects all three run-conversation surfaces (CLI, gateway, TUI
gateway) because _compress_context is the single rotation site.
Added tests/gateway/test_stream_consumer_draft.py with 11 tests
covering:
- Transport selection: auto+dm-supported -> draft; auto+group -> edit;
explicit edit; explicit draft on unsupported adapter -> edit;
MagicMock adapter -> edit (back-compat for the existing test suite).
- Happy path: DM stream animates draft frames with a single shared
draft_id, then finalizes via a regular adapter.send.
- Group fallback: drafts entirely skipped in non-DM chats.
- Failure fallback: send_draft returning success=False disables drafts
for the rest of the response.
- Draft_id lifecycle: consecutive responses use distinct ids; tool
boundaries bump the id so post-tool text animates fresh below the
tool-progress bubble (the openclaw #32535 leak guard).
- _already_sent contract: drafts must NOT set the flag so the gateway's
fallback final-send still fires (drafts have no message_id).
Updated website/docs/user-guide/messaging/telegram.md with a
'Streaming transport' section explaining auto|draft|edit|off, the
DM-only constraint, and the per-response fallback behaviour.
The existing _live_system_guard (PR #23397) blocked os.kill / os.killpg
and a narrow subset of subprocess invocations. Tests still SIGTERMed the
live gateway today (May 10) because the guard had structural holes.
Plug them all:
- subprocess: also wrap getoutput, getstatusoutput
- os.system, os.popen - completely unwrapped before
- pty.spawn - completely unwrapped before
- asyncio.create_subprocess_exec / create_subprocess_shell - bypassed
the subprocess module entirely; now wrapped
- Subprocess command inspection now looks at the WHOLE command string,
not just tokens[0]. Catches sudo systemctl, env systemctl, bash -c
'systemctl', setsid systemctl, /usr/bin/systemctl, etc.
- New process-killer block: pkill / killall / taskkill / fuser
targeting hermes/python patterns is now refused
- os.kill PID 0 (own group) allowed; PID -1 (every process we can
signal) refused
- subprocess.Popen wrapper preserves __class_getitem__ so third-party
packages that use Popen[bytes] as a type annotation still import
Coverage is locked in by tests/test_live_system_guard_self_test.py -
exercises every primitive against a guaranteed-foreign PID and asserts
the guard fires. Adding a new kill primitive without updating the guard
breaks CI.
scripts/run_tests.sh now also force-loads ~/.hermes/pytest_live_guard.py
when present (developer-machine convenience), so even worktrees that
predate this commit get the protection on subsequent test runs through
the canonical wrapper.
A Codex auxiliary timeout closes the underlying OpenAI client (so the
streaming hang doesn't sit until the user kills the session), but the
cached wrapper kept pointing at the now-dead transport. Subsequent
auxiliary calls (compression retry, memory flush, background review,
title generation routed via provider: main) reused that closed client
and failed fast with 'Connection error' until the gateway restarted —
even though the main agent route was healthy the whole time.
Sync `_get_cached_client` had no liveness check (async did, via loop
identity), and the connection-error fallback in `call_llm` only fired
on the auto provider path, so an explicit provider — including the
common `auxiliary.compression.provider: main` shape — never evicted.
Three fixes:
* New `_evict_cached_client_instance(target)` helper that drops the
cache entry whose stored client is target (or wraps it via
`_real_client`, for `CodexAuxiliaryClient`).
* `_CodexCompletionsAdapter._close_client_on_timeout` evicts the
wrapper after closing the inner OpenAI client.
* `call_llm` and `async_call_llm` evict on `_is_connection_error`
before re-raising, regardless of whether the provider is auto.
Net effect: one timeout costs one summary attempt + the existing 30s
compressor cooldown; the next compaction rebuilds the client and
works. Non-connection errors (4xx/5xx) do not evict, so cache hits
stay stable.
Closes#23432
Cherry-picked from PR #10371. Two-layer defense for the spurious-thread_id
issue (#3206):
1. _build_message_event filters DM thread_ids: only preserve thread_id
for real topic messages (is_topic_message=True). Telegram puts
message_thread_id on every DM that is a reply, but reply-chain ids
route to nonexistent threads on send.
2. _send_message_with_thread_fallback helper: control sends
(send_update_prompt, send_exec_approval / send_slash_confirm,
send_model_picker) retry once without message_thread_id when
Telegram returns BadRequest 'Message thread not found'. Mirrors
the pattern PR #3390 added for the streaming send path.
Salvage notes:
- Conflict 1 (line ~4099): merged the contributor's DM is_topic_message
filter with the existing forum General-topic default from #22423,
preserving both behaviors.
- Conflict 2 (line ~1664 / 1690): kept main's delete_message (PR #23416)
alongside the new helper. Tightened the helper's exception catch
from bare 'Exception' to use the existing _is_bad_request_error +
_is_thread_not_found_error helpers (line 484-496) for consistency
with the streaming send path.
- Widened the fix to send_update_prompt (was bare self._bot.send_message,
same bug class).
Authored by rahimsais via PR #10371 (re-attributed from donrhmexe@
local commit author).
* feat(goals): /goal checklist + /subgoal user controls
Two-phase judge for /goal — Phase A decomposes the goal into a detailed
checklist on first turn; Phase B evaluates each pending item harshly
against the agent's most recent response. The goal completes only when
every item is in a terminal status (completed or impossible). Adds
/subgoal so the user can append, complete, mark impossible, undo,
remove, or clear items the judge missed or got wrong.
Mechanics:
- GoalState gains `checklist` and `decomposed` fields, both backwards
compatible (old state_meta rows load unchanged).
- Phase A: aux call writes a harsh, exhaustive checklist; biased toward
more items not fewer. Falls through to legacy freeform judge when
decompose fails.
- Phase B: judge gets the checklist + last-response snippet + path to
a per-session conversation dump at <HERMES_HOME>/goals/<sid>.json.
A bounded read_file tool (max 5 calls per turn, restricted to that
one file) lets the judge inspect history when the snippet is
ambiguous. Stickiness in code: terminal items are frozen, only the
user can revert via /subgoal undo.
- Continuation prompt shows checklist progress when non-empty;
reverts to old prompt when empty.
- Status line shows M/N done counts.
CLI + gateway + TUI gateway all pass the agent reference into
evaluate_after_turn so the dump can be written. Gateway-side
/subgoal is allowed mid-run since it only modifies the checklist
the judge consults at turn boundaries.
Tests: 24 new cases — backcompat round-trip, Phase A decompose,
Phase B updates + new_items + stickiness, user override flows,
conversation dump (incl. unsafe-sid sanitization), judge read_file
restriction. Existing freeform-mode tests updated to patch the
renamed `judge_goal_freeform` and skip Phase A explicitly.
* fix(goals): off-by-one in judge index, message-list plumbing, prompt tuning
Three live-test findings from running /goal end-to-end against
gemini-3-flash-preview as the judge:
1. Off-by-one bug — the judge sees the checklist rendered with 1-based
indices ('1. [ ] foo, 2. [ ] bar') but the apply layer indexed
state.checklist as 0-based. Result: every judge update landed on
the wrong item, evidence got attached to neighbouring rows, and
the genuine 'first pending' item (usually #1) never got marked.
Fix: convert 1 → 0 in _parse_evaluate_response. Also tightened the
user prompt to call out the 1-based scheme explicitly. New tests
cover the parser conversion + an end-to-end fake-judge round-trip.
2. Conversation dump never happened — _extract_agent_messages tried
common AIAgent attribute names (.messages, .conversation_history,
etc.) but AIAgent doesn't expose the message list as an instance
attribute; it lives inside run_conversation()'s scope. Result: the
judge's read_file tool always saw history_path=unavailable. Fix:
added an explicit messages= kwarg to evaluate_after_turn that all
three call sites (CLI, gateway, TUI gateway) now pass directly.
Agent-attribute extraction kept as back-compat fallback.
3. Prompt was too harsh on simple goals. The original 'be HARSH,
default to leaving items pending' wording made the judge refuse
to mark 'file exists' completed even after the agent ran ls,
test -f, os.path.isfile, and find — burning the entire 8-turn
budget on a fizzbuzz task. Softened to 'strict but not absurd'
with explicit guidance on what counts as evidence and a directive
not to require re-proving items already established earlier.
Re-tested live with the same fizzbuzz goal: now terminates in 2
turns with all 8 checklist items correctly attributed to their
own evidence. /subgoal user-action flow (add / complete / undo /
impossible) verified live as well.
New TestUtf16OverflowDetection class covers two scenarios:
- test_emoji_text_exceeding_utf16_limit_triggers_overflow_split: feeds
2200 emoji codepoints (4400 UTF-16 units) — under Telegram's
codepoint-equivalent limit but over its UTF-16 limit. Asserts
truncate_message was called with len_fn=utf16_len, confirming the
consumer detected the overflow.
- test_codepoint_only_adapter_falls_back_to_len: documents that
adapters which don't subclass BasePlatformAdapter (or test MagicMocks)
fall back to plain len for backwards compat.
The contributor's PR shipped no tests for the UTF-16 path.
Slash commands (/clear, /new, /undo, /reload-mcp) are dispatched from the
process_loop daemon thread. prompt_toolkit.run_in_terminal returns a
coroutine that only the main-thread event loop can drive, so calling it
from a daemon thread orphans the coroutine — the input prompt never
renders and user keystrokes leak into the composer instead of the
confirmation prompt (issue #23185).
Mirror the thread-aware guard already in _run_curses_picker: when off the
main thread, fall back to a direct input() call. Also wrap
run_in_terminal in try/except so WSL / Warp / other emulators that
silently drop the scheduled coroutine fall back to input() too.
Tests: tests/cli/test_prompt_text_input_thread_safety.py covers main
thread (run_in_terminal path), daemon thread (direct input fallback),
no-app, run_in_terminal-raises, and EOF handling.
When kanban_complete rejects a created_cards list as hallucinated, the
task is intentionally left in-flight (the gate runs before the write
txn) so the worker can retry with a corrected list or pass
created_cards=[] to skip the check. The retry path already worked, but
the previous error wording read like a terminal failure and workers
were observed abandoning the run instead of trying again.
Spell out the recovery path explicitly in the tool_error response
("Your task is still in-flight ... Retry kanban_complete with ...") and
add regression coverage at both the kernel and tool layers so the
retry contract — and the wording the worker depends on to discover
it — is pinned.
Fixes#22923
Workers running slow models (e.g. kimi-k2.6) can spend longer than
DEFAULT_CLAIM_TTL_SECONDS inside a single tool-free LLM call, making
no tool calls and therefore not heartbeating. release_stale_claims
previously reclaimed these healthy workers, producing the
spawn-then-immediately-reclaim loop reported in #23025.
When a stale-by-TTL claim's host-local worker PID is still alive,
extend the claim (emit a claim_extended event) rather than killing
it. enforce_max_runtime / detect_crashed_workers remain the upper
bounds for genuinely wedged or dead workers. Reclaim events now also
record claim_expires, last_heartbeat_at, worker_pid, and host_local
so operators can see why a worker was killed.
xAI's Responses API returns HTTP 400 ("Model X does not support
parameter reasoningEffort") for grok-4, grok-4-0709, grok-4-fast-*,
grok-4-1-fast-*, grok-3, grok-4.20-0309-*, and grok-code-fast-1 — even
though those models reason natively. Hermes was unconditionally sending
`reasoning: {effort: 'medium'}` to xAI for every Grok model, breaking
direct `--provider xai` for the entire grok-4 line.
Add a substring allowlist predicate (verified live against api.x.ai
2026-05-10) covering the only Grok families that accept the effort dial:
grok-3-mini*, grok-4.20-multi-agent*, grok-4.3*. The Responses transport
omits the `reasoning` key entirely for everything else while still
including `reasoning.encrypted_content` so we capture native reasoning
tokens.
Verified end-to-end: `hermes chat -q hi --provider xai --model grok-4-0709`
went from HTTP 400 to a successful reply.
The contributor's regression test for Feishu fallback thread routing
asserted on attributes specific to the real lark SDK builder
(call_args.body, body.receive_id). In test environments without the
lark SDK installed, the in-tree fallback (gateway/platforms/feishu.py
_build_create_message_request) returns a SimpleNamespace using
.request_body instead of .body, causing AttributeError.
Now reads via getattr fallback and also verifies receive_id_type is
'thread_id' (not 'chat_id') as a stronger contract check.