* feat(video_gen): unified video_generate tool with pluggable provider backends
One core video_generate tool, every backend a plugin. Mirrors the
image_gen + memory_provider + context_engine architecture: ABC, registry,
plugin-context registration hook, and per-plugin model catalogs surfaced
through hermes tools.
Surface (one schema, every backend):
- operation: generate / edit / extend
- modalities: text-to-video (prompt only), image-to-video (prompt +
image_url), video edit (prompt + video_url), video extend (video_url)
- reference_image_urls, duration, aspect_ratio, resolution,
negative_prompt, audio, seed, model override
- Providers ignore unknown kwargs and declare what they support via
VideoGenProvider.capabilities() — backend-specific quirks stay in the
backend, the agent learns one tool
Backends shipped:
- plugins/video_gen/xai/ — Grok-Imagine, full generate/edit/extend +
image-to-video + reference images (salvaged from PR #10600 by
@Jaaneek, reshaped into the plugin interface)
- plugins/video_gen/fal/ — Veo 3.1 (t2v + i2v), Kling O3 i2v,
Pixverse v6 i2v with model-aware payload building that drops keys a
model doesn't declare
Wiring:
- agent/video_gen_provider.py — VideoGenProvider ABC, normalize_operation,
success_response / error_response, save_b64_video / save_bytes_video,
$HERMES_HOME/cache/videos/
- agent/video_gen_registry.py — thread-safe register/get/list +
get_active_provider() reading video_gen.provider from config.yaml
- hermes_cli/plugins.py — PluginContext.register_video_gen_provider()
- hermes_cli/tools_config.py — Video Generation category in
hermes tools, plugin-only providers list, model picker per plugin,
config write to video_gen.{provider,model}
- toolsets.py — new video_gen toolset
- tests: 31 new tests covering ABC, registry, tool dispatch, both plugins
- docs: developer-guide/video-gen-provider-plugin.md (parallel to the
image-gen guide), sidebar + toolsets-reference + plugin guides updated
Supersedes: #25035 (FAL), #17972 (FAL), #14543 (xAI), #13847 (HappyHorse),
#10458 (provider categories), #10786 (xAI media+search bundle), #2984
(FAL duplicate), #19086 (Google Veo standalone — easy port to plugin
interface).
Co-authored-by: Jaaneek <Jaaneek@users.noreply.github.com>
* feat(video_gen): dynamic schema reflects active backend's capabilities
Address the 'capability variance' question — instead of one tool with a
static schema that lies about what every backend supports, the
video_generate tool now rebuilds its description at get_definitions()
time based on the configured video_gen.provider and video_gen.model.
The agent sees backend-specific guidance up-front:
- 'fal-ai/veo3.1/image-to-video': 'image-to-video only — image_url is
REQUIRED; text-only prompts will be rejected'
- 'fal-ai/veo3.1' (t2v): no image_url restriction shown
- xAI grok-imagine-video: 'operations: generate, edit, extend; up to 7
reference_image_urls'
- Backends without edit/extend: 'not supported on this backend — surface
that they need to switch backends via hermes tools'
This is the same pattern PR #22694 used for delegate_task self-capping —
documented in the dynamic-tool-schemas skill. Cache invalidation is
free: get_tool_definitions() already memoizes on config.yaml mtime, so a
mid-session backend swap rebuilds the schema automatically.
Tested:
- Empirical FAL OpenAPI schema check confirms image-to-video models
require image_url (FAL returns HTTP 422 otherwise) — client-side
rejection in FALVideoGenProvider.generate() now prevents the wasted
round-trip
- Live E2E: fal-ai/veo3.1/image-to-video + prompt-only → clean
missing_image_url error; fal-ai/veo3.1 + prompt-only → dispatches
- 6 new tests cover the builder (no config / image-only / full-surface /
text-only / unknown provider / registry wiring), all passing
- 37/37 in the slice, 134/134 in the broader regression set
* test(video_gen/xai): full surface integration tests + cleaner schema
Verified end-to-end that the xAI plugin handles every documented mode
from PR #10600's surface: text-to-video, image-to-video,
reference-images-to-video, video edit, video extend (with and without
prompt). All five modes route to the correct xAI endpoint
(/videos/generations, /videos/edits, /videos/extensions) with the right
payload shape (image / reference_images / video keys), and all five
client-side rejections fire before the network: edit-without-prompt,
extend-without-video_url, image+refs conflict, >7 references, and
duration/aspect_ratio clamping.
15 new integration tests grouped into four classes (endpoint routing,
modalities, validation, clamping). httpx is stubbed via a small fake
AsyncClient that records POSTs so the tests assert the actual payload
the plugin would send to xAI — not just the success/error envelope.
Also cleaned up a description redundancy: when a model's operations
match the backend's overall set, we no longer print the duplicate
'operations supported by this model' line. xAI's description now reads:
Active backend: xAI . model: grok-imagine-video
- operations supported by this backend: edit, extend, generate
- modalities supported by this backend: image, reference_images, text
- aspect_ratio choices: 16:9, 1:1, 2:3, 3:2, 3:4, 4:3, 9:16
- resolution choices: 480p, 720p
- duration range: 1-15s
- reference_image_urls: up to 7 images
Co-authored-by: Jaaneek <Jaaneek@users.noreply.github.com>
* feat(video_gen): collapse surface to t2v + i2v, family-based auto-routing
Two design changes per Teknium:
1) Drop edit/extend from the tool surface entirely. Only text-to-video
and image-to-video remain. The agent sees a clean tool with two
modalities; backend-specific quirks like xAI's edit/extend endpoints
stay out of the unified schema.
2) FAL: pick a model FAMILY once, the plugin routes between the
family's text-to-video and image-to-video endpoints based on whether
image_url was passed. Users no longer pick 'fal-ai/veo3.1' AND
'fal-ai/veo3.1/image-to-video' as separate options — they pick
'veo3.1', and the plugin handles the rest.
Catalog rewritten as families:
veo3.1 fal-ai/veo3.1 / fal-ai/veo3.1/image-to-video
pixverse-v6 fal-ai/pixverse/v6/text-to-video / fal-ai/pixverse/v6/image-to-video
kling-o3-standard fal-ai/kling-video/o3/standard/text-to-video / fal-ai/kling-video/o3/standard/image-to-video
xAI uses a single endpoint (/videos/generations) for both modes,
routed by the presence of the 'image' field in the payload — no
edit/extend exposure.
Schema changes:
- VIDEO_GENERATE_SCHEMA: drop operation, drop video_url. Final params:
prompt (required), image_url, reference_image_urls, duration,
aspect_ratio, resolution, negative_prompt, audio, seed, model.
- VideoGenProvider ABC: drop normalize_operation, VALID_OPERATIONS,
DEFAULT_OPERATION. capabilities() drops 'operations' key.
- success_response: add 'modality' field ('text' | 'image') so the
agent and logs can see which endpoint was actually hit.
Dynamic schema builder simplified — no operations bullet, no
'switch backends if you need edit/extend' guidance. When the active
backend supports both modalities (the common case), description reads:
Active backend: FAL . model: pixverse-v6
- supports both text-to-video (omit image_url) and image-to-video
(pass image_url) - routes automatically
- aspect_ratio choices: 16:9, 9:16, 1:1
- resolution choices: 360p, 540p, 720p, 1080p
- duration range: 1-15s
- audio: pass audio=true to enable native audio (pricing tier)
- negative_prompt: supported
Tests: 51 in the video_gen slice, 216 across the broader image+video
sweep, all passing. New FAL routing tests prove pixverse-v6 + no image
hits text-to-video endpoint, pixverse-v6 + image_url hits
image-to-video endpoint, same for veo3.1 and kling-o3-standard.
Docs updated: developer-guide page rewrites the 'model families' pattern
as a first-class section so external plugin authors know the convention.
toolsets-reference and toolsets.py descriptions match the new surface.
Co-authored-by: Jaaneek <Jaaneek@users.noreply.github.com>
* feat(video_gen/fal): expand catalog to 6 families, cheap + premium tiers
Catalog now covers everything Teknium specced from FAL:
Cheap tier:
ltx-2.3 fal-ai/ltx-2.3-22b/text-to-video / image-to-video
pixverse-v6 fal-ai/pixverse/v6/text-to-video / image-to-video
Premium tier:
veo3.1 fal-ai/veo3.1 / fal-ai/veo3.1/image-to-video
seedance-2.0 bytedance/seedance-2.0/text-to-video / image-to-video
kling-v3-4k fal-ai/kling-video/v3/4k/text-to-video / image-to-video
happy-horse fal-ai/happy-horse/text-to-video / image-to-video
DEFAULT_MODEL moved from veo3.1 (premium) to pixverse-v6 (cheap, sane
defaults, both modalities) — better first-run UX for users who haven't
explicitly picked a model.
New family-entry knob: image_param_key. Kling v3 4K's image-to-video
endpoint expects start_image_url instead of image_url; declaring
image_param_key='start_image_url' on the family lets _build_payload
remap correctly. Other families default to plain image_url.
Per-family capability flags reflect each model's docs:
- LTX 2.3 + Happy Horse: minimal payloads (no duration/aspect/resolution
enum exposed by FAL — let endpoint apply defaults)
- Seedance: 6 aspect ratios incl 21:9, durations 4-15, audio supported,
negative prompts NOT supported per docs
- Kling v3 4K: 16:9/9:16/1:1, 3-15s, audio + negative
- Veo 3.1: unchanged, 16:9/9:16, 4/6/8s
Tests: +5 covering the new families (full catalog, Kling 4K
start_image_url remap, Seedance routing, LTX payload minimality, Happy
Horse minimality). 56/56 in the slice green.
Note: I did NOT add the FAL-hosted xAI Grok-Imagine variant. Hermes
already has a direct xAI plugin that talks to xAI's own API; routing
the same model through FAL's wrapper would duplicate the surface
without adding capabilities. Users on FAL who want Grok-Imagine should
use the xAI plugin directly; flag if you want both routes available.
* test(video_gen): tool-surface routing matrix — every model x modality
End-to-end matrix test driven through _handle_video_generate() — the
actual function the agent's video_generate tool call lands in. Writes
config.yaml, invokes the registered handler with a raw args dict, then
asserts the outbound HTTP/SDK call hit the right endpoint with the right
payload shape.
Parametrized over FAL_FAMILIES.keys() so the matrix auto-discovers new
families as they're added (add a family to FAL_FAMILIES and you get
both modalities tested for free).
Coverage:
- All 6 FAL families x {text-only, text+image} = 12 cases
- xAI x {text-only, text+image} = 2 cases
- tool-level model= arg overrides config = 2 cases
For each case, verifies:
- result['success'] is True
- result['modality'] matches input shape ('text' if no image_url, 'image' otherwise)
- outbound endpoint URL matches the family's text_endpoint or image_endpoint
- text-only payloads carry no image-shaped keys
- text+image payloads carry the family's image key (image_url for most,
start_image_url for kling-v3-4k, wrapped 'image' object for xAI)
All 16 cases passing. Confirms the tool surface routes every
(provider, model, modality) combination correctly with zero leakage.
* feat(video_gen): keep video_gen out of first-run setup, surface in status
Two changes:
1. video_gen joins _DEFAULT_OFF_TOOLSETS, so it is NOT pre-selected in
the first-run toolset checklist. Video gen is niche, paid, and slow —
most users don't want it nagging them during initial setup. Anyone
who wants it opts in via 'hermes tools' -> Video Generation, which
already routes to the provider+model picker.
2. The 'hermes setup' status panel learns about video_gen — but only
shows the row when a plugin reports available. Users without
FAL_KEY/XAI_API_KEY see nothing about video gen; users with one of
those keys see 'Video Generation (FAL) ✓' as confirmation it's wired.
Verified live:
- Fresh install (no creds): zero video_gen mentions in wizard.
- With FAL_KEY: status row appears with active backend name.
- 160/160 in the setup + tools_config + video_gen test slice.
Rationale: image_gen is on by default because it's a featured creative
tool used in casual chat (telegrams, etc). Video gen is heavier — long
wait, paid per-second pricing. Default-off matches user intent better.
---------
Co-authored-by: Jaaneek <Jaaneek@users.noreply.github.com>
Removes the 'Launch hermes chat now? (Y/n)' prompt at the end of
hermes setup. The summary already prints 'Ready to go! → hermes'
so the auto-launch was redundant, and on macOS 26+ it could crash
in prompt_toolkit when setup was invoked from the curl install
script with stdin redirected from /dev/tty (#5884, #6128).
After setup, users run 'hermes' themselves like every other CLI
tool. Same pattern applies to the Windows installer.
Closes#6128 (narrower env-var-guarded fix superseded by removing
the prompt outright).
Adds an explicit API compatibility mode prompt to the `hermes model -> custom`
flow so Codex-compatible third-party endpoints (and any other non-default
backend whose URL doesn't match the existing heuristics in
`_detect_api_mode_for_url`) can be selected explicitly instead of silently
falling back to chat_completions.
Choices: Auto-detect / chat_completions / codex_responses / anthropic_messages.
Persists `api_mode` to:
- `model.api_mode` (active session config)
- the matching `custom_providers[*]` entry (so re-activating the named
provider next time replays the same transport)
Salvaged from PR #6125 onto current main: kept the new prompt and the
`_save_custom_provider(api_mode=...)` plumbing; the named-custom flow
already extracts and applies `api_mode` from the saved entry on current
main so those changes are preserved as-is. Test fixtures updated for the
new prompt and the existing display-name prompt.
Co-authored-by: littlewwwhite <1095245867@qq.com>
- memory_setup.py: use shlex.split() for plugin dep checks instead of shell=True
- transcription_tools.py: avoid shell=True for auto-detected whisper commands
(user-provided templates via env var still use shell=True for compatibility)
- cli.py: add comment clarifying intentional shell=True for user quick_commands
- Add test verifying auto-detected template is shlex-safe
Addresses CONTRIBUTING.md Priority #3 (Security hardening — shell injection).
These two functions in hermes_cli/profiles.py have no callers — the live
`hermes completion {bash,zsh}` command uses hermes_cli/completion.py's
generate_bash() / generate_zsh() instead. Multiple PRs (incl. #6141) tried
to fix the trailing-`_hermes "$@"` zsh bug here, only to discover the
patch never reached users. Delete the dead code so future contributors
patch the right file.
The actual user-facing fix lives in the preceding cherry-picked commits
to hermes_cli/completion.py.
`lsp` is registered as a top-level subparser in `main()` (lines 9539-9545)
via `agent.lsp.cli.register_subparser`, so it shows up in `hermes --help`
output alongside the other built-ins. The `_BUILTIN_SUBCOMMANDS` set used
by `_plugin_cli_discovery_needed` to short-circuit the ~500-650ms plugin
import pass did not list it, so every `hermes lsp ...` invocation paid
the full discovery cost despite being a fully-built-in command.
This is also caught by the parity guard added in #22120:
`tests/hermes_cli/test_startup_plugin_gating.py::test_builtin_set_covers_every_registered_subcommand`
has been failing on clean origin/main with:
AssertionError: _BUILTIN_SUBCOMMANDS is missing these live
subcommands: ['lsp']. Add them to hermes_cli/main.py::_BUILTIN_SUBCOMMANDS
so plugin discovery can be skipped when the user targets them.
Fix: add `"lsp"` to the frozenset (alphabetical position between `logs`
and `mcp`). The accompanying `test_builtin_set_has_no_phantom_entries`
guard still passes because `lsp` is genuinely live — registered via the
guarded `try/except Exception` in main() since #24168.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Rename 'Alibaba Cloud (DashScope)' display label to 'Qwen Cloud'
in CANONICAL_PROVIDERS (model picker, /model, hermes model TUI) and
PROVIDER_REGISTRY (setup wizard prompts, status output).
- Move Qwen Cloud (alibaba) up to position 6 — directly below
OpenAI Codex and above Xiaomi MiMo.
- Move Qwen OAuth (Portal) (qwen-oauth) to the bottom of the
canonical provider list.
Provider slug 'alibaba' is unchanged — only the display label
moved. DashScope env var (DASHSCOPE_API_KEY) and base URL are
unchanged. The separate 'alibaba-coding-plan' plugin provider is
not affected.
* feat(nous): unified client=hermes-client-v<version> tag on every Portal request
Every Hermes request to Nous Portal now carries the same
client=hermes-client-v<__version__> tag (e.g. client=hermes-client-v0.13.0
on this release), sourced live from hermes_cli.__version__. The release
script's regex bump auto-aligns it on every release.
Centralized in agent/portal_tags.py and wired into all four call sites:
- NousProfile.build_extra_body (main agent loop, every chat completion)
- auxiliary_client.NOUS_EXTRA_BODY + _build_call_kwargs (aux client)
- run_agent.py compression-summary fallback path
- tools/web_tools.py web_extract fallback
Replaces the client=aux marker added in #24194 with the unified version
tag. Tests assert against the helper output (invariant) rather than the
literal string, so they don't need updating on every release.
* feat(nous): cover /goal judge and kanban specify aux paths
Two aux-using surfaces bypassed call_llm by invoking
client.chat.completions.create() directly without extra_body, so they
were missing the unified Portal client tag:
- hermes_cli/goals.py — /goal standing-goal judge
- hermes_cli/kanban_specify.py — kanban triage specifier
Both now pass extra_body=get_auxiliary_extra_body() or None so they
inherit the version tag when the aux client points at Nous Portal, and
emit nothing otherwise (no tag leak to OpenRouter/Anthropic auxes).
The long-lived prefix-cache layout split the system prompt into stable/
context/volatile blocks and re-derived them on every API call. The
volatile tier (timestamp + memory snapshot + USER profile) ticks per
turn, so the system message bytes mutated mid-conversation and broke
upstream prompt caches (OpenRouter, Nous Portal, Anthropic).
Diagnosed via live wire-format diffing: an 8-turn conversation showed
OLD layout flipping system block[1] sha mid-session at the minute
boundary, dropping cached_tokens to 0 on that turn (cumulative
66.6% vs 83.3% for the single-block layout). Hermes invariant:
history (system + all but the last 1-2 messages) must be static.
Fix: drop the long-lived layout entirely. Single layout everywhere —
system_and_3 with one cached system string built once on first turn,
replayed verbatim on every subsequent turn. Loses cross-session 1h
prefix caching for Claude (the feature that motivated the split), but
within-session caching now actually works on every provider.
Removed:
- run_agent.py: _use_long_lived_prefix_cache flag, _long_lived_cache_ttl,
_supports_long_lived_anthropic_cache method, the long-lived branch in
run_conversation, mark_tools_for_long_lived_cache call site
- agent/prompt_caching.py: apply_anthropic_cache_control_long_lived,
mark_tools_for_long_lived_cache, _mark_system_stable_block helper
- hermes_cli/config.py: prompt_caching.long_lived_prefix and
prompt_caching.long_lived_ttl config keys
- tests/agent/test_prompt_caching_live.py (entire file)
- tests/agent/test_prompt_caching.py: TestMarkToolsForLongLivedCache,
TestApplyAnthropicCacheControlLongLived
- tests/run_agent/test_anthropic_prompt_cache_policy.py:
TestSupportsLongLivedAnthropicCache
Targeted tests: 62/62 pass.
Xiaomi MiMo's /v1/models endpoint returns 401 even with a valid API key,
causing hermes doctor to falsely report 'invalid API key'.
Add a `supports_health_check` field to ProviderProfile (default True).
Providers whose /models endpoint doesn't support auth verification can
set it to False. The doctor's dynamic provider discovery now reads this
field instead of hardcoding True.
The xiaomi provider plugin sets supports_health_check=False.
Salvage of #21063 — adds 'Weixin, and more' to module-level docstrings
in gateway/__init__.py, gateway/config.py, gateway/platforms/base.py
and the 'hermes gateway' subparser description.
Co-authored-by: wuwuzhijing <chuang.guo@hopechart.com>
The fuzzy @-file completer shells out to 'rg --files' via subprocess.run
with text=True. On Windows, Python 3.13 decodes stdout using the system
ANSI codepage (cp1252), so any filename containing bytes like 0x81/0x8f
crashes the background reader thread with UnicodeDecodeError. The
exception is swallowed inside subprocess, leaving proc.stdout=None, and
the next line ('proc.stdout.strip()') blows up with:
AttributeError: 'NoneType' object has no attribute 'strip'
This takes down the prompt_toolkit event loop and forces 'Press ENTER to
continue' until the user clears the @-query.
Fix:
- Pass encoding='utf-8', errors='replace' so rg's UTF-8 output is decoded
consistently across platforms and unmappable bytes don't crash.
- Guard 'proc.stdout' with a None check before .strip(), so a future
reader-thread failure degrades gracefully instead of breaking input.
Fixes#24127
On headless Linux VPS (no DISPLAY or WAYLAND_DISPLAY), some Python
webbrowser backends register TUI programs such as links, lynx, or
www-browser. GenericBrowser.open() spawns these without redirecting
stdin/stdout, allowing them to take over the terminal. This can cause
the process to receive SIGHUP and exit immediately even though uvicorn
bound the port successfully, producing a misleading success message
followed by an empty --status.
Fix: detect headless Linux at startup and skip the auto-open when no
display server is available. On such systems the URL is still printed
so the user can open it manually or via an SSH tunnel. The webbrowser
call is also wrapped in a try/except so any unexpected failure on other
platforms is silently absorbed rather than surfacing as an unhandled
exception in the daemon thread.
Built-in commands with required args (e.g. /queue, /steer, /background)
were excluded from Telegram setMyCommands output, making them invisible
in the autocomplete menu. However, their handlers already return usage
text when invoked without arguments, so hiding them hurts discoverability.
This commit removes the _requires_argument filter for built-in commands
(COMMAND_REGISTRY) while keeping it for plugin-registered slash commands,
which may not provide a no-arg usage fallback.
Closes#24312
The clarify tool returned 'not available in this execution context' for
every gateway-mode agent because gateway/run.py never passed
clarify_callback into the AIAgent constructor. Schema actively encouraged
calling it; users never saw the question.
Changes:
- tools/clarify_gateway.py — new event-based primitive mirroring
tools/approval.py: register/wait_for_response/resolve_gateway_clarify
with per-session FIFO, threading.Event blocking with 1s heartbeat
slices (so the inactivity watchdog keeps ticking), and
clear_session for boundary cleanup.
- gateway/platforms/base.py — abstract send_clarify with a numbered-text
fallback so every adapter (Discord, Slack, WhatsApp, Signal, Matrix,
etc.) gets a working clarify out of the box. Plus an active-session
bypass: when the agent is blocked on a text-awaiting clarify, the next
non-command message routes inline to the runner's intercept instead
of being queued + triggering an interrupt. Same shape as the /approve
deadlock fix from PR #4926.
- gateway/platforms/telegram.py — concrete send_clarify renders one
inline button per choice plus '✏️ Other (type answer)'. cl: callback
handler resolves numeric choices immediately, flips to text-capture
mode for Other, with the same authorization guards as exec/slash
approvals.
- gateway/run.py — clarify_callback wired at the cached-agent per-turn
callback assignment site (only the user-facing agent path; cron and
hygiene-compress agents have no human attached). Bridges sync→async
via run_coroutine_threadsafe, blocks with the configured timeout, and
returns a '[user did not respond within Xm]' sentinel on timeout so
the agent adapts rather than pinning the running-agent guard. Text-
intercept added to _handle_message before slash-confirm intercept
(skipping slash commands). clear_session called in the run's finally
to cancel any orphan entries.
- hermes_cli/config.py — agent.clarify_timeout default 600s.
- website/docs/user-guide/messaging/telegram.md — Interactive Prompts
section.
Tests:
- tests/tools/test_clarify_gateway.py (14 tests) — full primitive
coverage: button resolve, open-ended auto-await, Other flip, timeout
None, unknown-id idempotency, clear_session cancellation, FIFO
ordering, register/unregister notify, config default.
- tests/gateway/test_telegram_clarify_buttons.py (12 tests) — render
paths (multi-choice/open-ended/long-label/HTML-escape/not-connected),
callback dispatch (numeric resolve/Other flip/already-resolved/
unauthorized/invalid-token), and base-adapter text fallback.
Out of scope: bot-to-bot, guest mode, checklists, poll media, live
photos. Closes#24191.
* feat(lsp): semantic diagnostics from real language servers in write_file/patch
Wire ~26 language servers (pyright, gopls, rust-analyzer, typescript-language-server,
clangd, bash-language-server, ...) into the post-write lint check used by write_file
and patch. The model now sees type errors, undefined names, missing imports, and
project-wide semantic issues introduced by its edits, not just syntax errors.
LSP is gated on git workspace detection: when the agent's cwd or the file being
edited is inside a git worktree, LSP runs against that workspace; otherwise the
existing in-process syntax checks are the only tier. This keeps users on
user-home cwds (Telegram/Discord gateway chats) from spawning daemons.
The post-write check is layered: in-process syntax check first (microseconds),
then LSP semantic diagnostics second when syntax is clean. Diagnostics are
delta-filtered against a baseline captured at write start, so the agent only
sees errors its edit introduced. A flaky/missing language server can never
break a write -- every LSP failure path falls back silently to the syntax-only
result.
New module agent/lsp/ split into:
- protocol.py: Content-Length JSON-RPC framer + envelope helpers
- client.py: async LSPClient (spawn, initialize, didOpen/didChange,
ContentModified retry, push/pull diagnostic stores)
- workspace.py: git worktree walk-up + per-server NearestRoot resolver
- servers.py: registry of 26 language servers (extension match,
root resolver, spawn builder per language)
- install.py: auto-install dispatch (npm install --prefix, go install
with GOBIN, pip install --target) into HERMES_HOME/lsp/bin/
- manager.py: LSPService (per-(server_id, root) client registry, lazy
spawn, broken-set, in-flight dedupe, sync facade for tools layer)
- reporter.py: <diagnostics> block formatter (severity-1-only, 20-per-file)
- cli.py: hermes lsp {status,list,install,install-all,restart,which}
Wired into tools/file_operations.py:
- write_file/patch_replace now call _snapshot_lsp_baseline before write
- _check_lint_delta gains a third tier: LSP semantic diagnostics when
syntax is clean
- All LSP code paths swallow exceptions; write_file's contract unchanged
Config: 'lsp' section in DEFAULT_CONFIG with enabled (default true),
wait_mode, wait_timeout, install_strategy (default 'auto'), and per-server
overrides (disabled, command, env, initialization_options).
Tests: tests/agent/lsp/ -- 49 tests covering protocol framing (encode and
read_message round-trip, EOF/truncation/missing Content-Length), workspace
gate (git walk-up, exclude markers, fallback to file location), reporter
(severity filter, max-per-file cap, truncation), service-level delta filter,
and an in-process mock LSP server that exercises the full client lifecycle
including didChange version bumps, dedup, crash recovery, and idempotent
teardown.
Live E2E verified end-to-end through ShellFileOperations: pyright
auto-installed via npm into HERMES_HOME, baseline captured, type error
introduced, single delta diagnostic surfaced with correct line/column/code/
source, then patch fix removes the diagnostic from the output.
Docs: new website/docs/user-guide/features/lsp.md page covering supported
languages, configuration knobs, performance characteristics, and
troubleshooting; cli-commands.md updated with the 'hermes lsp' reference;
sidebar updated.
* feat(lsp): structured logging, backend gate, defensive walk caps
Cherry-picks the substantive ideas from #24155 (different scope, same
problem space) onto our PR.
agent/lsp/eventlog.py (new): dedicated structured logger
``hermes.lint.lsp`` with steady-state silence. Module-level dedup sets
keep a 1000-write session at exactly ONE INFO line ("active for
<root>") at the default INFO threshold; clean writes log at DEBUG so
they never reach agent.log under normal config. State transitions
(server starts, no project root for a file, server unavailable) fire
at INFO/WARNING once per (server_id, key); novel events (timeouts,
unexpected errors) fire WARNING per call. Grep recipe: ``rg 'lsp\\['``.
agent/lsp/manager.py: wire the eventlog into _get_or_spawn and
get_diagnostics_sync so users can answer "did LSP fire on this edit?"
with a single grep, plus surface "binary not on PATH" warnings once
instead of silently retrying every write.
tools/file_operations.py: backend-type gate. ``_lsp_local_only()``
returns False for non-local backends (Docker / Modal / SSH /
Daytona); ``_snapshot_lsp_baseline`` and ``_maybe_lsp_diagnostics``
now skip entirely on remote envs. The host-side language server
can't see files inside a sandbox, so this prevents pretending to
lint a file the host process can't open.
agent/lsp/protocol.py: 8 KiB cap on the header block in
``read_message``. A pathological server that streams headers
without ever emitting CRLF-CRLF would have looped forever consuming
bytes; now raises ``LSPProtocolError`` instead.
agent/lsp/workspace.py: 64-step cap on ``find_git_worktree`` and
``nearest_root`` upward walks, plus try/except containment around
``Path(...).resolve()`` and child ``.exists()`` calls. Defensive
against pathological inputs (symlink loops, encoding errors,
permission failures mid-walk) — the lint hook is hot-path code and
must never raise.
Tests:
- tests/agent/lsp/test_eventlog.py: 18 tests covering steady-state
silence (clean writes stay DEBUG), state-transition INFO-once
semantics (active for, no project root), action-required
WARNING-once (server unavailable), per-call WARNING (timeouts,
spawn failures), and the "1000 clean writes => 1 INFO" contract.
- tests/agent/lsp/test_backend_gate.py: 5 tests verifying
_lsp_local_only / snapshot_baseline / maybe_lsp_diagnostics skip
the LSP layer for non-local backends and route correctly for
LocalEnvironment.
- tests/agent/lsp/test_protocol.py: new test_read_message_rejects_runaway_header
exercising the 8 KiB cap.
Validation:
- 73/73 LSP tests pass (49 original + 18 eventlog + 5 backend-gate + 1 framer cap)
- 198/198 pass when run alongside existing file_operations tests
- Live E2E re-run with pyright still surfaces "ERROR [2:12] Type
... reportReturnType (Pyright)" through the full path, then patch
fix removes it on the next call.
* feat(lsp): atexit cleanup + separate lsp_diagnostics JSON field
Two improvements salvaged from #24414's plugin-form alternative,
keeping our core-integrated design:
1. atexit cleanup of spawned language servers
----------------------------------------------------------------
``agent/lsp/__init__.get_service`` now registers an ``atexit``
handler on first creation that tears down the LSPService on
Python exit. Without this, every ``hermes chat`` exit was
leaking pyright/gopls/etc. processes for a few seconds while
their stdout buffers drained -- they got reaped by the kernel
eventually but a watchful ``ps aux`` would catch them.
The handler runs once per process (gated by
``_atexit_registered``); idempotent ``shutdown_service``
ensures double-fire is a no-op. Errors during shutdown are
swallowed at debug level since by the time atexit fires the
user has already seen the agent's final response.
2. Separate ``lsp_diagnostics`` field on WriteResult / PatchResult
----------------------------------------------------------------
Previously the LSP layer folded its diagnostic block into the
``lint.output`` string, conflating the syntax-check tier with
the semantic tier. The agent (and any downstream parsers) now
read syntax errors and semantic errors as independent signals:
{
"bytes_written": 42,
"lint": {"status": "ok", "output": ""},
"lsp_diagnostics": "<diagnostics file=...>\nERROR [2:12] ..."
}
``_check_lint_delta`` returns to its original two-tier shape
(syntax check + delta filter); ``write_file`` and
``patch_replace`` independently fetch LSP diagnostics via
``_maybe_lsp_diagnostics`` and pass them into the new field.
``patch_replace`` propagates the inner write_file's
``lsp_diagnostics`` so the outer PatchResult carries the patch's
delta correctly.
Tests: 19 new
- tests/agent/lsp/test_lifecycle.py (8 tests): atexit registration
fires once and only once across N get_service calls; the
registered callable is our internal shutdown wrapper;
shutdown_service is idempotent and safe when never started;
exceptions during shutdown are swallowed; inactive service is
cached so we don't rebuild on every check.
- tests/agent/lsp/test_diagnostics_field.py (11 tests): WriteResult
/ PatchResult dataclass shape, to_dict include/omit semantics,
channel separation (lint and lsp_diagnostics carry independent
signals), write_file populates the field via
_maybe_lsp_diagnostics only when the syntax tier is clean,
patch_replace propagates the field forward from its internal
write_file.
Validation:
- 92/92 LSP tests pass (73 prior + 8 lifecycle + 11 diagnostics field)
- 217/217 pass with file_operations + LSP combined
- Live E2E reverified: clean writes -> both fields empty/none; type
error introduced -> lint clean (parses), lsp_diagnostics carries
the pyright reportReturnType block; patch fix -> both fields
clean again.
* fix(lsp): broken-set short-circuit so a wedged server isn't paid every write
Discovered while auditing failure paths: a language server binary that
hangs (sleep forever, no LSP traffic on stdin/stdout) caused EVERY
subsequent write to re-pay the 8s snapshot_baseline timeout. Five
writes = ~64s of dead time.
The bug: ``_get_or_spawn`` adds the (server_id, root) pair to
``_broken`` inside its inner exception handler, but when the OUTER
``_loop.run`` timeout fires, it cancels the inner task before that
handler runs. The pair never makes it to broken-set, so the next
write re-enters the spawn path and re-pays the timeout.
Fix:
- New ``_mark_broken_for_file`` helper at the service layer marks
the (server_id, workspace_root) pair broken from the OUTSIDE when
the outer timeout fires. Called from the except branches in
``snapshot_baseline``, ``get_diagnostics_sync`` (asyncio.TimeoutError
+ generic Exception). Also kills any orphan client process that
survived the cancelled future, fire-and-forget with a 1s ceiling.
- ``enabled_for`` now consults the broken-set BEFORE returning True.
Files in already-broken (server_id, root) pairs short-circuit to
False, so the file_operations layer skips the LSP path entirely
with no spawn cost. Until the service is restarted (``hermes lsp
restart``) or the process exits.
- A single eventlog WARNING is emitted on first mark-broken so the
user knows which server gave up. Subsequent edits in the same
project stay silent.
Tests: 7 new in tests/agent/lsp/test_broken_set.py — covers the
key shape (server_id, per_server_root), enabled_for short-circuit,
sibling-file skip in same project, project isolation (broken in
A doesn't affect B), graceful no-op for missing-server / no-workspace,
and an end-to-end test that snapshots after a failure and verifies
the next ``enabled_for`` returns False.
Validation:
- Live retest of the wedged-binary scenario: 5 sequential writes,
first 8.88s (the one snapshot timeout), subsequent four ~0.84s
(no LSP cost). Down from 5x12.85s = 64s before this fix.
- 99/99 LSP tests pass (92 prior + 7 broken-set)
- 224/224 pass with file_operations + LSP combined
- Happy path E2E reverified — clean write, type error introduced,
patch fix all behave correctly with the new broken-set logic.
Note: the FIRST write to a wedged binary still pays 8s (the
snapshot_baseline timeout). We could shorten that, but pyright/
tsserver normally take 2-3s and slow CI rust-analyzer can need
5+ seconds, so 8s is the conservative ceiling. Subsequent writes
are instant.
Allow integrations to share a visible Camofox identity with Hermes and recover existing tabs without carrying local patches.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(install): use `--extra all` not `--all-extras`; drop lazy-covered extras from [all]
Two coupled fixes for the Windows install hang where uv sync built
python-olm from sdist and failed on missing make.
# Root cause: --all-extras vs --extra all (credit: ethernet)
`uv sync --all-extras` installs every key in [project.optional-
dependencies], bypassing the curated [all] extra entirely. So even
when [all] excluded [matrix], [rl], [yc-bench], etc., the installer
pulled them anyway because they were still defined as extras. On
Windows that meant python-olm (no wheel, needs make to build from
sdist) and the install died there.
The right flag is `--extra all` — install just the [all] extra's
contents, respecting curation. Empirically verified via dry-run:
--all-extras: pulls python-olm, mautrix, ctranslate2, onnxruntime,
atroposlib, tinker, wandb, modal, daytona, vercel,
python-telegram-bot, discord.py, slack-bolt,
dingtalk-stream, lark-oapi, anthropic, boto3,
edge-tts, elevenlabs, exa-py, fal-client, faster-
whisper, firecrawl-py, honcho-ai, parallel-web
--extra all: pulls none of those — just [all]'s curated set
Dockerfile already uses `--extra all` (with comment explaining the
gotcha) — knowledge existed; the gap was install.sh / install.ps1 /
setup-hermes.sh.
Sites fixed: scripts/install.sh L1118, scripts/install.ps1 L809,
setup-hermes.sh L245.
# Companion fix: drop lazy-covered extras from [all]
`tools/lazy_deps.py` already covers anthropic, bedrock, exa,
firecrawl, parallel-web, fal, edge-tts, elevenlabs, modal, daytona,
vercel, all messaging platforms (telegram/discord/slack/matrix/
dingtalk/feishu), honcho, and faster-whisper. They were ALSO in
[all], which defeats the whole point of lazy-install — fresh
installs eager-pulled them and inherited whatever was broken
upstream (the matrix → python-olm → no Windows wheel chain being
the proximate symptom).
[all] now contains only what genuinely can't be lazy-installed:
cron, cli, dev, pty, mcp, homeassistant, sms, acp, google, web,
youtube. Same trim applied to [termux-all]. New regression test
asserts the contract: every extra in LAZY_DEPS must NOT also appear
in [all].
# Companion fix: surface uv progress + errors
setup-hermes.sh's hash-verified path swallowed uv's stderr to a
tempfile, identical to the install.sh bug fixed in PR #24504. Same
fix applied: stream stderr through directly so users see live
progress instead of staring at a frozen prompt.
# Files
- pyproject.toml: trim [all] and [termux-all] to non-lazy extras only.
- scripts/install.sh: --all-extras → --extra all; trim _ALL_EXTRAS /
_PYPI_EXTRAS to match.
- scripts/install.ps1: --all-extras → --extra all; trim $allExtras /
$pypiExtras to match.
- setup-hermes.sh: --all-extras → --extra all; stream stderr.
- tests/test_project_metadata.py: invert matrix-in-[all] assertion;
add lazy-coverage contract test.
- uv.lock: regenerated.
# Validation
5/5 metadata tests pass. 37/37 in update_autostash + tool_token_
estimation. `uv lock --check` passes. Empirical dry-run confirms
`--extra all` excludes python-olm + RL chain on the new lockfile.
* fix(install): parse [all] from pyproject.toml instead of mirroring it
ethernet's review point: the previous patch left two hand-mirrored
copies of [all]'s contents (in install.sh's $_ALL_EXTRAS and
install.ps1's $allExtras). That guarantees future drift the next
time pyproject.toml's [all] changes.
Now both scripts parse pyproject.toml at install time using stdlib
tomllib (Python 3.11+, which the bootstrap step already requires).
Single source of truth. The only purpose of the parsed list is to
build the 'Tier 2: [all] minus broken extras' fallback spec — so we
parse, filter against $brokenExtras, and rebuild the .[a,b,c] spec.
Also: removed redundant fallback tiers.
Before: Tier 1 [all]
Tier 2 [all] minus broken
Tier 3 PyPI-only extras (no git deps)
Tier 4 [web,mcp,cron,cli,messaging,dev]
Tier 5 .
After: Tier 1 [all]
Tier 2 [all] minus broken
Tier 3 .
Tier 3 (PyPI-only) and Tier 4 (dashboard+core) used to dodge the [rl]
git+sdist deps and the [matrix] python-olm build. Both are no longer
in [all] post-2026-05-12 lazy-install migration, so the carve-out
tiers had no remaining content. Tier 4 also referenced [messaging],
which is now lazy-installed — the hardcoded fallback was actually
inconsistent with the new policy.
Defensive fallback: if tomllib parse fails (corrupted pyproject,
unexpected schema), Tier 2 collapses to '.[all]' (same as Tier 1) so
the broken-extras path becomes a no-op rather than crashing.
* fix(gateway): hide Matrix from setup picker on Windows
Matrix is the one messaging platform that has no working install path
on Windows: [matrix] -> mautrix[encryption] -> python-olm, which has
Linux-only wheels and needs make + libolm to build from sdist. The
[all] cleanup in this PR keeps mautrix out of fresh installs, but a
user who picked Matrix in 'hermes setup gateway' would still walk
into the same sdist build failure when the wizard tried to install
the extra.
Hide the option at the picker so users never get the chance to try.
The gate lives in _all_platforms() — single source of truth for the
setup wizard, the curses gateway-config menu, and any future picker.
Adapter loading at runtime is intentionally NOT gated: users who
already have MATRIX_* env vars set (e.g. config copied from a Linux
install) keep working if they somehow have python-olm available.
This is the lowest-friction fix — picker visibility only.
Tests cover linux/darwin/win32 and verify other platforms aren't
collateral damage.
Detect when write_file / patch calls fail during a turn and are never
superseded by a successful write to the same path. When the final
text response is delivered, append an advisory footer listing the
files that did NOT change — so models that over-claim 'patched 5 files'
after 4 silent failures can't hide the lie.
Catches the failure mode reported in Ben Eng's llm-wiki session:
grok-4.1-fast issued batches of parallel patches, half failed with
'Could not find old_string', and the agent summarised the turn
claiming every file was edited. The user had to manually run
'git status' each turn to catch it.
The verifier is a pure post-hoc check on tool results — no new LLM
calls, no synthetic messages injected into history (prompt cache
preserved), no changes to tool argument dispatch. Per-turn state is
keyed by path; a later successful write to the same path clears the
failure entry so single-file retry recovery is not flagged.
Wired into both _execute_tool_calls_concurrent and
_execute_tool_calls_sequential, so batched parallel patches and one-at-
a-time edits are both covered. Footer emission happens after the
agent loop exits, before transform_llm_output / post_llm_call plugin
hooks run, so plugins still see (and can modify) the augmented text.
Config: display.file_mutation_verifier (bool, default true) +
HERMES_FILE_MUTATION_VERIFIER env override.
31 unit tests in tests/run_agent/test_file_mutation_verifier.py cover
target extraction (write_file, patch-replace, patch-v4a single and
multi-file), error-preview extraction (JSON .error field and plain
string), per-turn state transitions (first-error-wins on repeated
failure, success supersedes failure), footer rendering (truncation
at 10 entries, user-actionable hint), and env/config precedence.
Companion docs updated: user-guide/configuration.md +
reference/environment-variables.md.
* feat(security): supply-chain advisory checker + lazy-install framework + tiered install fallback
Three coordinated mitigations for the Mini Shai-Hulud worm hitting
mistralai 2.4.6 on PyPI (2026-05-12) and for the next single-package
compromise that follows.
# What this PR makes true
1. Users with the poisoned mistralai 2.4.6 in their venv get a loud
detection banner with copy-pasteable remediation steps the moment
they run hermes (and on every gateway startup).
2. One quarantined / yanked PyPI package can no longer silently demote
a fresh install to 'core only' — the installer keeps every other
extra and tells the user which tier landed.
3. Future opt-in backends (Mistral, ElevenLabs, Honcho, etc.) can
lazy-install on first use under a strict allowlist, instead of
eagerly pulling everything at install time.
# Detection: hermes_cli/security_advisories.py
- ADVISORIES catalog (one entry currently: shai-hulud-2026-05 for
mistralai==2.4.6). Adding the next one is a single dataclass.
- detect_compromised() uses importlib.metadata.version() — no pip
dependency, works in uv venvs that lack pip.
- Banner cache (~/.hermes/cache/advisory_banner_seen) rate-limits
the startup banner to once per 24h per advisory.
- Acks persisted to security.acked_advisories in config.yaml; never
re-banner after ack.
- Wired into:
* hermes doctor — runs first, prints full remediation block
* hermes doctor --ack <id> — dismisses an advisory
* cli.py interactive run() and single-query branches — short
stderr banner pointing at hermes doctor
* gateway/run.py startup — operator-visible warning in gateway.log
# Lazy-install framework: tools/lazy_deps.py
- LAZY_DEPS allowlist maps namespaced feature keys (tts.elevenlabs,
memory.honcho, provider.bedrock, etc.) to pip specs.
- ensure(feature) installs missing deps in the active venv via the
uv → pip → ensurepip ladder (matches tools_config._pip_install).
- Strict spec safety regex rejects URLs, file paths, shell metas,
pip flag injection, control chars — only PyPI-by-name accepted.
- Gated on security.allow_lazy_installs (default true) plus the
HERMES_DISABLE_LAZY_INSTALLS env var for restricted/audited envs.
- Migrated three backends as proof of pattern:
* tools/tts_tool.py — _import_elevenlabs() calls ensure first
* plugins/memory/honcho/client.py — get_honcho_client lazy-installs
* tts.mistral / stt.mistral entries pre-registered for when PyPI
restores mistralai
# Installer fallback tiers
scripts/install.sh, scripts/install.ps1, setup-hermes.sh:
- Centralised _BROKEN_EXTRAS list (currently: mistral). Edit one
array when a transitive breaks; users keep every other extra.
- New 'all minus known-broken' tier between [all] and the existing
PyPI-only-extras tier. Only kicks in when [all] fails resolve.
- All three tiers explicit: every fallback announces which tier
landed and prints a re-run hint when not on Tier 1.
- install.ps1 and install.sh both regenerate their tier specs from
the same _BROKEN_EXTRAS array so updates stay in sync.
Side effect: install.ps1 Tier 2 spec previously hardcoded 'mistral'
in its extra list — bug fixed by the refactor (mistral is filtered
out).
# Config
hermes_cli/config.py — DEFAULT_CONFIG.security gains:
- acked_advisories: [] (advisory IDs the user has dismissed)
- allow_lazy_installs: True (security gate for ensure())
No config version bump needed — both keys nest under existing
security: block, and load_config's deep-merge picks up DEFAULT_CONFIG
defaults for users with older configs.
# Tests
tests/hermes_cli/test_security_advisories.py — 23 tests covering:
- detect_compromised matches/non-matches, wildcard frozenset
- ack persistence, idempotence, blank rejection, config-failure path
- banner cache rate limiting + 24h re-banner + ack-stops-banner
- short_banner_lines / full_remediation_text / render_doctor_section /
gateway_log_message
- shipped catalog well-formedness invariant
tests/tools/test_lazy_deps.py — 40 tests covering:
- spec safety: 11 safe parametrized + 18 unsafe parametrized
- allowlist: unknown-feature rejection, namespace.name shape,
every shipped spec passes the safety regex
- security gating: config flag, env var, default, fail-open
- ensure() happy/sad paths: already-satisfied, install success,
pip stderr surfaced on failure, install-succeeds-but-still-missing
- is_available, feature_install_command
Combined: 63 new tests, all passing under scripts/run_tests.sh.
# Validation
- scripts/run_tests.sh tests/hermes_cli/test_security_advisories.py
tests/tools/test_lazy_deps.py → 63/63 passing
- scripts/run_tests.sh tests/hermes_cli/test_doctor.py
tests/hermes_cli/test_doctor_command_install.py
tests/tools/test_tts_mistral.py tests/tools/test_transcription_tools.py
tests/tools/test_transcription_dotenv_fallback.py → 165/165 passing
- scripts/run_tests.sh tests/hermes_cli/ tests/tools/ →
9191 passed, 8 pre-existing failures (verified on origin/main
before this change)
- bash -n on install.sh and setup-hermes.sh → OK
- py_compile on all modified .py files → OK
- End-to-end smoke test of detect_compromised + render_doctor_section
+ gateway_log_message with mocked installed version → produces
copy-pasteable remediation output
# Community
Full advisory + remediation steps:
website/docs/community/security-advisories/shai-hulud-mistralai-2026-05.md
Short-form post drafts (Discord, GitHub pinned issue, README banner):
scripts/community-announcement-shai-hulud.md
Refs: PR #24205 (mistral disabled), Socket Security advisory
<https://socket.dev/blog/mini-shai-hulud-worm-pypi>
* build(deps): pin every direct dep to ==X.Y.Z (no ranges)
Companion to the supply-chain advisory work: replace every >=/</~= range
in pyproject.toml's [project.dependencies] and [project.optional-dependencies]
with an exact ==X.Y.Z pin sourced from uv.lock.
Why: ranges allow PyPI to ship a fresh version of any direct dep at any
time without a code review on our side. With ranges, the malicious
mistralai 2.4.6 release would have been pulled by every fresh
'pip install -e .[all]' for the hours between upload and PyPI's
quarantine — exactly the install window we got hit on. Exact pins close
that window: the only way a new package version reaches a user is via
an intentional update on our end.
What the user-facing change is: nothing, behavior-wise. Every package
resolves to the same version it was already resolving to via uv.lock —
the pins just remove the resolver's freedom to pick a different one.
Cost: any user installing Hermes alongside another package that requires
a newer pin gets a resolver conflict. Acceptable for our isolated-venv
install path; documented in the new comment block.
Build-system requires line (setuptools>=61.0) is intentionally left
as a range — pinning the build backend would block fresh pip from
bootstrapping the build on architectures where that exact wheel isn't
available.
mistral extra (mistralai==2.3.0) is pinned but stays out of [all]
(per PR #24205). 'uv lock' regeneration will fail until PyPI restores
mistralai; lockfile regeneration is gated behind that, NOT on every PR.
LAZY_DEPS in tools/lazy_deps.py also moved to exact pins so the lazy-
install pathway can never resolve a different version than the one
declared in pyproject.toml.
Validation:
- Cross-checked all 77 pinned direct deps in pyproject.toml against
uv.lock — every pin matches the resolved version exactly.
- Cross-checked all LAZY_DEPS specs against uv.lock — same.
- 'uv pip install -e .[all] --dry-run' resolves 205 packages cleanly.
- tests/tools/test_lazy_deps.py + tests/hermes_cli/test_security_advisories.py
→ 63/63 passing (every shipped spec passes the safety regex).
- Doctor + TTS + transcription targeted suite → 146/146 passing.
* build(deps): hash-verify transitives via uv.lock; remove unresolvable [mistral] extra
You asked: 'what about the dependencies the dependencies rely on?' —
correctly noting that exact-pinning direct deps in pyproject.toml does
NOT cover the transitive graph. `pip install` and `uv pip install` both
re-resolve transitives fresh from PyPI at install time, so a compromised
transitive (e.g. `httpcore` if it got worm-poisoned tomorrow) would
still hit our users even with every direct dep exact-pinned.
# What this commit fixes
1. **Both real installer scripts now prefer `uv sync --locked` as Tier 0.**
uv.lock records SHA256 hashes for every transitive — a compromised
package with a different hash gets REJECTED. Falls through to the
existing `uv pip install` cascade if the lockfile is missing or
stale, with a loud warning that the fallback path does NOT
hash-verify transitives. Previously only `setup-hermes.sh` (the dev
path) used the lockfile; `scripts/install.sh` and `scripts/install.ps1`
(the paths fresh users actually run) skipped it.
2. **Removed the `[mistral]` extra entirely.** The `mistralai` PyPI
project is fully quarantined right now — every version returns 404,
so any pin we wrote was unresolvable, which broke `uv lock --check`
in CI. Restoration is documented in pyproject.toml as a 5-step
checklist (verify, re-add extra, re-enable in 4 modules, regenerate
lock, optionally re-add to [all]).
3. **Regenerated uv.lock.** 262 packages, mistralai/eval-type-backport/
jsonpath-python pruned. `uv lock --check` now passes.
# Defense-in-depth view
| Layer | Where | Protects against |
|----------------------------|-------------------|-------------------------------------------|
| Exact pins in pyproject | direct deps | new mistralai 2.4.6-style direct compromise |
| uv.lock + `--locked` install | transitive graph | transitive worm injection |
| Tier-0 hash-verified path | install.sh / .ps1 | actually USE the lockfile in fresh installs |
| `uv lock --check` CI gate | every PR | drift between pyproject and lockfile |
| `hermes_cli/security_advisories.py` | runtime | cleanup for users who already got hit |
The exact pinning + hash verification together close the supply-chain
gap. Without the lockfile path, exact pins alone are theater.
# Validation
- `uv lock --check` → passes (262 packages resolved, no drift).
- `bash -n` on install.sh + setup-hermes.sh → OK.
- 209/209 tests passing across new + adjacent test files
(test_lazy_deps.py, test_security_advisories.py, test_doctor.py,
test_tts_mistral.py, test_transcription_tools.py).
- TOML parse OK.
* chore: remove community announcement drafts (PR body covers it)
* build(deps): lazy-install every opt-in backend (anthropic, search, terminal, platforms, dashboard)
Extends the lazy-install framework to cover everything that's not used by
every hermes session. Base install drops from ~60 packages to 45.
Moved out of core dependencies = []:
- anthropic (only when provider=anthropic native, not via aggregators)
- exa-py, firecrawl-py, parallel-web (search backends; only when picked)
- fal-client (image gen; only when picked)
- edge-tts (default TTS but still optional)
New extras in pyproject.toml: [anthropic] [exa] [firecrawl] [parallel-web]
[fal] [edge-tts]. All added to [all].
New LAZY_DEPS entries: provider.anthropic, search.{exa,firecrawl,parallel},
tts.edge, image.fal, memory.hindsight, platform.{telegram,discord,matrix},
terminal.{modal,daytona,vercel}, tool.dashboard.
Each import site now calls ensure() before importing the SDK. Where the
module had a top-level try/except (telegram, discord, fastapi), the
graceful-fallback pattern was extended to lazy-install on first
check_*_requirements() call and re-bind module globals.
Updated test_windows_native_support.py tzdata check from snapshot
(>=2023.3 literal) to invariant (any version + win32 marker).
Validation:
- Base install: 45 packages (was ~60); 6 newly-extracted packages absent
- uv lock --check: passes (262 packages, no drift)
- 209/209 lazy_deps + advisory + doctor + tts/transcription tests passing
- py_compile clean on all 12 modified modules
The `mistralai` PyPI package was quarantined on 2026-05-12 after a
malicious 2.4.6 release. Every fresh resolve (AUR makepkg, Docker build,
CI run, install.sh first-run) currently fails on
`mistralai>=2.3.0,<3` because PyPI returns zero candidates.
Existing users running `hermes update` mostly didn't notice — `hermes
update` falls back from `.[all]` to per-extra retries and silently
skips mistral with a warning that scrolls past. But fresh installs
hard-fail or lose every other extra.
Changes:
- pyproject.toml: drop `hermes-agent[mistral]` from `[all]` and
`[termux-all]`. The `mistral` extra itself is preserved so users
can opt back in once PyPI un-quarantines.
- hermes_cli/tools_config.py: hide Mistral Voxtral TTS from the
`hermes tools` provider picker until restored.
- hermes_cli/web_server.py: drop "mistral" from dashboard STT options.
- tools/transcription_tools.py: explicit `provider: mistral` returns
"none" with a clear status message; auto-detect skips mistral.
- tools/tts_tool.py: dispatcher returns a clear "temporarily disabled"
error before any SDK import attempt (avoids cached-stale-package
surprises).
- tests/tools/: update three test files to assert the new disabled
behavior. Each test docstring records why and points at the rollback
trigger (PyPI un-quarantines mistralai).
Restore plan: revert this commit once the package is available on PyPI
again. The behavior change is intentional and documented in code
comments + test docstrings to make the rollback trivial.
Validation:
- scripts/run_tests.sh tests/tools/ -k 'mistral or stt or tts' →
425/425 passing.
Refs: https://pypi.org/simple/mistralai/ (currently
"pypi:project-status: quarantined").
Handle MiniMax OAuth expiry values consistently across CLI and dashboard
flows, fix CLI status/add behavior, and force pooled OAuth runtime
requests through Anthropic Messages.
- web_server._minimax_poller: parse expired_in via the shared resolver
so unix-ms absolute timestamps stop landing as TTL seconds and crashing
with 'year 583911 is out of range' when a user connects MiniMax OAuth
from the dashboard.
- auth._minimax_oauth_login / _refresh_minimax_oauth_state: same fix on
the CLI login + refresh paths.
- auth.get_auth_status: dispatch minimax-oauth to its dedicated status
function instead of falling through.
- auth_commands.auth_add_command: 'hermes auth add minimax-oauth' now
starts the device-code login flow and persists a pool entry with the
access + refresh tokens, instead of requiring credentials to already
exist.
- runtime_provider._resolve_runtime_from_pool_entry: pin pooled
minimax-oauth credentials to anthropic_messages so a stale
model.api_mode: chat_completions can't send requests to
/anthropic/chat/completions and trigger MiniMax nginx 404s.
Co-authored-by: Cursor <cursoragent@cursor.com>
Free-tier users were seeing 'No free models currently available.' in the
`hermes model` and post-login pickers even though qwen/qwen3.6-plus is
free on the Portal right now. Three independent breakages compounded:
1. The docs-hosted catalog manifest at website/static/api/model-catalog.json
was not regenerated when _PROVIDER_MODELS['nous'] was updated, so users
fetching the manifest got a list that didn't include qwen/qwen3.6-plus.
2. _resolve_nous_pricing_credentials() returned ('', '') on any auth blip,
collapsing get_pricing_for_provider('nous') to {} and making every
curated model fall through the free-tier filter as 'paid'.
3. Even with healthy pricing, the picker only ever showed models from the
in-repo curated list intersected with live pricing — a Portal-flagged
free model not yet in the curated list could never appear.
Changes:
- hermes_cli/models.py: new union_with_portal_free_recommendations() that
augments the curated list with Portal freeRecommendedModels entries
(with synthetic free pricing so partition keeps them). The Portal's
/api/nous/recommended-models endpoint is now the source of truth for
free-tier surfacing — old Hermes builds will see new free models
without a CLI release.
- hermes_cli/models.py: _resolve_nous_pricing_credentials() falls back to
the public inference base URL when runtime cred resolution fails.
The /v1/models endpoint exposes pricing without auth, so silently
returning {} just because a refresh token expired was wrong.
- hermes_cli/auth.py + hermes_cli/main.py: both free-tier picker call
sites call union_with_portal_free_recommendations() before partition.
- tests/hermes_cli/test_models.py: 7 tests covering union behaviour
(prepend, dedup, end-to-end with stale pricing, empty/missing/error
payloads, invalid entries).
- tests/hermes_cli/test_model_catalog.py: drift guard
TestManifestMatchesInRepoLists fails CI when _PROVIDER_MODELS['nous']
or OPENROUTER_MODELS is edited without re-running
scripts/build_model_catalog.py. Verified empirically that removing a
manifest entry triggers an assertion with an actionable error message.
Validation:
- 133/133 targeted tests pass (test_models, test_model_catalog,
test_auth_nous_provider).
- Live E2E against the real Portal:
- Stale curated list ['claude-opus','claude-sonnet','gpt-5.4'] (no
qwen) → after union: ['qwen/qwen3.6-plus', ...] →
partition(free_tier=True): selectable=['qwen/qwen3.6-plus'].
- Simulated expired refresh token → anon fetch returns 403 pricing
entries including qwen/qwen3.6-plus -> {prompt:0, completion:0}.
- ruff: clean.
cua-driver was only installed once on toolset enable: `_run_post_setup` early-returns when the binary is already on PATH, so upstream fixes (e.g. v0.1.6 Safari window-focus fix) never reached existing users without manual reinstall.
Two refresh points now:
- `hermes update` re-runs the upstream installer at the end of the update if cua-driver is on PATH (macOS-only, no-op otherwise). Ties driver freshness to the user-controlled update cadence — no startup latency, no per-launch GitHub API call.
- `hermes computer-use install --upgrade` for manual force-refresh.
The upstream `install.sh` always pulls the latest release, so re-running is the canonical upgrade path. No version-comparison logic needed.
`hermes computer-use status` now shows the installed version, and points at `--upgrade` for refreshing.
Fixes#22832.
## Root cause
`hermes_cli/web_server.py:start_oauth_login` dispatched OAuth flows by
the catalog's `flow` field rather than provider id:
if catalog_entry["flow"] == "pkce":
return _start_anthropic_pkce()
The catalog had two `flow: "pkce"` entries — `anthropic` and
`minimax-oauth` — so clicking "Login" on MiniMax in the dashboard's
Keys tab unconditionally launched the Anthropic/Claude PKCE flow.
## Fix
Three changes in `hermes_cli/web_server.py`:
1. Catalog entry for `minimax-oauth` changed from `flow: "pkce"` to
`flow: "device_code"`. From a UX perspective MiniMax is a
verification-URI + user-code flow (open URL, enter code, backend
polls) — same shape as Nous's device-code flow. The PKCE bit
(verifier + challenge from `_minimax_pkce_pair`) is a security
extension that doesn't change the operator experience; the existing
dashboard modal already renders `device_code` correctly for this UX.
2. New MiniMax branch in `_start_device_code_flow`, mirroring the
existing Nous branch but calling MiniMax-specific helpers
(`_minimax_request_user_code`, `_minimax_pkce_pair`). Stashes
verifier + state in the session for the poller to consume. Handles
the overloaded `expired_in` field (could be unix-ms timestamp OR
seconds-from-now duration) the same way `_minimax_poll_token` does.
3. New `_minimax_poller` background thread mirroring `_nous_poller`.
Calls `_minimax_poll_token` → on success builds the same
`auth_state` dict the CLI flow (`_minimax_oauth_login`) builds, and
persists via `_minimax_save_auth_state` so the dashboard path leaves
the system in the same state as `hermes auth add minimax-oauth`.
Plus a dispatcher tightening to prevent regression: the `pkce` branch
now requires `provider_id == "anthropic"`, so any future PKCE provider
added without a proper start function gets a clean
`400 Unsupported flow` rather than silently launching Anthropic OAuth.
## Test
New `tests/hermes_cli/test_web_oauth_dispatch.py`:
- Regression test asserting MiniMax start does NOT return claude.ai
- Sanity test that Anthropic PKCE still works after the dispatcher
tightening
- Forward-looking test: a hypothetical pkce-flagged provider without
an explicit branch is rejected cleanly rather than misrouted
## Limitations
- The dashboard MiniMax path defaults to `region="global"`. CN-region
operators can still use the CLI flow which supports `--region cn`.
Adding a region toggle to the dashboard UI is a follow-up.
The old mtime-tracking staleness machinery (_tui_build_needed,
_hermes_ink_bundle_stale, _find_bundled_tui) tried to avoid rebuilding
by comparing source timestamps to dist/entry.js. This was fragile and
added ~100 lines of code. Replace with three clear paths:
1. HERMES_TUI_DIR set (prebuilt/nix): just node dist/entry.js, no build
2. --dev mode: tsx src/entry.tsx, no build, hot reload
3. Normal: always npm run build (esbuild is ~1s, correctness > caching)
Also error when HERMES_TUI_DIR is set with --dev (footgun: prebuilt
bundle has no source code to hot-reload).
Cuts input cost for first-turn Claude requests by ~85-90% on subsequent
sessions within an hour. Tools array (~13k tokens for default toolset) +
stable system prefix (~5-8k tokens) get a 1h cache_control marker; the
volatile suffix (memory, USER profile, timestamp, session id) sits in a
separate non-cached block at the end so it doesn't poison the cross-session
prefix when it changes.
Provider gate: Claude on native Anthropic (incl. OAuth subscription),
OpenRouter, and Nous Portal (which proxies to OpenRouter). All other
providers keep today's system_and_3 layout unchanged.
Layout (4 cache_control breakpoints, Anthropic max):
1. tools[-1] -> 1h (cross-session)
2. system content[0] -> 1h (cross-session, stable prefix)
3. messages[-2] -> 5m (within-session rolling)
4. messages[-1] -> 5m (within-session rolling)
Within-session rolling shrinks from 3 messages to 2 to free the breakpoint
budget. On Claude with realistic tool loadouts the long-lived tier carries
the bulk of cross-session value anyway.
System prompt is now always assembled cache-friendly: stable identity /
guidance / skills / platform hints first, then session-stable context
files (AGENTS.md, .cursorrules), then per-call volatile content. Old
single-string callers see the same logical content (same join order),
just reordered so volatile lives at the end.
Config knobs (defaults shown):
prompt_caching:
cache_ttl: "5m" # rolling-window TTL (unchanged)
long_lived_prefix: true # opt-out switch
long_lived_ttl: "1h" # cross-session prefix TTL
Live E2E (tests/agent/test_prompt_caching_live.py, gated on
OPENROUTER_API_KEY) on anthropic/claude-haiku-4.5 with default toolset:
Call 1 (cold): cache_write=13,415 cache_read=0
Call 2 (NEW agent + msg): cache_write=391 cache_read=13,025
Cross-session reuse: 97.09%
Implementation:
* agent/prompt_caching.py: new apply_anthropic_cache_control_long_lived()
+ mark_tools_for_long_lived_cache(); existing apply_anthropic_cache_control()
preserved verbatim for the fallback path.
* agent/anthropic_adapter.py: convert_tools_to_anthropic() now forwards
cache_control onto each Anthropic-format tool dict.
* run_agent.py: _build_system_prompt_parts() returns the 3-tier dict;
_build_system_prompt() joins them (backward compatible).
_supports_long_lived_anthropic_cache() policy added next to the existing
_anthropic_prompt_cache_policy() (which now also recognises Nous Portal
Claude — pre-existing gap fixed in passing).
_build_api_kwargs() resolves tools_for_api once and propagates the
marker through all four build paths (anthropic_messages, bedrock,
codex_responses, profile/legacy chat completions).
Long-lived flag plumbed into the runtime snapshot/restore + model-switch
+ fallback-promotion paths.
Tests:
* tests/agent/test_prompt_caching.py: +8 tests (TestMarkToolsForLongLivedCache,
TestApplyAnthropicCacheControlLongLived).
* tests/run_agent/test_anthropic_prompt_cache_policy.py: +9 tests
(TestSupportsLongLivedAnthropicCache matrix across 8 endpoint classes
+ a fallback-target case).
* tests/agent/test_prompt_caching_live.py: new live E2E (skipif when
OPENROUTER_API_KEY is unset; runs outside the hermetic suite).
* Targeted suites: 327/327 pass (caching/adapter/policy/builder).
* tests/agent/ + tests/run_agent/: 3992 pass, 17 skip, 1 pre-existing
flake (test_async_httpx_del_neuter::test_same_key_replaces_stale_loop_entry,
verified failing on pristine origin/main).
Replace with for all literal-tuple
membership tests. Set lookup is O(1) vs O(n) for tuple — consistent
micro-optimization across the codebase.
608 instances fixed via `ruff --fix --unsafe-fixes`, 0 remaining.
133 files, +626/-626 (net zero).
The /model picker for Nous Portal users was returning the in-repo
_PROVIDER_MODELS["nous"] snapshot — which only updates on Hermes
releases — instead of the remote manifest published at
https://hermes-agent.nousresearch.com/docs/api/model-catalog.json.
OpenRouter already pulled from the manifest via fetch_openrouter_models;
"nous" was the only curated provider where the existing manifest
plumbing (get_curated_nous_model_ids → get_curated_nous_models) was
defined but not wired into the picker pipeline. Switch the curated
build in list_authenticated_providers to use it, with the same
graceful fallback to the in-repo snapshot when the manifest is
unreachable.
Test: tests/hermes_cli/test_model_catalog.py exercises the picker with
a patched manifest and asserts the manifest's nous list reaches
list_picker_providers. Falls-back-to-static path was already covered
by test_curated_nous_ids_falls_back_to_hardcoded_on_empty_catalog.
Follow-up to PR #23824. Adds two correctness fixes on top of the
contributor's salvaged commit:
1. Stale-dist fallback no longer gated on `fatal=False`. `cmd_dashboard`
passes `fatal=True` and is the primary scenario this fallback is for
(issue #23817 — Windows Scheduled Task at logon). The previous gate
meant the fallback never fired in the case it was designed for.
2. `--skip-build` now verifies the dist actually exists before starting
the server. Without this, a misconfigured pre-build would launch the
dashboard pointing at a missing dist and silently serve 404s. We now
exit 1 with a clear "pre-build first: cd web && npm run build"
message, and on success print which dist directory is being used.
Verified end-to-end on Linux:
- build fails + stale dist (fatal=True) -> fallback fires
- build fails + no dist (fatal=True) -> exit 1 with stderr surfaced
- build fails + stale dist (fatal=False) -> fallback fires
- --skip-build + missing dist -> exit 1 with clear guidance
- --skip-build + valid dist -> 'Skipping web UI build...'
Three improvements for non-interactive contexts (Windows Scheduled
Tasks, CI/CD) where the web UI build may fail (issue #23817):
1. Retry build once after 3s — covers boot-time races (antivirus
scanning Node.js, npm cache not ready, transient disk I/O)
2. Fall back to existing dist when build fails (non-fatal mode) —
a stale UI is far better than no UI at all
3. Add --skip-build flag — lets callers pre-build in their wrapper
script and start the dashboard without internal build attempt
4. Surface npm stderr in build failure output for easier debugging
Fixes#23817
On Windows systems using a Chinese GBK locale, `hermes update` could misreport the Web UI build as failed even when `npm run build` actually succeeded. The failure was caused by Python decoding captured npm output with the process locale inside a background subprocess reader thread. When npm emitted bytes such as `0x85`, decoding under GBK raised `UnicodeDecodeError`, and Hermes then surfaced a misleading "Web UI build failed" warning.
This change makes the npm install/npm ci path and the Web UI build step decode captured output explicitly as UTF-8 with `errors="replace"`. That keeps unexpected bytes from crashing output collection, preserves successful builds, and prevents false negatives during update on Windows.
The patch also adds regression tests that verify these subprocess calls always use explicit UTF-8 decoding with replacement semantics.
* Revert "fix(goals): force judge to use tool calls instead of JSON-text replies (#23547)"
This reverts commit a63a2b7c78.
* Revert "fix(goals): forward standing /goal state on auto-compression session rotation (#23530)"
This reverts commit 4a080b1d5a.
* Revert "feat(goals): /goal checklist + /subgoal user controls (#23456)"
This reverts commit 404640a2b7.
The _default_spawn HERMES_HOME injection (PR #23356) calls
resolve_profile_env which raises FileNotFoundError when the profile
dir doesn't exist. In production the profile always exists (workers are
only dispatched for live profiles), but tests with isolated HERMES_HOME
never create profile dirs. Catch FileNotFoundError and fall through —
HERMES_PROFILE is still set below, so the worker CLI resolves the
profile at startup.
When a parent task is archived, dependent child tasks were stuck in
todo forever because recompute_ready and claim_task only checked for
status == 'done'. Now both functions also treat 'archived' as a
terminal status, allowing children to proceed when their parent is
archived.
Fixes#23180.
Default spawn did not propagate HERMES_HOME when forking kanban workers.
The worker's env is copied from the parent via dict(os.environ), so
HERMES_HOME is absent. When the child then starts hermes -p <profile>,
the CLI's _apply_profile_override() runs before hermes_constants is
imported and get_hermes_home() falls back to ~/.hermes (the default
profile root), silently ignoring the profile's config.yaml. Profile-
scoped fallback_providers, toolsets, and agent settings are therefore
never applied to kanban workers.
The fix injects HERMES_HOME into the worker's env using
resolve_profile_env(profile_arg) so the child reads the correct profile
directory instead of the default root.
Two independent opt-in QoL toggles, both off by default.
terminal.docker_extra_args:
- List of extra flags appended verbatim to docker run after security
defaults. Useful for adding capabilities (e.g. --cap-add SETUID) or
other docker run options not exposed by existing config keys.
- Non-string entries are logged and skipped.
- Also available via TERMINAL_DOCKER_EXTRA_ARGS='[...]' env var.
display.timestamps:
- Appends [HH:MM] to user input bullet and the assistant response box
header. Single hub in _format_submitted_user_message_preview()
covers both single-line and multi-line user previews; assistant
response label gets the timestamp at box-open time.
Closes#1569 (timestamps).
Co-authored-by: Mibayy <Mibayy@users.noreply.github.com>
A YAML parse error in ~/.hermes/config.yaml caused load_config() to print
one line to stdout (Warning: Failed to load config: ...) and silently fall
back to DEFAULT_CONFIG, dropping every user override (auxiliary providers,
fallback chain, model settings). Users only noticed when downstream
behavior misbehaved — see issue #23570 where a tab-indent error in the
auxiliary section caused aux fallback to use OpenRouter (depleted) instead
of the configured Codex/MiniMax chain.
Now: log at WARNING (so 'hermes logs' surfaces it), write a prominent line
to stderr, dedup on (path, mtime_ns, size) so concurrent loads don't spam,
and re-warn after the user edits the file. Both call sites (raw read +
merged load) route through the same helper.
Refs #23570
Surface ready tasks that nobody claims within a threshold (default
30 min) regardless of why. One identity-agnostic signal that catches:
- Operator typo'd the assignee
- Profile was deleted, leaving its tasks stranded
- External worker pool (Codex CLI lane, custom daemon) is down
- Dispatcher misconfigured (wrong board / wrong HERMES_HOME)
Today the dispatcher correctly skips these (no respawn loop, good)
but nothing surfaces the fact that operator-actionable work is
accumulating. The new `stranded_in_ready` rule does that without
requiring a manual lane registry — it reads the most recent ready-
transition event (`created` / `promoted` / `reclaimed` / `unblocked`)
and fires when (now - last_ready_ts) > threshold.
Severity escalates with age: warning at threshold, error at 2x,
critical at 6x. The cli_hint and reassign actions point operators
at the right next step.
Out of scope deliberately:
- Lane registry (#20157 closed) — this signal supersedes it.
- Pushing the diagnostic into messaging gateways — diagnostics
are pull-only via 'hermes kanban diagnostics' for now; gateway
push is a separate UX decision.
Tests: 10 new + 461 existing kanban tests pass. E2E verified end-
to-end via 'hermes kanban diagnostics --json' against a 2h-old
stranded task — surfaces as error severity with correct actions.
Live-tested on gemini-3-flash-preview the judge kept returning empty
or non-JSON content, tripping the consecutive-parse-failures auto-
pause. Free-form JSON output is hopeful; tool-call schemas are
enforced server-side by virtually every modern provider.
Two new tools the judge calls:
- submit_checklist(items) — Phase A, decompose
- update_checklist(updates, new_items, reason) — Phase B, evaluate
Both phases now call the auxiliary client with tool_choice forcing
the right tool. read_file remains for Phase B history inspection,
with the loop exiting only when update_checklist is called or the
read budget is exhausted (at which point read_file is dropped from
the toolbox and update_checklist is forced).
Robustness:
- _call_judge_with_tool_choice falls back tool_choice forced→required→
auto if the provider rejects a particular shape.
- If a fully-broken provider still returns content instead of a tool
call, the legacy JSON-text parsers stay around as a last-ditch
backstop so we never silently lose a checklist.
- _normalize_update_args replaces the JSON parser for the apply
layer; same 1-based→0-based conversion + terminal-status filter.
Live verification: same fizzbuzz goal that was hitting 'judge model
returned unparseable output 3 turns in a row' before now terminates
in 2 turns, all 11 items marked completed with item-specific
evidence, no auto-pause. Agent log shows
'produced 11 checklist items via tool call' instead of the JSON-
parse path.
Tests: 7 new cases for the tool-call path (Phase A success, Phase B
update only, Phase B read_file→update, JSON-content backstop,
empty-text item dropping, non-terminal status filter).
Out-of-scope behavior change in #23521 — the kanban notifier-routing fix
also flipped the 'kanban create --created-by' default from 'user' to the
active profile name. Revert to keep PR scope focused on the notifier
ownership fix; the profile-aware author default can be its own change.