The 'sessions' command has been registered in the central command
registry since #20805 (May 2025) and surfaces in /help and tab-completion,
but the classic CLI's process_command() never had an elif branch for it.
The canonical name fell through and printed 'Unknown command: sessions'.
The TUI side was wired up correctly via the SessionPicker overlay; only
the legacy CLI was missing the dispatch.
Adds _handle_sessions_command() which mirrors /resume's no-arg behavior
inline (the CLI has no overlay primitive equivalent to the TUI picker):
- /sessions and /sessions list → print the recent-sessions table
- /sessions <id_or_title> → delegates to _handle_resume_command
Includes regression tests covering the dispatcher wiring (the original
bug) plus the three handler branches.
Pre-existing diagnostics below an edit point used to surface as 'LSP
diagnostics introduced by this edit' whenever the edit deleted or
inserted lines. The delta-filter key included the diagnostic's
range, so the same logical error reported at a different line in
the post-edit snapshot looked like a brand new diagnostic.
Concrete case: deleting 14 lines in cli.py caused Pyright errors at
lines 9873, 10590, 12413, 13004 (unrelated to the edit) to be
reported as introduced by it.
Fix: build a piecewise-linear line-shift map (via difflib's
SequenceMatcher) from pre and post content, and remap baseline
diagnostics into post-edit coordinates before the set-difference.
Diagnostics in deleted regions drop out cleanly; diagnostics below
the edit shift by the right amount; diagnostics above are untouched.
The strict (range-aware) equality key stays — so a genuinely new
instance of an identical error class at a different line still
surfaces as new.
Pieces:
- agent/lsp/range_shift.py — build_line_shift, shift_diagnostic_range,
shift_baseline. Pure functions, no LSP state.
- agent/lsp/manager.py — LSPService.get_diagnostics_sync gains an
optional line_shift kwarg; baseline is shift_baseline'd before
computing the seen-set. _diag_key keeps the strict range key.
- tools/file_operations.py — write_file captures pre_content for any
LSP-handled extension (not just LINTERS_INPROC) and passes pre/post
to _maybe_lsp_diagnostics, which builds the shift map.
- New _lsp_handles_extension helper guards the pre_content read.
Trade-offs preserved:
- Genuinely new same-class errors at different lines still surface
(content-only key would have swallowed them).
- Pre-existing errors at unshifted positions still get filtered
(covered by the strict-key path with no shift).
- Best-effort: when pre_content can't be captured (file didn't
exist, permissions), the unshifted comparison still catches
most pre-existing errors; the edge case it misses is a new file
with a non-empty baseline, which is structurally impossible.
Follow-up to snav's PR #25463 contribution: flip default to on, broaden
scope so backfill fires whenever require_mention gates the bot (not just
shared-session channels).
Why:
- The mention-gate creates a session-transcript gap regardless of whether
the channel is shared or per-user. In per-user sessions, Alice's session
is still missing other participants' messages and her own pre-mention
messages — backfill fills both gaps.
- Threads naturally scope to thread-only history because discord.py's
channel.history() on a thread returns only that thread's messages.
- DMs still skip — every DM triggers the bot, so the session transcript
is already complete.
Changes:
- hermes_cli/config.py: discord.history_backfill default → true
- gateway/platforms/discord.py: drop the _is_shared gate, keep _is_dm
skip and _needed_mention gate; env var DISCORD_HISTORY_BACKFILL
default → 'true'
- cli-config.yaml.example + website docs: update defaults and prose;
add the DISCORD_HISTORY_BACKFILL / _LIMIT env var rows that were
documented in the PR description but missing from the env-var table
- tests/gateway/test_discord_free_response.py:
- flip test_discord_per_user_channel_does_not_backfill →
test_discord_per_user_channel_backfills_too (new behavior)
- add test_discord_dm_does_not_backfill (DM skip is invariant)
- give FakeThread a no-op history() so existing thread tests don't hit
a fake discord.Forbidden when backfill now fires on threads too
Tests: 160/160 in target files; 400/400 across all tests/gateway/ -k discord.
Adds optional channel-context backfill for Discord shared-channel sessions
so the agent can see recent messages it missed between its own turns
(typically when require_mention=true filters out most traffic).
Previously the agent only saw the @mention message that triggered it, which
led to disorienting replies in active multi-user channels where the
conversation context was invisible. With backfill enabled, a configurable
number of recent messages are fetched per-turn and prepended to the trigger
message as a context block, kept separate from sender-prefix logic so
attribution remains clean.
This re-opens the work from #13063 (approved by @OutThisLife on 2026-04-20,
closed when I closed the branch to address the simpolism:main head-branch
issue plus an ordering bug I caught later in live use). Filing against the
freshly-rewritten problem statement in #13054 so the design is grounded in
the failure mode rather than the implementation shape.
The implementation follows the **push-mode last-self-anchored** design from
the two options laid out in #13054. See the issue for the trade-off
discussion vs pull-mode (#13120 was an earlier closed PR using that shape).
Treating this as a reference implementation — happy to rewrite as
last-trigger anchoring or as a hybrid with #13120 if maintainers prefer.
Changes:
- gateway/platforms/discord.py:
- new `_discord_history_backfill()` / `_discord_history_backfill_limit()`
helpers (config.extra > env > default), mirroring the existing
`_discord_require_mention()` shape
- new `_fetch_channel_context()` that scans `channel.history()` backwards
from the trigger to the bot's last message (or limit), formats as
`[Recent channel messages] / [name] msg / ...`, respects DISCORD_ALLOW_BOTS,
skips system messages
- per-channel `_last_self_message_id` cache to narrow the fetch window
on hot paths (avoids full history scan when the bot has spoken recently)
- **IMPORTANT**: passes `oldest_first=False` explicitly to `channel.history()`.
discord.py 2.x silently flips the default to True when `after=` is supplied,
which would select the EARLIEST N messages after our last response instead
of the LATEST N before the trigger. In high-traffic windows this would
return stale tool traces and drop the actual final answer the user is
asking about. See regression test below. Caught in live use during a
Codex tool-trace burst on May 13 2026.
- gateway/config.py: discord_history_backfill + discord_history_backfill_limit
settings + yaml→env bridge
- gateway/platforms/base.py: channel_context field on MessageEvent
- gateway/run.py: prepend channel_context after sender-prefix so the
[sender name] tag applies to the trigger message alone, not to the backfill
- hermes_cli/config.py: defaults for new discord.history_backfill and
discord.history_backfill_limit keys
- cli-config.yaml.example: documented defaults
- tests/gateway/test_discord_free_response.py: 7 new tests covering
cold-start backfill, self-message stop boundary, other-bot filtering,
cache hot-path narrowing, stale-cache fallback, shared-channel +
per-user backfill paths, and the ordering regression test
(`test_fetch_channel_context_cache_uses_latest_window_when_after_set`)
- tests/gateway/test_config.py: yaml→env bridge tests
- tests/gateway/test_session.py: prefix-order edge cases
- website/docs/user-guide/messaging/discord.md: env vars + config keys +
usage docs
Tested on Ubuntu 24.04 — empirically validated in my own multi-bot Discord
research server for the past three weeks.
Fixes#13054
Supersedes #13063 (closed)
Adds 'hermes proxy start' — a local HTTP server that lets external apps
(OpenViking, Karakeep, Open WebUI, ...) use a Hermes-managed provider
subscription as their LLM endpoint. The proxy attaches the user's real
OAuth-resolved credentials to each forwarded request, refreshing them
automatically; the client can send any bearer (it gets stripped).
Ships with one adapter — Nous Portal. The UpstreamAdapter ABC and
registry in hermes_cli/proxy/adapters/ are designed for additional
OAuth providers to plug in by name without server changes.
Commands:
hermes proxy start [--provider nous] [--host 127.0.0.1] [--port 8645]
hermes proxy status
hermes proxy providers
Allowed Portal paths: /v1/chat/completions, /v1/completions,
/v1/embeddings, /v1/models. Anything else returns 404 with a clear
error pointing at the allowed list.
aiohttp is gated like gateway/platforms/api_server.py (try-import,
clean runtime error if missing). No new core dependency.
Tests: 24 unit tests + 1 separate E2E that spawns the real subprocess
and verifies the upstream receives the right bearer with the client's
header stripped.
When the terminal shrinks, already-printed box-drawing rules (response,
reasoning, streaming TTS, background-task Panels) reflow into multiple
narrower rows — visible as duplicated horizontal separators / ghost
lines in scrollback. Similarly, prompt_toolkit redraws a fresh status
bar on SIGWINCH on top of one the terminal just reflowed, producing
double-bar artifacts on column shrink.
Two surgical changes:
1. Decorative scrollback boxes now use a new
`HermesCLI._scrollback_box_width()` helper that clamps to
`max(32, min(width, 56))`. The live TUI footer is unaffected and still
uses the full width. Covers: streaming response box (open + close),
reasoning box (open + close, both streaming and post-stream paths),
streaming-TTS box close, final-response Rich Panel, and the
background-task Rich Panel.
2. `_recover_after_resize()` now also sets a new
`_status_bar_suppressed_after_resize` flag so the dynamic status bar
and both input separator rules stay hidden until the next user input.
The flag is cleared in the process loop the moment the user submits
their next prompt, restoring chrome cleanly.
Tests:
- New `test_input_rules_hide_after_resize_until_next_input` covers the
flag's effect on rule heights.
- New `test_scrollback_box_width_caps_to_resize_safe_value` covers the
helper at floor / cap / mid-range / overflow.
- Existing resize-recovery test extended to assert the flag flips.
Refs: #18449#19280#22976
Salvage of #24403.
Co-authored-by: Szymonclawd <szymonclawd@mac.home>
When codex app-server fails outside the OAuth-classified path
(non-auth turn/start errors, plain TimeoutErrors, generic turn-ended
status, subprocess silently exits, hard deadline timeout), the user
got a bare 'Internal error' / 'turn/start failed: ...' with no
context. Diagnosing config/provider/auth-bridge issues forced a
re-run with verbose codex flags.
Add a _format_error_with_stderr helper that appends the last few
stderr lines via agent.redact.redact_sensitive_text(force=True),
and use it at every catch-all error site:
- ensure_started() failures (codex init / thread/start) now return
a TurnResult.error with should_retire=True instead of bubbling
- non-OAuth turn/start CodexAppServerError / TimeoutError
- subprocess-died branch (previously dumped raw stderr_blob[-300:]
with no redaction — a leak risk)
- turn ended with non-completed status
- hard turn-timeout deadline
OAuth-classified failures and the post-tool quiet watchdog already
produce clean hints and stay unchanged. The redactor catches sk-*,
gh*_*, Authorization: Bearer, query-string tokens, JWTs, private
keys, etc., so provider error payloads can't leak into chat output
or trajectories.
Inspired by openclaw#80718, adapted for our app-server transport.
When the stream consumer's got_done handler successfully delivers the
final response content via _send_or_edit but the subsequent edit
(e.g. cursor removal) fails, final_response_sent remains False even
though the user has already received the final answer. The gateway's
fallback send path then re-delivers the same content, causing the
user to see the response twice on Telegram.
Introduce a new _final_content_delivered flag on the stream consumer,
set by the got_done handler when the final content has reached the
user. The _run_agent suppression logic now treats this flag as an
additional signal (alongside final_response_sent and
response_previewed) that final delivery is already complete.
This preserves the existing behavior for intermediate-text-only
streams (where already_sent=True but no final content has been
delivered) — those still receive the gateway's fallback send, matching
the test expectation in test_partial_stream_output_does_not_set_already_sent.
Adds TestFinalContentDeliveredSuppression with two cases covering
both the suppression (content delivered + edit failed) and the
non-suppression (intermediate text only) branches.
`hermes config set gateway.streaming.*` writes the streaming block
nested under a `gateway:` key in config.yaml, but the config loader
only checked for a top-level `streaming:` key — silently ignoring
the nested variant.
Fall back to `yaml_cfg['gateway']['streaming']` when the top-level
key is absent, matching the pattern already used for other nested
config sections.
Closes#25676
WhatsApp pseudo-chats (Status updates / Stories, Channels / Newsletters,
broadcast lists) were being routed through the full agent pipeline. A
user's gateway.log showed the agent replying to a contact's Story
('status@broadcast') with 345 chars plus title-generation cost, which
also shows up in the contact's status feed.
Drop these JIDs at _should_process_message() before the policy gate so
they're filtered regardless of dm_policy or allowlist state. Covers:
- status@broadcast (Stories)
- *@newsletter (Channels)
- *@broadcast (broadcast lists, future-proofing)
The bridge.js already filters these on the fromMe outbound path, but
inbound events on self-chat mode skipped that check.
Tests:
- status@broadcast dropped on open policy
- broadcast filter wins over allowlisted senders
- real DMs still pass through
- helper unit cases (case-insensitive, whitespace-tolerant)
26/26 tests/gateway/test_whatsapp_group_gating.py pass; 59/59 adjacent
WhatsApp test suites pass.
The Debian/Ubuntu branch of install_node_deps() ran 'npx playwright install
--with-deps chromium' unconditionally. Playwright invokes sudo interactively
to apt-install Chromium's system libraries, which blocks the installer for
non-sudo users (systemd service accounts, unprivileged operator users) on
an unsatisfiable password prompt.
Changes:
- install.sh: gate --with-deps behind a sudo capability check on the apt
branch (matches the existing Arch/pacman branch pattern). Non-sudo users
fall back to 'npx playwright install chromium' alone and the installer
prints the exact 'sudo npx playwright install-deps chromium' command an
administrator can run separately.
- install.sh: add --skip-browser (alias --no-playwright) to skip the
Playwright step entirely for headless installs that don't need browser
automation. Mirrors the existing --no-venv / --skip-setup shape.
- installation.md: add a 'Non-Sudo / System Service User Installs' section
covering the admin/service-user split, the --skip-browser flag, and the
~/.local/bin PATH gotcha (the root cause of the 'No module named dotenv'
error users hit when running the repo source 'hermes' script with system
Python instead of the venv launcher).
- test_install_sh_browser_install.py: regression coverage for the
--skip-browser flag and the sudo-gate on the apt branch.
Reported by @ssilver in Discord.
_make_stream_chunk built delta_kwargs with only `role`, so a reasoning-only
chunk produced a SimpleNamespace without a `.content` attribute. Downstream
consumers that read `delta.content` then raised AttributeError on Gemini 2.5
Flash, where the thinking delta arrives before any content delta.
Seed `content`, `tool_calls`, `reasoning`, and `reasoning_content` as None
up front, matching the pattern already used in gemini_native_adapter.py.
Key-present arguments still override the defaults.
Fixes#24974
References: Related open PR #24984 (luyao618) applies the same 1-line fix; this PR adds a regression test that #24984 omits
Co-Authored-By: Claude <noreply@anthropic.com>
Pyproject's [all] extra was slimmed down in May 2026 — ~20 optional
backends moved to tools/lazy_deps.py and only install on first use.
hermes update runs uv pip install -e .[all] which doesn't touch any of
them, so pin bumps in LAZY_DEPS (CVE response, transitive fixes) were
silently ignored on already-activated backends.
Two changes:
1. _is_satisfied() now parses the spec and checks the installed version
against the constraint via packaging.specifiers. Previously it
returned True the moment the package name was importable, which made
ensure() a name-presence gate rather than a version-pin gate.
2. New active_features() / refresh_active_features() pair: lists every
feature with at least one of its packages currently installed, then
re-runs ensure() on each. Refresh is invoked at the end of
_cmd_update_impl, right after the [all] install completes. Cold
backends (never activated) stay quiet — no churn for them.
Output during update is one summary block:
→ Refreshing 4 active lazy backend(s)...
↑ 1 refreshed: provider.anthropic
✓ 3 already current
or
⚠ memory.honcho failed to refresh: <pip stderr>
Failures never raise out of update — backends keep their previously-
installed version and we tell the user to rerun once upstream is fixed.
security.allow_lazy_installs=false is honored: features get marked
"skipped" with the reason shown.
Tests: 18 new unit tests covering version-aware satisfaction (exact pin,
range, extras blocks, missing package, malformed spec), active feature
discovery, and refresh status reporting. All 61 lazy_deps tests pass.
Adds regression tests pinning web search into the WhatsApp and api-server
default platform-coverage toolsets. Pure test additions, no runtime change.
Salvage of the test-addition commit from #25692 by @wesleysimplicio.
(The AUTHOR_MAP fixup commit from the same PR landed separately as
529ec85c7.)
Mirrors openclaw beta.8's app-server resilience fixes so a stuck codex
subprocess can't burn the full turn deadline and so users get a
`codex login` pointer instead of raw RPC errors when their token expires.
- TurnResult.should_retire signals the caller to drop+respawn codex.
- Deadline-hit path and dead-subprocess detection set should_retire so
the next turn doesn't ride a CPU-spinning or auth-broken process.
- Post-tool watchdog (post_tool_quiet_timeout=90s): if a tool item
completes and codex goes silent past the threshold without further
output or turn/completed, fast-fail instead of waiting the full 600s.
Resets on any non-tool activity so normal think-after-tool flows are
not affected.
- <turn_aborted> and <turn_aborted/> in agent text are treated as
terminal — some codex builds tear down a turn that way without
emitting turn/completed.
- _classify_oauth_failure() inspects RPC error message + stderr tail
for invalid_grant / token refresh / 401 / etc. and rewrites
user-facing errors to 'run codex login'. Conservative: generic
failures still surface verbatim. Fires at turn/start failure,
turn/completed failure, and dead-subprocess paths.
- thread/start cross-fill: tolerate thread.id, thread.sessionId,
top-level sessionId/threadId so future codex schema drift doesn't
KeyError us at handshake.
- run_agent.py: when run_turn returns should_retire=True OR raises,
close + null self._codex_session so the next turn respawns.
Tests: +30 cases across session + integration suites.
tests/agent/transports/test_codex_app_server_session.py 50/50 pass
tests/run_agent/test_codex_app_server_integration.py 27/27 pass
Broader codex scope (transports + cli runtime/migration) 376/376 pass
The cherry-picked PR over-indented the edit_message_text block for
the mm: (model selected → switch) success path so the confirmation
edit lived inside the preceding 'except Exception as exc' branch and
only fired when the callback raised. Dedent the try/except back to
12-space indent so it runs after the callback succeeds, restoring
the original flow that removes the inline buttons and shows the
'Switched to ...' confirmation.
Add a regression test (test_model_selected_edits_message_on_success)
that asserts edit_message_text is awaited and the result text is
routed through format_message (MARKDOWN_V2 + backtick survival).
Add phuongvm to scripts/release.py AUTHOR_MAP.
Use MarkdownV2 formatting for Telegram callback follow-ups and interactive prompts where dynamic names or user text can break legacy Markdown parsing. Add regression coverage for reload-mcp, model picker, approval callbacks, and update prompts.
Brings Discord to parity with Telegram on the clarify tool's interactive
UX. Overrides BasePlatformAdapter.send_clarify on DiscordAdapter to attach
a button view when choices are present.
- ClarifyChoiceView: one discord.ui.Button per choice (max 24, Discord's
25-component view cap leaves one slot for Other) plus a final
'Other (type answer)' button.
- Numeric click -> tools.clarify_gateway.resolve_gateway_clarify(
clarify_id, choice_text) using the canonical choice text from the
gateway entry (falls back to the button label if the entry vanished).
- Other click -> tools.clarify_gateway.mark_awaiting_text(clarify_id) so
the gateway's text-intercept captures the next user message in this
session as the response.
- Auth via the shared _component_check_auth helper (same OR-semantics as
ExecApprovalView / SlashConfirmView / UpdatePromptView / ModelPickerView).
- Open-ended (no choices) path renders the prompt as a plain embed and
relies on the existing text-intercept resolution.
- Single-use: first valid click disables every button and updates the
embed footer with who answered and what they chose.
No changes to BasePlatformAdapter.send_clarify or the gateway's
clarify_callback wiring -- the existing scaffolding already drives all
adapters; Discord just inherits the default text fallback today and gains
buttons by virtue of this override.
Test conftest extended: _FakeEmbed gains add_field() / set_footer() stubs
so tests can construct embedded views without monkey-patching per-test.
Original PR: #19249 by @LeonSGP43. This is a reshape of the contributor's
work onto current main's clarify infrastructure (clarify_id + entry-based
resolution shared with Telegram, instead of a parallel on_answer-closure
mechanism). The button view structure and UX shape are preserved.
Tests: 14 new tests in tests/gateway/test_discord_clarify_buttons.py.
391/391 existing Discord gateway tests still pass.
Co-authored-by: LeonSGP43 <cine.dreamer.one@gmail.com>
setup_path() writes the user-facing hermes shim with `cat >`, which
follows existing symlinks. Older installs created
`$command_link_dir/hermes` as a symlink to `$HERMES_BIN`
(`venv/bin/hermes`), so re-running install.sh stomped the pip entry
point with a bash shim that exec'd itself in an infinite loop.
`rm -f` the link target before writing so the shim lands at
`$command_link_dir/hermes` and the venv entry point is left intact.
Adds a regression test that reproduces the symlink-stomp end-to-end
(creates the symlink, drives the real shim-write block from setup_path,
asserts the venv pip script body survives and the shim is now a regular
file). Both new assertions fail on origin/main and pass with the fix.
Closes#21454.
Follow-up to Alex-wuhu's NovitaAI provider commit. Adds:
- _pricing_cache hit/write in _fetch_novita_pricing (was missing — every
pricing fetch was re-hitting the network), mirroring the
fetch_ai_gateway_pricing pattern. force_refresh now also propagates
from get_pricing_for_provider.
- TestNovitaProvider in tests/hermes_cli/test_api_key_providers.py
covering profile load, alias resolution, registry auto-registration,
model list parity between main.py and models.py, _URL_TO_PROVIDER,
_PROVIDER_PREFIXES, context_size in _CONTEXT_LENGTH_KEYS, pricing
unit conversion, and pricing cache behavior.
- AUTHOR_MAP entry for yanglongwei06@gmail.com → @Alex-yang00.
Previously ACP dangerous-command approvals mixed an invalid ACP
payload shape with partial Hermes option mapping, and the callback
plumbing was shared across worker threads. This commit uses ACP
tool-call updates, preserves Hermes once/session/always semantics,
and scopes approval callbacks to the current worker thread.
- Build permission requests with `update_tool_call` and unique
`perm-check-*` ids in `acp_adapter/permissions.py`
- Keep ACP option mapping explicit and fail closed on unknown outcomes
or request failures
- Set approval callbacks inside the ACP executor worker and read them
from thread-local state in `tools/terminal_tool.py`
- Replace duplicated ACP bridge coverage with focused tests in
`tests/acp/test_permissions.py` and add a thread-local callback test
The salvaged regression test called skin.get_spinner_list() which
doesn't exist on SkinConfig. Replace with direct dict access on
skin.spinner — same intent (verify default empty spinner is preserved
when user override is invalid).
* feat(goals): /subgoal — user-added criteria appended to active /goal
Layers a /subgoal command on top of the existing freeform Ralph judge
loop. The user can append extra criteria mid-loop; the judge factors
them into its done/continue verdict and the continuation prompt
surfaces them to the agent. No new tool, no agent self-judging — the
existing judge model just sees a richer prompt.
Forms:
/subgoal show current subgoals
/subgoal <text> append a criterion
/subgoal remove <n> drop subgoal n (1-based)
/subgoal clear wipe all subgoals
How it integrates:
- GoalState gains `subgoals: List[str]` (default []), backwards-compat
for existing state_meta rows.
- judge_goal accepts an optional subgoals kwarg; non-empty switches to
JUDGE_USER_PROMPT_WITH_SUBGOALS_TEMPLATE which lists them as
numbered criteria and asks 'is the goal AND every additional
criterion satisfied?'
- next_continuation_prompt picks CONTINUATION_PROMPT_WITH_SUBGOALS_TEMPLATE
when non-empty so the agent sees what to target.
- /subgoal is allowed mid-run on the gateway since it only touches the
state the judge reads at turn boundary — no race with the running
turn.
- Status line shows '... , N subgoals' when present.
Surface:
- hermes_cli/goals.py — field, prompt blocks, manager methods, judge weave
- hermes_cli/commands.py — /subgoal CommandDef
- cli.py — _handle_subgoal_command
- gateway/run.py — _handle_subgoal_command + mid-run dispatch
- tests/hermes_cli/test_goals.py — 15 new tests (backcompat, mutation,
persistence, prompt template selection, judge-prompt content via mock,
status-line rendering)
77 goal-related tests passing across goals + cli + gateway + tui.
* fix(goals): slash commands don't preempt the goal-continuation hook
Two findings from live-testing /subgoal:
1. Slash commands queued while the agent is running landed in
_pending_input (same queue as real user messages). The goal hook's
'is a real user message pending?' check returned True and silently
skipped — but the slash command consumes its queue slot via
process_command() which never re-fires the goal hook, so the loop
stalls indefinitely. Now the hook peeks the queue and only defers
when a non-slash payload is present.
2. The with-subgoals judge prompt was too soft — opus 4.7 said 'done,
implying all requirements met' without verifying. Tightened to
demand specific per-criterion evidence (file contents, output line,
command result) and explicitly reject phrases like 'implying it was
done.'
Live verified: /subgoal injected mid-loop now correctly forces the
judge to refuse done until the new criterion is met. Agent gets the
continuation prompt with subgoals listed, updates the script, judge
confirms done with specific evidence cited.
The cherry-picked tests from #6173 set HERMES_HOME outside Path.home()/.hermes,
which forces get_default_hermes_root() down its Docker branch and returns
HERMES_HOME directly — so _get_default_hermes_home() never resolves to the
~/.hermes directory the tests were trying to assert about.
Rewire both tests to use the real profile layout (HERMES_HOME pointing at
~/.hermes/profiles/<name>) so _get_default_hermes_home() resolves back to
~/.hermes and the default-profile fallback is actually exercised.
Surfaced by local E2E behavior-parity testing of PR vs origin/main: the
plugin-migrated dispatchers were quietly changing the error envelope
shape returned to function-calling models on unconfigured systems.
Two findings, both from per-result error wrapping bleeding into the
pre-flight configuration error path:
1. **search**: ``firecrawl.search()`` caught the
``ValueError("Web tools are not configured...")`` from
``_get_firecrawl_client()`` and returned it as
``{"success": False, "error": ...}``, losing the legacy
``{"error": "Error searching web: ..."}`` envelope that
``tool_error()`` emits on main. Models that special-case the
``error`` key still detect the failure, but the prefix is part of
the legacy contract some users rely on.
2. **crawl**: ``firecrawl.crawl()`` caught the same pre-flight
``ValueError`` and wrapped it as a per-page error inside
``results[0]``. Main short-circuits on ``check_firecrawl_api_key()``
BEFORE dispatching, so its unconfigured response is
``{"success": False, "error": "web_crawl requires Firecrawl..."}``
at the top level. The PR's per-page burying hid the failure inside
``results[]`` where models that check ``result.get("error")`` would
miss it.
Fix:
- ``plugins/web/firecrawl/provider.py``: pull
``_get_firecrawl_client()`` outside the broad ``try`` in
``search()``. Pre-flight ``ValueError`` / ``ImportError`` propagate
to the dispatcher's top-level exception handler. In-flight SDK
errors still get wrapped as ``{"success": False, ...}``.
- ``tools/web_tools.py``: mirror main's upstream availability gate in
``web_crawl_tool``. When the resolved crawl provider is
``is_available()==False``, short-circuit BEFORE dispatching with the
same top-level error shape main emits.
- ``tests/tools/test_web_providers.py``: 2 regression tests
(``TestUnconfiguredErrorEnvelopeParity``) lock in the behavior so
future plugin work can't undo this.
Verified via local subprocess-based parity test (14/14 scenarios match
origin/main shape exactly) and full 210/210 web test suite green.
The web-provider migration originally left firecrawl crawl as the only
provider-specific code remaining inline in tools/web_tools.py (~250
lines of Firecrawl-specific crawl orchestration that didn't fit the
plugin's existing surface). This commit closes that gap.
What this adds
--------------
1. plugins/web/firecrawl/provider.py: implement async ``crawl(url, **kwargs)``
- Accepts the same kwargs as the dispatcher passes to any crawl
provider (``instructions``, ``depth``, ``limit``); Firecrawl's
/crawl endpoint ignores ``instructions`` and ``depth`` so we log
and drop with a clear info message.
- Wraps the sync SDK ``crawl()`` call in asyncio.to_thread so the
gateway event loop isn't blocked on a multi-page crawl.
- Preserves the response-shape normalization across pydantic /
typed-object / dict variants that the legacy inline code did.
- Preserves per-page website-policy re-check (catches blocked
redirects after the SDK returns).
- Returns the same {"results": [...]} shape so the dispatcher's
shared LLM-summarization post-processing path works unchanged.
- Sets supports_crawl() to True so the dispatcher routes through
the plugin instead of the legacy fallthrough.
2. tools/web_tools.py: delete the entire legacy firecrawl crawl block
that used to run after "No registered provider supports crawl" —
~270 lines including:
- check_firecrawl_api_key gate + typed error
- inline SSRF + website-policy seed-URL gate (dispatcher already
does this)
- Firecrawl client setup with crawl_params
- 100+ lines of pydantic/dict/typed-object normalization
- Per-page LLM-processing loop (kept in the dispatcher's shared
post-processing path; that's where it always belonged)
- trimming + base64 image cleanup (still done in the dispatcher's
shared path)
Replaced with a single typed-error branch when no crawl-capable
provider is available: "web_crawl has no available backend. Set
FIRECRAWL_API_KEY (or FIRECRAWL_API_URL for self-hosted), or set
TAVILY_API_KEY for Tavily."
Test updates
------------
- tests/tools/test_website_policy.py:
- test_web_crawl_short_circuits_blocked_url: dispatcher seed-URL
gate still runs on web_tools.check_website_access (no change to
that patch), but the firecrawl client lockdown moved to the
plugin module — patch firecrawl_provider._get_firecrawl_client
instead of web_tools._get_firecrawl_client. The dispatcher
short-circuits before the plugin runs, so the test still passes.
- test_web_crawl_blocks_redirected_final_url: patch the per-page
policy gate at plugins.web.firecrawl.provider.check_website_access
(where it now runs) AND on web_tools (where the seed-URL gate
still runs). Patch firecrawl_provider._get_firecrawl_client for
the FakeCrawlClient injection. Both checks flow through the same
fake_check function.
- tests/plugins/web/test_web_search_provider_plugins.py:
- Update parametrized capability-flag spec: firecrawl supports_crawl
is now True.
- Add test_firecrawl_crawl_returns_error_dict_when_unconfigured —
verifies inspect.iscoroutinefunction(p.crawl) is True and that
the async crawl returns a per-page error dict (not a raise) when
FIRECRAWL_API_KEY is missing.
Verified
--------
- 218/218 web tests pass (was 173, +44 plugin tests + 1 new firecrawl
crawl test from this commit = 218 with the test deduplication).
- Compile-clean (py_compile passes on both files).
- Provider capabilities matrix confirmed end-to-end:
name search extract crawl async-extract? async-crawl?
firecrawl True True True True True
tavily True True True False False
Both crawl-capable providers exercise the dispatcher's
inspect.iscoroutinefunction async-or-sync detection.
Net diff
--------
- tools/web_tools.py: -254 lines (legacy inline crawl gone)
- plugins/web/firecrawl/provider.py: +185 lines (crawl method)
- test_website_policy.py: +14/-9 lines (patch locations)
- test_web_search_provider_plugins.py: +22/-1 lines (capability flag
+ new firecrawl crawl test)
- Total: -32 net LoC; tools/web_tools.py is now 1509 lines (was 1763
before this commit, 2227 before the migration started).
Adds 44 focused tests under tests/plugins/web/ covering the surface that
the PR #25182 web-provider migration introduced. Complements the
existing tests/tools/ coverage which is dispatcher-centric; this file is
plugin-centric and tests each plugin + the registry directly.
Test classes (44 tests, ~1.1s on 4 workers)
-------------------------------------------
TestBundledPluginsRegister (16 tests)
- All seven plugins present in the registry after
_ensure_plugins_discovered()
- Per-plugin parametrized capability-flag assertions
(brave-free / ddgs / searxng: search-only;
exa / parallel / firecrawl: search + extract;
tavily: search + extract + crawl)
- Every plugin exposes name + display_name properties
- Every plugin returns a picker-compatible get_setup_schema() dict
TestIsAvailable (7 tests)
- Each premium plugin reports is_available()==False when its env var is
absent and True once set (brave-free / searxng / tavily / exa /
parallel)
- firecrawl recognizes either FIRECRAWL_API_KEY or FIRECRAWL_API_URL
as a "configured" signal
- ddgs is the always-on fallback and must not raise from is_available()
TestRegistryResolution (4 tests)
- Option B semantics validated end-to-end:
1. Explicit configured provider wins even when is_available()==False
(dispatcher surfaces typed credential errors, no silent switch)
2. Unknown/typo name falls back to first available legacy-preference
provider
3. Asking for extract via a search-only backend falls back to an
extract-capable available provider (capability-incompatible
branch in _resolve())
4. No config + no credentials → None (or ddgs if installed)
TestAsyncExtractDispatch (4 tests)
- parallel + firecrawl extract() are coroutine functions (async path
in dispatcher uses await)
- exa + tavily extract() are sync (dispatcher wraps in
asyncio.to_thread)
TestErrorResponseShapes (7 tests)
- Plugins return typed error dicts (success=False + "error" key) when
credentials are missing, never raise
- async extract() returns list of per-URL error dicts
- tavily crawl() returns {"results": [{"error": ...}]} on missing
credentials
Design notes
------------
- All tests use real imports of plugin modules — no mocking of provider
classes themselves — so they catch drift in the ABC, registry, and
glue layer simultaneously. Per the hermes-agent-dev skill's E2E
testing guidance.
- The autouse _isolate_env fixture clears every web-provider env var
before each test so is_available() reflects the test's setup.
- Resolution tests use the lower-level _resolve() directly rather than
rebuilding the HERMES_HOME config dance — same observable behavior,
no sys.modules.pop side-effects that would break the ABC isinstance
check inside ctx.register_web_search_provider().
Removes the legacy in-tree provider scaffolding that PR #25182 fully
replaced with the plugin architecture:
tools/web_providers/__init__.py (6 lines)
tools/web_providers/base.py (89 lines — old ABCs)
tools/web_providers/ARCHITECTURE.md (73 lines — old design doc)
These were the staging-ground ABCs and provider modules that the
plugin migration absorbed. All seven web providers now implement the
single :class:`agent.web_search_provider.WebSearchProvider` ABC and
live under ``plugins/web/<vendor>/``. Nothing else in the tree imports
``tools.web_providers`` — verified via grep before deletion.
Test migration (tests/tools/test_web_providers.py)
--------------------------------------------------
Rewrote ``TestWebProviderABCs`` to test the new unified ABC at
:mod:`agent.web_search_provider`:
- test_cannot_instantiate_abc_directly — abstract ``name`` + ``is_available``
- test_concrete_search_only_provider_works — exercise default
``supports_extract=False`` / ``supports_crawl=False`` flags
- test_concrete_multi_capability_provider_works — exercise all three
capabilities, async extract supported (declared sync here for
simplicity; real plugins like parallel + firecrawl use async)
- test_search_only_provider_skips_extract_and_crawl — verify
``supports_*()`` flags default to False so search-only providers
don't have to implement extract() or crawl()
The 9 other tests in the file (per-capability backend selection,
DEFAULT_CONFIG merge, dispatcher routing) test public helpers in
``tools.web_tools`` that still exist and pass unchanged.
agent/web_search_provider.py docstring updated to reflect that the
legacy ABCs no longer exist; the response-shape contract is preserved
bit-for-bit so external consumers see no behavioral change.
Net diff
--------
- tools/web_providers/ removed (-168 lines)
- tests/tools/test_web_providers.py rewritten ABC section (+78/-30 net,
same coverage, new API)
- agent/web_search_provider.py docstring (-3/+5 lines)
Verified
--------
- 173/173 targeted web tests pass
- 12/12 ABC contract tests pass with the new interface
- No remaining grep hits for ``tools.web_providers`` outside of
intentional historical references in plugin docstrings.
Two regressions discovered by running the full tests/tools/ suite after
the dispatcher cutover, both fixed in this commit:
1. web_crawl_tool incorrectly errored "search-only" for firecrawl
---------------------------------------------------------------------
The cutover treated any provider with supports_crawl()==False as a
search-only backend and returned the typed search-only error. But
firecrawl can crawl via the legacy multi-page-extract path inside
web_crawl_tool — it just doesn't expose supports_crawl on the plugin
(adding native firecrawl crawl is a clean follow-up).
Fix: only emit the search-only error when the provider supports
NEITHER crawl NOR extract (brave-free / ddgs / searxng). When the
provider supports extract but not crawl (firecrawl), fall through to
the legacy firecrawl-via-extract path below.
2. firecrawl plugin's check_website_access wasn't patchable
---------------------------------------------------------------------
The plugin imported `from tools.website_policy import check_website_access`
INSIDE the extract() function body, so monkeypatching the name on
plugins.web.firecrawl.provider had no effect — the inner import re-bound
the name on every call.
Fix: hoist the import to module level. Cheap (website_policy itself
has no heavy deps) and makes the standard
monkeypatch.setattr(firecrawl_provider, "check_website_access", ...)
pattern work.
Test updates (tests/tools/test_website_policy.py — 4 tests):
- test_web_extract_short_circuits_blocked_url
- test_web_extract_blocks_redirected_final_url
Both: patch the gate at plugins.web.firecrawl.provider (where it
runs after migration) and force the firecrawl plugin to be the
active extract provider via FIRECRAWL_API_KEY.
- test_web_crawl_short_circuits_blocked_url
- test_web_crawl_blocks_redirected_final_url
Both: unchanged — the dispatcher-level gate at tools.web_tools.py
line 1651 still uses the imported `check_website_access` name and
the firecrawl-fallthrough path is exercised as before.
Verified: 22/22 tests/tools/test_website_policy.py pass.
Cuts over web_search_tool, web_extract_tool, and web_crawl_tool in
tools/web_tools.py to dispatch through agent.web_search_registry
instead of the legacy hardcoded if-elif backend chains.
Per-tool changes:
web_search_tool (sync)
Replace 5 backend branches (parallel, exa, registry-3-providers,
tavily, firecrawl-fallthrough) with a single registry path:
1. _get_search_backend() resolves the configured name
2. _wsp_get_provider(name) for explicit-config-wins semantics
3. get_active_search_provider() fallback for typo / unknown name
4. provider.search(query, limit) — sync for all 7 providers
web_extract_tool (async)
Replace 4 backend branches (parallel-async, exa-sync, tavily-sync,
search-only-error, firecrawl-perurl-loop) with:
1. Same provider resolution as search.
2. When configured backend IS registered but doesn't support
extract (search-only providers like brave-free), surface a
typed "search-only" error matching the legacy text — tests
assert that wording.
3. inspect.iscoroutinefunction(provider.extract) detects sync vs
async: parallel + firecrawl are async; exa + tavily are sync.
Sync extracts run in asyncio.to_thread() so we don't block.
web_crawl_tool (async)
Replace tavily-specific branch + search-only-error block with:
1. _wsp_get_provider(backend) — explicit config first
2. Search-only typed error when the configured name doesn't
support crawl (matches legacy phrasing)
3. get_active_crawl_provider() fallback otherwise
4. provider.crawl(url, **kwargs) — async-or-sync dispatch as above
5. Response post-processing (LLM summarization, trimming) stays
unchanged — it's not provider-specific.
When no plugin advertises supports_crawl, falls through to the
existing Firecrawl-via-web-summarize path below (unchanged).
Test updates (2 tests in tests/tools/test_web_tools_config.py):
- test_web_search_clamps_limit_before_backend_call:
patch("tools.web_tools._parallel_search") -> patch the registry
provider returned by agent.web_search_registry.get_provider
- test_search_error_response_does_not_expose_diagnostics:
patch("tools.web_tools._get_firecrawl_client") -> same pattern
Tests unchanged (still pass):
- All TestXBackendWiring classes (test _get_backend / _is_backend_available
config-resolution, independent of dispatch)
- All TestXSearchOnlyErrors classes (test the search-only error path
via web_extract_tool / web_crawl_tool — error text preserved)
- 141 passing web tests total, 0 regressions.
Dead-code cleanup deferred to a follow-up commit so this diff stays
focused on the cutover. After this commit:
- tools.web_tools._exa_search / _exa_extract / _parallel_search /
_parallel_extract / _tavily_request / _normalize_tavily_* /
_get_firecrawl_client / _extract_web_search_results /
_extract_scrape_payload / _to_plain_object / _normalize_result_list
are no longer called by the dispatchers, but still exist.
- The config-resolution layer (_get_backend, _is_backend_available,
_is_tool_gateway_ready, _has_direct_firecrawl_config) IS still in
use and must stay.
- The Firecrawl proxy and check_firecrawl_api_key are still imported
by integration tests and patched by unit tests — must stay (or be
re-exported from the plugin).
Deletes tools/web_providers/{brave_free,ddgs,searxng}.py — the three
providers that moved to plugins/web/ in prior commits. tools/web_tools.py
no longer imports them (registry dispatch as of d8735963f), so removing
them is purely a cleanup pass.
Also migrates the existing tests to the new import paths:
tests/tools/test_web_providers_brave_free.py
tests/tools/test_web_providers_ddgs.py
tests/tools/test_web_providers_searxng.py
Mechanical rewrites:
- `from tools.web_providers.X import YSearchProvider`
-> `from plugins.web.X.provider import YWebSearchProvider`
- `.is_configured()` -> `.is_available()` (legacy method -> new method)
- `.provider_name()` -> `.name` (legacy method -> new property)
- `from tools.web_providers.base import WebSearchProvider`
-> `from agent.web_search_provider import WebSearchProvider`
(the subclass-check asserts membership in the new plugin-facing ABC)
- `sys.modules.delitem("tools.web_providers.ddgs")` updated to point at
`plugins.web.ddgs.provider` (cache-busting for lazy ddgs imports)
The TestXBackendWiring / TestXSearchOnlyErrors classes (covering
_is_backend_available, _get_backend, check_web_api_key, and the
"search-only" error paths in web_extract/web_crawl) are untouched —
those still test web_tools.py's backend-selection logic, which continues
to recognize the names "brave-free" / "ddgs" / "searxng" even after the
modules behind them moved to plugins.
tools/web_providers/base.py is intentionally NOT deleted by this commit
— it's the parent ABC of the legacy modules and shares its name with
agent/web_search_provider.py::WebSearchProvider. Removing it surfaces the
naming collision (see PR description Finding 0); the real migration PR
deletes it in the same commit that drops the _WEB_PLUGIN_SKIPLIST
guards in hermes_cli/tools_config.py.
Test results:
bash scripts/run_tests.sh tests/tools/test_web_providers_*.py
-> 65 passed in 3.41s (all rewritten unit tests + unchanged integration tests)
bash scripts/run_tests.sh tests/tools/test_web_*.py
-> 141 passed in 4.70s (full web test set, post-deletion)
Three call-sites in the codebase each duplicated the same config-slice
+ list_authenticated_providers + post-processing pattern:
- hermes_cli/web_server.py /api/model/options
- tui_gateway/server.py model.options JSON-RPC
- tui_gateway/server.py model.save_key JSON-RPC
This consolidates them onto hermes_cli/inventory.py:
load_picker_context() -> ConfigContext
Replaces the 17-LOC config-slice (model.{default,name,provider,
base_url}, providers:, custom_providers:) every consumer did
inline.
ConfigContext.with_overrides(*, current_provider=, current_model=,
current_base_url=) -> ConfigContext
Truthy-only overlay for TUI agent-session state on top of disk
config. Empty getattr(agent, ...) attrs MUST NOT clobber disk.
build_models_payload(ctx, *, include_unconfigured, picker_hints,
canonical_order, max_models) -> dict
Single payload builder. Delegates curation to
list_authenticated_providers (does not call provider_model_ids
per row \u2014 that pulls non-agentic models). picker_hints +
canonical_order produce the TUI ModelPickerDialog shape;
defaults match the dashboard's existing /api/model/options
contract.
Two latent bugs fixed by consolidation:
1. The dashboard read cfg.get('custom_providers') directly, missing
the v12+ keyed providers: form. Now both surfaces go through
get_compatible_custom_providers().
2. The TUI's canonical-merge keyed on is_user_defined to decide order.
Section 3 of list_authenticated_providers sets is_user_defined=True
on rows from the providers: config dict even when the slug is
canonical \u2014 that silently demoted them to the picker tail.
_reorder_canonical now keys on slug membership instead.
Stats: +666 / -145 (net +521). Module 240 LOC; 18 behavior tests.
This PR replaces the rejected #23369 (which bundled the consolidation
with new scriptable CLI surfaces \u2014 hermes models list/status, hermes
providers list \u2014 and a JSON contract that have no external user
demand). Just the refactor; the CLI surface is deferred to a separate
PR gated on actual demand.
Refs #23359.
Follow-up on the salvaged feat commit:
- Keep the constructor / config / yaml-example default at 3 so existing
gateway and CLI users see no behavioural change. PR #13754 (which this
builds on) had lowered the default to 2 to chase pre-feature parity in
the system-prompt-present case, at the cost of quietly halving the
protected head for the gateway path (which strips the system prompt
before calling compress()). With the new "system prompt is implicit"
semantics, default 3 gives every caller a stable head shape.
- agent/context_engine.py: bring the ABC's protect_first_n docstring in
line with the new semantics so plugin context engines interpret the
config key the same way the built-in compressor does.
- tests: adjust the default-value test (3, not 2) and a stale comment;
per-test protect_first_n=2/3/1 values added in PR #13754 stay as-is
since those tests fix concrete head shapes.
The number of head messages preserved verbatim across context compactions
was previously hardcoded to 3 in AIAgent.__init__. Expose it as
`compression.protect_first_n` in config, matching the existing
`protect_last_n` pattern.
Motivation: users who rely on rolling compaction for long-running sessions
had the opening user/assistant exchange pinned as head forever, which
doesn't always match how they want the session framed after many
compactions. Lowering to 1 preserves the system prompt + first non-system
message; lowering to 0 preserves only the system prompt and lets the
entire first exchange age out naturally through the summary.
Semantics: `protect_first_n` counts non-system head messages protected
**in addition to** the system prompt, which is always implicitly protected
when present. Same meaning across both code paths:
protect_first_n=0 → system prompt only (or nothing if no system message)
protect_first_n=2 → system prompt + first 2 non-system messages (default)
This unifies the CLI path (which reads messages with the system prompt at
position 0) and the gateway path (where the gateway /compress handler
strips the system prompt before calling compress() — see
gateway/run.py L9150-9154 on the parent fork). Previously these two paths
disagreed:
CLI path: protect_first_n=1 → protect system prompt only
Gateway path: protect_first_n=1 → protect first USER turn forever
In practice on long-running gateway sessions the old semantics pinned
whatever stale aside happened to be the first user message, reinserting
it into every compaction summary indefinitely.
Default chosen as 2 (not 3) so that the effective protected head count
remains 3 messages in the common case — assuming a system prompt is
present, default protection becomes system + 2 non-system = 3 total,
matching the pre-feature behaviour where `protect_first_n` was hardcoded
to protect 3 messages total. Sessions without a system prompt will see a
small behaviour change (2 protected head messages instead of 3), but this
is the rare path and the new semantics make the system-prompt-present
case the well-defined one.
Changes:
- agent/context_compressor.py: redefine protect_first_n as the count of
non-system head messages protected beyond the implicit system-prompt
guarantee; both paths converge. Constructor default updated to 2.
- hermes_cli/config.py: add `compression.protect_first_n` default (2),
matching the new semantics. `show_config` label tweaked to
'Protect first: N non-system head messages' for clarity.
- run_agent.py: read protect_first_n from config; 0 is now valid (system
prompt is always implicitly protected).
- cli-config.yaml.example: document the new key and rationale.
- tests/agent/test_context_compressor.py: cover default, override, the
end-to-end `protect_first_n=0` and `protect_first_n=1` behaviour,
the no-system-prompt (gateway) path, and the new shared-semantics
regression test.
Fixes#13751
Tested on Ubuntu 24.04.