Rewrite TestDiscordMentions as negative assertions (mentions survive the
redactor) and clean up the orphaned comment + dangling whitespace left by
removing _DISCORD_MENTION_RE. Follow-up to the salvaged #32259 fix for #35611.
A handoff persisted under an older SUMMARY_PREFIX can be inherited into a
resumed lineage. _strip_summary_prefix only matched the current/legacy
literal, so on re-compaction the old 'resume exactly from Active Task'
directive stayed embedded in the body and kept hijacking replies to new,
unrelated user messages.
- Add _HISTORICAL_SUMMARY_PREFIXES (pre-#35344 prefix) and strip/recognize
them in _strip_summary_prefix + _is_context_summary_content so resumed
stale handoffs are re-normalized to the current latest-message-wins prefix.
- Reconcile the overlapping Active Task template edits from the salvaged
#26290 (reverse-signal cancellation) and #32787 (capture open questions /
decisions, don't write None too eagerly) — both intents kept.
- Regression coverage in tests/agent/test_resume_stale_active_task.py.
- AUTHOR_MAP entries for both salvaged contributors.
SUMMARY_PREFIX previously contained two contradictory directives:
1. "treat it as background reference, NOT as active instructions"
"Do NOT answer questions or fulfill requests mentioned in this summary"
"Respond ONLY to the latest user message that appears AFTER this summary"
2. "Your current task is identified in the '## Active Task' section of the
summary — resume exactly from there."
When the latest user message contradicted Active Task (e.g. 'stop the
i18n refactor', 'never mind, look at grafana instead'), models tended to
follow (2) anyway because 'resume exactly' is a strong, unambiguous
directive — leading to repeated re-surfacing of already-cancelled work
across turns, even after explicit 'stop'/'don't keep bringing that up'
messages from the user.
This change:
- Removes the conflicting 'resume exactly from Active Task' clause.
- Makes the precedence explicit: latest user message is the single source
of truth; it WINS on conflict; cancelled Active Task / In Progress /
Pending User Asks / Remaining Work must be discarded entirely (no
'wrap up the old task first').
- Names canonical reverse signals (stop, undo, roll back, never mind,
just verify, topic change) so the model recognizes them as cancellation
triggers, not background context.
- Updates the summarizer template instruction so the LLM doesn't
mechanically copy a cancelled task into Active Task on the next
compaction (it's instructed to copy the reverse signal verbatim).
- Preserves: REFERENCE ONLY framing, MEMORY.md/USER.md authority, and
the 'don't repeat work already reflected in session state' clause.
Adds tests/agent/test_summary_prefix_semantics.py to pin invariants so
the conflict can't regress.
Tested:
- All compaction tests pass: tests/agent/test_context_compressor.py,
tests/agent/test_context_compressor_summary_continuity.py,
tests/run_agent/test_413_compression.py,
tests/run_agent/test_compression_persistence.py,
tests/run_agent/test_compression_boundary_hook.py,
tests/cli/test_manual_compress.py — 117/117 passing.
- Tested on macOS.
Adds two real-client tests on top of the salvaged #34783 fix:
- config-less custom:<name> endpoint routes via the carried live base_url
(guards the #34777 symptom directly, not just the wiring)
- named custom:<name> WITH a config entry still resolves via the
named-custom branch (regression guard against collapsing to bare custom)
When a user configures a custom: provider (e.g. custom:openclaw-router),
set_runtime_main() only stored provider and model in process-local globals.
_resolve_auto() then had no base_url or api_key for the custom endpoint,
causing Step 1 to fail and auxiliary tasks (approval, compression, title
generation) to fall through to the aggregator chain and route to wrong
providers.
Fix: extend set_runtime_main() to accept base_url, api_key, and api_mode
keyword arguments; store them in new globals alongside the existing provider
and model; fall back to these globals in _resolve_auto() when the main_runtime
dict is empty. The call site in conversation_loop.py now passes all five
fields from the agent object.
Fixes#34777
These tests asserted that hardcoded curated model lists/constants still
contained specific model strings (e.g. 'glm-5' in provider_model_ids('zai'),
exact context-length values per model key, PROVIDER_TO_MODELS_DEV entries).
They mirror a constant rather than exercise logic, so they only ever break
when models are added/retired and never catch a real bug.
Removed 22 such functions across 7 files (149 deletions, 0 additions).
Behavioral siblings are kept: live-catalog-wins, fallback ordering,
substring/longest-match resolution, normalization, credential discovery,
and probe-tier stepping all still tested.
* fix(auxiliary): stop capping output with max_tokens by default
Auxiliary LLM calls (compression, titles, vision, etc.) no longer send
max_tokens on the OpenAI-compatible chat-completions path. Most providers
treat an omitted max_tokens as "use the model max", which is what we want;
an explicit cap only risks truncation or a wire-format 400.
This was surfaced by GitHub Copilot / GPT-5 (#34530): those models reject
max_tokens and require max_completion_tokens, so compression 400'd and fell
back to a static context marker. Omitting the param sidesteps that quirk
(and ZAI vision's error 1210) entirely.
The Anthropic Messages wire (MiniMax + /anthropic endpoints) keeps
max_tokens because it is a mandatory field there.
* test(auxiliary): update temperature-retry assertions for omitted max_tokens
The temperature-retry tests asserted retry_kwargs["max_tokens"] == 500 on an
api.openai.com endpoint. Now that auxiliary calls omit max_tokens on
OpenAI-compatible endpoints (#34530), that key is absent. Assert it's absent
in both first and retry kwargs and use model as the survives-the-retry witness.
Add the same managed-gateway UX that image_gen already has:
- TOOL_CATEGORIES['video_gen'] gets a 'Nous Subscription' provider row
with managed_nous_feature='video_gen' + video_gen_plugin_name='fal'
- NousSubscriptionFeatures gains a video_gen property + feature state
computation (managed/active/available using the fal-queue gateway)
- _GATEWAY_TOOL_LABELS, _GATEWAY_DIRECT_LABELS, _ALL_GATEWAY_KEYS,
_get_gateway_direct_credentials, opted_in all include video_gen
- apply_nous_managed_defaults and apply_gateway_defaults handle video_gen
- _is_toolset_satisfied checks Nous features for video_gen
- _is_provider_active detects managed video_gen (use_gateway + fal provider)
- _select_plugin_video_gen_provider accepts use_gateway kwarg, propagated
from all 4 call sites in _configure_provider when managed_feature is set
- hermes setup status shows 'Video Generation (FAL via Nous subscription)'
Users on a Nous subscription can now pick 'Nous Subscription' under
hermes tools → Video Generation, which sets video_gen.provider=fal +
video_gen.use_gateway=true. The FAL plugin's _resolve_managed_fal_video_gateway
then routes through the managed queue gateway — no FAL_KEY needed.
* fix(security): block AWS SDK creds from subprocess env
* fix(security): narrow Bedrock subprocess strip to inference bearer token only
Scopes the AWS_SDK subprocess strip down from the full AWS credential chain
to just AWS_BEARER_TOKEN_BEDROCK — the only Hermes-managed *inference* secret
(analogous to OPENAI_API_KEY). The general AWS credential chain
(AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / AWS_SESSION_TOKEN / AWS_PROFILE
/ config + role pointers) is intentionally left inheritable.
Why: per SECURITY.md §3.2 the local terminal is the user's trusted operator
shell. Hard-blocklisting the general chain would (a) regress *every* user who
runs aws/terraform/cdk/boto3 in the agent terminal — not just Bedrock users,
since PROVIDER_REGISTRY is iterated unconditionally at import — and (b) be
unrecoverable, because env_passthrough.py refuses to re-allow anything in
_HERMES_PROVIDER_ENV_BLOCKLIST (GHSA-rhgp-j443-p4rf). The narrow strip closes
the reported leak (opencode enumerating the Bedrock catalog off the leaked
bearer token) with no capability loss.
Keeps zapabob's self-healing auth_type=="aws_sdk" mechanism so any future
SDK-cred provider is covered automatically.
Tests: bearer token stripped + general chain preserved (no-regression guard),
on both the runtime strip path and the blocklist-membership path.
Co-authored-by: zapabob <1920071390@campus.ouj.ac.jp>
* feat: embedder environment-hint hook for the system prompt
Adds HERMES_ENVIRONMENT_HINT env var (and config.yaml agent.environment_hint)
so a host wrapping Hermes (sandbox runner, managed platform) can describe the
runtime environment — proxy, credential handling, mount layout — in the system
prompt's environment-hints block, without editing the identity slot (SOUL.md).
Read once at prompt-build time, so it lands in the stable, cache-safe portion
of the system prompt. Env var overrides the config key (build-time/container
mechanism). Empty by default — no behavior change for existing installs.
---------
Co-authored-by: zapabob <1920071390@campus.ouj.ac.jp>
The chatgpt.com/backend-api/codex endpoint can spend tens of seconds in
backend admission / prompt prefill before emitting its first SSE event. The
12s no-byte TTFB cutoff aborted those still-valid streams, surfacing as
'Codex stream produced no bytes within 12s' through all retries (Discord
reports). The OpenAI SDK's own streaming read timeout is 600s, so 12s was
~50x more aggressive than the transport layer would have tolerated.
Default the no-byte cutoff to 120s and raise the openai-codex MAX cap default
to 120s so it no longer clamps the new default back to 20s. Disabling stays
available via HERMES_CODEX_TTFB_TIMEOUT_SECONDS=0; the 25k-token auto-disable,
_STRICT override, and post-first-event idle watchdog are unchanged.
Co-authored-by: Gille <4317663+helix4u@users.noreply.github.com>
A process running mismatched module versions — conversation_compression.py
re-imported with the post-#34351 lock code while a long-lived
hermes_state.SessionDB stays bound to the pre-#34351 class in memory — has
the try_acquire_compression_lock call site but not the method. The
AttributeError it raises is NOT a sqlite3.Error, so the method's own
fail-open guard never runs; the exception escapes to the outer agent loop,
which prints the error and retries. Compression never succeeds, the token
count never drops, and the loop re-triggers compaction forever (the
'API call #47/#48/#49 ... has no attribute try_acquire_compression_lock'
spin a user hit after an update).
Wrap the lock acquire so any unexpected exception fails OPEN: skip locking
and proceed with compression. Skipping the lock risks a rare
concurrent-compression session fork; an infinite no-progress loop that never
compresses at all is strictly worse. The remediation hint in the log points
at the real fix (restart / hermes update to resync the stale module).
Also guards get_compression_lock_holder against the same skew.
Adds a regression test simulating the version skew (real SessionDB wrapped
so only the lock methods raise AttributeError) — asserts _compress_context
proceeds and rotates instead of raising.
OAuth auto-open only checked _is_remote_session() (SSH + cloud-shell env
vars). On a headless/CLI-only Linux box with no GUI browser, none of those
trip, so webbrowser.open() resolved to a console browser (w3m/lynx/links)
and launched it INSIDE the terminal — hijacking the user's TTY with the
xAI 'Account Management' login page instead of letting them copy the URL.
Add _can_open_graphical_browser(): returns False when webbrowser would
resolve to a known console browser, when $BROWSER names one, when there's
no display server on Linux, or when no browser resolves at all. Gate all 5
OAuth auto-open callsites (xAI loopback, Spotify loopback, MiniMax device
code, Anthropic, Google) on it in addition to the existing remote check.
Headless boxes now print the URL / fall through to manual-paste instead.
xAI returns HTTP 403 (not 401) with unauthenticated:bad-credentials
when an OAuth2 access token has expired or is invalid. The existing
_is_auth_error() only checked for 401 status codes, so these tokens
were never refreshed and the 403 propagated as a generic permission
denied error.
Three fixes:
1. _is_auth_error: Recognize xAI's 403+bad-credentials pattern as
an auth failure, triggering token refresh instead of silent failure.
2. _refresh_provider_credentials: Add xai-oauth branch with
pool-level refresh (try_refresh_current with select to ensure
current entry) then fallback to singleton resolver with
force_refresh=True.
3. _recoverable_pool_provider: Map api.x.ai host to xai-oauth
pool for auto-resolved providers, matching existing pattern for
openai-codex/openrouter/nous/anthropic.
Includes 14 tests covering the new detection logic, host mapping,
and graceful fallback behavior.
Signed-off-by: moikapy <moikapy@devmoi.com>
When OpenAI Codex returns 401 token_invalidated or token_revoked, the
credential is broken upstream — retrying after a TTL cooldown cannot
fix it. The existing code treated every 401/429 the same way:
STATUS_EXHAUSTED with a TTL cooldown (5 min for 401, 1 hour for 429).
After the TTL elapsed, the broken credential re-entered rotation and
immediately failed again with the same 401, surfacing as 'Failed to
generate context summary' on every context-compression cycle.
Reporter observed 7 separate 401 token_invalidated failures from the
same revoked credential in a single day; the only workaround was
removing it manually via 'hermes auth'.
Add a STATUS_DEAD terminal state. Only 401 responses whose
error.code/reason matches a known terminal OAuth state (token_invalidated,
token_revoked, invalid_token, invalid_grant, unauthorized_client,
refresh_token_reused) transition to DEAD. Everything else keeps the
existing TTL semantics — 429 rate limits are transient and should
recover.
DEAD entries are excluded from rotation unconditionally. They only
clear when an explicit write-side re-auth sync rewrites the tokens
(the existing _sync_codex_pool_entries / _sync_*_entry_from_auth_store
paths already clear last_status to None). The read-side
auth.json-sync paths also now fire on DEAD so an in-flight pool entry
can adopt fresh tokens written by another process without needing
explicit re-auth.
After 24 hours, DEAD manual entries (source='manual:*') are pruned
from the pool automatically so dead state doesn't accumulate forever.
Singleton-seeded DEAD entries (source='device_code' etc.) are kept
because _seed_from_singletons would recreate them on the next load
with the same stale tokens — pruning would be pointless. The audit
trail stays visible (label, last_error_reason, timestamps).
Closes#32849.
Remove unused imports (F401) and duplicate/shadowed import
redefinitions (F811) across the codebase using ruff's safe
autofixes. No behavioral changes -- imports only.
- ~1400 safe autofixes applied across 644 files (net -1072 lines)
- __init__.py re-exports preserved (excluded from F401 removal so
public re-export surfaces stay intact)
- Re-exports that are imported or monkeypatched by tests but look
unused in their defining module are kept with explicit # noqa:
F401 (gateway/run.py load_dotenv; run_agent re-exports from
agent.message_sanitization, agent.context_compressor,
agent.retry_utils, agent.prompt_builder, agent.process_bootstrap,
agent.codex_responses_adapter)
- Unsafe F841 (unused-variable) fixes deliberately skipped -- those
can change behavior when the RHS has side effects
- ruff lints remain disabled in pyproject.toml (only PLW1514 is
selected); this is a one-time cleanup, not a config change
Verification:
- python -m compileall: clean
- pytest --collect-only: all 27161 tests collect (zero import errors)
- core entry points import clean (run_agent, model_tools, cli,
toolsets, hermes_state, batch_runner, gateway)
- static scan: every name any test imports directly from an edited
module still resolves
* fix(codex): surface error code in Responses 'failed' status errors
When a Codex Responses turn ends with status=failed, the response carries
the failure details under `response.error` as
`{code, message, param, ...}`. The previous extractor pulled only
`message`, so users seeing a rate-limit failure got a bare "Slow down"
string indistinguishable from a generic stream truncation; an
internal_error with empty message degraded to a dict dump
("{'code': 'internal_error', 'message': ''}").
Extract a `_format_responses_error()` helper that:
- prefixes `code` when both code and message are present
(e.g. 'rate_limit_exceeded: Slow down')
- falls back to the bare `code` when message is empty
- accepts both dict and attribute-style payloads (SDK and JSON-RPC paths)
- preserves the prior status-only fallback when no error payload exists
Apply the same helper at the sibling site in
`codex_app_server_session.run_turn()` so codex-CLI subprocess turn
failures get the same treatment.
Tests:
- 8 new unit tests for `_format_responses_error` covering both shapes,
empty/missing fields, non-string fields, and the status-only fallback.
- 2 regression tests on `_normalize_codex_response` for failed status
with and without a code, asserting the exact RuntimeError message.
- All 3603 tests in tests/agent/ pass.
Adapted from anomalyco/opencode#28757.
* feat(prompt): universal task-completion guidance + local Python toolchain probe
Two cross-model failure modes get a single-line answer in the cached
system prompt. Both gated by config (default on), both add zero overhead
when not needed, both verified via real AIAgent prompt builds.
## What changed
`TASK_COMPLETION_GUIDANCE` — short prompt block applied to ALL models.
Targets two failure modes observed on a real Sarasota real-estate build
task: (1) Opus stopped after writing an 85-byte stub and gave a prose
response with finish_reason=stop on call #3 of 90; (2) DeepSeek pushed
through a PEP-668 wall, then returned fabricated listings instead of
admitting the blocker. Both behaviors are model-family-agnostic, so the
guidance lives outside the existing tool_use_enforcement gate (~192
tokens, paid once per session via prefix cache).
`tools/env_probe.py` — local Python toolchain probe. Detects
python3/pip/uv/PEP-668 state and emits ONE short line in the system
prompt when something is non-default. Emits NOTHING when the env is
clean (zero token cost for normal users). Skipped entirely for remote
terminal backends (docker/modal/ssh) — they have their own probe.
Example output on a broken environment (the actual case):
Python toolchain: python3=3.11.15 (no pip module),
python=missing (use python3), pip→python3.12 (mismatch),
PEP 668=yes (use venv or uv).
## Config
Both flags live under `agent.` in config.yaml, default True:
agent:
task_completion_guidance: true # universal "finish the job" block
environment_probe: true # local Python toolchain hints
Neither addition required a `_config_version` bump — deep-merge fills
defaults in for existing user configs.
## Validation
| Test surface | Result |
|---|---|
| tests/tools/test_env_probe.py | 10/10 pass (probe unit) |
| tests/run_agent/test_run_agent.py — new classes | 8/8 pass (integration) |
| TestToolUseEnforcementConfig | 17/17 pass (no regression) |
| TestBuildSystemPrompt | 9/9 pass (no regression) |
| TestInvalidateSystemPrompt | 2/2 pass (no regression) |
| tests/agent/test_prompt_builder.py | 124/124 pass (no regression) |
| tests/hermes_cli/ | 5662/5662 pass (config defaults) |
| E2E AIAgent build (broken env) | Both blocks present, 2,178 chars |
| E2E AIAgent build (clean env) | 771-char net overhead, env probe silent |
* fix(compression): prevent session-id fork from concurrent compressions
When two AIAgent instances share the same session_id (most commonly the
parent-turn agent and its background-review fork, which inherits
session_id verbatim via background_review.py L451), both can call
compress_context() on overlapping snapshots of the same conversation.
Each ends the parent and creates its own NEW child session in state.db,
both parented to the same old id. The gateway SessionEntry only catches
one rotation; the other becomes an orphan that silently accumulates
writes — Damien's incident shape (parent 20260527_234659_e65f0e → two
children, only one visible).
Adds a state.db-backed per-session compression lock. Acquired before
the rotation in conversation_compression.compress_context(); on
failure, the caller returns messages unchanged so the auto-compress
retry loop stops cleanly. TTL (5min default) reclaims locks abandoned
by crashed compressors. Lock holder identity (pid:tid:agent:nonce) is
preserved for diagnostics via get_compression_lock_holder().
Schema bumped 13 -> 14 to track the new compression_locks table.
Reconciled additively via the existing declarative-column pattern;
no data migration needed for existing DBs.
Regression test reproduces Damien's shape: two threads racing
_compress_context on a shared parent_sid. Without the lock the test
deterministically produces 2 child sessions; with the lock, exactly 1.
Covers all six compression entry points (preflight in conversation_loop,
mid-turn fallback, hygiene compression in gateway, /compact, CLI
/compress, TUI /compress). ACP /compress was already protected by
nulling out _session_db before its compress call.
* ci: trigger rerun (transient GitHub API rate limit on CodeQL workflow)
Kanban workers now scan the task body for local image paths and
http(s) image URLs and attach them to the worker's first user turn —
matching the CLI/gateway behaviour for inbound images. Before, a
user pasting `/home/me/screenshot.png` or `https://example.com/img.png`
into a kanban task description had it sent to the model as plain
text and the pixels were never seen.
How it works:
* agent/image_routing.py gains extract_image_refs(text) → (paths, urls)
that mirrors gateway/platforms/base.py:extract_local_files (absolute /
~-relative paths, image extensions only, ignores fenced/inline code).
* build_native_content_parts() accepts an optional image_urls= kwarg
and emits passthrough image_url parts for remote URLs alongside the
base64 data: URLs used for local paths.
* cli.py (single-query/quiet branch — the path every dispatcher-spawned
worker takes) detects HERMES_KANBAN_TASK, reads the task body via
kanban_db.get_task, runs extract_image_refs, and threads the results
into the existing image-routing decision (native vs text). Best-effort:
enrichment failures never block worker startup.
Tested:
* tests/agent/test_image_routing.py — 22 new tests for extract_image_refs
and URL pass-through in build_native_content_parts.
* tests/hermes_cli/test_kanban_worker_image_extraction.py — 10 new tests
driving real kanban_db round-trip (create task → read body → extract
refs → build parts).
* E2E: created a fake kanban task with a body referencing both a local
PNG and an https URL; verified the worker pipeline produces a
multimodal user turn with 1 text part + 2 image_url parts (data URL
for the local file, passthrough URL for the remote).
Adds an optional `messages` keyword to the `MemoryProvider.sync_turn`
contract so external/community memory plugins can receive the OpenAI-style
conversation message list for the completed turn — including assistant tool
calls and tool result content — not just the final assistant text.
Dispatch uses signature inspection (`_provider_sync_accepts_messages`): only
providers that declare a `messages` parameter (or `**kwargs`) receive it; all
existing in-tree providers keep their legacy text-only signature and are
called unchanged. No structured-trace envelope is added to core — providers
reconstruct whatever they need from the standard message list.
Also documents Memori as a standalone community memory provider.
Salvaged from #28065 — rebased onto current main.
Co-authored-by: Dave Heritage <david@memorilabs.ai>
* fix(redact): pass web URLs through unchanged
Magic-link checkout URLs, OAuth callbacks the agent is meant to follow,
and pre-signed share URLs were getting `?token=***` / `?code=***` /
`?signature=***` blanket-redacted by parameter NAME, which breaks any
skill that has to round-trip a URL through history (the model's tool
call arguments get sanitized before persistence — the live call fires
with the real URL, but the next turn sees `***`).
Joe Rinaldi Johnson hit this with a checkout-acceleration skill that
uses magic links in URLs.
Drops three call sites from `redact_sensitive_text`:
- `_redact_url_query_params` (was redacting `access_token`, `token`,
`api_key`, `code`, `signature`, `key`, `auth`, etc.)
- `_redact_url_userinfo` (was redacting `https://user:pass@host`)
- `_redact_http_request_target_query_params` (was redacting access-log
request targets like `"POST /hook?password=... HTTP/1.1"`)
The helpers themselves are kept in the module — still importable by
anything that wants to opt in explicitly.
Still redacted (unchanged):
- Vendor-prefix credential shapes (sk-, ghp_, AKIA, gAAAA, etc.)
anywhere they appear, including inside URLs — see the
`test_known_prefix_inside_url_still_redacted` case.
- JWTs (`eyJ...`)
- DB connection-string passwords (`postgres://admin:pw@host`) —
these are connection strings, not web URLs the agent navigates to.
- Authorization headers, ENV assignments, JSON `apiKey`/`token` fields,
Telegram bot tokens, private key blocks, Discord mentions, E.164
phone numbers, and form-urlencoded bodies (request bodies, not URLs).
Tests: replaces `TestUrlQueryParamRedaction` + `TestUrlUserinfoRedaction`
with `TestWebUrlsNotRedacted`, asserting representative URLs (OAuth
callback, magic link, S3 pre-signed, websocket, userinfo, access log)
pass through unchanged. Adds positive cases proving the prefix and DB
connstr nets still fire. 74 redact tests + 10 browser-exfil + 16 PII
redaction tests all pass.
* test(codex_app_server): drop URL-query assertion from stderr-tail redaction test
The test bundled (a) sk-live-* credential-prefix redaction with (b)
URL query-param redaction. (a) is still in effect via _PREFIX_RE;
(b) was the contract we just removed in the parent commit so the
'querysecret12345' assertion stopped holding. Keep the credential-shape
assertion, drop the URL-query one.
Send-message tool's local _URL_SECRET_QUERY_RE in tools/send_message_tool.py
is independent of agent/redact.py and unchanged — its tests
(test_top_level_send_failure_redacts_query_token,
test_http_error_redacts_access_token_in_exception_text) still pass.
Anthropic released Claude Opus 4.8 on 2026-05-27, available on
OpenRouter, Anthropic, Amazon Bedrock, and Claude Platform on AWS:
- https://openrouter.ai/anthropic/claude-opus-4.8
- https://openrouter.ai/anthropic/claude-opus-4.8-fast
The fast-mode variant is a separate model ID (anthropic/claude-opus-4.8-fast)
priced at 2x of the base model — a notable improvement over the 6x premium
on older Opus generations (4.6/4.7). It is NOT a `speed: "fast"` request
parameter like Opus 4.6; Anthropic's native fast-mode beta still only
covers Opus 4.6.
Changes:
hermes_cli/models.py
- Add anthropic/claude-opus-4.8 + anthropic/claude-opus-4.8-fast to
the OpenRouter fallback snapshot and the Nous Portal curated list
(live catalogs surface them automatically when reachable; the
fallback list matters when the manifest fetch fails).
- Add claude-opus-4-8 to the Anthropic-native picker list.
agent/model_metadata.py
- Register claude-opus-4-8 / claude-opus-4.8 in DEFAULT_CONTEXT_LENGTHS
with 1M tokens (matches 4.6/4.7).
agent/anthropic_adapter.py
- Extend _XHIGH_EFFORT_SUBSTRINGS, _ADAPTIVE_THINKING_SUBSTRINGS, and
_NO_SAMPLING_PARAMS_SUBSTRINGS with "4-8"/"4.8". 4.8 inherits the
Opus 4.7 API contract: adaptive thinking only, xhigh effort level
supported, sampling parameters (temperature/top_p/top_k) return 400.
- Add claude-opus-4-8 to _ANTHROPIC_OUTPUT_LIMITS (128k max output,
same as 4.7). Matches by substring so claude-opus-4-8-fast and
date-stamped variants resolve correctly.
agent/usage_pricing.py
- Add anthropic/claude-opus-4-8: $5/$25 per MTok input/output, $0.50
cache read, $6.25 cache write (same as 4.6/4.7).
- Add anthropic/claude-opus-4-8-fast: $10/$50 per MTok (2x), $1.00
cache read, $12.50 cache write. Per OpenRouter, the 2x premium is
the only differentiator from regular Opus 4.8.
- OpenRouter routes still pull pricing from the live /models API, so
no static OpenRouter entry is needed.
tests/agent/test_model_metadata.py
- Extend the Claude 4.6+ context-length tag list with 4.8/4-8.
website/static/api/model-catalog.json
- Regenerated via `python scripts/build_model_catalog.py` to pick up
the new entries in the OpenRouter and Nous Portal fallback lists.
E2E verification (isolated sys.path import against the worktree):
- _supports_adaptive_thinking, _supports_xhigh_effort, _forbids_sampling_params
all return True for claude-opus-4.8 and claude-opus-4.8-fast.
- _supports_fast_mode (the `speed: "fast"` request-parameter gate) stays
False for 4.8 — fast mode is a separate model ID on OpenRouter, not a
parameter Anthropic accepts on the base model.
- DEFAULT_CONTEXT_LENGTHS resolves 1M for both notations.
- resolve_billing_route + _lookup_official_docs_pricing resolve the
correct $5/$25 (regular) and $10/$50 (fast) pricing for both
dot-notation and dash-notation inputs.
- 4.7 and 4.6 regression: behavior unchanged.
Unit tests: 305 passed across tests/agent/test_usage_pricing.py,
test_model_metadata.py, tests/hermes_cli/test_model_catalog.py,
test_models.py, test_model_validation.py, test_models_dev_preferred_merge.py.
* fix(agent): fallback immediately on provider content-policy blocks
Provider safety-filter refusals (e.g. OpenAI Codex 'flagged for possible
cybersecurity risk', OpenAI moderation 'violates our usage policies',
Anthropic safety-system rejections, Azure content_filter) are
deterministic decisions about a specific prompt. Retrying the same
prompt up to api_max_retries times just reproduces the same refusal and
burns paid attempts before surfacing the generic 'API failed after 3
retries — <provider message>' to Telegram / cron with no indication that
the failure came from the model provider rather than Hermes itself.
Classify these as a new FailoverReason.content_policy_blocked
(non-retryable, should_fallback=True) and route them through the
existing is_client_error path so the loop:
- skips the 3x retry backoff
- activates a configured fallback model immediately
- emits a clear provider-safety message to the user (not the generic
'Non-retryable error (HTTP None)') and surfaces actionable guidance
when no fallback is configured (rephrase, narrow context, or set
fallback_model in hermes config)
- returns a final_response that explicitly tells the user this came
from the model provider, so gateway delivery is unambiguous and
cron last_status reflects the safety block rather than a vague
'agent reported failure'
Patterns are intentionally narrow — verbatim refusal phrasings keyed to
specific provider safety pipelines, not generic words like 'policy' or
'violation' that would collide with billing / format / auth errors.
Regression guards in test_18028_content_policy_blocked.py verify
billing 402s, generic 400s, and OpenRouter account-level
provider_policy_blocked remain distinct classifications.
Salvaged from #18164 onto current main (file restructure: loop logic
moved from run_agent.py to agent/conversation_loop.py, _emit_status →
_buffer_status), broadened patterns beyond the original OpenAI Codex
cybersecurity case to cover OpenAI moderation, Anthropic safety system,
and Azure content_filter; added user-actionable guidance and a clear
final_response so cron/gateway surfaces the policy block instead of a
generic non-retryable error, and added a regression-guard test module
mirroring the is_client_error predicate.
Addresses #18028.
Co-authored-by: Kuan-Chieh Huang <kchuang1015@users.noreply.github.com>
* chore: add kchuang1015 to AUTHOR_MAP
---------
Co-authored-by: Kuan-Chieh Huang <kchuang1015@users.noreply.github.com>
Users report that the CLI/gateway floods them with confusing retry chatter
during transient failures: a single 429 can produce 10+ "Provider/Endpoint/
Retrying in 5s..." lines before the request eventually succeeds. The same
firehose hits Telegram, Discord, Slack, etc. via _emit_status.
This patch defers all retry/fallback/compression status messages until we
know the outcome:
- if the turn ultimately succeeds (any path: primary recovers, fallback
activates, compression unsticks the request), the buffer is silently
dropped — the user sees nothing.
- if every retry and fallback exhausts and the turn fails, the buffer
is flushed at the terminal-failure return so the user sees the full
retry trace alongside the final error.
Backend logging (agent.log) is unchanged — every emission site still
writes to logger.warning/info, so post-mortem diagnosis is intact.
## What changed
run_agent.py: four new methods on AIAgent:
_buffer_status(msg) — defer an _emit_status call
_buffer_vprint(msg) — defer a _vprint(force=True) line
_clear_status_buffer() — drop pending messages on success
_flush_status_buffer() — replay pending messages on terminal failure
agent/conversation_loop.py:
- converted ~30 mid-process emit/vprint sites in the retry, fallback,
compression, empty-response, and stream-watchdog paths to the buffered
helpers
- added _flush_status_buffer() at every terminal-failure return so users
still see the trace when it actually matters
- added _clear_status_buffer() at the "non-empty assistant content"
point (NOT at "API call returned bytes" — empty responses still loop
through the empty-retry path and would otherwise lose their trace
between iterations)
- silenced the two "(´;ω;`) oops, retrying..." / "(╥_╥) error,
retrying..." spinner final-frame messages — the spinner now stops
cleanly so retries leave no visible residue
agent/chat_completion_helpers.py: same conversion for codex TTFB / stale-
stream / fallback-activation status messages.
agent/stream_diag.py: _emit_stream_drop now buffers instead of emitting
directly.
## Tests
tests/run_agent/test_retry_status_buffer.py: 7 unit tests covering
accumulate→flush, clear-on-success, mixed kinds, empty-buffer no-op,
re-buffer after flush, exception swallowing.
Updated 3 existing tests that mocked _emit_status to also mock (or use)
_buffer_status:
- tests/run_agent/test_run_agent.py::test_empty_response_emits_status_for_gateway
- tests/run_agent/test_stream_drop_logging.py (2 tests)
- tests/agent/test_codex_ttfb_watchdog.py (TTFB hint test)
## Validation
Live test: hermes chat -q against an unreachable endpoint with no fallback
exhausts retries and prints the full trace at the end. Same flow against
a working endpoint prints zero retry chatter.
Condenses the substance of PRs #16453, #17453, #16451, #17600, and #13373
into a minimal generic host contract that external context engine plugins
(e.g. hermes-lcm) need to integrate cleanly. Drops scaffolding that
duplicated existing infrastructure or had marginal value.
Five concrete changes:
1. `_transition_context_engine_session()` on AIAgent — generic lifecycle
helper that fires on_session_end → on_session_reset → on_session_start
→ optional carry_over_new_session_context. Engines implement only the
hooks they need; missing hooks are skipped. Built-in compressor keeps
its existing reset-only behavior because callers default to no
metadata. `reset_session_state()` now optionally accepts
previous_messages / old_session_id / carry_over_context and delegates
to the transition helper when provided. (#16453)
2. `conversation_id` passed to `on_session_start()` — both the
agent-init call site and the compression-boundary call site now
forward `self._gateway_session_key` so plugin engines have a stable
conversation identity that survives session_id rotation (compression
splits, /new, resume). The key already existed on AIAgent; it just
wasn't reaching engines. (#16453)
3. Canonical cache buckets forwarded to engines — the usage dict passed
to `update_from_response()` now includes input_tokens, output_tokens,
cache_read_tokens, cache_write_tokens, and reasoning_tokens on top of
the legacy prompt/completion/total keys. Engines can make decisions on
cache-hit ratios and reasoning costs instead of only aggregates. ABC
docstring updated. (#17453)
4. Plugin-registered context engines visible in the picker —
`_discover_context_engines()` in plugins_cmd.py now also includes
engines registered via `ctx.register_context_engine()` from plugin
manifests, deduplicating by name so repo-shipped descriptions win on
collision. (#16451)
5. `_EngineCollector.register_command()` — context engines using the
standard `register(ctx)` pattern can now expose slash commands (e.g.
`/lcm`). Routes to the global plugin command registry with the same
conflict-rejection policy regular plugins use (no shadowing built-ins,
no clobbering other plugins). Previously these calls hit a no-op and
the slash commands silently never appeared. (#17600)
Dropped from the original 5 PRs:
- Compression boundary signal (`boundary_reason="compression"`) from
#16453 — already on main at `agent/conversation_compression.py:412-424`,
landed via the bg-review extraction.
- `discover_plugins()` before fallback in run_agent.py from #16451 —
redundant: `get_plugin_context_engine()` already routes through
`_ensure_plugins_discovered()` which is idempotent.
- Runtime identity diagnostics method + helpers from #13373 (+251 LOC) —
operators can already read engine state via `engine.get_status()`;
the diagnostics view added marginal value relative to its surface area.
- The 553-LOC slash-command machinery from #17600 — replaced with a
20-LOC `register_command` method on the collector that reuses the
existing plugin command registry instead of building a parallel one.
Net: ~215 LOC of host-contract changes + 282 LOC of focused tests, vs
~1,176 LOC across the original 5 PRs.
Co-authored-by: Tosko4 <1294707+Tosko4@users.noreply.github.com>
Closes#16453.
Closes#17453.
Closes#16451.
Closes#17600.
Closes#13373.
Related: stephenschoettler/hermes-lcm#68.
Snapshot review_agent._session_messages before teardown so close() can
clean per-session state without dropping the user-visible
self-improvement summary. Adds two regressions:
- bg-review summarizer receives captured review-agent tool messages
after review_agent.close() runs
- context-compressor protected-head handoff rehydration populates
_previous_summary and keeps the old handoff out of newly summarized
turns
Salvaged from PR #26039 onto current main after agent/background_review.py
extraction. Original commit 63eaf6055; bg-review test updated to patch
the module-level summarize_background_review_actions in
agent.background_review instead of the now-forwarder
AIAgent._summarize_background_review_actions.
Closes#33368.
`_CodexCompletionsAdapter.create()` iterates `final.output` from the
Codex Responses stream. The event-driven consumer (introduced in #33042)
always sets `final.output` to a list, so this shape can't come from our
own code path. But:
- Mocked clients in tests can return a typed Response with `output=None`
- Third-party shims / compatibility layers that bypass the consumer can
do the same
- A future code path that wraps a different consumer could regress
The old code `getattr(final, "output", [])` returns `None` (not the
default `[]`) when the attribute EXISTS but is `None`. Iterating
`None` then raises `TypeError: 'NoneType' object is not iterable` —
the exact error logged by title-generation when this fires.
Fix: `getattr(final, "output", None) or []` — single-line defensive
coerce. Cheap; zero risk.
Regression test asserts the auxiliary path handles a final whose
`.output` is `None` (via monkey-patched consumer) without raising and
returns the expected chat.completions-shaped response.
Reporter: @pavegrid-1 (issue #33368).
Three additions on top of @Nami4D's salvage:
1. Gate the preflight slash-enum strip on the model name pattern
(grok-* / x-ai/grok-*). The original PR stripped slash-containing
enum values from every codex_responses request, but native Codex
(OpenAI) and GitHub Models DO accept slash enums — stripping them
there would silently degrade tool-schema constraints. xAI is the
only Responses-API surface that rejects the shape.
2. Resolve the merge conflict in agent/transports/codex.py by
preserving both the timeout-forwarding block that landed on main
between the PR's branch point and now AND the new service_tier
strip. Behavioural intent of both is preserved.
3. Six new tests in tests/agent/transports/test_codex_transport.py
covering:
- TestCodexTransportXaiServiceTierStrip (3 tests): xAI strips
service_tier from request_overrides; non-xAI codex_responses
and GitHub Models both KEEP service_tier (regression guards
so the strip stays xAI-only).
- TestPreflightSlashEnumStrip (3 tests): Grok and aggregator-
prefixed Grok model names both trigger the safety-net strip;
non-Grok models preserve slash enums as a regression guard
against the strip becoming too broad.
51/51 in tests/agent/transports/test_codex_transport.py.
Co-authored-by: Nami4D <hello@nami4d.tech>
* remove Vercel AI Gateway provider and Vercel Sandbox terminal backend
Both Vercel-hosted integrations are removed end-to-end. Users on the AI
Gateway should switch to OpenRouter or one of the other aggregators
(Nous Portal, Kilo Code). Users on the Vercel Sandbox backend should
switch to Docker, Modal, Daytona, or SSH.
What's removed:
- `plugins/model-providers/ai-gateway/` provider plugin
- `hermes_cli/vercel_auth.py` Vercel-Sandbox auth helper
- `tools/environments/vercel_sandbox.py` terminal backend
- `ai-gateway` provider wiring across auth, doctor, setup, models,
config, status, providers, main, web_server, model_normalize, dump
- `vercel_sandbox` backend wiring across terminal_tool, file_tools,
code_execution_tool, file_operations, approval, skills_tool,
environments/local, credential_files, lazy_deps, prompt_builder,
cli, gateway/run
- `AI_GATEWAY_BASE_URL` constant, `_AI_GATEWAY_HEADERS` auxiliary-client
header set, run_agent base-URL header/reasoning special-cases
- `[vercel]` pyproject extra and `vercel`/`vercel-workers` from uv.lock
- env vars: `AI_GATEWAY_API_KEY`, `AI_GATEWAY_BASE_URL`, `VERCEL_TOKEN`,
`VERCEL_PROJECT_ID`, `VERCEL_TEAM_ID`, `VERCEL_OIDC_TOKEN`,
`TERMINAL_VERCEL_RUNTIME`
- Tests: deletes test_ai_gateway_models.py and
test_vercel_sandbox_environment.py; scrubs references across 23
surviving test files (no entire tests deleted unless they were
dedicated to AI Gateway / Sandbox)
- Docs: provider tables, env-var reference, setup guides, security
notes, tool config, terminal-backend tables — English plus zh-Hans
i18n parity
- `hermes-agent` skill: provider table entry and remote-backend list
What stays (intentional):
- `popular-web-designs/templates/vercel.md` — CSS design reference,
unrelated to Vercel-the-AI-product
- `x-vercel-id` in `stream_diag.py` headers — generic Vercel CDN
response header, useful diag signal on any Vercel-hosted endpoint
- `vercel-labs/agent-browser` URL in browser config — lightpanda
browser project, different OSS effort
- `userStories.json` historical contributor entry mentioning Vercel
Sandbox — archive, not active docs
Validation:
- 1153 tests in the 22 targeted files pass (`scripts/run_tests.sh`)
- Full repo `py_compile` clean
- Live import of every touched module + invariant check (no
`ai-gateway` in `PROVIDER_REGISTRY`, no `_AI_GATEWAY_HEADERS`, no
`vercel_sandbox` in `_REMOTE_TERMINAL_BACKENDS`)
* test: convert profile-count check from change-detector to invariant
The hardcoded "== 34" assertion broke when ai-gateway was removed.
Per AGENTS.md change-detector-test guidance, assert the relationship
(registry count >= number of plugin dirs) instead of a literal count.
Counts shift when providers are added/removed; that's expected.
* refactor(codex): drop SDK responses.stream() helper; consume events directly
The OpenAI Python SDK's high-level `client.responses.stream(...)` helper
does post-hoc typed reconstruction from the terminal
`response.completed.response.output` field. The chatgpt.com Codex
backend has been observed (today, gpt-5.5) to ship `response.output =
null` on terminal frames, which crashes the SDK with `TypeError:
'NoneType' object is not iterable` mid-iteration.
Carlton's #32963 patched the symptom by wrapping the helper in
try/except and recovering from the same per-event accumulator the SDK
was supposed to populate. This PR removes the helper from the call
path entirely: we now use `client.responses.create(stream=True)` (raw
AsyncIterable of SSE events) and assemble the final response object
ourselves from `response.output_item.done` events as they arrive. The
terminal event's `output` field is never read for content. Same
strategy OpenClaw uses for the same backend.
This makes Hermes structurally immune to the bug class, not patched.
The next time OpenAI ships a shape change to chatgpt.com's terminal
frame, our consumer keeps working because it doesn't read that frame
for content — only for usage/status/id.
Changes
- `agent/codex_runtime.py`: new `_consume_codex_event_stream()` shared
consumer; `run_codex_stream()` uses `responses.create(stream=True)`;
`run_codex_create_stream_fallback()` collapses into a thin alias
since the primary path now does what the fallback used to do.
- `agent/auxiliary_client.py`: `_CodexCompletionsAdapter` uses the
same consumer; old null-output recovery helpers deleted as
unreferenced.
- Tests migrated: fixtures that mocked `responses.stream` now mock
`responses.create` returning a raw iterable. New regression test
asserts the auxiliary path returns streamed items even when the
terminal event's `output` is literally `null`.
Validation
- Live: tested against fresh OAuth on `chatgpt.com/backend-api/codex`
with `gpt-5.5` — response built correctly with `response.output=null`
on the terminal frame, all events consumed, usage/reasoning tokens
propagated.
- `tests/run_agent/test_run_agent_codex_responses.py` +
`tests/agent/test_auxiliary_client.py`: 242 passed.
* test+fix(codex): migrate streaming tests, raise on truncated streams
CI surfaced 10 test failures across tests/run_agent/test_streaming.py
and tests/run_agent/test_codex_xai_oauth_recovery.py — both files had
their own `responses.stream(...)` mocks I missed in the first sweep.
agent/codex_runtime.py: _consume_codex_event_stream() now raises
"Codex Responses stream did not emit a terminal response" when the
stream ends without any terminal frame AND no usable content. This
preserves the signal callers used to get from the SDK's high-level
helper, which they distinguished from "completed with empty body"
in error handling.
Tests migrated:
- test_streaming.py: text-delta callback, activity-touch, and
remote-protocol-error tests all switch from mocking responses.stream
to responses.create returning an iterable of events.
- test_codex_xai_oauth_recovery.py: prelude-error tests are recast as
wire-error-event tests (the new path raises _StreamErrorEvent
directly when the wire emits type=error, which is strictly better
than the old two-phase "SDK RuntimeError → retry → fallback"). The
retry-on-transport-error test moves from responses.stream side-effect
to responses.create side-effect.
Verified live against chatgpt.com Codex with gpt-5.5 — AIAgent.chat()
through the full codex_responses path returns correctly, 319/319
targeted tests passing.
* fix(codex-responses): gracefully recover from invalid_encrypted_content (salvage #10144)
When an OpenAI-compatible Responses API surface accepts an initial
request but later rejects the replayed `codex_reasoning_items`
encrypted blob with HTTP 400 `invalid_encrypted_content`, the
session previously got stuck retrying the same poisoned payload.
Recovery: classify the error as a dedicated FailoverReason, and on the
first hit disable encrypted reasoning replay for the rest of the
session, strip cached items from message history, and retry once.
Changes:
* error_classifier: add FailoverReason.invalid_encrypted_content
branch in _classify_400 (before context_overflow so the messages
that mention 'encrypted content … could not be verified' don't trip
context heuristics), in _classify_by_error_code, and extend
_extract_error_code to peek inside wrapped JSON in error.message and
ignore the bare '400' as a code.
* agent_init: initialize `_codex_reasoning_replay_enabled = True` on
every agent.
* run_agent: add AIAgent._disable_codex_reasoning_replay() helper
that flips the flag and pops cached items.
* codex_responses_adapter: thread a `replay_encrypted_reasoning`
kwarg through _chat_messages_to_responses_input so that when the
flag is False we don't replay codex_reasoning_items.
* transports/codex.py: read `replay_encrypted_reasoning` from params,
thread it into the adapter, and gate the
`include=['reasoning.encrypted_content']` request hint on it.
* chat_completion_helpers: pass the agent's replay flag through to
the transport.
* conversation_loop: in the retry loop, add an
invalid_encrypted_content recovery branch that fires once per
session, only when api_mode == codex_responses, only when replay is
still enabled, and only when at least one assistant message in
history actually carries cached reasoning items (otherwise the 400
has nothing to do with our cache and the normal retry path handles
it).
Tests:
* test_error_classifier: new wrapped-JSON _extract_error_code case;
new TestClassifyApiError cases proving the 400 is retryable with
no fallback, that the broad message match doesn't catch a generic
'parsed' message, and that the error code match is
case-insensitive.
* test_run_agent_codex_responses: end-to-end test of the recovery
branch firing once and disabling replay, plus a sibling test that
proves the branch does *not* fire (and the flag stays True) when
history has no cached reasoning items.
Salvages PR #10144 onto the post-refactor module layout
(error_classifier / codex_responses_adapter / transports/codex /
conversation_loop / agent_init) since the original diff was written
against the pre-refactor monolithic run_agent.py.
* chore(release): map victorGPT in AUTHOR_MAP for #10144 salvage
---------
Co-authored-by: victorGPT <wuxuebin1993@gmail.com>
SubdirectoryHintTracker was scanning directories outside the active
working directory, allowing files like ~/.codex/AGENTS.md or
~/.claude/CLAUDE.md to be loaded and injected into the agent context.
This causes cross-agent context contamination and instruction mixup.
Add _is_ancestor_or_same() helper and a path boundary check in
_is_valid_subdir(): only directories within the working directory tree
(i.e. path.is_relative_to(working_dir)) are allowed.
Also add exist_ok=True to mkdir() calls in new tests to prevent
pytest-xdist race conditions when workers share the same tmp_path parent.
Tests added:
- test_outside_working_dir_rejected: verifies sibling dirs are blocked
- test_outside_working_dir_absolute_path_rejected: verifies ~/.codex paths blocked
- test_inside_workspace_subdir_allowed: verifies normal subdir access unaffected
- test_sibling_repo_not_loaded_via_ancestor_walk: ancestor walk stays within workspace
When the user picks 'Anthropic API key' at `hermes setup` (vs 'Claude
Pro/Max subscription'), `save_anthropic_api_key()` writes ANTHROPIC_API_KEY
to ~/.hermes/.env and zeros ANTHROPIC_TOKEN. That env-var pattern is the
user's explicit choice of auth method — API key, not OAuth.
But the anthropic credential pool's autodiscovery (_seed_from_singletons)
unconditionally read ~/.claude/.credentials.json from the Claude Code CLI
and any saved hermes_pkce creds, and added them to the SAME anthropic
pool as the user's API key. Two problems:
1. Even with the API key at higher priority, a 401/429 on the API key
would rotate the session onto an autodiscovered OAuth credential,
silently flipping the agent into the Claude Code masquerade
mid-conversation: 'You are Claude Code' system block, every tool
renamed to mcp_*, claude-cli User-Agent header.
2. Switching OAuth → API key at `hermes setup` cleared the env vars
but left previously-seeded OAuth entries dormant in auth.json,
where rotation could revive them.
The user picking the API-key path is explicitly opting OUT of the
masquerade. Mixing OAuth credentials into their pool defeats that
choice.
Fix: in `_seed_from_singletons` for provider='anthropic', detect the
API-key path (ANTHROPIC_API_KEY set in env, no OAuth env var set) and:
- Skip calling read_claude_code_credentials() and
read_hermes_oauth_credentials() entirely
- Prune any stale hermes_pkce / claude_code entries that may already
be in the on-disk pool
OAuth-path users (ANTHROPIC_TOKEN set) are unaffected — autodiscovery
continues to fire as before.
Tests: 3 new regression tests (api-key skips autodiscovery, api-key
prunes stale entries, oauth path still autodiscovers). Full file 70/70.