In the concurrent tool-execution path, checkpoint preflight (write_file,
patch, destructive terminal) fired BEFORE plugin guardrail block_result
was computed. A blocked write_file could still dirty checkpoint state
(doc_modified_this_turn, _last_write_file_call_id, turn_counter).
Move checkpoint preflight to AFTER block_result computation, gated on
`if block_result is None:` — matching the invariant the sequential path
already enforces.
The max-iterations summary path (`handle_max_iterations`) hand-builds its
message list and calls `chat.completions.create()` directly, bypassing
`ChatCompletionsTransport.convert_messages()`. It only popped
("reasoning", "finish_reason", "_thinking_prefill"), so `tool_name` (SQLite
FTS bookkeeping), the `codex_*` reasoning carriers, and other internal
`_`-prefixed scaffolding leaked to the wire.
Strict OpenAI-compatible gateways (Fireworks-backed OpenCode Go, Mistral,
Moonshot/Kimi) reject these with HTTP 400 "Extra inputs are not permitted,
field: 'messages[N].tool_name'", so a long tool-using session that exhausts
the iteration budget fails to summarise instead of returning the result.
Mirror convert_messages() in this path: also drop tool_name,
codex_reasoning_items, codex_message_items, and every `_`-prefixed key.
Copy-on-write is already in place, so internal history keeps the fields for
FTS / Codex-fallback.
Adds a regression test to TestHandleMaxIterations asserting the summary
request carries none of the schema-foreign keys (fails on main, passes here).
Follow-up to the salvaged #34452 turn-completion explainer:
- Register display.turn_completion_explainer: True in DEFAULT_CONFIG so the
setting is discoverable, matching the file_mutation_verifier precedent.
- Shorten the repeated footer prefix from 'Turn ended without a usable
reply: ' to 'No reply: ' so the 10 reason variants don't all open with
the same 8-word boilerplate.
- Update the 7 assertions that referenced the old prefix.
PR #34470 adds an explainer suffix to abnormal turn endings (e.g.
max_iterations_reached) so users see why the response is short instead
of receiving a bare/blank reply. test_tool_call_validation_accepts_dict_arguments
runs the agent at max_iterations=3 which hits the explainer path; the
existing strict-equality assertion (== "done") no longer matches once
the suffix is appended.
Switch the assertion to .startswith("done") so the test continues to
verify that the models actual text survives intact while leaving the
explainer suffix wording owned by conversation_loop (where it belongs).
Test now passes (1 passed in 0.88s).
When a turn ends abnormally after substantive tool calls (empty content
after retries, a partial/truncated stream, exhausted retries, or an
iteration/budget limit), the CLI/TUI response area was left blank or
showed only a fragment (e.g. "The") with no consolidated reason. The
internal turn_exit_reason values (empty_response_exhausted,
partial_stream_recovery, etc.) were never surfaced to the user.
Add a turn-completion explainer that mirrors the existing file-mutation
verifier footer: at turn end, map an abnormal turn_exit_reason to a
short, actionable message and either replace the bare "(empty)"
sentinel or append the reason after a partial fragment. Normal
text_response exits (e.g. a terse "Done.") stay quiet.
Gated by display.turn_completion_explainer (default on) with
HERMES_TURN_COMPLETION_EXPLAINER env override, matching the
file-mutation verifier seam.
Closes#34452
* Inspired by Claude Code: /compress here [N] — boundary-aware 'summarize up to here'
Adds a user-chosen compression boundary to the existing /compress command.
/compress here [N] summarizes everything except the most recent N exchanges
(default 2), which are preserved verbatim — letting the user pick the
compression boundary instead of relying on the automatic token-budget heuristic.
Inspired by Claude Code's Rewind 'Summarize up to here' action (v2.1.139,
Week 20, May 2026): https://code.claude.com/docs/en/whats-new/2026-w20
- hermes_cli/partial_compress.py: pure split/parse helpers + seam-alternation
guard (shared by CLI and gateway).
- cli.py / gateway/run.py: route 'here [N]' / '--keep N' to partial compression;
compress only the head, re-append the verbatim tail through the seam guard.
- Preserves message-flow role alternation (seam guard merges any illegal
user->user / assistant->assistant adjacency).
- Reuses the existing _compress_context session-rotation/lock machinery — no
changes to the compression core.
- Bare /compress (full) and /compress <focus> behavior unchanged.
Tests: 12 helper unit tests + 5 CLI integration tests + E2E (interleaved
tool-call transcript, degenerate/multimodal seams, real handler path).
* fix: keep CLI context display in sync with preflight token estimate
The status bar reads compressor.last_prompt_tokens, which only updates
from a successful API response. When loaded history is oversized but
compression no-ops (e.g. the auxiliary summary model times out), no fresh
usage arrives and the bar stays frozen at the old, smaller value while the
preflight estimate reports a much larger number — looking permanently out
of sync (reported: 74.4K display vs ~144,669 preflight).
Seed last_prompt_tokens with the fresh preflight estimate (upward-only, so
a real usage figure is never clobbered and a successful compression's
downward correction still wins). Display-only; no behavioral change to
compression, caching, or the agent loop.
The rough-estimate mock supplied only 2 side_effect values but the
conversation loop calls estimate_request_tokens_rough a third time for
the post-response real-token estimate, exhausting the iterator. Use a
callable side_effect that returns 125k once (to fire preflight) then
sub-threshold values, independent of call count.
Worker threads that dispatch Hermes tools started with an empty contextvars.Context and no thread-local approval/sudo callbacks. Add tools/thread_context.propagate_context_to_thread factoring that capture/install/clear lifecycle (mirrors the GHSA-qg5c-hvr5-hjgr pattern), and refactor agent/tool_executor onto it so the security-critical logic lives in one audited place. Update the contextvar-propagation source guard for the new call shape.
Refs #33057
After the legacy session-key path was removed, two parameters became dead
surface on the Nous runtime-resolution chain:
- min_key_ttl_seconds: del'd inside refresh_nous_oauth_pure and pass-through /
telemetry-only in refresh_nous_oauth_from_state, _try_import_shared_nous_state,
_nous_device_code_login, and resolve_nous_runtime_credentials. It controlled the
now-deleted agent-key mint TTL and drives no behavior.
- inference_auth_mode: with the legacy mode gone, AUTO and FRESH are behaviorally
identical; the value only fed _normalize_nous_inference_auth_mode validation and
oauth trace output, never a branch.
Removing inference_auth_mode orphaned its whole supporting cluster
(NOUS_INFERENCE_AUTH_MODE_AUTO/FRESH, NOUS_INFERENCE_AUTH_MODES,
_normalize_nous_inference_auth_mode), and dropping min_key_ttl_seconds orphaned
DEFAULT_AGENT_KEY_MIN_TTL_SECONDS — all deleted here.
Updated every caller (run_agent, auxiliary_client, credential_pool, proxy adapter,
runtime_provider, web_server, main, auth_commands, setup) and pruned the matching
test kwargs. Deleted two tests that exercised the removed surface
(test_legacy_auth_mode_is_rejected, test_try_refresh_..._accepts_explicit_auth_mode).
No behavior change: net -134 LOC of dead code.
The scoping fix added enabled_toolsets/disabled_toolsets to the
agent_runtime_helpers sequential dispatch into handle_function_call, so
test_invoke_tool_dispatches_to_handle_function_call's assert_called_once_with
(exact match) needs the two new kwargs. Both are None for the default agent
fixture.
The two tests in TestRunConversation now verify the new behavior:
- test_kanban_block_called_on_iteration_exhaustion → verifies
_record_task_failure(outcome='timed_out') is called instead of
kanban_block
- test_no_kanban_block_when_not_in_kanban_mode → verifies the bridge
is a no-op when HERMES_KANBAN_TASK is unset
The function names are kept for diff stability; both assert against
_record_task_failure now, which is the correct contract per the gap-2
fix in this PR.
The xAI tool-schema sanitizers (strip_slash_enum, strip_pattern_and_format)
mutate their input in place — that's their documented contract. The two
call sites (chat_completion_helpers.build_api_kwargs and the auxiliary
client) were passing agent.tools straight through, so the first xAI
request would permanently strip slash-containing enum constraints and
pattern/format keywords from the per-agent tool registry.
Effect: any subsequent non-xAI call from the same agent (auxiliary task
routed to Anthropic, OpenRouter fallback, mid-session model switch) saw
the already-stripped schema with no way for the user to notice from
their config.
Fix: deepcopy tools_for_api before sanitizing at both call sites.
The slash-enum bug itself (xAI 400ing on enums with '/') was fixed
earlier by #32443 (Nami4D) — that PR landed the strip but used the
sanitizers directly without copying. This salvages #27907's correctness
contribution (the deepcopy) while skipping its redundant parallel
sanitizer (strip_xai_incompatible_enum_values is functionally
equivalent to the existing strip_slash_enum) and its preflight-
neutrality argument (we chose model-gated preflight in #32443).
3 new tests in tests/run_agent/test_run_agent_codex_responses.py:
- strips_slash_enum_from_outgoing_request — outgoing kwargs has no
slash-containing enum values (functional contract preserved).
- does_not_mutate_agent_tools — headline #27907 regression. Snapshot
agent.tools before build_api_kwargs, assert it survives intact
after. Pre-fix this assertion would have caught the mutation.
- is_idempotent_across_repeated_calls — three xAI requests in a row
each strip cleanly AND don't progressively erode the source schema.
344/344 across tests/agent/test_auxiliary_client.py,
tests/agent/transports/test_codex_transport.py,
tests/run_agent/test_run_agent_codex_responses.py, and
tests/tools/test_schema_sanitizer.py.
Co-authored-by: Gabor Barany <barany.gabor@gmail.com>
Remove unused imports (F401) and duplicate/shadowed import
redefinitions (F811) across the codebase using ruff's safe
autofixes. No behavioral changes -- imports only.
- ~1400 safe autofixes applied across 644 files (net -1072 lines)
- __init__.py re-exports preserved (excluded from F401 removal so
public re-export surfaces stay intact)
- Re-exports that are imported or monkeypatched by tests but look
unused in their defining module are kept with explicit # noqa:
F401 (gateway/run.py load_dotenv; run_agent re-exports from
agent.message_sanitization, agent.context_compressor,
agent.retry_utils, agent.prompt_builder, agent.process_bootstrap,
agent.codex_responses_adapter)
- Unsafe F841 (unused-variable) fixes deliberately skipped -- those
can change behavior when the RHS has side effects
- ruff lints remain disabled in pyproject.toml (only PLW1514 is
selected); this is a one-time cleanup, not a config change
Verification:
- python -m compileall: clean
- pytest --collect-only: all 27161 tests collect (zero import errors)
- core entry points import clean (run_agent, model_tools, cli,
toolsets, hermes_state, batch_runner, gateway)
- static scan: every name any test imports directly from an edited
module still resolves
* fix(codex): surface error code in Responses 'failed' status errors
When a Codex Responses turn ends with status=failed, the response carries
the failure details under `response.error` as
`{code, message, param, ...}`. The previous extractor pulled only
`message`, so users seeing a rate-limit failure got a bare "Slow down"
string indistinguishable from a generic stream truncation; an
internal_error with empty message degraded to a dict dump
("{'code': 'internal_error', 'message': ''}").
Extract a `_format_responses_error()` helper that:
- prefixes `code` when both code and message are present
(e.g. 'rate_limit_exceeded: Slow down')
- falls back to the bare `code` when message is empty
- accepts both dict and attribute-style payloads (SDK and JSON-RPC paths)
- preserves the prior status-only fallback when no error payload exists
Apply the same helper at the sibling site in
`codex_app_server_session.run_turn()` so codex-CLI subprocess turn
failures get the same treatment.
Tests:
- 8 new unit tests for `_format_responses_error` covering both shapes,
empty/missing fields, non-string fields, and the status-only fallback.
- 2 regression tests on `_normalize_codex_response` for failed status
with and without a code, asserting the exact RuntimeError message.
- All 3603 tests in tests/agent/ pass.
Adapted from anomalyco/opencode#28757.
* feat(prompt): universal task-completion guidance + local Python toolchain probe
Two cross-model failure modes get a single-line answer in the cached
system prompt. Both gated by config (default on), both add zero overhead
when not needed, both verified via real AIAgent prompt builds.
## What changed
`TASK_COMPLETION_GUIDANCE` — short prompt block applied to ALL models.
Targets two failure modes observed on a real Sarasota real-estate build
task: (1) Opus stopped after writing an 85-byte stub and gave a prose
response with finish_reason=stop on call #3 of 90; (2) DeepSeek pushed
through a PEP-668 wall, then returned fabricated listings instead of
admitting the blocker. Both behaviors are model-family-agnostic, so the
guidance lives outside the existing tool_use_enforcement gate (~192
tokens, paid once per session via prefix cache).
`tools/env_probe.py` — local Python toolchain probe. Detects
python3/pip/uv/PEP-668 state and emits ONE short line in the system
prompt when something is non-default. Emits NOTHING when the env is
clean (zero token cost for normal users). Skipped entirely for remote
terminal backends (docker/modal/ssh) — they have their own probe.
Example output on a broken environment (the actual case):
Python toolchain: python3=3.11.15 (no pip module),
python=missing (use python3), pip→python3.12 (mismatch),
PEP 668=yes (use venv or uv).
## Config
Both flags live under `agent.` in config.yaml, default True:
agent:
task_completion_guidance: true # universal "finish the job" block
environment_probe: true # local Python toolchain hints
Neither addition required a `_config_version` bump — deep-merge fills
defaults in for existing user configs.
## Validation
| Test surface | Result |
|---|---|
| tests/tools/test_env_probe.py | 10/10 pass (probe unit) |
| tests/run_agent/test_run_agent.py — new classes | 8/8 pass (integration) |
| TestToolUseEnforcementConfig | 17/17 pass (no regression) |
| TestBuildSystemPrompt | 9/9 pass (no regression) |
| TestInvalidateSystemPrompt | 2/2 pass (no regression) |
| tests/agent/test_prompt_builder.py | 124/124 pass (no regression) |
| tests/hermes_cli/ | 5662/5662 pass (config defaults) |
| E2E AIAgent build (broken env) | Both blocks present, 2,178 chars |
| E2E AIAgent build (clean env) | 771-char net overhead, env probe silent |
Adds an optional `messages` keyword to the `MemoryProvider.sync_turn`
contract so external/community memory plugins can receive the OpenAI-style
conversation message list for the completed turn — including assistant tool
calls and tool result content — not just the final assistant text.
Dispatch uses signature inspection (`_provider_sync_accepts_messages`): only
providers that declare a `messages` parameter (or `**kwargs`) receive it; all
existing in-tree providers keep their legacy text-only signature and are
called unchanged. No structured-trace envelope is added to core — providers
reconstruct whatever they need from the standard message list.
Also documents Memori as a standalone community memory provider.
Salvaged from #28065 — rebased onto current main.
Co-authored-by: Dave Heritage <david@memorilabs.ai>
The old test asserted that a non-MiniMax provider returning a generic
overflow (no provider-reported max) would step down to the 128K probe
tier. The salvaged fix from #33673 deliberately removes that step-down
because guessed tiers cause configured 1M sessions to silently shrink.
Update the test to assert the new contract: keep the configured 200K
window and rely on compression instead.
* fix(agent): fallback immediately on provider content-policy blocks
Provider safety-filter refusals (e.g. OpenAI Codex 'flagged for possible
cybersecurity risk', OpenAI moderation 'violates our usage policies',
Anthropic safety-system rejections, Azure content_filter) are
deterministic decisions about a specific prompt. Retrying the same
prompt up to api_max_retries times just reproduces the same refusal and
burns paid attempts before surfacing the generic 'API failed after 3
retries — <provider message>' to Telegram / cron with no indication that
the failure came from the model provider rather than Hermes itself.
Classify these as a new FailoverReason.content_policy_blocked
(non-retryable, should_fallback=True) and route them through the
existing is_client_error path so the loop:
- skips the 3x retry backoff
- activates a configured fallback model immediately
- emits a clear provider-safety message to the user (not the generic
'Non-retryable error (HTTP None)') and surfaces actionable guidance
when no fallback is configured (rephrase, narrow context, or set
fallback_model in hermes config)
- returns a final_response that explicitly tells the user this came
from the model provider, so gateway delivery is unambiguous and
cron last_status reflects the safety block rather than a vague
'agent reported failure'
Patterns are intentionally narrow — verbatim refusal phrasings keyed to
specific provider safety pipelines, not generic words like 'policy' or
'violation' that would collide with billing / format / auth errors.
Regression guards in test_18028_content_policy_blocked.py verify
billing 402s, generic 400s, and OpenRouter account-level
provider_policy_blocked remain distinct classifications.
Salvaged from #18164 onto current main (file restructure: loop logic
moved from run_agent.py to agent/conversation_loop.py, _emit_status →
_buffer_status), broadened patterns beyond the original OpenAI Codex
cybersecurity case to cover OpenAI moderation, Anthropic safety system,
and Azure content_filter; added user-actionable guidance and a clear
final_response so cron/gateway surfaces the policy block instead of a
generic non-retryable error, and added a regression-guard test module
mirroring the is_client_error predicate.
Addresses #18028.
Co-authored-by: Kuan-Chieh Huang <kchuang1015@users.noreply.github.com>
* chore: add kchuang1015 to AUTHOR_MAP
---------
Co-authored-by: Kuan-Chieh Huang <kchuang1015@users.noreply.github.com>
Users report that the CLI/gateway floods them with confusing retry chatter
during transient failures: a single 429 can produce 10+ "Provider/Endpoint/
Retrying in 5s..." lines before the request eventually succeeds. The same
firehose hits Telegram, Discord, Slack, etc. via _emit_status.
This patch defers all retry/fallback/compression status messages until we
know the outcome:
- if the turn ultimately succeeds (any path: primary recovers, fallback
activates, compression unsticks the request), the buffer is silently
dropped — the user sees nothing.
- if every retry and fallback exhausts and the turn fails, the buffer
is flushed at the terminal-failure return so the user sees the full
retry trace alongside the final error.
Backend logging (agent.log) is unchanged — every emission site still
writes to logger.warning/info, so post-mortem diagnosis is intact.
## What changed
run_agent.py: four new methods on AIAgent:
_buffer_status(msg) — defer an _emit_status call
_buffer_vprint(msg) — defer a _vprint(force=True) line
_clear_status_buffer() — drop pending messages on success
_flush_status_buffer() — replay pending messages on terminal failure
agent/conversation_loop.py:
- converted ~30 mid-process emit/vprint sites in the retry, fallback,
compression, empty-response, and stream-watchdog paths to the buffered
helpers
- added _flush_status_buffer() at every terminal-failure return so users
still see the trace when it actually matters
- added _clear_status_buffer() at the "non-empty assistant content"
point (NOT at "API call returned bytes" — empty responses still loop
through the empty-retry path and would otherwise lose their trace
between iterations)
- silenced the two "(´;ω;`) oops, retrying..." / "(╥_╥) error,
retrying..." spinner final-frame messages — the spinner now stops
cleanly so retries leave no visible residue
agent/chat_completion_helpers.py: same conversion for codex TTFB / stale-
stream / fallback-activation status messages.
agent/stream_diag.py: _emit_stream_drop now buffers instead of emitting
directly.
## Tests
tests/run_agent/test_retry_status_buffer.py: 7 unit tests covering
accumulate→flush, clear-on-success, mixed kinds, empty-buffer no-op,
re-buffer after flush, exception swallowing.
Updated 3 existing tests that mocked _emit_status to also mock (or use)
_buffer_status:
- tests/run_agent/test_run_agent.py::test_empty_response_emits_status_for_gateway
- tests/run_agent/test_stream_drop_logging.py (2 tests)
- tests/agent/test_codex_ttfb_watchdog.py (TTFB hint test)
## Validation
Live test: hermes chat -q against an unreachable endpoint with no fallback
exhausts retries and prints the full trace at the end. Same flow against
a working endpoint prints zero retry chatter.
api_messages is built once before the retry loop while the primary provider
is active. When a mid-conversation fallback switches to a require-side thinking
provider (DeepSeek/Kimi/MiMo), assistant turns built under a non-require primary
(e.g. Codex) go out without reasoning_content and the new provider rejects the
request with HTTP 400 ("reasoning_content must be passed back").
Re-apply the echo-back pad against the current provider immediately before
building the request kwargs. Idempotent and a no-op unless the active provider
enforces echo-back, so it covers all fallback paths without affecting normal or
reject-side operation.
Drafted by Claude (Opus 4.7) under human review while fixing a personal deployment.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Snapshot review_agent._session_messages before teardown so close() can
clean per-session state without dropping the user-visible
self-improvement summary. Adds two regressions:
- bg-review summarizer receives captured review-agent tool messages
after review_agent.close() runs
- context-compressor protected-head handoff rehydration populates
_previous_summary and keeps the old handoff out of newly summarized
turns
Salvaged from PR #26039 onto current main after agent/background_review.py
extraction. Original commit 63eaf6055; bg-review test updated to patch
the module-level summarize_background_review_actions in
agent.background_review instead of the now-forwarder
AIAgent._summarize_background_review_actions.
Salvages the transport-side fix from #32911 (@xxxigm). Closes#32892.
The openai SDK's responses.stream() / responses.parse() eagerly call
_make_tools(tools), which iterates tools without a None guard. Passing
tools=None raises TypeError: 'NoneType' object is not iterable before
any HTTP request is issued (openai==2.24.0).
PR #33042 already removed responses.stream() from our own Codex call
paths, so the specific iteration crash inside _make_tools is no longer
on the hot path. But the right API contract is to omit tools entirely
when there are no functions to expose — passing tools=None to the
backend is semantically wrong regardless of the SDK's iteration
behavior, and we'd hit it again on any future code path that hasn't
migrated off responses.stream().
This applies the transport-level part of @xxxigm's fix: move
'tools': response_tools into the if response_tools: branch so the
key is omitted when there are no tools, just like tool_choice and
parallel_tool_calls already are. Skips the run_agent.py-side
_strip_sdk_none_iterables helper from their PR — that path is now
obsolete because the SDK helper that needed defending is gone.
Tests
- tests/run_agent/test_codex_no_tools_nonetype.py: 6 tests trimmed
from @xxxigm's original 13-test file. Drops the obsolete tests for
_strip_sdk_none_iterables and _RecordingResponsesStream (helpers
that don't exist on main anymore), keeps the transport behavior
tests + the SDK contract sanity check that ensures we notice if
upstream ever fixes _make_tools(None).
- 6/6 passing locally.
Co-authored-by: xxxigm <tuancanhnguyen706@gmail.com>
Salvages the intent of #33136 (@Brixyy) onto current main. The original PR
was written against the pre-refactor monolithic run_agent.py and added a
top-level _is_nonretryable_local_validation_error() helper. Both target
functions have since been extracted to agent/conversation_loop.py:2869,
so the salvage applies the equivalent guard inline at that canonical
location rather than reintroducing the helper.
## Why
After #33042 made our own Codex consumer structurally immune to NoneType
crashes, third-party shims, mocked clients, and any future code path that
hasn't migrated could still surface TypeError: 'NoneType' object is not
iterable as a wire-shape mismatch. The agent loop's classifier currently
treats ALL TypeError as a local programming bug and aborts non-retryable
— users on stale Telegram/gateway turns saw bare "Non-retryable error
(HTTP None)" with no recovery.
This is a provider/SDK shape mismatch, not a local programming bug. The
retry/fallback path should run, not be short-circuited.
## What
agent/conversation_loop.py: extend is_local_validation_error to exclude
TypeErrors whose message matches the NoneType-not-iterable shape (case-
insensitive, both "NoneType" and "not iterable" must appear).
tests/run_agent/test_jsondecodeerror_retryable.py:
- update the mirror predicate to match the production check
- add TestNoneTypeNotIterableIsRetryable class with 3 tests (the basic
shape, message variants, unrelated TypeErrors still abort)
- add TestAgentLoopSourceHasNoneTypeCarveOut to enforce the source-level
invariant matches the test mirror
## Validation
tests/run_agent/test_jsondecodeerror_retryable.py +
tests/run_agent/test_31273_402_not_retried.py → 14/14 passing
Co-authored-by: Brixyy <subrtt@gmail.com>
Closes#33163.
When _try_activate_fallback() switches from one provider to another (e.g.
openai-codex → openrouter), the credential pool still belongs to the
primary provider. This causes two compounding bugs:
1. The pool retains the primary's base_url. Downstream pool recovery
(rate_limit / billing / auth) calls _swap_credential() with a primary
entry which overwrites the agent's base_url back to the primary's
endpoint. Every fallback request then 404s against the wrong host.
2. Pool recovery acting on errors from the FALLBACK provider mutates the
PRIMARY's pool state (#33088 reported a related corruption pattern),
exhausting/rotating entries that have nothing to do with the failure.
Two layered fixes:
a) try_activate_fallback (agent/chat_completion_helpers.py): on fallback
activation, clear agent._credential_pool when the fallback provider
doesn't match the pool's provider. Pool is preserved when the fallback
shares the pool's provider (e.g. multiple openrouter entries).
b) recover_with_credential_pool (agent/agent_runtime_helpers.py):
defensive guard rejects any pool mutation when agent.provider doesn't
match pool.provider. Defense-in-depth — should never fire after (a)
is in place, but covers any future path that attaches a stale pool.
Salvaged from @zccyman's PR #33217. The original PR was written against
the pre-refactor monolithic run_agent.py; both target functions have
since been extracted to module-level helpers. Behavior is identical —
the guards live in the canonical extracted locations.
Tests
- New tests/run_agent/test_fallback_credential_isolation.py (7 tests
covering: fallback clears mismatched pool, fallback preserves matching
pool, recovery rejects mismatched pool, recovery accepts matching
pool, 429-from-z.ai-doesn't-exhaust-codex-pool, _client_kwargs
base_url survives pool clear, _swap_credential doesn't restore
primary URL after fallback).
- Cross-verified: 77/77 passing across fallback isolation tests +
agent/test_credential_pool.py — no regression.
Co-authored-by: zccyman <16263913+zccyman@users.noreply.github.com>
Closes#33175.
switch_model() in agent/agent_runtime_helpers.py mutated agent.model and
agent.provider before rebuilding the client, with no try/except to restore
them on failure. If the rebuild raised (bad API key, network error,
build_anthropic_client failure, etc.) the agent was left with the new
model+provider name paired with the OLD client — producing HTTP 400s like
"claude-sonnet-4-6 is not supported on openai-codex" on the next turn.
Callers in cli.py, gateway/run.py, and tui_gateway/server.py already catch
the exception and warn the user, but the warning was misleading because
the swap had partially succeeded; the agent's state was torn.
Snapshot every mutated field before the swap, wrap the swap+rebuild block
in try/except, and restore the snapshot on failure before re-raising so
the caller's warning surfaces.
Reported by @amirariff91. Tests cover both branches (chat_completions and
anthropic_messages) and the cross-branch case (anthropic -> openai).
reasoning.encrypted_content is sealed to the Responses endpoint that
minted it. When a session switches model providers mid-conversation —
say the user runs /model gpt-5.5 after several turns on grok-4.3, or
vice versa — the persisted codex_reasoning_items carry blobs the new
endpoint cannot decrypt, and every subsequent turn fails with HTTP 400
invalid_encrypted_content.
This is the cross-issuer prevention layer. Pairs with:
* PR #33035 — runtime recovery when the HTTP 400 fires anyway
* PR #33146 — prevention for transient rs_tmp_* items
Stamps each reasoning item with the issuer kind that minted it
(codex_backend / xai_responses / github_responses / other:<url>) at
normalize time, then drops items at replay time when the active
endpoint differs from the stamp. Unstamped (legacy) items pass
through for backwards compatibility.
Cherry-picked from @chaconne67's PR #31629. Conflict against current
main (#33035's replay_encrypted_reasoning parameter) resolved as
'keep both' — the two guards compose: replay_encrypted_reasoning=False
is the session-wide kill switch, current_issuer_kind is the per-item
filter that runs only when replay is still enabled.
* remove Vercel AI Gateway provider and Vercel Sandbox terminal backend
Both Vercel-hosted integrations are removed end-to-end. Users on the AI
Gateway should switch to OpenRouter or one of the other aggregators
(Nous Portal, Kilo Code). Users on the Vercel Sandbox backend should
switch to Docker, Modal, Daytona, or SSH.
What's removed:
- `plugins/model-providers/ai-gateway/` provider plugin
- `hermes_cli/vercel_auth.py` Vercel-Sandbox auth helper
- `tools/environments/vercel_sandbox.py` terminal backend
- `ai-gateway` provider wiring across auth, doctor, setup, models,
config, status, providers, main, web_server, model_normalize, dump
- `vercel_sandbox` backend wiring across terminal_tool, file_tools,
code_execution_tool, file_operations, approval, skills_tool,
environments/local, credential_files, lazy_deps, prompt_builder,
cli, gateway/run
- `AI_GATEWAY_BASE_URL` constant, `_AI_GATEWAY_HEADERS` auxiliary-client
header set, run_agent base-URL header/reasoning special-cases
- `[vercel]` pyproject extra and `vercel`/`vercel-workers` from uv.lock
- env vars: `AI_GATEWAY_API_KEY`, `AI_GATEWAY_BASE_URL`, `VERCEL_TOKEN`,
`VERCEL_PROJECT_ID`, `VERCEL_TEAM_ID`, `VERCEL_OIDC_TOKEN`,
`TERMINAL_VERCEL_RUNTIME`
- Tests: deletes test_ai_gateway_models.py and
test_vercel_sandbox_environment.py; scrubs references across 23
surviving test files (no entire tests deleted unless they were
dedicated to AI Gateway / Sandbox)
- Docs: provider tables, env-var reference, setup guides, security
notes, tool config, terminal-backend tables — English plus zh-Hans
i18n parity
- `hermes-agent` skill: provider table entry and remote-backend list
What stays (intentional):
- `popular-web-designs/templates/vercel.md` — CSS design reference,
unrelated to Vercel-the-AI-product
- `x-vercel-id` in `stream_diag.py` headers — generic Vercel CDN
response header, useful diag signal on any Vercel-hosted endpoint
- `vercel-labs/agent-browser` URL in browser config — lightpanda
browser project, different OSS effort
- `userStories.json` historical contributor entry mentioning Vercel
Sandbox — archive, not active docs
Validation:
- 1153 tests in the 22 targeted files pass (`scripts/run_tests.sh`)
- Full repo `py_compile` clean
- Live import of every touched module + invariant check (no
`ai-gateway` in `PROVIDER_REGISTRY`, no `_AI_GATEWAY_HEADERS`, no
`vercel_sandbox` in `_REMOTE_TERMINAL_BACKENDS`)
* test: convert profile-count check from change-detector to invariant
The hardcoded "== 34" assertion broke when ai-gateway was removed.
Per AGENTS.md change-detector-test guidance, assert the relationship
(registry count >= number of plugin dirs) instead of a literal count.
Counts shift when providers are added/removed; that's expected.
* refactor(codex): drop SDK responses.stream() helper; consume events directly
The OpenAI Python SDK's high-level `client.responses.stream(...)` helper
does post-hoc typed reconstruction from the terminal
`response.completed.response.output` field. The chatgpt.com Codex
backend has been observed (today, gpt-5.5) to ship `response.output =
null` on terminal frames, which crashes the SDK with `TypeError:
'NoneType' object is not iterable` mid-iteration.
Carlton's #32963 patched the symptom by wrapping the helper in
try/except and recovering from the same per-event accumulator the SDK
was supposed to populate. This PR removes the helper from the call
path entirely: we now use `client.responses.create(stream=True)` (raw
AsyncIterable of SSE events) and assemble the final response object
ourselves from `response.output_item.done` events as they arrive. The
terminal event's `output` field is never read for content. Same
strategy OpenClaw uses for the same backend.
This makes Hermes structurally immune to the bug class, not patched.
The next time OpenAI ships a shape change to chatgpt.com's terminal
frame, our consumer keeps working because it doesn't read that frame
for content — only for usage/status/id.
Changes
- `agent/codex_runtime.py`: new `_consume_codex_event_stream()` shared
consumer; `run_codex_stream()` uses `responses.create(stream=True)`;
`run_codex_create_stream_fallback()` collapses into a thin alias
since the primary path now does what the fallback used to do.
- `agent/auxiliary_client.py`: `_CodexCompletionsAdapter` uses the
same consumer; old null-output recovery helpers deleted as
unreferenced.
- Tests migrated: fixtures that mocked `responses.stream` now mock
`responses.create` returning a raw iterable. New regression test
asserts the auxiliary path returns streamed items even when the
terminal event's `output` is literally `null`.
Validation
- Live: tested against fresh OAuth on `chatgpt.com/backend-api/codex`
with `gpt-5.5` — response built correctly with `response.output=null`
on the terminal frame, all events consumed, usage/reasoning tokens
propagated.
- `tests/run_agent/test_run_agent_codex_responses.py` +
`tests/agent/test_auxiliary_client.py`: 242 passed.
* test+fix(codex): migrate streaming tests, raise on truncated streams
CI surfaced 10 test failures across tests/run_agent/test_streaming.py
and tests/run_agent/test_codex_xai_oauth_recovery.py — both files had
their own `responses.stream(...)` mocks I missed in the first sweep.
agent/codex_runtime.py: _consume_codex_event_stream() now raises
"Codex Responses stream did not emit a terminal response" when the
stream ends without any terminal frame AND no usable content. This
preserves the signal callers used to get from the SDK's high-level
helper, which they distinguished from "completed with empty body"
in error handling.
Tests migrated:
- test_streaming.py: text-delta callback, activity-touch, and
remote-protocol-error tests all switch from mocking responses.stream
to responses.create returning an iterable of events.
- test_codex_xai_oauth_recovery.py: prelude-error tests are recast as
wire-error-event tests (the new path raises _StreamErrorEvent
directly when the wire emits type=error, which is strictly better
than the old two-phase "SDK RuntimeError → retry → fallback"). The
retry-on-transport-error test moves from responses.stream side-effect
to responses.create side-effect.
Verified live against chatgpt.com Codex with gpt-5.5 — AIAgent.chat()
through the full codex_responses path returns correctly, 319/319
targeted tests passing.
* fix(codex-responses): gracefully recover from invalid_encrypted_content (salvage #10144)
When an OpenAI-compatible Responses API surface accepts an initial
request but later rejects the replayed `codex_reasoning_items`
encrypted blob with HTTP 400 `invalid_encrypted_content`, the
session previously got stuck retrying the same poisoned payload.
Recovery: classify the error as a dedicated FailoverReason, and on the
first hit disable encrypted reasoning replay for the rest of the
session, strip cached items from message history, and retry once.
Changes:
* error_classifier: add FailoverReason.invalid_encrypted_content
branch in _classify_400 (before context_overflow so the messages
that mention 'encrypted content … could not be verified' don't trip
context heuristics), in _classify_by_error_code, and extend
_extract_error_code to peek inside wrapped JSON in error.message and
ignore the bare '400' as a code.
* agent_init: initialize `_codex_reasoning_replay_enabled = True` on
every agent.
* run_agent: add AIAgent._disable_codex_reasoning_replay() helper
that flips the flag and pops cached items.
* codex_responses_adapter: thread a `replay_encrypted_reasoning`
kwarg through _chat_messages_to_responses_input so that when the
flag is False we don't replay codex_reasoning_items.
* transports/codex.py: read `replay_encrypted_reasoning` from params,
thread it into the adapter, and gate the
`include=['reasoning.encrypted_content']` request hint on it.
* chat_completion_helpers: pass the agent's replay flag through to
the transport.
* conversation_loop: in the retry loop, add an
invalid_encrypted_content recovery branch that fires once per
session, only when api_mode == codex_responses, only when replay is
still enabled, and only when at least one assistant message in
history actually carries cached reasoning items (otherwise the 400
has nothing to do with our cache and the normal retry path handles
it).
Tests:
* test_error_classifier: new wrapped-JSON _extract_error_code case;
new TestClassifyApiError cases proving the 400 is retryable with
no fallback, that the broad message match doesn't catch a generic
'parsed' message, and that the error code match is
case-insensitive.
* test_run_agent_codex_responses: end-to-end test of the recovery
branch firing once and disabling replay, plus a sibling test that
proves the branch does *not* fire (and the flag stays True) when
history has no cached reasoning items.
Salvages PR #10144 onto the post-refactor module layout
(error_classifier / codex_responses_adapter / transports/codex /
conversation_loop / agent_init) since the original diff was written
against the pre-refactor monolithic run_agent.py.
* chore(release): map victorGPT in AUTHOR_MAP for #10144 salvage
---------
Co-authored-by: victorGPT <wuxuebin1993@gmail.com>
After key #1 is marked exhausted the retry still called the API with key #1
due to env-var bias in _get_cached_client / resolve_api_key_provider_credentials.
Fix: peek the pool and pass the active entry's key as explicit_api_key.
Secondary: api_key_hint in mark_exhausted_and_rotate pins the correct entry
under concurrent CLI+gateway calls; _is_payment_error matches GoUsageLimitError;
extract_api_error_context parses "Resets in Xhr Ymin".
Closes#26145.
When the user interrupts the retry loop between two 429s (Ctrl-C in
interactive mode, /new, gateway disconnect), the local has_retried_429
flag dies with the recovery function. On the next user prompt the agent
restarts with has_retried_429=False, hits 429 on the exhausted credential,
sets the flag, returns 'retry once'. Repeat forever — the second 429 that
would trigger rotation is never reached, and healthy entries (priority>0
free/paid accounts) are never tried.
Fix: in recover_with_credential_pool's rate_limit branch, pre-check
pool.current().last_status before running the retry-once dance. If the
current entry is already STATUS_EXHAUSTED, rotate immediately. Uses
getattr() for the attribute read so existing tests with SimpleNamespace
mocks (which only set 'label') keep working.
Co-authored-by: zccyman <16263913+zccyman@users.noreply.github.com>
* fix(streaming): route mid-tool-call partial-stream-stub through length continuation (#31998)
When a stream stalls mid-tool-call (e.g. a large write_file), the
partial-stream-stub recovery used finish_reason='stop' which caused the
conversation loop to treat the turn as complete, returning only the
warning text. When users said 'continue', the model retried the same
large tool call, hit the same stale timeout, and looped indefinitely.
Changes:
- chat_completion_helpers.py: change _stub_finish_reason from 'stop' to
'length' for mid-tool-call partials. The stub still has tool_calls=None
so no tool auto-executes — the model gets a fresh API call through the
existing length-continuation machinery (bounded to 3 retries).
Also attach _dropped_tool_names to the stub for downstream use.
- conversation_loop.py: add a third continuation prompt branch for
partial-stream-stubs with dropped tool calls. Instead of the generic
'continue where you left off' (which would retry the same large call),
tell the model to break the output into smaller tool calls (~8K
tokens each) to avoid stream timeouts.
- test_partial_stream_finish_reason.py: update existing test from
finish_reason='stop' to 'length', add _dropped_tool_names assertion,
add new test_dropped_tool_call_uses_chunking_prompt for the 3-way
prompt branching.
Safety: tool_calls=None is preserved on the stub, so the conversation
loop enters the text-continuation branch (line 1513), NOT the tool-call
execution branch (line 3246). No tool auto-executes. The model simply
gets another API call with targeted guidance.
* refactor: extract constants and continuation prompt helper
- Move magic strings to hermes_constants.py (PARTIAL_STREAM_STUB_ID,
FINISH_REASON_LENGTH)
- Extract _get_continuation_prompt() in conversation_loop.py — DRYs the
3-way prompt branching and lets tests import the real function
- Trim verbose inline comments in chat_completion_helpers.py
- Tests import constants + helper instead of duplicating logic
---------
Co-authored-by: alt-glitch <balyan.sid@gmail.com>
The ChatGPT Codex backend (chatgpt.com/backend-api/codex) has historically
silently dropped certain model requests: the connection is accepted but no
stream events are emitted and no error is raised. PR #31967 lowered the
implicit stale-call default from 300s to 90s so fallbacks kick in faster,
but users still see an opaque "No response from provider for 90s
(non-streaming, ...)" message that gives no path forward.
This patch adds a narrow heuristic — gpt-5.5 family on the Codex backend
via codex_responses api_mode — that substitutes the generic timeout
message with actionable text naming the gpt-5.4-codex workaround and
pointing at #21444 for symptom history.
Changes:
- run_agent.py — new ``AIAgent._codex_silent_hang_hint(model=...)`` method.
Returns ``None`` for any request that does not match all three guards
(codex_responses api_mode, openai-codex provider or chatgpt.com Codex
base URL, gpt-5.5-family model name with word-boundary regex anchoring
to avoid false-positives on e.g. ``gpt-5.50``).
- agent/chat_completion_helpers.py — the non-stream stale-call site
consults the hint via ``getattr(...)`` so the call site stays robust
if the helper is ever removed or stubbed in tests. Hint is appended to
both the ``_emit_status`` warning and the ``TimeoutError`` message so
the user sees it in their terminal AND it lands in any retry-loop
diagnostics.
- tests/run_agent/test_codex_silent_hang_hint.py — 10 regression tests
covering positive cases (bare gpt-5.5, vendor-prefixed openai/gpt-5.5,
gpt-5.5-codex SKU, model=None fallback to self.model) and negative
cases (gpt-5.4-codex workaround, gpt-5.50 false-positive guard,
non-codex api_mode, non-codex provider, empty/None model, unrelated
models on Codex).
Does NOT fix the backend-side issue (that's an upstream OpenAI/ChatGPT
problem we cannot patch from here). Only converts an opaque timeout into
text that names the workaround so users do not have to dig through logs
or wait for a forum post to learn what to do.
Closes#22046
Codex / Responses-API requests had three latent timeout bugs that combined
into the long silent hangs reported on #21444:
1. The non-stream stale-call detector estimated context tokens from
``api_kwargs["messages"]`` only. Codex / Responses-API payloads carry
their conversational load in ``input`` (with ``instructions`` and
``tools``), so every Codex turn logged ``context=~0 tokens`` and the
detector never applied its >50k / >100k tier bumps.
2. ``providers.<id>.request_timeout_seconds`` was silently dropped on the
main Codex path. The chat_completions path and the auxiliary Codex
adapter both forwarded it; the main path skipped it through three
places (``build_api_kwargs``, ``ResponsesApiTransport.build_kwargs``,
``_preflight_codex_api_kwargs``).
3. The streaming stale detector had the same payload-shape bug for
``codex_responses`` requests, which route through the non-streaming
detector (it's the path that emits the user-facing
"No response from provider for 300s (non-streaming, ...)" warning that
reporters keep pasting).
This commit:
- Adds ``estimate_request_context_tokens`` in ``chat_completion_helpers``,
used by both the non-stream and stream detectors. Handles ``messages``
(Chat Completions), ``input + instructions + tools`` (Responses API),
bare lists, and an unknown-dict fallback.
- Forwards ``timeout`` through ``ResponsesApiTransport.build_kwargs``
and ``_preflight_codex_api_kwargs`` (with guards against
zero/negative/inf/bool values), and wires
``_resolved_api_call_timeout()`` into the Codex branch of
``build_api_kwargs``.
- Lowers the implicit non-stream stale defaults so fallback providers
kick in faster when upstream stalls:
* base 300s -> 90s
* >50k 450s -> 150s
* >100k 600s -> 240s
These only apply when the user has *not* set
``providers.<id>.stale_timeout_seconds`` or
``HERMES_API_CALL_STALE_TIMEOUT``. Explicit config still wins.
- Adds regression tests for the estimator shapes, the new defaults, the
context-tier scaling, transport timeout pass-through, and preflight
timeout pass-through / rejection of invalid values.
Closes#21444
Supersedes #21652#24126#31855
Co-authored-by: Hoang V. Pham <26063003+hehehe0803@users.noreply.github.com>
Follow-up to @someaka's fix.
Polish:
- Drop the redundant `_preflight_tokens >= threshold_tokens` clause.
`should_compress(tokens)` already short-circuits when tokens < threshold,
so the explicit comparison was dead code on the True branch.
Tests:
- Preflight: pin that should_compress() is called (anti-thrash has a vote).
Mocks should_compress to return False even with tokens past the raw
threshold and asserts no compression runs — exact bug shape from #29335.
- Gateway: AST scan of gateway/run.py asserts every
`session_entry.session_id = ...` assignment is followed by a
`session_store._save()` call within the same block. Three sites mutate
the session_id after compression; all three must persist or the next
turn loads the pre-compression transcript and re-loops. Empirically
verified the test catches the bug (drops the new _save() line → red).
AUTHOR_MAP:
- Map ed@bebop.crew -> someaka so the salvaged commit resolves to
@someaka in release notes.
Bring 313 commits of upstream main into the bb/gui dashboard
refactor branch. Eight conflicts resolved by hand, the rest
auto-merged. One missing class (_StreamErrorEvent) restored from
main after the auto-merger dropped it.
Conflict resolutions:
apps/dashboard/README.md take HEAD: main's text described
the pre-rename web/ layout that
bb/gui refactored away.
apps/dashboard/package.json combine: keep HEAD's @hermes/shared
workspace dep, take main's
@nous-research/ui 0.16.0 bump.
apps/dashboard/package-lock.json regenerate via
npm install --package-lock-only.
Root lock also regenerated; only
dashboard and apps/desktop entries
moved (apps/desktop version 0.0.1 →
0.0.2 to match bb/gui's
package.json bump).
apps/dashboard/src/pages/ take main (4 hunks): text-xs
EnvPage.tsx replaces text-[0.65rem] per the
typography rule HEAD's own README
documents.
hermes_cli/gateway.py take main (2 hunks): Discord
setup metadata moved to plugin
(architectural migration); s6
service-manager dispatch helpers
additive.
hermes_cli/main.py combine (2 hunks): take main's
Termux-aware
_sync_bundled_skills_for_startup;
combine gui + portal subcommands
in the known-subcommand list.
hermes_cli/web_server.py mixed (10 hunks):
- take main on _PUBLIC_API_PATHS
(bb/gui's own test asserts the
rescan endpoint must require auth)
- combine WS helpers: keep HEAD's
_ws_client_label + main's
Host/Origin guard + composing
_ws_request_is_allowed
- take HEAD's debug-level broadcast
drop log (matches the comment
"subscriber went away mid-send")
- take main's _safe_plugin_api_relpath
GHSA-5qr3-c538-wm9j fix and the
paired discovery-time validation
- take main's {name:path} route
converter for plugin visibility
tui_gateway/server.py take main: PR #31379's verbose-
args gating supersedes HEAD's
unconditional args dump on
tool.start.
Post-merge restoration:
run_agent.py restored class _StreamErrorEvent
(40 lines, from origin/main:288).
Auto-merge silently dropped it,
breaking imports in
agent/codex_runtime.py and three
test files
(test_codex_xai_oauth_recovery.py,
test_streaming.py). Restored
verbatim from main.
Sanity checks:
* git diff --check / --cached --check: clean (no stray markers)
* ast.parse + import on all touched .py files: clean
* targeted pytest on resolved files: 756 passed, 1 pre-existing
Windows-curses failure unrelated to the merge
* full pytest_parallel run: 105 files / 391 failures vs baseline
98 files / 346. Differential vs origin/bb/gui shows all 11
"new" failure files come from main's added tests/code and
reproduce identically against origin/main on the same Windows
host (pure Windows path-separator / perms / git-bash issues
in upstream tests, not merge regressions). 4 baseline
failures fixed: 3 in test_codex_xai_oauth_recovery (the
_StreamErrorEvent restoration), 1 each in test_pairing,
test_runner_startup_failures, test_stream_consumer.
* sentinel-token sweep on main's eight largest commits:
every audited symbol present in the merged tree at expected
counts (TTSProvider 61, NtfyAdapter 29, S6ServiceManager 70,
install_bws 12, security_audit 16, register_image_gen_provider
23, list_profile_gateways 22, DISCORD_FREE_RESPONSE_CHANNELS
48, …).
* byte-diff sweep: 30/30 sampled main-only-modified files
byte-identical to origin/main; the four bb/gui-only files
that drifted (i18n/types.ts, i18n/ru.ts, ThemeSwitcher.tsx,
ToolCall.tsx) correctly absorbed main's web/ → apps/dashboard/
edits through git's rename detection (main's added lines all
present, removed lines all absent).
Two-layer redaction at the persistence boundary so credentials never reach
state.db, session_*.json, or compression:
1. agent/chat_completion_helpers.py :: build_assistant_message
- Redact assistant content before the message dict is constructed
(catches PATs / API keys the model inlines into natural language)
- Redact tool_call.function.arguments at the same site (catches secrets
inlined into tool args, e.g. terminal command=curl -H 'Authorization: ...')
Tool execution uses the raw API response object, not this dict, so
redacting the persisted shape is safe.
2. run_agent.py :: _save_session_log
- Add _redact_message_content() static helper that handles both string
content and OpenAI/Anthropic multimodal list-of-parts (image parts
pass through untouched, only text/content fields are redacted)
- Apply to every message + the cached system prompt before writing
session_*.json
Both layers respect HERMES_REDACT_SECRETS via redact_sensitive_text —
no-op when disabled.
Tests (TestSaveSessionLogRedactsSecrets, 4 cases):
- api key in tool content
- api key in user message
- api key in system prompt
- multimodal list-of-parts (image part preserved, text redacted)
Tests use an autouse fixture to force _REDACT_ENABLED=True because the
hermetic conftest defaults the env var to false.
Salvaged from PR #24758 by @vgocoder (build_assistant_message + session_log)
+ PR #19855 by @liuhao1024 (multimodal list helper, system_prompt redaction).
Kept only the redaction concern from #19855; its unrelated whatsapp npm
timeout + PATCH_SCHEMA changes are out of scope and dropped.
Refs #19798 (PAT leak via assistant inline mention), #19845 (session capture
credential leak).
Co-authored-by: liuhao1024 <liuhao03@bilibili.com>
Co-authored-by: teknium1 <127238744+teknium1@users.noreply.github.com>
Closes#31273.
HTTP 402 (insufficient credits) was retried up to agent.api_max_retries
times (default 3), burning paid requests against an exhausted balance.
Real-world impact: ~$40 in 48h on a 24/7 Telegram+Discord gateway.
Root cause: FailoverReason.billing was in the is_client_error
exclusion set in agent/conversation_loop.py, which prevents the
non-retryable-abort branch from firing.
By the time control reaches that predicate:
* credential-pool rotation has already run for billing and either
continued the loop or returned False (pool exhausted/absent)
* the eager-fallback branch has also fired on billing and either
continued the loop or fell through (no fallback configured)
Falling through to the backoff retry from here has no recovery
mechanism left — it just burns more paid requests. Removing billing
from the exclusion set makes 402 abort cleanly once pool+fallback
recovery has failed, mirroring how 401/403 (also should_fallback=True)
already behave.
Added tests/run_agent/test_31273_402_not_retried.py which mirrors the
is_client_error predicate shape from the source and asserts the
invariant (plus a source-inspection guard against accidental
re-introduction).
Regression guard for #30770 — verifies the guardrail-halt branch in
agent/conversation_loop.py pushes the synthesized halt message through
stream_delta_callback before breaking out of the loop. Without the
emit, chat-completions SSE writers drain an empty queue and clients
(Open WebUI, etc.) see a finish chunk with zero content delta —
indistinguishable from a crash.
Verified: the test fails when the production fix is reverted.