* perf(config): add load_config_readonly() fast path for hot agent loop
`load_config()` is called from the agent loop's per-API-call hot path via
`get_provider_request_timeout()` and `get_provider_stale_timeout()` —
both invoked once per turn from `_resolved_api_call_timeout()` in
run_agent.py.
Profiling a synthetic 20-tool-call agent run revealed:
- 21 invocations of `load_config()` cumulating 56ms (~17% of agent loop)
- 34,398 deepcopy calls totaling 37ms (config defensive deepcopy + chain)
- 8,652 `_expand_env_vars` invocations (~412 per turn)
Microbench (cache-hit, real config.yaml present):
load_config() 265us/call (125us deepcopy + 140us infra)
load_config_readonly() 138us/call (~48% faster)
`load_config_readonly()` returns the cached dict directly without the
defensive deepcopy. Documented contract: caller must not mutate. Returns
plain dict (not MappingProxyType) so downstream `isinstance(x, dict)`
guards keep working — caught during initial implementation when
MappingProxyType broke get_provider_request_timeout's guard logic.
Wired into hermes_cli/timeouts.py (the two functions called per agent
turn). load_config() is unchanged for the 263 other call sites that
mutate the result before save_config(), are not in the hot path, or
where the safety guarantee matters more than the perf.
Profile A/B (cached config, 21-turn agent loop):
BEFORE AFTER delta
get_provider_request_timeout 55ms 16ms -71%
total function calls 399k 160k -60%
deepcopy calls (in hotspots) 34,398 ~0 ~elim
Verified:
- isinstance(load_config_readonly(), dict) is True
- timeout/stale resolutions correct
- load_config() still returns isolated mutable deepcopies
- tests/hermes_cli/test_config*.py / test_timeouts.py: 102/102 pass
- tests/cli/ + tests/agent/test_auxiliary_client.py: 883/883 pass
* perf(redact): substring pre-screens skip non-matching regex chains
Every log record passes through `RedactingFormatter.format` which calls
`redact_sensitive_text`, which historically ran ALL 13 secret-pattern
regexes against every line — including DB connection strings, JWTs,
Discord mentions, Signal phone numbers, etc. — even for typical clean
log records like 'INFO run_agent: API call completed'.
Add cheap substring pre-checks before each regex pass. False positives
still run the regex (which then matches nothing); false negatives are
impossible because every pattern requires the gated substring to match
its leading anchor:
- `_PREFIX_RE` gated on any of 33 known credential prefix substrings
- `_ENV_ASSIGN_RE` gated on `=` in text
- `_JSON_FIELD_RE` gated on `:` and `"` in text
- `_AUTH_HEADER_RE` gated on `uthorization`/`UTHORIZATION` in text
- `_TELEGRAM_RE` gated on `:` in text
- `_PRIVATE_KEY_RE` gated on `BEGIN` and `-----`
- `_DB_CONNSTR_RE` gated on `://` in text
- `_JWT_RE` gated on `eyJ` in text
- URL userinfo/query gated on `://`
- `_redact_form_body` gated on `&` and `=`
- `_DISCORD_MENTION_RE` gated on `<@`
- `_SIGNAL_PHONE_RE` gated on `+`
Microbench (5 typical log records, 20k iterations each):
BEFORE AFTER delta
redact_sensitive_text per call 5.63us 1.79us -68%
Real-world impact: ~244 log records emitted in a 30-turn agent loop, so
the chain saves ~1ms of CPU per conversation. Bigger win is the
reduction in regex execution and GC pressure during heavy logging
sessions (verbose logging, gateway message processing).
Security regression test: 30 secret-containing inputs (sk-/ghp_/JWT/DB
connstr/Auth-Bearer/private key/URL userinfo/Discord/Signal/etc.)
verified to produce identical redacted output before/after. All 75
existing tests/agent/test_redact.py cases pass.
The `?access_token=foo&code=bar` (bare query string, no scheme) case
that 'leaks' is pre-existing behavior — the URL query redaction
requires a well-formed URL with scheme+host. Not a regression.
* perf(run_agent): cache _needs_thinking_reasoning_pad result per (provider, model, base_url)
Profile of a 31-turn synthetic agent run shows `_needs_thinking_reasoning_pad`
fires 495 times (~16 per turn) and each call ran 3 helper methods, each
hitting `base_url_host_matches` 1-4 times via `urlparse`. Total cost:
3,342 base_url_host_matches calls + 3,373 urlparse calls accounting for
~36ms of agent-loop overhead (~7% of the entire post-network work).
Provider / model / base_url don't change during a conversation except via
`switch_model` and fallback activation — both of which already overwrite
those attributes atomically. Cache the result on a tuple key; since the
key is derived from the very fields that would change, the cache
auto-invalidates on the next read after a switch. No manual invalidation
needed in switch_model / _try_activate_fallback.
Profile A/B (31-turn cached-config agent run):
BEFORE AFTER delta
_needs_thinking_reasoning_pad cum 18ms 1ms -94%
_copy_reasoning_content_for_api cum 17ms 1ms -94%
base_url_host_matches calls 3,342 372 -89%
urlparse calls 3,373 403 -88%
total function calls 296k 223k -25%
Verified:
- tests/run_agent/test_deepseek_reasoning_content_echo.py: 36/36 pass
- tests/run_agent/ (full): 1383/1383 pass + 3 skipped
GLM models via Ollama report finish_reason='stop' even when the
response was truncated by max_tokens. The continuation mechanism
uses _has_natural_response_ending() as one of the heuristics to
detect whether the response was genuinely finished.
Currently only ASCII punctuation and CJK punctuation are recognized.
This means any response ending with an emoji (e.g. ⚡, 👍) or the
caret character ^ (common in French ^^ smiley) is not recognized as
naturally ended, triggering a false-positive continuation where the
model receives 'Continue where you left off' and produces garbled
output.
Add:
- ^ (caret) to the punctuation set
- Unicode emoji range (codepoint >= 0x1F300) as natural ending
This only affects GLM/Ollama users but the fix is safe for all
backends since _has_natural_response_ending() is only consulted
inside the continuation flow.
When auxiliary compression's summary generation returns None (aux model
errored, returned non-JSON, timed out, etc.) the compressor previously
still dropped every middle message between compress_start..compress_end
and replaced them with a static 'Summary generation was unavailable'
placeholder. The session kept going but the user silently lost N turns
of context for nothing.
New behavior: on summary failure, compress() aborts entirely — returns
the input messages unchanged and sets _last_compress_aborted=True. The
existing _summary_failure_cooldown_until gate (30-60s) keeps the aux
model from being burned on every turn. Auto-compress callers detect
the no-op (len(after) == len(before)) and stop looping. The chat is
'frozen' at its current size until the next /compress or /new.
Manual /compress (CLI + gateway) now passes force=True which clears
the cooldown so users can retry immediately after an auto-abort. If
the manual retry also fails, the user gets a visible warning telling
them nothing was dropped and how to retry.
- agent/context_compressor.py: compress() gains force= kwarg; failure
branch sets _last_compress_aborted and returns messages unchanged
instead of inserting placeholder.
- run_agent.py: _compress_context() detects abort, surfaces warning,
skips session-rotation entirely, returns messages unchanged.
- cli.py + gateway/run.py: manual /compress paths pass force=True.
- gateway/run.py: hygiene + /compress handlers detect _last_compress_aborted
and emit the new 'Compression aborted' warning (gateway.compress.aborted)
instead of the old 'N historical messages were removed' message.
- locales/*.yaml: new gateway.compress.aborted key in all 16 locales.
- tests: updated to assert the abort contract (messages preserved,
compression_count not incremented, abort flag set, no placeholder
leaked). New test_force_true_bypasses_failure_cooldown covers the
manual-retry path.
Six days after #23937 (608 fixes) the codebase had accumulated 241 new
PLR6201 violations. Same mechanical `x in (...)` → `x in {...}` fix,
same zero-risk profile: set lookup is O(1) vs O(n) for tuple and the
two are semantically equivalent for hashable scalar membership tests.
All 241 instances fixed via `ruff check --select PLR6201 --fix
--unsafe-fixes`, zero remaining. Every changed value is a hashable
scalar (str/int/None/enum/signal); no risk of unhashable runtime
errors. No behavior change.
Test plan:
- 119 files changed, +244/-244 (net zero) — exactly one-line edits
- `ruff check` clean afterward
- Compile checks pass on the largest touched files (cli.py, run_agent.py,
gateway/run.py, gateway/platforms/discord.py, model_tools.py)
- Subset broad test run on tests/gateway/ tests/hermes_cli/ tests/agent/
tests/tools/: 18187 passed, 59 pre-existing failures (verified against
origin/main with the same shape — identical failure count, identical
category — all xdist test-order flakes unrelated to this change)
Follows the same template as PR #23937 ([tracker: #23972](https://github.com/NousResearch/hermes-agent/issues/23972)).
Original commit 2b193907d by Teknium added a new module-level
_StreamErrorEvent class and threaded its raise into
_run_codex_create_stream_fallback in pre-refactor run_agent.py.
- _StreamErrorEvent class → run_agent.py (module-level, next to
_qwen_portal_headers; class needs to be top-level for the codex
runtime to import it)
- The fallback event-loop's 'type=error' handler → agent/codex_runtime.py
where run_codex_create_stream_fallback now lives. Imports
_StreamErrorEvent lazily from run_agent to avoid circular import.
Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>
Original commit 9c304a7f5 by helix4u targeted _flatten_exception_chain,
_summarize_api_error, and the _call streaming retry loop in pre-refactor
run_agent.py. Re-applied to:
- New _is_provider_stream_parse_error helper → run_agent.py (next
to _flatten_exception_chain in the AIAgent class)
- _summarize_api_error early-return for the malformed-streaming
ValueError → run_agent.py (kept method body)
- _call streaming retry: _is_stream_parse_err flag wired into
_is_transient AND the post-exhaustion branch + dedicated
malformed-streaming user-status string → agent/chat_completion_helpers.py
(the _call body now lives there)
Co-authored-by: helix4u <4317663+helix4u@users.noreply.github.com>
Collapses the four-commit xAI entitlement-403 chain to its final
on-main state, ported to the post-refactor module layout:
- Added _is_entitlement_failure on AIAgent (run_agent.py) — detects
Grok subscription-shape 403s on (401|403|None) status codes.
- Added entitlement-skip branch to recover_with_credential_pool
(agent/agent_runtime_helpers.py) — breaks the refresh-loop that
Don's 100-iteration trace exposed when a Premium+ user hit a real
entitlement issue.
- Removed _decorate_xai_entitlement_error and unwrapped its two
_summarize_api_error call sites — xAI's own body text already
points users at grok.com/?_s=usage so we surface that verbatim
(dffb602f3 reasoning: X Premium subs DO now work per xAI's
2026-05-16 announcement, so editorialising would misdirect).
- grok-4.3 1M context entry landed in agent/model_metadata.py
via the prior merge — no additional port needed.
Tests already on disk (tests/run_agent/test_codex_xai_oauth_recovery.py)
assert _is_entitlement_failure shape and verbatim body surfacing.
Closes#27110.
Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>
Original commit 31ba2b0cb by Teknium targeted run_codex_stream() at
its pre-refactor location in run_agent.py. Re-applied:
- Prelude error retry/fallback → agent/codex_runtime.py (in
run_codex_stream where the body now lives)
- _decorate_xai_entitlement_error helper + _summarize_api_error
wrapping → run_agent.py (these methods remained on AIAgent
as @staticmethod's; cherry-pick applied them cleanly)
The xai-oauth provider gate, encrypted_content drop on replay, etc.
landed in agent/codex_responses_adapter.py via the prior merge from main.
Closes#8133, #14634
Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>
Original commit 13c3d4b4e by kchantharuan touched __init__ and
_apply_client_headers_for_base_url in pre-refactor run_agent.py. Re-applied to:
- __init__: agent/agent_init.py (3 hunks — NVIDIA branch + _custom_headers
fallback in routed-client and fallback-client paths)
- _apply_client_headers_for_base_url: still in run_agent.py (1 hunk)
build_nvidia_nim_headers was already present in agent/auxiliary_client.py
from the prior merge — no additional port needed.
Co-authored-by: kchantharuan <kchantharuan@nvidia.com>
Original commit b62c99797 by Jaaneek targeted six locations in
pre-refactor run_agent.py. Re-applied to the extracted post-PR locations:
- api_mode dispatch → agent/agent_init.py
- is_xai_responses build_api_kwargs → agent/chat_completion_helpers.py
- codex_auth_retry block + 401 hint → agent/conversation_loop.py
- _try_refresh_codex_client_credentials body → run_agent.py (kept)
The non-run_agent.py portions of the commit (auxiliary_client, codex
transport, hermes_cli/auth, tools/xai_http, tests, docs) merged cleanly
from main via the prior merge commit.
Co-authored-by: Jaaneek <Jaaneek@users.noreply.github.com>
previously only checked provider ID and
base URL. When kimi-k2.6 is served via ollama-cloud (or any third-party
provider), provider is not 'kimi-coding' and base URL is not
api.kimi.com — so reasoning_content pad was never injected. This caused
HTTP 400 from Ollama Cloud's Go backend: 'invalid message content type:
map[string]interface {}'.
Fix: add model-name detection ('kimi' in model.lower()) so any route
serving a kimi model gets the required reasoning_content echo-back.
Refs the 400/401 Telegram errors where kimi-k2.6 via ollama-cloud
consistently failed after tool-call turns.
(cherry picked from commit 9a9f8a6d99)
Four fixes from PR #27248 review:
1. **__init__ forwarder is now keyword-forwarded** (daimon-nous review).
Previously the run_agent.AIAgent.__init__ wrapper forwarded all 64
params positionally to agent.agent_init.init_agent, so adding a
65th param on main would require three lockstep edits (signature,
init_agent signature, forwarder call) or silently shift every value.
Keyword forwarding makes this trivially safe — adding a param now
only needs the two signatures and one extra keyword line.
2. **Drop dead _ra() in agent/codex_runtime.py** (daimon-nous + Copilot).
The lazy run_agent reference was defined but never called inside
this module — the codex paths use agent.* accessors only.
3. **Drop unused imports in agent/codex_runtime.py** (Copilot):
contextvars, threading, time, uuid, Optional. Carried over from
run_agent.py during the original extraction.
4. **Tighten three source-introspection test guards** (Copilot):
- test_memory_nudge_counter_hydration.py — was scanning the
concatenated source of run_agent.py + agent/conversation_loop.py
and matching self.X or agent.X form. Now asserts the
hydration block lives in agent/conversation_loop.py specifically
with the agent.X form — the body never moves back, so if it
ever drifts a future re-introduction fails the guard.
- test_run_agent.py::TestMemoryNudgeCounterPersistence — anchor on
agent.iteration_budget = IterationBudget exactly (was just
iteration_budget = IterationBudget) so an unrelated identifier
ending in iteration_budget can't match.
- test_run_agent.py::TestMemoryProviderTurnStart — assert the
agent._user_turn_count form directly (the extracted body uses
agent.X, not self.X — accepting either was a transitional fudge).
- test_jsondecodeerror_retryable.py — scan agent/conversation_loop.py
only, not the concatenation.
Not addressed in this commit:
* Pre-existing bugs in agent/tool_executor.py (heartbeat index
mismatch when calls are blocked, _current_tool clobber in result
loop, blocked-counted-as-completed in spinner summary, dead
result_preview computation). These were preserved byte-for-byte from
the original _execute_tool_calls_concurrent — worth a separate
follow-up PR with proper tests.
* _OpenAIProxy.__instancecheck__ concern — pre-existing, not flagged
by any of the original test patches (nothing actually does
isinstance(x, OpenAI) against the proxy instance).
* agent_init.py:949 mem_config potential NameError — pre-existing;
only triggers if _agent_cfg.get('memory', {}) itself raises, which
it can't with a stock dict.
tests/run_agent/ + tests/agent/: 4313 passed, 1 pre-existing
test_auxiliary_client failure (unchanged).
run_agent.py: 3821 -> 3937 lines (+116 from the keyword-forwarded
init call's verbosity). Final: 16083 -> 3937 (-12146, 75% reduction).
previously only checked provider ID and
base URL. When kimi-k2.6 is served via ollama-cloud (or any third-party
provider), provider is not 'kimi-coding' and base URL is not
api.kimi.com — so reasoning_content pad was never injected. This caused
HTTP 400 from Ollama Cloud's Go backend: 'invalid message content type:
map[string]interface {}'.
Fix: add model-name detection ('kimi' in model.lower()) so any route
serving a kimi model gets the required reasoning_content echo-back.
Refs the 400/401 Telegram errors where kimi-k2.6 via ollama-cloud
consistently failed after tool-call turns.
Pass skip_memory=True to the AIAgent constructor used by
_spawn_background_review() so the review fork's __init__ no longer
rebuilds a _memory_manager wired to honcho / mem0 / supermemory /
etc. under the parent's session_id.
Before this change, the review fork ingested its harness prompt
(the 'Review the conversation above and update the skill library...'
text) into the user's real memory namespace via three sites in
run_conversation():
- on_turn_start(turn_count, prompt) cadence + turn-message
- prefetch_all(prompt) recall query
- sync_all(prompt, review_output, ...) harness + review output
recorded as a
(user, assistant) pair
Built-in MEMORY.md / USER.md state is still rebound from the parent
right after construction, so memory(action='add') writes from the
review continue to land on disk; only the external-plugin side
effects are removed.
Reported by @Utku.
The same root cause as the auxiliary compression fix (commit 7becb19):
get_model_context_length() is called without custom_providers, so per-model
context_length overrides are silently skipped. The fallback activation path
(_try_activate_fallback) had the same missing parameter.
When the agent switches to a fallback provider, the fallback model would use
the models.dev value (e.g. 204800 for NVIDIA NIM minimax-m2.7) instead of
the user-configured one in custom_providers (e.g. 196608) — a subtle
discrepancy that could cause the fallback model to run with an incorrect
context window, leading to truncated messages or failed API requests when
the model does not support the detected length.
Fix: pass self._custom_providers to get_model_context_length() so the
fallback path sees the same per-model overrides as the main model path.
The largest method left on AIAgent (60+ parameters, the entire startup
sequence — credential resolution, provider auto-detection, context
engine bootstrap, memory store hydration, plugin lifecycle hooks)
moves into agent/agent_init.py.
AIAgent.__init__ is now a thin wrapper that calls
agent.agent_init.init_agent(self, ...) with the original full
parameter list preserved.
Module-level run_agent names referenced in the body (_openrouter_prewarm_done,
_qwen_portal_headers, _routermint_headers, _hermes_home, OpenAI,
get_tool_definitions, check_toolset_requirements) are resolved through
_ra() so test patches on those names keep working. agent_init's logger
warnings are routed via _ra().logger so tests patching run_agent.logger
capture them (TestStringKSuffixContextLengthWarns,
TestCustomProvidersInvalidContextLengthWarns).
Live E2E reconfirmed on three model paths (openai/gpt-5.4,
anthropic/claude-sonnet-4.6, moonshotai/kimi-k2-thinking).
tests/run_agent/ + tests/agent/: 4313 passed (same pre-existing
test_auxiliary_client failure).
run_agent.py: 5944 -> 4564 lines (-1380).
Total reduction since baseline: 16083 -> 4564 (-11519, 72%).
The 3,877-line run_conversation body — the agent loop itself — moves out
of run_agent.py into a dedicated module. AIAgent.run_conversation is
now a thin forwarder that delegates to agent.conversation_loop.run_conversation
with the AIAgent instance as the first argument.
This is the largest single extraction in the run_agent.py refactor.
The body keeps all 163 self.X references intact (rewritten as agent.X),
all nested closures, all retry/backoff/compression machinery. Symbols
that tests or callers patch on run_agent (_set_interrupt,
handle_function_call, AIAgent class attrs) are resolved through _ra()
inside the extracted module so the patch surface is preserved.
Five tests doing inspect.getsource(AIAgent.run_conversation) updated to
scan agent.conversation_loop.run_conversation. Two source-introspection
tests (TestMemoryNudgeCounterPersistence, TestMemoryProviderTurnStart)
updated to accept either self.X (legacy) or agent.X (extracted
form) in the matched assertions.
Live E2E verified on three model paths:
* openai/gpt-5.4 (OpenAI chat completions via OpenRouter)
* anthropic/claude-sonnet-4.6 (Anthropic Messages via OpenRouter)
* moonshotai/kimi-k2-thinking (reasoning model, reasoning_content path)
Plus read_file tool execution, terminal tool, web_search.
tests/run_agent/ + tests/agent/: 4313 passed, 1 pre-existing failure
(test_auxiliary_client::test_custom_endpoint... — same as on main).
run_agent.py: 9800 -> 5944 lines (-3856).
Total reduction since baseline: 16083 -> 5944 (-10139, 63%).
The three big review-prompt strings (_MEMORY_REVIEW_PROMPT,
_SKILL_REVIEW_PROMPT, _COMBINED_REVIEW_PROMPT — 183 lines combined) move
out of the AIAgent class body and into agent/background_review.py where
they're consumed.
AIAgent re-exposes them as class attributes via 'from ... import' inside
the class body — Python binds those names into the class namespace so
existing AIAgent._MEMORY_REVIEW_PROMPT references keep working.
spawn_background_review_thread also falls back to the module-level
constants if an agent doesn't have the attribute (preserves the test
pattern of mocking these on the agent).
tests/run_agent/ + tests/agent/: 4313 passed (same pre-existing
test_auxiliary_client failure).
run_agent.py: 9986 -> 9800 lines (-186).
Move _interruptible_streaming_api_call out of run_agent.py — the biggest
single method in the file. Body lives next to interruptible_api_call
in agent/chat_completion_helpers.py so streaming + non-streaming code
share one home.
Nested closures (_call_chat_completions, _call_anthropic, the codex
stream branch) all come along with the body and still capture the
parent function's locals as expected.
AIAgent keeps a thin forwarder method. is_local_endpoint added to
the import block (used by the stream stale-timeout disable logic).
One source-introspection test in TestAnthropicInterruptHandler is
updated to scan agent.chat_completion_helpers.interruptible_streaming_api_call
instead of AIAgent._interruptible_streaming_api_call.
tests/run_agent/ + tests/agent/: 4312 passed (same pre-existing
test_auxiliary_client failure).
run_agent.py: 12277 -> 11385 lines (-892).
Move the two big tool-dispatch methods out of run_agent.py:
* execute_tool_calls_concurrent — 408-line concurrent path (interrupt
pre-flight, guardrail+plugin block, callback fan-out, ContextVar-
preserving ThreadPoolExecutor, periodic heartbeats for the gateway
inactivity monitor, per-tool result handling with subdir hints +
guardrail observations + checkpoint, /steer drain)
* execute_tool_calls_sequential — 441-line sequential path (the
original behavior used for single-tool batches and interactive
tools)
Both take the parent AIAgent as their first argument; AIAgent keeps
thin forwarders so call sites unchanged. handle_function_call is
routed through _ra() so tests that patch run_agent.handle_function_call
keep working. _set_interrupt likewise.
The AST guard in test_tool_executor_contextvar_propagation.py is
updated to scan both run_agent.py AND agent/tool_executor.py so it
still catches the executor.submit(_run_tool, ...) regression
regardless of which file the body lives in.
tests/run_agent/ + tests/agent/: 4313 passed (same pre-existing
test_auxiliary_client failure as before).
run_agent.py: 14309 -> 13461 lines (-848).
Move the background-review subsystem (the self-improvement loop — see the
README) out of run_agent.py into a dedicated module.
* summarize_background_review_actions — was the @staticmethod that builds
the user-facing action summary
* spawn_background_review_thread — builds the thread target + prompt;
the actual review loop body (forked AIAgent, runtime inheritance,
tool whitelist, suppression, teardown) lives in _run_review_in_thread
* build_memory_write_metadata — provenance for external memory mirrors
AIAgent keeps thin wrappers for backward compatibility AND because tests
patch run_agent.threading.Thread to assert lifecycle behavior — the
threading.Thread construction stays in AIAgent._spawn_background_review,
the inner work moves out.
tests/run_agent/ + tests/agent/: 4313 passed, 1 pre-existing failure
(test_auxiliary_client.py::test_custom_endpoint... — confirmed failing
on main before this change). 3 skipped.
run_agent.py: 15272 -> 14972 lines (-300).
Three small extractions into focused modules:
* agent/process_bootstrap.py — \_OpenAIProxy (lazy openai.OpenAI import),
\_SafeWriter (broken-pipe-resistant stdio wrapper), \_install_safe_stdio,
\_get_proxy_from_env, \_get_proxy_for_base_url. All process / IO bootstrap.
* agent/iteration_budget.py — IterationBudget class (thread-safe consume/
refund counter shared by parent agent and subagents).
run_agent re-exports every name so existing test patches like
patch('run_agent.OpenAI', ...) and 'from run_agent import IterationBudget'
keep working unchanged. Verified the patch-rebinding contract for OpenAI
explicitly.
tests/run_agent/ + tests/agent/test_gemini_fast_fallback.py:
1347 passed, 3 skipped.
run_agent.py: 15427 -> 15261 lines (-166).
Pull the 10 pure sanitization/repair helpers (\_sanitize_surrogates,
\_sanitize_structure_surrogates, \_sanitize_messages_surrogates,
\_escape_invalid_chars_in_json_strings, \_repair_tool_call_arguments,
\_strip_non_ascii, \_sanitize_messages_non_ascii, \_sanitize_tools_non_ascii,
\_strip_images_from_messages, \_sanitize_structure_non_ascii) and the
\_SURROGATE_RE constant out of run_agent.py into a new module.
These are stateless byte-walking helpers with no AIAgent dependency.
Backward compatibility: run_agent re-exports every name via a single
import block, so existing 'from run_agent import _sanitize_surrogates'
imports in tests and cli.py keep working unchanged. Same pattern the
file already uses for _summarize_user_message_for_log (codex_responses_adapter).
run_agent.py: 16077 -> 15682 lines (-395).
In long-lived interactive sessions, _try_activate_fallback() advances
_fallback_index before attempting client resolution. When resolution
fails (provider not configured, etc.) the function returns False without
ever setting _fallback_activated=True. _restore_primary_runtime() then
skips its reset block entirely (guarded by `if not _fallback_activated`),
leaving _fallback_index >= len(_fallback_chain) for all subsequent turns.
The eager-fallback guard at the top of the retry loop checks
`_fallback_index < len(_fallback_chain)`, so the condition fails silently
and no fallback is ever attempted again for that session.
Cron jobs spawn a fresh AIAgent per run and never hit this path, which is
why the same fallback chain works reliably for cron but not interactive.
Fix: reset _fallback_index=0 in the `not _fallback_activated` early-return
branch so every new turn starts with the full chain available.
Fixes#20465
xAI's Responses stream emits 'type=error' as the FIRST SSE frame when an
OAuth account is unsubscribed/exhausted or rejects the encrypted-reasoning
replay introduced in the May 2026 SuperGrok rollout. The SDK helper
raises RuntimeError(Expected to have received response.created before
error), which the caller correctly routes to
_run_codex_create_stream_fallback. The fallback then opens a new stream
that emits the same 'error' frame — but the fallback loop only handled
{response.completed, response.incomplete, response.failed} and silently
continue'd past 'error' events. Result: the loop fell off the end of
the stream and raised the useless 'fallback did not emit a terminal
response' RuntimeError, which the classifier marked retryable=True and
looped 3x before failing with no clue what went wrong.
Now: 'error' frames raise a synthesized _StreamErrorEvent with an OpenAI
SDK-shaped .body so _summarize_api_error, _extract_api_error_context,
_is_entitlement_failure, and classify_api_error all see the real
provider message. Users on unsubscribed accounts now see 'do not have
an active Grok subscription' once, not three RuntimeErrors.
Verified end-to-end: classifier returns reason=auth retryable=False;
entitlement detector matches even with status_code=None; summarizer
returns the full xAI message.
Tests: 4 new in TestCodexFallbackErrorEvent covering xAI subscription
message, dict-shaped events, summarizer integration, and the empty-stream
case (must still raise the original RuntimeError so 'truncated mid-flight'
stays distinguishable from 'provider rejected the call').
xAI announced on 2026-05-16 (https://x.ai/news/grok-hermes) that X Premium
subscriptions now work in Hermes Agent. The hint we shipped in PR #26644
asserted the opposite ("X Premium+ does NOT include xAI API access — only
standalone SuperGrok subscribers can use this provider"), which would now
misdirect Premium+ users who hit any other 403 (no Grok sub at all, wrong
tier, exhausted quota) into thinking they need to switch subscriptions
when their sub is in fact valid.
Remove _decorate_xai_entitlement_error and its two call sites in
_summarize_api_error. xAI's own body text already says "Manage subscriptions
at https://grok.com/?_s=usage" — surface that verbatim and let xAI's wording
do the diagnosis.
The _is_entitlement_failure guard (which prevents credential-pool refresh
loops on entitlement 403s) and the reasoning-replay gating for xai-oauth
are unrelated and untouched.
Update tests to assert the body still surfaces verbatim and that no
Hermes-side editorializing is appended.
Follow-up improvements on top of @konsisumer's cherry-picked fix for #10648:
1. Deprecation patterns required BOTH a product fingerprint ('gh-copilot') and
a deprecation marker. The previous list included 'copilot-cli' and bare
'deprecation', which would false-positive on stderr from the NEW
@github/copilot CLI — whose repo is literally github.com/github/copilot-cli
and which legitimately surfaces those substrings in its own messages.
2. Replace the deprecation hint. The user in #10648 installed
'gh extension install github/gh-copilot' (the deprecated extension)
thinking that's what ACP mode uses, when ACP actually spawns the new
'copilot' binary from '@github/copilot'. The hint now points users at the
correct install command ('npm install -g @github/copilot') with the new
CLI's repo URL, and demotes provider-switching to a fallback alternative.
3. Change _URL_TO_PROVIDER value for models.inference.ai.azure.com from the
'github-models' alias to the canonical 'copilot' provider id, matching the
convention used by every other entry in the table.
4. Sharpen the 413 hint message. The free tier's ~8K cap is below the
system-prompt floor, so this endpoint is fundamentally incompatible with
an agentic loop — not a 'use a different URL' problem.
Tests:
- New parametrized false-positive coverage for the new CLI's stderr shape.
- Updated assertion to require canonical 'copilot' provider mapping.
- All 14 deprecation/URL tests pass.
Address two blocking issues when using GitHub Copilot integrations:
1. ACP mode: detect the gh-copilot CLI deprecation error from stderr
and surface an actionable message with alternatives instead of
hanging or showing a cryptic error.
2. GitHub Models (Azure) 413: recognize models.inference.ai.azure.com
as a known GitHub Models URL, and print a targeted hint explaining
the hard 8K token limit that makes this endpoint incompatible with
Hermes' system prompt size.
Port from openai/codex#17667: MCP servers can now opt-in to parallel
tool execution by setting supports_parallel_tool_calls: true in their
config. This allows tools from the same server to run concurrently
within a single tool-call batch, matching the behavior already available
for built-in tools like web_search and read_file.
Previously all MCP tools were forced sequential because they weren't in
the _PARALLEL_SAFE_TOOLS set. Now _should_parallelize_tool_batch checks
is_mcp_tool_parallel_safe() which looks up the server's config flag.
Config example:
mcp_servers:
docs:
command: "docs-server"
supports_parallel_tool_calls: true
Changes:
- tools/mcp_tool.py: Track parallel-safe servers in _parallel_safe_servers
set, populated during register_mcp_servers(). Add is_mcp_tool_parallel_safe()
public API.
- run_agent.py: Add _is_mcp_tool_parallel_safe() lazy-import wrapper. Update
_should_parallelize_tool_batch() to check MCP tools against server config.
- 11 new tests covering the feature end-to-end.
- Updated MCP docs and config reference.
The #1 confusing cause of the xAI 403 (per Teknium): X Premium+
subscribers see Grok inside the X app and assume API access is
included. It is NOT — only standalone SuperGrok subscribers can use
xai-oauth with Hermes today. Without calling this out, every Premium+
user hits the 403 with no idea why.
PR #26666's neutral 4-cause list was correct but buried the most
common cause. Lead with the Premium+ gotcha, then list the other
possibilities (no subscription, wrong tier, exhausted quota) as
fallbacks. Same neutral framing — does not accuse anyone of being
unsubscribed.