The ChatGPT Codex backend (chatgpt.com/backend-api/codex) has historically
silently dropped certain model requests: the connection is accepted but no
stream events are emitted and no error is raised. PR #31967 lowered the
implicit stale-call default from 300s to 90s so fallbacks kick in faster,
but users still see an opaque "No response from provider for 90s
(non-streaming, ...)" message that gives no path forward.
This patch adds a narrow heuristic — gpt-5.5 family on the Codex backend
via codex_responses api_mode — that substitutes the generic timeout
message with actionable text naming the gpt-5.4-codex workaround and
pointing at #21444 for symptom history.
Changes:
- run_agent.py — new ``AIAgent._codex_silent_hang_hint(model=...)`` method.
Returns ``None`` for any request that does not match all three guards
(codex_responses api_mode, openai-codex provider or chatgpt.com Codex
base URL, gpt-5.5-family model name with word-boundary regex anchoring
to avoid false-positives on e.g. ``gpt-5.50``).
- agent/chat_completion_helpers.py — the non-stream stale-call site
consults the hint via ``getattr(...)`` so the call site stays robust
if the helper is ever removed or stubbed in tests. Hint is appended to
both the ``_emit_status`` warning and the ``TimeoutError`` message so
the user sees it in their terminal AND it lands in any retry-loop
diagnostics.
- tests/run_agent/test_codex_silent_hang_hint.py — 10 regression tests
covering positive cases (bare gpt-5.5, vendor-prefixed openai/gpt-5.5,
gpt-5.5-codex SKU, model=None fallback to self.model) and negative
cases (gpt-5.4-codex workaround, gpt-5.50 false-positive guard,
non-codex api_mode, non-codex provider, empty/None model, unrelated
models on Codex).
Does NOT fix the backend-side issue (that's an upstream OpenAI/ChatGPT
problem we cannot patch from here). Only converts an opaque timeout into
text that names the workaround so users do not have to dig through logs
or wait for a forum post to learn what to do.
Closes#22046
Codex / Responses-API requests had three latent timeout bugs that combined
into the long silent hangs reported on #21444:
1. The non-stream stale-call detector estimated context tokens from
``api_kwargs["messages"]`` only. Codex / Responses-API payloads carry
their conversational load in ``input`` (with ``instructions`` and
``tools``), so every Codex turn logged ``context=~0 tokens`` and the
detector never applied its >50k / >100k tier bumps.
2. ``providers.<id>.request_timeout_seconds`` was silently dropped on the
main Codex path. The chat_completions path and the auxiliary Codex
adapter both forwarded it; the main path skipped it through three
places (``build_api_kwargs``, ``ResponsesApiTransport.build_kwargs``,
``_preflight_codex_api_kwargs``).
3. The streaming stale detector had the same payload-shape bug for
``codex_responses`` requests, which route through the non-streaming
detector (it's the path that emits the user-facing
"No response from provider for 300s (non-streaming, ...)" warning that
reporters keep pasting).
This commit:
- Adds ``estimate_request_context_tokens`` in ``chat_completion_helpers``,
used by both the non-stream and stream detectors. Handles ``messages``
(Chat Completions), ``input + instructions + tools`` (Responses API),
bare lists, and an unknown-dict fallback.
- Forwards ``timeout`` through ``ResponsesApiTransport.build_kwargs``
and ``_preflight_codex_api_kwargs`` (with guards against
zero/negative/inf/bool values), and wires
``_resolved_api_call_timeout()`` into the Codex branch of
``build_api_kwargs``.
- Lowers the implicit non-stream stale defaults so fallback providers
kick in faster when upstream stalls:
* base 300s -> 90s
* >50k 450s -> 150s
* >100k 600s -> 240s
These only apply when the user has *not* set
``providers.<id>.stale_timeout_seconds`` or
``HERMES_API_CALL_STALE_TIMEOUT``. Explicit config still wins.
- Adds regression tests for the estimator shapes, the new defaults, the
context-tier scaling, transport timeout pass-through, and preflight
timeout pass-through / rejection of invalid values.
Closes#21444
Supersedes #21652#24126#31855
Co-authored-by: Hoang V. Pham <26063003+hehehe0803@users.noreply.github.com>
Two-layer redaction at the persistence boundary so credentials never reach
state.db, session_*.json, or compression:
1. agent/chat_completion_helpers.py :: build_assistant_message
- Redact assistant content before the message dict is constructed
(catches PATs / API keys the model inlines into natural language)
- Redact tool_call.function.arguments at the same site (catches secrets
inlined into tool args, e.g. terminal command=curl -H 'Authorization: ...')
Tool execution uses the raw API response object, not this dict, so
redacting the persisted shape is safe.
2. run_agent.py :: _save_session_log
- Add _redact_message_content() static helper that handles both string
content and OpenAI/Anthropic multimodal list-of-parts (image parts
pass through untouched, only text/content fields are redacted)
- Apply to every message + the cached system prompt before writing
session_*.json
Both layers respect HERMES_REDACT_SECRETS via redact_sensitive_text —
no-op when disabled.
Tests (TestSaveSessionLogRedactsSecrets, 4 cases):
- api key in tool content
- api key in user message
- api key in system prompt
- multimodal list-of-parts (image part preserved, text redacted)
Tests use an autouse fixture to force _REDACT_ENABLED=True because the
hermetic conftest defaults the env var to false.
Salvaged from PR #24758 by @vgocoder (build_assistant_message + session_log)
+ PR #19855 by @liuhao1024 (multimodal list helper, system_prompt redaction).
Kept only the redaction concern from #19855; its unrelated whatsapp npm
timeout + PATCH_SCHEMA changes are out of scope and dropped.
Refs #19798 (PAT leak via assistant inline mention), #19845 (session capture
credential leak).
Co-authored-by: liuhao1024 <liuhao03@bilibili.com>
Co-authored-by: teknium1 <127238744+teknium1@users.noreply.github.com>
``_is_entitlement_failure`` over-matched on xAI 403s. xAI returns the
same permission-denied ``code`` text for two distinct conditions:
1. Unsubscribed account ("active Grok subscription. Manage at
https://grok.com" in the ``error`` field).
2. Stale OAuth access token ("OAuth2 access token could not be
validated. [WKE=unauthenticated:bad-credentials]" in the ``error``
field).
The classifier's "does not have permission + grok" substring heuristic
treated both identically, so the credential-pool refresh path was
short-circuited for case (2) — long-running TUI sessions stuck on a
stale OAuth token surfaced a non-retryable client error and the user
had to exit + reopen the TUI to recover (the startup-resolve path
bypasses the classifier entirely, which is why bridge adapters with
proactive refresh cadences didn't see this in practice).
This patch adopts the reporter's recommended fix (option 1, tightest):
honor xAI's explicit ``[WKE=unauthenticated:...]`` suffix and the
``OAuth2 access token could not be validated`` phrasing as
authoritative "this is auth, not entitlement" signals. When either
appears anywhere in the body's text fields, the classifier returns
False eagerly — *before* the entitlement keyword checks run — so the
refresh-on-401 path takes over and the existing loop-protection still
guards against runaway refresh storms if the refresh itself fails.
Two small adjustments fall out of this:
* The haystack now also covers ``code`` and ``error`` keys directly,
not just the ``message``/``reason`` shape ``_extract_api_error_context``
produces. Real runtime paths use the normalised shape, but the test
suite and any future call sites that pass raw bodies get the same
treatment. Backwards compatible: missing keys default to empty
strings, the haystack still skips when everything is blank.
* Both disambiguator checks fire BEFORE the entitlement keyword
checks. If a future xAI body somehow lands with both an entitlement
message AND the WKE suffix, the WKE suffix wins (correct — auth is
recoverable; entitlement is not, and a refreshed token will surface
the entitlement message on the next request anyway).
Existing tests (``test_is_entitlement_failure_matches_real_xai_bodies``,
``test_is_entitlement_failure_false_for_unrelated_auth_errors``,
``test_recover_with_credential_pool_skips_refresh_on_entitlement_403``,
``test_recover_with_credential_pool_still_refreshes_genuine_auth_failure``)
continue to pass unchanged — the unsubscribed-account path, the
generic auth-error path, and the refresh-on-401 path are all left
intact.
Layer-2 defense for the FD-recycling race: even with
``force_close_tcp_sockets`` reduced to shutdown-only, the followup
``client.close()`` in ``_close_openai_client`` still walks the httpx
pool and closes sockets — and if called from a stranger thread (the
interrupt-check loop, the stale-call detector) it has the same
FD-recycling exposure that wrote a TLS record on top of ``kanban.db``.
Stamp the request_client_holder with the owning thread's ident at
``_set_request_client`` time. In ``_close_request_client_once``:
* Owning thread (the worker's ``finally``) → pop + ``client.close()``
via ``_close_request_openai_client``, exactly as before.
* Stranger thread → ``_abort_request_openai_client`` (new): only
``shutdown(SHUT_RDWR)`` the pool sockets and log a deferred-close
marker. The holder stays populated so the worker's eventual
``finally`` performs the real close from its own thread context,
where the FD release races nothing.
Applied symmetrically to both the non-streaming
``interruptible_api_call`` and the streaming variant — both routinely
get hit by stranger-thread interrupts.
The log field ``tcp_force_closed=N`` keeps its existing shape; the new
abort path adds ``deferred_close=stranger_thread`` so production
triage can distinguish the two close kinds.
Some providers (Xiaomi MiMo, some Alibaba endpoints, a long tail of
OpenAI-compatible servers) follow the OpenAI spec strictly and require
tool message `content` to be a string — they reject our list-type
content (text + image_url parts) with HTTP 400 'text is not set' /
'tool message content must be a string'.
Instead of an allowlist of known-good providers (maintenance burden,
guaranteed to miss aggregators like OpenRouter where the underlying
model determines support, not the aggregator name), this lands a
reactive recovery:
1. New `FailoverReason.multimodal_tool_content_unsupported` with a
small pattern list covering the common 400 wordings.
2. `AIAgent._try_strip_image_parts_from_tool_messages` walks the API
message list, downgrades any `role:tool` message whose content is
list-with-image to a plain text summary (preserves text parts) in
place, AND records the active (provider, model) in a session-scoped
`_no_list_tool_content_models` set.
3. `_tool_result_content_for_active_model` short-circuits to a text
summary when (provider, model) is in the cache — so after the first
400 + retry, subsequent screenshots in the same session skip the
round trip entirely.
4. Retry hook in `agent.conversation_loop` mirrors the existing
`image_too_large` recovery: detect the reason, run the helper,
retry once, fall through to the normal error path if no list-type
tool content was actually present.
Cache is transient (per-session) by design — next session retries in
case the provider added support, no persistent state to maintain.
Fixes#27344. Closes#27351 (allowlist approach superseded by reactive
recovery).
The contributor PR (#17936) only patched the strip path in
`_model_supports_vision()`. The auto-mode router in
`agent/image_routing._lookup_supports_vision` still only read models.dev,
so a custom-provider model declared as vision-capable would still get its
images routed through vision_analyze in the default `agent.image_input_mode:
auto` setting. Users had to set both `supports_vision: true` AND
`image_input_mode: native` to bypass the text pipeline.
Single-knob behavior now: `supports_vision: true` alone is enough in auto
mode. The strip path and the routing path consult the same resolver.
- Extract override resolution into `_supports_vision_override()` in
agent/image_routing.py and wire it into `_lookup_supports_vision()`.
- Refactor `run_agent._model_supports_vision` to call the same helper
(DRY, single source of truth for the resolution order).
- Strict YAML boolean coercion: `supports_vision: "false"` (quoted —
a common YAML mistake) no longer coerces to True via bool() truthiness.
Recognised tokens: true/false/yes/no/on/off/1/0 plus real bools and 0/1.
Unrecognised values return None and fall through to models.dev.
- Add @CNSeniorious000 to AUTHOR_MAP for release attribution.
Tests: 26 new (TestCoerceCapabilityBool, TestSupportsVisionOverride,
TestLookupSupportsVisionOverride, TestAutoModeRespectsOverride). Existing
contributor tests + image_routing + vision_native_fast_path +
native_image_buffer_isolation all green (92/92).
Named custom providers are rewritten to provider="custom" at runtime
(hermes_cli/runtime_provider.py:_resolve_named_custom_runtime), so a
config under providers.my-vllm.models.my-llava.supports_vision was
unreachable via self.provider alone. Also try cfg.model.provider as a
candidate provider key, covering both runtime and config naming.
Adds a regression test for the named-provider path.
Custom/local provider models absent from models.dev get classified as
non-vision and have their image content stripped before reaching the
upstream API. Surface a user-facing override:
model:
supports_vision: true
providers:
my-vllm:
models:
my-llava:
supports_vision: true
The override short-circuits the models.dev lookup in
_model_supports_vision(), which is the single gate guarding image-strip
preprocessing on every transport path.
Refs #8731.
PR #29182 deleted the per-session JSON snapshot writer outright because
state.db is canonical and the snapshots had no in-tree consumer. Some
users have external tooling that reads `~/.hermes/sessions/session_{sid}.json`
directly, so reintroduce the writer behind a config flag that defaults
to off.
- Add `sessions.write_json_snapshots` (default False) to DEFAULT_CONFIG
- Restore `AIAgent._save_session_log` + `_clean_session_content` as
gated methods. When the flag is off the call is a fast no-op; when
on, the writer behaves as before (atomic write, truncation guard
preserved, REASONING_SCRATCHPAD → think tag normalization)
- Re-derive the target path from `agent.session_id` on each call so
`/branch` and `/compress` re-points happen automatically — no need
to restore the explicit re-point bookkeeping at call sites
- Wire the single call site in `_persist_session` (the cleanup-on-exit
hook). Did NOT restore the 7 intra-turn calls the original PR deleted
— those were redundant writes within the same turn that doubled disk
I/O without adding any persistence guarantee `_persist_session` does
not already provide
- Read the flag once at agent init via `load_config()`, cache as
`agent._session_json_enabled`
- Update `TestNoSessionJsonSnapshot` → `TestSessionJsonSnapshotOptIn`
to pin behavior: default off (no file), opt-in true (file written),
no-op method on default agents, logs_dir retained unconditionally
- Update CONTRIBUTING.md and the bundled `hermes-agent` skill to
document the flag and its default
Only caller was the removed _save_session_log. Also removes the unused
convert_scratchpad_to_think and has_incomplete_scratchpad imports from
run_agent.py (both still used elsewhere via their own imports).
state.db now stores every message field the JSON snapshot stored. Removed
the method, all 7 call-sites, and ~13 test stubs that suppressed its file I/O.
Body is in git history if it ever needs to come back.
* perf(config): add load_config_readonly() fast path for hot agent loop
`load_config()` is called from the agent loop's per-API-call hot path via
`get_provider_request_timeout()` and `get_provider_stale_timeout()` —
both invoked once per turn from `_resolved_api_call_timeout()` in
run_agent.py.
Profiling a synthetic 20-tool-call agent run revealed:
- 21 invocations of `load_config()` cumulating 56ms (~17% of agent loop)
- 34,398 deepcopy calls totaling 37ms (config defensive deepcopy + chain)
- 8,652 `_expand_env_vars` invocations (~412 per turn)
Microbench (cache-hit, real config.yaml present):
load_config() 265us/call (125us deepcopy + 140us infra)
load_config_readonly() 138us/call (~48% faster)
`load_config_readonly()` returns the cached dict directly without the
defensive deepcopy. Documented contract: caller must not mutate. Returns
plain dict (not MappingProxyType) so downstream `isinstance(x, dict)`
guards keep working — caught during initial implementation when
MappingProxyType broke get_provider_request_timeout's guard logic.
Wired into hermes_cli/timeouts.py (the two functions called per agent
turn). load_config() is unchanged for the 263 other call sites that
mutate the result before save_config(), are not in the hot path, or
where the safety guarantee matters more than the perf.
Profile A/B (cached config, 21-turn agent loop):
BEFORE AFTER delta
get_provider_request_timeout 55ms 16ms -71%
total function calls 399k 160k -60%
deepcopy calls (in hotspots) 34,398 ~0 ~elim
Verified:
- isinstance(load_config_readonly(), dict) is True
- timeout/stale resolutions correct
- load_config() still returns isolated mutable deepcopies
- tests/hermes_cli/test_config*.py / test_timeouts.py: 102/102 pass
- tests/cli/ + tests/agent/test_auxiliary_client.py: 883/883 pass
* perf(redact): substring pre-screens skip non-matching regex chains
Every log record passes through `RedactingFormatter.format` which calls
`redact_sensitive_text`, which historically ran ALL 13 secret-pattern
regexes against every line — including DB connection strings, JWTs,
Discord mentions, Signal phone numbers, etc. — even for typical clean
log records like 'INFO run_agent: API call completed'.
Add cheap substring pre-checks before each regex pass. False positives
still run the regex (which then matches nothing); false negatives are
impossible because every pattern requires the gated substring to match
its leading anchor:
- `_PREFIX_RE` gated on any of 33 known credential prefix substrings
- `_ENV_ASSIGN_RE` gated on `=` in text
- `_JSON_FIELD_RE` gated on `:` and `"` in text
- `_AUTH_HEADER_RE` gated on `uthorization`/`UTHORIZATION` in text
- `_TELEGRAM_RE` gated on `:` in text
- `_PRIVATE_KEY_RE` gated on `BEGIN` and `-----`
- `_DB_CONNSTR_RE` gated on `://` in text
- `_JWT_RE` gated on `eyJ` in text
- URL userinfo/query gated on `://`
- `_redact_form_body` gated on `&` and `=`
- `_DISCORD_MENTION_RE` gated on `<@`
- `_SIGNAL_PHONE_RE` gated on `+`
Microbench (5 typical log records, 20k iterations each):
BEFORE AFTER delta
redact_sensitive_text per call 5.63us 1.79us -68%
Real-world impact: ~244 log records emitted in a 30-turn agent loop, so
the chain saves ~1ms of CPU per conversation. Bigger win is the
reduction in regex execution and GC pressure during heavy logging
sessions (verbose logging, gateway message processing).
Security regression test: 30 secret-containing inputs (sk-/ghp_/JWT/DB
connstr/Auth-Bearer/private key/URL userinfo/Discord/Signal/etc.)
verified to produce identical redacted output before/after. All 75
existing tests/agent/test_redact.py cases pass.
The `?access_token=foo&code=bar` (bare query string, no scheme) case
that 'leaks' is pre-existing behavior — the URL query redaction
requires a well-formed URL with scheme+host. Not a regression.
* perf(run_agent): cache _needs_thinking_reasoning_pad result per (provider, model, base_url)
Profile of a 31-turn synthetic agent run shows `_needs_thinking_reasoning_pad`
fires 495 times (~16 per turn) and each call ran 3 helper methods, each
hitting `base_url_host_matches` 1-4 times via `urlparse`. Total cost:
3,342 base_url_host_matches calls + 3,373 urlparse calls accounting for
~36ms of agent-loop overhead (~7% of the entire post-network work).
Provider / model / base_url don't change during a conversation except via
`switch_model` and fallback activation — both of which already overwrite
those attributes atomically. Cache the result on a tuple key; since the
key is derived from the very fields that would change, the cache
auto-invalidates on the next read after a switch. No manual invalidation
needed in switch_model / _try_activate_fallback.
Profile A/B (31-turn cached-config agent run):
BEFORE AFTER delta
_needs_thinking_reasoning_pad cum 18ms 1ms -94%
_copy_reasoning_content_for_api cum 17ms 1ms -94%
base_url_host_matches calls 3,342 372 -89%
urlparse calls 3,373 403 -88%
total function calls 296k 223k -25%
Verified:
- tests/run_agent/test_deepseek_reasoning_content_echo.py: 36/36 pass
- tests/run_agent/ (full): 1383/1383 pass + 3 skipped
GLM models via Ollama report finish_reason='stop' even when the
response was truncated by max_tokens. The continuation mechanism
uses _has_natural_response_ending() as one of the heuristics to
detect whether the response was genuinely finished.
Currently only ASCII punctuation and CJK punctuation are recognized.
This means any response ending with an emoji (e.g. ⚡, 👍) or the
caret character ^ (common in French ^^ smiley) is not recognized as
naturally ended, triggering a false-positive continuation where the
model receives 'Continue where you left off' and produces garbled
output.
Add:
- ^ (caret) to the punctuation set
- Unicode emoji range (codepoint >= 0x1F300) as natural ending
This only affects GLM/Ollama users but the fix is safe for all
backends since _has_natural_response_ending() is only consulted
inside the continuation flow.
When auxiliary compression's summary generation returns None (aux model
errored, returned non-JSON, timed out, etc.) the compressor previously
still dropped every middle message between compress_start..compress_end
and replaced them with a static 'Summary generation was unavailable'
placeholder. The session kept going but the user silently lost N turns
of context for nothing.
New behavior: on summary failure, compress() aborts entirely — returns
the input messages unchanged and sets _last_compress_aborted=True. The
existing _summary_failure_cooldown_until gate (30-60s) keeps the aux
model from being burned on every turn. Auto-compress callers detect
the no-op (len(after) == len(before)) and stop looping. The chat is
'frozen' at its current size until the next /compress or /new.
Manual /compress (CLI + gateway) now passes force=True which clears
the cooldown so users can retry immediately after an auto-abort. If
the manual retry also fails, the user gets a visible warning telling
them nothing was dropped and how to retry.
- agent/context_compressor.py: compress() gains force= kwarg; failure
branch sets _last_compress_aborted and returns messages unchanged
instead of inserting placeholder.
- run_agent.py: _compress_context() detects abort, surfaces warning,
skips session-rotation entirely, returns messages unchanged.
- cli.py + gateway/run.py: manual /compress paths pass force=True.
- gateway/run.py: hygiene + /compress handlers detect _last_compress_aborted
and emit the new 'Compression aborted' warning (gateway.compress.aborted)
instead of the old 'N historical messages were removed' message.
- locales/*.yaml: new gateway.compress.aborted key in all 16 locales.
- tests: updated to assert the abort contract (messages preserved,
compression_count not incremented, abort flag set, no placeholder
leaked). New test_force_true_bypasses_failure_cooldown covers the
manual-retry path.
Six days after #23937 (608 fixes) the codebase had accumulated 241 new
PLR6201 violations. Same mechanical `x in (...)` → `x in {...}` fix,
same zero-risk profile: set lookup is O(1) vs O(n) for tuple and the
two are semantically equivalent for hashable scalar membership tests.
All 241 instances fixed via `ruff check --select PLR6201 --fix
--unsafe-fixes`, zero remaining. Every changed value is a hashable
scalar (str/int/None/enum/signal); no risk of unhashable runtime
errors. No behavior change.
Test plan:
- 119 files changed, +244/-244 (net zero) — exactly one-line edits
- `ruff check` clean afterward
- Compile checks pass on the largest touched files (cli.py, run_agent.py,
gateway/run.py, gateway/platforms/discord.py, model_tools.py)
- Subset broad test run on tests/gateway/ tests/hermes_cli/ tests/agent/
tests/tools/: 18187 passed, 59 pre-existing failures (verified against
origin/main with the same shape — identical failure count, identical
category — all xdist test-order flakes unrelated to this change)
Follows the same template as PR #23937 ([tracker: #23972](https://github.com/NousResearch/hermes-agent/issues/23972)).
Original commit 2b193907d by Teknium added a new module-level
_StreamErrorEvent class and threaded its raise into
_run_codex_create_stream_fallback in pre-refactor run_agent.py.
- _StreamErrorEvent class → run_agent.py (module-level, next to
_qwen_portal_headers; class needs to be top-level for the codex
runtime to import it)
- The fallback event-loop's 'type=error' handler → agent/codex_runtime.py
where run_codex_create_stream_fallback now lives. Imports
_StreamErrorEvent lazily from run_agent to avoid circular import.
Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>
Original commit 9c304a7f5 by helix4u targeted _flatten_exception_chain,
_summarize_api_error, and the _call streaming retry loop in pre-refactor
run_agent.py. Re-applied to:
- New _is_provider_stream_parse_error helper → run_agent.py (next
to _flatten_exception_chain in the AIAgent class)
- _summarize_api_error early-return for the malformed-streaming
ValueError → run_agent.py (kept method body)
- _call streaming retry: _is_stream_parse_err flag wired into
_is_transient AND the post-exhaustion branch + dedicated
malformed-streaming user-status string → agent/chat_completion_helpers.py
(the _call body now lives there)
Co-authored-by: helix4u <4317663+helix4u@users.noreply.github.com>
Collapses the four-commit xAI entitlement-403 chain to its final
on-main state, ported to the post-refactor module layout:
- Added _is_entitlement_failure on AIAgent (run_agent.py) — detects
Grok subscription-shape 403s on (401|403|None) status codes.
- Added entitlement-skip branch to recover_with_credential_pool
(agent/agent_runtime_helpers.py) — breaks the refresh-loop that
Don's 100-iteration trace exposed when a Premium+ user hit a real
entitlement issue.
- Removed _decorate_xai_entitlement_error and unwrapped its two
_summarize_api_error call sites — xAI's own body text already
points users at grok.com/?_s=usage so we surface that verbatim
(dffb602f3 reasoning: X Premium subs DO now work per xAI's
2026-05-16 announcement, so editorialising would misdirect).
- grok-4.3 1M context entry landed in agent/model_metadata.py
via the prior merge — no additional port needed.
Tests already on disk (tests/run_agent/test_codex_xai_oauth_recovery.py)
assert _is_entitlement_failure shape and verbatim body surfacing.
Closes#27110.
Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>
Original commit 31ba2b0cb by Teknium targeted run_codex_stream() at
its pre-refactor location in run_agent.py. Re-applied:
- Prelude error retry/fallback → agent/codex_runtime.py (in
run_codex_stream where the body now lives)
- _decorate_xai_entitlement_error helper + _summarize_api_error
wrapping → run_agent.py (these methods remained on AIAgent
as @staticmethod's; cherry-pick applied them cleanly)
The xai-oauth provider gate, encrypted_content drop on replay, etc.
landed in agent/codex_responses_adapter.py via the prior merge from main.
Closes#8133, #14634
Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>
Original commit 13c3d4b4e by kchantharuan touched __init__ and
_apply_client_headers_for_base_url in pre-refactor run_agent.py. Re-applied to:
- __init__: agent/agent_init.py (3 hunks — NVIDIA branch + _custom_headers
fallback in routed-client and fallback-client paths)
- _apply_client_headers_for_base_url: still in run_agent.py (1 hunk)
build_nvidia_nim_headers was already present in agent/auxiliary_client.py
from the prior merge — no additional port needed.
Co-authored-by: kchantharuan <kchantharuan@nvidia.com>
Original commit b62c99797 by Jaaneek targeted six locations in
pre-refactor run_agent.py. Re-applied to the extracted post-PR locations:
- api_mode dispatch → agent/agent_init.py
- is_xai_responses build_api_kwargs → agent/chat_completion_helpers.py
- codex_auth_retry block + 401 hint → agent/conversation_loop.py
- _try_refresh_codex_client_credentials body → run_agent.py (kept)
The non-run_agent.py portions of the commit (auxiliary_client, codex
transport, hermes_cli/auth, tools/xai_http, tests, docs) merged cleanly
from main via the prior merge commit.
Co-authored-by: Jaaneek <Jaaneek@users.noreply.github.com>
previously only checked provider ID and
base URL. When kimi-k2.6 is served via ollama-cloud (or any third-party
provider), provider is not 'kimi-coding' and base URL is not
api.kimi.com — so reasoning_content pad was never injected. This caused
HTTP 400 from Ollama Cloud's Go backend: 'invalid message content type:
map[string]interface {}'.
Fix: add model-name detection ('kimi' in model.lower()) so any route
serving a kimi model gets the required reasoning_content echo-back.
Refs the 400/401 Telegram errors where kimi-k2.6 via ollama-cloud
consistently failed after tool-call turns.
(cherry picked from commit 9a9f8a6d99)
Four fixes from PR #27248 review:
1. **__init__ forwarder is now keyword-forwarded** (daimon-nous review).
Previously the run_agent.AIAgent.__init__ wrapper forwarded all 64
params positionally to agent.agent_init.init_agent, so adding a
65th param on main would require three lockstep edits (signature,
init_agent signature, forwarder call) or silently shift every value.
Keyword forwarding makes this trivially safe — adding a param now
only needs the two signatures and one extra keyword line.
2. **Drop dead _ra() in agent/codex_runtime.py** (daimon-nous + Copilot).
The lazy run_agent reference was defined but never called inside
this module — the codex paths use agent.* accessors only.
3. **Drop unused imports in agent/codex_runtime.py** (Copilot):
contextvars, threading, time, uuid, Optional. Carried over from
run_agent.py during the original extraction.
4. **Tighten three source-introspection test guards** (Copilot):
- test_memory_nudge_counter_hydration.py — was scanning the
concatenated source of run_agent.py + agent/conversation_loop.py
and matching self.X or agent.X form. Now asserts the
hydration block lives in agent/conversation_loop.py specifically
with the agent.X form — the body never moves back, so if it
ever drifts a future re-introduction fails the guard.
- test_run_agent.py::TestMemoryNudgeCounterPersistence — anchor on
agent.iteration_budget = IterationBudget exactly (was just
iteration_budget = IterationBudget) so an unrelated identifier
ending in iteration_budget can't match.
- test_run_agent.py::TestMemoryProviderTurnStart — assert the
agent._user_turn_count form directly (the extracted body uses
agent.X, not self.X — accepting either was a transitional fudge).
- test_jsondecodeerror_retryable.py — scan agent/conversation_loop.py
only, not the concatenation.
Not addressed in this commit:
* Pre-existing bugs in agent/tool_executor.py (heartbeat index
mismatch when calls are blocked, _current_tool clobber in result
loop, blocked-counted-as-completed in spinner summary, dead
result_preview computation). These were preserved byte-for-byte from
the original _execute_tool_calls_concurrent — worth a separate
follow-up PR with proper tests.
* _OpenAIProxy.__instancecheck__ concern — pre-existing, not flagged
by any of the original test patches (nothing actually does
isinstance(x, OpenAI) against the proxy instance).
* agent_init.py:949 mem_config potential NameError — pre-existing;
only triggers if _agent_cfg.get('memory', {}) itself raises, which
it can't with a stock dict.
tests/run_agent/ + tests/agent/: 4313 passed, 1 pre-existing
test_auxiliary_client failure (unchanged).
run_agent.py: 3821 -> 3937 lines (+116 from the keyword-forwarded
init call's verbosity). Final: 16083 -> 3937 (-12146, 75% reduction).
The largest method left on AIAgent (60+ parameters, the entire startup
sequence — credential resolution, provider auto-detection, context
engine bootstrap, memory store hydration, plugin lifecycle hooks)
moves into agent/agent_init.py.
AIAgent.__init__ is now a thin wrapper that calls
agent.agent_init.init_agent(self, ...) with the original full
parameter list preserved.
Module-level run_agent names referenced in the body (_openrouter_prewarm_done,
_qwen_portal_headers, _routermint_headers, _hermes_home, OpenAI,
get_tool_definitions, check_toolset_requirements) are resolved through
_ra() so test patches on those names keep working. agent_init's logger
warnings are routed via _ra().logger so tests patching run_agent.logger
capture them (TestStringKSuffixContextLengthWarns,
TestCustomProvidersInvalidContextLengthWarns).
Live E2E reconfirmed on three model paths (openai/gpt-5.4,
anthropic/claude-sonnet-4.6, moonshotai/kimi-k2-thinking).
tests/run_agent/ + tests/agent/: 4313 passed (same pre-existing
test_auxiliary_client failure).
run_agent.py: 5944 -> 4564 lines (-1380).
Total reduction since baseline: 16083 -> 4564 (-11519, 72%).
The 3,877-line run_conversation body — the agent loop itself — moves out
of run_agent.py into a dedicated module. AIAgent.run_conversation is
now a thin forwarder that delegates to agent.conversation_loop.run_conversation
with the AIAgent instance as the first argument.
This is the largest single extraction in the run_agent.py refactor.
The body keeps all 163 self.X references intact (rewritten as agent.X),
all nested closures, all retry/backoff/compression machinery. Symbols
that tests or callers patch on run_agent (_set_interrupt,
handle_function_call, AIAgent class attrs) are resolved through _ra()
inside the extracted module so the patch surface is preserved.
Five tests doing inspect.getsource(AIAgent.run_conversation) updated to
scan agent.conversation_loop.run_conversation. Two source-introspection
tests (TestMemoryNudgeCounterPersistence, TestMemoryProviderTurnStart)
updated to accept either self.X (legacy) or agent.X (extracted
form) in the matched assertions.
Live E2E verified on three model paths:
* openai/gpt-5.4 (OpenAI chat completions via OpenRouter)
* anthropic/claude-sonnet-4.6 (Anthropic Messages via OpenRouter)
* moonshotai/kimi-k2-thinking (reasoning model, reasoning_content path)
Plus read_file tool execution, terminal tool, web_search.
tests/run_agent/ + tests/agent/: 4313 passed, 1 pre-existing failure
(test_auxiliary_client::test_custom_endpoint... — same as on main).
run_agent.py: 9800 -> 5944 lines (-3856).
Total reduction since baseline: 16083 -> 5944 (-10139, 63%).
The three big review-prompt strings (_MEMORY_REVIEW_PROMPT,
_SKILL_REVIEW_PROMPT, _COMBINED_REVIEW_PROMPT — 183 lines combined) move
out of the AIAgent class body and into agent/background_review.py where
they're consumed.
AIAgent re-exposes them as class attributes via 'from ... import' inside
the class body — Python binds those names into the class namespace so
existing AIAgent._MEMORY_REVIEW_PROMPT references keep working.
spawn_background_review_thread also falls back to the module-level
constants if an agent doesn't have the attribute (preserves the test
pattern of mocking these on the agent).
tests/run_agent/ + tests/agent/: 4313 passed (same pre-existing
test_auxiliary_client failure).
run_agent.py: 9986 -> 9800 lines (-186).
Move _interruptible_streaming_api_call out of run_agent.py — the biggest
single method in the file. Body lives next to interruptible_api_call
in agent/chat_completion_helpers.py so streaming + non-streaming code
share one home.
Nested closures (_call_chat_completions, _call_anthropic, the codex
stream branch) all come along with the body and still capture the
parent function's locals as expected.
AIAgent keeps a thin forwarder method. is_local_endpoint added to
the import block (used by the stream stale-timeout disable logic).
One source-introspection test in TestAnthropicInterruptHandler is
updated to scan agent.chat_completion_helpers.interruptible_streaming_api_call
instead of AIAgent._interruptible_streaming_api_call.
tests/run_agent/ + tests/agent/: 4312 passed (same pre-existing
test_auxiliary_client failure).
run_agent.py: 12277 -> 11385 lines (-892).
Move the two big tool-dispatch methods out of run_agent.py:
* execute_tool_calls_concurrent — 408-line concurrent path (interrupt
pre-flight, guardrail+plugin block, callback fan-out, ContextVar-
preserving ThreadPoolExecutor, periodic heartbeats for the gateway
inactivity monitor, per-tool result handling with subdir hints +
guardrail observations + checkpoint, /steer drain)
* execute_tool_calls_sequential — 441-line sequential path (the
original behavior used for single-tool batches and interactive
tools)
Both take the parent AIAgent as their first argument; AIAgent keeps
thin forwarders so call sites unchanged. handle_function_call is
routed through _ra() so tests that patch run_agent.handle_function_call
keep working. _set_interrupt likewise.
The AST guard in test_tool_executor_contextvar_propagation.py is
updated to scan both run_agent.py AND agent/tool_executor.py so it
still catches the executor.submit(_run_tool, ...) regression
regardless of which file the body lives in.
tests/run_agent/ + tests/agent/: 4313 passed (same pre-existing
test_auxiliary_client failure as before).
run_agent.py: 14309 -> 13461 lines (-848).
Move the background-review subsystem (the self-improvement loop — see the
README) out of run_agent.py into a dedicated module.
* summarize_background_review_actions — was the @staticmethod that builds
the user-facing action summary
* spawn_background_review_thread — builds the thread target + prompt;
the actual review loop body (forked AIAgent, runtime inheritance,
tool whitelist, suppression, teardown) lives in _run_review_in_thread
* build_memory_write_metadata — provenance for external memory mirrors
AIAgent keeps thin wrappers for backward compatibility AND because tests
patch run_agent.threading.Thread to assert lifecycle behavior — the
threading.Thread construction stays in AIAgent._spawn_background_review,
the inner work moves out.
tests/run_agent/ + tests/agent/: 4313 passed, 1 pre-existing failure
(test_auxiliary_client.py::test_custom_endpoint... — confirmed failing
on main before this change). 3 skipped.
run_agent.py: 15272 -> 14972 lines (-300).
Three small extractions into focused modules:
* agent/process_bootstrap.py — \_OpenAIProxy (lazy openai.OpenAI import),
\_SafeWriter (broken-pipe-resistant stdio wrapper), \_install_safe_stdio,
\_get_proxy_from_env, \_get_proxy_for_base_url. All process / IO bootstrap.
* agent/iteration_budget.py — IterationBudget class (thread-safe consume/
refund counter shared by parent agent and subagents).
run_agent re-exports every name so existing test patches like
patch('run_agent.OpenAI', ...) and 'from run_agent import IterationBudget'
keep working unchanged. Verified the patch-rebinding contract for OpenAI
explicitly.
tests/run_agent/ + tests/agent/test_gemini_fast_fallback.py:
1347 passed, 3 skipped.
run_agent.py: 15427 -> 15261 lines (-166).
Pull the 10 pure sanitization/repair helpers (\_sanitize_surrogates,
\_sanitize_structure_surrogates, \_sanitize_messages_surrogates,
\_escape_invalid_chars_in_json_strings, \_repair_tool_call_arguments,
\_strip_non_ascii, \_sanitize_messages_non_ascii, \_sanitize_tools_non_ascii,
\_strip_images_from_messages, \_sanitize_structure_non_ascii) and the
\_SURROGATE_RE constant out of run_agent.py into a new module.
These are stateless byte-walking helpers with no AIAgent dependency.
Backward compatibility: run_agent re-exports every name via a single
import block, so existing 'from run_agent import _sanitize_surrogates'
imports in tests and cli.py keep working unchanged. Same pattern the
file already uses for _summarize_user_message_for_log (codex_responses_adapter).
run_agent.py: 16077 -> 15682 lines (-395).
Mirrors openclaw beta.8's app-server resilience fixes so a stuck codex
subprocess can't burn the full turn deadline and so users get a
`codex login` pointer instead of raw RPC errors when their token expires.
- TurnResult.should_retire signals the caller to drop+respawn codex.
- Deadline-hit path and dead-subprocess detection set should_retire so
the next turn doesn't ride a CPU-spinning or auth-broken process.
- Post-tool watchdog (post_tool_quiet_timeout=90s): if a tool item
completes and codex goes silent past the threshold without further
output or turn/completed, fast-fail instead of waiting the full 600s.
Resets on any non-tool activity so normal think-after-tool flows are
not affected.
- <turn_aborted> and <turn_aborted/> in agent text are treated as
terminal — some codex builds tear down a turn that way without
emitting turn/completed.
- _classify_oauth_failure() inspects RPC error message + stderr tail
for invalid_grant / token refresh / 401 / etc. and rewrites
user-facing errors to 'run codex login'. Conservative: generic
failures still surface verbatim. Fires at turn/start failure,
turn/completed failure, and dead-subprocess paths.
- thread/start cross-fill: tolerate thread.id, thread.sessionId,
top-level sessionId/threadId so future codex schema drift doesn't
KeyError us at handshake.
- run_agent.py: when run_turn returns should_retire=True OR raises,
close + null self._codex_session so the next turn respawns.
Tests: +30 cases across session + integration suites.
tests/agent/transports/test_codex_app_server_session.py 50/50 pass
tests/run_agent/test_codex_app_server_integration.py 27/27 pass
Broader codex scope (transports + cli runtime/migration) 376/376 pass
Background review fork redirected stdout/stderr around run_conversation()
so its iteration messages stay silent. But the memory-provider teardown
(shutdown_memory_provider() and review_agent.close()) fired in the outer
finally block AFTER the redirect_stdout context exited — so provider
teardown prints (Honcho disconnect, Hindsight sync, etc.) leaked into
the parent terminal at end of every turn.
Moves the teardown inside the redirect_stdout scope on the success path
(and nulls review_agent so the finally safety-net skips double-shutdown).
The finally block is rewritten as an exception-path safety net that
re-opens a devnull redirect, since the original 'with' context has
already exited by the time finally runs.
Salvage of #25342 by @ayushere (manually re-applied + merged conflict
with current main's set_thread_tool_whitelist wiring).
When auxiliary.compression.provider is "auto", the compression model
reuses the main model's provider and base_url. The main model's
context_length was correctly picking up custom_providers per-model
overrides (via _custom_providers stored during __init__), but the
auxiliary compression model's context-length detection path in
_check_compression_model_feasibility was not passing custom_providers,
causing it to skip step 0b and fall through to models.dev.
This meant that for providers like NVIDIA NIM where the user has a
per-model context_length in custom_providers (e.g. 196608 for
minimax-m2.7), the auxiliary model would use the models.dev value
(204800) instead of the user-configured one — a subtle discrepancy
that could lead to silent compression issues when the auxiliary model
doesn't actually support the detected context length.
Fix: pass self._custom_providers (already stored as an instance attr
during __init__) to the get_model_context_length() call for the
auxiliary compression model.