mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-18 04:41:56 +00:00
924 commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
9c304a7f56 | fix(agent): retry malformed anthropic stream parser errors | ||
|
|
97a32afdc4 | fix(auxiliary): resolve xai oauth compression from pool | ||
|
|
6784c80794
|
fix(xai-oauth): lead entitlement-403 hint with X Premium+ gotcha (#26672)
The #1 confusing cause of the xAI 403 (per Teknium): X Premium+ subscribers see Grok inside the X app and assume API access is included. It is NOT — only standalone SuperGrok subscribers can use xai-oauth with Hermes today. Without calling this out, every Premium+ user hits the 403 with no idea why. PR #26666's neutral 4-cause list was correct but buried the most common cause. Lead with the Premium+ gotcha, then list the other possibilities (no subscription, wrong tier, exhausted quota) as fallbacks. Same neutral framing — does not accuse anyone of being unsubscribed. |
||
|
|
9818b9a1ac
|
fix(xai-oauth): rewrite entitlement-403 hint to not accuse subscribers (#26666)
PR #26644 confidently told users "xAI OAuth account lacks SuperGrok / X Premium entitlement" on any 403 from xAI's permission-denied surface. But that body is returned for at least four distinct causes that Hermes cannot distinguish from the wire: * Account has no Grok subscription at all * Account has SuperGrok but the tier doesn't include the requested model (e.g. grok-4.3 needs SuperGrok Heavy) * Monthly quota for the subscribed tier is exhausted * SuperGrok is active but the API access add-on isn't enabled Don Piedro pushed back that he IS subscribed yet still hit this. Picking the worst-case interpretation ("you're not subscribed") reads as wrong and insulting to subscribers, and points them at a fix they already did. New wording lists all 4 possibilities and points at https://grok.com/?_s=usage where the user can check which applies. The detection logic and credential-pool short-circuit (PR #26664) are unchanged — only the user-facing wording is rephrased. |
||
|
|
ce0e189d3e
|
fix(xai-oauth): break entitlement-403 credential-refresh loop, bump grok-4.3 context to 1M (#26664)
Don Piedro's 18-minute hang on grok-4.3 traced to two issues PR #26644 didn't cover: - _recover_with_credential_pool classifies 403 as FailoverReason.auth and calls pool.try_refresh_current(). For xAI OAuth on an unsubscribed account, refresh succeeds (mints a new token from the same account) but the next API call 403s with the same entitlement error. Result: infinite refresh → retry → 403 loop until Ctrl+C (1133s in Don's log). New _is_entitlement_failure(error_context, status_code) detects the subscription-shape body ("do not have an active Grok subscription" / "out of available resources" + grok / "does not have permission" + grok) and short-circuits recovery so _summarize_api_error surfaces PR #26644's friendly hint. - grok-4.3 resolved to 256k via the grok-4 catch-all in DEFAULT_CONTEXT_LENGTHS. Per docs.x.ai/developers/models/grok-4.3 the model ships with 1M context. Add explicit grok-4.3 entry before the grok-4 fallback (longest-first substring matching ensures grok-4.3 and grok-4.3-latest both land on the new value). Tests: 8 new (23 total in test_codex_xai_oauth_recovery.py). E2E verified Don's 100-iteration loop bails out with 0 refresh calls while genuine auth failures still refresh once and recover. |
||
|
|
cd9470f416 |
fix(deepseek): wire thinking-mode via DeepSeekProfile, not legacy fallback
The cherry-picked PR #15251 from @tw2818 correctly identified the DeepSeek 400 root cause but placed the fix in the legacy fallback path of `build_kwargs`, which DeepSeek never reaches — DeepSeek has a registered ProviderProfile and goes through `_build_kwargs_from_profile` instead. The legacy-path block was therefore dead code. This commit pivots the fix to where it actually fires: - New `DeepSeekProfile` in `plugins/model-providers/deepseek/__init__.py` overrides `build_api_kwargs_extras` to emit DeepSeek's expected wire format (mirrors `KimiProfile`): {"reasoning_effort": "<low|medium|high|max>", "extra_body": {"thinking": {"type": "enabled" | "disabled"}}} - Model gating: only `deepseek-v4-*` and `deepseek-reasoner` emit thinking control. `deepseek-chat` (V3) is untouched — current behavior. - Effort mapping: low/medium/high passthrough, xhigh/max → max, unset → omitted (DeepSeek server applies its own default). - Revert the legacy-path additions from PR #15251 — they were dead code, and the `_copy_reasoning_content_for_api` strip block specifically would have nullified the existing reasoning_content padding machinery (`_needs_deepseek_tool_reasoning` → space-pad on replay) that the active provider already relies on for replay correctness. - Unit tests pin the wire-shape contract and the model gating rules (26 tests, all passing). Existing transport + provider profile suites (321 tests) continue to pass. - AUTHOR_MAP: map twebefy@gmail.com → tw2818 for release notes credit. Closes #15700, #17212, #17825. Co-authored-by: tw2818 <twebefy@gmail.com> |
||
|
|
068c24f8a4 |
feat(deepseek): add thinking.type + reasoning_effort mapping for DeepSeek API
DeepSeek's thinking mode requires both: - extra_body.thinking.type: "enabled" to activate thinking mode - top-level reasoning_effort: "max" or "high" to control depth Previously, the ChatCompletionsTransport only handled Kimi's thinking mode — DeepSeek was left unmapped, so reasoning_effort config was silently dropped. This patch: 1. Adds is_deepseek: bool to the Params dataclass, detected by base_url matching api.deepseek.com 2. Maps Hermes effort levels (xhigh/max → "max", low/medium/high → themselves) to the top-level reasoning_effort parameter 3. Sets extra_body.thinking.type alongside the effort 4. Strips reasoning_content from assistant messages sent back to DeepSeek, preventing 400 errors when thinking was enabled |
||
|
|
31ba2b0cbc
|
fix(xai-oauth): recover from prelude SSE errors, gate reasoning replay, surface entitlement 403s (#26644)
Three fixes for the May 2026 xAI OAuth (SuperGrok / X Premium) rollout
failures:
- _run_codex_stream: when openai SDK raises RuntimeError("Expected to
have received `response.created` before `<type>`"), retry once then
fall back to responses.create(stream=True) — same path used for
missing-response.completed postlude. Fallback surfaces the real
provider error with body+status_code intact. Also fixes #8133
(response.in_progress prelude on custom relays) and #14634
(codex.rate_limits prelude on codex-lb).
- _summarize_api_error: when error body matches xAI's entitlement
shape, append a one-line hint pointing to https://grok.com and
/model. Once-only, applies to both auxiliary warnings and
main-loop error surfacing.
- _chat_messages_to_responses_input: new is_xai_responses kwarg
drops replayed codex_reasoning_items (encrypted_content) before
they reach xAI. Also drops reasoning.encrypted_content from the
xAI include array. Native Codex behavior unchanged. Grok still
reasons natively each turn; coherence rides on visible message
text alone.
Closes #8133, #14634.
|
||
|
|
13c3d4b4ef | feat(nvidia): add NIM billing origin header | ||
|
|
b62c997973 |
feat(xai-oauth): add xAI Grok OAuth (SuperGrok Subscription) provider
Adds a new authentication provider that lets SuperGrok subscribers sign in to Hermes with their xAI account via the standard OAuth 2.0 PKCE loopback flow, instead of pasting a raw API key from console.x.ai. Highlights ---------- * OAuth 2.0 PKCE loopback login against accounts.x.ai with discovery, state/nonce, and a strict CORS-origin allowlist on the callback. * Authorize URL carries `plan=generic` (required for non-allowlisted loopback clients) and `referrer=hermes-agent` for best-effort attribution in xAI's OAuth server logs. * Token storage in `auth.json` with file-locked atomic writes; JWT `exp`-based expiry detection with skew; refresh-token rotation synced both ways between the singleton store and the credential pool so multi-process / multi-profile setups don't tear each other's refresh tokens. * Reactive 401 retry: on a 401 from the xAI Responses API, the agent refreshes the token, swaps it back into `self.api_key`, and retries the call once. Guarded against silent account swaps when the active key was sourced from a different (manual) pool entry. * Auxiliary tasks (curator, vision, embeddings, etc.) route through a dedicated xAI Responses-mode auxiliary client instead of falling back to OpenRouter billing. * Direct HTTP tools (`tools/xai_http.py`, transcription, TTS, image-gen plugin) resolve credentials through a unified runtime → singleton → env-var fallback chain so xai-oauth users get them for free. * `hermes auth add xai-oauth` and `hermes auth remove xai-oauth N` are wired through the standard auth-commands surface; remove cleans up the singleton loopback_pkce entry so it doesn't silently reinstate. * `hermes model` provider picker shows "xAI Grok OAuth (SuperGrok Subscription)" and the model-flow falls back to pool credentials when the singleton is missing. Hardening --------- * Discovery and refresh responses validate the returned `token_endpoint` host against the same `*.x.ai` allowlist as the authorization endpoint, blocking MITM persistence of a hostile endpoint. * Discovery / refresh / token-exchange `response.json()` calls are wrapped to raise typed `AuthError` on malformed bodies (captive portals, proxy error pages) instead of leaking JSONDecodeError tracebacks. * `prompt_cache_key` is routed through `extra_body` on the codex transport (sending it as a top-level kwarg trips xAI's SDK with a TypeError). * Credential-pool sync-back preserves `active_provider` so refreshing an OAuth entry doesn't silently flip the active provider out from under the running agent. Testing ------- * New `tests/hermes_cli/test_auth_xai_oauth_provider.py` (~63 tests) covers JWT expiry, OAuth URL params (plan + referrer), CORS origins, redirect URI validation, singleton↔pool sync, concurrency races, refresh error paths, runtime resolution, and malformed-JSON guards. * Extended `test_credential_pool.py`, `test_codex_transport.py`, and `test_run_agent_codex_responses.py` cover the pool sync-back, `extra_body` routing, and 401 reactive refresh paths. * 165 tests passing on this branch via `scripts/run_tests.sh`. |
||
|
|
db84a78e61
|
fix(langfuse): complete observability fix — trace I/O, tool outputs, placeholder credentials (closes #22342, #22763) (#26320)
* fix(langfuse): reject placeholder credentials with one-shot warning When operators leave HERMES_LANGFUSE_PUBLIC_KEY / HERMES_LANGFUSE_SECRET_KEY at a template value like 'placeholder', 'test-key', or 'your-langfuse-key', the Langfuse SDK silently accepts the credentials at construction time and drops every trace at flush time. No warning, no error — just an empty Langfuse dashboard the operator only notices hours later. Add prefix-based validation in _get_langfuse() against the documented 'pk-lf-' / 'sk-lf-' prefixes that Langfuse always issues server-side. Anything else fires a single warning naming the offending env var(s) with a log-safe value preview (full string for short placeholders so the operator knows which template they left in place; truncated for long values so a real secret pasted into the wrong field never hits the log), then short-circuits via the existing _INIT_FAILED cache so the warning fires once per process, not once per hook invocation. The check sits after the 'Langfuse is None' SDK-installed guard so hosts without the optional langfuse SDK don't see misleading 'set real keys' hints when the actionable fix is 'pip install langfuse'. Missing credentials remains the documented opt-out path and stays silent — no log noise for unconfigured installs. Fixes #22763 Fixes #23823 * fix(langfuse): use actual API request messages for generation input on_pre_llm_request previously used the messages kwarg alone, which could be None when Hermes passes the payload via request_messages, conversation_history, or user_message instead. Add _coerce_request_messages to pick the first available list across all variants, falling back to a synthetic user message. Generations now show the real outbound payload rather than an empty input. * fix(langfuse): record tool call outputs in traces Tool observations showed input (arguments) but output was always undefined. Root cause: when tool_call_id is empty, pre_tool_call stored observations under a unique time-based key that post_tool_call could never reconstruct, so every tool span was closed without output by the _finish_trace sweep. Fix pre/post matching by routing empty-tool_call_id tools through a per-name FIFO queue (pending_tools_by_name) instead of the time-based key. Tools with a tool_call_id continue to use the id-keyed dict. Also: - Preserve OpenAI-style nested function shape in serialized tool calls so Langfuse renders name/arguments correctly - Keep name + tool_call_id on role:tool messages for proper pairing - Backfill tool results onto the matching turn_tool_calls entry so the generation's tool-call record carries the result alongside arguments - Coerce request messages from whichever field the runtime provides (request_messages, messages, conversation_history, user_message) * fix(langfuse): salvage-review polish — drop dead is_first_turn, shallow-copy request_messages, real threaded FIFO test Self-review of the combined #22345 + #23831 salvage surfaced three issues worth fixing in the same PR rather than as follow-ups: 1. Drop is_first_turn from the pre_api_request hook. The boolean expression `not bool(conversation_history)` was wrong: conversation_history is reassigned to None mid-run after compression (5 sites in run_agent.py), so the value flips False -> True mid-conversation on every post-compression API call. The langfuse plugin never consumed it, so the kwarg was both misleading AND dead. 2. Replace copy.deepcopy(request_messages) with shallow list() copy. The pre_api_request hook contract discards return values (invoke_hook never writes back to api_kwargs), and the langfuse plugin's _serialize_messages already builds its own snapshot dicts via _safe_value. A deepcopy on every API call would walk every tool result and base64 image — significant overhead for no real isolation benefit. Shallow copy of the outer list protects against later mutations of api_messages without paying for the inner-dict walk. 3. Rename test_empty_tool_call_id_concurrent_fifo_order -> test_empty_tool_call_id_observations_are_fifo_within_tool_name and add a real test_threaded_post_calls_preserve_fifo_under_lock that spawns 8 threads behind a barrier to actually exercise _STATE_LOCK on the pending_tools_by_name queue. The original test was sequential and only validated Python list semantics; this one validates the lock discipline. 4. Fix stale 'Cleared by reset_cache_for_tests()' comment on _INIT_FAILED — that function does not exist. Tests reload the module via sys.modules.pop + importlib.import_module instead. Tests: 37 langfuse plugin tests pass, 658 plugin tests overall pass. --------- Co-authored-by: xxxigm <tuancanhnguyen706@gmail.com> Co-authored-by: Brian Conklin <brian@dralth.com> |
||
|
|
4f8aaf1046 |
perf(run_agent): accumulate length-continuation prefix via list+join
Replace O(n²) string concatenation of truncated_response_prefix in the length-continuation retry loop with a list + ''.join(). Functionally equivalent: same partial response on early return, same prepend on final assembly. The legacy retry path is capped at 3 iterations, so the practical wall-clock win is small, but the new idiom matches the rest of the codebase and removes a needless repeated allocation. Salvaged from PR #2717 (the run_conversation portion only — trajectory refactor dropped because it silently rewrote </tool_response> to </think>). Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com> |
||
|
|
1702a94c88
|
Merge pull request #25957 from stephenschoettler/fix/main-ci-unblocker-after-21012
fix(ci): stabilize shared test state after 21012 |
||
|
|
a28add199d | fix(agent): keep image tool results from poisoning text-only sessions | ||
|
|
5ce0067c08 | fix(ci): stabilize shared test state after 21012 | ||
|
|
12f755c9eb
|
fix(codex-runtime): retire wedged sessions + post-tool watchdog + OAuth refresh classify (#25769)
Mirrors openclaw beta.8's app-server resilience fixes so a stuck codex subprocess can't burn the full turn deadline and so users get a `codex login` pointer instead of raw RPC errors when their token expires. - TurnResult.should_retire signals the caller to drop+respawn codex. - Deadline-hit path and dead-subprocess detection set should_retire so the next turn doesn't ride a CPU-spinning or auth-broken process. - Post-tool watchdog (post_tool_quiet_timeout=90s): if a tool item completes and codex goes silent past the threshold without further output or turn/completed, fast-fail instead of waiting the full 600s. Resets on any non-tool activity so normal think-after-tool flows are not affected. - <turn_aborted> and <turn_aborted/> in agent text are treated as terminal — some codex builds tear down a turn that way without emitting turn/completed. - _classify_oauth_failure() inspects RPC error message + stderr tail for invalid_grant / token refresh / 401 / etc. and rewrites user-facing errors to 'run codex login'. Conservative: generic failures still surface verbatim. Fires at turn/start failure, turn/completed failure, and dead-subprocess paths. - thread/start cross-fill: tolerate thread.id, thread.sessionId, top-level sessionId/threadId so future codex schema drift doesn't KeyError us at handshake. - run_agent.py: when run_turn returns should_retire=True OR raises, close + null self._codex_session so the next turn respawns. Tests: +30 cases across session + integration suites. tests/agent/transports/test_codex_app_server_session.py 50/50 pass tests/run_agent/test_codex_app_server_integration.py 27/27 pass Broader codex scope (transports + cli runtime/migration) 376/376 pass |
||
|
|
55ba02befb |
fix(background-review): silence memory provider teardown output leak
Background review fork redirected stdout/stderr around run_conversation() so its iteration messages stay silent. But the memory-provider teardown (shutdown_memory_provider() and review_agent.close()) fired in the outer finally block AFTER the redirect_stdout context exited — so provider teardown prints (Honcho disconnect, Hindsight sync, etc.) leaked into the parent terminal at end of every turn. Moves the teardown inside the redirect_stdout scope on the success path (and nulls review_agent so the finally safety-net skips double-shutdown). The finally block is rewritten as an exception-path safety net that re-opens a devnull redirect, since the original 'with' context has already exited by the time finally runs. Salvage of #25342 by @ayushere (manually re-applied + merged conflict with current main's set_thread_tool_whitelist wiring). |
||
|
|
7becb19ea0 |
fix(auxiliary): forward custom_providers to compression model context-length detection
When auxiliary.compression.provider is "auto", the compression model reuses the main model's provider and base_url. The main model's context_length was correctly picking up custom_providers per-model overrides (via _custom_providers stored during __init__), but the auxiliary compression model's context-length detection path in _check_compression_model_feasibility was not passing custom_providers, causing it to skip step 0b and fall through to models.dev. This meant that for providers like NVIDIA NIM where the user has a per-model context_length in custom_providers (e.g. 196608 for minimax-m2.7), the auxiliary model would use the models.dev value (204800) instead of the user-configured one — a subtle discrepancy that could lead to silent compression issues when the auxiliary model doesn't actually support the detected context length. Fix: pass self._custom_providers (already stored as an instance attr during __init__) to the get_model_context_length() call for the auxiliary compression model. |
||
|
|
00ad3d3c9c | fix: show context compaction status | ||
|
|
efa97af7e2 |
fix(agent): add Xiaomi MiMo to reasoning_content echo-back providers
Xiaomi MiMo emits reasoning via OpenAI's reasoning_content field and requires reasoning_content on every assistant tool-call message when replaying history. Without echo-back, subsequent API calls fail with HTTP 400 — same shape as DeepSeek and Kimi/Moonshot thinking modes. Adds _needs_mimo_tool_reasoning() detection (provider == 'xiaomi', 'mimo' in model, or xiaomimimo.com base url) and wires it into the _needs_thinking_reasoning_pad() check. Salvage of #25358 by @ephron-ren (manually re-applied — original branch was severely stale against current main). |
||
|
|
4ceab16893 |
fix(compression): keep default protect_first_n at 3 + align ABC
Follow-up on the salvaged feat commit: - Keep the constructor / config / yaml-example default at 3 so existing gateway and CLI users see no behavioural change. PR #13754 (which this builds on) had lowered the default to 2 to chase pre-feature parity in the system-prompt-present case, at the cost of quietly halving the protected head for the gateway path (which strips the system prompt before calling compress()). With the new "system prompt is implicit" semantics, default 3 gives every caller a stable head shape. - agent/context_engine.py: bring the ABC's protect_first_n docstring in line with the new semantics so plugin context engines interpret the config key the same way the built-in compressor does. - tests: adjust the default-value test (3, not 2) and a stale comment; per-test protect_first_n=2/3/1 values added in PR #13754 stay as-is since those tests fix concrete head shapes. |
||
|
|
dee71a31e5 |
feat(compression): make protect_first_n configurable
The number of head messages preserved verbatim across context compactions was previously hardcoded to 3 in AIAgent.__init__. Expose it as `compression.protect_first_n` in config, matching the existing `protect_last_n` pattern. Motivation: users who rely on rolling compaction for long-running sessions had the opening user/assistant exchange pinned as head forever, which doesn't always match how they want the session framed after many compactions. Lowering to 1 preserves the system prompt + first non-system message; lowering to 0 preserves only the system prompt and lets the entire first exchange age out naturally through the summary. Semantics: `protect_first_n` counts non-system head messages protected **in addition to** the system prompt, which is always implicitly protected when present. Same meaning across both code paths: protect_first_n=0 → system prompt only (or nothing if no system message) protect_first_n=2 → system prompt + first 2 non-system messages (default) This unifies the CLI path (which reads messages with the system prompt at position 0) and the gateway path (where the gateway /compress handler strips the system prompt before calling compress() — see gateway/run.py L9150-9154 on the parent fork). Previously these two paths disagreed: CLI path: protect_first_n=1 → protect system prompt only Gateway path: protect_first_n=1 → protect first USER turn forever In practice on long-running gateway sessions the old semantics pinned whatever stale aside happened to be the first user message, reinserting it into every compaction summary indefinitely. Default chosen as 2 (not 3) so that the effective protected head count remains 3 messages in the common case — assuming a system prompt is present, default protection becomes system + 2 non-system = 3 total, matching the pre-feature behaviour where `protect_first_n` was hardcoded to protect 3 messages total. Sessions without a system prompt will see a small behaviour change (2 protected head messages instead of 3), but this is the rare path and the new semantics make the system-prompt-present case the well-defined one. Changes: - agent/context_compressor.py: redefine protect_first_n as the count of non-system head messages protected beyond the implicit system-prompt guarantee; both paths converge. Constructor default updated to 2. - hermes_cli/config.py: add `compression.protect_first_n` default (2), matching the new semantics. `show_config` label tweaked to 'Protect first: N non-system head messages' for clarity. - run_agent.py: read protect_first_n from config; 0 is now valid (system prompt is always implicitly protected). - cli-config.yaml.example: document the new key and rationale. - tests/agent/test_context_compressor.py: cover default, override, the end-to-end `protect_first_n=0` and `protect_first_n=1` behaviour, the no-system-prompt (gateway) path, and the new shared-semantics regression test. Fixes #13751 Tested on Ubuntu 24.04. |
||
|
|
07349ce4df |
fix(memory): pin session_start + session_id on background review fork
Belt-and-suspenders complement to the cached-system-prompt inheritance: pin session_start and session_id to the parent's so any code path that re-renders parts of the system prompt (compression, plugin hooks) still produces byte-identical output. The cached-prompt assignment already short-circuits the normal rebuild path, but these pins guarantee parity even if a future code path bypasses the cache. Idea from simpolism's reference PR #25427 for #25322. Co-Authored-By: simpolism <32201324+simpolism@users.noreply.github.com> |
||
|
|
5fe0672260 |
fix(memory): hit prefix cache in background review fork
Background review fork is supposed to hit Anthropic's prefix cache on the parent's messages_snapshot, but currently doesn't (cache_read=0 on every fork). Two root causes, fixed in this commit: 1. System prompt is rebuilt at fork time. _cached_system_prompt starts as None, so run_conversation calls _build_system_prompt, which embeds a minute-precision "Conversation started: ..." timestamp. Reviews fire 10+ turns after session start, so the minute differs from main's, producing a 1-character diff that invalidates the byte-exact cache key. Fix: inherit the parent's _cached_system_prompt directly (same idea as #17089, which was self-closed for only fixing this half). 2. Tools schema was narrowed via enabled_toolsets=["memory","skills"] for safety. Anthropic's cache key includes `tools`, which sits before `system` in the cache hierarchy, so even byte-identical `system` won't hit when `tools` differs from main's full set. Fix: drop the schema-level restriction so `tools` matches main, and deny non-whitelisted tools at runtime via the existing get_pre_tool_call_block_message gate (hermes_cli/plugins.py:1085, already called at all three dispatch sites). Install/clear a thread- local whitelist (added in the previous commit) on the daemon thread. Append a soft constraint to the review prompt so the model knows. Real E2E on Sonnet 4.5 (12-tool task + auto-triggered review): - Per review-call cost: $0.331 → $0.035 (~89% reduction) - End-to-end per run: $0.848 → $0.629 (~26% reduction) - Review fork cache_create / cache_read: 88,385 / 0 → 1,234 / 94,404 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
091d8e1030
|
feat(codex-runtime): optional codex app-server runtime for OpenAI/Codex models (#24182)
* feat(codex-runtime): scaffold optional codex app-server runtime
Foundational commit for an opt-in alternate runtime that hands OpenAI/Codex
turns to a 'codex app-server' subprocess instead of Hermes' tool dispatch.
Default behavior is unchanged.
Lands in three pieces:
1. agent/transports/codex_app_server.py — JSON-RPC 2.0 over stdio speaker
for codex's app-server protocol (codex-rs/app-server). Spawn, init
handshake, request/response, notification queue, server-initiated
request queue (for approval round-trips), interrupt-friendly blocking
reads. Tested against real codex 0.130.0 binary end-to-end during
development.
2. hermes_cli/runtime_provider.py:
- Adds 'codex_app_server' to _VALID_API_MODES.
- Adds _maybe_apply_codex_app_server_runtime() helper, called at the
end of _resolve_runtime_from_pool_entry(). Inert unless
'model.openai_runtime: codex_app_server' is set in config.yaml AND
provider in {openai, openai-codex}. Other providers cannot be
rerouted (anthropic, openrouter, etc. preserved).
3. tests/agent/transports/test_codex_app_server_runtime.py — 24 tests
covering api_mode registration, the rewriter helper (default-off,
case-insensitive, opt-in, non-eligible providers preserved), version
parser, missing-binary handling, error class. Does NOT require codex
CLI installed.
This commit is wire-only: the api_mode is recognized but AIAgent does
not yet branch on it. Followup commits add the session adapter, event
projector, approval bridge, transcript projection (so memory/skill
review still works), plugin migration, and slash command.
Existing tests remain green:
- tests/cli/test_cli_provider_resolution.py (29 passed)
- tests/agent/test_credential_pool_routing.py (included above)
* feat(codex-runtime): add codex item projector for memory/skill review
The translator that lets Hermes' self-improvement loop keep working under the
Codex runtime: converts codex 'item/*' notifications into Hermes' standard
{role, content, tool_calls, tool_call_id} message shape that
agent/curator.py already knows how to read.
Item taxonomy (matches codex-rs/app-server-protocol/src/protocol/v2/item.rs):
- userMessage → {role: user, content}
- agentMessage → {role: assistant, content: text}
- reasoning → stashed in next assistant's 'reasoning' field
- commandExecution → assistant tool_call(name='exec_command') + tool result
- fileChange → assistant tool_call(name='apply_patch') + tool result
- mcpToolCall → assistant tool_call(name='mcp.<server>.<tool>') + tool result
- dynamicToolCall → assistant tool_call(name=<tool>) + tool result
- plan/hookPrompt/etc → opaque assistant note, no fabricated tool_calls
Invariants preserved:
- Message role alternation never violated: each tool item produces at most
one assistant + one tool message in that order, correlated by call_id.
- Streaming deltas (item/<type>/outputDelta, item/agentMessage/delta)
don't materialize messages — only item/completed does. Mirrors how
Hermes already only writes the assistant message after streaming ends.
- Tool call ids are deterministic (codex item id-based) so replays produce
identical messages and prefix caches stay valid (AGENTS.md pitfall #16).
- JSON args use sorted_keys for the same reason.
Real wire formats verified against codex 0.130.0 by capturing live
notifications from thread/shellCommand and including one as a fixture
(COMMAND_EXEC_COMPLETED).
23 new tests, all green:
- Streaming deltas don't materialize (3 paths)
- Turn/thread frame events are silent
- commandExecution: 5 tests including non-zero exit annotation +
deterministic id stability across replays
- agentMessage + reasoning attachment + reasoning consumption
- fileChange: summary without inlined content
- mcpToolCall: namespaced naming + error surfacing
- userMessage: text fragments only (drops images/etc)
- opaque items: no fabricated tool_calls
- Helpers: deterministic id stability + sorted JSON args
- Role alternation invariant across all four tool-shaped item types
This commit is a pure addition. AIAgent integration (the wire that uses the
projector) is the next commit.
* feat(codex-runtime): add session adapter + approval bridge
The third self-contained module: CodexAppServerSession owns one Codex
thread per Hermes session, drives turn/start, consumes streaming
notifications via CodexEventProjector, handles server-initiated approval
requests, and translates cancellation into turn/interrupt.
The adapter has a single public per-turn method:
result = session.run_turn(user_input='...', turn_timeout=600)
# result.final_text → assistant text for the caller
# result.projected_messages → list ready to splice into AIAgent.messages
# result.tool_iterations → tick count for _iters_since_skill nudge
# result.interrupted → True on Ctrl+C / deadline / interrupt
# result.error → error string when the turn cannot complete
# result.turn_id, thread_id → for sessions DB / resume
Behavior:
- ensure_started() spawns codex, does the initialize handshake, and
issues thread/start with cwd + permissions profile. Idempotent.
- run_turn() blocks until turn/completed, drains server-initiated
requests (approvals) before reading notifications so codex never
deadlocks waiting for us, projects every item/completed via the
projector, and increments tool_iterations for the skill nudge gate.
- request_interrupt() is thread-safe (threading.Event); the next loop
iteration issues turn/interrupt and unwinds.
- turn_timeout deadlock guard issues turn/interrupt and records an
error if the turn never completes.
- close() escalates terminate → kill via the underlying client.
Approval bridge:
Codex emits server-initiated requests for execCommandApproval and
applyPatchApproval. The adapter translates Hermes' approval choice
vocabulary onto codex's decision vocabulary:
Hermes 'once' → codex 'approved'
Hermes 'session' or 'always' → codex 'approvedForSession'
Hermes 'deny' / anything else → codex 'denied'
Routing precedence:
1. _ServerRequestRouting.auto_approve_* flags (cron / non-interactive)
2. approval_callback wired by the CLI (defers to
tools.approval.prompt_dangerous_approval())
3. Fail-closed denial when neither is wired
Unknown server-request methods are answered with JSON-RPC error -32601
so codex doesn't hang waiting for us.
Permission profile mapping mirrors AGENTS.md:
Hermes 'auto' → codex 'workspace-write'
Hermes 'approval-required' → codex 'read-only-with-approval'
Hermes 'unrestricted/yolo' → codex 'full-access'
20 new tests, all green. Combined with prior commits this PR now has
67 tests across three modules:
- test_codex_app_server_runtime.py: 24 (api_mode + transport surface)
- test_codex_event_projector.py: 23 (item taxonomy projections)
- test_codex_app_server_session.py: 20 (turn loop + approvals + interrupts)
Full tests/agent/transports/ directory: 249/249 pass — no regressions
to existing transport tests.
Still no wire into AIAgent.run_conversation(); that integration commit
is small and goes next.
* feat(codex-runtime): wire codex_app_server runtime into AIAgent
The integration commit. AIAgent.run_conversation() now early-returns to a
new helper _run_codex_app_server_turn() when self.api_mode ==
'codex_app_server', bypassing the chat_completions tool loop entirely.
Three small surgical edits to run_agent.py (~105 LOC total):
1. Line ~1204 (constructor api_mode validation set):
Add 'codex_app_server' so an explicit api_mode='codex_app_server'
passed to AIAgent() isn't silently rewritten to 'chat_completions'.
2. Line ~12048 (run_conversation, just before the while loop):
Early-return to _run_codex_app_server_turn() when self.api_mode is
'codex_app_server'. Placed AFTER all standard pre-loop setup —
logging context, session DB, surrogate sanitization, _user_turn_count
and _turns_since_memory increments, _ext_prefetch_cache, memory
manager on_turn_start — so behavior outside the model-call loop is
identical between paths. Default Hermes flow is unchanged when the
flag is off.
3. End-of-class (line ~15497):
New method _run_codex_app_server_turn(). Lazy-instantiates one
CodexAppServerSession per AIAgent (reused across turns), runs the
turn, splices projected_messages into messages, increments
_iters_since_skill by tool_iterations (since the chat_completions
loop normally does that per iteration), fires
_spawn_background_review on the same cadence as the default path.
Counter accounting:
_turns_since_memory ← already incremented at run_conversation:11817
(gated on memory store configured) — codex
helper does NOT touch it (would double-count).
_user_turn_count ← already incremented at run_conversation:11793
— codex helper does NOT touch it.
_iters_since_skill ← incremented in the chat_completions loop per
tool iteration. Codex helper increments by
turn.tool_iterations since the loop is bypassed.
User message:
ALREADY appended to messages by run_conversation pre-loop (line 11823)
before the early-return reaches us. Helper does NOT append again.
Regression test test_user_message_not_duplicated guards this.
Approval callback wiring:
Lazy-fetches tools.terminal_tool._get_approval_callback at session
spawn time, passes to CodexAppServerSession. CLI threads with
prompt_toolkit get interactive approvals; gateway/cron contexts get
the codex-side fail-closed deny.
Error path:
Codex session exceptions become a 'partial' result with completed=False
and a final_response that explicitly tells the user how to switch back:
'Codex app-server turn failed: ... Fall back to default runtime with
/codex-runtime auto.' Same return-dict shape as the chat_completions
path so all callers (gateway, CLI, batch_runner, ACP) work unchanged.
9 new integration tests in tests/run_agent/test_codex_app_server_integration.py:
- api_mode='codex_app_server' is accepted on AIAgent construction
- run_conversation returns the expected codex shape
(final_response, codex_thread_id, codex_turn_id, completed, partial)
- Projected messages are spliced into messages list
- _iters_since_skill ticks per tool iteration
- _user_turn_count delegated to standard flow (not double-counted)
- User message appears exactly once (regression guard)
- _spawn_background_review IS invoked (memory/skill review keeps working)
- chat.completions.create is NEVER called (loop fully bypassed)
- Session exception → partial result with /codex-runtime auto hint
- Interrupted turn → partial result with error preserved
Adjacent test runs confirm no regressions:
- tests/run_agent/test_memory_nudge_counter_hydration.py: green
- tests/run_agent/test_background_review.py: green
- tests/run_agent/test_fallback_model.py: green
- tests/agent/transports/: 249/249 green
Still missing for full feature: /codex-runtime slash command, plugin
migration helper, docs page, live e2e test gated on codex binary. Those
are the remaining followup commits.
* feat(codex-runtime): add /codex-runtime slash command (CLI + gateway)
User-facing toggle for the optional codex app-server runtime. Follows the
'Adding a Slash Command (All Platforms)' pattern from AGENTS.md exactly:
single CommandDef in the central registry → CLI handler → gateway handler
→ running-agent guard → all surfaces (autocomplete, /help, Telegram menu,
Slack subcommands) update automatically.
Surface:
/codex-runtime — show current state + codex CLI status
/codex-runtime auto — Hermes default runtime
/codex-runtime codex_app_server — codex subprocess runtime
/codex-runtime on / off — synonyms
Files changed:
hermes_cli/codex_runtime_switch.py (new):
Pure-Python state machine shared by CLI and gateway. Parse args,
read/write model.openai_runtime in the config dict, gate enabling
behind a codex --version check (don't let users opt in to a runtime
they have no binary for; print npm install hint instead).
Returns a CodexRuntimeStatus dataclass that callers render however
suits their surface.
hermes_cli/commands.py:
Single CommandDef entry, no aliases (codex-runtime is its own thing).
cli.py:
Dispatch in process_command() + _handle_codex_runtime() handler that
delegates to the shared module and renders results via _cprint.
gateway/run.py:
Dispatch in _handle_message() + _handle_codex_runtime_command() that
returns a string (gateway sends as message). On a successful change
that requires a new session, _evict_cached_agent() forces the next
inbound message to construct a fresh AIAgent with the new api_mode —
avoids prompt-cache invalidation mid-session.
gateway/run.py running-agent guard:
/codex-runtime joins /model in the early-intercept block so a runtime
flip mid-turn can't split a turn across two transports.
Tests:
tests/hermes_cli/test_codex_runtime_switch.py — 25 tests covering the
state machine: arg parsing (10 cases incl. case-insensitive and
synonyms), reading current runtime (5 cases incl. malformed configs),
writing runtime (3 cases), apply() entry point covering read-only,
no-op, codex-missing-blocked, codex-present-success, disable-no-binary-check,
and persist-failure paths (8 cases). All green.
Adjacent test suites confirm no regressions:
- tests/hermes_cli/test_commands.py + test_codex_runtime_switch.py:
167/167 green
- tests/agent/transports/: 283/283 green when combined with prior commits
Still missing: plugin migration helper, docs page, live e2e test gated on
codex binary. Followup commits.
* feat(codex-runtime): auto-migrate Hermes MCP servers to ~/.codex/config.toml
Translates the user's mcp_servers config from ~/.hermes/config.yaml into
the TOML format codex's MCP client expects. Wired into the
/codex-runtime codex_app_server enable path so users get their MCP tool
surface in the spawned subprocess automatically.
The migration runs on every enable. Failures are non-fatal — the runtime
change still proceeds and the user gets a warning so they can fix the
codex config manually.
What translates (mapping verified against codex-rs/core/src/config/edit.rs):
Hermes mcp_servers.<n>.command/args/env → codex stdio transport
Hermes mcp_servers.<n>.url/headers → codex streamable_http transport
Hermes mcp_servers.<n>.timeout → codex tool_timeout_sec
Hermes mcp_servers.<n>.connect_timeout → codex startup_timeout_sec
Hermes mcp_servers.<n>.cwd → codex stdio cwd
Hermes mcp_servers.<n>.enabled: false → codex enabled = false
What does NOT translate (warned + skipped per server):
Hermes-specific keys (sampling, etc.) — codex's MCP client has no
equivalent. Listed in the per-server skipped[] field of the report.
What's NOT migrated (intentional):
AGENTS.md — codex respects this file natively in its cwd. Hermes' own
AGENTS.md (project-level) is already in the worktree, so codex picks
it up without translation. No code needed.
Idempotency design:
All managed content lives between a 'managed by hermes-agent' marker
and the next non-mcp_servers section header. _strip_existing_managed_block
removes the prior managed region cleanly, preserving any user-added
codex config (model, providers.openai, sandbox profiles, etc.) above
or below.
Files added:
hermes_cli/codex_runtime_plugin_migration.py — pure-Python migration
helper. Public API: migrate(hermes_config, codex_home=None,
dry_run=False) returns MigrationReport with .migrated/.errors/
.skipped_keys_per_server. No external TOML dependency — minimal
formatter handles strings/numbers/booleans/lists/inline-tables.
tests/hermes_cli/test_codex_runtime_plugin_migration.py — 39 tests
covering:
- per-server translation (12): stdio/http/sse, cwd, timeouts,
enabled flag, command+url precedence, sampling drop, unknown keys
- TOML formatter (8): types, escaping, inline tables, error case
- existing-block stripping (4): no marker, alone, with user content
above, with user content below
- end-to-end migrate() (8): empty, dry-run, round-trip, idempotent
re-run, preserves user config, error reporting, invalid input,
summary formatting
Files changed:
hermes_cli/codex_runtime_switch.py — apply() now calls migrate() in
the codex_app_server enable branch. Migration failure logs a warning
in the result message but does NOT fail the runtime change. Disable
path (auto) explicitly skips migration.
tests/hermes_cli/test_codex_runtime_switch.py — 3 new tests:
test_enable_triggers_mcp_migration, test_disable_does_not_trigger_migration,
test_migration_failure_does_not_block_enable.
All 325 feature tests green:
- tests/agent/transports/: 249 (incl. 67 new)
- tests/run_agent/test_codex_app_server_integration.py: 9
- tests/hermes_cli/test_codex_runtime_switch.py: 28 (3 new)
- tests/hermes_cli/test_codex_runtime_plugin_migration.py: 39 (new)
* perf(codex-runtime): cache codex --version check within apply()
Single /codex-runtime invocation could spawn 'codex --version' up to 3
times (state report, enable gate, success message). Each spawn is ~50ms,
so the cumulative cost wasn't a crisis, but it was wasteful and turned a
trivial slash command into something noticeably laggy on slower systems.
Refactored to lazy-once via a closure over a nonlocal cache. First call
spawns; subsequent calls in the same apply() reuse the result.
Behavior unchanged — same return shape, same error handling, same install
hint when codex is missing. Just one subprocess per call instead of three.
Two regression-guard tests added:
- test_binary_check_cached_within_apply: enable path → call_count == 1
- test_binary_check_cached_on_read_only_call: state-report path → call_count == 1
Total tests for /codex-runtime now 30 (was 28); all 143 codex-runtime
tests still green.
* fix(codex-runtime): correct protocol field names found via live e2e test
Three real bugs caught only by running a turn end-to-end against codex
0.130.0 with a real ChatGPT subscription. Unit tests passed because they
asserted on our own (incorrect) wire shapes; the wire format from
codex-rs/app-server-protocol/src/protocol/v2/* is the source of truth and
my initial reading of the README was incomplete.
Bug 1: thread/start.permissions wire format
Was sending {"profileId": "workspace-write"}.
Real format per PermissionProfileSelectionParams enum (tagged union):
{"type": "profile", "id": "workspace-write"}
AND requires the experimentalApi capability declared during initialize.
AND requires a matching [permissions] table in ~/.codex/config.toml or
codex fails the request with 'default_permissions requires a [permissions]
table'.
Fix: stop overriding permissions on thread/start. Codex picks its default
profile (read-only unless user configures otherwise), which matches what
codex CLI users expect — they configure their default permission profile
in ~/.codex/config.toml the standard way. Trying to be clever about
profile selection broke every turn we tested.
Live error before fix: 'Invalid request: missing field type' on every
turn/start, even though our turn/start payload was correct — the field
codex was complaining about was inside the permissions sub-object we
shouldn't have been sending.
Bug 2: server-request method names
Was matching 'execCommandApproval' and 'applyPatchApproval'.
Real names per common.rs ServerRequest enum:
item/commandExecution/requestApproval
item/fileChange/requestApproval
item/permissions/requestApproval (new third method)
Fix: match the documented names. Added handler for
item/permissions/requestApproval that always declines — codex sometimes
asks to escalate permissions mid-turn and silent acceptance would surprise
users.
Live symptom before fix: agent.log showed
'Unknown codex server request: item/commandExecution/requestApproval'
and codex stalled because we replied with -32601 (unsupported method)
instead of an approval decision. The agent reported back 'The write
command was rejected' even though Hermes never showed the user an
approval prompt.
Bug 3: approval decision values
Was sending decision strings 'approved'/'approvedForSession'/'denied'.
Real values per CommandExecutionApprovalDecision enum (camelCase):
accept, acceptForSession, decline, cancel
(also AcceptWithExecpolicyAmendment and ApplyNetworkPolicyAmendment
variants we don't currently use).
Fix: rename _approval_choice_to_codex_decision return values; update
auto_approve_* fallbacks; update fail-closed default from 'denied' to
'decline'. Test mapping table updated to match.
Live test verified after fixes:
$ hermes (with model.openai_runtime: codex_app_server)
> Run the shell command: echo hermes-codex-livetest > .../proof.txt
then read it back
Approval prompt fired with 'Codex requests exec in <cwd>'.
User chose 'Allow once'. Codex executed the command, wrote the file,
read it back. Final response: 'Read back from proof.txt:
hermes-codex-livetest'. File contents on disk match.
agent.log confirms:
codex app-server thread started: id=019e200e profile=workspace-write
cwd=/tmp/hermes-codex-livetest/workspace
All 20 session tests still green after wire-format updates.
* fix(codex-runtime): correct apply_patch approval params + ship docs
Live e2e revealed FileChangeRequestApprovalParams doesn't carry the
changeset (just itemId, threadId, turnId, reason, grantRoot) — Codex's
'reason' field describes what the patch wants to do. Test config and
display logic updated to use it. The first 'apply_patch (0 change(s))'
display from the live test is now 'apply_patch: <reason>'.
Adds website/docs/user-guide/features/codex-app-server-runtime.md
covering enable/disable, prerequisites, approval UX, MCP migration
behavior, permission profile delegation to ~/.codex/config.toml, known
limitations, and the architecture diagram. Wired into the Automation
category in sidebars.ts.
Live e2e validation across the path matrix:
✓ thread/start handshake
✓ turn/start with text input
✓ commandExecution items + projection
✓ item/commandExecution/requestApproval → Hermes UI → response
✓ Approve once → command runs
✓ Deny → command rejected, codex falls back to read-only message
✓ Multi-turn (codex remembers prior turn's results)
✓ apply_patch via Codex's fileChange path
✓ item/fileChange/requestApproval → Hermes UI
✓ MCP server migration loads inside spawned codex (verified via
'use the filesystem MCP tool' prompt)
✓ /codex-runtime auto → codex_app_server toggle cycle
✓ Disable doesn't trigger migration
✓ Enable with codex CLI present succeeds + migrates
✓ Hermes-side interrupt path (turn/interrupt request issued cleanly
even if codex finishes before the interrupt lands)
Known live-validated limitations now documented in the docs page:
- delegate_task subagents unavailable on this runtime
- permission profile selection delegated to ~/.codex/config.toml
- apply_patch approval prompt has no inline changeset (codex protocol
doesn't expose it)
145/145 codex-runtime tests still green.
* feat(codex-runtime): native plugin migration + UX polish (quirks 2/4/5/10/11)
Major: migrate native Codex plugins (#7 in OpenClaw's PR list)
Discovers installed curated plugins via codex's plugin/list RPC and
writes [plugins."<name>@<marketplace>"] entries to ~/.codex/config.toml
so they're enabled in the spawned Codex sessions. This is the
'YouTube-video-worthy' bit Pash highlighted: when a user has
google-calendar, github, etc. installed in their Codex CLI, those
plugins activate automatically when they enable Hermes' codex runtime.
Implementation:
- hermes_cli/codex_runtime_plugin_migration.py: new _query_codex_plugins()
helper spawns 'codex app-server' briefly and walks plugin/list. Returns
(plugins, error) — failures are non-fatal so MCP migration still works.
- render_codex_toml_section() now takes plugins + permissions args.
- migrate() defaults: discover_plugins=True, default_permission_profile=
'workspace-write'. Explicit None on either disables that side.
- _strip_existing_managed_block() now also strips [plugins.*] and
[permissions]/[permissions.*] sections inside the managed block, so
re-runs replace plugins cleanly without touching codex's own config.
Quirk fixes:
#2 Default permissions profile written on enable.
Without this, Codex's read-only default kicks in and EVERY write
triggers an approval prompt. Now writes [permissions] default =
'workspace-write' so the runtime feels normal out of the box. Set
default_permission_profile=None to opt out.
#4 apply_patch approval prompt now shows what's changing.
Codex's FileChangeRequestApprovalParams doesn't carry the changeset.
Session adapter now caches the fileChange item from item/started
notifications and looks it up by itemId when codex requests approval.
Prompt shows '1 add, 1 update: /tmp/new.py, /tmp/old.py' instead of
'apply_patch (0 change(s))'.
Side benefit: also drains pending notifications BEFORE handling a
server request, so the projector and per-turn caches are up to date
when the approval decision fires. Bounded to 8 notifications per
loop iter to avoid starving codex's response.
#5/#10 Exec approval prompt never shows empty cwd.
When codex omits cwd in CommandExecutionRequestApprovalParams, fall
back to the session's cwd. If somehow neither is available, show
'<unknown>' explicitly instead of an empty string.
Also surfaces 'reason' from the approval params when codex provides
it — gives users more context on why codex wants to run something.
#11 Banner indicates the codex_app_server runtime when active.
New 'Runtime: codex app-server (terminal/file ops/MCP run inside
codex)' line appears in the welcome banner only when the runtime is
on. Default banner is unchanged.
Tests:
- 7 new tests in test_codex_runtime_plugin_migration.py covering
plugin discovery (mocked), failure handling, dry-run skip, opt-out
flag, idempotent re-runs, and permissions writing.
- 3 new tests in test_codex_app_server_session.py covering the
enriched approval prompts: cwd fallback, change summary on
apply_patch, fallback when no item/started cache exists.
- All 26 session tests + 46 migration tests green; 153 total in PR.
* feat(codex-runtime): hermes-tools MCP callback + native plugin migration
The big architectural addition: when codex_app_server runtime is on,
Hermes registers its own tool surface as an MCP server in
~/.codex/config.toml so the codex subprocess can call back into Hermes
for tools codex doesn't ship with — web_search, browser_*, vision,
image_generate, skills, TTS.
Also: 'migrate native codex plugins' (Pash's YouTube-video-worthy bit) —
when the user has plugins like Linear, GitHub, Gmail, Calendar, Canva
installed via 'codex plugin', Hermes discovers them via plugin/list and
writes [plugins.<name>@openai-curated] entries so they activate
automatically.
New module: agent/transports/hermes_tools_mcp_server.py
FastMCP stdio server exposing 17 Hermes tools. Each call dispatches
through model_tools.handle_function_call() — same code path as the
Hermes default runtime. Run with:
python -m agent.transports.hermes_tools_mcp_server [--verbose]
Exposed: web_search, web_extract, browser_navigate / _click / _type /
_press / _snapshot / _scroll / _back / _get_images / _console /
_vision, vision_analyze, image_generate, skill_view, skills_list,
text_to_speech.
NOT exposed (deliberately):
- terminal/shell/read_file/write_file/patch — codex has built-ins
- delegate_task/memory/session_search/todo — _AGENT_LOOP_TOOLS in
model_tools.py:493, require running AIAgent context. Documented
as a limitation and surfaced in the slash command output.
Migration changes (hermes_cli/codex_runtime_plugin_migration.py):
- _query_codex_plugins() spawns 'codex app-server' briefly to walk
plugin/list and pull installed openai-curated plugins. Failures are
non-fatal — MCP migration still completes.
- render_codex_toml_section() now takes plugins + permissions args
AND wraps the managed block with a MIGRATION_END_MARKER comment so
the stripper can reliably find both ends, even when the block
contains top-level keys (default_permissions = ...).
- migrate() defaults: discover_plugins=True, expose_hermes_tools=True,
default_permission_profile=':workspace' (built-in codex profile name
— must be prefixed with ':'). All three opt-out via explicit args.
- _build_hermes_tools_mcp_entry() builds the codex stdio entry with
HERMES_HOME and PYTHONPATH passthrough so a worktree-launched
Hermes points the MCP subprocess at the same module layout.
Live-caught wire bugs fixed during this turn:
1. Permission profile config key is top-level , NOT a [permissions] table. The [permissions] table is
for *user-defined* profiles with structured fields. Built-in
profile names start with ':' (':workspace', ':read-only',
':danger-no-sandbox'). Was emitting
which codex rejected with 'invalid type: string "X", expected
struct PermissionProfileToml'.
2. Built-in profile is , NOT . Codex
rejected with 'unknown built-in profile'.
3. Codex's MCP layer sends for
tool-call confirmation. We weren't handling it, so codex stalled
and returned 'MCP tool call was rejected'. Now: auto-accept for
our own hermes-tools server (user already opted in by enabling
the runtime), decline for third-party servers.
Quirk fixes shipped (from the limitations list):
#2 default permissions: workspace profile written on enable. No more
approval prompt on every write.
#4 apply_patch approval shows what's changing: cache fileChange
items from item/started, look up by itemId when codex sends
item/fileChange/requestApproval. Prompt: '1 add, 1 update:
/tmp/new.py, /tmp/old.py' instead of '0 change(s)'.
#5/#10 exec approval cwd never empty: fall back to session cwd, then
'<unknown>'. Also surfaces 'reason' from codex when present.
#11 banner shows 'Runtime: codex app-server' line when active so
users understand why tool counts may not match what's reachable.
Tests:
- 5 new tests in test_codex_runtime_plugin_migration.py covering
plugin discovery, expose_hermes_tools entry generation, idempotent
re-runs, opt-out flag, permissions profile.
- 3 new tests in test_codex_app_server_session.py covering enriched
approval prompts (cwd fallback, fileChange summary).
- 2 new tests for mcpServer/elicitation/request handling (accept
hermes-tools, decline others).
- New test file test_hermes_tools_mcp_server.py covering module
surface, EXPOSED_TOOLS safety invariants (no shell/file_ops,
no agent-loop tools), and main() error paths.
- 166 codex-runtime tests total, all green.
Live e2e validated against codex 0.130.0 + ChatGPT subscription:
✓ /codex-runtime codex_app_server enables, migrates filesystem MCP,
registers hermes-tools, writes default_permissions = ':workspace'
✓ Banner shows 'Runtime: codex app-server' line in subsequent sessions
✓ Shell command runs without approval prompt (workspace profile works)
✓ Multi-turn — codex remembers prior turn's results
✓ apply_patch path via fileChange request approval
✓ web_search via hermes-tools MCP callback returns real Firecrawl
results: 'OpenAI Codex CLI – Getting Started' end-to-end in 13s
✓ Disable cycle clean
Docs updated: website/docs/user-guide/features/codex-app-server-runtime.md
Full re-write covering native plugin migration, the hermes-tools
callback architecture, the prerequisites change ('codex login is
separate from hermes auth login codex'), the trade-off table now
reflecting which Hermes tools work via callback, and the limitations
list updated with what's actually unavailable on this runtime.
* feat(codex-runtime): pin user-config preservation invariant for quirk #6
Quirk #6 from the limitations list — user MCP servers / overrides /
codex-only sections in ~/.codex/config.toml that live OUTSIDE the
hermes-managed block must survive re-migration verbatim.
This already worked thanks to the MIGRATION_MARKER + MIGRATION_END_MARKER
pair I added when fixing the default_permissions wire format (so the
strip can find both ends of the managed region even with top-level
keys like default_permissions). But it was an emergent property
without a test pinning it.
Now explicitly tested:
- User MCP server above the managed block survives migration
- User MCP server below the managed block survives migration
- Both above + below survive a second re-migration
- User content (model, providers, sandbox, otel, etc.) outside our
region is left untouched
Docs added a section "Editing ~/.codex/config.toml safely" explaining
the marker contract — so users know they can add their own MCP
servers, override permissions, configure codex-only options, etc.
without fear of Hermes overwriting their work.
167 codex-runtime tests, all green.
* docs(codex-runtime): clarify the actual tool surface — shell covers terminal/read/write/find
Previous docs and PR description undersold what codex's built-in
toolset actually provides. apply_patch alone made it sound like the
runtime could only edit files in patch format — implying you'd lose
terminal use, read_file, write_file, search/find. That was wrong.
Codex's 'shell' tool runs arbitrary shell commands inside the sandbox,
which covers everything you'd do in bash: cat/head/tail (read), echo>
or heredocs (write), find/rg/grep (search), ls/cd (navigate), build/
test/git/etc. apply_patch is for structured multi-file edits on top
of that. update_plan is its in-runtime todo. view_image loads images.
And codex has its own web_search built in (in addition to the
Firecrawl-backed one Hermes exposes via MCP callback).
Docs now have a 'What tools the model actually has' section right
after Why, breaking the surface into three clearly-labeled buckets:
1. Codex's built-in toolset (always on) — shell, apply_patch,
update_plan, view_image, web_search; covers everything terminal-
adjacent.
2. Native Codex plugins (auto-migrated from your codex plugin
install) — Linear, GitHub, Gmail, Calendar, Outlook, Canva, etc.
3. Hermes tool callback (MCP server in ~/.codex/config.toml) —
web_search/web_extract via Firecrawl, browser_*, vision_analyze,
image_generate, skill_view/skills_list, text_to_speech.
Plus a 'What's NOT available' callout listing the four agent-loop tools
(delegate_task, memory, session_search, todo) that need running
AIAgent context and can't reach the codex runtime.
Trade-offs table broken out: shell, apply_patch, update_plan,
view_image, sandbox each get their own row with a one-line description
so users can see at a glance what's available natively.
Architecture diagram updated to list the codex built-ins by name
instead of 'apply_patch + shell + sandbox'.
No code changes — purely docs clarification. 167 codex-runtime tests
still green.
* fix(codex-runtime): _spawn_background_review signature + review fork api_mode downgrade
Two real bugs in the self-improvement loop integration that the previous
test mocked away.
Bug 1: wrong call signature
The codex helper was calling self._spawn_background_review() with no
args after every turn. That function actually requires:
messages_snapshot=list (positional or keyword)
review_memory=bool (at least one trigger must be True)
review_skills=bool
So the call would have raised TypeError at runtime — except the only
test that exercised this path mocked _spawn_background_review entirely
and just asserted spawn.called, so the wrong-arg shape never surfaced.
Bug 2: review fork inherits codex_app_server api_mode
The review fork is constructed with:
api_mode = _parent_runtime.get('api_mode')
So when the parent is codex_app_server, the review fork ALSO runs as
codex_app_server. But the review fork's whole job is to call agent-loop
tools (memory, skill_manage) which require Hermes' own dispatch — they
short-circuit with 'must be handled by the agent loop' on the codex
runtime. So the review fork would have run, decided to save something,
called memory or skill_manage, and silently no-op'd.
Fixed in run_agent.py:_spawn_background_review() — when the parent
api_mode is 'codex_app_server', the review fork is downgraded to
'codex_responses' (same OAuth credentials, same openai-codex provider,
but talks to OpenAI's Responses API directly so Hermes owns the loop).
Also rewrote the codex helper's review wiring to match the
chat_completions path:
- Computes _should_review_memory in the pre-loop block (was already
being computed; now passed through to the helper as an arg).
- Computes _should_review_skills AFTER the codex turn returns +
counters tick (line ~15432 pattern in chat_completions).
- Calls _spawn_background_review(messages_snapshot=, review_memory=,
review_skills=) only when at least one trigger fires.
- Adds the external memory provider sync (_sync_external_memory_for_turn)
that the chat_completions path runs after every turn.
Tests:
Replaced the broken test_background_review_invoked (which only
asserted spawn.called) with three sharper tests:
- test_background_review_NOT_invoked_below_threshold:
single turn at default thresholds → no review fires (would have
caught the original 'every turn calls spawn with no args' bug)
- test_background_review_skill_trigger_fires_above_threshold:
10 tool_iterations at threshold=10 → review fires with
messages_snapshot=list, review_skills=True, counter resets
- test_background_review_signature_never_breaks: regression guard
asserting positional args are always empty and kwargs include
messages_snapshot
New TestReviewForkApiModeDowngrade class:
- test_codex_app_server_parent_downgrades_review_fork: drives the
real _spawn_background_review function (no mock at that level),
asserts the review_agent gets api_mode='codex_responses' when
the parent was codex_app_server.
Live-validated against real run_conversation:
- Counter ticked from 0 to 5 after a 5-tool-iteration turn
- _spawn_background_review fired exactly once with kwargs-only signature
- review_skills=True, review_memory=False
- messages_snapshot was 12 entries (5 assistant tool_calls + 5 tool
results + 1 final assistant + initial system/user)
- Counter reset to 0 after fire
170 codex-runtime tests, all green.
Docs: added a Self-improvement loop section to the codex runtime page
explaining both how the trigger logic stays equivalent and that the
review fork is auto-downgraded to codex_responses for the agent-loop
tools. Also clarified that apply_patch and update_plan ARE codex's
built-in tools (the previous version made it sound like they were
separate from 'codex's stuff' — they're not, all five tools listed
in 'What tools the model actually has' section 1 are codex built-ins).
* feat(codex-runtime): expose kanban tools through Hermes MCP callback
Kanban workers spawn as separate hermes chat -q subprocesses that read
the user's config.yaml. If model.openai_runtime: codex_app_server is set
globally (which is the whole point of opt-in), every dispatched worker
ALSO comes up on the codex runtime.
That mostly works — codex's built-in shell + apply_patch + update_plan
do the actual task work fine — but it had one critical break: the
worker handoff tools (kanban_complete, kanban_block, kanban_comment,
kanban_heartbeat) are Hermes-registered tools, not codex built-ins.
On the codex runtime, codex builds its own tool list and these never
reach the model, so the worker would do the work but not be able to
report back, hanging until the dispatcher's timeout escalates it as
zombie.
Fix: add all 9 kanban tools to the EXPOSED_TOOLS list in the Hermes
MCP callback. They dispatch statelessly through handle_function_call()
just like web_search and the others — they read HERMES_KANBAN_TASK
from env (set by the dispatcher), gate correctly (worker tools require
the env var, orchestrator tools require it unset), and write to
~/.hermes/kanban.db.
Why kanban tools work via stateless dispatch when delegate_task/memory/
session_search/todo don't: those four are listed in _AGENT_LOOP_TOOLS
(model_tools.py:493) and short-circuit in handle_function_call() with
'must be handled by the agent loop' — they need to mutate AIAgent's
mid-loop state. Kanban tools have no such requirement; they're pure
side-effect functions against the kanban.db plus state_meta.
Tools exposed:
Worker handoff (require HERMES_KANBAN_TASK):
kanban_complete, kanban_block, kanban_comment, kanban_heartbeat
Read-only board queries:
kanban_show, kanban_list
Orchestrator (require HERMES_KANBAN_TASK unset):
kanban_create, kanban_unblock, kanban_link
Tests:
- test_kanban_worker_tools_exposed: complete/block/comment/heartbeat
in EXPOSED_TOOLS (regression guard for the would-hang-worker bug)
- test_kanban_orchestrator_tools_exposed: create/show/list/unblock/link
Docs:
- New 'Workflow features' section in the docs page covering /goal,
kanban, and cron behavior on this runtime
- /goal: works fully via run_conversation feedback; only caveat is
approval-prompt noise on long writes-heavy goals (mitigated by
the default :workspace permission profile)
- Kanban: enumerated which tools are reachable via the callback and
why the env var propagates correctly through the codex subprocess
to the MCP server subprocess
- Cron: documented as 'not specifically tested' — same rules as the
CLI apply since cron runs through AIAgent.run_conversation
- Trade-offs table gained rows for /goal, kanban worker, kanban
orchestrator
172/172 codex-runtime tests green (+2 from kanban tests).
* docs(codex-runtime): wire /codex-runtime into slash-commands ref + flag aux token cost
Three docs gaps caught during a final audit:
1. /codex-runtime was only in the feature docs page, not in the
slash-commands reference. Added rows to both the CLI section and
the Messaging section so users discover it where they'd look for
slash command syntax.
2. CODEX_HOME and HERMES_KANBAN_TASK weren't in environment-variables.md.
CODEX_HOME lets users redirect Codex CLI's config dir (the migration
honors it). HERMES_KANBAN_TASK is set by the kanban dispatcher and
propagates to the codex subprocess + the hermes-tools MCP subprocess
so kanban worker tools gate correctly — documented as 'don't set
manually' since it's an internal handoff.
3. Aux client behavior on this runtime. When openai_runtime=
codex_app_server is on with the openai-codex provider, every aux
task (title generation, context compression, vision auto-detect,
session search summarization, the background self-improvement review
fork) flows through the user's ChatGPT subscription by default.
This is true for the existing codex_responses path too, but it's
more visible / important here because users explicitly opted in for
subscription billing. Added a 'Auxiliary tasks and ChatGPT
subscription token cost' section to the docs page with a YAML
example showing how to override specific aux tasks to a cheaper
model (typically google/gemini-3-flash-preview via OpenRouter).
Also documents how the self-improvement review fork gets
auto-downgraded from codex_app_server to codex_responses by the
fix earlier in this PR.
No code changes — pure docs. 172 codex-runtime tests still green.
* docs+test(codex-runtime): pin HOME passthrough, document multi-profile + CODEX_HOME
OpenClaw hit a real footgun in openclaw/openclaw#81562: when spawning
codex app-server they were synthesizing a per-agent HOME alongside
CODEX_HOME. That made every subprocess codex's shell tool launches
(gh, git, aws, npm, gcloud, ...) see a fake $HOME and miss the user's
real config files. They had to back it out in PR #81562 — keep
CODEX_HOME isolation, leave HOME alone.
Audit confirms Hermes' codex spawn doesn't have this problem. We do
os.environ.copy() and only overlay CODEX_HOME (when provided) and
RUST_LOG. HOME passes through unchanged. But it was an emergent
property without a test pinning it, so adding a regression guard:
test_spawn_env_preserves_HOME — confirms parent HOME survives intact
in the subprocess env
test_spawn_env_sets_CODEX_HOME_when_provided — confirms codex_home
arg still isolates
codex state correctly
Docs additions:
'HOME environment variable passthrough' section — calls out the
contract explicitly: CODEX_HOME isolates codex's own state, HOME
stays user-real so gh/git/aws/npm/etc. find their normal config.
Cites openclaw#81562 as the cautionary tale.
'Multi-profile / multi-tenant setups' section — addresses the
related concern: profiles share ~/.codex/ by default. For users who
want per-profile codex isolation (separate auth, separate plugins),
documents the manual CODEX_HOME=<profile-scoped-dir> approach.
Explains why we DON'T auto-scope CODEX_HOME per profile: doing so
would silently invalidate existing codex login state for anyone
upgrading to this PR with tokens already at ~/.codex/auth.json.
Opt-in is safer than surprising users.
174 codex-runtime tests (+2 from HOME guards), all green.
* fix(codex-runtime): TOML control-char escapes + atomic config.toml write
Two footguns caught in a final audit pass before merge.
Bug 1: TOML control characters not escaped
The _format_toml_value() helper escaped backslashes and double quotes
but passed literal control characters (\n, \t, \r, \f, \b) through
unchanged. TOML basic strings don't allow literal control characters
— a path or env var containing a newline would produce invalid TOML
that codex refuses to load.
Realistic exposure: pathological cases like a HERMES_HOME with a
trailing newline (env var concatenation accident), or a PYTHONPATH
with a tab from a multi-line shell heredoc.
Fix: escape all five TOML basic-string control sequences (\b \t \n
\f \r) in addition to \\ and \" that we already did. Order
matters — backslash must come first or the other escapes get
re-escaped.
Bug 2: config.toml write wasn't atomic
If the python process crashed between target.mkdir() and the
write_text() finishing, a half-written config.toml could be left
behind. On NFS / Windows / some FUSE mounts this is a real concern;
on ext4/APFS small writes are usually atomic in practice but not
guaranteed.
Fix: write to a tempfile.mkstemp() temp file in the same directory,
then Path.replace() (atomic same-dir rename on POSIX, ReplaceFile on
Windows). On rename failure, clean up the temp file so repeated
failed migrations don't pile up .config.toml.* files.
Tests:
- test_string_with_newline_escaped — \n in value → \n in output
- test_string_with_tab_escaped — \t in value → \t in output
- test_string_with_other_controls_escaped — \r, \f, \b
- test_windows_path_escaped_correctly — backslash doubling
- test_atomic_write_no_temp_leak_on_success — no .config.toml.*
left over after a successful write
- test_atomic_write_cleanup_on_rename_failure — temp file removed
when Path.replace raises (simulated disk full)
180 codex-runtime tests, all green (+6 from this commit).
Footguns audited but NOT fixed (with rationale):
- Concurrent migrations race. Two Hermes processes hitting
/codex-runtime codex_app_server within seconds of each other could
cause one writer to lose entries. Low probability (you'd have to
enable from two surfaces simultaneously) and low impact (just re-run
migration). Adding fcntl/msvcrt locking is more code than it's
worth here. The atomic rename above means each individual write is
consistent — only the merge step is racy.
- Codex protocol version drift. We pin MIN_CODEX_VERSION=0.125 and
check at runtime but don't reject too-new versions. Right call —
the protocol has been stable through 0.125 → 0.130. If OpenAI
breaks it later we'd see the error in test_codex_app_server_runtime
on CI before users hit it.
|
||
|
|
c3094b46e9 |
refactor: import FILE_MUTATING_TOOL_NAMES from shared module
Drops the duplicate _FILE_MUTATING_TOOLS frozenset in run_agent.py and
imports the canonical FILE_MUTATING_TOOL_NAMES from
agent/tool_result_classification.py (aliased as _FILE_MUTATING_TOOLS to
avoid renaming the existing call sites). Prevents future drift if
another file-mutating tool is added — only one set needs updating.
No behavior change: same frozenset({'write_file', 'patch'}), and the
117 PR-scoped tests still pass.
|
||
|
|
da0ddbf88a | fix: classify landed file mutations with diagnostics | ||
|
|
486b692ddd
|
feat(nous): unified client=hermes-client-v<version> tag on every Portal request (#24779)
Some checks failed
Deploy Site / deploy-vercel (push) Waiting to run
Deploy Site / deploy-docs (push) Waiting to run
Docker Build and Publish / build-amd64 (push) Waiting to run
Docker Build and Publish / build-arm64 (push) Waiting to run
Docker Build and Publish / merge (push) Blocked by required conditions
Docker Build and Publish / move-latest (push) Blocked by required conditions
Lint (ruff + ty) / ruff + ty diff (push) Waiting to run
Lint (ruff + ty) / ruff enforcement (blocking) (push) Waiting to run
Lint (ruff + ty) / Windows footguns (blocking) (push) Waiting to run
Nix / nix (macos-latest) (push) Waiting to run
Nix / nix (ubuntu-latest) (push) Waiting to run
Tests / test (push) Waiting to run
Tests / e2e (push) Waiting to run
OSV-Scanner / Scan lockfiles (push) Has been cancelled
uv.lock check / uv lock --check (push) Has been cancelled
* feat(nous): unified client=hermes-client-v<version> tag on every Portal request Every Hermes request to Nous Portal now carries the same client=hermes-client-v<__version__> tag (e.g. client=hermes-client-v0.13.0 on this release), sourced live from hermes_cli.__version__. The release script's regex bump auto-aligns it on every release. Centralized in agent/portal_tags.py and wired into all four call sites: - NousProfile.build_extra_body (main agent loop, every chat completion) - auxiliary_client.NOUS_EXTRA_BODY + _build_call_kwargs (aux client) - run_agent.py compression-summary fallback path - tools/web_tools.py web_extract fallback Replaces the client=aux marker added in #24194 with the unified version tag. Tests assert against the helper output (invariant) rather than the literal string, so they don't need updating on every release. * feat(nous): cover /goal judge and kanban specify aux paths Two aux-using surfaces bypassed call_llm by invoking client.chat.completions.create() directly without extra_body, so they were missing the unified Portal client tag: - hermes_cli/goals.py — /goal standing-goal judge - hermes_cli/kanban_specify.py — kanban triage specifier Both now pass extra_body=get_auxiliary_extra_body() or None so they inherit the version tag when the aux client points at Nous Portal, and emit nothing otherwise (no tag leak to OpenRouter/Anthropic auxes). |
||
|
|
b06e999302
|
fix(cache): kill long-lived prefix layout — system prompt is now byte-static within a session (#24778)
The long-lived prefix-cache layout split the system prompt into stable/ context/volatile blocks and re-derived them on every API call. The volatile tier (timestamp + memory snapshot + USER profile) ticks per turn, so the system message bytes mutated mid-conversation and broke upstream prompt caches (OpenRouter, Nous Portal, Anthropic). Diagnosed via live wire-format diffing: an 8-turn conversation showed OLD layout flipping system block[1] sha mid-session at the minute boundary, dropping cached_tokens to 0 on that turn (cumulative 66.6% vs 83.3% for the single-block layout). Hermes invariant: history (system + all but the last 1-2 messages) must be static. Fix: drop the long-lived layout entirely. Single layout everywhere — system_and_3 with one cached system string built once on first turn, replayed verbatim on every subsequent turn. Loses cross-session 1h prefix caching for Claude (the feature that motivated the split), but within-session caching now actually works on every provider. Removed: - run_agent.py: _use_long_lived_prefix_cache flag, _long_lived_cache_ttl, _supports_long_lived_anthropic_cache method, the long-lived branch in run_conversation, mark_tools_for_long_lived_cache call site - agent/prompt_caching.py: apply_anthropic_cache_control_long_lived, mark_tools_for_long_lived_cache, _mark_system_stable_block helper - hermes_cli/config.py: prompt_caching.long_lived_prefix and prompt_caching.long_lived_ttl config keys - tests/agent/test_prompt_caching_live.py (entire file) - tests/agent/test_prompt_caching.py: TestMarkToolsForLongLivedCache, TestApplyAnthropicCacheControlLongLived - tests/run_agent/test_anthropic_prompt_cache_policy.py: TestSupportsLongLivedAnthropicCache Targeted tests: 62/62 pass. |
||
|
|
8ac351407e |
fix(agent): clear stale config context_length on model switch
When switching models via /model, AIAgent._config_context_length was never cleared, so the new model inherited the previous model's context window instead of auto-detecting the correct one via get_model_context_length(). Clear _config_context_length to None before the runtime field swap so the full resolution chain (custom_providers per-model, endpoint probe, models.dev, etc.) is re-evaluated for the newly selected model. Closes #21509 |
||
|
|
4c825554c1 | fix(retry): use float() for Retry-After header to handle sub-second values | ||
|
|
2a18b6283b
|
fix(cache): drop ttl=1h on Portal Qwen — Alibaba upstream is 5m-only (#24702)
PR #24151 routed Portal Qwen (qwen3.6-plus) through the prefix_and_2 long-lived cache layout, attaching {"type":"ephemeral","ttl":"1h"} markers to the tools[-1] entry and the stable system-prefix block. That layout works for Portal Claude because Anthropic / OpenRouter on Anthropic routes honour 1h TTL — but Portal Qwen ultimately proxies to Alibaba DashScope, which documents a single "ephemeral" TTL of 5 minutes on its Context Cache. The ttl="1h" qualifier is silently dropped upstream, so the two highest-value breakpoints (tools array + system prefix) never land. Only the rolling-window 5m markers on the last 2 messages cache, which matches the observed ~25% read rate. Fix: keep Portal Qwen on cache_control via _anthropic_prompt_cache_policy returning (True, False), but drop it from _supports_long_lived_anthropic_cache so it rides the standard system_and_3 5m layout (system + last 3 messages, all at 5m). Same 4 breakpoints, all in a TTL the upstream actually honours. Refs: https://www.alibabacloud.com/help/en/model-studio/context-cache https://openrouter.ai/docs/features/prompt-caching (Alibaba Qwen section: "TTL: 5 minutes") - _supports_long_lived_anthropic_cache: Portal scope narrowed back to Claude - tests: flip the two qwen long-lived expectations to False, retitle non_claude_non_qwen_rejected -> non_claude_rejected |
||
|
|
c594a23047
|
feat(agent): per-turn file-mutation verifier footer (#24498)
Detect when write_file / patch calls fail during a turn and are never superseded by a successful write to the same path. When the final text response is delivered, append an advisory footer listing the files that did NOT change — so models that over-claim 'patched 5 files' after 4 silent failures can't hide the lie. Catches the failure mode reported in Ben Eng's llm-wiki session: grok-4.1-fast issued batches of parallel patches, half failed with 'Could not find old_string', and the agent summarised the turn claiming every file was edited. The user had to manually run 'git status' each turn to catch it. The verifier is a pure post-hoc check on tool results — no new LLM calls, no synthetic messages injected into history (prompt cache preserved), no changes to tool argument dispatch. Per-turn state is keyed by path; a later successful write to the same path clears the failure entry so single-file retry recovery is not flagged. Wired into both _execute_tool_calls_concurrent and _execute_tool_calls_sequential, so batched parallel patches and one-at- a-time edits are both covered. Footer emission happens after the agent loop exits, before transform_llm_output / post_llm_call plugin hooks run, so plugins still see (and can modify) the augmented text. Config: display.file_mutation_verifier (bool, default true) + HERMES_FILE_MUTATION_VERIFIER env override. 31 unit tests in tests/run_agent/test_file_mutation_verifier.py cover target extraction (write_file, patch-replace, patch-v4a single and multi-file), error-preview extraction (JSON .error field and plain string), per-turn state transitions (first-error-wins on repeated failure, success supersedes failure), footer rendering (truncation at 10 entries, user-actionable hint), and env/config precedence. Companion docs updated: user-guide/configuration.md + reference/environment-variables.md. |
||
|
|
7993e03c06
|
fix(cache): route Nous Portal Qwen through Portal-Claude cache pathway (#24151)
Qwen models on Nous Portal (e.g. qwen3.6-plus) now get the same envelope-layout cache_control markers and long-lived (1h cross-session) cache treatment as Portal Claude. Portal proxies to OpenRouter with identical wire-format and cache_control semantics, but the prior policy left Portal Qwen falling through to the alibaba-family branch (which only matches provider=opencode/alibaba), serving 0% cache hits and re-billing the full prompt every turn. Scope is narrow: Portal Claude OR Portal Qwen. Other models on Portal keep their existing behavior. - _anthropic_prompt_cache_policy: add (is_nous_portal and qwen) -> (True, False) - _supports_long_lived_anthropic_cache: drop Claude-only gate for Portal so Qwen also gets the validated 1h cross-session layout - tests cover both functions, both bare and vendored qwen slug forms, and the rejection of non-Claude non-Qwen Portal traffic |
||
|
|
271883447e
|
feat: expose HERMES_SESSION_ID to agent tools via ContextVar + env (#23847)
Set HERMES_SESSION_ID using the existing session_context.py ContextVar system for concurrency safety (multiple gateway sessions in one process won't cross-talk). Also writes os.environ as fallback for CLI mode. Touchpoints: - gateway/session_context.py: Add _SESSION_ID ContextVar + _VAR_MAP entry - run_agent.py: Set both ContextVar and os.environ at init and on context-compression rotation - tools/environments/local.py: Bridge ContextVars into subprocess env in _make_run_env() (ContextVars don't propagate to child processes) - tests/run_agent/test_session_id_env.py: 3 tests covering env, provided ID, and ContextVar paths execute_code subprocess already passes HERMES_* prefixed vars through _scrub_child_env (line 82: _SAFE_ENV_PREFIXES includes 'HERMES_'). Primary use case: webhook-triggered agents that need to include a `--resume <session_id>` takeover command in their output. |
||
|
|
7b76366552
|
feat(prompt-cache): cross-session 1h prefix cache for Claude on Anthropic / OpenRouter / Nous Portal (#23828)
Cuts input cost for first-turn Claude requests by ~85-90% on subsequent
sessions within an hour. Tools array (~13k tokens for default toolset) +
stable system prefix (~5-8k tokens) get a 1h cache_control marker; the
volatile suffix (memory, USER profile, timestamp, session id) sits in a
separate non-cached block at the end so it doesn't poison the cross-session
prefix when it changes.
Provider gate: Claude on native Anthropic (incl. OAuth subscription),
OpenRouter, and Nous Portal (which proxies to OpenRouter). All other
providers keep today's system_and_3 layout unchanged.
Layout (4 cache_control breakpoints, Anthropic max):
1. tools[-1] -> 1h (cross-session)
2. system content[0] -> 1h (cross-session, stable prefix)
3. messages[-2] -> 5m (within-session rolling)
4. messages[-1] -> 5m (within-session rolling)
Within-session rolling shrinks from 3 messages to 2 to free the breakpoint
budget. On Claude with realistic tool loadouts the long-lived tier carries
the bulk of cross-session value anyway.
System prompt is now always assembled cache-friendly: stable identity /
guidance / skills / platform hints first, then session-stable context
files (AGENTS.md, .cursorrules), then per-call volatile content. Old
single-string callers see the same logical content (same join order),
just reordered so volatile lives at the end.
Config knobs (defaults shown):
prompt_caching:
cache_ttl: "5m" # rolling-window TTL (unchanged)
long_lived_prefix: true # opt-out switch
long_lived_ttl: "1h" # cross-session prefix TTL
Live E2E (tests/agent/test_prompt_caching_live.py, gated on
OPENROUTER_API_KEY) on anthropic/claude-haiku-4.5 with default toolset:
Call 1 (cold): cache_write=13,415 cache_read=0
Call 2 (NEW agent + msg): cache_write=391 cache_read=13,025
Cross-session reuse: 97.09%
Implementation:
* agent/prompt_caching.py: new apply_anthropic_cache_control_long_lived()
+ mark_tools_for_long_lived_cache(); existing apply_anthropic_cache_control()
preserved verbatim for the fallback path.
* agent/anthropic_adapter.py: convert_tools_to_anthropic() now forwards
cache_control onto each Anthropic-format tool dict.
* run_agent.py: _build_system_prompt_parts() returns the 3-tier dict;
_build_system_prompt() joins them (backward compatible).
_supports_long_lived_anthropic_cache() policy added next to the existing
_anthropic_prompt_cache_policy() (which now also recognises Nous Portal
Claude — pre-existing gap fixed in passing).
_build_api_kwargs() resolves tools_for_api once and propagates the
marker through all four build paths (anthropic_messages, bedrock,
codex_responses, profile/legacy chat completions).
Long-lived flag plumbed into the runtime snapshot/restore + model-switch
+ fallback-promotion paths.
Tests:
* tests/agent/test_prompt_caching.py: +8 tests (TestMarkToolsForLongLivedCache,
TestApplyAnthropicCacheControlLongLived).
* tests/run_agent/test_anthropic_prompt_cache_policy.py: +9 tests
(TestSupportsLongLivedAnthropicCache matrix across 8 endpoint classes
+ a fallback-target case).
* tests/agent/test_prompt_caching_live.py: new live E2E (skipif when
OPENROUTER_API_KEY is unset; runs outside the hermetic suite).
* Targeted suites: 327/327 pass (caching/adapter/policy/builder).
* tests/agent/ + tests/run_agent/: 3992 pass, 17 skip, 1 pre-existing
flake (test_async_httpx_del_neuter::test_same_key_replaces_stale_loop_entry,
verified failing on pristine origin/main).
|
||
|
|
2ec8d2b42f
|
chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937)
Replace with for all literal-tuple membership tests. Set lookup is O(1) vs O(n) for tuple — consistent micro-optimization across the codebase. 608 instances fixed via `ruff --fix --unsafe-fixes`, 0 remaining. 133 files, +626/-626 (net zero). |
||
|
|
657874460f
|
chore: ruff auto-fixes — collapsible-else-if, if-stmt-min-max, dict.fromkeys (#23926)
PLR5501 (collapsible-else-if): 28 instances — else: if: → elif: PLR1730 (if-stmt-min-max): 15 instances — if x<y: x=y → x=max(x,y) C420 (dict.fromkeys): 2 instances — dictcomp → dict.fromkeys PLR1704 (redefined-argument): 1 instance — reason → err_msg (shadow fix) C414 (unnecessary-list): 1 instance — sorted(list(x)) → sorted(x) 28 files, -44 net lines. All mechanical, zero logic changes. 17,211 tests pass, zero regressions. |
||
|
|
7026af4e23
|
fix(agent): catch ChatGPT-account Codex data-URL rejection so images are stripped instead of cascading to compression (#23602)
When the user's main provider is openai-codex on the ChatGPT-account backend (https://chatgpt.com/backend-api/codex), sending a native image attachment encodes it as data:image/...base64,... in the input_image field. The OpenAI Responses API on the public endpoint accepts that, but the ChatGPT-account variant rejects it with HTTP 400: Invalid 'input[N].content[K].image_url'. Expected a valid URL, but got a value with an invalid format. Hermes' image-rejection phrase list didn't include this wording, so the error escaped the strip-and-retry branch and fell through to the generic recovery path: model fallback → context-too-large → compression cascade → auxiliary OpenRouter 402 spam (issue #23570). Add a NARROW phrase keyed on the field-path apostrophe used by the Codex Responses error format: "image_url'. expected". This matches the actual error format without false-tripping on generic 'Expected a valid URL' errors from unrelated tools (webhooks, redirect_uri, etc.). Once matched, the existing branch strips images from history, sets _vision_supported= False for the session, and retries text-only. Refs #23570 (1 of 3 image-replay improvements; persistence rewrite to store image PATHS instead of inlined base64 is a separate follow-up) |
||
|
|
3e7145e0bb
|
revert: roll back /goal checklist + /subgoal feature stack (#23813)
* Revert "fix(goals): force judge to use tool calls instead of JSON-text replies (#23547)" This reverts commit |
||
|
|
2b3bf17dfa |
fix(kanban): call kanban_block on iteration-budget exhaustion to prevent protocol violation
When a kanban worker subprocess hits the iteration budget, the agent loop strips tools and asks the model for a summary. The model cannot call kanban_block itself at that point, so the process exits rc=0 without calling kanban_complete or kanban_block — a protocol violation that the dispatcher detects as a fatal error, giving up after 1 failure and stranding downstream tasks. Fix: after _handle_max_iterations() returns, check HERMES_KANBAN_TASK and call kanban_block with a reason describing the exhaustion. The dispatcher then sees a clean block transition instead of a protocol violation, and the task can be retried or escalated by a human. Fixes [Bug] kanban-worker exits cleanly (rc=0) on iteration-budget exhaustion without calling kanban_complete or kanban_block #23216 |
||
|
|
3af3c4eb8c |
fix(misc): three small defensive fixes from PR #1974
Salvages the three substantive low-severity fixes from Gutslabs' #1974 "misc bug fixes" bundle. The other 8 claims in that PR were either already fixed on main with superior implementations (state lock, firecrawl lazy import, fcntl/msvcrt guard, path normalization, schema migrations) or did not survive review. - run_agent: `_materialize_data_url_for_vision` uses `NamedTemporaryFile(delete=False)`; if `base64.b64decode` raises on a corrupt data URL the temp file would persist forever. Wrap the write in try/except and `os.unlink` the temp on failure. - gateway/session: `append_to_transcript` JSONL write had no error handling, so disk-full / read-only-fs / permission errors crashed the message handler. The SQLite write above is the primary store, so swallow OSError on the JSONL fallback with a debug log. - gateway/status: `_read_pid_record` reads `pid_path.read_text()` after an `exists()` check; if the PID file is deleted between the two calls (concurrent gateway restart) we hit an unhandled OSError. Catch it and return None. Adds a regression test for the tempfile cleanup; the other two paths are defensive try/excepts on infrequent OSError that don't warrant dedicated tests. Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com> |
||
|
|
4a080b1d5a
|
fix(goals): forward standing /goal state on auto-compression session rotation (#23530)
When run_agent's _compress_context fires mid-turn it ends the parent
session in SessionDB and creates a new continuation session with a
fresh session_id. The /goal state is keyed on session_id in
state_meta ("goal:<sid>"), so without forwarding the goal silently
disappears: _get_goal_manager() rebinds for the new session_id,
load_goal() returns None, mgr.is_active() is False, and the
continuation loop dies with no user-visible signal.
Fix: in the same SessionDB transaction block that creates the
continuation session, copy state_meta[goal:<old>] →
state_meta[goal:<new>] when present. No-op when the user has no
active goal. Logged at INFO so a stuck loop is debuggable.
Tests cover the round-trip via SessionDB and the no-op path.
Affects all three run-conversation surfaces (CLI, gateway, TUI
gateway) because _compress_context is the single rotation site.
|
||
|
|
e5af1dd633
|
fix(review): tell background reviewer not to capture transient env failures as skills (#23004)
Closes #6051. Reported failure mode: agent migrated to WSL2, browser launch failed because Playwright wasn't installed yet. Background reviewer captured the failure as a durable skill (`browser-tool-launch-issue`) and the agent kept refusing the browser tool for weeks after Playwright was installed and verified working. Negative claims also propagated into unrelated skills ("browser tools do not work", "cannot use Y from execute_code"). Root cause: `_SKILL_REVIEW_PROMPT` and `_COMBINED_REVIEW_PROMPT` both lean hard on "be active, save things, a pass that does nothing is a missed learning opportunity." Neither distinguished durable knowledge from transient environment state. The reviewer was doing what it was told. Fix at the write site — both prompts now carry a "Do NOT capture" section calling out: • Environment-dependent failures (missing binaries, fresh-install errors, post-migration path mismatches, 'command not found', unconfigured credentials, uninstalled packages) • Negative claims about tools or features ("X does not work") that harden into self-cited refusals • Session-specific transient errors that resolved before the conversation ended • One-off task narratives ("summarize today's market", "analyze this PR") — also addresses the #12812 / #4538 family Plus a positive-reframing line: when a tool fails because of setup state, capture the FIX (install command, config step, env var) under an existing setup/troubleshooting skill — never "this tool doesn't work" as a standalone constraint. Targeted tests: 24/24 passing in tests/run_agent/test_review_prompt_class_first.py (2 new + all existing review-prompt assertions). Substring-based checks so future prompt edits don't false-fail. |
||
|
|
126cbffb8a
|
feat(stream-retry): add upstream + timing diagnostics to drop log (#23005)
The previous PR (#22993) gave us a structured WARNING per stream drop but the only diagnostic was 'error_type=APIError error=Network connection lost.' — same nothing the user started with. To actually diagnose why subagents drop streams disproportionately we need to know WHERE the drop happened. Adds three breadcrumbs to the agent.log WARNING: 1. Inner exception chain. openai SDK wraps httpx errors as APIConnectionError / APIError so the catch site only sees the wrapper. _flatten_exception_chain walks __cause__/__context__ up to 4 levels deep and renders 'Outer(msg) <- Inner(msg)' so we can tell ConnectError vs RemoteProtocolError vs ReadError vs ProxyError without enabling verbose mode. 2. Upstream HTTP headers. Snapshots cf-ray, x-openrouter-provider, x-openrouter-model, x-openrouter-id, x-request-id, server, via, etc. from stream.response immediately after open (so they survive even when the stream dies before the first chunk). These answer 'is one CF edge / one downstream provider responsible, or random?' 3. Per-attempt counters. bytes streamed, chunk count, elapsed time on the dying attempt, and time-to-first-byte. Distinguishes 'couldn't connect at all' (0s, 0 bytes) from 'died after 30s mid-stream' (very different root causes — first is auth/routing, second is upstream idle-kill or proxy timeout). Plumbing: - _stream_diag_init / _stream_diag_capture_response live on AIAgent and produce a per-attempt dict held on request_client_holder['diag'] for closure access from the retry block. - _call_chat_completions and _call_anthropic both initialize the diag and increment counters per chunk/event (best-effort, never raises in the streaming hot path). - _log_stream_retry / _emit_stream_drop accept an optional diag and render the new fields. Final-exhaustion log goes through the same helper so it gets the same diagnostic dump. - UI status line gains a brief 'after Xs' suffix when timing is available — distinguishes 'connect failed' from 'died mid-stream' at a glance without grepping logs. Sample WARNING after this change: Stream drop mid tool-call on attempt 2/3 — retrying. subagent_id=sa-2-cafef00d depth=1 provider=openrouter base_url=https://openrouter.ai/api/v1 error_type=APIError error=Connection error. chain=APIError(Connection error.) <- RemoteProtocolError(peer closed connection without sending complete message body) http_status=200 bytes=12400 chunks=47 elapsed=12.00s ttfb=0.83s upstream=[cf-ray=8f1a2b3c4d5e6f7g-LAX x-openrouter-provider=Anthropic x-openrouter-id=gen-abc123 server=cloudflare] Tests: 10 covering diag init, header capture (whitelist enforced for PII), exception-chain walking + depth cap, log content with full diag, log content without diag (placeholders), UI elapsed-suffix on/off. |
||
|
|
68e44642c8
|
fix(stream-retry): collapse two-line drop status, name provider, and let agent.log capture diagnostics (#22993)
Subagent stream drops were spamming the parent terminal with two lines
per blip ('Connection dropped...' + 'Reconnected...') while leaving zero
breadcrumb in agent.log to debug them.
Two underlying bugs, fixed together:
1. quiet_mode raised the run_agent/tools/etc. loggers to ERROR, which
filters records before root-logger file handlers see them. The comment
claimed 'File handlers still capture everything' — that was wrong.
Removed in both run_agent.py and cli.py; console quietness already
comes from hermes_logging not installing a console StreamHandler in
non-verbose mode.
2. The stream-retry blocks emitted two _emit_status calls per drop
('⚠️ Connection dropped... Reconnecting...' + '🔄 Reconnected —
resuming…') with no provider name, so multi-provider sessions had to
dig through agent.log to attribute a drop. Replaced both call sites
with a single _emit_stream_drop helper that emits ONE line naming the
provider and error class, and always writes a structured WARNING to
agent.log with subagent_id, depth, provider, base_url, error_type.
Net UX change: 6 lines per triple-subagent drop → 3 lines, each
naming the provider. agent.log now has a structured breadcrumb per
retry that didn't exist before.
Tests: 6 new tests in tests/run_agent/test_stream_drop_logging.py
covering the logger-level guard, structured WARNING content, single
status line per drop (no Reconnected follow-up), and provider naming.
|
||
|
|
3800972dd0
|
feat(vision): vision_analyze returns pixels to vision-capable models, not aux text (#22955)
When the active main model has native vision and the provider supports multimodal tool results (Anthropic, OpenAI Chat, Codex Responses, Gemini 3, OpenRouter, Nous), vision_analyze loads the image bytes and returns them to the model as a multimodal tool-result envelope. The model then sees the pixels directly on its next turn instead of receiving a lossy text description from an auxiliary LLM. Falls back to the legacy aux-LLM text path for non-vision models and unverified providers. Mirrors the architecture used in OpenCode, Claude Code, Codex CLI, and Cline. All four converge on the same pattern: tool results carry image content blocks for vision-capable provider/model combinations. Changes - tools/vision_tools.py: _vision_analyze_native fast path + provider capability table (_supports_media_in_tool_results). Schema description updated to reflect new behaviour. - agent/codex_responses_adapter.py: function_call_output.output now accepts the array form for multimodal tool results (was string-only). Preflight validates input_text/input_image parts. - agent/auxiliary_client.py: _RUNTIME_MAIN_PROVIDER/_MODEL globals so tools see the live CLI/gateway override, not the stale config.yaml default. set_runtime_main()/clear_runtime_main() helpers. - run_agent.py: AIAgent.run_conversation calls set_runtime_main at turn start so vision_analyze's fast-path check sees the actual runtime. - tests/conftest.py: clear runtime-main override between tests. Tests - tests/tools/test_vision_native_fast_path.py: provider capability table, envelope shape, fast-path gating (vision-capable model uses fast path; non-vision model falls through to aux). - tests/run_agent/test_codex_multimodal_tool_result.py: list tool content becomes function_call_output.output array; preflight preserves arrays and drops unknown part types. Live verified - Opus 4.6 + Sonnet 4.6 on OpenRouter: model calls vision_analyze on a typed filepath, gets pixels back, reads exact text from images that no aux description could capture (font color irony, multi-line fruit-count list, etc.). PR replaces the closed prior efforts (#16506 shipped the inbound user- attached path; this PR closes the gap for tool-discovered images). |
||
|
|
6ddc48b058 |
fix(fallback): resolve api_key_env in fallback chain entries (carve-out of #22665)
Fallback chain entries with 'api_key_env: ENV_VAR_NAME' weren't being
resolved by either the init-time fallback path (line ~1660) or the
runtime _try_activate_fallback path (line ~8045). Only literal
'api_key' was honored; the snake_case 'api_key_env' alias documented
elsewhere in the config was silently dropped, so a 'provider: custom'
fallback with base_url + api_key_env worked as primary but failed as
fallback with 'no endpoint credentials found' / 401.
Adds 'or fb.get("api_key_env")' to the existing 'key_env' lookup in
both call sites, with empty-string-to-None coercion so unset env vars
don't poison the resolver.
Salvage of #22665's fallback portion. The original PR also bundled
gateway-degrade-on-no-adapters changes (those land via the carve-out
in #22853 which is the same code) and run_agent.py memory-nudge
counter hydration (issue #22357 territory, not mentioned in the
title). Drops both bundled pieces; keeps just the api_key_env fix.
Closes #5392.
|
||
|
|
c7f0aab949
|
feat(openrouter): wire Pareto Code router with min_coding_score knob (#22838)
Pick openrouter/pareto-code as your model and OpenRouter auto-routes each
request to the cheapest model meeting your coding-quality bar (ranked by
Artificial Analysis). The new openrouter.min_coding_score config key (0.0-1.0,
default 0.65) tunes the floor.
- hermes_cli/models.py: add openrouter/pareto-code to OPENROUTER_MODELS so
it shows up in the picker with a description
- hermes_cli/config.py: add openrouter.min_coding_score (default 0.65 — lands
on a mid-tier coder on the current Pareto frontier)
- plugins/model-providers/openrouter: emit extra_body.plugins =
[{id: pareto-router, min_coding_score: X}] when model is openrouter/pareto-code
AND the score is a valid float in [0.0, 1.0]
- agent/transports/chat_completions.py: same emission on the legacy flag
path (when no provider profile is loaded)
- run_agent.py: openrouter_min_coding_score kwarg + storage; plumbed into
both build_kwargs() invocations and the context-summary extra_body path
- cli.py: read openrouter.min_coding_score once at init, validate float in
[0,1], pass to AIAgent constructions (CLI + background-task paths)
- cron/scheduler.py, batch_runner.py, tools/delegate_tool.py,
tui_gateway/server.py: propagate the kwarg (mirrors providers_order
plumbing — subagents inherit, cron/batch read from config)
- tests: profile-level + transport-level coverage of the model gating,
unset/empty/out-of-range handling, and the legacy flag path
- docs: new 'OpenRouter Pareto Code Router' section in providers.md
Verified end-to-end against api.openrouter.ai: at score=0.65 we land on a
mid-tier coder, at omission we get the strongest. Score is silently dropped
on any model other than openrouter/pareto-code, so it's safe to leave set.
|
||
|
|
840ebe063e | fix: make session search initialize session db |