Memory provider plugins (e.g. Mnemosyne) can register tools via two paths:
1. Plugin system (ctx.register_tool) → tool registry → get_tool_definitions()
2. Memory manager → get_all_tool_schemas() → direct append in AIAgent.__init__
Path 2 blindly appended without checking if path 1 already added the same
tool names. This created duplicate function names in the tools array sent
to the API. Most providers silently handle duplicates, but Xiaomi MiMo
(via Nous Portal) strictly rejects them with a 400 Bad Request.
Fix: build a set of existing tool names before memory manager injection
and skip any tool whose name is already present.
Confirmed via live testing against Nous Portal:
- Unique tool names → 200 OK
- Duplicate tool names → 400 'Provider returned error'
The on_memory_write bridge that notifies external memory providers
(ClawMem, retaindb, supermemory, etc.) of built-in memory writes was
only present in the concurrent tool execution path (_invoke_tool).
The sequential path (_execute_tool_calls_sequential) — which handles
all single tool calls, the common case — was missing it entirely.
This meant external memory providers silently missed every single-call
memory write, which is the vast majority of memory operations.
Fix: add the identical bridge block to the sequential path, right
after the memory_tool call returns.
Closes#10174
Multiple gaps in activity tracking could cause the gateway's inactivity
timeout to fire while the agent is actively working:
1. Streaming wait loop had no periodic heartbeat — the outer thread only
touched activity when the stale-stream detector fired (180-300s), and
for local providers (Ollama) the stale timeout was infinity, meaning
zero heartbeats. Now touches activity every 30s.
2. Concurrent tool execution never set the activity callback on worker
threads (threading.local invisible across threads) and never set
_current_tool. Workers now set the callback, and the concurrent wait
uses a polling loop with 30s heartbeats.
3. Modal backend's execute() override had its own polling loop without
any activity callback. Now matches _wait_for_process cadence (10s).
The _last_content_with_tools fallback was firing indiscriminately for ALL
content+tool turns, including mid-task narration alongside substantive
tools (terminal, search_files, etc.). This caused the agent to exit
the loop with 'I'll scan the directory...' as the final answer instead
of nudging the model to continue processing tool results.
The fix restricts the fallback to housekeeping-only turns (memory, todo,
skill_manage, session_search) where the content genuinely IS the final
answer. When substantive tools are present, the existing post-tool
nudge mechanism now fires instead, prompting the model to continue.
Affected models: xiaomi/mimo-v2-pro, GLM-5, and other weaker models
that intermittently return empty after tool results.
Reported by user Renaissance on Discord.
OV transparently handles message history across /new and /compress: old
messages stay in the same session and extraction is idempotent, so there's
no need to rebind providers to a new session_id. The only thing the
session boundary actually needs is to trigger extraction.
- MemoryProvider / MemoryManager: remove on_session_reset hook
- OpenViking: remove on_session_reset override (nothing to do)
- AIAgent: replace rotate_memory_session with commit_memory_session
(just calls on_session_end, no rebind)
- cli.py / run_agent.py: single commit_memory_session call at the
session boundary before session_id rotates
- tests: replace on_session_reset coverage with routing tests for
MemoryManager.on_session_end
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace hasattr-forked OpenViking-specific paths with a proper base-class
hook. Collapse the two agent wrappers into a single rotate_memory_session
so callers don't orchestrate commit + rebind themselves.
- MemoryProvider: add on_session_reset(new_session_id) as a default no-op
- MemoryManager: on_session_reset fans out unconditionally (no hasattr,
no builtin skip — base no-op covers it)
- OpenViking: rename reset_session -> on_session_reset; drop the explicit
POST /api/v1/sessions (OV auto-creates on first message) and the two
debug raise_for_status wrappers
- AIAgent: collapse commit_memory_session + reinitialize_memory_session
into rotate_memory_session(new_sid, messages)
- cli.py / run_agent.py: replace hasattr blocks and the split calls with
a single unconditional rotate_memory_session call; compression path
now passes the real messages list instead of []
- tests: align with on_session_reset, assert reset does NOT POST /sessions
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The OpenViking memory provider extracts memories when its session is
committed (POST /api/v1/sessions/{id}/commit). Before this fix, the
CLI had two code paths that changed the active session_id without ever
committing the outgoing OpenViking session:
1. /new (new_session() in cli.py) — called flush_memories() to write
MEMORY.md, then immediately discarded the old session_id. The
accumulated OpenViking session was never committed, so all context
from that session was lost before extraction could run.
2. /compress and auto-compress (_compress_context() in run_agent.py) —
split the SQLite session (new session_id) but left the OpenViking
provider pointing at the old session_id with no commit, meaning all
messages synced to OpenViking were silently orphaned.
The gateway already handles session commit on /new and /reset via
shutdown_memory_provider() on the cached agent; the CLI path did not.
Fix: introduce a lightweight session-transition lifecycle alongside
the existing full shutdown path:
- OpenVikingMemoryProvider.reset_session(new_session_id): waits for
in-flight background threads, resets per-session counters, and
creates the new OV session via POST /api/v1/sessions — without
tearing down the HTTP client (avoids connection overhead on /new).
- MemoryManager.restart_session(new_session_id): calls reset_session()
on providers that implement it; falls back to initialize() for
providers that do not. Skips the builtin provider (no per-session
state).
- AIAgent.commit_memory_session(messages): wraps
memory_manager.on_session_end() without shutdown — commits OV session
for extraction but leaves the provider alive for the next session.
- AIAgent.reinitialize_memory_session(new_session_id): wraps
memory_manager.restart_session() — transitions all external providers
to the new session after session_id has been assigned.
Call sites:
- cli.py new_session(): commit BEFORE session_id changes, reinitialize
AFTER — ensuring OV extraction runs on the correct session and the
new session is immediately ready for the next turn.
- run_agent._compress_context(): same pattern, inside the
if self._session_db: block where the session_id split happens.
/compress and auto-compress are functionally identical at this layer:
both call _compress_context(), so both are fixed by the same change.
Tests added to tests/agent/test_memory_provider.py:
- TestMemoryManagerRestartSession: reset_session() routing, builtin
skip, initialize() fallback, failure tolerance, empty-manager noop.
- TestOpenVikingResetSession: session_id update, per-session state
clear, POST /api/v1/sessions call, API failure tolerance, no-client
noop.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
With store=False (our default for the Responses API), the API does not
persist response items. When reasoning items with 'id' fields were
replayed on subsequent turns, the API attempted a server-side lookup
for those IDs and returned 404:
Item with id 'rs_...' not found. Items are not persisted when store
is set to false.
The encrypted_content blob is self-contained for reasoning chain
continuity — the id field is unnecessary and triggers the failed lookup.
Fix: strip 'id' from reasoning items in both _chat_messages_to_responses_input
(message conversion) and _preflight_codex_input_items (normalization layer).
The id is still used for local deduplication but never sent to the API.
Reported by @zuogl448 on GPT-5.4.
The existing recovery block sanitized self.api_key and
self._client_kwargs['api_key'] but did not update self.client.api_key.
The OpenAI SDK stores its own copy of api_key and reads it dynamically
via the auth_headers property on every request. Without this fix, the
retry after sanitization would still send the corrupted key in the
Authorization header, causing the same UnicodeEncodeError.
The bug manifests when an API key contains Unicode lookalike characters
(e.g. ʋ U+028B instead of v) from copy-pasting out of PDFs, rich-text
editors, or web pages with decorative fonts. httpx hard-encodes all
HTTP headers as ASCII, so the non-ASCII char in the Authorization
header triggers the error.
Adds TestApiKeyClientSync with two tests verifying:
- All three key locations are synced after sanitization
- Recovery handles client=None (pre-init) without crashing
Three independent fixes:
1. Reset activity timestamp on cached agent reuse (#9051)
When the gateway reuses a cached AIAgent for a new turn, the
_last_activity_ts from the previous turn (possibly hours ago)
carried over. The inactivity timeout handler immediately saw
the agent as idle for hours and killed it.
Fix: reset _last_activity_ts, _last_activity_desc, and
_api_call_count when retrieving an agent from the cache.
2. Detect uv-managed virtual environments (#8620 sub-issue 1)
The systemd unit generator fell back to sys.executable (uv's
standalone Python) when running under 'uv run', because
sys.prefix == sys.base_prefix (uv doesn't set up traditional
venv activation). The generated ExecStart pointed to a Python
binary without site-packages, crashing the service on startup.
Fix: check VIRTUAL_ENV env var before falling back to
sys.executable. uv sets VIRTUAL_ENV even when sys.prefix
doesn't reflect the venv.
3. Nudge model to continue after empty post-tool response (#9400)
Weaker models (GLM-5, mimo-v2-pro) sometimes return empty
responses after tool calls instead of continuing to the next
step. The agent silently abandoned the remaining work with
'(empty)' or used prior-turn fallback text.
Fix: when the model returns empty after tool calls AND there's
no prior-turn content to fall back on, inject a one-time user
nudge message telling the model to process the tool results and
continue. The flag resets after each successful tool round so it
can fire again on later rounds.
Test plan: 97 gateway + CLI tests pass, 9 venv detection tests pass
Previously, non-integer context_length values (e.g. '256K') in
config.yaml were silently ignored, causing the agent to fall back
to 128K auto-detection with no user feedback. This was confusing
for users with custom LiteLLM endpoints expecting larger context.
Now prints a clear stderr warning and logs at WARNING level when
model.context_length or custom_providers[].models.<model>.context_length
cannot be parsed as an integer, telling users to use plain integers
(e.g. 256000 instead of '256K').
Reported by community user ChFarhan via Discord.
When compression fails after max attempts, the agent returns
{completed: False, partial: True} but was missing the 'failed' flag.
The gateway's agent_failed_early guard checked for 'failed' AND
'not final_response', but _run_agent_blocking always converts errors
to final_response — making the guard dead code. This caused the
oversized session to persist, creating an infinite fail loop where
every subsequent message hits the same compression failure.
Changes:
- run_agent.py: add 'failed: True' and 'compression_exhausted: True'
to all 5 compression-exhaustion return paths
- gateway/run.py (_run_agent_blocking): forward 'failed' and
'compression_exhausted' flags through to the caller
- gateway/run.py (_handle_message_with_agent): fix agent_failed_early
to check bool(failed) without the broken 'not final_response' clause;
auto-reset the session when compression is exhausted so the next
message starts fresh
- Update tests to match new guard logic and add
TestCompressionExhaustedFlag test class
Closes#9893
Three bugfixes in the agent loop:
1. Reset retry counters after context compression. Without this,
pre-compression retry counts carry over, causing the model to
hit empty-response recovery immediately after a compression-
induced context loss, wasting API calls on a now-valid context.
2. Unmute output in the final-response (no-tool-call) branch.
_mute_post_response could be left True from a prior housekeeping
turn, silently suppressing empty-response warnings and recovery
status that the user should see.
3. Stop injecting 'Calling the X tools...' into assistant message
content when falling back to prior-turn content. This mutated
conversation history with synthetic text that the model never
produced, poisoning subsequent turns.
API keys containing Unicode lookalike characters (e.g. ʋ U+028B instead
of v) cause UnicodeEncodeError when httpx encodes the Authorization
header as ASCII. This commonly happens when users copy-paste keys from
PDFs, rich-text editors, or web pages with decorative fonts.
Three layers of defense:
1. **Save-time validation** (hermes_cli/config.py):
_check_non_ascii_credential() strips non-ASCII from credential values
when saving to .env, with a clear warning explaining the issue.
2. **Load-time sanitization** (hermes_cli/env_loader.py):
_sanitize_loaded_credentials() strips non-ASCII from credential env
vars (those ending in _API_KEY, _TOKEN, _SECRET, _KEY) after dotenv
loads them, so the rest of the codebase never sees non-ASCII keys.
3. **Runtime recovery** (run_agent.py):
The UnicodeEncodeError recovery block now also sanitizes self.api_key
and self._client_kwargs['api_key'], fixing the gap where message/tool
sanitization succeeded but the API key still caused httpx to fail on
the Authorization header.
Also: hermes_logging.py RotatingFileHandler now explicitly sets
encoding='utf-8' instead of relying on locale default (defensive
hardening for ASCII-locale systems).
* feat(skills): add fitness-nutrition skill to optional-skills
Cherry-picked from PR #9177 by @haileymarshall.
Adds a fitness and nutrition skill for gym-goers and health-conscious users:
- Exercise search via wger API (690+ exercises, free, no auth)
- Nutrition lookup via USDA FoodData Central (380K+ foods, DEMO_KEY fallback)
- Offline body composition calculators (BMI, TDEE, 1RM, macros, body fat %)
- Pure stdlib Python, no pip dependencies
Changes from original PR:
- Moved from skills/ to optional-skills/health/ (correct location)
- Fixed BMR formula in FORMULAS.md (removed confusing -5+10, now just +5)
- Fixed author attribution to match PR submitter
- Marked USDA_API_KEY as optional (DEMO_KEY works without signup)
Also adds optional env var support to the skill readiness checker:
- New 'optional: true' field in required_environment_variables entries
- Optional vars are preserved in metadata but don't block skill readiness
- Optional vars skip the CLI capture prompt flow
- Skills with only optional missing vars show as 'available' not 'setup_needed'
* fix: increase CLI response text padding to 4-space tab indent
Increases horizontal padding on all response display paths:
- Rich Panel responses (main, background, /btw): padding (1,2) -> (1,4)
- Streaming text: add 4-space indent prefix to each line
- Streaming TTS: add 4-space indent prefix to sentences
Gives response text proper breathing room with a tab-width indent.
Rich Panel word wrapping automatically adjusts for the wider padding.
Requested by AriesTheCoder.
* fix: word-wrap verbose tool call args and results to terminal width
Verbose mode (tool_progress: verbose) printed tool args and results as
single unwrapped lines that could be thousands of characters long.
Adds _wrap_verbose() helper that:
- Pretty-prints JSON args with indent=2 instead of one-line dumps
- Splits text on existing newlines (preserves JSON/structured output)
- Wraps lines exceeding terminal width with 5-char continuation indent
- Uses break_long_words=True for URLs and paths without spaces
Applied to all 4 verbose print sites:
- Concurrent tool call args
- Concurrent tool results
- Sequential tool call args
- Sequential tool results
---------
Co-authored-by: haileymarshall <haileymarshall@users.noreply.github.com>
GPT-5.4 supports none/low/medium/high/xhigh but not 'minimal'.
Users may configure 'minimal' via OpenRouter conventions, which would
cause a 400 on native OpenAI. Clamp to 'low' in the codex_responses
path before sending.
Plugins can now return {"action": "block", "message": "reason"} from
their pre_tool_call hook to prevent a tool from executing. The error
message is returned to the model as a tool result so it can adjust.
Covers both execution paths: handle_function_call (model_tools.py) and
agent-level tools (run_agent.py _invoke_tool + sequential/concurrent).
Blocked tools skip all side effects (counter resets, checkpoints,
callbacks, read-loop tracker).
Adds skip_pre_tool_call_hook flag to avoid double-firing the hook when
run_agent.py already checked and then calls handle_function_call.
Salvaged from PR #5385 (gianfrancopiana) and PR #4610 (oredsecurity).
- Use isinstance() with try/except import for CopilotACPClient check
in _to_async_client instead of fragile __class__.__name__ string check
- Restore accurate comment: GPT-5.x models *require* (not 'often require')
the Responses API on OpenAI/OpenRouter; ACP is the exception, not a
softening of the requirement
- Add inline comment explaining the ACP exclusion rationale
Plugin context engines loaded via load_context_engine() were never
given context_length, causing the CLI status bar to show "ctx --"
with an empty progress bar. Call update_model() immediately after
loading the plugin engine, mirroring what switch_model() already does.
FixesNousResearch/hermes-agent#9071
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously, long-running streamed responses could be incorrectly treated
as idle by the gateway/cron inactivity timeout even while tokens were
actively arriving. The _touch_activity() call (which feeds
get_activity_summary() polled by the external timeout) was either called
only on the first chunk (chat completions) or not at all (Anthropic,
Codex, Codex fallback).
Add _touch_activity() on every chunk/event in all four streaming paths
so the inactivity monitor knows data is still flowing.
Fixes#8760
The v11→v12 migration converts custom_providers (list) into providers
(dict), then deletes the list. But all runtime resolvers read from
custom_providers — after migration, named custom endpoints silently stop
resolving and fallback chains fail with AuthError.
Add get_compatible_custom_providers() that reads from both config schemas
(legacy custom_providers list + v12+ providers dict), normalizes entries,
deduplicates, and returns a unified list. Update ALL consumers:
- hermes_cli/runtime_provider.py: _get_named_custom_provider() + key_env
- hermes_cli/auth_commands.py: credential pool provider names
- hermes_cli/main.py: model picker + _model_flow_named_custom()
- agent/auxiliary_client.py: key_env + custom_entry model fallback
- agent/credential_pool.py: _iter_custom_providers()
- cli.py + gateway/run.py: /model switch custom_providers passthrough
- run_agent.py + gateway/run.py: per-model context_length lookup
Also: use config.pop() instead of del for safer migration, fix stale
_config_version assertions in tests, add pool mock to codex test.
Co-authored-by: 墨綠BG <s5460703@gmail.com>
Closes#8776, salvaged from PR #8814
The existing ASCII codec handler only sanitized conversation messages,
leaving tool schemas, system prompts, ephemeral prompts, prefill messages,
and HTTP headers as unhandled sources of non-ASCII content. On systems
with LANG=C or non-UTF-8 locale, Unicode symbols in tool descriptions
(e.g. arrows, em-dashes from prompt_builder) and system prompt content
would cause UnicodeEncodeError that fell through to the error path.
Changes:
- Add _sanitize_structure_non_ascii() generic recursive walker for
nested dict/list payloads
- Add _sanitize_tools_non_ascii() thin wrapper for tool schemas
- Add _force_ascii_payload flag: once ASCII locale is detected, all
subsequent API calls get proactively sanitized (prevents recurring
failures from new tool results bringing fresh Unicode each turn)
- Extend the ASCII codec error handler to sanitize: prefill_messages,
tool schemas (self.tools), system prompt, ephemeral system prompt,
and default HTTP headers
- Update stale comment that acknowledged the gap
Cherry-picked from PR #8834 (credential pool changes dropped as
separate concern).
Remove the backward-compat code paths that read compression provider/model
settings from legacy config keys and env vars, which caused silent failures
when auto-detection resolved to incompatible backends.
What changed:
- Remove compression.summary_model, summary_provider, summary_base_url from
DEFAULT_CONFIG and cli.py defaults
- Remove backward-compat block in _resolve_task_provider_model() that read
from the legacy compression section
- Remove _get_auxiliary_provider() and _get_auxiliary_env_override() helper
functions (AUXILIARY_*/CONTEXT_* env var readers)
- Remove env var fallback chain for per-task overrides
- Update hermes config show to read from auxiliary.compression
- Add config migration (v16→17) that moves non-empty legacy values to
auxiliary.compression and strips the old keys
- Update example config and openclaw migration script
- Remove/update tests for deleted code paths
Compression model/provider is now configured exclusively via:
auxiliary.compression.provider / auxiliary.compression.model
Closes#8923
Add <thought>(.*?)</thought> to inline_patterns so Gemma 4
reasoning content is captured for /reasoning display, not just
stripped from visible output.
Closes#8891
Co-authored-by: RhushabhVaghela <rhushabhvaghela@users.noreply.github.com>
Three targeted changes to close the gaps between retry layers that
caused users to experience 'No response from provider for 580s' and
'No activity for 15 minutes' despite having 5 layers of retry:
1. Remove non-streaming fallback from streaming path
Previously, when all 3 stream retries exhausted, the code fell back
to _interruptible_api_call() which had no stale detection and no
activity tracking — a black hole that could hang for up to 1800s.
Now errors propagate to the main retry loop which has richer recovery
(credential rotation, provider fallback, backoff).
For 'stream not supported' errors, sets _disable_streaming flag so
the main retry loop automatically switches to non-streaming on the
next attempt.
2. Add _touch_activity to recovery dead zones
The gateway inactivity monitor relies on _touch_activity() to know
the agent is alive, but activity was never touched during:
- Stale stream detection/kill cycles (180-300s gaps)
- Stream retry connection rebuilds
- Main retry backoff sleeps (up to 120s)
- Error recovery classification
Now all these paths touch activity every ~30s, keeping the gateway
informed during recovery cycles.
3. Add stale-call detector to non-streaming path
_interruptible_api_call() now has the same stale detection pattern
as the streaming path: kills hung connections after 300s (default,
configurable via HERMES_API_CALL_STALE_TIMEOUT), scaled for large
contexts (450s for 50K+ tokens, 600s for 100K+ tokens), disabled
for local providers.
Also touches activity every ~30s during the wait so the gateway
monitor stays informed.
Env vars:
- HERMES_API_CALL_STALE_TIMEOUT: non-streaming stale timeout (default 300s)
- HERMES_STREAM_STALE_TIMEOUT: unchanged (default 180s)
Before: worst case ~2+ hours of sequential retries with no feedback
After: worst case bounded by gateway inactivity timeout (default 1800s)
with continuous activity reporting
The post-loop grace call mechanism was broken: it injected a user
message and set _budget_grace_call=True, but could never re-enter the
while loop (already exited). Worse, the flag blocked the fallback
_handle_max_iterations from running, so final_response stayed None.
Users saw empty/no response when the agent hit max iterations.
Fix: remove the dead grace block and let _handle_max_iterations handle
it directly — it already injects a summary request and makes one extra
toolless API call.
When streaming fails after partial content delivery (e.g. OpenRouter
timeout kills connection mid-response), the stub response now carries
the accumulated streamed text instead of content=None.
Two fixes:
1. The partial-stream stub response includes recovered content from
_current_streamed_assistant_text — the text that was already
delivered to the user via stream callbacks before the connection
died.
2. The empty response recovery chain now checks for partial stream
content BEFORE falling back to _last_content_with_tools (prior
turn content) or wasting API calls on retries. This prevents:
- Showing wrong content from a prior turn
- Burning 3+ unnecessary retry API calls
- Falling through to '(empty)' when the user already saw content
The root cause: OpenRouter has a ~125s inactivity timeout. When
Anthropic's SSE stream goes silent during extended reasoning, the
proxy kills the connection. The model's text was already partially
streamed but the stub discarded it, triggering the empty recovery
chain which would show stale prior-turn content or waste retries.
OpenCode Zen was in _DOT_TO_HYPHEN_PROVIDERS, causing all dotted model
names (minimax-m2.5-free, gpt-5.4, glm-5.1) to be mangled. The fix:
Layer 1 (model_normalize.py): Remove opencode-zen from the blanket
dot-to-hyphen set. Add an explicit block that preserves dots for
non-Claude models while keeping Claude hyphenated (Zen's Claude
endpoint uses anthropic_messages mode which expects hyphens).
Layer 2 (run_agent.py _anthropic_preserve_dots): Add opencode-zen and
zai to the provider allowlist. Broaden URL check from opencode.ai/zen/go
to opencode.ai/zen/ to cover both Go and Zen endpoints. Add bigmodel.cn
for ZAI URL detection.
Also adds glm-5.1 to ZAI model lists in models.py and setup.py.
Closes#7710
Salvaged from contributions by:
- konsisumer (PR #7739, #7719)
- DomGrieco (PR #8708)
- Esashiero (PR #7296)
- sharziki (PR #7497)
- XiaoYingGee (PR #8750)
- APTX4869-maker (PR #8752)
- kagura-agent (PR #7157)
_check_compression_model_feasibility() called get_model_context_length()
without passing config_context_length, so custom endpoints that do not
support /models API queries always fell through to the 128K default,
ignoring auxiliary.compression.context_length in config.yaml.
Fix: read auxiliary.compression.context_length from config and pass it
as config_context_length (highest-priority hint) so the user-configured
value is always respected regardless of API availability.
Fixes#8499
Three fixes for the (empty) response bug affecting open reasoning models:
1. Allow retries after prefill exhaustion — models like mimo-v2-pro always
populate reasoning fields via OpenRouter, so the old 'not _has_structured'
guard on the retry path blocked retries for EVERY reasoning model after
the 2 prefill attempts. Now: 2 prefills + 3 retries = 6 total attempts
before (empty).
2. Reset prefill/retry counters on tool-call recovery — the counters
accumulated across the entire conversation, never resetting during
tool-calling turns. A model cycling empty→prefill→tools→empty burned
both prefill attempts and the third empty got zero recovery. Now
counters reset when prefill succeeds with tool calls.
3. Strip think blocks before _truly_empty check — inline <think> content
made the string non-empty, skipping both retry paths.
Reported by users on Telegram with xiaomi/mimo-v2-pro and qwen3.5 models.
Reproduced: qwen3.5-9b emits tool calls as XML in reasoning field instead
of proper function calls, causing content=None + tool_calls=None + reasoning
with embedded <tool_call> XML. Prefill recovery works but counter
accumulation caused permanent (empty) in long sessions.
Previously, all invalid API responses (choices=None) were diagnosed
as 'fast response often indicates rate limiting' regardless of actual
response time or error code. A 738s Cloudflare 524 timeout was labeled
as 'fast response' and 'possible rate limit'.
Now extracts the error code from response.error and classifies:
- 524: upstream provider timed out (Cloudflare)
- 504: upstream gateway timeout
- 429: rate limited by upstream provider
- 500/502: upstream server error
- 503/529: upstream provider overloaded
- Other codes: shown with code number
- No code + <10s: likely rate limited (timing heuristic)
- No code + >60s: likely upstream timeout
- No code + 10-60s: neutral response time
All downstream messages (retry status, final error, interrupt message)
now use the classified hint instead of generic rate-limit language.
Reported by community member Lumen Radley (MiMo provider timeouts).
Gemma 4 (26B/31B) uses <thought>...</thought> to wrap its reasoning
output. This tag was not included in the existing list of reasoning tag
variants stripped by _strip_think_blocks(), causing raw thinking blocks
to leak into the visible response.
Added a new re.sub() line for <thought> and extended the cleanup regex
to include 'thought' alongside the existing variants.
Fixes#6148
When running inside WSL (Windows Subsystem for Linux), inject a hint into
the system prompt explaining that the Windows host filesystem is mounted
at /mnt/c/, /mnt/d/, etc. This lets the agent naturally translate Windows
paths (Desktop, Documents) to their /mnt/ equivalents without the user
needing to configure anything.
Uses the existing is_wsl() detection from hermes_constants (cached,
checks /proc/version for 'microsoft'). Adds build_environment_hints()
in prompt_builder.py — extensible for Termux, Docker, etc. later.
Closes the UX gap where WSL users had to manually explain path
translation to the agent every session.
Adds an optional focus topic to /compress: `/compress database schema`
guides the summariser to preserve information related to the focus topic
(60-70% of summary budget) while compressing everything else more aggressively.
Inspired by Claude Code's /compact <focus>.
Changes:
- context_compressor.py: focus_topic parameter on _generate_summary() and
compress(); appends FOCUS TOPIC guidance block to the LLM prompt
- run_agent.py: focus_topic parameter on _compress_context(), passed through
to the compressor
- cli.py: _manual_compress() extracts focus topic from command string,
preserves existing manual_compression_feedback integration (no regression)
- gateway/run.py: _handle_compress_command() extracts focus from event args
and passes through — full gateway parity
- commands.py: args_hint="[focus topic]" on /compress CommandDef
Salvaged from PR #7459 (CLI /compress focus only — /context command deferred).
15 new tests across CLI, compressor, and gateway.
* feat: component-separated logging with session context and filtering
Phase 1 — Gateway log isolation:
- gateway.log now only receives records from gateway.* loggers
(platform adapters, session management, slash commands, delivery)
- agent.log remains the catch-all (all components)
- errors.log remains WARNING+ catch-all
- Moved gateway.log handler creation from gateway/run.py into
hermes_logging.setup_logging(mode='gateway') with _ComponentFilter
Phase 2 — Session ID injection:
- Added set_session_context(session_id) / clear_session_context() API
using threading.local() for per-thread session tracking
- _SessionFilter enriches every log record with session_tag attribute
- Log format: '2026-04-11 10:23:45 INFO [session_id] logger.name: msg'
- Session context set at start of run_conversation() in run_agent.py
- Thread-isolated: gateway conversations on different threads don't leak
Phase 3 — Component filtering in hermes logs:
- Added --component flag: hermes logs --component gateway|agent|tools|cli|cron
- COMPONENT_PREFIXES maps component names to logger name prefixes
- Works with all existing filters (--level, --session, --since, -f)
- Logger name extraction handles both old and new log formats
Files changed:
- hermes_logging.py: _SessionFilter, _ComponentFilter, COMPONENT_PREFIXES,
set/clear_session_context(), gateway.log creation in setup_logging()
- gateway/run.py: removed redundant gateway.log handler (now in hermes_logging)
- run_agent.py: set_session_context() at start of run_conversation()
- hermes_cli/logs.py: --component filter, logger name extraction
- hermes_cli/main.py: --component argument on logs subparser
Addresses community request for component-separated, filterable logging.
Zero changes to existing logger names — __name__ already provides hierarchy.
* fix: use LogRecord factory instead of per-handler _SessionFilter
The _SessionFilter approach required attaching a filter to every handler
we create. Any handler created outside our _add_rotating_handler (like
the gateway stderr handler, or third-party handlers) would crash with
KeyError: 'session_tag' if it used our format string.
Replace with logging.setLogRecordFactory() which injects session_tag
into every LogRecord at creation time — process-global, zero per-handler
wiring needed. The factory is installed at import time (before
setup_logging) so session_tag is available from the moment hermes_logging
is imported.
- Idempotent: marker attribute prevents double-wrapping on module reload
- Chains with existing factory: won't break third-party record factories
- Removes _SessionFilter from _add_rotating_handler and setup_verbose_logging
- Adds tests: record factory injection, idempotency, arbitrary handler compat
The _get_budget_warning() method already returned None unconditionally —
the entire budget warning system was disabled. Remove all dead code:
- _BUDGET_WARNING_RE regex
- _strip_budget_warnings_from_history() function and its call site
- Both injection blocks (concurrent + sequential tool execution)
- _get_budget_warning() method
- 7 tests for the removed functions
The budget exhaustion grace call system (_budget_exhausted_injected,
_budget_grace_call) is a separate recovery mechanism and is preserved.
Normalize api_messages before each API call for consistent prefix
matching across turns:
1. Strip leading/trailing whitespace from system prompt parts
2. Strip leading/trailing whitespace from message content strings
3. Normalize tool-call arguments to compact sorted JSON
This enables KV cache reuse on local inference servers (llama.cpp,
vLLM, Ollama) and improves cache hit rates for cloud providers.
All normalization operates on the api_messages copy — the original
conversation history in messages is never mutated. Tool-call JSON
normalization creates new dicts via spread to avoid the shallow-copy
mutation bug in the original PR.
Salvaged from PR #7875 by @waxinz with mutation fix.
Switch estimate_tokens_rough(), estimate_messages_tokens_rough(), and
estimate_request_tokens_rough() from floor division (len // 4) to
ceiling division ((len + 3) // 4). Short texts (1-3 chars) previously
estimated as 0 tokens, causing the compressor and pre-flight checks to
systematically undercount when many short tool results are present.
Also replaced the inline duplicate formula in run_conversation()
(total_chars // 4) with a call to the shared
estimate_messages_tokens_rough() function.
Updated 4 tests that hardcoded floor-division expected values.
Related: issue #6217, PR #6629
Add display.interim_assistant_messages config (enabled by default) that
forwards completed assistant commentary between tool calls to the user
as separate chat messages. Models already emit useful status text like
'I'll inspect the repo first.' — this surfaces it on Telegram, Discord,
and other messaging platforms instead of swallowing it.
Independent from tool_progress and gateway streaming. Disabled for
webhooks. Uses GatewayStreamConsumer when available, falls back to
direct adapter send. Tracks response_previewed to prevent double-delivery
when interim message matches the final response.
Also fixes: cursor not stripped from fallback prefix in stream consumer
(affected continuation calculation on no-edit platforms like Signal).
Cherry-picked from PR #7885 by asheriif, default changed to enabled.
Fixes#5016
Three root causes of the 'agent stops mid-task' gateway bug:
1. Compression threshold floor (64K tokens minimum)
- The 50% threshold on a 100K-context model fired at 50K tokens,
causing premature compression that made models lose track of
multi-step plans. Now threshold_tokens = max(50% * context, 64K).
- Models with <64K context are rejected at startup with a clear error.
2. Budget warning removal — grace call instead
- Removed the 70%/90% iteration budget warnings entirely. These
injected '[BUDGET WARNING: Provide your final response NOW]' into
tool results, causing models to abandon complex tasks prematurely.
- Now: no warnings during normal execution. When the budget is
actually exhausted (90/90), inject a user message asking the model
to summarise, allow one grace API call, and only then fall back
to _handle_max_iterations.
3. Activity touches during long terminal execution
- _wait_for_process polls every 0.2s but never reported activity.
The gateway's inactivity timeout (default 1800s) would fire during
long-running commands that appeared 'idle.'
- Now: thread-local activity callback fires every 10s during the
poll loop, keeping the gateway's activity tracker alive.
- Agent wires _touch_activity into the callback before each tool call.
Also: docs update noting 64K minimum context requirement.
Closes#7915 (root cause was agent-loop termination, not Weixin delivery limits).
Replace the verbose_logging-gated logging.exception() with an
unconditional logger.debug(exc_info=True). The full traceback now
always lands in agent.log when debug logging is enabled, without
requiring the verbose_logging flag or spamming the console.
Previously, production errors in the 700-line response processing
block (normalization, tool dispatch, final response handling) were
logged as one-line messages with the traceback hidden behind
verbose_logging — making post-mortem debugging difficult.
All retry counters (_invalid_tool_retries, _invalid_json_retries,
_empty_content_retries, _incomplete_scratchpad_retries,
_codex_incomplete_retries) are initialized to 0 at the top of
run_conversation() (lines 7566-7570). The hasattr guards added before
the reset block existed are now dead code — the attributes always exist.
Removed 7 redundant hasattr checks (5 original targets + 2 bonus for
_codex_incomplete_retries found during cleanup).
When _try_activate_fallback() switches to a new provider, retry_count was
reset to 0 but compression_attempts and primary_recovery_attempted were
not. This meant a fallback provider that hit context overflow would only
get the leftover compression budget from the failed primary provider,
and transport recovery was blocked because the flag was still True from
the old provider's attempt.
Reset both counters at all 5 fallback activation sites inside the retry
loop so each fallback provider gets a fresh compression budget (3 attempts)
and its own transport recovery opportunity.
When replaying codex_reasoning_items from previous turns,
duplicate item IDs (rs_*) could appear in the input array,
causing HTTP 400 "Duplicate item found" errors from the
OpenAI Responses API.
Add seen_item_ids tracking in both _chat_messages_to_responses_input()
and _preflight_codex_input_items() to skip already-added reasoning
items by their ID.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The interrupt mechanism in tools/interrupt.py used a process-global
threading.Event. In the gateway, multiple agents run concurrently in
the same process via run_in_executor. When any agent was interrupted
(user sends a follow-up message), the global flag killed ALL agents'
running tools — terminal commands, browser ops, web requests — across
all sessions.
Changes:
- tools/interrupt.py: Replace single threading.Event with a set of
interrupted thread IDs. set_interrupt() targets a specific thread;
is_interrupted() checks the current thread. Includes a backward-
compatible _ThreadAwareEventProxy for legacy _interrupt_event usage.
- run_agent.py: Store execution thread ID at start of run_conversation().
interrupt() and clear_interrupt() pass it to set_interrupt() so only
this agent's thread is affected.
- tools/code_execution_tool.py: Use is_interrupted() instead of
directly checking _interrupt_event.is_set().
- tools/process_registry.py: Same — use is_interrupted().
- tests: Update interrupt tests for per-thread semantics. Add new
TestPerThreadInterruptIsolation with two tests verifying cross-thread
isolation.
Models that do not use <think> tags (e.g. GLM-4.7 on NVIDIA Build,
minimax) may return content=None or empty string when truncated. The
previous _thinking_exhausted check treated any None/empty content as
thinking-budget exhaustion, causing these models to always show the
'Thinking Budget Exhausted' error instead of attempting continuation.
Fix: gate the exhaustion check on _has_think_tags — only trigger the
exhaustion path when the model actually produced reasoning blocks
(<think>, <thinking>, <reasoning>, <REASONING_SCRATCHPAD>). Models
without think tags now fall through to the normal continuation retry
logic (up to 3 attempts).
Fixes#7729
When API routers rewrite finish_reason from "length" to "tool_calls",
truncated JSON arguments bypassed the length handler and wasted 3
retry attempts in the generic JSON validation loop. Now detects
truncation patterns in tool call arguments regardless of finish_reason.
Fixes#7680
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>