hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-04-25 00:51:20 +00:00

Author	SHA1	Message	Date
ygd58	151654851c	fix(agent): prevent false thinking-exhaustion for non-reasoning models Models that do not use <think> tags (e.g. GLM-4.7 on NVIDIA Build, minimax) may return content=None or empty string when truncated. The previous _thinking_exhausted check treated any None/empty content as thinking-budget exhaustion, causing these models to always show the 'Thinking Budget Exhausted' error instead of attempting continuation. Fix: gate the exhaustion check on _has_think_tags — only trigger the exhaustion path when the model actually produced reasoning blocks (<think>, <thinking>, <reasoning>, <REASONING_SCRATCHPAD>). Models without think tags now fall through to the normal continuation retry logic (up to 3 attempts). Fixes #7729	2026-04-11 13:47:25 -07:00
Tom Qiao	5910412002	fix: detect truncated tool_calls when finish_reason is not length When API routers rewrite finish_reason from "length" to "tool_calls", truncated JSON arguments bypassed the length handler and wasted 3 retry attempts in the generic JSON validation loop. Now detects truncation patterns in tool call arguments regardless of finish_reason. Fixes #7680 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-11 13:47:25 -07:00
Teknium	dafe443beb	feat: warn at session start when compression model context is too small (#7894 ) Two-phase design so the warning fires before the user's first message on every platform: Phase 1 (__init__): _check_compression_model_feasibility() runs during agent construction. Resolves the auxiliary compression model (same chain as call_llm with task='compression'), compares its context length to the main model's compression threshold. If too small, emits via _emit_status() (prints for CLI) and stores the warning in _compression_warning. Phase 2 (run_conversation, first call): _replay_compression_warning() re-sends the stored warning through status_callback — which the gateway wires AFTER construction. The warning is then cleared so it only fires once. This ensures: - CLI users see the warning immediately at startup (right after the context limit line) - Gateway users (Telegram, Discord, Slack, WhatsApp, Signal, Matrix, Mattermost, Home Assistant, DingTalk, etc.) receive it via status_callback('lifecycle', ...) on their first message - logger.warning() always hits agent.log regardless of platform Also warns when no auxiliary LLM provider is configured at all. Entire check wrapped in try/except — never blocks startup. 11 tests covering: core warning logic, boundary conditions, exception safety, two-phase store+replay, gateway callback wiring, and single-delivery guarantee.	2026-04-11 12:01:30 -07:00
Teknium	06e1d9cdd4	fix: resolve three high-impact community bugs (#5819 , #6893 , #3388 ) (#7881 ) Matrix gateway: fix sync loop never dispatching events (#5819) - _sync_loop() called client.sync() but never called handle_sync() to dispatch events to registered callbacks — _on_room_message was registered but never fired for new messages - Store next_batch token from initial sync and pass as since= to subsequent incremental syncs (was doing full initial sync every time) - 17 comments, confirmed by multiple users on matrix.org Feishu docs: add interactive card configuration for approvals (#6893) - Error 200340 is a Feishu Developer Console configuration issue, not a code bug — users need to enable Interactive Card capability and configure Card Request URL - Added required 3-step setup instructions to feishu.md - Added troubleshooting entry for error 200340 - 17 comments from Feishu users Copilot provider drift: detect GPT-5.x Responses API requirement (#3388) - GPT-5.x models are rejected on /v1/chat/completions by both OpenAI and OpenRouter (unsupported_api_for_model error) - Added _model_requires_responses_api() to detect models needing Responses API regardless of provider - Applied in __init__ (covers OpenRouter primary users) and in _try_activate_fallback() (covers Copilot->OpenRouter drift) - Fixed stale comment claiming gateway creates fresh agents per message (it caches them via _agent_cache since the caching was added) - 7 comments, reported on Copilot+Telegram gateway	2026-04-11 11:12:20 -07:00
kshitijk4poor	af9caec44f	fix(qwen): correct context lengths for qwen3-coder models and send max_tokens to portal Based on PR #7285 by @kshitijk4poor. Two bugs affecting Qwen OAuth users: 1. Wrong context window — qwen3-coder-plus showed 128K instead of 1M. Added specific entries before the generic qwen catch-all: - qwen3-coder-plus: 1,000,000 (corrected from PR's 1,048,576 per official Alibaba Cloud docs and OpenRouter) - qwen3-coder: 262,144 2. Random stopping — max_tokens was suppressed for Qwen Portal, so the server applied its own low default. Reasoning models exhaust that on thinking tokens. Now: honor explicit max_tokens, default to 65536 when unset. Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>	2026-04-11 03:29:31 -07:00
kshitijk4poor	d442f25a2f	fix: align MiniMax provider with official API docs Aligns MiniMax provider with official API documentation. Fixes 6 bugs: transport mismatch (openai_chat -> anthropic_messages), credential leak in switch_model(), prompt caching sent to non-Anthropic endpoints, dot-to-hyphen model name corruption, trajectory compressor URL routing, and stale doctor health check. Also corrects context window (204,800), thinking support (manual mode), max output (131,072), and model catalog (M2 family only on /anthropic). Source: https://platform.minimax.io/docs/api-reference/text-anthropic-api Co-authored-by: kshitijk4poor <kshitijk4poor@users.noreply.github.com>	2026-04-11 01:04:41 -07:00
Moris Chao	5b16f31702	feat(plugins): pass sender_id to pre_llm_call hook The pre_llm_call plugin hook receives session_id, user_message, conversation_history, is_first_turn, model, and platform — but not the sender's user_id. This means plugins cannot perform per-user access control (e.g. restricting knowledge base recall to authorized users). The gateway already passes source.user_id as user_id to AIAgent, which stores it in self._user_id. This change forwards it as sender_id in the pre_llm_call kwargs so plugins can use it for ACL decisions. For CLI sessions where no user_id exists, sender_id defaults to empty string. Plugins can treat empty sender_id as a trusted local call (the owner is at the terminal) or deny it depending on their ACL policy.	2026-04-11 00:43:20 -07:00
Teknium	caf371da18	fix: MiniMax/Alibaba incorrectly detected as Anthropic OAuth, causing mcp_ tool prefix (#7509 ) _is_oauth_token() returned True for any key not starting with 'sk-ant-api', which means MiniMax and Alibaba API keys were falsely treated as Anthropic OAuth tokens. This triggered the Claude Code compatibility path: - All tool names prefixed with mcp_ (e.g. mcp_terminal, mcp_web_search) - System prompt injected with 'You are Claude Code' identity - 'Hermes Agent' replaced with 'Claude Code' throughout Fix: Make _is_oauth_token() positively identify Anthropic OAuth tokens by their key format instead of using a broad catch-all: - sk-ant-* (but not sk-ant-api-) -> setup tokens, managed keys - eyJ -> JWTs from Anthropic OAuth flow - Everything else -> False (MiniMax, Alibaba, etc.) Reported by stefan171.	2026-04-11 00:43:01 -07:00
Teknium	436dfd5ab5	fix: no auto-activation + unified hermes plugins UI with provider categories - Remove auto-activation: when context.engine is 'compressor' (default), plugin-registered engines are NOT used. Users must explicitly set context.engine to a plugin name to activate it. - Add curses_radiolist() to curses_ui.py: single-select radio picker with keyboard nav + text fallback, matching curses_checklist pattern. - Rewrite cmd_toggle() as composite plugins UI: Top section: general plugins with checkboxes (existing behavior) Bottom section: provider plugin categories (Memory Provider, Context Engine) with current selection shown inline. ENTER/SPACE on a category opens a radiolist sub-screen for single-select configuration. - Add provider discovery helpers: _discover_memory_providers(), _discover_context_engines(), config read/save for memory.provider and context.engine. - Add tests: radiolist non-TTY fallback, provider config save/load, discovery error handling, auto-activation removal verification.	2026-04-10 19:15:50 -07:00
Teknium	3fe6938176	fix: robust context engine interface — config selection, plugin discovery, ABC completeness Follow-up fixes for the context engine plugin slot (PR #5700): - Enhance ContextEngine ABC: add threshold_percent, protect_first_n, protect_last_n as class attributes; complete update_model() default with threshold recalculation; clarify on_session_end() lifecycle docs - Add ContextCompressor.update_model() override for model/provider/ base_url/api_key updates - Replace all direct compressor internal access in run_agent.py with ABC interface: switch_model(), fallback restore, context probing all use update_model() now; _context_probed guarded with getattr/ hasattr for plugin engine compatibility - Create plugins/context_engine/ directory with discovery module (mirrors plugins/memory/ pattern) — discover_context_engines(), load_context_engine() - Add context.engine config key to DEFAULT_CONFIG (default: compressor) - Config-driven engine selection in run_agent.__init__: checks config, then plugins/context_engine/<name>/, then general plugin system, falls back to built-in ContextCompressor - Wire on_session_end() in shutdown_memory_provider() at real session boundaries (CLI exit, /reset, gateway expiry)	2026-04-10 19:15:50 -07:00
Stephen Schoettler	5d8dd622bc	feat: wire context engine tools, session lifecycle, and tool dispatch - Inject engine tool schemas into agent tool surface after compressor init - Call on_session_start() with session_id, hermes_home, platform, model - Dispatch engine tool calls (lcm_grep, etc.) before regular tool handler - 55/55 tests pass	2026-04-10 19:15:50 -07:00
Stephen Schoettler	92382fb00e	feat: wire context engine plugin slot into agent and plugin system - PluginContext.register_context_engine() lets plugins replace the built-in ContextCompressor with a custom ContextEngine implementation - PluginManager stores the registered engine; only one allowed - run_agent.py checks for a plugin engine at init before falling back to the default ContextCompressor - reset_session_state() now calls engine.on_session_reset() instead of poking internal attributes directly - ContextCompressor.on_session_reset() handles its own internals (_context_probed, _previous_summary, etc.) - 19 new tests covering ABC contract, defaults, plugin slot registration, rejection of duplicates/non-engines, and compressor reset behavior - All 34 existing compressor tests pass unchanged	2026-04-10 19:15:50 -07:00
Teknium	842e669a13	fix: activate fallback provider on repeated empty responses + user-visible status (#7505 ) When models return empty responses (no content, no tool calls, no reasoning), Hermes previously retried 3 times silently then fell through to '(empty)' — without ever trying the fallback provider chain. Users on GLM-4.5-Air and similar models experienced what appeared to be a complete hang, especially in gateway (Telegram/Discord) contexts where the silent retries produced zero feedback. Changes: - After exhausting 3 empty retries, attempt _try_activate_fallback() before giving up with '(empty)'. If fallback succeeds, reset retry counter and continue the conversation loop with the new provider. - Replace all _vprint() calls in recovery paths with _emit_status(), which surfaces messages through both CLI (_vprint with force=True) and gateway (status_callback -> adapter.send). Users now see: * '⚠️ Empty response from model — retrying (N/3)' during retries * '⚠️ Model returning empty responses — switching to fallback...' * '↻ Switched to fallback: <model> (<provider>)' on success * '❌ Model returned no content after all retries [and fallback]' - Add logger.warning() throughout empty response paths for log file visibility (model name, provider, retry counts). - Upgrade _last_content_with_tools fallback from logger.debug to logger.info + _emit_status so recovery is visible. - Upgrade thinking-only prefill continuation to use _emit_status. Tests: - test_empty_response_triggers_fallback_provider: verifies fallback activation after 3 empty retries produces content from fallback model - test_empty_response_fallback_also_empty_returns_empty: verifies graceful degradation when fallback also returns empty - test_empty_response_emits_status_for_gateway: verifies _emit_status is called during retries so gateway users see feedback Addresses #7180.	2026-04-10 19:15:41 -07:00
pefontana	5b42aecfa7	feat(agent): add AIAgent.close() for subprocess cleanup Add a close() method to AIAgent that acts as a single entry point for releasing all resources held by an agent instance. This prevents zombie process accumulation on long-running gateway deployments by explicitly cleaning up: - Background processes tracked in ProcessRegistry - Terminal sandbox environments - Browser daemon sessions - Active child agents (subagent delegation) - OpenAI/httpx client connections Each cleanup step is independently guarded so a failure in one does not prevent the rest. The method is idempotent and safe to call multiple times. Also simplifies the background review cleanup to use close() instead of manually closing the OpenAI client. Ref: #7131	2026-04-10 16:51:44 -07:00
Teknium	ea81aa2eec	fix: guard api_kwargs in except handler to prevent UnboundLocalError (#7376 ) When _build_api_kwargs() throws an exception, the except handler in the retry loop referenced api_kwargs before it was assigned. This caused an UnboundLocalError that masked the real error, making debugging impossible for the user. Two _dump_api_request_debug() calls in the except block (non-retryable client error path and max-retries-exhausted path) both accessed api_kwargs without checking if it was assigned. Fix: initialize api_kwargs = None before the retry loop and guard both dump calls. Now the real error surfaces instead of the masking UnboundLocalError. Reported by Discord user gruman0.	2026-04-10 15:12:00 -07:00
angelos	7ccdb74364	fix(delegate): make max_concurrent_children configurable + error on excess `delegate_task` silently truncated batch tasks to 3 — the model sends 5 tasks, gets results for 3, never told 2 were dropped. Now returns a clear tool_error explaining the limit and how to fix it. The limit is configurable via: - delegation.max_concurrent_children in config.yaml (priority 1) - DELEGATION_MAX_CONCURRENT_CHILDREN env var (priority 2) - default: 3 Uses the same _load_config() path as the rest of delegate_task for consistent config priority. Clamps to min 1, warns on non-integer config values. Also removes the hardcoded maxItems: 3 from the JSON schema — the schema was blocking the model from even attempting >3 tasks before the runtime check could fire. The runtime check gives a much more actionable error message. Backwards compatible: default remains 3, existing configs unchanged.	2026-04-10 13:38:14 -07:00
Tranquil-Flow	6c115440fd	fix(delegate): sync self.base_url with client_kwargs after credential resolution When delegation.base_url routes subagents to a different endpoint, the correct URL was passed through _resolve_delegation_credentials() and _build_child_agent() into AIAgent.__init__(), but self.base_url could fall out of sync with client_kwargs["base_url"] — the value the OpenAI client actually uses. This caused billing_base_url in session records to show the parent's endpoint while actual API calls went to the correct delegation target. Keep self.base_url in sync with client_kwargs after the credential resolution block, matching the existing pattern for self.api_key. Fixes #6825	2026-04-10 13:38:14 -07:00
0xbyt4	0bea603510	fix: handle NoneType request_overrides in fast_mode check (#7350 )	2026-04-10 13:07:25 -07:00
Hermes Audit	2c99b4e79b	fix(unicode): sanitize surrogate metadata and allow two-pass retry	2026-04-10 13:05:01 -07:00
Hermes Audit	71036a7a75	fix: handle UnicodeEncodeError with ASCII codec (#6843 ) Broaden the UnicodeEncodeError recovery to handle systems with ASCII-only locale (LANG=C, Chromebooks) where ANY non-ASCII character causes encoding failure, not just lone surrogates. Changes: - Add _strip_non_ascii() and _sanitize_messages_non_ascii() helpers that strip all non-ASCII characters from message content, name, and tool_calls - Update the UnicodeEncodeError handler to detect ASCII codec errors and fall back to non-ASCII sanitization after surrogate check fails - Sanitize tool_calls arguments and name fields (not just content) - Fix bare .encode() in cli.py suspend handler to use explicit utf-8 - Add comprehensive test suite (17 tests)	2026-04-10 13:05:01 -07:00
Kenny Xie	916fbf362c	fix(model): tighten direct-provider fallback normalization	2026-04-10 05:52:45 -07:00
Kenny Xie	b730c2955a	fix(model): normalize direct provider ids in auxiliary routing	2026-04-10 05:52:45 -07:00
Kenny Xie	fd5cc6e1b4	fix(model): normalize native provider-prefixed model ids	2026-04-10 05:52:45 -07:00
Ronald Reis	fd3e855d58	fix: pass config_context_length to switch_model context compressor When switching models at runtime, the config_context_length override was not being passed to the new context compressor instance. This meant the user-specified context length from config.yaml was lost after a model switch. - Store _config_context_length on AIAgent instance during __init__ - Pass _config_context_length when creating new ContextCompressor in switch_model - Add test to verify config_context_length is preserved across model switches Fixes: quando estamos alterando o modelo não está alterando o tamanho do contexto	2026-04-10 05:52:45 -07:00
alt-glitch	96c060018a	fix: remove 115 verified dead code symbols across 46 production files Automated dead code audit using vulture + coverage.py + ast-grep intersection, confirmed by Opus deep verification pass. Every symbol verified to have zero production callers (test imports excluded from reachability analysis). Removes ~1,534 lines of dead production code across 46 files and ~1,382 lines of stale test code. 3 entire files deleted (agent/builtin_memory_provider.py, hermes_cli/checklist.py, tests/hermes_cli/test_setup_model_selection.py). Co-authored-by: alt-glitch <balyan.sid@gmail.com>	2026-04-10 03:44:43 -07:00
Teknium	68528068ec	fix(streaming): update stale-stream timer during Anthropic native streaming (#7117 ) The _call_anthropic() streaming path never updated last_chunk_time during the event loop — only once at stream start. The stale stream detector in the outer poll loop uses this timer, so any Anthropic stream longer than 180s was killed even when events were actively arriving. This self-inflicted a RemoteProtocolError that users saw as: '⚠️ Connection to provider dropped (RemoteProtocolError). Reconnecting…' The _call_chat_completions() path already updates last_chunk_time on every chunk (line 4475). This brings _call_anthropic() to parity. Also adds deltas_were_sent tracking to the Anthropic text_delta path so the retry loop knows not to retry after partial delivery (prevents duplicated output on connection drops mid-stream). Reported-by: Discord users (Castellani, Codename_11)	2026-04-10 03:34:56 -07:00
helix4u	5a8b5f149d	fix(run-agent): rotate credential pool on billing-classified 400s	2026-04-10 03:27:19 -07:00
helix4u	9aedab00f4	fix(run_agent): recover primary client on openai transport errors	2026-04-10 03:21:24 -07:00
kshitijk4poor	9431f82aff	fix: update Kimi Coding User-Agent to KimiCLI/1.30.0 The hardcoded User-Agent 'KimiCLI/1.3' is outdated — Kimi CLI is now at v1.30.0. The stale version string causes intermittent 403 errors from Kimi's coding endpoint ('only available for Coding Agents'). Update all 8 occurrences across run_agent.py, auxiliary_client.py, and doctor.py to 'KimiCLI/1.30.0' to match the current official Kimi CLI.	2026-04-10 02:37:28 -07:00
Teknium	8779a268a7	feat: add Anthropic Fast Mode support to /fast command (#7037 ) Extends the /fast command to support Anthropic's Fast Mode beta in addition to OpenAI Priority Processing. When enabled on Claude Opus 4.6, adds speed:"fast" and the fast-mode-2026-02-01 beta header to API requests for ~2.5x faster output token throughput. Changes: - hermes_cli/models.py: Add _ANTHROPIC_FAST_MODE_MODELS registry, model_supports_fast_mode() now recognizes Claude Opus 4.6, resolve_fast_mode_overrides() returns {speed: fast} for Anthropic vs {service_tier: priority} for OpenAI - agent/anthropic_adapter.py: Add _FAST_MODE_BETA constant, build_anthropic_kwargs() accepts fast_mode=True which injects speed:fast + beta header via extra_headers (skipped for third-party Anthropic-compatible endpoints like MiniMax) - run_agent.py: Pass fast_mode to build_anthropic_kwargs in the anthropic_messages path of _build_api_kwargs() - cli.py: Update _handle_fast_command with provider-aware messaging (shows 'Anthropic Fast Mode' vs 'Priority Processing') - hermes_cli/commands.py: Update /fast description to mention both providers - tests: 13 new tests covering Anthropic model detection, override resolution, CLI availability, routing, adapter kwargs, and third-party endpoint safety	2026-04-10 02:32:15 -07:00
Teknium	871313ae2d	fix: clear conversation_history after mid-loop compression to prevent empty sessions (#7001 ) After mid-loop compression (triggered by 413, context_overflow, or Anthropic long-context tier errors), _compress_context() creates a new session in SQLite and resets _last_flushed_db_idx=0. However, conversation_history was not cleared, so _flush_messages_to_session_db() computed: flush_from = max(len(conversation_history=200), _last_flushed_db_idx=0) = 200 messages[200:] → empty (compressed messages < 200) This resulted in zero messages being written to the new session's SQLite store. On resume, the user would see 'Session found but has no messages.' The preflight compression path (line 7311) already had the fix: conversation_history = None This commit adds the same clearing to the three mid-loop compression sites: - Anthropic long-context tier overflow - HTTP 413 payload too large - Generic context_overflow error Reported by Aaryan (Nous community).	2026-04-10 00:14:59 -07:00
Teknium	f783986f5a	fix: increase stream read timeout default to 120s, auto-raise for local LLMs (#6967 ) Raise the default httpx stream read timeout from 60s to 120s for all providers. Additionally, auto-detect local LLM endpoints (Ollama, llama.cpp, vLLM) and raise the read timeout to HERMES_API_TIMEOUT (1800s) since local models can take minutes for prefill on large contexts before producing the first token. The stale stream timeout already had this local auto-detection pattern; the httpx read timeout was missing it — causing a hard 60s wall that users couldn't find (HERMES_STREAM_READ_TIMEOUT was undocumented). Changes: - Default HERMES_STREAM_READ_TIMEOUT: 60s -> 120s - Auto-detect local endpoints -> raise to 1800s (user override respected) - Document HERMES_STREAM_READ_TIMEOUT and HERMES_STREAM_STALE_TIMEOUT - Add 10 parametrized tests Reported-by: Pavan Srinivas (@pavanandums)	2026-04-09 22:35:30 -07:00
emozilla	bda9aa17cb	fix(streaming): prevent <think> in prose from suppressing response output When the model mentions <think> as literal text in its response (e.g. "(/think not producing <think> tags)"), the streaming display treated it as a reasoning block opener and suppressed everything after it. The response box would close with truncated content and no error — the API response was complete but the display ate it. Root cause: _stream_delta() matched <think> anywhere in the text stream regardless of position. Real reasoning blocks always start at the beginning of a line; mentions in prose appear mid-sentence. Fix: track line position across streaming deltas with a _stream_last_was_newline flag. Only enter reasoning suppression when the tag appears at a block boundary (start of stream, after a newline, or after only whitespace on the current line). Add a _flush_stream() safety net that recovers buffered content if no closing tag is found by end-of-stream. Also fixes three related issues discovered during investigation: - anthropic_adapter: _get_anthropic_max_output() now normalizes dots to hyphens so 'claude-opus-4.6' matches the 'claude-opus-4-6' table key (was returning 32K instead of 128K) - run_agent: send explicit max_tokens for Claude models on Nous Portal, same as OpenRouter — both proxy to Anthropic's API which requires it. Without it the backend defaults to a low limit that truncates responses. - run_agent: reset truncated_tool_call_retries after successful tool execution so a single truncation doesn't poison the entire conversation.	2026-04-09 22:16:36 -07:00
Teknium	8394b5ddd2	feat: expand /fast to all OpenAI Priority Processing models (#6960 ) Previously /fast only supported gpt-5.4 and forced a provider switch to openai-codex. Now supports all 13 models from OpenAI's Priority Processing pricing table (gpt-5.4, gpt-5.4-mini, gpt-5.2, gpt-5.1, gpt-5, gpt-5-mini, gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, gpt-4o, gpt-4o-mini, o3, o4-mini). Key changes: - Replaced _FAST_MODE_BACKEND_CONFIG with _PRIORITY_PROCESSING_MODELS frozenset - Removed provider-forcing logic — service_tier is now injected into whatever API path the user is already on (Codex Responses, Chat Completions, or OpenRouter passthrough) - Added request_overrides support to chat_completions path in run_agent.py - Updated messaging from 'Codex inference tier' to 'Priority Processing' - Expanded test coverage for all supported models	2026-04-09 22:06:30 -07:00
g-guthrie	d416a69288	feat: add Codex fast mode toggle (/fast command) Add /fast slash command to toggle OpenAI Codex service_tier between normal and priority ('fast') inference. Only exposed for models registered in _FAST_MODE_BACKEND_CONFIG (currently gpt-5.4). - Registry-based backend config for extensibility - Dynamic command visibility (hidden from help/autocomplete for non-supported models) via command_filter on SlashCommandCompleter - service_tier flows through request_overrides from route resolution - Omit max_output_tokens for Codex backend (rejects it) - Persists to config.yaml under agent.service_tier Salvage cleanup: removed simple_term_menu/input() menu (banned), bare /fast now shows status like /reasoning. Removed redundant override resolution in _build_api_kwargs — single source of truth via request_overrides from route. Co-authored-by: Hermes Agent <hermes@nousresearch.com>	2026-04-09 21:54:32 -07:00
Teknium	b87d00288d	fix: add actionable hint for OpenRouter 'no tool endpoints' error When OpenRouter returns 'No endpoints found that support tool use' (HTTP 404), display a hint explaining that provider routing restrictions may be filtering out tool-capable providers. Links the user directly to the model's OpenRouter page to check which providers support tools. The hint fires in the error display block that runs regardless of whether fallback succeeds — so the user always understands WHY the model failed, not just that it fell back. Reported via Discord: GLM-5.1 on OpenRouter with US-based provider restrictions eliminated all 4 tool-supporting endpoints (DeepInfra, Z.AI, Friendli, Venice), leaving only 7 non-tool providers.	2026-04-09 18:03:09 -07:00
AIandI0x1	2d0d05a337	fix(agent): detect truncated streaming tool calls before execution When a streaming response is cut mid-tool-call (connection drop, timeout), the accumulated function.arguments is invalid JSON. The mock response builder defaulted finish_reason to 'stop', so the agent loop treated it as a valid completed turn and tried to execute tools with broken args. Fix: validate tool call arguments with json.loads() during mock response reconstruction. If any are invalid JSON, override finish_reason to 'length'. In the main loop's length handler, if tool calls are present, refuse to execute and return partial=True with a clear error instead of silently failing or wasting retries. Also fixes _thinking_exhausted to not short-circuit when tool calls are present — truncated tool calls are not thinking exhaustion. Original cherry-picked from PR #6776 by AIandI0x1. Closes #6638.	2026-04-09 17:03:54 -07:00
adybag14-cyber	a3aed1bd26	fix(termux): keep quiet chat output parseable	2026-04-09 16:24:53 -07:00
adybag14-cyber	4970705ed3	fix(termux): silence quiet chat tool previews	2026-04-09 16:24:53 -07:00
KUSH42	34d06a9802	fix(compaction): don't halve context_length on output-cap-too-large errors When the API returns "max_tokens too large given prompt" (input tokens are within the context window, but input + requested output > window), the old code incorrectly routed through the same handler as "prompt too long" errors, calling get_next_probe_tier() and permanently halving context_length. This made things worse: the window was fine, only the requested output size needed trimming for that one call. Two distinct error classes now handled separately: Prompt too long — input itself exceeds context window. Fix: compress history + halve context_length (existing behaviour, unchanged). Output cap too large — input OK, but input + max_tokens > window. Fix: parse available_tokens from the error message, set a one-shot _ephemeral_max_output_tokens override for the retry, and leave context_length completely untouched. Changes: - agent/model_metadata.py: add parse_available_output_tokens_from_error() that detects Anthropic's "available_tokens: N" error format and returns the available output budget, or None for all other error types. - run_agent.py: call the new parser first in the is_context_length_error block; if it fires, set _ephemeral_max_output_tokens (with a 64-token safety margin) and break to retry without touching context_length. _build_api_kwargs consumes the ephemeral value exactly once then clears it so subsequent calls use self.max_tokens normally. - agent/anthropic_adapter.py: expand build_anthropic_kwargs docstring to clearly document the max_tokens (output cap) vs context_length (total window) distinction, which is a persistent source of confusion due to the OpenAI-inherited "max_tokens" name. - cli-config.yaml.example: add inline comments explaining both keys side by side where users are most likely to look. - website/docs/integrations/providers.md: add a callout box at the top of "Context Length Detection" and clarify the troubleshooting entry. - tests/test_ctx_halving_fix.py: 24 tests across four classes covering the parser, build_anthropic_kwargs clamping, ephemeral one-shot consumption, and the invariant that context_length is never mutated on output-cap errors.	2026-04-09 11:27:41 -07:00
Yang Zhi	019c11d07e	fix(fallback): preserve provider-specific headers when activating fallback When _try_activate_fallback() swaps to a new provider (e.g. kimi-coding), resolve_provider_client() correctly injects provider-specific default_headers (like KimiCLI User-Agent) into the returned OpenAI client. However, _client_kwargs was saved with only api_key and base_url, dropping those headers. Every subsequent API call rebuilds the client from _client_kwargs via _create_request_openai_client(), producing a bare OpenAI client without the required headers. Kimi Coding rejects this with 403; Copilot would lose its auth headers similarly. This patch reads _custom_headers from the fallback client (where the OpenAI SDK stores the default_headers kwarg) and includes them in _client_kwargs so any client rebuild preserves provider-specific headers. Fixes #6075	2026-04-09 11:11:25 -07:00
Teknium	268ee6bdce	fix: add turn-exit diagnostic logging to agent loop (#6549 ) Every turn now logs WHY the agent loop ended to agent.log with a structured INFO line capturing: exit reason, model, api_calls/max, budget usage, tool turn count, last message role, response length, and session ID. When the last message is a tool result and the turn was NOT interrupted, emits WARNING level (visible in errors.log) — this is the 'just stops' scenario users report where a tool call completes but no continuation or final response follows. 10 tracked exit reasons: text_response, interrupted_by_user, interrupted_during_api_call, budget_exhausted, max_iterations_reached, all_retries_exhausted_no_response, fallback_prior_turn_content, empty_response_exhausted, error_near_max_iterations, unknown.	2026-04-09 04:15:22 -07:00
Teknium	1a3ae6ac6e	feat: structured API error classification for smart failover (#6514 ) Add agent/error_classifier.py with a priority-ordered classification pipeline that replaces scattered inline string-matching in the retry loop with structured error taxonomy and recovery hints. FailoverReason enum (14 categories): auth, auth_permanent, billing, rate_limit, overloaded, server_error, timeout, context_overflow, payload_too_large, model_not_found, format_error, thinking_signature, long_context_tier, unknown. ClassifiedError dataclass carries reason + recovery action hints (retryable, should_compress, should_rotate_credential, should_fallback). Key improvements over inline matching: - 402 disambiguation: 'insufficient credits' = billing (immediate rotate), 'usage limit, try again' = rate_limit (backoff first) - OpenRouter 403 'key limit exceeded' correctly classified as billing - Error cause chain walking (walks __cause__/__context__ up to 5 levels) - Body message included in pattern matching (SDK str() misses it) - Server disconnect + large session check ordered before generic transport catch so RemoteProtocolError triggers compression when appropriate - Chinese error message support for context overflow run_agent.py: replaced 6 inline detection blocks with classifier calls, net -55 lines. All recovery actions (pool rotation, fallback activation, compression, transport recovery) unchanged. 65 new unit tests + 10 E2E tests + live tests with real SDK error objects. Inspired by OpenClaw's failover error classification system.	2026-04-09 04:10:11 -07:00
Teknium	8dfc96dbbb	feat: capture provider rate limit headers and show in /usage (#6541 ) Parse x-ratelimit-* headers from inference API responses (Nous Portal, OpenRouter, OpenAI-compatible) and display them in the /usage command. - New agent/rate_limit_tracker.py: parse 12 rate limit headers (RPM/RPH/ TPM/TPH limits, remaining, reset timers), format as progress bars (CLI) or compact one-liner (gateway) - Hook into streaming path in run_agent.py: stream.response.headers is available on the OpenAI SDK Stream object before chunks are consumed - CLI /usage: appends rate limit section with progress bars + warnings when any bucket exceeds 80% - Gateway /usage: appends compact rate limit summary - 24 unit tests covering parsing, formatting, edge cases Headers captured per response: x-ratelimit-{limit,remaining,reset}-{requests,tokens}{,-1h} Example CLI display: Nous Rate Limits (captured just now): Requests/min [░░░░░░░░░░░░░░░░░░░░] 0.1% 1/800 used (799 left, resets in 59s) Tokens/hr [░░░░░░░░░░░░░░░░░░░░] 0.0% 49/336.0M (336.0M left, resets in 52m)	2026-04-09 03:43:14 -07:00
Teknium	1eabbe905e	fix: retry 3 times when model returns truly empty response (#6488 ) When a model returns no content, no structured reasoning, and no tool calls (common with open models), the agent now silently retries up to 3 times before falling through to (empty). Silent retry (no synthetic messages) keeps the conversation history clean, preserves prompt caching, and respects the no-synthetic-user- injection invariant. Most empty responses from open models are transient (provider hiccups, rate limits, sampling flukes) so a simple retry is sufficient. This fills the last gap in the empty-response recovery chain: 1. _last_content_with_tools fallback (prior tool turn had content) 2. Thinking-only prefill continuation (#5931 — structured reasoning) 3. Empty response silent retry (NEW — truly empty, no reasoning) 4. (empty) terminal (last resort after all retries exhausted) Inline <think> blocks are excluded — the model chose to reason, it just produced no visible text. That differs from truly empty. Tests: - Updated test_truly_empty to expect 4 API calls (1 + 3 retries) - Added test_truly_empty_response_succeeds_on_nudge	2026-04-09 02:06:12 -07:00
angelos	e7d3e9d767	fix(terminal): persistent sandbox envs survive between turns `_cleanup_task_resources` was unconditionally calling `cleanup_vm()` at the end of every `run_conversation` (i.e. every user turn), tearing down the docker/daytona/modal sandbox container regardless of its `persistent_filesystem` setting. This contradicted the documented intent of `terminal.lifetime_seconds` (idle reaper) and `container_persistent`, and caused per-turn loss of `/workspace`, `~/.config`, agent CLI auth state, and any other content living inside the sandbox. The unconditional teardown was introduced in `fbd3a2fd` ("prevent leakage of morph instances between tasks", 2025-11-04) to plug a Morph backend leak, two days after `lifetime_seconds` shipped in `faecbddd`. It was later refactored into `_cleanup_task_resources` in `70dd3a16` without changing semantics. Code and docs have disagreed since. Fix: introduce `terminal_tool.is_persistent_env(task_id)` and skip the per-turn `cleanup_vm` when the active env is persistent. The idle reaper (`_cleanup_inactive_envs`) still tears persistent envs down once `terminal.lifetime_seconds` is exceeded. Non-persistent backends (Morph) are unchanged — still torn down per turn, preserving the original leak-prevention intent.	2026-04-08 21:31:57 -07:00
Teknium	54db7cbbe1	fix(agent): tiered context pressure warnings + gateway dedup (#6411 ) Combines the approaches from PR #6309 (duan78) and PR #5963 (KUSH42): Tiered warnings (from #5963): - Replaces boolean _context_pressure_warned with float _context_pressure_warned_at - Fires at 85% (orange) and re-fires at 95% (red/critical) - Adds 'compacting context...' status message before compression Gateway dedup (from #6309): - Class-level dict _context_pressure_last_warned survives across AIAgent instances (gateway creates a new instance per message) - 5-minute cooldown per session prevents warning spam - Higher-tier warnings bypass the cooldown (85% → 95% always fires) - Compression reset clears the dedup entry for the session - Stale entries evicted (older than 2x cooldown) to prevent memory leak Does NOT inject into messages — purely user-facing via _safe_print (CLI) and status_callback (gateway). Zero prompt cache impact. Fixes #6309. Fixes #5963.	2026-04-08 21:31:44 -07:00
SHL0MS	8567031433	fix: improve context compression quality — named constants, tool tracking, degradation warning Three targeted improvements to the compression system: 1. Replace hardcoded truncation limits with named class constants (_CONTENT_MAX=6000, _CONTENT_HEAD=4000, _CONTENT_TAIL=1500, _TOOL_ARGS_MAX=1500, _TOOL_ARGS_HEAD=1200). Previous limits (3000/500) heavily truncated the summarizer's input — a 200-line edit got cut to 3000 chars before the summarizer ever saw it. 2. Add '## Tools & Patterns' section to both compression prompt templates (first-pass and iterative). Preserves working tool invocations, preferred flags, and tool-specific discoveries across compaction boundaries. 3. Warn users on 2nd+ compression: 'Session compressed N times — accuracy may degrade. Consider /new to start fresh.' Ref #499	2026-04-08 20:54:23 -07:00
Teknium	ae4a884e8d	fix(agent): disable stale stream timeout for local providers (#6368 ) Local inference providers (Ollama, oMLX, llama-cpp) can take 300+ seconds for prefill on large contexts. The 180s stale stream detector was killing these connections while the provider was still processing. Uses the existing is_local_endpoint() (proper URL parsing with RFC-1918, localhost, WSL detection) instead of ad-hoc substring matching. The stale timeout is only disabled when the user hasn't explicitly set HERMES_STREAM_STALE_TIMEOUT — explicit user config is always honored. Fixes #5889	2026-04-08 19:53:39 -07:00
konsisumer	42e366f27b	fix(agent): respect config timeout for flush_memories instead of hardcoded 30s The _call_llm() and direct OpenAI fallback paths in flush_memories() both hardcoded timeout=30.0, ignoring the user-configurable value at auxiliary.flush_memories.timeout in config.yaml. Remove the explicit timeout from the auxiliary _call_llm() call so that _get_task_timeout('flush_memories') reads from config. For the direct OpenAI fallback, import and use _get_task_timeout() instead of the hardcoded value. Add two regression tests verifying both code paths respect the config. Fixes #6154	2026-04-08 18:55:33 -07:00

1 2 3 4 5 ...

543 commits