Commit graph

703 commits

Author SHA1 Message Date
ygd58
151654851c fix(agent): prevent false thinking-exhaustion for non-reasoning models
Models that do not use <think> tags (e.g. GLM-4.7 on NVIDIA Build,
minimax) may return content=None or empty string when truncated. The
previous _thinking_exhausted check treated any None/empty content as
thinking-budget exhaustion, causing these models to always show the
'Thinking Budget Exhausted' error instead of attempting continuation.

Fix: gate the exhaustion check on _has_think_tags — only trigger the
exhaustion path when the model actually produced reasoning blocks
(<think>, <thinking>, <reasoning>, <REASONING_SCRATCHPAD>). Models
without think tags now fall through to the normal continuation retry
logic (up to 3 attempts).

Fixes #7729
2026-04-11 13:47:25 -07:00
Tom Qiao
5910412002 fix: detect truncated tool_calls when finish_reason is not length
When API routers rewrite finish_reason from "length" to "tool_calls",
truncated JSON arguments bypassed the length handler and wasted 3
retry attempts in the generic JSON validation loop. Now detects
truncation patterns in tool call arguments regardless of finish_reason.

Fixes #7680

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-11 13:47:25 -07:00
Brooklyn Nicholson
9ccb490cf3 Merge branch 'main' of github.com:NousResearch/hermes-agent into feat/ink-refactor 2026-04-11 15:30:23 -05:00
Teknium
dafe443beb
feat: warn at session start when compression model context is too small (#7894)
Two-phase design so the warning fires before the user's first message
on every platform:

Phase 1 (__init__):
  _check_compression_model_feasibility() runs during agent construction.
  Resolves the auxiliary compression model (same chain as call_llm with
  task='compression'), compares its context length to the main model's
  compression threshold. If too small, emits via _emit_status() (prints
  for CLI) and stores the warning in _compression_warning.

Phase 2 (run_conversation, first call):
  _replay_compression_warning() re-sends the stored warning through
  status_callback — which the gateway wires AFTER construction. The
  warning is then cleared so it only fires once.

This ensures:
- CLI users see the warning immediately at startup (right after the
  context limit line)
- Gateway users (Telegram, Discord, Slack, WhatsApp, Signal, Matrix,
  Mattermost, Home Assistant, DingTalk, etc.) receive it via
  status_callback('lifecycle', ...) on their first message
- logger.warning() always hits agent.log regardless of platform

Also warns when no auxiliary LLM provider is configured at all.
Entire check wrapped in try/except — never blocks startup.

11 tests covering: core warning logic, boundary conditions, exception
safety, two-phase store+replay, gateway callback wiring, and
single-delivery guarantee.
2026-04-11 12:01:30 -07:00
Brooklyn Nicholson
bf6af95ff5 Merge branch 'main' of github.com:NousResearch/hermes-agent into feat/ink-refactor 2026-04-11 13:14:36 -05:00
Teknium
06e1d9cdd4
fix: resolve three high-impact community bugs (#5819, #6893, #3388) (#7881)
Matrix gateway: fix sync loop never dispatching events (#5819)
- _sync_loop() called client.sync() but never called handle_sync()
  to dispatch events to registered callbacks — _on_room_message was
  registered but never fired for new messages
- Store next_batch token from initial sync and pass as since= to
  subsequent incremental syncs (was doing full initial sync every time)
- 17 comments, confirmed by multiple users on matrix.org

Feishu docs: add interactive card configuration for approvals (#6893)
- Error 200340 is a Feishu Developer Console configuration issue,
  not a code bug — users need to enable Interactive Card capability
  and configure Card Request URL
- Added required 3-step setup instructions to feishu.md
- Added troubleshooting entry for error 200340
- 17 comments from Feishu users

Copilot provider drift: detect GPT-5.x Responses API requirement (#3388)
- GPT-5.x models are rejected on /v1/chat/completions by both OpenAI
  and OpenRouter (unsupported_api_for_model error)
- Added _model_requires_responses_api() to detect models needing
  Responses API regardless of provider
- Applied in __init__ (covers OpenRouter primary users) and in
  _try_activate_fallback() (covers Copilot->OpenRouter drift)
- Fixed stale comment claiming gateway creates fresh agents per message
  (it caches them via _agent_cache since the caching was added)
- 7 comments, reported on Copilot+Telegram gateway
2026-04-11 11:12:20 -07:00
Brooklyn Nicholson
b04248f4d5 Merge branch 'main' of github.com:NousResearch/hermes-agent into feat/ink-refactor
# Conflicts:
#	gateway/platforms/base.py
#	gateway/run.py
#	tests/gateway/test_command_bypass_active_session.py
2026-04-11 11:39:47 -05:00
kshitijk4poor
af9caec44f fix(qwen): correct context lengths for qwen3-coder models and send max_tokens to portal
Based on PR #7285 by @kshitijk4poor.

Two bugs affecting Qwen OAuth users:

1. Wrong context window — qwen3-coder-plus showed 128K instead of 1M.
   Added specific entries before the generic qwen catch-all:
   - qwen3-coder-plus: 1,000,000 (corrected from PR's 1,048,576 per
     official Alibaba Cloud docs and OpenRouter)
   - qwen3-coder: 262,144

2. Random stopping — max_tokens was suppressed for Qwen Portal, so the
   server applied its own low default. Reasoning models exhaust that on
   thinking tokens. Now: honor explicit max_tokens, default to 65536
   when unset.

Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
2026-04-11 03:29:31 -07:00
kshitijk4poor
d442f25a2f fix: align MiniMax provider with official API docs
Aligns MiniMax provider with official API documentation. Fixes 6 bugs:
transport mismatch (openai_chat -> anthropic_messages), credential leak
in switch_model(), prompt caching sent to non-Anthropic endpoints,
dot-to-hyphen model name corruption, trajectory compressor URL routing,
and stale doctor health check.

Also corrects context window (204,800), thinking support (manual mode),
max output (131,072), and model catalog (M2 family only on /anthropic).

Source: https://platform.minimax.io/docs/api-reference/text-anthropic-api

Co-authored-by: kshitijk4poor <kshitijk4poor@users.noreply.github.com>
2026-04-11 01:04:41 -07:00
Moris Chao
5b16f31702 feat(plugins): pass sender_id to pre_llm_call hook
The pre_llm_call plugin hook receives session_id, user_message,
conversation_history, is_first_turn, model, and platform — but not
the sender's user_id. This means plugins cannot perform per-user
access control (e.g. restricting knowledge base recall to authorized
users).

The gateway already passes source.user_id as user_id to AIAgent,
which stores it in self._user_id. This change forwards it as
sender_id in the pre_llm_call kwargs so plugins can use it for
ACL decisions.

For CLI sessions where no user_id exists, sender_id defaults to
empty string. Plugins can treat empty sender_id as a trusted local
call (the owner is at the terminal) or deny it depending on their
ACL policy.
2026-04-11 00:43:20 -07:00
Teknium
caf371da18
fix: MiniMax/Alibaba incorrectly detected as Anthropic OAuth, causing mcp_ tool prefix (#7509)
_is_oauth_token() returned True for any key not starting with 'sk-ant-api',
which means MiniMax and Alibaba API keys were falsely treated as Anthropic
OAuth tokens. This triggered the Claude Code compatibility path:
- All tool names prefixed with mcp_ (e.g. mcp_terminal, mcp_web_search)
- System prompt injected with 'You are Claude Code' identity
- 'Hermes Agent' replaced with 'Claude Code' throughout

Fix: Make _is_oauth_token() positively identify Anthropic OAuth tokens by
their key format instead of using a broad catch-all:
- sk-ant-* (but not sk-ant-api-*) -> setup tokens, managed keys
- eyJ* -> JWTs from Anthropic OAuth flow
- Everything else -> False (MiniMax, Alibaba, etc.)

Reported by stefan171.
2026-04-11 00:43:01 -07:00
Teknium
436dfd5ab5 fix: no auto-activation + unified hermes plugins UI with provider categories
- Remove auto-activation: when context.engine is 'compressor' (default),
  plugin-registered engines are NOT used. Users must explicitly set
  context.engine to a plugin name to activate it.

- Add curses_radiolist() to curses_ui.py: single-select radio picker
  with keyboard nav + text fallback, matching curses_checklist pattern.

- Rewrite cmd_toggle() as composite plugins UI:
  Top section: general plugins with checkboxes (existing behavior)
  Bottom section: provider plugin categories (Memory Provider, Context Engine)
  with current selection shown inline. ENTER/SPACE on a category opens
  a radiolist sub-screen for single-select configuration.

- Add provider discovery helpers: _discover_memory_providers(),
  _discover_context_engines(), config read/save for memory.provider
  and context.engine.

- Add tests: radiolist non-TTY fallback, provider config save/load,
  discovery error handling, auto-activation removal verification.
2026-04-10 19:15:50 -07:00
Teknium
3fe6938176 fix: robust context engine interface — config selection, plugin discovery, ABC completeness
Follow-up fixes for the context engine plugin slot (PR #5700):

- Enhance ContextEngine ABC: add threshold_percent, protect_first_n,
  protect_last_n as class attributes; complete update_model() default
  with threshold recalculation; clarify on_session_end() lifecycle docs
- Add ContextCompressor.update_model() override for model/provider/
  base_url/api_key updates
- Replace all direct compressor internal access in run_agent.py with
  ABC interface: switch_model(), fallback restore, context probing
  all use update_model() now; _context_probed guarded with getattr/
  hasattr for plugin engine compatibility
- Create plugins/context_engine/ directory with discovery module
  (mirrors plugins/memory/ pattern) — discover_context_engines(),
  load_context_engine()
- Add context.engine config key to DEFAULT_CONFIG (default: compressor)
- Config-driven engine selection in run_agent.__init__: checks config,
  then plugins/context_engine/<name>/, then general plugin system,
  falls back to built-in ContextCompressor
- Wire on_session_end() in shutdown_memory_provider() at real session
  boundaries (CLI exit, /reset, gateway expiry)
2026-04-10 19:15:50 -07:00
Stephen Schoettler
5d8dd622bc feat: wire context engine tools, session lifecycle, and tool dispatch
- Inject engine tool schemas into agent tool surface after compressor init
- Call on_session_start() with session_id, hermes_home, platform, model
- Dispatch engine tool calls (lcm_grep, etc.) before regular tool handler
- 55/55 tests pass
2026-04-10 19:15:50 -07:00
Stephen Schoettler
92382fb00e feat: wire context engine plugin slot into agent and plugin system
- PluginContext.register_context_engine() lets plugins replace the
  built-in ContextCompressor with a custom ContextEngine implementation
- PluginManager stores the registered engine; only one allowed
- run_agent.py checks for a plugin engine at init before falling back
  to the default ContextCompressor
- reset_session_state() now calls engine.on_session_reset() instead of
  poking internal attributes directly
- ContextCompressor.on_session_reset() handles its own internals
  (_context_probed, _previous_summary, etc.)
- 19 new tests covering ABC contract, defaults, plugin slot registration,
  rejection of duplicates/non-engines, and compressor reset behavior
- All 34 existing compressor tests pass unchanged
2026-04-10 19:15:50 -07:00
Teknium
842e669a13
fix: activate fallback provider on repeated empty responses + user-visible status (#7505)
When models return empty responses (no content, no tool calls, no
reasoning), Hermes previously retried 3 times silently then fell through
to '(empty)' — without ever trying the fallback provider chain. Users on
GLM-4.5-Air and similar models experienced what appeared to be a
complete hang, especially in gateway (Telegram/Discord) contexts where
the silent retries produced zero feedback.

Changes:
- After exhausting 3 empty retries, attempt _try_activate_fallback()
  before giving up with '(empty)'. If fallback succeeds, reset retry
  counter and continue the conversation loop with the new provider.
- Replace all _vprint() calls in recovery paths with _emit_status(),
  which surfaces messages through both CLI (_vprint with force=True)
  and gateway (status_callback -> adapter.send). Users now see:
  * '⚠️ Empty response from model — retrying (N/3)' during retries
  * '⚠️ Model returning empty responses — switching to fallback...'
  * '↻ Switched to fallback: <model> (<provider>)' on success
  * ' Model returned no content after all retries [and fallback]'
- Add logger.warning() throughout empty response paths for log file
  visibility (model name, provider, retry counts).
- Upgrade _last_content_with_tools fallback from logger.debug to
  logger.info + _emit_status so recovery is visible.
- Upgrade thinking-only prefill continuation to use _emit_status.

Tests:
- test_empty_response_triggers_fallback_provider: verifies fallback
  activation after 3 empty retries produces content from fallback model
- test_empty_response_fallback_also_empty_returns_empty: verifies
  graceful degradation when fallback also returns empty
- test_empty_response_emits_status_for_gateway: verifies _emit_status
  is called during retries so gateway users see feedback

Addresses #7180.
2026-04-10 19:15:41 -07:00
pefontana
5b42aecfa7 feat(agent): add AIAgent.close() for subprocess cleanup
Add a close() method to AIAgent that acts as a single entry point for
releasing all resources held by an agent instance. This prevents zombie
process accumulation on long-running gateway deployments by explicitly
cleaning up:

- Background processes tracked in ProcessRegistry
- Terminal sandbox environments
- Browser daemon sessions
- Active child agents (subagent delegation)
- OpenAI/httpx client connections

Each cleanup step is independently guarded so a failure in one does not
prevent the rest. The method is idempotent and safe to call multiple
times.

Also simplifies the background review cleanup to use close() instead
of manually closing the OpenAI client.

Ref: #7131
2026-04-10 16:51:44 -07:00
Teknium
ea81aa2eec
fix: guard api_kwargs in except handler to prevent UnboundLocalError (#7376)
When _build_api_kwargs() throws an exception, the except handler in
the retry loop referenced api_kwargs before it was assigned. This
caused an UnboundLocalError that masked the real error, making
debugging impossible for the user.

Two _dump_api_request_debug() calls in the except block (non-retryable
client error path and max-retries-exhausted path) both accessed
api_kwargs without checking if it was assigned.

Fix: initialize api_kwargs = None before the retry loop and guard both
dump calls. Now the real error surfaces instead of the masking
UnboundLocalError.

Reported by Discord user gruman0.
2026-04-10 15:12:00 -07:00
angelos
7ccdb74364 fix(delegate): make max_concurrent_children configurable + error on excess
`delegate_task` silently truncated batch tasks to 3 — the model sends
5 tasks, gets results for 3, never told 2 were dropped. Now returns a
clear tool_error explaining the limit and how to fix it.

The limit is configurable via:
  - delegation.max_concurrent_children in config.yaml (priority 1)
  - DELEGATION_MAX_CONCURRENT_CHILDREN env var (priority 2)
  - default: 3

Uses the same _load_config() path as the rest of delegate_task for
consistent config priority. Clamps to min 1, warns on non-integer
config values.

Also removes the hardcoded maxItems: 3 from the JSON schema — the
schema was blocking the model from even attempting >3 tasks before
the runtime check could fire. The runtime check gives a much more
actionable error message.

Backwards compatible: default remains 3, existing configs unchanged.
2026-04-10 13:38:14 -07:00
Tranquil-Flow
6c115440fd fix(delegate): sync self.base_url with client_kwargs after credential resolution
When delegation.base_url routes subagents to a different endpoint, the
correct URL was passed through _resolve_delegation_credentials() and
_build_child_agent() into AIAgent.__init__(), but self.base_url could
fall out of sync with client_kwargs["base_url"] — the value the OpenAI
client actually uses.

This caused billing_base_url in session records to show the parent's
endpoint while actual API calls went to the correct delegation target.

Keep self.base_url in sync with client_kwargs after the credential
resolution block, matching the existing pattern for self.api_key.

Fixes #6825
2026-04-10 13:38:14 -07:00
0xbyt4
0bea603510
fix: handle NoneType request_overrides in fast_mode check (#7350) 2026-04-10 13:07:25 -07:00
Hermes Audit
2c99b4e79b fix(unicode): sanitize surrogate metadata and allow two-pass retry 2026-04-10 13:05:01 -07:00
Hermes Audit
71036a7a75 fix: handle UnicodeEncodeError with ASCII codec (#6843)
Broaden the UnicodeEncodeError recovery to handle systems with ASCII-only
locale (LANG=C, Chromebooks) where ANY non-ASCII character causes encoding
failure, not just lone surrogates.

Changes:
- Add _strip_non_ascii() and _sanitize_messages_non_ascii() helpers that
  strip all non-ASCII characters from message content, name, and tool_calls
- Update the UnicodeEncodeError handler to detect ASCII codec errors and
  fall back to non-ASCII sanitization after surrogate check fails
- Sanitize tool_calls arguments and name fields (not just content)
- Fix bare .encode() in cli.py suspend handler to use explicit utf-8
- Add comprehensive test suite (17 tests)
2026-04-10 13:05:01 -07:00
Kenny Xie
916fbf362c fix(model): tighten direct-provider fallback normalization 2026-04-10 05:52:45 -07:00
Kenny Xie
b730c2955a fix(model): normalize direct provider ids in auxiliary routing 2026-04-10 05:52:45 -07:00
Kenny Xie
fd5cc6e1b4 fix(model): normalize native provider-prefixed model ids 2026-04-10 05:52:45 -07:00
Ronald Reis
fd3e855d58 fix: pass config_context_length to switch_model context compressor
When switching models at runtime, the config_context_length override
was not being passed to the new context compressor instance. This
meant the user-specified context length from config.yaml was lost
after a model switch.

- Store _config_context_length on AIAgent instance during __init__
- Pass _config_context_length when creating new ContextCompressor in switch_model
- Add test to verify config_context_length is preserved across model switches

Fixes: quando estamos alterando o modelo não está alterando o tamanho do contexto
2026-04-10 05:52:45 -07:00
alt-glitch
96c060018a fix: remove 115 verified dead code symbols across 46 production files
Automated dead code audit using vulture + coverage.py + ast-grep intersection,
confirmed by Opus deep verification pass. Every symbol verified to have zero
production callers (test imports excluded from reachability analysis).

Removes ~1,534 lines of dead production code across 46 files and ~1,382 lines
of stale test code. 3 entire files deleted (agent/builtin_memory_provider.py,
hermes_cli/checklist.py, tests/hermes_cli/test_setup_model_selection.py).

Co-authored-by: alt-glitch <balyan.sid@gmail.com>
2026-04-10 03:44:43 -07:00
Teknium
68528068ec
fix(streaming): update stale-stream timer during Anthropic native streaming (#7117)
The _call_anthropic() streaming path never updated last_chunk_time during
the event loop — only once at stream start. The stale stream detector in
the outer poll loop uses this timer, so any Anthropic stream longer than
180s was killed even when events were actively arriving. This self-inflicted
a RemoteProtocolError that users saw as:

  '⚠️ Connection to provider dropped (RemoteProtocolError). Reconnecting…'

The _call_chat_completions() path already updates last_chunk_time on every
chunk (line 4475). This brings _call_anthropic() to parity.

Also adds deltas_were_sent tracking to the Anthropic text_delta path so
the retry loop knows not to retry after partial delivery (prevents
duplicated output on connection drops mid-stream).

Reported-by: Discord users (Castellani, Codename_11)
2026-04-10 03:34:56 -07:00
helix4u
5a8b5f149d fix(run-agent): rotate credential pool on billing-classified 400s 2026-04-10 03:27:19 -07:00
helix4u
9aedab00f4 fix(run_agent): recover primary client on openai transport errors 2026-04-10 03:21:24 -07:00
kshitijk4poor
9431f82aff fix: update Kimi Coding User-Agent to KimiCLI/1.30.0
The hardcoded User-Agent 'KimiCLI/1.3' is outdated — Kimi CLI is now at
v1.30.0. The stale version string causes intermittent 403 errors from
Kimi's coding endpoint ('only available for Coding Agents').

Update all 8 occurrences across run_agent.py, auxiliary_client.py, and
doctor.py to 'KimiCLI/1.30.0' to match the current official Kimi CLI.
2026-04-10 02:37:28 -07:00
Teknium
8779a268a7
feat: add Anthropic Fast Mode support to /fast command (#7037)
Extends the /fast command to support Anthropic's Fast Mode beta in addition
to OpenAI Priority Processing. When enabled on Claude Opus 4.6, adds
speed:"fast" and the fast-mode-2026-02-01 beta header to API requests for
~2.5x faster output token throughput.

Changes:
- hermes_cli/models.py: Add _ANTHROPIC_FAST_MODE_MODELS registry,
  model_supports_fast_mode() now recognizes Claude Opus 4.6,
  resolve_fast_mode_overrides() returns {speed: fast} for Anthropic
  vs {service_tier: priority} for OpenAI
- agent/anthropic_adapter.py: Add _FAST_MODE_BETA constant,
  build_anthropic_kwargs() accepts fast_mode=True which injects
  speed:fast + beta header via extra_headers (skipped for third-party
  Anthropic-compatible endpoints like MiniMax)
- run_agent.py: Pass fast_mode to build_anthropic_kwargs in the
  anthropic_messages path of _build_api_kwargs()
- cli.py: Update _handle_fast_command with provider-aware messaging
  (shows 'Anthropic Fast Mode' vs 'Priority Processing')
- hermes_cli/commands.py: Update /fast description to mention both
  providers
- tests: 13 new tests covering Anthropic model detection, override
  resolution, CLI availability, routing, adapter kwargs, and
  third-party endpoint safety
2026-04-10 02:32:15 -07:00
Teknium
871313ae2d
fix: clear conversation_history after mid-loop compression to prevent empty sessions (#7001)
After mid-loop compression (triggered by 413, context_overflow, or Anthropic
long-context tier errors), _compress_context() creates a new session in SQLite
and resets _last_flushed_db_idx=0. However, conversation_history was not cleared,
so _flush_messages_to_session_db() computed:

    flush_from = max(len(conversation_history=200), _last_flushed_db_idx=0) = 200
    messages[200:] → empty (compressed messages < 200)

This resulted in zero messages being written to the new session's SQLite store.
On resume, the user would see 'Session found but has no messages.'

The preflight compression path (line 7311) already had the fix:
    conversation_history = None

This commit adds the same clearing to the three mid-loop compression sites:
- Anthropic long-context tier overflow
- HTTP 413 payload too large
- Generic context_overflow error

Reported by Aaryan (Nous community).
2026-04-10 00:14:59 -07:00
Teknium
f783986f5a
fix: increase stream read timeout default to 120s, auto-raise for local LLMs (#6967)
Raise the default httpx stream read timeout from 60s to 120s for all
providers. Additionally, auto-detect local LLM endpoints (Ollama,
llama.cpp, vLLM) and raise the read timeout to HERMES_API_TIMEOUT
(1800s) since local models can take minutes for prefill on large
contexts before producing the first token.

The stale stream timeout already had this local auto-detection pattern;
the httpx read timeout was missing it — causing a hard 60s wall that
users couldn't find (HERMES_STREAM_READ_TIMEOUT was undocumented).

Changes:
- Default HERMES_STREAM_READ_TIMEOUT: 60s -> 120s
- Auto-detect local endpoints -> raise to 1800s (user override respected)
- Document HERMES_STREAM_READ_TIMEOUT and HERMES_STREAM_STALE_TIMEOUT
- Add 10 parametrized tests

Reported-by: Pavan Srinivas (@pavanandums)
2026-04-09 22:35:30 -07:00
emozilla
bda9aa17cb fix(streaming): prevent <think> in prose from suppressing response output
When the model mentions <think> as literal text in its response (e.g.
"(/think not producing <think> tags)"), the streaming display treated it
as a reasoning block opener and suppressed everything after it. The
response box would close with truncated content and no error — the API
response was complete but the display ate it.

Root cause: _stream_delta() matched <think> anywhere in the text stream
regardless of position. Real reasoning blocks always start at the
beginning of a line; mentions in prose appear mid-sentence.

Fix: track line position across streaming deltas with a
_stream_last_was_newline flag. Only enter reasoning suppression when
the tag appears at a block boundary (start of stream, after a newline,
or after only whitespace on the current line). Add a _flush_stream()
safety net that recovers buffered content if no closing tag is found
by end-of-stream.

Also fixes three related issues discovered during investigation:

- anthropic_adapter: _get_anthropic_max_output() now normalizes dots to
  hyphens so 'claude-opus-4.6' matches the 'claude-opus-4-6' table key
  (was returning 32K instead of 128K)

- run_agent: send explicit max_tokens for Claude models on Nous Portal,
  same as OpenRouter — both proxy to Anthropic's API which requires it.
  Without it the backend defaults to a low limit that truncates responses.

- run_agent: reset truncated_tool_call_retries after successful tool
  execution so a single truncation doesn't poison the entire conversation.
2026-04-09 22:16:36 -07:00
Teknium
8394b5ddd2
feat: expand /fast to all OpenAI Priority Processing models (#6960)
Previously /fast only supported gpt-5.4 and forced a provider switch to
openai-codex. Now supports all 13 models from OpenAI's Priority Processing
pricing table (gpt-5.4, gpt-5.4-mini, gpt-5.2, gpt-5.1, gpt-5, gpt-5-mini,
gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, gpt-4o, gpt-4o-mini, o3, o4-mini).

Key changes:
- Replaced _FAST_MODE_BACKEND_CONFIG with _PRIORITY_PROCESSING_MODELS frozenset
- Removed provider-forcing logic — service_tier is now injected into whatever
  API path the user is already on (Codex Responses, Chat Completions, or
  OpenRouter passthrough)
- Added request_overrides support to chat_completions path in run_agent.py
- Updated messaging from 'Codex inference tier' to 'Priority Processing'
- Expanded test coverage for all supported models
2026-04-09 22:06:30 -07:00
g-guthrie
d416a69288 feat: add Codex fast mode toggle (/fast command)
Add /fast slash command to toggle OpenAI Codex service_tier between
normal and priority ('fast') inference. Only exposed for models
registered in _FAST_MODE_BACKEND_CONFIG (currently gpt-5.4).

- Registry-based backend config for extensibility
- Dynamic command visibility (hidden from help/autocomplete for
  non-supported models) via command_filter on SlashCommandCompleter
- service_tier flows through request_overrides from route resolution
- Omit max_output_tokens for Codex backend (rejects it)
- Persists to config.yaml under agent.service_tier

Salvage cleanup: removed simple_term_menu/input() menu (banned),
bare /fast now shows status like /reasoning. Removed redundant
override resolution in _build_api_kwargs — single source of truth
via request_overrides from route.

Co-authored-by: Hermes Agent <hermes@nousresearch.com>
2026-04-09 21:54:32 -07:00
Teknium
b87d00288d
fix: add actionable hint for OpenRouter 'no tool endpoints' error
When OpenRouter returns 'No endpoints found that support tool use'
(HTTP 404), display a hint explaining that provider routing restrictions
may be filtering out tool-capable providers. Links the user directly
to the model's OpenRouter page to check which providers support tools.

The hint fires in the error display block that runs regardless of whether
fallback succeeds — so the user always understands WHY the model failed,
not just that it fell back.

Reported via Discord: GLM-5.1 on OpenRouter with US-based provider
restrictions eliminated all 4 tool-supporting endpoints (DeepInfra,
Z.AI, Friendli, Venice), leaving only 7 non-tool providers.
2026-04-09 18:03:09 -07:00
Brooklyn Nicholson
aa5b697a9d Merge branch 'main' of github.com:NousResearch/hermes-agent into feat/ink-refactor 2026-04-09 19:12:31 -05:00
Brooklyn Nicholson
aca479c1ae Merge branch 'feat/ink-refactor' of github.com:NousResearch/hermes-agent into feat/ink-refactor 2026-04-09 19:08:52 -05:00
Brooklyn Nicholson
b85ff282bc feat(ui-tui): slash command history/display, CoT fade, live skin switch, fix double reasoning 2026-04-09 19:08:47 -05:00
AIandI0x1
2d0d05a337 fix(agent): detect truncated streaming tool calls before execution
When a streaming response is cut mid-tool-call (connection drop, timeout),
the accumulated function.arguments is invalid JSON. The mock response
builder defaulted finish_reason to 'stop', so the agent loop treated it
as a valid completed turn and tried to execute tools with broken args.

Fix: validate tool call arguments with json.loads() during mock response
reconstruction. If any are invalid JSON, override finish_reason to
'length'. In the main loop's length handler, if tool calls are present,
refuse to execute and return partial=True with a clear error instead of
silently failing or wasting retries.

Also fixes _thinking_exhausted to not short-circuit when tool calls are
present — truncated tool calls are not thinking exhaustion.

Original cherry-picked from PR #6776 by AIandI0x1.
Closes #6638.
2026-04-09 17:03:54 -07:00
Austin Pickett
f805323517 chore: merge main 2026-04-09 20:00:34 -04:00
adybag14-cyber
a3aed1bd26 fix(termux): keep quiet chat output parseable 2026-04-09 16:24:53 -07:00
adybag14-cyber
4970705ed3 fix(termux): silence quiet chat tool previews 2026-04-09 16:24:53 -07:00
Brooklyn Nicholson
99fd3b518d feat: add /copy and /agents 2026-04-09 17:19:36 -05:00
KUSH42
34d06a9802 fix(compaction): don't halve context_length on output-cap-too-large errors
When the API returns "max_tokens too large given prompt" (input tokens
are within the context window, but input + requested output > window),
the old code incorrectly routed through the same handler as "prompt too
long" errors, calling get_next_probe_tier() and permanently halving
context_length. This made things worse: the window was fine, only the
requested output size needed trimming for that one call.

Two distinct error classes now handled separately:

  Prompt too long  — input itself exceeds context window.
    Fix: compress history + halve context_length (existing behaviour,
    unchanged).

  Output cap too large — input OK, but input + max_tokens > window.
    Fix: parse available_tokens from the error message, set a one-shot
    _ephemeral_max_output_tokens override for the retry, and leave
    context_length completely untouched.

Changes:
- agent/model_metadata.py: add parse_available_output_tokens_from_error()
  that detects Anthropic's "available_tokens: N" error format and returns
  the available output budget, or None for all other error types.
- run_agent.py: call the new parser first in the is_context_length_error
  block; if it fires, set _ephemeral_max_output_tokens (with a 64-token
  safety margin) and break to retry without touching context_length.
  _build_api_kwargs consumes the ephemeral value exactly once then clears
  it so subsequent calls use self.max_tokens normally.
- agent/anthropic_adapter.py: expand build_anthropic_kwargs docstring to
  clearly document the max_tokens (output cap) vs context_length (total
  window) distinction, which is a persistent source of confusion due to
  the OpenAI-inherited "max_tokens" name.
- cli-config.yaml.example: add inline comments explaining both keys side
  by side where users are most likely to look.
- website/docs/integrations/providers.md: add a callout box at the top
  of "Context Length Detection" and clarify the troubleshooting entry.
- tests/test_ctx_halving_fix.py: 24 tests across four classes covering
  the parser, build_anthropic_kwargs clamping, ephemeral one-shot
  consumption, and the invariant that context_length is never mutated
  on output-cap errors.
2026-04-09 11:27:41 -07:00
Yang Zhi
019c11d07e fix(fallback): preserve provider-specific headers when activating fallback
When _try_activate_fallback() swaps to a new provider (e.g.
kimi-coding), resolve_provider_client() correctly injects
provider-specific default_headers (like KimiCLI User-Agent) into the
returned OpenAI client. However, _client_kwargs was saved with only
api_key and base_url, dropping those headers.

Every subsequent API call rebuilds the client from _client_kwargs via
_create_request_openai_client(), producing a bare OpenAI client without
the required headers. Kimi Coding rejects this with 403; Copilot would
lose its auth headers similarly.

This patch reads _custom_headers from the fallback client (where the
OpenAI SDK stores the default_headers kwarg) and includes them in
_client_kwargs so any client rebuild preserves provider-specific headers.

Fixes #6075
2026-04-09 11:11:25 -07:00
Brooklyn Nicholson
b2ea9b4176 Merge branch 'main' of github.com:NousResearch/hermes-agent into feat/ink-refactor 2026-04-09 12:31:20 -05:00