Based on PR #23950 by @nicoechaniz.
- Add "kimi" and "moonshot" to PROVIDER_TO_MODELS_DEV → kimi-for-coding
- Gate OpenRouter metadata step behind "if not effective_provider":
known providers should not be overridden by community-maintained OR data
- Keep the targeted Kimi-family 32k guard as a secondary safety net
inside the OR gate (for unknown providers with Kimi models)
Co-authored-by: nicoechaniz <nicoechaniz@altermundi.net>
Kimi-k2.6 (which supports 262K context) was incorrectly resolved as 32K,
tripping the 64K minimum-context guard and preventing use of the model on
Ollama Cloud and Kimi Coding / Moonshot providers.
Three fixes in the context-length resolution chain:
1. Ollama Cloud native /api/show query: new _query_ollama_api_show()
queries the Ollama native API for authoritative GGUF model_info
context_length. For hosted Ollama, prefers model_info over num_ctx
since users can't set their own num_ctx on Cloud. Added at step 5e
in get_model_context_length(), before the models.dev fallback.
2. models.dev :cloud/-cloud suffix fallback: lookup_models_dev_context()
now also tries appending :cloud and -cloud suffixes when the bare
model name doesn't match. models.dev stores 'kimi-k2.6:cloud' but
users and the live API use bare 'kimi-k2.6'.
3. Kimi-family 32K guard: after the OpenRouter metadata step, reject
exactly 32768 for Kimi-named models (kimi-*, moonshot*) and fall
through to hardcoded defaults ('kimi': 262144). OpenRouter reports
32768 for moonshotai/kimi-k2.6 but the model actually supports 262K.
Narrow filter — only 32768, only Kimi-family — becomes dead code
when OpenRouter updates its metadata.
---
Replace with for all literal-tuple
membership tests. Set lookup is O(1) vs O(n) for tuple — consistent
micro-optimization across the codebase.
608 instances fixed via `ruff --fix --unsafe-fixes`, 0 remaining.
133 files, +626/-626 (net zero).
xAI's Responses API returns HTTP 400 ("Model X does not support
parameter reasoningEffort") for grok-4, grok-4-0709, grok-4-fast-*,
grok-4-1-fast-*, grok-3, grok-4.20-0309-*, and grok-code-fast-1 — even
though those models reason natively. Hermes was unconditionally sending
`reasoning: {effort: 'medium'}` to xAI for every Grok model, breaking
direct `--provider xai` for the entire grok-4 line.
Add a substring allowlist predicate (verified live against api.x.ai
2026-05-10) covering the only Grok families that accept the effort dial:
grok-3-mini*, grok-4.20-multi-agent*, grok-4.3*. The Responses transport
omits the `reasoning` key entirely for everything else while still
including `reasoning.encrypted_content` so we capture native reasoning
tokens.
Verified end-to-end: `hermes chat -q hi --provider xai --model grok-4-0709`
went from HTTP 400 to a successful reply.
Two follow-ups from self-review:
1. Add gpt-5.3-codex-spark to DEFAULT_CONTEXT_LENGTHS at 128k. The
primary resolution path for Spark goes through provider='openai-codex'
→ _CODEX_OAUTH_CONTEXT_FALLBACK (already correct). But if any future
code path resolves Spark's context with a different provider (custom
proxy, generic fallthrough), the longest-substring-first lookup in
step 8 would match 'gpt-5' and report 400k, which is wrong by ~3x.
Adding the explicit override is a cheap defensive correctness fix
matching how gpt-5.4-mini and gpt-5.4-nano already shadow the generic
gpt-5 entry.
2. Update test_openai_codex_model_validation_fallback.py docstring. The
bug it was originally written for (gpt-5.3-codex-spark missing from
listing) is now resolved by this PR's catalog restoration. The test
still validly exercises the soft-accept code path for any future
entitlement-gated Codex slug that ships before Hermes catalogs it,
but the framing was stale — clarified.
PR #12994 stripped gpt-5.3-codex-spark on the assumption that it was
unsupported. It's actually research-preview, ChatGPT-Pro-only, exposed
via the Codex OAuth backend at chatgpt.com/backend-api/codex/models —
not via the public OpenAI API.
Add explanatory comments in:
- DEFAULT_CODEX_MODELS / _FORWARD_COMPAT_TEMPLATE_MODELS (codex_models.py)
- _CODEX_OAUTH_CONTEXT_FALLBACK (model_metadata.py)
- list_authenticated_providers' live-discovery branch (model_switch.py)
so future maintainers don't strip the entry again. Also documents the
intentional asymmetry that Spark stays out of the "openai" provider
catalog (it isn't on the public API) and why the supported_in_api
filter is *not* applied for the openai-codex route.
Two co-located fixes:
1. agent/model_metadata.py: bump hy3-preview static fallback from
256000 to 262144 (256 * 1024) to match OpenRouter live metadata
so cache and offline both agree (issue #22268).
2. tests/hermes_cli/test_tencent_tokenhub_provider.py: replace the
exact-value change-detector (assert ctx == 256000) with an
invariant assertion (registered + >= 4096). Per AGENTS.md
'Don't write change-detector tests': pinning the upstream-controlled
context length is exactly the test class the rule forbids — it
breaks every time the provider bumps the published value, with
zero behavioral coverage gained.
Salvage of #22574 with a redirect on the test approach. The
contributor's diff bumped the integer and added a SECOND
change-detector pinning DEFAULT_CONTEXT_LENGTHS[hy3-preview] == 262144,
which would re-break on the next published bump. We instead delete
the change-detector entirely and assert the relationship.
Closes#22268.
Closes the last Python-on-Windows UTF-8 exposure by making every
text-mode open() call explicit about its encoding.
Before: on Windows, bare open(path, 'r') defaults to the system
locale encoding (cp1252 on US-locale installs). That means reading
any config/yaml/markdown/json file with non-ASCII content either
crashes with UnicodeDecodeError or silently mis-decodes bytes.
After: all 89 affected call sites in production code now pass
encoding='utf-8' explicitly. Works identically on every platform
and every locale, no surprise behavior.
Mechanical sweep via:
ruff check --preview --extend-select PLW1514 --unsafe-fixes --fix --exclude 'tests,venv,.venv,node_modules,website,optional-skills, skills,tinker-atropos,plugins' .
All 89 fixes have the same shape: open(x) or open(x, mode) became
open(x, encoding='utf-8') or open(x, mode, encoding='utf-8'). Nothing
else changed. Every modified file still parses and the Windows/sandbox
test suite is still green (85 passed, 14 skipped, 0 failed across
tests/tools/test_code_execution_windows_env.py +
tests/tools/test_code_execution_modes.py + tests/tools/test_env_passthrough.py +
tests/test_hermes_bootstrap.py).
Scope notes:
- tests/ excluded: test fixtures can use locale encoding intentionally
(exercising edge cases). If we want to tighten tests later that's
a separate PR.
- plugins/ excluded: plugin-specific conventions may differ; plugin
authors own their code.
- optional-skills/ and skills/ excluded: skill scripts are user-authored
and we don't want to mass-edit them.
- website/ and tinker-atropos/ excluded: vendored / generated content.
46 files touched, 89 +/- lines (symmetric replacement). No behavior
change on POSIX or on Windows when the file is ASCII; bug fix on
Windows when the file contains non-ASCII.
Background macOS desktop control via cua-driver MCP — does NOT steal the
user's cursor or keyboard focus, works with any tool-capable model.
Replaces the Anthropic-native `computer_20251124` approach from the
abandoned #4562 with a generic OpenAI function-calling schema plus SOM
(set-of-mark) captures so Claude, GPT, Gemini, and open models can all
drive the desktop via numbered element indices.
- `tools/computer_use/` package — swappable ComputerUseBackend ABC +
CuaDriverBackend (stdio MCP client to trycua/cua's cua-driver binary).
- Universal `computer_use` tool with one schema for all providers.
Actions: capture (som/vision/ax), click, double_click, right_click,
middle_click, drag, scroll, type, key, wait, list_apps, focus_app.
- Multimodal tool-result envelope (`_multimodal=True`, OpenAI-style
`content: [text, image_url]` parts) that flows through
handle_function_call into the tool message. Anthropic adapter converts
into native `tool_result` image blocks; OpenAI-compatible providers
get the parts list directly.
- Image eviction in convert_messages_to_anthropic: only the 3 most
recent screenshots carry real image data; older ones become text
placeholders to cap per-turn token cost.
- Context compressor image pruning: old multimodal tool results have
their image parts stripped instead of being skipped.
- Image-aware token estimation: each image counts as a flat 1500 tokens
instead of its base64 char length (~1MB would have registered as
~250K tokens before).
- COMPUTER_USE_GUIDANCE system-prompt block — injected when the toolset
is active.
- Session DB persistence strips base64 from multimodal tool messages.
- Trajectory saver normalises multimodal messages to text-only.
- `hermes tools` post-setup installs cua-driver via the upstream script
and prints permission-grant instructions.
- CLI approval callback wired so destructive computer_use actions go
through the same prompt_toolkit approval dialog as terminal commands.
- Hard safety guards at the tool level: blocked type patterns
(curl|bash, sudo rm -rf, fork bomb), blocked key combos (empty trash,
force delete, lock screen, log out).
- Skill `apple/macos-computer-use/SKILL.md` — universal (model-agnostic)
workflow guide.
- Docs: `user-guide/features/computer-use.md` plus reference catalog
entries.
44 new tests in tests/tools/test_computer_use.py covering schema
shape (universal, not Anthropic-native), dispatch routing, safety
guards, multimodal envelope, Anthropic adapter conversion, screenshot
eviction, context compressor pruning, image-aware token estimation,
run_agent helpers, and universality guarantees.
469/469 pass across tests/tools/test_computer_use.py + the affected
agent/ test suites.
- `model_tools.py` provider-gating: the tool is available to every
provider. Providers without multi-part tool message support will see
text-only tool results (graceful degradation via `text_summary`).
- Anthropic server-side `clear_tool_uses_20250919` — deferred;
client-side eviction + compressor pruning cover the same cost ceiling
without a beta header.
- macOS only. cua-driver uses private SkyLight SPIs
(SLEventPostToPid, SLPSPostEventRecordTo,
_AXObserverAddNotificationAndCheckRemote) that can break on any macOS
update. Pin with HERMES_CUA_DRIVER_VERSION.
- Requires Accessibility + Screen Recording permissions — the post-setup
prints the Settings path.
Supersedes PR #4562 (pyautogui/Quartz foreground backend, Anthropic-
native schema). Credit @0xbyt4 for the original #3816 groundwork whose
context/eviction/token design is preserved here in generic form.
Introduces providers/ package — single source of truth for every
inference provider. Adding a simple api-key provider now requires one
providers/<name>.py file with zero edits anywhere else.
What this PR ships:
- providers/ package (ProviderProfile ABC + 33 profiles across 4 api_modes)
- ProviderProfile declarative fields: name, api_mode, aliases, display_name,
env_vars, base_url, models_url, auth_type, fallback_models, hostname,
default_headers, fixed_temperature, default_max_tokens, default_aux_model
- 4 overridable hooks: prepare_messages, build_extra_body,
build_api_kwargs_extras, fetch_models
- chat_completions.build_kwargs: profile path via _build_kwargs_from_profile,
legacy flag path retained for lmstudio/tencent-tokenhub (which have
session-aware reasoning probing that doesn't map cleanly to hooks yet)
- run_agent.py: profile path for all registered providers; legacy path
variable scoping fixed (all flags defined before branching)
- Auto-wires: auth.PROVIDER_REGISTRY, models.CANONICAL_PROVIDERS,
doctor health checks, config.OPTIONAL_ENV_VARS, model_metadata._URL_TO_PROVIDER
- GeminiProfile: thinking_config translation (native + openai-compat nested)
- New tests/providers/ (79 tests covering profile declarations, transport
parity, hook overrides, e2e kwargs assembly)
Deltas vs original PR (salvaged onto current main):
- Added profiles: alibaba-coding-plan, azure-foundry, minimax-oauth
(were added to main since original PR)
- Skipped profiles: lmstudio, tencent-tokenhub stay on legacy path (their
reasoning_effort probing has no clean hook equivalent yet)
- Removed lmstudio alias from custom profile (it's a separate provider now)
- Skipped openrouter/custom from PROVIDER_REGISTRY auto-extension
(resolve_provider special-cases them; adding breaks runtime resolution)
- runtime_provider: profile.api_mode only as fallback when URL detection
finds nothing (was breaking minimax /v1 override)
- Preserved main's legacy-path improvements: deepseek reasoning_content
preserve, gemini Gemma skip, OpenRouter response caching, Anthropic 1M
beta recovery, etc.
- Kept agent/copilot_acp_client.py in place (rejected PR's relocation —
main has 7 fixes landed since; relocation would revert them)
- _API_KEY_PROVIDER_AUX_MODELS alias kept for backward compat with existing
test imports
Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
Closes#14418
When a user sets model.context_length in config.yaml, the value was only
used for Hermes' internal compression decisions (context_compressor) but
NOT for Ollama's num_ctx parameter. Ollama auto-detects context from GGUF
metadata (often 256K+) and allocates that much VRAM regardless of the
user's config — causing OOM on smaller GPUs like the P100 (16GB).
Root cause: two separate context values existed independently:
- context_compressor.context_length = config value (e.g. 65536) ✓
- _ollama_num_ctx = GGUF metadata value (e.g. 256000) ✗ ignored config
Changes:
1. Cap Ollama num_ctx to config context_length (run_agent.py)
When model.context_length is explicitly set and no explicit
ollama_num_ctx override exists, cap the auto-detected GGUF value
to the user's context_length. This is the core fix — it prevents
Ollama from allocating more VRAM than the user budgeted.
2. Pass config_context_length through all secondary call sites
Several paths called get_model_context_length() without the config
override, falling through to the 256K default fallback:
- cli.py: @-reference expansion and /model switch display
- gateway/run.py: @-reference expansion and /model switch display
- tui_gateway/server.py: @-reference expansion
- hermes_cli/model_switch.py: resolve_display_context_length()
3. Normalize root-level context_length in config (hermes_cli/config.py)
_normalize_root_model_keys() now migrates root-level context_length
into the model section, matching existing behavior for provider and
base_url. Users who wrote `context_length: 65536` at the YAML root
instead of under `model:` had it silently ignored.
4. Fix misleading comments (agent/model_metadata.py)
DEFAULT_FALLBACK_CONTEXT is 256K (CONTEXT_PROBE_TIERS[0]), not 128K
as two comments stated.
Tests: 3 new tests for root-level context_length normalization.
All existing context_length tests pass (96 tests).
Wire MiniMax-M2.7 and MiniMax-M2.7-highspeed into the model catalog,
CLI model picker, and agent auxiliary/metadata subsystems.
Changes:
- hermes_cli/models.py:
- Add 'minimax-oauth' to _PROVIDER_MODELS with MiniMax-M2.7 and
MiniMax-M2.7-highspeed
- Add ProviderEntry('minimax-oauth', 'MiniMax (OAuth)', ...) to
CANONICAL_PROVIDERS near existing minimax entries
- Add aliases: minimax-portal, minimax-global, minimax_oauth in
_PROVIDER_ALIASES
- hermes_cli/main.py:
- Add 'minimax-oauth' to provider_labels dict
- Insert 'minimax-oauth' into providers list in
select_provider_and_model() near the other minimax entries
- Add 'minimax-oauth' to --provider argparse choices
- Add _model_flow_minimax_oauth() function: ensures login via
_login_minimax_oauth(), resolves runtime credentials, prompts for
model selection, saves model choice and config
- Add dispatch elif branch for selected_provider == 'minimax-oauth'
- agent/auxiliary_client.py:
- Add 'minimax-oauth': 'MiniMax-M2.7-highspeed' to
_API_KEY_PROVIDER_AUX_MODELS
- Add 'minimax-oauth' to _ANTHROPIC_COMPAT_PROVIDERS set
- agent/model_metadata.py:
- Add 'minimax-oauth' to _PROVIDER_PREFIXES frozenset
- MiniMax-M2.7 context length (200_000) already covered by the
existing 'minimax' substring match in DEFAULT_CONTEXT_LENGTHS
Registers tencent-tokenhub (https://tokenhub.tencentmaas.com/v1) as a
new API-key provider with model tencent/hy3-preview (256K context).
- PROVIDER_REGISTRY entry + TOKENHUB_API_KEY / TOKENHUB_BASE_URL env vars
- Aliases: tencent, tokenhub, tencent-cloud, tencentmaas
- openai_chat transport with is_tokenhub branch for top-level
reasoning_effort (Hy3 is a reasoning model)
- tencent/hy3-preview:free added to OpenRouter curated list
- 60+ tests (provider registry, aliases, runtime resolution,
credentials, model catalog, URL mapping, context length)
- Docs: integrations/providers.md, environment-variables.md,
model-catalog.json
Author: simonweng <simonweng@tencent.com>
Salvaged from PR #16860 onto current main (resolved conflicts with
#16935 Azure Anthropic env-var hint tests and the --provider choices=
list removal in chat_parser).
Background macOS desktop control via cua-driver MCP — does NOT steal the
user's cursor or keyboard focus, works with any tool-capable model.
Replaces the Anthropic-native `computer_20251124` approach from the
abandoned #4562 with a generic OpenAI function-calling schema plus SOM
(set-of-mark) captures so Claude, GPT, Gemini, and open models can all
drive the desktop via numbered element indices.
- `tools/computer_use/` package — swappable ComputerUseBackend ABC +
CuaDriverBackend (stdio MCP client to trycua/cua's cua-driver binary).
- Universal `computer_use` tool with one schema for all providers.
Actions: capture (som/vision/ax), click, double_click, right_click,
middle_click, drag, scroll, type, key, wait, list_apps, focus_app.
- Multimodal tool-result envelope (`_multimodal=True`, OpenAI-style
`content: [text, image_url]` parts) that flows through
handle_function_call into the tool message. Anthropic adapter converts
into native `tool_result` image blocks; OpenAI-compatible providers
get the parts list directly.
- Image eviction in convert_messages_to_anthropic: only the 3 most
recent screenshots carry real image data; older ones become text
placeholders to cap per-turn token cost.
- Context compressor image pruning: old multimodal tool results have
their image parts stripped instead of being skipped.
- Image-aware token estimation: each image counts as a flat 1500 tokens
instead of its base64 char length (~1MB would have registered as
~250K tokens before).
- COMPUTER_USE_GUIDANCE system-prompt block — injected when the toolset
is active.
- Session DB persistence strips base64 from multimodal tool messages.
- Trajectory saver normalises multimodal messages to text-only.
- `hermes tools` post-setup installs cua-driver via the upstream script
and prints permission-grant instructions.
- CLI approval callback wired so destructive computer_use actions go
through the same prompt_toolkit approval dialog as terminal commands.
- Hard safety guards at the tool level: blocked type patterns
(curl|bash, sudo rm -rf, fork bomb), blocked key combos (empty trash,
force delete, lock screen, log out).
- Skill `apple/macos-computer-use/SKILL.md` — universal (model-agnostic)
workflow guide.
- Docs: `user-guide/features/computer-use.md` plus reference catalog
entries.
44 new tests in tests/tools/test_computer_use.py covering schema
shape (universal, not Anthropic-native), dispatch routing, safety
guards, multimodal envelope, Anthropic adapter conversion, screenshot
eviction, context compressor pruning, image-aware token estimation,
run_agent helpers, and universality guarantees.
469/469 pass across tests/tools/test_computer_use.py + the affected
agent/ test suites.
- `model_tools.py` provider-gating: the tool is available to every
provider. Providers without multi-part tool message support will see
text-only tool results (graceful degradation via `text_summary`).
- Anthropic server-side `clear_tool_uses_20250919` — deferred;
client-side eviction + compressor pruning cover the same cost ceiling
without a beta header.
- macOS only. cua-driver uses private SkyLight SPIs
(SLEventPostToPid, SLPSPostEventRecordTo,
_AXObserverAddNotificationAndCheckRemote) that can break on any macOS
update. Pin with HERMES_CUA_DRIVER_VERSION.
- Requires Accessibility + Screen Recording permissions — the post-setup
prints the Settings path.
Supersedes PR #4562 (pyautogui/Quartz foreground backend, Anthropic-
native schema). Credit @0xbyt4 for the original #3816 groundwork whose
context/eviction/token design is preserved here in generic form.
`_apply_model_switch_result` (the interactive `/model` picker's
confirmation path) printed `ModelInfo.context_window` straight from
models.dev, which reports the vendor-wide value (1.05M for gpt-5.5 on
openai). ChatGPT Codex OAuth caps the same slug at 272K, so the picker
showed 1M while the runtime (compressor, gateway `/model`, typed
`/model <name>`) correctly used 272K — the classic 'sometimes 1M,
sometimes 272K' mismatch on a single model.
Both display paths now go through `resolve_display_context_length()`,
matching the fix that `_handle_model_switch` received earlier.
Also bump the stale last-resort fallback in DEFAULT_CONTEXT_LENGTHS
(`gpt-5.5: 400000 -> 1050000`) to match the real OpenAI API value; the
272K Codex cap is already enforced via the Codex-OAuth branch, so the
fallback now reflects what every non-Codex probe-miss should see.
Tests: adds `test_apply_model_switch_result_context.py` with three
scenarios (Codex cap wins, OpenRouter shows 1.05M, resolver-empty falls
back to ModelInfo). Updates the existing non-Codex fallback test to
assert 1.05M (the correct value).
## Validation
| path | before | after |
|-------------------------------|-----------|-----------|
| picker -> gpt-5.5 on Codex | 1,050,000 | 272,000 |
| picker -> gpt-5.5 on OpenAI | 1,050,000 | 1,050,000 |
| picker -> gpt-5.5 on OpenRouter | 1,050,000 | 1,050,000 |
| typed /model gpt-5.5 on Codex | 272,000 | 272,000 |
#14934 added deepseek-v4-pro / deepseek-v4-flash to the DeepSeek native
provider but the context-window lookup still falls back to the existing
"deepseek" substring entry (128K). DeepSeek V4 ships with a 1M context
window, so any caller relying on get_model_context_length() for
pre-flight token budgeting (compression, context warnings) under-counts
by ~8x.
Add explicit lowercase entries for the four DeepSeek model ids that
ship 1M context:
- deepseek-v4-pro
- deepseek-v4-flash
- deepseek-chat (legacy alias, server-side maps to v4-flash non-thinking)
- deepseek-reasoner (legacy alias, server-side maps to v4-flash thinking)
Longest-key-first substring matching means these explicit entries also
cover the vendor-prefixed forms (deepseek/deepseek-v4-pro on OpenRouter
and Nous Portal) without regressing the existing 128K fallback for
older / unknown DeepSeek model ids on custom endpoints.
Source: https://api-docs.deepseek.com/zh-cn/quick_start/pricing
Fixes#15779. Custom-provider per-model context_length (`custom_providers[].models.<id>.context_length`) is now honored across every resolution path, not just agent startup. Also adds 256K as the top probe tier and default fallback.
## What changed
New helper `hermes_cli.config.get_custom_provider_context_length()` — single source of truth for the per-model override lookup, with trailing-slash-insensitive base-url matching.
`agent.model_metadata.get_model_context_length()` gains an optional `custom_providers=` kwarg (step 0b — runs after explicit `config_context_length` but before every other probe).
Wired through five call sites that previously either duplicated the lookup or ignored it entirely:
- `run_agent.py` startup — refactored to use the new helper (dedups legacy inline loop, keeps invalid-value warning)
- `AIAgent.switch_model()` — re-reads custom_providers from live config on every /model switch
- `hermes_cli.model_switch.resolve_display_context_length()` — new `custom_providers=` kwarg
- `gateway/run.py` /model confirmation (picker callback + text path)
- `gateway/run.py` `_format_session_info` (/info)
## Context probe tiers
`CONTEXT_PROBE_TIERS = [256_000, 128_000, 64_000, 32_000, 16_000, 8_000]` — was `[128_000, ...]`. `DEFAULT_FALLBACK_CONTEXT` follows tier[0], so unknown models now default to 256K. The stale `128000` literal in the OpenRouter metadata-miss path is replaced with `DEFAULT_FALLBACK_CONTEXT` for consistency.
## Repro (from #15779)
```yaml
custom_providers:
- name: my-custom-endpoint
base_url: https://example.invalid/v1
model: gpt-5.5
models:
gpt-5.5:
context_length: 1050000
```
`/model gpt-5.5 --provider custom:my-custom-endpoint` → previously "Context: 128,000", now "Context: 1,050,000".
## Tests
- `tests/hermes_cli/test_custom_provider_context_length.py` — new file, 19 tests covering the helper, step-0b integration, and the 256K tier invariants
- `tests/hermes_cli/test_model_switch_context_display.py` — added regression tests for #15779 through the display resolver
- `tests/gateway/test_session_info.py` — updated default-fallback assertion (128K → 256K)
- `tests/agent/test_model_metadata.py` — updated tier assertions for the new top tier
## Problem
`get_model_context_length()` in `agent/model_metadata.py` had a resolution
order bug that caused every Bedrock model to fall back to the 128K default
context length instead of reaching the static Bedrock table (200K for
Claude, etc.).
The root cause: `bedrock-runtime.<region>.amazonaws.com` is not listed in
`_URL_TO_PROVIDER`, so `_is_known_provider_base_url()` returned False.
The resolution order then ran the custom-endpoint probe (step 2) *before*
the Bedrock branch (step 4b), which:
1. Treated Bedrock as a custom endpoint (via `_is_custom_endpoint`).
2. Called `fetch_endpoint_model_metadata()` → `GET /models` on the
bedrock-runtime URL (Bedrock doesn't serve this shape).
3. Fell through to `return DEFAULT_FALLBACK_CONTEXT` (128K) at the
"probe-down" branch — never reaching the Bedrock static table.
Result: users on Bedrock saw 128K context for Claude models that
actually support 200K on Bedrock, causing premature auto-compression.
## Fix
Promote the Bedrock branch from step 4b to step 1b, so it runs *before*
the custom-endpoint probe at step 2. The static table in
`bedrock_adapter.py::get_bedrock_context_length()` is the authoritative
source for Bedrock (the ListFoundationModels API doesn't expose context
window sizes), so there's no reason to probe `/models` first.
The original step 4b is replaced with a one-line breadcrumb comment
pointing to the new location, to make the resolution-order docstring
accurate.
## Changes
- `agent/model_metadata.py`
- Add step 1b: Bedrock static-table branch (unchanged predicate, moved).
- Remove dead step 4b block, replace with breadcrumb comment.
- Update resolution-order docstring to include step 1b.
- `tests/agent/test_model_metadata.py`
- New `TestBedrockContextResolution` class (3 tests):
- `test_bedrock_provider_returns_static_table_before_probe`:
confirms `provider="bedrock"` hits the static table and does NOT
call `fetch_endpoint_model_metadata` (regression guard).
- `test_bedrock_url_without_provider_hint`: confirms the
`bedrock-runtime.*.amazonaws.com` host match works without an
explicit `provider=` hint.
- `test_non_bedrock_url_still_probes`: confirms the probe still
fires for genuinely-custom endpoints (no over-reach).
## Testing
pytest tests/agent/test_model_metadata.py -q
# 83 passed in 1.95s (3 new + 80 existing)
## Risk
Very low.
- Predicate is identical to the original step 4b — no behaviour change
for non-Bedrock paths.
- Original step 4b was dead code for the user-facing case (always hit
the 128K fallback first), so removing it cannot regress behaviour.
- Bedrock path now short-circuits before any network I/O — faster too.
- `ImportError` fall-through preserved so users without `boto3`
installed are unaffected.
## Related
- This is a prerequisite for accurate context-window accounting on
Bedrock — the fix for #14710 (stale-connection client eviction)
depends on correct context sizing to know when to compress.
Signed-off-by: Andre Kurait <andrekurait@gmail.com>
The Copilot provider resolved context windows via models.dev static data,
which does not include account-specific models (e.g. claude-opus-4.6-1m
with 1M context). This adds the live Copilot /models API as a higher-
priority source for copilot/copilot-acp/github-copilot providers.
New helper get_copilot_model_context() in hermes_cli/models.py extracts
capabilities.limits.max_prompt_tokens from the cached catalog. Results
are cached in-process for 1 hour.
In agent/model_metadata.py, step 5a queries the live API before falling
through to models.dev (step 5b). This ensures account-specific models
get correct context windows while standard models still have a fallback.
Part 1 of #7731.
Refs: #7272
PR #14935 added a Codex-aware context resolver but only new lookups
hit the live /models probe. Users who had run Hermes on gpt-5.5 / 5.4
BEFORE that PR already had the wrong value (e.g. 1,050,000 from
models.dev) persisted in ~/.hermes/context_length_cache.yaml, and the
cache-first lookup in get_model_context_length() returns it forever.
Symptom (reported in the wild by Ludwig, min heo, Gaoge on current
main at 6051fba9d, which is AFTER #14935):
* Startup banner shows context usage against 1M
* Compression fires late and then OpenAI hard-rejects with
'context length will be reduced from 1,050,000 to 128,000'
around the real 272k boundary.
Fix: when the step-1 cache returns a value for an openai-codex lookup,
check whether it's >= 400k. Codex OAuth caps every slug at 272k (live
probe values) so anything at or above 400k is definitionally a
pre-#14935 leftover. Drop that entry from the on-disk cache and fall
through to step 5, which runs the live /models probe and repersists
the correct value (or 272k from the hardcoded fallback if the probe
fails). Non-Codex providers and legitimately-cached Codex entries at
272k are untouched.
Changes:
- agent/model_metadata.py:
* _invalidate_cached_context_length() — drop a single entry from
context_length_cache.yaml and rewrite the file.
* Step-1 cache check in get_model_context_length() now gates
provider=='openai-codex' entries >= 400k through invalidation
instead of returning them.
Tests (3 new in TestCodexOAuthContextLength):
- stale 1.05M Codex entry is dropped from disk AND re-resolved
through the live probe to 272k; unrelated cache entries survive.
- fresh 272k Codex entry is respected (no probe call, no invalidation).
- non-Codex 1M entries (e.g. anthropic/claude-opus-4.6 on OpenRouter)
are unaffected — the guard is strictly scoped to openai-codex.
Full tests/agent/test_model_metadata.py: 88 passed.
Follow-up to PR #14533 — applies the same _resolve_requests_verify()
treatment to the one requests.get() site the PR missed (Codex OAuth
chatgpt.com /models probe). Keeps all seven requests.get() callsites
in model_metadata.py consistent so HERMES_CA_BUNDLE / REQUESTS_CA_BUNDLE /
SSL_CERT_FILE are honored everywhere.
Co-authored-by: teknium1 <teknium@hermes-agent>
- hermes_cli/auth.py: add _default_verify() with macOS Homebrew certifi
fallback (mirrors weixin 3a0ec1d93). Extend env var chain to include
REQUESTS_CA_BUNDLE so one env var works across httpx + requests paths.
- agent/model_metadata.py: add _resolve_requests_verify() reading
HERMES_CA_BUNDLE / REQUESTS_CA_BUNDLE / SSL_CERT_FILE in priority
order. Apply explicit verify= to all 6 requests.get callsites.
- Tests: 18 new unit tests + autouse platform pin on existing
TestResolveVerifyFallback to keep its "returns True" assertions
platform-independent.
Empirically verified against self-signed HTTPS server: requests honors
REQUESTS_CA_BUNDLE only; httpx honors SSL_CERT_FILE only. Hermes now
honors all three everywhere.
Triggered by Discord reports — Nous OAuth SSL failure on macOS
Homebrew Python; custom provider self-signed cert ignored despite
REQUESTS_CA_BUNDLE set in env.
On ChatGPT Codex OAuth every gpt-5.x slug actually caps at 272,000 tokens,
but Hermes was resolving gpt-5.5 / gpt-5.4 to 1,050,000 (from models.dev)
because openai-codex aliases to the openai entry there. At 1.05M the
compressor never fires and requests hard-fail with 'context window
exceeded' around the real 272k boundary.
Verified live against chatgpt.com/backend-api/codex/models:
gpt-5.5, gpt-5.4, gpt-5.4-mini, gpt-5.3-codex, gpt-5.2-codex,
gpt-5.2, gpt-5.1-codex-max → context_window = 272000
Changes:
- agent/model_metadata.py:
* _fetch_codex_oauth_context_lengths() — probe the Codex /models
endpoint with the OAuth bearer token and read context_window per
slug (1h in-memory TTL).
* _resolve_codex_oauth_context_length() — prefer the live probe,
fall back to hardcoded _CODEX_OAUTH_CONTEXT_FALLBACK (all 272k).
* Wire into get_model_context_length() when provider=='openai-codex',
running BEFORE the models.dev lookup (which returns 1.05M). Result
persists via save_context_length() so subsequent lookups skip the
probe entirely.
* Fixed the now-wrong comment on the DEFAULT_CONTEXT_LENGTHS gpt-5.5
entry (400k was never right for Codex; it's the catch-all for
providers we can't probe live).
Tests (4 new in TestCodexOAuthContextLength):
- fallback table used when no token is available (no models.dev leakage)
- live probe overrides the fallback
- probe failure (non-200) falls back to hardcoded 272k
- non-codex providers (openrouter, direct openai) unaffected
Non-codex context resolution is unchanged — the Codex branch only fires
when provider=='openai-codex'.
OpenAI launched GPT-5.5 on Codex today (Apr 23 2026). Adds it to the static
catalog and pipes the user's OAuth access token into the openai-codex path of
provider_model_ids() so /model mid-session and the gateway picker hit the
live ChatGPT codex/models endpoint — new models appear for each user
according to what ChatGPT actually lists for their account, without a Hermes
release.
Verified live: 'gpt-5.5' returns priority 0 (featured) from the endpoint,
400k context per OpenAI's launch article. 'hermes chat --provider
openai-codex --model gpt-5.5' completes end-to-end.
Changes:
- hermes_cli/codex_models.py: add gpt-5.5 to DEFAULT_CODEX_MODELS + forward-compat
- agent/model_metadata.py: 400k context length entry
- hermes_cli/models.py: resolve codex OAuth token before calling
get_codex_model_ids() in provider_model_ids('openai-codex')
## Merged
Adds MiMo v2.5-pro and v2.5 support to Xiaomi native provider, OpenCode Go, and setup wizard.
### Changes
- Context lengths: added v2.5-pro (1M) and v2.5 (1M), corrected existing MiMo entries to exact values (262144)
- Provider lists: xiaomi, opencode-go, setup wizard
- Vision: upgraded from mimo-v2-omni to mimo-v2.5 (omnimodal)
- Config description updated for XIAOMI_API_KEY
- Tests updated for new vision model preference
### Verification
- 4322 tests passed, 0 new regressions
- Live API tested on Xiaomi portal: basic, reasoning, tool calling, multi-tool, file ops, system prompt, vision — all pass
- Self-review found and fixed 2 issues (redundant vision check, stale HuggingFace context length)
Zhipu AI (智谱) serves both international users via api.z.ai and
China-based users via open.bigmodel.cn. The domestic endpoint was not
mapped in _URL_TO_PROVIDER, causing Hermes to treat it as an unknown
custom endpoint and fall back to the default 128K context length
instead of resolving the correct 200K+ context via models.dev or the
hardcoded GLM defaults.
This affects users of both the standard API
(https://open.bigmodel.cn/api/paas/v4) and the Coding Plan
(https://open.bigmodel.cn/api/coding/paas/v4).
- Adds 'ctx_size' field to _CONTEXT_LENGTH_KEYS tuple
- Enables hermes agent to correctly detect context size from custom LLMs
running on Lemonade server that use this field name instead of the
standard keys (max_seq_len, n_ctx_train, n_ctx)
Fixes#12976
The generic "gemma": 8192 fallback was incorrectly matching gemma4:31b-cloud
before the more specific Gemma 4 entries could match, causing Hermes to assign
only 8K context instead of 262K. Added "gemma-4" and "gemma4" entries before
the fallback to correctly handle Gemma 4 model naming conventions.
Replace xiaomi/mimo-v2-pro with xiaomi/mimo-v2.5-pro and xiaomi/mimo-v2.5
in the OpenRouter fallback catalog and the nous provider model list.
Add matching DEFAULT_CONTEXT_LENGTHS entries (1M tokens each).
`is_local_endpoint()` leaned on `ipaddress.is_private`, which classifies
RFC-1918 ranges and link-local as private but deliberately excludes the
RFC 6598 CGNAT block (100.64.0.0/10) — the range Tailscale uses for its
mesh IPs. As a result, Ollama reached over Tailscale (e.g.
`http://100.77.243.5:11434`) was treated as remote and missed the
automatic stream-read / stale-stream timeout bumps, so cold model load
plus long prefill would trip the 300 s watchdog before the first token.
Add a module-level `_TAILSCALE_CGNAT = ipaddress.IPv4Network("100.64.0.0/10")`
(built once) and extend `is_local_endpoint()` to match the block both
via the parsed-`IPv4Address` path and the existing bare-string fallback
(for symmetry with the 10/172/192 checks). Also hoist the previously
function-local `import ipaddress` to module scope now that it's used by
the constant.
Extend `TestIsLocalEndpoint` with a CGNAT positive set (lower bound,
representative host, MagicDNS anchor, upper bound) and a near-miss
negative set (just below 100.64.0.0, just above 100.127.255.255, well
outside the block, and first-octet-wrong).
Adds a first-class 'stepfun' API-key provider surfaced as Step Plan:
- Support Step Plan setup for both International and China regions
- Discover Step Plan models live from /step_plan/v1/models, with a
small coding-focused fallback catalog when discovery is unavailable
- Thread StepFun through provider metadata, setup persistence, status
and doctor output, auxiliary routing, and model normalization
- Add tests for provider resolution, model validation, metadata
mapping, and StepFun region/model persistence
Based on #6005 by @hengm3467.
Co-authored-by: hengm3467 <100685635+hengm3467@users.noreply.github.com>
Catalog snapshots, config version literals, and enumeration counts are data
that changes as designed. Tests that assert on those values add no
behavioral coverage — they just break CI on every routine update and cost
engineering time to 'fix.'
Replace with invariants where one exists, delete where none does.
Deleted (pure snapshots):
- TestMinimaxModelCatalog (3 tests): 'MiniMax-M2.7 in models' et al
- TestGeminiModelCatalog: 'gemini-2.5-pro in models', 'gemini-3.x in models'
- test_browser_camofox_state::test_config_version_matches_current_schema
(docstring literally said it would break on unrelated bumps)
Relaxed (keep plumbing check, drop snapshot):
- Xiaomi / Arcee / Kimi moonshot / Kimi coding / HuggingFace static lists:
now assert 'provider exists and has >= 1 entry' instead of specific names
- HuggingFace main/models.py consistency test: drop 'len >= 6' floor
Dynamicized (follow source, not a literal):
- 3x test_config.py migration tests: raw['_config_version'] ==
DEFAULT_CONFIG['_config_version'] instead of hardcoded 21
Fixed stale tests against intentional behavior changes:
- test_insights::test_gateway_format_hides_cost: name matches new behavior
(no dollar figures); remove contradicting '$' in text assertion
- test_config::prefers_api_then_url_then_base_url: flipped per PR #9332;
rename + update to base_url > url > api
- test_anthropic_adapter: relax assert_called_once() (xdist-flaky) to
assert called — contract is 'credential flowed through'
- test_interrupt_propagation: add provider/model/_base_url to bare-agent
fixture so the stale-timeout code path resolves
Fixed stale integration tests against opt-in plugin gate:
- transform_tool_result + transform_terminal_output: write plugins.enabled
allow-list to config.yaml and reset the plugin manager singleton
Source fix (real consistency invariant):
- agent/model_metadata.py: add moonshotai/Kimi-K2.6 context length
(262144, same as K2.5). test_model_metadata_has_context_lengths was
correctly catching the gap.
Policy:
- AGENTS.md Testing section: new subsection 'Don't write change-detector
tests' with do/don't examples. Reviewers should reject catalog-snapshot
assertions in new tests.
Covers every test that failed on the last completed main CI run
(24703345583) except test_modal_sandbox_fixes::test_terminal_tool_present
+ test_terminal_and_file_toolsets_resolve_all_tools, which now pass both
alone and with the full tests/tools/ directory (xdist ordering flake that
resolved itself).
Aslaaen's fix in the original PR covered _detect_api_mode_for_url and the
two openai/xai sites in run_agent.py. This finishes the sweep: the same
substring-match false-positive class (e.g. https://api.openai.com.evil/v1,
https://proxy/api.openai.com/v1, https://api.anthropic.com.example/v1)
existed in eight more call sites, and the hostname helper was duplicated
in two modules.
- utils: add shared base_url_hostname() (single source of truth).
- hermes_cli/runtime_provider, run_agent: drop local duplicates, import
from utils. Reuse the cached AIAgent._base_url_hostname attribute
everywhere it's already populated.
- agent/auxiliary_client: switch codex-wrap auto-detect, max_completion_tokens
gate (auxiliary_max_tokens_param), and custom-endpoint max_tokens kwarg
selection to hostname equality.
- run_agent: native-anthropic check in the Claude-style model branch
and in the AIAgent init provider-auto-detect branch.
- agent/model_metadata: Anthropic /v1/models context-length lookup.
- hermes_cli/providers.determine_api_mode: anthropic / openai URL
heuristics for custom/unknown providers (the /anthropic path-suffix
convention for third-party gateways is preserved).
- tools/delegate_tool: anthropic detection for delegated subagent
runtimes.
- hermes_cli/setup, hermes_cli/tools_config: setup-wizard vision-endpoint
native-OpenAI detection (paired with deduping the repeated check into
a single is_native_openai boolean per branch).
Tests:
- tests/test_base_url_hostname.py covers the helper directly
(path-containing-host, host-suffix, trailing dot, port, case).
- tests/hermes_cli/test_determine_api_mode_hostname.py adds the same
regression class for determine_api_mode, plus a test that the
/anthropic third-party gateway convention still wins.
Also: add asslaenn5@gmail.com → Aslaaen to scripts/release.py AUTHOR_MAP.
Pass the user's configured api_key through local-server detection and
context-length probes (detect_local_server_type, _query_local_context_length,
query_ollama_num_ctx) and use LM Studio's native /api/v1/models endpoint in
fetch_endpoint_model_metadata when a loaded instance is present — so the
probed context length is the actual runtime value the user loaded the model
at, not just the model's theoretical max.
Helps local-LLM users whose auto-detected context length was wrong, causing
compression failures and context-overrun crashes.
Google-side 429 Code Assist errors now flow through Hermes' normal rate-limit
path (status_code on the exception, Retry-After preserved via error.response)
instead of being opaque RuntimeErrors. User sees a one-line capacity message
instead of a 500-char JSON dump.
Changes
- CodeAssistError grows status_code / response / retry_after / details attrs.
_extract_status_code in error_classifier picks up status_code and classifies
429 as FailoverReason.rate_limit, so fallback_providers triggers the same
way it does for SDK errors. run_agent.py line ~10428 already walks
error.response.headers for Retry-After — preserving the response means that
path just works.
- _gemini_http_error parses the Google error envelope (error.status +
error.details[].reason from google.rpc.ErrorInfo, retryDelay from
google.rpc.RetryInfo). MODEL_CAPACITY_EXHAUSTED / RESOURCE_EXHAUSTED / 404
model-not-found each produce a human-readable message; unknown shapes fall
back to the previous raw-body format.
- Drop gemma-4-26b-it from hermes_cli/models.py, hermes_cli/setup.py, and
agent/model_metadata.py — Google returned 404 for it today in local repro.
Kept gemma-4-31b-it (capacity-constrained but not retired).
Validation
| | Before | After |
|---------------------------|--------------------------------|-------------------------------------------|
| Error message | 'Code Assist returned HTTP 429: {500 chars JSON}' | 'Gemini capacity exhausted for gemini-2.5-pro (Google-side throttle...)' |
| status_code on error | None (opaque RuntimeError) | 429 |
| Classifier reason | unknown (string-match fallback) | FailoverReason.rate_limit |
| Retry-After honored | ignored | extracted from RetryInfo or header |
| gemma-4-26b-it picker | advertised (404s on Google) | removed |
Unit + E2E tests cover non-streaming 429, streaming 429, 404 model-not-found,
Retry-After header fallback, malformed body, and classifier integration.
Targeted suites: tests/agent/test_gemini_cloudcode.py (81 tests), full
tests/hermes_cli (2203 tests) green.
Co-authored-by: teknium1 <teknium@nousresearch.com>
Follow-up on the native NVIDIA NIM provider salvage. The original PR wired
PROVIDER_REGISTRY + HERMES_OVERLAYS correctly but missed several touchpoints
required for full parity with other OpenAI-compatible providers (xai,
huggingface, deepseek, zai).
Gaps closed:
- hermes_cli/main.py:
- Add 'nvidia' to the _model_flow_api_key_provider dispatch tuple so
selecting 'NVIDIA NIM' in `hermes model` actually runs the api-key
provider flow (previously fell through silently).
- Add 'nvidia' to `hermes chat --provider` argparse choices so the
documented test command (`hermes chat --provider nvidia --model ...`)
parses successfully.
- hermes_cli/config.py: Register NVIDIA_API_KEY and NVIDIA_BASE_URL in
OPTIONAL_ENV_VARS so setup wizard can prompt for them and they're
auto-added to the subprocess env blocklist.
- hermes_cli/doctor.py: Add NVIDIA NIM row to `_apikey_providers` so
`hermes doctor` probes https://integrate.api.nvidia.com/v1/models.
- hermes_cli/dump.py: Add NVIDIA_API_KEY → 'nvidia' mapping for
`hermes dump` credential masking.
- tests/tools/test_local_env_blocklist.py: Extend registry_vars fixture
with NVIDIA_API_KEY to verify it's blocked from leaking into subprocesses.
- agent/model_metadata.py: Add 'nemotron' → 131072 context-length entry
so all Nemotron variants get 128K context via substring match (rather
than falling back to MINIMUM_CONTEXT_LENGTH).
- hermes_cli/models.py: Fix hallucinated model ID
'nvidia/nemotron-3-nano-8b-a4b' → 'nvidia/nemotron-3-nano-30b-a3b'
(verified against live integrate.api.nvidia.com/v1/models catalog).
Expand curated list from 5 to 9 agentic models mapping to OpenRouter
defaults per provider-guide convention: add qwen3.5-397b-a17b,
deepseek-v3.2, llama-3.3-nemotron-super-49b-v1.5, gpt-oss-120b.
- cli-config.yaml.example: Document 'nvidia' provider option.
- scripts/release.py: Map asurla@nvidia.com → anniesurla in AUTHOR_MAP
for CI attribution.
E2E verified: `hermes chat --provider nvidia ...` now reaches NVIDIA's
endpoint (returns 401 with bogus key instead of argparse error);
`hermes doctor` detects NVIDIA NIM when NVIDIA_API_KEY is set.
Adds NVIDIA NIM as a first-class provider: ProviderConfig in
auth.py, HermesOverlay in providers.py, curated models
(Nemotron plus other open source models hosted on
build.nvidia.com), URL mapping in model_metadata.py, aliases
(nim, nvidia-nim, build-nvidia, nemotron), and env var tests.
Docs updated: providers page, quickstart table, fallback
providers table, and README provider list.