mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-25 00:51:20 +00:00
fix(compaction): don't halve context_length on output-cap-too-large errors
When the API returns "max_tokens too large given prompt" (input tokens
are within the context window, but input + requested output > window),
the old code incorrectly routed through the same handler as "prompt too
long" errors, calling get_next_probe_tier() and permanently halving
context_length. This made things worse: the window was fine, only the
requested output size needed trimming for that one call.
Two distinct error classes now handled separately:
Prompt too long — input itself exceeds context window.
Fix: compress history + halve context_length (existing behaviour,
unchanged).
Output cap too large — input OK, but input + max_tokens > window.
Fix: parse available_tokens from the error message, set a one-shot
_ephemeral_max_output_tokens override for the retry, and leave
context_length completely untouched.
Changes:
- agent/model_metadata.py: add parse_available_output_tokens_from_error()
that detects Anthropic's "available_tokens: N" error format and returns
the available output budget, or None for all other error types.
- run_agent.py: call the new parser first in the is_context_length_error
block; if it fires, set _ephemeral_max_output_tokens (with a 64-token
safety margin) and break to retry without touching context_length.
_build_api_kwargs consumes the ephemeral value exactly once then clears
it so subsequent calls use self.max_tokens normally.
- agent/anthropic_adapter.py: expand build_anthropic_kwargs docstring to
clearly document the max_tokens (output cap) vs context_length (total
window) distinction, which is a persistent source of confusion due to
the OpenAI-inherited "max_tokens" name.
- cli-config.yaml.example: add inline comments explaining both keys side
by side where users are most likely to look.
- website/docs/integrations/providers.md: add a callout box at the top
of "Context Length Detection" and clarify the troubleshooting entry.
- tests/test_ctx_halving_fix.py: 24 tests across four classes covering
the parser, build_anthropic_kwargs clamping, ephemeral one-shot
consumption, and the invariant that context_length is never mutated
on output-cap errors.
This commit is contained in:
parent
2772d99085
commit
34d06a9802
6 changed files with 472 additions and 11 deletions
|
|
@ -1238,10 +1238,27 @@ def build_anthropic_kwargs(
|
|||
) -> Dict[str, Any]:
|
||||
"""Build kwargs for anthropic.messages.create().
|
||||
|
||||
When *max_tokens* is None, the model's native output limit is used
|
||||
(e.g. 128K for Opus 4.6, 64K for Sonnet 4.6). If *context_length*
|
||||
is provided, the effective limit is clamped so it doesn't exceed
|
||||
the context window.
|
||||
Naming note — two distinct concepts, easily confused:
|
||||
max_tokens = OUTPUT token cap for a single response.
|
||||
Anthropic's API calls this "max_tokens" but it only
|
||||
limits the *output*. Anthropic's own native SDK
|
||||
renamed it "max_output_tokens" for clarity.
|
||||
context_length = TOTAL context window (input tokens + output tokens).
|
||||
The API enforces: input_tokens + max_tokens ≤ context_length.
|
||||
Stored on the ContextCompressor; reduced on overflow errors.
|
||||
|
||||
When *max_tokens* is None the model's native output ceiling is used
|
||||
(e.g. 128K for Opus 4.6, 64K for Sonnet 4.6).
|
||||
|
||||
When *context_length* is provided and the model's native output ceiling
|
||||
exceeds it (e.g. a local endpoint with an 8K window), the output cap is
|
||||
clamped to context_length − 1. This only kicks in for unusually small
|
||||
context windows; for full-size models the native output cap is always
|
||||
smaller than the context window so no clamping happens.
|
||||
NOTE: this clamping does not account for prompt size — if the prompt is
|
||||
large, Anthropic may still reject the request. The caller must detect
|
||||
"max_tokens too large given prompt" errors and retry with a smaller cap
|
||||
(see parse_available_output_tokens_from_error + _ephemeral_max_output_tokens).
|
||||
|
||||
When *is_oauth* is True, applies Claude Code compatibility transforms:
|
||||
system prompt prefix, tool name prefixing, and prompt sanitization.
|
||||
|
|
@ -1256,10 +1273,14 @@ def build_anthropic_kwargs(
|
|||
anthropic_tools = convert_tools_to_anthropic(tools) if tools else []
|
||||
|
||||
model = normalize_model_name(model, preserve_dots=preserve_dots)
|
||||
# effective_max_tokens = output cap for this call (≠ total context window)
|
||||
effective_max_tokens = max_tokens or _get_anthropic_max_output(model)
|
||||
|
||||
# Clamp to context window if the user set a lower context_length
|
||||
# (e.g. custom endpoint with limited capacity).
|
||||
# Clamp output cap to fit inside the total context window.
|
||||
# Only matters for small custom endpoints where context_length < native
|
||||
# output ceiling. For standard Anthropic models context_length (e.g.
|
||||
# 200K) is always larger than the output ceiling (e.g. 128K), so this
|
||||
# branch is not taken.
|
||||
if context_length and effective_max_tokens > context_length:
|
||||
effective_max_tokens = max(context_length - 1, 1)
|
||||
|
||||
|
|
|
|||
|
|
@ -603,6 +603,49 @@ def parse_context_limit_from_error(error_msg: str) -> Optional[int]:
|
|||
return None
|
||||
|
||||
|
||||
def parse_available_output_tokens_from_error(error_msg: str) -> Optional[int]:
|
||||
"""Detect an "output cap too large" error and return how many output tokens are available.
|
||||
|
||||
Background — two distinct context errors exist:
|
||||
1. "Prompt too long" — the INPUT itself exceeds the context window.
|
||||
Fix: compress history and/or halve context_length.
|
||||
2. "max_tokens too large" — input is fine, but input + requested_output > window.
|
||||
Fix: reduce max_tokens (the output cap) for this call.
|
||||
Do NOT touch context_length — the window hasn't shrunk.
|
||||
|
||||
Anthropic's API returns errors like:
|
||||
"max_tokens: 32768 > context_window: 200000 - input_tokens: 190000 = available_tokens: 10000"
|
||||
|
||||
Returns the number of output tokens that would fit (e.g. 10000 above), or None if
|
||||
the error does not look like a max_tokens-too-large error.
|
||||
"""
|
||||
error_lower = error_msg.lower()
|
||||
|
||||
# Must look like an output-cap error, not a prompt-length error.
|
||||
is_output_cap_error = (
|
||||
"max_tokens" in error_lower
|
||||
and ("available_tokens" in error_lower or "available tokens" in error_lower)
|
||||
)
|
||||
if not is_output_cap_error:
|
||||
return None
|
||||
|
||||
# Extract the available_tokens figure.
|
||||
# Anthropic format: "… = available_tokens: 10000"
|
||||
patterns = [
|
||||
r'available_tokens[:\s]+(\d+)',
|
||||
r'available\s+tokens[:\s]+(\d+)',
|
||||
# fallback: last number after "=" in expressions like "200000 - 190000 = 10000"
|
||||
r'=\s*(\d+)\s*$',
|
||||
]
|
||||
for pattern in patterns:
|
||||
match = re.search(pattern, error_lower)
|
||||
if match:
|
||||
tokens = int(match.group(1))
|
||||
if tokens >= 1:
|
||||
return tokens
|
||||
return None
|
||||
|
||||
|
||||
def _model_id_matches(candidate_id: str, lookup_model: str) -> bool:
|
||||
"""Return True if *candidate_id* (from server) matches *lookup_model* (configured).
|
||||
|
||||
|
|
|
|||
|
|
@ -48,6 +48,25 @@ model:
|
|||
# api_key: "your-key-here" # Uncomment to set here instead of .env
|
||||
base_url: "https://openrouter.ai/api/v1"
|
||||
|
||||
# ── Token limits — two settings, easy to confuse ──────────────────────────
|
||||
#
|
||||
# context_length: TOTAL context window (input + output tokens combined).
|
||||
# Controls when Hermes compresses history and validates requests.
|
||||
# Leave unset — Hermes auto-detects the correct value from the provider.
|
||||
# Set manually only when auto-detection is wrong (e.g. a local server with
|
||||
# a custom num_ctx, or a proxy that doesn't expose /v1/models).
|
||||
#
|
||||
# context_length: 131072
|
||||
#
|
||||
# max_tokens: OUTPUT cap — maximum tokens the model may generate per response.
|
||||
# Unrelated to how long your conversation history can be.
|
||||
# The OpenAI-standard name "max_tokens" is a misnomer; Anthropic's native
|
||||
# API has since renamed it "max_output_tokens" for clarity.
|
||||
# Leave unset to use the model's native output ceiling (recommended).
|
||||
# Set only if you want to deliberately limit individual response length.
|
||||
#
|
||||
# max_tokens: 8192
|
||||
|
||||
# =============================================================================
|
||||
# OpenRouter Provider Routing (only applies when using OpenRouter)
|
||||
# =============================================================================
|
||||
|
|
|
|||
56
run_agent.py
56
run_agent.py
|
|
@ -87,6 +87,7 @@ from agent.model_metadata import (
|
|||
fetch_model_metadata,
|
||||
estimate_tokens_rough, estimate_messages_tokens_rough, estimate_request_tokens_rough,
|
||||
get_next_probe_tier, parse_context_limit_from_error,
|
||||
parse_available_output_tokens_from_error,
|
||||
save_context_length, is_local_endpoint,
|
||||
query_ollama_num_ctx,
|
||||
)
|
||||
|
|
@ -5397,15 +5398,22 @@ class AIAgent:
|
|||
if self.api_mode == "anthropic_messages":
|
||||
from agent.anthropic_adapter import build_anthropic_kwargs
|
||||
anthropic_messages = self._prepare_anthropic_messages_for_api(api_messages)
|
||||
# Pass context_length so the adapter can clamp max_tokens if the
|
||||
# user configured a smaller context window than the model's output limit.
|
||||
# Pass context_length (total input+output window) so the adapter can
|
||||
# clamp max_tokens (output cap) when the user configured a smaller
|
||||
# context window than the model's native output limit.
|
||||
ctx_len = getattr(self, "context_compressor", None)
|
||||
ctx_len = ctx_len.context_length if ctx_len else None
|
||||
# _ephemeral_max_output_tokens is set for one call when the API
|
||||
# returns "max_tokens too large given prompt" — it caps output to
|
||||
# the available window space without touching context_length.
|
||||
ephemeral_out = getattr(self, "_ephemeral_max_output_tokens", None)
|
||||
if ephemeral_out is not None:
|
||||
self._ephemeral_max_output_tokens = None # consume immediately
|
||||
return build_anthropic_kwargs(
|
||||
model=self.model,
|
||||
messages=anthropic_messages,
|
||||
tools=self.tools,
|
||||
max_tokens=self.max_tokens,
|
||||
max_tokens=ephemeral_out if ephemeral_out is not None else self.max_tokens,
|
||||
reasoning_config=self.reasoning_config,
|
||||
is_oauth=self._is_anthropic_oauth,
|
||||
preserve_dots=self._anthropic_preserve_dots(),
|
||||
|
|
@ -8306,6 +8314,48 @@ class AIAgent:
|
|||
compressor = self.context_compressor
|
||||
old_ctx = compressor.context_length
|
||||
|
||||
# ── Distinguish two very different errors ───────────
|
||||
# 1. "Prompt too long": the INPUT exceeds the context window.
|
||||
# Fix: reduce context_length + compress history.
|
||||
# 2. "max_tokens too large": input is fine, but
|
||||
# input_tokens + requested max_tokens > context_window.
|
||||
# Fix: reduce max_tokens (the OUTPUT cap) for this call.
|
||||
# Do NOT shrink context_length — the window is unchanged.
|
||||
#
|
||||
# Note: max_tokens = output token cap (one response).
|
||||
# context_length = total window (input + output combined).
|
||||
available_out = parse_available_output_tokens_from_error(error_msg)
|
||||
if available_out is not None:
|
||||
# Error is purely about the output cap being too large.
|
||||
# Cap output to the available space and retry without
|
||||
# touching context_length or triggering compression.
|
||||
safe_out = max(1, available_out - 64) # small safety margin
|
||||
self._ephemeral_max_output_tokens = safe_out
|
||||
self._vprint(
|
||||
f"{self.log_prefix}⚠️ Output cap too large for current prompt — "
|
||||
f"retrying with max_tokens={safe_out:,} "
|
||||
f"(available_tokens={available_out:,}; context_length unchanged at {old_ctx:,})",
|
||||
force=True,
|
||||
)
|
||||
# Still count against compression_attempts so we don't
|
||||
# loop forever if the error keeps recurring.
|
||||
compression_attempts += 1
|
||||
if compression_attempts > max_compression_attempts:
|
||||
self._vprint(f"{self.log_prefix}❌ Max compression attempts ({max_compression_attempts}) reached.", force=True)
|
||||
self._vprint(f"{self.log_prefix} 💡 Try /new to start a fresh conversation, or /compress to retry compression.", force=True)
|
||||
logging.error(f"{self.log_prefix}Context compression failed after {max_compression_attempts} attempts.")
|
||||
self._persist_session(messages, conversation_history)
|
||||
return {
|
||||
"messages": messages,
|
||||
"completed": False,
|
||||
"api_calls": api_call_count,
|
||||
"error": f"Context length exceeded: max compression attempts ({max_compression_attempts}) reached.",
|
||||
"partial": True
|
||||
}
|
||||
restart_with_compressed_messages = True
|
||||
break
|
||||
|
||||
# Error is about the INPUT being too large — reduce context_length.
|
||||
# Try to parse the actual limit from the error message
|
||||
parsed_limit = parse_context_limit_from_error(error_msg)
|
||||
if parsed_limit and parsed_limit < old_ctx:
|
||||
|
|
|
|||
319
tests/test_ctx_halving_fix.py
Normal file
319
tests/test_ctx_halving_fix.py
Normal file
|
|
@ -0,0 +1,319 @@
|
|||
"""Tests for the context-halving bugfix.
|
||||
|
||||
Background
|
||||
----------
|
||||
When the API returns "max_tokens too large given prompt" (input is fine,
|
||||
but input_tokens + requested max_tokens > context_window), the old code
|
||||
incorrectly halved context_length via get_next_probe_tier().
|
||||
|
||||
The fix introduces:
|
||||
* parse_available_output_tokens_from_error() — detects this specific
|
||||
error class and returns the available output token budget.
|
||||
* _ephemeral_max_output_tokens on AIAgent — a one-shot override that
|
||||
caps the output for one retry without touching context_length.
|
||||
|
||||
Naming note
|
||||
-----------
|
||||
max_tokens = OUTPUT token cap (a single response).
|
||||
context_length = TOTAL context window (input + output combined).
|
||||
These are different and the old code conflated them; the fix keeps them
|
||||
separate.
|
||||
"""
|
||||
|
||||
import sys
|
||||
import os
|
||||
from unittest.mock import MagicMock, patch, PropertyMock
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# parse_available_output_tokens_from_error — unit tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestParseAvailableOutputTokens:
|
||||
"""Pure-function tests; no I/O required."""
|
||||
|
||||
def _parse(self, msg):
|
||||
from agent.model_metadata import parse_available_output_tokens_from_error
|
||||
return parse_available_output_tokens_from_error(msg)
|
||||
|
||||
# ── Should detect and extract ────────────────────────────────────────
|
||||
|
||||
def test_anthropic_canonical_format(self):
|
||||
"""Canonical Anthropic error: max_tokens: X > context_window: Y - input_tokens: Z = available_tokens: W"""
|
||||
msg = (
|
||||
"max_tokens: 32768 > context_window: 200000 "
|
||||
"- input_tokens: 190000 = available_tokens: 10000"
|
||||
)
|
||||
assert self._parse(msg) == 10000
|
||||
|
||||
def test_anthropic_format_large_numbers(self):
|
||||
msg = (
|
||||
"max_tokens: 128000 > context_window: 200000 "
|
||||
"- input_tokens: 180000 = available_tokens: 20000"
|
||||
)
|
||||
assert self._parse(msg) == 20000
|
||||
|
||||
def test_available_tokens_variant_spacing(self):
|
||||
"""Handles extra spaces around the colon."""
|
||||
msg = "max_tokens: 32768 > 200000 available_tokens : 5000"
|
||||
assert self._parse(msg) == 5000
|
||||
|
||||
def test_available_tokens_natural_language(self):
|
||||
"""'available tokens: N' wording (no underscore)."""
|
||||
msg = "max_tokens must be at most 10000 given your prompt (available tokens: 10000)"
|
||||
assert self._parse(msg) == 10000
|
||||
|
||||
def test_single_token_available(self):
|
||||
"""Edge case: only 1 token left."""
|
||||
msg = "max_tokens: 9999 > context_window: 10000 - input_tokens: 9999 = available_tokens: 1"
|
||||
assert self._parse(msg) == 1
|
||||
|
||||
# ── Should NOT detect (returns None) ─────────────────────────────────
|
||||
|
||||
def test_prompt_too_long_is_not_output_cap_error(self):
|
||||
"""'prompt is too long' errors must NOT be caught — they need context halving."""
|
||||
msg = "prompt is too long: 205000 tokens > 200000 maximum"
|
||||
assert self._parse(msg) is None
|
||||
|
||||
def test_generic_context_window_exceeded(self):
|
||||
"""Generic context window errors without available_tokens should not match."""
|
||||
msg = "context window exceeded: maximum is 32768 tokens"
|
||||
assert self._parse(msg) is None
|
||||
|
||||
def test_context_length_exceeded(self):
|
||||
msg = "context_length_exceeded: prompt has 131073 tokens, limit is 131072"
|
||||
assert self._parse(msg) is None
|
||||
|
||||
def test_no_max_tokens_keyword(self):
|
||||
"""Error not related to max_tokens at all."""
|
||||
msg = "invalid_api_key: the API key is invalid"
|
||||
assert self._parse(msg) is None
|
||||
|
||||
def test_empty_string(self):
|
||||
assert self._parse("") is None
|
||||
|
||||
def test_rate_limit_error(self):
|
||||
msg = "rate_limit_error: too many requests per minute"
|
||||
assert self._parse(msg) is None
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# build_anthropic_kwargs — output cap clamping
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestBuildAnthropicKwargsClamping:
|
||||
"""The context_length clamp only fires when output ceiling > window.
|
||||
For standard Anthropic models (output ceiling < window) it must not fire.
|
||||
"""
|
||||
|
||||
def _build(self, model, max_tokens=None, context_length=None):
|
||||
from agent.anthropic_adapter import build_anthropic_kwargs
|
||||
return build_anthropic_kwargs(
|
||||
model=model,
|
||||
messages=[{"role": "user", "content": "hi"}],
|
||||
tools=None,
|
||||
max_tokens=max_tokens,
|
||||
reasoning_config=None,
|
||||
context_length=context_length,
|
||||
)
|
||||
|
||||
def test_no_clamping_when_output_ceiling_fits_in_window(self):
|
||||
"""Opus 4.6 native output (128K) < context window (200K) — no clamping."""
|
||||
kwargs = self._build("claude-opus-4-6", context_length=200_000)
|
||||
assert kwargs["max_tokens"] == 128_000
|
||||
|
||||
def test_clamping_fires_for_tiny_custom_window(self):
|
||||
"""When context_length is 8K (local model), output cap is clamped to 7999."""
|
||||
kwargs = self._build("claude-opus-4-6", context_length=8_000)
|
||||
assert kwargs["max_tokens"] == 7_999
|
||||
|
||||
def test_explicit_max_tokens_respected_when_within_window(self):
|
||||
"""Explicit max_tokens smaller than window passes through unchanged."""
|
||||
kwargs = self._build("claude-opus-4-6", max_tokens=4096, context_length=200_000)
|
||||
assert kwargs["max_tokens"] == 4096
|
||||
|
||||
def test_explicit_max_tokens_clamped_when_exceeds_window(self):
|
||||
"""Explicit max_tokens larger than a small window is clamped."""
|
||||
kwargs = self._build("claude-opus-4-6", max_tokens=32_768, context_length=16_000)
|
||||
assert kwargs["max_tokens"] == 15_999
|
||||
|
||||
def test_no_context_length_uses_native_ceiling(self):
|
||||
"""Without context_length the native output ceiling is used directly."""
|
||||
kwargs = self._build("claude-sonnet-4-6")
|
||||
assert kwargs["max_tokens"] == 64_000
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Ephemeral max_tokens mechanism — _build_api_kwargs
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestEphemeralMaxOutputTokens:
|
||||
"""_build_api_kwargs consumes _ephemeral_max_output_tokens exactly once
|
||||
and falls back to self.max_tokens on subsequent calls.
|
||||
"""
|
||||
|
||||
def _make_agent(self):
|
||||
"""Return a minimal AIAgent with api_mode='anthropic_messages' and
|
||||
a stubbed context_compressor, bypassing full __init__ cost."""
|
||||
from run_agent import AIAgent
|
||||
agent = object.__new__(AIAgent)
|
||||
# Minimal attributes used by _build_api_kwargs
|
||||
agent.api_mode = "anthropic_messages"
|
||||
agent.model = "claude-opus-4-6"
|
||||
agent.tools = []
|
||||
agent.max_tokens = None
|
||||
agent.reasoning_config = None
|
||||
agent._is_anthropic_oauth = False
|
||||
agent._ephemeral_max_output_tokens = None
|
||||
|
||||
compressor = MagicMock()
|
||||
compressor.context_length = 200_000
|
||||
agent.context_compressor = compressor
|
||||
|
||||
# Stub out the internal message-preparation helper
|
||||
agent._prepare_anthropic_messages_for_api = MagicMock(
|
||||
return_value=[{"role": "user", "content": "hi"}]
|
||||
)
|
||||
agent._anthropic_preserve_dots = MagicMock(return_value=False)
|
||||
return agent
|
||||
|
||||
def test_ephemeral_override_is_used_on_first_call(self):
|
||||
"""When _ephemeral_max_output_tokens is set, it overrides self.max_tokens."""
|
||||
agent = self._make_agent()
|
||||
agent._ephemeral_max_output_tokens = 5_000
|
||||
|
||||
kwargs = agent._build_api_kwargs([{"role": "user", "content": "hi"}])
|
||||
assert kwargs["max_tokens"] == 5_000
|
||||
|
||||
def test_ephemeral_override_is_consumed_after_one_call(self):
|
||||
"""After one call the ephemeral override is cleared to None."""
|
||||
agent = self._make_agent()
|
||||
agent._ephemeral_max_output_tokens = 5_000
|
||||
|
||||
agent._build_api_kwargs([{"role": "user", "content": "hi"}])
|
||||
assert agent._ephemeral_max_output_tokens is None
|
||||
|
||||
def test_subsequent_call_uses_self_max_tokens(self):
|
||||
"""A second _build_api_kwargs call uses the normal max_tokens path."""
|
||||
agent = self._make_agent()
|
||||
agent._ephemeral_max_output_tokens = 5_000
|
||||
agent.max_tokens = None # will resolve to native ceiling (128K for Opus 4.6)
|
||||
|
||||
agent._build_api_kwargs([{"role": "user", "content": "hi"}])
|
||||
# Second call — ephemeral is gone
|
||||
kwargs2 = agent._build_api_kwargs([{"role": "user", "content": "hi"}])
|
||||
assert kwargs2["max_tokens"] == 128_000 # Opus 4.6 native ceiling
|
||||
|
||||
def test_no_ephemeral_uses_self_max_tokens_directly(self):
|
||||
"""Without an ephemeral override, self.max_tokens is used normally."""
|
||||
agent = self._make_agent()
|
||||
agent.max_tokens = 8_192
|
||||
|
||||
kwargs = agent._build_api_kwargs([{"role": "user", "content": "hi"}])
|
||||
assert kwargs["max_tokens"] == 8_192
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Integration: error handler does NOT halve context_length for output-cap errors
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestContextNotHalvedOnOutputCapError:
|
||||
"""When the API returns 'max_tokens too large given prompt', the handler
|
||||
must set _ephemeral_max_output_tokens and NOT modify context_length.
|
||||
"""
|
||||
|
||||
def _make_agent_with_compressor(self, context_length=200_000):
|
||||
from run_agent import AIAgent
|
||||
from agent.context_compressor import ContextCompressor
|
||||
|
||||
agent = object.__new__(AIAgent)
|
||||
agent.api_mode = "anthropic_messages"
|
||||
agent.model = "claude-opus-4-6"
|
||||
agent.base_url = "https://api.anthropic.com"
|
||||
agent.tools = []
|
||||
agent.max_tokens = None
|
||||
agent.reasoning_config = None
|
||||
agent._is_anthropic_oauth = False
|
||||
agent._ephemeral_max_output_tokens = None
|
||||
agent.log_prefix = ""
|
||||
agent.quiet_mode = True
|
||||
agent.verbose_logging = False
|
||||
|
||||
compressor = MagicMock(spec=ContextCompressor)
|
||||
compressor.context_length = context_length
|
||||
compressor.threshold_percent = 0.75
|
||||
agent.context_compressor = compressor
|
||||
|
||||
agent._prepare_anthropic_messages_for_api = MagicMock(
|
||||
return_value=[{"role": "user", "content": "hi"}]
|
||||
)
|
||||
agent._anthropic_preserve_dots = MagicMock(return_value=False)
|
||||
agent._vprint = MagicMock()
|
||||
return agent
|
||||
|
||||
def test_output_cap_error_sets_ephemeral_not_context_length(self):
|
||||
"""On 'max_tokens too large' error, _ephemeral_max_output_tokens is set
|
||||
and compressor.context_length is left unchanged."""
|
||||
from agent.model_metadata import parse_available_output_tokens_from_error
|
||||
from agent.model_metadata import get_next_probe_tier
|
||||
|
||||
error_msg = (
|
||||
"max_tokens: 128000 > context_window: 200000 "
|
||||
"- input_tokens: 180000 = available_tokens: 20000"
|
||||
)
|
||||
|
||||
# Simulate the handler logic from run_agent.py
|
||||
agent = self._make_agent_with_compressor(context_length=200_000)
|
||||
old_ctx = agent.context_compressor.context_length
|
||||
|
||||
available_out = parse_available_output_tokens_from_error(error_msg)
|
||||
assert available_out == 20_000, "parser must detect the error"
|
||||
|
||||
# The fix: set ephemeral, skip context_length modification
|
||||
agent._ephemeral_max_output_tokens = max(1, available_out - 64)
|
||||
|
||||
# context_length must be untouched
|
||||
assert agent.context_compressor.context_length == old_ctx
|
||||
assert agent._ephemeral_max_output_tokens == 19_936
|
||||
|
||||
def test_prompt_too_long_still_triggers_probe_tier(self):
|
||||
"""Genuine prompt-too-long errors must still use get_next_probe_tier."""
|
||||
from agent.model_metadata import parse_available_output_tokens_from_error
|
||||
from agent.model_metadata import get_next_probe_tier
|
||||
|
||||
error_msg = "prompt is too long: 205000 tokens > 200000 maximum"
|
||||
|
||||
available_out = parse_available_output_tokens_from_error(error_msg)
|
||||
assert available_out is None, "prompt-too-long must not be caught by output-cap parser"
|
||||
|
||||
# The old halving path is still used for this class of error
|
||||
new_ctx = get_next_probe_tier(200_000)
|
||||
assert new_ctx == 128_000
|
||||
|
||||
def test_output_cap_error_safety_margin(self):
|
||||
"""The ephemeral value includes a 64-token safety margin below available_out."""
|
||||
from agent.model_metadata import parse_available_output_tokens_from_error
|
||||
|
||||
error_msg = (
|
||||
"max_tokens: 32768 > context_window: 200000 "
|
||||
"- input_tokens: 190000 = available_tokens: 10000"
|
||||
)
|
||||
available_out = parse_available_output_tokens_from_error(error_msg)
|
||||
safe_out = max(1, available_out - 64)
|
||||
assert safe_out == 9_936
|
||||
|
||||
def test_safety_margin_never_goes_below_one(self):
|
||||
"""When available_out is very small, safe_out must be at least 1."""
|
||||
from agent.model_metadata import parse_available_output_tokens_from_error
|
||||
|
||||
error_msg = (
|
||||
"max_tokens: 10 > context_window: 200000 "
|
||||
"- input_tokens: 199990 = available_tokens: 1"
|
||||
)
|
||||
available_out = parse_available_output_tokens_from_error(error_msg)
|
||||
safe_out = max(1, available_out - 64)
|
||||
assert safe_out == 1
|
||||
|
|
@ -657,8 +657,8 @@ model:
|
|||
#### Responses get cut off mid-sentence
|
||||
|
||||
**Possible causes:**
|
||||
1. **Low `max_tokens` on the server** — SGLang defaults to 128 tokens per response. Set `--default-max-tokens` on the server or configure Hermes with `model.max_tokens` in config.yaml.
|
||||
2. **Context exhaustion** — The model filled its context window. Increase context length or enable [context compression](/docs/user-guide/configuration#context-compression) in Hermes.
|
||||
1. **Low output cap (`max_tokens`) on the server** — SGLang defaults to 128 tokens per response. Set `--default-max-tokens` on the server or configure Hermes with `model.max_tokens` in config.yaml. Note: `max_tokens` controls response length only — it is unrelated to how long your conversation history can be (that is `context_length`).
|
||||
2. **Context exhaustion** — The model filled its context window. Increase `model.context_length` or enable [context compression](/docs/user-guide/configuration#context-compression) in Hermes.
|
||||
|
||||
---
|
||||
|
||||
|
|
@ -751,6 +751,15 @@ model:
|
|||
|
||||
### Context Length Detection
|
||||
|
||||
:::note Two settings, easy to confuse
|
||||
**`context_length`** is the **total context window** — the combined budget for input *and* output tokens (e.g. 200,000 for Claude Opus 4.6). Hermes uses this to decide when to compress history and to validate API requests.
|
||||
|
||||
**`model.max_tokens`** is the **output cap** — the maximum number of tokens the model may generate in a *single response*. It has nothing to do with how long your conversation history can be. The industry-standard name `max_tokens` is a common source of confusion; Anthropic's native API has since renamed it `max_output_tokens` for clarity.
|
||||
|
||||
Set `context_length` when auto-detection gets the window size wrong.
|
||||
Set `model.max_tokens` only when you need to limit how long individual responses can be.
|
||||
:::
|
||||
|
||||
Hermes uses a multi-source resolution chain to detect the correct context window for your model and provider:
|
||||
|
||||
1. **Config override** — `model.context_length` in config.yaml (highest priority)
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue