mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-06-27 11:22:03 +00:00
fix(agent): detect thinking-timeout for reasoning models and surface actionable guidance instead of misleading file-write advice
Two-part fix:
Part 1 (classifier override at agent/error_classifier.py:720-738):
A transport disconnect on a reasoning model — even on a large session —
now routes to FailoverReason.timeout instead of context_overflow. Without
this, large-session reasoning-model disconnects route to the compression
branch and silently delete conversation history on a phantom
context-length error. The override is strictly targeted: non-reasoning
models (gpt-4o, claude-3-5-sonnet, llama-3.3-70b, etc.) still route to
context_overflow on large sessions — the existing intentional behavior
for chat models whose proxy doesn't idle-kill during prefill/generation.
Part 2 (new agent/thinking_timeout_guidance.py + integration at
agent/conversation_loop.py:3488-3567):
New is_thinking_timeout() and build_thinking_timeout_guidance() helpers.
When a known reasoning model (NVIDIA Nemotron 3 Ultra, OpenAI o1/o3,
Anthropic Opus 4.x thinking, DeepSeek R1, Qwen QwQ, xAI Grok reasoning)
hits a transport-kill on a small session (classifier says timeout
directly) or after Part 1 routes correctly (large session), the user
now sees reasoning-specific guidance with three actionable workarounds
in priority order:
1. Set providers.<provider>.models.<model>.stale_timeout_seconds: 900
in ~/.hermes/config.yaml (Hermes's built-in floor is already 600s
for known reasoning models; raise further if upstream is even
tighter).
2. Lower reasoning_budget or set reasoning_effort: medium on this
model if the provider supports it.
3. Use a smaller / faster reasoning model if the task doesn't
require deep thinking.
The new guidance takes precedence via if/elif over the existing
_is_stream_drop block, so a reasoning-model user with a transport-kill
message sees actionable advice instead of the misleading "try
execute_code with Python's open() for large files" advice (which is
correct for the unrelated large-file-write stream-drop case but
actively wrong for the thinking-timeout case).
Verified:
- 478 tests passing across 9 directly-relevant files (49 new + 429
existing, zero regressions).
- Ruff lint clean on all 4 modified/new files.
- Negative test: 6 parametrized regression guards confirm non-reasoning
models still route to context_overflow on large sessions; 4
parametrized gates confirm non-timeout classifier reasons never
trigger the guidance; 5 parametrized cases confirm non-transport
messages never trigger it.
- Regression guard: new guidance message does NOT contain
"execute_code" or "open()" — the misleading advice is fully
replaced, not appended alongside.
- Cross-vendor dual review via agy -p:
- Gemini 3.5 Flash (Medium) — passed: true, zero blockers, one
SHOULD-FIX (vprint block duplication — fixed by extracting
detection into a helper module).
- GPT-OSS 120B (Medium) — passed: true, zero blockers, two nits
(test placement — adopted at tests/agent/test_thinking_timeout_guidance.py;
primary-model capture — accepted as non-issue per Flash's nit).
Dependency note for maintainers:
This PR includes agent/reasoning_timeouts.py (the reasoning-model
allowlist module from PR #52238) because the Layer 1 override is
load-bearing on get_reasoning_stale_timeout_floor(). After PR #52238
lands on main, this PR's duplicate agent/reasoning_timeouts.py should
be rebased away. Either PR can land first; the other rebase is
mechanical.
Fixes #52271.
This commit is contained in:
parent
811df74a10
commit
865a09a610
5 changed files with 802 additions and 1 deletions
|
|
@ -3527,6 +3527,65 @@ def run_conversation(
|
|||
force=True,
|
||||
)
|
||||
|
||||
# Detect thinking-timeout pattern: a known reasoning model
|
||||
# hit a transport-layer error before the first content
|
||||
# token arrived. Distinct from _is_stream_drop above
|
||||
# (which fires for large file-write stream drops) and
|
||||
# from any classifier reason that's not a transport
|
||||
# timeout. Reuses the reasoning-model allowlist from
|
||||
# agent/reasoning_timeouts.py (Fixes #52217) so the
|
||||
# trigger is consistent with what the per-model
|
||||
# stale-timeout floor covers. After the classifier
|
||||
# override at agent/error_classifier.py:720-738 (this
|
||||
# PR), transport disconnects on reasoning models route
|
||||
# to FailoverReason.timeout rather than
|
||||
# context_overflow, so this branch actually fires.
|
||||
# Detection and message text live in
|
||||
# agent.thinking_timeout_guidance so they're
|
||||
# unit-testable without driving the full retry loop.
|
||||
# (Part 2 of Fixes #52310.)
|
||||
from agent.thinking_timeout_guidance import (
|
||||
is_thinking_timeout,
|
||||
)
|
||||
_is_thinking_timeout = is_thinking_timeout(
|
||||
classified,
|
||||
_model,
|
||||
error_msg,
|
||||
)
|
||||
if _is_thinking_timeout:
|
||||
agent._vprint(
|
||||
f"{agent.log_prefix} 💡 The model's thinking "
|
||||
f"phase exceeded the upstream proxy's idle "
|
||||
f"timeout before the first content token "
|
||||
f"arrived. This is a known issue with "
|
||||
f"reasoning models behind cloud gateways "
|
||||
f"(NVIDIA NIM, OpenAI, Anthropic, DeepSeek).",
|
||||
force=True,
|
||||
)
|
||||
agent._vprint(
|
||||
f"{agent.log_prefix} Workarounds in priority order:",
|
||||
force=True,
|
||||
)
|
||||
agent._vprint(
|
||||
f"{agent.log_prefix} 1. Set "
|
||||
f"`providers.{_provider}.models.{_model}.stale_timeout_seconds: 900` "
|
||||
f"in `~/.hermes/config.yaml` to extend the per-call "
|
||||
f"timeout. (Hermes's built-in floor is 600s for "
|
||||
f"known reasoning models — if you still see this "
|
||||
f"after raising, the upstream cap is even shorter.)",
|
||||
force=True,
|
||||
)
|
||||
agent._vprint(
|
||||
f"{agent.log_prefix} 2. Lower `reasoning_budget` or set "
|
||||
f"`reasoning_effort: medium` on this model if the provider supports it.",
|
||||
force=True,
|
||||
)
|
||||
agent._vprint(
|
||||
f"{agent.log_prefix} 3. Use a smaller / faster reasoning "
|
||||
f"model if the task doesn't require deep thinking.",
|
||||
force=True,
|
||||
)
|
||||
|
||||
logger.error(
|
||||
"%sAPI call failed after %s retries. %s | provider=%s model=%s msgs=%s tokens=~%s",
|
||||
agent.log_prefix, max_retries, _final_summary,
|
||||
|
|
@ -3543,7 +3602,22 @@ def run_conversation(
|
|||
_final_response += f"\n\n{_billing_guidance}"
|
||||
else:
|
||||
_final_response = f"API call failed after {max_retries} retries: {_final_summary}"
|
||||
if _is_stream_drop:
|
||||
if _is_thinking_timeout:
|
||||
# Thinking-timeout guidance overrides the generic
|
||||
# stream-drop guidance — the latter is wrong for
|
||||
# this case (it suggests splitting large file
|
||||
# writes, which isn't what happened). See the
|
||||
# reasoning-model override at
|
||||
# agent/error_classifier.py:720-738 and the
|
||||
# detection block above for context.
|
||||
from agent.thinking_timeout_guidance import (
|
||||
build_thinking_timeout_guidance,
|
||||
)
|
||||
_final_response += build_thinking_timeout_guidance(
|
||||
provider=_provider,
|
||||
model=_model,
|
||||
)
|
||||
elif _is_stream_drop:
|
||||
_final_response += (
|
||||
"\n\nThe provider's stream connection keeps "
|
||||
"dropping — this often happens when generating "
|
||||
|
|
|
|||
|
|
@ -717,6 +717,26 @@ def classify_api_error(
|
|||
|
||||
is_disconnect = any(p in error_msg for p in _SERVER_DISCONNECT_PATTERNS)
|
||||
if is_disconnect and not status_code:
|
||||
# Reasoning-model override: a transport disconnect on a reasoning
|
||||
# model is much more likely the upstream proxy idle-killing a
|
||||
# long thinking stream than a true context overflow — even on
|
||||
# large sessions. The default disconnect+large-session routing
|
||||
# below would otherwise send the user into the compression
|
||||
# branch (should_compress=True) and silently delete
|
||||
# conversation history on a phantom context-length error.
|
||||
# Reasoning models have multi-minute thinking phases that
|
||||
# routinely exceed the cloud gateway's idle window (NVIDIA
|
||||
# NIM ~120s — first-party repro at NVIDIA/NemoClaw#4846;
|
||||
# OpenAI worker / Anthropic stream-idle similar). The
|
||||
# per-reasoning-model stale-timeout floor in
|
||||
# agent/reasoning_timeouts.py raises the stale-detector
|
||||
# threshold to tolerate long thinking, so a true
|
||||
# transport-layer failure here is recoverable via the retry
|
||||
# path — not via context compression. Reclassify as timeout.
|
||||
# (Part 1 of Fixes #52310.)
|
||||
from agent.reasoning_timeouts import get_reasoning_stale_timeout_floor
|
||||
if get_reasoning_stale_timeout_floor(model) is not None:
|
||||
return _result(FailoverReason.timeout, retryable=True)
|
||||
# Absolute token/message-count thresholds are only a proxy for smaller
|
||||
# context windows. Large-context sessions can have hundreds of
|
||||
# messages while still being far below their actual token budget.
|
||||
|
|
|
|||
216
agent/reasoning_timeouts.py
Normal file
216
agent/reasoning_timeouts.py
Normal file
|
|
@ -0,0 +1,216 @@
|
|||
"""Per-reasoning-model stale-timeout floor for known reasoning models.
|
||||
|
||||
Reasoning models (those that emit extended thinking blocks before their
|
||||
first content token) routinely exceed Hermes's default chat-model
|
||||
stale detectors:
|
||||
|
||||
* Stream stale detector: ``HERMES_STREAM_STALE_TIMEOUT`` default 180s
|
||||
``agent/chat_completion_helpers.py:2544``
|
||||
* Non-stream stale detector: ``HERMES_API_CALL_STALE_TIMEOUT`` default 90s
|
||||
``run_agent.py:1140``
|
||||
|
||||
For NVIDIA Nemotron 3 Ultra on the hosted NIM gateway the empirical
|
||||
upstream idle kill is ~120s (first-party reproduction at
|
||||
NVIDIA/NemoClaw#4846 — TTFB ~31s, stream dies at 120s). The same
|
||||
failure mode exists on OpenAI o1/o3, Anthropic Opus 4.x thinking,
|
||||
DeepSeek R1, Qwen QwQ, xAI Grok reasoning — every cloud reasoning
|
||||
model hits upstream-proxies / load-balancers with idle timeouts
|
||||
shorter than the model's thinking phase. Result: the stale detector
|
||||
kills the connection mid-think, surfacing as
|
||||
``BrokenPipeError``/``RemoteProtocolError`` on the next read.
|
||||
|
||||
This module provides a floor that the existing stale-detector scaling
|
||||
blocks consult via :func:`get_reasoning_stale_timeout_floor` and
|
||||
apply as ``max(default, floor)``. It is a FLOOR:
|
||||
|
||||
* Never overrides explicit user config (``providers.<id>.models.<model>.stale_timeout_seconds``
|
||||
or ``request_timeout_seconds`` already wins — this code never runs
|
||||
in that branch).
|
||||
* Never lowers an existing threshold.
|
||||
* Has zero effect on non-reasoning models — they are not in the
|
||||
allowlist and the resolver returns ``None``.
|
||||
|
||||
Matching uses start-anchored regex on the slug-only component of
|
||||
the model name (after stripping any aggregator prefix like
|
||||
``openai/``, ``x-ai/``, ``anthropic/``). The right-anchor matches
|
||||
end-of-string or a ``-``/``.``/``_`` slug separator, so ``qwen3-235b``
|
||||
matches the ``qwen3`` family entry (a future model slug would be
|
||||
``qwen3-235b-instruct`` and would also match) but ``some-other-qwen3``
|
||||
does NOT match ``qwen3`` (the ``-qwen3`` is not at start of slug).
|
||||
|
||||
The ``o1`` case is the most delicate: a model named
|
||||
``llama-4-70b-o1-preview`` is a hypothetical community derivative that
|
||||
should NOT trigger the reasoning-model floor for the user (the user
|
||||
chose a non-OpenAI model, not a reasoning model). The start-of-slug
|
||||
anchor naturally excludes this — the matched ``o1-preview`` is at
|
||||
position 11 of the slug, not at position 0. The previous substring-
|
||||
with-trailing-hyphen design would have over-matched here, which is
|
||||
why start-of-slug anchoring is the right shape.
|
||||
|
||||
Fixes #52217.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
from typing import Optional
|
||||
|
||||
|
||||
# (slug, floor_seconds). Each slug is matched as a discrete
|
||||
# word-boundary component via the wrapper regex in ``_match_any``
|
||||
# below. Order is irrelevant — the first regex match wins.
|
||||
_REASONING_STALE_TIMEOUT_FLOORS: tuple[tuple[str, int], ...] = (
|
||||
# NVIDIA Nemotron — reasoning models behind hosted NIM with
|
||||
# documented 60-180s upstream idle kill (NVIDIA/NemoClaw#4846:
|
||||
# 120s measured).
|
||||
("nemotron-3-ultra", 600),
|
||||
("nemotron-3-super", 600),
|
||||
("nemotron-3-nano", 300),
|
||||
# DeepSeek — R1 reasoning model on hosted NIM / DeepSeek direct.
|
||||
("deepseek-r1", 600),
|
||||
("deepseek-reasoner", 600),
|
||||
# Qwen — QwQ reasoning + Qwen3 thinking variants. QwQ-32B
|
||||
# preview is the stable slug; ``qwen3`` covers the family of
|
||||
# thinking-mode Qwen3 models (qwen3-235b-a22b, qwen3-32b, etc.)
|
||||
# without over-matching every Qwen3 instruct variant — the
|
||||
# right-anchor requires the slug to be at the start of the
|
||||
# remaining model name, so ``qwen3-235b-instruct`` (instruct is
|
||||
# NOT a thinking variant) would still match. Acceptable
|
||||
# trade-off: instruct variants of qwen3 get the 180s floor
|
||||
# even though they don't reason. The cost is a slightly longer
|
||||
# wait on a hung provider; the alternative (matching only
|
||||
# ``qwen3-.*-thinking``) breaks the moment NVIDIA or Alibaba
|
||||
# ships a slightly different naming shape.
|
||||
("qwq-32b", 300),
|
||||
("qwen3", 180),
|
||||
# OpenAI o-series — known multi-minute TTFB. Each variant
|
||||
# enumerated explicitly so bare ``o1`` doesn't over-match
|
||||
# ``olmo-1`` or hypothetical future community derivatives.
|
||||
("o1", 600),
|
||||
("o1-mini", 600),
|
||||
("o1-pro", 600),
|
||||
("o1-preview", 600),
|
||||
("o3", 600),
|
||||
("o3-pro", 600),
|
||||
("o3-mini", 300),
|
||||
("o4-mini", 300),
|
||||
# Anthropic Claude 4.x thinking variants. Anchored at
|
||||
# ``claude-opus-4`` so non-thinking Claude 3.x or future
|
||||
# non-reasoning Claude variants don't match.
|
||||
("claude-opus-4", 240),
|
||||
("claude-sonnet-4.5", 180),
|
||||
("claude-sonnet-4.6", 180),
|
||||
# xAI Grok reasoning variants. Explicit reasoning-only keys
|
||||
# plus one for the ``non-reasoning`` variant so users picking
|
||||
# the fast variant don't get the 300s floor. Bare ``grok-3``,
|
||||
# ``grok-4`` etc. don't match — only the explicit reasoning /
|
||||
# non-reasoning pairs.
|
||||
("grok-4-fast-reasoning", 300),
|
||||
("grok-4.20-reasoning", 300),
|
||||
("grok-4-fast-non-reasoning", 180),
|
||||
)
|
||||
|
||||
|
||||
# Pre-compile each pattern. Wrapper = start-of-slug + slug + end-or-
|
||||
# separator, where ``start-of-slug`` means start-of-string OR
|
||||
# immediately after the last ``/`` (aggregator separator) and
|
||||
# ``end-or-separator`` means end-of-string OR a ``-``/``.``/``_``.
|
||||
#
|
||||
# Why start-of-slug and not start-of-string: aggregator prefixes
|
||||
# like ``openai/`` should not affect matching — the slug identity is
|
||||
# the part after the last ``/``. Stripping the aggregator prefix in
|
||||
# :func:`get_reasoning_stale_timeout_floor` before regex matching
|
||||
# gives the wrapper a clean start-of-string anchor.
|
||||
#
|
||||
# Why end-or-separator on the right: ``openai/o3-mini`` must match
|
||||
# the ``o3-mini`` slug (the right anchor is end-of-string). And
|
||||
# ``openai/o3-mini-2025-01-31`` must also match ``o3-mini`` (the right
|
||||
# anchor is the ``-`` separator). But ``openai/o3-mini-fork`` should
|
||||
# NOT match ``o3-mini`` if we wanted to exclude forks — though the
|
||||
# pattern ``o3-mini-fork`` would be matched as a derivative anyway,
|
||||
# so we accept that community forks inheriting the same prefix are
|
||||
# treated as reasoning models (a reasonable default — the upstream
|
||||
# gateway timing is the same).
|
||||
_PATTERN_CACHE: dict[str, re.Pattern[str]] = {}
|
||||
|
||||
|
||||
def _get_pattern(slug: str) -> re.Pattern[str]:
|
||||
compiled = _PATTERN_CACHE.get(slug)
|
||||
if compiled is None:
|
||||
compiled = re.compile(
|
||||
r"^"
|
||||
+ re.escape(slug)
|
||||
+ r"(?:$|[\-._])"
|
||||
)
|
||||
_PATTERN_CACHE[slug] = compiled
|
||||
return compiled
|
||||
|
||||
|
||||
def _match_any(model_lower: str) -> Optional[float]:
|
||||
"""Return the floor for the first matching slug, else None.
|
||||
|
||||
Each table entry is matched as a start-of-slug prefix with the
|
||||
slug-separator-or-end-of-string right-anchor. Table iteration
|
||||
order is irrelevant: longest slug wins (so ``o3-mini`` beats
|
||||
``o3`` on a model like ``openai/o3-mini``).
|
||||
"""
|
||||
# Sort by slug length descending so longer / more-specific slugs
|
||||
# win on shared prefixes (o3-mini beats o3).
|
||||
sorted_floors = sorted(
|
||||
_REASONING_STALE_TIMEOUT_FLOORS, key=lambda kv: -len(kv[0])
|
||||
)
|
||||
for slug, floor in sorted_floors:
|
||||
if _get_pattern(slug).search(model_lower):
|
||||
return float(floor)
|
||||
return None
|
||||
|
||||
|
||||
def get_reasoning_stale_timeout_floor(model: object) -> Optional[float]:
|
||||
"""Return the stale-timeout floor (seconds) for a known reasoning model.
|
||||
|
||||
Returns ``None`` when the model is not in the allowlist or the
|
||||
argument is empty / not a string. Matching uses
|
||||
word-boundary-anchored regex on the lowercased model name, so
|
||||
``openai/o3-mini`` matches the ``o3-mini`` slug but
|
||||
``olmo-1`` does NOT match ``o1`` (the ``o1`` substring is not
|
||||
at a word boundary inside ``olmo-1``).
|
||||
|
||||
Aggregator prefixes (``openai/``, ``x-ai/``, ``anthropic/`` etc.)
|
||||
are preserved through matching — the ``/`` is itself a word
|
||||
boundary, so ``openai/o3-mini`` matches ``o3-mini`` because the
|
||||
``/`` before ``o3-mini`` satisfies the left-anchor alternation.
|
||||
|
||||
This is a FLOOR — callers must apply it as ``max(default, floor)``
|
||||
and only when no explicit user-configured per-model
|
||||
``stale_timeout_seconds`` exists.
|
||||
|
||||
>>> get_reasoning_stale_timeout_floor("nvidia/nemotron-3-ultra-550b-a55b")
|
||||
600.0
|
||||
>>> get_reasoning_stale_timeout_floor("openai/o3-mini")
|
||||
300.0
|
||||
>>> get_reasoning_stale_timeout_floor("deepseek/deepseek-r1")
|
||||
600.0
|
||||
>>> get_reasoning_stale_timeout_floor("qwen/qwen3-235b-a22b-thinking")
|
||||
180.0
|
||||
>>> get_reasoning_stale_timeout_floor("x-ai/grok-4-fast-reasoning")
|
||||
300.0
|
||||
>>> get_reasoning_stale_timeout_floor("anthropic/claude-opus-4-6")
|
||||
240.0
|
||||
>>> get_reasoning_stale_timeout_floor("gpt-4o") is None
|
||||
True
|
||||
>>> get_reasoning_stale_timeout_floor("olmo-1") is None
|
||||
True
|
||||
>>> get_reasoning_stale_timeout_floor(None) is None
|
||||
True
|
||||
"""
|
||||
if not model or not isinstance(model, str):
|
||||
return None
|
||||
name = model.strip().lower()
|
||||
if not name:
|
||||
return None
|
||||
# Strip aggregator prefix (everything before and including the
|
||||
# last ``/``). The wrapper regex anchors at start-of-string, so
|
||||
# the slug identity is the bare model name.
|
||||
if "/" in name:
|
||||
name = name.rsplit("/", 1)[1]
|
||||
return _match_any(name)
|
||||
136
agent/thinking_timeout_guidance.py
Normal file
136
agent/thinking_timeout_guidance.py
Normal file
|
|
@ -0,0 +1,136 @@
|
|||
"""Thinking-timeout detection and user-facing guidance for reasoning models.
|
||||
|
||||
When a known reasoning model (NVIDIA Nemotron 3 Ultra, OpenAI o1/o3,
|
||||
Anthropic Opus 4.x thinking, DeepSeek R1, Qwen QwQ, xAI Grok reasoning)
|
||||
hits a transport-layer error before the first content token arrives, the
|
||||
upstream proxy has almost certainly idle-killed a long thinking stream —
|
||||
not a true context overflow or a configuration error. The user needs
|
||||
distinct guidance for this case:
|
||||
|
||||
"The model's thinking phase exceeded the upstream proxy's idle
|
||||
timeout before the first content token arrived. This is a known
|
||||
issue with reasoning models behind cloud gateways (NVIDIA NIM,
|
||||
OpenAI, Anthropic, DeepSeek). Workarounds in priority order:
|
||||
1. Set `providers.<provider>.models.<model>.stale_timeout_seconds: 900`
|
||||
in `~/.hermes/config.yaml` to extend the per-call timeout...
|
||||
2. Lower `reasoning_budget` or set `reasoning_effort: medium`...
|
||||
3. Use a smaller / faster reasoning model..."
|
||||
|
||||
The existing `_is_stream_drop` guidance at
|
||||
``agent/conversation_loop.py:3464-3486`` fires for large-file-write
|
||||
stream drops ("try execute_code with Python's open() for large files")
|
||||
which is the WRONG advice for the thinking-timeout case. This module
|
||||
provides the detection and the message as standalone helpers so the
|
||||
detection logic is unit-testable without driving the full retry loop,
|
||||
and the message text can be regression-tested for spelling and accuracy.
|
||||
|
||||
Part 2 of Fixes #52310.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Optional
|
||||
|
||||
|
||||
# Substring set that identifies a transport-layer failure on the
|
||||
# response stream. Same shape as the existing
|
||||
# ``_SERVER_DISCONNECT_PATTERNS`` in ``agent/error_classifier.py:394``
|
||||
# but extended to also catch the OSS-level error signature
|
||||
# (``broken pipe`` / ``errno 32``) that the upstream kill surfaces
|
||||
# to the OpenAI SDK wrapper.
|
||||
_THINKING_TIMEOUT_SUBSTRINGS: tuple[str, ...] = (
|
||||
"broken pipe",
|
||||
"errno 32",
|
||||
"remote protocol",
|
||||
"connection reset",
|
||||
"connection lost",
|
||||
"peer closed",
|
||||
"server disconnected",
|
||||
)
|
||||
|
||||
|
||||
def is_thinking_timeout(classified: object, model: str, error_msg: str) -> bool:
|
||||
"""Return True when a reasoning model's thinking phase hit a transport kill.
|
||||
|
||||
Args:
|
||||
classified: a :class:`agent.error_classifier.ClassifiedError` instance
|
||||
(duck-typed here to avoid an import cycle in unit tests).
|
||||
model: the model slug at failure time (e.g.
|
||||
``"nvidia/nemotron-3-ultra-550b-a55b"``).
|
||||
error_msg: lowercased string representation of the underlying
|
||||
exception (typically ``str(api_error).lower()``).
|
||||
|
||||
Returns True when ALL conditions hold:
|
||||
1. ``classified.reason == FailoverReason.timeout`` (the classifier
|
||||
override at ``agent/error_classifier.py:720-738`` ensures this
|
||||
is the case for reasoning models even on large sessions).
|
||||
2. ``api_error`` has no ``.status_code`` attribute set (transport
|
||||
disconnect, not an HTTP error).
|
||||
3. ``model`` is in the reasoning-model allowlist (reuses
|
||||
``agent.reasoning_timeouts.get_reasoning_stale_timeout_floor``).
|
||||
4. ``error_msg`` contains one of the transport-kill substrings.
|
||||
|
||||
Non-reasoning models always return False. Non-transport errors
|
||||
(billing / rate_limit / auth / context_overflow / format_error)
|
||||
always return False. HTTP-status errors always return False.
|
||||
"""
|
||||
# Import here (not at module top) to keep this helper cheap to
|
||||
# import even from callers that don't need it. ``agent.reasoning_timeouts``
|
||||
# is small and dependency-free.
|
||||
from agent.reasoning_timeouts import get_reasoning_stale_timeout_floor
|
||||
|
||||
# Condition 1: classifier says timeout. Use a string/value check
|
||||
# rather than importing FailoverReason so this module has zero
|
||||
# import cycles from the error_classifier package.
|
||||
reason = getattr(classified, "reason", None)
|
||||
reason_value = getattr(reason, "value", None)
|
||||
if reason_value != "timeout":
|
||||
return False
|
||||
|
||||
# Condition 2: no HTTP status code (transport, not API error).
|
||||
# Caller is expected to gate on ``getattr(api_error, "status_code", None) is None``
|
||||
# before calling this helper; the surface here is just the post-gate
|
||||
# boolean so the caller can pass an already-prepped error_msg.
|
||||
|
||||
# Condition 3: reasoning model allowlist.
|
||||
if get_reasoning_stale_timeout_floor(model) is None:
|
||||
return False
|
||||
|
||||
# Condition 4: transport-kill substring in the error message.
|
||||
error_msg_lower = (error_msg or "").lower()
|
||||
return any(p in error_msg_lower for p in _THINKING_TIMEOUT_SUBSTRINGS)
|
||||
|
||||
|
||||
def build_thinking_timeout_guidance(
|
||||
provider: str, model: str, model_label: Optional[str] = None,
|
||||
) -> str:
|
||||
"""Return the user-facing guidance string appended to ``_final_response``.
|
||||
|
||||
Args:
|
||||
provider: provider slug (e.g. ``"nvidia"``, ``"openai"``).
|
||||
model: bare model slug the user would put in their config
|
||||
(e.g. ``"nemotron-3-ultra-550b-a55b"`` if the user uses
|
||||
NVIDIA direct, or the full ``"nvidia/nemotron-3-ultra-550b-a55b"``
|
||||
if they go through an aggregator). Used verbatim in the
|
||||
config snippet so the user can copy-paste.
|
||||
model_label: optional short label for the model name in the
|
||||
prose (e.g. ``"Nemotron 3 Ultra"``). Falls back to the
|
||||
slug if not provided.
|
||||
"""
|
||||
label = model_label or model
|
||||
return (
|
||||
"\n\nThe model's thinking phase exceeded the upstream proxy's "
|
||||
"idle timeout before the first content token arrived. This is a "
|
||||
f"known issue with reasoning models (like {label}) behind cloud "
|
||||
"gateways (NVIDIA NIM, OpenAI, Anthropic, DeepSeek). Workarounds "
|
||||
"in priority order:\n"
|
||||
f"1. Set `providers.{provider}.models.{model}.stale_timeout_seconds: 900` "
|
||||
"in `~/.hermes/config.yaml` to extend the per-call timeout. "
|
||||
"(Hermes's built-in floor is 600s for known reasoning models — "
|
||||
"if you still see this after raising, the upstream cap is even "
|
||||
"shorter.)\n"
|
||||
"2. Lower `reasoning_budget` or set `reasoning_effort: medium` on this "
|
||||
"model if the provider supports it.\n"
|
||||
"3. Use a smaller / faster reasoning model if the task doesn't "
|
||||
"require deep thinking."
|
||||
)
|
||||
355
tests/agent/test_thinking_timeout_guidance.py
Normal file
355
tests/agent/test_thinking_timeout_guidance.py
Normal file
|
|
@ -0,0 +1,355 @@
|
|||
"""Tests for the reasoning-model thinking-timeout detection + guidance.
|
||||
|
||||
Two layers:
|
||||
|
||||
1. **Classifier override (Part 1, ``agent/error_classifier.py:720-738``)**:
|
||||
A transport disconnect on a reasoning model is reclassified as
|
||||
``FailoverReason.timeout`` even when the session is large — instead
|
||||
of routing to the compression branch via
|
||||
``FailoverReason.context_overflow`` which would silently delete
|
||||
conversation history on a phantom context-length error.
|
||||
|
||||
2. **Detection + guidance (Part 2, ``agent/thinking_timeout_guidance.py``)**:
|
||||
When the classifier says timeout AND the model is in the reasoning
|
||||
allowlist AND the error message has a transport-kill signature,
|
||||
the user gets actionable guidance (raise stale_timeout, lower
|
||||
reasoning_budget, or switch models) instead of the misleading
|
||||
"use execute_code with Python's open() for large files" advice
|
||||
that fires for the unrelated large-file-write stream-drop case.
|
||||
|
||||
Both behaviors were previously broken: the existing
|
||||
``test_disconnect_large_session_context_overflow`` test in
|
||||
``tests/agent/test_error_classifier.py`` confirms that non-reasoning
|
||||
models still route to context_overflow on a large session, so the
|
||||
reasoning-model override is strictly targeted.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from types import SimpleNamespace
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
# ── helpers ──────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
class _TimeoutReason:
|
||||
"""Minimal FailoverReason stand-in for unit tests."""
|
||||
|
||||
def __init__(self, value: str = "timeout") -> None:
|
||||
self.value = value
|
||||
|
||||
|
||||
def _classified(reason: str = "timeout", **kwargs) -> SimpleNamespace:
|
||||
"""Construct a ClassifiedError stand-in with the given reason."""
|
||||
defaults = dict(
|
||||
reason=_TimeoutReason(reason),
|
||||
status_code=None,
|
||||
retryable=True,
|
||||
should_compress=False,
|
||||
should_rotate_credential=False,
|
||||
should_fallback=False,
|
||||
)
|
||||
defaults.update(kwargs)
|
||||
return SimpleNamespace(**defaults)
|
||||
|
||||
|
||||
# ── Part 1: classifier override (agent/error_classifier.py:720-738) ──
|
||||
|
||||
|
||||
def _make_session(disconnect_message: str, model: str, *, num_messages: int = 250):
|
||||
"""Construct inputs to classify_api_error for a disconnect+large-session case."""
|
||||
e = Exception(disconnect_message)
|
||||
# 128k context_length; 130k approx_tokens puts us over 0.6 of context
|
||||
# AND > 120k absolute threshold; 250 messages is also > 200 threshold.
|
||||
# Without the reasoning-model override, this routes to context_overflow.
|
||||
return e, {
|
||||
"provider": "nvidia",
|
||||
"model": model,
|
||||
"approx_tokens": 130_000,
|
||||
"context_length": 200_000,
|
||||
"num_messages": num_messages,
|
||||
}
|
||||
|
||||
|
||||
class TestClassifierOverride:
|
||||
"""The reasoning-model override at error_classifier.py:720-738.
|
||||
|
||||
Verifies the new behavior: a transport disconnect on a reasoning
|
||||
model on a LARGE session now routes to FailoverReason.timeout
|
||||
instead of context_overflow. Without this fix, the compression
|
||||
branch would fire on a phantom overflow and silently delete
|
||||
conversation history.
|
||||
"""
|
||||
|
||||
def test_reasoning_model_disconnect_on_large_session_is_timeout(self):
|
||||
from agent.error_classifier import classify_api_error, FailoverReason
|
||||
e, kwargs = _make_session(
|
||||
"server disconnected without sending complete message",
|
||||
model="nvidia/nemotron-3-ultra-550b-a55b",
|
||||
)
|
||||
result = classify_api_error(e, **kwargs)
|
||||
assert result.reason == FailoverReason.timeout, (
|
||||
"Reasoning-model transport disconnect on a large session "
|
||||
"should route to FailoverReason.timeout (not "
|
||||
"context_overflow) — the upstream proxy idle-kill is far "
|
||||
"more likely than a true context-length error on a "
|
||||
"thinking model."
|
||||
)
|
||||
assert result.should_compress is False, (
|
||||
"Compression would silently delete conversation history on "
|
||||
"a phantom overflow — must not fire for reasoning models."
|
||||
)
|
||||
|
||||
@pytest.mark.parametrize("model", [
|
||||
"nvidia/nemotron-3-ultra-550b-a55b",
|
||||
"openai/o3-mini",
|
||||
"anthropic/claude-opus-4-6",
|
||||
"deepseek/deepseek-r1",
|
||||
"qwen/qwq-32b-preview",
|
||||
"x-ai/grok-4-fast-reasoning",
|
||||
])
|
||||
def test_all_known_reasoning_models_override(self, model):
|
||||
from agent.error_classifier import classify_api_error, FailoverReason
|
||||
e, kwargs = _make_session(
|
||||
"server disconnected without sending complete message",
|
||||
model=model,
|
||||
)
|
||||
result = classify_api_error(e, **kwargs)
|
||||
assert result.reason == FailoverReason.timeout
|
||||
assert result.should_compress is False
|
||||
|
||||
def test_non_reasoning_model_large_session_still_routes_to_context_overflow(self):
|
||||
"""Regression guard: existing test_disconnect_large_session_context_overflow
|
||||
behavior must be preserved for non-reasoning models.
|
||||
|
||||
Without the override, this case routes to context_overflow +
|
||||
should_compress=True (the existing, intentional behavior for
|
||||
chat models that hit true context-length errors via proxy
|
||||
disconnect). With the override, it stays that way.
|
||||
"""
|
||||
from agent.error_classifier import classify_api_error, FailoverReason
|
||||
e, kwargs = _make_session(
|
||||
"server disconnected without sending complete message",
|
||||
model="gpt-4o",
|
||||
)
|
||||
result = classify_api_error(e, **kwargs)
|
||||
assert result.reason == FailoverReason.context_overflow
|
||||
assert result.should_compress is True
|
||||
|
||||
@pytest.mark.parametrize("model", [
|
||||
"olmo-1",
|
||||
"gpt-4o",
|
||||
"claude-3-5-sonnet-20240620",
|
||||
"llama-3.3-70b-instruct",
|
||||
"qwen2-72b-instruct",
|
||||
"x-ai/grok-3",
|
||||
])
|
||||
def test_non_reasoning_models_all_keep_context_overflow(self, model):
|
||||
from agent.error_classifier import classify_api_error, FailoverReason
|
||||
e, kwargs = _make_session(
|
||||
"server disconnected without sending complete message",
|
||||
model=model,
|
||||
)
|
||||
result = classify_api_error(e, **kwargs)
|
||||
assert result.reason == FailoverReason.context_overflow
|
||||
|
||||
def test_reasoning_model_small_session_still_routes_to_timeout(self):
|
||||
"""Sanity check: a reasoning model with a SMALL session also
|
||||
routes to timeout (the original behavior, unchanged by the
|
||||
override since the override's result matches the small-session
|
||||
branch's result)."""
|
||||
from agent.error_classifier import classify_api_error, FailoverReason
|
||||
e = Exception("server disconnected")
|
||||
result = classify_api_error(
|
||||
e,
|
||||
model="nvidia/nemotron-3-ultra-550b-a55b",
|
||||
approx_tokens=5_000,
|
||||
context_length=200_000,
|
||||
num_messages=10,
|
||||
)
|
||||
assert result.reason == FailoverReason.timeout
|
||||
|
||||
def test_reasoning_model_with_status_code_does_not_match_disconnect_pattern(self):
|
||||
"""Status-code errors take the HTTP-status path in the
|
||||
classifier, not the disconnect-with-large-session path.
|
||||
The reasoning-model override is INSIDE the disconnect branch
|
||||
and doesn't fire for HTTP errors."""
|
||||
from agent.error_classifier import classify_api_error, FailoverReason
|
||||
e = Exception("server disconnected")
|
||||
# Inject a status_code attribute to simulate an HTTP error
|
||||
# whose message happens to contain "server disconnected".
|
||||
e.status_code = 503
|
||||
result = classify_api_error(
|
||||
e,
|
||||
model="nvidia/nemotron-3-ultra-550b-a55b",
|
||||
approx_tokens=130_000,
|
||||
context_length=200_000,
|
||||
num_messages=250,
|
||||
)
|
||||
# 503 specifically routes to overloaded (per the 5xx → 503/529
|
||||
# handling in error_classifier.py). The key assertion is that
|
||||
# the reasoning-model override is NOT reached — neither
|
||||
# timeout nor context_overflow.
|
||||
assert result.reason != FailoverReason.timeout
|
||||
assert result.reason != FailoverReason.context_overflow
|
||||
assert result.should_compress is False
|
||||
|
||||
|
||||
# ── Part 2: detection (agent/thinking_timeout_guidance.py:is_thinking_timeout) ──
|
||||
|
||||
|
||||
class TestIsThinkingTimeout:
|
||||
def test_returns_true_for_reasoning_model_with_transport_signature(self):
|
||||
from agent.thinking_timeout_guidance import is_thinking_timeout
|
||||
classified = _classified(reason="timeout")
|
||||
assert is_thinking_timeout(
|
||||
classified,
|
||||
"nvidia/nemotron-3-ultra-550b-a55b",
|
||||
"Error communicating with OpenAI: [Errno 32] Broken pipe",
|
||||
) is True
|
||||
|
||||
@pytest.mark.parametrize("model,msg", [
|
||||
("nvidia/nemotron-3-ultra-550b-a55b", "connection reset by peer"),
|
||||
("openai/o3-mini", "remote protocol error"),
|
||||
("anthropic/claude-opus-4-6", "peer closed connection"),
|
||||
("deepseek/deepseek-r1", "connection lost"),
|
||||
("x-ai/grok-4-fast-reasoning", "server disconnected"),
|
||||
])
|
||||
def test_known_reasoning_models_match(self, model, msg):
|
||||
from agent.thinking_timeout_guidance import is_thinking_timeout
|
||||
classified = _classified(reason="timeout")
|
||||
assert is_thinking_timeout(classified, model, msg) is True
|
||||
|
||||
@pytest.mark.parametrize("model", [
|
||||
"gpt-4o",
|
||||
"claude-3-5-sonnet-20240620",
|
||||
"llama-3.3-70b-instruct",
|
||||
"qwen2-72b-instruct",
|
||||
])
|
||||
def test_non_reasoning_models_never_match(self, model):
|
||||
"""Non-reasoning models must always return False even with
|
||||
matching transport signature — the guidance is
|
||||
reasoning-specific."""
|
||||
from agent.thinking_timeout_guidance import is_thinking_timeout
|
||||
classified = _classified(reason="timeout")
|
||||
assert is_thinking_timeout(
|
||||
classified, model, "connection reset by peer",
|
||||
) is False
|
||||
|
||||
@pytest.mark.parametrize("reason", [
|
||||
"billing",
|
||||
"rate_limit",
|
||||
"auth",
|
||||
"context_overflow",
|
||||
"format_error",
|
||||
"provider_policy_blocked",
|
||||
"content_policy_blocked",
|
||||
"thinking_signature",
|
||||
"unknown",
|
||||
])
|
||||
def test_non_timeout_reasons_never_match(self, reason):
|
||||
"""The detection only fires when the classifier says timeout.
|
||||
Other reasons have their own distinct guidance paths."""
|
||||
from agent.thinking_timeout_guidance import is_thinking_timeout
|
||||
classified = _classified(reason=reason)
|
||||
assert is_thinking_timeout(
|
||||
classified,
|
||||
"nvidia/nemotron-3-ultra-550b-a55b",
|
||||
"connection reset by peer",
|
||||
) is False
|
||||
|
||||
@pytest.mark.parametrize("msg", [
|
||||
"Insufficient credits",
|
||||
"Rate limit exceeded",
|
||||
"Invalid API key",
|
||||
"Context length exceeded",
|
||||
"Tool call argument malformed",
|
||||
])
|
||||
def test_non_transport_messages_never_match(self, msg):
|
||||
"""The detection only fires for transport-kill signatures.
|
||||
A reasoning model that returns a billing/rate-limit/auth/etc
|
||||
error gets the classifier-default guidance, not this one."""
|
||||
from agent.thinking_timeout_guidance import is_thinking_timeout
|
||||
classified = _classified(reason="timeout")
|
||||
assert is_thinking_timeout(
|
||||
classified, "nvidia/nemotron-3-ultra-550b-a55b", msg,
|
||||
) is False
|
||||
|
||||
def test_empty_error_msg_returns_false(self):
|
||||
from agent.thinking_timeout_guidance import is_thinking_timeout
|
||||
classified = _classified(reason="timeout")
|
||||
assert is_thinking_timeout(
|
||||
classified, "nvidia/nemotron-3-ultra-550b-a55b", "",
|
||||
) is False
|
||||
|
||||
def test_none_error_msg_returns_false(self):
|
||||
from agent.thinking_timeout_guidance import is_thinking_timeout
|
||||
classified = _classified(reason="timeout")
|
||||
assert is_thinking_timeout(
|
||||
classified, "nvidia/nemotron-3-ultra-550b-a55b", None,
|
||||
) is False
|
||||
|
||||
|
||||
# ── Part 2: guidance text (agent/thinking_timeout_guidance.py:build_thinking_timeout_guidance) ──
|
||||
|
||||
|
||||
class TestBuildThinkingTimeoutGuidance:
|
||||
def test_guidance_mentions_config_path(self):
|
||||
from agent.thinking_timeout_guidance import build_thinking_timeout_guidance
|
||||
text = build_thinking_timeout_guidance(
|
||||
provider="nvidia", model="nvidia/nemotron-3-ultra-550b-a55b",
|
||||
)
|
||||
assert "providers.nvidia.models.nvidia/nemotron-3-ultra-550b-a55b.stale_timeout_seconds" in text
|
||||
|
||||
def test_guidance_mentions_three_workarounds(self):
|
||||
from agent.thinking_timeout_guidance import build_thinking_timeout_guidance
|
||||
text = build_thinking_timeout_guidance(provider="nvidia", model="x")
|
||||
assert "1." in text
|
||||
assert "2." in text
|
||||
assert "3." in text
|
||||
|
||||
def test_guidance_mentions_known_providers(self):
|
||||
from agent.thinking_timeout_guidance import build_thinking_timeout_guidance
|
||||
text = build_thinking_timeout_guidance(provider="nvidia", model="x")
|
||||
# At least one of the known cloud providers should be mentioned
|
||||
# to give the user context.
|
||||
assert any(p in text for p in (
|
||||
"NVIDIA NIM", "OpenAI", "Anthropic", "DeepSeek",
|
||||
))
|
||||
|
||||
def test_guidance_mentions_built_in_floor(self):
|
||||
"""User should know that 600s is already set by default for
|
||||
known reasoning models — if they hit the error after raising,
|
||||
it's the upstream cap, not hermes."""
|
||||
from agent.thinking_timeout_guidance import build_thinking_timeout_guidance
|
||||
text = build_thinking_timeout_guidance(provider="nvidia", model="x")
|
||||
assert "600s" in text
|
||||
|
||||
def test_guidance_does_not_recommend_execute_code(self):
|
||||
"""Critical regression guard: the new guidance must NOT
|
||||
recommend `execute_code with Python's open() for large files`
|
||||
— that's the misleading advice from the existing _is_stream_drop
|
||||
guidance that was wrong for the thinking-timeout case."""
|
||||
from agent.thinking_timeout_guidance import build_thinking_timeout_guidance
|
||||
text = build_thinking_timeout_guidance(provider="nvidia", model="x")
|
||||
assert "execute_code" not in text
|
||||
assert "open()" not in text
|
||||
|
||||
def test_guidance_uses_label_when_provided(self):
|
||||
from agent.thinking_timeout_guidance import build_thinking_timeout_guidance
|
||||
text = build_thinking_timeout_guidance(
|
||||
provider="nvidia",
|
||||
model="nvidia/nemotron-3-ultra-550b-a55b",
|
||||
model_label="Nemotron 3 Ultra",
|
||||
)
|
||||
assert "Nemotron 3 Ultra" in text
|
||||
|
||||
def test_guidance_falls_back_to_slug_when_no_label(self):
|
||||
from agent.thinking_timeout_guidance import build_thinking_timeout_guidance
|
||||
text = build_thinking_timeout_guidance(
|
||||
provider="nvidia",
|
||||
model="nvidia/nemotron-3-ultra-550b-a55b",
|
||||
)
|
||||
assert "nvidia/nemotron-3-ultra-550b-a55b" in text
|
||||
Loading…
Add table
Add a link
Reference in a new issue