mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-06-27 11:22:03 +00:00
Two-part fix:
Part 1 (classifier override at agent/error_classifier.py:720-738):
A transport disconnect on a reasoning model — even on a large session —
now routes to FailoverReason.timeout instead of context_overflow. Without
this, large-session reasoning-model disconnects route to the compression
branch and silently delete conversation history on a phantom
context-length error. The override is strictly targeted: non-reasoning
models (gpt-4o, claude-3-5-sonnet, llama-3.3-70b, etc.) still route to
context_overflow on large sessions — the existing intentional behavior
for chat models whose proxy doesn't idle-kill during prefill/generation.
Part 2 (new agent/thinking_timeout_guidance.py + integration at
agent/conversation_loop.py:3488-3567):
New is_thinking_timeout() and build_thinking_timeout_guidance() helpers.
When a known reasoning model (NVIDIA Nemotron 3 Ultra, OpenAI o1/o3,
Anthropic Opus 4.x thinking, DeepSeek R1, Qwen QwQ, xAI Grok reasoning)
hits a transport-kill on a small session (classifier says timeout
directly) or after Part 1 routes correctly (large session), the user
now sees reasoning-specific guidance with three actionable workarounds
in priority order:
1. Set providers.<provider>.models.<model>.stale_timeout_seconds: 900
in ~/.hermes/config.yaml (Hermes's built-in floor is already 600s
for known reasoning models; raise further if upstream is even
tighter).
2. Lower reasoning_budget or set reasoning_effort: medium on this
model if the provider supports it.
3. Use a smaller / faster reasoning model if the task doesn't
require deep thinking.
The new guidance takes precedence via if/elif over the existing
_is_stream_drop block, so a reasoning-model user with a transport-kill
message sees actionable advice instead of the misleading "try
execute_code with Python's open() for large files" advice (which is
correct for the unrelated large-file-write stream-drop case but
actively wrong for the thinking-timeout case).
Verified:
- 478 tests passing across 9 directly-relevant files (49 new + 429
existing, zero regressions).
- Ruff lint clean on all 4 modified/new files.
- Negative test: 6 parametrized regression guards confirm non-reasoning
models still route to context_overflow on large sessions; 4
parametrized gates confirm non-timeout classifier reasons never
trigger the guidance; 5 parametrized cases confirm non-transport
messages never trigger it.
- Regression guard: new guidance message does NOT contain
"execute_code" or "open()" — the misleading advice is fully
replaced, not appended alongside.
- Cross-vendor dual review via agy -p:
- Gemini 3.5 Flash (Medium) — passed: true, zero blockers, one
SHOULD-FIX (vprint block duplication — fixed by extracting
detection into a helper module).
- GPT-OSS 120B (Medium) — passed: true, zero blockers, two nits
(test placement — adopted at tests/agent/test_thinking_timeout_guidance.py;
primary-model capture — accepted as non-issue per Flash's nit).
Dependency note for maintainers:
This PR includes agent/reasoning_timeouts.py (the reasoning-model
allowlist module from PR #52238) because the Layer 1 override is
load-bearing on get_reasoning_stale_timeout_floor(). After PR #52238
lands on main, this PR's duplicate agent/reasoning_timeouts.py should
be rebased away. Either PR can land first; the other rebase is
mechanical.
Fixes #52271.
136 lines
6.3 KiB
Python
136 lines
6.3 KiB
Python
"""Thinking-timeout detection and user-facing guidance for reasoning models.
|
|
|
|
When a known reasoning model (NVIDIA Nemotron 3 Ultra, OpenAI o1/o3,
|
|
Anthropic Opus 4.x thinking, DeepSeek R1, Qwen QwQ, xAI Grok reasoning)
|
|
hits a transport-layer error before the first content token arrives, the
|
|
upstream proxy has almost certainly idle-killed a long thinking stream —
|
|
not a true context overflow or a configuration error. The user needs
|
|
distinct guidance for this case:
|
|
|
|
"The model's thinking phase exceeded the upstream proxy's idle
|
|
timeout before the first content token arrived. This is a known
|
|
issue with reasoning models behind cloud gateways (NVIDIA NIM,
|
|
OpenAI, Anthropic, DeepSeek). Workarounds in priority order:
|
|
1. Set `providers.<provider>.models.<model>.stale_timeout_seconds: 900`
|
|
in `~/.hermes/config.yaml` to extend the per-call timeout...
|
|
2. Lower `reasoning_budget` or set `reasoning_effort: medium`...
|
|
3. Use a smaller / faster reasoning model..."
|
|
|
|
The existing `_is_stream_drop` guidance at
|
|
``agent/conversation_loop.py:3464-3486`` fires for large-file-write
|
|
stream drops ("try execute_code with Python's open() for large files")
|
|
which is the WRONG advice for the thinking-timeout case. This module
|
|
provides the detection and the message as standalone helpers so the
|
|
detection logic is unit-testable without driving the full retry loop,
|
|
and the message text can be regression-tested for spelling and accuracy.
|
|
|
|
Part 2 of Fixes #52310.
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
from typing import Optional
|
|
|
|
|
|
# Substring set that identifies a transport-layer failure on the
|
|
# response stream. Same shape as the existing
|
|
# ``_SERVER_DISCONNECT_PATTERNS`` in ``agent/error_classifier.py:394``
|
|
# but extended to also catch the OSS-level error signature
|
|
# (``broken pipe`` / ``errno 32``) that the upstream kill surfaces
|
|
# to the OpenAI SDK wrapper.
|
|
_THINKING_TIMEOUT_SUBSTRINGS: tuple[str, ...] = (
|
|
"broken pipe",
|
|
"errno 32",
|
|
"remote protocol",
|
|
"connection reset",
|
|
"connection lost",
|
|
"peer closed",
|
|
"server disconnected",
|
|
)
|
|
|
|
|
|
def is_thinking_timeout(classified: object, model: str, error_msg: str) -> bool:
|
|
"""Return True when a reasoning model's thinking phase hit a transport kill.
|
|
|
|
Args:
|
|
classified: a :class:`agent.error_classifier.ClassifiedError` instance
|
|
(duck-typed here to avoid an import cycle in unit tests).
|
|
model: the model slug at failure time (e.g.
|
|
``"nvidia/nemotron-3-ultra-550b-a55b"``).
|
|
error_msg: lowercased string representation of the underlying
|
|
exception (typically ``str(api_error).lower()``).
|
|
|
|
Returns True when ALL conditions hold:
|
|
1. ``classified.reason == FailoverReason.timeout`` (the classifier
|
|
override at ``agent/error_classifier.py:720-738`` ensures this
|
|
is the case for reasoning models even on large sessions).
|
|
2. ``api_error`` has no ``.status_code`` attribute set (transport
|
|
disconnect, not an HTTP error).
|
|
3. ``model`` is in the reasoning-model allowlist (reuses
|
|
``agent.reasoning_timeouts.get_reasoning_stale_timeout_floor``).
|
|
4. ``error_msg`` contains one of the transport-kill substrings.
|
|
|
|
Non-reasoning models always return False. Non-transport errors
|
|
(billing / rate_limit / auth / context_overflow / format_error)
|
|
always return False. HTTP-status errors always return False.
|
|
"""
|
|
# Import here (not at module top) to keep this helper cheap to
|
|
# import even from callers that don't need it. ``agent.reasoning_timeouts``
|
|
# is small and dependency-free.
|
|
from agent.reasoning_timeouts import get_reasoning_stale_timeout_floor
|
|
|
|
# Condition 1: classifier says timeout. Use a string/value check
|
|
# rather than importing FailoverReason so this module has zero
|
|
# import cycles from the error_classifier package.
|
|
reason = getattr(classified, "reason", None)
|
|
reason_value = getattr(reason, "value", None)
|
|
if reason_value != "timeout":
|
|
return False
|
|
|
|
# Condition 2: no HTTP status code (transport, not API error).
|
|
# Caller is expected to gate on ``getattr(api_error, "status_code", None) is None``
|
|
# before calling this helper; the surface here is just the post-gate
|
|
# boolean so the caller can pass an already-prepped error_msg.
|
|
|
|
# Condition 3: reasoning model allowlist.
|
|
if get_reasoning_stale_timeout_floor(model) is None:
|
|
return False
|
|
|
|
# Condition 4: transport-kill substring in the error message.
|
|
error_msg_lower = (error_msg or "").lower()
|
|
return any(p in error_msg_lower for p in _THINKING_TIMEOUT_SUBSTRINGS)
|
|
|
|
|
|
def build_thinking_timeout_guidance(
|
|
provider: str, model: str, model_label: Optional[str] = None,
|
|
) -> str:
|
|
"""Return the user-facing guidance string appended to ``_final_response``.
|
|
|
|
Args:
|
|
provider: provider slug (e.g. ``"nvidia"``, ``"openai"``).
|
|
model: bare model slug the user would put in their config
|
|
(e.g. ``"nemotron-3-ultra-550b-a55b"`` if the user uses
|
|
NVIDIA direct, or the full ``"nvidia/nemotron-3-ultra-550b-a55b"``
|
|
if they go through an aggregator). Used verbatim in the
|
|
config snippet so the user can copy-paste.
|
|
model_label: optional short label for the model name in the
|
|
prose (e.g. ``"Nemotron 3 Ultra"``). Falls back to the
|
|
slug if not provided.
|
|
"""
|
|
label = model_label or model
|
|
return (
|
|
"\n\nThe model's thinking phase exceeded the upstream proxy's "
|
|
"idle timeout before the first content token arrived. This is a "
|
|
f"known issue with reasoning models (like {label}) behind cloud "
|
|
"gateways (NVIDIA NIM, OpenAI, Anthropic, DeepSeek). Workarounds "
|
|
"in priority order:\n"
|
|
f"1. Set `providers.{provider}.models.{model}.stale_timeout_seconds: 900` "
|
|
"in `~/.hermes/config.yaml` to extend the per-call timeout. "
|
|
"(Hermes's built-in floor is 600s for known reasoning models — "
|
|
"if you still see this after raising, the upstream cap is even "
|
|
"shorter.)\n"
|
|
"2. Lower `reasoning_budget` or set `reasoning_effort: medium` on this "
|
|
"model if the provider supports it.\n"
|
|
"3. Use a smaller / faster reasoning model if the task doesn't "
|
|
"require deep thinking."
|
|
)
|