mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-06-27 11:22:03 +00:00
Two-part fix:
Part 1 (classifier override at agent/error_classifier.py:720-738):
A transport disconnect on a reasoning model — even on a large session —
now routes to FailoverReason.timeout instead of context_overflow. Without
this, large-session reasoning-model disconnects route to the compression
branch and silently delete conversation history on a phantom
context-length error. The override is strictly targeted: non-reasoning
models (gpt-4o, claude-3-5-sonnet, llama-3.3-70b, etc.) still route to
context_overflow on large sessions — the existing intentional behavior
for chat models whose proxy doesn't idle-kill during prefill/generation.
Part 2 (new agent/thinking_timeout_guidance.py + integration at
agent/conversation_loop.py:3488-3567):
New is_thinking_timeout() and build_thinking_timeout_guidance() helpers.
When a known reasoning model (NVIDIA Nemotron 3 Ultra, OpenAI o1/o3,
Anthropic Opus 4.x thinking, DeepSeek R1, Qwen QwQ, xAI Grok reasoning)
hits a transport-kill on a small session (classifier says timeout
directly) or after Part 1 routes correctly (large session), the user
now sees reasoning-specific guidance with three actionable workarounds
in priority order:
1. Set providers.<provider>.models.<model>.stale_timeout_seconds: 900
in ~/.hermes/config.yaml (Hermes's built-in floor is already 600s
for known reasoning models; raise further if upstream is even
tighter).
2. Lower reasoning_budget or set reasoning_effort: medium on this
model if the provider supports it.
3. Use a smaller / faster reasoning model if the task doesn't
require deep thinking.
The new guidance takes precedence via if/elif over the existing
_is_stream_drop block, so a reasoning-model user with a transport-kill
message sees actionable advice instead of the misleading "try
execute_code with Python's open() for large files" advice (which is
correct for the unrelated large-file-write stream-drop case but
actively wrong for the thinking-timeout case).
Verified:
- 478 tests passing across 9 directly-relevant files (49 new + 429
existing, zero regressions).
- Ruff lint clean on all 4 modified/new files.
- Negative test: 6 parametrized regression guards confirm non-reasoning
models still route to context_overflow on large sessions; 4
parametrized gates confirm non-timeout classifier reasons never
trigger the guidance; 5 parametrized cases confirm non-transport
messages never trigger it.
- Regression guard: new guidance message does NOT contain
"execute_code" or "open()" — the misleading advice is fully
replaced, not appended alongside.
- Cross-vendor dual review via agy -p:
- Gemini 3.5 Flash (Medium) — passed: true, zero blockers, one
SHOULD-FIX (vprint block duplication — fixed by extracting
detection into a helper module).
- GPT-OSS 120B (Medium) — passed: true, zero blockers, two nits
(test placement — adopted at tests/agent/test_thinking_timeout_guidance.py;
primary-model capture — accepted as non-issue per Flash's nit).
Dependency note for maintainers:
This PR includes agent/reasoning_timeouts.py (the reasoning-model
allowlist module from PR #52238) because the Layer 1 override is
load-bearing on get_reasoning_stale_timeout_floor(). After PR #52238
lands on main, this PR's duplicate agent/reasoning_timeouts.py should
be rebased away. Either PR can land first; the other rebase is
mechanical.
Fixes #52271.
216 lines
9.2 KiB
Python
216 lines
9.2 KiB
Python
"""Per-reasoning-model stale-timeout floor for known reasoning models.
|
|
|
|
Reasoning models (those that emit extended thinking blocks before their
|
|
first content token) routinely exceed Hermes's default chat-model
|
|
stale detectors:
|
|
|
|
* Stream stale detector: ``HERMES_STREAM_STALE_TIMEOUT`` default 180s
|
|
``agent/chat_completion_helpers.py:2544``
|
|
* Non-stream stale detector: ``HERMES_API_CALL_STALE_TIMEOUT`` default 90s
|
|
``run_agent.py:1140``
|
|
|
|
For NVIDIA Nemotron 3 Ultra on the hosted NIM gateway the empirical
|
|
upstream idle kill is ~120s (first-party reproduction at
|
|
NVIDIA/NemoClaw#4846 — TTFB ~31s, stream dies at 120s). The same
|
|
failure mode exists on OpenAI o1/o3, Anthropic Opus 4.x thinking,
|
|
DeepSeek R1, Qwen QwQ, xAI Grok reasoning — every cloud reasoning
|
|
model hits upstream-proxies / load-balancers with idle timeouts
|
|
shorter than the model's thinking phase. Result: the stale detector
|
|
kills the connection mid-think, surfacing as
|
|
``BrokenPipeError``/``RemoteProtocolError`` on the next read.
|
|
|
|
This module provides a floor that the existing stale-detector scaling
|
|
blocks consult via :func:`get_reasoning_stale_timeout_floor` and
|
|
apply as ``max(default, floor)``. It is a FLOOR:
|
|
|
|
* Never overrides explicit user config (``providers.<id>.models.<model>.stale_timeout_seconds``
|
|
or ``request_timeout_seconds`` already wins — this code never runs
|
|
in that branch).
|
|
* Never lowers an existing threshold.
|
|
* Has zero effect on non-reasoning models — they are not in the
|
|
allowlist and the resolver returns ``None``.
|
|
|
|
Matching uses start-anchored regex on the slug-only component of
|
|
the model name (after stripping any aggregator prefix like
|
|
``openai/``, ``x-ai/``, ``anthropic/``). The right-anchor matches
|
|
end-of-string or a ``-``/``.``/``_`` slug separator, so ``qwen3-235b``
|
|
matches the ``qwen3`` family entry (a future model slug would be
|
|
``qwen3-235b-instruct`` and would also match) but ``some-other-qwen3``
|
|
does NOT match ``qwen3`` (the ``-qwen3`` is not at start of slug).
|
|
|
|
The ``o1`` case is the most delicate: a model named
|
|
``llama-4-70b-o1-preview`` is a hypothetical community derivative that
|
|
should NOT trigger the reasoning-model floor for the user (the user
|
|
chose a non-OpenAI model, not a reasoning model). The start-of-slug
|
|
anchor naturally excludes this — the matched ``o1-preview`` is at
|
|
position 11 of the slug, not at position 0. The previous substring-
|
|
with-trailing-hyphen design would have over-matched here, which is
|
|
why start-of-slug anchoring is the right shape.
|
|
|
|
Fixes #52217.
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import re
|
|
from typing import Optional
|
|
|
|
|
|
# (slug, floor_seconds). Each slug is matched as a discrete
|
|
# word-boundary component via the wrapper regex in ``_match_any``
|
|
# below. Order is irrelevant — the first regex match wins.
|
|
_REASONING_STALE_TIMEOUT_FLOORS: tuple[tuple[str, int], ...] = (
|
|
# NVIDIA Nemotron — reasoning models behind hosted NIM with
|
|
# documented 60-180s upstream idle kill (NVIDIA/NemoClaw#4846:
|
|
# 120s measured).
|
|
("nemotron-3-ultra", 600),
|
|
("nemotron-3-super", 600),
|
|
("nemotron-3-nano", 300),
|
|
# DeepSeek — R1 reasoning model on hosted NIM / DeepSeek direct.
|
|
("deepseek-r1", 600),
|
|
("deepseek-reasoner", 600),
|
|
# Qwen — QwQ reasoning + Qwen3 thinking variants. QwQ-32B
|
|
# preview is the stable slug; ``qwen3`` covers the family of
|
|
# thinking-mode Qwen3 models (qwen3-235b-a22b, qwen3-32b, etc.)
|
|
# without over-matching every Qwen3 instruct variant — the
|
|
# right-anchor requires the slug to be at the start of the
|
|
# remaining model name, so ``qwen3-235b-instruct`` (instruct is
|
|
# NOT a thinking variant) would still match. Acceptable
|
|
# trade-off: instruct variants of qwen3 get the 180s floor
|
|
# even though they don't reason. The cost is a slightly longer
|
|
# wait on a hung provider; the alternative (matching only
|
|
# ``qwen3-.*-thinking``) breaks the moment NVIDIA or Alibaba
|
|
# ships a slightly different naming shape.
|
|
("qwq-32b", 300),
|
|
("qwen3", 180),
|
|
# OpenAI o-series — known multi-minute TTFB. Each variant
|
|
# enumerated explicitly so bare ``o1`` doesn't over-match
|
|
# ``olmo-1`` or hypothetical future community derivatives.
|
|
("o1", 600),
|
|
("o1-mini", 600),
|
|
("o1-pro", 600),
|
|
("o1-preview", 600),
|
|
("o3", 600),
|
|
("o3-pro", 600),
|
|
("o3-mini", 300),
|
|
("o4-mini", 300),
|
|
# Anthropic Claude 4.x thinking variants. Anchored at
|
|
# ``claude-opus-4`` so non-thinking Claude 3.x or future
|
|
# non-reasoning Claude variants don't match.
|
|
("claude-opus-4", 240),
|
|
("claude-sonnet-4.5", 180),
|
|
("claude-sonnet-4.6", 180),
|
|
# xAI Grok reasoning variants. Explicit reasoning-only keys
|
|
# plus one for the ``non-reasoning`` variant so users picking
|
|
# the fast variant don't get the 300s floor. Bare ``grok-3``,
|
|
# ``grok-4`` etc. don't match — only the explicit reasoning /
|
|
# non-reasoning pairs.
|
|
("grok-4-fast-reasoning", 300),
|
|
("grok-4.20-reasoning", 300),
|
|
("grok-4-fast-non-reasoning", 180),
|
|
)
|
|
|
|
|
|
# Pre-compile each pattern. Wrapper = start-of-slug + slug + end-or-
|
|
# separator, where ``start-of-slug`` means start-of-string OR
|
|
# immediately after the last ``/`` (aggregator separator) and
|
|
# ``end-or-separator`` means end-of-string OR a ``-``/``.``/``_``.
|
|
#
|
|
# Why start-of-slug and not start-of-string: aggregator prefixes
|
|
# like ``openai/`` should not affect matching — the slug identity is
|
|
# the part after the last ``/``. Stripping the aggregator prefix in
|
|
# :func:`get_reasoning_stale_timeout_floor` before regex matching
|
|
# gives the wrapper a clean start-of-string anchor.
|
|
#
|
|
# Why end-or-separator on the right: ``openai/o3-mini`` must match
|
|
# the ``o3-mini`` slug (the right anchor is end-of-string). And
|
|
# ``openai/o3-mini-2025-01-31`` must also match ``o3-mini`` (the right
|
|
# anchor is the ``-`` separator). But ``openai/o3-mini-fork`` should
|
|
# NOT match ``o3-mini`` if we wanted to exclude forks — though the
|
|
# pattern ``o3-mini-fork`` would be matched as a derivative anyway,
|
|
# so we accept that community forks inheriting the same prefix are
|
|
# treated as reasoning models (a reasonable default — the upstream
|
|
# gateway timing is the same).
|
|
_PATTERN_CACHE: dict[str, re.Pattern[str]] = {}
|
|
|
|
|
|
def _get_pattern(slug: str) -> re.Pattern[str]:
|
|
compiled = _PATTERN_CACHE.get(slug)
|
|
if compiled is None:
|
|
compiled = re.compile(
|
|
r"^"
|
|
+ re.escape(slug)
|
|
+ r"(?:$|[\-._])"
|
|
)
|
|
_PATTERN_CACHE[slug] = compiled
|
|
return compiled
|
|
|
|
|
|
def _match_any(model_lower: str) -> Optional[float]:
|
|
"""Return the floor for the first matching slug, else None.
|
|
|
|
Each table entry is matched as a start-of-slug prefix with the
|
|
slug-separator-or-end-of-string right-anchor. Table iteration
|
|
order is irrelevant: longest slug wins (so ``o3-mini`` beats
|
|
``o3`` on a model like ``openai/o3-mini``).
|
|
"""
|
|
# Sort by slug length descending so longer / more-specific slugs
|
|
# win on shared prefixes (o3-mini beats o3).
|
|
sorted_floors = sorted(
|
|
_REASONING_STALE_TIMEOUT_FLOORS, key=lambda kv: -len(kv[0])
|
|
)
|
|
for slug, floor in sorted_floors:
|
|
if _get_pattern(slug).search(model_lower):
|
|
return float(floor)
|
|
return None
|
|
|
|
|
|
def get_reasoning_stale_timeout_floor(model: object) -> Optional[float]:
|
|
"""Return the stale-timeout floor (seconds) for a known reasoning model.
|
|
|
|
Returns ``None`` when the model is not in the allowlist or the
|
|
argument is empty / not a string. Matching uses
|
|
word-boundary-anchored regex on the lowercased model name, so
|
|
``openai/o3-mini`` matches the ``o3-mini`` slug but
|
|
``olmo-1`` does NOT match ``o1`` (the ``o1`` substring is not
|
|
at a word boundary inside ``olmo-1``).
|
|
|
|
Aggregator prefixes (``openai/``, ``x-ai/``, ``anthropic/`` etc.)
|
|
are preserved through matching — the ``/`` is itself a word
|
|
boundary, so ``openai/o3-mini`` matches ``o3-mini`` because the
|
|
``/`` before ``o3-mini`` satisfies the left-anchor alternation.
|
|
|
|
This is a FLOOR — callers must apply it as ``max(default, floor)``
|
|
and only when no explicit user-configured per-model
|
|
``stale_timeout_seconds`` exists.
|
|
|
|
>>> get_reasoning_stale_timeout_floor("nvidia/nemotron-3-ultra-550b-a55b")
|
|
600.0
|
|
>>> get_reasoning_stale_timeout_floor("openai/o3-mini")
|
|
300.0
|
|
>>> get_reasoning_stale_timeout_floor("deepseek/deepseek-r1")
|
|
600.0
|
|
>>> get_reasoning_stale_timeout_floor("qwen/qwen3-235b-a22b-thinking")
|
|
180.0
|
|
>>> get_reasoning_stale_timeout_floor("x-ai/grok-4-fast-reasoning")
|
|
300.0
|
|
>>> get_reasoning_stale_timeout_floor("anthropic/claude-opus-4-6")
|
|
240.0
|
|
>>> get_reasoning_stale_timeout_floor("gpt-4o") is None
|
|
True
|
|
>>> get_reasoning_stale_timeout_floor("olmo-1") is None
|
|
True
|
|
>>> get_reasoning_stale_timeout_floor(None) is None
|
|
True
|
|
"""
|
|
if not model or not isinstance(model, str):
|
|
return None
|
|
name = model.strip().lower()
|
|
if not name:
|
|
return None
|
|
# Strip aggregator prefix (everything before and including the
|
|
# last ``/``). The wrapper regex anchors at start-of-string, so
|
|
# the slug identity is the bare model name.
|
|
if "/" in name:
|
|
name = name.rsplit("/", 1)[1]
|
|
return _match_any(name)
|