hermes-agent/agent/reasoning_timeouts.py
DavidMetcalfe 865a09a610 fix(agent): detect thinking-timeout for reasoning models and surface actionable guidance instead of misleading file-write advice
Two-part fix:

Part 1 (classifier override at agent/error_classifier.py:720-738):
A transport disconnect on a reasoning model — even on a large session —
now routes to FailoverReason.timeout instead of context_overflow. Without
this, large-session reasoning-model disconnects route to the compression
branch and silently delete conversation history on a phantom
context-length error. The override is strictly targeted: non-reasoning
models (gpt-4o, claude-3-5-sonnet, llama-3.3-70b, etc.) still route to
context_overflow on large sessions — the existing intentional behavior
for chat models whose proxy doesn't idle-kill during prefill/generation.

Part 2 (new agent/thinking_timeout_guidance.py + integration at
agent/conversation_loop.py:3488-3567):
New is_thinking_timeout() and build_thinking_timeout_guidance() helpers.
When a known reasoning model (NVIDIA Nemotron 3 Ultra, OpenAI o1/o3,
Anthropic Opus 4.x thinking, DeepSeek R1, Qwen QwQ, xAI Grok reasoning)
hits a transport-kill on a small session (classifier says timeout
directly) or after Part 1 routes correctly (large session), the user
now sees reasoning-specific guidance with three actionable workarounds
in priority order:

  1. Set providers.<provider>.models.<model>.stale_timeout_seconds: 900
     in ~/.hermes/config.yaml (Hermes's built-in floor is already 600s
     for known reasoning models; raise further if upstream is even
     tighter).
  2. Lower reasoning_budget or set reasoning_effort: medium on this
     model if the provider supports it.
  3. Use a smaller / faster reasoning model if the task doesn't
     require deep thinking.

The new guidance takes precedence via if/elif over the existing
_is_stream_drop block, so a reasoning-model user with a transport-kill
message sees actionable advice instead of the misleading "try
execute_code with Python's open() for large files" advice (which is
correct for the unrelated large-file-write stream-drop case but
actively wrong for the thinking-timeout case).

Verified:
- 478 tests passing across 9 directly-relevant files (49 new + 429
  existing, zero regressions).
- Ruff lint clean on all 4 modified/new files.
- Negative test: 6 parametrized regression guards confirm non-reasoning
  models still route to context_overflow on large sessions; 4
  parametrized gates confirm non-timeout classifier reasons never
  trigger the guidance; 5 parametrized cases confirm non-transport
  messages never trigger it.
- Regression guard: new guidance message does NOT contain
  "execute_code" or "open()" — the misleading advice is fully
  replaced, not appended alongside.
- Cross-vendor dual review via agy -p:
  - Gemini 3.5 Flash (Medium) — passed: true, zero blockers, one
    SHOULD-FIX (vprint block duplication — fixed by extracting
    detection into a helper module).
  - GPT-OSS 120B (Medium) — passed: true, zero blockers, two nits
    (test placement — adopted at tests/agent/test_thinking_timeout_guidance.py;
    primary-model capture — accepted as non-issue per Flash's nit).

Dependency note for maintainers:
This PR includes agent/reasoning_timeouts.py (the reasoning-model
allowlist module from PR #52238) because the Layer 1 override is
load-bearing on get_reasoning_stale_timeout_floor(). After PR #52238
lands on main, this PR's duplicate agent/reasoning_timeouts.py should
be rebased away. Either PR can land first; the other rebase is
mechanical.

Fixes #52271.
2026-06-25 19:00:48 -07:00

216 lines
9.2 KiB
Python

"""Per-reasoning-model stale-timeout floor for known reasoning models.
Reasoning models (those that emit extended thinking blocks before their
first content token) routinely exceed Hermes's default chat-model
stale detectors:
* Stream stale detector: ``HERMES_STREAM_STALE_TIMEOUT`` default 180s
``agent/chat_completion_helpers.py:2544``
* Non-stream stale detector: ``HERMES_API_CALL_STALE_TIMEOUT`` default 90s
``run_agent.py:1140``
For NVIDIA Nemotron 3 Ultra on the hosted NIM gateway the empirical
upstream idle kill is ~120s (first-party reproduction at
NVIDIA/NemoClaw#4846 — TTFB ~31s, stream dies at 120s). The same
failure mode exists on OpenAI o1/o3, Anthropic Opus 4.x thinking,
DeepSeek R1, Qwen QwQ, xAI Grok reasoning — every cloud reasoning
model hits upstream-proxies / load-balancers with idle timeouts
shorter than the model's thinking phase. Result: the stale detector
kills the connection mid-think, surfacing as
``BrokenPipeError``/``RemoteProtocolError`` on the next read.
This module provides a floor that the existing stale-detector scaling
blocks consult via :func:`get_reasoning_stale_timeout_floor` and
apply as ``max(default, floor)``. It is a FLOOR:
* Never overrides explicit user config (``providers.<id>.models.<model>.stale_timeout_seconds``
or ``request_timeout_seconds`` already wins — this code never runs
in that branch).
* Never lowers an existing threshold.
* Has zero effect on non-reasoning models — they are not in the
allowlist and the resolver returns ``None``.
Matching uses start-anchored regex on the slug-only component of
the model name (after stripping any aggregator prefix like
``openai/``, ``x-ai/``, ``anthropic/``). The right-anchor matches
end-of-string or a ``-``/``.``/``_`` slug separator, so ``qwen3-235b``
matches the ``qwen3`` family entry (a future model slug would be
``qwen3-235b-instruct`` and would also match) but ``some-other-qwen3``
does NOT match ``qwen3`` (the ``-qwen3`` is not at start of slug).
The ``o1`` case is the most delicate: a model named
``llama-4-70b-o1-preview`` is a hypothetical community derivative that
should NOT trigger the reasoning-model floor for the user (the user
chose a non-OpenAI model, not a reasoning model). The start-of-slug
anchor naturally excludes this — the matched ``o1-preview`` is at
position 11 of the slug, not at position 0. The previous substring-
with-trailing-hyphen design would have over-matched here, which is
why start-of-slug anchoring is the right shape.
Fixes #52217.
"""
from __future__ import annotations
import re
from typing import Optional
# (slug, floor_seconds). Each slug is matched as a discrete
# word-boundary component via the wrapper regex in ``_match_any``
# below. Order is irrelevant — the first regex match wins.
_REASONING_STALE_TIMEOUT_FLOORS: tuple[tuple[str, int], ...] = (
# NVIDIA Nemotron — reasoning models behind hosted NIM with
# documented 60-180s upstream idle kill (NVIDIA/NemoClaw#4846:
# 120s measured).
("nemotron-3-ultra", 600),
("nemotron-3-super", 600),
("nemotron-3-nano", 300),
# DeepSeek — R1 reasoning model on hosted NIM / DeepSeek direct.
("deepseek-r1", 600),
("deepseek-reasoner", 600),
# Qwen — QwQ reasoning + Qwen3 thinking variants. QwQ-32B
# preview is the stable slug; ``qwen3`` covers the family of
# thinking-mode Qwen3 models (qwen3-235b-a22b, qwen3-32b, etc.)
# without over-matching every Qwen3 instruct variant — the
# right-anchor requires the slug to be at the start of the
# remaining model name, so ``qwen3-235b-instruct`` (instruct is
# NOT a thinking variant) would still match. Acceptable
# trade-off: instruct variants of qwen3 get the 180s floor
# even though they don't reason. The cost is a slightly longer
# wait on a hung provider; the alternative (matching only
# ``qwen3-.*-thinking``) breaks the moment NVIDIA or Alibaba
# ships a slightly different naming shape.
("qwq-32b", 300),
("qwen3", 180),
# OpenAI o-series — known multi-minute TTFB. Each variant
# enumerated explicitly so bare ``o1`` doesn't over-match
# ``olmo-1`` or hypothetical future community derivatives.
("o1", 600),
("o1-mini", 600),
("o1-pro", 600),
("o1-preview", 600),
("o3", 600),
("o3-pro", 600),
("o3-mini", 300),
("o4-mini", 300),
# Anthropic Claude 4.x thinking variants. Anchored at
# ``claude-opus-4`` so non-thinking Claude 3.x or future
# non-reasoning Claude variants don't match.
("claude-opus-4", 240),
("claude-sonnet-4.5", 180),
("claude-sonnet-4.6", 180),
# xAI Grok reasoning variants. Explicit reasoning-only keys
# plus one for the ``non-reasoning`` variant so users picking
# the fast variant don't get the 300s floor. Bare ``grok-3``,
# ``grok-4`` etc. don't match — only the explicit reasoning /
# non-reasoning pairs.
("grok-4-fast-reasoning", 300),
("grok-4.20-reasoning", 300),
("grok-4-fast-non-reasoning", 180),
)
# Pre-compile each pattern. Wrapper = start-of-slug + slug + end-or-
# separator, where ``start-of-slug`` means start-of-string OR
# immediately after the last ``/`` (aggregator separator) and
# ``end-or-separator`` means end-of-string OR a ``-``/``.``/``_``.
#
# Why start-of-slug and not start-of-string: aggregator prefixes
# like ``openai/`` should not affect matching — the slug identity is
# the part after the last ``/``. Stripping the aggregator prefix in
# :func:`get_reasoning_stale_timeout_floor` before regex matching
# gives the wrapper a clean start-of-string anchor.
#
# Why end-or-separator on the right: ``openai/o3-mini`` must match
# the ``o3-mini`` slug (the right anchor is end-of-string). And
# ``openai/o3-mini-2025-01-31`` must also match ``o3-mini`` (the right
# anchor is the ``-`` separator). But ``openai/o3-mini-fork`` should
# NOT match ``o3-mini`` if we wanted to exclude forks — though the
# pattern ``o3-mini-fork`` would be matched as a derivative anyway,
# so we accept that community forks inheriting the same prefix are
# treated as reasoning models (a reasonable default — the upstream
# gateway timing is the same).
_PATTERN_CACHE: dict[str, re.Pattern[str]] = {}
def _get_pattern(slug: str) -> re.Pattern[str]:
compiled = _PATTERN_CACHE.get(slug)
if compiled is None:
compiled = re.compile(
r"^"
+ re.escape(slug)
+ r"(?:$|[\-._])"
)
_PATTERN_CACHE[slug] = compiled
return compiled
def _match_any(model_lower: str) -> Optional[float]:
"""Return the floor for the first matching slug, else None.
Each table entry is matched as a start-of-slug prefix with the
slug-separator-or-end-of-string right-anchor. Table iteration
order is irrelevant: longest slug wins (so ``o3-mini`` beats
``o3`` on a model like ``openai/o3-mini``).
"""
# Sort by slug length descending so longer / more-specific slugs
# win on shared prefixes (o3-mini beats o3).
sorted_floors = sorted(
_REASONING_STALE_TIMEOUT_FLOORS, key=lambda kv: -len(kv[0])
)
for slug, floor in sorted_floors:
if _get_pattern(slug).search(model_lower):
return float(floor)
return None
def get_reasoning_stale_timeout_floor(model: object) -> Optional[float]:
"""Return the stale-timeout floor (seconds) for a known reasoning model.
Returns ``None`` when the model is not in the allowlist or the
argument is empty / not a string. Matching uses
word-boundary-anchored regex on the lowercased model name, so
``openai/o3-mini`` matches the ``o3-mini`` slug but
``olmo-1`` does NOT match ``o1`` (the ``o1`` substring is not
at a word boundary inside ``olmo-1``).
Aggregator prefixes (``openai/``, ``x-ai/``, ``anthropic/`` etc.)
are preserved through matching — the ``/`` is itself a word
boundary, so ``openai/o3-mini`` matches ``o3-mini`` because the
``/`` before ``o3-mini`` satisfies the left-anchor alternation.
This is a FLOOR — callers must apply it as ``max(default, floor)``
and only when no explicit user-configured per-model
``stale_timeout_seconds`` exists.
>>> get_reasoning_stale_timeout_floor("nvidia/nemotron-3-ultra-550b-a55b")
600.0
>>> get_reasoning_stale_timeout_floor("openai/o3-mini")
300.0
>>> get_reasoning_stale_timeout_floor("deepseek/deepseek-r1")
600.0
>>> get_reasoning_stale_timeout_floor("qwen/qwen3-235b-a22b-thinking")
180.0
>>> get_reasoning_stale_timeout_floor("x-ai/grok-4-fast-reasoning")
300.0
>>> get_reasoning_stale_timeout_floor("anthropic/claude-opus-4-6")
240.0
>>> get_reasoning_stale_timeout_floor("gpt-4o") is None
True
>>> get_reasoning_stale_timeout_floor("olmo-1") is None
True
>>> get_reasoning_stale_timeout_floor(None) is None
True
"""
if not model or not isinstance(model, str):
return None
name = model.strip().lower()
if not name:
return None
# Strip aggregator prefix (everything before and including the
# last ``/``). The wrapper regex anchors at start-of-string, so
# the slug identity is the bare model name.
if "/" in name:
name = name.rsplit("/", 1)[1]
return _match_any(name)