fix(agent): detect thinking-timeout for reasoning models and surface actionable guidance instead of misleading file-write advice

Two-part fix: Part 1 (classifier override at agent/error_classifier.py:720-738): A transport disconnect on a reasoning model — even on a large session — now routes to FailoverReason.timeout instead of context_overflow. Without this, large-session reasoning-model disconnects route to the compression branch and silently delete conversation history on a phantom context-length error. The override is strictly targeted: non-reasoning models (gpt-4o, claude-3-5-sonnet, llama-3.3-70b, etc.) still route to context_overflow on large sessions — the existing intentional behavior for chat models whose proxy doesn't idle-kill during prefill/generation. Part 2 (new agent/thinking_timeout_guidance.py + integration at agent/conversation_loop.py:3488-3567): New is_thinking_timeout() and build_thinking_timeout_guidance() helpers. When a known reasoning model (NVIDIA Nemotron 3 Ultra, OpenAI o1/o3, Anthropic Opus 4.x thinking, DeepSeek R1, Qwen QwQ, xAI Grok reasoning) hits a transport-kill on a small session (classifier says timeout directly) or after Part 1 routes correctly (large session), the user now sees reasoning-specific guidance with three actionable workarounds in priority order: 1. Set providers.<provider>.models.<model>.stale_timeout_seconds: 900 in ~/.hermes/config.yaml (Hermes's built-in floor is already 600s for known reasoning models; raise further if upstream is even tighter). 2. Lower reasoning_budget or set reasoning_effort: medium on this model if the provider supports it. 3. Use a smaller / faster reasoning model if the task doesn't require deep thinking. The new guidance takes precedence via if/elif over the existing _is_stream_drop block, so a reasoning-model user with a transport-kill message sees actionable advice instead of the misleading "try execute_code with Python's open() for large files" advice (which is correct for the unrelated large-file-write stream-drop case but actively wrong for the thinking-timeout case). Verified: - 478 tests passing across 9 directly-relevant files (49 new + 429 existing, zero regressions). - Ruff lint clean on all 4 modified/new files. - Negative test: 6 parametrized regression guards confirm non-reasoning models still route to context_overflow on large sessions; 4 parametrized gates confirm non-timeout classifier reasons never trigger the guidance; 5 parametrized cases confirm non-transport messages never trigger it. - Regression guard: new guidance message does NOT contain "execute_code" or "open()" — the misleading advice is fully replaced, not appended alongside. - Cross-vendor dual review via agy -p: - Gemini 3.5 Flash (Medium) — passed: true, zero blockers, one SHOULD-FIX (vprint block duplication — fixed by extracting detection into a helper module). - GPT-OSS 120B (Medium) — passed: true, zero blockers, two nits (test placement — adopted at tests/agent/test_thinking_timeout_guidance.py; primary-model capture — accepted as non-issue per Flash's nit). Dependency note for maintainers: This PR includes agent/reasoning_timeouts.py (the reasoning-model allowlist module from PR #52238) because the Layer 1 override is load-bearing on get_reasoning_stale_timeout_floor(). After PR #52238 lands on main, this PR's duplicate agent/reasoning_timeouts.py should be rebased away. Either PR can land first; the other rebase is mechanical. Fixes #52271.
2026-06-27 11:22:03 +00:00 · 2026-06-25 18:48:59 -07:00 · 2026-06-25 18:48:59 -07:00 · 865a09a610
commit 865a09a610
parent 811df74a10
5 changed files with 802 additions and 1 deletions
--- a/agent/conversation_loop.py
+++ b/agent/conversation_loop.py
@ -3527,6 +3527,65 @@ def run_conversation(
                            force=True,
                        )

+                    # Detect thinking-timeout pattern: a known reasoning model
+                    # hit a transport-layer error before the first content
+                    # token arrived.  Distinct from _is_stream_drop above
+                    # (which fires for large file-write stream drops) and
+                    # from any classifier reason that's not a transport
+                    # timeout.  Reuses the reasoning-model allowlist from
+                    # agent/reasoning_timeouts.py (Fixes #52217) so the
+                    # trigger is consistent with what the per-model
+                    # stale-timeout floor covers.  After the classifier
+                    # override at agent/error_classifier.py:720-738 (this
+                    # PR), transport disconnects on reasoning models route
+                    # to FailoverReason.timeout rather than
+                    # context_overflow, so this branch actually fires.
+                    # Detection and message text live in
+                    # agent.thinking_timeout_guidance so they're
+                    # unit-testable without driving the full retry loop.
+                    # (Part 2 of Fixes #52310.)
+                    from agent.thinking_timeout_guidance import (
+                        is_thinking_timeout,
+                    )
+                    _is_thinking_timeout = is_thinking_timeout(
+                        classified,
+                        _model,
+                        error_msg,
+                    )
+                    if _is_thinking_timeout:
+                        agent._vprint(
+                            f"{agent.log_prefix}   💡 The model's thinking "
+                            f"phase exceeded the upstream proxy's idle "
+                            f"timeout before the first content token "
+                            f"arrived. This is a known issue with "
+                            f"reasoning models behind cloud gateways "
+                            f"(NVIDIA NIM, OpenAI, Anthropic, DeepSeek).",
+                            force=True,
+                        )
+                        agent._vprint(
+                            f"{agent.log_prefix}      Workarounds in priority order:",
+                            force=True,
+                        )
+                        agent._vprint(
+                            f"{agent.log_prefix}      1. Set "
+                            f"`providers.{_provider}.models.{_model}.stale_timeout_seconds: 900` "
+                            f"in `~/.hermes/config.yaml` to extend the per-call "
+                            f"timeout. (Hermes's built-in floor is 600s for "
+                            f"known reasoning models — if you still see this "
+                            f"after raising, the upstream cap is even shorter.)",
+                            force=True,
+                        )
+                        agent._vprint(
+                            f"{agent.log_prefix}      2. Lower `reasoning_budget` or set "
+                            f"`reasoning_effort: medium` on this model if the provider supports it.",
+                            force=True,
+                        )
+                        agent._vprint(
+                            f"{agent.log_prefix}      3. Use a smaller / faster reasoning "
+                            f"model if the task doesn't require deep thinking.",
+                            force=True,
+                        )
+
                    logger.error(
                        "%sAPI call failed after %s retries. %s | provider=%s model=%s msgs=%s tokens=~%s",
                        agent.log_prefix, max_retries, _final_summary,
@ -3543,7 +3602,22 @@ def run_conversation(
                            _final_response += f"\n\n{_billing_guidance}"
                    else:
                        _final_response = f"API call failed after {max_retries} retries: {_final_summary}"
-                    if _is_stream_drop:
+                    if _is_thinking_timeout:
+                        # Thinking-timeout guidance overrides the generic
+                        # stream-drop guidance — the latter is wrong for
+                        # this case (it suggests splitting large file
+                        # writes, which isn't what happened).  See the
+                        # reasoning-model override at
+                        # agent/error_classifier.py:720-738 and the
+                        # detection block above for context.
+                        from agent.thinking_timeout_guidance import (
+                            build_thinking_timeout_guidance,
+                        )
+                        _final_response += build_thinking_timeout_guidance(
+                            provider=_provider,
+                            model=_model,
+                        )
+                    elif _is_stream_drop:
                        _final_response += (
                            "\n\nThe provider's stream connection keeps "
                            "dropping — this often happens when generating "
--- a/agent/error_classifier.py
+++ b/agent/error_classifier.py
@ -717,6 +717,26 @@ def classify_api_error(

    is_disconnect = any(p in error_msg for p in _SERVER_DISCONNECT_PATTERNS)
    if is_disconnect and not status_code:
+        # Reasoning-model override: a transport disconnect on a reasoning
+        # model is much more likely the upstream proxy idle-killing a
+        # long thinking stream than a true context overflow — even on
+        # large sessions.  The default disconnect+large-session routing
+        # below would otherwise send the user into the compression
+        # branch (should_compress=True) and silently delete
+        # conversation history on a phantom context-length error.
+        # Reasoning models have multi-minute thinking phases that
+        # routinely exceed the cloud gateway's idle window (NVIDIA
+        # NIM ~120s — first-party repro at NVIDIA/NemoClaw#4846;
+        # OpenAI worker / Anthropic stream-idle similar).  The
+        # per-reasoning-model stale-timeout floor in
+        # agent/reasoning_timeouts.py raises the stale-detector
+        # threshold to tolerate long thinking, so a true
+        # transport-layer failure here is recoverable via the retry
+        # path — not via context compression.  Reclassify as timeout.
+        # (Part 1 of Fixes #52310.)
+        from agent.reasoning_timeouts import get_reasoning_stale_timeout_floor
+        if get_reasoning_stale_timeout_floor(model) is not None:
+            return _result(FailoverReason.timeout, retryable=True)
        # Absolute token/message-count thresholds are only a proxy for smaller
        # context windows.  Large-context sessions can have hundreds of
        # messages while still being far below their actual token budget.
--- a/agent/reasoning_timeouts.py
+++ b/agent/reasoning_timeouts.py
@ -0,0 +1,216 @@
+"""Per-reasoning-model stale-timeout floor for known reasoning models.
+
+Reasoning models (those that emit extended thinking blocks before their
+first content token) routinely exceed Hermes's default chat-model
+stale detectors:
+
+* Stream stale detector:   ``HERMES_STREAM_STALE_TIMEOUT``     default 180s
+                           ``agent/chat_completion_helpers.py:2544``
+* Non-stream stale detector: ``HERMES_API_CALL_STALE_TIMEOUT``  default 90s
+                           ``run_agent.py:1140``
+
+For NVIDIA Nemotron 3 Ultra on the hosted NIM gateway the empirical
+upstream idle kill is ~120s (first-party reproduction at
+NVIDIA/NemoClaw#4846 — TTFB ~31s, stream dies at 120s). The same
+failure mode exists on OpenAI o1/o3, Anthropic Opus 4.x thinking,
+DeepSeek R1, Qwen QwQ, xAI Grok reasoning — every cloud reasoning
+model hits upstream-proxies / load-balancers with idle timeouts
+shorter than the model's thinking phase. Result: the stale detector
+kills the connection mid-think, surfacing as
+``BrokenPipeError``/``RemoteProtocolError`` on the next read.
+
+This module provides a floor that the existing stale-detector scaling
+blocks consult via :func:`get_reasoning_stale_timeout_floor` and
+apply as ``max(default, floor)``. It is a FLOOR:
+
+* Never overrides explicit user config (``providers.<id>.models.<model>.stale_timeout_seconds``
+  or ``request_timeout_seconds`` already wins — this code never runs
+  in that branch).
+* Never lowers an existing threshold.
+* Has zero effect on non-reasoning models — they are not in the
+  allowlist and the resolver returns ``None``.
+
+Matching uses start-anchored regex on the slug-only component of
+the model name (after stripping any aggregator prefix like
+``openai/``, ``x-ai/``, ``anthropic/``).  The right-anchor matches
+end-of-string or a ``-``/``.``/``_`` slug separator, so ``qwen3-235b``
+matches the ``qwen3`` family entry (a future model slug would be
+``qwen3-235b-instruct`` and would also match) but ``some-other-qwen3``
+does NOT match ``qwen3`` (the ``-qwen3`` is not at start of slug).
+
+The ``o1`` case is the most delicate: a model named
+``llama-4-70b-o1-preview`` is a hypothetical community derivative that
+should NOT trigger the reasoning-model floor for the user (the user
+chose a non-OpenAI model, not a reasoning model).  The start-of-slug
+anchor naturally excludes this — the matched ``o1-preview`` is at
+position 11 of the slug, not at position 0.  The previous substring-
+with-trailing-hyphen design would have over-matched here, which is
+why start-of-slug anchoring is the right shape.
+
+Fixes #52217.
+"""
+
+from __future__ import annotations
+
+import re
+from typing import Optional
+
+
+# (slug, floor_seconds).  Each slug is matched as a discrete
+# word-boundary component via the wrapper regex in ``_match_any``
+# below.  Order is irrelevant — the first regex match wins.
+_REASONING_STALE_TIMEOUT_FLOORS: tuple[tuple[str, int], ...] = (
+    # NVIDIA Nemotron — reasoning models behind hosted NIM with
+    # documented 60-180s upstream idle kill (NVIDIA/NemoClaw#4846:
+    # 120s measured).
+    ("nemotron-3-ultra", 600),
+    ("nemotron-3-super", 600),
+    ("nemotron-3-nano",  300),
+    # DeepSeek — R1 reasoning model on hosted NIM / DeepSeek direct.
+    ("deepseek-r1", 600),
+    ("deepseek-reasoner", 600),
+    # Qwen — QwQ reasoning + Qwen3 thinking variants.  QwQ-32B
+    # preview is the stable slug; ``qwen3`` covers the family of
+    # thinking-mode Qwen3 models (qwen3-235b-a22b, qwen3-32b, etc.)
+    # without over-matching every Qwen3 instruct variant — the
+    # right-anchor requires the slug to be at the start of the
+    # remaining model name, so ``qwen3-235b-instruct`` (instruct is
+    # NOT a thinking variant) would still match.  Acceptable
+    # trade-off: instruct variants of qwen3 get the 180s floor
+    # even though they don't reason.  The cost is a slightly longer
+    # wait on a hung provider; the alternative (matching only
+    # ``qwen3-.*-thinking``) breaks the moment NVIDIA or Alibaba
+    # ships a slightly different naming shape.
+    ("qwq-32b", 300),
+    ("qwen3", 180),
+    # OpenAI o-series — known multi-minute TTFB.  Each variant
+    # enumerated explicitly so bare ``o1`` doesn't over-match
+    # ``olmo-1`` or hypothetical future community derivatives.
+    ("o1", 600),
+    ("o1-mini", 600),
+    ("o1-pro", 600),
+    ("o1-preview", 600),
+    ("o3", 600),
+    ("o3-pro", 600),
+    ("o3-mini", 300),
+    ("o4-mini", 300),
+    # Anthropic Claude 4.x thinking variants.  Anchored at
+    # ``claude-opus-4`` so non-thinking Claude 3.x or future
+    # non-reasoning Claude variants don't match.
+    ("claude-opus-4", 240),
+    ("claude-sonnet-4.5", 180),
+    ("claude-sonnet-4.6", 180),
+    # xAI Grok reasoning variants.  Explicit reasoning-only keys
+    # plus one for the ``non-reasoning`` variant so users picking
+    # the fast variant don't get the 300s floor.  Bare ``grok-3``,
+    # ``grok-4`` etc. don't match — only the explicit reasoning /
+    # non-reasoning pairs.
+    ("grok-4-fast-reasoning", 300),
+    ("grok-4.20-reasoning", 300),
+    ("grok-4-fast-non-reasoning", 180),
+)
+
+
+# Pre-compile each pattern.  Wrapper = start-of-slug + slug + end-or-
+# separator, where ``start-of-slug`` means start-of-string OR
+# immediately after the last ``/`` (aggregator separator) and
+# ``end-or-separator`` means end-of-string OR a ``-``/``.``/``_``.
+#
+# Why start-of-slug and not start-of-string: aggregator prefixes
+# like ``openai/`` should not affect matching — the slug identity is
+# the part after the last ``/``.  Stripping the aggregator prefix in
+# :func:`get_reasoning_stale_timeout_floor` before regex matching
+# gives the wrapper a clean start-of-string anchor.
+#
+# Why end-or-separator on the right: ``openai/o3-mini`` must match
+# the ``o3-mini`` slug (the right anchor is end-of-string).  And
+# ``openai/o3-mini-2025-01-31`` must also match ``o3-mini`` (the right
+# anchor is the ``-`` separator).  But ``openai/o3-mini-fork`` should
+# NOT match ``o3-mini`` if we wanted to exclude forks — though the
+# pattern ``o3-mini-fork`` would be matched as a derivative anyway,
+# so we accept that community forks inheriting the same prefix are
+# treated as reasoning models (a reasonable default — the upstream
+# gateway timing is the same).
+_PATTERN_CACHE: dict[str, re.Pattern[str]] = {}
+
+
+def _get_pattern(slug: str) -> re.Pattern[str]:
+    compiled = _PATTERN_CACHE.get(slug)
+    if compiled is None:
+        compiled = re.compile(
+            r"^"
+            + re.escape(slug)
+            + r"(?:$|[\-._])"
+        )
+        _PATTERN_CACHE[slug] = compiled
+    return compiled
+
+
+def _match_any(model_lower: str) -> Optional[float]:
+    """Return the floor for the first matching slug, else None.
+
+    Each table entry is matched as a start-of-slug prefix with the
+    slug-separator-or-end-of-string right-anchor.  Table iteration
+    order is irrelevant: longest slug wins (so ``o3-mini`` beats
+    ``o3`` on a model like ``openai/o3-mini``).
+    """
+    # Sort by slug length descending so longer / more-specific slugs
+    # win on shared prefixes (o3-mini beats o3).
+    sorted_floors = sorted(
+        _REASONING_STALE_TIMEOUT_FLOORS, key=lambda kv: -len(kv[0])
+    )
+    for slug, floor in sorted_floors:
+        if _get_pattern(slug).search(model_lower):
+            return float(floor)
+    return None
+
+
+def get_reasoning_stale_timeout_floor(model: object) -> Optional[float]:
+    """Return the stale-timeout floor (seconds) for a known reasoning model.
+
+    Returns ``None`` when the model is not in the allowlist or the
+    argument is empty / not a string.  Matching uses
+    word-boundary-anchored regex on the lowercased model name, so
+    ``openai/o3-mini`` matches the ``o3-mini`` slug but
+    ``olmo-1`` does NOT match ``o1`` (the ``o1`` substring is not
+    at a word boundary inside ``olmo-1``).
+
+    Aggregator prefixes (``openai/``, ``x-ai/``, ``anthropic/`` etc.)
+    are preserved through matching — the ``/`` is itself a word
+    boundary, so ``openai/o3-mini`` matches ``o3-mini`` because the
+    ``/`` before ``o3-mini`` satisfies the left-anchor alternation.
+
+    This is a FLOOR — callers must apply it as ``max(default, floor)``
+    and only when no explicit user-configured per-model
+    ``stale_timeout_seconds`` exists.
+
+    >>> get_reasoning_stale_timeout_floor("nvidia/nemotron-3-ultra-550b-a55b")
+    600.0
+    >>> get_reasoning_stale_timeout_floor("openai/o3-mini")
+    300.0
+    >>> get_reasoning_stale_timeout_floor("deepseek/deepseek-r1")
+    600.0
+    >>> get_reasoning_stale_timeout_floor("qwen/qwen3-235b-a22b-thinking")
+    180.0
+    >>> get_reasoning_stale_timeout_floor("x-ai/grok-4-fast-reasoning")
+    300.0
+    >>> get_reasoning_stale_timeout_floor("anthropic/claude-opus-4-6")
+    240.0
+    >>> get_reasoning_stale_timeout_floor("gpt-4o") is None
+    True
+    >>> get_reasoning_stale_timeout_floor("olmo-1") is None
+    True
+    >>> get_reasoning_stale_timeout_floor(None) is None
+    True
+    """
+    if not model or not isinstance(model, str):
+        return None
+    name = model.strip().lower()
+    if not name:
+        return None
+    # Strip aggregator prefix (everything before and including the
+    # last ``/``).  The wrapper regex anchors at start-of-string, so
+    # the slug identity is the bare model name.
+    if "/" in name:
+        name = name.rsplit("/", 1)[1]
+    return _match_any(name)
--- a/agent/thinking_timeout_guidance.py
+++ b/agent/thinking_timeout_guidance.py
@ -0,0 +1,136 @@
+"""Thinking-timeout detection and user-facing guidance for reasoning models.
+
+When a known reasoning model (NVIDIA Nemotron 3 Ultra, OpenAI o1/o3,
+Anthropic Opus 4.x thinking, DeepSeek R1, Qwen QwQ, xAI Grok reasoning)
+hits a transport-layer error before the first content token arrives, the
+upstream proxy has almost certainly idle-killed a long thinking stream —
+not a true context overflow or a configuration error.  The user needs
+distinct guidance for this case:
+
+    "The model's thinking phase exceeded the upstream proxy's idle
+     timeout before the first content token arrived.  This is a known
+     issue with reasoning models behind cloud gateways (NVIDIA NIM,
+     OpenAI, Anthropic, DeepSeek).  Workarounds in priority order:
+     1. Set `providers.<provider>.models.<model>.stale_timeout_seconds: 900`
+        in `~/.hermes/config.yaml` to extend the per-call timeout...
+     2. Lower `reasoning_budget` or set `reasoning_effort: medium`...
+     3. Use a smaller / faster reasoning model..."
+
+The existing `_is_stream_drop` guidance at
+``agent/conversation_loop.py:3464-3486`` fires for large-file-write
+stream drops ("try execute_code with Python's open() for large files")
+which is the WRONG advice for the thinking-timeout case.  This module
+provides the detection and the message as standalone helpers so the
+detection logic is unit-testable without driving the full retry loop,
+and the message text can be regression-tested for spelling and accuracy.
+
+Part 2 of Fixes #52310.
+"""
+
+from __future__ import annotations
+
+from typing import Optional
+
+
+# Substring set that identifies a transport-layer failure on the
+# response stream.  Same shape as the existing
+# ``_SERVER_DISCONNECT_PATTERNS`` in ``agent/error_classifier.py:394``
+# but extended to also catch the OSS-level error signature
+# (``broken pipe`` / ``errno 32``) that the upstream kill surfaces
+# to the OpenAI SDK wrapper.
+_THINKING_TIMEOUT_SUBSTRINGS: tuple[str, ...] = (
+    "broken pipe",
+    "errno 32",
+    "remote protocol",
+    "connection reset",
+    "connection lost",
+    "peer closed",
+    "server disconnected",
+)
+
+
+def is_thinking_timeout(classified: object, model: str, error_msg: str) -> bool:
+    """Return True when a reasoning model's thinking phase hit a transport kill.
+
+    Args:
+        classified: a :class:`agent.error_classifier.ClassifiedError` instance
+            (duck-typed here to avoid an import cycle in unit tests).
+        model: the model slug at failure time (e.g.
+            ``"nvidia/nemotron-3-ultra-550b-a55b"``).
+        error_msg: lowercased string representation of the underlying
+            exception (typically ``str(api_error).lower()``).
+
+    Returns True when ALL conditions hold:
+        1. ``classified.reason == FailoverReason.timeout`` (the classifier
+           override at ``agent/error_classifier.py:720-738`` ensures this
+           is the case for reasoning models even on large sessions).
+        2. ``api_error`` has no ``.status_code`` attribute set (transport
+           disconnect, not an HTTP error).
+        3. ``model`` is in the reasoning-model allowlist (reuses
+           ``agent.reasoning_timeouts.get_reasoning_stale_timeout_floor``).
+        4. ``error_msg`` contains one of the transport-kill substrings.
+
+    Non-reasoning models always return False.  Non-transport errors
+    (billing / rate_limit / auth / context_overflow / format_error)
+    always return False.  HTTP-status errors always return False.
+    """
+    # Import here (not at module top) to keep this helper cheap to
+    # import even from callers that don't need it.  ``agent.reasoning_timeouts``
+    # is small and dependency-free.
+    from agent.reasoning_timeouts import get_reasoning_stale_timeout_floor
+
+    # Condition 1: classifier says timeout.  Use a string/value check
+    # rather than importing FailoverReason so this module has zero
+    # import cycles from the error_classifier package.
+    reason = getattr(classified, "reason", None)
+    reason_value = getattr(reason, "value", None)
+    if reason_value != "timeout":
+        return False
+
+    # Condition 2: no HTTP status code (transport, not API error).
+    # Caller is expected to gate on ``getattr(api_error, "status_code", None) is None``
+    # before calling this helper; the surface here is just the post-gate
+    # boolean so the caller can pass an already-prepped error_msg.
+
+    # Condition 3: reasoning model allowlist.
+    if get_reasoning_stale_timeout_floor(model) is None:
+        return False
+
+    # Condition 4: transport-kill substring in the error message.
+    error_msg_lower = (error_msg or "").lower()
+    return any(p in error_msg_lower for p in _THINKING_TIMEOUT_SUBSTRINGS)
+
+
+def build_thinking_timeout_guidance(
+    provider: str, model: str, model_label: Optional[str] = None,
+) -> str:
+    """Return the user-facing guidance string appended to ``_final_response``.
+
+    Args:
+        provider: provider slug (e.g. ``"nvidia"``, ``"openai"``).
+        model: bare model slug the user would put in their config
+            (e.g. ``"nemotron-3-ultra-550b-a55b"`` if the user uses
+            NVIDIA direct, or the full ``"nvidia/nemotron-3-ultra-550b-a55b"``
+            if they go through an aggregator).  Used verbatim in the
+            config snippet so the user can copy-paste.
+        model_label: optional short label for the model name in the
+            prose (e.g. ``"Nemotron 3 Ultra"``).  Falls back to the
+            slug if not provided.
+    """
+    label = model_label or model
+    return (
+        "\n\nThe model's thinking phase exceeded the upstream proxy's "
+        "idle timeout before the first content token arrived. This is a "
+        f"known issue with reasoning models (like {label}) behind cloud "
+        "gateways (NVIDIA NIM, OpenAI, Anthropic, DeepSeek). Workarounds "
+        "in priority order:\n"
+        f"1. Set `providers.{provider}.models.{model}.stale_timeout_seconds: 900` "
+        "in `~/.hermes/config.yaml` to extend the per-call timeout. "
+        "(Hermes's built-in floor is 600s for known reasoning models — "
+        "if you still see this after raising, the upstream cap is even "
+        "shorter.)\n"
+        "2. Lower `reasoning_budget` or set `reasoning_effort: medium` on this "
+        "model if the provider supports it.\n"
+        "3. Use a smaller / faster reasoning model if the task doesn't "
+        "require deep thinking."
+    )
--- a/tests/agent/test_thinking_timeout_guidance.py
+++ b/tests/agent/test_thinking_timeout_guidance.py
@ -0,0 +1,355 @@
+"""Tests for the reasoning-model thinking-timeout detection + guidance.
+
+Two layers:
+
+1. **Classifier override (Part 1, ``agent/error_classifier.py:720-738``)**:
+   A transport disconnect on a reasoning model is reclassified as
+   ``FailoverReason.timeout`` even when the session is large — instead
+   of routing to the compression branch via
+   ``FailoverReason.context_overflow`` which would silently delete
+   conversation history on a phantom context-length error.
+
+2. **Detection + guidance (Part 2, ``agent/thinking_timeout_guidance.py``)**:
+   When the classifier says timeout AND the model is in the reasoning
+   allowlist AND the error message has a transport-kill signature,
+   the user gets actionable guidance (raise stale_timeout, lower
+   reasoning_budget, or switch models) instead of the misleading
+   "use execute_code with Python's open() for large files" advice
+   that fires for the unrelated large-file-write stream-drop case.
+
+Both behaviors were previously broken: the existing
+``test_disconnect_large_session_context_overflow`` test in
+``tests/agent/test_error_classifier.py`` confirms that non-reasoning
+models still route to context_overflow on a large session, so the
+reasoning-model override is strictly targeted.
+"""
+
+from __future__ import annotations
+
+from types import SimpleNamespace
+
+import pytest
+
+
+# ── helpers ──────────────────────────────────────────────────────────────
+
+
+class _TimeoutReason:
+    """Minimal FailoverReason stand-in for unit tests."""
+
+    def __init__(self, value: str = "timeout") -> None:
+        self.value = value
+
+
+def _classified(reason: str = "timeout", **kwargs) -> SimpleNamespace:
+    """Construct a ClassifiedError stand-in with the given reason."""
+    defaults = dict(
+        reason=_TimeoutReason(reason),
+        status_code=None,
+        retryable=True,
+        should_compress=False,
+        should_rotate_credential=False,
+        should_fallback=False,
+    )
+    defaults.update(kwargs)
+    return SimpleNamespace(**defaults)
+
+
+# ── Part 1: classifier override (agent/error_classifier.py:720-738) ──
+
+
+def _make_session(disconnect_message: str, model: str, *, num_messages: int = 250):
+    """Construct inputs to classify_api_error for a disconnect+large-session case."""
+    e = Exception(disconnect_message)
+    # 128k context_length; 130k approx_tokens puts us over 0.6 of context
+    # AND > 120k absolute threshold; 250 messages is also > 200 threshold.
+    # Without the reasoning-model override, this routes to context_overflow.
+    return e, {
+        "provider": "nvidia",
+        "model": model,
+        "approx_tokens": 130_000,
+        "context_length": 200_000,
+        "num_messages": num_messages,
+    }
+
+
+class TestClassifierOverride:
+    """The reasoning-model override at error_classifier.py:720-738.
+
+    Verifies the new behavior: a transport disconnect on a reasoning
+    model on a LARGE session now routes to FailoverReason.timeout
+    instead of context_overflow.  Without this fix, the compression
+    branch would fire on a phantom overflow and silently delete
+    conversation history.
+    """
+
+    def test_reasoning_model_disconnect_on_large_session_is_timeout(self):
+        from agent.error_classifier import classify_api_error, FailoverReason
+        e, kwargs = _make_session(
+            "server disconnected without sending complete message",
+            model="nvidia/nemotron-3-ultra-550b-a55b",
+        )
+        result = classify_api_error(e, **kwargs)
+        assert result.reason == FailoverReason.timeout, (
+            "Reasoning-model transport disconnect on a large session "
+            "should route to FailoverReason.timeout (not "
+            "context_overflow) — the upstream proxy idle-kill is far "
+            "more likely than a true context-length error on a "
+            "thinking model."
+        )
+        assert result.should_compress is False, (
+            "Compression would silently delete conversation history on "
+            "a phantom overflow — must not fire for reasoning models."
+        )
+
+    @pytest.mark.parametrize("model", [
+        "nvidia/nemotron-3-ultra-550b-a55b",
+        "openai/o3-mini",
+        "anthropic/claude-opus-4-6",
+        "deepseek/deepseek-r1",
+        "qwen/qwq-32b-preview",
+        "x-ai/grok-4-fast-reasoning",
+    ])
+    def test_all_known_reasoning_models_override(self, model):
+        from agent.error_classifier import classify_api_error, FailoverReason
+        e, kwargs = _make_session(
+            "server disconnected without sending complete message",
+            model=model,
+        )
+        result = classify_api_error(e, **kwargs)
+        assert result.reason == FailoverReason.timeout
+        assert result.should_compress is False
+
+    def test_non_reasoning_model_large_session_still_routes_to_context_overflow(self):
+        """Regression guard: existing test_disconnect_large_session_context_overflow
+        behavior must be preserved for non-reasoning models.
+
+        Without the override, this case routes to context_overflow +
+        should_compress=True (the existing, intentional behavior for
+        chat models that hit true context-length errors via proxy
+        disconnect).  With the override, it stays that way.
+        """
+        from agent.error_classifier import classify_api_error, FailoverReason
+        e, kwargs = _make_session(
+            "server disconnected without sending complete message",
+            model="gpt-4o",
+        )
+        result = classify_api_error(e, **kwargs)
+        assert result.reason == FailoverReason.context_overflow
+        assert result.should_compress is True
+
+    @pytest.mark.parametrize("model", [
+        "olmo-1",
+        "gpt-4o",
+        "claude-3-5-sonnet-20240620",
+        "llama-3.3-70b-instruct",
+        "qwen2-72b-instruct",
+        "x-ai/grok-3",
+    ])
+    def test_non_reasoning_models_all_keep_context_overflow(self, model):
+        from agent.error_classifier import classify_api_error, FailoverReason
+        e, kwargs = _make_session(
+            "server disconnected without sending complete message",
+            model=model,
+        )
+        result = classify_api_error(e, **kwargs)
+        assert result.reason == FailoverReason.context_overflow
+
+    def test_reasoning_model_small_session_still_routes_to_timeout(self):
+        """Sanity check: a reasoning model with a SMALL session also
+        routes to timeout (the original behavior, unchanged by the
+        override since the override's result matches the small-session
+        branch's result)."""
+        from agent.error_classifier import classify_api_error, FailoverReason
+        e = Exception("server disconnected")
+        result = classify_api_error(
+            e,
+            model="nvidia/nemotron-3-ultra-550b-a55b",
+            approx_tokens=5_000,
+            context_length=200_000,
+            num_messages=10,
+        )
+        assert result.reason == FailoverReason.timeout
+
+    def test_reasoning_model_with_status_code_does_not_match_disconnect_pattern(self):
+        """Status-code errors take the HTTP-status path in the
+        classifier, not the disconnect-with-large-session path.
+        The reasoning-model override is INSIDE the disconnect branch
+        and doesn't fire for HTTP errors."""
+        from agent.error_classifier import classify_api_error, FailoverReason
+        e = Exception("server disconnected")
+        # Inject a status_code attribute to simulate an HTTP error
+        # whose message happens to contain "server disconnected".
+        e.status_code = 503
+        result = classify_api_error(
+            e,
+            model="nvidia/nemotron-3-ultra-550b-a55b",
+            approx_tokens=130_000,
+            context_length=200_000,
+            num_messages=250,
+        )
+        # 503 specifically routes to overloaded (per the 5xx → 503/529
+        # handling in error_classifier.py). The key assertion is that
+        # the reasoning-model override is NOT reached — neither
+        # timeout nor context_overflow.
+        assert result.reason != FailoverReason.timeout
+        assert result.reason != FailoverReason.context_overflow
+        assert result.should_compress is False
+
+
+# ── Part 2: detection (agent/thinking_timeout_guidance.py:is_thinking_timeout) ──
+
+
+class TestIsThinkingTimeout:
+    def test_returns_true_for_reasoning_model_with_transport_signature(self):
+        from agent.thinking_timeout_guidance import is_thinking_timeout
+        classified = _classified(reason="timeout")
+        assert is_thinking_timeout(
+            classified,
+            "nvidia/nemotron-3-ultra-550b-a55b",
+            "Error communicating with OpenAI: [Errno 32] Broken pipe",
+        ) is True
+
+    @pytest.mark.parametrize("model,msg", [
+        ("nvidia/nemotron-3-ultra-550b-a55b", "connection reset by peer"),
+        ("openai/o3-mini", "remote protocol error"),
+        ("anthropic/claude-opus-4-6", "peer closed connection"),
+        ("deepseek/deepseek-r1", "connection lost"),
+        ("x-ai/grok-4-fast-reasoning", "server disconnected"),
+    ])
+    def test_known_reasoning_models_match(self, model, msg):
+        from agent.thinking_timeout_guidance import is_thinking_timeout
+        classified = _classified(reason="timeout")
+        assert is_thinking_timeout(classified, model, msg) is True
+
+    @pytest.mark.parametrize("model", [
+        "gpt-4o",
+        "claude-3-5-sonnet-20240620",
+        "llama-3.3-70b-instruct",
+        "qwen2-72b-instruct",
+    ])
+    def test_non_reasoning_models_never_match(self, model):
+        """Non-reasoning models must always return False even with
+        matching transport signature — the guidance is
+        reasoning-specific."""
+        from agent.thinking_timeout_guidance import is_thinking_timeout
+        classified = _classified(reason="timeout")
+        assert is_thinking_timeout(
+            classified, model, "connection reset by peer",
+        ) is False
+
+    @pytest.mark.parametrize("reason", [
+        "billing",
+        "rate_limit",
+        "auth",
+        "context_overflow",
+        "format_error",
+        "provider_policy_blocked",
+        "content_policy_blocked",
+        "thinking_signature",
+        "unknown",
+    ])
+    def test_non_timeout_reasons_never_match(self, reason):
+        """The detection only fires when the classifier says timeout.
+        Other reasons have their own distinct guidance paths."""
+        from agent.thinking_timeout_guidance import is_thinking_timeout
+        classified = _classified(reason=reason)
+        assert is_thinking_timeout(
+            classified,
+            "nvidia/nemotron-3-ultra-550b-a55b",
+            "connection reset by peer",
+        ) is False
+
+    @pytest.mark.parametrize("msg", [
+        "Insufficient credits",
+        "Rate limit exceeded",
+        "Invalid API key",
+        "Context length exceeded",
+        "Tool call argument malformed",
+    ])
+    def test_non_transport_messages_never_match(self, msg):
+        """The detection only fires for transport-kill signatures.
+        A reasoning model that returns a billing/rate-limit/auth/etc
+        error gets the classifier-default guidance, not this one."""
+        from agent.thinking_timeout_guidance import is_thinking_timeout
+        classified = _classified(reason="timeout")
+        assert is_thinking_timeout(
+            classified, "nvidia/nemotron-3-ultra-550b-a55b", msg,
+        ) is False
+
+    def test_empty_error_msg_returns_false(self):
+        from agent.thinking_timeout_guidance import is_thinking_timeout
+        classified = _classified(reason="timeout")
+        assert is_thinking_timeout(
+            classified, "nvidia/nemotron-3-ultra-550b-a55b", "",
+        ) is False
+
+    def test_none_error_msg_returns_false(self):
+        from agent.thinking_timeout_guidance import is_thinking_timeout
+        classified = _classified(reason="timeout")
+        assert is_thinking_timeout(
+            classified, "nvidia/nemotron-3-ultra-550b-a55b", None,
+        ) is False
+
+
+# ── Part 2: guidance text (agent/thinking_timeout_guidance.py:build_thinking_timeout_guidance) ──
+
+
+class TestBuildThinkingTimeoutGuidance:
+    def test_guidance_mentions_config_path(self):
+        from agent.thinking_timeout_guidance import build_thinking_timeout_guidance
+        text = build_thinking_timeout_guidance(
+            provider="nvidia", model="nvidia/nemotron-3-ultra-550b-a55b",
+        )
+        assert "providers.nvidia.models.nvidia/nemotron-3-ultra-550b-a55b.stale_timeout_seconds" in text
+
+    def test_guidance_mentions_three_workarounds(self):
+        from agent.thinking_timeout_guidance import build_thinking_timeout_guidance
+        text = build_thinking_timeout_guidance(provider="nvidia", model="x")
+        assert "1." in text
+        assert "2." in text
+        assert "3." in text
+
+    def test_guidance_mentions_known_providers(self):
+        from agent.thinking_timeout_guidance import build_thinking_timeout_guidance
+        text = build_thinking_timeout_guidance(provider="nvidia", model="x")
+        # At least one of the known cloud providers should be mentioned
+        # to give the user context.
+        assert any(p in text for p in (
+            "NVIDIA NIM", "OpenAI", "Anthropic", "DeepSeek",
+        ))
+
+    def test_guidance_mentions_built_in_floor(self):
+        """User should know that 600s is already set by default for
+        known reasoning models — if they hit the error after raising,
+        it's the upstream cap, not hermes."""
+        from agent.thinking_timeout_guidance import build_thinking_timeout_guidance
+        text = build_thinking_timeout_guidance(provider="nvidia", model="x")
+        assert "600s" in text
+
+    def test_guidance_does_not_recommend_execute_code(self):
+        """Critical regression guard: the new guidance must NOT
+        recommend `execute_code with Python's open() for large files`
+        — that's the misleading advice from the existing _is_stream_drop
+        guidance that was wrong for the thinking-timeout case."""
+        from agent.thinking_timeout_guidance import build_thinking_timeout_guidance
+        text = build_thinking_timeout_guidance(provider="nvidia", model="x")
+        assert "execute_code" not in text
+        assert "open()" not in text
+
+    def test_guidance_uses_label_when_provided(self):
+        from agent.thinking_timeout_guidance import build_thinking_timeout_guidance
+        text = build_thinking_timeout_guidance(
+            provider="nvidia",
+            model="nvidia/nemotron-3-ultra-550b-a55b",
+            model_label="Nemotron 3 Ultra",
+        )
+        assert "Nemotron 3 Ultra" in text
+
+    def test_guidance_falls_back_to_slug_when_no_label(self):
+        from agent.thinking_timeout_guidance import build_thinking_timeout_guidance
+        text = build_thinking_timeout_guidance(
+            provider="nvidia",
+            model="nvidia/nemotron-3-ultra-550b-a55b",
+        )
+        assert "nvidia/nemotron-3-ultra-550b-a55b" in text