docs: document streaming timeout auto-detection for local LLMs (#6990)

Add streaming timeout documentation to three pages: - guides/local-llm-on-mac.md: New 'Timeouts' section with table of all three timeouts, their defaults, local auto-adjustments, and env var overrides - reference/faq.md: Tip box in the local models FAQ section - user-guide/configuration.md: 'Streaming Timeouts' subsection under the agent config section Follow-up to #6967.
2026-06-18 09:51:59 +00:00 · 2026-04-09 23:28:25 -07:00 · 2026-04-09 23:28:25 -07:00 · d5023d36d8
commit d5023d36d8
parent 0602ff8f58
3 changed files with 39 additions and 0 deletions
--- a/website/docs/guides/local-llm-on-mac.md
+++ b/website/docs/guides/local-llm-on-mac.md
@ -217,3 +217,24 @@ hermes model
 ```

 Select **Custom endpoint** and follow the prompts. It will ask for the base URL and model name — use the values from whichever backend you set up above.
+
+---
+
+## Timeouts
+
+Hermes automatically detects local endpoints (localhost, LAN IPs) and relaxes its streaming timeouts. No configuration needed for most setups.
+
+If you still hit timeout errors (e.g. very large contexts on slow hardware), you can override the streaming read timeout:
+
+```bash
+# In your .env — raise from the 120s default to 30 minutes
+HERMES_STREAM_READ_TIMEOUT=1800
+```
+
+| Timeout | Default | Local auto-adjustment | Env var override |
+|---------|---------|----------------------|------------------|
+| Stream read (socket-level) | 120s | Raised to 1800s | `HERMES_STREAM_READ_TIMEOUT` |
+| Stale stream detection | 180s | Disabled entirely | `HERMES_STREAM_STALE_TIMEOUT` |
+| API call (non-streaming) | 1800s | No change needed | `HERMES_API_TIMEOUT` |
+
+The stream read timeout is the one most likely to cause issues — it's the socket-level deadline for receiving the next chunk of data. During prefill on large contexts, local models may produce no output for minutes while processing the prompt. The auto-detection handles this transparently.
--- a/website/docs/reference/faq.md
+++ b/website/docs/reference/faq.md
@ -84,6 +84,10 @@ This works with Ollama, vLLM, llama.cpp server, SGLang, LocalAI, and others. See
 If you set a custom `num_ctx` in Ollama (e.g., `ollama run --num_ctx 16384`), make sure to set the matching context length in Hermes — Ollama's `/api/show` reports the model's *maximum* context, not the effective `num_ctx` you configured.
 :::

+:::tip Timeouts with local models
+Hermes auto-detects local endpoints and relaxes streaming timeouts (read timeout raised from 120s to 1800s, stale stream detection disabled). If you still hit timeouts on very large contexts, set `HERMES_STREAM_READ_TIMEOUT=1800` in your `.env`. See the [Local LLM guide](../guides/local-llm-on-mac.md#timeouts) for details.
+:::
+
 ### How much does it cost?

 Hermes Agent itself is **free and open-source** (MIT license). You pay only for the LLM API usage from your chosen provider. Local models are completely free to run.
--- a/website/docs/user-guide/configuration.md
+++ b/website/docs/user-guide/configuration.md
@ -500,6 +500,20 @@ agent:

 Budget pressure is enabled by default. The agent sees warnings naturally as part of tool results, encouraging it to consolidate its work and deliver a response before running out of iterations.

+### Streaming Timeouts
+
+The LLM streaming connection has two timeout layers. Both auto-adjust for local providers (localhost, LAN IPs) — no configuration needed for most setups.
+
+| Timeout | Default | Local providers | Env var |
+|---------|---------|----------------|---------|
+| Socket read timeout | 120s | Auto-raised to 1800s | `HERMES_STREAM_READ_TIMEOUT` |
+| Stale stream detection | 180s | Auto-disabled | `HERMES_STREAM_STALE_TIMEOUT` |
+| API call (non-streaming) | 1800s | Unchanged | `HERMES_API_TIMEOUT` |
+
+The **socket read timeout** controls how long httpx waits for the next chunk of data from the provider. Local LLMs can take minutes for prefill on large contexts before producing the first token, so Hermes raises this to 30 minutes when it detects a local endpoint. If you explicitly set `HERMES_STREAM_READ_TIMEOUT`, that value is always used regardless of endpoint detection.
+
+The **stale stream detection** kills connections that receive SSE keep-alive pings but no actual content. This is disabled entirely for local providers since they don't send keep-alive pings during prefill.
+
 ## Context Pressure Warnings

 Separate from iteration budget pressure, context pressure tracks how close the conversation is to the **compaction threshold** — the point where context compression fires to summarize older messages. This helps both you and the agent understand when the conversation is getting long.