docs: document streaming timeout auto-detection for local LLMs (#6990)

Add streaming timeout documentation to three pages: - guides/local-llm-on-mac.md: New 'Timeouts' section with table of all three timeouts, their defaults, local auto-adjustments, and env var overrides - reference/faq.md: Tip box in the local models FAQ section - user-guide/configuration.md: 'Streaming Timeouts' subsection under the agent config section Follow-up to #6967.
2026-04-25 00:51:20 +00:00 · 2026-04-09 23:28:25 -07:00 · 2026-04-09 23:28:25 -07:00 · d5023d36d8
commit d5023d36d8
parent 0602ff8f58
3 changed files with 39 additions and 0 deletions
--- a/website/docs/user-guide/configuration.md
+++ b/website/docs/user-guide/configuration.md
@ -500,6 +500,20 @@ agent:

 Budget pressure is enabled by default. The agent sees warnings naturally as part of tool results, encouraging it to consolidate its work and deliver a response before running out of iterations.

+### Streaming Timeouts
+
+The LLM streaming connection has two timeout layers. Both auto-adjust for local providers (localhost, LAN IPs) — no configuration needed for most setups.
+
+| Timeout | Default | Local providers | Env var |
+|---------|---------|----------------|---------|
+| Socket read timeout | 120s | Auto-raised to 1800s | `HERMES_STREAM_READ_TIMEOUT` |
+| Stale stream detection | 180s | Auto-disabled | `HERMES_STREAM_STALE_TIMEOUT` |
+| API call (non-streaming) | 1800s | Unchanged | `HERMES_API_TIMEOUT` |
+
+The **socket read timeout** controls how long httpx waits for the next chunk of data from the provider. Local LLMs can take minutes for prefill on large contexts before producing the first token, so Hermes raises this to 30 minutes when it detects a local endpoint. If you explicitly set `HERMES_STREAM_READ_TIMEOUT`, that value is always used regardless of endpoint detection.
+
+The **stale stream detection** kills connections that receive SSE keep-alive pings but no actual content. This is disabled entirely for local providers since they don't send keep-alive pings during prefill.
+
 ## Context Pressure Warnings

 Separate from iteration budget pressure, context pressure tracks how close the conversation is to the **compaction threshold** — the point where context compression fires to summarize older messages. This helps both you and the agent understand when the conversation is getting long.