docs: document streaming timeout auto-detection for local LLMs (#6990)

Add streaming timeout documentation to three pages:

- guides/local-llm-on-mac.md: New 'Timeouts' section with table of all
  three timeouts, their defaults, local auto-adjustments, and env var
  overrides
- reference/faq.md: Tip box in the local models FAQ section
- user-guide/configuration.md: 'Streaming Timeouts' subsection under
  the agent config section

Follow-up to #6967.
This commit is contained in:
Teknium 2026-04-09 23:28:25 -07:00 committed by GitHub
parent 0602ff8f58
commit d5023d36d8
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
3 changed files with 39 additions and 0 deletions

View file

@ -500,6 +500,20 @@ agent:
Budget pressure is enabled by default. The agent sees warnings naturally as part of tool results, encouraging it to consolidate its work and deliver a response before running out of iterations.
### Streaming Timeouts
The LLM streaming connection has two timeout layers. Both auto-adjust for local providers (localhost, LAN IPs) — no configuration needed for most setups.
| Timeout | Default | Local providers | Env var |
|---------|---------|----------------|---------|
| Socket read timeout | 120s | Auto-raised to 1800s | `HERMES_STREAM_READ_TIMEOUT` |
| Stale stream detection | 180s | Auto-disabled | `HERMES_STREAM_STALE_TIMEOUT` |
| API call (non-streaming) | 1800s | Unchanged | `HERMES_API_TIMEOUT` |
The **socket read timeout** controls how long httpx waits for the next chunk of data from the provider. Local LLMs can take minutes for prefill on large contexts before producing the first token, so Hermes raises this to 30 minutes when it detects a local endpoint. If you explicitly set `HERMES_STREAM_READ_TIMEOUT`, that value is always used regardless of endpoint detection.
The **stale stream detection** kills connections that receive SSE keep-alive pings but no actual content. This is disabled entirely for local providers since they don't send keep-alive pings during prefill.
## Context Pressure Warnings
Separate from iteration budget pressure, context pressure tracks how close the conversation is to the **compaction threshold** — the point where context compression fires to summarize older messages. This helps both you and the agent understand when the conversation is getting long.