diff --git a/website/docs/guides/local-llm-on-mac.md b/website/docs/guides/local-llm-on-mac.md index e0a82c7ff..975ba6b12 100644 --- a/website/docs/guides/local-llm-on-mac.md +++ b/website/docs/guides/local-llm-on-mac.md @@ -217,3 +217,24 @@ hermes model ``` Select **Custom endpoint** and follow the prompts. It will ask for the base URL and model name — use the values from whichever backend you set up above. + +--- + +## Timeouts + +Hermes automatically detects local endpoints (localhost, LAN IPs) and relaxes its streaming timeouts. No configuration needed for most setups. + +If you still hit timeout errors (e.g. very large contexts on slow hardware), you can override the streaming read timeout: + +```bash +# In your .env — raise from the 120s default to 30 minutes +HERMES_STREAM_READ_TIMEOUT=1800 +``` + +| Timeout | Default | Local auto-adjustment | Env var override | +|---------|---------|----------------------|------------------| +| Stream read (socket-level) | 120s | Raised to 1800s | `HERMES_STREAM_READ_TIMEOUT` | +| Stale stream detection | 180s | Disabled entirely | `HERMES_STREAM_STALE_TIMEOUT` | +| API call (non-streaming) | 1800s | No change needed | `HERMES_API_TIMEOUT` | + +The stream read timeout is the one most likely to cause issues — it's the socket-level deadline for receiving the next chunk of data. During prefill on large contexts, local models may produce no output for minutes while processing the prompt. The auto-detection handles this transparently. diff --git a/website/docs/reference/faq.md b/website/docs/reference/faq.md index 0ec0abd40..6db208718 100644 --- a/website/docs/reference/faq.md +++ b/website/docs/reference/faq.md @@ -84,6 +84,10 @@ This works with Ollama, vLLM, llama.cpp server, SGLang, LocalAI, and others. See If you set a custom `num_ctx` in Ollama (e.g., `ollama run --num_ctx 16384`), make sure to set the matching context length in Hermes — Ollama's `/api/show` reports the model's *maximum* context, not the effective `num_ctx` you configured. ::: +:::tip Timeouts with local models +Hermes auto-detects local endpoints and relaxes streaming timeouts (read timeout raised from 120s to 1800s, stale stream detection disabled). If you still hit timeouts on very large contexts, set `HERMES_STREAM_READ_TIMEOUT=1800` in your `.env`. See the [Local LLM guide](../guides/local-llm-on-mac.md#timeouts) for details. +::: + ### How much does it cost? Hermes Agent itself is **free and open-source** (MIT license). You pay only for the LLM API usage from your chosen provider. Local models are completely free to run. diff --git a/website/docs/user-guide/configuration.md b/website/docs/user-guide/configuration.md index 819a379eb..48f6f554f 100644 --- a/website/docs/user-guide/configuration.md +++ b/website/docs/user-guide/configuration.md @@ -500,6 +500,20 @@ agent: Budget pressure is enabled by default. The agent sees warnings naturally as part of tool results, encouraging it to consolidate its work and deliver a response before running out of iterations. +### Streaming Timeouts + +The LLM streaming connection has two timeout layers. Both auto-adjust for local providers (localhost, LAN IPs) — no configuration needed for most setups. + +| Timeout | Default | Local providers | Env var | +|---------|---------|----------------|---------| +| Socket read timeout | 120s | Auto-raised to 1800s | `HERMES_STREAM_READ_TIMEOUT` | +| Stale stream detection | 180s | Auto-disabled | `HERMES_STREAM_STALE_TIMEOUT` | +| API call (non-streaming) | 1800s | Unchanged | `HERMES_API_TIMEOUT` | + +The **socket read timeout** controls how long httpx waits for the next chunk of data from the provider. Local LLMs can take minutes for prefill on large contexts before producing the first token, so Hermes raises this to 30 minutes when it detects a local endpoint. If you explicitly set `HERMES_STREAM_READ_TIMEOUT`, that value is always used regardless of endpoint detection. + +The **stale stream detection** kills connections that receive SSE keep-alive pings but no actual content. This is disabled entirely for local providers since they don't send keep-alive pings during prefill. + ## Context Pressure Warnings Separate from iteration budget pressure, context pressure tracks how close the conversation is to the **compaction threshold** — the point where context compression fires to summarize older messages. This helps both you and the agent understand when the conversation is getting long.