docs: add context length detection references to FAQ and quickstart (#2179)

- quickstart.md: mention context length prompt for custom endpoints, link to configuration docs, add Ollama to provider table - faq.md: rewrite local models section with hermes model flow and context length prompt example, add Ollama num_ctx tip, expand context-length-exceeded troubleshooting with detection override options and config.yaml examples Co-authored-by: Test <test@test.com>
2026-04-25 00:51:20 +00:00 · 2026-03-20 08:38:44 -07:00 · 2026-03-20 08:38:44 -07:00 · 80e578d3e3
commit 80e578d3e3
parent c52353cf8a
2 changed files with 44 additions and 8 deletions
--- a/website/docs/reference/faq.md
+++ b/website/docs/reference/faq.md
@ -42,18 +42,25 @@ API calls go **only to the LLM provider you configure** (e.g., OpenRouter, your

 ### Can I use it offline / with local models?

-Yes. Point Hermes at any local OpenAI-compatible server:
+Yes. Run `hermes model`, select **Custom endpoint**, and enter your server's URL:

 ```bash
-hermes config set OPENAI_BASE_URL http://localhost:11434/v1  # Ollama
-hermes config set OPENAI_API_KEY ollama                       # Any non-empty value
-hermes config set HERMES_MODEL llama3.1
+hermes model
+# Select: Custom endpoint (enter URL manually)
+# API base URL: http://localhost:11434/v1
+# API key: ollama
+# Model name: qwen3.5:27b
+# Context length: 32768   ← set this to match your server's actual context window
 ```

-You can also save the endpoint interactively with `hermes model`. Hermes persists that custom endpoint in `config.yaml`, and auxiliary tasks configured with provider `main` follow the same saved endpoint.
+Hermes persists the endpoint in `config.yaml` and prompts for the context window size so compression triggers at the right time. If you leave context length blank, Hermes auto-detects it from the server's `/models` endpoint or [models.dev](https://models.dev).

 This works with Ollama, vLLM, llama.cpp server, SGLang, LocalAI, and others. See the [Configuration guide](../user-guide/configuration.md) for details.

+:::tip Ollama users
+If you set a custom `num_ctx` in Ollama (e.g., `ollama run --num_ctx 16384`), make sure to set the matching context length in Hermes — Ollama's `/api/show` reports the model's *maximum* context, not the effective `num_ctx` you configured.
+:::
+
 ### How much does it cost?

 Hermes Agent itself is **free and open-source** (MIT license). You pay only for the LLM API usage from your chosen provider. Local models are completely free to run.
@ -200,7 +207,7 @@ hermes chat --model openrouter/meta-llama/llama-3.1-70b-instruct

 #### Context length exceeded

-**Cause:** The conversation has grown too long for the model's context window.
+**Cause:** The conversation has grown too long for the model's context window, or Hermes detected the wrong context length for your model.

 **Solution:**
 ```bash
@ -214,6 +221,35 @@ hermes chat
 hermes chat --model openrouter/google/gemini-2.0-flash-001
 ```

+If this happens on the first long conversation, Hermes may have the wrong context length for your model. Check what it detected:
+
+```bash
+# Look at the status bar — it shows the detected context length
+/context
+```
+
+To fix context detection, set it explicitly:
+
+```yaml
+# In ~/.hermes/config.yaml
+model:
+  default: your-model-name
+  context_length: 131072  # your model's actual context window
+```
+
+Or for custom endpoints, add it per-model:
+
+```yaml
+custom_providers:
+  - name: "My Server"
+    base_url: "http://localhost:11434/v1"
+    models:
+      qwen3.5:27b:
+        context_length: 32768
+```
+
+See [Context Length Detection](../user-guide/configuration.md#context-length-detection) for how auto-detection works and all override options.
+
 ---

 ### Terminal Issues