docs(web-search): explain auxiliary-model summarization for web_extract (#23211)

web_extract runs returned page content through the web_extract auxiliary model when pages exceed 5 000 chars (single-pass up to 500k, chunked up to 2M, refused above that). The user-guide page didn't mention this — users were surprised that long-page extracts produced summaries instead of raw markdown, and that those summaries cost main-model tokens by default. Adds: - size-driven behavior table (under 5k / 5k–500k / 500k–2M / over 2M) - which auxiliary task does the work (auxiliary.web_extract) - how to route summaries to a cheap model regardless of main - escape hatch: browser_navigate when you need raw content - troubleshooting entry for summarization timeouts
2026-07-08 13:12:08 +00:00 · 2026-05-10 06:40:23 -07:00 · 2026-05-10 06:40:23 -07:00 · 9cdcf31cae
commit 9cdcf31cae
parent 3d4297a59a
1 changed files with 46 additions and 0 deletions
--- a/website/docs/user-guide/features/web-search.md
+++ b/website/docs/user-guide/features/web-search.md
@ -32,6 +32,44 @@ If you have a paid [Nous Portal](https://portal.nousresearch.com) subscription,

 ---

+## How `web_extract` handles long pages
+
+Backends return raw page markdown, which can be huge (forum threads, docs sites, news articles with embedded comments). To keep your context window usable and your costs down, `web_extract` runs returned content through the **`web_extract` auxiliary model** before handing it to the agent. Behavior is purely size-driven:
+
+| Page size (characters) | What happens |
+|------------------------|--------------|
+| Under 5 000 | Returned as-is — no LLM call, full markdown reaches the agent |
+| 5 000 – 500 000 | Single-pass summary via the `web_extract` auxiliary model, capped at ~5 000 chars of output |
+| 500 000 – 2 000 000 | Chunked: split into 100 k-char chunks, summarize each in parallel, then synthesize a final summary (~5 000 chars) |
+| Over 2 000 000 | Refused with a hint to use `web_crawl` with focused extraction instructions or a more specific source |
+
+The summary keeps quotes, code blocks, and key facts in their original formatting — it's a content compressor, not a paraphraser. If summarization fails or times out, Hermes falls back to the first ~5 000 chars of raw content rather than a useless error.
+
+### Which model does the summarizing?
+
+The `web_extract` auxiliary task. By default (`auxiliary.web_extract.provider: "auto"`), this is your **main chat model** — same provider, same model as `hermes model`. That's fine for most setups, but on expensive reasoning models (Opus, MiniMax M2.7, etc.) every long-page extract adds meaningful cost.
+
+To route extraction summaries to a cheap, fast model regardless of your main:
+
+```yaml
+# ~/.hermes/config.yaml
+auxiliary:
+  web_extract:
+    provider: openrouter
+    model: google/gemini-3-flash-preview
+    timeout: 360       # seconds; raise if you hit summarization timeouts
+```
+
+Or pick interactively: `hermes model` → **Configure auxiliary models** → `web_extract`.
+
+See [Auxiliary Models](/docs/user-guide/configuration#auxiliary-models) for the full reference and per-task override patterns.
+
+### When summarization gets in the way
+
+If you specifically need raw, unsummarized page content — for example, you're scraping a structured page where the LLM summary would drop important fields — use `browser_navigate` + `browser_snapshot` instead. The browser tool returns the live accessibility tree without auxiliary-model rewriting (subject to its own 8 000-char snapshot cap on huge pages).
+
+---
+
 ## Setup

 ### Quick setup via `hermes tools`
@ -329,6 +367,14 @@ Some public instances disable certain search engines or categories. Try:

 Switch to a self-hosted instance (see [Option A](#option-a--self-host-with-docker-recommended) above). With Docker, your own instance has no rate limits.

+### `web_extract` returns truncated content with a "summarization timed out" note
+
+The auxiliary model didn't finish summarizing within the configured timeout. Either:
+
+- Raise `auxiliary.web_extract.timeout` in `config.yaml` (default 360s on fresh installs, 30s if the key is missing)
+- Switch the `web_extract` auxiliary task to a faster model (e.g. `google/gemini-3-flash-preview`) — see [How `web_extract` handles long pages](#how-web_extract-handles-long-pages)
+- For pages where summarization is the wrong tool, use `browser_navigate` instead
+
 ---

 ## Optional skill: `searxng-search`