From 9cdcf31caef202555446c0e0b68e652bddcc211a Mon Sep 17 00:00:00 2001
From: Teknium <127238744+teknium1@users.noreply.github.com>
Date: Sun, 10 May 2026 06:40:23 -0700
Subject: [PATCH] docs(web-search): explain auxiliary-model summarization for
 web_extract (#23211)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

web_extract runs returned page content through the web_extract auxiliary
model when pages exceed 5 000 chars (single-pass up to 500k, chunked up
to 2M, refused above that). The user-guide page didn't mention this —
users were surprised that long-page extracts produced summaries instead
of raw markdown, and that those summaries cost main-model tokens by
default.

Adds:
- size-driven behavior table (under 5k / 5k–500k / 500k–2M / over 2M)
- which auxiliary task does the work (auxiliary.web_extract)
- how to route summaries to a cheap model regardless of main
- escape hatch: browser_navigate when you need raw content
- troubleshooting entry for summarization timeouts
---
 .../docs/user-guide/features/web-search.md    | 46 +++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/website/docs/user-guide/features/web-search.md b/website/docs/user-guide/features/web-search.md
index 7f06c8e0d4d..931b4ce9cef 100644
--- a/website/docs/user-guide/features/web-search.md
+++ b/website/docs/user-guide/features/web-search.md
@@ -32,6 +32,44 @@ If you have a paid [Nous Portal](https://portal.nousresearch.com) subscription,
 
 ---
 
+## How `web_extract` handles long pages
+
+Backends return raw page markdown, which can be huge (forum threads, docs sites, news articles with embedded comments). To keep your context window usable and your costs down, `web_extract` runs returned content through the **`web_extract` auxiliary model** before handing it to the agent. Behavior is purely size-driven:
+
+| Page size (characters) | What happens |
+|------------------------|--------------|
+| Under 5 000 | Returned as-is — no LLM call, full markdown reaches the agent |
+| 5 000 – 500 000 | Single-pass summary via the `web_extract` auxiliary model, capped at ~5 000 chars of output |
+| 500 000 – 2 000 000 | Chunked: split into 100 k-char chunks, summarize each in parallel, then synthesize a final summary (~5 000 chars) |
+| Over 2 000 000 | Refused with a hint to use `web_crawl` with focused extraction instructions or a more specific source |
+
+The summary keeps quotes, code blocks, and key facts in their original formatting — it's a content compressor, not a paraphraser. If summarization fails or times out, Hermes falls back to the first ~5 000 chars of raw content rather than a useless error.
+
+### Which model does the summarizing?
+
+The `web_extract` auxiliary task. By default (`auxiliary.web_extract.provider: "auto"`), this is your **main chat model** — same provider, same model as `hermes model`. That's fine for most setups, but on expensive reasoning models (Opus, MiniMax M2.7, etc.) every long-page extract adds meaningful cost.
+
+To route extraction summaries to a cheap, fast model regardless of your main:
+
+```yaml
+# ~/.hermes/config.yaml
+auxiliary:
+  web_extract:
+    provider: openrouter
+    model: google/gemini-3-flash-preview
+    timeout: 360       # seconds; raise if you hit summarization timeouts
+```
+
+Or pick interactively: `hermes model` → **Configure auxiliary models** → `web_extract`.
+
+See [Auxiliary Models](/docs/user-guide/configuration#auxiliary-models) for the full reference and per-task override patterns.
+
+### When summarization gets in the way
+
+If you specifically need raw, unsummarized page content — for example, you're scraping a structured page where the LLM summary would drop important fields — use `browser_navigate` + `browser_snapshot` instead. The browser tool returns the live accessibility tree without auxiliary-model rewriting (subject to its own 8 000-char snapshot cap on huge pages).
+
+---
+
 ## Setup
 
 ### Quick setup via `hermes tools`
@@ -329,6 +367,14 @@ Some public instances disable certain search engines or categories. Try:
 
 Switch to a self-hosted instance (see [Option A](#option-a--self-host-with-docker-recommended) above). With Docker, your own instance has no rate limits.
 
+### `web_extract` returns truncated content with a "summarization timed out" note
+
+The auxiliary model didn't finish summarizing within the configured timeout. Either:
+
+- Raise `auxiliary.web_extract.timeout` in `config.yaml` (default 360s on fresh installs, 30s if the key is missing)
+- Switch the `web_extract` auxiliary task to a faster model (e.g. `google/gemini-3-flash-preview`) — see [How `web_extract` handles long pages](#how-web_extract-handles-long-pages)
+- For pages where summarization is the wrong tool, use `browser_navigate` instead
+
 ---
 
 ## Optional skill: `searxng-search`