From 9cdcf31caef202555446c0e0b68e652bddcc211a Mon Sep 17 00:00:00 2001 From: Teknium <127238744+teknium1@users.noreply.github.com> Date: Sun, 10 May 2026 06:40:23 -0700 Subject: [PATCH] docs(web-search): explain auxiliary-model summarization for web_extract (#23211) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit web_extract runs returned page content through the web_extract auxiliary model when pages exceed 5 000 chars (single-pass up to 500k, chunked up to 2M, refused above that). The user-guide page didn't mention this — users were surprised that long-page extracts produced summaries instead of raw markdown, and that those summaries cost main-model tokens by default. Adds: - size-driven behavior table (under 5k / 5k–500k / 500k–2M / over 2M) - which auxiliary task does the work (auxiliary.web_extract) - how to route summaries to a cheap model regardless of main - escape hatch: browser_navigate when you need raw content - troubleshooting entry for summarization timeouts --- .../docs/user-guide/features/web-search.md | 46 +++++++++++++++++++ 1 file changed, 46 insertions(+) diff --git a/website/docs/user-guide/features/web-search.md b/website/docs/user-guide/features/web-search.md index 7f06c8e0d4d..931b4ce9cef 100644 --- a/website/docs/user-guide/features/web-search.md +++ b/website/docs/user-guide/features/web-search.md @@ -32,6 +32,44 @@ If you have a paid [Nous Portal](https://portal.nousresearch.com) subscription, --- +## How `web_extract` handles long pages + +Backends return raw page markdown, which can be huge (forum threads, docs sites, news articles with embedded comments). To keep your context window usable and your costs down, `web_extract` runs returned content through the **`web_extract` auxiliary model** before handing it to the agent. Behavior is purely size-driven: + +| Page size (characters) | What happens | +|------------------------|--------------| +| Under 5 000 | Returned as-is — no LLM call, full markdown reaches the agent | +| 5 000 – 500 000 | Single-pass summary via the `web_extract` auxiliary model, capped at ~5 000 chars of output | +| 500 000 – 2 000 000 | Chunked: split into 100 k-char chunks, summarize each in parallel, then synthesize a final summary (~5 000 chars) | +| Over 2 000 000 | Refused with a hint to use `web_crawl` with focused extraction instructions or a more specific source | + +The summary keeps quotes, code blocks, and key facts in their original formatting — it's a content compressor, not a paraphraser. If summarization fails or times out, Hermes falls back to the first ~5 000 chars of raw content rather than a useless error. + +### Which model does the summarizing? + +The `web_extract` auxiliary task. By default (`auxiliary.web_extract.provider: "auto"`), this is your **main chat model** — same provider, same model as `hermes model`. That's fine for most setups, but on expensive reasoning models (Opus, MiniMax M2.7, etc.) every long-page extract adds meaningful cost. + +To route extraction summaries to a cheap, fast model regardless of your main: + +```yaml +# ~/.hermes/config.yaml +auxiliary: + web_extract: + provider: openrouter + model: google/gemini-3-flash-preview + timeout: 360 # seconds; raise if you hit summarization timeouts +``` + +Or pick interactively: `hermes model` → **Configure auxiliary models** → `web_extract`. + +See [Auxiliary Models](/docs/user-guide/configuration#auxiliary-models) for the full reference and per-task override patterns. + +### When summarization gets in the way + +If you specifically need raw, unsummarized page content — for example, you're scraping a structured page where the LLM summary would drop important fields — use `browser_navigate` + `browser_snapshot` instead. The browser tool returns the live accessibility tree without auxiliary-model rewriting (subject to its own 8 000-char snapshot cap on huge pages). + +--- + ## Setup ### Quick setup via `hermes tools` @@ -329,6 +367,14 @@ Some public instances disable certain search engines or categories. Try: Switch to a self-hosted instance (see [Option A](#option-a--self-host-with-docker-recommended) above). With Docker, your own instance has no rate limits. +### `web_extract` returns truncated content with a "summarization timed out" note + +The auxiliary model didn't finish summarizing within the configured timeout. Either: + +- Raise `auxiliary.web_extract.timeout` in `config.yaml` (default 360s on fresh installs, 30s if the key is missing) +- Switch the `web_extract` auxiliary task to a faster model (e.g. `google/gemini-3-flash-preview`) — see [How `web_extract` handles long pages](#how-web_extract-handles-long-pages) +- For pages where summarization is the wrong tool, use `browser_navigate` instead + --- ## Optional skill: `searxng-search`