docs: 30-day overhaul — correctness audit, PR coverage, Nous Portal weave, sidebar reorg (#33782)

* docs(audit): correctness pass across getting-started, reference, features, messaging, developer-guide, guides, integrations, user-guide

* docs: add PR coverage for last 30d + Nous Portal weave + nav reorg + build fixes

- Add docs for top user-visible PRs that shipped without docs (api-server
  session control, kanban features, telegram pin/edit, provider client tag,
  xAI retired-model migration, cron name lookup, --branch update flag, etc.)
- Apply Nous Portal weave across 23 pages (tasteful one-liners on
  getting-started/learning-path, configuration, overview, vision, x-search,
  credential-pools, provider-routing, cron, codex-runtime, profiles, docker,
  messaging/index, multiple guides, plus FAQ + index promotion)
- Reorganize sidebar: split Messaging into Popular/M365/Chinese/Other,
  Reference into Command/Configuration/Tools-Skills sub-categories, add
  orphan developer-guide pages (web-search-provider-plugin,
  browser-supervisor), move features from Integrations back to Features,
  fold lone spotify into Media & Web.
- Regenerate skill stubs + catalogs (kanban-codex-lane, hermes-s6-container-
  supervision, web-pentest)
- Fix broken anchor links (security/cron, configuration/fallback, telegram
  large-files, adding-platform-adapters step-by-step)
This commit is contained in:
Teknium 2026-05-28 02:41:36 -07:00 committed by GitHub
parent c7f7783e5c
commit 8b6beaab5f
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
142 changed files with 1840 additions and 483 deletions

View file

@ -1,57 +1,49 @@
# Browser CDP Supervisor — Design
---
sidebar_position: 18
title: "Browser CDP Supervisor"
description: "How Hermes detects and responds to native JS dialogs and interacts with cross-origin iframes via a persistent CDP connection."
---
**Status:** Shipped (PR 14540)
**Last updated:** 2026-04-23
**Author:** @teknium1
# Browser CDP Supervisor
## Problem
The CDP supervisor closes two long-standing gaps in Hermes' browser tooling:
Native JS dialogs (`alert`/`confirm`/`prompt`/`beforeunload`) and iframes are
the two biggest gaps in our browser tooling:
1. **Native JS dialogs** (`alert`/`confirm`/`prompt`/`beforeunload`) block the
page's JS thread. Without supervision, the agent has no way to know a
dialog is open — subsequent tool calls hang or throw opaque errors.
2. **Cross-origin iframes (OOPIFs)** are invisible to top-level
`Runtime.evaluate`. The agent can see iframe nodes in the DOM snapshot but
can't click, type, or eval inside them without a CDP session attached to
the child target.
1. **Dialogs block the JS thread.** Any operation on the page stalls until the
dialog is handled. Before this work, the agent had no way to know a dialog
was open — subsequent tool calls would hang or throw opaque errors.
2. **Iframes are invisible.** The agent could see iframe nodes in the DOM
snapshot but could not click, type, or eval inside them — especially
cross-origin (OOPIF) iframes that live in separate Chromium processes.
The supervisor solves both by holding a persistent WebSocket to the backend's
CDP endpoint per browser task, surfacing pending dialogs and frame structure
into `browser_snapshot`, and exposing a `browser_dialog` tool for explicit
responses.
[PR #12550](https://github.com/NousResearch/hermes-agent/pull/12550) proposed a
stateless `browser_dialog` wrapper. That doesn't solve detection — it's a
cleaner CDP call for when the agent already knows (via symptoms) that a dialog
is open. Closed as superseded.
## Backend capability matrix (verified live 2026-04-23)
Using throwaway probe scripts against a data-URL page that fires alerts in the
main frame and in a same-origin srcdoc iframe, plus a cross-origin
`https://example.com` iframe:
## Backend support
| Backend | Dialog detect | Dialog respond | Frame tree | OOPIF `Runtime.evaluate` via `browser_cdp(frame_id=...)` |
|---|---|---|---|---|
| Local Chrome (`--remote-debugging-port`) / `/browser connect` | ✓ | ✓ full workflow | ✓ | ✓ |
| Browserbase | ✓ (via bridge) | ✓ full workflow (via bridge) | ✓ | ✓ (`document.title = "Example Domain"` verified on real cross-origin iframe) |
| Browserbase | ✓ (via bridge) | ✓ full workflow (via bridge) | ✓ | ✓ |
| Camofox | ✗ no CDP (REST-only) | ✗ | partial via DOM snapshot | ✗ |
**How Browserbase respond works.** Browserbase's CDP proxy uses Playwright
internally and auto-dismisses native dialogs within ~10ms, so
`Page.handleJavaScriptDialog` can't keep up. To work around this, the
supervisor injects a bridge script via
**Browserbase quirk.** Browserbase's CDP proxy uses Playwright internally and
auto-dismisses native dialogs within ~10ms, so `Page.handleJavaScriptDialog`
can't keep up. The supervisor injects a bridge script via
`Page.addScriptToEvaluateOnNewDocument` that overrides
`window.alert`/`confirm`/`prompt` with a synchronous XHR to a magic host
(`hermes-dialog-bridge.invalid`). `Fetch.enable` intercepts those XHRs
before they touch the network — the dialog becomes a `Fetch.requestPaused`
event the supervisor captures, and `respond_to_dialog` fulfills via
(`hermes-dialog-bridge.invalid`). `Fetch.enable` intercepts those XHRs before
they touch the network — the dialog becomes a `Fetch.requestPaused` event the
supervisor captures, and `respond_to_dialog` fulfills via
`Fetch.fulfillRequest` with a JSON body the injected script decodes.
Net result: from the page's perspective, `prompt()` still returns the
agent-supplied string. From the agent's perspective, it's the same
`browser_dialog(action=...)` API either way. Tested end-to-end against
real Browserbase sessions — 4/4 (alert/prompt/confirm-accept/confirm-dismiss)
pass including value round-tripping back into page JS.
From the page's perspective, `prompt()` still returns the agent-supplied
string. From the agent's perspective, it's the same `browser_dialog(action=...)`
API either way.
Camofox stays unsupported for this PR; follow-up upstream issue planned at
`jo-inc/camofox-browser` requesting a dialog polling endpoint.
Camofox is unsupported — no CDP surface, REST-only.
## Architecture
@ -63,9 +55,10 @@ Holds a persistent WebSocket to the backend's CDP endpoint. Maintains:
- **Dialog queue**`List[PendingDialog]` with `{id, type, message, default_prompt, session_id, opened_at}`
- **Frame tree**`Dict[frame_id, FrameInfo]` with parent relationships, URL, origin, whether cross-origin child session
- **Session map**`Dict[session_id, SessionInfo]` so interaction tools can route to the right attached session for OOPIF operations
- **Recent console errors** — ring buffer of the last 50 (for PR 2 diagnostics)
- **Recent console errors** — ring buffer of the last 50 for diagnostics
Subscribes on attach:
- `Page.enable``javascriptDialogOpening`, `frameAttached`, `frameNavigated`, `frameDetached`
- `Runtime.enable``executionContextCreated`, `consoleAPICalled`, `exceptionThrown`
- `Target.setAutoAttach {autoAttach: true, flatten: true}` — surfaces child OOPIF targets; supervisor enables `Page`+`Runtime` on each
@ -76,11 +69,13 @@ frozen snapshot without awaiting.
### Lifecycle
- **Start:** `SupervisorRegistry.get_or_start(task_id, cdp_url)` — called by
`browser_navigate`, Browserbase session create, `/browser connect`. Idempotent.
`browser_navigate`, Browserbase session create, `/browser connect`.
Idempotent.
- **Stop:** session teardown or `/browser disconnect`. Cancels the asyncio
task, closes the WebSocket, discards state.
- **Rebind:** if the CDP URL changes (user reconnects to a new Chrome), stop
the old supervisor and start fresh — never reuse state across endpoints.
- **Rebind:** if the CDP URL changes (user reconnects to a new Chrome), the
old supervisor is stopped and a fresh one started — state is never reused
across endpoints.
### Dialog policy
@ -92,14 +87,14 @@ Configurable via `config.yaml` under `browser.dialog_policy`:
forever.
- `auto_dismiss` — record and dismiss immediately; agent sees it after the
fact via `browser_state` inside `browser_snapshot`.
- `auto_accept` — record and accept (useful for `beforeunload` where the user
wants to navigate away cleanly).
- `auto_accept` — record and accept (useful for `beforeunload` where the
workflow wants to navigate away cleanly).
Policy is per-task; no per-dialog overrides in v1.
Policy is per-task; no per-dialog overrides.
## Agent surface (PR 1)
## Agent surface
### One new tool
### `browser_dialog` tool
```
browser_dialog(action, prompt_text=None, dialog_id=None)
@ -107,9 +102,9 @@ browser_dialog(action, prompt_text=None, dialog_id=None)
- `action="accept"` / `"dismiss"` → responds to the specified or sole pending dialog (required)
- `prompt_text=...` → text to supply to a `prompt()` dialog
- `dialog_id=...` → disambiguate when multiple dialogs queued (rare)
- `dialog_id=...` → disambiguate when multiple dialogs are queued (rare)
Tool is response-only. Agent reads pending dialogs from `browser_snapshot`
Tool is response-only. The agent reads pending dialogs from `browser_snapshot`
output before calling.
### `browser_snapshot` extension
@ -137,72 +132,52 @@ is attached:
}
```
- **`pending_dialogs`**: dialogs currently blocking the page's JS thread.
- **`pending_dialogs`** dialogs currently blocking the page's JS thread.
The agent must call `browser_dialog(action=...)` to respond. Empty on
Browserbase because their CDP proxy auto-dismisses within ~10ms.
- **`recent_dialogs`**: ring buffer of up to 20 recently-closed dialogs with
a `closed_by` tag `"agent"` (we responded), `"auto_policy"` (local
- **`recent_dialogs`** ring buffer of up to 20 recently-closed dialogs with
a `closed_by` tag: `"agent"` (we responded), `"auto_policy"` (local
auto_dismiss/auto_accept), `"watchdog"` (must_respond timeout hit), or
`"remote"` (browser/backend closed it on us, e.g. Browserbase). This is
how agents on Browserbase still get visibility into what happened.
- **`frame_tree`**: frame structure including cross-origin (OOPIF) children.
- **`frame_tree`** frame structure including cross-origin (OOPIF) children.
Capped at 30 entries + OOPIF depth 2 to bound snapshot size on ad-heavy
pages. `truncated: true` surfaces when limits were hit; agents needing
the full tree can use `browser_cdp` with `Page.getFrameTree`.
No new tool schema surface for any of these — the agent reads the snapshot
it already requests.
No new tool schema surface for any of these — the agent reads the snapshot it
already requests.
### Availability gating
Both surfaces gate on `_browser_cdp_check` (supervisor can only run when a CDP
endpoint is reachable). On Camofox / no-backend sessions, the dialog tool is
hidden and snapshot omits the new fields — no schema bloat.
hidden and the snapshot omits the new fields — no schema bloat.
## Cross-origin iframe interaction
Extending the dialog-detect work, `browser_cdp(frame_id=...)` routes CDP
calls (notably `Runtime.evaluate`) through the supervisor's already-connected
WebSocket using the OOPIF's child `sessionId`. Agents pick frame_ids out of
`browser_cdp(frame_id=...)` routes CDP calls (notably `Runtime.evaluate`)
through the supervisor's already-connected WebSocket using the OOPIF's child
`sessionId`. Agents pick frame_ids out of
`browser_snapshot.frame_tree.children[]` where `is_oopif=true` and pass them
to `browser_cdp`. For same-origin iframes (no dedicated CDP session), the
agent uses `contentWindow`/`contentDocument` from a top-level
`Runtime.evaluate` instead — supervisor surfaces an error pointing at that
`Runtime.evaluate` instead — the supervisor surfaces an error pointing at that
fallback when `frame_id` belongs to a non-OOPIF.
On Browserbase, this is the ONLY reliable path for iframe interaction —
On Browserbase, this is the only reliable path for iframe interaction —
stateless CDP connections (opened per `browser_cdp` call) hit signed-URL
expiry, while the supervisor's long-lived connection keeps a valid session.
## Camofox (follow-up)
Issue planned against `jo-inc/camofox-browser` adding:
- Playwright `page.on('dialog', handler)` per session
- `GET /tabs/:tabId/dialogs` polling endpoint
- `POST /tabs/:tabId/dialogs/:id` to accept/dismiss
- Frame-tree introspection endpoint
## Files touched (PR 1)
### New
## File layout
- `tools/browser_supervisor.py``CDPSupervisor`, `SupervisorRegistry`, `PendingDialog`, `FrameInfo`
- `tools/browser_dialog_tool.py``browser_dialog` tool handler
- `tests/tools/test_browser_supervisor.py` — mock CDP WebSocket server + lifecycle/state tests
- `website/docs/developer-guide/browser-supervisor.md` — this file
### Modified
- `toolsets.py` — register `browser_dialog` in `browser`, `hermes-acp`, `hermes-api-server`, core toolsets (gated on CDP reachability)
- `tools/browser_tool.py`
- `browser_navigate` start-hook: if CDP URL resolvable, `SupervisorRegistry.get_or_start(task_id, cdp_url)`
- `browser_snapshot` (at ~line 1536): merge supervisor state into return payload
- `/browser connect` handler: restart supervisor with new endpoint
- Session teardown hooks in `_cleanup_browser_session`
- `hermes_cli/config.py` — add `browser.dialog_policy` and `browser.dialog_timeout_s` to `DEFAULT_CONFIG`
- Docs: `website/docs/user-guide/features/browser.md`, `website/docs/reference/tools-reference.md`, `website/docs/reference/toolsets-reference.md`
- `tools/browser_tool.py``browser_navigate` start-hook, `browser_snapshot` merge, `/browser connect` reattach, `_cleanup_browser_session` teardown
- `toolsets.py` — registers `browser_dialog` in `browser`, `hermes-acp`, `hermes-api-server`, and core toolsets (gated on CDP reachability)
- `hermes_cli/config.py``browser.dialog_policy` and `browser.dialog_timeout_s` defaults
## Non-goals
@ -214,9 +189,10 @@ Issue planned against `jo-inc/camofox-browser` adding:
## Testing
Unit tests use an asyncio mock CDP server that speaks enough of the protocol
to exercise all state transitions: attach, enable, navigate, dialog fire,
dialog dismiss, frame attach/detach, child target attach, session teardown.
Real-backend E2E (Browserbase + local Chromium-family browser) is manual — exercise via
`/browser connect` to a live Chromium-family browser and run the dialog/frame
test cases described above.
Unit tests (`tests/tools/test_browser_supervisor.py`) use an asyncio mock CDP
server that speaks enough of the protocol to exercise all state transitions:
attach, enable, navigate, dialog fire, dialog dismiss, frame attach/detach,
child target attach, session teardown. Real-backend E2E (Browserbase + local
Chromium-family browser) is manual — exercise via `/browser connect` to a
live Chromium-family browser and run the dialog/frame test cases described
above.