* docs(audit): correctness pass across getting-started, reference, features, messaging, developer-guide, guides, integrations, user-guide * docs: add PR coverage for last 30d + Nous Portal weave + nav reorg + build fixes - Add docs for top user-visible PRs that shipped without docs (api-server session control, kanban features, telegram pin/edit, provider client tag, xAI retired-model migration, cron name lookup, --branch update flag, etc.) - Apply Nous Portal weave across 23 pages (tasteful one-liners on getting-started/learning-path, configuration, overview, vision, x-search, credential-pools, provider-routing, cron, codex-runtime, profiles, docker, messaging/index, multiple guides, plus FAQ + index promotion) - Reorganize sidebar: split Messaging into Popular/M365/Chinese/Other, Reference into Command/Configuration/Tools-Skills sub-categories, add orphan developer-guide pages (web-search-provider-plugin, browser-supervisor), move features from Integrations back to Features, fold lone spotify into Media & Web. - Regenerate skill stubs + catalogs (kanban-codex-lane, hermes-s6-container- supervision, web-pentest) - Fix broken anchor links (security/cron, configuration/fallback, telegram large-files, adding-platform-adapters step-by-step)
9 KiB
| sidebar_position | title | description |
|---|---|---|
| 18 | Browser CDP Supervisor | How Hermes detects and responds to native JS dialogs and interacts with cross-origin iframes via a persistent CDP connection. |
Browser CDP Supervisor
The CDP supervisor closes two long-standing gaps in Hermes' browser tooling:
- Native JS dialogs (
alert/confirm/prompt/beforeunload) block the page's JS thread. Without supervision, the agent has no way to know a dialog is open — subsequent tool calls hang or throw opaque errors. - Cross-origin iframes (OOPIFs) are invisible to top-level
Runtime.evaluate. The agent can see iframe nodes in the DOM snapshot but can't click, type, or eval inside them without a CDP session attached to the child target.
The supervisor solves both by holding a persistent WebSocket to the backend's
CDP endpoint per browser task, surfacing pending dialogs and frame structure
into browser_snapshot, and exposing a browser_dialog tool for explicit
responses.
Backend support
| Backend | Dialog detect | Dialog respond | Frame tree | OOPIF Runtime.evaluate via browser_cdp(frame_id=...) |
|---|---|---|---|---|
Local Chrome (--remote-debugging-port) / /browser connect |
✓ | ✓ full workflow | ✓ | ✓ |
| Browserbase | ✓ (via bridge) | ✓ full workflow (via bridge) | ✓ | ✓ |
| Camofox | ✗ no CDP (REST-only) | ✗ | partial via DOM snapshot | ✗ |
Browserbase quirk. Browserbase's CDP proxy uses Playwright internally and
auto-dismisses native dialogs within ~10ms, so Page.handleJavaScriptDialog
can't keep up. The supervisor injects a bridge script via
Page.addScriptToEvaluateOnNewDocument that overrides
window.alert/confirm/prompt with a synchronous XHR to a magic host
(hermes-dialog-bridge.invalid). Fetch.enable intercepts those XHRs before
they touch the network — the dialog becomes a Fetch.requestPaused event the
supervisor captures, and respond_to_dialog fulfills via
Fetch.fulfillRequest with a JSON body the injected script decodes.
From the page's perspective, prompt() still returns the agent-supplied
string. From the agent's perspective, it's the same browser_dialog(action=...)
API either way.
Camofox is unsupported — no CDP surface, REST-only.
Architecture
CDPSupervisor
One asyncio.Task running in a background daemon thread per Hermes task_id.
Holds a persistent WebSocket to the backend's CDP endpoint. Maintains:
- Dialog queue —
List[PendingDialog]with{id, type, message, default_prompt, session_id, opened_at} - Frame tree —
Dict[frame_id, FrameInfo]with parent relationships, URL, origin, whether cross-origin child session - Session map —
Dict[session_id, SessionInfo]so interaction tools can route to the right attached session for OOPIF operations - Recent console errors — ring buffer of the last 50 for diagnostics
Subscribes on attach:
Page.enable—javascriptDialogOpening,frameAttached,frameNavigated,frameDetachedRuntime.enable—executionContextCreated,consoleAPICalled,exceptionThrownTarget.setAutoAttach {autoAttach: true, flatten: true}— surfaces child OOPIF targets; supervisor enablesPage+Runtimeon each
Thread-safe state access via a snapshot lock; tool handlers (sync) read the frozen snapshot without awaiting.
Lifecycle
- Start:
SupervisorRegistry.get_or_start(task_id, cdp_url)— called bybrowser_navigate, Browserbase session create,/browser connect. Idempotent. - Stop: session teardown or
/browser disconnect. Cancels the asyncio task, closes the WebSocket, discards state. - Rebind: if the CDP URL changes (user reconnects to a new Chrome), the old supervisor is stopped and a fresh one started — state is never reused across endpoints.
Dialog policy
Configurable via config.yaml under browser.dialog_policy:
must_respond(default) — capture, surface inbrowser_snapshot, wait for explicitbrowser_dialog(action=...)call. After a 300s safety timeout with no response, auto-dismiss and log. Prevents a buggy agent from stalling forever.auto_dismiss— record and dismiss immediately; agent sees it after the fact viabrowser_stateinsidebrowser_snapshot.auto_accept— record and accept (useful forbeforeunloadwhere the workflow wants to navigate away cleanly).
Policy is per-task; no per-dialog overrides.
Agent surface
browser_dialog tool
browser_dialog(action, prompt_text=None, dialog_id=None)
action="accept"/"dismiss"→ responds to the specified or sole pending dialog (required)prompt_text=...→ text to supply to aprompt()dialogdialog_id=...→ disambiguate when multiple dialogs are queued (rare)
Tool is response-only. The agent reads pending dialogs from browser_snapshot
output before calling.
browser_snapshot extension
Adds three optional fields to the existing snapshot output when a supervisor is attached:
{
"pending_dialogs": [
{"id": "d-1", "type": "alert", "message": "Hello", "opened_at": 1650000000.0}
],
"recent_dialogs": [
{"id": "d-1", "type": "alert", "message": "...", "opened_at": 1650000000.0,
"closed_at": 1650000000.1, "closed_by": "remote"}
],
"frame_tree": {
"top": {"frame_id": "FRAME_A", "url": "https://example.com/", "origin": "https://example.com"},
"children": [
{"frame_id": "FRAME_B", "url": "about:srcdoc", "is_oopif": false},
{"frame_id": "FRAME_C", "url": "https://ads.example.net/", "is_oopif": true, "session_id": "SID_C"}
],
"truncated": false
}
}
-
pending_dialogs— dialogs currently blocking the page's JS thread. The agent must callbrowser_dialog(action=...)to respond. Empty on Browserbase because their CDP proxy auto-dismisses within ~10ms. -
recent_dialogs— ring buffer of up to 20 recently-closed dialogs with aclosed_bytag:"agent"(we responded),"auto_policy"(local auto_dismiss/auto_accept),"watchdog"(must_respond timeout hit), or"remote"(browser/backend closed it on us, e.g. Browserbase). This is how agents on Browserbase still get visibility into what happened. -
frame_tree— frame structure including cross-origin (OOPIF) children. Capped at 30 entries + OOPIF depth 2 to bound snapshot size on ad-heavy pages.truncated: truesurfaces when limits were hit; agents needing the full tree can usebrowser_cdpwithPage.getFrameTree.
No new tool schema surface for any of these — the agent reads the snapshot it already requests.
Availability gating
Both surfaces gate on _browser_cdp_check (supervisor can only run when a CDP
endpoint is reachable). On Camofox / no-backend sessions, the dialog tool is
hidden and the snapshot omits the new fields — no schema bloat.
Cross-origin iframe interaction
browser_cdp(frame_id=...) routes CDP calls (notably Runtime.evaluate)
through the supervisor's already-connected WebSocket using the OOPIF's child
sessionId. Agents pick frame_ids out of
browser_snapshot.frame_tree.children[] where is_oopif=true and pass them
to browser_cdp. For same-origin iframes (no dedicated CDP session), the
agent uses contentWindow/contentDocument from a top-level
Runtime.evaluate instead — the supervisor surfaces an error pointing at that
fallback when frame_id belongs to a non-OOPIF.
On Browserbase, this is the only reliable path for iframe interaction —
stateless CDP connections (opened per browser_cdp call) hit signed-URL
expiry, while the supervisor's long-lived connection keeps a valid session.
File layout
tools/browser_supervisor.py—CDPSupervisor,SupervisorRegistry,PendingDialog,FrameInfotools/browser_dialog_tool.py—browser_dialogtool handlertools/browser_tool.py—browser_navigatestart-hook,browser_snapshotmerge,/browser connectreattach,_cleanup_browser_sessionteardowntoolsets.py— registersbrowser_dialoginbrowser,hermes-acp,hermes-api-server, and core toolsets (gated on CDP reachability)hermes_cli/config.py—browser.dialog_policyandbrowser.dialog_timeout_sdefaults
Non-goals
- Detection/interaction for Camofox (upstream gap; tracked separately)
- Streaming dialog/frame events live to the user (would require gateway hooks)
- Persisting dialog history across sessions (in-memory only)
- Per-iframe dialog policies (agent can express this via
dialog_id) - Replacing
browser_cdp— it stays as the escape hatch for the long tail (cookies, viewport, network throttling)
Testing
Unit tests (tests/tools/test_browser_supervisor.py) use an asyncio mock CDP
server that speaks enough of the protocol to exercise all state transitions:
attach, enable, navigate, dialog fire, dialog dismiss, frame attach/detach,
child target attach, session teardown. Real-backend E2E (Browserbase + local
Chromium-family browser) is manual — exercise via /browser connect to a
live Chromium-family browser and run the dialog/frame test cases described
above.