* docs: browser CDP supervisor design (for upcoming PR) Design doc ahead of implementation — dialog + iframe detection/interaction via a persistent CDP supervisor. Covers backend capability matrix (verified live 2026-04-23), architecture, lifecycle, policy, agent surface, PR split, non-goals, and test plan. Supersedes #12550. No code changes in this commit. * feat(browser): add persistent CDP supervisor for dialog + frame detection Single persistent CDP WebSocket per Hermes task_id that subscribes to Page/Runtime/Target events and maintains thread-safe state for pending dialogs, frame tree, and console errors. Supervisor lives in its own daemon thread running an asyncio loop; external callers use sync API (snapshot(), respond_to_dialog()) that bridges onto the loop. Auto-attaches to OOPIF child targets via Target.setAutoAttach{flatten:true} and enables Page+Runtime on each so iframe-origin dialogs surface through the same supervisor. Dialog policies: must_respond (default, 300s safety timeout), auto_dismiss, auto_accept. Frame tree capped at 30 entries + OOPIF depth 2 to keep snapshot payloads bounded on ad-heavy pages. E2E verified against real Chrome via smoke test — detects + responds to main-frame alerts, iframe-contentWindow alerts, preserves frame tree, graceful no-dialog error path, clean shutdown. No agent-facing tool wiring in this commit (comes next). * feat(browser): add browser_dialog tool wired to CDP supervisor Agent-facing response-only tool. Schema: action: 'accept' | 'dismiss' (required) prompt_text: response for prompt() dialogs (optional) dialog_id: disambiguate when multiple dialogs queued (optional) Handler: SUPERVISOR_REGISTRY.get(task_id).respond_to_dialog(...) check_fn shares _browser_cdp_check with browser_cdp so both surface and hide together. When no supervisor is attached (Camofox, default Playwright, or no browser session started yet), tool is hidden; if somehow invoked it returns a clear error pointing the agent to browser_navigate / /browser connect. Registered in _HERMES_CORE_TOOLS and the browser / hermes-acp / hermes-api-server toolsets alongside browser_cdp. * feat(browser): wire CDP supervisor into session lifecycle + browser_snapshot Supervisor lifecycle: * _get_session_info lazy-starts the supervisor after a session row is materialized — covers every backend code path (Browserbase, cdp_url override, /browser connect, future providers) with one hook. * cleanup_browser(task_id) stops the supervisor for that task first (before the backend tears down CDP). * cleanup_all_browsers() calls SUPERVISOR_REGISTRY.stop_all(). * /browser connect eagerly starts the supervisor for task 'default' so the first snapshot already shows pending_dialogs. * /browser disconnect stops the supervisor. CDP URL resolution for the supervisor: 1. BROWSER_CDP_URL / browser.cdp_url override. 2. Fallback: session_info['cdp_url'] from cloud providers (Browserbase). browser_snapshot merges supervisor state (pending_dialogs + frame_tree) into its JSON output when a supervisor is active — the agent reads pending_dialogs from the snapshot it already requests, then calls browser_dialog to respond. No extra tool surface. Config defaults: * browser.dialog_policy: 'must_respond' (new) * browser.dialog_timeout_s: 300 (new) No version bump — new keys deep-merge into existing browser section. Deadlock fix in supervisor event dispatch: * _on_dialog_opening and _on_target_attached used to await CDP calls while the reader was still processing an event — but only the reader can set the response Future, so the call timed out. * Both now fire asyncio.create_task(...) so the reader stays pumping. * auto_dismiss/auto_accept now actually close the dialog immediately. Tests (tests/tools/test_browser_supervisor.py, 11 tests, real Chrome): * supervisor start/snapshot * main-frame alert detection + dismiss * iframe.contentWindow alert * prompt() with prompt_text reply * respond with no pending dialog -> clean error * auto_dismiss clears on event * registry idempotency * registry stop -> snapshot reports inactive * browser_dialog tool no-supervisor error * browser_dialog invalid action * browser_dialog end-to-end via tool handler xdist-safe: chrome_cdp fixture uses a per-worker port. Skipped when google-chrome/chromium isn't installed. * docs(browser): document browser_dialog tool + CDP supervisor - user-guide/features/browser.md: new browser_dialog section with workflow, availability gate, and dialog_policy table - reference/tools-reference.md: row for browser_dialog, tool count bumped 53 -> 54, browser tools count 11 -> 12 - reference/toolsets-reference.md: browser_dialog added to browser toolset row with note on pending_dialogs / frame_tree snapshot fields Full design doc lives at developer-guide/browser-supervisor.md (committed earlier). * fix(browser): reconnect loop + recent_dialogs for Browserbase visibility Found via Browserbase E2E test that revealed two production-critical issues: 1. **Supervisor WebSocket drops when other clients disconnect.** Browserbase's CDP proxy tears down our long-lived WebSocket whenever a short-lived client (e.g. agent-browser CLI's per-command CDP connection) disconnects. Fixed with a reconnecting _run loop that re-attaches with exponential backoff on drops. _page_session_id and _child_sessions are reset on each reconnect; pending_dialogs and frames are preserved across reconnects. 2. **Browserbase auto-dismisses dialogs server-side within ~10ms.** Their Playwright-based CDP proxy dismisses alert/confirm/prompt before our Page.handleJavaScriptDialog call can respond. So pending_dialogs is empty by the time the agent reads a snapshot on Browserbase. Added a recent_dialogs ring buffer (capacity 20) that retains a DialogRecord for every dialog that opened, with a closed_by tag: * 'agent' — agent called browser_dialog * 'auto_policy' — local auto_dismiss/auto_accept fired * 'watchdog' — must_respond timeout auto-dismissed (300s default) * 'remote' — browser/backend closed it on us (Browserbase) Agents on Browserbase now see the dialog history with closed_by='remote' so they at least know a dialog fired, even though they couldn't respond. 3. **Page.javascriptDialogClosed matching bug.** The event doesn't include a 'message' field (CDP spec has only 'result' and 'userInput') but our _on_dialog_closed was matching on message. Fixed to match by session_id + oldest-first, with a safety assumption that only one dialog is in flight per session (the JS thread is blocked while a dialog is up). Docs + tests updated: * browser.md: new availability matrix showing the three backends and which mode (pending / recent / response) each supports * developer-guide/browser-supervisor.md: three-field snapshot schema with closed_by semantics * test_browser_supervisor.py: +test_recent_dialogs_ring_buffer (12/12 passing against real Chrome) E2E verified both backends: * Local Chrome via /browser connect: detect + respond full workflow (smoke_supervisor.py all 7 scenarios pass) * Browserbase: detect via recent_dialogs with closed_by='remote' (smoke_supervisor_browserbase_v2.py passes) Camofox remains out of scope (REST-only, no CDP) — tracked for upstream PR 3. * feat(browser): XHR bridge for dialog response on Browserbase (FIXED) Browserbase's CDP proxy auto-dismisses native JS dialogs within ~10ms, so Page.handleJavaScriptDialog calls lose the race. Solution: bypass native dialogs entirely. The supervisor now injects Page.addScriptToEvaluateOnNewDocument with a JavaScript override for window.alert/confirm/prompt. Those overrides perform a synchronous XMLHttpRequest to a magic host ('hermes-dialog-bridge.invalid'). We intercept those XHRs via Fetch.enable with a requestStage=Request pattern. Flow when a page calls alert('hi'): 1. window.alert override intercepts, builds XHR GET to http://hermes-dialog-bridge.invalid/?kind=alert&message=hi 2. Sync XHR blocks the page's JS thread (mirrors real dialog semantics) 3. Fetch.requestPaused fires on our WebSocket; supervisor surfaces it as a pending dialog with bridge_request_id set 4. Agent reads pending_dialogs from browser_snapshot, calls browser_dialog 5. Supervisor calls Fetch.fulfillRequest with JSON body: {accept: true|false, prompt_text: '...', dialog_id: 'd-N'} 6. The injected script parses the body, returns the appropriate value from the override (undefined for alert, bool for confirm, string|null for prompt) This works identically on Browserbase AND local Chrome — no native dialog ever fires, so Browserbase's auto-dismiss has nothing to race. Dialog policies (must_respond / auto_dismiss / auto_accept) all still work. Bridge is installed on every attached session (main page + OOPIF child sessions) so iframe dialogs are captured too. Native-dialog path kept as a fallback for backends that don't auto-dismiss (so a page that somehow bypasses our override — e.g. iframes that load after Fetch.enable but before the init-script runs — still gets observed via Page.javascriptDialogOpening). E2E VERIFIED: * Local Chrome: 13/13 pytest tests green (12 original + new test_bridge_captures_prompt_and_returns_reply_text that asserts window.__ret === 'AGENT-SUPPLIED-REPLY' after agent responds) * Browserbase: smoke_bb_bridge_v2.py runs 4/4 PASS: - alert('BB-ALERT-MSG') dismiss → page.alert_ret = undefined ✓ - prompt('BB-PROMPT-MSG', 'default-xyz') accept with 'AGENT-REPLY' → page.prompt_ret === 'AGENT-REPLY' ✓ - confirm('BB-CONFIRM-MSG') accept → page.confirm_ret === true ✓ - confirm('BB-CONFIRM-MSG') dismiss → page.confirm_ret === false ✓ Docs updated in browser.md and developer-guide/browser-supervisor.md — availability matrix now shows Browserbase at full parity with local Chrome for both detection and response. * feat(browser): cross-origin iframe interaction via browser_cdp(frame_id=...) Adds iframe interaction to the CDP supervisor PR (was queued as PR 2). Design: browser_cdp gets an optional frame_id parameter. When set, the tool looks up the frame in the supervisor's frame_tree, grabs its child cdp_session_id (OOPIF session), and dispatches the CDP call through the supervisor's already-connected WebSocket via run_coroutine_threadsafe. Why not stateless: on Browserbase, each fresh browser_cdp WebSocket must re-negotiate against a signed connectUrl. The session info carries a specific URL that can expire while the supervisor's long-lived connection stays valid. Routing via the supervisor sidesteps this. Agent workflow: 1. browser_snapshot → frame_tree.children[] shows OOPIFs with is_oopif=true 2. browser_cdp(method='Runtime.evaluate', frame_id=<OOPIF frame_id>, params={'expression': 'document.title', 'returnByValue': True}) 3. Supervisor dispatches the call on the OOPIF's child session Supervisor state fixes needed along the way: * _on_frame_detached now skips reason='swap' (frame migrating processes) * _on_frame_detached also skips when the frame is an OOPIF with a live child session — Browserbase fires spurious remove events when a same-origin iframe gets promoted to OOPIF * _on_target_detached clears cdp_session_id but KEEPS the frame record so the agent still sees the OOPIF in frame_tree during transient session flaps E2E VERIFIED on Browserbase (smoke_bb_iframe_agent_path.py): browser_cdp(method='Runtime.evaluate', params={'expression': 'document.title', 'returnByValue': True}, frame_id=<OOPIF>) → {'success': True, 'result': {'value': 'Example Domain'}} The iframe is <iframe src='https://example.com/'> inside a top-level data: URL page on a real Browserbase session. The agent Runtime.evaluates INSIDE the cross-origin iframe and gets example.com's title back. Tests (tests/tools/test_browser_supervisor.py — 16 pass total): * test_browser_cdp_frame_id_routes_via_supervisor — injects fake OOPIF, verifies routing via supervisor, Runtime.evaluate returns 1+1=2 * test_browser_cdp_frame_id_missing_supervisor — clean error when no supervisor attached * test_browser_cdp_frame_id_not_in_frame_tree — clean error on bad frame_id Docs (browser.md and developer-guide/browser-supervisor.md) updated with the iframe workflow, availability matrix now shows OOPIF eval as shipped for local Chrome + Browserbase. * test(browser): real-OOPIF E2E verified manually + chrome_cdp uses --site-per-process When asked 'did you test the iframe stuff' I had only done a mocked pytest (fake injected OOPIF) plus a Browserbase E2E. Closed the local-Chrome real-OOPIF gap by writing /tmp/dialog-iframe-test/ smoke_local_oopif.py: * 2 http servers on different hostnames (localhost:18905 + 127.0.0.1:18906) * Chrome with --site-per-process so the cross-origin iframe becomes a real OOPIF in its own process * Navigate, find OOPIF in supervisor.frame_tree, call browser_cdp(method='Runtime.evaluate', frame_id=<OOPIF>) which routes through the supervisor's child session * Asserts iframe document.title === 'INNER-FRAME-XYZ' (from the inner page, retrieved via OOPIF eval) PASSED on 2026-04-23. Tried to embed this as a pytest but hit an asyncio version quirk between venv (3.11) and the system python (3.13) — Page.navigate hangs in the pytest harness but works in standalone. Left a self-documenting skip test that points to the smoke script + describes the verification. chrome_cdp fixture now passes --site-per-process so future iframe tests can rely on OOPIF behavior. Result: 16 pass + 1 documented-skip = 17 tests in tests/tools/test_browser_supervisor.py. * docs(browser): add dialog_policy + dialog_timeout_s to configuration.md, fix tool count Pre-merge docs audit revealed two gaps: 1. user-guide/configuration.md browser config example was missing the two new dialog_* knobs. Added with a short table explaining must_respond / auto_dismiss / auto_accept semantics and a link to the feature page for the full workflow. 2. reference/tools-reference.md header said '54 built-in tools' — real count on main is 54, this branch adds browser_dialog so it's 55. Fixed the header. (browser count was already correctly bumped 11 -> 12 in the earlier docs commit.) No code changes.
10 KiB
Browser CDP Supervisor — Design
Status: Shipped (PR 14540) Last updated: 2026-04-23 Author: @teknium1
Problem
Native JS dialogs (alert/confirm/prompt/beforeunload) and iframes are
the two biggest gaps in our browser tooling:
- Dialogs block the JS thread. Any operation on the page stalls until the dialog is handled. Before this work, the agent had no way to know a dialog was open — subsequent tool calls would hang or throw opaque errors.
- Iframes are invisible. The agent could see iframe nodes in the DOM snapshot but could not click, type, or eval inside them — especially cross-origin (OOPIF) iframes that live in separate Chromium processes.
PR #12550 proposed a
stateless browser_dialog wrapper. That doesn't solve detection — it's a
cleaner CDP call for when the agent already knows (via symptoms) that a dialog
is open. Closed as superseded.
Backend capability matrix (verified live 2026-04-23)
Using throwaway probe scripts against a data-URL page that fires alerts in the
main frame and in a same-origin srcdoc iframe, plus a cross-origin
https://example.com iframe:
| Backend | Dialog detect | Dialog respond | Frame tree | OOPIF Runtime.evaluate via browser_cdp(frame_id=...) |
|---|---|---|---|---|
Local Chrome (--remote-debugging-port) / /browser connect |
✓ | ✓ full workflow | ✓ | ✓ |
| Browserbase | ✓ (via bridge) | ✓ full workflow (via bridge) | ✓ | ✓ (document.title = "Example Domain" verified on real cross-origin iframe) |
| Camofox | ✗ no CDP (REST-only) | ✗ | partial via DOM snapshot | ✗ |
How Browserbase respond works. Browserbase's CDP proxy uses Playwright
internally and auto-dismisses native dialogs within ~10ms, so
Page.handleJavaScriptDialog can't keep up. To work around this, the
supervisor injects a bridge script via
Page.addScriptToEvaluateOnNewDocument that overrides
window.alert/confirm/prompt with a synchronous XHR to a magic host
(hermes-dialog-bridge.invalid). Fetch.enable intercepts those XHRs
before they touch the network — the dialog becomes a Fetch.requestPaused
event the supervisor captures, and respond_to_dialog fulfills via
Fetch.fulfillRequest with a JSON body the injected script decodes.
Net result: from the page's perspective, prompt() still returns the
agent-supplied string. From the agent's perspective, it's the same
browser_dialog(action=...) API either way. Tested end-to-end against
real Browserbase sessions — 4/4 (alert/prompt/confirm-accept/confirm-dismiss)
pass including value round-tripping back into page JS.
Camofox stays unsupported for this PR; follow-up upstream issue planned at
jo-inc/camofox-browser requesting a dialog polling endpoint.
Architecture
CDPSupervisor
One asyncio.Task running in a background daemon thread per Hermes task_id.
Holds a persistent WebSocket to the backend's CDP endpoint. Maintains:
- Dialog queue —
List[PendingDialog]with{id, type, message, default_prompt, session_id, opened_at} - Frame tree —
Dict[frame_id, FrameInfo]with parent relationships, URL, origin, whether cross-origin child session - Session map —
Dict[session_id, SessionInfo]so interaction tools can route to the right attached session for OOPIF operations - Recent console errors — ring buffer of the last 50 (for PR 2 diagnostics)
Subscribes on attach:
Page.enable—javascriptDialogOpening,frameAttached,frameNavigated,frameDetachedRuntime.enable—executionContextCreated,consoleAPICalled,exceptionThrownTarget.setAutoAttach {autoAttach: true, flatten: true}— surfaces child OOPIF targets; supervisor enablesPage+Runtimeon each
Thread-safe state access via a snapshot lock; tool handlers (sync) read the frozen snapshot without awaiting.
Lifecycle
- Start:
SupervisorRegistry.get_or_start(task_id, cdp_url)— called bybrowser_navigate, Browserbase session create,/browser connect. Idempotent. - Stop: session teardown or
/browser disconnect. Cancels the asyncio task, closes the WebSocket, discards state. - Rebind: if the CDP URL changes (user reconnects to a new Chrome), stop the old supervisor and start fresh — never reuse state across endpoints.
Dialog policy
Configurable via config.yaml under browser.dialog_policy:
must_respond(default) — capture, surface inbrowser_snapshot, wait for explicitbrowser_dialog(action=...)call. After a 300s safety timeout with no response, auto-dismiss and log. Prevents a buggy agent from stalling forever.auto_dismiss— record and dismiss immediately; agent sees it after the fact viabrowser_stateinsidebrowser_snapshot.auto_accept— record and accept (useful forbeforeunloadwhere the user wants to navigate away cleanly).
Policy is per-task; no per-dialog overrides in v1.
Agent surface (PR 1)
One new tool
browser_dialog(action, prompt_text=None, dialog_id=None)
action="accept"/"dismiss"→ responds to the specified or sole pending dialog (required)prompt_text=...→ text to supply to aprompt()dialogdialog_id=...→ disambiguate when multiple dialogs queued (rare)
Tool is response-only. Agent reads pending dialogs from browser_snapshot
output before calling.
browser_snapshot extension
Adds three optional fields to the existing snapshot output when a supervisor is attached:
{
"pending_dialogs": [
{"id": "d-1", "type": "alert", "message": "Hello", "opened_at": 1650000000.0}
],
"recent_dialogs": [
{"id": "d-1", "type": "alert", "message": "...", "opened_at": 1650000000.0,
"closed_at": 1650000000.1, "closed_by": "remote"}
],
"frame_tree": {
"top": {"frame_id": "FRAME_A", "url": "https://example.com/", "origin": "https://example.com"},
"children": [
{"frame_id": "FRAME_B", "url": "about:srcdoc", "is_oopif": false},
{"frame_id": "FRAME_C", "url": "https://ads.example.net/", "is_oopif": true, "session_id": "SID_C"}
],
"truncated": false
}
}
-
pending_dialogs: dialogs currently blocking the page's JS thread. The agent must callbrowser_dialog(action=...)to respond. Empty on Browserbase because their CDP proxy auto-dismisses within ~10ms. -
recent_dialogs: ring buffer of up to 20 recently-closed dialogs with aclosed_bytag —"agent"(we responded),"auto_policy"(local auto_dismiss/auto_accept),"watchdog"(must_respond timeout hit), or"remote"(browser/backend closed it on us, e.g. Browserbase). This is how agents on Browserbase still get visibility into what happened. -
frame_tree: frame structure including cross-origin (OOPIF) children. Capped at 30 entries + OOPIF depth 2 to bound snapshot size on ad-heavy pages.truncated: truesurfaces when limits were hit; agents needing the full tree can usebrowser_cdpwithPage.getFrameTree.
No new tool schema surface for any of these — the agent reads the snapshot it already requests.
Availability gating
Both surfaces gate on _browser_cdp_check (supervisor can only run when a CDP
endpoint is reachable). On Camofox / no-backend sessions, the dialog tool is
hidden and snapshot omits the new fields — no schema bloat.
Cross-origin iframe interaction
Extending the dialog-detect work, browser_cdp(frame_id=...) routes CDP
calls (notably Runtime.evaluate) through the supervisor's already-connected
WebSocket using the OOPIF's child sessionId. Agents pick frame_ids out of
browser_snapshot.frame_tree.children[] where is_oopif=true and pass them
to browser_cdp. For same-origin iframes (no dedicated CDP session), the
agent uses contentWindow/contentDocument from a top-level
Runtime.evaluate instead — supervisor surfaces an error pointing at that
fallback when frame_id belongs to a non-OOPIF.
On Browserbase, this is the ONLY reliable path for iframe interaction —
stateless CDP connections (opened per browser_cdp call) hit signed-URL
expiry, while the supervisor's long-lived connection keeps a valid session.
Camofox (follow-up)
Issue planned against jo-inc/camofox-browser adding:
- Playwright
page.on('dialog', handler)per session GET /tabs/:tabId/dialogspolling endpointPOST /tabs/:tabId/dialogs/:idto accept/dismiss- Frame-tree introspection endpoint
Files touched (PR 1)
New
tools/browser_supervisor.py—CDPSupervisor,SupervisorRegistry,PendingDialog,FrameInfotools/browser_dialog_tool.py—browser_dialogtool handlertests/tools/test_browser_supervisor.py— mock CDP WebSocket server + lifecycle/state testswebsite/docs/developer-guide/browser-supervisor.md— this file
Modified
toolsets.py— registerbrowser_dialoginbrowser,hermes-acp,hermes-api-server, core toolsets (gated on CDP reachability)tools/browser_tool.pybrowser_navigatestart-hook: if CDP URL resolvable,SupervisorRegistry.get_or_start(task_id, cdp_url)browser_snapshot(at ~line 1536): merge supervisor state into return payload/browser connecthandler: restart supervisor with new endpoint- Session teardown hooks in
_cleanup_browser_session
hermes_cli/config.py— addbrowser.dialog_policyandbrowser.dialog_timeout_stoDEFAULT_CONFIG- Docs:
website/docs/user-guide/features/browser.md,website/docs/reference/tools-reference.md,website/docs/reference/toolsets-reference.md
Non-goals
- Detection/interaction for Camofox (upstream gap; tracked separately)
- Streaming dialog/frame events live to the user (would require gateway hooks)
- Persisting dialog history across sessions (in-memory only)
- Per-iframe dialog policies (agent can express this via
dialog_id) - Replacing
browser_cdp— it stays as the escape hatch for the long tail (cookies, viewport, network throttling)
Testing
Unit tests use an asyncio mock CDP server that speaks enough of the protocol
to exercise all state transitions: attach, enable, navigate, dialog fire,
dialog dismiss, frame attach/detach, child target attach, session teardown.
Real-backend E2E (Browserbase + local Chrome) is manual; probe scripts from
the 2026-04-23 investigation kept in-repo under
scripts/browser_supervisor_e2e.py so anyone can re-verify on new backend
versions.