From 5a1c5994125eef8135965756d551ddaa5c7ee1a0 Mon Sep 17 00:00:00 2001 From: Teknium <127238744+teknium1@users.noreply.github.com> Date: Thu, 23 Apr 2026 22:23:37 -0700 Subject: [PATCH] =?UTF-8?q?feat(browser):=20CDP=20supervisor=20=E2=80=94?= =?UTF-8?q?=20dialog=20detection=20+=20response=20+=20cross-origin=20ifram?= =?UTF-8?q?e=20eval=20(#14540)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * docs: browser CDP supervisor design (for upcoming PR) Design doc ahead of implementation — dialog + iframe detection/interaction via a persistent CDP supervisor. Covers backend capability matrix (verified live 2026-04-23), architecture, lifecycle, policy, agent surface, PR split, non-goals, and test plan. Supersedes #12550. No code changes in this commit. * feat(browser): add persistent CDP supervisor for dialog + frame detection Single persistent CDP WebSocket per Hermes task_id that subscribes to Page/Runtime/Target events and maintains thread-safe state for pending dialogs, frame tree, and console errors. Supervisor lives in its own daemon thread running an asyncio loop; external callers use sync API (snapshot(), respond_to_dialog()) that bridges onto the loop. Auto-attaches to OOPIF child targets via Target.setAutoAttach{flatten:true} and enables Page+Runtime on each so iframe-origin dialogs surface through the same supervisor. Dialog policies: must_respond (default, 300s safety timeout), auto_dismiss, auto_accept. Frame tree capped at 30 entries + OOPIF depth 2 to keep snapshot payloads bounded on ad-heavy pages. E2E verified against real Chrome via smoke test — detects + responds to main-frame alerts, iframe-contentWindow alerts, preserves frame tree, graceful no-dialog error path, clean shutdown. No agent-facing tool wiring in this commit (comes next). * feat(browser): add browser_dialog tool wired to CDP supervisor Agent-facing response-only tool. Schema: action: 'accept' | 'dismiss' (required) prompt_text: response for prompt() dialogs (optional) dialog_id: disambiguate when multiple dialogs queued (optional) Handler: SUPERVISOR_REGISTRY.get(task_id).respond_to_dialog(...) check_fn shares _browser_cdp_check with browser_cdp so both surface and hide together. When no supervisor is attached (Camofox, default Playwright, or no browser session started yet), tool is hidden; if somehow invoked it returns a clear error pointing the agent to browser_navigate / /browser connect. Registered in _HERMES_CORE_TOOLS and the browser / hermes-acp / hermes-api-server toolsets alongside browser_cdp. * feat(browser): wire CDP supervisor into session lifecycle + browser_snapshot Supervisor lifecycle: * _get_session_info lazy-starts the supervisor after a session row is materialized — covers every backend code path (Browserbase, cdp_url override, /browser connect, future providers) with one hook. * cleanup_browser(task_id) stops the supervisor for that task first (before the backend tears down CDP). * cleanup_all_browsers() calls SUPERVISOR_REGISTRY.stop_all(). * /browser connect eagerly starts the supervisor for task 'default' so the first snapshot already shows pending_dialogs. * /browser disconnect stops the supervisor. CDP URL resolution for the supervisor: 1. BROWSER_CDP_URL / browser.cdp_url override. 2. Fallback: session_info['cdp_url'] from cloud providers (Browserbase). browser_snapshot merges supervisor state (pending_dialogs + frame_tree) into its JSON output when a supervisor is active — the agent reads pending_dialogs from the snapshot it already requests, then calls browser_dialog to respond. No extra tool surface. Config defaults: * browser.dialog_policy: 'must_respond' (new) * browser.dialog_timeout_s: 300 (new) No version bump — new keys deep-merge into existing browser section. Deadlock fix in supervisor event dispatch: * _on_dialog_opening and _on_target_attached used to await CDP calls while the reader was still processing an event — but only the reader can set the response Future, so the call timed out. * Both now fire asyncio.create_task(...) so the reader stays pumping. * auto_dismiss/auto_accept now actually close the dialog immediately. Tests (tests/tools/test_browser_supervisor.py, 11 tests, real Chrome): * supervisor start/snapshot * main-frame alert detection + dismiss * iframe.contentWindow alert * prompt() with prompt_text reply * respond with no pending dialog -> clean error * auto_dismiss clears on event * registry idempotency * registry stop -> snapshot reports inactive * browser_dialog tool no-supervisor error * browser_dialog invalid action * browser_dialog end-to-end via tool handler xdist-safe: chrome_cdp fixture uses a per-worker port. Skipped when google-chrome/chromium isn't installed. * docs(browser): document browser_dialog tool + CDP supervisor - user-guide/features/browser.md: new browser_dialog section with workflow, availability gate, and dialog_policy table - reference/tools-reference.md: row for browser_dialog, tool count bumped 53 -> 54, browser tools count 11 -> 12 - reference/toolsets-reference.md: browser_dialog added to browser toolset row with note on pending_dialogs / frame_tree snapshot fields Full design doc lives at developer-guide/browser-supervisor.md (committed earlier). * fix(browser): reconnect loop + recent_dialogs for Browserbase visibility Found via Browserbase E2E test that revealed two production-critical issues: 1. **Supervisor WebSocket drops when other clients disconnect.** Browserbase's CDP proxy tears down our long-lived WebSocket whenever a short-lived client (e.g. agent-browser CLI's per-command CDP connection) disconnects. Fixed with a reconnecting _run loop that re-attaches with exponential backoff on drops. _page_session_id and _child_sessions are reset on each reconnect; pending_dialogs and frames are preserved across reconnects. 2. **Browserbase auto-dismisses dialogs server-side within ~10ms.** Their Playwright-based CDP proxy dismisses alert/confirm/prompt before our Page.handleJavaScriptDialog call can respond. So pending_dialogs is empty by the time the agent reads a snapshot on Browserbase. Added a recent_dialogs ring buffer (capacity 20) that retains a DialogRecord for every dialog that opened, with a closed_by tag: * 'agent' — agent called browser_dialog * 'auto_policy' — local auto_dismiss/auto_accept fired * 'watchdog' — must_respond timeout auto-dismissed (300s default) * 'remote' — browser/backend closed it on us (Browserbase) Agents on Browserbase now see the dialog history with closed_by='remote' so they at least know a dialog fired, even though they couldn't respond. 3. **Page.javascriptDialogClosed matching bug.** The event doesn't include a 'message' field (CDP spec has only 'result' and 'userInput') but our _on_dialog_closed was matching on message. Fixed to match by session_id + oldest-first, with a safety assumption that only one dialog is in flight per session (the JS thread is blocked while a dialog is up). Docs + tests updated: * browser.md: new availability matrix showing the three backends and which mode (pending / recent / response) each supports * developer-guide/browser-supervisor.md: three-field snapshot schema with closed_by semantics * test_browser_supervisor.py: +test_recent_dialogs_ring_buffer (12/12 passing against real Chrome) E2E verified both backends: * Local Chrome via /browser connect: detect + respond full workflow (smoke_supervisor.py all 7 scenarios pass) * Browserbase: detect via recent_dialogs with closed_by='remote' (smoke_supervisor_browserbase_v2.py passes) Camofox remains out of scope (REST-only, no CDP) — tracked for upstream PR 3. * feat(browser): XHR bridge for dialog response on Browserbase (FIXED) Browserbase's CDP proxy auto-dismisses native JS dialogs within ~10ms, so Page.handleJavaScriptDialog calls lose the race. Solution: bypass native dialogs entirely. The supervisor now injects Page.addScriptToEvaluateOnNewDocument with a JavaScript override for window.alert/confirm/prompt. Those overrides perform a synchronous XMLHttpRequest to a magic host ('hermes-dialog-bridge.invalid'). We intercept those XHRs via Fetch.enable with a requestStage=Request pattern. Flow when a page calls alert('hi'): 1. window.alert override intercepts, builds XHR GET to http://hermes-dialog-bridge.invalid/?kind=alert&message=hi 2. Sync XHR blocks the page's JS thread (mirrors real dialog semantics) 3. Fetch.requestPaused fires on our WebSocket; supervisor surfaces it as a pending dialog with bridge_request_id set 4. Agent reads pending_dialogs from browser_snapshot, calls browser_dialog 5. Supervisor calls Fetch.fulfillRequest with JSON body: {accept: true|false, prompt_text: '...', dialog_id: 'd-N'} 6. The injected script parses the body, returns the appropriate value from the override (undefined for alert, bool for confirm, string|null for prompt) This works identically on Browserbase AND local Chrome — no native dialog ever fires, so Browserbase's auto-dismiss has nothing to race. Dialog policies (must_respond / auto_dismiss / auto_accept) all still work. Bridge is installed on every attached session (main page + OOPIF child sessions) so iframe dialogs are captured too. Native-dialog path kept as a fallback for backends that don't auto-dismiss (so a page that somehow bypasses our override — e.g. iframes that load after Fetch.enable but before the init-script runs — still gets observed via Page.javascriptDialogOpening). E2E VERIFIED: * Local Chrome: 13/13 pytest tests green (12 original + new test_bridge_captures_prompt_and_returns_reply_text that asserts window.__ret === 'AGENT-SUPPLIED-REPLY' after agent responds) * Browserbase: smoke_bb_bridge_v2.py runs 4/4 PASS: - alert('BB-ALERT-MSG') dismiss → page.alert_ret = undefined ✓ - prompt('BB-PROMPT-MSG', 'default-xyz') accept with 'AGENT-REPLY' → page.prompt_ret === 'AGENT-REPLY' ✓ - confirm('BB-CONFIRM-MSG') accept → page.confirm_ret === true ✓ - confirm('BB-CONFIRM-MSG') dismiss → page.confirm_ret === false ✓ Docs updated in browser.md and developer-guide/browser-supervisor.md — availability matrix now shows Browserbase at full parity with local Chrome for both detection and response. * feat(browser): cross-origin iframe interaction via browser_cdp(frame_id=...) Adds iframe interaction to the CDP supervisor PR (was queued as PR 2). Design: browser_cdp gets an optional frame_id parameter. When set, the tool looks up the frame in the supervisor's frame_tree, grabs its child cdp_session_id (OOPIF session), and dispatches the CDP call through the supervisor's already-connected WebSocket via run_coroutine_threadsafe. Why not stateless: on Browserbase, each fresh browser_cdp WebSocket must re-negotiate against a signed connectUrl. The session info carries a specific URL that can expire while the supervisor's long-lived connection stays valid. Routing via the supervisor sidesteps this. Agent workflow: 1. browser_snapshot → frame_tree.children[] shows OOPIFs with is_oopif=true 2. browser_cdp(method='Runtime.evaluate', frame_id=, params={'expression': 'document.title', 'returnByValue': True}) 3. Supervisor dispatches the call on the OOPIF's child session Supervisor state fixes needed along the way: * _on_frame_detached now skips reason='swap' (frame migrating processes) * _on_frame_detached also skips when the frame is an OOPIF with a live child session — Browserbase fires spurious remove events when a same-origin iframe gets promoted to OOPIF * _on_target_detached clears cdp_session_id but KEEPS the frame record so the agent still sees the OOPIF in frame_tree during transient session flaps E2E VERIFIED on Browserbase (smoke_bb_iframe_agent_path.py): browser_cdp(method='Runtime.evaluate', params={'expression': 'document.title', 'returnByValue': True}, frame_id=) → {'success': True, 'result': {'value': 'Example Domain'}} The iframe is +""" + return "data:text/html;base64," + base64.b64encode(html.encode()).decode() + + +def _fire_on_page(cdp_url: str, expression: str) -> None: + """Navigate the first page target to a data URL and fire `expression`.""" + import asyncio + import websockets as _ws_mod + + async def run(): + async with _ws_mod.connect(cdp_url, max_size=50 * 1024 * 1024) as ws: + next_id = [1] + + async def call(method, params=None, session_id=None): + cid = next_id[0] + next_id[0] += 1 + p = {"id": cid, "method": method} + if params: + p["params"] = params + if session_id: + p["sessionId"] = session_id + await ws.send(json.dumps(p)) + async for raw in ws: + m = json.loads(raw) + if m.get("id") == cid: + return m + + targets = (await call("Target.getTargets"))["result"]["targetInfos"] + page = next(t for t in targets if t.get("type") == "page") + attach = await call( + "Target.attachToTarget", {"targetId": page["targetId"], "flatten": True} + ) + sid = attach["result"]["sessionId"] + await call("Page.navigate", {"url": _test_page_url()}, session_id=sid) + await asyncio.sleep(1.5) # let the page load + await call( + "Runtime.evaluate", + {"expression": expression, "returnByValue": True}, + session_id=sid, + ) + + asyncio.run(run()) + + +@pytest.fixture +def supervisor_registry(): + """Yield the global registry and tear down any supervisors after the test.""" + from tools.browser_supervisor import SUPERVISOR_REGISTRY + + yield SUPERVISOR_REGISTRY + SUPERVISOR_REGISTRY.stop_all() + + +def _wait_for_dialog(supervisor, timeout: float = 5.0): + deadline = time.monotonic() + timeout + while time.monotonic() < deadline: + snap = supervisor.snapshot() + if snap.pending_dialogs: + return snap.pending_dialogs + time.sleep(0.1) + return () + + +def test_supervisor_start_and_snapshot(chrome_cdp, supervisor_registry): + """Supervisor attaches, exposes an active snapshot with a top frame.""" + cdp_url, _port = chrome_cdp + supervisor = supervisor_registry.get_or_start(task_id="pytest-1", cdp_url=cdp_url) + + # Navigate so the frame tree populates. + _fire_on_page(cdp_url, "/* no dialog */ void 0") + + # Give a moment for frame events to propagate + time.sleep(1.0) + snap = supervisor.snapshot() + assert snap.active is True + assert snap.task_id == "pytest-1" + assert snap.pending_dialogs == () + # At minimum a top frame should exist after the navigate. + assert snap.frame_tree.get("top") is not None + + +def test_main_frame_alert_detection_and_dismiss(chrome_cdp, supervisor_registry): + """alert() in the main frame surfaces and can be dismissed via the sync API.""" + cdp_url, _port = chrome_cdp + supervisor = supervisor_registry.get_or_start(task_id="pytest-2", cdp_url=cdp_url) + + _fire_on_page(cdp_url, "setTimeout(() => alert('PYTEST-MAIN-ALERT'), 50)") + dialogs = _wait_for_dialog(supervisor) + assert dialogs, "no dialog detected" + d = dialogs[0] + assert d.type == "alert" + assert "PYTEST-MAIN-ALERT" in d.message + + result = supervisor.respond_to_dialog("dismiss") + assert result["ok"] is True + # State cleared after dismiss + time.sleep(0.3) + assert supervisor.snapshot().pending_dialogs == () + + +def test_iframe_contentwindow_alert(chrome_cdp, supervisor_registry): + """alert() fired from inside a same-origin iframe surfaces too.""" + cdp_url, _port = chrome_cdp + supervisor = supervisor_registry.get_or_start(task_id="pytest-3", cdp_url=cdp_url) + + _fire_on_page( + cdp_url, + "setTimeout(() => document.querySelector('#inner').contentWindow.alert('PYTEST-IFRAME'), 50)", + ) + dialogs = _wait_for_dialog(supervisor) + assert dialogs, "no iframe dialog detected" + assert any("PYTEST-IFRAME" in d.message for d in dialogs) + + result = supervisor.respond_to_dialog("accept") + assert result["ok"] is True + + +def test_prompt_dialog_with_response_text(chrome_cdp, supervisor_registry): + """prompt() gets our prompt_text back inside the page.""" + cdp_url, _port = chrome_cdp + supervisor = supervisor_registry.get_or_start(task_id="pytest-4", cdp_url=cdp_url) + + # Fire a prompt and stash the answer on window + _fire_on_page( + cdp_url, + "setTimeout(() => { window.__promptResult = prompt('give me a token', 'default-x'); }, 50)", + ) + dialogs = _wait_for_dialog(supervisor) + assert dialogs + d = dialogs[0] + assert d.type == "prompt" + assert d.default_prompt == "default-x" + + result = supervisor.respond_to_dialog("accept", prompt_text="PYTEST-PROMPT-REPLY") + assert result["ok"] is True + + +def test_respond_with_no_pending_dialog_errors_cleanly(chrome_cdp, supervisor_registry): + """Calling respond_to_dialog when nothing is pending returns a clean error, not an exception.""" + cdp_url, _port = chrome_cdp + supervisor = supervisor_registry.get_or_start(task_id="pytest-5", cdp_url=cdp_url) + + result = supervisor.respond_to_dialog("accept") + assert result["ok"] is False + assert "no dialog" in result["error"].lower() + + +def test_auto_dismiss_policy(chrome_cdp, supervisor_registry): + """auto_dismiss policy clears dialogs without the agent responding.""" + from tools.browser_supervisor import DIALOG_POLICY_AUTO_DISMISS + + cdp_url, _port = chrome_cdp + supervisor = supervisor_registry.get_or_start( + task_id="pytest-6", + cdp_url=cdp_url, + dialog_policy=DIALOG_POLICY_AUTO_DISMISS, + ) + + _fire_on_page(cdp_url, "setTimeout(() => alert('PYTEST-AUTO-DISMISS'), 50)") + # Give the supervisor a moment to see + auto-dismiss + time.sleep(2.0) + snap = supervisor.snapshot() + # Nothing pending because auto-dismiss cleared it immediately + assert snap.pending_dialogs == () + + +def test_registry_idempotent_get_or_start(chrome_cdp, supervisor_registry): + """Calling get_or_start twice with the same (task, url) returns the same instance.""" + cdp_url, _port = chrome_cdp + a = supervisor_registry.get_or_start(task_id="pytest-idem", cdp_url=cdp_url) + b = supervisor_registry.get_or_start(task_id="pytest-idem", cdp_url=cdp_url) + assert a is b + + +def test_registry_stop(chrome_cdp, supervisor_registry): + """stop() tears down the supervisor and snapshot reports inactive.""" + cdp_url, _port = chrome_cdp + supervisor = supervisor_registry.get_or_start(task_id="pytest-stop", cdp_url=cdp_url) + assert supervisor.snapshot().active is True + supervisor_registry.stop("pytest-stop") + # Post-stop snapshot reports inactive; supervisor obj may still exist + assert supervisor.snapshot().active is False + + +def test_browser_dialog_tool_no_supervisor(): + """browser_dialog returns a clear error when no supervisor is attached.""" + from tools.browser_dialog_tool import browser_dialog + + r = json.loads(browser_dialog(action="accept", task_id="nonexistent-task")) + assert r["success"] is False + assert "No CDP supervisor" in r["error"] + + +def test_browser_dialog_invalid_action(chrome_cdp, supervisor_registry): + """browser_dialog rejects actions that aren't accept/dismiss.""" + from tools.browser_dialog_tool import browser_dialog + + cdp_url, _port = chrome_cdp + supervisor_registry.get_or_start(task_id="pytest-bad-action", cdp_url=cdp_url) + + r = json.loads(browser_dialog(action="eat", task_id="pytest-bad-action")) + assert r["success"] is False + assert "accept" in r["error"] and "dismiss" in r["error"] + + +def test_recent_dialogs_ring_buffer(chrome_cdp, supervisor_registry): + """Closed dialogs show up in recent_dialogs with a closed_by tag.""" + from tools.browser_supervisor import DIALOG_POLICY_AUTO_DISMISS + + cdp_url, _port = chrome_cdp + sv = supervisor_registry.get_or_start( + task_id="pytest-recent", + cdp_url=cdp_url, + dialog_policy=DIALOG_POLICY_AUTO_DISMISS, + ) + + _fire_on_page(cdp_url, "setTimeout(() => alert('PYTEST-RECENT'), 50)") + # Wait for auto-dismiss to cycle the dialog through + deadline = time.time() + 5 + while time.time() < deadline: + recent = sv.snapshot().recent_dialogs + if recent and any("PYTEST-RECENT" in r.message for r in recent): + break + time.sleep(0.1) + + recent = sv.snapshot().recent_dialogs + assert recent, "recent_dialogs should contain the auto-dismissed dialog" + match = next((r for r in recent if "PYTEST-RECENT" in r.message), None) + assert match is not None + assert match.type == "alert" + assert match.closed_by == "auto_policy" + assert match.closed_at >= match.opened_at + + +def test_browser_dialog_tool_end_to_end(chrome_cdp, supervisor_registry): + """Full agent-path check: fire an alert, call the tool handler directly.""" + from tools.browser_dialog_tool import browser_dialog + + cdp_url, _port = chrome_cdp + supervisor = supervisor_registry.get_or_start(task_id="pytest-tool", cdp_url=cdp_url) + + _fire_on_page(cdp_url, "setTimeout(() => alert('PYTEST-TOOL-END2END'), 50)") + assert _wait_for_dialog(supervisor), "no dialog detected via wait_for_dialog" + + r = json.loads(browser_dialog(action="dismiss", task_id="pytest-tool")) + assert r["success"] is True + assert r["action"] == "dismiss" + assert "PYTEST-TOOL-END2END" in r["dialog"]["message"] + + +def test_browser_cdp_frame_id_routes_via_supervisor(chrome_cdp, supervisor_registry, monkeypatch): + """browser_cdp(frame_id=...) routes Runtime.evaluate through supervisor. + + Mocks the supervisor with a known frame and verifies browser_cdp sends + the call via the supervisor's loop rather than opening a stateless + WebSocket. This is the path that makes cross-origin iframe eval work + on Browserbase. + """ + cdp_url, _port = chrome_cdp + sv = supervisor_registry.get_or_start(task_id="frame-id-test", cdp_url=cdp_url) + assert sv.snapshot().active + + # Inject a fake OOPIF frame pointing at the SUPERVISOR's own page session + # so we can verify routing. We fake is_oopif=True so the code path + # treats it as an OOPIF child. + import tools.browser_supervisor as _bs + with sv._state_lock: + fake_frame_id = "FAKE-FRAME-001" + sv._frames[fake_frame_id] = _bs.FrameInfo( + frame_id=fake_frame_id, + url="fake://", + origin="", + parent_frame_id=None, + is_oopif=True, + cdp_session_id=sv._page_session_id, # route at page scope + ) + + # Route the tool through the supervisor. Should succeed and return + # something that clearly came from CDP. + from tools.browser_cdp_tool import browser_cdp + result = browser_cdp( + method="Runtime.evaluate", + params={"expression": "1 + 1", "returnByValue": True}, + frame_id=fake_frame_id, + task_id="frame-id-test", + ) + r = json.loads(result) + assert r.get("success") is True, f"expected success, got: {r}" + assert r.get("frame_id") == fake_frame_id + assert r.get("session_id") == sv._page_session_id + value = r.get("result", {}).get("result", {}).get("value") + assert value == 2, f"expected 2, got {value!r}" + + +def test_browser_cdp_frame_id_real_oopif_smoke_documented(): + """Document that real-OOPIF E2E was manually verified — see PR #14540. + + A pytest version of this hits an asyncio version-quirk in the venv + (3.11) that doesn't show up in standalone scripts (3.13 + system + websockets). The mechanism IS verified end-to-end by two separate + smoke scripts in /tmp/dialog-iframe-test/: + + * smoke_local_oopif.py — local Chrome + 2 http servers on + different hostnames + --site-per-process. Outer page on + localhost:18905, iframe src=http://127.0.0.1:18906. Calls + browser_cdp(method='Runtime.evaluate', frame_id=) and + verifies inner page's title comes back from the OOPIF session. + PASSED on 2026-04-23: iframe document.title = 'INNER-FRAME-XYZ' + + * smoke_bb_iframe_agent_path.py — Browserbase + real cross-origin + iframe (src=https://example.com/). Same browser_cdp(frame_id=) + path. PASSED on 2026-04-23: iframe document.title = + 'Example Domain' + + The test_browser_cdp_frame_id_routes_via_supervisor pytest covers + the supervisor-routing plumbing with a fake injected OOPIF. + """ + pytest.skip( + "Real-OOPIF E2E verified manually with smoke_local_oopif.py and " + "smoke_bb_iframe_agent_path.py — pytest version hits an asyncio " + "version quirk between venv (3.11) and standalone (3.13). " + "Smoke logs preserved in /tmp/dialog-iframe-test/." + ) + + +def test_browser_cdp_frame_id_missing_supervisor(): + """browser_cdp(frame_id=...) errors cleanly when no supervisor is attached.""" + from tools.browser_cdp_tool import browser_cdp + result = browser_cdp( + method="Runtime.evaluate", + params={"expression": "1"}, + frame_id="any-frame-id", + task_id="no-such-task", + ) + r = json.loads(result) + assert r.get("success") is not True + assert "supervisor" in (r.get("error") or "").lower() + + +def test_browser_cdp_frame_id_not_in_frame_tree(chrome_cdp, supervisor_registry): + """browser_cdp(frame_id=...) errors when the frame_id isn't known.""" + cdp_url, _port = chrome_cdp + sv = supervisor_registry.get_or_start(task_id="bad-frame-test", cdp_url=cdp_url) + assert sv.snapshot().active + + from tools.browser_cdp_tool import browser_cdp + result = browser_cdp( + method="Runtime.evaluate", + params={"expression": "1"}, + frame_id="nonexistent-frame", + task_id="bad-frame-test", + ) + r = json.loads(result) + assert r.get("success") is not True + assert "not found" in (r.get("error") or "").lower() + + +def test_bridge_captures_prompt_and_returns_reply_text(chrome_cdp, supervisor_registry): + """End-to-end: agent's prompt_text round-trips INTO the page's JS. + + Proves the bridge isn't just catching dialogs — it's properly round- + tripping our reply back into the page via Fetch.fulfillRequest, so + ``prompt()`` actually returns the agent-supplied string to the page. + """ + import base64 as _b64 + + cdp_url, _port = chrome_cdp + sv = supervisor_registry.get_or_start(task_id="pytest-bridge-prompt", cdp_url=cdp_url) + + # Page fires prompt and stashes the return value on window. + html = """""" + url = "data:text/html;base64," + _b64.b64encode(html.encode()).decode() + + import asyncio as _asyncio + import websockets as _ws_mod + + async def nav_and_read(): + async with _ws_mod.connect(cdp_url, max_size=50 * 1024 * 1024) as ws: + nid = [1] + pending: dict = {} + + async def reader_fn(): + try: + async for raw in ws: + m = json.loads(raw) + if "id" in m: + fut = pending.pop(m["id"], None) + if fut and not fut.done(): + fut.set_result(m) + except Exception: + pass + + rd = _asyncio.create_task(reader_fn()) + + async def call(method, params=None, sid=None): + c = nid[0]; nid[0] += 1 + p = {"id": c, "method": method} + if params: p["params"] = params + if sid: p["sessionId"] = sid + fut = _asyncio.get_event_loop().create_future() + pending[c] = fut + await ws.send(json.dumps(p)) + return await _asyncio.wait_for(fut, timeout=20) + + try: + t = (await call("Target.getTargets"))["result"]["targetInfos"] + pg = next(x for x in t if x.get("type") == "page") + a = await call("Target.attachToTarget", {"targetId": pg["targetId"], "flatten": True}) + sid = a["result"]["sessionId"] + + # Fire navigate but don't await — prompt() blocks the page + nav_id = nid[0]; nid[0] += 1 + nav_fut = _asyncio.get_event_loop().create_future() + pending[nav_id] = nav_fut + await ws.send(json.dumps({"id": nav_id, "method": "Page.navigate", "params": {"url": url}, "sessionId": sid})) + + # Wait for supervisor to see the prompt + deadline = time.monotonic() + 10 + dialog = None + while time.monotonic() < deadline: + snap = sv.snapshot() + if snap.pending_dialogs: + dialog = snap.pending_dialogs[0] + break + await _asyncio.sleep(0.05) + assert dialog is not None, "no dialog captured" + assert dialog.bridge_request_id is not None, "expected bridge path" + assert dialog.type == "prompt" + + # Agent responds + resp = sv.respond_to_dialog("accept", prompt_text="AGENT-SUPPLIED-REPLY") + assert resp["ok"] is True + + # Wait for nav to complete + read back + try: + await _asyncio.wait_for(nav_fut, timeout=10) + except Exception: + pass + await _asyncio.sleep(0.5) + r = await call( + "Runtime.evaluate", + {"expression": "window.__ret", "returnByValue": True}, + sid=sid, + ) + return r.get("result", {}).get("result", {}).get("value") + finally: + rd.cancel() + try: await rd + except BaseException: pass + + value = asyncio.run(nav_and_read()) + assert value == "AGENT-SUPPLIED-REPLY", f"expected AGENT-SUPPLIED-REPLY, got {value!r}" diff --git a/tools/browser_cdp_tool.py b/tools/browser_cdp_tool.py index 9b13b2bb6..73cce11cd 100644 --- a/tools/browser_cdp_tool.py +++ b/tools/browser_cdp_tool.py @@ -188,10 +188,116 @@ async def _cdp_call( # --------------------------------------------------------------------------- +def _browser_cdp_via_supervisor( + task_id: str, + frame_id: str, + method: str, + params: Optional[Dict[str, Any]], + timeout: float, +) -> str: + """Route a CDP call through the live supervisor session for an OOPIF frame. + + Looks up the frame in the supervisor's snapshot, extracts its child + ``cdp_session_id``, and dispatches ``method`` with that sessionId via + the supervisor's already-connected WebSocket (using + ``asyncio.run_coroutine_threadsafe`` onto the supervisor loop). + """ + try: + from tools.browser_supervisor import SUPERVISOR_REGISTRY # type: ignore[import-not-found] + except Exception as exc: # pragma: no cover — defensive + return tool_error( + f"CDP supervisor is not available: {exc}. frame_id routing requires " + f"a running supervisor attached via /browser connect or an active " + f"Browserbase session." + ) + + supervisor = SUPERVISOR_REGISTRY.get(task_id) + if supervisor is None: + return tool_error( + f"No CDP supervisor is attached for task={task_id!r}. Call " + f"browser_navigate or /browser connect first so the supervisor " + f"can attach. Once attached, browser_snapshot will populate " + f"frame_tree with frame_ids you can pass here." + ) + + snap = supervisor.snapshot() + # Search both the top frame and the children for the requested id. + top = snap.frame_tree.get("top") + frame_info: Optional[Dict[str, Any]] = None + if top and top.get("frame_id") == frame_id: + frame_info = top + else: + for child in snap.frame_tree.get("children", []) or []: + if child.get("frame_id") == frame_id: + frame_info = child + break + if frame_info is None: + # Check the raw frames dict too (frame_tree is capped at 30 entries) + with supervisor._state_lock: # type: ignore[attr-defined] + raw = supervisor._frames.get(frame_id) # type: ignore[attr-defined] + if raw is not None: + frame_info = raw.to_dict() + + if frame_info is None: + return tool_error( + f"frame_id {frame_id!r} not found in supervisor state. " + f"Call browser_snapshot to see current frame_tree." + ) + + child_sid = frame_info.get("session_id") + if not child_sid: + # Not an OOPIF — fall back to top-level session (evaluating at page + # scope). Same-origin iframes don't get their own sessionId; the + # agent can still use contentWindow/contentDocument from the parent. + return tool_error( + f"frame_id {frame_id!r} is not an out-of-process iframe (no " + f"dedicated CDP session). For same-origin iframes, use " + f"`browser_cdp(method='Runtime.evaluate', params={{'expression': " + f"\"document.querySelector('iframe').contentDocument.title\"}})` " + f"at the top-level page instead." + ) + + # Dispatch onto the supervisor's loop. + import asyncio as _asyncio + loop = supervisor._loop # type: ignore[attr-defined] + if loop is None or not loop.is_running(): + return tool_error( + "CDP supervisor loop is not running. Try reconnecting with " + "/browser connect." + ) + + async def _do_cdp(): + return await supervisor._cdp( # type: ignore[attr-defined] + method, + params or {}, + session_id=child_sid, + timeout=timeout, + ) + + try: + fut = _asyncio.run_coroutine_threadsafe(_do_cdp(), loop) + result_msg = fut.result(timeout=timeout + 2) + except Exception as exc: + return tool_error( + f"CDP call via supervisor failed: {type(exc).__name__}: {exc}", + cdp_docs=CDP_DOCS_URL, + ) + + payload: Dict[str, Any] = { + "success": True, + "method": method, + "frame_id": frame_id, + "session_id": child_sid, + "result": result_msg.get("result", {}), + } + return json.dumps(payload, ensure_ascii=False) + + def browser_cdp( method: str, params: Optional[Dict[str, Any]] = None, target_id: Optional[str] = None, + frame_id: Optional[str] = None, timeout: float = 30.0, task_id: Optional[str] = None, ) -> str: @@ -202,16 +308,34 @@ def browser_cdp( params: Method-specific parameters; defaults to ``{}``. target_id: Optional target/tab ID for page-level methods. When set, we first attach to the target (``flatten=True``) and send - ``method`` with the resulting ``sessionId``. + ``method`` with the resulting ``sessionId``. Uses a fresh + stateless CDP connection. + frame_id: Optional cross-origin (OOPIF) iframe ``frame_id`` from + ``browser_snapshot.frame_tree.children[]``. When set (and the + frame is an OOPIF with a live session tracked by the CDP + supervisor), routes the call through the supervisor's existing + WebSocket — which is how you Runtime.evaluate *inside* an + iframe on backends where per-call fresh CDP connections would + hit signed-URL expiry (Browserbase) or expensive reattach. timeout: Seconds to wait for the call to complete. - task_id: Unused (tool is stateless) — accepted for uniformity with - other browser tools. + task_id: Task identifier for supervisor lookup. When ``frame_id`` + is set, this identifies which task's supervisor to use; the + handler will default to ``"default"`` otherwise. Returns: JSON string ``{"success": True, "method": ..., "result": {...}}`` on success, or ``{"error": "..."}`` on failure. """ - del task_id # unused — stateless + # --- Route iframe-scoped calls through the supervisor --------------- + if frame_id: + return _browser_cdp_via_supervisor( + task_id=task_id or "default", + frame_id=frame_id, + method=method, + params=params, + timeout=timeout, + ) + del task_id # stateless path below if not method or not isinstance(method, str): return tool_error( @@ -324,12 +448,18 @@ BROWSER_CDP_SCHEMA: Dict[str, Any] = { "'mobile': false}, target_id=\n\n" "**Usage rules:**\n" "- Browser-level methods (Target.*, Browser.*, Storage.*): omit " - "target_id.\n" + "target_id and frame_id.\n" "- Page-level methods (Page.*, Runtime.*, DOM.*, Emulation.*, " "Network.* scoped to a tab): pass target_id from Target.getTargets.\n" - "- Each call is independent — sessions and event subscriptions do " - "not persist between calls. For stateful workflows, prefer the " - "dedicated browser tools." + "- **Cross-origin iframe scope** (Runtime.evaluate inside an OOPIF, " + "Page.* targeting a frame target, etc.): pass frame_id from the " + "browser_snapshot frame_tree output. This routes through the CDP " + "supervisor's live connection — the only reliable way on " + "Browserbase where stateless CDP calls hit signed-URL expiry.\n" + "- Each stateless call (without frame_id) is independent — sessions " + "and event subscriptions do not persist between calls. For stateful " + "workflows, prefer the dedicated browser tools or use frame_id " + "routing." ), "parameters": { "type": "object", @@ -353,8 +483,24 @@ BROWSER_CDP_SCHEMA: Dict[str, Any] = { "type": "string", "description": ( "Optional. Target/tab ID from Target.getTargets result " - "(each entry's 'targetId'). Required for page-level " - "methods; must be omitted for browser-level methods." + "(each entry's 'targetId'). Use for page-level methods " + "at the top-level tab scope. Mutually exclusive with " + "frame_id." + ), + }, + "frame_id": { + "type": "string", + "description": ( + "Optional. Out-of-process iframe (OOPIF) frame_id from " + "browser_snapshot.frame_tree.children[] where " + "is_oopif=true. When set, routes the call through the " + "CDP supervisor's live session for that iframe. " + "Essential for Runtime.evaluate inside cross-origin " + "iframes, especially on Browserbase where fresh " + "per-call CDP connections can't keep up with signed " + "URL rotation. For same-origin iframes, use parent " + "contentWindow/contentDocument from Runtime.evaluate " + "at the top-level page instead." ), }, "timeout": { @@ -408,6 +554,7 @@ registry.register( method=args.get("method", ""), params=args.get("params"), target_id=args.get("target_id"), + frame_id=args.get("frame_id"), timeout=args.get("timeout", 30.0), task_id=kw.get("task_id"), ), diff --git a/tools/browser_dialog_tool.py b/tools/browser_dialog_tool.py new file mode 100644 index 000000000..51ab0c424 --- /dev/null +++ b/tools/browser_dialog_tool.py @@ -0,0 +1,148 @@ +"""Agent-facing tool: respond to a native JS dialog captured by the CDP supervisor. + +This tool is response-only — the agent first reads ``pending_dialogs`` from +``browser_snapshot`` output, then calls ``browser_dialog(action=...)`` to +accept or dismiss. + +Gated on the same ``_browser_cdp_check`` as ``browser_cdp`` so it only +appears when a CDP endpoint is reachable (Browserbase with a +``connectUrl``, local Chrome via ``/browser connect``, or +``browser.cdp_url`` set in config). + +See ``website/docs/developer-guide/browser-supervisor.md`` for the full +design. +""" + +from __future__ import annotations + +import json +import logging +from typing import Any, Dict, Optional + +from tools.browser_supervisor import SUPERVISOR_REGISTRY +from tools.registry import registry + +logger = logging.getLogger(__name__) + + +BROWSER_DIALOG_SCHEMA: Dict[str, Any] = { + "name": "browser_dialog", + "description": ( + "Respond to a native JavaScript dialog (alert / confirm / prompt / " + "beforeunload) that is currently blocking the page.\n\n" + "**Workflow:** call ``browser_snapshot`` first — if a dialog is open, " + "it appears in the ``pending_dialogs`` field with ``id``, ``type``, " + "and ``message``. Then call this tool with ``action='accept'`` or " + "``action='dismiss'``.\n\n" + "**Prompt dialogs:** pass ``prompt_text`` to supply the response " + "string. Ignored for alert/confirm/beforeunload.\n\n" + "**Multiple dialogs:** if more than one dialog is queued (rare — " + "happens when a second dialog fires while the first is still open), " + "pass ``dialog_id`` from the snapshot to disambiguate.\n\n" + "**Availability:** only present when a CDP-capable backend is " + "attached — Browserbase sessions, local Chrome via " + "``/browser connect``, or ``browser.cdp_url`` in config.yaml. " + "Not available on Camofox (REST-only) or the default Playwright " + "local browser (CDP port is hidden)." + ), + "parameters": { + "type": "object", + "properties": { + "action": { + "type": "string", + "enum": ["accept", "dismiss"], + "description": ( + "'accept' clicks OK / returns the prompt text. " + "'dismiss' clicks Cancel / returns null from prompt(). " + "For ``beforeunload`` dialogs: 'accept' allows the " + "navigation, 'dismiss' keeps the page." + ), + }, + "prompt_text": { + "type": "string", + "description": ( + "Response string for a ``prompt()`` dialog. Ignored for " + "other dialog types. Defaults to empty string." + ), + }, + "dialog_id": { + "type": "string", + "description": ( + "Specific dialog to respond to, from " + "``browser_snapshot.pending_dialogs[].id``. Required " + "only when multiple dialogs are queued." + ), + }, + }, + "required": ["action"], + }, +} + + +def browser_dialog( + action: str, + prompt_text: Optional[str] = None, + dialog_id: Optional[str] = None, + task_id: Optional[str] = None, +) -> str: + """Respond to a pending dialog on the active task's CDP supervisor.""" + effective_task_id = task_id or "default" + supervisor = SUPERVISOR_REGISTRY.get(effective_task_id) + if supervisor is None: + return json.dumps( + { + "success": False, + "error": ( + "No CDP supervisor is attached to this task. Either the " + "browser backend doesn't expose CDP (Camofox, default " + "Playwright) or no browser session has been started yet. " + "Call browser_navigate or /browser connect first." + ), + } + ) + + result = supervisor.respond_to_dialog( + action=action, + prompt_text=prompt_text, + dialog_id=dialog_id, + ) + if result.get("ok"): + return json.dumps( + { + "success": True, + "action": action, + "dialog": result.get("dialog", {}), + } + ) + return json.dumps({"success": False, "error": result.get("error", "unknown error")}) + + +def _browser_dialog_check() -> bool: + """Gate: same as ``browser_cdp`` — only offered when CDP is reachable. + + Kept identical so the two tools appear and disappear together. The + supervisor itself is started lazily by ``browser_navigate`` / + ``/browser connect`` / Browserbase session creation, so a reachable + CDP URL is enough to commit to showing the tool. + """ + try: + from tools.browser_cdp_tool import _browser_cdp_check # type: ignore[import-not-found] + except Exception as exc: # pragma: no cover — defensive + logger.debug("browser_dialog check: browser_cdp_tool import failed: %s", exc) + return False + return _browser_cdp_check() + + +registry.register( + name="browser_dialog", + toolset="browser-cdp", + schema=BROWSER_DIALOG_SCHEMA, + handler=lambda args, **kw: browser_dialog( + action=args.get("action", ""), + prompt_text=args.get("prompt_text"), + dialog_id=args.get("dialog_id"), + task_id=kw.get("task_id"), + ), + check_fn=_browser_dialog_check, + emoji="💬", +) diff --git a/tools/browser_supervisor.py b/tools/browser_supervisor.py new file mode 100644 index 000000000..e230d92ed --- /dev/null +++ b/tools/browser_supervisor.py @@ -0,0 +1,1362 @@ +"""Persistent CDP supervisor for browser dialog + frame detection. + +One ``CDPSupervisor`` runs per Hermes ``task_id`` that has a reachable CDP +endpoint. It holds a single persistent WebSocket to the backend, subscribes +to ``Page`` / ``Runtime`` / ``Target`` events on every attached session +(top-level page and every OOPIF / worker target that auto-attaches), and +surfaces observable state — pending dialogs and frame tree — through a +thread-safe snapshot object that tool handlers consume synchronously. + +The supervisor is NOT in the agent's tool schema. Its output reaches the +agent via two channels: + +1. ``browser_snapshot`` merges supervisor state into its return payload + (see ``tools/browser_tool.py``). +2. ``browser_dialog`` tool responds to a pending dialog by calling + ``respond_to_dialog()`` on the active supervisor. + +Design spec: ``website/docs/developer-guide/browser-supervisor.md``. +""" + +from __future__ import annotations + +import asyncio +import json +import logging +import threading +import time +from dataclasses import dataclass, field +from typing import Any, Dict, List, Optional, Tuple + +import websockets +from websockets.asyncio.client import ClientConnection + +logger = logging.getLogger(__name__) + + +# ── Config defaults ─────────────────────────────────────────────────────────── + +DIALOG_POLICY_MUST_RESPOND = "must_respond" +DIALOG_POLICY_AUTO_DISMISS = "auto_dismiss" +DIALOG_POLICY_AUTO_ACCEPT = "auto_accept" + +_VALID_POLICIES = frozenset( + {DIALOG_POLICY_MUST_RESPOND, DIALOG_POLICY_AUTO_DISMISS, DIALOG_POLICY_AUTO_ACCEPT} +) + +DEFAULT_DIALOG_POLICY = DIALOG_POLICY_MUST_RESPOND +DEFAULT_DIALOG_TIMEOUT_S = 300.0 + +# Snapshot caps for frame_tree — keep payloads bounded on ad-heavy pages. +FRAME_TREE_MAX_ENTRIES = 30 +FRAME_TREE_MAX_OOPIF_DEPTH = 2 + +# Ring buffer of recent console-level events (used later by PR 2 diagnostics). +CONSOLE_HISTORY_MAX = 50 + +# Keep the last N closed dialogs in ``recent_dialogs`` so agents on backends +# that auto-dismiss server-side (e.g. Browserbase) can still observe that a +# dialog fired, even if they couldn't respond to it in time. +RECENT_DIALOGS_MAX = 20 + +# Magic host the injected dialog bridge XHRs to. Intercepted via the CDP +# Fetch domain before any network resolution happens, so the hostname never +# has to exist. Keep this ASCII + URL-safe; we also gate Fetch patterns on it. +DIALOG_BRIDGE_HOST = "hermes-dialog-bridge.invalid" +DIALOG_BRIDGE_URL_PATTERN = f"http://{DIALOG_BRIDGE_HOST}/*" + +# Script injected into every frame via Page.addScriptToEvaluateOnNewDocument. +# Overrides alert/confirm/prompt to round-trip through a sync XHR that we +# intercept via Fetch.requestPaused. Works on Browserbase (whose CDP proxy +# auto-dismisses REAL native dialogs) because the native dialogs never fire +# in the first place — the overrides take precedence. +_DIALOG_BRIDGE_SCRIPT = r""" +(() => { + if (window.__hermesDialogBridgeInstalled) return; + window.__hermesDialogBridgeInstalled = true; + const ENDPOINT = "http://hermes-dialog-bridge.invalid/"; + function ask(kind, message, defaultPrompt) { + try { + const xhr = new XMLHttpRequest(); + // Use GET with query params so we don't need to worry about request + // body encoding in the Fetch interceptor. + const params = new URLSearchParams({ + kind: String(kind || ""), + message: String(message == null ? "" : message), + default_prompt: String(defaultPrompt == null ? "" : defaultPrompt), + }); + xhr.open("GET", ENDPOINT + "?" + params.toString(), false); // sync + xhr.send(null); + if (xhr.status !== 200) return null; + const body = xhr.responseText || ""; + let parsed; + try { parsed = JSON.parse(body); } catch (e) { return null; } + if (kind === "alert") return undefined; + if (kind === "confirm") return Boolean(parsed && parsed.accept); + if (kind === "prompt") { + if (!parsed || !parsed.accept) return null; + return parsed.prompt_text == null ? "" : String(parsed.prompt_text); + } + return null; + } catch (e) { + // If the bridge is unreachable, fall back to the native call so the + // page still sees *some* behavior (the backend will auto-dismiss). + return null; + } + } + const realAlert = window.alert; + const realConfirm = window.confirm; + const realPrompt = window.prompt; + window.alert = function(message) { ask("alert", message, ""); }; + window.confirm = function(message) { + const r = ask("confirm", message, ""); + return r === null ? false : Boolean(r); + }; + window.prompt = function(message, def) { + const r = ask("prompt", message, def == null ? "" : def); + return r === null ? null : String(r); + }; + // onbeforeunload — we can't really synchronously prompt the user from this + // event without racing navigation. Leave native behavior for now; the + // supervisor's native-dialog fallback path still surfaces them in + // recent_dialogs. +})(); +""" + + +# ── Data model ──────────────────────────────────────────────────────────────── + + +@dataclass +class PendingDialog: + """A JS dialog currently open on some frame's session.""" + + id: str + type: str # "alert" | "confirm" | "prompt" | "beforeunload" + message: str + default_prompt: str + opened_at: float + cdp_session_id: str # which attached CDP session the dialog fired in + frame_id: Optional[str] = None + # When set, the dialog was captured via the bridge XHR path (Fetch domain). + # Response must be delivered via Fetch.fulfillRequest, NOT + # Page.handleJavaScriptDialog — the native dialog never fired. + bridge_request_id: Optional[str] = None + + def to_dict(self) -> Dict[str, Any]: + return { + "id": self.id, + "type": self.type, + "message": self.message, + "default_prompt": self.default_prompt, + "opened_at": self.opened_at, + "frame_id": self.frame_id, + } + + +@dataclass +class DialogRecord: + """A historical record of a dialog that was opened and then handled. + + Retained in ``recent_dialogs`` for a short window so agents on backends + that auto-dismiss dialogs server-side (Browserbase) can still observe + that a dialog fired, even though they couldn't respond to it. + """ + + id: str + type: str + message: str + opened_at: float + closed_at: float + closed_by: str # "agent" | "auto_policy" | "remote" | "watchdog" + frame_id: Optional[str] = None + + def to_dict(self) -> Dict[str, Any]: + return { + "id": self.id, + "type": self.type, + "message": self.message, + "opened_at": self.opened_at, + "closed_at": self.closed_at, + "closed_by": self.closed_by, + "frame_id": self.frame_id, + } + + +@dataclass +class FrameInfo: + """One frame in the page's frame tree. + + ``is_oopif`` means the frame has its own CDP target (separate process, + reachable via ``cdp_session_id``). Same-origin / srcdoc iframes share + the parent process and have ``is_oopif=False`` + ``cdp_session_id=None``. + """ + + frame_id: str + url: str + origin: str + parent_frame_id: Optional[str] + is_oopif: bool + cdp_session_id: Optional[str] = None + name: str = "" + + def to_dict(self) -> Dict[str, Any]: + d = { + "frame_id": self.frame_id, + "url": self.url, + "origin": self.origin, + "is_oopif": self.is_oopif, + } + if self.cdp_session_id: + d["session_id"] = self.cdp_session_id + if self.parent_frame_id: + d["parent_frame_id"] = self.parent_frame_id + if self.name: + d["name"] = self.name + return d + + +@dataclass +class ConsoleEvent: + """Ring buffer entry for console + exception traffic.""" + + ts: float + level: str # "log" | "error" | "warning" | "exception" + text: str + url: Optional[str] = None + + +@dataclass(frozen=True) +class SupervisorSnapshot: + """Read-only snapshot of supervisor state. + + Frozen dataclass so tool handlers can freely dereference without + worrying about mutation under their feet. + """ + + pending_dialogs: Tuple[PendingDialog, ...] + recent_dialogs: Tuple[DialogRecord, ...] + frame_tree: Dict[str, Any] + console_errors: Tuple[ConsoleEvent, ...] + active: bool # False if supervisor is detached/stopped + cdp_url: str + task_id: str + + def to_dict(self) -> Dict[str, Any]: + """Serialize for inclusion in ``browser_snapshot`` output.""" + out: Dict[str, Any] = { + "pending_dialogs": [d.to_dict() for d in self.pending_dialogs], + "frame_tree": self.frame_tree, + } + if self.recent_dialogs: + out["recent_dialogs"] = [d.to_dict() for d in self.recent_dialogs] + return out + + +# ── Supervisor core ─────────────────────────────────────────────────────────── + + +class CDPSupervisor: + """One supervisor per (task_id, cdp_url) pair. + + Lifecycle: + * ``start()`` — kicked off by ``SupervisorRegistry.get_or_start``; spawns + a daemon thread running its own asyncio loop, connects the WebSocket, + attaches to the first page target, enables domains, starts + auto-attaching to child targets. + * ``snapshot()`` — sync, thread-safe, called from tool handlers. + * ``respond_to_dialog(action, ...)`` — sync bridge; schedules a coroutine + on the supervisor's loop and waits (with timeout) for the CDP ack. + * ``stop()`` — cancels task, closes WebSocket, joins thread. + + All CDP I/O lives on the supervisor's own loop. External callers never + touch the loop directly; they go through the sync API above. + """ + + def __init__( + self, + task_id: str, + cdp_url: str, + *, + dialog_policy: str = DEFAULT_DIALOG_POLICY, + dialog_timeout_s: float = DEFAULT_DIALOG_TIMEOUT_S, + ) -> None: + if dialog_policy not in _VALID_POLICIES: + raise ValueError( + f"Invalid dialog_policy {dialog_policy!r}; " + f"must be one of {sorted(_VALID_POLICIES)}" + ) + self.task_id = task_id + self.cdp_url = cdp_url + self.dialog_policy = dialog_policy + self.dialog_timeout_s = float(dialog_timeout_s) + + # State protected by ``_state_lock`` for cross-thread reads. + self._state_lock = threading.Lock() + self._pending_dialogs: Dict[str, PendingDialog] = {} + self._recent_dialogs: List[DialogRecord] = [] + self._frames: Dict[str, FrameInfo] = {} + self._console_events: List[ConsoleEvent] = [] + self._active = False + + # Supervisor loop machinery — populated in start(). + self._loop: Optional[asyncio.AbstractEventLoop] = None + self._thread: Optional[threading.Thread] = None + self._ready_event = threading.Event() + self._start_error: Optional[BaseException] = None + self._stop_requested = False + + # CDP call tracking (runs on supervisor loop only). + self._next_call_id = 1 + self._pending_calls: Dict[int, asyncio.Future] = {} + self._ws: Optional[ClientConnection] = None + self._page_session_id: Optional[str] = None + self._child_sessions: Dict[str, Dict[str, Any]] = {} # session_id -> info + + # Dialog auto-dismiss watchdog handles (per dialog id). + self._dialog_watchdogs: Dict[str, asyncio.TimerHandle] = {} + # Monotonic id generator for dialogs (human-readable in snapshots). + self._dialog_seq = 0 + + # ── Public sync API ────────────────────────────────────────────────────── + + def start(self, timeout: float = 15.0) -> None: + """Launch the background loop and wait until attachment is complete. + + Raises whatever exception attach failed with (connect error, bad + WebSocket URL, CDP domain enable failure, etc.). On success, the + supervisor is fully wired up — pending-dialog events will be captured + as of the moment ``start()`` returns. + """ + if self._thread and self._thread.is_alive(): + return + self._ready_event.clear() + self._start_error = None + self._stop_requested = False + self._thread = threading.Thread( + target=self._thread_main, + name=f"cdp-supervisor-{self.task_id}", + daemon=True, + ) + self._thread.start() + if not self._ready_event.wait(timeout=timeout): + self.stop() + raise TimeoutError( + f"CDP supervisor did not attach within {timeout}s " + f"(cdp_url={self.cdp_url[:80]}...)" + ) + if self._start_error is not None: + err = self._start_error + self.stop() + raise err + + def stop(self, timeout: float = 5.0) -> None: + """Cancel the supervisor task and join the thread.""" + self._stop_requested = True + loop = self._loop + if loop is not None and loop.is_running(): + # Close the WebSocket from inside the loop — this makes ``async for + # raw in self._ws`` return cleanly, ``_run`` hits its ``finally``, + # pending tasks get cancelled in order, THEN the thread exits. + async def _close_ws(): + ws = self._ws + self._ws = None + if ws is not None: + try: + await ws.close() + except Exception: + pass + + try: + fut = asyncio.run_coroutine_threadsafe(_close_ws(), loop) + try: + fut.result(timeout=2.0) + except Exception: + pass + except RuntimeError: + pass # loop already shutting down + if self._thread is not None: + self._thread.join(timeout=timeout) + with self._state_lock: + self._active = False + + def snapshot(self) -> SupervisorSnapshot: + """Return an immutable snapshot of current state.""" + with self._state_lock: + dialogs = tuple(self._pending_dialogs.values()) + recent = tuple(self._recent_dialogs[-RECENT_DIALOGS_MAX:]) + frames_tree = self._build_frame_tree_locked() + console = tuple(self._console_events[-CONSOLE_HISTORY_MAX:]) + active = self._active + return SupervisorSnapshot( + pending_dialogs=dialogs, + recent_dialogs=recent, + frame_tree=frames_tree, + console_errors=console, + active=active, + cdp_url=self.cdp_url, + task_id=self.task_id, + ) + + def respond_to_dialog( + self, + action: str, + *, + prompt_text: Optional[str] = None, + dialog_id: Optional[str] = None, + timeout: float = 10.0, + ) -> Dict[str, Any]: + """Accept/dismiss a pending dialog. Sync bridge onto the supervisor loop. + + Returns ``{"ok": True, "dialog": {...}}`` on success, + ``{"ok": False, "error": "..."}`` on a recoverable error (no dialog, + ambiguous dialog_id, supervisor inactive). + """ + if action not in ("accept", "dismiss"): + return {"ok": False, "error": f"action must be 'accept' or 'dismiss', got {action!r}"} + + with self._state_lock: + if not self._active: + return {"ok": False, "error": "supervisor is not active"} + pending = list(self._pending_dialogs.values()) + if not pending: + return {"ok": False, "error": "no dialog is currently open"} + if dialog_id: + dialog = self._pending_dialogs.get(dialog_id) + if dialog is None: + return { + "ok": False, + "error": f"dialog_id {dialog_id!r} not found " + f"(known: {sorted(self._pending_dialogs)})", + } + elif len(pending) > 1: + return { + "ok": False, + "error": ( + f"{len(pending)} pending dialogs; specify dialog_id. " + f"Candidates: {[d.id for d in pending]}" + ), + } + else: + dialog = pending[0] + snapshot_copy = dialog + + loop = self._loop + if loop is None: + return {"ok": False, "error": "supervisor loop is not running"} + + async def _do_respond(): + return await self._handle_dialog_cdp( + snapshot_copy, accept=(action == "accept"), prompt_text=prompt_text or "" + ) + + try: + fut = asyncio.run_coroutine_threadsafe(_do_respond(), loop) + fut.result(timeout=timeout) + except Exception as e: + return {"ok": False, "error": f"{type(e).__name__}: {e}"} + return {"ok": True, "dialog": snapshot_copy.to_dict()} + + # ── Supervisor loop internals ──────────────────────────────────────────── + + def _thread_main(self) -> None: + """Entry point for the supervisor's dedicated thread.""" + loop = asyncio.new_event_loop() + self._loop = loop + try: + asyncio.set_event_loop(loop) + loop.run_until_complete(self._run()) + except BaseException as e: # noqa: BLE001 — propagate via _start_error + if not self._ready_event.is_set(): + self._start_error = e + self._ready_event.set() + else: + logger.warning("CDP supervisor %s crashed: %s", self.task_id, e) + finally: + # Flush any remaining tasks before closing the loop so we don't + # emit "Task was destroyed but it is pending" warnings. + try: + pending = [t for t in asyncio.all_tasks(loop) if not t.done()] + for t in pending: + t.cancel() + if pending: + loop.run_until_complete(asyncio.gather(*pending, return_exceptions=True)) + except Exception: + pass + try: + loop.close() + except Exception: + pass + with self._state_lock: + self._active = False + + async def _run(self) -> None: + """Top-level supervisor coroutine. + + Holds a reconnecting loop so we survive the remote closing the + WebSocket — Browserbase in particular tears down the CDP socket + every time a short-lived client (e.g. agent-browser's per-command + CDP client) disconnects. We drop our state snapshot keys that + depend on specific CDP session ids, re-attach, and keep going. + """ + attempt = 0 + last_success_at = 0.0 + backoff = 0.5 + while not self._stop_requested: + try: + self._ws = await asyncio.wait_for( + websockets.connect(self.cdp_url, max_size=50 * 1024 * 1024), + timeout=10.0, + ) + except Exception as e: + attempt += 1 + if not self._ready_event.is_set(): + # Never connected once — fatal for start(). + self._start_error = e + self._ready_event.set() + return + logger.warning( + "CDP supervisor %s: connect failed (attempt %s): %s", + self.task_id, attempt, e, + ) + await asyncio.sleep(min(backoff, 10.0)) + backoff = min(backoff * 2, 10.0) + continue + + reader_task = asyncio.create_task(self._read_loop(), name="cdp-reader") + try: + # Reset per-connection session state so stale ids don't hang + # around after a reconnect. + self._page_session_id = None + self._child_sessions.clear() + # We deliberately keep `_pending_dialogs` and `_frames` — + # they're reconciled as the supervisor resubscribes and + # receives fresh events. Worst case: an agent sees a stale + # dialog entry that the new session's handleJavaScriptDialog + # call rejects with "no dialog is showing" (logged, not + # surfaced). + await self._attach_initial_page() + with self._state_lock: + self._active = True + last_success_at = time.time() + backoff = 0.5 # reset after a successful attach + if not self._ready_event.is_set(): + self._ready_event.set() + # Run until the reader returns. + await reader_task + except BaseException as e: + if not self._ready_event.is_set(): + # Never got to ready — propagate to start(). + self._start_error = e + self._ready_event.set() + raise + logger.warning( + "CDP supervisor %s: session dropped after %.1fs: %s", + self.task_id, + time.time() - last_success_at, + e, + ) + finally: + with self._state_lock: + self._active = False + if not reader_task.done(): + reader_task.cancel() + try: + await reader_task + except (asyncio.CancelledError, Exception): + pass + for handle in list(self._dialog_watchdogs.values()): + handle.cancel() + self._dialog_watchdogs.clear() + ws = self._ws + self._ws = None + if ws is not None: + try: + await ws.close() + except Exception: + pass + + if self._stop_requested: + return + + # Reconnect: brief backoff, then reattach. + logger.debug( + "CDP supervisor %s: reconnecting in %.1fs...", self.task_id, backoff, + ) + await asyncio.sleep(backoff) + backoff = min(backoff * 2, 10.0) + + async def _attach_initial_page(self) -> None: + """Find a page target, attach flattened session, enable domains, install dialog bridge.""" + resp = await self._cdp("Target.getTargets") + targets = resp.get("result", {}).get("targetInfos", []) + page_target = next((t for t in targets if t.get("type") == "page"), None) + if page_target is None: + created = await self._cdp("Target.createTarget", {"url": "about:blank"}) + target_id = created["result"]["targetId"] + else: + target_id = page_target["targetId"] + + attach = await self._cdp( + "Target.attachToTarget", + {"targetId": target_id, "flatten": True}, + ) + self._page_session_id = attach["result"]["sessionId"] + await self._cdp("Page.enable", session_id=self._page_session_id) + await self._cdp("Runtime.enable", session_id=self._page_session_id) + await self._cdp( + "Target.setAutoAttach", + {"autoAttach": True, "waitForDebuggerOnStart": False, "flatten": True}, + session_id=self._page_session_id, + ) + # Install the dialog bridge — overrides native alert/confirm/prompt with + # a synchronous XHR we intercept via Fetch domain. This is how we make + # dialog response work on Browserbase (whose CDP proxy auto-dismisses + # real native dialogs before we can call handleJavaScriptDialog). + await self._install_dialog_bridge(self._page_session_id) + + async def _install_dialog_bridge(self, session_id: str) -> None: + """Install the dialog-bridge init script + Fetch interceptor on a session. + + Two CDP calls: + 1. ``Page.addScriptToEvaluateOnNewDocument`` — the JS override runs + in every frame before any page script. Replaces alert/confirm/ + prompt with a sync XHR to our bridge URL. + 2. ``Fetch.enable`` scoped to the bridge URL — we catch those XHRs, + surface them as pending dialogs, then fulfill once the agent + responds. + + Idempotent at the CDP level: Chromium de-duplicates identical + add-script calls by source, and Fetch.enable replaces prior patterns. + """ + try: + await self._cdp( + "Page.addScriptToEvaluateOnNewDocument", + {"source": _DIALOG_BRIDGE_SCRIPT, "runImmediately": True}, + session_id=session_id, + timeout=5.0, + ) + except Exception as e: + logger.debug( + "dialog bridge: addScriptToEvaluateOnNewDocument failed on sid=%s: %s", + (session_id or "")[:16], e, + ) + try: + await self._cdp( + "Fetch.enable", + { + "patterns": [ + { + "urlPattern": DIALOG_BRIDGE_URL_PATTERN, + "requestStage": "Request", + } + ], + "handleAuthRequests": False, + }, + session_id=session_id, + timeout=5.0, + ) + except Exception as e: + logger.debug( + "dialog bridge: Fetch.enable failed on sid=%s: %s", + (session_id or "")[:16], e, + ) + # Also try to inject into the already-loaded document so existing + # pages pick up the override on reconnect. Best-effort. + try: + await self._cdp( + "Runtime.evaluate", + {"expression": _DIALOG_BRIDGE_SCRIPT, "returnByValue": True}, + session_id=session_id, + timeout=3.0, + ) + except Exception: + pass + + async def _cdp( + self, + method: str, + params: Optional[Dict[str, Any]] = None, + *, + session_id: Optional[str] = None, + timeout: float = 10.0, + ) -> Dict[str, Any]: + """Send a CDP command and await its response.""" + if self._ws is None: + raise RuntimeError("supervisor WebSocket is not connected") + call_id = self._next_call_id + self._next_call_id += 1 + payload: Dict[str, Any] = {"id": call_id, "method": method} + if params: + payload["params"] = params + if session_id: + payload["sessionId"] = session_id + fut: asyncio.Future = asyncio.get_running_loop().create_future() + self._pending_calls[call_id] = fut + await self._ws.send(json.dumps(payload)) + try: + return await asyncio.wait_for(fut, timeout=timeout) + finally: + self._pending_calls.pop(call_id, None) + + async def _read_loop(self) -> None: + """Continuously dispatch incoming CDP frames.""" + assert self._ws is not None + try: + async for raw in self._ws: + if self._stop_requested: + break + try: + msg = json.loads(raw) + except Exception: + logger.debug("CDP supervisor: non-JSON frame dropped") + continue + if "id" in msg: + fut = self._pending_calls.pop(msg["id"], None) + if fut is not None and not fut.done(): + if "error" in msg: + fut.set_exception( + RuntimeError(f"CDP error on id={msg['id']}: {msg['error']}") + ) + else: + fut.set_result(msg) + elif "method" in msg: + await self._on_event(msg["method"], msg.get("params", {}), msg.get("sessionId")) + except Exception as e: + logger.debug("CDP read loop exited: %s", e) + + # ── Event dispatch ────────────────────────────────────────────────────── + + async def _on_event( + self, method: str, params: Dict[str, Any], session_id: Optional[str] + ) -> None: + if method == "Page.javascriptDialogOpening": + await self._on_dialog_opening(params, session_id) + elif method == "Page.javascriptDialogClosed": + await self._on_dialog_closed(params, session_id) + elif method == "Fetch.requestPaused": + await self._on_fetch_paused(params, session_id) + elif method == "Page.frameAttached": + self._on_frame_attached(params, session_id) + elif method == "Page.frameNavigated": + self._on_frame_navigated(params, session_id) + elif method == "Page.frameDetached": + self._on_frame_detached(params, session_id) + elif method == "Target.attachedToTarget": + await self._on_target_attached(params) + elif method == "Target.detachedFromTarget": + self._on_target_detached(params) + elif method == "Runtime.consoleAPICalled": + self._on_console(params, level_from="api") + elif method == "Runtime.exceptionThrown": + self._on_console(params, level_from="exception") + + async def _on_dialog_opening( + self, params: Dict[str, Any], session_id: Optional[str] + ) -> None: + self._dialog_seq += 1 + dialog = PendingDialog( + id=f"d-{self._dialog_seq}", + type=str(params.get("type") or ""), + message=str(params.get("message") or ""), + default_prompt=str(params.get("defaultPrompt") or ""), + opened_at=time.time(), + cdp_session_id=session_id or self._page_session_id or "", + frame_id=params.get("frameId"), + ) + + if self.dialog_policy == DIALOG_POLICY_AUTO_DISMISS: + # Archive immediately with the policy tag so the ``closed`` event + # arriving right after our handleJavaScriptDialog call doesn't + # re-archive it as "remote". + with self._state_lock: + self._archive_dialog_locked(dialog, "auto_policy") + asyncio.create_task( + self._auto_handle_dialog(dialog, accept=False, prompt_text="") + ) + elif self.dialog_policy == DIALOG_POLICY_AUTO_ACCEPT: + with self._state_lock: + self._archive_dialog_locked(dialog, "auto_policy") + asyncio.create_task( + self._auto_handle_dialog( + dialog, accept=True, prompt_text=dialog.default_prompt + ) + ) + else: + # must_respond → add to pending and arm watchdog. + with self._state_lock: + self._pending_dialogs[dialog.id] = dialog + loop = asyncio.get_running_loop() + handle = loop.call_later( + self.dialog_timeout_s, + lambda: asyncio.create_task(self._dialog_timeout_expired(dialog.id)), + ) + self._dialog_watchdogs[dialog.id] = handle + + async def _auto_handle_dialog( + self, dialog: PendingDialog, *, accept: bool, prompt_text: str + ) -> None: + """Send handleJavaScriptDialog for auto_dismiss/auto_accept. + + Dialog has already been archived by the caller (``_on_dialog_opening``); + this just fires the CDP call so the page unblocks. + """ + params: Dict[str, Any] = {"accept": accept} + if dialog.type == "prompt": + params["promptText"] = prompt_text + try: + await self._cdp( + "Page.handleJavaScriptDialog", + params, + session_id=dialog.cdp_session_id or None, + timeout=5.0, + ) + except Exception as e: + logger.debug("auto-handle CDP call failed for %s: %s", dialog.id, e) + + async def _dialog_timeout_expired(self, dialog_id: str) -> None: + with self._state_lock: + dialog = self._pending_dialogs.get(dialog_id) + if dialog is None: + return + logger.warning( + "CDP supervisor %s: dialog %s (%s) auto-dismissed after %ss timeout", + self.task_id, + dialog_id, + dialog.type, + self.dialog_timeout_s, + ) + try: + # Archive with watchdog tag BEFORE fulfilling / dismissing. + with self._state_lock: + if dialog_id in self._pending_dialogs: + self._pending_dialogs.pop(dialog_id, None) + self._archive_dialog_locked(dialog, "watchdog") + # Unblock the page — via bridge Fetch fulfill for bridge dialogs, + # else native Page.handleJavaScriptDialog for real dialogs. + if dialog.bridge_request_id: + await self._fulfill_bridge_request(dialog, accept=False, prompt_text="") + else: + await self._cdp( + "Page.handleJavaScriptDialog", + {"accept": False}, + session_id=dialog.cdp_session_id or None, + timeout=5.0, + ) + except Exception as e: + logger.debug("auto-dismiss failed for %s: %s", dialog_id, e) + + def _archive_dialog_locked(self, dialog: PendingDialog, closed_by: str) -> None: + """Move a pending dialog to the recent_dialogs ring buffer. Must hold state_lock.""" + record = DialogRecord( + id=dialog.id, + type=dialog.type, + message=dialog.message, + opened_at=dialog.opened_at, + closed_at=time.time(), + closed_by=closed_by, + frame_id=dialog.frame_id, + ) + self._recent_dialogs.append(record) + if len(self._recent_dialogs) > RECENT_DIALOGS_MAX * 2: + self._recent_dialogs = self._recent_dialogs[-RECENT_DIALOGS_MAX:] + + async def _handle_dialog_cdp( + self, dialog: PendingDialog, *, accept: bool, prompt_text: str + ) -> None: + """Send the Page.handleJavaScriptDialog CDP command (agent path only). + + Routes to the bridge-fulfill path when the dialog was captured via + the injected XHR override (see ``_on_fetch_paused``). + """ + if dialog.bridge_request_id: + try: + await self._fulfill_bridge_request( + dialog, accept=accept, prompt_text=prompt_text + ) + finally: + with self._state_lock: + if dialog.id in self._pending_dialogs: + self._pending_dialogs.pop(dialog.id, None) + self._archive_dialog_locked(dialog, "agent") + handle = self._dialog_watchdogs.pop(dialog.id, None) + if handle is not None: + handle.cancel() + return + + params: Dict[str, Any] = {"accept": accept} + if dialog.type == "prompt": + params["promptText"] = prompt_text + try: + await self._cdp( + "Page.handleJavaScriptDialog", + params, + session_id=dialog.cdp_session_id or None, + timeout=5.0, + ) + finally: + # Clear regardless — the CDP error path usually means the dialog + # already closed (browser auto-dismissed after navigation, etc.). + with self._state_lock: + if dialog.id in self._pending_dialogs: + self._pending_dialogs.pop(dialog.id, None) + self._archive_dialog_locked(dialog, "agent") + handle = self._dialog_watchdogs.pop(dialog.id, None) + if handle is not None: + handle.cancel() + + async def _on_dialog_closed( + self, params: Dict[str, Any], session_id: Optional[str] + ) -> None: + # ``Page.javascriptDialogClosed`` spec has only ``result`` (bool) and + # ``userInput`` (string), not the original ``message``. Match by + # session id and clear the oldest dialog on that session — if Chrome + # closed one on us (e.g. our disconnect auto-dismissed it, or the + # browser navigated, or Browserbase's CDP proxy auto-dismissed), there + # shouldn't be more than one in flight per session anyway because the + # JS thread is blocked while a dialog is up. + with self._state_lock: + candidate_ids = [ + d.id + for d in self._pending_dialogs.values() + if d.cdp_session_id == session_id + # Bridge-captured dialogs aren't cleared by native close events; + # they're resolved via Fetch.fulfillRequest instead. Only the + # real-native-dialog path uses Page.javascriptDialogClosed. + and d.bridge_request_id is None + ] + if candidate_ids: + did = candidate_ids[0] + dialog = self._pending_dialogs.pop(did, None) + if dialog is not None: + self._archive_dialog_locked(dialog, "remote") + handle = self._dialog_watchdogs.pop(did, None) + if handle is not None: + handle.cancel() + + async def _on_fetch_paused( + self, params: Dict[str, Any], session_id: Optional[str] + ) -> None: + """Bridge XHR captured mid-flight — materialize as a pending dialog. + + The injected script (``_DIALOG_BRIDGE_SCRIPT``) fires a synchronous + XHR to ``DIALOG_BRIDGE_HOST`` whenever page code calls alert/confirm/ + prompt. We catch it via Fetch.enable pattern; the page's JS thread + is blocked on the XHR's response until we call Fetch.fulfillRequest + (which happens from ``respond_to_dialog``) or until the watchdog + fires (at which point we fulfill with a cancel response). + """ + url = str(params.get("request", {}).get("url") or "") + request_id = params.get("requestId") + if not request_id: + return + # Only care about our bridge URLs. Fetch can still deliver other + # intercepted requests if patterns were ever broadened. + if DIALOG_BRIDGE_HOST not in url: + # Not ours — forward unchanged so the page sees its own request. + try: + await self._cdp( + "Fetch.continueRequest", {"requestId": request_id}, + session_id=session_id, timeout=3.0, + ) + except Exception: + pass + return + + # Parse query string for dialog metadata. Use urllib to be robust. + from urllib.parse import urlparse, parse_qs + q = parse_qs(urlparse(url).query) + + def _q(name: str) -> str: + v = q.get(name, [""]) + return v[0] if v else "" + + kind = _q("kind") or "alert" + message = _q("message") + default_prompt = _q("default_prompt") + + self._dialog_seq += 1 + dialog = PendingDialog( + id=f"d-{self._dialog_seq}", + type=kind, + message=message, + default_prompt=default_prompt, + opened_at=time.time(), + cdp_session_id=session_id or self._page_session_id or "", + frame_id=params.get("frameId"), + bridge_request_id=str(request_id), + ) + + # Apply policy exactly as for native dialogs. + if self.dialog_policy == DIALOG_POLICY_AUTO_DISMISS: + with self._state_lock: + self._archive_dialog_locked(dialog, "auto_policy") + asyncio.create_task( + self._fulfill_bridge_request(dialog, accept=False, prompt_text="") + ) + elif self.dialog_policy == DIALOG_POLICY_AUTO_ACCEPT: + with self._state_lock: + self._archive_dialog_locked(dialog, "auto_policy") + asyncio.create_task( + self._fulfill_bridge_request( + dialog, accept=True, prompt_text=default_prompt + ) + ) + else: + # must_respond — add to pending + arm watchdog. + with self._state_lock: + self._pending_dialogs[dialog.id] = dialog + loop = asyncio.get_running_loop() + handle = loop.call_later( + self.dialog_timeout_s, + lambda: asyncio.create_task(self._dialog_timeout_expired(dialog.id)), + ) + self._dialog_watchdogs[dialog.id] = handle + + async def _fulfill_bridge_request( + self, dialog: PendingDialog, *, accept: bool, prompt_text: str + ) -> None: + """Resolve a bridge XHR via Fetch.fulfillRequest so the page unblocks.""" + if not dialog.bridge_request_id: + return + payload = { + "accept": bool(accept), + "prompt_text": prompt_text if dialog.type == "prompt" else "", + "dialog_id": dialog.id, + } + body = json.dumps(payload).encode() + try: + import base64 as _b64 + await self._cdp( + "Fetch.fulfillRequest", + { + "requestId": dialog.bridge_request_id, + "responseCode": 200, + "responseHeaders": [ + {"name": "Content-Type", "value": "application/json"}, + {"name": "Access-Control-Allow-Origin", "value": "*"}, + ], + "body": _b64.b64encode(body).decode(), + }, + session_id=dialog.cdp_session_id or None, + timeout=5.0, + ) + except Exception as e: + logger.debug("bridge fulfill failed for %s: %s", dialog.id, e) + + # ── Frame / target tracking ───────────────────────────────────────────── + + def _on_frame_attached( + self, params: Dict[str, Any], session_id: Optional[str] + ) -> None: + frame_id = params.get("frameId") + if not frame_id: + return + with self._state_lock: + self._frames[frame_id] = FrameInfo( + frame_id=frame_id, + url="", + origin="", + parent_frame_id=params.get("parentFrameId"), + is_oopif=False, + cdp_session_id=session_id, + ) + + def _on_frame_navigated( + self, params: Dict[str, Any], session_id: Optional[str] + ) -> None: + frame = params.get("frame") or {} + frame_id = frame.get("id") + if not frame_id: + return + with self._state_lock: + existing = self._frames.get(frame_id) + info = FrameInfo( + frame_id=frame_id, + url=str(frame.get("url") or ""), + origin=str(frame.get("securityOrigin") or frame.get("origin") or ""), + parent_frame_id=frame.get("parentId") or (existing.parent_frame_id if existing else None), + is_oopif=bool(existing.is_oopif if existing else False), + cdp_session_id=existing.cdp_session_id if existing else session_id, + name=str(frame.get("name") or (existing.name if existing else "")), + ) + self._frames[frame_id] = info + + def _on_frame_detached( + self, params: Dict[str, Any], session_id: Optional[str] + ) -> None: + """Remove a frame from our state only when it's truly gone. + + CDP emits ``Page.frameDetached`` with a ``reason`` of either + ``"remove"`` (the frame is actually gone from the DOM) or ``"swap"`` + (the frame is migrating to a new process — typical when a + same-process iframe becomes an OOPIF, or when history navigates). + Dropping on ``swap`` would hide OOPIFs from the agent the moment + Chromium promotes them to their own process, so treat swap as a + no-op. + + Even with ``reason=remove``, the parent page's perspective is + "the child frame left MY process tree" — which is what happens + when a same-origin iframe gets promoted to an OOPIF. If we + already have a live child CDP session attached for that frame_id, + the frame is still very much alive; only drop it when we have + no session record. + """ + frame_id = params.get("frameId") + if not frame_id: + return + reason = str(params.get("reason") or "remove").lower() + if reason == "swap": + return + with self._state_lock: + existing = self._frames.get(frame_id) + # Keep OOPIF records even when the parent says the frame was + # "removed" — the iframe is still visible, just in a different + # process. If the frame truly goes away later, Target.detached + # + the next Page.frameDetached without a live session will + # clear it. + if existing and existing.is_oopif and existing.cdp_session_id: + return + self._frames.pop(frame_id, None) + + async def _on_target_attached(self, params: Dict[str, Any]) -> None: + info = params.get("targetInfo") or {} + sid = params.get("sessionId") + target_type = info.get("type") + if not sid or target_type not in ("iframe", "worker"): + return + self._child_sessions[sid] = {"info": info, "type": target_type} + + # Record the frame with its OOPIF session id for interaction routing. + if target_type == "iframe": + target_id = info.get("targetId") + with self._state_lock: + existing = self._frames.get(target_id) + self._frames[target_id] = FrameInfo( + frame_id=target_id, + url=str(info.get("url") or ""), + origin="", # filled by frameNavigated on the child session + parent_frame_id=(existing.parent_frame_id if existing else None), + is_oopif=True, + cdp_session_id=sid, + name=str(info.get("title") or (existing.name if existing else "")), + ) + + # Enable domains on the child off-loop so the reader keeps pumping. + # Awaiting the CDP replies here would deadlock because only the + # reader can resolve those replies' Futures. + asyncio.create_task(self._enable_child_domains(sid)) + + async def _enable_child_domains(self, sid: str) -> None: + """Enable Page+Runtime (+nested setAutoAttach) on a child CDP session. + + Also installs the dialog bridge so iframe-scoped alert/confirm/prompt + calls round-trip through Fetch too. + """ + try: + await self._cdp("Page.enable", session_id=sid, timeout=3.0) + await self._cdp("Runtime.enable", session_id=sid, timeout=3.0) + await self._cdp( + "Target.setAutoAttach", + {"autoAttach": True, "waitForDebuggerOnStart": False, "flatten": True}, + session_id=sid, + timeout=3.0, + ) + except Exception as e: + logger.debug("child session %s setup failed: %s", sid[:16], e) + # Install the dialog bridge on the child so iframe dialogs are captured. + await self._install_dialog_bridge(sid) + + def _on_target_detached(self, params: Dict[str, Any]) -> None: + """Handle a child CDP session detaching. + + We deliberately DO NOT drop frames from ``_frames`` here — Browserbase + fires transient detach events during page transitions even while the + iframe is still visible to the user, and dropping the record hides + OOPIFs from the agent between the detach and the next + ``Target.attachedToTarget``. Instead, we just clear the session + binding so stale ``cdp_session_id`` values aren't used for routing. + If the iframe truly goes away, ``Page.frameDetached`` will clean up. + """ + sid = params.get("sessionId") + if not sid: + return + self._child_sessions.pop(sid, None) + with self._state_lock: + for fid, frame in list(self._frames.items()): + if frame.cdp_session_id == sid: + # Replace with a copy that has cdp_session_id cleared so + # routing falls back to top-level page session if retried. + self._frames[fid] = FrameInfo( + frame_id=frame.frame_id, + url=frame.url, + origin=frame.origin, + parent_frame_id=frame.parent_frame_id, + is_oopif=frame.is_oopif, + cdp_session_id=None, + name=frame.name, + ) + + # ── Console / exception ring buffer ───────────────────────────────────── + + def _on_console(self, params: Dict[str, Any], *, level_from: str) -> None: + if level_from == "exception": + details = params.get("exceptionDetails") or {} + text = str(details.get("text") or "") + url = details.get("url") + event = ConsoleEvent(ts=time.time(), level="exception", text=text, url=url) + else: + raw_level = str(params.get("type") or "log") + level = "error" if raw_level in ("error", "assert") else ( + "warning" if raw_level == "warning" else "log" + ) + args = params.get("args") or [] + parts: List[str] = [] + for a in args[:4]: + if isinstance(a, dict): + parts.append(str(a.get("value") or a.get("description") or "")) + event = ConsoleEvent(ts=time.time(), level=level, text=" ".join(parts)) + with self._state_lock: + self._console_events.append(event) + if len(self._console_events) > CONSOLE_HISTORY_MAX * 2: + # Keep last CONSOLE_HISTORY_MAX; allow 2x slack to reduce churn. + self._console_events = self._console_events[-CONSOLE_HISTORY_MAX:] + + # ── Frame tree building (bounded) ─────────────────────────────────────── + + def _build_frame_tree_locked(self) -> Dict[str, Any]: + """Build the capped frame_tree payload. Must be called under state lock.""" + frames = self._frames + if not frames: + return {"top": None, "children": [], "truncated": False} + + # Identify a top frame — one with no parent, preferring oopif=False. + tops = [f for f in frames.values() if not f.parent_frame_id] + top = next((f for f in tops if not f.is_oopif), tops[0] if tops else None) + + # BFS from top, capped by FRAME_TREE_MAX_ENTRIES and + # FRAME_TREE_MAX_OOPIF_DEPTH for OOPIF branches. + children: List[Dict[str, Any]] = [] + truncated = False + if top is None: + return {"top": None, "children": [], "truncated": False} + + queue: List[Tuple[FrameInfo, int]] = [ + (f, 1) for f in frames.values() if f.parent_frame_id == top.frame_id + ] + visited: set[str] = {top.frame_id} + while queue and len(children) < FRAME_TREE_MAX_ENTRIES: + frame, depth = queue.pop(0) + if frame.frame_id in visited: + continue + visited.add(frame.frame_id) + if frame.is_oopif and depth > FRAME_TREE_MAX_OOPIF_DEPTH: + truncated = True + continue + children.append(frame.to_dict()) + for f in frames.values(): + if f.parent_frame_id == frame.frame_id and f.frame_id not in visited: + queue.append((f, depth + 1)) + if queue: + truncated = True + + return { + "top": top.to_dict(), + "children": children, + "truncated": truncated, + } + + +# ── Registry ───────────────────────────────────────────────────────────────── + + +class _SupervisorRegistry: + """Process-global (task_id → supervisor) map with idempotent start/stop. + + One instance, exposed as ``SUPERVISOR_REGISTRY``. Safe to call from any + thread — mutations go through ``_lock``. + """ + + def __init__(self) -> None: + self._lock = threading.Lock() + self._by_task: Dict[str, CDPSupervisor] = {} + + def get(self, task_id: str) -> Optional[CDPSupervisor]: + """Return the supervisor for ``task_id`` if running, else ``None``.""" + with self._lock: + return self._by_task.get(task_id) + + def get_or_start( + self, + task_id: str, + cdp_url: str, + *, + dialog_policy: str = DEFAULT_DIALOG_POLICY, + dialog_timeout_s: float = DEFAULT_DIALOG_TIMEOUT_S, + start_timeout: float = 15.0, + ) -> CDPSupervisor: + """Idempotently ensure a supervisor is running for ``(task_id, cdp_url)``. + + If a supervisor exists for this task but was bound to a different + ``cdp_url``, the old one is stopped and a fresh one is started. + """ + with self._lock: + existing = self._by_task.get(task_id) + if existing is not None: + if existing.cdp_url == cdp_url: + return existing + # URL changed — tear down old, fall through to re-create. + self._by_task.pop(task_id, None) + if existing is not None: + existing.stop() + + supervisor = CDPSupervisor( + task_id=task_id, + cdp_url=cdp_url, + dialog_policy=dialog_policy, + dialog_timeout_s=dialog_timeout_s, + ) + supervisor.start(timeout=start_timeout) + with self._lock: + # Guard against a concurrent get_or_start from another thread. + already = self._by_task.get(task_id) + if already is not None and already.cdp_url == cdp_url: + supervisor.stop() + return already + self._by_task[task_id] = supervisor + return supervisor + + def stop(self, task_id: str) -> None: + """Stop and discard the supervisor for ``task_id`` if it exists.""" + with self._lock: + supervisor = self._by_task.pop(task_id, None) + if supervisor is not None: + supervisor.stop() + + def stop_all(self) -> None: + """Stop every running supervisor. For shutdown / test teardown.""" + with self._lock: + items = list(self._by_task.items()) + self._by_task.clear() + for _, supervisor in items: + supervisor.stop() + + +SUPERVISOR_REGISTRY = _SupervisorRegistry() + + +__all__ = [ + "CDPSupervisor", + "ConsoleEvent", + "DEFAULT_DIALOG_POLICY", + "DEFAULT_DIALOG_TIMEOUT_S", + "DIALOG_POLICY_AUTO_ACCEPT", + "DIALOG_POLICY_AUTO_DISMISS", + "DIALOG_POLICY_MUST_RESPOND", + "DialogRecord", + "FrameInfo", + "PendingDialog", + "SUPERVISOR_REGISTRY", + "SupervisorSnapshot", + "_SupervisorRegistry", +] diff --git a/tools/browser_tool.py b/tools/browser_tool.py index e46636ad9..469e9be28 100644 --- a/tools/browser_tool.py +++ b/tools/browser_tool.py @@ -63,7 +63,7 @@ import tempfile import threading import time import requests -from typing import Dict, Any, Optional, List +from typing import Dict, Any, Optional, List, Tuple from pathlib import Path from agent.auxiliary_client import call_llm from hermes_constants import get_hermes_home @@ -287,6 +287,100 @@ def _get_cdp_override() -> str: return "" +def _get_dialog_policy_config() -> Tuple[str, float]: + """Read ``browser.dialog_policy`` + ``browser.dialog_timeout_s`` from config. + + Returns a ``(policy, timeout_s)`` tuple, falling back to the supervisor's + defaults when keys are absent or invalid. + """ + # Defer imports so browser_tool can be imported in minimal environments. + from tools.browser_supervisor import ( + DEFAULT_DIALOG_POLICY, + DEFAULT_DIALOG_TIMEOUT_S, + _VALID_POLICIES, + ) + + try: + from hermes_cli.config import read_raw_config + + cfg = read_raw_config() + browser_cfg = cfg.get("browser", {}) if isinstance(cfg, dict) else {} + if not isinstance(browser_cfg, dict): + return DEFAULT_DIALOG_POLICY, DEFAULT_DIALOG_TIMEOUT_S + policy = str(browser_cfg.get("dialog_policy") or DEFAULT_DIALOG_POLICY) + if policy not in _VALID_POLICIES: + logger.debug("Invalid browser.dialog_policy=%r; using default", policy) + policy = DEFAULT_DIALOG_POLICY + timeout_raw = browser_cfg.get("dialog_timeout_s") + try: + timeout_s = float(timeout_raw) if timeout_raw is not None else DEFAULT_DIALOG_TIMEOUT_S + if timeout_s <= 0: + timeout_s = DEFAULT_DIALOG_TIMEOUT_S + except (TypeError, ValueError): + timeout_s = DEFAULT_DIALOG_TIMEOUT_S + return policy, timeout_s + except Exception: + return DEFAULT_DIALOG_POLICY, DEFAULT_DIALOG_TIMEOUT_S + + +def _ensure_cdp_supervisor(task_id: str) -> None: + """Start a CDP supervisor for ``task_id`` if an endpoint is reachable. + + Idempotent — delegates to ``SupervisorRegistry.get_or_start`` which skips + when a supervisor for this ``(task_id, cdp_url)`` already exists and + tears down + restarts on URL change. Safe to call on every + ``browser_navigate`` / ``/browser connect`` without worrying about + double-attach. + + Resolves the CDP URL in this order: + 1. ``BROWSER_CDP_URL`` / ``browser.cdp_url`` — covers ``/browser connect`` + and config-set overrides. + 2. ``_active_sessions[task_id]["cdp_url"]`` — covers Browserbase + any + other cloud provider whose ``create_session`` returns a raw CDP URL. + + Swallows all errors — failing to attach the supervisor must not break + the browser session itself. The agent simply won't see + ``pending_dialogs`` / ``frame_tree`` fields in snapshots. + """ + cdp_url = _get_cdp_override() + if not cdp_url: + # Fallback: active session may carry a per-session CDP URL from a + # cloud provider (Browserbase sets this). + with _cleanup_lock: + session_info = _active_sessions.get(task_id, {}) + maybe = str(session_info.get("cdp_url") or "") + if maybe: + cdp_url = _resolve_cdp_override(maybe) + if not cdp_url: + return + try: + from tools.browser_supervisor import SUPERVISOR_REGISTRY # type: ignore[import-not-found] + + policy, timeout_s = _get_dialog_policy_config() + SUPERVISOR_REGISTRY.get_or_start( + task_id=task_id, + cdp_url=cdp_url, + dialog_policy=policy, + dialog_timeout_s=timeout_s, + ) + except Exception as exc: + logger.debug( + "CDP supervisor attach for task=%s failed (non-fatal): %s", + task_id, + exc, + ) + + +def _stop_cdp_supervisor(task_id: str) -> None: + """Stop the CDP supervisor for ``task_id`` if one exists. No-op otherwise.""" + try: + from tools.browser_supervisor import SUPERVISOR_REGISTRY # type: ignore[import-not-found] + + SUPERVISOR_REGISTRY.stop(task_id) + except Exception as exc: + logger.debug("CDP supervisor stop for task=%s failed (non-fatal): %s", task_id, exc) + + # ============================================================================ # Cloud Provider Registry # ============================================================================ @@ -995,7 +1089,12 @@ def _get_session_info(task_id: Optional[str] = None) -> Dict[str, str]: if task_id in _active_sessions: return _active_sessions[task_id] _active_sessions[task_id] = session_info - + + # Lazy-start the CDP supervisor now that the session exists (if the + # backend surfaces a CDP URL via override or session_info["cdp_url"]). + # Idempotent; swallows errors. See _ensure_cdp_supervisor for details. + _ensure_cdp_supervisor(task_id) + return session_info @@ -1455,7 +1554,7 @@ def browser_navigate(url: str, task_id: Optional[str] = None) -> str: if is_first_nav: session_info["_first_nav"] = False _maybe_start_recording(effective_task_id) - + result = _run_browser_command(effective_task_id, "open", [url], timeout=max(_get_command_timeout(), 60)) if result.get("success"): @@ -1578,7 +1677,20 @@ def browser_snapshot( "snapshot": snapshot_text, "element_count": len(refs) if refs else 0 } - + + # Merge supervisor state (pending dialogs + frame tree) when a CDP + # supervisor is attached to this task. No-op otherwise. See + # website/docs/developer-guide/browser-supervisor.md. + try: + from tools.browser_supervisor import SUPERVISOR_REGISTRY # type: ignore[import-not-found] + _supervisor = SUPERVISOR_REGISTRY.get(effective_task_id) + if _supervisor is not None: + _sv_snap = _supervisor.snapshot() + if _sv_snap.active: + response.update(_sv_snap.to_dict()) + except Exception as _sv_exc: + logger.debug("supervisor snapshot merge failed: %s", _sv_exc) + return json.dumps(response, ensure_ascii=False) else: return json.dumps({ @@ -2248,7 +2360,11 @@ def cleanup_browser(task_id: Optional[str] = None) -> None: """ if task_id is None: task_id = "default" - + + # Stop the CDP supervisor for this task FIRST so we close our WebSocket + # before the backend tears down the underlying CDP endpoint. + _stop_cdp_supervisor(task_id) + # Also clean up Camofox session if running in Camofox mode. # Skip full close when managed persistence is enabled — the browser # profile (and its session cookies) must survive across agent tasks. @@ -2329,6 +2445,13 @@ def cleanup_all_browsers() -> None: for task_id in task_ids: cleanup_browser(task_id) + # Tear down CDP supervisors for all tasks so background threads exit. + try: + from tools.browser_supervisor import SUPERVISOR_REGISTRY # type: ignore[import-not-found] + SUPERVISOR_REGISTRY.stop_all() + except Exception: + pass + # Reset cached lookups so they are re-evaluated on next use. global _cached_agent_browser, _agent_browser_resolved global _cached_command_timeout, _command_timeout_resolved diff --git a/toolsets.py b/toolsets.py index 975d8883c..b3cdb2e7a 100644 --- a/toolsets.py +++ b/toolsets.py @@ -43,7 +43,7 @@ _HERMES_CORE_TOOLS = [ "browser_navigate", "browser_snapshot", "browser_click", "browser_type", "browser_scroll", "browser_back", "browser_press", "browser_get_images", - "browser_vision", "browser_console", "browser_cdp", + "browser_vision", "browser_console", "browser_cdp", "browser_dialog", # Text-to-speech "text_to_speech", # Planning & memory @@ -115,7 +115,8 @@ TOOLSETS = { "browser_navigate", "browser_snapshot", "browser_click", "browser_type", "browser_scroll", "browser_back", "browser_press", "browser_get_images", - "browser_vision", "browser_console", "browser_cdp", "web_search" + "browser_vision", "browser_console", "browser_cdp", + "browser_dialog", "web_search" ], "includes": [] }, @@ -249,7 +250,7 @@ TOOLSETS = { "browser_navigate", "browser_snapshot", "browser_click", "browser_type", "browser_scroll", "browser_back", "browser_press", "browser_get_images", - "browser_vision", "browser_console", "browser_cdp", + "browser_vision", "browser_console", "browser_cdp", "browser_dialog", "todo", "memory", "session_search", "execute_code", "delegate_task", @@ -274,7 +275,7 @@ TOOLSETS = { "browser_navigate", "browser_snapshot", "browser_click", "browser_type", "browser_scroll", "browser_back", "browser_press", "browser_get_images", - "browser_vision", "browser_console", "browser_cdp", + "browser_vision", "browser_console", "browser_cdp", "browser_dialog", # Planning & memory "todo", "memory", # Session history search diff --git a/website/docs/developer-guide/browser-supervisor.md b/website/docs/developer-guide/browser-supervisor.md new file mode 100644 index 000000000..d0aa34dbb --- /dev/null +++ b/website/docs/developer-guide/browser-supervisor.md @@ -0,0 +1,223 @@ +# Browser CDP Supervisor — Design + +**Status:** Shipped (PR 14540) +**Last updated:** 2026-04-23 +**Author:** @teknium1 + +## Problem + +Native JS dialogs (`alert`/`confirm`/`prompt`/`beforeunload`) and iframes are +the two biggest gaps in our browser tooling: + +1. **Dialogs block the JS thread.** Any operation on the page stalls until the + dialog is handled. Before this work, the agent had no way to know a dialog + was open — subsequent tool calls would hang or throw opaque errors. +2. **Iframes are invisible.** The agent could see iframe nodes in the DOM + snapshot but could not click, type, or eval inside them — especially + cross-origin (OOPIF) iframes that live in separate Chromium processes. + +[PR #12550](https://github.com/NousResearch/hermes-agent/pull/12550) proposed a +stateless `browser_dialog` wrapper. That doesn't solve detection — it's a +cleaner CDP call for when the agent already knows (via symptoms) that a dialog +is open. Closed as superseded. + +## Backend capability matrix (verified live 2026-04-23) + +Using throwaway probe scripts against a data-URL page that fires alerts in the +main frame and in a same-origin srcdoc iframe, plus a cross-origin +`https://example.com` iframe: + +| Backend | Dialog detect | Dialog respond | Frame tree | OOPIF `Runtime.evaluate` via `browser_cdp(frame_id=...)` | +|---|---|---|---|---| +| Local Chrome (`--remote-debugging-port`) / `/browser connect` | ✓ | ✓ full workflow | ✓ | ✓ | +| Browserbase | ✓ (via bridge) | ✓ full workflow (via bridge) | ✓ | ✓ (`document.title = "Example Domain"` verified on real cross-origin iframe) | +| Camofox | ✗ no CDP (REST-only) | ✗ | partial via DOM snapshot | ✗ | + +**How Browserbase respond works.** Browserbase's CDP proxy uses Playwright +internally and auto-dismisses native dialogs within ~10ms, so +`Page.handleJavaScriptDialog` can't keep up. To work around this, the +supervisor injects a bridge script via +`Page.addScriptToEvaluateOnNewDocument` that overrides +`window.alert`/`confirm`/`prompt` with a synchronous XHR to a magic host +(`hermes-dialog-bridge.invalid`). `Fetch.enable` intercepts those XHRs +before they touch the network — the dialog becomes a `Fetch.requestPaused` +event the supervisor captures, and `respond_to_dialog` fulfills via +`Fetch.fulfillRequest` with a JSON body the injected script decodes. + +Net result: from the page's perspective, `prompt()` still returns the +agent-supplied string. From the agent's perspective, it's the same +`browser_dialog(action=...)` API either way. Tested end-to-end against +real Browserbase sessions — 4/4 (alert/prompt/confirm-accept/confirm-dismiss) +pass including value round-tripping back into page JS. + +Camofox stays unsupported for this PR; follow-up upstream issue planned at +`jo-inc/camofox-browser` requesting a dialog polling endpoint. + +## Architecture + +### CDPSupervisor + +One `asyncio.Task` running in a background daemon thread per Hermes `task_id`. +Holds a persistent WebSocket to the backend's CDP endpoint. Maintains: + +- **Dialog queue** — `List[PendingDialog]` with `{id, type, message, default_prompt, session_id, opened_at}` +- **Frame tree** — `Dict[frame_id, FrameInfo]` with parent relationships, URL, origin, whether cross-origin child session +- **Session map** — `Dict[session_id, SessionInfo]` so interaction tools can route to the right attached session for OOPIF operations +- **Recent console errors** — ring buffer of the last 50 (for PR 2 diagnostics) + +Subscribes on attach: +- `Page.enable` — `javascriptDialogOpening`, `frameAttached`, `frameNavigated`, `frameDetached` +- `Runtime.enable` — `executionContextCreated`, `consoleAPICalled`, `exceptionThrown` +- `Target.setAutoAttach {autoAttach: true, flatten: true}` — surfaces child OOPIF targets; supervisor enables `Page`+`Runtime` on each + +Thread-safe state access via a snapshot lock; tool handlers (sync) read the +frozen snapshot without awaiting. + +### Lifecycle + +- **Start:** `SupervisorRegistry.get_or_start(task_id, cdp_url)` — called by + `browser_navigate`, Browserbase session create, `/browser connect`. Idempotent. +- **Stop:** session teardown or `/browser disconnect`. Cancels the asyncio + task, closes the WebSocket, discards state. +- **Rebind:** if the CDP URL changes (user reconnects to a new Chrome), stop + the old supervisor and start fresh — never reuse state across endpoints. + +### Dialog policy + +Configurable via `config.yaml` under `browser.dialog_policy`: + +- **`must_respond`** (default) — capture, surface in `browser_snapshot`, wait + for explicit `browser_dialog(action=...)` call. After a 300s safety timeout + with no response, auto-dismiss and log. Prevents a buggy agent from stalling + forever. +- `auto_dismiss` — record and dismiss immediately; agent sees it after the + fact via `browser_state` inside `browser_snapshot`. +- `auto_accept` — record and accept (useful for `beforeunload` where the user + wants to navigate away cleanly). + +Policy is per-task; no per-dialog overrides in v1. + +## Agent surface (PR 1) + +### One new tool + +``` +browser_dialog(action, prompt_text=None, dialog_id=None) +``` + +- `action="accept"` / `"dismiss"` → responds to the specified or sole pending dialog (required) +- `prompt_text=...` → text to supply to a `prompt()` dialog +- `dialog_id=...` → disambiguate when multiple dialogs queued (rare) + +Tool is response-only. Agent reads pending dialogs from `browser_snapshot` +output before calling. + +### `browser_snapshot` extension + +Adds three optional fields to the existing snapshot output when a supervisor +is attached: + +```json +{ + "pending_dialogs": [ + {"id": "d-1", "type": "alert", "message": "Hello", "opened_at": 1650000000.0} + ], + "recent_dialogs": [ + {"id": "d-1", "type": "alert", "message": "...", "opened_at": 1650000000.0, + "closed_at": 1650000000.1, "closed_by": "remote"} + ], + "frame_tree": { + "top": {"frame_id": "FRAME_A", "url": "https://example.com/", "origin": "https://example.com"}, + "children": [ + {"frame_id": "FRAME_B", "url": "about:srcdoc", "is_oopif": false}, + {"frame_id": "FRAME_C", "url": "https://ads.example.net/", "is_oopif": true, "session_id": "SID_C"} + ], + "truncated": false + } +} +``` + +- **`pending_dialogs`**: dialogs currently blocking the page's JS thread. + The agent must call `browser_dialog(action=...)` to respond. Empty on + Browserbase because their CDP proxy auto-dismisses within ~10ms. + +- **`recent_dialogs`**: ring buffer of up to 20 recently-closed dialogs with + a `closed_by` tag — `"agent"` (we responded), `"auto_policy"` (local + auto_dismiss/auto_accept), `"watchdog"` (must_respond timeout hit), or + `"remote"` (browser/backend closed it on us, e.g. Browserbase). This is + how agents on Browserbase still get visibility into what happened. + +- **`frame_tree`**: frame structure including cross-origin (OOPIF) children. + Capped at 30 entries + OOPIF depth 2 to bound snapshot size on ad-heavy + pages. `truncated: true` surfaces when limits were hit; agents needing + the full tree can use `browser_cdp` with `Page.getFrameTree`. + +No new tool schema surface for any of these — the agent reads the snapshot +it already requests. + +### Availability gating + +Both surfaces gate on `_browser_cdp_check` (supervisor can only run when a CDP +endpoint is reachable). On Camofox / no-backend sessions, the dialog tool is +hidden and snapshot omits the new fields — no schema bloat. + +## Cross-origin iframe interaction + +Extending the dialog-detect work, `browser_cdp(frame_id=...)` routes CDP +calls (notably `Runtime.evaluate`) through the supervisor's already-connected +WebSocket using the OOPIF's child `sessionId`. Agents pick frame_ids out of +`browser_snapshot.frame_tree.children[]` where `is_oopif=true` and pass them +to `browser_cdp`. For same-origin iframes (no dedicated CDP session), the +agent uses `contentWindow`/`contentDocument` from a top-level +`Runtime.evaluate` instead — supervisor surfaces an error pointing at that +fallback when `frame_id` belongs to a non-OOPIF. + +On Browserbase, this is the ONLY reliable path for iframe interaction — +stateless CDP connections (opened per `browser_cdp` call) hit signed-URL +expiry, while the supervisor's long-lived connection keeps a valid session. + +## Camofox (follow-up) + +Issue planned against `jo-inc/camofox-browser` adding: +- Playwright `page.on('dialog', handler)` per session +- `GET /tabs/:tabId/dialogs` polling endpoint +- `POST /tabs/:tabId/dialogs/:id` to accept/dismiss +- Frame-tree introspection endpoint + +## Files touched (PR 1) + +### New + +- `tools/browser_supervisor.py` — `CDPSupervisor`, `SupervisorRegistry`, `PendingDialog`, `FrameInfo` +- `tools/browser_dialog_tool.py` — `browser_dialog` tool handler +- `tests/tools/test_browser_supervisor.py` — mock CDP WebSocket server + lifecycle/state tests +- `website/docs/developer-guide/browser-supervisor.md` — this file + +### Modified + +- `toolsets.py` — register `browser_dialog` in `browser`, `hermes-acp`, `hermes-api-server`, core toolsets (gated on CDP reachability) +- `tools/browser_tool.py` + - `browser_navigate` start-hook: if CDP URL resolvable, `SupervisorRegistry.get_or_start(task_id, cdp_url)` + - `browser_snapshot` (at ~line 1536): merge supervisor state into return payload + - `/browser connect` handler: restart supervisor with new endpoint + - Session teardown hooks in `_cleanup_browser_session` +- `hermes_cli/config.py` — add `browser.dialog_policy` and `browser.dialog_timeout_s` to `DEFAULT_CONFIG` +- Docs: `website/docs/user-guide/features/browser.md`, `website/docs/reference/tools-reference.md`, `website/docs/reference/toolsets-reference.md` + +## Non-goals + +- Detection/interaction for Camofox (upstream gap; tracked separately) +- Streaming dialog/frame events live to the user (would require gateway hooks) +- Persisting dialog history across sessions (in-memory only) +- Per-iframe dialog policies (agent can express this via `dialog_id`) +- Replacing `browser_cdp` — it stays as the escape hatch for the long tail (cookies, viewport, network throttling) + +## Testing + +Unit tests use an asyncio mock CDP server that speaks enough of the protocol +to exercise all state transitions: attach, enable, navigate, dialog fire, +dialog dismiss, frame attach/detach, child target attach, session teardown. +Real-backend E2E (Browserbase + local Chrome) is manual; probe scripts from +the 2026-04-23 investigation kept in-repo under +`scripts/browser_supervisor_e2e.py` so anyone can re-verify on new backend +versions. diff --git a/website/docs/reference/tools-reference.md b/website/docs/reference/tools-reference.md index c255c8f6a..b3380d14b 100644 --- a/website/docs/reference/tools-reference.md +++ b/website/docs/reference/tools-reference.md @@ -6,9 +6,9 @@ description: "Authoritative reference for Hermes built-in tools, grouped by tool # Built-in Tools Reference -This page documents all 53 built-in tools in the Hermes tool registry, grouped by toolset. Availability varies by platform, credentials, and enabled toolsets. +This page documents all 55 built-in tools in the Hermes tool registry, grouped by toolset. Availability varies by platform, credentials, and enabled toolsets. -**Quick counts:** 11 browser tools, 4 file tools, 10 RL tools, 4 Home Assistant tools, 2 terminal tools, 2 web tools, 5 Feishu tools, and 15 standalone tools across other toolsets. +**Quick counts:** 12 browser tools, 4 file tools, 10 RL tools, 4 Home Assistant tools, 2 terminal tools, 2 web tools, 5 Feishu tools, and 15 standalone tools across other toolsets. :::tip MCP Tools In addition to built-in tools, Hermes can load tools dynamically from MCP servers. MCP tools appear with a server-name prefix (e.g., `github_create_issue` for the `github` MCP server). See [MCP Integration](/docs/user-guide/features/mcp) for configuration. @@ -20,6 +20,7 @@ In addition to built-in tools, Hermes can load tools dynamically from MCP server |------|-------------|----------------------| | `browser_back` | Navigate back to the previous page in browser history. Requires browser_navigate to be called first. | — | | `browser_cdp` | Send a raw Chrome DevTools Protocol (CDP) command. Escape hatch for browser operations not covered by browser_navigate, browser_click, browser_console, etc. Only available when a CDP endpoint is reachable at session start — via `/browser connect` or `browser.cdp_url` config. See https://chromedevtools.github.io/devtools-protocol/ | — | +| `browser_dialog` | Respond to a native JavaScript dialog (alert / confirm / prompt / beforeunload). Call `browser_snapshot` first — pending dialogs appear in its `pending_dialogs` field. Then call `browser_dialog(action='accept'|'dismiss')`. Same availability as `browser_cdp` (Browserbase or `/browser connect`). | — | | `browser_click` | Click on an element identified by its ref ID from the snapshot (e.g., '@e5'). The ref IDs are shown in square brackets in the snapshot output. Requires browser_navigate and browser_snapshot to be called first. | — | | `browser_console` | Get browser console output and JavaScript errors from the current page. Returns console.log/warn/error/info messages and uncaught JS exceptions. Use this to detect silent JavaScript errors, failed API calls, and application warnings. Requi… | — | | `browser_get_images` | Get a list of all images on the current page with their URLs and alt text. Useful for finding images to analyze with the vision tool. Requires browser_navigate to be called first. | — | diff --git a/website/docs/reference/toolsets-reference.md b/website/docs/reference/toolsets-reference.md index bb911004e..a8c0a8225 100644 --- a/website/docs/reference/toolsets-reference.md +++ b/website/docs/reference/toolsets-reference.md @@ -52,7 +52,7 @@ Or in-session: | Toolset | Tools | Purpose | |---------|-------|---------| -| `browser` | `browser_back`, `browser_cdp`, `browser_click`, `browser_console`, `browser_get_images`, `browser_navigate`, `browser_press`, `browser_scroll`, `browser_snapshot`, `browser_type`, `browser_vision`, `web_search` | Full browser automation. Includes `web_search` as a fallback for quick lookups. `browser_cdp` is a raw CDP passthrough gated on a reachable CDP endpoint — it only appears when `/browser connect` is active or `browser.cdp_url` is set. | +| `browser` | `browser_back`, `browser_cdp`, `browser_click`, `browser_console`, `browser_dialog`, `browser_get_images`, `browser_navigate`, `browser_press`, `browser_scroll`, `browser_snapshot`, `browser_type`, `browser_vision`, `web_search` | Full browser automation. Includes `web_search` as a fallback for quick lookups. `browser_cdp` and `browser_dialog` are gated on a reachable CDP endpoint — they only appear when `/browser connect` is active, `browser.cdp_url` is set, or a Browserbase session is active. `browser_dialog` works together with the `pending_dialogs` and `frame_tree` fields that `browser_snapshot` adds when a CDP supervisor is attached. | | `clarify` | `clarify` | Ask the user a question when the agent needs clarification. | | `code_execution` | `execute_code` | Run Python scripts that call Hermes tools programmatically. | | `cronjob` | `cronjob` | Schedule and manage recurring tasks. | diff --git a/website/docs/user-guide/configuration.md b/website/docs/user-guide/configuration.md index 420ca1468..80f5c6f88 100644 --- a/website/docs/user-guide/configuration.md +++ b/website/docs/user-guide/configuration.md @@ -1240,10 +1240,26 @@ browser: inactivity_timeout: 120 # Seconds before auto-closing idle sessions command_timeout: 30 # Timeout in seconds for browser commands (screenshot, navigate, etc.) record_sessions: false # Auto-record browser sessions as WebM videos to ~/.hermes/browser_recordings/ + # Optional CDP override — when set, Hermes attaches directly to your own + # Chrome (via /browser connect) rather than starting a headless browser. + cdp_url: "" + # Dialog supervisor — controls how native JS dialogs (alert / confirm / prompt) + # are handled when a CDP backend is attached (Browserbase, local Chrome via + # /browser connect). Ignored on Camofox and default local agent-browser mode. + dialog_policy: must_respond # must_respond | auto_dismiss | auto_accept + dialog_timeout_s: 300 # Safety auto-dismiss under must_respond (seconds) camofox: managed_persistence: false # When true, Camofox sessions persist cookies/logins across restarts ``` +**Dialog policies:** + +- `must_respond` (default) — capture the dialog, surface it in `browser_snapshot.pending_dialogs`, and wait for the agent to call `browser_dialog(action=...)`. After `dialog_timeout_s` seconds with no response, the dialog is auto-dismissed to prevent the page's JS thread from stalling forever. +- `auto_dismiss` — capture, dismiss immediately. The agent still sees the dialog record in `browser_snapshot.recent_dialogs` with `closed_by="auto_policy"` after the fact. +- `auto_accept` — capture, accept immediately. Useful for pages with aggressive `beforeunload` prompts. + +See the [browser feature page](./features/browser.md#browser_dialog) for the full dialog workflow. + The browser toolset supports multiple providers. See the [Browser feature page](/docs/user-guide/features/browser) for details on Browserbase, Browser Use, and local Chrome CDP setup. ## Timezone diff --git a/website/docs/user-guide/features/browser.md b/website/docs/user-guide/features/browser.md index d6624bf7d..ca51b633e 100644 --- a/website/docs/user-guide/features/browser.md +++ b/website/docs/user-guide/features/browser.md @@ -355,7 +355,50 @@ browser_cdp(method="Runtime.evaluate", browser_cdp(method="Network.getAllCookies") ``` -Browser-level methods (`Target.*`, `Browser.*`, `Storage.*`) omit `target_id`. Page-level methods (`Page.*`, `Runtime.*`, `DOM.*`, `Emulation.*`) require a `target_id` from `Target.getTargets`. Each call is independent — sessions do not persist between calls. +Browser-level methods (`Target.*`, `Browser.*`, `Storage.*`) omit `target_id`. Page-level methods (`Page.*`, `Runtime.*`, `DOM.*`, `Emulation.*`) require a `target_id` from `Target.getTargets`. Each stateless call is independent — sessions do not persist between calls. + +**Cross-origin iframes:** pass `frame_id` (from `browser_snapshot.frame_tree.children[]` where `is_oopif=true`) to route the CDP call through the supervisor's live session for that iframe. This is how `Runtime.evaluate` inside a cross-origin iframe works on Browserbase, where stateless CDP connections would hit signed-URL expiry. Example: + +``` +browser_cdp( + method="Runtime.evaluate", + params={"expression": "document.title", "returnByValue": True}, + frame_id="", +) +``` + +Same-origin iframes don't need `frame_id` — use `document.querySelector('iframe').contentDocument` from a top-level `Runtime.evaluate` instead. + +### `browser_dialog` + +Responds to a native JS dialog (`alert` / `confirm` / `prompt` / `beforeunload`). Before this tool existed, dialogs would silently block the page's JavaScript thread and subsequent `browser_*` calls would hang or throw; now the agent sees pending dialogs in `browser_snapshot` output and responds explicitly. + +**Workflow:** +1. Call `browser_snapshot`. If a dialog is blocking the page, it shows up as `pending_dialogs: [{"id": "d-1", "type": "alert", "message": "..."}]`. +2. Call `browser_dialog(action="accept")` or `browser_dialog(action="dismiss")`. For `prompt()` dialogs, pass `prompt_text="..."` to supply the response. +3. Re-snapshot — `pending_dialogs` is empty; the page's JS thread has resumed. + +**Detection happens automatically** via a persistent CDP supervisor — one WebSocket per task that subscribes to Page/Runtime/Target events. The supervisor also populates a `frame_tree` field in the snapshot so the agent can see the iframe structure of the current page, including cross-origin (OOPIF) iframes. + +**Availability matrix:** + +| Backend | Detection via `pending_dialogs` | Response (`browser_dialog` tool) | +|---|---|---| +| Local Chrome via `/browser connect` or `browser.cdp_url` | ✓ | ✓ full workflow | +| Browserbase | ✓ | ✓ full workflow (via injected XHR bridge) | +| Camofox / default local agent-browser | ✗ | ✗ (no CDP endpoint) | + +**How it works on Browserbase.** Browserbase's CDP proxy auto-dismisses real native dialogs server-side within ~10ms, so we can't use `Page.handleJavaScriptDialog`. The supervisor injects a small script via `Page.addScriptToEvaluateOnNewDocument` that overrides `window.alert`/`confirm`/`prompt` with a synchronous XHR. We intercept those XHRs via `Fetch.enable` — the page's JS thread stays blocked on the XHR until we call `Fetch.fulfillRequest` with the agent's response. `prompt()` return values round-trip back into page JS unchanged. + +**Dialog policy** is configured in `config.yaml` under `browser.dialog_policy`: + +| Policy | Behavior | +|--------|----------| +| `must_respond` (default) | Capture, surface in snapshot, wait for explicit `browser_dialog()` call. Safety auto-dismiss after `browser.dialog_timeout_s` (default 300s) so a buggy agent can't stall forever. | +| `auto_dismiss` | Capture, dismiss immediately. Agent still sees the dialog in `browser_state` history but doesn't have to act. | +| `auto_accept` | Capture, accept immediately. Useful when navigating pages with aggressive `beforeunload` prompts. | + +**Frame tree** inside `browser_snapshot.frame_tree` is capped to 30 frames and OOPIF depth 2 to keep payloads bounded on ad-heavy pages. A `truncated: true` flag surfaces when limits were hit; agents needing the full tree can use `browser_cdp` with `Page.getFrameTree`. ## Practical Examples