mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-04-25 00:51:20 +00:00

feat(browser): CDP supervisor — dialog detection + response + cross-origin iframe eval (#14540 )

* docs: browser CDP supervisor design (for upcoming PR)

Design doc ahead of implementation — dialog + iframe detection/interaction
via a persistent CDP supervisor. Covers backend capability matrix (verified
live 2026-04-23), architecture, lifecycle, policy, agent surface, PR split,
non-goals, and test plan.

Supersedes #12550.

No code changes in this commit.

* feat(browser): add persistent CDP supervisor for dialog + frame detection

Single persistent CDP WebSocket per Hermes task_id that subscribes to
Page/Runtime/Target events and maintains thread-safe state for pending
dialogs, frame tree, and console errors.

Supervisor lives in its own daemon thread running an asyncio loop;
external callers use sync API (snapshot(), respond_to_dialog()) that
bridges onto the loop.

Auto-attaches to OOPIF child targets via Target.setAutoAttach{flatten:true}
and enables Page+Runtime on each so iframe-origin dialogs surface through
the same supervisor.

Dialog policies: must_respond (default, 300s safety timeout),
auto_dismiss, auto_accept.

Frame tree capped at 30 entries + OOPIF depth 2 to keep snapshot
payloads bounded on ad-heavy pages.

E2E verified against real Chrome via smoke test — detects + responds
to main-frame alerts, iframe-contentWindow alerts, preserves frame
tree, graceful no-dialog error path, clean shutdown.

No agent-facing tool wiring in this commit (comes next).

* feat(browser): add browser_dialog tool wired to CDP supervisor

Agent-facing response-only tool. Schema:
  action: 'accept' | 'dismiss' (required)
  prompt_text: response for prompt() dialogs (optional)
  dialog_id: disambiguate when multiple dialogs queued (optional)

Handler:
  SUPERVISOR_REGISTRY.get(task_id).respond_to_dialog(...)

check_fn shares _browser_cdp_check with browser_cdp so both surface and
hide together. When no supervisor is attached (Camofox, default
Playwright, or no browser session started yet), tool is hidden; if
somehow invoked it returns a clear error pointing the agent to
browser_navigate / /browser connect.

Registered in _HERMES_CORE_TOOLS and the browser / hermes-acp /
hermes-api-server toolsets alongside browser_cdp.

* feat(browser): wire CDP supervisor into session lifecycle + browser_snapshot

Supervisor lifecycle:
  * _get_session_info lazy-starts the supervisor after a session row is
    materialized — covers every backend code path (Browserbase, cdp_url
    override, /browser connect, future providers) with one hook.
  * cleanup_browser(task_id) stops the supervisor for that task first
    (before the backend tears down CDP).
  * cleanup_all_browsers() calls SUPERVISOR_REGISTRY.stop_all().
  * /browser connect eagerly starts the supervisor for task 'default'
    so the first snapshot already shows pending_dialogs.
  * /browser disconnect stops the supervisor.

CDP URL resolution for the supervisor:
  1. BROWSER_CDP_URL / browser.cdp_url override.
  2. Fallback: session_info['cdp_url'] from cloud providers (Browserbase).

browser_snapshot merges supervisor state (pending_dialogs + frame_tree)
into its JSON output when a supervisor is active — the agent reads
pending_dialogs from the snapshot it already requests, then calls
browser_dialog to respond. No extra tool surface.

Config defaults:
  * browser.dialog_policy: 'must_respond' (new)
  * browser.dialog_timeout_s: 300 (new)
No version bump — new keys deep-merge into existing browser section.

Deadlock fix in supervisor event dispatch:
  * _on_dialog_opening and _on_target_attached used to await CDP calls
    while the reader was still processing an event — but only the reader
    can set the response Future, so the call timed out.
  * Both now fire asyncio.create_task(...) so the reader stays pumping.
  * auto_dismiss/auto_accept now actually close the dialog immediately.

Tests (tests/tools/test_browser_supervisor.py, 11 tests, real Chrome):
  * supervisor start/snapshot
  * main-frame alert detection + dismiss
  * iframe.contentWindow alert
  * prompt() with prompt_text reply
  * respond with no pending dialog -> clean error
  * auto_dismiss clears on event
  * registry idempotency
  * registry stop -> snapshot reports inactive
  * browser_dialog tool no-supervisor error
  * browser_dialog invalid action
  * browser_dialog end-to-end via tool handler

xdist-safe: chrome_cdp fixture uses a per-worker port.
Skipped when google-chrome/chromium isn't installed.

* docs(browser): document browser_dialog tool + CDP supervisor

- user-guide/features/browser.md: new browser_dialog section with
  workflow, availability gate, and dialog_policy table
- reference/tools-reference.md: row for browser_dialog, tool count
  bumped 53 -> 54, browser tools count 11 -> 12
- reference/toolsets-reference.md: browser_dialog added to browser
  toolset row with note on pending_dialogs / frame_tree snapshot fields

Full design doc lives at
developer-guide/browser-supervisor.md (committed earlier).

* fix(browser): reconnect loop + recent_dialogs for Browserbase visibility

Found via Browserbase E2E test that revealed two production-critical issues:

1. **Supervisor WebSocket drops when other clients disconnect.** Browserbase's
   CDP proxy tears down our long-lived WebSocket whenever a short-lived
   client (e.g. agent-browser CLI's per-command CDP connection) disconnects.
   Fixed with a reconnecting _run loop that re-attaches with exponential
   backoff on drops. _page_session_id and _child_sessions are reset on each
   reconnect; pending_dialogs and frames are preserved across reconnects.

2. **Browserbase auto-dismisses dialogs server-side within ~10ms.** Their
   Playwright-based CDP proxy dismisses alert/confirm/prompt before our
   Page.handleJavaScriptDialog call can respond. So pending_dialogs is
   empty by the time the agent reads a snapshot on Browserbase.

   Added a recent_dialogs ring buffer (capacity 20) that retains a
   DialogRecord for every dialog that opened, with a closed_by tag:
     * 'agent'       — agent called browser_dialog
     * 'auto_policy' — local auto_dismiss/auto_accept fired
     * 'watchdog'    — must_respond timeout auto-dismissed (300s default)
     * 'remote'      — browser/backend closed it on us (Browserbase)

   Agents on Browserbase now see the dialog history with closed_by='remote'
   so they at least know a dialog fired, even though they couldn't respond.

3. **Page.javascriptDialogClosed matching bug.** The event doesn't include a
   'message' field (CDP spec has only 'result' and 'userInput') but our
   _on_dialog_closed was matching on message. Fixed to match by session_id
   + oldest-first, with a safety assumption that only one dialog is in
   flight per session (the JS thread is blocked while a dialog is up).

Docs + tests updated:
  * browser.md: new availability matrix showing the three backends and
    which mode (pending / recent / response) each supports
  * developer-guide/browser-supervisor.md: three-field snapshot schema
    with closed_by semantics
  * test_browser_supervisor.py: +test_recent_dialogs_ring_buffer (12/12
    passing against real Chrome)

E2E verified both backends:
  * Local Chrome via /browser connect: detect + respond full workflow
    (smoke_supervisor.py all 7 scenarios pass)
  * Browserbase: detect via recent_dialogs with closed_by='remote'
    (smoke_supervisor_browserbase_v2.py passes)

Camofox remains out of scope (REST-only, no CDP) — tracked for
upstream PR 3.

* feat(browser): XHR bridge for dialog response on Browserbase (FIXED)

Browserbase's CDP proxy auto-dismisses native JS dialogs within ~10ms, so
Page.handleJavaScriptDialog calls lose the race. Solution: bypass native
dialogs entirely.

The supervisor now injects Page.addScriptToEvaluateOnNewDocument with a
JavaScript override for window.alert/confirm/prompt. Those overrides
perform a synchronous XMLHttpRequest to a magic host
('hermes-dialog-bridge.invalid'). We intercept those XHRs via Fetch.enable
with a requestStage=Request pattern.

Flow when a page calls alert('hi'):
  1. window.alert override intercepts, builds XHR GET to
     http://hermes-dialog-bridge.invalid/?kind=alert&message=hi
  2. Sync XHR blocks the page's JS thread (mirrors real dialog semantics)
  3. Fetch.requestPaused fires on our WebSocket; supervisor surfaces
     it as a pending dialog with bridge_request_id set
  4. Agent reads pending_dialogs from browser_snapshot, calls browser_dialog
  5. Supervisor calls Fetch.fulfillRequest with JSON body:
     {accept: true|false, prompt_text: '...', dialog_id: 'd-N'}
  6. The injected script parses the body, returns the appropriate value
     from the override (undefined for alert, bool for confirm, string|null
     for prompt)

This works identically on Browserbase AND local Chrome — no native dialog
ever fires, so Browserbase's auto-dismiss has nothing to race. Dialog
policies (must_respond / auto_dismiss / auto_accept) all still work.

Bridge is installed on every attached session (main page + OOPIF child
sessions) so iframe dialogs are captured too.

Native-dialog path kept as a fallback for backends that don't auto-dismiss
(so a page that somehow bypasses our override — e.g. iframes that load
after Fetch.enable but before the init-script runs — still gets observed
via Page.javascriptDialogOpening).

E2E VERIFIED:
  * Local Chrome: 13/13 pytest tests green (12 original + new
    test_bridge_captures_prompt_and_returns_reply_text that asserts
    window.__ret === 'AGENT-SUPPLIED-REPLY' after agent responds)
  * Browserbase: smoke_bb_bridge_v2.py runs 4/4 PASS:
    - alert('BB-ALERT-MSG') dismiss → page.alert_ret = undefined ✓
    - prompt('BB-PROMPT-MSG', 'default-xyz') accept with 'AGENT-REPLY'
      → page.prompt_ret === 'AGENT-REPLY' ✓
    - confirm('BB-CONFIRM-MSG') accept → page.confirm_ret === true ✓
    - confirm('BB-CONFIRM-MSG') dismiss → page.confirm_ret === false ✓

Docs updated in browser.md and developer-guide/browser-supervisor.md —
availability matrix now shows Browserbase at full parity with local
Chrome for both detection and response.

* feat(browser): cross-origin iframe interaction via browser_cdp(frame_id=...)

Adds iframe interaction to the CDP supervisor PR (was queued as PR 2).

Design: browser_cdp gets an optional frame_id parameter. When set, the
tool looks up the frame in the supervisor's frame_tree, grabs its child
cdp_session_id (OOPIF session), and dispatches the CDP call through the
supervisor's already-connected WebSocket via run_coroutine_threadsafe.

Why not stateless: on Browserbase, each fresh browser_cdp WebSocket
must re-negotiate against a signed connectUrl. The session info carries
a specific URL that can expire while the supervisor's long-lived
connection stays valid. Routing via the supervisor sidesteps this.

Agent workflow:
  1. browser_snapshot → frame_tree.children[] shows OOPIFs with is_oopif=true
  2. browser_cdp(method='Runtime.evaluate', frame_id=<OOPIF frame_id>,
                 params={'expression': 'document.title', 'returnByValue': True})
  3. Supervisor dispatches the call on the OOPIF's child session

Supervisor state fixes needed along the way:
  * _on_frame_detached now skips reason='swap' (frame migrating processes)
  * _on_frame_detached also skips when the frame is an OOPIF with a live
    child session — Browserbase fires spurious remove events when a
    same-origin iframe gets promoted to OOPIF
  * _on_target_detached clears cdp_session_id but KEEPS the frame record
    so the agent still sees the OOPIF in frame_tree during transient
    session flaps

E2E VERIFIED on Browserbase (smoke_bb_iframe_agent_path.py):
  browser_cdp(method='Runtime.evaluate',
              params={'expression': 'document.title', 'returnByValue': True},
              frame_id=<OOPIF>)
  → {'success': True, 'result': {'value': 'Example Domain'}}

  The iframe is <iframe src='https://example.com/'> inside a top-level
  data: URL page on a real Browserbase session. The agent Runtime.evaluates
  INSIDE the cross-origin iframe and gets example.com's title back.

Tests (tests/tools/test_browser_supervisor.py — 16 pass total):
  * test_browser_cdp_frame_id_routes_via_supervisor — injects fake OOPIF,
    verifies routing via supervisor, Runtime.evaluate returns 1+1=2
  * test_browser_cdp_frame_id_missing_supervisor — clean error when no
    supervisor attached
  * test_browser_cdp_frame_id_not_in_frame_tree — clean error on bad
    frame_id

Docs (browser.md and developer-guide/browser-supervisor.md) updated with
the iframe workflow, availability matrix now shows OOPIF eval as shipped
for local Chrome + Browserbase.

* test(browser): real-OOPIF E2E verified manually + chrome_cdp uses --site-per-process

When asked 'did you test the iframe stuff' I had only done a mocked
pytest (fake injected OOPIF) plus a Browserbase E2E. Closed the
local-Chrome real-OOPIF gap by writing /tmp/dialog-iframe-test/
smoke_local_oopif.py:

  * 2 http servers on different hostnames (localhost:18905 + 127.0.0.1:18906)
  * Chrome with --site-per-process so the cross-origin iframe becomes a
    real OOPIF in its own process
  * Navigate, find OOPIF in supervisor.frame_tree, call
    browser_cdp(method='Runtime.evaluate', frame_id=<OOPIF>) which routes
    through the supervisor's child session
  * Asserts iframe document.title === 'INNER-FRAME-XYZ' (from the
    inner page, retrieved via OOPIF eval)

PASSED on 2026-04-23.

Tried to embed this as a pytest but hit an asyncio version quirk between
venv (3.11) and the system python (3.13) — Page.navigate hangs in the
pytest harness but works in standalone. Left a self-documenting skip
test that points to the smoke script + describes the verification.

chrome_cdp fixture now passes --site-per-process so future iframe tests
can rely on OOPIF behavior.

Result: 16 pass + 1 documented-skip = 17 tests in
tests/tools/test_browser_supervisor.py.

* docs(browser): add dialog_policy + dialog_timeout_s to configuration.md, fix tool count

Pre-merge docs audit revealed two gaps:

1. user-guide/configuration.md browser config example was missing the
   two new dialog_* knobs. Added with a short table explaining
   must_respond / auto_dismiss / auto_accept semantics and a link to
   the feature page for the full workflow.

2. reference/tools-reference.md header said '54 built-in tools' — real
   count on main is 54, this branch adds browser_dialog so it's 55.
   Fixed the header.  (browser count was already correctly bumped
   11 -> 12 in the earlier docs commit.)

No code changes.

2026-04-23 22:23:37 -07:00

10 KiB

Raw Blame History

Browser CDP Supervisor — Design

Status: Shipped (PR 14540) Last updated: 2026-04-23 Author: @teknium1

Problem

Native JS dialogs (alert/confirm/prompt/beforeunload) and iframes are the two biggest gaps in our browser tooling:

Dialogs block the JS thread. Any operation on the page stalls until the dialog is handled. Before this work, the agent had no way to know a dialog was open — subsequent tool calls would hang or throw opaque errors.
Iframes are invisible. The agent could see iframe nodes in the DOM snapshot but could not click, type, or eval inside them — especially cross-origin (OOPIF) iframes that live in separate Chromium processes.

PR #12550 proposed a stateless browser_dialog wrapper. That doesn't solve detection — it's a cleaner CDP call for when the agent already knows (via symptoms) that a dialog is open. Closed as superseded.

Backend capability matrix (verified live 2026-04-23)

Using throwaway probe scripts against a data-URL page that fires alerts in the main frame and in a same-origin srcdoc iframe, plus a cross-origin https://example.com iframe:

Backend	Dialog detect	Dialog respond	Frame tree	OOPIF `Runtime.evaluate` via `browser_cdp(frame_id=...)`
Local Chrome (`--remote-debugging-port`) / `/browser connect`	✓	✓ full workflow	✓	✓
Browserbase	✓ (via bridge)	✓ full workflow (via bridge)	✓	✓ (`document.title = "Example Domain"` verified on real cross-origin iframe)
Camofox	✗ no CDP (REST-only)	✗	partial via DOM snapshot	✗

How Browserbase respond works. Browserbase's CDP proxy uses Playwright internally and auto-dismisses native dialogs within ~10ms, so Page.handleJavaScriptDialog can't keep up. To work around this, the supervisor injects a bridge script via Page.addScriptToEvaluateOnNewDocument that overrides window.alert/confirm/prompt with a synchronous XHR to a magic host (hermes-dialog-bridge.invalid). Fetch.enable intercepts those XHRs before they touch the network — the dialog becomes a Fetch.requestPaused event the supervisor captures, and respond_to_dialog fulfills via Fetch.fulfillRequest with a JSON body the injected script decodes.

Net result: from the page's perspective, prompt() still returns the agent-supplied string. From the agent's perspective, it's the same browser_dialog(action=...) API either way. Tested end-to-end against real Browserbase sessions — 4/4 (alert/prompt/confirm-accept/confirm-dismiss) pass including value round-tripping back into page JS.

Camofox stays unsupported for this PR; follow-up upstream issue planned at jo-inc/camofox-browser requesting a dialog polling endpoint.

Architecture

CDPSupervisor

One asyncio.Task running in a background daemon thread per Hermes task_id. Holds a persistent WebSocket to the backend's CDP endpoint. Maintains:

Dialog queue — List[PendingDialog] with {id, type, message, default_prompt, session_id, opened_at}
Frame tree — Dict[frame_id, FrameInfo] with parent relationships, URL, origin, whether cross-origin child session
Session map — Dict[session_id, SessionInfo] so interaction tools can route to the right attached session for OOPIF operations
Recent console errors — ring buffer of the last 50 (for PR 2 diagnostics)

Subscribes on attach:

Page.enable — javascriptDialogOpening, frameAttached, frameNavigated, frameDetached
Runtime.enable — executionContextCreated, consoleAPICalled, exceptionThrown
Target.setAutoAttach {autoAttach: true, flatten: true} — surfaces child OOPIF targets; supervisor enables Page+Runtime on each

Thread-safe state access via a snapshot lock; tool handlers (sync) read the frozen snapshot without awaiting.

Lifecycle

Start: SupervisorRegistry.get_or_start(task_id, cdp_url) — called by browser_navigate, Browserbase session create, /browser connect. Idempotent.
Stop: session teardown or /browser disconnect. Cancels the asyncio task, closes the WebSocket, discards state.
Rebind: if the CDP URL changes (user reconnects to a new Chrome), stop the old supervisor and start fresh — never reuse state across endpoints.

Dialog policy

Configurable via config.yaml under browser.dialog_policy:

must_respond (default) — capture, surface in browser_snapshot, wait for explicit browser_dialog(action=...) call. After a 300s safety timeout with no response, auto-dismiss and log. Prevents a buggy agent from stalling forever.
auto_dismiss — record and dismiss immediately; agent sees it after the fact via browser_state inside browser_snapshot.
auto_accept — record and accept (useful for beforeunload where the user wants to navigate away cleanly).

Policy is per-task; no per-dialog overrides in v1.

Agent surface (PR 1)

One new tool

browser_dialog(action, prompt_text=None, dialog_id=None)

action="accept" / "dismiss" → responds to the specified or sole pending dialog (required)
prompt_text=... → text to supply to a prompt() dialog
dialog_id=... → disambiguate when multiple dialogs queued (rare)

Tool is response-only. Agent reads pending dialogs from browser_snapshot output before calling.

`browser_snapshot` extension

Adds three optional fields to the existing snapshot output when a supervisor is attached:

{
  "pending_dialogs": [
    {"id": "d-1", "type": "alert", "message": "Hello", "opened_at": 1650000000.0}
  ],
  "recent_dialogs": [
    {"id": "d-1", "type": "alert", "message": "...", "opened_at": 1650000000.0,
     "closed_at": 1650000000.1, "closed_by": "remote"}
  ],
  "frame_tree": {
    "top": {"frame_id": "FRAME_A", "url": "https://example.com/", "origin": "https://example.com"},
    "children": [
      {"frame_id": "FRAME_B", "url": "about:srcdoc", "is_oopif": false},
      {"frame_id": "FRAME_C", "url": "https://ads.example.net/", "is_oopif": true, "session_id": "SID_C"}
    ],
    "truncated": false
  }
}

pending_dialogs: dialogs currently blocking the page's JS thread. The agent must call browser_dialog(action=...) to respond. Empty on Browserbase because their CDP proxy auto-dismisses within ~10ms.
recent_dialogs: ring buffer of up to 20 recently-closed dialogs with a closed_by tag — "agent" (we responded), "auto_policy" (local auto_dismiss/auto_accept), "watchdog" (must_respond timeout hit), or "remote" (browser/backend closed it on us, e.g. Browserbase). This is how agents on Browserbase still get visibility into what happened.
frame_tree: frame structure including cross-origin (OOPIF) children. Capped at 30 entries + OOPIF depth 2 to bound snapshot size on ad-heavy pages. truncated: true surfaces when limits were hit; agents needing the full tree can use browser_cdp with Page.getFrameTree.

No new tool schema surface for any of these — the agent reads the snapshot it already requests.

Availability gating

Both surfaces gate on _browser_cdp_check (supervisor can only run when a CDP endpoint is reachable). On Camofox / no-backend sessions, the dialog tool is hidden and snapshot omits the new fields — no schema bloat.

Cross-origin iframe interaction

Extending the dialog-detect work, browser_cdp(frame_id=...) routes CDP calls (notably Runtime.evaluate) through the supervisor's already-connected WebSocket using the OOPIF's child sessionId. Agents pick frame_ids out of browser_snapshot.frame_tree.children[] where is_oopif=true and pass them to browser_cdp. For same-origin iframes (no dedicated CDP session), the agent uses contentWindow/contentDocument from a top-level Runtime.evaluate instead — supervisor surfaces an error pointing at that fallback when frame_id belongs to a non-OOPIF.

On Browserbase, this is the ONLY reliable path for iframe interaction — stateless CDP connections (opened per browser_cdp call) hit signed-URL expiry, while the supervisor's long-lived connection keeps a valid session.

Camofox (follow-up)

Issue planned against jo-inc/camofox-browser adding:

Playwright page.on('dialog', handler) per session
GET /tabs/:tabId/dialogs polling endpoint
POST /tabs/:tabId/dialogs/:id to accept/dismiss
Frame-tree introspection endpoint

Files touched (PR 1)

New

tools/browser_supervisor.py — CDPSupervisor, SupervisorRegistry, PendingDialog, FrameInfo
tools/browser_dialog_tool.py — browser_dialog tool handler
tests/tools/test_browser_supervisor.py — mock CDP WebSocket server + lifecycle/state tests
website/docs/developer-guide/browser-supervisor.md — this file

Modified

toolsets.py — register browser_dialog in browser, hermes-acp, hermes-api-server, core toolsets (gated on CDP reachability)
tools/browser_tool.py
- browser_navigate start-hook: if CDP URL resolvable, SupervisorRegistry.get_or_start(task_id, cdp_url)
- browser_snapshot (at ~line 1536): merge supervisor state into return payload
- /browser connect handler: restart supervisor with new endpoint
- Session teardown hooks in _cleanup_browser_session
hermes_cli/config.py — add browser.dialog_policy and browser.dialog_timeout_s to DEFAULT_CONFIG
Docs: website/docs/user-guide/features/browser.md, website/docs/reference/tools-reference.md, website/docs/reference/toolsets-reference.md

Non-goals

Detection/interaction for Camofox (upstream gap; tracked separately)
Streaming dialog/frame events live to the user (would require gateway hooks)
Persisting dialog history across sessions (in-memory only)
Per-iframe dialog policies (agent can express this via dialog_id)
Replacing browser_cdp — it stays as the escape hatch for the long tail (cookies, viewport, network throttling)

Testing

Unit tests use an asyncio mock CDP server that speaks enough of the protocol to exercise all state transitions: attach, enable, navigate, dialog fire, dialog dismiss, frame attach/detach, child target attach, session teardown. Real-backend E2E (Browserbase + local Chrome) is manual; probe scripts from the 2026-04-23 investigation kept in-repo under scripts/browser_supervisor_e2e.py so anyone can re-verify on new backend versions.

10 KiB Raw Blame History