hermes-agent/website/docs/developer-guide/browser-supervisor.md
Teknium 8b6beaab5f
docs: 30-day overhaul — correctness audit, PR coverage, Nous Portal weave, sidebar reorg (#33782)
* docs(audit): correctness pass across getting-started, reference, features, messaging, developer-guide, guides, integrations, user-guide

* docs: add PR coverage for last 30d + Nous Portal weave + nav reorg + build fixes

- Add docs for top user-visible PRs that shipped without docs (api-server
  session control, kanban features, telegram pin/edit, provider client tag,
  xAI retired-model migration, cron name lookup, --branch update flag, etc.)
- Apply Nous Portal weave across 23 pages (tasteful one-liners on
  getting-started/learning-path, configuration, overview, vision, x-search,
  credential-pools, provider-routing, cron, codex-runtime, profiles, docker,
  messaging/index, multiple guides, plus FAQ + index promotion)
- Reorganize sidebar: split Messaging into Popular/M365/Chinese/Other,
  Reference into Command/Configuration/Tools-Skills sub-categories, add
  orphan developer-guide pages (web-search-provider-plugin,
  browser-supervisor), move features from Integrations back to Features,
  fold lone spotify into Media & Web.
- Regenerate skill stubs + catalogs (kanban-codex-lane, hermes-s6-container-
  supervision, web-pentest)
- Fix broken anchor links (security/cron, configuration/fallback, telegram
  large-files, adding-platform-adapters step-by-step)
2026-05-28 02:41:36 -07:00

9 KiB

sidebar_position title description
18 Browser CDP Supervisor How Hermes detects and responds to native JS dialogs and interacts with cross-origin iframes via a persistent CDP connection.

Browser CDP Supervisor

The CDP supervisor closes two long-standing gaps in Hermes' browser tooling:

  1. Native JS dialogs (alert/confirm/prompt/beforeunload) block the page's JS thread. Without supervision, the agent has no way to know a dialog is open — subsequent tool calls hang or throw opaque errors.
  2. Cross-origin iframes (OOPIFs) are invisible to top-level Runtime.evaluate. The agent can see iframe nodes in the DOM snapshot but can't click, type, or eval inside them without a CDP session attached to the child target.

The supervisor solves both by holding a persistent WebSocket to the backend's CDP endpoint per browser task, surfacing pending dialogs and frame structure into browser_snapshot, and exposing a browser_dialog tool for explicit responses.

Backend support

Backend Dialog detect Dialog respond Frame tree OOPIF Runtime.evaluate via browser_cdp(frame_id=...)
Local Chrome (--remote-debugging-port) / /browser connect ✓ full workflow
Browserbase ✓ (via bridge) ✓ full workflow (via bridge)
Camofox ✗ no CDP (REST-only) partial via DOM snapshot

Browserbase quirk. Browserbase's CDP proxy uses Playwright internally and auto-dismisses native dialogs within ~10ms, so Page.handleJavaScriptDialog can't keep up. The supervisor injects a bridge script via Page.addScriptToEvaluateOnNewDocument that overrides window.alert/confirm/prompt with a synchronous XHR to a magic host (hermes-dialog-bridge.invalid). Fetch.enable intercepts those XHRs before they touch the network — the dialog becomes a Fetch.requestPaused event the supervisor captures, and respond_to_dialog fulfills via Fetch.fulfillRequest with a JSON body the injected script decodes.

From the page's perspective, prompt() still returns the agent-supplied string. From the agent's perspective, it's the same browser_dialog(action=...) API either way.

Camofox is unsupported — no CDP surface, REST-only.

Architecture

CDPSupervisor

One asyncio.Task running in a background daemon thread per Hermes task_id. Holds a persistent WebSocket to the backend's CDP endpoint. Maintains:

  • Dialog queueList[PendingDialog] with {id, type, message, default_prompt, session_id, opened_at}
  • Frame treeDict[frame_id, FrameInfo] with parent relationships, URL, origin, whether cross-origin child session
  • Session mapDict[session_id, SessionInfo] so interaction tools can route to the right attached session for OOPIF operations
  • Recent console errors — ring buffer of the last 50 for diagnostics

Subscribes on attach:

  • Page.enablejavascriptDialogOpening, frameAttached, frameNavigated, frameDetached
  • Runtime.enableexecutionContextCreated, consoleAPICalled, exceptionThrown
  • Target.setAutoAttach {autoAttach: true, flatten: true} — surfaces child OOPIF targets; supervisor enables Page+Runtime on each

Thread-safe state access via a snapshot lock; tool handlers (sync) read the frozen snapshot without awaiting.

Lifecycle

  • Start: SupervisorRegistry.get_or_start(task_id, cdp_url) — called by browser_navigate, Browserbase session create, /browser connect. Idempotent.
  • Stop: session teardown or /browser disconnect. Cancels the asyncio task, closes the WebSocket, discards state.
  • Rebind: if the CDP URL changes (user reconnects to a new Chrome), the old supervisor is stopped and a fresh one started — state is never reused across endpoints.

Dialog policy

Configurable via config.yaml under browser.dialog_policy:

  • must_respond (default) — capture, surface in browser_snapshot, wait for explicit browser_dialog(action=...) call. After a 300s safety timeout with no response, auto-dismiss and log. Prevents a buggy agent from stalling forever.
  • auto_dismiss — record and dismiss immediately; agent sees it after the fact via browser_state inside browser_snapshot.
  • auto_accept — record and accept (useful for beforeunload where the workflow wants to navigate away cleanly).

Policy is per-task; no per-dialog overrides.

Agent surface

browser_dialog tool

browser_dialog(action, prompt_text=None, dialog_id=None)
  • action="accept" / "dismiss" → responds to the specified or sole pending dialog (required)
  • prompt_text=... → text to supply to a prompt() dialog
  • dialog_id=... → disambiguate when multiple dialogs are queued (rare)

Tool is response-only. The agent reads pending dialogs from browser_snapshot output before calling.

browser_snapshot extension

Adds three optional fields to the existing snapshot output when a supervisor is attached:

{
  "pending_dialogs": [
    {"id": "d-1", "type": "alert", "message": "Hello", "opened_at": 1650000000.0}
  ],
  "recent_dialogs": [
    {"id": "d-1", "type": "alert", "message": "...", "opened_at": 1650000000.0,
     "closed_at": 1650000000.1, "closed_by": "remote"}
  ],
  "frame_tree": {
    "top": {"frame_id": "FRAME_A", "url": "https://example.com/", "origin": "https://example.com"},
    "children": [
      {"frame_id": "FRAME_B", "url": "about:srcdoc", "is_oopif": false},
      {"frame_id": "FRAME_C", "url": "https://ads.example.net/", "is_oopif": true, "session_id": "SID_C"}
    ],
    "truncated": false
  }
}
  • pending_dialogs — dialogs currently blocking the page's JS thread. The agent must call browser_dialog(action=...) to respond. Empty on Browserbase because their CDP proxy auto-dismisses within ~10ms.

  • recent_dialogs — ring buffer of up to 20 recently-closed dialogs with a closed_by tag: "agent" (we responded), "auto_policy" (local auto_dismiss/auto_accept), "watchdog" (must_respond timeout hit), or "remote" (browser/backend closed it on us, e.g. Browserbase). This is how agents on Browserbase still get visibility into what happened.

  • frame_tree — frame structure including cross-origin (OOPIF) children. Capped at 30 entries + OOPIF depth 2 to bound snapshot size on ad-heavy pages. truncated: true surfaces when limits were hit; agents needing the full tree can use browser_cdp with Page.getFrameTree.

No new tool schema surface for any of these — the agent reads the snapshot it already requests.

Availability gating

Both surfaces gate on _browser_cdp_check (supervisor can only run when a CDP endpoint is reachable). On Camofox / no-backend sessions, the dialog tool is hidden and the snapshot omits the new fields — no schema bloat.

Cross-origin iframe interaction

browser_cdp(frame_id=...) routes CDP calls (notably Runtime.evaluate) through the supervisor's already-connected WebSocket using the OOPIF's child sessionId. Agents pick frame_ids out of browser_snapshot.frame_tree.children[] where is_oopif=true and pass them to browser_cdp. For same-origin iframes (no dedicated CDP session), the agent uses contentWindow/contentDocument from a top-level Runtime.evaluate instead — the supervisor surfaces an error pointing at that fallback when frame_id belongs to a non-OOPIF.

On Browserbase, this is the only reliable path for iframe interaction — stateless CDP connections (opened per browser_cdp call) hit signed-URL expiry, while the supervisor's long-lived connection keeps a valid session.

File layout

  • tools/browser_supervisor.pyCDPSupervisor, SupervisorRegistry, PendingDialog, FrameInfo
  • tools/browser_dialog_tool.pybrowser_dialog tool handler
  • tools/browser_tool.pybrowser_navigate start-hook, browser_snapshot merge, /browser connect reattach, _cleanup_browser_session teardown
  • toolsets.py — registers browser_dialog in browser, hermes-acp, hermes-api-server, and core toolsets (gated on CDP reachability)
  • hermes_cli/config.pybrowser.dialog_policy and browser.dialog_timeout_s defaults

Non-goals

  • Detection/interaction for Camofox (upstream gap; tracked separately)
  • Streaming dialog/frame events live to the user (would require gateway hooks)
  • Persisting dialog history across sessions (in-memory only)
  • Per-iframe dialog policies (agent can express this via dialog_id)
  • Replacing browser_cdp — it stays as the escape hatch for the long tail (cookies, viewport, network throttling)

Testing

Unit tests (tests/tools/test_browser_supervisor.py) use an asyncio mock CDP server that speaks enough of the protocol to exercise all state transitions: attach, enable, navigate, dialog fire, dialog dismiss, frame attach/detach, child target attach, session teardown. Real-backend E2E (Browserbase + local Chromium-family browser) is manual — exercise via /browser connect to a live Chromium-family browser and run the dialog/frame test cases described above.