feat(browser): CDP supervisor — dialog detection + response + cross-origin iframe eval (#14540)

* docs: browser CDP supervisor design (for upcoming PR)

Design doc ahead of implementation — dialog + iframe detection/interaction
via a persistent CDP supervisor. Covers backend capability matrix (verified
live 2026-04-23), architecture, lifecycle, policy, agent surface, PR split,
non-goals, and test plan.

Supersedes #12550.

No code changes in this commit.

* feat(browser): add persistent CDP supervisor for dialog + frame detection

Single persistent CDP WebSocket per Hermes task_id that subscribes to
Page/Runtime/Target events and maintains thread-safe state for pending
dialogs, frame tree, and console errors.

Supervisor lives in its own daemon thread running an asyncio loop;
external callers use sync API (snapshot(), respond_to_dialog()) that
bridges onto the loop.

Auto-attaches to OOPIF child targets via Target.setAutoAttach{flatten:true}
and enables Page+Runtime on each so iframe-origin dialogs surface through
the same supervisor.

Dialog policies: must_respond (default, 300s safety timeout),
auto_dismiss, auto_accept.

Frame tree capped at 30 entries + OOPIF depth 2 to keep snapshot
payloads bounded on ad-heavy pages.

E2E verified against real Chrome via smoke test — detects + responds
to main-frame alerts, iframe-contentWindow alerts, preserves frame
tree, graceful no-dialog error path, clean shutdown.

No agent-facing tool wiring in this commit (comes next).

* feat(browser): add browser_dialog tool wired to CDP supervisor

Agent-facing response-only tool. Schema:
  action: 'accept' | 'dismiss' (required)
  prompt_text: response for prompt() dialogs (optional)
  dialog_id: disambiguate when multiple dialogs queued (optional)

Handler:
  SUPERVISOR_REGISTRY.get(task_id).respond_to_dialog(...)

check_fn shares _browser_cdp_check with browser_cdp so both surface and
hide together. When no supervisor is attached (Camofox, default
Playwright, or no browser session started yet), tool is hidden; if
somehow invoked it returns a clear error pointing the agent to
browser_navigate / /browser connect.

Registered in _HERMES_CORE_TOOLS and the browser / hermes-acp /
hermes-api-server toolsets alongside browser_cdp.

* feat(browser): wire CDP supervisor into session lifecycle + browser_snapshot

Supervisor lifecycle:
  * _get_session_info lazy-starts the supervisor after a session row is
    materialized — covers every backend code path (Browserbase, cdp_url
    override, /browser connect, future providers) with one hook.
  * cleanup_browser(task_id) stops the supervisor for that task first
    (before the backend tears down CDP).
  * cleanup_all_browsers() calls SUPERVISOR_REGISTRY.stop_all().
  * /browser connect eagerly starts the supervisor for task 'default'
    so the first snapshot already shows pending_dialogs.
  * /browser disconnect stops the supervisor.

CDP URL resolution for the supervisor:
  1. BROWSER_CDP_URL / browser.cdp_url override.
  2. Fallback: session_info['cdp_url'] from cloud providers (Browserbase).

browser_snapshot merges supervisor state (pending_dialogs + frame_tree)
into its JSON output when a supervisor is active — the agent reads
pending_dialogs from the snapshot it already requests, then calls
browser_dialog to respond. No extra tool surface.

Config defaults:
  * browser.dialog_policy: 'must_respond' (new)
  * browser.dialog_timeout_s: 300 (new)
No version bump — new keys deep-merge into existing browser section.

Deadlock fix in supervisor event dispatch:
  * _on_dialog_opening and _on_target_attached used to await CDP calls
    while the reader was still processing an event — but only the reader
    can set the response Future, so the call timed out.
  * Both now fire asyncio.create_task(...) so the reader stays pumping.
  * auto_dismiss/auto_accept now actually close the dialog immediately.

Tests (tests/tools/test_browser_supervisor.py, 11 tests, real Chrome):
  * supervisor start/snapshot
  * main-frame alert detection + dismiss
  * iframe.contentWindow alert
  * prompt() with prompt_text reply
  * respond with no pending dialog -> clean error
  * auto_dismiss clears on event
  * registry idempotency
  * registry stop -> snapshot reports inactive
  * browser_dialog tool no-supervisor error
  * browser_dialog invalid action
  * browser_dialog end-to-end via tool handler

xdist-safe: chrome_cdp fixture uses a per-worker port.
Skipped when google-chrome/chromium isn't installed.

* docs(browser): document browser_dialog tool + CDP supervisor

- user-guide/features/browser.md: new browser_dialog section with
  workflow, availability gate, and dialog_policy table
- reference/tools-reference.md: row for browser_dialog, tool count
  bumped 53 -> 54, browser tools count 11 -> 12
- reference/toolsets-reference.md: browser_dialog added to browser
  toolset row with note on pending_dialogs / frame_tree snapshot fields

Full design doc lives at
developer-guide/browser-supervisor.md (committed earlier).

* fix(browser): reconnect loop + recent_dialogs for Browserbase visibility

Found via Browserbase E2E test that revealed two production-critical issues:

1. **Supervisor WebSocket drops when other clients disconnect.** Browserbase's
   CDP proxy tears down our long-lived WebSocket whenever a short-lived
   client (e.g. agent-browser CLI's per-command CDP connection) disconnects.
   Fixed with a reconnecting _run loop that re-attaches with exponential
   backoff on drops. _page_session_id and _child_sessions are reset on each
   reconnect; pending_dialogs and frames are preserved across reconnects.

2. **Browserbase auto-dismisses dialogs server-side within ~10ms.** Their
   Playwright-based CDP proxy dismisses alert/confirm/prompt before our
   Page.handleJavaScriptDialog call can respond. So pending_dialogs is
   empty by the time the agent reads a snapshot on Browserbase.

   Added a recent_dialogs ring buffer (capacity 20) that retains a
   DialogRecord for every dialog that opened, with a closed_by tag:
     * 'agent'       — agent called browser_dialog
     * 'auto_policy' — local auto_dismiss/auto_accept fired
     * 'watchdog'    — must_respond timeout auto-dismissed (300s default)
     * 'remote'      — browser/backend closed it on us (Browserbase)

   Agents on Browserbase now see the dialog history with closed_by='remote'
   so they at least know a dialog fired, even though they couldn't respond.

3. **Page.javascriptDialogClosed matching bug.** The event doesn't include a
   'message' field (CDP spec has only 'result' and 'userInput') but our
   _on_dialog_closed was matching on message. Fixed to match by session_id
   + oldest-first, with a safety assumption that only one dialog is in
   flight per session (the JS thread is blocked while a dialog is up).

Docs + tests updated:
  * browser.md: new availability matrix showing the three backends and
    which mode (pending / recent / response) each supports
  * developer-guide/browser-supervisor.md: three-field snapshot schema
    with closed_by semantics
  * test_browser_supervisor.py: +test_recent_dialogs_ring_buffer (12/12
    passing against real Chrome)

E2E verified both backends:
  * Local Chrome via /browser connect: detect + respond full workflow
    (smoke_supervisor.py all 7 scenarios pass)
  * Browserbase: detect via recent_dialogs with closed_by='remote'
    (smoke_supervisor_browserbase_v2.py passes)

Camofox remains out of scope (REST-only, no CDP) — tracked for
upstream PR 3.

* feat(browser): XHR bridge for dialog response on Browserbase (FIXED)

Browserbase's CDP proxy auto-dismisses native JS dialogs within ~10ms, so
Page.handleJavaScriptDialog calls lose the race. Solution: bypass native
dialogs entirely.

The supervisor now injects Page.addScriptToEvaluateOnNewDocument with a
JavaScript override for window.alert/confirm/prompt. Those overrides
perform a synchronous XMLHttpRequest to a magic host
('hermes-dialog-bridge.invalid'). We intercept those XHRs via Fetch.enable
with a requestStage=Request pattern.

Flow when a page calls alert('hi'):
  1. window.alert override intercepts, builds XHR GET to
     http://hermes-dialog-bridge.invalid/?kind=alert&message=hi
  2. Sync XHR blocks the page's JS thread (mirrors real dialog semantics)
  3. Fetch.requestPaused fires on our WebSocket; supervisor surfaces
     it as a pending dialog with bridge_request_id set
  4. Agent reads pending_dialogs from browser_snapshot, calls browser_dialog
  5. Supervisor calls Fetch.fulfillRequest with JSON body:
     {accept: true|false, prompt_text: '...', dialog_id: 'd-N'}
  6. The injected script parses the body, returns the appropriate value
     from the override (undefined for alert, bool for confirm, string|null
     for prompt)

This works identically on Browserbase AND local Chrome — no native dialog
ever fires, so Browserbase's auto-dismiss has nothing to race. Dialog
policies (must_respond / auto_dismiss / auto_accept) all still work.

Bridge is installed on every attached session (main page + OOPIF child
sessions) so iframe dialogs are captured too.

Native-dialog path kept as a fallback for backends that don't auto-dismiss
(so a page that somehow bypasses our override — e.g. iframes that load
after Fetch.enable but before the init-script runs — still gets observed
via Page.javascriptDialogOpening).

E2E VERIFIED:
  * Local Chrome: 13/13 pytest tests green (12 original + new
    test_bridge_captures_prompt_and_returns_reply_text that asserts
    window.__ret === 'AGENT-SUPPLIED-REPLY' after agent responds)
  * Browserbase: smoke_bb_bridge_v2.py runs 4/4 PASS:
    - alert('BB-ALERT-MSG') dismiss → page.alert_ret = undefined ✓
    - prompt('BB-PROMPT-MSG', 'default-xyz') accept with 'AGENT-REPLY'
      → page.prompt_ret === 'AGENT-REPLY' ✓
    - confirm('BB-CONFIRM-MSG') accept → page.confirm_ret === true ✓
    - confirm('BB-CONFIRM-MSG') dismiss → page.confirm_ret === false ✓

Docs updated in browser.md and developer-guide/browser-supervisor.md —
availability matrix now shows Browserbase at full parity with local
Chrome for both detection and response.

* feat(browser): cross-origin iframe interaction via browser_cdp(frame_id=...)

Adds iframe interaction to the CDP supervisor PR (was queued as PR 2).

Design: browser_cdp gets an optional frame_id parameter. When set, the
tool looks up the frame in the supervisor's frame_tree, grabs its child
cdp_session_id (OOPIF session), and dispatches the CDP call through the
supervisor's already-connected WebSocket via run_coroutine_threadsafe.

Why not stateless: on Browserbase, each fresh browser_cdp WebSocket
must re-negotiate against a signed connectUrl. The session info carries
a specific URL that can expire while the supervisor's long-lived
connection stays valid. Routing via the supervisor sidesteps this.

Agent workflow:
  1. browser_snapshot → frame_tree.children[] shows OOPIFs with is_oopif=true
  2. browser_cdp(method='Runtime.evaluate', frame_id=<OOPIF frame_id>,
                 params={'expression': 'document.title', 'returnByValue': True})
  3. Supervisor dispatches the call on the OOPIF's child session

Supervisor state fixes needed along the way:
  * _on_frame_detached now skips reason='swap' (frame migrating processes)
  * _on_frame_detached also skips when the frame is an OOPIF with a live
    child session — Browserbase fires spurious remove events when a
    same-origin iframe gets promoted to OOPIF
  * _on_target_detached clears cdp_session_id but KEEPS the frame record
    so the agent still sees the OOPIF in frame_tree during transient
    session flaps

E2E VERIFIED on Browserbase (smoke_bb_iframe_agent_path.py):
  browser_cdp(method='Runtime.evaluate',
              params={'expression': 'document.title', 'returnByValue': True},
              frame_id=<OOPIF>)
  → {'success': True, 'result': {'value': 'Example Domain'}}

  The iframe is <iframe src='https://example.com/'> inside a top-level
  data: URL page on a real Browserbase session. The agent Runtime.evaluates
  INSIDE the cross-origin iframe and gets example.com's title back.

Tests (tests/tools/test_browser_supervisor.py — 16 pass total):
  * test_browser_cdp_frame_id_routes_via_supervisor — injects fake OOPIF,
    verifies routing via supervisor, Runtime.evaluate returns 1+1=2
  * test_browser_cdp_frame_id_missing_supervisor — clean error when no
    supervisor attached
  * test_browser_cdp_frame_id_not_in_frame_tree — clean error on bad
    frame_id

Docs (browser.md and developer-guide/browser-supervisor.md) updated with
the iframe workflow, availability matrix now shows OOPIF eval as shipped
for local Chrome + Browserbase.

* test(browser): real-OOPIF E2E verified manually + chrome_cdp uses --site-per-process

When asked 'did you test the iframe stuff' I had only done a mocked
pytest (fake injected OOPIF) plus a Browserbase E2E. Closed the
local-Chrome real-OOPIF gap by writing /tmp/dialog-iframe-test/
smoke_local_oopif.py:

  * 2 http servers on different hostnames (localhost:18905 + 127.0.0.1:18906)
  * Chrome with --site-per-process so the cross-origin iframe becomes a
    real OOPIF in its own process
  * Navigate, find OOPIF in supervisor.frame_tree, call
    browser_cdp(method='Runtime.evaluate', frame_id=<OOPIF>) which routes
    through the supervisor's child session
  * Asserts iframe document.title === 'INNER-FRAME-XYZ' (from the
    inner page, retrieved via OOPIF eval)

PASSED on 2026-04-23.

Tried to embed this as a pytest but hit an asyncio version quirk between
venv (3.11) and the system python (3.13) — Page.navigate hangs in the
pytest harness but works in standalone. Left a self-documenting skip
test that points to the smoke script + describes the verification.

chrome_cdp fixture now passes --site-per-process so future iframe tests
can rely on OOPIF behavior.

Result: 16 pass + 1 documented-skip = 17 tests in
tests/tools/test_browser_supervisor.py.

* docs(browser): add dialog_policy + dialog_timeout_s to configuration.md, fix tool count

Pre-merge docs audit revealed two gaps:

1. user-guide/configuration.md browser config example was missing the
   two new dialog_* knobs. Added with a short table explaining
   must_respond / auto_dismiss / auto_accept semantics and a link to
   the feature page for the full workflow.

2. reference/tools-reference.md header said '54 built-in tools' — real
   count on main is 54, this branch adds browser_dialog so it's 55.
   Fixed the header.  (browser count was already correctly bumped
   11 -> 12 in the earlier docs commit.)

No code changes.
This commit is contained in:
Teknium 2026-04-23 22:23:37 -07:00 committed by GitHub
parent 0f6eabb890
commit 5a1c599412
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
13 changed files with 2665 additions and 24 deletions

10
cli.py
View file

@ -6685,6 +6685,13 @@ class HermesCLI:
print(f" ⚠ Port {_port} is not reachable at {cdp_url}") print(f" ⚠ Port {_port} is not reachable at {cdp_url}")
os.environ["BROWSER_CDP_URL"] = cdp_url os.environ["BROWSER_CDP_URL"] = cdp_url
# Eagerly start the CDP supervisor so pending_dialogs + frame_tree
# show up in the next browser_snapshot. No-op if already started.
try:
from tools.browser_tool import _ensure_cdp_supervisor # type: ignore[import-not-found]
_ensure_cdp_supervisor("default")
except Exception:
pass
print() print()
print("🌐 Browser connected to live Chrome via CDP") print("🌐 Browser connected to live Chrome via CDP")
print(f" Endpoint: {cdp_url}") print(f" Endpoint: {cdp_url}")
@ -6706,7 +6713,8 @@ class HermesCLI:
if current: if current:
os.environ.pop("BROWSER_CDP_URL", None) os.environ.pop("BROWSER_CDP_URL", None)
try: try:
from tools.browser_tool import cleanup_all_browsers from tools.browser_tool import cleanup_all_browsers, _stop_cdp_supervisor
_stop_cdp_supervisor("default")
cleanup_all_browsers() cleanup_all_browsers()
except Exception: except Exception:
pass pass

View file

@ -466,6 +466,12 @@ DEFAULT_CONFIG = {
"record_sessions": False, # Auto-record browser sessions as WebM videos "record_sessions": False, # Auto-record browser sessions as WebM videos
"allow_private_urls": False, # Allow navigating to private/internal IPs (localhost, 192.168.x.x, etc.) "allow_private_urls": False, # Allow navigating to private/internal IPs (localhost, 192.168.x.x, etc.)
"cdp_url": "", # Optional persistent CDP endpoint for attaching to an existing Chromium/Chrome "cdp_url": "", # Optional persistent CDP endpoint for attaching to an existing Chromium/Chrome
# CDP supervisor — dialog + frame detection via a persistent WebSocket.
# Active only when a CDP-capable backend is attached (Browserbase or
# local Chrome via /browser connect). See
# website/docs/developer-guide/browser-supervisor.md.
"dialog_policy": "must_respond", # must_respond | auto_dismiss | auto_accept
"dialog_timeout_s": 300, # Safety auto-dismiss after N seconds under must_respond
"camofox": { "camofox": {
# When true, Hermes sends a stable profile-scoped userId to Camofox # When true, Hermes sends a stable profile-scoped userId to Camofox
# so the server maps it to a persistent Firefox profile automatically. # so the server maps it to a persistent Firefox profile automatically.

View file

@ -0,0 +1,563 @@
"""Integration tests for tools.browser_supervisor.
Exercises the supervisor end-to-end against a real local Chrome
(``--remote-debugging-port``). Skipped when Chrome is not installed
these are the tests that actually verify the CDP wire protocol
works, since mock-CDP unit tests can only prove the happy paths we
thought to model.
Run manually:
scripts/run_tests.sh tests/tools/test_browser_supervisor.py
Automated: skipped in CI unless ``HERMES_E2E_BROWSER=1`` is set.
"""
from __future__ import annotations
import asyncio
import base64
import json
import os
import shutil
import subprocess
import tempfile
import time
import pytest
pytestmark = pytest.mark.skipif(
not shutil.which("google-chrome") and not shutil.which("chromium"),
reason="Chrome/Chromium not installed",
)
def _find_chrome() -> str:
for candidate in ("google-chrome", "chromium", "chromium-browser"):
path = shutil.which(candidate)
if path:
return path
pytest.skip("no Chrome binary found")
@pytest.fixture
def chrome_cdp(worker_id):
"""Start a headless Chrome with --remote-debugging-port, yield its WS URL.
Uses a unique port per xdist worker to avoid cross-worker collisions.
Always launches with ``--site-per-process`` so cross-origin iframes
become real OOPIFs (needed by the iframe interaction tests).
"""
import socket
# xdist worker_id is "master" in single-process mode or "gw0".."gwN" otherwise.
if worker_id == "master":
port_offset = 0
else:
port_offset = int(worker_id.lstrip("gw"))
port = 9225 + port_offset
profile = tempfile.mkdtemp(prefix="hermes-supervisor-test-")
proc = subprocess.Popen(
[
_find_chrome(),
f"--remote-debugging-port={port}",
f"--user-data-dir={profile}",
"--no-first-run",
"--no-default-browser-check",
"--headless=new",
"--disable-gpu",
"--site-per-process", # force OOPIFs for cross-origin iframes
],
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
)
ws_url = None
deadline = time.monotonic() + 15
while time.monotonic() < deadline:
try:
import urllib.request
with urllib.request.urlopen(
f"http://127.0.0.1:{port}/json/version", timeout=1
) as r:
info = json.loads(r.read().decode())
ws_url = info["webSocketDebuggerUrl"]
break
except Exception:
time.sleep(0.25)
if ws_url is None:
proc.terminate()
proc.wait(timeout=5)
shutil.rmtree(profile, ignore_errors=True)
pytest.skip("Chrome didn't expose CDP in time")
yield ws_url, port
proc.terminate()
try:
proc.wait(timeout=3)
except Exception:
proc.kill()
shutil.rmtree(profile, ignore_errors=True)
def _test_page_url() -> str:
html = """<!doctype html>
<html><head><title>Supervisor pytest</title></head><body>
<h1>Supervisor pytest</h1>
<iframe id="inner" srcdoc="<body><h2>frame-marker</h2></body>" width="400" height="100"></iframe>
</body></html>"""
return "data:text/html;base64," + base64.b64encode(html.encode()).decode()
def _fire_on_page(cdp_url: str, expression: str) -> None:
"""Navigate the first page target to a data URL and fire `expression`."""
import asyncio
import websockets as _ws_mod
async def run():
async with _ws_mod.connect(cdp_url, max_size=50 * 1024 * 1024) as ws:
next_id = [1]
async def call(method, params=None, session_id=None):
cid = next_id[0]
next_id[0] += 1
p = {"id": cid, "method": method}
if params:
p["params"] = params
if session_id:
p["sessionId"] = session_id
await ws.send(json.dumps(p))
async for raw in ws:
m = json.loads(raw)
if m.get("id") == cid:
return m
targets = (await call("Target.getTargets"))["result"]["targetInfos"]
page = next(t for t in targets if t.get("type") == "page")
attach = await call(
"Target.attachToTarget", {"targetId": page["targetId"], "flatten": True}
)
sid = attach["result"]["sessionId"]
await call("Page.navigate", {"url": _test_page_url()}, session_id=sid)
await asyncio.sleep(1.5) # let the page load
await call(
"Runtime.evaluate",
{"expression": expression, "returnByValue": True},
session_id=sid,
)
asyncio.run(run())
@pytest.fixture
def supervisor_registry():
"""Yield the global registry and tear down any supervisors after the test."""
from tools.browser_supervisor import SUPERVISOR_REGISTRY
yield SUPERVISOR_REGISTRY
SUPERVISOR_REGISTRY.stop_all()
def _wait_for_dialog(supervisor, timeout: float = 5.0):
deadline = time.monotonic() + timeout
while time.monotonic() < deadline:
snap = supervisor.snapshot()
if snap.pending_dialogs:
return snap.pending_dialogs
time.sleep(0.1)
return ()
def test_supervisor_start_and_snapshot(chrome_cdp, supervisor_registry):
"""Supervisor attaches, exposes an active snapshot with a top frame."""
cdp_url, _port = chrome_cdp
supervisor = supervisor_registry.get_or_start(task_id="pytest-1", cdp_url=cdp_url)
# Navigate so the frame tree populates.
_fire_on_page(cdp_url, "/* no dialog */ void 0")
# Give a moment for frame events to propagate
time.sleep(1.0)
snap = supervisor.snapshot()
assert snap.active is True
assert snap.task_id == "pytest-1"
assert snap.pending_dialogs == ()
# At minimum a top frame should exist after the navigate.
assert snap.frame_tree.get("top") is not None
def test_main_frame_alert_detection_and_dismiss(chrome_cdp, supervisor_registry):
"""alert() in the main frame surfaces and can be dismissed via the sync API."""
cdp_url, _port = chrome_cdp
supervisor = supervisor_registry.get_or_start(task_id="pytest-2", cdp_url=cdp_url)
_fire_on_page(cdp_url, "setTimeout(() => alert('PYTEST-MAIN-ALERT'), 50)")
dialogs = _wait_for_dialog(supervisor)
assert dialogs, "no dialog detected"
d = dialogs[0]
assert d.type == "alert"
assert "PYTEST-MAIN-ALERT" in d.message
result = supervisor.respond_to_dialog("dismiss")
assert result["ok"] is True
# State cleared after dismiss
time.sleep(0.3)
assert supervisor.snapshot().pending_dialogs == ()
def test_iframe_contentwindow_alert(chrome_cdp, supervisor_registry):
"""alert() fired from inside a same-origin iframe surfaces too."""
cdp_url, _port = chrome_cdp
supervisor = supervisor_registry.get_or_start(task_id="pytest-3", cdp_url=cdp_url)
_fire_on_page(
cdp_url,
"setTimeout(() => document.querySelector('#inner').contentWindow.alert('PYTEST-IFRAME'), 50)",
)
dialogs = _wait_for_dialog(supervisor)
assert dialogs, "no iframe dialog detected"
assert any("PYTEST-IFRAME" in d.message for d in dialogs)
result = supervisor.respond_to_dialog("accept")
assert result["ok"] is True
def test_prompt_dialog_with_response_text(chrome_cdp, supervisor_registry):
"""prompt() gets our prompt_text back inside the page."""
cdp_url, _port = chrome_cdp
supervisor = supervisor_registry.get_or_start(task_id="pytest-4", cdp_url=cdp_url)
# Fire a prompt and stash the answer on window
_fire_on_page(
cdp_url,
"setTimeout(() => { window.__promptResult = prompt('give me a token', 'default-x'); }, 50)",
)
dialogs = _wait_for_dialog(supervisor)
assert dialogs
d = dialogs[0]
assert d.type == "prompt"
assert d.default_prompt == "default-x"
result = supervisor.respond_to_dialog("accept", prompt_text="PYTEST-PROMPT-REPLY")
assert result["ok"] is True
def test_respond_with_no_pending_dialog_errors_cleanly(chrome_cdp, supervisor_registry):
"""Calling respond_to_dialog when nothing is pending returns a clean error, not an exception."""
cdp_url, _port = chrome_cdp
supervisor = supervisor_registry.get_or_start(task_id="pytest-5", cdp_url=cdp_url)
result = supervisor.respond_to_dialog("accept")
assert result["ok"] is False
assert "no dialog" in result["error"].lower()
def test_auto_dismiss_policy(chrome_cdp, supervisor_registry):
"""auto_dismiss policy clears dialogs without the agent responding."""
from tools.browser_supervisor import DIALOG_POLICY_AUTO_DISMISS
cdp_url, _port = chrome_cdp
supervisor = supervisor_registry.get_or_start(
task_id="pytest-6",
cdp_url=cdp_url,
dialog_policy=DIALOG_POLICY_AUTO_DISMISS,
)
_fire_on_page(cdp_url, "setTimeout(() => alert('PYTEST-AUTO-DISMISS'), 50)")
# Give the supervisor a moment to see + auto-dismiss
time.sleep(2.0)
snap = supervisor.snapshot()
# Nothing pending because auto-dismiss cleared it immediately
assert snap.pending_dialogs == ()
def test_registry_idempotent_get_or_start(chrome_cdp, supervisor_registry):
"""Calling get_or_start twice with the same (task, url) returns the same instance."""
cdp_url, _port = chrome_cdp
a = supervisor_registry.get_or_start(task_id="pytest-idem", cdp_url=cdp_url)
b = supervisor_registry.get_or_start(task_id="pytest-idem", cdp_url=cdp_url)
assert a is b
def test_registry_stop(chrome_cdp, supervisor_registry):
"""stop() tears down the supervisor and snapshot reports inactive."""
cdp_url, _port = chrome_cdp
supervisor = supervisor_registry.get_or_start(task_id="pytest-stop", cdp_url=cdp_url)
assert supervisor.snapshot().active is True
supervisor_registry.stop("pytest-stop")
# Post-stop snapshot reports inactive; supervisor obj may still exist
assert supervisor.snapshot().active is False
def test_browser_dialog_tool_no_supervisor():
"""browser_dialog returns a clear error when no supervisor is attached."""
from tools.browser_dialog_tool import browser_dialog
r = json.loads(browser_dialog(action="accept", task_id="nonexistent-task"))
assert r["success"] is False
assert "No CDP supervisor" in r["error"]
def test_browser_dialog_invalid_action(chrome_cdp, supervisor_registry):
"""browser_dialog rejects actions that aren't accept/dismiss."""
from tools.browser_dialog_tool import browser_dialog
cdp_url, _port = chrome_cdp
supervisor_registry.get_or_start(task_id="pytest-bad-action", cdp_url=cdp_url)
r = json.loads(browser_dialog(action="eat", task_id="pytest-bad-action"))
assert r["success"] is False
assert "accept" in r["error"] and "dismiss" in r["error"]
def test_recent_dialogs_ring_buffer(chrome_cdp, supervisor_registry):
"""Closed dialogs show up in recent_dialogs with a closed_by tag."""
from tools.browser_supervisor import DIALOG_POLICY_AUTO_DISMISS
cdp_url, _port = chrome_cdp
sv = supervisor_registry.get_or_start(
task_id="pytest-recent",
cdp_url=cdp_url,
dialog_policy=DIALOG_POLICY_AUTO_DISMISS,
)
_fire_on_page(cdp_url, "setTimeout(() => alert('PYTEST-RECENT'), 50)")
# Wait for auto-dismiss to cycle the dialog through
deadline = time.time() + 5
while time.time() < deadline:
recent = sv.snapshot().recent_dialogs
if recent and any("PYTEST-RECENT" in r.message for r in recent):
break
time.sleep(0.1)
recent = sv.snapshot().recent_dialogs
assert recent, "recent_dialogs should contain the auto-dismissed dialog"
match = next((r for r in recent if "PYTEST-RECENT" in r.message), None)
assert match is not None
assert match.type == "alert"
assert match.closed_by == "auto_policy"
assert match.closed_at >= match.opened_at
def test_browser_dialog_tool_end_to_end(chrome_cdp, supervisor_registry):
"""Full agent-path check: fire an alert, call the tool handler directly."""
from tools.browser_dialog_tool import browser_dialog
cdp_url, _port = chrome_cdp
supervisor = supervisor_registry.get_or_start(task_id="pytest-tool", cdp_url=cdp_url)
_fire_on_page(cdp_url, "setTimeout(() => alert('PYTEST-TOOL-END2END'), 50)")
assert _wait_for_dialog(supervisor), "no dialog detected via wait_for_dialog"
r = json.loads(browser_dialog(action="dismiss", task_id="pytest-tool"))
assert r["success"] is True
assert r["action"] == "dismiss"
assert "PYTEST-TOOL-END2END" in r["dialog"]["message"]
def test_browser_cdp_frame_id_routes_via_supervisor(chrome_cdp, supervisor_registry, monkeypatch):
"""browser_cdp(frame_id=...) routes Runtime.evaluate through supervisor.
Mocks the supervisor with a known frame and verifies browser_cdp sends
the call via the supervisor's loop rather than opening a stateless
WebSocket. This is the path that makes cross-origin iframe eval work
on Browserbase.
"""
cdp_url, _port = chrome_cdp
sv = supervisor_registry.get_or_start(task_id="frame-id-test", cdp_url=cdp_url)
assert sv.snapshot().active
# Inject a fake OOPIF frame pointing at the SUPERVISOR's own page session
# so we can verify routing. We fake is_oopif=True so the code path
# treats it as an OOPIF child.
import tools.browser_supervisor as _bs
with sv._state_lock:
fake_frame_id = "FAKE-FRAME-001"
sv._frames[fake_frame_id] = _bs.FrameInfo(
frame_id=fake_frame_id,
url="fake://",
origin="",
parent_frame_id=None,
is_oopif=True,
cdp_session_id=sv._page_session_id, # route at page scope
)
# Route the tool through the supervisor. Should succeed and return
# something that clearly came from CDP.
from tools.browser_cdp_tool import browser_cdp
result = browser_cdp(
method="Runtime.evaluate",
params={"expression": "1 + 1", "returnByValue": True},
frame_id=fake_frame_id,
task_id="frame-id-test",
)
r = json.loads(result)
assert r.get("success") is True, f"expected success, got: {r}"
assert r.get("frame_id") == fake_frame_id
assert r.get("session_id") == sv._page_session_id
value = r.get("result", {}).get("result", {}).get("value")
assert value == 2, f"expected 2, got {value!r}"
def test_browser_cdp_frame_id_real_oopif_smoke_documented():
"""Document that real-OOPIF E2E was manually verified — see PR #14540.
A pytest version of this hits an asyncio version-quirk in the venv
(3.11) that doesn't show up in standalone scripts (3.13 + system
websockets). The mechanism IS verified end-to-end by two separate
smoke scripts in /tmp/dialog-iframe-test/:
* smoke_local_oopif.py local Chrome + 2 http servers on
different hostnames + --site-per-process. Outer page on
localhost:18905, iframe src=http://127.0.0.1:18906. Calls
browser_cdp(method='Runtime.evaluate', frame_id=<OOPIF>) and
verifies inner page's title comes back from the OOPIF session.
PASSED on 2026-04-23: iframe document.title = 'INNER-FRAME-XYZ'
* smoke_bb_iframe_agent_path.py Browserbase + real cross-origin
iframe (src=https://example.com/). Same browser_cdp(frame_id=)
path. PASSED on 2026-04-23: iframe document.title =
'Example Domain'
The test_browser_cdp_frame_id_routes_via_supervisor pytest covers
the supervisor-routing plumbing with a fake injected OOPIF.
"""
pytest.skip(
"Real-OOPIF E2E verified manually with smoke_local_oopif.py and "
"smoke_bb_iframe_agent_path.py — pytest version hits an asyncio "
"version quirk between venv (3.11) and standalone (3.13). "
"Smoke logs preserved in /tmp/dialog-iframe-test/."
)
def test_browser_cdp_frame_id_missing_supervisor():
"""browser_cdp(frame_id=...) errors cleanly when no supervisor is attached."""
from tools.browser_cdp_tool import browser_cdp
result = browser_cdp(
method="Runtime.evaluate",
params={"expression": "1"},
frame_id="any-frame-id",
task_id="no-such-task",
)
r = json.loads(result)
assert r.get("success") is not True
assert "supervisor" in (r.get("error") or "").lower()
def test_browser_cdp_frame_id_not_in_frame_tree(chrome_cdp, supervisor_registry):
"""browser_cdp(frame_id=...) errors when the frame_id isn't known."""
cdp_url, _port = chrome_cdp
sv = supervisor_registry.get_or_start(task_id="bad-frame-test", cdp_url=cdp_url)
assert sv.snapshot().active
from tools.browser_cdp_tool import browser_cdp
result = browser_cdp(
method="Runtime.evaluate",
params={"expression": "1"},
frame_id="nonexistent-frame",
task_id="bad-frame-test",
)
r = json.loads(result)
assert r.get("success") is not True
assert "not found" in (r.get("error") or "").lower()
def test_bridge_captures_prompt_and_returns_reply_text(chrome_cdp, supervisor_registry):
"""End-to-end: agent's prompt_text round-trips INTO the page's JS.
Proves the bridge isn't just catching dialogs — it's properly round-
tripping our reply back into the page via Fetch.fulfillRequest, so
``prompt()`` actually returns the agent-supplied string to the page.
"""
import base64 as _b64
cdp_url, _port = chrome_cdp
sv = supervisor_registry.get_or_start(task_id="pytest-bridge-prompt", cdp_url=cdp_url)
# Page fires prompt and stashes the return value on window.
html = """<!doctype html><html><body><script>
window.__ret = null;
setTimeout(() => { window.__ret = prompt('PROMPT-MSG', 'default'); }, 50);
</script></body></html>"""
url = "data:text/html;base64," + _b64.b64encode(html.encode()).decode()
import asyncio as _asyncio
import websockets as _ws_mod
async def nav_and_read():
async with _ws_mod.connect(cdp_url, max_size=50 * 1024 * 1024) as ws:
nid = [1]
pending: dict = {}
async def reader_fn():
try:
async for raw in ws:
m = json.loads(raw)
if "id" in m:
fut = pending.pop(m["id"], None)
if fut and not fut.done():
fut.set_result(m)
except Exception:
pass
rd = _asyncio.create_task(reader_fn())
async def call(method, params=None, sid=None):
c = nid[0]; nid[0] += 1
p = {"id": c, "method": method}
if params: p["params"] = params
if sid: p["sessionId"] = sid
fut = _asyncio.get_event_loop().create_future()
pending[c] = fut
await ws.send(json.dumps(p))
return await _asyncio.wait_for(fut, timeout=20)
try:
t = (await call("Target.getTargets"))["result"]["targetInfos"]
pg = next(x for x in t if x.get("type") == "page")
a = await call("Target.attachToTarget", {"targetId": pg["targetId"], "flatten": True})
sid = a["result"]["sessionId"]
# Fire navigate but don't await — prompt() blocks the page
nav_id = nid[0]; nid[0] += 1
nav_fut = _asyncio.get_event_loop().create_future()
pending[nav_id] = nav_fut
await ws.send(json.dumps({"id": nav_id, "method": "Page.navigate", "params": {"url": url}, "sessionId": sid}))
# Wait for supervisor to see the prompt
deadline = time.monotonic() + 10
dialog = None
while time.monotonic() < deadline:
snap = sv.snapshot()
if snap.pending_dialogs:
dialog = snap.pending_dialogs[0]
break
await _asyncio.sleep(0.05)
assert dialog is not None, "no dialog captured"
assert dialog.bridge_request_id is not None, "expected bridge path"
assert dialog.type == "prompt"
# Agent responds
resp = sv.respond_to_dialog("accept", prompt_text="AGENT-SUPPLIED-REPLY")
assert resp["ok"] is True
# Wait for nav to complete + read back
try:
await _asyncio.wait_for(nav_fut, timeout=10)
except Exception:
pass
await _asyncio.sleep(0.5)
r = await call(
"Runtime.evaluate",
{"expression": "window.__ret", "returnByValue": True},
sid=sid,
)
return r.get("result", {}).get("result", {}).get("value")
finally:
rd.cancel()
try: await rd
except BaseException: pass
value = asyncio.run(nav_and_read())
assert value == "AGENT-SUPPLIED-REPLY", f"expected AGENT-SUPPLIED-REPLY, got {value!r}"

View file

@ -188,10 +188,116 @@ async def _cdp_call(
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
def _browser_cdp_via_supervisor(
task_id: str,
frame_id: str,
method: str,
params: Optional[Dict[str, Any]],
timeout: float,
) -> str:
"""Route a CDP call through the live supervisor session for an OOPIF frame.
Looks up the frame in the supervisor's snapshot, extracts its child
``cdp_session_id``, and dispatches ``method`` with that sessionId via
the supervisor's already-connected WebSocket (using
``asyncio.run_coroutine_threadsafe`` onto the supervisor loop).
"""
try:
from tools.browser_supervisor import SUPERVISOR_REGISTRY # type: ignore[import-not-found]
except Exception as exc: # pragma: no cover — defensive
return tool_error(
f"CDP supervisor is not available: {exc}. frame_id routing requires "
f"a running supervisor attached via /browser connect or an active "
f"Browserbase session."
)
supervisor = SUPERVISOR_REGISTRY.get(task_id)
if supervisor is None:
return tool_error(
f"No CDP supervisor is attached for task={task_id!r}. Call "
f"browser_navigate or /browser connect first so the supervisor "
f"can attach. Once attached, browser_snapshot will populate "
f"frame_tree with frame_ids you can pass here."
)
snap = supervisor.snapshot()
# Search both the top frame and the children for the requested id.
top = snap.frame_tree.get("top")
frame_info: Optional[Dict[str, Any]] = None
if top and top.get("frame_id") == frame_id:
frame_info = top
else:
for child in snap.frame_tree.get("children", []) or []:
if child.get("frame_id") == frame_id:
frame_info = child
break
if frame_info is None:
# Check the raw frames dict too (frame_tree is capped at 30 entries)
with supervisor._state_lock: # type: ignore[attr-defined]
raw = supervisor._frames.get(frame_id) # type: ignore[attr-defined]
if raw is not None:
frame_info = raw.to_dict()
if frame_info is None:
return tool_error(
f"frame_id {frame_id!r} not found in supervisor state. "
f"Call browser_snapshot to see current frame_tree."
)
child_sid = frame_info.get("session_id")
if not child_sid:
# Not an OOPIF — fall back to top-level session (evaluating at page
# scope). Same-origin iframes don't get their own sessionId; the
# agent can still use contentWindow/contentDocument from the parent.
return tool_error(
f"frame_id {frame_id!r} is not an out-of-process iframe (no "
f"dedicated CDP session). For same-origin iframes, use "
f"`browser_cdp(method='Runtime.evaluate', params={{'expression': "
f"\"document.querySelector('iframe').contentDocument.title\"}})` "
f"at the top-level page instead."
)
# Dispatch onto the supervisor's loop.
import asyncio as _asyncio
loop = supervisor._loop # type: ignore[attr-defined]
if loop is None or not loop.is_running():
return tool_error(
"CDP supervisor loop is not running. Try reconnecting with "
"/browser connect."
)
async def _do_cdp():
return await supervisor._cdp( # type: ignore[attr-defined]
method,
params or {},
session_id=child_sid,
timeout=timeout,
)
try:
fut = _asyncio.run_coroutine_threadsafe(_do_cdp(), loop)
result_msg = fut.result(timeout=timeout + 2)
except Exception as exc:
return tool_error(
f"CDP call via supervisor failed: {type(exc).__name__}: {exc}",
cdp_docs=CDP_DOCS_URL,
)
payload: Dict[str, Any] = {
"success": True,
"method": method,
"frame_id": frame_id,
"session_id": child_sid,
"result": result_msg.get("result", {}),
}
return json.dumps(payload, ensure_ascii=False)
def browser_cdp( def browser_cdp(
method: str, method: str,
params: Optional[Dict[str, Any]] = None, params: Optional[Dict[str, Any]] = None,
target_id: Optional[str] = None, target_id: Optional[str] = None,
frame_id: Optional[str] = None,
timeout: float = 30.0, timeout: float = 30.0,
task_id: Optional[str] = None, task_id: Optional[str] = None,
) -> str: ) -> str:
@ -202,16 +308,34 @@ def browser_cdp(
params: Method-specific parameters; defaults to ``{}``. params: Method-specific parameters; defaults to ``{}``.
target_id: Optional target/tab ID for page-level methods. When set, target_id: Optional target/tab ID for page-level methods. When set,
we first attach to the target (``flatten=True``) and send we first attach to the target (``flatten=True``) and send
``method`` with the resulting ``sessionId``. ``method`` with the resulting ``sessionId``. Uses a fresh
stateless CDP connection.
frame_id: Optional cross-origin (OOPIF) iframe ``frame_id`` from
``browser_snapshot.frame_tree.children[]``. When set (and the
frame is an OOPIF with a live session tracked by the CDP
supervisor), routes the call through the supervisor's existing
WebSocket which is how you Runtime.evaluate *inside* an
iframe on backends where per-call fresh CDP connections would
hit signed-URL expiry (Browserbase) or expensive reattach.
timeout: Seconds to wait for the call to complete. timeout: Seconds to wait for the call to complete.
task_id: Unused (tool is stateless) accepted for uniformity with task_id: Task identifier for supervisor lookup. When ``frame_id``
other browser tools. is set, this identifies which task's supervisor to use; the
handler will default to ``"default"`` otherwise.
Returns: Returns:
JSON string ``{"success": True, "method": ..., "result": {...}}`` on JSON string ``{"success": True, "method": ..., "result": {...}}`` on
success, or ``{"error": "..."}`` on failure. success, or ``{"error": "..."}`` on failure.
""" """
del task_id # unused — stateless # --- Route iframe-scoped calls through the supervisor ---------------
if frame_id:
return _browser_cdp_via_supervisor(
task_id=task_id or "default",
frame_id=frame_id,
method=method,
params=params,
timeout=timeout,
)
del task_id # stateless path below
if not method or not isinstance(method, str): if not method or not isinstance(method, str):
return tool_error( return tool_error(
@ -324,12 +448,18 @@ BROWSER_CDP_SCHEMA: Dict[str, Any] = {
"'mobile': false}, target_id=<tabId>\n\n" "'mobile': false}, target_id=<tabId>\n\n"
"**Usage rules:**\n" "**Usage rules:**\n"
"- Browser-level methods (Target.*, Browser.*, Storage.*): omit " "- Browser-level methods (Target.*, Browser.*, Storage.*): omit "
"target_id.\n" "target_id and frame_id.\n"
"- Page-level methods (Page.*, Runtime.*, DOM.*, Emulation.*, " "- Page-level methods (Page.*, Runtime.*, DOM.*, Emulation.*, "
"Network.* scoped to a tab): pass target_id from Target.getTargets.\n" "Network.* scoped to a tab): pass target_id from Target.getTargets.\n"
"- Each call is independent — sessions and event subscriptions do " "- **Cross-origin iframe scope** (Runtime.evaluate inside an OOPIF, "
"not persist between calls. For stateful workflows, prefer the " "Page.* targeting a frame target, etc.): pass frame_id from the "
"dedicated browser tools." "browser_snapshot frame_tree output. This routes through the CDP "
"supervisor's live connection — the only reliable way on "
"Browserbase where stateless CDP calls hit signed-URL expiry.\n"
"- Each stateless call (without frame_id) is independent — sessions "
"and event subscriptions do not persist between calls. For stateful "
"workflows, prefer the dedicated browser tools or use frame_id "
"routing."
), ),
"parameters": { "parameters": {
"type": "object", "type": "object",
@ -353,8 +483,24 @@ BROWSER_CDP_SCHEMA: Dict[str, Any] = {
"type": "string", "type": "string",
"description": ( "description": (
"Optional. Target/tab ID from Target.getTargets result " "Optional. Target/tab ID from Target.getTargets result "
"(each entry's 'targetId'). Required for page-level " "(each entry's 'targetId'). Use for page-level methods "
"methods; must be omitted for browser-level methods." "at the top-level tab scope. Mutually exclusive with "
"frame_id."
),
},
"frame_id": {
"type": "string",
"description": (
"Optional. Out-of-process iframe (OOPIF) frame_id from "
"browser_snapshot.frame_tree.children[] where "
"is_oopif=true. When set, routes the call through the "
"CDP supervisor's live session for that iframe. "
"Essential for Runtime.evaluate inside cross-origin "
"iframes, especially on Browserbase where fresh "
"per-call CDP connections can't keep up with signed "
"URL rotation. For same-origin iframes, use parent "
"contentWindow/contentDocument from Runtime.evaluate "
"at the top-level page instead."
), ),
}, },
"timeout": { "timeout": {
@ -408,6 +554,7 @@ registry.register(
method=args.get("method", ""), method=args.get("method", ""),
params=args.get("params"), params=args.get("params"),
target_id=args.get("target_id"), target_id=args.get("target_id"),
frame_id=args.get("frame_id"),
timeout=args.get("timeout", 30.0), timeout=args.get("timeout", 30.0),
task_id=kw.get("task_id"), task_id=kw.get("task_id"),
), ),

View file

@ -0,0 +1,148 @@
"""Agent-facing tool: respond to a native JS dialog captured by the CDP supervisor.
This tool is response-only the agent first reads ``pending_dialogs`` from
``browser_snapshot`` output, then calls ``browser_dialog(action=...)`` to
accept or dismiss.
Gated on the same ``_browser_cdp_check`` as ``browser_cdp`` so it only
appears when a CDP endpoint is reachable (Browserbase with a
``connectUrl``, local Chrome via ``/browser connect``, or
``browser.cdp_url`` set in config).
See ``website/docs/developer-guide/browser-supervisor.md`` for the full
design.
"""
from __future__ import annotations
import json
import logging
from typing import Any, Dict, Optional
from tools.browser_supervisor import SUPERVISOR_REGISTRY
from tools.registry import registry
logger = logging.getLogger(__name__)
BROWSER_DIALOG_SCHEMA: Dict[str, Any] = {
"name": "browser_dialog",
"description": (
"Respond to a native JavaScript dialog (alert / confirm / prompt / "
"beforeunload) that is currently blocking the page.\n\n"
"**Workflow:** call ``browser_snapshot`` first — if a dialog is open, "
"it appears in the ``pending_dialogs`` field with ``id``, ``type``, "
"and ``message``. Then call this tool with ``action='accept'`` or "
"``action='dismiss'``.\n\n"
"**Prompt dialogs:** pass ``prompt_text`` to supply the response "
"string. Ignored for alert/confirm/beforeunload.\n\n"
"**Multiple dialogs:** if more than one dialog is queued (rare — "
"happens when a second dialog fires while the first is still open), "
"pass ``dialog_id`` from the snapshot to disambiguate.\n\n"
"**Availability:** only present when a CDP-capable backend is "
"attached — Browserbase sessions, local Chrome via "
"``/browser connect``, or ``browser.cdp_url`` in config.yaml. "
"Not available on Camofox (REST-only) or the default Playwright "
"local browser (CDP port is hidden)."
),
"parameters": {
"type": "object",
"properties": {
"action": {
"type": "string",
"enum": ["accept", "dismiss"],
"description": (
"'accept' clicks OK / returns the prompt text. "
"'dismiss' clicks Cancel / returns null from prompt(). "
"For ``beforeunload`` dialogs: 'accept' allows the "
"navigation, 'dismiss' keeps the page."
),
},
"prompt_text": {
"type": "string",
"description": (
"Response string for a ``prompt()`` dialog. Ignored for "
"other dialog types. Defaults to empty string."
),
},
"dialog_id": {
"type": "string",
"description": (
"Specific dialog to respond to, from "
"``browser_snapshot.pending_dialogs[].id``. Required "
"only when multiple dialogs are queued."
),
},
},
"required": ["action"],
},
}
def browser_dialog(
action: str,
prompt_text: Optional[str] = None,
dialog_id: Optional[str] = None,
task_id: Optional[str] = None,
) -> str:
"""Respond to a pending dialog on the active task's CDP supervisor."""
effective_task_id = task_id or "default"
supervisor = SUPERVISOR_REGISTRY.get(effective_task_id)
if supervisor is None:
return json.dumps(
{
"success": False,
"error": (
"No CDP supervisor is attached to this task. Either the "
"browser backend doesn't expose CDP (Camofox, default "
"Playwright) or no browser session has been started yet. "
"Call browser_navigate or /browser connect first."
),
}
)
result = supervisor.respond_to_dialog(
action=action,
prompt_text=prompt_text,
dialog_id=dialog_id,
)
if result.get("ok"):
return json.dumps(
{
"success": True,
"action": action,
"dialog": result.get("dialog", {}),
}
)
return json.dumps({"success": False, "error": result.get("error", "unknown error")})
def _browser_dialog_check() -> bool:
"""Gate: same as ``browser_cdp`` — only offered when CDP is reachable.
Kept identical so the two tools appear and disappear together. The
supervisor itself is started lazily by ``browser_navigate`` /
``/browser connect`` / Browserbase session creation, so a reachable
CDP URL is enough to commit to showing the tool.
"""
try:
from tools.browser_cdp_tool import _browser_cdp_check # type: ignore[import-not-found]
except Exception as exc: # pragma: no cover — defensive
logger.debug("browser_dialog check: browser_cdp_tool import failed: %s", exc)
return False
return _browser_cdp_check()
registry.register(
name="browser_dialog",
toolset="browser-cdp",
schema=BROWSER_DIALOG_SCHEMA,
handler=lambda args, **kw: browser_dialog(
action=args.get("action", ""),
prompt_text=args.get("prompt_text"),
dialog_id=args.get("dialog_id"),
task_id=kw.get("task_id"),
),
check_fn=_browser_dialog_check,
emoji="💬",
)

1362
tools/browser_supervisor.py Normal file

File diff suppressed because it is too large Load diff

View file

@ -63,7 +63,7 @@ import tempfile
import threading import threading
import time import time
import requests import requests
from typing import Dict, Any, Optional, List from typing import Dict, Any, Optional, List, Tuple
from pathlib import Path from pathlib import Path
from agent.auxiliary_client import call_llm from agent.auxiliary_client import call_llm
from hermes_constants import get_hermes_home from hermes_constants import get_hermes_home
@ -287,6 +287,100 @@ def _get_cdp_override() -> str:
return "" return ""
def _get_dialog_policy_config() -> Tuple[str, float]:
"""Read ``browser.dialog_policy`` + ``browser.dialog_timeout_s`` from config.
Returns a ``(policy, timeout_s)`` tuple, falling back to the supervisor's
defaults when keys are absent or invalid.
"""
# Defer imports so browser_tool can be imported in minimal environments.
from tools.browser_supervisor import (
DEFAULT_DIALOG_POLICY,
DEFAULT_DIALOG_TIMEOUT_S,
_VALID_POLICIES,
)
try:
from hermes_cli.config import read_raw_config
cfg = read_raw_config()
browser_cfg = cfg.get("browser", {}) if isinstance(cfg, dict) else {}
if not isinstance(browser_cfg, dict):
return DEFAULT_DIALOG_POLICY, DEFAULT_DIALOG_TIMEOUT_S
policy = str(browser_cfg.get("dialog_policy") or DEFAULT_DIALOG_POLICY)
if policy not in _VALID_POLICIES:
logger.debug("Invalid browser.dialog_policy=%r; using default", policy)
policy = DEFAULT_DIALOG_POLICY
timeout_raw = browser_cfg.get("dialog_timeout_s")
try:
timeout_s = float(timeout_raw) if timeout_raw is not None else DEFAULT_DIALOG_TIMEOUT_S
if timeout_s <= 0:
timeout_s = DEFAULT_DIALOG_TIMEOUT_S
except (TypeError, ValueError):
timeout_s = DEFAULT_DIALOG_TIMEOUT_S
return policy, timeout_s
except Exception:
return DEFAULT_DIALOG_POLICY, DEFAULT_DIALOG_TIMEOUT_S
def _ensure_cdp_supervisor(task_id: str) -> None:
"""Start a CDP supervisor for ``task_id`` if an endpoint is reachable.
Idempotent delegates to ``SupervisorRegistry.get_or_start`` which skips
when a supervisor for this ``(task_id, cdp_url)`` already exists and
tears down + restarts on URL change. Safe to call on every
``browser_navigate`` / ``/browser connect`` without worrying about
double-attach.
Resolves the CDP URL in this order:
1. ``BROWSER_CDP_URL`` / ``browser.cdp_url`` covers ``/browser connect``
and config-set overrides.
2. ``_active_sessions[task_id]["cdp_url"]`` covers Browserbase + any
other cloud provider whose ``create_session`` returns a raw CDP URL.
Swallows all errors failing to attach the supervisor must not break
the browser session itself. The agent simply won't see
``pending_dialogs`` / ``frame_tree`` fields in snapshots.
"""
cdp_url = _get_cdp_override()
if not cdp_url:
# Fallback: active session may carry a per-session CDP URL from a
# cloud provider (Browserbase sets this).
with _cleanup_lock:
session_info = _active_sessions.get(task_id, {})
maybe = str(session_info.get("cdp_url") or "")
if maybe:
cdp_url = _resolve_cdp_override(maybe)
if not cdp_url:
return
try:
from tools.browser_supervisor import SUPERVISOR_REGISTRY # type: ignore[import-not-found]
policy, timeout_s = _get_dialog_policy_config()
SUPERVISOR_REGISTRY.get_or_start(
task_id=task_id,
cdp_url=cdp_url,
dialog_policy=policy,
dialog_timeout_s=timeout_s,
)
except Exception as exc:
logger.debug(
"CDP supervisor attach for task=%s failed (non-fatal): %s",
task_id,
exc,
)
def _stop_cdp_supervisor(task_id: str) -> None:
"""Stop the CDP supervisor for ``task_id`` if one exists. No-op otherwise."""
try:
from tools.browser_supervisor import SUPERVISOR_REGISTRY # type: ignore[import-not-found]
SUPERVISOR_REGISTRY.stop(task_id)
except Exception as exc:
logger.debug("CDP supervisor stop for task=%s failed (non-fatal): %s", task_id, exc)
# ============================================================================ # ============================================================================
# Cloud Provider Registry # Cloud Provider Registry
# ============================================================================ # ============================================================================
@ -996,6 +1090,11 @@ def _get_session_info(task_id: Optional[str] = None) -> Dict[str, str]:
return _active_sessions[task_id] return _active_sessions[task_id]
_active_sessions[task_id] = session_info _active_sessions[task_id] = session_info
# Lazy-start the CDP supervisor now that the session exists (if the
# backend surfaces a CDP URL via override or session_info["cdp_url"]).
# Idempotent; swallows errors. See _ensure_cdp_supervisor for details.
_ensure_cdp_supervisor(task_id)
return session_info return session_info
@ -1579,6 +1678,19 @@ def browser_snapshot(
"element_count": len(refs) if refs else 0 "element_count": len(refs) if refs else 0
} }
# Merge supervisor state (pending dialogs + frame tree) when a CDP
# supervisor is attached to this task. No-op otherwise. See
# website/docs/developer-guide/browser-supervisor.md.
try:
from tools.browser_supervisor import SUPERVISOR_REGISTRY # type: ignore[import-not-found]
_supervisor = SUPERVISOR_REGISTRY.get(effective_task_id)
if _supervisor is not None:
_sv_snap = _supervisor.snapshot()
if _sv_snap.active:
response.update(_sv_snap.to_dict())
except Exception as _sv_exc:
logger.debug("supervisor snapshot merge failed: %s", _sv_exc)
return json.dumps(response, ensure_ascii=False) return json.dumps(response, ensure_ascii=False)
else: else:
return json.dumps({ return json.dumps({
@ -2249,6 +2361,10 @@ def cleanup_browser(task_id: Optional[str] = None) -> None:
if task_id is None: if task_id is None:
task_id = "default" task_id = "default"
# Stop the CDP supervisor for this task FIRST so we close our WebSocket
# before the backend tears down the underlying CDP endpoint.
_stop_cdp_supervisor(task_id)
# Also clean up Camofox session if running in Camofox mode. # Also clean up Camofox session if running in Camofox mode.
# Skip full close when managed persistence is enabled — the browser # Skip full close when managed persistence is enabled — the browser
# profile (and its session cookies) must survive across agent tasks. # profile (and its session cookies) must survive across agent tasks.
@ -2329,6 +2445,13 @@ def cleanup_all_browsers() -> None:
for task_id in task_ids: for task_id in task_ids:
cleanup_browser(task_id) cleanup_browser(task_id)
# Tear down CDP supervisors for all tasks so background threads exit.
try:
from tools.browser_supervisor import SUPERVISOR_REGISTRY # type: ignore[import-not-found]
SUPERVISOR_REGISTRY.stop_all()
except Exception:
pass
# Reset cached lookups so they are re-evaluated on next use. # Reset cached lookups so they are re-evaluated on next use.
global _cached_agent_browser, _agent_browser_resolved global _cached_agent_browser, _agent_browser_resolved
global _cached_command_timeout, _command_timeout_resolved global _cached_command_timeout, _command_timeout_resolved

View file

@ -43,7 +43,7 @@ _HERMES_CORE_TOOLS = [
"browser_navigate", "browser_snapshot", "browser_click", "browser_navigate", "browser_snapshot", "browser_click",
"browser_type", "browser_scroll", "browser_back", "browser_type", "browser_scroll", "browser_back",
"browser_press", "browser_get_images", "browser_press", "browser_get_images",
"browser_vision", "browser_console", "browser_cdp", "browser_vision", "browser_console", "browser_cdp", "browser_dialog",
# Text-to-speech # Text-to-speech
"text_to_speech", "text_to_speech",
# Planning & memory # Planning & memory
@ -115,7 +115,8 @@ TOOLSETS = {
"browser_navigate", "browser_snapshot", "browser_click", "browser_navigate", "browser_snapshot", "browser_click",
"browser_type", "browser_scroll", "browser_back", "browser_type", "browser_scroll", "browser_back",
"browser_press", "browser_get_images", "browser_press", "browser_get_images",
"browser_vision", "browser_console", "browser_cdp", "web_search" "browser_vision", "browser_console", "browser_cdp",
"browser_dialog", "web_search"
], ],
"includes": [] "includes": []
}, },
@ -249,7 +250,7 @@ TOOLSETS = {
"browser_navigate", "browser_snapshot", "browser_click", "browser_navigate", "browser_snapshot", "browser_click",
"browser_type", "browser_scroll", "browser_back", "browser_type", "browser_scroll", "browser_back",
"browser_press", "browser_get_images", "browser_press", "browser_get_images",
"browser_vision", "browser_console", "browser_cdp", "browser_vision", "browser_console", "browser_cdp", "browser_dialog",
"todo", "memory", "todo", "memory",
"session_search", "session_search",
"execute_code", "delegate_task", "execute_code", "delegate_task",
@ -274,7 +275,7 @@ TOOLSETS = {
"browser_navigate", "browser_snapshot", "browser_click", "browser_navigate", "browser_snapshot", "browser_click",
"browser_type", "browser_scroll", "browser_back", "browser_type", "browser_scroll", "browser_back",
"browser_press", "browser_get_images", "browser_press", "browser_get_images",
"browser_vision", "browser_console", "browser_cdp", "browser_vision", "browser_console", "browser_cdp", "browser_dialog",
# Planning & memory # Planning & memory
"todo", "memory", "todo", "memory",
# Session history search # Session history search

View file

@ -0,0 +1,223 @@
# Browser CDP Supervisor — Design
**Status:** Shipped (PR 14540)
**Last updated:** 2026-04-23
**Author:** @teknium1
## Problem
Native JS dialogs (`alert`/`confirm`/`prompt`/`beforeunload`) and iframes are
the two biggest gaps in our browser tooling:
1. **Dialogs block the JS thread.** Any operation on the page stalls until the
dialog is handled. Before this work, the agent had no way to know a dialog
was open — subsequent tool calls would hang or throw opaque errors.
2. **Iframes are invisible.** The agent could see iframe nodes in the DOM
snapshot but could not click, type, or eval inside them — especially
cross-origin (OOPIF) iframes that live in separate Chromium processes.
[PR #12550](https://github.com/NousResearch/hermes-agent/pull/12550) proposed a
stateless `browser_dialog` wrapper. That doesn't solve detection — it's a
cleaner CDP call for when the agent already knows (via symptoms) that a dialog
is open. Closed as superseded.
## Backend capability matrix (verified live 2026-04-23)
Using throwaway probe scripts against a data-URL page that fires alerts in the
main frame and in a same-origin srcdoc iframe, plus a cross-origin
`https://example.com` iframe:
| Backend | Dialog detect | Dialog respond | Frame tree | OOPIF `Runtime.evaluate` via `browser_cdp(frame_id=...)` |
|---|---|---|---|---|
| Local Chrome (`--remote-debugging-port`) / `/browser connect` | ✓ | ✓ full workflow | ✓ | ✓ |
| Browserbase | ✓ (via bridge) | ✓ full workflow (via bridge) | ✓ | ✓ (`document.title = "Example Domain"` verified on real cross-origin iframe) |
| Camofox | ✗ no CDP (REST-only) | ✗ | partial via DOM snapshot | ✗ |
**How Browserbase respond works.** Browserbase's CDP proxy uses Playwright
internally and auto-dismisses native dialogs within ~10ms, so
`Page.handleJavaScriptDialog` can't keep up. To work around this, the
supervisor injects a bridge script via
`Page.addScriptToEvaluateOnNewDocument` that overrides
`window.alert`/`confirm`/`prompt` with a synchronous XHR to a magic host
(`hermes-dialog-bridge.invalid`). `Fetch.enable` intercepts those XHRs
before they touch the network — the dialog becomes a `Fetch.requestPaused`
event the supervisor captures, and `respond_to_dialog` fulfills via
`Fetch.fulfillRequest` with a JSON body the injected script decodes.
Net result: from the page's perspective, `prompt()` still returns the
agent-supplied string. From the agent's perspective, it's the same
`browser_dialog(action=...)` API either way. Tested end-to-end against
real Browserbase sessions — 4/4 (alert/prompt/confirm-accept/confirm-dismiss)
pass including value round-tripping back into page JS.
Camofox stays unsupported for this PR; follow-up upstream issue planned at
`jo-inc/camofox-browser` requesting a dialog polling endpoint.
## Architecture
### CDPSupervisor
One `asyncio.Task` running in a background daemon thread per Hermes `task_id`.
Holds a persistent WebSocket to the backend's CDP endpoint. Maintains:
- **Dialog queue**`List[PendingDialog]` with `{id, type, message, default_prompt, session_id, opened_at}`
- **Frame tree**`Dict[frame_id, FrameInfo]` with parent relationships, URL, origin, whether cross-origin child session
- **Session map**`Dict[session_id, SessionInfo]` so interaction tools can route to the right attached session for OOPIF operations
- **Recent console errors** — ring buffer of the last 50 (for PR 2 diagnostics)
Subscribes on attach:
- `Page.enable``javascriptDialogOpening`, `frameAttached`, `frameNavigated`, `frameDetached`
- `Runtime.enable``executionContextCreated`, `consoleAPICalled`, `exceptionThrown`
- `Target.setAutoAttach {autoAttach: true, flatten: true}` — surfaces child OOPIF targets; supervisor enables `Page`+`Runtime` on each
Thread-safe state access via a snapshot lock; tool handlers (sync) read the
frozen snapshot without awaiting.
### Lifecycle
- **Start:** `SupervisorRegistry.get_or_start(task_id, cdp_url)` — called by
`browser_navigate`, Browserbase session create, `/browser connect`. Idempotent.
- **Stop:** session teardown or `/browser disconnect`. Cancels the asyncio
task, closes the WebSocket, discards state.
- **Rebind:** if the CDP URL changes (user reconnects to a new Chrome), stop
the old supervisor and start fresh — never reuse state across endpoints.
### Dialog policy
Configurable via `config.yaml` under `browser.dialog_policy`:
- **`must_respond`** (default) — capture, surface in `browser_snapshot`, wait
for explicit `browser_dialog(action=...)` call. After a 300s safety timeout
with no response, auto-dismiss and log. Prevents a buggy agent from stalling
forever.
- `auto_dismiss` — record and dismiss immediately; agent sees it after the
fact via `browser_state` inside `browser_snapshot`.
- `auto_accept` — record and accept (useful for `beforeunload` where the user
wants to navigate away cleanly).
Policy is per-task; no per-dialog overrides in v1.
## Agent surface (PR 1)
### One new tool
```
browser_dialog(action, prompt_text=None, dialog_id=None)
```
- `action="accept"` / `"dismiss"` → responds to the specified or sole pending dialog (required)
- `prompt_text=...` → text to supply to a `prompt()` dialog
- `dialog_id=...` → disambiguate when multiple dialogs queued (rare)
Tool is response-only. Agent reads pending dialogs from `browser_snapshot`
output before calling.
### `browser_snapshot` extension
Adds three optional fields to the existing snapshot output when a supervisor
is attached:
```json
{
"pending_dialogs": [
{"id": "d-1", "type": "alert", "message": "Hello", "opened_at": 1650000000.0}
],
"recent_dialogs": [
{"id": "d-1", "type": "alert", "message": "...", "opened_at": 1650000000.0,
"closed_at": 1650000000.1, "closed_by": "remote"}
],
"frame_tree": {
"top": {"frame_id": "FRAME_A", "url": "https://example.com/", "origin": "https://example.com"},
"children": [
{"frame_id": "FRAME_B", "url": "about:srcdoc", "is_oopif": false},
{"frame_id": "FRAME_C", "url": "https://ads.example.net/", "is_oopif": true, "session_id": "SID_C"}
],
"truncated": false
}
}
```
- **`pending_dialogs`**: dialogs currently blocking the page's JS thread.
The agent must call `browser_dialog(action=...)` to respond. Empty on
Browserbase because their CDP proxy auto-dismisses within ~10ms.
- **`recent_dialogs`**: ring buffer of up to 20 recently-closed dialogs with
a `closed_by` tag — `"agent"` (we responded), `"auto_policy"` (local
auto_dismiss/auto_accept), `"watchdog"` (must_respond timeout hit), or
`"remote"` (browser/backend closed it on us, e.g. Browserbase). This is
how agents on Browserbase still get visibility into what happened.
- **`frame_tree`**: frame structure including cross-origin (OOPIF) children.
Capped at 30 entries + OOPIF depth 2 to bound snapshot size on ad-heavy
pages. `truncated: true` surfaces when limits were hit; agents needing
the full tree can use `browser_cdp` with `Page.getFrameTree`.
No new tool schema surface for any of these — the agent reads the snapshot
it already requests.
### Availability gating
Both surfaces gate on `_browser_cdp_check` (supervisor can only run when a CDP
endpoint is reachable). On Camofox / no-backend sessions, the dialog tool is
hidden and snapshot omits the new fields — no schema bloat.
## Cross-origin iframe interaction
Extending the dialog-detect work, `browser_cdp(frame_id=...)` routes CDP
calls (notably `Runtime.evaluate`) through the supervisor's already-connected
WebSocket using the OOPIF's child `sessionId`. Agents pick frame_ids out of
`browser_snapshot.frame_tree.children[]` where `is_oopif=true` and pass them
to `browser_cdp`. For same-origin iframes (no dedicated CDP session), the
agent uses `contentWindow`/`contentDocument` from a top-level
`Runtime.evaluate` instead — supervisor surfaces an error pointing at that
fallback when `frame_id` belongs to a non-OOPIF.
On Browserbase, this is the ONLY reliable path for iframe interaction —
stateless CDP connections (opened per `browser_cdp` call) hit signed-URL
expiry, while the supervisor's long-lived connection keeps a valid session.
## Camofox (follow-up)
Issue planned against `jo-inc/camofox-browser` adding:
- Playwright `page.on('dialog', handler)` per session
- `GET /tabs/:tabId/dialogs` polling endpoint
- `POST /tabs/:tabId/dialogs/:id` to accept/dismiss
- Frame-tree introspection endpoint
## Files touched (PR 1)
### New
- `tools/browser_supervisor.py``CDPSupervisor`, `SupervisorRegistry`, `PendingDialog`, `FrameInfo`
- `tools/browser_dialog_tool.py``browser_dialog` tool handler
- `tests/tools/test_browser_supervisor.py` — mock CDP WebSocket server + lifecycle/state tests
- `website/docs/developer-guide/browser-supervisor.md` — this file
### Modified
- `toolsets.py` — register `browser_dialog` in `browser`, `hermes-acp`, `hermes-api-server`, core toolsets (gated on CDP reachability)
- `tools/browser_tool.py`
- `browser_navigate` start-hook: if CDP URL resolvable, `SupervisorRegistry.get_or_start(task_id, cdp_url)`
- `browser_snapshot` (at ~line 1536): merge supervisor state into return payload
- `/browser connect` handler: restart supervisor with new endpoint
- Session teardown hooks in `_cleanup_browser_session`
- `hermes_cli/config.py` — add `browser.dialog_policy` and `browser.dialog_timeout_s` to `DEFAULT_CONFIG`
- Docs: `website/docs/user-guide/features/browser.md`, `website/docs/reference/tools-reference.md`, `website/docs/reference/toolsets-reference.md`
## Non-goals
- Detection/interaction for Camofox (upstream gap; tracked separately)
- Streaming dialog/frame events live to the user (would require gateway hooks)
- Persisting dialog history across sessions (in-memory only)
- Per-iframe dialog policies (agent can express this via `dialog_id`)
- Replacing `browser_cdp` — it stays as the escape hatch for the long tail (cookies, viewport, network throttling)
## Testing
Unit tests use an asyncio mock CDP server that speaks enough of the protocol
to exercise all state transitions: attach, enable, navigate, dialog fire,
dialog dismiss, frame attach/detach, child target attach, session teardown.
Real-backend E2E (Browserbase + local Chrome) is manual; probe scripts from
the 2026-04-23 investigation kept in-repo under
`scripts/browser_supervisor_e2e.py` so anyone can re-verify on new backend
versions.

View file

@ -6,9 +6,9 @@ description: "Authoritative reference for Hermes built-in tools, grouped by tool
# Built-in Tools Reference # Built-in Tools Reference
This page documents all 53 built-in tools in the Hermes tool registry, grouped by toolset. Availability varies by platform, credentials, and enabled toolsets. This page documents all 55 built-in tools in the Hermes tool registry, grouped by toolset. Availability varies by platform, credentials, and enabled toolsets.
**Quick counts:** 11 browser tools, 4 file tools, 10 RL tools, 4 Home Assistant tools, 2 terminal tools, 2 web tools, 5 Feishu tools, and 15 standalone tools across other toolsets. **Quick counts:** 12 browser tools, 4 file tools, 10 RL tools, 4 Home Assistant tools, 2 terminal tools, 2 web tools, 5 Feishu tools, and 15 standalone tools across other toolsets.
:::tip MCP Tools :::tip MCP Tools
In addition to built-in tools, Hermes can load tools dynamically from MCP servers. MCP tools appear with a server-name prefix (e.g., `github_create_issue` for the `github` MCP server). See [MCP Integration](/docs/user-guide/features/mcp) for configuration. In addition to built-in tools, Hermes can load tools dynamically from MCP servers. MCP tools appear with a server-name prefix (e.g., `github_create_issue` for the `github` MCP server). See [MCP Integration](/docs/user-guide/features/mcp) for configuration.
@ -20,6 +20,7 @@ In addition to built-in tools, Hermes can load tools dynamically from MCP server
|------|-------------|----------------------| |------|-------------|----------------------|
| `browser_back` | Navigate back to the previous page in browser history. Requires browser_navigate to be called first. | — | | `browser_back` | Navigate back to the previous page in browser history. Requires browser_navigate to be called first. | — |
| `browser_cdp` | Send a raw Chrome DevTools Protocol (CDP) command. Escape hatch for browser operations not covered by browser_navigate, browser_click, browser_console, etc. Only available when a CDP endpoint is reachable at session start — via `/browser connect` or `browser.cdp_url` config. See https://chromedevtools.github.io/devtools-protocol/ | — | | `browser_cdp` | Send a raw Chrome DevTools Protocol (CDP) command. Escape hatch for browser operations not covered by browser_navigate, browser_click, browser_console, etc. Only available when a CDP endpoint is reachable at session start — via `/browser connect` or `browser.cdp_url` config. See https://chromedevtools.github.io/devtools-protocol/ | — |
| `browser_dialog` | Respond to a native JavaScript dialog (alert / confirm / prompt / beforeunload). Call `browser_snapshot` first — pending dialogs appear in its `pending_dialogs` field. Then call `browser_dialog(action='accept'|'dismiss')`. Same availability as `browser_cdp` (Browserbase or `/browser connect`). | — |
| `browser_click` | Click on an element identified by its ref ID from the snapshot (e.g., '@e5'). The ref IDs are shown in square brackets in the snapshot output. Requires browser_navigate and browser_snapshot to be called first. | — | | `browser_click` | Click on an element identified by its ref ID from the snapshot (e.g., '@e5'). The ref IDs are shown in square brackets in the snapshot output. Requires browser_navigate and browser_snapshot to be called first. | — |
| `browser_console` | Get browser console output and JavaScript errors from the current page. Returns console.log/warn/error/info messages and uncaught JS exceptions. Use this to detect silent JavaScript errors, failed API calls, and application warnings. Requi… | — | | `browser_console` | Get browser console output and JavaScript errors from the current page. Returns console.log/warn/error/info messages and uncaught JS exceptions. Use this to detect silent JavaScript errors, failed API calls, and application warnings. Requi… | — |
| `browser_get_images` | Get a list of all images on the current page with their URLs and alt text. Useful for finding images to analyze with the vision tool. Requires browser_navigate to be called first. | — | | `browser_get_images` | Get a list of all images on the current page with their URLs and alt text. Useful for finding images to analyze with the vision tool. Requires browser_navigate to be called first. | — |

View file

@ -52,7 +52,7 @@ Or in-session:
| Toolset | Tools | Purpose | | Toolset | Tools | Purpose |
|---------|-------|---------| |---------|-------|---------|
| `browser` | `browser_back`, `browser_cdp`, `browser_click`, `browser_console`, `browser_get_images`, `browser_navigate`, `browser_press`, `browser_scroll`, `browser_snapshot`, `browser_type`, `browser_vision`, `web_search` | Full browser automation. Includes `web_search` as a fallback for quick lookups. `browser_cdp` is a raw CDP passthrough gated on a reachable CDP endpoint — it only appears when `/browser connect` is active or `browser.cdp_url` is set. | | `browser` | `browser_back`, `browser_cdp`, `browser_click`, `browser_console`, `browser_dialog`, `browser_get_images`, `browser_navigate`, `browser_press`, `browser_scroll`, `browser_snapshot`, `browser_type`, `browser_vision`, `web_search` | Full browser automation. Includes `web_search` as a fallback for quick lookups. `browser_cdp` and `browser_dialog` are gated on a reachable CDP endpoint — they only appear when `/browser connect` is active, `browser.cdp_url` is set, or a Browserbase session is active. `browser_dialog` works together with the `pending_dialogs` and `frame_tree` fields that `browser_snapshot` adds when a CDP supervisor is attached. |
| `clarify` | `clarify` | Ask the user a question when the agent needs clarification. | | `clarify` | `clarify` | Ask the user a question when the agent needs clarification. |
| `code_execution` | `execute_code` | Run Python scripts that call Hermes tools programmatically. | | `code_execution` | `execute_code` | Run Python scripts that call Hermes tools programmatically. |
| `cronjob` | `cronjob` | Schedule and manage recurring tasks. | | `cronjob` | `cronjob` | Schedule and manage recurring tasks. |

View file

@ -1240,10 +1240,26 @@ browser:
inactivity_timeout: 120 # Seconds before auto-closing idle sessions inactivity_timeout: 120 # Seconds before auto-closing idle sessions
command_timeout: 30 # Timeout in seconds for browser commands (screenshot, navigate, etc.) command_timeout: 30 # Timeout in seconds for browser commands (screenshot, navigate, etc.)
record_sessions: false # Auto-record browser sessions as WebM videos to ~/.hermes/browser_recordings/ record_sessions: false # Auto-record browser sessions as WebM videos to ~/.hermes/browser_recordings/
# Optional CDP override — when set, Hermes attaches directly to your own
# Chrome (via /browser connect) rather than starting a headless browser.
cdp_url: ""
# Dialog supervisor — controls how native JS dialogs (alert / confirm / prompt)
# are handled when a CDP backend is attached (Browserbase, local Chrome via
# /browser connect). Ignored on Camofox and default local agent-browser mode.
dialog_policy: must_respond # must_respond | auto_dismiss | auto_accept
dialog_timeout_s: 300 # Safety auto-dismiss under must_respond (seconds)
camofox: camofox:
managed_persistence: false # When true, Camofox sessions persist cookies/logins across restarts managed_persistence: false # When true, Camofox sessions persist cookies/logins across restarts
``` ```
**Dialog policies:**
- `must_respond` (default) — capture the dialog, surface it in `browser_snapshot.pending_dialogs`, and wait for the agent to call `browser_dialog(action=...)`. After `dialog_timeout_s` seconds with no response, the dialog is auto-dismissed to prevent the page's JS thread from stalling forever.
- `auto_dismiss` — capture, dismiss immediately. The agent still sees the dialog record in `browser_snapshot.recent_dialogs` with `closed_by="auto_policy"` after the fact.
- `auto_accept` — capture, accept immediately. Useful for pages with aggressive `beforeunload` prompts.
See the [browser feature page](./features/browser.md#browser_dialog) for the full dialog workflow.
The browser toolset supports multiple providers. See the [Browser feature page](/docs/user-guide/features/browser) for details on Browserbase, Browser Use, and local Chrome CDP setup. The browser toolset supports multiple providers. See the [Browser feature page](/docs/user-guide/features/browser) for details on Browserbase, Browser Use, and local Chrome CDP setup.
## Timezone ## Timezone

View file

@ -355,7 +355,50 @@ browser_cdp(method="Runtime.evaluate",
browser_cdp(method="Network.getAllCookies") browser_cdp(method="Network.getAllCookies")
``` ```
Browser-level methods (`Target.*`, `Browser.*`, `Storage.*`) omit `target_id`. Page-level methods (`Page.*`, `Runtime.*`, `DOM.*`, `Emulation.*`) require a `target_id` from `Target.getTargets`. Each call is independent — sessions do not persist between calls. Browser-level methods (`Target.*`, `Browser.*`, `Storage.*`) omit `target_id`. Page-level methods (`Page.*`, `Runtime.*`, `DOM.*`, `Emulation.*`) require a `target_id` from `Target.getTargets`. Each stateless call is independent — sessions do not persist between calls.
**Cross-origin iframes:** pass `frame_id` (from `browser_snapshot.frame_tree.children[]` where `is_oopif=true`) to route the CDP call through the supervisor's live session for that iframe. This is how `Runtime.evaluate` inside a cross-origin iframe works on Browserbase, where stateless CDP connections would hit signed-URL expiry. Example:
```
browser_cdp(
method="Runtime.evaluate",
params={"expression": "document.title", "returnByValue": True},
frame_id="<frame_id from browser_snapshot>",
)
```
Same-origin iframes don't need `frame_id` — use `document.querySelector('iframe').contentDocument` from a top-level `Runtime.evaluate` instead.
### `browser_dialog`
Responds to a native JS dialog (`alert` / `confirm` / `prompt` / `beforeunload`). Before this tool existed, dialogs would silently block the page's JavaScript thread and subsequent `browser_*` calls would hang or throw; now the agent sees pending dialogs in `browser_snapshot` output and responds explicitly.
**Workflow:**
1. Call `browser_snapshot`. If a dialog is blocking the page, it shows up as `pending_dialogs: [{"id": "d-1", "type": "alert", "message": "..."}]`.
2. Call `browser_dialog(action="accept")` or `browser_dialog(action="dismiss")`. For `prompt()` dialogs, pass `prompt_text="..."` to supply the response.
3. Re-snapshot — `pending_dialogs` is empty; the page's JS thread has resumed.
**Detection happens automatically** via a persistent CDP supervisor — one WebSocket per task that subscribes to Page/Runtime/Target events. The supervisor also populates a `frame_tree` field in the snapshot so the agent can see the iframe structure of the current page, including cross-origin (OOPIF) iframes.
**Availability matrix:**
| Backend | Detection via `pending_dialogs` | Response (`browser_dialog` tool) |
|---|---|---|
| Local Chrome via `/browser connect` or `browser.cdp_url` | ✓ | ✓ full workflow |
| Browserbase | ✓ | ✓ full workflow (via injected XHR bridge) |
| Camofox / default local agent-browser | ✗ | ✗ (no CDP endpoint) |
**How it works on Browserbase.** Browserbase's CDP proxy auto-dismisses real native dialogs server-side within ~10ms, so we can't use `Page.handleJavaScriptDialog`. The supervisor injects a small script via `Page.addScriptToEvaluateOnNewDocument` that overrides `window.alert`/`confirm`/`prompt` with a synchronous XHR. We intercept those XHRs via `Fetch.enable` — the page's JS thread stays blocked on the XHR until we call `Fetch.fulfillRequest` with the agent's response. `prompt()` return values round-trip back into page JS unchanged.
**Dialog policy** is configured in `config.yaml` under `browser.dialog_policy`:
| Policy | Behavior |
|--------|----------|
| `must_respond` (default) | Capture, surface in snapshot, wait for explicit `browser_dialog()` call. Safety auto-dismiss after `browser.dialog_timeout_s` (default 300s) so a buggy agent can't stall forever. |
| `auto_dismiss` | Capture, dismiss immediately. Agent still sees the dialog in `browser_state` history but doesn't have to act. |
| `auto_accept` | Capture, accept immediately. Useful when navigating pages with aggressive `beforeunload` prompts. |
**Frame tree** inside `browser_snapshot.frame_tree` is capped to 30 frames and OOPIF depth 2 to keep payloads bounded on ad-heavy pages. A `truncated: true` flag surfaces when limits were hit; agents needing the full tree can use `browser_cdp` with `Page.getFrameTree`.
## Practical Examples ## Practical Examples