mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-25 00:51:20 +00:00
feat(browser): CDP supervisor — dialog detection + response + cross-origin iframe eval (#14540)
* docs: browser CDP supervisor design (for upcoming PR) Design doc ahead of implementation — dialog + iframe detection/interaction via a persistent CDP supervisor. Covers backend capability matrix (verified live 2026-04-23), architecture, lifecycle, policy, agent surface, PR split, non-goals, and test plan. Supersedes #12550. No code changes in this commit. * feat(browser): add persistent CDP supervisor for dialog + frame detection Single persistent CDP WebSocket per Hermes task_id that subscribes to Page/Runtime/Target events and maintains thread-safe state for pending dialogs, frame tree, and console errors. Supervisor lives in its own daemon thread running an asyncio loop; external callers use sync API (snapshot(), respond_to_dialog()) that bridges onto the loop. Auto-attaches to OOPIF child targets via Target.setAutoAttach{flatten:true} and enables Page+Runtime on each so iframe-origin dialogs surface through the same supervisor. Dialog policies: must_respond (default, 300s safety timeout), auto_dismiss, auto_accept. Frame tree capped at 30 entries + OOPIF depth 2 to keep snapshot payloads bounded on ad-heavy pages. E2E verified against real Chrome via smoke test — detects + responds to main-frame alerts, iframe-contentWindow alerts, preserves frame tree, graceful no-dialog error path, clean shutdown. No agent-facing tool wiring in this commit (comes next). * feat(browser): add browser_dialog tool wired to CDP supervisor Agent-facing response-only tool. Schema: action: 'accept' | 'dismiss' (required) prompt_text: response for prompt() dialogs (optional) dialog_id: disambiguate when multiple dialogs queued (optional) Handler: SUPERVISOR_REGISTRY.get(task_id).respond_to_dialog(...) check_fn shares _browser_cdp_check with browser_cdp so both surface and hide together. When no supervisor is attached (Camofox, default Playwright, or no browser session started yet), tool is hidden; if somehow invoked it returns a clear error pointing the agent to browser_navigate / /browser connect. Registered in _HERMES_CORE_TOOLS and the browser / hermes-acp / hermes-api-server toolsets alongside browser_cdp. * feat(browser): wire CDP supervisor into session lifecycle + browser_snapshot Supervisor lifecycle: * _get_session_info lazy-starts the supervisor after a session row is materialized — covers every backend code path (Browserbase, cdp_url override, /browser connect, future providers) with one hook. * cleanup_browser(task_id) stops the supervisor for that task first (before the backend tears down CDP). * cleanup_all_browsers() calls SUPERVISOR_REGISTRY.stop_all(). * /browser connect eagerly starts the supervisor for task 'default' so the first snapshot already shows pending_dialogs. * /browser disconnect stops the supervisor. CDP URL resolution for the supervisor: 1. BROWSER_CDP_URL / browser.cdp_url override. 2. Fallback: session_info['cdp_url'] from cloud providers (Browserbase). browser_snapshot merges supervisor state (pending_dialogs + frame_tree) into its JSON output when a supervisor is active — the agent reads pending_dialogs from the snapshot it already requests, then calls browser_dialog to respond. No extra tool surface. Config defaults: * browser.dialog_policy: 'must_respond' (new) * browser.dialog_timeout_s: 300 (new) No version bump — new keys deep-merge into existing browser section. Deadlock fix in supervisor event dispatch: * _on_dialog_opening and _on_target_attached used to await CDP calls while the reader was still processing an event — but only the reader can set the response Future, so the call timed out. * Both now fire asyncio.create_task(...) so the reader stays pumping. * auto_dismiss/auto_accept now actually close the dialog immediately. Tests (tests/tools/test_browser_supervisor.py, 11 tests, real Chrome): * supervisor start/snapshot * main-frame alert detection + dismiss * iframe.contentWindow alert * prompt() with prompt_text reply * respond with no pending dialog -> clean error * auto_dismiss clears on event * registry idempotency * registry stop -> snapshot reports inactive * browser_dialog tool no-supervisor error * browser_dialog invalid action * browser_dialog end-to-end via tool handler xdist-safe: chrome_cdp fixture uses a per-worker port. Skipped when google-chrome/chromium isn't installed. * docs(browser): document browser_dialog tool + CDP supervisor - user-guide/features/browser.md: new browser_dialog section with workflow, availability gate, and dialog_policy table - reference/tools-reference.md: row for browser_dialog, tool count bumped 53 -> 54, browser tools count 11 -> 12 - reference/toolsets-reference.md: browser_dialog added to browser toolset row with note on pending_dialogs / frame_tree snapshot fields Full design doc lives at developer-guide/browser-supervisor.md (committed earlier). * fix(browser): reconnect loop + recent_dialogs for Browserbase visibility Found via Browserbase E2E test that revealed two production-critical issues: 1. **Supervisor WebSocket drops when other clients disconnect.** Browserbase's CDP proxy tears down our long-lived WebSocket whenever a short-lived client (e.g. agent-browser CLI's per-command CDP connection) disconnects. Fixed with a reconnecting _run loop that re-attaches with exponential backoff on drops. _page_session_id and _child_sessions are reset on each reconnect; pending_dialogs and frames are preserved across reconnects. 2. **Browserbase auto-dismisses dialogs server-side within ~10ms.** Their Playwright-based CDP proxy dismisses alert/confirm/prompt before our Page.handleJavaScriptDialog call can respond. So pending_dialogs is empty by the time the agent reads a snapshot on Browserbase. Added a recent_dialogs ring buffer (capacity 20) that retains a DialogRecord for every dialog that opened, with a closed_by tag: * 'agent' — agent called browser_dialog * 'auto_policy' — local auto_dismiss/auto_accept fired * 'watchdog' — must_respond timeout auto-dismissed (300s default) * 'remote' — browser/backend closed it on us (Browserbase) Agents on Browserbase now see the dialog history with closed_by='remote' so they at least know a dialog fired, even though they couldn't respond. 3. **Page.javascriptDialogClosed matching bug.** The event doesn't include a 'message' field (CDP spec has only 'result' and 'userInput') but our _on_dialog_closed was matching on message. Fixed to match by session_id + oldest-first, with a safety assumption that only one dialog is in flight per session (the JS thread is blocked while a dialog is up). Docs + tests updated: * browser.md: new availability matrix showing the three backends and which mode (pending / recent / response) each supports * developer-guide/browser-supervisor.md: three-field snapshot schema with closed_by semantics * test_browser_supervisor.py: +test_recent_dialogs_ring_buffer (12/12 passing against real Chrome) E2E verified both backends: * Local Chrome via /browser connect: detect + respond full workflow (smoke_supervisor.py all 7 scenarios pass) * Browserbase: detect via recent_dialogs with closed_by='remote' (smoke_supervisor_browserbase_v2.py passes) Camofox remains out of scope (REST-only, no CDP) — tracked for upstream PR 3. * feat(browser): XHR bridge for dialog response on Browserbase (FIXED) Browserbase's CDP proxy auto-dismisses native JS dialogs within ~10ms, so Page.handleJavaScriptDialog calls lose the race. Solution: bypass native dialogs entirely. The supervisor now injects Page.addScriptToEvaluateOnNewDocument with a JavaScript override for window.alert/confirm/prompt. Those overrides perform a synchronous XMLHttpRequest to a magic host ('hermes-dialog-bridge.invalid'). We intercept those XHRs via Fetch.enable with a requestStage=Request pattern. Flow when a page calls alert('hi'): 1. window.alert override intercepts, builds XHR GET to http://hermes-dialog-bridge.invalid/?kind=alert&message=hi 2. Sync XHR blocks the page's JS thread (mirrors real dialog semantics) 3. Fetch.requestPaused fires on our WebSocket; supervisor surfaces it as a pending dialog with bridge_request_id set 4. Agent reads pending_dialogs from browser_snapshot, calls browser_dialog 5. Supervisor calls Fetch.fulfillRequest with JSON body: {accept: true|false, prompt_text: '...', dialog_id: 'd-N'} 6. The injected script parses the body, returns the appropriate value from the override (undefined for alert, bool for confirm, string|null for prompt) This works identically on Browserbase AND local Chrome — no native dialog ever fires, so Browserbase's auto-dismiss has nothing to race. Dialog policies (must_respond / auto_dismiss / auto_accept) all still work. Bridge is installed on every attached session (main page + OOPIF child sessions) so iframe dialogs are captured too. Native-dialog path kept as a fallback for backends that don't auto-dismiss (so a page that somehow bypasses our override — e.g. iframes that load after Fetch.enable but before the init-script runs — still gets observed via Page.javascriptDialogOpening). E2E VERIFIED: * Local Chrome: 13/13 pytest tests green (12 original + new test_bridge_captures_prompt_and_returns_reply_text that asserts window.__ret === 'AGENT-SUPPLIED-REPLY' after agent responds) * Browserbase: smoke_bb_bridge_v2.py runs 4/4 PASS: - alert('BB-ALERT-MSG') dismiss → page.alert_ret = undefined ✓ - prompt('BB-PROMPT-MSG', 'default-xyz') accept with 'AGENT-REPLY' → page.prompt_ret === 'AGENT-REPLY' ✓ - confirm('BB-CONFIRM-MSG') accept → page.confirm_ret === true ✓ - confirm('BB-CONFIRM-MSG') dismiss → page.confirm_ret === false ✓ Docs updated in browser.md and developer-guide/browser-supervisor.md — availability matrix now shows Browserbase at full parity with local Chrome for both detection and response. * feat(browser): cross-origin iframe interaction via browser_cdp(frame_id=...) Adds iframe interaction to the CDP supervisor PR (was queued as PR 2). Design: browser_cdp gets an optional frame_id parameter. When set, the tool looks up the frame in the supervisor's frame_tree, grabs its child cdp_session_id (OOPIF session), and dispatches the CDP call through the supervisor's already-connected WebSocket via run_coroutine_threadsafe. Why not stateless: on Browserbase, each fresh browser_cdp WebSocket must re-negotiate against a signed connectUrl. The session info carries a specific URL that can expire while the supervisor's long-lived connection stays valid. Routing via the supervisor sidesteps this. Agent workflow: 1. browser_snapshot → frame_tree.children[] shows OOPIFs with is_oopif=true 2. browser_cdp(method='Runtime.evaluate', frame_id=<OOPIF frame_id>, params={'expression': 'document.title', 'returnByValue': True}) 3. Supervisor dispatches the call on the OOPIF's child session Supervisor state fixes needed along the way: * _on_frame_detached now skips reason='swap' (frame migrating processes) * _on_frame_detached also skips when the frame is an OOPIF with a live child session — Browserbase fires spurious remove events when a same-origin iframe gets promoted to OOPIF * _on_target_detached clears cdp_session_id but KEEPS the frame record so the agent still sees the OOPIF in frame_tree during transient session flaps E2E VERIFIED on Browserbase (smoke_bb_iframe_agent_path.py): browser_cdp(method='Runtime.evaluate', params={'expression': 'document.title', 'returnByValue': True}, frame_id=<OOPIF>) → {'success': True, 'result': {'value': 'Example Domain'}} The iframe is <iframe src='https://example.com/'> inside a top-level data: URL page on a real Browserbase session. The agent Runtime.evaluates INSIDE the cross-origin iframe and gets example.com's title back. Tests (tests/tools/test_browser_supervisor.py — 16 pass total): * test_browser_cdp_frame_id_routes_via_supervisor — injects fake OOPIF, verifies routing via supervisor, Runtime.evaluate returns 1+1=2 * test_browser_cdp_frame_id_missing_supervisor — clean error when no supervisor attached * test_browser_cdp_frame_id_not_in_frame_tree — clean error on bad frame_id Docs (browser.md and developer-guide/browser-supervisor.md) updated with the iframe workflow, availability matrix now shows OOPIF eval as shipped for local Chrome + Browserbase. * test(browser): real-OOPIF E2E verified manually + chrome_cdp uses --site-per-process When asked 'did you test the iframe stuff' I had only done a mocked pytest (fake injected OOPIF) plus a Browserbase E2E. Closed the local-Chrome real-OOPIF gap by writing /tmp/dialog-iframe-test/ smoke_local_oopif.py: * 2 http servers on different hostnames (localhost:18905 + 127.0.0.1:18906) * Chrome with --site-per-process so the cross-origin iframe becomes a real OOPIF in its own process * Navigate, find OOPIF in supervisor.frame_tree, call browser_cdp(method='Runtime.evaluate', frame_id=<OOPIF>) which routes through the supervisor's child session * Asserts iframe document.title === 'INNER-FRAME-XYZ' (from the inner page, retrieved via OOPIF eval) PASSED on 2026-04-23. Tried to embed this as a pytest but hit an asyncio version quirk between venv (3.11) and the system python (3.13) — Page.navigate hangs in the pytest harness but works in standalone. Left a self-documenting skip test that points to the smoke script + describes the verification. chrome_cdp fixture now passes --site-per-process so future iframe tests can rely on OOPIF behavior. Result: 16 pass + 1 documented-skip = 17 tests in tests/tools/test_browser_supervisor.py. * docs(browser): add dialog_policy + dialog_timeout_s to configuration.md, fix tool count Pre-merge docs audit revealed two gaps: 1. user-guide/configuration.md browser config example was missing the two new dialog_* knobs. Added with a short table explaining must_respond / auto_dismiss / auto_accept semantics and a link to the feature page for the full workflow. 2. reference/tools-reference.md header said '54 built-in tools' — real count on main is 54, this branch adds browser_dialog so it's 55. Fixed the header. (browser count was already correctly bumped 11 -> 12 in the earlier docs commit.) No code changes.
This commit is contained in:
parent
0f6eabb890
commit
5a1c599412
13 changed files with 2665 additions and 24 deletions
|
|
@ -188,10 +188,116 @@ async def _cdp_call(
|
|||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _browser_cdp_via_supervisor(
|
||||
task_id: str,
|
||||
frame_id: str,
|
||||
method: str,
|
||||
params: Optional[Dict[str, Any]],
|
||||
timeout: float,
|
||||
) -> str:
|
||||
"""Route a CDP call through the live supervisor session for an OOPIF frame.
|
||||
|
||||
Looks up the frame in the supervisor's snapshot, extracts its child
|
||||
``cdp_session_id``, and dispatches ``method`` with that sessionId via
|
||||
the supervisor's already-connected WebSocket (using
|
||||
``asyncio.run_coroutine_threadsafe`` onto the supervisor loop).
|
||||
"""
|
||||
try:
|
||||
from tools.browser_supervisor import SUPERVISOR_REGISTRY # type: ignore[import-not-found]
|
||||
except Exception as exc: # pragma: no cover — defensive
|
||||
return tool_error(
|
||||
f"CDP supervisor is not available: {exc}. frame_id routing requires "
|
||||
f"a running supervisor attached via /browser connect or an active "
|
||||
f"Browserbase session."
|
||||
)
|
||||
|
||||
supervisor = SUPERVISOR_REGISTRY.get(task_id)
|
||||
if supervisor is None:
|
||||
return tool_error(
|
||||
f"No CDP supervisor is attached for task={task_id!r}. Call "
|
||||
f"browser_navigate or /browser connect first so the supervisor "
|
||||
f"can attach. Once attached, browser_snapshot will populate "
|
||||
f"frame_tree with frame_ids you can pass here."
|
||||
)
|
||||
|
||||
snap = supervisor.snapshot()
|
||||
# Search both the top frame and the children for the requested id.
|
||||
top = snap.frame_tree.get("top")
|
||||
frame_info: Optional[Dict[str, Any]] = None
|
||||
if top and top.get("frame_id") == frame_id:
|
||||
frame_info = top
|
||||
else:
|
||||
for child in snap.frame_tree.get("children", []) or []:
|
||||
if child.get("frame_id") == frame_id:
|
||||
frame_info = child
|
||||
break
|
||||
if frame_info is None:
|
||||
# Check the raw frames dict too (frame_tree is capped at 30 entries)
|
||||
with supervisor._state_lock: # type: ignore[attr-defined]
|
||||
raw = supervisor._frames.get(frame_id) # type: ignore[attr-defined]
|
||||
if raw is not None:
|
||||
frame_info = raw.to_dict()
|
||||
|
||||
if frame_info is None:
|
||||
return tool_error(
|
||||
f"frame_id {frame_id!r} not found in supervisor state. "
|
||||
f"Call browser_snapshot to see current frame_tree."
|
||||
)
|
||||
|
||||
child_sid = frame_info.get("session_id")
|
||||
if not child_sid:
|
||||
# Not an OOPIF — fall back to top-level session (evaluating at page
|
||||
# scope). Same-origin iframes don't get their own sessionId; the
|
||||
# agent can still use contentWindow/contentDocument from the parent.
|
||||
return tool_error(
|
||||
f"frame_id {frame_id!r} is not an out-of-process iframe (no "
|
||||
f"dedicated CDP session). For same-origin iframes, use "
|
||||
f"`browser_cdp(method='Runtime.evaluate', params={{'expression': "
|
||||
f"\"document.querySelector('iframe').contentDocument.title\"}})` "
|
||||
f"at the top-level page instead."
|
||||
)
|
||||
|
||||
# Dispatch onto the supervisor's loop.
|
||||
import asyncio as _asyncio
|
||||
loop = supervisor._loop # type: ignore[attr-defined]
|
||||
if loop is None or not loop.is_running():
|
||||
return tool_error(
|
||||
"CDP supervisor loop is not running. Try reconnecting with "
|
||||
"/browser connect."
|
||||
)
|
||||
|
||||
async def _do_cdp():
|
||||
return await supervisor._cdp( # type: ignore[attr-defined]
|
||||
method,
|
||||
params or {},
|
||||
session_id=child_sid,
|
||||
timeout=timeout,
|
||||
)
|
||||
|
||||
try:
|
||||
fut = _asyncio.run_coroutine_threadsafe(_do_cdp(), loop)
|
||||
result_msg = fut.result(timeout=timeout + 2)
|
||||
except Exception as exc:
|
||||
return tool_error(
|
||||
f"CDP call via supervisor failed: {type(exc).__name__}: {exc}",
|
||||
cdp_docs=CDP_DOCS_URL,
|
||||
)
|
||||
|
||||
payload: Dict[str, Any] = {
|
||||
"success": True,
|
||||
"method": method,
|
||||
"frame_id": frame_id,
|
||||
"session_id": child_sid,
|
||||
"result": result_msg.get("result", {}),
|
||||
}
|
||||
return json.dumps(payload, ensure_ascii=False)
|
||||
|
||||
|
||||
def browser_cdp(
|
||||
method: str,
|
||||
params: Optional[Dict[str, Any]] = None,
|
||||
target_id: Optional[str] = None,
|
||||
frame_id: Optional[str] = None,
|
||||
timeout: float = 30.0,
|
||||
task_id: Optional[str] = None,
|
||||
) -> str:
|
||||
|
|
@ -202,16 +308,34 @@ def browser_cdp(
|
|||
params: Method-specific parameters; defaults to ``{}``.
|
||||
target_id: Optional target/tab ID for page-level methods. When set,
|
||||
we first attach to the target (``flatten=True``) and send
|
||||
``method`` with the resulting ``sessionId``.
|
||||
``method`` with the resulting ``sessionId``. Uses a fresh
|
||||
stateless CDP connection.
|
||||
frame_id: Optional cross-origin (OOPIF) iframe ``frame_id`` from
|
||||
``browser_snapshot.frame_tree.children[]``. When set (and the
|
||||
frame is an OOPIF with a live session tracked by the CDP
|
||||
supervisor), routes the call through the supervisor's existing
|
||||
WebSocket — which is how you Runtime.evaluate *inside* an
|
||||
iframe on backends where per-call fresh CDP connections would
|
||||
hit signed-URL expiry (Browserbase) or expensive reattach.
|
||||
timeout: Seconds to wait for the call to complete.
|
||||
task_id: Unused (tool is stateless) — accepted for uniformity with
|
||||
other browser tools.
|
||||
task_id: Task identifier for supervisor lookup. When ``frame_id``
|
||||
is set, this identifies which task's supervisor to use; the
|
||||
handler will default to ``"default"`` otherwise.
|
||||
|
||||
Returns:
|
||||
JSON string ``{"success": True, "method": ..., "result": {...}}`` on
|
||||
success, or ``{"error": "..."}`` on failure.
|
||||
"""
|
||||
del task_id # unused — stateless
|
||||
# --- Route iframe-scoped calls through the supervisor ---------------
|
||||
if frame_id:
|
||||
return _browser_cdp_via_supervisor(
|
||||
task_id=task_id or "default",
|
||||
frame_id=frame_id,
|
||||
method=method,
|
||||
params=params,
|
||||
timeout=timeout,
|
||||
)
|
||||
del task_id # stateless path below
|
||||
|
||||
if not method or not isinstance(method, str):
|
||||
return tool_error(
|
||||
|
|
@ -324,12 +448,18 @@ BROWSER_CDP_SCHEMA: Dict[str, Any] = {
|
|||
"'mobile': false}, target_id=<tabId>\n\n"
|
||||
"**Usage rules:**\n"
|
||||
"- Browser-level methods (Target.*, Browser.*, Storage.*): omit "
|
||||
"target_id.\n"
|
||||
"target_id and frame_id.\n"
|
||||
"- Page-level methods (Page.*, Runtime.*, DOM.*, Emulation.*, "
|
||||
"Network.* scoped to a tab): pass target_id from Target.getTargets.\n"
|
||||
"- Each call is independent — sessions and event subscriptions do "
|
||||
"not persist between calls. For stateful workflows, prefer the "
|
||||
"dedicated browser tools."
|
||||
"- **Cross-origin iframe scope** (Runtime.evaluate inside an OOPIF, "
|
||||
"Page.* targeting a frame target, etc.): pass frame_id from the "
|
||||
"browser_snapshot frame_tree output. This routes through the CDP "
|
||||
"supervisor's live connection — the only reliable way on "
|
||||
"Browserbase where stateless CDP calls hit signed-URL expiry.\n"
|
||||
"- Each stateless call (without frame_id) is independent — sessions "
|
||||
"and event subscriptions do not persist between calls. For stateful "
|
||||
"workflows, prefer the dedicated browser tools or use frame_id "
|
||||
"routing."
|
||||
),
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
|
|
@ -353,8 +483,24 @@ BROWSER_CDP_SCHEMA: Dict[str, Any] = {
|
|||
"type": "string",
|
||||
"description": (
|
||||
"Optional. Target/tab ID from Target.getTargets result "
|
||||
"(each entry's 'targetId'). Required for page-level "
|
||||
"methods; must be omitted for browser-level methods."
|
||||
"(each entry's 'targetId'). Use for page-level methods "
|
||||
"at the top-level tab scope. Mutually exclusive with "
|
||||
"frame_id."
|
||||
),
|
||||
},
|
||||
"frame_id": {
|
||||
"type": "string",
|
||||
"description": (
|
||||
"Optional. Out-of-process iframe (OOPIF) frame_id from "
|
||||
"browser_snapshot.frame_tree.children[] where "
|
||||
"is_oopif=true. When set, routes the call through the "
|
||||
"CDP supervisor's live session for that iframe. "
|
||||
"Essential for Runtime.evaluate inside cross-origin "
|
||||
"iframes, especially on Browserbase where fresh "
|
||||
"per-call CDP connections can't keep up with signed "
|
||||
"URL rotation. For same-origin iframes, use parent "
|
||||
"contentWindow/contentDocument from Runtime.evaluate "
|
||||
"at the top-level page instead."
|
||||
),
|
||||
},
|
||||
"timeout": {
|
||||
|
|
@ -408,6 +554,7 @@ registry.register(
|
|||
method=args.get("method", ""),
|
||||
params=args.get("params"),
|
||||
target_id=args.get("target_id"),
|
||||
frame_id=args.get("frame_id"),
|
||||
timeout=args.get("timeout", 30.0),
|
||||
task_id=kw.get("task_id"),
|
||||
),
|
||||
|
|
|
|||
148
tools/browser_dialog_tool.py
Normal file
148
tools/browser_dialog_tool.py
Normal file
|
|
@ -0,0 +1,148 @@
|
|||
"""Agent-facing tool: respond to a native JS dialog captured by the CDP supervisor.
|
||||
|
||||
This tool is response-only — the agent first reads ``pending_dialogs`` from
|
||||
``browser_snapshot`` output, then calls ``browser_dialog(action=...)`` to
|
||||
accept or dismiss.
|
||||
|
||||
Gated on the same ``_browser_cdp_check`` as ``browser_cdp`` so it only
|
||||
appears when a CDP endpoint is reachable (Browserbase with a
|
||||
``connectUrl``, local Chrome via ``/browser connect``, or
|
||||
``browser.cdp_url`` set in config).
|
||||
|
||||
See ``website/docs/developer-guide/browser-supervisor.md`` for the full
|
||||
design.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
from typing import Any, Dict, Optional
|
||||
|
||||
from tools.browser_supervisor import SUPERVISOR_REGISTRY
|
||||
from tools.registry import registry
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
BROWSER_DIALOG_SCHEMA: Dict[str, Any] = {
|
||||
"name": "browser_dialog",
|
||||
"description": (
|
||||
"Respond to a native JavaScript dialog (alert / confirm / prompt / "
|
||||
"beforeunload) that is currently blocking the page.\n\n"
|
||||
"**Workflow:** call ``browser_snapshot`` first — if a dialog is open, "
|
||||
"it appears in the ``pending_dialogs`` field with ``id``, ``type``, "
|
||||
"and ``message``. Then call this tool with ``action='accept'`` or "
|
||||
"``action='dismiss'``.\n\n"
|
||||
"**Prompt dialogs:** pass ``prompt_text`` to supply the response "
|
||||
"string. Ignored for alert/confirm/beforeunload.\n\n"
|
||||
"**Multiple dialogs:** if more than one dialog is queued (rare — "
|
||||
"happens when a second dialog fires while the first is still open), "
|
||||
"pass ``dialog_id`` from the snapshot to disambiguate.\n\n"
|
||||
"**Availability:** only present when a CDP-capable backend is "
|
||||
"attached — Browserbase sessions, local Chrome via "
|
||||
"``/browser connect``, or ``browser.cdp_url`` in config.yaml. "
|
||||
"Not available on Camofox (REST-only) or the default Playwright "
|
||||
"local browser (CDP port is hidden)."
|
||||
),
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"action": {
|
||||
"type": "string",
|
||||
"enum": ["accept", "dismiss"],
|
||||
"description": (
|
||||
"'accept' clicks OK / returns the prompt text. "
|
||||
"'dismiss' clicks Cancel / returns null from prompt(). "
|
||||
"For ``beforeunload`` dialogs: 'accept' allows the "
|
||||
"navigation, 'dismiss' keeps the page."
|
||||
),
|
||||
},
|
||||
"prompt_text": {
|
||||
"type": "string",
|
||||
"description": (
|
||||
"Response string for a ``prompt()`` dialog. Ignored for "
|
||||
"other dialog types. Defaults to empty string."
|
||||
),
|
||||
},
|
||||
"dialog_id": {
|
||||
"type": "string",
|
||||
"description": (
|
||||
"Specific dialog to respond to, from "
|
||||
"``browser_snapshot.pending_dialogs[].id``. Required "
|
||||
"only when multiple dialogs are queued."
|
||||
),
|
||||
},
|
||||
},
|
||||
"required": ["action"],
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def browser_dialog(
|
||||
action: str,
|
||||
prompt_text: Optional[str] = None,
|
||||
dialog_id: Optional[str] = None,
|
||||
task_id: Optional[str] = None,
|
||||
) -> str:
|
||||
"""Respond to a pending dialog on the active task's CDP supervisor."""
|
||||
effective_task_id = task_id or "default"
|
||||
supervisor = SUPERVISOR_REGISTRY.get(effective_task_id)
|
||||
if supervisor is None:
|
||||
return json.dumps(
|
||||
{
|
||||
"success": False,
|
||||
"error": (
|
||||
"No CDP supervisor is attached to this task. Either the "
|
||||
"browser backend doesn't expose CDP (Camofox, default "
|
||||
"Playwright) or no browser session has been started yet. "
|
||||
"Call browser_navigate or /browser connect first."
|
||||
),
|
||||
}
|
||||
)
|
||||
|
||||
result = supervisor.respond_to_dialog(
|
||||
action=action,
|
||||
prompt_text=prompt_text,
|
||||
dialog_id=dialog_id,
|
||||
)
|
||||
if result.get("ok"):
|
||||
return json.dumps(
|
||||
{
|
||||
"success": True,
|
||||
"action": action,
|
||||
"dialog": result.get("dialog", {}),
|
||||
}
|
||||
)
|
||||
return json.dumps({"success": False, "error": result.get("error", "unknown error")})
|
||||
|
||||
|
||||
def _browser_dialog_check() -> bool:
|
||||
"""Gate: same as ``browser_cdp`` — only offered when CDP is reachable.
|
||||
|
||||
Kept identical so the two tools appear and disappear together. The
|
||||
supervisor itself is started lazily by ``browser_navigate`` /
|
||||
``/browser connect`` / Browserbase session creation, so a reachable
|
||||
CDP URL is enough to commit to showing the tool.
|
||||
"""
|
||||
try:
|
||||
from tools.browser_cdp_tool import _browser_cdp_check # type: ignore[import-not-found]
|
||||
except Exception as exc: # pragma: no cover — defensive
|
||||
logger.debug("browser_dialog check: browser_cdp_tool import failed: %s", exc)
|
||||
return False
|
||||
return _browser_cdp_check()
|
||||
|
||||
|
||||
registry.register(
|
||||
name="browser_dialog",
|
||||
toolset="browser-cdp",
|
||||
schema=BROWSER_DIALOG_SCHEMA,
|
||||
handler=lambda args, **kw: browser_dialog(
|
||||
action=args.get("action", ""),
|
||||
prompt_text=args.get("prompt_text"),
|
||||
dialog_id=args.get("dialog_id"),
|
||||
task_id=kw.get("task_id"),
|
||||
),
|
||||
check_fn=_browser_dialog_check,
|
||||
emoji="💬",
|
||||
)
|
||||
1362
tools/browser_supervisor.py
Normal file
1362
tools/browser_supervisor.py
Normal file
File diff suppressed because it is too large
Load diff
|
|
@ -63,7 +63,7 @@ import tempfile
|
|||
import threading
|
||||
import time
|
||||
import requests
|
||||
from typing import Dict, Any, Optional, List
|
||||
from typing import Dict, Any, Optional, List, Tuple
|
||||
from pathlib import Path
|
||||
from agent.auxiliary_client import call_llm
|
||||
from hermes_constants import get_hermes_home
|
||||
|
|
@ -287,6 +287,100 @@ def _get_cdp_override() -> str:
|
|||
return ""
|
||||
|
||||
|
||||
def _get_dialog_policy_config() -> Tuple[str, float]:
|
||||
"""Read ``browser.dialog_policy`` + ``browser.dialog_timeout_s`` from config.
|
||||
|
||||
Returns a ``(policy, timeout_s)`` tuple, falling back to the supervisor's
|
||||
defaults when keys are absent or invalid.
|
||||
"""
|
||||
# Defer imports so browser_tool can be imported in minimal environments.
|
||||
from tools.browser_supervisor import (
|
||||
DEFAULT_DIALOG_POLICY,
|
||||
DEFAULT_DIALOG_TIMEOUT_S,
|
||||
_VALID_POLICIES,
|
||||
)
|
||||
|
||||
try:
|
||||
from hermes_cli.config import read_raw_config
|
||||
|
||||
cfg = read_raw_config()
|
||||
browser_cfg = cfg.get("browser", {}) if isinstance(cfg, dict) else {}
|
||||
if not isinstance(browser_cfg, dict):
|
||||
return DEFAULT_DIALOG_POLICY, DEFAULT_DIALOG_TIMEOUT_S
|
||||
policy = str(browser_cfg.get("dialog_policy") or DEFAULT_DIALOG_POLICY)
|
||||
if policy not in _VALID_POLICIES:
|
||||
logger.debug("Invalid browser.dialog_policy=%r; using default", policy)
|
||||
policy = DEFAULT_DIALOG_POLICY
|
||||
timeout_raw = browser_cfg.get("dialog_timeout_s")
|
||||
try:
|
||||
timeout_s = float(timeout_raw) if timeout_raw is not None else DEFAULT_DIALOG_TIMEOUT_S
|
||||
if timeout_s <= 0:
|
||||
timeout_s = DEFAULT_DIALOG_TIMEOUT_S
|
||||
except (TypeError, ValueError):
|
||||
timeout_s = DEFAULT_DIALOG_TIMEOUT_S
|
||||
return policy, timeout_s
|
||||
except Exception:
|
||||
return DEFAULT_DIALOG_POLICY, DEFAULT_DIALOG_TIMEOUT_S
|
||||
|
||||
|
||||
def _ensure_cdp_supervisor(task_id: str) -> None:
|
||||
"""Start a CDP supervisor for ``task_id`` if an endpoint is reachable.
|
||||
|
||||
Idempotent — delegates to ``SupervisorRegistry.get_or_start`` which skips
|
||||
when a supervisor for this ``(task_id, cdp_url)`` already exists and
|
||||
tears down + restarts on URL change. Safe to call on every
|
||||
``browser_navigate`` / ``/browser connect`` without worrying about
|
||||
double-attach.
|
||||
|
||||
Resolves the CDP URL in this order:
|
||||
1. ``BROWSER_CDP_URL`` / ``browser.cdp_url`` — covers ``/browser connect``
|
||||
and config-set overrides.
|
||||
2. ``_active_sessions[task_id]["cdp_url"]`` — covers Browserbase + any
|
||||
other cloud provider whose ``create_session`` returns a raw CDP URL.
|
||||
|
||||
Swallows all errors — failing to attach the supervisor must not break
|
||||
the browser session itself. The agent simply won't see
|
||||
``pending_dialogs`` / ``frame_tree`` fields in snapshots.
|
||||
"""
|
||||
cdp_url = _get_cdp_override()
|
||||
if not cdp_url:
|
||||
# Fallback: active session may carry a per-session CDP URL from a
|
||||
# cloud provider (Browserbase sets this).
|
||||
with _cleanup_lock:
|
||||
session_info = _active_sessions.get(task_id, {})
|
||||
maybe = str(session_info.get("cdp_url") or "")
|
||||
if maybe:
|
||||
cdp_url = _resolve_cdp_override(maybe)
|
||||
if not cdp_url:
|
||||
return
|
||||
try:
|
||||
from tools.browser_supervisor import SUPERVISOR_REGISTRY # type: ignore[import-not-found]
|
||||
|
||||
policy, timeout_s = _get_dialog_policy_config()
|
||||
SUPERVISOR_REGISTRY.get_or_start(
|
||||
task_id=task_id,
|
||||
cdp_url=cdp_url,
|
||||
dialog_policy=policy,
|
||||
dialog_timeout_s=timeout_s,
|
||||
)
|
||||
except Exception as exc:
|
||||
logger.debug(
|
||||
"CDP supervisor attach for task=%s failed (non-fatal): %s",
|
||||
task_id,
|
||||
exc,
|
||||
)
|
||||
|
||||
|
||||
def _stop_cdp_supervisor(task_id: str) -> None:
|
||||
"""Stop the CDP supervisor for ``task_id`` if one exists. No-op otherwise."""
|
||||
try:
|
||||
from tools.browser_supervisor import SUPERVISOR_REGISTRY # type: ignore[import-not-found]
|
||||
|
||||
SUPERVISOR_REGISTRY.stop(task_id)
|
||||
except Exception as exc:
|
||||
logger.debug("CDP supervisor stop for task=%s failed (non-fatal): %s", task_id, exc)
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Cloud Provider Registry
|
||||
# ============================================================================
|
||||
|
|
@ -995,7 +1089,12 @@ def _get_session_info(task_id: Optional[str] = None) -> Dict[str, str]:
|
|||
if task_id in _active_sessions:
|
||||
return _active_sessions[task_id]
|
||||
_active_sessions[task_id] = session_info
|
||||
|
||||
|
||||
# Lazy-start the CDP supervisor now that the session exists (if the
|
||||
# backend surfaces a CDP URL via override or session_info["cdp_url"]).
|
||||
# Idempotent; swallows errors. See _ensure_cdp_supervisor for details.
|
||||
_ensure_cdp_supervisor(task_id)
|
||||
|
||||
return session_info
|
||||
|
||||
|
||||
|
|
@ -1455,7 +1554,7 @@ def browser_navigate(url: str, task_id: Optional[str] = None) -> str:
|
|||
if is_first_nav:
|
||||
session_info["_first_nav"] = False
|
||||
_maybe_start_recording(effective_task_id)
|
||||
|
||||
|
||||
result = _run_browser_command(effective_task_id, "open", [url], timeout=max(_get_command_timeout(), 60))
|
||||
|
||||
if result.get("success"):
|
||||
|
|
@ -1578,7 +1677,20 @@ def browser_snapshot(
|
|||
"snapshot": snapshot_text,
|
||||
"element_count": len(refs) if refs else 0
|
||||
}
|
||||
|
||||
|
||||
# Merge supervisor state (pending dialogs + frame tree) when a CDP
|
||||
# supervisor is attached to this task. No-op otherwise. See
|
||||
# website/docs/developer-guide/browser-supervisor.md.
|
||||
try:
|
||||
from tools.browser_supervisor import SUPERVISOR_REGISTRY # type: ignore[import-not-found]
|
||||
_supervisor = SUPERVISOR_REGISTRY.get(effective_task_id)
|
||||
if _supervisor is not None:
|
||||
_sv_snap = _supervisor.snapshot()
|
||||
if _sv_snap.active:
|
||||
response.update(_sv_snap.to_dict())
|
||||
except Exception as _sv_exc:
|
||||
logger.debug("supervisor snapshot merge failed: %s", _sv_exc)
|
||||
|
||||
return json.dumps(response, ensure_ascii=False)
|
||||
else:
|
||||
return json.dumps({
|
||||
|
|
@ -2248,7 +2360,11 @@ def cleanup_browser(task_id: Optional[str] = None) -> None:
|
|||
"""
|
||||
if task_id is None:
|
||||
task_id = "default"
|
||||
|
||||
|
||||
# Stop the CDP supervisor for this task FIRST so we close our WebSocket
|
||||
# before the backend tears down the underlying CDP endpoint.
|
||||
_stop_cdp_supervisor(task_id)
|
||||
|
||||
# Also clean up Camofox session if running in Camofox mode.
|
||||
# Skip full close when managed persistence is enabled — the browser
|
||||
# profile (and its session cookies) must survive across agent tasks.
|
||||
|
|
@ -2329,6 +2445,13 @@ def cleanup_all_browsers() -> None:
|
|||
for task_id in task_ids:
|
||||
cleanup_browser(task_id)
|
||||
|
||||
# Tear down CDP supervisors for all tasks so background threads exit.
|
||||
try:
|
||||
from tools.browser_supervisor import SUPERVISOR_REGISTRY # type: ignore[import-not-found]
|
||||
SUPERVISOR_REGISTRY.stop_all()
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Reset cached lookups so they are re-evaluated on next use.
|
||||
global _cached_agent_browser, _agent_browser_resolved
|
||||
global _cached_command_timeout, _command_timeout_resolved
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue