mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-25 00:51:20 +00:00
* docs: browser CDP supervisor design (for upcoming PR) Design doc ahead of implementation — dialog + iframe detection/interaction via a persistent CDP supervisor. Covers backend capability matrix (verified live 2026-04-23), architecture, lifecycle, policy, agent surface, PR split, non-goals, and test plan. Supersedes #12550. No code changes in this commit. * feat(browser): add persistent CDP supervisor for dialog + frame detection Single persistent CDP WebSocket per Hermes task_id that subscribes to Page/Runtime/Target events and maintains thread-safe state for pending dialogs, frame tree, and console errors. Supervisor lives in its own daemon thread running an asyncio loop; external callers use sync API (snapshot(), respond_to_dialog()) that bridges onto the loop. Auto-attaches to OOPIF child targets via Target.setAutoAttach{flatten:true} and enables Page+Runtime on each so iframe-origin dialogs surface through the same supervisor. Dialog policies: must_respond (default, 300s safety timeout), auto_dismiss, auto_accept. Frame tree capped at 30 entries + OOPIF depth 2 to keep snapshot payloads bounded on ad-heavy pages. E2E verified against real Chrome via smoke test — detects + responds to main-frame alerts, iframe-contentWindow alerts, preserves frame tree, graceful no-dialog error path, clean shutdown. No agent-facing tool wiring in this commit (comes next). * feat(browser): add browser_dialog tool wired to CDP supervisor Agent-facing response-only tool. Schema: action: 'accept' | 'dismiss' (required) prompt_text: response for prompt() dialogs (optional) dialog_id: disambiguate when multiple dialogs queued (optional) Handler: SUPERVISOR_REGISTRY.get(task_id).respond_to_dialog(...) check_fn shares _browser_cdp_check with browser_cdp so both surface and hide together. When no supervisor is attached (Camofox, default Playwright, or no browser session started yet), tool is hidden; if somehow invoked it returns a clear error pointing the agent to browser_navigate / /browser connect. Registered in _HERMES_CORE_TOOLS and the browser / hermes-acp / hermes-api-server toolsets alongside browser_cdp. * feat(browser): wire CDP supervisor into session lifecycle + browser_snapshot Supervisor lifecycle: * _get_session_info lazy-starts the supervisor after a session row is materialized — covers every backend code path (Browserbase, cdp_url override, /browser connect, future providers) with one hook. * cleanup_browser(task_id) stops the supervisor for that task first (before the backend tears down CDP). * cleanup_all_browsers() calls SUPERVISOR_REGISTRY.stop_all(). * /browser connect eagerly starts the supervisor for task 'default' so the first snapshot already shows pending_dialogs. * /browser disconnect stops the supervisor. CDP URL resolution for the supervisor: 1. BROWSER_CDP_URL / browser.cdp_url override. 2. Fallback: session_info['cdp_url'] from cloud providers (Browserbase). browser_snapshot merges supervisor state (pending_dialogs + frame_tree) into its JSON output when a supervisor is active — the agent reads pending_dialogs from the snapshot it already requests, then calls browser_dialog to respond. No extra tool surface. Config defaults: * browser.dialog_policy: 'must_respond' (new) * browser.dialog_timeout_s: 300 (new) No version bump — new keys deep-merge into existing browser section. Deadlock fix in supervisor event dispatch: * _on_dialog_opening and _on_target_attached used to await CDP calls while the reader was still processing an event — but only the reader can set the response Future, so the call timed out. * Both now fire asyncio.create_task(...) so the reader stays pumping. * auto_dismiss/auto_accept now actually close the dialog immediately. Tests (tests/tools/test_browser_supervisor.py, 11 tests, real Chrome): * supervisor start/snapshot * main-frame alert detection + dismiss * iframe.contentWindow alert * prompt() with prompt_text reply * respond with no pending dialog -> clean error * auto_dismiss clears on event * registry idempotency * registry stop -> snapshot reports inactive * browser_dialog tool no-supervisor error * browser_dialog invalid action * browser_dialog end-to-end via tool handler xdist-safe: chrome_cdp fixture uses a per-worker port. Skipped when google-chrome/chromium isn't installed. * docs(browser): document browser_dialog tool + CDP supervisor - user-guide/features/browser.md: new browser_dialog section with workflow, availability gate, and dialog_policy table - reference/tools-reference.md: row for browser_dialog, tool count bumped 53 -> 54, browser tools count 11 -> 12 - reference/toolsets-reference.md: browser_dialog added to browser toolset row with note on pending_dialogs / frame_tree snapshot fields Full design doc lives at developer-guide/browser-supervisor.md (committed earlier). * fix(browser): reconnect loop + recent_dialogs for Browserbase visibility Found via Browserbase E2E test that revealed two production-critical issues: 1. **Supervisor WebSocket drops when other clients disconnect.** Browserbase's CDP proxy tears down our long-lived WebSocket whenever a short-lived client (e.g. agent-browser CLI's per-command CDP connection) disconnects. Fixed with a reconnecting _run loop that re-attaches with exponential backoff on drops. _page_session_id and _child_sessions are reset on each reconnect; pending_dialogs and frames are preserved across reconnects. 2. **Browserbase auto-dismisses dialogs server-side within ~10ms.** Their Playwright-based CDP proxy dismisses alert/confirm/prompt before our Page.handleJavaScriptDialog call can respond. So pending_dialogs is empty by the time the agent reads a snapshot on Browserbase. Added a recent_dialogs ring buffer (capacity 20) that retains a DialogRecord for every dialog that opened, with a closed_by tag: * 'agent' — agent called browser_dialog * 'auto_policy' — local auto_dismiss/auto_accept fired * 'watchdog' — must_respond timeout auto-dismissed (300s default) * 'remote' — browser/backend closed it on us (Browserbase) Agents on Browserbase now see the dialog history with closed_by='remote' so they at least know a dialog fired, even though they couldn't respond. 3. **Page.javascriptDialogClosed matching bug.** The event doesn't include a 'message' field (CDP spec has only 'result' and 'userInput') but our _on_dialog_closed was matching on message. Fixed to match by session_id + oldest-first, with a safety assumption that only one dialog is in flight per session (the JS thread is blocked while a dialog is up). Docs + tests updated: * browser.md: new availability matrix showing the three backends and which mode (pending / recent / response) each supports * developer-guide/browser-supervisor.md: three-field snapshot schema with closed_by semantics * test_browser_supervisor.py: +test_recent_dialogs_ring_buffer (12/12 passing against real Chrome) E2E verified both backends: * Local Chrome via /browser connect: detect + respond full workflow (smoke_supervisor.py all 7 scenarios pass) * Browserbase: detect via recent_dialogs with closed_by='remote' (smoke_supervisor_browserbase_v2.py passes) Camofox remains out of scope (REST-only, no CDP) — tracked for upstream PR 3. * feat(browser): XHR bridge for dialog response on Browserbase (FIXED) Browserbase's CDP proxy auto-dismisses native JS dialogs within ~10ms, so Page.handleJavaScriptDialog calls lose the race. Solution: bypass native dialogs entirely. The supervisor now injects Page.addScriptToEvaluateOnNewDocument with a JavaScript override for window.alert/confirm/prompt. Those overrides perform a synchronous XMLHttpRequest to a magic host ('hermes-dialog-bridge.invalid'). We intercept those XHRs via Fetch.enable with a requestStage=Request pattern. Flow when a page calls alert('hi'): 1. window.alert override intercepts, builds XHR GET to http://hermes-dialog-bridge.invalid/?kind=alert&message=hi 2. Sync XHR blocks the page's JS thread (mirrors real dialog semantics) 3. Fetch.requestPaused fires on our WebSocket; supervisor surfaces it as a pending dialog with bridge_request_id set 4. Agent reads pending_dialogs from browser_snapshot, calls browser_dialog 5. Supervisor calls Fetch.fulfillRequest with JSON body: {accept: true|false, prompt_text: '...', dialog_id: 'd-N'} 6. The injected script parses the body, returns the appropriate value from the override (undefined for alert, bool for confirm, string|null for prompt) This works identically on Browserbase AND local Chrome — no native dialog ever fires, so Browserbase's auto-dismiss has nothing to race. Dialog policies (must_respond / auto_dismiss / auto_accept) all still work. Bridge is installed on every attached session (main page + OOPIF child sessions) so iframe dialogs are captured too. Native-dialog path kept as a fallback for backends that don't auto-dismiss (so a page that somehow bypasses our override — e.g. iframes that load after Fetch.enable but before the init-script runs — still gets observed via Page.javascriptDialogOpening). E2E VERIFIED: * Local Chrome: 13/13 pytest tests green (12 original + new test_bridge_captures_prompt_and_returns_reply_text that asserts window.__ret === 'AGENT-SUPPLIED-REPLY' after agent responds) * Browserbase: smoke_bb_bridge_v2.py runs 4/4 PASS: - alert('BB-ALERT-MSG') dismiss → page.alert_ret = undefined ✓ - prompt('BB-PROMPT-MSG', 'default-xyz') accept with 'AGENT-REPLY' → page.prompt_ret === 'AGENT-REPLY' ✓ - confirm('BB-CONFIRM-MSG') accept → page.confirm_ret === true ✓ - confirm('BB-CONFIRM-MSG') dismiss → page.confirm_ret === false ✓ Docs updated in browser.md and developer-guide/browser-supervisor.md — availability matrix now shows Browserbase at full parity with local Chrome for both detection and response. * feat(browser): cross-origin iframe interaction via browser_cdp(frame_id=...) Adds iframe interaction to the CDP supervisor PR (was queued as PR 2). Design: browser_cdp gets an optional frame_id parameter. When set, the tool looks up the frame in the supervisor's frame_tree, grabs its child cdp_session_id (OOPIF session), and dispatches the CDP call through the supervisor's already-connected WebSocket via run_coroutine_threadsafe. Why not stateless: on Browserbase, each fresh browser_cdp WebSocket must re-negotiate against a signed connectUrl. The session info carries a specific URL that can expire while the supervisor's long-lived connection stays valid. Routing via the supervisor sidesteps this. Agent workflow: 1. browser_snapshot → frame_tree.children[] shows OOPIFs with is_oopif=true 2. browser_cdp(method='Runtime.evaluate', frame_id=<OOPIF frame_id>, params={'expression': 'document.title', 'returnByValue': True}) 3. Supervisor dispatches the call on the OOPIF's child session Supervisor state fixes needed along the way: * _on_frame_detached now skips reason='swap' (frame migrating processes) * _on_frame_detached also skips when the frame is an OOPIF with a live child session — Browserbase fires spurious remove events when a same-origin iframe gets promoted to OOPIF * _on_target_detached clears cdp_session_id but KEEPS the frame record so the agent still sees the OOPIF in frame_tree during transient session flaps E2E VERIFIED on Browserbase (smoke_bb_iframe_agent_path.py): browser_cdp(method='Runtime.evaluate', params={'expression': 'document.title', 'returnByValue': True}, frame_id=<OOPIF>) → {'success': True, 'result': {'value': 'Example Domain'}} The iframe is <iframe src='https://example.com/'> inside a top-level data: URL page on a real Browserbase session. The agent Runtime.evaluates INSIDE the cross-origin iframe and gets example.com's title back. Tests (tests/tools/test_browser_supervisor.py — 16 pass total): * test_browser_cdp_frame_id_routes_via_supervisor — injects fake OOPIF, verifies routing via supervisor, Runtime.evaluate returns 1+1=2 * test_browser_cdp_frame_id_missing_supervisor — clean error when no supervisor attached * test_browser_cdp_frame_id_not_in_frame_tree — clean error on bad frame_id Docs (browser.md and developer-guide/browser-supervisor.md) updated with the iframe workflow, availability matrix now shows OOPIF eval as shipped for local Chrome + Browserbase. * test(browser): real-OOPIF E2E verified manually + chrome_cdp uses --site-per-process When asked 'did you test the iframe stuff' I had only done a mocked pytest (fake injected OOPIF) plus a Browserbase E2E. Closed the local-Chrome real-OOPIF gap by writing /tmp/dialog-iframe-test/ smoke_local_oopif.py: * 2 http servers on different hostnames (localhost:18905 + 127.0.0.1:18906) * Chrome with --site-per-process so the cross-origin iframe becomes a real OOPIF in its own process * Navigate, find OOPIF in supervisor.frame_tree, call browser_cdp(method='Runtime.evaluate', frame_id=<OOPIF>) which routes through the supervisor's child session * Asserts iframe document.title === 'INNER-FRAME-XYZ' (from the inner page, retrieved via OOPIF eval) PASSED on 2026-04-23. Tried to embed this as a pytest but hit an asyncio version quirk between venv (3.11) and the system python (3.13) — Page.navigate hangs in the pytest harness but works in standalone. Left a self-documenting skip test that points to the smoke script + describes the verification. chrome_cdp fixture now passes --site-per-process so future iframe tests can rely on OOPIF behavior. Result: 16 pass + 1 documented-skip = 17 tests in tests/tools/test_browser_supervisor.py. * docs(browser): add dialog_policy + dialog_timeout_s to configuration.md, fix tool count Pre-merge docs audit revealed two gaps: 1. user-guide/configuration.md browser config example was missing the two new dialog_* knobs. Added with a short table explaining must_respond / auto_dismiss / auto_accept semantics and a link to the feature page for the full workflow. 2. reference/tools-reference.md header said '54 built-in tools' — real count on main is 54, this branch adds browser_dialog so it's 55. Fixed the header. (browser count was already correctly bumped 11 -> 12 in the earlier docs commit.) No code changes.
470 lines
20 KiB
Markdown
470 lines
20 KiB
Markdown
---
|
|
title: Browser Automation
|
|
description: Control browsers with multiple providers, local Chrome via CDP, or cloud browsers for web interaction, form filling, scraping, and more.
|
|
sidebar_label: Browser
|
|
sidebar_position: 5
|
|
---
|
|
|
|
# Browser Automation
|
|
|
|
Hermes Agent includes a full browser automation toolset with multiple backend options:
|
|
|
|
- **Browserbase cloud mode** via [Browserbase](https://browserbase.com) for managed cloud browsers and anti-bot tooling
|
|
- **Browser Use cloud mode** via [Browser Use](https://browser-use.com) as an alternative cloud browser provider
|
|
- **Firecrawl cloud mode** via [Firecrawl](https://firecrawl.dev) for cloud browsers with built-in scraping
|
|
- **Camofox local mode** via [Camofox](https://github.com/jo-inc/camofox-browser) for local anti-detection browsing (Firefox-based fingerprint spoofing)
|
|
- **Local Chrome via CDP** — connect browser tools to your own Chrome instance using `/browser connect`
|
|
- **Local browser mode** via the `agent-browser` CLI and a local Chromium installation
|
|
|
|
In all modes, the agent can navigate websites, interact with page elements, fill forms, and extract information.
|
|
|
|
## Overview
|
|
|
|
Pages are represented as **accessibility trees** (text-based snapshots), making them ideal for LLM agents. Interactive elements get ref IDs (like `@e1`, `@e2`) that the agent uses for clicking and typing.
|
|
|
|
Key capabilities:
|
|
|
|
- **Multi-provider cloud execution** — Browserbase, Browser Use, or Firecrawl — no local browser needed
|
|
- **Local Chrome integration** — attach to your running Chrome via CDP for hands-on browsing
|
|
- **Built-in stealth** — random fingerprints, CAPTCHA solving, residential proxies (Browserbase)
|
|
- **Session isolation** — each task gets its own browser session
|
|
- **Automatic cleanup** — inactive sessions are closed after a timeout
|
|
- **Vision analysis** — screenshot + AI analysis for visual understanding
|
|
|
|
## Setup
|
|
|
|
:::tip Nous Subscribers
|
|
If you have a paid [Nous Portal](https://portal.nousresearch.com) subscription, you can use browser automation through the **[Tool Gateway](tool-gateway.md)** without any separate API keys. Run `hermes model` or `hermes tools` to enable it.
|
|
:::
|
|
|
|
### Browserbase cloud mode
|
|
|
|
To use Browserbase-managed cloud browsers, add:
|
|
|
|
```bash
|
|
# Add to ~/.hermes/.env
|
|
BROWSERBASE_API_KEY=***
|
|
BROWSERBASE_PROJECT_ID=your-project-id-here
|
|
```
|
|
|
|
Get your credentials at [browserbase.com](https://browserbase.com).
|
|
|
|
### Browser Use cloud mode
|
|
|
|
To use Browser Use as your cloud browser provider, add:
|
|
|
|
```bash
|
|
# Add to ~/.hermes/.env
|
|
BROWSER_USE_API_KEY=***
|
|
```
|
|
|
|
Get your API key at [browser-use.com](https://browser-use.com). Browser Use provides a cloud browser via its REST API. If both Browserbase and Browser Use credentials are set, Browserbase takes priority.
|
|
|
|
### Firecrawl cloud mode
|
|
|
|
To use Firecrawl as your cloud browser provider, add:
|
|
|
|
```bash
|
|
# Add to ~/.hermes/.env
|
|
FIRECRAWL_API_KEY=fc-***
|
|
```
|
|
|
|
Get your API key at [firecrawl.dev](https://firecrawl.dev). Then select Firecrawl as your browser provider:
|
|
|
|
```bash
|
|
hermes setup tools
|
|
# → Browser Automation → Firecrawl
|
|
```
|
|
|
|
Optional settings:
|
|
|
|
```bash
|
|
# Self-hosted Firecrawl instance (default: https://api.firecrawl.dev)
|
|
FIRECRAWL_API_URL=http://localhost:3002
|
|
|
|
# Session TTL in seconds (default: 300)
|
|
FIRECRAWL_BROWSER_TTL=600
|
|
```
|
|
|
|
### Camofox local mode
|
|
|
|
[Camofox](https://github.com/jo-inc/camofox-browser) is a self-hosted Node.js server wrapping Camoufox (a Firefox fork with C++ fingerprint spoofing). It provides local anti-detection browsing without cloud dependencies.
|
|
|
|
```bash
|
|
# Install and run
|
|
git clone https://github.com/jo-inc/camofox-browser && cd camofox-browser
|
|
npm install && npm start # downloads Camoufox (~300MB) on first run
|
|
|
|
# Or via Docker
|
|
docker run -d --network host -e CAMOFOX_PORT=9377 jo-inc/camofox-browser
|
|
```
|
|
|
|
Then set in `~/.hermes/.env`:
|
|
|
|
```bash
|
|
CAMOFOX_URL=http://localhost:9377
|
|
```
|
|
|
|
Or configure via `hermes tools` → Browser Automation → Camofox.
|
|
|
|
When `CAMOFOX_URL` is set, all browser tools automatically route through Camofox instead of Browserbase or agent-browser.
|
|
|
|
#### Persistent browser sessions
|
|
|
|
By default, each Camofox session gets a random identity — cookies and logins don't survive across agent restarts. To enable persistent browser sessions, add the following to `~/.hermes/config.yaml`:
|
|
|
|
```yaml
|
|
browser:
|
|
camofox:
|
|
managed_persistence: true
|
|
```
|
|
|
|
Then fully restart Hermes so the new config is picked up.
|
|
|
|
:::warning Nested path matters
|
|
Hermes reads `browser.camofox.managed_persistence`, **not** a top-level `managed_persistence`. A common mistake is writing:
|
|
|
|
```yaml
|
|
# ❌ Wrong — Hermes ignores this
|
|
managed_persistence: true
|
|
```
|
|
|
|
If the flag is placed at the wrong path, Hermes silently falls back to a random ephemeral `userId` and your login state will be lost on every session.
|
|
:::
|
|
|
|
##### What Hermes does
|
|
- Sends a deterministic profile-scoped `userId` to Camofox so the server can reuse the same Firefox profile across sessions.
|
|
- Skips server-side context destruction on cleanup, so cookies and logins survive between agent tasks.
|
|
- Scopes the `userId` to the active Hermes profile, so different Hermes profiles get different browser profiles (profile isolation).
|
|
|
|
##### What Hermes does not do
|
|
- It does not force persistence on the Camofox server. Hermes only sends a stable `userId`; the server must honor it by mapping that `userId` to a persistent Firefox profile directory.
|
|
- If your Camofox server build treats every request as ephemeral (e.g. always calls `browser.newContext()` without loading a stored profile), Hermes cannot make those sessions persist. Make sure you are running a Camofox build that implements userId-based profile persistence.
|
|
|
|
##### Verify it's working
|
|
|
|
1. Start Hermes and your Camofox server.
|
|
2. Open Google (or any login site) in a browser task and sign in manually.
|
|
3. End the browser task normally.
|
|
4. Start a new browser task.
|
|
5. Open the same site again — you should still be signed in.
|
|
|
|
If step 5 logs you out, the Camofox server isn't honoring the stable `userId`. Double-check your config path, confirm you fully restarted Hermes after editing `config.yaml`, and verify your Camofox server version supports persistent per-user profiles.
|
|
|
|
##### Where state lives
|
|
|
|
Hermes derives the stable `userId` from the profile-scoped directory `~/.hermes/browser_auth/camofox/` (or the equivalent under `$HERMES_HOME` for non-default profiles). The actual browser profile data lives on the Camofox server side, keyed by that `userId`. To fully reset a persistent profile, clear it on the Camofox server and remove the corresponding Hermes profile's state directory.
|
|
|
|
#### VNC live view
|
|
|
|
When Camofox runs in headed mode (with a visible browser window), it exposes a VNC port in its health check response. Hermes automatically discovers this and includes the VNC URL in navigation responses, so the agent can share a link for you to watch the browser live.
|
|
|
|
### Local Chrome via CDP (`/browser connect`)
|
|
|
|
Instead of a cloud provider, you can attach Hermes browser tools to your own running Chrome instance via the Chrome DevTools Protocol (CDP). This is useful when you want to see what the agent is doing in real-time, interact with pages that require your own cookies/sessions, or avoid cloud browser costs.
|
|
|
|
:::note
|
|
`/browser connect` is an **interactive-CLI slash command** — it is not dispatched by the gateway. If you try to run it inside a WebUI, Telegram, Discord, or other gateway chat, the message will be sent to the agent as plain text and the command will not execute. Start Hermes from the terminal (`hermes` or `hermes chat`) and issue `/browser connect` there.
|
|
:::
|
|
|
|
In the CLI, use:
|
|
|
|
```
|
|
/browser connect # Connect to Chrome at ws://localhost:9222
|
|
/browser connect ws://host:port # Connect to a specific CDP endpoint
|
|
/browser status # Check current connection
|
|
/browser disconnect # Detach and return to cloud/local mode
|
|
```
|
|
|
|
If Chrome isn't already running with remote debugging, Hermes will attempt to auto-launch it with `--remote-debugging-port=9222`.
|
|
|
|
:::tip
|
|
To start Chrome manually with CDP enabled, use a dedicated user-data-dir so the debug port actually comes up even if Chrome is already running with your normal profile:
|
|
|
|
```bash
|
|
# Linux
|
|
google-chrome \
|
|
--remote-debugging-port=9222 \
|
|
--user-data-dir=$HOME/.hermes/chrome-debug \
|
|
--no-first-run \
|
|
--no-default-browser-check &
|
|
|
|
# macOS
|
|
"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" \
|
|
--remote-debugging-port=9222 \
|
|
--user-data-dir="$HOME/.hermes/chrome-debug" \
|
|
--no-first-run \
|
|
--no-default-browser-check &
|
|
```
|
|
|
|
Then launch the Hermes CLI and run `/browser connect`.
|
|
|
|
**Why `--user-data-dir`?** Without it, launching Chrome while a regular Chrome instance is already running typically opens a new window on the existing process — and that existing process was not started with `--remote-debugging-port`, so port 9222 never opens. A dedicated user-data-dir forces a fresh Chrome process where the debug port actually listens. `--no-first-run --no-default-browser-check` skips the first-launch wizard for the fresh profile.
|
|
:::
|
|
|
|
When connected via CDP, all browser tools (`browser_navigate`, `browser_click`, etc.) operate on your live Chrome instance instead of spinning up a cloud session.
|
|
|
|
### Local browser mode
|
|
|
|
If you do **not** set any cloud credentials and don't use `/browser connect`, Hermes can still use the browser tools through a local Chromium install driven by `agent-browser`.
|
|
|
|
### Optional Environment Variables
|
|
|
|
```bash
|
|
# Residential proxies for better CAPTCHA solving (default: "true")
|
|
BROWSERBASE_PROXIES=true
|
|
|
|
# Advanced stealth with custom Chromium — requires Scale Plan (default: "false")
|
|
BROWSERBASE_ADVANCED_STEALTH=false
|
|
|
|
# Session reconnection after disconnects — requires paid plan (default: "true")
|
|
BROWSERBASE_KEEP_ALIVE=true
|
|
|
|
# Custom session timeout in milliseconds (default: project default)
|
|
# Examples: 600000 (10min), 1800000 (30min)
|
|
BROWSERBASE_SESSION_TIMEOUT=600000
|
|
|
|
# Inactivity timeout before auto-cleanup in seconds (default: 120)
|
|
BROWSER_INACTIVITY_TIMEOUT=120
|
|
```
|
|
|
|
### Install agent-browser CLI
|
|
|
|
```bash
|
|
npm install -g agent-browser
|
|
# Or install locally in the repo:
|
|
npm install
|
|
```
|
|
|
|
:::info
|
|
The `browser` toolset must be included in your config's `toolsets` list or enabled via `hermes config set toolsets '["hermes-cli", "browser"]'`.
|
|
:::
|
|
|
|
## Available Tools
|
|
|
|
### `browser_navigate`
|
|
|
|
Navigate to a URL. Must be called before any other browser tool. Initializes the Browserbase session.
|
|
|
|
```
|
|
Navigate to https://github.com/NousResearch
|
|
```
|
|
|
|
:::tip
|
|
For simple information retrieval, prefer `web_search` or `web_extract` — they are faster and cheaper. Use browser tools when you need to **interact** with a page (click buttons, fill forms, handle dynamic content).
|
|
:::
|
|
|
|
### `browser_snapshot`
|
|
|
|
Get a text-based snapshot of the current page's accessibility tree. Returns interactive elements with ref IDs like `@e1`, `@e2` for use with `browser_click` and `browser_type`.
|
|
|
|
- **`full=false`** (default): Compact view showing only interactive elements
|
|
- **`full=true`**: Complete page content
|
|
|
|
Snapshots over 8000 characters are automatically summarized by an LLM.
|
|
|
|
### `browser_click`
|
|
|
|
Click an element identified by its ref ID from the snapshot.
|
|
|
|
```
|
|
Click @e5 to press the "Sign In" button
|
|
```
|
|
|
|
### `browser_type`
|
|
|
|
Type text into an input field. Clears the field first, then types the new text.
|
|
|
|
```
|
|
Type "hermes agent" into the search field @e3
|
|
```
|
|
|
|
### `browser_scroll`
|
|
|
|
Scroll the page up or down to reveal more content.
|
|
|
|
```
|
|
Scroll down to see more results
|
|
```
|
|
|
|
### `browser_press`
|
|
|
|
Press a keyboard key. Useful for submitting forms or navigation.
|
|
|
|
```
|
|
Press Enter to submit the form
|
|
```
|
|
|
|
Supported keys: `Enter`, `Tab`, `Escape`, `ArrowDown`, `ArrowUp`, and more.
|
|
|
|
### `browser_back`
|
|
|
|
Navigate back to the previous page in browser history.
|
|
|
|
### `browser_get_images`
|
|
|
|
List all images on the current page with their URLs and alt text. Useful for finding images to analyze.
|
|
|
|
### `browser_vision`
|
|
|
|
Take a screenshot and analyze it with vision AI. Use this when text snapshots don't capture important visual information — especially useful for CAPTCHAs, complex layouts, or visual verification challenges.
|
|
|
|
The screenshot is saved persistently and the file path is returned alongside the AI analysis. On messaging platforms (Telegram, Discord, Slack, WhatsApp), you can ask the agent to share the screenshot — it will be sent as a native photo attachment via the `MEDIA:` mechanism.
|
|
|
|
```
|
|
What does the chart on this page show?
|
|
```
|
|
|
|
Screenshots are stored in `~/.hermes/cache/screenshots/` and automatically cleaned up after 24 hours.
|
|
|
|
### `browser_console`
|
|
|
|
Get browser console output (log/warn/error messages) and uncaught JavaScript exceptions from the current page. Essential for detecting silent JS errors that don't appear in the accessibility tree.
|
|
|
|
```
|
|
Check the browser console for any JavaScript errors
|
|
```
|
|
|
|
Use `clear=True` to clear the console after reading, so subsequent calls only show new messages.
|
|
|
|
### `browser_cdp`
|
|
|
|
Raw Chrome DevTools Protocol passthrough — the escape hatch for browser operations not covered by the other tools. Use for native dialog handling, iframe-scoped evaluation, cookie/network control, or any CDP verb the agent needs.
|
|
|
|
**Only available when a CDP endpoint is reachable at session start** — meaning `/browser connect` has attached to a running Chrome, or `browser.cdp_url` is set in `config.yaml`. The default local agent-browser mode, Camofox, and cloud providers (Browserbase, Browser Use, Firecrawl) do not currently expose CDP to this tool — cloud providers have per-session CDP URLs but live-session routing is a follow-up.
|
|
|
|
**CDP method reference:** https://chromedevtools.github.io/devtools-protocol/ — the agent can `web_extract` a specific method's page to look up parameters and return shape.
|
|
|
|
Common patterns:
|
|
|
|
```
|
|
# List tabs (browser-level, no target_id)
|
|
browser_cdp(method="Target.getTargets")
|
|
|
|
# Handle a native JS dialog on a tab
|
|
browser_cdp(method="Page.handleJavaScriptDialog",
|
|
params={"accept": true, "promptText": ""},
|
|
target_id="<tabId>")
|
|
|
|
# Evaluate JS in a specific tab
|
|
browser_cdp(method="Runtime.evaluate",
|
|
params={"expression": "document.title", "returnByValue": true},
|
|
target_id="<tabId>")
|
|
|
|
# Get all cookies
|
|
browser_cdp(method="Network.getAllCookies")
|
|
```
|
|
|
|
Browser-level methods (`Target.*`, `Browser.*`, `Storage.*`) omit `target_id`. Page-level methods (`Page.*`, `Runtime.*`, `DOM.*`, `Emulation.*`) require a `target_id` from `Target.getTargets`. Each stateless call is independent — sessions do not persist between calls.
|
|
|
|
**Cross-origin iframes:** pass `frame_id` (from `browser_snapshot.frame_tree.children[]` where `is_oopif=true`) to route the CDP call through the supervisor's live session for that iframe. This is how `Runtime.evaluate` inside a cross-origin iframe works on Browserbase, where stateless CDP connections would hit signed-URL expiry. Example:
|
|
|
|
```
|
|
browser_cdp(
|
|
method="Runtime.evaluate",
|
|
params={"expression": "document.title", "returnByValue": True},
|
|
frame_id="<frame_id from browser_snapshot>",
|
|
)
|
|
```
|
|
|
|
Same-origin iframes don't need `frame_id` — use `document.querySelector('iframe').contentDocument` from a top-level `Runtime.evaluate` instead.
|
|
|
|
### `browser_dialog`
|
|
|
|
Responds to a native JS dialog (`alert` / `confirm` / `prompt` / `beforeunload`). Before this tool existed, dialogs would silently block the page's JavaScript thread and subsequent `browser_*` calls would hang or throw; now the agent sees pending dialogs in `browser_snapshot` output and responds explicitly.
|
|
|
|
**Workflow:**
|
|
1. Call `browser_snapshot`. If a dialog is blocking the page, it shows up as `pending_dialogs: [{"id": "d-1", "type": "alert", "message": "..."}]`.
|
|
2. Call `browser_dialog(action="accept")` or `browser_dialog(action="dismiss")`. For `prompt()` dialogs, pass `prompt_text="..."` to supply the response.
|
|
3. Re-snapshot — `pending_dialogs` is empty; the page's JS thread has resumed.
|
|
|
|
**Detection happens automatically** via a persistent CDP supervisor — one WebSocket per task that subscribes to Page/Runtime/Target events. The supervisor also populates a `frame_tree` field in the snapshot so the agent can see the iframe structure of the current page, including cross-origin (OOPIF) iframes.
|
|
|
|
**Availability matrix:**
|
|
|
|
| Backend | Detection via `pending_dialogs` | Response (`browser_dialog` tool) |
|
|
|---|---|---|
|
|
| Local Chrome via `/browser connect` or `browser.cdp_url` | ✓ | ✓ full workflow |
|
|
| Browserbase | ✓ | ✓ full workflow (via injected XHR bridge) |
|
|
| Camofox / default local agent-browser | ✗ | ✗ (no CDP endpoint) |
|
|
|
|
**How it works on Browserbase.** Browserbase's CDP proxy auto-dismisses real native dialogs server-side within ~10ms, so we can't use `Page.handleJavaScriptDialog`. The supervisor injects a small script via `Page.addScriptToEvaluateOnNewDocument` that overrides `window.alert`/`confirm`/`prompt` with a synchronous XHR. We intercept those XHRs via `Fetch.enable` — the page's JS thread stays blocked on the XHR until we call `Fetch.fulfillRequest` with the agent's response. `prompt()` return values round-trip back into page JS unchanged.
|
|
|
|
**Dialog policy** is configured in `config.yaml` under `browser.dialog_policy`:
|
|
|
|
| Policy | Behavior |
|
|
|--------|----------|
|
|
| `must_respond` (default) | Capture, surface in snapshot, wait for explicit `browser_dialog()` call. Safety auto-dismiss after `browser.dialog_timeout_s` (default 300s) so a buggy agent can't stall forever. |
|
|
| `auto_dismiss` | Capture, dismiss immediately. Agent still sees the dialog in `browser_state` history but doesn't have to act. |
|
|
| `auto_accept` | Capture, accept immediately. Useful when navigating pages with aggressive `beforeunload` prompts. |
|
|
|
|
**Frame tree** inside `browser_snapshot.frame_tree` is capped to 30 frames and OOPIF depth 2 to keep payloads bounded on ad-heavy pages. A `truncated: true` flag surfaces when limits were hit; agents needing the full tree can use `browser_cdp` with `Page.getFrameTree`.
|
|
|
|
## Practical Examples
|
|
|
|
### Filling Out a Web Form
|
|
|
|
```
|
|
User: Sign up for an account on example.com with my email john@example.com
|
|
|
|
Agent workflow:
|
|
1. browser_navigate("https://example.com/signup")
|
|
2. browser_snapshot() → sees form fields with refs
|
|
3. browser_type(ref="@e3", text="john@example.com")
|
|
4. browser_type(ref="@e5", text="SecurePass123")
|
|
5. browser_click(ref="@e8") → clicks "Create Account"
|
|
6. browser_snapshot() → confirms success
|
|
```
|
|
|
|
### Researching Dynamic Content
|
|
|
|
```
|
|
User: What are the top trending repos on GitHub right now?
|
|
|
|
Agent workflow:
|
|
1. browser_navigate("https://github.com/trending")
|
|
2. browser_snapshot(full=true) → reads trending repo list
|
|
3. Returns formatted results
|
|
```
|
|
|
|
## Session Recording
|
|
|
|
Automatically record browser sessions as WebM video files:
|
|
|
|
```yaml
|
|
browser:
|
|
record_sessions: true # default: false
|
|
```
|
|
|
|
When enabled, recording starts automatically on the first `browser_navigate` and saves to `~/.hermes/browser_recordings/` when the session closes. Works in both local and cloud (Browserbase) modes. Recordings older than 72 hours are automatically cleaned up.
|
|
|
|
## Stealth Features
|
|
|
|
Browserbase provides automatic stealth capabilities:
|
|
|
|
| Feature | Default | Notes |
|
|
|---------|---------|-------|
|
|
| Basic Stealth | Always on | Random fingerprints, viewport randomization, CAPTCHA solving |
|
|
| Residential Proxies | On | Routes through residential IPs for better access |
|
|
| Advanced Stealth | Off | Custom Chromium build, requires Scale Plan |
|
|
| Keep Alive | On | Session reconnection after network hiccups |
|
|
|
|
:::note
|
|
If paid features aren't available on your plan, Hermes automatically falls back — first disabling `keepAlive`, then proxies — so browsing still works on free plans.
|
|
:::
|
|
|
|
## Session Management
|
|
|
|
- Each task gets an isolated browser session via Browserbase
|
|
- Sessions are automatically cleaned up after inactivity (default: 2 minutes)
|
|
- A background thread checks every 30 seconds for stale sessions
|
|
- Emergency cleanup runs on process exit to prevent orphaned sessions
|
|
- Sessions are released via the Browserbase API (`REQUEST_RELEASE` status)
|
|
|
|
## Limitations
|
|
|
|
- **Text-based interaction** — relies on accessibility tree, not pixel coordinates
|
|
- **Snapshot size** — large pages may be truncated or LLM-summarized at 8000 characters
|
|
- **Session timeout** — cloud sessions expire based on your provider's plan settings
|
|
- **Cost** — cloud sessions consume provider credits; sessions are automatically cleaned up when the conversation ends or after inactivity. Use `/browser connect` for free local browsing.
|
|
- **No file downloads** — cannot download files from the browser
|