feat(web): firecrawl plugin — largest migration (search + async extract + dual auth)

Migrates Firecrawl from inline code in tools/web_tools.py to a bundled plugin at plugins/web/firecrawl/. By line count this is the largest of the seven provider migrations: the firecrawl path captured most of the file's vendor-specific complexity. What moved into the plugin (all previously in tools/web_tools.py): Lazy Firecrawl SDK proxy - _load_firecrawl_cls() — caches the imported SDK class - _FirecrawlProxy + Firecrawl singleton — defers ~200ms of SDK imports until first construction or isinstance check. Client construction (dual auth) - _get_direct_firecrawl_config() — direct FIRECRAWL_API_KEY/URL path - _get_firecrawl_gateway_url() — managed Nous tool-gateway URL - _is_tool_gateway_ready() — gateway URL + Nous token check - _has_direct_firecrawl_config() — direct config present? - _get_firecrawl_client() — combined client construction honoring web.use_gateway - check_firecrawl_api_key() — top-level "is firecrawl usable" - _firecrawl_backend_help_suffix() — managed-gateway help string - _raise_web_backend_configuration_error() — typed misconfig error Response shape normalization (vendor-specific) - _to_plain_object(), _normalize_result_list() — SDK→dict helpers - _extract_web_search_results() — handles SDK/direct/gateway shapes - _extract_scrape_payload() — nested-data unwrap for scrape Per-URL extract loop - 60s asyncio.wait_for timeout per URL - Pre-scrape website-policy gate - Post-scrape redirect-aware SSRF re-check - Format-aware content selection (markdown / html / auto) - Per-URL errors returned as {"error": str} entries, no raises Extract is declared `async def` — each URL is scraped in asyncio.to_thread(...). This is the second async-extract plugin after parallel. The plugin re-exports `Firecrawl` (the lazy proxy) and `check_firecrawl_api_key()` so existing tests doing `patch("tools.web_tools.Firecrawl")` or `monkeypatch.setattr(web_tools, "check_firecrawl_api_key", ...)` keep working — tools/web_tools.py re-exports both names in the next dispatcher-cutover commit. Note: web_crawl_tool still has its own Firecrawl crawl path inline (separate from extract); the Firecrawl SDK supports /crawl but we don't expose supports_crawl=True on this plugin yet. Tavily handles crawl today. Adding Firecrawl crawl is a clean follow-up. Adds "firecrawl" to _WEB_PLUGIN_SKIPLIST. E2E verified: - All 7 providers register: brave-free, ddgs, exa, firecrawl, parallel, searxng, tavily - inspect.iscoroutinefunction(firecrawl.extract) -> True - Firecrawl proxy is a callable lazy proxy at module level - check_firecrawl_api_key reflects FIRECRAWL_API_KEY presence
2026-05-18 04:41:56 +00:00 · 2026-05-14 00:20:16 +05:30 · 2026-05-14 00:20:16 +05:30 · 143184e943
commit 143184e943
parent 31fcde876c
4 changed files with 603 additions and 1 deletions
--- a/plugins/web/firecrawl/init.py
+++ b/plugins/web/firecrawl/init.py
@ -0,0 +1,28 @@
+"""Firecrawl web search + extract plugin — bundled, auto-loaded.
+
+Largest single plugin in this PR. Captures everything the previous
+inline implementation in tools/web_tools.py did:
+
+  - Lazy import of the firecrawl SDK (~200ms cold-start cost) via a
+    callable proxy that defers the actual import to first use.
+  - Dual client paths: direct (FIRECRAWL_API_KEY / FIRECRAWL_API_URL)
+    OR Nous-hosted tool-gateway routing for subscribers, with
+    web.use_gateway as the tie-breaker.
+  - Per-URL scrape loop with 60s timeout, SSRF re-check after redirect,
+    website-policy gating, and format-aware content selection.
+  - Robust response shape normalization across SDK / direct API /
+    gateway variants (search returns differ by transport).
+
+The plugin re-exports ``Firecrawl`` (the lazy proxy) and
+``check_firecrawl_api_key`` for backward-compatibility with tests and
+external code that imports those names from ``tools.web_tools``.
+"""
+
+from __future__ import annotations
+
+from plugins.web.firecrawl.provider import FirecrawlWebSearchProvider
+
+
+def register(ctx) -> None:
+    """Register the Firecrawl provider with the plugin context."""
+    ctx.register_web_search_provider(FirecrawlWebSearchProvider())
--- a/plugins/web/firecrawl/plugin.yaml
+++ b/plugins/web/firecrawl/plugin.yaml
@ -0,0 +1,7 @@
+name: web-firecrawl
+version: 1.0.0
+description: "Firecrawl web search + content extraction. Supports direct API and Nous-hosted tool-gateway routing for subscribers. Requires FIRECRAWL_API_KEY (or FIRECRAWL_API_URL for self-hosted), or an active Nous subscription with FIRECRAWL_GATEWAY_URL."
+author: NousResearch
+kind: backend
+provides_web_providers:
+  - firecrawl
--- a/plugins/web/firecrawl/provider.py
+++ b/plugins/web/firecrawl/provider.py
@ -0,0 +1,565 @@
+"""Firecrawl web search + extract — plugin form.
+
+Subclasses :class:`agent.web_search_provider.WebSearchProvider`. This is
+the largest provider migrated in this PR; it captures the full inline
+firecrawl implementation that previously lived in tools/web_tools.py:
+
+  - :data:`Firecrawl` lazy proxy that defers the ~200ms SDK import to
+    first use (re-exported by tools.web_tools for backward compat with
+    existing tests that mock that name).
+  - :func:`_get_firecrawl_client` with direct + managed-gateway dual
+    mode, controlled by ``web.use_gateway`` config when both are
+    configured.
+  - :func:`check_firecrawl_api_key` re-exported (tests + tools_config
+    setup hint depend on this name living in tools.web_tools).
+  - :func:`_extract_web_search_results` / :func:`_extract_scrape_payload`
+    response-shape normalizers that handle SDK / direct API / gateway
+    response variants.
+  - Per-URL extract loop with 60s timeout, redirect-aware SSRF re-check,
+    website-policy gating, and format-aware content selection.
+
+Async note: the underlying SDK is sync. ``extract()`` is declared
+``async def`` because it performs per-URL I/O that benefits from
+running in an executor; the implementation wraps each scrape in
+:func:`asyncio.to_thread` with :func:`asyncio.wait_for(timeout=60)` to
+guard against hung fetches.
+
+Config keys this provider responds to::
+
+    web:
+      search_backend: "firecrawl"     # explicit per-capability
+      extract_backend: "firecrawl"    # explicit per-capability
+      backend: "firecrawl"            # shared fallback (default)
+      use_gateway: false              # prefer managed gateway when both
+                                      # direct + gateway credentials exist
+
+Env vars::
+
+    FIRECRAWL_API_KEY=...            # direct cloud auth
+    FIRECRAWL_API_URL=...            # self-hosted Firecrawl
+    FIRECRAWL_GATEWAY_URL=...        # Nous tool-gateway (subscribers)
+    TOOL_GATEWAY_DOMAIN=...          # alternate gateway env
+    TOOL_GATEWAY_SCHEME=...
+    TOOL_GATEWAY_USER_TOKEN=...
+"""
+
+from __future__ import annotations
+
+import asyncio
+import logging
+import os
+from typing import Any, Dict, List, Optional, TYPE_CHECKING
+
+from agent.web_search_provider import WebSearchProvider
+
+logger = logging.getLogger(__name__)
+
+
+# ---------------------------------------------------------------------------
+# Lazy Firecrawl SDK proxy
+# ---------------------------------------------------------------------------
+# The firecrawl SDK pulls ~200ms of imports (httpcore, firecrawl.v1/v2 type
+# trees) on a cold CLI. We only need it when the backend is actually
+# "firecrawl", so defer the import to first use via a callable proxy.
+#
+# Tests that do ``patch("tools.web_tools.Firecrawl", ...)`` continue to
+# work because tools/web_tools.py re-exports ``Firecrawl`` from this
+# module — so the patched name still references the same proxy instance.
+
+if TYPE_CHECKING:
+    from firecrawl import Firecrawl as FirecrawlSDK  # noqa: F401 — type hints only
+
+_FIRECRAWL_CLS_CACHE: Optional[type] = None
+
+
+def _load_firecrawl_cls() -> type:
+    """Import and cache ``firecrawl.Firecrawl``."""
+    global _FIRECRAWL_CLS_CACHE
+    if _FIRECRAWL_CLS_CACHE is None:
+        try:
+            from tools.lazy_deps import ensure as _lazy_ensure
+
+            _lazy_ensure("search.firecrawl", prompt=False)
+        except ImportError:
+            pass
+        except Exception as exc:  # noqa: BLE001 — surface install hint
+            raise ImportError(str(exc))
+        from firecrawl import Firecrawl as _cls  # noqa: WPS433 — deliberately lazy
+
+        _FIRECRAWL_CLS_CACHE = _cls
+    return _FIRECRAWL_CLS_CACHE
+
+
+class _FirecrawlProxy:
+    """Callable proxy that looks like ``firecrawl.Firecrawl`` but imports lazily."""
+
+    __slots__ = ()
+
+    def __call__(self, *args: Any, **kwargs: Any) -> Any:
+        return _load_firecrawl_cls()(*args, **kwargs)
+
+    def __instancecheck__(self, obj: Any) -> bool:
+        return isinstance(obj, _load_firecrawl_cls())
+
+    def __repr__(self) -> str:
+        return "<lazy firecrawl.Firecrawl proxy>"
+
+
+Firecrawl = _FirecrawlProxy()
+
+
+# ---------------------------------------------------------------------------
+# Client construction (direct vs managed-gateway)
+# ---------------------------------------------------------------------------
+
+_firecrawl_client: Any = None
+_firecrawl_client_config: Any = None
+
+
+def _get_direct_firecrawl_config() -> Optional[tuple]:
+    """Return explicit direct Firecrawl kwargs + cache key, or None when unset."""
+    api_key = os.getenv("FIRECRAWL_API_KEY", "").strip()
+    api_url = os.getenv("FIRECRAWL_API_URL", "").strip().rstrip("/")
+
+    if not api_key and not api_url:
+        return None
+
+    kwargs: Dict[str, str] = {}
+    if api_key:
+        kwargs["api_key"] = api_key
+    if api_url:
+        kwargs["api_url"] = api_url
+
+    return kwargs, ("direct", api_url or None, api_key or None)
+
+
+def _get_firecrawl_gateway_url() -> str:
+    """Return the configured Firecrawl gateway URL."""
+    from tools.tool_backend_helpers import build_vendor_gateway_url
+
+    return build_vendor_gateway_url("firecrawl")
+
+
+def _is_tool_gateway_ready() -> bool:
+    """Return True when gateway URL + Nous Subscriber token are available."""
+    from tools.managed_tool_gateway import (
+        read_nous_access_token,
+        resolve_managed_tool_gateway,
+    )
+
+    return resolve_managed_tool_gateway(
+        "firecrawl", token_reader=read_nous_access_token
+    ) is not None
+
+
+def _has_direct_firecrawl_config() -> bool:
+    """Return True when direct Firecrawl config is explicitly configured."""
+    return _get_direct_firecrawl_config() is not None
+
+
+def check_firecrawl_api_key() -> bool:
+    """Return True when Firecrawl backend (direct or gateway) is usable.
+
+    Re-exported by :mod:`tools.web_tools` for backward compatibility with
+    existing tests and the ``hermes tools`` setup flow.
+    """
+    return _has_direct_firecrawl_config() or _is_tool_gateway_ready()
+
+
+def _firecrawl_backend_help_suffix() -> str:
+    """Return optional managed-gateway guidance for Firecrawl help text."""
+    from tools.tool_backend_helpers import managed_nous_tools_enabled
+
+    if not managed_nous_tools_enabled():
+        return ""
+    return (
+        ", or use the Nous Tool Gateway via your subscription "
+        "(FIRECRAWL_GATEWAY_URL or TOOL_GATEWAY_DOMAIN)"
+    )
+
+
+def _raise_web_backend_configuration_error() -> None:
+    """Raise a clear error for unsupported web backend configuration."""
+    from tools.tool_backend_helpers import managed_nous_tools_enabled
+
+    message = (
+        "Web tools are not configured. "
+        "Set FIRECRAWL_API_KEY for cloud Firecrawl or set FIRECRAWL_API_URL "
+        "for a self-hosted Firecrawl instance."
+    )
+    if managed_nous_tools_enabled():
+        message += (
+            " With your Nous subscription you can also use the Tool Gateway — "
+            "run `hermes tools` and select Nous Subscription as the web provider."
+        )
+    raise ValueError(message)
+
+
+def _get_firecrawl_client() -> Any:
+    """Get or create the cached Firecrawl client.
+
+    When ``web.use_gateway`` is set in config, the managed Tool Gateway is
+    preferred even if direct Firecrawl credentials are present. Otherwise
+    direct Firecrawl takes precedence when explicitly configured.
+
+    Raises ValueError when neither path is usable.
+    """
+    global _firecrawl_client, _firecrawl_client_config
+
+    from tools.managed_tool_gateway import (
+        read_nous_access_token,
+        resolve_managed_tool_gateway,
+    )
+    from tools.tool_backend_helpers import prefers_gateway
+
+    direct_config = _get_direct_firecrawl_config()
+    if direct_config is not None and not prefers_gateway("web"):
+        kwargs, client_config = direct_config
+    else:
+        managed_gateway = resolve_managed_tool_gateway(
+            "firecrawl", token_reader=read_nous_access_token
+        )
+        if managed_gateway is None:
+            logger.error(
+                "Firecrawl client initialization failed: "
+                "missing direct config and tool-gateway auth."
+            )
+            _raise_web_backend_configuration_error()
+
+        kwargs = {
+            "api_key": managed_gateway.nous_user_token,
+            "api_url": managed_gateway.gateway_origin,
+        }
+        client_config = (
+            "tool-gateway",
+            kwargs["api_url"],
+            managed_gateway.nous_user_token,
+        )
+
+    if _firecrawl_client is not None and _firecrawl_client_config == client_config:
+        return _firecrawl_client
+
+    _firecrawl_client = Firecrawl(**kwargs)
+    _firecrawl_client_config = client_config
+    return _firecrawl_client
+
+
+def _reset_client_for_tests() -> None:
+    """Drop the cached Firecrawl client so tests can re-instantiate cleanly."""
+    global _firecrawl_client, _firecrawl_client_config
+    _firecrawl_client = None
+    _firecrawl_client_config = None
+
+
+# ---------------------------------------------------------------------------
+# Response shape normalization (SDK / direct / gateway differ)
+# ---------------------------------------------------------------------------
+
+
+def _to_plain_object(value: Any) -> Any:
+    """Convert SDK objects to plain python data structures when possible."""
+    if value is None:
+        return None
+
+    if isinstance(value, (dict, list, str, int, float, bool)):
+        return value
+
+    if hasattr(value, "model_dump"):
+        try:
+            return value.model_dump()
+        except Exception:  # noqa: BLE001
+            pass
+
+    if hasattr(value, "__dict__"):
+        try:
+            return {k: v for k, v in value.__dict__.items() if not k.startswith("_")}
+        except Exception:  # noqa: BLE001
+            pass
+
+    return value
+
+
+def _normalize_result_list(values: Any) -> List[Dict[str, Any]]:
+    """Normalize mixed SDK/list payloads into a list of dicts."""
+    if not isinstance(values, list):
+        return []
+
+    normalized: List[Dict[str, Any]] = []
+    for item in values:
+        plain = _to_plain_object(item)
+        if isinstance(plain, dict):
+            normalized.append(plain)
+    return normalized
+
+
+def _extract_web_search_results(response: Any) -> List[Dict[str, Any]]:
+    """Extract Firecrawl search results across SDK/direct/gateway response shapes."""
+    response_plain = _to_plain_object(response)
+
+    if isinstance(response_plain, dict):
+        data = response_plain.get("data")
+        if isinstance(data, list):
+            return _normalize_result_list(data)
+
+        if isinstance(data, dict):
+            data_web = _normalize_result_list(data.get("web"))
+            if data_web:
+                return data_web
+            data_results = _normalize_result_list(data.get("results"))
+            if data_results:
+                return data_results
+
+        top_web = _normalize_result_list(response_plain.get("web"))
+        if top_web:
+            return top_web
+
+        top_results = _normalize_result_list(response_plain.get("results"))
+        if top_results:
+            return top_results
+
+    if hasattr(response, "web"):
+        return _normalize_result_list(getattr(response, "web", []))
+
+    return []
+
+
+def _extract_scrape_payload(scrape_result: Any) -> Dict[str, Any]:
+    """Normalize Firecrawl scrape payload shape across SDK and gateway variants."""
+    result_plain = _to_plain_object(scrape_result)
+    if not isinstance(result_plain, dict):
+        return {}
+
+    nested = result_plain.get("data")
+    if isinstance(nested, dict):
+        return nested
+
+    return result_plain
+
+
+# ---------------------------------------------------------------------------
+# Provider class
+# ---------------------------------------------------------------------------
+
+
+class FirecrawlWebSearchProvider(WebSearchProvider):
+    """Firecrawl search + extract provider with dual auth paths."""
+
+    @property
+    def name(self) -> str:
+        return "firecrawl"
+
+    @property
+    def display_name(self) -> str:
+        return "Firecrawl"
+
+    def is_available(self) -> bool:
+        """Return True when direct Firecrawl OR managed-gateway path is configured."""
+        return check_firecrawl_api_key()
+
+    def supports_search(self) -> bool:
+        return True
+
+    def supports_extract(self) -> bool:
+        return True
+
+    def search(self, query: str, limit: int = 5) -> Dict[str, Any]:
+        """Execute a Firecrawl search.
+
+        Sync; matches the legacy ``_get_firecrawl_client().search(...)``
+        call directly. Normalizes the response across SDK/direct/gateway
+        shapes via :func:`_extract_web_search_results`.
+        """
+        try:
+            from tools.interrupt import is_interrupted
+
+            if is_interrupted():
+                return {"success": False, "error": "Interrupted"}
+
+            logger.info("Firecrawl search: '%s' (limit=%d)", query, limit)
+            response = _get_firecrawl_client().search(query=query, limit=limit)
+            web_results = _extract_web_search_results(response)
+            logger.info("Firecrawl: found %d search results", len(web_results))
+            return {"success": True, "data": {"web": web_results}}
+        except ValueError as exc:
+            return {"success": False, "error": str(exc)}
+        except ImportError as exc:
+            return {"success": False, "error": f"Firecrawl SDK not installed: {exc}"}
+        except Exception as exc:  # noqa: BLE001
+            logger.warning("Firecrawl search error: %s", exc)
+            return {"success": False, "error": f"Firecrawl search failed: {exc}"}
+
+    async def extract(self, urls: List[str], **kwargs: Any) -> List[Dict[str, Any]]:
+        """Extract content from one or more URLs via Firecrawl.
+
+        Async; each URL is scraped in a background thread with a 60s
+        timeout. After scraping, the final URL (post-redirect) is
+        re-checked against website-access policy.
+
+        Accepted kwargs (others ignored for forward compat):
+          - ``format``: ``"markdown"`` or ``"html"``; default is both
+            (request both, return markdown when available).
+
+        Returns the legacy per-URL list-of-results shape. Per-URL failures
+        (timeout, SSRF block, scrape error, policy block) become items
+        with an ``error`` field rather than raising.
+        """
+        from tools.interrupt import is_interrupted as _is_interrupted
+
+        if _is_interrupted():
+            return [{"url": u, "error": "Interrupted", "title": ""} for u in urls]
+
+        format = kwargs.get("format")
+        formats: List[str] = []
+        if format == "markdown":
+            formats = ["markdown"]
+        elif format == "html":
+            formats = ["html"]
+        else:
+            formats = ["markdown", "html"]
+
+        # check_website_access is the legacy policy gate; import inside
+        # the function so the plugin doesn't pay the cost when never used.
+        from tools.website_policy import check_website_access
+
+        results: List[Dict[str, Any]] = []
+
+        for url in urls:
+            if _is_interrupted():
+                results.append({"url": url, "error": "Interrupted", "title": ""})
+                continue
+
+            # Pre-scrape website policy gate
+            blocked = check_website_access(url)
+            if blocked:
+                logger.info(
+                    "Blocked web_extract for %s by rule %s",
+                    blocked["host"],
+                    blocked["rule"],
+                )
+                results.append(
+                    {
+                        "url": url,
+                        "title": "",
+                        "content": "",
+                        "error": blocked["message"],
+                        "blocked_by_policy": {
+                            "host": blocked["host"],
+                            "rule": blocked["rule"],
+                            "source": blocked["source"],
+                        },
+                    }
+                )
+                continue
+
+            try:
+                logger.info("Firecrawl scraping: %s", url)
+                try:
+                    scrape_result = await asyncio.wait_for(
+                        asyncio.to_thread(
+                            _get_firecrawl_client().scrape,
+                            url=url,
+                            formats=formats,
+                        ),
+                        timeout=60,
+                    )
+                except asyncio.TimeoutError:
+                    logger.warning("Firecrawl scrape timed out for %s", url)
+                    results.append(
+                        {
+                            "url": url,
+                            "title": "",
+                            "content": "",
+                            "error": (
+                                "Scrape timed out after 60s — page may be too large "
+                                "or unresponsive. Try browser_navigate instead."
+                            ),
+                        }
+                    )
+                    continue
+
+                scrape_payload = _extract_scrape_payload(scrape_result)
+                metadata = scrape_payload.get("metadata", {})
+                content_markdown = scrape_payload.get("markdown")
+                content_html = scrape_payload.get("html")
+
+                # Ensure metadata is a dict (SDK may return a typed object)
+                if not isinstance(metadata, dict):
+                    if hasattr(metadata, "model_dump"):
+                        metadata = metadata.model_dump()
+                    elif hasattr(metadata, "__dict__"):
+                        metadata = metadata.__dict__
+                    else:
+                        metadata = {}
+
+                title = metadata.get("title", "")
+                final_url = metadata.get("sourceURL", url)
+
+                # Re-check website-access policy after any redirect
+                final_blocked = check_website_access(final_url)
+                if final_blocked:
+                    logger.info(
+                        "Blocked redirected web_extract for %s by rule %s",
+                        final_blocked["host"],
+                        final_blocked["rule"],
+                    )
+                    results.append(
+                        {
+                            "url": final_url,
+                            "title": title,
+                            "content": "",
+                            "raw_content": "",
+                            "error": final_blocked["message"],
+                            "blocked_by_policy": {
+                                "host": final_blocked["host"],
+                                "rule": final_blocked["rule"],
+                                "source": final_blocked["source"],
+                            },
+                        }
+                    )
+                    continue
+
+                # Choose markdown vs html according to the requested format
+                if format == "markdown" or (format is None and content_markdown):
+                    chosen_content = content_markdown
+                else:
+                    chosen_content = content_html or content_markdown or ""
+
+                results.append(
+                    {
+                        "url": final_url,
+                        "title": title,
+                        "content": chosen_content,
+                        "raw_content": chosen_content,
+                        "metadata": metadata,
+                    }
+                )
+            except Exception as scrape_err:  # noqa: BLE001
+                logger.debug("Firecrawl scrape failed for %s: %s", url, scrape_err)
+                results.append(
+                    {
+                        "url": url,
+                        "title": "",
+                        "content": "",
+                        "raw_content": "",
+                        "error": str(scrape_err),
+                    }
+                )
+
+        return results
+
+    def get_setup_schema(self) -> Dict[str, Any]:
+        return {
+            "name": "Firecrawl",
+            "badge": "paid · optional gateway",
+            "tag": (
+                "Mainstream search + extract; supports direct API and Nous "
+                "tool-gateway routing."
+            ),
+            "env_vars": [
+                {
+                    "key": "FIRECRAWL_API_KEY",
+                    "prompt": "Firecrawl API key (or leave blank for self-hosted)",
+                    "url": "https://docs.firecrawl.dev/introduction",
+                },
+            ],
+        }