mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-23 05:31:23 +00:00
feat(web): firecrawl plugin — largest migration (search + async extract + dual auth)
Migrates Firecrawl from inline code in tools/web_tools.py to a bundled
plugin at plugins/web/firecrawl/. By line count this is the largest of
the seven provider migrations: the firecrawl path captured most of the
file's vendor-specific complexity.
What moved into the plugin (all previously in tools/web_tools.py):
Lazy Firecrawl SDK proxy
- _load_firecrawl_cls() — caches the imported SDK class
- _FirecrawlProxy + Firecrawl singleton — defers ~200ms of SDK
imports until first construction or isinstance check.
Client construction (dual auth)
- _get_direct_firecrawl_config() — direct FIRECRAWL_API_KEY/URL path
- _get_firecrawl_gateway_url() — managed Nous tool-gateway URL
- _is_tool_gateway_ready() — gateway URL + Nous token check
- _has_direct_firecrawl_config() — direct config present?
- _get_firecrawl_client() — combined client construction
honoring web.use_gateway
- check_firecrawl_api_key() — top-level "is firecrawl usable"
- _firecrawl_backend_help_suffix() — managed-gateway help string
- _raise_web_backend_configuration_error() — typed misconfig error
Response shape normalization (vendor-specific)
- _to_plain_object(), _normalize_result_list() — SDK→dict helpers
- _extract_web_search_results() — handles SDK/direct/gateway shapes
- _extract_scrape_payload() — nested-data unwrap for scrape
Per-URL extract loop
- 60s asyncio.wait_for timeout per URL
- Pre-scrape website-policy gate
- Post-scrape redirect-aware SSRF re-check
- Format-aware content selection (markdown / html / auto)
- Per-URL errors returned as {"error": str} entries, no raises
Extract is declared `async def` — each URL is scraped in
asyncio.to_thread(...). This is the second async-extract plugin after
parallel.
The plugin re-exports `Firecrawl` (the lazy proxy) and
`check_firecrawl_api_key()` so existing tests doing
`patch("tools.web_tools.Firecrawl")` or
`monkeypatch.setattr(web_tools, "check_firecrawl_api_key", ...)` keep
working — tools/web_tools.py re-exports both names in the next
dispatcher-cutover commit.
Note: web_crawl_tool still has its own Firecrawl crawl path inline
(separate from extract); the Firecrawl SDK supports /crawl but we don't
expose supports_crawl=True on this plugin yet. Tavily handles crawl
today. Adding Firecrawl crawl is a clean follow-up.
Adds "firecrawl" to _WEB_PLUGIN_SKIPLIST.
E2E verified:
- All 7 providers register: brave-free, ddgs, exa, firecrawl,
parallel, searxng, tavily
- inspect.iscoroutinefunction(firecrawl.extract) -> True
- Firecrawl proxy is a callable lazy proxy at module level
- check_firecrawl_api_key reflects FIRECRAWL_API_KEY presence
This commit is contained in:
parent
31fcde876c
commit
143184e943
4 changed files with 603 additions and 1 deletions
28
plugins/web/firecrawl/__init__.py
Normal file
28
plugins/web/firecrawl/__init__.py
Normal file
|
|
@ -0,0 +1,28 @@
|
|||
"""Firecrawl web search + extract plugin — bundled, auto-loaded.
|
||||
|
||||
Largest single plugin in this PR. Captures everything the previous
|
||||
inline implementation in tools/web_tools.py did:
|
||||
|
||||
- Lazy import of the firecrawl SDK (~200ms cold-start cost) via a
|
||||
callable proxy that defers the actual import to first use.
|
||||
- Dual client paths: direct (FIRECRAWL_API_KEY / FIRECRAWL_API_URL)
|
||||
OR Nous-hosted tool-gateway routing for subscribers, with
|
||||
web.use_gateway as the tie-breaker.
|
||||
- Per-URL scrape loop with 60s timeout, SSRF re-check after redirect,
|
||||
website-policy gating, and format-aware content selection.
|
||||
- Robust response shape normalization across SDK / direct API /
|
||||
gateway variants (search returns differ by transport).
|
||||
|
||||
The plugin re-exports ``Firecrawl`` (the lazy proxy) and
|
||||
``check_firecrawl_api_key`` for backward-compatibility with tests and
|
||||
external code that imports those names from ``tools.web_tools``.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from plugins.web.firecrawl.provider import FirecrawlWebSearchProvider
|
||||
|
||||
|
||||
def register(ctx) -> None:
|
||||
"""Register the Firecrawl provider with the plugin context."""
|
||||
ctx.register_web_search_provider(FirecrawlWebSearchProvider())
|
||||
Loading…
Add table
Add a link
Reference in a new issue