hermes-agent/plugins
kshitijk4poor 21e3a863bb feat(web): firecrawl plugin natively supports crawl; delete legacy inline path
The web-provider migration originally left firecrawl crawl as the only
provider-specific code remaining inline in tools/web_tools.py (~250
lines of Firecrawl-specific crawl orchestration that didn't fit the
plugin's existing surface). This commit closes that gap.

What this adds
--------------
1. plugins/web/firecrawl/provider.py: implement async ``crawl(url, **kwargs)``
   - Accepts the same kwargs as the dispatcher passes to any crawl
     provider (``instructions``, ``depth``, ``limit``); Firecrawl's
     /crawl endpoint ignores ``instructions`` and ``depth`` so we log
     and drop with a clear info message.
   - Wraps the sync SDK ``crawl()`` call in asyncio.to_thread so the
     gateway event loop isn't blocked on a multi-page crawl.
   - Preserves the response-shape normalization across pydantic /
     typed-object / dict variants that the legacy inline code did.
   - Preserves per-page website-policy re-check (catches blocked
     redirects after the SDK returns).
   - Returns the same {"results": [...]} shape so the dispatcher's
     shared LLM-summarization post-processing path works unchanged.
   - Sets supports_crawl() to True so the dispatcher routes through
     the plugin instead of the legacy fallthrough.

2. tools/web_tools.py: delete the entire legacy firecrawl crawl block
   that used to run after "No registered provider supports crawl" —
   ~270 lines including:
   - check_firecrawl_api_key gate + typed error
   - inline SSRF + website-policy seed-URL gate (dispatcher already
     does this)
   - Firecrawl client setup with crawl_params
   - 100+ lines of pydantic/dict/typed-object normalization
   - Per-page LLM-processing loop (kept in the dispatcher's shared
     post-processing path; that's where it always belonged)
   - trimming + base64 image cleanup (still done in the dispatcher's
     shared path)

   Replaced with a single typed-error branch when no crawl-capable
   provider is available: "web_crawl has no available backend. Set
   FIRECRAWL_API_KEY (or FIRECRAWL_API_URL for self-hosted), or set
   TAVILY_API_KEY for Tavily."

Test updates
------------
- tests/tools/test_website_policy.py:
  - test_web_crawl_short_circuits_blocked_url: dispatcher seed-URL
    gate still runs on web_tools.check_website_access (no change to
    that patch), but the firecrawl client lockdown moved to the
    plugin module — patch firecrawl_provider._get_firecrawl_client
    instead of web_tools._get_firecrawl_client. The dispatcher
    short-circuits before the plugin runs, so the test still passes.
  - test_web_crawl_blocks_redirected_final_url: patch the per-page
    policy gate at plugins.web.firecrawl.provider.check_website_access
    (where it now runs) AND on web_tools (where the seed-URL gate
    still runs). Patch firecrawl_provider._get_firecrawl_client for
    the FakeCrawlClient injection. Both checks flow through the same
    fake_check function.
- tests/plugins/web/test_web_search_provider_plugins.py:
  - Update parametrized capability-flag spec: firecrawl supports_crawl
    is now True.
  - Add test_firecrawl_crawl_returns_error_dict_when_unconfigured —
    verifies inspect.iscoroutinefunction(p.crawl) is True and that
    the async crawl returns a per-page error dict (not a raise) when
    FIRECRAWL_API_KEY is missing.

Verified
--------
- 218/218 web tests pass (was 173, +44 plugin tests + 1 new firecrawl
  crawl test from this commit = 218 with the test deduplication).
- Compile-clean (py_compile passes on both files).
- Provider capabilities matrix confirmed end-to-end:
    name        search  extract  crawl   async-extract?  async-crawl?
    firecrawl   True    True     True    True            True
    tavily      True    True     True    False           False
  Both crawl-capable providers exercise the dispatcher's
  inspect.iscoroutinefunction async-or-sync detection.

Net diff
--------
- tools/web_tools.py: -254 lines (legacy inline crawl gone)
- plugins/web/firecrawl/provider.py: +185 lines (crawl method)
- test_website_policy.py: +14/-9 lines (patch locations)
- test_web_search_provider_plugins.py: +22/-1 lines (capability flag
  + new firecrawl crawl test)
- Total: -32 net LoC; tools/web_tools.py is now 1509 lines (was 1763
  before this commit, 2227 before the migration started).
2026-05-13 22:31:28 -07:00
..
context_engine feat(cross-platform): psutil for PID/process management + Windows footgun checker 2026-05-08 14:27:40 -07:00
disk-cleanup feat(cross-platform): psutil for PID/process management + Windows footgun checker 2026-05-08 14:27:40 -07:00
example-dashboard/dashboard fix(dashboard): UI polish — modals, layout, consistency, test fixes 2026-05-12 13:59:22 -04:00
google_meet feat(cross-platform): psutil for PID/process management + Windows footgun checker 2026-05-08 14:27:40 -07:00
hermes-achievements fix(achievements): use canonical X-Hermes-Session-Token header 2026-05-10 19:41:45 -07:00
image_gen fix: send correct resolution param to xAI image generation API 2026-05-09 13:07:46 -07:00
kanban fix(kanban): use localized column label in select-all aria label 2026-05-11 06:44:58 -07:00
memory feat(security): supply-chain advisory checker + lazy-install framework + tiered install fallback (#24220) 2026-05-12 01:02:25 -07:00
model-providers feat(nous): unified client=hermes-client-v<version> tag on every Portal request (#24779) 2026-05-12 20:49:20 -07:00
observability/langfuse feat(plugins): add bundled observability/langfuse plugin 2026-04-28 01:40:59 -07:00
platforms fix(line): use build_source instead of nonexistent create_source 2026-05-12 18:46:54 -07:00
spotify refactor(spotify): convert to built-in bundled plugin under plugins/spotify (#15174) 2026-04-24 07:06:11 -07:00
teams_pipeline feat(teams): add pipeline outbound delivery via existing adapter 2026-05-08 12:00:09 -07:00
video_gen feat(video_gen): unified video_generate tool with pluggable provider backends (#25126) 2026-05-13 16:39:41 -07:00
web feat(web): firecrawl plugin natively supports crawl; delete legacy inline path 2026-05-13 22:31:28 -07:00
__init__.py feat(memory): pluggable memory provider interface with profile isolation, review fixes, and honcho CLI restoration (#4623) 2026-04-02 15:33:51 -07:00