fix(web): preserve firecrawl crawl + website-policy gate after migration

Two regressions discovered by running the full tests/tools/ suite after
the dispatcher cutover, both fixed in this commit:

1. web_crawl_tool incorrectly errored "search-only" for firecrawl
---------------------------------------------------------------------
The cutover treated any provider with supports_crawl()==False as a
search-only backend and returned the typed search-only error. But
firecrawl can crawl via the legacy multi-page-extract path inside
web_crawl_tool — it just doesn't expose supports_crawl on the plugin
(adding native firecrawl crawl is a clean follow-up).

Fix: only emit the search-only error when the provider supports
NEITHER crawl NOR extract (brave-free / ddgs / searxng). When the
provider supports extract but not crawl (firecrawl), fall through to
the legacy firecrawl-via-extract path below.

2. firecrawl plugin's check_website_access wasn't patchable
---------------------------------------------------------------------
The plugin imported `from tools.website_policy import check_website_access`
INSIDE the extract() function body, so monkeypatching the name on
plugins.web.firecrawl.provider had no effect — the inner import re-bound
the name on every call.

Fix: hoist the import to module level. Cheap (website_policy itself
has no heavy deps) and makes the standard
monkeypatch.setattr(firecrawl_provider, "check_website_access", ...)
pattern work.

Test updates (tests/tools/test_website_policy.py — 4 tests):
  - test_web_extract_short_circuits_blocked_url
  - test_web_extract_blocks_redirected_final_url
    Both: patch the gate at plugins.web.firecrawl.provider (where it
    runs after migration) and force the firecrawl plugin to be the
    active extract provider via FIRECRAWL_API_KEY.
  - test_web_crawl_short_circuits_blocked_url
  - test_web_crawl_blocks_redirected_final_url
    Both: unchanged — the dispatcher-level gate at tools.web_tools.py
    line 1651 still uses the imported `check_website_access` name and
    the firecrawl-fallthrough path is exercised as before.

Verified: 22/22 tests/tools/test_website_policy.py pass.
This commit is contained in:
kshitijk4poor 2026-05-14 00:34:28 +05:30 committed by Teknium
parent b05253ceed
commit 5e54330e27
3 changed files with 39 additions and 23 deletions

View file

@ -1618,22 +1618,26 @@ async def web_crawl_tool(
crawl_provider = _wsp_get_provider(backend) if backend else None
if crawl_provider is not None and not crawl_provider.supports_crawl():
# Configured name IS registered but doesn't support crawl
# (search-only providers like brave-free / ddgs / searxng).
# Surface a typed error rather than silently switching to a
# different crawl backend.
return json.dumps(
{
"success": False,
"error": (
f"{crawl_provider.display_name} is a search-only "
"backend and cannot crawl URLs. "
"Set FIRECRAWL_API_KEY for crawling, or use "
"web_search instead."
),
},
ensure_ascii=False,
)
# When the configured provider is search-only AND cannot
# extract URLs either (brave-free / ddgs / searxng), surface a
# typed "search-only" error rather than silently switching to
# a different crawl backend. When the provider supports extract
# but not crawl (e.g. firecrawl), fall through to the legacy
# firecrawl-via-extract path below.
if not crawl_provider.supports_extract():
return json.dumps(
{
"success": False,
"error": (
f"{crawl_provider.display_name} is a search-only "
"backend and cannot crawl URLs. "
"Set FIRECRAWL_API_KEY for crawling, or use "
"web_search instead."
),
},
ensure_ascii=False,
)
crawl_provider = None # let legacy firecrawl path handle it
if crawl_provider is None:
crawl_provider = get_active_crawl_provider()