feat(web): extend ABC with supports_crawl and async-extract semantics

Two ABC additions to cover the surface area of the remaining four
providers (exa, parallel, tavily, firecrawl) which were untouched by the
initial spike:

1. supports_crawl() + crawl() — Tavily natively crawls a seed URL via
   its /crawl endpoint. Exposing supports_crawl=True lets the crawl
   tool's dispatcher route to Tavily when configured, falling back to
   the auxiliary-model summarization path otherwise. Firecrawl could
   add this in a follow-up (the SDK supports it; we just don't surface
   it as a tool today).

2. Async-or-sync extract() — Parallel's SDK is natively async
   (AsyncParallel.beta.extract); Exa and Tavily are sync; Firecrawl is
   sync but called inside asyncio.to_thread() with a 60s timeout. The
   ABC docstring now permits either shape: implementations declare
   their own sync/async signature and the dispatcher uses
   inspect.iscoroutinefunction to detect and await.

Also adds get_active_crawl_provider() to web_search_registry mirroring
the search/extract resolvers, with web.crawl_backend as the explicit
override config key.

No behavior change on its own — these are scaffolds for the four
remaining provider migrations.
This commit is contained in:
kshitijk4poor 2026-05-14 00:08:03 +05:30 committed by Teknium
parent 0a7cbd3342
commit e3f0a88891
2 changed files with 84 additions and 4 deletions

View file

@ -114,7 +114,7 @@ _LEGACY_PREFERENCE = ("brave-free", "firecrawl", "searxng", "ddgs")
def _resolve(configured: Optional[str], *, capability: str) -> Optional[WebSearchProvider]:
"""Resolve the active provider for a capability ("search" | "extract").
"""Resolve the active provider for a capability ("search" | "extract" | "crawl").
Resolution rules (in order):
@ -147,6 +147,8 @@ def _resolve(configured: Optional[str], *, capability: str) -> Optional[WebSearc
return bool(p.supports_search())
if capability == "extract":
return bool(p.supports_extract())
if capability == "crawl":
return bool(p.supports_crawl())
return False
def _is_available_safe(p: WebSearchProvider) -> bool:
@ -218,6 +220,20 @@ def get_active_extract_provider() -> Optional[WebSearchProvider]:
return _resolve(explicit, capability="extract")
def get_active_crawl_provider() -> Optional[WebSearchProvider]:
"""Resolve the currently-active web crawl provider.
Reads ``web.crawl_backend`` (preferred) or ``web.backend`` (shared
fallback) from config.yaml; falls back per the module docstring.
Crawl is a niche capability only Tavily implements it among built-in
providers. Most callers should expect ``None`` and fall back to a
different strategy (e.g. summarize-via-LLM).
"""
explicit = _read_config_key("web", "crawl_backend") or _read_config_key("web", "backend")
return _resolve(explicit, capability="crawl")
def _reset_for_tests() -> None:
"""Clear the registry. **Test-only.**"""
with _lock: