fix(web): align _LEGACY_PREFERENCE with legacy 7-provider order + doc cleanup

Self-review of the plugin migration surfaced one warning and a handful of
doc/dead-code cleanups. None affect production behaviour through the main
dispatcher (which always calls `tools.web_tools._get_backend()` first and
preserves the full 7-provider walk), but direct callers of
`agent.web_search_registry.get_active_*_provider()` previously diverged
from the legacy order and could return `None` for users with credentials
but no explicit `web.backend` config key.

Changes
-------
1. `_LEGACY_PREFERENCE` was shipped as a 4-tuple
   `("brave-free", "firecrawl", "searxng", "ddgs")` while the PR
   description and the legacy `_get_backend()` candidate order both
   call for the 7-tuple
   `(firecrawl, parallel, tavily, exa, searxng, brave-free, ddgs)`.
   Replaced with the 7-tuple. Verified empirically: with TAVILY+EXA keys
   and no config, `get_active_search_provider()` now returns tavily
   (was None); with EXA+PARALLEL it returns parallel (was None); with
   BRAVE+FIRECRAWL it returns firecrawl (was brave-free).

2. `agent/web_search_registry.py` — module docstring, `_resolve` step-3
   docstring, and inline comment all listed the old 4-tuple and claimed
   "brave-free first because it was the shipped default". The legacy
   default is `"firecrawl"`. Rewritten to match the new ordering and
   reference `tools.web_tools._get_backend()` as the source of truth.

3. `agent/web_search_registry.py` — `get_active_crawl_provider`
   docstring said "only Tavily implements it among built-in providers".
   Firecrawl also advertises `supports_crawl=True` after the previous
   commit. Updated to "Tavily and Firecrawl".

4. `plugins/web/tavily/provider.py` — module docstring said "Tavily is
   the only built-in backend that natively crawls". Updated.

5. `agent/web_search_provider.py` — ABC docstring mentioned only
   `search` / `extract` capabilities. Added `crawl` for accuracy.

6. `plugins/web/{firecrawl,parallel,exa}/provider.py` — dead plugin-level
   cache globals (`_firecrawl_client`, `_parallel_client`,
   `_async_parallel_client`, `_exa_client`) were declared but never read
   (all reads/writes go through `_wt.*` per the `extracting-inline-
   helpers-to-plugins` recipe). Removed the dead declarations; the
   reset-for-tests helpers in firecrawl + parallel now clear the
   canonical `_wt._<name>` slots, matching the pattern exa already used.

Tests
-----
218/218 web-targeted tests still pass (no test changes needed). 4910/4910
in `tests/tools/` still green.
This commit is contained in:
kshitijk4poor 2026-05-14 02:02:01 +05:30 committed by Teknium
parent 21e3a863bb
commit 657e6d87cc
6 changed files with 82 additions and 48 deletions

View file

@ -61,13 +61,14 @@ from typing import Any, Dict, List
class WebSearchProvider(abc.ABC): class WebSearchProvider(abc.ABC):
"""Abstract base class for a web search/extract backend. """Abstract base class for a web search/extract/crawl backend.
Subclasses must implement :meth:`is_available` and at least one of Subclasses must implement :meth:`is_available` and at least one of
:meth:`search` / :meth:`extract`. The :meth:`supports_search` and :meth:`search` / :meth:`extract` / :meth:`crawl`. The
:meth:`supports_extract` capability flags let the registry route each :meth:`supports_search` / :meth:`supports_extract` / :meth:`supports_crawl`
tool call to the right provider, and let multi-capability providers capability flags let the registry route each tool call to the right
(SearXNG, Firecrawl, Tavily, ) advertise both. provider, and let multi-capability providers (Firecrawl, Tavily, Exa,
) advertise multiple capabilities from a single class.
""" """
@property @property

View file

@ -11,17 +11,23 @@ Active selection
---------------- ----------------
The active provider is chosen by configuration with this precedence: The active provider is chosen by configuration with this precedence:
1. ``web.search_backend`` (for search) or ``web.extract_backend`` (for extract) 1. ``web.search_backend`` / ``web.extract_backend`` / ``web.crawl_backend``
2. ``web.backend`` (shared fallback) (per-capability override).
3. If exactly one capability-eligible provider is registered, use it. 2. ``web.backend`` (shared fallback).
4. Legacy preference order (``brave-free`` ``firecrawl`` ``searxng`` ``ddgs``) 3. If exactly one capability-eligible provider is registered AND available,
so installs that omitted the config key keep working. use it.
4. Legacy preference order ``firecrawl`` ``parallel`` ``tavily``
``exa`` ``searxng`` ``brave-free`` ``ddgs`` filtered by
availability. Matches the historic ``tools.web_tools._get_backend()``
candidate order so installs that never set a config key keep landing
on the same provider they did before the plugin migration.
5. Otherwise ``None`` the tool surfaces a helpful error pointing at 5. Otherwise ``None`` the tool surfaces a helpful error pointing at
``hermes tools``. ``hermes tools``.
The capability filter (``supports_search`` vs ``supports_extract``) is applied The capability filter (``supports_search`` / ``supports_extract`` /
at every step so a search-only provider (``brave-free``) configured as ``supports_crawl``) is applied at every step so a search-only provider
``web.extract_backend`` correctly falls through. (``brave-free``) configured as ``web.extract_backend`` correctly falls
through to an extract-capable backend.
""" """
from __future__ import annotations from __future__ import annotations
@ -107,10 +113,21 @@ def _read_config_key(*path: str) -> Optional[str]:
return None return None
# Legacy preference order — preserves behaviour for users who set no config # Legacy preference order — preserves behaviour for users who set no
# at all. brave-free first because it was the shipped default after the # ``web.backend`` / ``web.<capability>_backend`` config key at all. Matches
# Brave migration; firecrawl second for back-compat with older configs. # the historic candidate order in :func:`tools.web_tools._get_backend`
_LEGACY_PREFERENCE = ("brave-free", "firecrawl", "searxng", "ddgs") # (paid providers first so existing paid setups don't get downgraded to
# a free tier on upgrade). Filtered by ``is_available()`` at walk time so
# we don't surface a provider the user has no credentials for.
_LEGACY_PREFERENCE = (
"firecrawl",
"parallel",
"tavily",
"exa",
"searxng",
"brave-free",
"ddgs",
)
def _resolve(configured: Optional[str], *, capability: str) -> Optional[WebSearchProvider]: def _resolve(configured: Optional[str], *, capability: str) -> Optional[WebSearchProvider]:
@ -130,10 +147,14 @@ def _resolve(configured: Optional[str], *, capability: str) -> Optional[WebSearc
supports *capability* AND ``is_available()`` reports True, return it. supports *capability* AND ``is_available()`` reports True, return it.
3. **Legacy preference walk, filtered by availability.** Walk the 3. **Legacy preference walk, filtered by availability.** Walk the
:data:`_LEGACY_PREFERENCE` order looking for a provider whose :data:`_LEGACY_PREFERENCE` order (firecrawl parallel tavily
exa searxng brave-free ddgs) looking for a provider whose
``supports_<capability>()`` is True AND whose ``is_available()`` is ``supports_<capability>()`` is True AND whose ``is_available()`` is
True. This is the path that fires when no config key is set pick True. Matches the historic ``tools.web_tools._get_backend()``
the highest-priority backend the user actually has credentials for. candidate order so users with credentials but no explicit config
key keep landing on the same provider as pre-migration. This is
the path that fires when no config key is set pick the
highest-priority backend the user actually has credentials for.
Returns None when no provider is configured AND no available provider Returns None when no provider is configured AND no available provider
matches the legacy preference; the dispatcher then returns a "set up a matches the legacy preference; the dispatcher then returns a "set up a
@ -179,8 +200,8 @@ def _resolve(configured: Optional[str], *, capability: str) -> Optional[WebSearc
# 2. + 3. Fallback path — filter by availability so we don't surface # 2. + 3. Fallback path — filter by availability so we don't surface
# a provider the user has no credentials for. Without this filter, # a provider the user has no credentials for. Without this filter,
# brave-free's slot in the legacy preference order would make it # a registered-but-unconfigured provider could end up "active" on
# the "active" provider on a fresh install with no API keys at all. # a fresh install with no API keys at all.
eligible = [ eligible = [
p for p in snapshot.values() p for p in snapshot.values()
if _capable(p) and _is_available_safe(p) if _capable(p) and _is_available_safe(p)
@ -226,9 +247,10 @@ def get_active_crawl_provider() -> Optional[WebSearchProvider]:
Reads ``web.crawl_backend`` (preferred) or ``web.backend`` (shared Reads ``web.crawl_backend`` (preferred) or ``web.backend`` (shared
fallback) from config.yaml; falls back per the module docstring. fallback) from config.yaml; falls back per the module docstring.
Crawl is a niche capability only Tavily implements it among built-in Crawl is a niche capability among built-in providers only Tavily and
providers. Most callers should expect ``None`` and fall back to a Firecrawl implement it. Callers should expect ``None`` and fall back to
different strategy (e.g. summarize-via-LLM). a different strategy (e.g. summarize-via-LLM) when neither is
configured.
""" """
explicit = _read_config_key("web", "crawl_backend") or _read_config_key("web", "backend") explicit = _read_config_key("web", "crawl_backend") or _read_config_key("web", "backend")
return _resolve(explicit, capability="crawl") return _resolve(explicit, capability="crawl")

View file

@ -32,9 +32,10 @@ from agent.web_search_provider import WebSearchProvider
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
# Module-level cache for the Exa client so we don't reconstruct it per # Module-level note: the canonical ``_exa_client`` cache slot lives on
# call. Matches the legacy `_exa_client` pattern in tools/web_tools.py. # :mod:`tools.web_tools` so tests that do ``tools.web_tools._exa_client =
_exa_client: Any = None # None`` between cases see fresh state. The plugin reads/writes through
# that public module (see :func:`_get_exa_client`).
def _get_exa_client() -> Any: def _get_exa_client() -> Any:

View file

@ -112,9 +112,11 @@ Firecrawl = _FirecrawlProxy()
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
# Client construction (direct vs managed-gateway) # Client construction (direct vs managed-gateway)
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
#
_firecrawl_client: Any = None # The canonical cache slots live on :mod:`tools.web_tools` so tests that do
_firecrawl_client_config: Any = None # ``tools.web_tools._firecrawl_client = None`` between cases see fresh
# state. The plugin reads/writes through that public module — see
# :func:`_get_firecrawl_client` below.
def _get_direct_firecrawl_config() -> Optional[tuple]: def _get_direct_firecrawl_config() -> Optional[tuple]:
@ -257,10 +259,15 @@ def _get_firecrawl_client() -> Any:
def _reset_client_for_tests() -> None: def _reset_client_for_tests() -> None:
"""Drop the cached Firecrawl client so tests can re-instantiate cleanly.""" """Drop the cached Firecrawl client so tests can re-instantiate cleanly.
global _firecrawl_client, _firecrawl_client_config
_firecrawl_client = None Clears the canonical slots on :mod:`tools.web_tools` (where
_firecrawl_client_config = None :func:`_get_firecrawl_client` reads/writes them).
"""
import tools.web_tools as _wt
_wt._firecrawl_client = None
_wt._firecrawl_client_config = None
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------

View file

@ -36,13 +36,11 @@ from agent.web_search_provider import WebSearchProvider
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
# Module-level client caches mirroring the legacy `tools.web_tools._parallel_client` # Module-level note: the canonical cache slots ``_parallel_client`` and
# / `_async_parallel_client` pattern. For tests, the canonical cache lives on # ``_async_parallel_client`` live on :mod:`tools.web_tools` so tests that do
# tools.web_tools so existing setup_method() handlers that reset # ``tools.web_tools._parallel_client = None`` between cases see fresh state.
# ``tools.web_tools._parallel_client = None`` keep working — we read/write # The plugin reads/writes through that public module (see
# the cache via that module rather than these module-level globals. # :func:`_get_sync_client` / :func:`_get_async_client`).
_parallel_client: Any = None
_async_parallel_client: Any = None
def _ensure_parallel_sdk_installed() -> None: def _ensure_parallel_sdk_installed() -> None:
@ -117,10 +115,15 @@ def _get_async_client() -> Any:
def _reset_clients_for_tests() -> None: def _reset_clients_for_tests() -> None:
"""Drop both cached clients so tests can re-instantiate cleanly.""" """Drop both cached clients so tests can re-instantiate cleanly.
global _parallel_client, _async_parallel_client
_parallel_client = None Clears the canonical slots on :mod:`tools.web_tools` (where
_async_parallel_client = None :func:`_get_sync_client` / :func:`_get_async_client` read/write them).
"""
import tools.web_tools as _wt
_wt._parallel_client = None
_wt._async_parallel_client = None
# Backward-compatible aliases for the names that lived in tools.web_tools # Backward-compatible aliases for the names that lived in tools.web_tools

View file

@ -5,8 +5,8 @@ capabilities advertised:
- ``supports_search()`` -> True (Tavily ``/search``) - ``supports_search()`` -> True (Tavily ``/search``)
- ``supports_extract()`` -> True (Tavily ``/extract``) - ``supports_extract()`` -> True (Tavily ``/extract``)
- ``supports_crawl()`` -> True (Tavily ``/crawl``) Tavily is the only - ``supports_crawl()`` -> True (Tavily ``/crawl``) sync HTTP crawl;
built-in backend that natively crawls Firecrawl also advertises ``supports_crawl=True`` (async)
All three are sync the underlying call is ``httpx.post(...)``. The All three are sync the underlying call is ``httpx.post(...)``. The
dispatcher in :func:`tools.web_tools.web_crawl_tool` (which is itself dispatcher in :func:`tools.web_tools.web_crawl_tool` (which is itself