Commit graph

4 commits

Author SHA1 Message Date
kshitijk4poor
657e6d87cc fix(web): align _LEGACY_PREFERENCE with legacy 7-provider order + doc cleanup
Self-review of the plugin migration surfaced one warning and a handful of
doc/dead-code cleanups. None affect production behaviour through the main
dispatcher (which always calls `tools.web_tools._get_backend()` first and
preserves the full 7-provider walk), but direct callers of
`agent.web_search_registry.get_active_*_provider()` previously diverged
from the legacy order and could return `None` for users with credentials
but no explicit `web.backend` config key.

Changes
-------
1. `_LEGACY_PREFERENCE` was shipped as a 4-tuple
   `("brave-free", "firecrawl", "searxng", "ddgs")` while the PR
   description and the legacy `_get_backend()` candidate order both
   call for the 7-tuple
   `(firecrawl, parallel, tavily, exa, searxng, brave-free, ddgs)`.
   Replaced with the 7-tuple. Verified empirically: with TAVILY+EXA keys
   and no config, `get_active_search_provider()` now returns tavily
   (was None); with EXA+PARALLEL it returns parallel (was None); with
   BRAVE+FIRECRAWL it returns firecrawl (was brave-free).

2. `agent/web_search_registry.py` — module docstring, `_resolve` step-3
   docstring, and inline comment all listed the old 4-tuple and claimed
   "brave-free first because it was the shipped default". The legacy
   default is `"firecrawl"`. Rewritten to match the new ordering and
   reference `tools.web_tools._get_backend()` as the source of truth.

3. `agent/web_search_registry.py` — `get_active_crawl_provider`
   docstring said "only Tavily implements it among built-in providers".
   Firecrawl also advertises `supports_crawl=True` after the previous
   commit. Updated to "Tavily and Firecrawl".

4. `plugins/web/tavily/provider.py` — module docstring said "Tavily is
   the only built-in backend that natively crawls". Updated.

5. `agent/web_search_provider.py` — ABC docstring mentioned only
   `search` / `extract` capabilities. Added `crawl` for accuracy.

6. `plugins/web/{firecrawl,parallel,exa}/provider.py` — dead plugin-level
   cache globals (`_firecrawl_client`, `_parallel_client`,
   `_async_parallel_client`, `_exa_client`) were declared but never read
   (all reads/writes go through `_wt.*` per the `extracting-inline-
   helpers-to-plugins` recipe). Removed the dead declarations; the
   reset-for-tests helpers in firecrawl + parallel now clear the
   canonical `_wt._<name>` slots, matching the pattern exa already used.

Tests
-----
218/218 web-targeted tests still pass (no test changes needed). 4910/4910
in `tests/tools/` still green.
2026-05-13 22:31:28 -07:00
kshitijk4poor
e3f0a88891 feat(web): extend ABC with supports_crawl and async-extract semantics
Two ABC additions to cover the surface area of the remaining four
providers (exa, parallel, tavily, firecrawl) which were untouched by the
initial spike:

1. supports_crawl() + crawl() — Tavily natively crawls a seed URL via
   its /crawl endpoint. Exposing supports_crawl=True lets the crawl
   tool's dispatcher route to Tavily when configured, falling back to
   the auxiliary-model summarization path otherwise. Firecrawl could
   add this in a follow-up (the SDK supports it; we just don't surface
   it as a tool today).

2. Async-or-sync extract() — Parallel's SDK is natively async
   (AsyncParallel.beta.extract); Exa and Tavily are sync; Firecrawl is
   sync but called inside asyncio.to_thread() with a 60s timeout. The
   ABC docstring now permits either shape: implementations declare
   their own sync/async signature and the dispatcher uses
   inspect.iscoroutinefunction to detect and await.

Also adds get_active_crawl_provider() to web_search_registry mirroring
the search/extract resolvers, with web.crawl_backend as the explicit
override config key.

No behavior change on its own — these are scaffolds for the four
remaining provider migrations.
2026-05-13 22:31:28 -07:00
kshitijk4poor
0a7cbd3342 fix(plugins): filter resolution by is_available() in web + image_gen registries
Both web_search_registry._resolve() and image_gen_registry.get_active_provider()
walked their registered providers and returned the first one matching the
capability flag — without checking whether that provider was actually
usable. On a fresh install with no credentials at all, this meant
get_active_search_provider() returned `brave-free` (legacy preference
order) even though BRAVE_SEARCH_API_KEY was unset, leading the
dispatcher to surface a "BRAVE_SEARCH_API_KEY is not set" error for a
provider the user never chose. Same bug shape in image_gen for FAL.

Resolution semantics now match tools.web_tools._get_backend():

  1. Explicit config name wins, ignoring is_available() — the dispatcher
     surfaces a precise "X_API_KEY is not set" error rather than silently
     switching backends. Matches user expectation: "I configured X, tell
     me what's wrong with X."
  2. Fallback (no explicit config) walks the legacy preference order
     filtered by is_available() — pick the highest-priority backend the
     user actually has credentials for.

is_available() is wrapped in a try/except so a buggy provider doesn't
brick resolution.

E2E verified:
  - No creds + no config: get_active_search_provider() -> None
  - Explicit brave-free + no key: get_active_search_provider() -> brave-free
    (and .is_available() correctly reports False)

This fix was identified during the spike (#25182 finding #1) and is
fold-in to the same PR rather than a follow-up.
2026-05-13 22:31:28 -07:00
kshitijk4poor
007a630b16 feat(web): add web search provider registry mirroring image_gen pattern 2026-05-13 22:31:28 -07:00