Commit graph

8 commits

Author SHA1 Message Date
kshitijk4poor
143184e943 feat(web): firecrawl plugin — largest migration (search + async extract + dual auth)
Migrates Firecrawl from inline code in tools/web_tools.py to a bundled
plugin at plugins/web/firecrawl/. By line count this is the largest of
the seven provider migrations: the firecrawl path captured most of the
file's vendor-specific complexity.

What moved into the plugin (all previously in tools/web_tools.py):

  Lazy Firecrawl SDK proxy
    - _load_firecrawl_cls() — caches the imported SDK class
    - _FirecrawlProxy + Firecrawl singleton — defers ~200ms of SDK
      imports until first construction or isinstance check.

  Client construction (dual auth)
    - _get_direct_firecrawl_config()  — direct FIRECRAWL_API_KEY/URL path
    - _get_firecrawl_gateway_url()    — managed Nous tool-gateway URL
    - _is_tool_gateway_ready()        — gateway URL + Nous token check
    - _has_direct_firecrawl_config()  — direct config present?
    - _get_firecrawl_client()         — combined client construction
                                        honoring web.use_gateway
    - check_firecrawl_api_key()       — top-level "is firecrawl usable"
    - _firecrawl_backend_help_suffix() — managed-gateway help string
    - _raise_web_backend_configuration_error() — typed misconfig error

  Response shape normalization (vendor-specific)
    - _to_plain_object(), _normalize_result_list() — SDK→dict helpers
    - _extract_web_search_results() — handles SDK/direct/gateway shapes
    - _extract_scrape_payload()     — nested-data unwrap for scrape

  Per-URL extract loop
    - 60s asyncio.wait_for timeout per URL
    - Pre-scrape website-policy gate
    - Post-scrape redirect-aware SSRF re-check
    - Format-aware content selection (markdown / html / auto)
    - Per-URL errors returned as {"error": str} entries, no raises

Extract is declared `async def` — each URL is scraped in
asyncio.to_thread(...). This is the second async-extract plugin after
parallel.

The plugin re-exports `Firecrawl` (the lazy proxy) and
`check_firecrawl_api_key()` so existing tests doing
`patch("tools.web_tools.Firecrawl")` or
`monkeypatch.setattr(web_tools, "check_firecrawl_api_key", ...)` keep
working — tools/web_tools.py re-exports both names in the next
dispatcher-cutover commit.

Note: web_crawl_tool still has its own Firecrawl crawl path inline
(separate from extract); the Firecrawl SDK supports /crawl but we don't
expose supports_crawl=True on this plugin yet. Tavily handles crawl
today. Adding Firecrawl crawl is a clean follow-up.

Adds "firecrawl" to _WEB_PLUGIN_SKIPLIST.

E2E verified:
  - All 7 providers register: brave-free, ddgs, exa, firecrawl,
    parallel, searxng, tavily
  - inspect.iscoroutinefunction(firecrawl.extract) -> True
  - Firecrawl proxy is a callable lazy proxy at module level
  - check_firecrawl_api_key reflects FIRECRAWL_API_KEY presence
2026-05-13 22:31:28 -07:00
kshitijk4poor
31fcde876c feat(web): tavily plugin — first three-capability plugin (search + extract + crawl)
Migrates Tavily from inline _tavily_request() / _normalize_tavily_*
helpers in tools/web_tools.py to a bundled plugin at plugins/web/tavily/.

First plugin in the codebase to advertise supports_crawl=True. Tavily is
unique among built-in backends in offering a native /crawl endpoint that
walks linked pages from a seed URL with optional natural-language
instructions and depth ("basic" or "advanced").

Capabilities:
  - supports_search()  -> True (Tavily /search)
  - supports_extract() -> True (Tavily /extract)
  - supports_crawl()   -> True (Tavily /crawl)
  All sync (httpx.post under the hood).

The crawl method accepts forward-compat kwargs (instructions, depth,
limit) and is gated against unsafe URLs/policy by the dispatcher in
web_crawl_tool — exactly as before.

Behavior preserved:
  - TAVILY_API_KEY required (ValueError → typed error response)
  - TAVILY_BASE_URL env override honored
  - /crawl requires both body auth AND Bearer header — preserved
  - failed_results[] and failed_urls[] response keys mapped to per-URL
    items with error fields rather than raising
  - max_results capped at 20 server-side

Adds "tavily" to _WEB_PLUGIN_SKIPLIST.

The legacy inline _tavily_request / _normalize_tavily_search_results /
_normalize_tavily_documents / _TAVILY_BASE_URL in tools/web_tools.py are
NOT deleted yet — search/extract dispatch and the entire web_crawl_tool
function still reference them. They go away when those dispatchers are
cut over to the registry.

E2E verified:
  - Tavily registers with all 3 capabilities
  - Provider list now: brave-free, ddgs, exa, parallel, searxng, tavily
2026-05-13 22:31:28 -07:00
kshitijk4poor
4816646109 feat(web): parallel plugin — first async-extract plugin
Migrates Parallel.ai from inline `_parallel_search()` / `_parallel_extract()`
in tools/web_tools.py to a bundled plugin at plugins/web/parallel/.

First plugin in the codebase to expose an async :meth:`extract`:

  - search() is sync — Parallel.beta.search
  - extract() is **async def** — AsyncParallel.beta.extract

The ABC's docstring on supports_extract() already permits sync-or-async;
this commit is the first to exercise the async path. The web_extract_tool
dispatcher (next commit) detects coroutines via
inspect.iscoroutinefunction and awaits accordingly.

Behavior preserved:
  - PARALLEL_API_KEY required (raises ValueError if missing → surfaced
    as {"success": False, "error": "..."} instead)
  - PARALLEL_SEARCH_MODE env var honored (agentic|fast|one-shot, default
    agentic), validated via _resolve_search_mode()
  - Limit capped at 20 server-side via min(limit, 20)
  - Per-URL failure mode preserved: response.errors[] each become a
    result dict with an "error" field rather than raising
  - Module-level _parallel_client / _async_parallel_client caches kept
    (mirrors legacy singleton pattern)

Adds "parallel" to _WEB_PLUGIN_SKIPLIST in hermes_cli/tools_config.py so
the picker doesn't double-list.

The legacy inline _parallel_search, _parallel_extract, _get_parallel_client,
_get_async_parallel_client in tools/web_tools.py are NOT deleted yet — the
dispatcher still calls them. They go away when the dispatcher cuts over.

E2E verified:
  - inspect.iscoroutinefunction(p.search) -> False
  - inspect.iscoroutinefunction(p.extract) -> True
  - extract() returns a coroutine (not a list)
  - 5 providers register correctly (brave-free, ddgs, exa, parallel, searxng)
2026-05-13 22:31:28 -07:00
kshitijk4poor
ec8449e9c6 feat(web): exa plugin — first multi-capability migration (search + extract)
Migrates Exa from the inline `_exa_search()` / `_exa_extract()` helpers in
tools/web_tools.py to a bundled plugin at plugins/web/exa/.

This is the first plugin in this PR to advertise supports_extract=True,
exercising the multi-capability ABC path that the initial three migrations
(brave_free, ddgs, searxng — all search-only) did not cover.

Both Exa methods are sync — the SDK is sync-only. The web_extract_tool
dispatcher in tools/web_tools.py will continue to call them inline until
Task "dispatch-extract-all" cuts it over to the registry.

Behaviour preserved bit-for-bit aside from the ABC method-name change:
  - is_configured()  -> is_available()
  - provider_name()  -> name (property)
  - "exa" stays as the registered name
  - Module-level `_exa_client` cache + lazy `from exa_py import Exa`
    preserved at the new location.
  - Errors (ValueError for missing API key, ImportError for missing SDK,
    generic Exception) caught and surfaced as {"success": False, "error": ...}
    instead of raising.

Adds "exa" to _WEB_PLUGIN_SKIPLIST in hermes_cli/tools_config.py so the
hardcoded TOOL_CATEGORIES["web"] row and the plugin-injected row don't
duplicate during the spike. The skip-list goes away in the cleanup phase
along with the hardcoded row.

The legacy inline `_exa_search` / `_exa_extract` / `_get_exa_client` /
`_exa_client` in tools/web_tools.py are NOT deleted yet — the dispatcher
still references them. They go away in the next dispatcher-cutover commit.

E2E verified:
  - Plugin discovers + registers
  - .supports_search/.supports_extract/.supports_crawl = (True, True, False)
  - .get_setup_schema() returns the picker row shape
  - resolve(): explicit exa + EXA_API_KEY -> exa; without key -> exa (registered
    but unavailable, dispatcher surfaces "EXA_API_KEY not set" error)
2026-05-13 22:31:28 -07:00
kshitijk4poor
6b219f5af6 refactor(web): remove legacy in-tree provider modules
Deletes tools/web_providers/{brave_free,ddgs,searxng}.py — the three
providers that moved to plugins/web/ in prior commits. tools/web_tools.py
no longer imports them (registry dispatch as of d8735963f), so removing
them is purely a cleanup pass.

Also migrates the existing tests to the new import paths:
  tests/tools/test_web_providers_brave_free.py
  tests/tools/test_web_providers_ddgs.py
  tests/tools/test_web_providers_searxng.py

Mechanical rewrites:
  - `from tools.web_providers.X import YSearchProvider`
      -> `from plugins.web.X.provider import YWebSearchProvider`
  - `.is_configured()` -> `.is_available()`        (legacy method  -> new method)
  - `.provider_name()` -> `.name`                  (legacy method  -> new property)
  - `from tools.web_providers.base import WebSearchProvider`
      -> `from agent.web_search_provider import WebSearchProvider`
      (the subclass-check asserts membership in the new plugin-facing ABC)
  - `sys.modules.delitem("tools.web_providers.ddgs")` updated to point at
    `plugins.web.ddgs.provider` (cache-busting for lazy ddgs imports)

The TestXBackendWiring / TestXSearchOnlyErrors classes (covering
_is_backend_available, _get_backend, check_web_api_key, and the
"search-only" error paths in web_extract/web_crawl) are untouched —
those still test web_tools.py's backend-selection logic, which continues
to recognize the names "brave-free" / "ddgs" / "searxng" even after the
modules behind them moved to plugins.

tools/web_providers/base.py is intentionally NOT deleted by this commit
— it's the parent ABC of the legacy modules and shares its name with
agent/web_search_provider.py::WebSearchProvider. Removing it surfaces the
naming collision (see PR description Finding 0); the real migration PR
deletes it in the same commit that drops the _WEB_PLUGIN_SKIPLIST
guards in hermes_cli/tools_config.py.

Test results:
  bash scripts/run_tests.sh tests/tools/test_web_providers_*.py
  -> 65 passed in 3.41s (all rewritten unit tests + unchanged integration tests)
  bash scripts/run_tests.sh tests/tools/test_web_*.py
  -> 141 passed in 4.70s (full web test set, post-deletion)
2026-05-13 22:31:28 -07:00
kshitijk4poor
0d085d9454 feat(web): searxng plugin (search-only, third migration)
Adds plugins/web/searxng/. SearXNG aggregates results from upstream engines
via its JSON API (/search?format=json) — search-only, no extract capability
(supports_extract() returns False).

E2E verified — registry now has ['brave-free', 'ddgs', 'searxng'].
2026-05-13 22:31:28 -07:00
kshitijk4poor
5c7d098bee feat(web): ddgs plugin (second migration)
Adds plugins/web/ddgs/ following the same plugins/image_gen/ pattern as
brave_free. DuckDuckGo search via the community ddgs package; no API key,
package is an optional dep gated by is_available().

E2E verified — registry now has ['brave-free', 'ddgs'].
2026-05-13 22:31:28 -07:00
kshitijk4poor
d403cf018c feat(web): brave_free plugin (first migration from tools/web_providers/)
Adds plugins/web/brave_free/ as the first plugin built against the new
WebSearchProvider ABC. Mirrors the plugins/image_gen/openai/ layout exactly:

  plugins/web/brave_free/
    plugin.yaml      kind: backend, provides_web_providers: [brave-free]
    __init__.py      register(ctx) -> ctx.register_web_search_provider(...)
    provider.py      BraveFreeWebSearchProvider(WebSearchProvider)

Behavior preserved: same name ("brave-free" with hyphen), same env var
(BRAVE_SEARCH_API_KEY), same HTTP request shape, same response normalization.

The legacy tools/web_providers/brave_free.py is left in place — the
dispatcher in tools/web_tools.py still references it. Task 7 cuts over the
dispatcher to the new registry; Task 10 deletes the legacy file.

E2E verified:
  HERMES_PLUGINS_DEBUG=1 python -c "
  from hermes_cli.plugins import _ensure_plugins_discovered
  _ensure_plugins_discovered()
  from agent.web_search_registry import list_providers
  print([p.name for p in list_providers()])
  "
  # -> ['brave-free']
2026-05-13 22:31:28 -07:00