From 94016dd1aa7eac05765bdebf8de0838d76402dc0 Mon Sep 17 00:00:00 2001 From: kshitij <82637225+kshitijk4poor@users.noreply.github.com> Date: Wed, 6 May 2026 10:15:56 -0700 Subject: [PATCH] docs+skill: add searxng-search optional skill and documentation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes the remaining gaps from PR #11562 that weren't covered by the core SearXNG integration landed in #20823. - optional-skills/research/searxng-search/ — installable skill with SKILL.md (curl-based usage, category support, Python example) and searxng.sh helper script for health checks and instance queries - website/docs/user-guide/configuration.md — SearXNG added to the Web Search Backends section (5 backends, backend table, per-capability split config example, correct search-only note) - website/docs/reference/environment-variables.md — SEARXNG_URL row - website/docs/reference/optional-skills-catalog.md — searxng-search entry The core SearXNG code, OPTIONAL_ENV_VARS, hermes tools picker, and tests were already on main via #20823. This commit is purely additive docs + the optional skill scaffold. Credits from #11562 salvage: @w4rum — original _searxng_search structure @nathansdev — tools_config.py integration @moyomartin — category support and result formatting @0xMihai — config/env var approach @nicobailon — skill and documentation structure @searxng-fan — error handling patterns @local-first — self-hosted-first philosophy and docs --- .../research/searxng-search/SKILL.md | 211 ++++++++++++++++++ .../searxng-search/scripts/searxng.sh | 22 ++ .../docs/reference/environment-variables.md | 1 + .../docs/reference/optional-skills-catalog.md | 1 + website/docs/user-guide/configuration.md | 15 +- 5 files changed, 246 insertions(+), 4 deletions(-) create mode 100644 optional-skills/research/searxng-search/SKILL.md create mode 100755 optional-skills/research/searxng-search/scripts/searxng.sh diff --git a/optional-skills/research/searxng-search/SKILL.md b/optional-skills/research/searxng-search/SKILL.md new file mode 100644 index 0000000000..c2d170591b --- /dev/null +++ b/optional-skills/research/searxng-search/SKILL.md @@ -0,0 +1,211 @@ +--- +name: searxng-search +description: Free meta-search via SearXNG — aggregates results from 70+ search engines. Self-hosted or use a public instance. No API key needed. Falls back automatically when the web search toolset is unavailable. +version: 1.0.0 +author: hermes-agent +license: MIT +metadata: + hermes: + tags: [search, searxng, meta-search, self-hosted, free, fallback] + related_skills: [duckduckgo-search, domain-intel] + fallback_for_toolsets: [web] +--- + +# SearXNG Search + +Free meta-search using [SearXNG](https://searxng.org/) — a privacy-respecting, self-hosted search aggregator that queries 70+ search engines simultaneously. + +**No API key required** when using a public instance. Can also be self-hosted for full control. Automatically appears as a fallback when the main web search toolset (`FIRECRAWL_API_KEY`) is not configured. + +## Configuration + +SearXNG requires a `SEARXNG_URL` environment variable pointing to your SearXNG instance: + +```bash +# Public instances (no setup required) +SEARXNG_URL=https://searxng.example.com + +# Self-hosted SearXNG +SEARXNG_URL=http://localhost:8888 +``` + +If no instance is configured, this skill is unavailable and the agent falls back to other search options. + +## Detection Flow + +Check what is actually available before choosing an approach: + +```bash +# Check if SEARXNG_URL is set and the instance is reachable +curl -s --max-time 5 "${SEARXNG_URL}/search?q=test&format=json" | head -c 200 +``` + +Decision tree: +1. If `SEARXNG_URL` is set and the instance responds, use SearXNG +2. If `SEARXNG_URL` is unset or unreachable, fall back to other available search tools +3. If the user wants SearXNG specifically, help them set up an instance or find a public one + +## Method 1: CLI via curl (Preferred) + +Use `curl` via `terminal` to call the SearXNG JSON API. This avoids assuming any particular Python package is installed. + +```bash +# Text search (JSON output) +curl -s --max-time 10 \ + "${SEARXNG_URL}/search?q=python+async+programming&format=json&engines=google,bing&limit=10" + +# With Safesearch off +curl -s --max-time 10 \ + "${SEARXNG_URL}/search?q=example&format=json&safesearch=0" + +# Specific categories (general, news, science, etc.) +curl -s --max-time 10 \ + "${SEARXNG_URL}/search?q=AI+news&format=json&categories=news" +``` + +### Common CLI Flags + +| Flag | Description | Example | +|------|-------------|---------| +| `q` | Query string (URL-encoded) | `q=python+async` | +| `format` | Output format: `json`, `csv`, `rss` | `format=json` | +| `engines` | Comma-separated engine names | `engines=google,bing,ddg` | +| `limit` | Max results per engine (default 10) | `limit=5` | +| `categories` | Filter by category | `categories=news,science` | +| `safesearch` | 0=none, 1=moderate, 2=strict | `safesearch=0` | +| `time_range` | Filter: `day`, `week`, `month`, `year` | `time_range=week` | + +### Parsing JSON Results + +```bash +# Extract titles and URLs from JSON +curl -s --max-time 10 "${SEARXNG_URL}/search?q=fastapi&format=json&limit=5" \ + | python3 -c " +import json, sys +data = json.load(sys.stdin) +for r in data.get('results', []): + print(r.get('title','')) + print(r.get('url','')) + print(r.get('content','')[:200]) + print() +" +``` + +Returns per result: `title`, `url`, `content` (snippet), `engine`, `parsed_url`, `img_src`, `thumbnail`, `author`, `published_date` + +## Method 2: Python API via `requests` + +Use the SearXNG REST API directly from Python with the `requests` library: + +```python +import os, requests, urllib.parse + +base_url = os.environ.get("SEARXNG_URL", "") +if not base_url: + raise RuntimeError("SEARXNG_URL is not set") + +query = "fastapi deployment guide" +params = { + "q": query, + "format": "json", + "limit": 5, + "engines": "google,bing", +} + +resp = requests.get(f"{base_url}/search", params=params, timeout=10) +resp.raise_for_status() +data = resp.json() + +for r in data.get("results", []): + print(r["title"]) + print(r["url"]) + print(r.get("content", "")[:200]) + print() +``` + +## Method 3: searxng-data Python Package + +For more structured access, install the `searxng-data` package: + +```bash +pip install searxng-data +``` + +```python +from searxng_data import engines + +# List available engines +print(engines.list_engines()) +``` + +Note: This package only provides engine metadata, not the search API itself. + +## Self-Hosting SearXNG + +To run your own SearXNG instance: + +```bash +# Using Docker +docker run -d -p 8888:8080 \ + -v $(pwd)/searxng:/etc/searxng \ + searxng/searxng:latest + +# Then set +SEARXNG_URL=http://localhost:8888 +``` + +Or install via pip: +```bash +pip install searxng +# Edit /etc/searxng/settings.yml +searxng-run +``` + +Public SearXNG instances are available at: +- `https://searxng.example.com` (replace with any public instance) + +## Workflow: Search then Extract + +SearXNG returns titles, URLs, and snippets — not full page content. To get full page content, search first and then extract the most relevant URL with `web_extract`, browser tools, or `curl`. + +```bash +# Search for relevant pages +curl -s "${SEARXNG_URL}/search?q=fastapi+deployment&format=json&limit=3" +# Output: list of results with titles and URLs + +# Then extract the best URL with web_extract +``` + +## Limitations + +- **Instance availability**: If the SearXNG instance is down or unreachable, search fails. Always check `SEARXNG_URL` is set and the instance is reachable. +- **No content extraction**: SearXNG returns snippets, not full page content. Use `web_extract`, browser tools, or `curl` for full articles. +- **Rate limiting**: Some public instances limit requests. Self-hosting avoids this. +- **Engine coverage**: Available engines depend on the SearXNG instance configuration. Some engines may be disabled. +- **Results freshness**: Meta-search aggregates external engines — result freshness depends on those engines. + +## Troubleshooting + +| Problem | Likely Cause | What To Do | +|---------|--------------|------------| +| `SEARXNG_URL` not set | No instance configured | Use a public SearXNG instance or set up your own | +| Connection refused | Instance not running or wrong URL | Check the URL is correct and the instance is running | +| Empty results | Instance blocks the query | Try a different instance or self-host | +| Slow responses | Public instance under load | Self-host or use a less-loaded public instance | +| `json` format not supported | Old SearXNG version | Try `format=rss` or upgrade SearXNG | + +## Pitfalls + +- **Always set `SEARXNG_URL`**: Without it, the skill cannot function. +- **URL-encode queries**: Spaces and special characters must be URL-encoded in curl, or use `urllib.parse.quote()` in Python. +- **Use `format=json`**: The default format may not be machine-readable. Always request JSON explicitly. +- **Set a timeout**: Always use `--max-time` or `timeout=` to avoid hanging on unreachable instances. +- **Self-hosting is best**: Public instances may go down, rate-limit, or block. A self-hosted instance is reliable. + +## Instance Discovery + +If `SEARXNG_URL` is not set and the user asks about SearXNG, help them either: +1. Find a public SearXNG instance (search for "public searxng instance") +2. Set up their own with Docker or pip + +Public instances are listed at: https://searxng.org/ diff --git a/optional-skills/research/searxng-search/scripts/searxng.sh b/optional-skills/research/searxng-search/scripts/searxng.sh new file mode 100755 index 0000000000..12fe792d09 --- /dev/null +++ b/optional-skills/research/searxng-search/scripts/searxng.sh @@ -0,0 +1,22 @@ +#!/bin/bash +# Usage: ./searxng.sh [max_results] [engines] +# Example: ./searxng.sh "python async" 10 "google,bing" + +QUERY="${1:-}" +MAX="${2:-5}" +ENGINES="${3:-google,bing}" + +if [ -z "$SEARXNG_URL" ]; then + echo "Error: SEARXNG_URL is not set" + exit 1 +fi + +if [ -z "$QUERY" ]; then + echo "Usage: $0 [max_results] [engines]" + exit 1 +fi + +ENCODED_QUERY=$(echo "$QUERY" | sed 's/ /+/g') + +curl -s --max-time 10 \ + "${SEARXNG_URL}/search?q=${ENCODED_QUERY}&format=json&limit=${MAX}&engines=${ENGINES}" diff --git a/website/docs/reference/environment-variables.md b/website/docs/reference/environment-variables.md index 05206eb0c9..7aa635bd44 100644 --- a/website/docs/reference/environment-variables.md +++ b/website/docs/reference/environment-variables.md @@ -120,6 +120,7 @@ For native Anthropic auth, Hermes prefers Claude Code's own credential files whe | `FIRECRAWL_API_KEY` | Web scraping and cloud browser ([firecrawl.dev](https://firecrawl.dev/)) | | `FIRECRAWL_API_URL` | Custom Firecrawl API endpoint for self-hosted instances (optional) | | `TAVILY_API_KEY` | Tavily API key for AI-native web search, extract, and crawl ([app.tavily.com](https://app.tavily.com/home)) | +| `SEARXNG_URL` | SearXNG instance URL for free self-hosted web search — no API key required ([searxng.github.io](https://searxng.github.io/searxng/)) | | `TAVILY_BASE_URL` | Override the Tavily API endpoint. Useful for corporate proxies and self-hosted Tavily-compatible search backends. Same pattern as `GROQ_BASE_URL`. | | `EXA_API_KEY` | Exa API key for AI-native web search and contents ([exa.ai](https://exa.ai/)) | | `BROWSERBASE_API_KEY` | Browser automation ([browserbase.com](https://browserbase.com/)) | diff --git a/website/docs/reference/optional-skills-catalog.md b/website/docs/reference/optional-skills-catalog.md index 9a9188a5b1..cec7454feb 100644 --- a/website/docs/reference/optional-skills-catalog.md +++ b/website/docs/reference/optional-skills-catalog.md @@ -143,6 +143,7 @@ hermes skills uninstall | [**domain-intel**](/docs/user-guide/skills/optional/research/research-domain-intel) | Passive domain reconnaissance using Python stdlib. Subdomain discovery, SSL certificate inspection, WHOIS lookups, DNS records, domain availability checks, and bulk multi-domain analysis. No API keys required. | | [**drug-discovery**](/docs/user-guide/skills/optional/research/research-drug-discovery) | Pharmaceutical research assistant for drug discovery workflows. Search bioactive compounds on ChEMBL, calculate drug-likeness (Lipinski Ro5, QED, TPSA, synthetic accessibility), look up drug-drug interactions via OpenFDA, interpret ADMET... | | [**duckduckgo-search**](/docs/user-guide/skills/optional/research/research-duckduckgo-search) | Free web search via DuckDuckGo — text, news, images, videos. No API key needed. Prefer the `ddgs` CLI when installed; use the Python DDGS library only after verifying that `ddgs` is available in the current runtime. | +| [**searxng-search**](/docs/user-guide/skills/optional/research/research-searxng-search) | Free meta-search via SearXNG — aggregates results from 70+ search engines. Self-hosted or use a public instance. No API key needed. Falls back automatically when the web search toolset is unavailable. | | [**gitnexus-explorer**](/docs/user-guide/skills/optional/research/research-gitnexus-explorer) | Index a codebase with GitNexus and serve an interactive knowledge graph via web UI + Cloudflare tunnel. | | [**parallel-cli**](/docs/user-guide/skills/optional/research/research-parallel-cli) | Optional vendor skill for Parallel CLI — agent-native web search, extraction, deep research, enrichment, FindAll, and monitoring. Prefer JSON output and non-interactive flows. | | [**qmd**](/docs/user-guide/skills/optional/research/research-qmd) | Search personal knowledge bases, notes, docs, and meeting transcripts locally using qmd — a hybrid retrieval engine with BM25, vector search, and LLM reranking. Supports CLI and MCP integration. | diff --git a/website/docs/user-guide/configuration.md b/website/docs/user-guide/configuration.md index 07f5ba0eed..3977c3c252 100644 --- a/website/docs/user-guide/configuration.md +++ b/website/docs/user-guide/configuration.md @@ -1425,23 +1425,30 @@ Environment scrubbing (strips `*_API_KEY`, `*_TOKEN`, `*_SECRET`, `*_PASSWORD`, ## Web Search Backends -The `web_search`, `web_extract`, and `web_crawl` tools support four backend providers. Configure the backend in `config.yaml` or via `hermes tools`: +The `web_search`, `web_extract`, and `web_crawl` tools support five backend providers. Configure the backend in `config.yaml` or via `hermes tools`: ```yaml web: - backend: firecrawl # firecrawl | parallel | tavily | exa + backend: firecrawl # firecrawl | searxng | parallel | tavily | exa + + # Or use per-capability keys to mix providers (e.g. free search + paid extract): + search_backend: "searxng" + extract_backend: "firecrawl" ``` | Backend | Env Var | Search | Extract | Crawl | |---------|---------|--------|---------|-------| | **Firecrawl** (default) | `FIRECRAWL_API_KEY` | ✔ | ✔ | ✔ | +| **SearXNG** | `SEARXNG_URL` | ✔ | — | — | | **Parallel** | `PARALLEL_API_KEY` | ✔ | ✔ | — | | **Tavily** | `TAVILY_API_KEY` | ✔ | ✔ | ✔ | | **Exa** | `EXA_API_KEY` | ✔ | ✔ | — | -**Backend selection:** If `web.backend` is not set, the backend is auto-detected from available API keys. If only `EXA_API_KEY` is set, Exa is used. If only `TAVILY_API_KEY` is set, Tavily is used. If only `PARALLEL_API_KEY` is set, Parallel is used. Otherwise Firecrawl is the default. +**Backend selection:** If `web.backend` is not set, the backend is auto-detected from available API keys. If only `SEARXNG_URL` is set, SearXNG is used. If only `EXA_API_KEY` is set, Exa is used. If only `TAVILY_API_KEY` is set, Tavily is used. If only `PARALLEL_API_KEY` is set, Parallel is used. Otherwise Firecrawl is the default. -**Self-hosted Firecrawl:** Set `FIRECRAWL_API_URL` to point at your own instance. When a custom URL is set, the API key becomes optional (set `USE_DB_AUTHENTICATION=false` on the server to disable auth). +**SearXNG** is a free, self-hosted, privacy-respecting metasearch engine that queries 70+ search engines. No API key needed — just set `SEARXNG_URL` to your instance (e.g., `http://localhost:8080`). SearXNG is search-only; `web_extract` and `web_crawl` require a separate extract provider (set `web.extract_backend`). + +**Self-hosted Firecrawl:** Set `FIRECRAWL_API_URL` to point at your own instance. When a custom URL is set, the API key becomes optional (set `USE_DB_AUTHENTICATION=*** on the server to disable auth). **Parallel search modes:** Set `PARALLEL_SEARCH_MODE` to control search behavior — `fast`, `one-shot`, or `agentic` (default: `agentic`).