fix(skills): pull full ClawHub catalog into the skills index (200 → 20k+) (#33748)

* fix(skills): pull full ClawHub catalog into the skills index The website was showing 200 ClawHub skills out of 20k+ because `ClawHubSource.search("")` for empty queries went straight to a single unpaginated request. ClawHub's API caps any single page at 200 items and returns a `nextCursor`; we grabbed page 1 and stopped, so the cached index served from hermes-agent.nousresearch.com had a silent 99% truncation. End users never hit clawhub.ai directly (the index is rebuilt twice daily by .github/workflows/skills-index.yml and served as a static JSON on the docs site), so the cap-and-cache architecture is correct — it just wasn't being filled. Changes: - `ClawHubSource.search(query="")` now routes through the existing `_load_catalog_index()` paginating walker instead of the unpaginated listing fallback (non-empty queries still hit the fast catalog search). - `_load_catalog_index()` max_pages 50 → 250 (50k-skill ceiling; live catalog is ~20k as of May 2026, with headroom for growth). - `build_skills_index.py`: per-source crawl limits split out — ClawHub and LobeHub get 100k, others keep their effective caps. - `EXPECTED_FLOORS["clawhub"]` 50 → 5000 so the next pagination regression hard-fails the CI build instead of silently shipping a degenerate index. Test plan: - New unit test `test_search_empty_query_paginates_full_catalog` exercises the cursor-following path with three mocked pages (450 total items) and asserts all pages are walked. - Existing 9 ClawHub tests + 127 broader skills_hub tests all pass. - E2E against live ClawHub API: walker reached 9700+ skills across 49 pages before this commit landed, paginating well past the previous 50-page cap. * fix(skills): raise ClawHub ceilings — live catalog is 50k, not 20k E2E walk against live ClawHub API hit my initial 250-page cap at 49,698 skills with cursor=yes still pending. The catalog is roughly 2.5x larger than the docstring estimate. - max_pages 250 → 750 (150k ceiling, walks terminate on cursor=None well before this in practice) - SOURCE_LIMITS['clawhub'] 100k → 200k - EXPECTED_FLOORS['clawhub'] 5000 → 20000
2026-07-13 14:02:16 +00:00 · 2026-05-28 01:42:19 -07:00 · 2026-05-28 01:42:19 -07:00 · fb9f3a4ef9
commit fb9f3a4ef9
parent 09a5cd8084
3 changed files with 93 additions and 5 deletions
--- a/scripts/build_skills_index.py
+++ b/scripts/build_skills_index.py
@ -269,11 +269,28 @@ def main():
    # Crawl skills.sh
    all_skills.extend(crawl_skills_sh(skills_sh_source))

-    # Crawl other sources in parallel
+    # Crawl other sources in parallel.
+    # Per-source soft caps — sources stop returning when they run out, so these
+    # are ceilings, not targets.  ClawHub has 20k+ skills; bumping to 100k
+    # (well above current catalog size) lets the full catalog land in the
+    # index instead of being truncated at an arbitrary build-time limit.
+    SOURCE_LIMITS = {
+        # ClawHub had 49,698+ skills as of May 2026; 200k leaves headroom.
+        "clawhub": 200_000,
+        "lobehub": 100_000,
+        "browse-sh": 5_000,
+        "claude-marketplace": 5_000,
+        "github": 5_000,
+        "well-known": 5_000,
+        "official": 5_000,
+    }
+    DEFAULT_SOURCE_LIMIT = 500
+
    with ThreadPoolExecutor(max_workers=4) as pool:
        futures = {}
        for name, source in sources.items():
-            futures[pool.submit(crawl_source, source, name, 500)] = name
+            limit = SOURCE_LIMITS.get(name, DEFAULT_SOURCE_LIMIT)
+            futures[pool.submit(crawl_source, source, name, limit)] = name
        for future in as_completed(futures):
            try:
                all_skills.extend(future.result())
@ -330,7 +347,11 @@ def main():
    EXPECTED_FLOORS = {
        "skills.sh": 100,
        "lobehub": 100,
-        "clawhub": 50,
+        # ClawHub had 49,698+ skills as of May 2026 — anything under 20k means
+        # pagination broke or the API surface changed.  Fail loudly rather
+        # than ship a degenerate index (we shipped 200/50000 silently for
+        # weeks because the floor was 50).
+        "clawhub": 20000,
        "official": 50,
        "github": 30,        # collapsed across all GitHub taps
        "browse-sh": 50,