fix(skills): pull full ClawHub catalog into the skills index (200 → 20k+) (#33748)

* fix(skills): pull full ClawHub catalog into the skills index

The website was showing 200 ClawHub skills out of 20k+ because
`ClawHubSource.search("")` for empty queries went straight to a single
unpaginated request. ClawHub's API caps any single page at 200 items and
returns a `nextCursor`; we grabbed page 1 and stopped, so the cached
index served from hermes-agent.nousresearch.com had a silent 99%
truncation.

End users never hit clawhub.ai directly (the index is rebuilt twice
daily by .github/workflows/skills-index.yml and served as a static JSON
on the docs site), so the cap-and-cache architecture is correct — it
just wasn't being filled.

Changes:
- `ClawHubSource.search(query="")` now routes through the existing
  `_load_catalog_index()` paginating walker instead of the unpaginated
  listing fallback (non-empty queries still hit the fast catalog search).
- `_load_catalog_index()` max_pages 50 → 250 (50k-skill ceiling; live
  catalog is ~20k as of May 2026, with headroom for growth).
- `build_skills_index.py`: per-source crawl limits split out — ClawHub
  and LobeHub get 100k, others keep their effective caps.
- `EXPECTED_FLOORS["clawhub"]` 50 → 5000 so the next pagination
  regression hard-fails the CI build instead of silently shipping a
  degenerate index.

Test plan:
- New unit test `test_search_empty_query_paginates_full_catalog`
  exercises the cursor-following path with three mocked pages (450
  total items) and asserts all pages are walked.
- Existing 9 ClawHub tests + 127 broader skills_hub tests all pass.
- E2E against live ClawHub API: walker reached 9700+ skills across 49
  pages before this commit landed, paginating well past the previous
  50-page cap.

* fix(skills): raise ClawHub ceilings — live catalog is 50k, not 20k

E2E walk against live ClawHub API hit my initial 250-page cap at 49,698
skills with cursor=yes still pending. The catalog is roughly 2.5x larger
than the docstring estimate.

- max_pages 250 → 750 (150k ceiling, walks terminate on cursor=None
  well before this in practice)
- SOURCE_LIMITS['clawhub'] 100k → 200k
- EXPECTED_FLOORS['clawhub'] 5000 → 20000
This commit is contained in:
Teknium 2026-05-28 01:42:19 -07:00 committed by GitHub
parent 09a5cd8084
commit fb9f3a4ef9
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
3 changed files with 93 additions and 5 deletions

View file

@ -1859,8 +1859,18 @@ class ClawHubSource(SkillSource):
results = self._search_catalog(query, limit=limit)
if results:
return results
else:
# Empty query: route through the paginating catalog walker so the
# full ClawHub catalog (20k+ skills) lands in the index. The
# single-request listing path below caps at one page (200 items)
# regardless of `limit`, which silently truncates the public
# skills index. The catalog walker follows `nextCursor`.
catalog = self._load_catalog_index()
if catalog:
return self._dedupe_results(catalog)[:limit] if limit > 0 else self._dedupe_results(catalog)
# Empty query or catalog fallback failure: use the lightweight listing API.
# Non-empty query catalog miss, or catalog walker failure: fall back to
# the lightweight listing API for a best-effort response.
cache_key = f"clawhub_search_listing_v1_{hashlib.md5(query.encode()).hexdigest()}_{limit}"
cached = _read_index_cache(cache_key)
if cached is not None:
@ -1989,7 +1999,12 @@ class ClawHubSource(SkillSource):
cursor: Optional[str] = None
results: List[SkillMeta] = []
seen: set[str] = set()
max_pages = 50
# ClawHub has 50k+ skills as of May 2026 (live E2E walked 49,698 with
# an active cursor still pending); 750 pages * 200/page = 150k ceiling
# leaves room for catalog growth. Walk-to-exhaustion typically
# terminates well before this on `nextCursor` going None — the cap is
# a safety rail against an infinite-cursor loop.
max_pages = 750
for _ in range(max_pages):
params: Dict[str, Any] = {"limit": 200}