mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-29 06:31:32 +00:00
fix(skills): pull full ClawHub catalog into the skills index (200 → 20k+) (#33748)
* fix(skills): pull full ClawHub catalog into the skills index
The website was showing 200 ClawHub skills out of 20k+ because
`ClawHubSource.search("")` for empty queries went straight to a single
unpaginated request. ClawHub's API caps any single page at 200 items and
returns a `nextCursor`; we grabbed page 1 and stopped, so the cached
index served from hermes-agent.nousresearch.com had a silent 99%
truncation.
End users never hit clawhub.ai directly (the index is rebuilt twice
daily by .github/workflows/skills-index.yml and served as a static JSON
on the docs site), so the cap-and-cache architecture is correct — it
just wasn't being filled.
Changes:
- `ClawHubSource.search(query="")` now routes through the existing
`_load_catalog_index()` paginating walker instead of the unpaginated
listing fallback (non-empty queries still hit the fast catalog search).
- `_load_catalog_index()` max_pages 50 → 250 (50k-skill ceiling; live
catalog is ~20k as of May 2026, with headroom for growth).
- `build_skills_index.py`: per-source crawl limits split out — ClawHub
and LobeHub get 100k, others keep their effective caps.
- `EXPECTED_FLOORS["clawhub"]` 50 → 5000 so the next pagination
regression hard-fails the CI build instead of silently shipping a
degenerate index.
Test plan:
- New unit test `test_search_empty_query_paginates_full_catalog`
exercises the cursor-following path with three mocked pages (450
total items) and asserts all pages are walked.
- Existing 9 ClawHub tests + 127 broader skills_hub tests all pass.
- E2E against live ClawHub API: walker reached 9700+ skills across 49
pages before this commit landed, paginating well past the previous
50-page cap.
* fix(skills): raise ClawHub ceilings — live catalog is 50k, not 20k
E2E walk against live ClawHub API hit my initial 250-page cap at 49,698
skills with cursor=yes still pending. The catalog is roughly 2.5x larger
than the docstring estimate.
- max_pages 250 → 750 (150k ceiling, walks terminate on cursor=None
well before this in practice)
- SOURCE_LIMITS['clawhub'] 100k → 200k
- EXPECTED_FLOORS['clawhub'] 5000 → 20000
This commit is contained in:
parent
09a5cd8084
commit
fb9f3a4ef9
3 changed files with 93 additions and 5 deletions
|
|
@ -269,11 +269,28 @@ def main():
|
|||
# Crawl skills.sh
|
||||
all_skills.extend(crawl_skills_sh(skills_sh_source))
|
||||
|
||||
# Crawl other sources in parallel
|
||||
# Crawl other sources in parallel.
|
||||
# Per-source soft caps — sources stop returning when they run out, so these
|
||||
# are ceilings, not targets. ClawHub has 20k+ skills; bumping to 100k
|
||||
# (well above current catalog size) lets the full catalog land in the
|
||||
# index instead of being truncated at an arbitrary build-time limit.
|
||||
SOURCE_LIMITS = {
|
||||
# ClawHub had 49,698+ skills as of May 2026; 200k leaves headroom.
|
||||
"clawhub": 200_000,
|
||||
"lobehub": 100_000,
|
||||
"browse-sh": 5_000,
|
||||
"claude-marketplace": 5_000,
|
||||
"github": 5_000,
|
||||
"well-known": 5_000,
|
||||
"official": 5_000,
|
||||
}
|
||||
DEFAULT_SOURCE_LIMIT = 500
|
||||
|
||||
with ThreadPoolExecutor(max_workers=4) as pool:
|
||||
futures = {}
|
||||
for name, source in sources.items():
|
||||
futures[pool.submit(crawl_source, source, name, 500)] = name
|
||||
limit = SOURCE_LIMITS.get(name, DEFAULT_SOURCE_LIMIT)
|
||||
futures[pool.submit(crawl_source, source, name, limit)] = name
|
||||
for future in as_completed(futures):
|
||||
try:
|
||||
all_skills.extend(future.result())
|
||||
|
|
@ -330,7 +347,11 @@ def main():
|
|||
EXPECTED_FLOORS = {
|
||||
"skills.sh": 100,
|
||||
"lobehub": 100,
|
||||
"clawhub": 50,
|
||||
# ClawHub had 49,698+ skills as of May 2026 — anything under 20k means
|
||||
# pagination broke or the API surface changed. Fail loudly rather
|
||||
# than ship a degenerate index (we shipped 200/50000 silently for
|
||||
# weeks because the floor was 50).
|
||||
"clawhub": 20000,
|
||||
"official": 50,
|
||||
"github": 30, # collapsed across all GitHub taps
|
||||
"browse-sh": 50,
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue