The skills.sh source was returning ~858 unique skills from a hardcoded
list of 28 popular keyword searches (each capped at 50 results). The
real catalog is ~20k — exposed via sitemap-skills-{1,2}.xml linked from
the site's sitemap index.
Switch the empty-query path in SkillsShSource.search() to walk the
sitemap instead of scraping the homepage's curated featured strip.
Falls back to the homepage scrape if the sitemap is unreachable.
build_skills_index.crawl_skills_sh() now just calls search("", limit=0)
instead of running 28 keyword searches — same result in one HTTP round
instead of 28.
Also handle a httpx + brotlicffi interaction: the per-skill sitemaps
are ~900 KB brotli-compressed and the cffi backend's streaming decode
chokes on them. Forcing Accept-Encoding to gzip dodges the bug without
requiring a brotli library upgrade.
E2E against live skills.sh: 19,932 unique skills walked in 0.7s.
Tests: 137 pass (+1 new regression test exercising the sitemap path).
Floor for skills.sh raised 100 → 10,000 in EXPECTED_FLOORS so a future
regression hard-fails the build.
* fix(skills): pull full ClawHub catalog into the skills index
The website was showing 200 ClawHub skills out of 20k+ because
`ClawHubSource.search("")` for empty queries went straight to a single
unpaginated request. ClawHub's API caps any single page at 200 items and
returns a `nextCursor`; we grabbed page 1 and stopped, so the cached
index served from hermes-agent.nousresearch.com had a silent 99%
truncation.
End users never hit clawhub.ai directly (the index is rebuilt twice
daily by .github/workflows/skills-index.yml and served as a static JSON
on the docs site), so the cap-and-cache architecture is correct — it
just wasn't being filled.
Changes:
- `ClawHubSource.search(query="")` now routes through the existing
`_load_catalog_index()` paginating walker instead of the unpaginated
listing fallback (non-empty queries still hit the fast catalog search).
- `_load_catalog_index()` max_pages 50 → 250 (50k-skill ceiling; live
catalog is ~20k as of May 2026, with headroom for growth).
- `build_skills_index.py`: per-source crawl limits split out — ClawHub
and LobeHub get 100k, others keep their effective caps.
- `EXPECTED_FLOORS["clawhub"]` 50 → 5000 so the next pagination
regression hard-fails the CI build instead of silently shipping a
degenerate index.
Test plan:
- New unit test `test_search_empty_query_paginates_full_catalog`
exercises the cursor-following path with three mocked pages (450
total items) and asserts all pages are walked.
- Existing 9 ClawHub tests + 127 broader skills_hub tests all pass.
- E2E against live ClawHub API: walker reached 9700+ skills across 49
pages before this commit landed, paginating well past the previous
50-page cap.
* fix(skills): raise ClawHub ceilings — live catalog is 50k, not 20k
E2E walk against live ClawHub API hit my initial 250-page cap at 49,698
skills with cursor=yes still pending. The catalog is roughly 2.5x larger
than the docstring estimate.
- max_pages 250 → 750 (150k ceiling, walks terminate on cursor=None
well before this in practice)
- SOURCE_LIMITS['clawhub'] 100k → 200k
- EXPECTED_FLOORS['clawhub'] 5000 → 20000
Layered safety so the Skills Hub at /docs/skills stays in sync without
silent rot. Three pieces:
1. build_skills_index.py — refuses to ship a degenerate index.
EXPECTED_FLOORS per source (skills.sh ≥100, lobehub ≥100, clawhub ≥50,
official ≥50, github ≥30, browse-sh ≥50) and MIN_TOTAL=1500. Any source
collapsing to zero (the silent OpenAI breakage that hid for weeks) now
fails the workflow loud — broken index never reaches the live site.
2. extract-skills.py + the React page — visible freshness signal.
Sidecar website/src/data/skills-meta.json carries the index's
generated_at timestamp, plus per-source counts. Skills Hub renders a
'Catalog refreshed N hours ago · auto-rebuilt twice daily' line under
the hero copy. If the cron stalls, users see the staleness immediately.
3. .github/workflows/skills-index-freshness.yml — watchdog cron.
Every 4 hours, fetches the live /docs/api/skills-index.json, validates
shape, checks age (>26h is stale), checks the same per-source floors,
and opens (or appends to) a GitHub issue when anything is off. The
issue is title-prefixed [skills-index-watchdog] so subsequent failures
append a comment instead of spamming new issues.
Net effect:
- A silent regression like 'OpenAI tap moved its skills' now fails the
build instead of shipping a quietly broken catalog.
- A stuck cron (like the landingpage breakage that ran red for weeks) now
files an issue within 4 hours.
- Users see how fresh the catalog is on the page itself.
Test plan:
- Local: built skills-meta.json from the live index → 'Catalog refreshed
N minutes ago' rendered correctly in the static HTML.
- Probe logic dry-run against the live index: total=2456, all 6 sources
above floor, age 0.1h — issues=NONE.
- Triggered skills-index.yml manually; both jobs green, deploy-site.yml
dispatch fired.
The Skills Hub page was stuck on a stale Feb 25 snapshot, showing only Built-in
+ Optional + Anthropic + LobeHub. The unified index already has 2078 skills
from skills.sh / ClawHub / LobeHub / GitHub taps / Claude Marketplace, and
BrowseShSource adds another ~330 — none of it was reaching the page.
Changes:
- website/scripts/extract-skills.py: read website/static/api/skills-index.json
(the unified multi-source catalog, rebuilt twice daily) as the canonical
external source. Keep the legacy skills/index-cache/ fallback for offline
builds. Add friendly per-source labels (skills.sh, ClawHub, browse.sh,
OpenAI, HuggingFace, Anthropic, LobeHub, etc.) and per-entry installCmd.
- website/src/pages/skills/index.tsx: add source pills + ordering for the 11
new sources; render installCmd from the index entry.
- website/scripts/prebuild.mjs: when no local skills-index.json exists, fetch
the live one from hermes-agent.nousresearch.com so local 'npm run build'
matches production without burning GitHub API quota.
- scripts/build_skills_index.py: crawl BrowseShSource so browse.sh entries
land in the unified index. Adjust source_order.
- tools/skills_hub.py: GitHubSource.DEFAULT_TAPS — openai/skills moved its
skills into skills/.curated/ and skills/.system/, so add both as explicit
taps (the listing code skips dotted dirs by design). Drop
VoltAgent/awesome-agent-skills (README-only, no SKILL.md files) and
MiniMax-AI/cli (singular skill, not a tap directory). Net effect: github
source jumps from 83 → 143 skills, with OpenAI properly included.
- .github/workflows/deploy-site.yml: build the unified index BEFORE running
extract-skills.py — previous order meant extract-skills always fell back
to the legacy cache. Drop the 'skip if file exists' guard; the file is
gitignored and must be rebuilt every deploy.
- .github/workflows/skills-index.yml: drop the broken 'deploy-with-index'
job (it cp'd 'landingpage/\*' which no longer exists, failing every cron
run since the landingpage move). Replace it with a workflow_dispatch
trigger of deploy-site.yml so the index refresh still reaches production
on schedule.
- website/docs/user-guide/features/skills.md: drop VoltAgent from the
default-taps doc list to match the code.
Before: 695 skills (Built-in 90, Optional 84, Anthropic 16, LobeHub 505).
After: 2168 skills across 9 source pills, including the 1212 skills.sh
entries the user expected to see.
Replace with for all literal-tuple
membership tests. Set lookup is O(1) vs O(n) for tuple — consistent
micro-optimization across the codebase.
608 instances fixed via `ruff --fix --unsafe-fixes`, 0 remaining.
133 files, +626/-626 (net zero).
Closes the last Python-on-Windows UTF-8 exposure by making every
text-mode open() call explicit about its encoding.
Before: on Windows, bare open(path, 'r') defaults to the system
locale encoding (cp1252 on US-locale installs). That means reading
any config/yaml/markdown/json file with non-ASCII content either
crashes with UnicodeDecodeError or silently mis-decodes bytes.
After: all 89 affected call sites in production code now pass
encoding='utf-8' explicitly. Works identically on every platform
and every locale, no surprise behavior.
Mechanical sweep via:
ruff check --preview --extend-select PLW1514 --unsafe-fixes --fix --exclude 'tests,venv,.venv,node_modules,website,optional-skills, skills,tinker-atropos,plugins' .
All 89 fixes have the same shape: open(x) or open(x, mode) became
open(x, encoding='utf-8') or open(x, mode, encoding='utf-8'). Nothing
else changed. Every modified file still parses and the Windows/sandbox
test suite is still green (85 passed, 14 skipped, 0 failed across
tests/tools/test_code_execution_windows_env.py +
tests/tools/test_code_execution_modes.py + tests/tools/test_env_passthrough.py +
tests/test_hermes_bootstrap.py).
Scope notes:
- tests/ excluded: test fixtures can use locale encoding intentionally
(exercising edge cases). If we want to tighten tests later that's
a separate PR.
- plugins/ excluded: plugin-specific conventions may differ; plugin
authors own their code.
- optional-skills/ and skills/ excluded: skill scripts are user-authored
and we don't want to mass-edit them.
- website/ and tinker-atropos/ excluded: vendored / generated content.
46 files touched, 89 +/- lines (symmetric replacement). No behavior
change on POSIX or on Windows when the file is ASCII; bug fix on
Windows when the file contains non-ASCII.
Add a CI-built skills index served from the docs site. The index is
crawled daily by GitHub Actions, resolves all GitHub paths upfront, and
is cached locally by the client. When the index is available:
- Search uses the cached index (0 GitHub API calls, was 23+)
- Install uses resolved paths from index (6 API calls for file
downloads only, was 31-45 for discovery + downloads)
Total: 68 → 6 GitHub API calls for a typical search + install flow.
Unauthenticated users (60 req/hr) can now search and install without
hitting rate limits.
Components:
- scripts/build_skills_index.py: Crawl all sources (skills.sh, GitHub
taps, official, clawhub, lobehub), batch-resolve GitHub paths via
tree API, output JSON index
- tools/skills_hub.py: HermesIndexSource class — search/fetch/inspect
backed by the index, with lazy GitHubSource for file downloads
- parallel_search_sources() skips external API sources when index is
available (0 GitHub calls for search)
- .github/workflows/skills-index.yml: twice-daily CI build + deploy
- .github/workflows/deploy-site.yml: also builds index during docs deploy
Graceful degradation: when the index is unavailable (first run, network
down, stale), all methods return empty/None and downstream sources
handle the request via direct API as before.