The skills directory was getting disorganized — mlops alone had 40 skills in a flat list, and 12 categories were singletons with just one skill each. Code change: - prompt_builder.py: Support sub-categories in skill scanner. skills/mlops/training/axolotl/SKILL.md now shows as category 'mlops/training' instead of just 'mlops'. Backwards-compatible with existing flat structure. Split mlops (40 skills) into 7 sub-categories: - mlops/training (12): accelerate, axolotl, flash-attention, grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, torchtitan, trl-fine-tuning, unsloth - mlops/inference (8): gguf, guidance, instructor, llama-cpp, obliteratus, outlines, tensorrt-llm, vllm - mlops/models (6): audiocraft, clip, llava, segment-anything, stable-diffusion, whisper - mlops/vector-databases (4): chroma, faiss, pinecone, qdrant - mlops/evaluation (5): huggingface-tokenizers, lm-evaluation-harness, nemo-curator, saelens, weights-and-biases - mlops/cloud (2): lambda-labs, modal - mlops/research (1): dspy Merged singleton categories: - gifs → media (gif-search joins youtube-content) - music-creation → media (heartmula, songsee) - diagramming → creative (excalidraw joins ascii-art) - ocr-and-documents → productivity - domain → research (domain-intel) - feeds → research (blogwatcher) - market-data → research (polymarket) Fixed misplaced skills: - mlops/code-review → software-development (not ML-specific) - mlops/ml-paper-writing → research (academic writing) Added DESCRIPTION.md files for all new/updated categories.
4.4 KiB
| name | description | version | author | license | metadata | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ocr-and-documents | Extract text from PDFs and scanned documents. Use web_extract for remote URLs, pymupdf for local text-based PDFs, marker-pdf for OCR/scanned docs. For DOCX use python-docx, for PPTX see the powerpoint skill. | 2.3.0 | Hermes Agent | MIT |
|
PDF & Document Extraction
For DOCX: use python-docx (parses actual document structure, far better than OCR).
For PPTX: see the powerpoint skill (uses python-pptx with full slide/notes support).
This skill covers PDFs and scanned documents.
Step 1: Remote URL Available?
If the document has a URL, always try web_extract first:
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
web_extract(urls=["https://example.com/report.pdf"])
This handles PDF-to-markdown conversion via Firecrawl with no local dependencies.
Only use local extraction when: the file is local, web_extract fails, or you need batch processing.
Step 2: Choose Local Extractor
| Feature | pymupdf (~25MB) | marker-pdf (~3-5GB) |
|---|---|---|
| Text-based PDF | ✅ | ✅ |
| Scanned PDF (OCR) | ❌ | ✅ (90+ languages) |
| Tables | ✅ (basic) | ✅ (high accuracy) |
| Equations / LaTeX | ❌ | ✅ |
| Code blocks | ❌ | ✅ |
| Forms | ❌ | ✅ |
| Headers/footers removal | ❌ | ✅ |
| Reading order detection | ❌ | ✅ |
| Images extraction | ✅ (embedded) | ✅ (with context) |
| Images → text (OCR) | ❌ | ✅ |
| EPUB | ✅ | ✅ |
| Markdown output | ✅ (via pymupdf4llm) | ✅ (native, higher quality) |
| Install size | ~25MB | ~3-5GB (PyTorch + models) |
| Speed | Instant | ~1-14s/page (CPU), ~0.2s/page (GPU) |
Decision: Use pymupdf unless you need OCR, equations, forms, or complex layout analysis.
If the user needs marker capabilities but the system lacks ~5GB free disk:
"This document needs OCR/advanced extraction (marker-pdf), which requires ~5GB for PyTorch and models. Your system has [X]GB free. Options: free up space, provide a URL so I can use web_extract, or I can try pymupdf which works for text-based PDFs but not scanned documents or equations."
pymupdf (lightweight)
pip install pymupdf pymupdf4llm
Via helper script:
python scripts/extract_pymupdf.py document.pdf # Plain text
python scripts/extract_pymupdf.py document.pdf --markdown # Markdown
python scripts/extract_pymupdf.py document.pdf --tables # Tables
python scripts/extract_pymupdf.py document.pdf --images out/ # Extract images
python scripts/extract_pymupdf.py document.pdf --metadata # Title, author, pages
python scripts/extract_pymupdf.py document.pdf --pages 0-4 # Specific pages
Inline:
python3 -c "
import pymupdf
doc = pymupdf.open('document.pdf')
for page in doc:
print(page.get_text())
"
marker-pdf (high-quality OCR)
# Check disk space first
python scripts/extract_marker.py --check
pip install marker-pdf
Via helper script:
python scripts/extract_marker.py document.pdf # Markdown
python scripts/extract_marker.py document.pdf --json # JSON with metadata
python scripts/extract_marker.py document.pdf --output_dir out/ # Save images
python scripts/extract_marker.py scanned.pdf # Scanned PDF (OCR)
python scripts/extract_marker.py document.pdf --use_llm # LLM-boosted accuracy
CLI (installed with marker-pdf):
marker_single document.pdf --output_dir ./output
marker /path/to/folder --workers 4 # Batch
Arxiv Papers
# Abstract only (fast)
web_extract(urls=["https://arxiv.org/abs/2402.03300"])
# Full paper
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
# Search
web_search(query="arxiv GRPO reinforcement learning 2026")
Notes
web_extractis always first choice for URLs- pymupdf is the safe default — instant, no models, works everywhere
- marker-pdf is for OCR, scanned docs, equations, complex layouts — install only when needed
- Both helper scripts accept
--helpfor full usage - marker-pdf downloads ~2.5GB of models to
~/.cache/huggingface/on first use - For Word docs:
pip install python-docx(better than OCR — parses actual structure) - For PowerPoint: see the
powerpointskill (uses python-pptx)