mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-14 04:02:26 +00:00
Completes the Windows-gating coverage for the built-in skills/ tree. Every
bundled SKILL.md now carries an explicit platforms: declaration so the
loader (agent.skill_utils.skill_matches_platform) can skip-load skills
that don't fit the current OS.
74 skills declared cross-platform (platforms: [linux, macos, windows]):
Creative (16): ascii-art, ascii-video, architecture-diagram, baoyu-comic,
baoyu-infographic, claude-design, creative-ideation, design-md,
excalidraw, humanizer, manim-video, p5js, pixel-art,
popular-web-designs, pretext, sketch, songwriting-and-ai-music,
touchdesigner-mcp
Autonomous agents: claude-code, codex, hermes-agent, opencode
Data/devops: jupyter-live-kernel, kanban-orchestrator, kanban-worker,
webhook-subscriptions, dogfood, codebase-inspection
GitHub: github-auth, github-code-review, github-issues,
github-pr-workflow, github-repo-management
Media: gif-search, heartmula, songsee, spotify, youtube-content
MCP / email / gaming / notes / smart-home: native-mcp, himalaya,
pokemon-player, obsidian, openhue
mlops (non-broken): weights-and-biases, huggingface-hub, llama-cpp,
outlines, segment-anything-model, dspy, trl-fine-tuning
Productivity: airtable, google-workspace, linear, maps, nano-pdf,
notion, ocr-and-documents, powerpoint
Red-teaming / research: godmode, arxiv, blogwatcher, llm-wiki,
polymarket
Software-dev: debugging-hermes-tui-commands, hermes-agent-skill-authoring,
node-inspect-debugger, plan, requesting-code-review, spike,
subagent-driven-development, systematic-debugging,
test-driven-development, writing-plans
Misc: yuanbao
5 skills gated from Windows (platforms: [linux, macos]):
mlops/inference/vllm (serving-llms-vllm)
vLLM is officially Linux-only; Windows requires WSL.
mlops/training/axolotl
Axolotl's flash-attn + deepspeed + bitsandbytes stack is Linux-first.
mlops/training/unsloth
Requires Triton + xformers + flash-attn — Linux only in practice.
mlops/models/audiocraft (audiocraft-audio-generation)
torchaudio ffmpeg backend + encodec dependencies are Linux-first.
mlops/inference/obliteratus
Research abliteration workflow; relies on Linux-focused pytorch
kernels and MLX — no first-class Windows path.
Same strict-over-lenient policy as the optional-skills sweep: when the
underlying tool's Windows support is rough, missing, or WSL-only, gate the
skill. Easier to un-gate after verified Windows support lands than to leak
partial support that manifests as mid-task failures.
Combined with prior commits in this branch, every bundled SKILL.md
(skills/ + optional-skills/) now has a platforms: declaration.
172 lines
5.2 KiB
Markdown
172 lines
5.2 KiB
Markdown
---
|
|
name: ocr-and-documents
|
|
description: "Extract text from PDFs/scans (pymupdf, marker-pdf)."
|
|
version: 2.3.0
|
|
author: Hermes Agent
|
|
license: MIT
|
|
platforms: [linux, macos, windows]
|
|
metadata:
|
|
hermes:
|
|
tags: [PDF, Documents, Research, Arxiv, Text-Extraction, OCR]
|
|
related_skills: [powerpoint]
|
|
---
|
|
|
|
# PDF & Document Extraction
|
|
|
|
For DOCX: use `python-docx` (parses actual document structure, far better than OCR).
|
|
For PPTX: see the `powerpoint` skill (uses `python-pptx` with full slide/notes support).
|
|
This skill covers **PDFs and scanned documents**.
|
|
|
|
## Step 1: Remote URL Available?
|
|
|
|
If the document has a URL, **always try `web_extract` first**:
|
|
|
|
```
|
|
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
|
|
web_extract(urls=["https://example.com/report.pdf"])
|
|
```
|
|
|
|
This handles PDF-to-markdown conversion via Firecrawl with no local dependencies.
|
|
|
|
Only use local extraction when: the file is local, web_extract fails, or you need batch processing.
|
|
|
|
## Step 2: Choose Local Extractor
|
|
|
|
| Feature | pymupdf (~25MB) | marker-pdf (~3-5GB) |
|
|
|---------|-----------------|---------------------|
|
|
| **Text-based PDF** | ✅ | ✅ |
|
|
| **Scanned PDF (OCR)** | ❌ | ✅ (90+ languages) |
|
|
| **Tables** | ✅ (basic) | ✅ (high accuracy) |
|
|
| **Equations / LaTeX** | ❌ | ✅ |
|
|
| **Code blocks** | ❌ | ✅ |
|
|
| **Forms** | ❌ | ✅ |
|
|
| **Headers/footers removal** | ❌ | ✅ |
|
|
| **Reading order detection** | ❌ | ✅ |
|
|
| **Images extraction** | ✅ (embedded) | ✅ (with context) |
|
|
| **Images → text (OCR)** | ❌ | ✅ |
|
|
| **EPUB** | ✅ | ✅ |
|
|
| **Markdown output** | ✅ (via pymupdf4llm) | ✅ (native, higher quality) |
|
|
| **Install size** | ~25MB | ~3-5GB (PyTorch + models) |
|
|
| **Speed** | Instant | ~1-14s/page (CPU), ~0.2s/page (GPU) |
|
|
|
|
**Decision**: Use pymupdf unless you need OCR, equations, forms, or complex layout analysis.
|
|
|
|
If the user needs marker capabilities but the system lacks ~5GB free disk:
|
|
> "This document needs OCR/advanced extraction (marker-pdf), which requires ~5GB for PyTorch and models. Your system has [X]GB free. Options: free up space, provide a URL so I can use web_extract, or I can try pymupdf which works for text-based PDFs but not scanned documents or equations."
|
|
|
|
---
|
|
|
|
## pymupdf (lightweight)
|
|
|
|
```bash
|
|
pip install pymupdf pymupdf4llm
|
|
```
|
|
|
|
**Via helper script**:
|
|
```bash
|
|
python scripts/extract_pymupdf.py document.pdf # Plain text
|
|
python scripts/extract_pymupdf.py document.pdf --markdown # Markdown
|
|
python scripts/extract_pymupdf.py document.pdf --tables # Tables
|
|
python scripts/extract_pymupdf.py document.pdf --images out/ # Extract images
|
|
python scripts/extract_pymupdf.py document.pdf --metadata # Title, author, pages
|
|
python scripts/extract_pymupdf.py document.pdf --pages 0-4 # Specific pages
|
|
```
|
|
|
|
**Inline**:
|
|
```bash
|
|
python3 -c "
|
|
import pymupdf
|
|
doc = pymupdf.open('document.pdf')
|
|
for page in doc:
|
|
print(page.get_text())
|
|
"
|
|
```
|
|
|
|
---
|
|
|
|
## marker-pdf (high-quality OCR)
|
|
|
|
```bash
|
|
# Check disk space first
|
|
python scripts/extract_marker.py --check
|
|
|
|
pip install marker-pdf
|
|
```
|
|
|
|
**Via helper script**:
|
|
```bash
|
|
python scripts/extract_marker.py document.pdf # Markdown
|
|
python scripts/extract_marker.py document.pdf --json # JSON with metadata
|
|
python scripts/extract_marker.py document.pdf --output_dir out/ # Save images
|
|
python scripts/extract_marker.py scanned.pdf # Scanned PDF (OCR)
|
|
python scripts/extract_marker.py document.pdf --use_llm # LLM-boosted accuracy
|
|
```
|
|
|
|
**CLI** (installed with marker-pdf):
|
|
```bash
|
|
marker_single document.pdf --output_dir ./output
|
|
marker /path/to/folder --workers 4 # Batch
|
|
```
|
|
|
|
---
|
|
|
|
## Arxiv Papers
|
|
|
|
```
|
|
# Abstract only (fast)
|
|
web_extract(urls=["https://arxiv.org/abs/2402.03300"])
|
|
|
|
# Full paper
|
|
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
|
|
|
|
# Search
|
|
web_search(query="arxiv GRPO reinforcement learning 2026")
|
|
```
|
|
|
|
## Split, Merge & Search
|
|
|
|
pymupdf handles these natively — use `execute_code` or inline Python:
|
|
|
|
```python
|
|
# Split: extract pages 1-5 to a new PDF
|
|
import pymupdf
|
|
doc = pymupdf.open("report.pdf")
|
|
new = pymupdf.open()
|
|
for i in range(5):
|
|
new.insert_pdf(doc, from_page=i, to_page=i)
|
|
new.save("pages_1-5.pdf")
|
|
```
|
|
|
|
```python
|
|
# Merge multiple PDFs
|
|
import pymupdf
|
|
result = pymupdf.open()
|
|
for path in ["a.pdf", "b.pdf", "c.pdf"]:
|
|
result.insert_pdf(pymupdf.open(path))
|
|
result.save("merged.pdf")
|
|
```
|
|
|
|
```python
|
|
# Search for text across all pages
|
|
import pymupdf
|
|
doc = pymupdf.open("report.pdf")
|
|
for i, page in enumerate(doc):
|
|
results = page.search_for("revenue")
|
|
if results:
|
|
print(f"Page {i+1}: {len(results)} match(es)")
|
|
print(page.get_text("text"))
|
|
```
|
|
|
|
No extra dependencies needed — pymupdf covers split, merge, search, and text extraction in one package.
|
|
|
|
---
|
|
|
|
## Notes
|
|
|
|
- `web_extract` is always first choice for URLs
|
|
- pymupdf is the safe default — instant, no models, works everywhere
|
|
- marker-pdf is for OCR, scanned docs, equations, complex layouts — install only when needed
|
|
- Both helper scripts accept `--help` for full usage
|
|
- marker-pdf downloads ~2.5GB of models to `~/.cache/huggingface/` on first use
|
|
- For Word docs: `pip install python-docx` (better than OCR — parses actual structure)
|
|
- For PowerPoint: see the `powerpoint` skill (uses python-pptx)
|