feat(ocr-and-documents): add OCR and document extraction skills

- Introduced new skills for extracting text from PDFs, scanned documents, and images using OCR and document parsing tools. - Added detailed documentation for usage and installation of `pymupdf` and `marker-pdf` for local extraction. - Implemented scripts for text extraction with both lightweight and high-quality options, including support for various document formats. - Updated web extraction functionality to handle PDF URLs directly, enhancing usability for academic papers and documents.
2026-04-25 00:51:20 +00:00 · 2026-02-26 23:06:08 -08:00 · 2026-02-26 23:06:08 -08:00 · 19abbfff96
commit 19abbfff96
parent 21cf339a85
5 changed files with 322 additions and 1 deletions
--- a/skills/ocr-and-documents/DESCRIPTION.md
+++ b/skills/ocr-and-documents/DESCRIPTION.md
@ -0,0 +1,3 @@
+---
+description: Skills for extracting text from PDFs, scanned documents, images, and other file formats using OCR and document parsing tools.
+---
--- a/skills/ocr-and-documents/SKILL.md
+++ b/skills/ocr-and-documents/SKILL.md
@ -0,0 +1,133 @@
+---
+name: ocr-and-documents
+description: Extract text from PDFs and scanned documents. Use web_extract for remote URLs, pymupdf for local text-based PDFs, marker-pdf for OCR/scanned docs. For DOCX use python-docx, for PPTX see the powerpoint skill.
+version: 2.3.0
+author: Hermes Agent
+license: MIT
+metadata:
+  hermes:
+    tags: [PDF, Documents, Research, Arxiv, Text-Extraction, OCR]
+    related_skills: [powerpoint]
+---
+
+# PDF & Document Extraction
+
+For DOCX: use `python-docx` (parses actual document structure, far better than OCR).
+For PPTX: see the `powerpoint` skill (uses `python-pptx` with full slide/notes support).
+This skill covers **PDFs and scanned documents**.
+
+## Step 1: Remote URL Available?
+
+If the document has a URL, **always try `web_extract` first**:
+
+```
+web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
+web_extract(urls=["https://example.com/report.pdf"])
+```
+
+This handles PDF-to-markdown conversion via Firecrawl with no local dependencies.
+
+Only use local extraction when: the file is local, web_extract fails, or you need batch processing.
+
+## Step 2: Choose Local Extractor
+
+| Feature | pymupdf (~25MB) | marker-pdf (~3-5GB) |
+|---------|-----------------|---------------------|
+| **Text-based PDF** | ✅ | ✅ |
+| **Scanned PDF (OCR)** | ❌ | ✅ (90+ languages) |
+| **Tables** | ✅ (basic) | ✅ (high accuracy) |
+| **Equations / LaTeX** | ❌ | ✅ |
+| **Code blocks** | ❌ | ✅ |
+| **Forms** | ❌ | ✅ |
+| **Headers/footers removal** | ❌ | ✅ |
+| **Reading order detection** | ❌ | ✅ |
+| **Images extraction** | ✅ (embedded) | ✅ (with context) |
+| **Images → text (OCR)** | ❌ | ✅ |
+| **EPUB** | ✅ | ✅ |
+| **Markdown output** | ✅ (via pymupdf4llm) | ✅ (native, higher quality) |
+| **Install size** | ~25MB | ~3-5GB (PyTorch + models) |
+| **Speed** | Instant | ~1-14s/page (CPU), ~0.2s/page (GPU) |
+
+**Decision**: Use pymupdf unless you need OCR, equations, forms, or complex layout analysis.
+
+If the user needs marker capabilities but the system lacks ~5GB free disk:
+> "This document needs OCR/advanced extraction (marker-pdf), which requires ~5GB for PyTorch and models. Your system has [X]GB free. Options: free up space, provide a URL so I can use web_extract, or I can try pymupdf which works for text-based PDFs but not scanned documents or equations."
+
+---
+
+## pymupdf (lightweight)
+
+```bash
+pip install pymupdf pymupdf4llm
+```
+
+**Via helper script**:
+```bash
+python scripts/extract_pymupdf.py document.pdf              # Plain text
+python scripts/extract_pymupdf.py document.pdf --markdown    # Markdown
+python scripts/extract_pymupdf.py document.pdf --tables      # Tables
+python scripts/extract_pymupdf.py document.pdf --images out/ # Extract images
+python scripts/extract_pymupdf.py document.pdf --metadata    # Title, author, pages
+python scripts/extract_pymupdf.py document.pdf --pages 0-4   # Specific pages
+```
+
+**Inline**:
+```bash
+python3 -c "
+import pymupdf
+doc = pymupdf.open('document.pdf')
+for page in doc:
+    print(page.get_text())
+"
+```
+
+---
+
+## marker-pdf (high-quality OCR)
+
+```bash
+# Check disk space first
+python scripts/extract_marker.py --check
+
+pip install marker-pdf
+```
+
+**Via helper script**:
+```bash
+python scripts/extract_marker.py document.pdf                # Markdown
+python scripts/extract_marker.py document.pdf --json         # JSON with metadata
+python scripts/extract_marker.py document.pdf --output_dir out/  # Save images
+python scripts/extract_marker.py scanned.pdf                 # Scanned PDF (OCR)
+python scripts/extract_marker.py document.pdf --use_llm      # LLM-boosted accuracy
+```
+
+**CLI** (installed with marker-pdf):
+```bash
+marker_single document.pdf --output_dir ./output
+marker /path/to/folder --workers 4    # Batch
+```
+
+---
+
+## Arxiv Papers
+
+```
+# Abstract only (fast)
+web_extract(urls=["https://arxiv.org/abs/2402.03300"])
+
+# Full paper
+web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
+
+# Search
+web_search(query="arxiv GRPO reinforcement learning 2026")
+```
+
+## Notes
+
+- `web_extract` is always first choice for URLs
+- pymupdf is the safe default — instant, no models, works everywhere
+- marker-pdf is for OCR, scanned docs, equations, complex layouts — install only when needed
+- Both helper scripts accept `--help` for full usage
+- marker-pdf downloads ~2.5GB of models to `~/.cache/huggingface/` on first use
+- For Word docs: `pip install python-docx` (better than OCR — parses actual structure)
+- For PowerPoint: see the `powerpoint` skill (uses python-pptx)
--- a/skills/ocr-and-documents/scripts/extract_marker.py
+++ b/skills/ocr-and-documents/scripts/extract_marker.py
@ -0,0 +1,87 @@
+#!/usr/bin/env python3
+"""Extract text from documents using marker-pdf. High-quality OCR + layout analysis.
+
+Requires ~3-5GB disk (PyTorch + models downloaded on first use).
+Supports: PDF, DOCX, PPTX, XLSX, HTML, EPUB, images.
+
+Usage:
+    python extract_marker.py document.pdf
+    python extract_marker.py document.pdf --output_dir ./output
+    python extract_marker.py presentation.pptx
+    python extract_marker.py spreadsheet.xlsx
+    python extract_marker.py scanned_doc.pdf           # OCR works here
+    python extract_marker.py document.pdf --json        # Structured output
+    python extract_marker.py document.pdf --use_llm     # LLM-boosted accuracy
+"""
+import sys
+import os
+
+def convert(path, output_dir=None, output_format="markdown", use_llm=False):
+    from marker.converters.pdf import PdfConverter
+    from marker.models import create_model_dict
+    from marker.config.parser import ConfigParser
+
+    config_dict = {}
+    if use_llm:
+        config_dict["use_llm"] = True
+
+    config_parser = ConfigParser(config_dict)
+    models = create_model_dict()
+    converter = PdfConverter(config=config_parser.generate_config_dict(), artifact_dict=models)
+    rendered = converter(path)
+
+    if output_format == "json":
+        import json
+        print(json.dumps({
+            "markdown": rendered.markdown,
+            "metadata": rendered.metadata if hasattr(rendered, "metadata") else {},
+        }, indent=2, ensure_ascii=False))
+    else:
+        print(rendered.markdown)
+
+    # Save images if output_dir specified
+    if output_dir and hasattr(rendered, "images") and rendered.images:
+        from pathlib import Path
+        Path(output_dir).mkdir(parents=True, exist_ok=True)
+        for name, img_data in rendered.images.items():
+            img_path = os.path.join(output_dir, name)
+            with open(img_path, "wb") as f:
+                f.write(img_data)
+        print(f"\nSaved {len(rendered.images)} image(s) to {output_dir}/", file=sys.stderr)
+
+
+def check_requirements():
+    """Check disk space before installing."""
+    import shutil
+    free_gb = shutil.disk_usage("/").free / (1024**3)
+    if free_gb < 5:
+        print(f"⚠️  Only {free_gb:.1f}GB free. marker-pdf needs ~5GB for PyTorch + models.")
+        print("Use pymupdf instead (scripts/extract_pymupdf.py) or free up disk space.")
+        sys.exit(1)
+    print(f"✓ {free_gb:.1f}GB free — sufficient for marker-pdf")
+
+
+if __name__ == "__main__":
+    args = sys.argv[1:]
+    if not args or args[0] in ("-h", "--help"):
+        print(__doc__)
+        sys.exit(0)
+
+    if args[0] == "--check":
+        check_requirements()
+        sys.exit(0)
+
+    path = args[0]
+    output_dir = None
+    output_format = "markdown"
+    use_llm = False
+
+    if "--output_dir" in args:
+        idx = args.index("--output_dir")
+        output_dir = args[idx + 1]
+    if "--json" in args:
+        output_format = "json"
+    if "--use_llm" in args:
+        use_llm = True
+
+    convert(path, output_dir=output_dir, output_format=output_format, use_llm=use_llm)
--- a/skills/ocr-and-documents/scripts/extract_pymupdf.py
+++ b/skills/ocr-and-documents/scripts/extract_pymupdf.py
@ -0,0 +1,98 @@
+#!/usr/bin/env python3
+"""Extract text from documents using pymupdf. Lightweight (~25MB), no models.
+
+Usage:
+    python extract_pymupdf.py document.pdf
+    python extract_pymupdf.py document.pdf --markdown
+    python extract_pymupdf.py document.pdf --pages 0-4
+    python extract_pymupdf.py document.pdf --images output_dir/
+    python extract_pymupdf.py document.pdf --tables
+    python extract_pymupdf.py document.pdf --metadata
+"""
+import sys
+import json
+
+def extract_text(path, pages=None):
+    import pymupdf
+    doc = pymupdf.open(path)
+    page_range = range(len(doc)) if pages is None else pages
+    for i in page_range:
+        if i < len(doc):
+            print(f"\n--- Page {i+1}/{len(doc)} ---\n")
+            print(doc[i].get_text())
+
+def extract_markdown(path, pages=None):
+    import pymupdf4llm
+    md = pymupdf4llm.to_markdown(path, pages=pages)
+    print(md)
+
+def extract_tables(path):
+    import pymupdf
+    doc = pymupdf.open(path)
+    for i, page in enumerate(doc):
+        tables = page.find_tables()
+        for j, table in enumerate(tables.tables):
+            print(f"\n--- Page {i+1}, Table {j+1} ---\n")
+            df = table.to_pandas()
+            print(df.to_markdown(index=False))
+
+def extract_images(path, output_dir):
+    import pymupdf
+    from pathlib import Path
+    Path(output_dir).mkdir(parents=True, exist_ok=True)
+    doc = pymupdf.open(path)
+    count = 0
+    for i, page in enumerate(doc):
+        for img_idx, img in enumerate(page.get_images(full=True)):
+            xref = img[0]
+            pix = pymupdf.Pixmap(doc, xref)
+            if pix.n >= 5:
+                pix = pymupdf.Pixmap(pymupdf.csRGB, pix)
+            out_path = f"{output_dir}/page{i+1}_img{img_idx+1}.png"
+            pix.save(out_path)
+            count += 1
+    print(f"Extracted {count} images to {output_dir}/")
+
+def show_metadata(path):
+    import pymupdf
+    doc = pymupdf.open(path)
+    print(json.dumps({
+        "pages": len(doc),
+        "title": doc.metadata.get("title", ""),
+        "author": doc.metadata.get("author", ""),
+        "subject": doc.metadata.get("subject", ""),
+        "creator": doc.metadata.get("creator", ""),
+        "producer": doc.metadata.get("producer", ""),
+        "format": doc.metadata.get("format", ""),
+    }, indent=2))
+
+if __name__ == "__main__":
+    args = sys.argv[1:]
+    if not args or args[0] in ("-h", "--help"):
+        print(__doc__)
+        sys.exit(0)
+
+    path = args[0]
+    pages = None
+
+    if "--pages" in args:
+        idx = args.index("--pages")
+        p = args[idx + 1]
+        if "-" in p:
+            start, end = p.split("-")
+            pages = list(range(int(start), int(end) + 1))
+        else:
+            pages = [int(p)]
+
+    if "--metadata" in args:
+        show_metadata(path)
+    elif "--tables" in args:
+        extract_tables(path)
+    elif "--images" in args:
+        idx = args.index("--images")
+        output_dir = args[idx + 1] if idx + 1 < len(args) else "./images"
+        extract_images(path, output_dir)
+    elif "--markdown" in args:
+        extract_markdown(path, pages=pages)
+    else:
+        extract_text(path, pages=pages)