hermes-agent/website/docs/user-guide/skills/bundled/research/research-arxiv.md
Teknium 0f6eabb890
docs(website): dedicated page per bundled + optional skill (#14929)
Generates a full dedicated Docusaurus page for every one of the 132 skills
(73 bundled + 59 optional) under website/docs/user-guide/skills/{bundled,optional}/<category>/.
Each page carries the skill's description, metadata (version, author, license,
dependencies, platform gating, tags, related skills cross-linked to their own
pages), and the complete SKILL.md body that Hermes loads at runtime.

Previously the two catalog pages just listed skills with a one-line blurb and
no way to see what the skill actually did — users had to go read the source
repo. Now every skill has a browsable, searchable, cross-linked reference in
the docs.

- website/scripts/generate-skill-docs.py — generator that reads skills/ and
  optional-skills/, writes per-skill pages, regenerates both catalog indexes,
  and rewrites the Skills section of sidebars.ts. Handles MDX escaping
  (outside fenced code blocks: curly braces, unsafe HTML-ish tags) and
  rewrites relative references/*.md links to point at the GitHub source.
- website/docs/reference/skills-catalog.md — regenerated; each row links to
  the new dedicated page.
- website/docs/reference/optional-skills-catalog.md — same.
- website/sidebars.ts — Skills section now has Bundled / Optional subtrees
  with one nested category per skill folder.
- .github/workflows/{docs-site-checks,deploy-site}.yml — run the generator
  before docusaurus build so CI stays in sync with the source SKILL.md files.

Build verified locally with `npx docusaurus build`. Only remaining warnings
are pre-existing broken link/anchor issues in unrelated pages.
2026-04-23 22:22:11 -07:00

11 KiB

title sidebar_label description
Arxiv — Search and retrieve academic papers from arXiv using their free REST API Arxiv Search and retrieve academic papers from arXiv using their free REST API

{/* This page is auto-generated from the skill's SKILL.md by website/scripts/generate-skill-docs.py. Edit the source SKILL.md, not this page. */}

Arxiv

Search and retrieve academic papers from arXiv using their free REST API. No API key needed. Search by keyword, author, category, or ID. Combine with web_extract or the ocr-and-documents skill to read full paper content.

Skill metadata

Source Bundled (installed by default)
Path skills/research/arxiv
Version 1.0.0
Author Hermes Agent
License MIT
Tags Research, Arxiv, Papers, Academic, Science, API
Related skills ocr-and-documents

Reference: full SKILL.md

:::info The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active. :::

arXiv Research

Search and retrieve academic papers from arXiv via their free REST API. No API key, no dependencies — just curl.

Quick Reference

Action Command
Search papers curl "https://export.arxiv.org/api/query?search_query=all:QUERY&max_results=5"
Get specific paper curl "https://export.arxiv.org/api/query?id_list=2402.03300"
Read abstract (web) web_extract(urls=["https://arxiv.org/abs/2402.03300"])
Read full paper (PDF) web_extract(urls=["https://arxiv.org/pdf/2402.03300"])

Searching Papers

The API returns Atom XML. Parse with grep/sed or pipe through python3 for clean output.

curl -s "https://export.arxiv.org/api/query?search_query=all:GRPO+reinforcement+learning&max_results=5"

Clean output (parse XML to readable format)

curl -s "https://export.arxiv.org/api/query?search_query=all:GRPO+reinforcement+learning&max_results=5&sortBy=submittedDate&sortOrder=descending" | python3 -c "
import sys, xml.etree.ElementTree as ET
ns = {'a': 'http://www.w3.org/2005/Atom'}
root = ET.parse(sys.stdin).getroot()
for i, entry in enumerate(root.findall('a:entry', ns)):
    title = entry.find('a:title', ns).text.strip().replace('\n', ' ')
    arxiv_id = entry.find('a:id', ns).text.strip().split('/abs/')[-1]
    published = entry.find('a:published', ns).text[:10]
    authors = ', '.join(a.find('a:name', ns).text for a in entry.findall('a:author', ns))
    summary = entry.find('a:summary', ns).text.strip()[:200]
    cats = ', '.join(c.get('term') for c in entry.findall('a:category', ns))
    print(f'{i+1}. [{arxiv_id}] {title}')
    print(f'   Authors: {authors}')
    print(f'   Published: {published} | Categories: {cats}')
    print(f'   Abstract: {summary}...')
    print(f'   PDF: https://arxiv.org/pdf/{arxiv_id}')
    print()
"

Search Query Syntax

Prefix Searches Example
all: All fields all:transformer+attention
ti: Title ti:large+language+models
au: Author au:vaswani
abs: Abstract abs:reinforcement+learning
cat: Category cat:cs.AI
co: Comment co:accepted+NeurIPS

Boolean operators

# AND (default when using +)
search_query=all:transformer+attention

# OR
search_query=all:GPT+OR+all:BERT

# AND NOT
search_query=all:language+model+ANDNOT+all:vision

# Exact phrase
search_query=ti:"chain+of+thought"

# Combined
search_query=au:hinton+AND+cat:cs.LG

Sort and Pagination

Parameter Options
sortBy relevance, lastUpdatedDate, submittedDate
sortOrder ascending, descending
start Result offset (0-based)
max_results Number of results (default 10, max 30000)
# Latest 10 papers in cs.AI
curl -s "https://export.arxiv.org/api/query?search_query=cat:cs.AI&sortBy=submittedDate&sortOrder=descending&max_results=10"

Fetching Specific Papers

# By arXiv ID
curl -s "https://export.arxiv.org/api/query?id_list=2402.03300"

# Multiple papers
curl -s "https://export.arxiv.org/api/query?id_list=2402.03300,2401.12345,2403.00001"

BibTeX Generation

After fetching metadata for a paper, generate a BibTeX entry:

{% raw %}

curl -s "https://export.arxiv.org/api/query?id_list=1706.03762" | python3 -c "
import sys, xml.etree.ElementTree as ET
ns = {'a': 'http://www.w3.org/2005/Atom', 'arxiv': 'http://arxiv.org/schemas/atom'}
root = ET.parse(sys.stdin).getroot()
entry = root.find('a:entry', ns)
if entry is None: sys.exit('Paper not found')
title = entry.find('a:title', ns).text.strip().replace('\n', ' ')
authors = ' and '.join(a.find('a:name', ns).text for a in entry.findall('a:author', ns))
year = entry.find('a:published', ns).text[:4]
raw_id = entry.find('a:id', ns).text.strip().split('/abs/')[-1]
cat = entry.find('arxiv:primary_category', ns)
primary = cat.get('term') if cat is not None else 'cs.LG'
last_name = entry.find('a:author', ns).find('a:name', ns).text.split()[-1]
print(f'@article{{{last_name}{year}_{raw_id.replace(\".\", \"\")},')
print(f'  title     = {{{title}}},')
print(f'  author    = {{{authors}}},')
print(f'  year      = {{{year}}},')
print(f'  eprint    = {{{raw_id}}},')
print(f'  archivePrefix = {{arXiv}},')
print(f'  primaryClass  = {{{primary}}},')
print(f'  url       = {{https://arxiv.org/abs/{raw_id}}}')
print('}')
"

{% endraw %}

Reading Paper Content

After finding a paper, read it:

# Abstract page (fast, metadata + abstract)
web_extract(urls=["https://arxiv.org/abs/2402.03300"])

# Full paper (PDF → markdown via Firecrawl)
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])

For local PDF processing, see the ocr-and-documents skill.

Common Categories

Category Field
cs.AI Artificial Intelligence
cs.CL Computation and Language (NLP)
cs.CV Computer Vision
cs.LG Machine Learning
cs.CR Cryptography and Security
stat.ML Machine Learning (Statistics)
math.OC Optimization and Control
physics.comp-ph Computational Physics

Full list: https://arxiv.org/category_taxonomy

Helper Script

The scripts/search_arxiv.py script handles XML parsing and provides clean output:

python scripts/search_arxiv.py "GRPO reinforcement learning"
python scripts/search_arxiv.py "transformer attention" --max 10 --sort date
python scripts/search_arxiv.py --author "Yann LeCun" --max 5
python scripts/search_arxiv.py --category cs.AI --sort date
python scripts/search_arxiv.py --id 2402.03300
python scripts/search_arxiv.py --id 2402.03300,2401.12345

No dependencies — uses only Python stdlib.


arXiv doesn't provide citation data or recommendations. Use the Semantic Scholar API for that — free, no key needed for basic use (1 req/sec), returns JSON.

Get paper details + citations

# By arXiv ID
curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:2402.03300?fields=title,authors,citationCount,referenceCount,influentialCitationCount,year,abstract" | python3 -m json.tool

# By Semantic Scholar paper ID or DOI
curl -s "https://api.semanticscholar.org/graph/v1/paper/DOI:10.1234/example?fields=title,citationCount"

Get citations OF a paper (who cited it)

curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:2402.03300/citations?fields=title,authors,year,citationCount&limit=10" | python3 -m json.tool

Get references FROM a paper (what it cites)

curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:2402.03300/references?fields=title,authors,year,citationCount&limit=10" | python3 -m json.tool

Search papers (alternative to arXiv search, returns JSON)

curl -s "https://api.semanticscholar.org/graph/v1/paper/search?query=GRPO+reinforcement+learning&limit=5&fields=title,authors,year,citationCount,externalIds" | python3 -m json.tool

Get paper recommendations

curl -s -X POST "https://api.semanticscholar.org/recommendations/v1/papers/" \
  -H "Content-Type: application/json" \
  -d '{"positivePaperIds": ["arXiv:2402.03300"], "negativePaperIds": []}' | python3 -m json.tool

Author profile

curl -s "https://api.semanticscholar.org/graph/v1/author/search?query=Yann+LeCun&fields=name,hIndex,citationCount,paperCount" | python3 -m json.tool

Useful Semantic Scholar fields

title, authors, year, abstract, citationCount, referenceCount, influentialCitationCount, isOpenAccess, openAccessPdf, fieldsOfStudy, publicationVenue, externalIds (contains arXiv ID, DOI, etc.)


Complete Research Workflow

  1. Discover: python scripts/search_arxiv.py "your topic" --sort date --max 10
  2. Assess impact: curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:ID?fields=citationCount,influentialCitationCount"
  3. Read abstract: web_extract(urls=["https://arxiv.org/abs/ID"])
  4. Read full paper: web_extract(urls=["https://arxiv.org/pdf/ID"])
  5. Find related work: curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:ID/references?fields=title,citationCount&limit=20"
  6. Get recommendations: POST to Semantic Scholar recommendations endpoint
  7. Track authors: curl -s "https://api.semanticscholar.org/graph/v1/author/search?query=NAME"

Rate Limits

API Rate Auth
arXiv ~1 req / 3 seconds None needed
Semantic Scholar 1 req / second None (100/sec with API key)

Notes

  • arXiv returns Atom XML — use the helper script or parsing snippet for clean output
  • Semantic Scholar returns JSON — pipe through python3 -m json.tool for readability
  • arXiv IDs: old format (hep-th/0601001) vs new (2402.03300)
  • PDF: https://arxiv.org/pdf/{id} — Abstract: https://arxiv.org/abs/{id}
  • HTML (when available): https://arxiv.org/html/{id}
  • For local PDF processing, see the ocr-and-documents skill

ID Versioning

  • arxiv.org/abs/1706.03762 always resolves to the latest version
  • arxiv.org/abs/1706.03762v1 points to a specific immutable version
  • When generating citations, preserve the version suffix you actually read to prevent citation drift (a later version may substantially change content)
  • The API <id> field returns the versioned URL (e.g., http://arxiv.org/abs/1706.03762v7)

Withdrawn Papers

Papers can be withdrawn after submission. When this happens:

  • The <summary> field contains a withdrawal notice (look for "withdrawn" or "retracted")
  • Metadata fields may be incomplete
  • Always check the summary before treating a result as a valid paper