mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-26 01:01:40 +00:00
docs(website): dedicated page per bundled + optional skill (#14929)
Generates a full dedicated Docusaurus page for every one of the 132 skills
(73 bundled + 59 optional) under website/docs/user-guide/skills/{bundled,optional}/<category>/.
Each page carries the skill's description, metadata (version, author, license,
dependencies, platform gating, tags, related skills cross-linked to their own
pages), and the complete SKILL.md body that Hermes loads at runtime.
Previously the two catalog pages just listed skills with a one-line blurb and
no way to see what the skill actually did — users had to go read the source
repo. Now every skill has a browsable, searchable, cross-linked reference in
the docs.
- website/scripts/generate-skill-docs.py — generator that reads skills/ and
optional-skills/, writes per-skill pages, regenerates both catalog indexes,
and rewrites the Skills section of sidebars.ts. Handles MDX escaping
(outside fenced code blocks: curly braces, unsafe HTML-ish tags) and
rewrites relative references/*.md links to point at the GitHub source.
- website/docs/reference/skills-catalog.md — regenerated; each row links to
the new dedicated page.
- website/docs/reference/optional-skills-catalog.md — same.
- website/sidebars.ts — Skills section now has Bundled / Optional subtrees
with one nested category per skill folder.
- .github/workflows/{docs-site-checks,deploy-site}.yml — run the generator
before docusaurus build so CI stays in sync with the source SKILL.md files.
Build verified locally with `npx docusaurus build`. Only remaining warnings
are pre-existing broken link/anchor issues in unrelated pages.
This commit is contained in:
parent
eb93f88e1d
commit
0f6eabb890
139 changed files with 43523 additions and 306 deletions
|
|
@ -0,0 +1,252 @@
|
|||
---
|
||||
title: "Bioinformatics — Gateway to 400+ bioinformatics skills from bioSkills and ClawBio"
|
||||
sidebar_label: "Bioinformatics"
|
||||
description: "Gateway to 400+ bioinformatics skills from bioSkills and ClawBio"
|
||||
---
|
||||
|
||||
{/* This page is auto-generated from the skill's SKILL.md by website/scripts/generate-skill-docs.py. Edit the source SKILL.md, not this page. */}
|
||||
|
||||
# Bioinformatics
|
||||
|
||||
Gateway to 400+ bioinformatics skills from bioSkills and ClawBio. Covers genomics, transcriptomics, single-cell, variant calling, pharmacogenomics, metagenomics, structural biology, and more. Fetches domain-specific reference material on demand.
|
||||
|
||||
## Skill metadata
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Source | Optional — install with `hermes skills install official/research/bioinformatics` |
|
||||
| Path | `optional-skills/research/bioinformatics` |
|
||||
| Version | `1.0.0` |
|
||||
| Platforms | linux, macos |
|
||||
| Tags | `bioinformatics`, `genomics`, `sequencing`, `biology`, `research`, `science` |
|
||||
|
||||
## Reference: full SKILL.md
|
||||
|
||||
:::info
|
||||
The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
|
||||
:::
|
||||
|
||||
# Bioinformatics Skills Gateway
|
||||
|
||||
Use when asked about bioinformatics, genomics, sequencing, variant calling, gene expression, single-cell analysis, protein structure, pharmacogenomics, metagenomics, phylogenetics, or any computational biology task.
|
||||
|
||||
This skill is a gateway to two open-source bioinformatics skill libraries. Instead of bundling hundreds of domain-specific skills, it indexes them and fetches what you need on demand.
|
||||
|
||||
## Sources
|
||||
|
||||
◆ **bioSkills** — 385 reference skills (code patterns, parameter guides, decision trees)
|
||||
Repo: https://github.com/GPTomics/bioSkills
|
||||
Format: SKILL.md per topic with code examples. Python/R/CLI.
|
||||
|
||||
◆ **ClawBio** — 33 runnable pipeline skills (executable scripts, reproducibility bundles)
|
||||
Repo: https://github.com/ClawBio/ClawBio
|
||||
Format: Python scripts with demos. Each analysis exports report.md + commands.sh + environment.yml.
|
||||
|
||||
## How to fetch and use a skill
|
||||
|
||||
1. Identify the domain and skill name from the index below.
|
||||
2. Clone the relevant repo (shallow clone to save time):
|
||||
```bash
|
||||
# bioSkills (reference material)
|
||||
git clone --depth 1 https://github.com/GPTomics/bioSkills.git /tmp/bioSkills
|
||||
|
||||
# ClawBio (runnable pipelines)
|
||||
git clone --depth 1 https://github.com/ClawBio/ClawBio.git /tmp/ClawBio
|
||||
```
|
||||
3. Read the specific skill:
|
||||
```bash
|
||||
# bioSkills — each skill is at: <category>/<skill-name>/SKILL.md
|
||||
cat /tmp/bioSkills/variant-calling/gatk-variant-calling/SKILL.md
|
||||
|
||||
# ClawBio — each skill is at: skills/<skill-name>/
|
||||
cat /tmp/ClawBio/skills/pharmgx-reporter/README.md
|
||||
```
|
||||
4. Follow the fetched skill as reference material. These are NOT Hermes-format skills — treat them as expert domain guides. They contain correct parameters, proper tool flags, and validated pipelines.
|
||||
|
||||
## Skill Index by Domain
|
||||
|
||||
### Sequence Fundamentals
|
||||
bioSkills:
|
||||
sequence-io/ — read-sequences, write-sequences, format-conversion, batch-processing, compressed-files, fastq-quality, filter-sequences, paired-end-fastq, sequence-statistics
|
||||
sequence-manipulation/ — seq-objects, reverse-complement, transcription-translation, motif-search, codon-usage, sequence-properties, sequence-slicing
|
||||
ClawBio:
|
||||
seq-wrangler — Sequence QC, alignment, and BAM processing (wraps FastQC, BWA, SAMtools)
|
||||
|
||||
### Read QC & Alignment
|
||||
bioSkills:
|
||||
read-qc/ — quality-reports, fastp-workflow, adapter-trimming, quality-filtering, umi-processing, contamination-screening, rnaseq-qc
|
||||
read-alignment/ — bwa-alignment, star-alignment, hisat2-alignment, bowtie2-alignment
|
||||
alignment-files/ — sam-bam-basics, alignment-sorting, alignment-filtering, bam-statistics, duplicate-handling, pileup-generation
|
||||
|
||||
### Variant Calling & Annotation
|
||||
bioSkills:
|
||||
variant-calling/ — gatk-variant-calling, deepvariant, variant-calling (bcftools), joint-calling, structural-variant-calling, filtering-best-practices, variant-annotation, variant-normalization, vcf-basics, vcf-manipulation, vcf-statistics, consensus-sequences, clinical-interpretation
|
||||
ClawBio:
|
||||
vcf-annotator — VEP + ClinVar + gnomAD annotation with ancestry-aware context
|
||||
variant-annotation — Variant annotation pipeline
|
||||
|
||||
### Differential Expression (Bulk RNA-seq)
|
||||
bioSkills:
|
||||
differential-expression/ — deseq2-basics, edger-basics, batch-correction, de-results, de-visualization, timeseries-de
|
||||
rna-quantification/ — alignment-free-quant (Salmon/kallisto), featurecounts-counting, tximport-workflow, count-matrix-qc
|
||||
expression-matrix/ — counts-ingest, gene-id-mapping, metadata-joins, sparse-handling
|
||||
ClawBio:
|
||||
rnaseq-de — Full DE pipeline with QC, normalization, and visualization
|
||||
diff-visualizer — Rich visualization and reporting for DE results
|
||||
|
||||
### Single-Cell RNA-seq
|
||||
bioSkills:
|
||||
single-cell/ — preprocessing, clustering, batch-integration, cell-annotation, cell-communication, doublet-detection, markers-annotation, trajectory-inference, multimodal-integration, perturb-seq, scatac-analysis, lineage-tracing, metabolite-communication, data-io
|
||||
ClawBio:
|
||||
scrna-orchestrator — Full Scanpy pipeline (QC, clustering, markers, annotation)
|
||||
scrna-embedding — scVI-based latent embedding and batch integration
|
||||
|
||||
### Spatial Transcriptomics
|
||||
bioSkills:
|
||||
spatial-transcriptomics/ — spatial-data-io, spatial-preprocessing, spatial-domains, spatial-deconvolution, spatial-communication, spatial-neighbors, spatial-statistics, spatial-visualization, spatial-multiomics, spatial-proteomics, image-analysis
|
||||
|
||||
### Epigenomics
|
||||
bioSkills:
|
||||
chip-seq/ — peak-calling, differential-binding, motif-analysis, peak-annotation, chipseq-qc, chipseq-visualization, super-enhancers
|
||||
atac-seq/ — atac-peak-calling, atac-qc, differential-accessibility, footprinting, motif-deviation, nucleosome-positioning
|
||||
methylation-analysis/ — bismark-alignment, methylation-calling, dmr-detection, methylkit-analysis
|
||||
hi-c-analysis/ — hic-data-io, tad-detection, loop-calling, compartment-analysis, contact-pairs, matrix-operations, hic-visualization, hic-differential
|
||||
ClawBio:
|
||||
methylation-clock — Epigenetic age estimation
|
||||
|
||||
### Pharmacogenomics & Clinical
|
||||
bioSkills:
|
||||
clinical-databases/ — clinvar-lookup, gnomad-frequencies, dbsnp-queries, pharmacogenomics, polygenic-risk, hla-typing, variant-prioritization, somatic-signatures, tumor-mutational-burden, myvariant-queries
|
||||
ClawBio:
|
||||
pharmgx-reporter — PGx report from 23andMe/AncestryDNA (12 genes, 31 SNPs, 51 drugs)
|
||||
drug-photo — Photo of medication → personalized PGx dosage card (via vision)
|
||||
clinpgx — ClinPGx API for gene-drug data and CPIC guidelines
|
||||
gwas-lookup — Federated variant lookup across 9 genomic databases
|
||||
gwas-prs — Polygenic risk scores from consumer genetic data
|
||||
nutrigx_advisor — Personalized nutrition from consumer genetic data
|
||||
|
||||
### Population Genetics & GWAS
|
||||
bioSkills:
|
||||
population-genetics/ — association-testing (PLINK GWAS), plink-basics, population-structure, linkage-disequilibrium, scikit-allel-analysis, selection-statistics
|
||||
causal-genomics/ — mendelian-randomization, fine-mapping, colocalization-analysis, mediation-analysis, pleiotropy-detection
|
||||
phasing-imputation/ — haplotype-phasing, genotype-imputation, imputation-qc, reference-panels
|
||||
ClawBio:
|
||||
claw-ancestry-pca — Ancestry PCA against SGDP reference panel
|
||||
|
||||
### Metagenomics & Microbiome
|
||||
bioSkills:
|
||||
metagenomics/ — kraken-classification, metaphlan-profiling, abundance-estimation, functional-profiling, amr-detection, strain-tracking, metagenome-visualization
|
||||
microbiome/ — amplicon-processing, diversity-analysis, differential-abundance, taxonomy-assignment, functional-prediction, qiime2-workflow
|
||||
ClawBio:
|
||||
claw-metagenomics — Shotgun metagenomics profiling (taxonomy, resistome, functional pathways)
|
||||
|
||||
### Genome Assembly & Annotation
|
||||
bioSkills:
|
||||
genome-assembly/ — hifi-assembly, long-read-assembly, short-read-assembly, metagenome-assembly, assembly-polishing, assembly-qc, scaffolding, contamination-detection
|
||||
genome-annotation/ — eukaryotic-gene-prediction, prokaryotic-annotation, functional-annotation, ncrna-annotation, repeat-annotation, annotation-transfer
|
||||
long-read-sequencing/ — basecalling, long-read-alignment, long-read-qc, clair3-variants, structural-variants, medaka-polishing, nanopore-methylation, isoseq-analysis
|
||||
|
||||
### Structural Biology & Chemoinformatics
|
||||
bioSkills:
|
||||
structural-biology/ — alphafold-predictions, modern-structure-prediction, structure-io, structure-navigation, structure-modification, geometric-analysis
|
||||
chemoinformatics/ — molecular-io, molecular-descriptors, similarity-searching, substructure-search, virtual-screening, admet-prediction, reaction-enumeration
|
||||
ClawBio:
|
||||
struct-predictor — Local AlphaFold/Boltz/Chai structure prediction with comparison
|
||||
|
||||
### Proteomics
|
||||
bioSkills:
|
||||
proteomics/ — data-import, peptide-identification, protein-inference, quantification, differential-abundance, dia-analysis, ptm-analysis, proteomics-qc, spectral-libraries
|
||||
ClawBio:
|
||||
proteomics-de — Proteomics differential expression
|
||||
|
||||
### Pathway Analysis & Gene Networks
|
||||
bioSkills:
|
||||
pathway-analysis/ — go-enrichment, gsea, kegg-pathways, reactome-pathways, wikipathways, enrichment-visualization
|
||||
gene-regulatory-networks/ — scenic-regulons, coexpression-networks, differential-networks, multiomics-grn, perturbation-simulation
|
||||
|
||||
### Immunoinformatics
|
||||
bioSkills:
|
||||
immunoinformatics/ — mhc-binding-prediction, epitope-prediction, neoantigen-prediction, immunogenicity-scoring, tcr-epitope-binding
|
||||
tcr-bcr-analysis/ — mixcr-analysis, scirpy-analysis, immcantation-analysis, repertoire-visualization, vdjtools-analysis
|
||||
|
||||
### CRISPR & Genome Engineering
|
||||
bioSkills:
|
||||
crispr-screens/ — mageck-analysis, jacks-analysis, hit-calling, screen-qc, library-design, crispresso-editing, base-editing-analysis, batch-correction
|
||||
genome-engineering/ — grna-design, off-target-prediction, hdr-template-design, base-editing-design, prime-editing-design
|
||||
|
||||
### Workflow Management
|
||||
bioSkills:
|
||||
workflow-management/ — snakemake-workflows, nextflow-pipelines, cwl-workflows, wdl-workflows
|
||||
ClawBio:
|
||||
repro-enforcer — Export any analysis as reproducibility bundle (Conda env + Singularity + checksums)
|
||||
galaxy-bridge — Access 8,000+ Galaxy tools from usegalaxy.org
|
||||
|
||||
### Specialized Domains
|
||||
bioSkills:
|
||||
alternative-splicing/ — splicing-quantification, differential-splicing, isoform-switching, sashimi-plots, single-cell-splicing, splicing-qc
|
||||
ecological-genomics/ — edna-metabarcoding, landscape-genomics, conservation-genetics, biodiversity-metrics, community-ecology, species-delimitation
|
||||
epidemiological-genomics/ — pathogen-typing, variant-surveillance, phylodynamics, transmission-inference, amr-surveillance
|
||||
liquid-biopsy/ — cfdna-preprocessing, ctdna-mutation-detection, fragment-analysis, tumor-fraction-estimation, methylation-based-detection, longitudinal-monitoring
|
||||
epitranscriptomics/ — m6a-peak-calling, m6a-differential, m6anet-analysis, merip-preprocessing, modification-visualization
|
||||
metabolomics/ — xcms-preprocessing, metabolite-annotation, normalization-qc, statistical-analysis, pathway-mapping, lipidomics, targeted-analysis, msdial-preprocessing
|
||||
flow-cytometry/ — fcs-handling, gating-analysis, compensation-transformation, clustering-phenotyping, differential-analysis, cytometry-qc, doublet-detection, bead-normalization
|
||||
systems-biology/ — flux-balance-analysis, metabolic-reconstruction, gene-essentiality, context-specific-models, model-curation
|
||||
rna-structure/ — secondary-structure-prediction, ncrna-search, structure-probing
|
||||
|
||||
### Data Visualization & Reporting
|
||||
bioSkills:
|
||||
data-visualization/ — ggplot2-fundamentals, heatmaps-clustering, volcano-customization, circos-plots, genome-browser-tracks, interactive-visualization, multipanel-figures, network-visualization, upset-plots, color-palettes, specialized-omics-plots, genome-tracks
|
||||
reporting/ — rmarkdown-reports, quarto-reports, jupyter-reports, automated-qc-reports, figure-export
|
||||
ClawBio:
|
||||
profile-report — Analysis profile reporting
|
||||
data-extractor — Extract numerical data from scientific figure images (via vision)
|
||||
lit-synthesizer — PubMed/bioRxiv search, summarization, citation graphs
|
||||
pubmed-summariser — Gene/disease PubMed search with structured briefing
|
||||
|
||||
### Database Access
|
||||
bioSkills:
|
||||
database-access/ — entrez-search, entrez-fetch, entrez-link, blast-searches, local-blast, sra-data, geo-data, uniprot-access, batch-downloads, interaction-databases, sequence-similarity
|
||||
ClawBio:
|
||||
ukb-navigator — Semantic search across 12,000+ UK Biobank fields
|
||||
clinical-trial-finder — Clinical trial discovery
|
||||
|
||||
### Experimental Design
|
||||
bioSkills:
|
||||
experimental-design/ — power-analysis, sample-size, batch-design, multiple-testing
|
||||
|
||||
### Machine Learning for Omics
|
||||
bioSkills:
|
||||
machine-learning/ — omics-classifiers, biomarker-discovery, survival-analysis, model-validation, prediction-explanation, atlas-mapping
|
||||
ClawBio:
|
||||
claw-semantic-sim — Semantic similarity index for disease literature (PubMedBERT)
|
||||
omics-target-evidence-mapper — Aggregate target-level evidence across omics sources
|
||||
|
||||
## Environment Setup
|
||||
|
||||
These skills assume a bioinformatics workstation. Common dependencies:
|
||||
|
||||
```bash
|
||||
# Python
|
||||
pip install biopython pysam cyvcf2 pybedtools pyBigWig scikit-allel anndata scanpy mygene
|
||||
|
||||
# R/Bioconductor
|
||||
Rscript -e 'BiocManager::install(c("DESeq2","edgeR","Seurat","clusterProfiler","methylKit"))'
|
||||
|
||||
# CLI tools (Ubuntu/Debian)
|
||||
sudo apt install samtools bcftools ncbi-blast+ minimap2 bedtools
|
||||
|
||||
# CLI tools (macOS)
|
||||
brew install samtools bcftools blast minimap2 bedtools
|
||||
|
||||
# Or via Conda (recommended for reproducibility)
|
||||
conda install -c bioconda samtools bcftools blast minimap2 bedtools fastp kraken2
|
||||
```
|
||||
|
||||
## Pitfalls
|
||||
|
||||
- The fetched skills are NOT in Hermes SKILL.md format. They use their own structure (bioSkills: code pattern cookbooks; ClawBio: README + Python scripts). Read them as expert reference material.
|
||||
- bioSkills are reference guides — they show correct parameters and code patterns but aren't executable pipelines.
|
||||
- ClawBio skills are executable — many have `--demo` flags and can be run directly.
|
||||
- Both repos assume bioinformatics tools are installed. Check prerequisites before running pipelines.
|
||||
- For ClawBio, run `pip install -r requirements.txt` in the cloned repo first.
|
||||
- Genomic data files can be very large. Be mindful of disk space when downloading reference genomes, SRA datasets, or building indices.
|
||||
|
|
@ -0,0 +1,116 @@
|
|||
---
|
||||
title: "Domain Intel — Passive domain reconnaissance using Python stdlib"
|
||||
sidebar_label: "Domain Intel"
|
||||
description: "Passive domain reconnaissance using Python stdlib"
|
||||
---
|
||||
|
||||
{/* This page is auto-generated from the skill's SKILL.md by website/scripts/generate-skill-docs.py. Edit the source SKILL.md, not this page. */}
|
||||
|
||||
# Domain Intel
|
||||
|
||||
Passive domain reconnaissance using Python stdlib. Subdomain discovery, SSL certificate inspection, WHOIS lookups, DNS records, domain availability checks, and bulk multi-domain analysis. No API keys required.
|
||||
|
||||
## Skill metadata
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Source | Optional — install with `hermes skills install official/research/domain-intel` |
|
||||
| Path | `optional-skills/research/domain-intel` |
|
||||
|
||||
## Reference: full SKILL.md
|
||||
|
||||
:::info
|
||||
The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
|
||||
:::
|
||||
|
||||
# Domain Intelligence — Passive OSINT
|
||||
|
||||
Passive domain reconnaissance using only Python stdlib.
|
||||
**Zero dependencies. Zero API keys. Works on Linux, macOS, and Windows.**
|
||||
|
||||
## Helper script
|
||||
|
||||
This skill includes `scripts/domain_intel.py` — a complete CLI tool for all domain intelligence operations.
|
||||
|
||||
```bash
|
||||
# Subdomain discovery via Certificate Transparency logs
|
||||
python3 SKILL_DIR/scripts/domain_intel.py subdomains example.com
|
||||
|
||||
# SSL certificate inspection (expiry, cipher, SANs, issuer)
|
||||
python3 SKILL_DIR/scripts/domain_intel.py ssl example.com
|
||||
|
||||
# WHOIS lookup (registrar, dates, name servers — 100+ TLDs)
|
||||
python3 SKILL_DIR/scripts/domain_intel.py whois example.com
|
||||
|
||||
# DNS records (A, AAAA, MX, NS, TXT, CNAME)
|
||||
python3 SKILL_DIR/scripts/domain_intel.py dns example.com
|
||||
|
||||
# Domain availability check (passive: DNS + WHOIS + SSL signals)
|
||||
python3 SKILL_DIR/scripts/domain_intel.py available coolstartup.io
|
||||
|
||||
# Bulk analysis — multiple domains, multiple checks in parallel
|
||||
python3 SKILL_DIR/scripts/domain_intel.py bulk example.com github.com google.com
|
||||
python3 SKILL_DIR/scripts/domain_intel.py bulk example.com github.com --checks ssl,dns
|
||||
```
|
||||
|
||||
`SKILL_DIR` is the directory containing this SKILL.md file. All output is structured JSON.
|
||||
|
||||
## Available commands
|
||||
|
||||
| Command | What it does | Data source |
|
||||
|---------|-------------|-------------|
|
||||
| `subdomains` | Find subdomains from certificate logs | crt.sh (HTTPS) |
|
||||
| `ssl` | Inspect TLS certificate details | Direct TCP:443 to target |
|
||||
| `whois` | Registration info, registrar, dates | WHOIS servers (TCP:43) |
|
||||
| `dns` | A, AAAA, MX, NS, TXT, CNAME records | System DNS + Google DoH |
|
||||
| `available` | Check if domain is registered | DNS + WHOIS + SSL signals |
|
||||
| `bulk` | Run multiple checks on multiple domains | All of the above |
|
||||
|
||||
## When to use this vs built-in tools
|
||||
|
||||
- **Use this skill** for infrastructure questions: subdomains, SSL certs, WHOIS, DNS records, availability
|
||||
- **Use `web_search`** for general research about what a domain/company does
|
||||
- **Use `web_extract`** to get the actual content of a webpage
|
||||
- **Use `terminal` with `curl -I`** for a simple "is this URL reachable" check
|
||||
|
||||
| Task | Better tool | Why |
|
||||
|------|-------------|-----|
|
||||
| "What does example.com do?" | `web_extract` | Gets page content, not DNS/WHOIS data |
|
||||
| "Find info about a company" | `web_search` | General research, not domain-specific |
|
||||
| "Is this website safe?" | `web_search` | Reputation checks need web context |
|
||||
| "Check if a URL is reachable" | `terminal` with `curl -I` | Simple HTTP check |
|
||||
| "Find subdomains of X" | **This skill** | Only passive source for this |
|
||||
| "When does the SSL cert expire?" | **This skill** | Built-in tools can't inspect TLS |
|
||||
| "Who registered this domain?" | **This skill** | WHOIS data not in web search |
|
||||
| "Is coolstartup.io available?" | **This skill** | Passive availability via DNS+WHOIS+SSL |
|
||||
|
||||
## Platform compatibility
|
||||
|
||||
Pure Python stdlib (`socket`, `ssl`, `urllib`, `json`, `concurrent.futures`).
|
||||
Works identically on Linux, macOS, and Windows with no dependencies.
|
||||
|
||||
- **crt.sh queries** use HTTPS (port 443) — works behind most firewalls
|
||||
- **WHOIS queries** use TCP port 43 — may be blocked on restrictive networks
|
||||
- **DNS queries** use Google DoH (HTTPS) for MX/NS/TXT — firewall-friendly
|
||||
- **SSL checks** connect to the target on port 443 — the only "active" operation
|
||||
|
||||
## Data sources
|
||||
|
||||
All queries are **passive** — no port scanning, no vulnerability testing:
|
||||
|
||||
- **crt.sh** — Certificate Transparency logs (subdomain discovery, HTTPS only)
|
||||
- **WHOIS servers** — Direct TCP to 100+ authoritative TLD registrars
|
||||
- **Google DNS-over-HTTPS** — MX, NS, TXT, CNAME resolution (firewall-friendly)
|
||||
- **System DNS** — A/AAAA record resolution
|
||||
- **SSL check** is the only "active" operation (TCP connection to target:443)
|
||||
|
||||
## Notes
|
||||
|
||||
- WHOIS queries use TCP port 43 — may be blocked on restrictive networks
|
||||
- Some WHOIS servers redact registrant info (GDPR) — mention this to the user
|
||||
- crt.sh can be slow for very popular domains (thousands of certs) — set reasonable expectations
|
||||
- The availability check is heuristic-based (3 passive signals) — not authoritative like a registrar API
|
||||
|
||||
---
|
||||
|
||||
*Contributed by [@FurkanL0](https://github.com/FurkanL0)*
|
||||
|
|
@ -0,0 +1,236 @@
|
|||
---
|
||||
title: "Drug Discovery — Pharmaceutical research assistant for drug discovery workflows"
|
||||
sidebar_label: "Drug Discovery"
|
||||
description: "Pharmaceutical research assistant for drug discovery workflows"
|
||||
---
|
||||
|
||||
{/* This page is auto-generated from the skill's SKILL.md by website/scripts/generate-skill-docs.py. Edit the source SKILL.md, not this page. */}
|
||||
|
||||
# Drug Discovery
|
||||
|
||||
Pharmaceutical research assistant for drug discovery workflows. Search bioactive compounds on ChEMBL, calculate drug-likeness (Lipinski Ro5, QED, TPSA, synthetic accessibility), look up drug-drug interactions via OpenFDA, interpret ADMET profiles, and assist with lead optimization. Use for medicinal chemistry questions, molecule property analysis, clinical pharmacology, and open-science drug research.
|
||||
|
||||
## Skill metadata
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Source | Optional — install with `hermes skills install official/research/drug-discovery` |
|
||||
| Path | `optional-skills/research/drug-discovery` |
|
||||
| Version | `1.0.0` |
|
||||
| Author | bennytimz |
|
||||
| License | MIT |
|
||||
| Tags | `science`, `chemistry`, `pharmacology`, `research`, `health` |
|
||||
|
||||
## Reference: full SKILL.md
|
||||
|
||||
:::info
|
||||
The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
|
||||
:::
|
||||
|
||||
# Drug Discovery & Pharmaceutical Research
|
||||
|
||||
You are an expert pharmaceutical scientist and medicinal chemist with deep
|
||||
knowledge of drug discovery, cheminformatics, and clinical pharmacology.
|
||||
Use this skill for all pharma/chemistry research tasks.
|
||||
|
||||
## Core Workflows
|
||||
|
||||
### 1 — Bioactive Compound Search (ChEMBL)
|
||||
|
||||
Search ChEMBL (the world's largest open bioactivity database) for compounds
|
||||
by target, activity, or molecule name. No API key required.
|
||||
|
||||
```bash
|
||||
# Search compounds by target name (e.g. "EGFR", "COX-2", "ACE")
|
||||
TARGET="$1"
|
||||
ENCODED=$(python3 -c "import urllib.parse,sys; print(urllib.parse.quote(sys.argv[1]))" "$TARGET")
|
||||
curl -s "https://www.ebi.ac.uk/chembl/api/data/target/search?q=${ENCODED}&format=json" \
|
||||
| python3 -c "
|
||||
import json,sys
|
||||
data=json.load(sys.stdin)
|
||||
targets=data.get('targets',[])[:5]
|
||||
for t in targets:
|
||||
print(f\"ChEMBL ID : {t.get('target_chembl_id')}\")
|
||||
print(f\"Name : {t.get('pref_name')}\")
|
||||
print(f\"Type : {t.get('target_type')}\")
|
||||
print()
|
||||
"
|
||||
```
|
||||
|
||||
```bash
|
||||
# Get bioactivity data for a ChEMBL target ID
|
||||
TARGET_ID="$1" # e.g. CHEMBL203
|
||||
curl -s "https://www.ebi.ac.uk/chembl/api/data/activity?target_chembl_id=${TARGET_ID}&pchembl_value__gte=6&limit=10&format=json" \
|
||||
| python3 -c "
|
||||
import json,sys
|
||||
data=json.load(sys.stdin)
|
||||
acts=data.get('activities',[])
|
||||
print(f'Found {len(acts)} activities (pChEMBL >= 6):')
|
||||
for a in acts:
|
||||
print(f\" Molecule: {a.get('molecule_chembl_id')} | {a.get('standard_type')}: {a.get('standard_value')} {a.get('standard_units')} | pChEMBL: {a.get('pchembl_value')}\")
|
||||
"
|
||||
```
|
||||
|
||||
```bash
|
||||
# Look up a specific molecule by ChEMBL ID
|
||||
MOL_ID="$1" # e.g. CHEMBL25 (aspirin)
|
||||
curl -s "https://www.ebi.ac.uk/chembl/api/data/molecule/${MOL_ID}?format=json" \
|
||||
| python3 -c "
|
||||
import json,sys
|
||||
m=json.load(sys.stdin)
|
||||
props=m.get('molecule_properties',{}) or {}
|
||||
print(f\"Name : {m.get('pref_name','N/A')}\")
|
||||
print(f\"SMILES : {m.get('molecule_structures',{}).get('canonical_smiles','N/A') if m.get('molecule_structures') else 'N/A'}\")
|
||||
print(f\"MW : {props.get('full_mwt','N/A')} Da\")
|
||||
print(f\"LogP : {props.get('alogp','N/A')}\")
|
||||
print(f\"HBD : {props.get('hbd','N/A')}\")
|
||||
print(f\"HBA : {props.get('hba','N/A')}\")
|
||||
print(f\"TPSA : {props.get('psa','N/A')} Ų\")
|
||||
print(f\"Ro5 violations: {props.get('num_ro5_violations','N/A')}\")
|
||||
print(f\"QED : {props.get('qed_weighted','N/A')}\")
|
||||
"
|
||||
```
|
||||
|
||||
### 2 — Drug-Likeness Calculation (Lipinski Ro5 + Veber)
|
||||
|
||||
Assess any molecule against established oral bioavailability rules using
|
||||
PubChem's free property API — no RDKit install needed.
|
||||
|
||||
```bash
|
||||
COMPOUND="$1"
|
||||
ENCODED=$(python3 -c "import urllib.parse,sys; print(urllib.parse.quote(sys.argv[1]))" "$COMPOUND")
|
||||
curl -s "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/${ENCODED}/property/MolecularWeight,XLogP,HBondDonorCount,HBondAcceptorCount,RotatableBondCount,TPSA,InChIKey/JSON" \
|
||||
| python3 -c "
|
||||
import json,sys
|
||||
data=json.load(sys.stdin)
|
||||
props=data['PropertyTable']['Properties'][0]
|
||||
mw = float(props.get('MolecularWeight', 0))
|
||||
logp = float(props.get('XLogP', 0))
|
||||
hbd = int(props.get('HBondDonorCount', 0))
|
||||
hba = int(props.get('HBondAcceptorCount', 0))
|
||||
rot = int(props.get('RotatableBondCount', 0))
|
||||
tpsa = float(props.get('TPSA', 0))
|
||||
print('=== Lipinski Rule of Five (Ro5) ===')
|
||||
print(f' MW {mw:.1f} Da {\"✓\" if mw<=500 else \"✗ VIOLATION (>500)\"}')
|
||||
print(f' LogP {logp:.2f} {\"✓\" if logp<=5 else \"✗ VIOLATION (>5)\"}')
|
||||
print(f' HBD {hbd} {\"✓\" if hbd<=5 else \"✗ VIOLATION (>5)\"}')
|
||||
print(f' HBA {hba} {\"✓\" if hba<=10 else \"✗ VIOLATION (>10)\"}')
|
||||
viol = sum([mw>500, logp>5, hbd>5, hba>10])
|
||||
print(f' Violations: {viol}/4 {\"→ Likely orally bioavailable\" if viol<=1 else \"→ Poor oral bioavailability predicted\"}')
|
||||
print()
|
||||
print('=== Veber Oral Bioavailability Rules ===')
|
||||
print(f' TPSA {tpsa:.1f} Ų {\"✓\" if tpsa<=140 else \"✗ VIOLATION (>140)\"}')
|
||||
print(f' Rot. bonds {rot} {\"✓\" if rot<=10 else \"✗ VIOLATION (>10)\"}')
|
||||
print(f' Both rules met: {\"Yes → good oral absorption predicted\" if tpsa<=140 and rot<=10 else \"No → reduced oral absorption\"}')
|
||||
"
|
||||
```
|
||||
|
||||
### 3 — Drug Interaction & Safety Lookup (OpenFDA)
|
||||
|
||||
```bash
|
||||
DRUG="$1"
|
||||
ENCODED=$(python3 -c "import urllib.parse,sys; print(urllib.parse.quote(sys.argv[1]))" "$DRUG")
|
||||
curl -s "https://api.fda.gov/drug/label.json?search=drug_interactions:\"${ENCODED}\"&limit=3" \
|
||||
| python3 -c "
|
||||
import json,sys
|
||||
data=json.load(sys.stdin)
|
||||
results=data.get('results',[])
|
||||
if not results:
|
||||
print('No interaction data found in FDA labels.')
|
||||
sys.exit()
|
||||
for r in results[:2]:
|
||||
brand=r.get('openfda',{}).get('brand_name',['Unknown'])[0]
|
||||
generic=r.get('openfda',{}).get('generic_name',['Unknown'])[0]
|
||||
interactions=r.get('drug_interactions',['N/A'])[0]
|
||||
print(f'--- {brand} ({generic}) ---')
|
||||
print(interactions[:800])
|
||||
print()
|
||||
"
|
||||
```
|
||||
|
||||
```bash
|
||||
DRUG="$1"
|
||||
ENCODED=$(python3 -c "import urllib.parse,sys; print(urllib.parse.quote(sys.argv[1]))" "$DRUG")
|
||||
curl -s "https://api.fda.gov/drug/event.json?search=patient.drug.medicinalproduct:\"${ENCODED}\"&count=patient.reaction.reactionmeddrapt.exact&limit=10" \
|
||||
| python3 -c "
|
||||
import json,sys
|
||||
data=json.load(sys.stdin)
|
||||
results=data.get('results',[])
|
||||
if not results:
|
||||
print('No adverse event data found.')
|
||||
sys.exit()
|
||||
print(f'Top adverse events reported:')
|
||||
for r in results[:10]:
|
||||
print(f\" {r['count']:>5}x {r['term']}\")
|
||||
"
|
||||
```
|
||||
|
||||
### 4 — PubChem Compound Search
|
||||
|
||||
```bash
|
||||
COMPOUND="$1"
|
||||
ENCODED=$(python3 -c "import urllib.parse,sys; print(urllib.parse.quote(sys.argv[1]))" "$COMPOUND")
|
||||
CID=$(curl -s "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/${ENCODED}/cids/TXT" | head -1 | tr -d '[:space:]')
|
||||
echo "PubChem CID: $CID"
|
||||
curl -s "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/${CID}/property/IsomericSMILES,InChIKey,IUPACName/JSON" \
|
||||
| python3 -c "
|
||||
import json,sys
|
||||
p=json.load(sys.stdin)['PropertyTable']['Properties'][0]
|
||||
print(f\"IUPAC Name : {p.get('IUPACName','N/A')}\")
|
||||
print(f\"SMILES : {p.get('IsomericSMILES','N/A')}\")
|
||||
print(f\"InChIKey : {p.get('InChIKey','N/A')}\")
|
||||
"
|
||||
```
|
||||
|
||||
### 5 — Target & Disease Literature (OpenTargets)
|
||||
|
||||
```bash
|
||||
GENE="$1"
|
||||
curl -s -X POST "https://api.platform.opentargets.org/api/v4/graphql" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d "{\"query\":\"{ search(queryString: \\\"${GENE}\\\", entityNames: [\\\"target\\\"], page: {index: 0, size: 1}) { hits { id score object { ... on Target { id approvedSymbol approvedName associatedDiseases(page: {index: 0, size: 5}) { count rows { score disease { id name } } } } } } } }\"}" \
|
||||
| python3 -c "
|
||||
import json,sys
|
||||
data=json.load(sys.stdin)
|
||||
hits=data.get('data',{}).get('search',{}).get('hits',[])
|
||||
if not hits:
|
||||
print('Target not found.')
|
||||
sys.exit()
|
||||
obj=hits[0]['object']
|
||||
print(f\"Target: {obj.get('approvedSymbol')} — {obj.get('approvedName')}\")
|
||||
assoc=obj.get('associatedDiseases',{})
|
||||
print(f\"Associated with {assoc.get('count',0)} diseases. Top associations:\")
|
||||
for row in assoc.get('rows',[]):
|
||||
print(f\" Score {row['score']:.3f} | {row['disease']['name']}\")
|
||||
"
|
||||
```
|
||||
|
||||
## Reasoning Guidelines
|
||||
|
||||
When analysing drug-likeness or molecular properties, always:
|
||||
|
||||
1. **State raw values first** — MW, LogP, HBD, HBA, TPSA, RotBonds
|
||||
2. **Apply rule sets** — Ro5 (Lipinski), Veber, Ghose filter where relevant
|
||||
3. **Flag liabilities** — metabolic hotspots, hERG risk, high TPSA for CNS penetration
|
||||
4. **Suggest optimizations** — bioisosteric replacements, prodrug strategies, ring truncation
|
||||
5. **Cite the source API** — ChEMBL, PubChem, OpenFDA, or OpenTargets
|
||||
|
||||
For ADMET questions, reason through Absorption, Distribution, Metabolism, Excretion, Toxicity systematically. See references/ADMET_REFERENCE.md for detailed guidance.
|
||||
|
||||
## Important Notes
|
||||
|
||||
- All APIs are free, public, require no authentication
|
||||
- ChEMBL rate limits: add sleep 1 between batch requests
|
||||
- FDA data reflects reported adverse events, not necessarily causation
|
||||
- Always recommend consulting a licensed pharmacist or physician for clinical decisions
|
||||
|
||||
## Quick Reference
|
||||
|
||||
| Task | API | Endpoint |
|
||||
|------|-----|----------|
|
||||
| Find target | ChEMBL | `/api/data/target/search?q=` |
|
||||
| Get bioactivity | ChEMBL | `/api/data/activity?target_chembl_id=` |
|
||||
| Molecule properties | PubChem | `/rest/pug/compound/name/{name}/property/` |
|
||||
| Drug interactions | OpenFDA | `/drug/label.json?search=drug_interactions:` |
|
||||
| Adverse events | OpenFDA | `/drug/event.json?search=...&count=reaction` |
|
||||
| Gene-disease | OpenTargets | GraphQL POST `/api/v4/graphql` |
|
||||
|
|
@ -0,0 +1,254 @@
|
|||
---
|
||||
title: "Duckduckgo Search — Free web search via DuckDuckGo — text, news, images, videos"
|
||||
sidebar_label: "Duckduckgo Search"
|
||||
description: "Free web search via DuckDuckGo — text, news, images, videos"
|
||||
---
|
||||
|
||||
{/* This page is auto-generated from the skill's SKILL.md by website/scripts/generate-skill-docs.py. Edit the source SKILL.md, not this page. */}
|
||||
|
||||
# Duckduckgo Search
|
||||
|
||||
Free web search via DuckDuckGo — text, news, images, videos. No API key needed. Prefer the `ddgs` CLI when installed; use the Python DDGS library only after verifying that `ddgs` is available in the current runtime.
|
||||
|
||||
## Skill metadata
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Source | Optional — install with `hermes skills install official/research/duckduckgo-search` |
|
||||
| Path | `optional-skills/research/duckduckgo-search` |
|
||||
| Version | `1.3.0` |
|
||||
| Author | gamedevCloudy |
|
||||
| License | MIT |
|
||||
| Tags | `search`, `duckduckgo`, `web-search`, `free`, `fallback` |
|
||||
| Related skills | [`arxiv`](/docs/user-guide/skills/bundled/research/research-arxiv) |
|
||||
|
||||
## Reference: full SKILL.md
|
||||
|
||||
:::info
|
||||
The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
|
||||
:::
|
||||
|
||||
# DuckDuckGo Search
|
||||
|
||||
Free web search using DuckDuckGo. **No API key required.**
|
||||
|
||||
Preferred when `web_search` is unavailable or unsuitable (for example when `FIRECRAWL_API_KEY` is not set). Can also be used as a standalone search path when DuckDuckGo results are specifically desired.
|
||||
|
||||
## Detection Flow
|
||||
|
||||
Check what is actually available before choosing an approach:
|
||||
|
||||
```bash
|
||||
# Check CLI availability
|
||||
command -v ddgs >/dev/null && echo "DDGS_CLI=installed" || echo "DDGS_CLI=missing"
|
||||
```
|
||||
|
||||
Decision tree:
|
||||
1. If `ddgs` CLI is installed, prefer `terminal` + `ddgs`
|
||||
2. If `ddgs` CLI is missing, do not assume `execute_code` can import `ddgs`
|
||||
3. If the user wants DuckDuckGo specifically, install `ddgs` first in the relevant environment
|
||||
4. Otherwise fall back to built-in web/browser tools
|
||||
|
||||
Important runtime note:
|
||||
- Terminal and `execute_code` are separate runtimes
|
||||
- A successful shell install does not guarantee `execute_code` can import `ddgs`
|
||||
- Never assume third-party Python packages are preinstalled inside `execute_code`
|
||||
|
||||
## Installation
|
||||
|
||||
Install `ddgs` only when DuckDuckGo search is specifically needed and the runtime does not already provide it.
|
||||
|
||||
```bash
|
||||
# Python package + CLI entrypoint
|
||||
pip install ddgs
|
||||
|
||||
# Verify CLI
|
||||
ddgs --help
|
||||
```
|
||||
|
||||
If a workflow depends on Python imports, verify that same runtime can import `ddgs` before using `from ddgs import DDGS`.
|
||||
|
||||
## Method 1: CLI Search (Preferred)
|
||||
|
||||
Use the `ddgs` command via `terminal` when it exists. This is the preferred path because it avoids assuming the `execute_code` sandbox has the `ddgs` Python package installed.
|
||||
|
||||
```bash
|
||||
# Text search
|
||||
ddgs text -q "python async programming" -m 5
|
||||
|
||||
# News search
|
||||
ddgs news -q "artificial intelligence" -m 5
|
||||
|
||||
# Image search
|
||||
ddgs images -q "landscape photography" -m 10
|
||||
|
||||
# Video search
|
||||
ddgs videos -q "python tutorial" -m 5
|
||||
|
||||
# With region filter
|
||||
ddgs text -q "best restaurants" -m 5 -r us-en
|
||||
|
||||
# Recent results only (d=day, w=week, m=month, y=year)
|
||||
ddgs text -q "latest AI news" -m 5 -t w
|
||||
|
||||
# JSON output for parsing
|
||||
ddgs text -q "fastapi tutorial" -m 5 -o json
|
||||
```
|
||||
|
||||
### CLI Flags
|
||||
|
||||
| Flag | Description | Example |
|
||||
|------|-------------|---------|
|
||||
| `-q` | Query — **required** | `-q "search terms"` |
|
||||
| `-m` | Max results | `-m 5` |
|
||||
| `-r` | Region | `-r us-en` |
|
||||
| `-t` | Time limit | `-t w` (week) |
|
||||
| `-s` | Safe search | `-s off` |
|
||||
| `-o` | Output format | `-o json` |
|
||||
|
||||
## Method 2: Python API (Only After Verification)
|
||||
|
||||
Use the `DDGS` class in `execute_code` or another Python runtime only after verifying that `ddgs` is installed there. Do not assume `execute_code` includes third-party packages by default.
|
||||
|
||||
Safe wording:
|
||||
- "Use `execute_code` with `ddgs` after installing or verifying the package if needed"
|
||||
|
||||
Avoid saying:
|
||||
- "`execute_code` includes `ddgs`"
|
||||
- "DuckDuckGo search works by default in `execute_code`"
|
||||
|
||||
**Important:** `max_results` must always be passed as a **keyword argument** — positional usage raises an error on all methods.
|
||||
|
||||
### Text Search
|
||||
|
||||
Best for: general research, companies, documentation.
|
||||
|
||||
```python
|
||||
from ddgs import DDGS
|
||||
|
||||
with DDGS() as ddgs:
|
||||
for r in ddgs.text("python async programming", max_results=5):
|
||||
print(r["title"])
|
||||
print(r["href"])
|
||||
print(r.get("body", "")[:200])
|
||||
print()
|
||||
```
|
||||
|
||||
Returns: `title`, `href`, `body`
|
||||
|
||||
### News Search
|
||||
|
||||
Best for: current events, breaking news, latest updates.
|
||||
|
||||
```python
|
||||
from ddgs import DDGS
|
||||
|
||||
with DDGS() as ddgs:
|
||||
for r in ddgs.news("AI regulation 2026", max_results=5):
|
||||
print(r["date"], "-", r["title"])
|
||||
print(r.get("source", ""), "|", r["url"])
|
||||
print(r.get("body", "")[:200])
|
||||
print()
|
||||
```
|
||||
|
||||
Returns: `date`, `title`, `body`, `url`, `image`, `source`
|
||||
|
||||
### Image Search
|
||||
|
||||
Best for: visual references, product images, diagrams.
|
||||
|
||||
```python
|
||||
from ddgs import DDGS
|
||||
|
||||
with DDGS() as ddgs:
|
||||
for r in ddgs.images("semiconductor chip", max_results=5):
|
||||
print(r["title"])
|
||||
print(r["image"])
|
||||
print(r.get("thumbnail", ""))
|
||||
print(r.get("source", ""))
|
||||
print()
|
||||
```
|
||||
|
||||
Returns: `title`, `image`, `thumbnail`, `url`, `height`, `width`, `source`
|
||||
|
||||
### Video Search
|
||||
|
||||
Best for: tutorials, demos, explainers.
|
||||
|
||||
```python
|
||||
from ddgs import DDGS
|
||||
|
||||
with DDGS() as ddgs:
|
||||
for r in ddgs.videos("FastAPI tutorial", max_results=5):
|
||||
print(r["title"])
|
||||
print(r.get("content", ""))
|
||||
print(r.get("duration", ""))
|
||||
print(r.get("provider", ""))
|
||||
print(r.get("published", ""))
|
||||
print()
|
||||
```
|
||||
|
||||
Returns: `title`, `content`, `description`, `duration`, `provider`, `published`, `statistics`, `uploader`
|
||||
|
||||
### Quick Reference
|
||||
|
||||
| Method | Use When | Key Fields |
|
||||
|--------|----------|------------|
|
||||
| `text()` | General research, companies | title, href, body |
|
||||
| `news()` | Current events, updates | date, title, source, body, url |
|
||||
| `images()` | Visuals, diagrams | title, image, thumbnail, url |
|
||||
| `videos()` | Tutorials, demos | title, content, duration, provider |
|
||||
|
||||
## Workflow: Search then Extract
|
||||
|
||||
DuckDuckGo returns titles, URLs, and snippets — not full page content. To get full page content, search first and then extract the most relevant URL with `web_extract`, browser tools, or curl.
|
||||
|
||||
CLI example:
|
||||
|
||||
```bash
|
||||
ddgs text -q "fastapi deployment guide" -m 3 -o json
|
||||
```
|
||||
|
||||
Python example, only after verifying `ddgs` is installed in that runtime:
|
||||
|
||||
```python
|
||||
from ddgs import DDGS
|
||||
|
||||
with DDGS() as ddgs:
|
||||
results = list(ddgs.text("fastapi deployment guide", max_results=3))
|
||||
for r in results:
|
||||
print(r["title"], "->", r["href"])
|
||||
```
|
||||
|
||||
Then extract the best URL with `web_extract` or another content-retrieval tool.
|
||||
|
||||
## Limitations
|
||||
|
||||
- **Rate limiting**: DuckDuckGo may throttle after many rapid requests. Add a short delay between searches if needed.
|
||||
- **No content extraction**: `ddgs` returns snippets, not full page content. Use `web_extract`, browser tools, or curl for the full article/page.
|
||||
- **Results quality**: Generally good but less configurable than Firecrawl's search.
|
||||
- **Availability**: DuckDuckGo may block requests from some cloud IPs. If searches return empty, try different keywords or wait a few seconds.
|
||||
- **Field variability**: Return fields may vary between results or `ddgs` versions. Use `.get()` for optional fields to avoid `KeyError`.
|
||||
- **Separate runtimes**: A successful `ddgs` install in terminal does not automatically mean `execute_code` can import it.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Problem | Likely Cause | What To Do |
|
||||
|---------|--------------|------------|
|
||||
| `ddgs: command not found` | CLI not installed in the shell environment | Install `ddgs`, or use built-in web/browser tools instead |
|
||||
| `ModuleNotFoundError: No module named 'ddgs'` | Python runtime does not have the package installed | Do not use Python DDGS there until that runtime is prepared |
|
||||
| Search returns nothing | Temporary rate limiting or poor query | Wait a few seconds, retry, or adjust the query |
|
||||
| CLI works but `execute_code` import fails | Terminal and `execute_code` are different runtimes | Keep using CLI, or separately prepare the Python runtime |
|
||||
|
||||
## Pitfalls
|
||||
|
||||
- **`max_results` is keyword-only**: `ddgs.text("query", 5)` raises an error. Use `ddgs.text("query", max_results=5)`.
|
||||
- **Do not assume the CLI exists**: Check `command -v ddgs` before using it.
|
||||
- **Do not assume `execute_code` can import `ddgs`**: `from ddgs import DDGS` may fail with `ModuleNotFoundError` unless that runtime was prepared separately.
|
||||
- **Package name**: The package is `ddgs` (previously `duckduckgo-search`). Install with `pip install ddgs`.
|
||||
- **Don't confuse `-q` and `-m`** (CLI): `-q` is for the query, `-m` is for max results count.
|
||||
- **Empty results**: If `ddgs` returns nothing, it may be rate-limited. Wait a few seconds and retry.
|
||||
|
||||
## Validated With
|
||||
|
||||
Validated examples against `ddgs==9.11.2` semantics. Skill guidance now treats CLI availability and Python import availability as separate concerns so the documented workflow matches actual runtime behavior.
|
||||
|
|
@ -0,0 +1,231 @@
|
|||
---
|
||||
title: "Gitnexus Explorer"
|
||||
sidebar_label: "Gitnexus Explorer"
|
||||
description: "Index a codebase with GitNexus and serve an interactive knowledge graph via web UI + Cloudflare tunnel"
|
||||
---
|
||||
|
||||
{/* This page is auto-generated from the skill's SKILL.md by website/scripts/generate-skill-docs.py. Edit the source SKILL.md, not this page. */}
|
||||
|
||||
# Gitnexus Explorer
|
||||
|
||||
Index a codebase with GitNexus and serve an interactive knowledge graph via web UI + Cloudflare tunnel.
|
||||
|
||||
## Skill metadata
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Source | Optional — install with `hermes skills install official/research/gitnexus-explorer` |
|
||||
| Path | `optional-skills/research/gitnexus-explorer` |
|
||||
| Version | `1.0.0` |
|
||||
| Author | Hermes Agent + Teknium |
|
||||
| License | MIT |
|
||||
| Tags | `gitnexus`, `code-intelligence`, `knowledge-graph`, `visualization` |
|
||||
| Related skills | [`native-mcp`](/docs/user-guide/skills/bundled/mcp/mcp-native-mcp), [`codebase-inspection`](/docs/user-guide/skills/bundled/github/github-codebase-inspection) |
|
||||
|
||||
## Reference: full SKILL.md
|
||||
|
||||
:::info
|
||||
The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
|
||||
:::
|
||||
|
||||
# GitNexus Explorer
|
||||
|
||||
Index any codebase into a knowledge graph and serve an interactive web UI for exploring
|
||||
symbols, call chains, clusters, and execution flows. Tunneled via Cloudflare for remote access.
|
||||
|
||||
## When to Use
|
||||
|
||||
- User wants to visually explore a codebase's architecture
|
||||
- User asks for a knowledge graph / dependency graph of a repo
|
||||
- User wants to share an interactive codebase explorer with someone
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- **Node.js** (v18+) — required for GitNexus and the proxy
|
||||
- **git** — repo must have a `.git` directory
|
||||
- **cloudflared** — for tunneling (auto-installed to ~/.local/bin if missing)
|
||||
|
||||
## Size Warning
|
||||
|
||||
The web UI renders all nodes in the browser. Repos under ~5,000 files work well. Large
|
||||
repos (30k+ nodes) will be sluggish or crash the browser tab. The CLI/MCP tools work
|
||||
at any scale — only the web visualization has this limit.
|
||||
|
||||
## Steps
|
||||
|
||||
### 1. Clone and Build GitNexus (one-time setup)
|
||||
|
||||
```bash
|
||||
GITNEXUS_DIR="${GITNEXUS_DIR:-$HOME/.local/share/gitnexus}"
|
||||
|
||||
if [ ! -d "$GITNEXUS_DIR/gitnexus-web/dist" ]; then
|
||||
git clone https://github.com/abhigyanpatwari/GitNexus.git "$GITNEXUS_DIR"
|
||||
cd "$GITNEXUS_DIR/gitnexus-shared" && npm install && npm run build
|
||||
cd "$GITNEXUS_DIR/gitnexus-web" && npm install
|
||||
fi
|
||||
```
|
||||
|
||||
### 2. Patch the Web UI for Remote Access
|
||||
|
||||
The web UI defaults to `localhost:4747` for API calls. Patch it to use same-origin
|
||||
so it works through a tunnel/proxy:
|
||||
|
||||
**File: `$GITNEXUS_DIR/gitnexus-web/src/config/ui-constants.ts`**
|
||||
Change:
|
||||
```typescript
|
||||
export const DEFAULT_BACKEND_URL = 'http://localhost:4747';
|
||||
```
|
||||
To:
|
||||
```typescript
|
||||
export const DEFAULT_BACKEND_URL = typeof window !== 'undefined' && window.location.hostname !== 'localhost' ? window.location.origin : 'http://localhost:4747';
|
||||
```
|
||||
|
||||
**File: `$GITNEXUS_DIR/gitnexus-web/vite.config.ts`**
|
||||
Add `allowedHosts: true` inside the `server: { }` block (only needed if running dev
|
||||
mode instead of production build):
|
||||
```typescript
|
||||
server: {
|
||||
allowedHosts: true,
|
||||
// ... existing config
|
||||
},
|
||||
```
|
||||
|
||||
Then build the production bundle:
|
||||
```bash
|
||||
cd "$GITNEXUS_DIR/gitnexus-web" && npx vite build
|
||||
```
|
||||
|
||||
### 3. Index the Target Repo
|
||||
|
||||
```bash
|
||||
cd /path/to/target-repo
|
||||
npx gitnexus analyze --skip-agents-md
|
||||
rm -rf .claude/ # remove Claude Code-specific artifacts
|
||||
```
|
||||
|
||||
Add `--embeddings` for semantic search (slower — minutes instead of seconds).
|
||||
|
||||
The index lives in `.gitnexus/` inside the repo (auto-gitignored).
|
||||
|
||||
### 4. Create the Proxy Script
|
||||
|
||||
Write this to a file (e.g., `$GITNEXUS_DIR/proxy.mjs`). It serves the production
|
||||
web UI and proxies `/api/*` to the GitNexus backend — same origin, no CORS issues,
|
||||
no sudo, no nginx.
|
||||
|
||||
```javascript
|
||||
import http from 'node:http';
|
||||
import fs from 'node:fs';
|
||||
import path from 'node:path';
|
||||
|
||||
const API_PORT = parseInt(process.env.API_PORT || '4747');
|
||||
const DIST_DIR = process.argv[2] || './dist';
|
||||
const PORT = parseInt(process.argv[3] || '8888');
|
||||
|
||||
const MIME = {
|
||||
'.html': 'text/html', '.js': 'application/javascript', '.css': 'text/css',
|
||||
'.json': 'application/json', '.png': 'image/png', '.svg': 'image/svg+xml',
|
||||
'.ico': 'image/x-icon', '.woff2': 'font/woff2', '.woff': 'font/woff',
|
||||
'.wasm': 'application/wasm',
|
||||
};
|
||||
|
||||
function proxyToApi(req, res) {
|
||||
const opts = {
|
||||
hostname: '127.0.0.1', port: API_PORT,
|
||||
path: req.url, method: req.method, headers: req.headers,
|
||||
};
|
||||
const proxy = http.request(opts, (upstream) => {
|
||||
res.writeHead(upstream.statusCode, upstream.headers);
|
||||
upstream.pipe(res, { end: true });
|
||||
});
|
||||
proxy.on('error', () => { res.writeHead(502); res.end('Backend unavailable'); });
|
||||
req.pipe(proxy, { end: true });
|
||||
}
|
||||
|
||||
function serveStatic(req, res) {
|
||||
let filePath = path.join(DIST_DIR, req.url === '/' ? 'index.html' : req.url.split('?')[0]);
|
||||
if (!fs.existsSync(filePath)) filePath = path.join(DIST_DIR, 'index.html');
|
||||
const ext = path.extname(filePath);
|
||||
const mime = MIME[ext] || 'application/octet-stream';
|
||||
try {
|
||||
const data = fs.readFileSync(filePath);
|
||||
res.writeHead(200, { 'Content-Type': mime, 'Cache-Control': 'public, max-age=3600' });
|
||||
res.end(data);
|
||||
} catch { res.writeHead(404); res.end('Not found'); }
|
||||
}
|
||||
|
||||
http.createServer((req, res) => {
|
||||
if (req.url.startsWith('/api')) proxyToApi(req, res);
|
||||
else serveStatic(req, res);
|
||||
}).listen(PORT, () => console.log(`GitNexus proxy on http://localhost:${PORT}`));
|
||||
```
|
||||
|
||||
### 5. Start the Services
|
||||
|
||||
```bash
|
||||
# Terminal 1: GitNexus backend API
|
||||
npx gitnexus serve &
|
||||
|
||||
# Terminal 2: Proxy (web UI + API on one port)
|
||||
node "$GITNEXUS_DIR/proxy.mjs" "$GITNEXUS_DIR/gitnexus-web/dist" 8888 &
|
||||
```
|
||||
|
||||
Verify: `curl -s http://localhost:8888/api/repos` should return the indexed repo(s).
|
||||
|
||||
### 6. Tunnel with Cloudflare (optional — for remote access)
|
||||
|
||||
```bash
|
||||
# Install cloudflared if needed (no sudo)
|
||||
if ! command -v cloudflared &>/dev/null; then
|
||||
mkdir -p ~/.local/bin
|
||||
curl -sL https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64 \
|
||||
-o ~/.local/bin/cloudflared
|
||||
chmod +x ~/.local/bin/cloudflared
|
||||
export PATH="$HOME/.local/bin:$PATH"
|
||||
fi
|
||||
|
||||
# Start tunnel (--config /dev/null avoids conflicts with existing named tunnels)
|
||||
cloudflared tunnel --config /dev/null --url http://localhost:8888 --no-autoupdate --protocol http2
|
||||
```
|
||||
|
||||
The tunnel URL (e.g., `https://random-words.trycloudflare.com`) is printed to stderr.
|
||||
Share it — anyone with the link can explore the graph.
|
||||
|
||||
### 7. Cleanup
|
||||
|
||||
```bash
|
||||
# Stop services
|
||||
pkill -f "gitnexus serve"
|
||||
pkill -f "proxy.mjs"
|
||||
pkill -f cloudflared
|
||||
|
||||
# Remove index from the target repo
|
||||
cd /path/to/target-repo
|
||||
npx gitnexus clean
|
||||
rm -rf .claude/
|
||||
```
|
||||
|
||||
## Pitfalls
|
||||
|
||||
- **`--config /dev/null` is required for cloudflared** if the user has an existing
|
||||
named tunnel config at `~/.cloudflared/config.yml`. Without it, the catch-all
|
||||
ingress rule in the config returns 404 for all quick tunnel requests.
|
||||
|
||||
- **Production build is mandatory for tunneling.** The Vite dev server blocks
|
||||
non-localhost hosts by default (`allowedHosts`). The production build + Node
|
||||
proxy avoids this entirely.
|
||||
|
||||
- **The web UI does NOT create `.claude/` or `CLAUDE.md`.** Those are created by
|
||||
`npx gitnexus analyze`. Use `--skip-agents-md` to suppress the markdown files,
|
||||
then `rm -rf .claude/` for the rest. These are Claude Code integrations that
|
||||
hermes-agent users don't need.
|
||||
|
||||
- **Browser memory limit.** The web UI loads the entire graph into browser memory.
|
||||
Repos with 5k+ files may be sluggish. 30k+ files will likely crash the tab.
|
||||
|
||||
- **Embeddings are optional.** `--embeddings` enables semantic search but takes
|
||||
minutes on large repos. Skip it for quick exploration; add it if you want
|
||||
natural language queries via the AI chat panel.
|
||||
|
||||
- **Multiple repos.** `gitnexus serve` serves ALL indexed repos. Index several
|
||||
repos, start serve once, and the web UI lets you switch between them.
|
||||
|
|
@ -0,0 +1,408 @@
|
|||
---
|
||||
title: "Parallel Cli"
|
||||
sidebar_label: "Parallel Cli"
|
||||
description: "Optional vendor skill for Parallel CLI — agent-native web search, extraction, deep research, enrichment, FindAll, and monitoring"
|
||||
---
|
||||
|
||||
{/* This page is auto-generated from the skill's SKILL.md by website/scripts/generate-skill-docs.py. Edit the source SKILL.md, not this page. */}
|
||||
|
||||
# Parallel Cli
|
||||
|
||||
Optional vendor skill for Parallel CLI — agent-native web search, extraction, deep research, enrichment, FindAll, and monitoring. Prefer JSON output and non-interactive flows.
|
||||
|
||||
## Skill metadata
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Source | Optional — install with `hermes skills install official/research/parallel-cli` |
|
||||
| Path | `optional-skills/research/parallel-cli` |
|
||||
| Version | `1.1.0` |
|
||||
| Author | Hermes Agent |
|
||||
| License | MIT |
|
||||
| Tags | `Research`, `Web`, `Search`, `Deep-Research`, `Enrichment`, `CLI` |
|
||||
| Related skills | [`duckduckgo-search`](/docs/user-guide/skills/optional/research/research-duckduckgo-search), [`mcporter`](/docs/user-guide/skills/optional/mcp/mcp-mcporter) |
|
||||
|
||||
## Reference: full SKILL.md
|
||||
|
||||
:::info
|
||||
The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
|
||||
:::
|
||||
|
||||
# Parallel CLI
|
||||
|
||||
Use `parallel-cli` when the user explicitly wants Parallel, or when a terminal-native workflow would benefit from Parallel's vendor-specific stack for web search, extraction, deep research, enrichment, entity discovery, or monitoring.
|
||||
|
||||
This is an optional third-party workflow, not a Hermes core capability.
|
||||
|
||||
Important expectations:
|
||||
- Parallel is a paid service with a free tier, not a fully free local tool.
|
||||
- It overlaps with Hermes native `web_search` / `web_extract`, so do not prefer it by default for ordinary lookups.
|
||||
- Prefer this skill when the user mentions Parallel specifically or needs capabilities like Parallel's enrichment, FindAll, or monitor workflows.
|
||||
|
||||
`parallel-cli` is designed for agents:
|
||||
- JSON output via `--json`
|
||||
- Non-interactive command execution
|
||||
- Async long-running jobs with `--no-wait`, `status`, and `poll`
|
||||
- Context chaining with `--previous-interaction-id`
|
||||
- Search, extract, research, enrichment, entity discovery, and monitoring in one CLI
|
||||
|
||||
## When to use it
|
||||
|
||||
Prefer this skill when:
|
||||
- The user explicitly mentions Parallel or `parallel-cli`
|
||||
- The task needs richer workflows than a simple one-shot search/extract pass
|
||||
- You need async deep research jobs that can be launched and polled later
|
||||
- You need structured enrichment, FindAll entity discovery, or monitoring
|
||||
|
||||
Prefer Hermes native `web_search` / `web_extract` for quick one-off lookups when Parallel is not specifically requested.
|
||||
|
||||
## Installation
|
||||
|
||||
Try the least invasive install path available for the environment.
|
||||
|
||||
### Homebrew
|
||||
|
||||
```bash
|
||||
brew install parallel-web/tap/parallel-cli
|
||||
```
|
||||
|
||||
### npm
|
||||
|
||||
```bash
|
||||
npm install -g parallel-web-cli
|
||||
```
|
||||
|
||||
### Python package
|
||||
|
||||
```bash
|
||||
pip install "parallel-web-tools[cli]"
|
||||
```
|
||||
|
||||
### Standalone installer
|
||||
|
||||
```bash
|
||||
curl -fsSL https://parallel.ai/install.sh | bash
|
||||
```
|
||||
|
||||
If you want an isolated Python install, `pipx` can also work:
|
||||
|
||||
```bash
|
||||
pipx install "parallel-web-tools[cli]"
|
||||
pipx ensurepath
|
||||
```
|
||||
|
||||
## Authentication
|
||||
|
||||
Interactive login:
|
||||
|
||||
```bash
|
||||
parallel-cli login
|
||||
```
|
||||
|
||||
Headless / SSH / CI:
|
||||
|
||||
```bash
|
||||
parallel-cli login --device
|
||||
```
|
||||
|
||||
API key environment variable:
|
||||
|
||||
```bash
|
||||
export PARALLEL_API_KEY="***"
|
||||
```
|
||||
|
||||
Verify current auth status:
|
||||
|
||||
```bash
|
||||
parallel-cli auth
|
||||
```
|
||||
|
||||
If auth requires browser interaction, run with `pty=true`.
|
||||
|
||||
## Core rule set
|
||||
|
||||
1. Always prefer `--json` when you need machine-readable output.
|
||||
2. Prefer explicit arguments and non-interactive flows.
|
||||
3. For long-running jobs, use `--no-wait` and then `status` / `poll`.
|
||||
4. Cite only URLs returned by the CLI output.
|
||||
5. Save large JSON outputs to a temp file when follow-up questions are likely.
|
||||
6. Use background processes only for genuinely long-running workflows; otherwise run in foreground.
|
||||
7. Prefer Hermes native tools unless the user wants Parallel specifically or needs Parallel-only workflows.
|
||||
|
||||
## Quick reference
|
||||
|
||||
```text
|
||||
parallel-cli
|
||||
├── auth
|
||||
├── login
|
||||
├── logout
|
||||
├── search
|
||||
├── extract / fetch
|
||||
├── research run|status|poll|processors
|
||||
├── enrich run|status|poll|plan|suggest|deploy
|
||||
├── findall run|ingest|status|poll|result|enrich|extend|schema|cancel
|
||||
└── monitor create|list|get|update|delete|events|event-group|simulate
|
||||
```
|
||||
|
||||
## Common flags and patterns
|
||||
|
||||
Commonly useful flags:
|
||||
- `--json` for structured output
|
||||
- `--no-wait` for async jobs
|
||||
- `--previous-interaction-id <id>` for follow-up tasks that reuse earlier context
|
||||
- `--max-results <n>` for search result count
|
||||
- `--mode one-shot|agentic` for search behavior
|
||||
- `--include-domains domain1.com,domain2.com`
|
||||
- `--exclude-domains domain1.com,domain2.com`
|
||||
- `--after-date YYYY-MM-DD`
|
||||
|
||||
Read from stdin when convenient:
|
||||
|
||||
```bash
|
||||
echo "What is the latest funding for Anthropic?" | parallel-cli search - --json
|
||||
echo "Research question" | parallel-cli research run - --json
|
||||
```
|
||||
|
||||
## Search
|
||||
|
||||
Use for current web lookups with structured results.
|
||||
|
||||
```bash
|
||||
parallel-cli search "What is Anthropic's latest AI model?" --json
|
||||
parallel-cli search "SEC filings for Apple" --include-domains sec.gov --json
|
||||
parallel-cli search "bitcoin price" --after-date 2026-01-01 --max-results 10 --json
|
||||
parallel-cli search "latest browser benchmarks" --mode one-shot --json
|
||||
parallel-cli search "AI coding agent enterprise reviews" --mode agentic --json
|
||||
```
|
||||
|
||||
Useful constraints:
|
||||
- `--include-domains` to narrow trusted sources
|
||||
- `--exclude-domains` to strip noisy domains
|
||||
- `--after-date` for recency filtering
|
||||
- `--max-results` when you need broader coverage
|
||||
|
||||
If you expect follow-up questions, save output:
|
||||
|
||||
```bash
|
||||
parallel-cli search "latest React 19 changes" --json -o /tmp/react-19-search.json
|
||||
```
|
||||
|
||||
When summarizing results:
|
||||
- lead with the answer
|
||||
- include dates, names, and concrete facts
|
||||
- cite only returned sources
|
||||
- avoid inventing URLs or source titles
|
||||
|
||||
## Extraction
|
||||
|
||||
Use to pull clean content or markdown from a URL.
|
||||
|
||||
```bash
|
||||
parallel-cli extract https://example.com --json
|
||||
parallel-cli extract https://company.com --objective "Find pricing info" --json
|
||||
parallel-cli extract https://example.com --full-content --json
|
||||
parallel-cli fetch https://example.com --json
|
||||
```
|
||||
|
||||
Use `--objective` when the page is broad and you only need one slice of information.
|
||||
|
||||
## Deep research
|
||||
|
||||
Use for deeper multi-step research tasks that may take time.
|
||||
|
||||
Common processor tiers:
|
||||
- `lite` / `base` for faster, cheaper passes
|
||||
- `core` / `pro` for more thorough synthesis
|
||||
- `ultra` for the heaviest research jobs
|
||||
|
||||
### Synchronous
|
||||
|
||||
```bash
|
||||
parallel-cli research run \
|
||||
"Compare the leading AI coding agents by pricing, model support, and enterprise controls" \
|
||||
--processor core \
|
||||
--json
|
||||
```
|
||||
|
||||
### Async launch + poll
|
||||
|
||||
```bash
|
||||
parallel-cli research run \
|
||||
"Compare the leading AI coding agents by pricing, model support, and enterprise controls" \
|
||||
--processor ultra \
|
||||
--no-wait \
|
||||
--json
|
||||
|
||||
parallel-cli research status trun_xxx --json
|
||||
parallel-cli research poll trun_xxx --json
|
||||
parallel-cli research processors --json
|
||||
```
|
||||
|
||||
### Context chaining / follow-up
|
||||
|
||||
```bash
|
||||
parallel-cli research run "What are the top AI coding agents?" --json
|
||||
parallel-cli research run \
|
||||
"What enterprise controls does the top-ranked one offer?" \
|
||||
--previous-interaction-id trun_xxx \
|
||||
--json
|
||||
```
|
||||
|
||||
Recommended Hermes workflow:
|
||||
1. launch with `--no-wait --json`
|
||||
2. capture the returned run/task ID
|
||||
3. if the user wants to continue other work, keep moving
|
||||
4. later call `status` or `poll`
|
||||
5. summarize the final report with citations from the returned sources
|
||||
|
||||
## Enrichment
|
||||
|
||||
Use when the user has CSV/JSON/tabular inputs and wants additional columns inferred from web research.
|
||||
|
||||
### Suggest columns
|
||||
|
||||
```bash
|
||||
parallel-cli enrich suggest "Find the CEO and annual revenue" --json
|
||||
```
|
||||
|
||||
### Plan a config
|
||||
|
||||
```bash
|
||||
parallel-cli enrich plan -o config.yaml
|
||||
```
|
||||
|
||||
### Inline data
|
||||
|
||||
```bash
|
||||
parallel-cli enrich run \
|
||||
--data '[{"company": "Anthropic"}, {"company": "Mistral"}]' \
|
||||
--intent "Find headquarters and employee count" \
|
||||
--json
|
||||
```
|
||||
|
||||
### Non-interactive file run
|
||||
|
||||
```bash
|
||||
parallel-cli enrich run \
|
||||
--source-type csv \
|
||||
--source companies.csv \
|
||||
--target enriched.csv \
|
||||
--source-columns '[{"name": "company", "description": "Company name"}]' \
|
||||
--intent "Find the CEO and annual revenue"
|
||||
```
|
||||
|
||||
### YAML config run
|
||||
|
||||
```bash
|
||||
parallel-cli enrich run config.yaml
|
||||
```
|
||||
|
||||
### Status / polling
|
||||
|
||||
```bash
|
||||
parallel-cli enrich status <task_group_id> --json
|
||||
parallel-cli enrich poll <task_group_id> --json
|
||||
```
|
||||
|
||||
Use explicit JSON arrays for column definitions when operating non-interactively.
|
||||
Validate the output file before reporting success.
|
||||
|
||||
## FindAll
|
||||
|
||||
Use for web-scale entity discovery when the user wants a discovered dataset rather than a short answer.
|
||||
|
||||
```bash
|
||||
parallel-cli findall run "Find AI coding agent startups with enterprise offerings" --json
|
||||
parallel-cli findall run "AI startups in healthcare" -n 25 --json
|
||||
parallel-cli findall status <run_id> --json
|
||||
parallel-cli findall poll <run_id> --json
|
||||
parallel-cli findall result <run_id> --json
|
||||
parallel-cli findall schema <run_id> --json
|
||||
```
|
||||
|
||||
This is a better fit than ordinary search when the user wants a discovered set of entities that can be reviewed, filtered, or enriched later.
|
||||
|
||||
## Monitor
|
||||
|
||||
Use for ongoing change detection over time.
|
||||
|
||||
```bash
|
||||
parallel-cli monitor list --json
|
||||
parallel-cli monitor get <monitor_id> --json
|
||||
parallel-cli monitor events <monitor_id> --json
|
||||
parallel-cli monitor delete <monitor_id> --json
|
||||
```
|
||||
|
||||
Creation is usually the sensitive part because cadence and delivery matter:
|
||||
|
||||
```bash
|
||||
parallel-cli monitor create --help
|
||||
```
|
||||
|
||||
Use this when the user wants recurring tracking of a page or source rather than a one-time fetch.
|
||||
|
||||
## Recommended Hermes usage patterns
|
||||
|
||||
### Fast answer with citations
|
||||
1. Run `parallel-cli search ... --json`
|
||||
2. Parse titles, URLs, dates, excerpts
|
||||
3. Summarize with inline citations from the returned URLs only
|
||||
|
||||
### URL investigation
|
||||
1. Run `parallel-cli extract URL --json`
|
||||
2. If needed, rerun with `--objective` or `--full-content`
|
||||
3. Quote or summarize the extracted markdown
|
||||
|
||||
### Long research workflow
|
||||
1. Run `parallel-cli research run ... --no-wait --json`
|
||||
2. Store the returned ID
|
||||
3. Continue other work or periodically poll
|
||||
4. Summarize the final report with citations
|
||||
|
||||
### Structured enrichment workflow
|
||||
1. Inspect the input file and columns
|
||||
2. Use `enrich suggest` or provide explicit enriched columns
|
||||
3. Run `enrich run`
|
||||
4. Poll for completion if needed
|
||||
5. Validate the output file before reporting success
|
||||
|
||||
## Error handling and exit codes
|
||||
|
||||
The CLI documents these exit codes:
|
||||
- `0` success
|
||||
- `2` bad input
|
||||
- `3` auth error
|
||||
- `4` API error
|
||||
- `5` timeout
|
||||
|
||||
If you hit auth errors:
|
||||
1. check `parallel-cli auth`
|
||||
2. confirm `PARALLEL_API_KEY` or run `parallel-cli login` / `parallel-cli login --device`
|
||||
3. verify `parallel-cli` is on `PATH`
|
||||
|
||||
## Maintenance
|
||||
|
||||
Check current auth / install state:
|
||||
|
||||
```bash
|
||||
parallel-cli auth
|
||||
parallel-cli --help
|
||||
```
|
||||
|
||||
Update commands:
|
||||
|
||||
```bash
|
||||
parallel-cli update
|
||||
pip install --upgrade parallel-web-tools
|
||||
parallel-cli config auto-update-check off
|
||||
```
|
||||
|
||||
## Pitfalls
|
||||
|
||||
- Do not omit `--json` unless the user explicitly wants human-formatted output.
|
||||
- Do not cite sources not present in the CLI output.
|
||||
- `login` may require PTY/browser interaction.
|
||||
- Prefer foreground execution for short tasks; do not overuse background processes.
|
||||
- For large result sets, save JSON to `/tmp/*.json` instead of stuffing everything into context.
|
||||
- Do not silently choose Parallel when Hermes native tools are already sufficient.
|
||||
- Remember this is a vendor workflow that usually requires account auth and paid usage beyond the free tier.
|
||||
459
website/docs/user-guide/skills/optional/research/research-qmd.md
Normal file
459
website/docs/user-guide/skills/optional/research/research-qmd.md
Normal file
|
|
@ -0,0 +1,459 @@
|
|||
---
|
||||
title: "Qmd"
|
||||
sidebar_label: "Qmd"
|
||||
description: "Search personal knowledge bases, notes, docs, and meeting transcripts locally using qmd — a hybrid retrieval engine with BM25, vector search, and LLM reranking"
|
||||
---
|
||||
|
||||
{/* This page is auto-generated from the skill's SKILL.md by website/scripts/generate-skill-docs.py. Edit the source SKILL.md, not this page. */}
|
||||
|
||||
# Qmd
|
||||
|
||||
Search personal knowledge bases, notes, docs, and meeting transcripts locally using qmd — a hybrid retrieval engine with BM25, vector search, and LLM reranking. Supports CLI and MCP integration.
|
||||
|
||||
## Skill metadata
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Source | Optional — install with `hermes skills install official/research/qmd` |
|
||||
| Path | `optional-skills/research/qmd` |
|
||||
| Version | `1.0.0` |
|
||||
| Author | Hermes Agent + Teknium |
|
||||
| License | MIT |
|
||||
| Platforms | macos, linux |
|
||||
| Tags | `Search`, `Knowledge-Base`, `RAG`, `Notes`, `MCP`, `Local-AI` |
|
||||
| Related skills | [`obsidian`](/docs/user-guide/skills/bundled/note-taking/note-taking-obsidian), [`native-mcp`](/docs/user-guide/skills/bundled/mcp/mcp-native-mcp), [`arxiv`](/docs/user-guide/skills/bundled/research/research-arxiv) |
|
||||
|
||||
## Reference: full SKILL.md
|
||||
|
||||
:::info
|
||||
The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
|
||||
:::
|
||||
|
||||
# QMD — Query Markup Documents
|
||||
|
||||
Local, on-device search engine for personal knowledge bases. Indexes markdown
|
||||
notes, meeting transcripts, documentation, and any text-based files, then
|
||||
provides hybrid search combining keyword matching, semantic understanding, and
|
||||
LLM-powered reranking — all running locally with no cloud dependencies.
|
||||
|
||||
Created by [Tobi Lütke](https://github.com/tobi/qmd). MIT licensed.
|
||||
|
||||
## When to Use
|
||||
|
||||
- User asks to search their notes, docs, knowledge base, or meeting transcripts
|
||||
- User wants to find something across a large collection of markdown/text files
|
||||
- User wants semantic search ("find notes about X concept") not just keyword grep
|
||||
- User has already set up qmd collections and wants to query them
|
||||
- User asks to set up a local knowledge base or document search system
|
||||
- Keywords: "search my notes", "find in my docs", "knowledge base", "qmd"
|
||||
|
||||
## Prerequisites
|
||||
|
||||
### Node.js >= 22 (required)
|
||||
|
||||
```bash
|
||||
# Check version
|
||||
node --version # must be >= 22
|
||||
|
||||
# macOS — install or upgrade via Homebrew
|
||||
brew install node@22
|
||||
|
||||
# Linux — use NodeSource or nvm
|
||||
curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash -
|
||||
sudo apt-get install -y nodejs
|
||||
# or with nvm:
|
||||
nvm install 22 && nvm use 22
|
||||
```
|
||||
|
||||
### SQLite with Extension Support (macOS only)
|
||||
|
||||
macOS system SQLite lacks extension loading. Install via Homebrew:
|
||||
|
||||
```bash
|
||||
brew install sqlite
|
||||
```
|
||||
|
||||
### Install qmd
|
||||
|
||||
```bash
|
||||
npm install -g @tobilu/qmd
|
||||
# or with Bun:
|
||||
bun install -g @tobilu/qmd
|
||||
```
|
||||
|
||||
First run auto-downloads 3 local GGUF models (~2GB total):
|
||||
|
||||
| Model | Purpose | Size |
|
||||
|-------|---------|------|
|
||||
| embeddinggemma-300M-Q8_0 | Vector embeddings | ~300MB |
|
||||
| qwen3-reranker-0.6b-q8_0 | Result reranking | ~640MB |
|
||||
| qmd-query-expansion-1.7B | Query expansion | ~1.1GB |
|
||||
|
||||
### Verify Installation
|
||||
|
||||
```bash
|
||||
qmd --version
|
||||
qmd status
|
||||
```
|
||||
|
||||
## Quick Reference
|
||||
|
||||
| Command | What It Does | Speed |
|
||||
|---------|-------------|-------|
|
||||
| `qmd search "query"` | BM25 keyword search (no models) | ~0.2s |
|
||||
| `qmd vsearch "query"` | Semantic vector search (1 model) | ~3s |
|
||||
| `qmd query "query"` | Hybrid + reranking (all 3 models) | ~2-3s warm, ~19s cold |
|
||||
| `qmd get <docid>` | Retrieve full document content | instant |
|
||||
| `qmd multi-get "glob"` | Retrieve multiple files | instant |
|
||||
| `qmd collection add <path> --name <n>` | Add a directory as a collection | instant |
|
||||
| `qmd context add <path> "description"` | Add context metadata to improve retrieval | instant |
|
||||
| `qmd embed` | Generate/update vector embeddings | varies |
|
||||
| `qmd status` | Show index health and collection info | instant |
|
||||
| `qmd mcp` | Start MCP server (stdio) | persistent |
|
||||
| `qmd mcp --http --daemon` | Start MCP server (HTTP, warm models) | persistent |
|
||||
|
||||
## Setup Workflow
|
||||
|
||||
### 1. Add Collections
|
||||
|
||||
Point qmd at directories containing your documents:
|
||||
|
||||
```bash
|
||||
# Add a notes directory
|
||||
qmd collection add ~/notes --name notes
|
||||
|
||||
# Add project docs
|
||||
qmd collection add ~/projects/myproject/docs --name project-docs
|
||||
|
||||
# Add meeting transcripts
|
||||
qmd collection add ~/meetings --name meetings
|
||||
|
||||
# List all collections
|
||||
qmd collection list
|
||||
```
|
||||
|
||||
### 2. Add Context Descriptions
|
||||
|
||||
Context metadata helps the search engine understand what each collection
|
||||
contains. This significantly improves retrieval quality:
|
||||
|
||||
```bash
|
||||
qmd context add qmd://notes "Personal notes, ideas, and journal entries"
|
||||
qmd context add qmd://project-docs "Technical documentation for the main project"
|
||||
qmd context add qmd://meetings "Meeting transcripts and action items from team syncs"
|
||||
```
|
||||
|
||||
### 3. Generate Embeddings
|
||||
|
||||
```bash
|
||||
qmd embed
|
||||
```
|
||||
|
||||
This processes all documents in all collections and generates vector
|
||||
embeddings. Re-run after adding new documents or collections.
|
||||
|
||||
### 4. Verify
|
||||
|
||||
```bash
|
||||
qmd status # shows index health, collection stats, model info
|
||||
```
|
||||
|
||||
## Search Patterns
|
||||
|
||||
### Fast Keyword Search (BM25)
|
||||
|
||||
Best for: exact terms, code identifiers, names, known phrases.
|
||||
No models loaded — near-instant results.
|
||||
|
||||
```bash
|
||||
qmd search "authentication middleware"
|
||||
qmd search "handleError async"
|
||||
```
|
||||
|
||||
### Semantic Vector Search
|
||||
|
||||
Best for: natural language questions, conceptual queries.
|
||||
Loads embedding model (~3s first query).
|
||||
|
||||
```bash
|
||||
qmd vsearch "how does the rate limiter handle burst traffic"
|
||||
qmd vsearch "ideas for improving onboarding flow"
|
||||
```
|
||||
|
||||
### Hybrid Search with Reranking (Best Quality)
|
||||
|
||||
Best for: important queries where quality matters most.
|
||||
Uses all 3 models — query expansion, parallel BM25+vector, reranking.
|
||||
|
||||
```bash
|
||||
qmd query "what decisions were made about the database migration"
|
||||
```
|
||||
|
||||
### Structured Multi-Mode Queries
|
||||
|
||||
Combine different search types in a single query for precision:
|
||||
|
||||
```bash
|
||||
# BM25 for exact term + vector for concept
|
||||
qmd query $'lex: rate limiter\nvec: how does throttling work under load'
|
||||
|
||||
# With query expansion
|
||||
qmd query $'expand: database migration plan\nlex: "schema change"'
|
||||
```
|
||||
|
||||
### Query Syntax (lex/BM25 mode)
|
||||
|
||||
| Syntax | Effect | Example |
|
||||
|--------|--------|---------|
|
||||
| `term` | Prefix match | `perf` matches "performance" |
|
||||
| `"phrase"` | Exact phrase | `"rate limiter"` |
|
||||
| `-term` | Exclude term | `performance -sports` |
|
||||
|
||||
### HyDE (Hypothetical Document Embeddings)
|
||||
|
||||
For complex topics, write what you expect the answer to look like:
|
||||
|
||||
```bash
|
||||
qmd query $'hyde: The migration plan involves three phases. First, we add the new columns without dropping the old ones. Then we backfill data. Finally we cut over and remove legacy columns.'
|
||||
```
|
||||
|
||||
### Scoping to Collections
|
||||
|
||||
```bash
|
||||
qmd search "query" --collection notes
|
||||
qmd query "query" --collection project-docs
|
||||
```
|
||||
|
||||
### Output Formats
|
||||
|
||||
```bash
|
||||
qmd search "query" --json # JSON output (best for parsing)
|
||||
qmd search "query" --limit 5 # Limit results
|
||||
qmd get "#abc123" # Get by document ID
|
||||
qmd get "path/to/file.md" # Get by file path
|
||||
qmd get "file.md:50" -l 100 # Get specific line range
|
||||
qmd multi-get "journals/*.md" --json # Batch retrieve by glob
|
||||
```
|
||||
|
||||
## MCP Integration (Recommended)
|
||||
|
||||
qmd exposes an MCP server that provides search tools directly to
|
||||
Hermes Agent via the native MCP client. This is the preferred
|
||||
integration — once configured, the agent gets qmd tools automatically
|
||||
without needing to load this skill.
|
||||
|
||||
### Option A: Stdio Mode (Simple)
|
||||
|
||||
Add to `~/.hermes/config.yaml`:
|
||||
|
||||
```yaml
|
||||
mcp_servers:
|
||||
qmd:
|
||||
command: "qmd"
|
||||
args: ["mcp"]
|
||||
timeout: 30
|
||||
connect_timeout: 45
|
||||
```
|
||||
|
||||
This registers tools: `mcp_qmd_search`, `mcp_qmd_vsearch`,
|
||||
`mcp_qmd_deep_search`, `mcp_qmd_get`, `mcp_qmd_status`.
|
||||
|
||||
**Tradeoff:** Models load on first search call (~19s cold start),
|
||||
then stay warm for the session. Acceptable for occasional use.
|
||||
|
||||
### Option B: HTTP Daemon Mode (Fast, Recommended for Heavy Use)
|
||||
|
||||
Start the qmd daemon separately — it keeps models warm in memory:
|
||||
|
||||
```bash
|
||||
# Start daemon (persists across agent restarts)
|
||||
qmd mcp --http --daemon
|
||||
|
||||
# Runs on http://localhost:8181 by default
|
||||
```
|
||||
|
||||
Then configure Hermes Agent to connect via HTTP:
|
||||
|
||||
```yaml
|
||||
mcp_servers:
|
||||
qmd:
|
||||
url: "http://localhost:8181/mcp"
|
||||
timeout: 30
|
||||
```
|
||||
|
||||
**Tradeoff:** Uses ~2GB RAM while running, but every query is fast
|
||||
(~2-3s). Best for users who search frequently.
|
||||
|
||||
### Keeping the Daemon Running
|
||||
|
||||
#### macOS (launchd)
|
||||
|
||||
```bash
|
||||
cat > ~/Library/LaunchAgents/com.qmd.daemon.plist << 'EOF'
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
|
||||
"http://www.apple.com/DTDs/PropertyList-1.0.dtd">
|
||||
<plist version="1.0">
|
||||
<dict>
|
||||
<key>Label</key>
|
||||
<string>com.qmd.daemon</string>
|
||||
<key>ProgramArguments</key>
|
||||
<array>
|
||||
<string>qmd</string>
|
||||
<string>mcp</string>
|
||||
<string>--http</string>
|
||||
<string>--daemon</string>
|
||||
</array>
|
||||
<key>RunAtLoad</key>
|
||||
<true/>
|
||||
<key>KeepAlive</key>
|
||||
<true/>
|
||||
<key>StandardOutPath</key>
|
||||
<string>/tmp/qmd-daemon.log</string>
|
||||
<key>StandardErrorPath</key>
|
||||
<string>/tmp/qmd-daemon.log</string>
|
||||
</dict>
|
||||
</plist>
|
||||
EOF
|
||||
|
||||
launchctl load ~/Library/LaunchAgents/com.qmd.daemon.plist
|
||||
```
|
||||
|
||||
#### Linux (systemd user service)
|
||||
|
||||
```bash
|
||||
mkdir -p ~/.config/systemd/user
|
||||
|
||||
cat > ~/.config/systemd/user/qmd-daemon.service << 'EOF'
|
||||
[Unit]
|
||||
Description=QMD MCP Daemon
|
||||
After=network.target
|
||||
|
||||
[Service]
|
||||
ExecStart=qmd mcp --http --daemon
|
||||
Restart=on-failure
|
||||
RestartSec=10
|
||||
Environment=PATH=/usr/local/bin:/usr/bin:/bin
|
||||
|
||||
[Install]
|
||||
WantedBy=default.target
|
||||
EOF
|
||||
|
||||
systemctl --user daemon-reload
|
||||
systemctl --user enable --now qmd-daemon
|
||||
systemctl --user status qmd-daemon
|
||||
```
|
||||
|
||||
### MCP Tools Reference
|
||||
|
||||
Once connected, these tools are available as `mcp_qmd_*`:
|
||||
|
||||
| MCP Tool | Maps To | Description |
|
||||
|----------|---------|-------------|
|
||||
| `mcp_qmd_search` | `qmd search` | BM25 keyword search |
|
||||
| `mcp_qmd_vsearch` | `qmd vsearch` | Semantic vector search |
|
||||
| `mcp_qmd_deep_search` | `qmd query` | Hybrid search + reranking |
|
||||
| `mcp_qmd_get` | `qmd get` | Retrieve document by ID or path |
|
||||
| `mcp_qmd_status` | `qmd status` | Index health and stats |
|
||||
|
||||
The MCP tools accept structured JSON queries for multi-mode search:
|
||||
|
||||
```json
|
||||
{
|
||||
"searches": [
|
||||
{"type": "lex", "query": "authentication middleware"},
|
||||
{"type": "vec", "query": "how user login is verified"}
|
||||
],
|
||||
"collections": ["project-docs"],
|
||||
"limit": 10
|
||||
}
|
||||
```
|
||||
|
||||
## CLI Usage (Without MCP)
|
||||
|
||||
When MCP is not configured, use qmd directly via terminal:
|
||||
|
||||
```
|
||||
terminal(command="qmd query 'what was decided about the API redesign' --json", timeout=30)
|
||||
```
|
||||
|
||||
For setup and management tasks, always use terminal:
|
||||
|
||||
```
|
||||
terminal(command="qmd collection add ~/Documents/notes --name notes")
|
||||
terminal(command="qmd context add qmd://notes 'Personal research notes and ideas'")
|
||||
terminal(command="qmd embed")
|
||||
terminal(command="qmd status")
|
||||
```
|
||||
|
||||
## How the Search Pipeline Works
|
||||
|
||||
Understanding the internals helps choose the right search mode:
|
||||
|
||||
1. **Query Expansion** — A fine-tuned 1.7B model generates 2 alternative
|
||||
queries. The original gets 2x weight in fusion.
|
||||
2. **Parallel Retrieval** — BM25 (SQLite FTS5) and vector search run
|
||||
simultaneously across all query variants.
|
||||
3. **RRF Fusion** — Reciprocal Rank Fusion (k=60) merges results.
|
||||
Top-rank bonus: #1 gets +0.05, #2-3 get +0.02.
|
||||
4. **LLM Reranking** — qwen3-reranker scores top 30 candidates (0.0-1.0).
|
||||
5. **Position-Aware Blending** — Ranks 1-3: 75% retrieval / 25% reranker.
|
||||
Ranks 4-10: 60/40. Ranks 11+: 40/60 (trusts reranker more for long tail).
|
||||
|
||||
**Smart Chunking:** Documents are split at natural break points (headings,
|
||||
code blocks, blank lines) targeting ~900 tokens with 15% overlap. Code
|
||||
blocks are never split mid-block.
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Always add context descriptions** — `qmd context add` dramatically
|
||||
improves retrieval accuracy. Describe what each collection contains.
|
||||
2. **Re-embed after adding documents** — `qmd embed` must be re-run when
|
||||
new files are added to collections.
|
||||
3. **Use `qmd search` for speed** — when you need fast keyword lookup
|
||||
(code identifiers, exact names), BM25 is instant and needs no models.
|
||||
4. **Use `qmd query` for quality** — when the question is conceptual or
|
||||
the user needs the best possible results, use hybrid search.
|
||||
5. **Prefer MCP integration** — once configured, the agent gets native
|
||||
tools without needing to load this skill each time.
|
||||
6. **Daemon mode for frequent users** — if the user searches their
|
||||
knowledge base regularly, recommend the HTTP daemon setup.
|
||||
7. **First query in structured search gets 2x weight** — put the most
|
||||
important/certain query first when combining lex and vec.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "Models downloading on first run"
|
||||
Normal — qmd auto-downloads ~2GB of GGUF models on first use.
|
||||
This is a one-time operation.
|
||||
|
||||
### Cold start latency (~19s)
|
||||
This happens when models aren't loaded in memory. Solutions:
|
||||
- Use HTTP daemon mode (`qmd mcp --http --daemon`) to keep warm
|
||||
- Use `qmd search` (BM25 only) when models aren't needed
|
||||
- MCP stdio mode loads models on first search, stays warm for session
|
||||
|
||||
### macOS: "unable to load extension"
|
||||
Install Homebrew SQLite: `brew install sqlite`
|
||||
Then ensure it's on PATH before system SQLite.
|
||||
|
||||
### "No collections found"
|
||||
Run `qmd collection add <path> --name <name>` to add directories,
|
||||
then `qmd embed` to index them.
|
||||
|
||||
### Embedding model override (CJK/multilingual)
|
||||
Set `QMD_EMBED_MODEL` environment variable for non-English content:
|
||||
```bash
|
||||
export QMD_EMBED_MODEL="your-multilingual-model"
|
||||
```
|
||||
|
||||
## Data Storage
|
||||
|
||||
- **Index & vectors:** `~/.cache/qmd/index.sqlite`
|
||||
- **Models:** Auto-downloaded to local cache on first run
|
||||
- **No cloud dependencies** — everything runs locally
|
||||
|
||||
## References
|
||||
|
||||
- [GitHub: tobi/qmd](https://github.com/tobi/qmd)
|
||||
- [QMD Changelog](https://github.com/tobi/qmd/blob/main/CHANGELOG.md)
|
||||
|
|
@ -0,0 +1,350 @@
|
|||
---
|
||||
title: "Scrapling"
|
||||
sidebar_label: "Scrapling"
|
||||
description: "Web scraping with Scrapling - HTTP fetching, stealth browser automation, Cloudflare bypass, and spider crawling via CLI and Python"
|
||||
---
|
||||
|
||||
{/* This page is auto-generated from the skill's SKILL.md by website/scripts/generate-skill-docs.py. Edit the source SKILL.md, not this page. */}
|
||||
|
||||
# Scrapling
|
||||
|
||||
Web scraping with Scrapling - HTTP fetching, stealth browser automation, Cloudflare bypass, and spider crawling via CLI and Python.
|
||||
|
||||
## Skill metadata
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Source | Optional — install with `hermes skills install official/research/scrapling` |
|
||||
| Path | `optional-skills/research/scrapling` |
|
||||
| Version | `1.0.0` |
|
||||
| Author | FEUAZUR |
|
||||
| License | MIT |
|
||||
| Tags | `Web Scraping`, `Browser`, `Cloudflare`, `Stealth`, `Crawling`, `Spider` |
|
||||
| Related skills | [`duckduckgo-search`](/docs/user-guide/skills/optional/research/research-duckduckgo-search), [`domain-intel`](/docs/user-guide/skills/optional/research/research-domain-intel) |
|
||||
|
||||
## Reference: full SKILL.md
|
||||
|
||||
:::info
|
||||
The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
|
||||
:::
|
||||
|
||||
# Scrapling
|
||||
|
||||
[Scrapling](https://github.com/D4Vinci/Scrapling) is a web scraping framework with anti-bot bypass, stealth browser automation, and a spider framework. It provides three fetching strategies (HTTP, dynamic JS, stealth/Cloudflare) and a full CLI.
|
||||
|
||||
**This skill is for educational and research purposes only.** Users must comply with local/international data scraping laws and respect website Terms of Service.
|
||||
|
||||
## When to Use
|
||||
|
||||
- Scraping static HTML pages (faster than browser tools)
|
||||
- Scraping JS-rendered pages that need a real browser
|
||||
- Bypassing Cloudflare Turnstile or bot detection
|
||||
- Crawling multiple pages with a spider
|
||||
- When the built-in `web_extract` tool does not return the data you need
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
pip install "scrapling[all]"
|
||||
scrapling install
|
||||
```
|
||||
|
||||
Minimal install (HTTP only, no browser):
|
||||
```bash
|
||||
pip install scrapling
|
||||
```
|
||||
|
||||
With browser automation only:
|
||||
```bash
|
||||
pip install "scrapling[fetchers]"
|
||||
scrapling install
|
||||
```
|
||||
|
||||
## Quick Reference
|
||||
|
||||
| Approach | Class | Use When |
|
||||
|----------|-------|----------|
|
||||
| HTTP | `Fetcher` / `FetcherSession` | Static pages, APIs, fast bulk requests |
|
||||
| Dynamic | `DynamicFetcher` / `DynamicSession` | JS-rendered content, SPAs |
|
||||
| Stealth | `StealthyFetcher` / `StealthySession` | Cloudflare, anti-bot protected sites |
|
||||
| Spider | `Spider` | Multi-page crawling with link following |
|
||||
|
||||
## CLI Usage
|
||||
|
||||
### Extract Static Page
|
||||
|
||||
```bash
|
||||
scrapling extract get 'https://example.com' output.md
|
||||
```
|
||||
|
||||
With CSS selector and browser impersonation:
|
||||
|
||||
```bash
|
||||
scrapling extract get 'https://example.com' output.md \
|
||||
--css-selector '.content' \
|
||||
--impersonate 'chrome'
|
||||
```
|
||||
|
||||
### Extract JS-Rendered Page
|
||||
|
||||
```bash
|
||||
scrapling extract fetch 'https://example.com' output.md \
|
||||
--css-selector '.dynamic-content' \
|
||||
--disable-resources \
|
||||
--network-idle
|
||||
```
|
||||
|
||||
### Extract Cloudflare-Protected Page
|
||||
|
||||
```bash
|
||||
scrapling extract stealthy-fetch 'https://protected-site.com' output.html \
|
||||
--solve-cloudflare \
|
||||
--block-webrtc \
|
||||
--hide-canvas
|
||||
```
|
||||
|
||||
### POST Request
|
||||
|
||||
```bash
|
||||
scrapling extract post 'https://example.com/api' output.json \
|
||||
--json '{"query": "search term"}'
|
||||
```
|
||||
|
||||
### Output Formats
|
||||
|
||||
The output format is determined by the file extension:
|
||||
- `.html` -- raw HTML
|
||||
- `.md` -- converted to Markdown
|
||||
- `.txt` -- plain text
|
||||
- `.json` / `.jsonl` -- JSON
|
||||
|
||||
## Python: HTTP Scraping
|
||||
|
||||
### Single Request
|
||||
|
||||
```python
|
||||
from scrapling.fetchers import Fetcher
|
||||
|
||||
page = Fetcher.get('https://quotes.toscrape.com/')
|
||||
quotes = page.css('.quote .text::text').getall()
|
||||
for q in quotes:
|
||||
print(q)
|
||||
```
|
||||
|
||||
### Session (Persistent Cookies)
|
||||
|
||||
```python
|
||||
from scrapling.fetchers import FetcherSession
|
||||
|
||||
with FetcherSession(impersonate='chrome') as session:
|
||||
page = session.get('https://example.com/', stealthy_headers=True)
|
||||
links = page.css('a::attr(href)').getall()
|
||||
for link in links[:5]:
|
||||
sub = session.get(link)
|
||||
print(sub.css('h1::text').get())
|
||||
```
|
||||
|
||||
### POST / PUT / DELETE
|
||||
|
||||
```python
|
||||
page = Fetcher.post('https://api.example.com/data', json={"key": "value"})
|
||||
page = Fetcher.put('https://api.example.com/item/1', data={"name": "updated"})
|
||||
page = Fetcher.delete('https://api.example.com/item/1')
|
||||
```
|
||||
|
||||
### With Proxy
|
||||
|
||||
```python
|
||||
page = Fetcher.get('https://example.com', proxy='http://user:pass@proxy:8080')
|
||||
```
|
||||
|
||||
## Python: Dynamic Pages (JS-Rendered)
|
||||
|
||||
For pages that require JavaScript execution (SPAs, lazy-loaded content):
|
||||
|
||||
```python
|
||||
from scrapling.fetchers import DynamicFetcher
|
||||
|
||||
page = DynamicFetcher.fetch('https://example.com', headless=True)
|
||||
data = page.css('.js-loaded-content::text').getall()
|
||||
```
|
||||
|
||||
### Wait for Specific Element
|
||||
|
||||
```python
|
||||
page = DynamicFetcher.fetch(
|
||||
'https://example.com',
|
||||
wait_selector=('.results', 'visible'),
|
||||
network_idle=True,
|
||||
)
|
||||
```
|
||||
|
||||
### Disable Resources for Speed
|
||||
|
||||
Blocks fonts, images, media, stylesheets (~25% faster):
|
||||
|
||||
```python
|
||||
from scrapling.fetchers import DynamicSession
|
||||
|
||||
with DynamicSession(headless=True, disable_resources=True, network_idle=True) as session:
|
||||
page = session.fetch('https://example.com')
|
||||
items = page.css('.item::text').getall()
|
||||
```
|
||||
|
||||
### Custom Page Automation
|
||||
|
||||
```python
|
||||
from playwright.sync_api import Page
|
||||
from scrapling.fetchers import DynamicFetcher
|
||||
|
||||
def scroll_and_click(page: Page):
|
||||
page.mouse.wheel(0, 3000)
|
||||
page.wait_for_timeout(1000)
|
||||
page.click('button.load-more')
|
||||
page.wait_for_selector('.extra-results')
|
||||
|
||||
page = DynamicFetcher.fetch('https://example.com', page_action=scroll_and_click)
|
||||
results = page.css('.extra-results .item::text').getall()
|
||||
```
|
||||
|
||||
## Python: Stealth Mode (Anti-Bot Bypass)
|
||||
|
||||
For Cloudflare-protected or heavily fingerprinted sites:
|
||||
|
||||
```python
|
||||
from scrapling.fetchers import StealthyFetcher
|
||||
|
||||
page = StealthyFetcher.fetch(
|
||||
'https://protected-site.com',
|
||||
headless=True,
|
||||
solve_cloudflare=True,
|
||||
block_webrtc=True,
|
||||
hide_canvas=True,
|
||||
)
|
||||
content = page.css('.protected-content::text').getall()
|
||||
```
|
||||
|
||||
### Stealth Session
|
||||
|
||||
```python
|
||||
from scrapling.fetchers import StealthySession
|
||||
|
||||
with StealthySession(headless=True, solve_cloudflare=True) as session:
|
||||
page1 = session.fetch('https://protected-site.com/page1')
|
||||
page2 = session.fetch('https://protected-site.com/page2')
|
||||
```
|
||||
|
||||
## Element Selection
|
||||
|
||||
All fetchers return a `Selector` object with these methods:
|
||||
|
||||
### CSS Selectors
|
||||
|
||||
```python
|
||||
page.css('h1::text').get() # First h1 text
|
||||
page.css('a::attr(href)').getall() # All link hrefs
|
||||
page.css('.quote .text::text').getall() # Nested selection
|
||||
```
|
||||
|
||||
### XPath
|
||||
|
||||
```python
|
||||
page.xpath('//div[@class="content"]/text()').getall()
|
||||
page.xpath('//a/@href').getall()
|
||||
```
|
||||
|
||||
### Find Methods
|
||||
|
||||
```python
|
||||
page.find_all('div', class_='quote') # By tag + attribute
|
||||
page.find_by_text('Read more', tag='a') # By text content
|
||||
page.find_by_regex(r'\$\d+\.\d{2}') # By regex pattern
|
||||
```
|
||||
|
||||
### Similar Elements
|
||||
|
||||
Find elements with similar structure (useful for product listings, etc.):
|
||||
|
||||
```python
|
||||
first_product = page.css('.product')[0]
|
||||
all_similar = first_product.find_similar()
|
||||
```
|
||||
|
||||
### Navigation
|
||||
|
||||
```python
|
||||
el = page.css('.target')[0]
|
||||
el.parent # Parent element
|
||||
el.children # Child elements
|
||||
el.next_sibling # Next sibling
|
||||
el.prev_sibling # Previous sibling
|
||||
```
|
||||
|
||||
## Python: Spider Framework
|
||||
|
||||
For multi-page crawling with link following:
|
||||
|
||||
```python
|
||||
from scrapling.spiders import Spider, Request, Response
|
||||
|
||||
class QuotesSpider(Spider):
|
||||
name = "quotes"
|
||||
start_urls = ["https://quotes.toscrape.com/"]
|
||||
concurrent_requests = 10
|
||||
download_delay = 1
|
||||
|
||||
async def parse(self, response: Response):
|
||||
for quote in response.css('.quote'):
|
||||
yield {
|
||||
"text": quote.css('.text::text').get(),
|
||||
"author": quote.css('.author::text').get(),
|
||||
"tags": quote.css('.tag::text').getall(),
|
||||
}
|
||||
|
||||
next_page = response.css('.next a::attr(href)').get()
|
||||
if next_page:
|
||||
yield response.follow(next_page)
|
||||
|
||||
result = QuotesSpider().start()
|
||||
print(f"Scraped {len(result.items)} quotes")
|
||||
result.items.to_json("quotes.json")
|
||||
```
|
||||
|
||||
### Multi-Session Spider
|
||||
|
||||
Route requests to different fetcher types:
|
||||
|
||||
```python
|
||||
from scrapling.fetchers import FetcherSession, AsyncStealthySession
|
||||
|
||||
class SmartSpider(Spider):
|
||||
name = "smart"
|
||||
start_urls = ["https://example.com/"]
|
||||
|
||||
def configure_sessions(self, manager):
|
||||
manager.add("fast", FetcherSession(impersonate="chrome"))
|
||||
manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
|
||||
|
||||
async def parse(self, response: Response):
|
||||
for link in response.css('a::attr(href)').getall():
|
||||
if "protected" in link:
|
||||
yield Request(link, sid="stealth")
|
||||
else:
|
||||
yield Request(link, sid="fast", callback=self.parse)
|
||||
```
|
||||
|
||||
### Pause/Resume Crawling
|
||||
|
||||
```python
|
||||
spider = QuotesSpider(crawldir="./crawl_checkpoint")
|
||||
spider.start() # Ctrl+C to pause, re-run to resume from checkpoint
|
||||
```
|
||||
|
||||
## Pitfalls
|
||||
|
||||
- **Browser install required**: run `scrapling install` after pip install -- without it, `DynamicFetcher` and `StealthyFetcher` will fail
|
||||
- **Timeouts**: DynamicFetcher/StealthyFetcher timeout is in **milliseconds** (default 30000), Fetcher timeout is in **seconds**
|
||||
- **Cloudflare bypass**: `solve_cloudflare=True` adds 5-15 seconds to fetch time -- only enable when needed
|
||||
- **Resource usage**: StealthyFetcher runs a real browser -- limit concurrent usage
|
||||
- **Legal**: always check robots.txt and website ToS before scraping. This library is for educational and research purposes
|
||||
- **Python version**: requires Python 3.10+
|
||||
Loading…
Add table
Add a link
Reference in a new issue