docs(website): dedicated page per bundled + optional skill (#14929)

Generates a full dedicated Docusaurus page for every one of the 132 skills (73 bundled + 59 optional) under website/docs/user-guide/skills/{bundled,optional}/<category>/. Each page carries the skill's description, metadata (version, author, license, dependencies, platform gating, tags, related skills cross-linked to their own pages), and the complete SKILL.md body that Hermes loads at runtime. Previously the two catalog pages just listed skills with a one-line blurb and no way to see what the skill actually did — users had to go read the source repo. Now every skill has a browsable, searchable, cross-linked reference in the docs. - website/scripts/generate-skill-docs.py — generator that reads skills/ and optional-skills/, writes per-skill pages, regenerates both catalog indexes, and rewrites the Skills section of sidebars.ts. Handles MDX escaping (outside fenced code blocks: curly braces, unsafe HTML-ish tags) and rewrites relative references/*.md links to point at the GitHub source. - website/docs/reference/skills-catalog.md — regenerated; each row links to the new dedicated page. - website/docs/reference/optional-skills-catalog.md — same. - website/sidebars.ts — Skills section now has Bundled / Optional subtrees with one nested category per skill folder. - .github/workflows/{docs-site-checks,deploy-site}.yml — run the generator before docusaurus build so CI stays in sync with the source SKILL.md files. Build verified locally with `npx docusaurus build`. Only remaining warnings are pre-existing broken link/anchor issues in unrelated pages.
2026-04-25 00:51:20 +00:00 · 2026-04-23 22:22:11 -07:00 · 2026-04-23 22:22:11 -07:00 · 0f6eabb890
commit 0f6eabb890
parent eb93f88e1d
139 changed files with 43523 additions and 306 deletions
--- a/website/docs/user-guide/skills/optional/mlops/mlops-torchtitan.md
+++ b/website/docs/user-guide/skills/optional/mlops/mlops-torchtitan.md
@ -0,0 +1,377 @@
+---
+title: "Distributed Llm Pretraining Torchtitan"
+sidebar_label: "Distributed Llm Pretraining Torchtitan"
+description: "Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP)"
+---
+
+{/* This page is auto-generated from the skill's SKILL.md by website/scripts/generate-skill-docs.py. Edit the source SKILL.md, not this page. */}
+
+# Distributed Llm Pretraining Torchtitan
+
+Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.
+
+## Skill metadata
+
+| | |
+|---|---|
+| Source | Optional — install with `hermes skills install official/mlops/torchtitan` |
+| Path | `optional-skills/mlops/torchtitan` |
+| Version | `1.0.0` |
+| Author | Orchestra Research |
+| License | MIT |
+| Dependencies | `torch>=2.6.0`, `torchtitan>=0.2.0`, `torchao>=0.5.0` |
+| Tags | `Model Architecture`, `Distributed Training`, `TorchTitan`, `FSDP2`, `Tensor Parallel`, `Pipeline Parallel`, `Context Parallel`, `Float8`, `Llama`, `Pretraining` |
+
+## Reference: full SKILL.md
+
+:::info
+The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
+:::
+
+# TorchTitan - PyTorch Native Distributed LLM Pretraining
+
+## Quick start
+
+TorchTitan is PyTorch's official platform for large-scale LLM pretraining with composable 4D parallelism (FSDP2, TP, PP, CP), achieving 65%+ speedups over baselines on H100 GPUs.
+
+**Installation**:
+```bash
+# From PyPI (stable)
+pip install torchtitan
+
+# From source (latest features, requires PyTorch nightly)
+git clone https://github.com/pytorch/torchtitan
+cd torchtitan
+pip install -r requirements.txt
+```
+
+**Download tokenizer**:
+```bash
+# Get HF token from https://huggingface.co/settings/tokens
+python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=...
+```
+
+**Start training on 8 GPUs**:
+```bash
+CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh
+```
+
+## Common workflows
+
+### Workflow 1: Pretrain Llama 3.1 8B on single node
+
+Copy this checklist:
+
+```
+Single Node Pretraining:
+- [ ] Step 1: Download tokenizer
+- [ ] Step 2: Configure training
+- [ ] Step 3: Launch training
+- [ ] Step 4: Monitor and checkpoint
+```
+
+**Step 1: Download tokenizer**
+
+```bash
+python scripts/download_hf_assets.py \
+  --repo_id meta-llama/Llama-3.1-8B \
+  --assets tokenizer \
+  --hf_token=YOUR_HF_TOKEN
+```
+
+**Step 2: Configure training**
+
+Edit or create a TOML config file:
+
+```toml
+# llama3_8b_custom.toml
+[job]
+dump_folder = "./outputs"
+description = "Llama 3.1 8B training"
+
+[model]
+name = "llama3"
+flavor = "8B"
+hf_assets_path = "./assets/hf/Llama-3.1-8B"
+
+[optimizer]
+name = "AdamW"
+lr = 3e-4
+
+[lr_scheduler]
+warmup_steps = 200
+
+[training]
+local_batch_size = 2
+seq_len = 8192
+max_norm = 1.0
+steps = 1000
+dataset = "c4"
+
+[parallelism]
+data_parallel_shard_degree = -1  # Use all GPUs for FSDP
+
+[activation_checkpoint]
+mode = "selective"
+selective_ac_option = "op"
+
+[checkpoint]
+enable = true
+folder = "checkpoint"
+interval = 500
+```
+
+**Step 3: Launch training**
+
+```bash
+# 8 GPUs on single node
+CONFIG_FILE="./llama3_8b_custom.toml" ./run_train.sh
+
+# Or explicitly with torchrun
+torchrun --nproc_per_node=8 \
+  -m torchtitan.train \
+  --job.config_file ./llama3_8b_custom.toml
+```
+
+**Step 4: Monitor and checkpoint**
+
+TensorBoard logs are saved to `./outputs/tb/`:
+```bash
+tensorboard --logdir ./outputs/tb
+```
+
+### Workflow 2: Multi-node training with SLURM
+
+```
+Multi-Node Training:
+- [ ] Step 1: Configure parallelism for scale
+- [ ] Step 2: Set up SLURM script
+- [ ] Step 3: Submit job
+- [ ] Step 4: Resume from checkpoint
+```
+
+**Step 1: Configure parallelism for scale**
+
+For 70B model on 256 GPUs (32 nodes):
+```toml
+[parallelism]
+data_parallel_shard_degree = 32  # FSDP across 32 ranks
+tensor_parallel_degree = 8        # TP within node
+pipeline_parallel_degree = 1      # No PP for 70B
+context_parallel_degree = 1       # Increase for long sequences
+```
+
+**Step 2: Set up SLURM script**
+
+```bash
+#!/bin/bash
+#SBATCH --job-name=llama70b
+#SBATCH --nodes=32
+#SBATCH --ntasks-per-node=8
+#SBATCH --gpus-per-node=8
+
+srun torchrun \
+  --nnodes=32 \
+  --nproc_per_node=8 \
+  --rdzv_backend=c10d \
+  --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
+  -m torchtitan.train \
+  --job.config_file ./llama3_70b.toml
+```
+
+**Step 3: Submit job**
+
+```bash
+sbatch multinode_trainer.slurm
+```
+
+**Step 4: Resume from checkpoint**
+
+Training auto-resumes if checkpoint exists in configured folder.
+
+### Workflow 3: Enable Float8 training for H100s
+
+Float8 provides 30-50% speedup on H100 GPUs.
+
+```
+Float8 Training:
+- [ ] Step 1: Install torchao
+- [ ] Step 2: Configure Float8
+- [ ] Step 3: Launch with compile
+```
+
+**Step 1: Install torchao**
+
+```bash
+USE_CPP=0 pip install git+https://github.com/pytorch/ao.git
+```
+
+**Step 2: Configure Float8**
+
+Add to your TOML config:
+```toml
+[model]
+converters = ["quantize.linear.float8"]
+
+[quantize.linear.float8]
+enable_fsdp_float8_all_gather = true
+precompute_float8_dynamic_scale_for_fsdp = true
+filter_fqns = ["output"]  # Exclude output layer
+
+[compile]
+enable = true
+components = ["model", "loss"]
+```
+
+**Step 3: Launch with compile**
+
+```bash
+CONFIG_FILE="./llama3_8b.toml" ./run_train.sh \
+  --model.converters="quantize.linear.float8" \
+  --quantize.linear.float8.enable_fsdp_float8_all_gather \
+  --compile.enable
+```
+
+### Workflow 4: 4D parallelism for 405B models
+
+```
+4D Parallelism (FSDP + TP + PP + CP):
+- [ ] Step 1: Create seed checkpoint
+- [ ] Step 2: Configure 4D parallelism
+- [ ] Step 3: Launch on 512 GPUs
+```
+
+**Step 1: Create seed checkpoint**
+
+Required for consistent initialization across PP stages:
+```bash
+NGPU=1 CONFIG_FILE=./llama3_405b.toml ./run_train.sh \
+  --checkpoint.enable \
+  --checkpoint.create_seed_checkpoint \
+  --parallelism.data_parallel_shard_degree 1 \
+  --parallelism.tensor_parallel_degree 1 \
+  --parallelism.pipeline_parallel_degree 1
+```
+
+**Step 2: Configure 4D parallelism**
+
+```toml
+[parallelism]
+data_parallel_shard_degree = 8   # FSDP
+tensor_parallel_degree = 8       # TP within node
+pipeline_parallel_degree = 8     # PP across nodes
+context_parallel_degree = 1      # CP for long sequences
+
+[training]
+local_batch_size = 32
+seq_len = 8192
+```
+
+**Step 3: Launch on 512 GPUs**
+
+```bash
+# 64 nodes x 8 GPUs = 512 GPUs
+srun torchrun --nnodes=64 --nproc_per_node=8 \
+  -m torchtitan.train \
+  --job.config_file ./llama3_405b.toml
+```
+
+## When to use vs alternatives
+
+**Use TorchTitan when:**
+- Pretraining LLMs from scratch (8B to 405B+)
+- Need PyTorch-native solution without third-party dependencies
+- Require composable 4D parallelism (FSDP2, TP, PP, CP)
+- Training on H100s with Float8 support
+- Want interoperable checkpoints with torchtune/HuggingFace
+
+**Use alternatives instead:**
+- **Megatron-LM**: Maximum performance for NVIDIA-only deployments
+- **DeepSpeed**: Broader ZeRO optimization ecosystem, inference support
+- **Axolotl/TRL**: Fine-tuning rather than pretraining
+- **LitGPT**: Educational, smaller-scale training
+
+## Common issues
+
+**Issue: Out of memory on large models**
+
+Enable activation checkpointing and reduce batch size:
+```toml
+[activation_checkpoint]
+mode = "full"  # Instead of "selective"
+
+[training]
+local_batch_size = 1
+```
+
+Or use gradient accumulation:
+```toml
+[training]
+local_batch_size = 1
+global_batch_size = 32  # Accumulates gradients
+```
+
+**Issue: TP causes high memory with async collectives**
+
+Set environment variable:
+```bash
+export TORCH_NCCL_AVOID_RECORD_STREAMS=1
+```
+
+**Issue: Float8 training not faster**
+
+Float8 only benefits large GEMMs. Filter small layers:
+```toml
+[quantize.linear.float8]
+filter_fqns = ["attention.wk", "attention.wv", "output", "auto_filter_small_kn"]
+```
+
+**Issue: Checkpoint loading fails after parallelism change**
+
+Use DCP's resharding capability:
+```bash
+# Convert sharded checkpoint to single file
+python -m torch.distributed.checkpoint.format_utils \
+  dcp_to_torch checkpoint/step-1000 checkpoint.pt
+```
+
+**Issue: Pipeline parallelism initialization**
+
+Create seed checkpoint first (see Workflow 4, Step 1).
+
+## Supported models
+
+| Model | Sizes | Status |
+|-------|-------|--------|
+| Llama 3.1 | 8B, 70B, 405B | Production |
+| Llama 4 | Various | Experimental |
+| DeepSeek V3 | 16B, 236B, 671B (MoE) | Experimental |
+| GPT-OSS | 20B, 120B (MoE) | Experimental |
+| Qwen 3 | Various | Experimental |
+| Flux | Diffusion | Experimental |
+
+## Performance benchmarks (H100)
+
+| Model | GPUs | Parallelism | TPS/GPU | Techniques |
+|-------|------|-------------|---------|------------|
+| Llama 8B | 8 | FSDP | 5,762 | Baseline |
+| Llama 8B | 8 | FSDP+compile+FP8 | 8,532 | +48% |
+| Llama 70B | 256 | FSDP+TP+AsyncTP | 876 | 2D parallel |
+| Llama 405B | 512 | FSDP+TP+PP | 128 | 3D parallel |
+
+## Advanced topics
+
+**FSDP2 configuration**: See [references/fsdp.md](https://github.com/NousResearch/hermes-agent/blob/main/optional-skills/mlops/torchtitan/references/fsdp.md) for detailed FSDP2 vs FSDP1 comparison and ZeRO equivalents.
+
+**Float8 training**: See [references/float8.md](https://github.com/NousResearch/hermes-agent/blob/main/optional-skills/mlops/torchtitan/references/float8.md) for tensorwise vs rowwise scaling recipes.
+
+**Checkpointing**: See [references/checkpoint.md](https://github.com/NousResearch/hermes-agent/blob/main/optional-skills/mlops/torchtitan/references/checkpoint.md) for HuggingFace conversion and async checkpointing.
+
+**Adding custom models**: See [references/custom-models.md](https://github.com/NousResearch/hermes-agent/blob/main/optional-skills/mlops/torchtitan/references/custom-models.md) for TrainSpec protocol.
+
+## Resources
+
+- GitHub: https://github.com/pytorch/torchtitan
+- Paper: https://arxiv.org/abs/2410.06511
+- ICLR 2025: https://iclr.cc/virtual/2025/poster/29620
+- PyTorch Forum: https://discuss.pytorch.org/c/distributed/torchtitan/44