docs(website): dedicated page per bundled + optional skill (#14929)

Generates a full dedicated Docusaurus page for every one of the 132 skills (73 bundled + 59 optional) under website/docs/user-guide/skills/{bundled,optional}/<category>/. Each page carries the skill's description, metadata (version, author, license, dependencies, platform gating, tags, related skills cross-linked to their own pages), and the complete SKILL.md body that Hermes loads at runtime. Previously the two catalog pages just listed skills with a one-line blurb and no way to see what the skill actually did — users had to go read the source repo. Now every skill has a browsable, searchable, cross-linked reference in the docs. - website/scripts/generate-skill-docs.py — generator that reads skills/ and optional-skills/, writes per-skill pages, regenerates both catalog indexes, and rewrites the Skills section of sidebars.ts. Handles MDX escaping (outside fenced code blocks: curly braces, unsafe HTML-ish tags) and rewrites relative references/*.md links to point at the GitHub source. - website/docs/reference/skills-catalog.md — regenerated; each row links to the new dedicated page. - website/docs/reference/optional-skills-catalog.md — same. - website/sidebars.ts — Skills section now has Bundled / Optional subtrees with one nested category per skill folder. - .github/workflows/{docs-site-checks,deploy-site}.yml — run the generator before docusaurus build so CI stays in sync with the source SKILL.md files. Build verified locally with `npx docusaurus build`. Only remaining warnings are pre-existing broken link/anchor issues in unrelated pages.
2026-04-25 00:51:20 +00:00 · 2026-04-23 22:22:11 -07:00 · 2026-04-23 22:22:11 -07:00 · 0f6eabb890
commit 0f6eabb890
parent eb93f88e1d
139 changed files with 43523 additions and 306 deletions
--- a/website/docs/user-guide/skills/optional/mlops/mlops-lambda-labs.md
+++ b/website/docs/user-guide/skills/optional/mlops/mlops-lambda-labs.md
@ -0,0 +1,565 @@
+---
+title: "Lambda Labs Gpu Cloud — Reserved and on-demand GPU cloud instances for ML training and inference"
+sidebar_label: "Lambda Labs Gpu Cloud"
+description: "Reserved and on-demand GPU cloud instances for ML training and inference"
+---
+
+{/* This page is auto-generated from the skill's SKILL.md by website/scripts/generate-skill-docs.py. Edit the source SKILL.md, not this page. */}
+
+# Lambda Labs Gpu Cloud
+
+Reserved and on-demand GPU cloud instances for ML training and inference. Use when you need dedicated GPU instances with simple SSH access, persistent filesystems, or high-performance multi-node clusters for large-scale training.
+
+## Skill metadata
+
+| | |
+|---|---|
+| Source | Optional — install with `hermes skills install official/mlops/lambda-labs` |
+| Path | `optional-skills/mlops/lambda-labs` |
+| Version | `1.0.0` |
+| Author | Orchestra Research |
+| License | MIT |
+| Dependencies | `lambda-cloud-client>=1.0.0` |
+| Tags | `Infrastructure`, `GPU Cloud`, `Training`, `Inference`, `Lambda Labs` |
+
+## Reference: full SKILL.md
+
+:::info
+The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
+:::
+
+# Lambda Labs GPU Cloud
+
+Comprehensive guide to running ML workloads on Lambda Labs GPU cloud with on-demand instances and 1-Click Clusters.
+
+## When to use Lambda Labs
+
+**Use Lambda Labs when:**
+- Need dedicated GPU instances with full SSH access
+- Running long training jobs (hours to days)
+- Want simple pricing with no egress fees
+- Need persistent storage across sessions
+- Require high-performance multi-node clusters (16-512 GPUs)
+- Want pre-installed ML stack (Lambda Stack with PyTorch, CUDA, NCCL)
+
+**Key features:**
+- **GPU variety**: B200, H100, GH200, A100, A10, A6000, V100
+- **Lambda Stack**: Pre-installed PyTorch, TensorFlow, CUDA, cuDNN, NCCL
+- **Persistent filesystems**: Keep data across instance restarts
+- **1-Click Clusters**: 16-512 GPU Slurm clusters with InfiniBand
+- **Simple pricing**: Pay-per-minute, no egress fees
+- **Global regions**: 12+ regions worldwide
+
+**Use alternatives instead:**
+- **Modal**: For serverless, auto-scaling workloads
+- **SkyPilot**: For multi-cloud orchestration and cost optimization
+- **RunPod**: For cheaper spot instances and serverless endpoints
+- **Vast.ai**: For GPU marketplace with lowest prices
+
+## Quick start
+
+### Account setup
+
+1. Create account at https://lambda.ai
+2. Add payment method
+3. Generate API key from dashboard
+4. Add SSH key (required before launching instances)
+
+### Launch via console
+
+1. Go to https://cloud.lambda.ai/instances
+2. Click "Launch instance"
+3. Select GPU type and region
+4. Choose SSH key
+5. Optionally attach filesystem
+6. Launch and wait 3-15 minutes
+
+### Connect via SSH
+
+```bash
+# Get instance IP from console
+ssh ubuntu@<INSTANCE-IP>
+
+# Or with specific key
+ssh -i ~/.ssh/lambda_key ubuntu@<INSTANCE-IP>
+```
+
+## GPU instances
+
+### Available GPUs
+
+| GPU | VRAM | Price/GPU/hr | Best For |
+|-----|------|--------------|----------|
+| B200 SXM6 | 180 GB | $4.99 | Largest models, fastest training |
+| H100 SXM | 80 GB | $2.99-3.29 | Large model training |
+| H100 PCIe | 80 GB | $2.49 | Cost-effective H100 |
+| GH200 | 96 GB | $1.49 | Single-GPU large models |
+| A100 80GB | 80 GB | $1.79 | Production training |
+| A100 40GB | 40 GB | $1.29 | Standard training |
+| A10 | 24 GB | $0.75 | Inference, fine-tuning |
+| A6000 | 48 GB | $0.80 | Good VRAM/price ratio |
+| V100 | 16 GB | $0.55 | Budget training |
+
+### Instance configurations
+
+```
+8x GPU: Best for distributed training (DDP, FSDP)
+4x GPU: Large models, multi-GPU training
+2x GPU: Medium workloads
+1x GPU: Fine-tuning, inference, development
+```
+
+### Launch times
+
+- Single-GPU: 3-5 minutes
+- Multi-GPU: 10-15 minutes
+
+## Lambda Stack
+
+All instances come with Lambda Stack pre-installed:
+
+```bash
+# Included software
+- Ubuntu 22.04 LTS
+- NVIDIA drivers (latest)
+- CUDA 12.x
+- cuDNN 8.x
+- NCCL (for multi-GPU)
+- PyTorch (latest)
+- TensorFlow (latest)
+- JAX
+- JupyterLab
+```
+
+### Verify installation
+
+```bash
+# Check GPU
+nvidia-smi
+
+# Check PyTorch
+python -c "import torch; print(torch.cuda.is_available())"
+
+# Check CUDA version
+nvcc --version
+```
+
+## Python API
+
+### Installation
+
+```bash
+pip install lambda-cloud-client
+```
+
+### Authentication
+
+```python
+import os
+import lambda_cloud_client
+
+# Configure with API key
+configuration = lambda_cloud_client.Configuration(
+    host="https://cloud.lambdalabs.com/api/v1",
+    access_token=os.environ["LAMBDA_API_KEY"]
+)
+```
+
+### List available instances
+
+```python
+with lambda_cloud_client.ApiClient(configuration) as api_client:
+    api = lambda_cloud_client.DefaultApi(api_client)
+
+    # Get available instance types
+    types = api.instance_types()
+    for name, info in types.data.items():
+        print(f"{name}: {info.instance_type.description}")
+```
+
+### Launch instance
+
+```python
+from lambda_cloud_client.models import LaunchInstanceRequest
+
+request = LaunchInstanceRequest(
+    region_name="us-west-1",
+    instance_type_name="gpu_1x_h100_sxm5",
+    ssh_key_names=["my-ssh-key"],
+    file_system_names=["my-filesystem"],  # Optional
+    name="training-job"
+)
+
+response = api.launch_instance(request)
+instance_id = response.data.instance_ids[0]
+print(f"Launched: {instance_id}")
+```
+
+### List running instances
+
+```python
+instances = api.list_instances()
+for instance in instances.data:
+    print(f"{instance.name}: {instance.ip} ({instance.status})")
+```
+
+### Terminate instance
+
+```python
+from lambda_cloud_client.models import TerminateInstanceRequest
+
+request = TerminateInstanceRequest(
+    instance_ids=[instance_id]
+)
+api.terminate_instance(request)
+```
+
+### SSH key management
+
+```python
+from lambda_cloud_client.models import AddSshKeyRequest
+
+# Add SSH key
+request = AddSshKeyRequest(
+    name="my-key",
+    public_key="ssh-rsa AAAA..."
+)
+api.add_ssh_key(request)
+
+# List keys
+keys = api.list_ssh_keys()
+
+# Delete key
+api.delete_ssh_key(key_id)
+```
+
+## CLI with curl
+
+### List instance types
+
+```bash
+curl -u $LAMBDA_API_KEY: \
+  https://cloud.lambdalabs.com/api/v1/instance-types | jq
+```
+
+### Launch instance
+
+```bash
+curl -u $LAMBDA_API_KEY: \
+  -X POST https://cloud.lambdalabs.com/api/v1/instance-operations/launch \
+  -H "Content-Type: application/json" \
+  -d '{
+    "region_name": "us-west-1",
+    "instance_type_name": "gpu_1x_h100_sxm5",
+    "ssh_key_names": ["my-key"]
+  }' | jq
+```
+
+### Terminate instance
+
+```bash
+curl -u $LAMBDA_API_KEY: \
+  -X POST https://cloud.lambdalabs.com/api/v1/instance-operations/terminate \
+  -H "Content-Type: application/json" \
+  -d '{"instance_ids": ["<INSTANCE-ID>"]}' | jq
+```
+
+## Persistent storage
+
+### Filesystems
+
+Filesystems persist data across instance restarts:
+
+```bash
+# Mount location
+/lambda/nfs/<FILESYSTEM_NAME>
+
+# Example: save checkpoints
+python train.py --checkpoint-dir /lambda/nfs/my-storage/checkpoints
+```
+
+### Create filesystem
+
+1. Go to Storage in Lambda console
+2. Click "Create filesystem"
+3. Select region (must match instance region)
+4. Name and create
+
+### Attach to instance
+
+Filesystems must be attached at instance launch time:
+- Via console: Select filesystem when launching
+- Via API: Include `file_system_names` in launch request
+
+### Best practices
+
+```bash
+# Store on filesystem (persists)
+/lambda/nfs/storage/
+  ├── datasets/
+  ├── checkpoints/
+  ├── models/
+  └── outputs/
+
+# Local SSD (faster, ephemeral)
+/home/ubuntu/
+  └── working/  # Temporary files
+```
+
+## SSH configuration
+
+### Add SSH key
+
+```bash
+# Generate key locally
+ssh-keygen -t ed25519 -f ~/.ssh/lambda_key
+
+# Add public key to Lambda console
+# Or via API
+```
+
+### Multiple keys
+
+```bash
+# On instance, add more keys
+echo 'ssh-rsa AAAA...' >> ~/.ssh/authorized_keys
+```
+
+### Import from GitHub
+
+```bash
+# On instance
+ssh-import-id gh:username
+```
+
+### SSH tunneling
+
+```bash
+# Forward Jupyter
+ssh -L 8888:localhost:8888 ubuntu@<IP>
+
+# Forward TensorBoard
+ssh -L 6006:localhost:6006 ubuntu@<IP>
+
+# Multiple ports
+ssh -L 8888:localhost:8888 -L 6006:localhost:6006 ubuntu@<IP>
+```
+
+## JupyterLab
+
+### Launch from console
+
+1. Go to Instances page
+2. Click "Launch" in Cloud IDE column
+3. JupyterLab opens in browser
+
+### Manual access
+
+```bash
+# On instance
+jupyter lab --ip=0.0.0.0 --port=8888
+
+# From local machine with tunnel
+ssh -L 8888:localhost:8888 ubuntu@<IP>
+# Open http://localhost:8888
+```
+
+## Training workflows
+
+### Single-GPU training
+
+```bash
+# SSH to instance
+ssh ubuntu@<IP>
+
+# Clone repo
+git clone https://github.com/user/project
+cd project
+
+# Install dependencies
+pip install -r requirements.txt
+
+# Train
+python train.py --epochs 100 --checkpoint-dir /lambda/nfs/storage/checkpoints
+```
+
+### Multi-GPU training (single node)
+
+```python
+# train_ddp.py
+import torch
+import torch.distributed as dist
+from torch.nn.parallel import DistributedDataParallel as DDP
+
+def main():
+    dist.init_process_group("nccl")
+    rank = dist.get_rank()
+    device = rank % torch.cuda.device_count()
+
+    model = MyModel().to(device)
+    model = DDP(model, device_ids=[device])
+
+    # Training loop...
+
+if __name__ == "__main__":
+    main()
+```
+
+```bash
+# Launch with torchrun (8 GPUs)
+torchrun --nproc_per_node=8 train_ddp.py
+```
+
+### Checkpoint to filesystem
+
+```python
+import os
+
+checkpoint_dir = "/lambda/nfs/my-storage/checkpoints"
+os.makedirs(checkpoint_dir, exist_ok=True)
+
+# Save checkpoint
+torch.save({
+    'epoch': epoch,
+    'model_state_dict': model.state_dict(),
+    'optimizer_state_dict': optimizer.state_dict(),
+    'loss': loss,
+}, f"{checkpoint_dir}/checkpoint_{epoch}.pt")
+```
+
+## 1-Click Clusters
+
+### Overview
+
+High-performance Slurm clusters with:
+- 16-512 NVIDIA H100 or B200 GPUs
+- NVIDIA Quantum-2 400 Gb/s InfiniBand
+- GPUDirect RDMA at 3200 Gb/s
+- Pre-installed distributed ML stack
+
+### Included software
+
+- Ubuntu 22.04 LTS + Lambda Stack
+- NCCL, Open MPI
+- PyTorch with DDP and FSDP
+- TensorFlow
+- OFED drivers
+
+### Storage
+
+- 24 TB NVMe per compute node (ephemeral)
+- Lambda filesystems for persistent data
+
+### Multi-node training
+
+```bash
+# On Slurm cluster
+srun --nodes=4 --ntasks-per-node=8 --gpus-per-node=8 \
+  torchrun --nnodes=4 --nproc_per_node=8 \
+  --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:29500 \
+  train.py
+```
+
+## Networking
+
+### Bandwidth
+
+- Inter-instance (same region): up to 200 Gbps
+- Internet outbound: 20 Gbps max
+
+### Firewall
+
+- Default: Only port 22 (SSH) open
+- Configure additional ports in Lambda console
+- ICMP traffic allowed by default
+
+### Private IPs
+
+```bash
+# Find private IP
+ip addr show | grep 'inet '
+```
+
+## Common workflows
+
+### Workflow 1: Fine-tuning LLM
+
+```bash
+# 1. Launch 8x H100 instance with filesystem
+
+# 2. SSH and setup
+ssh ubuntu@<IP>
+pip install transformers accelerate peft
+
+# 3. Download model to filesystem
+python -c "
+from transformers import AutoModelForCausalLM
+model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf')
+model.save_pretrained('/lambda/nfs/storage/models/llama-2-7b')
+"
+
+# 4. Fine-tune with checkpoints on filesystem
+accelerate launch --num_processes 8 train.py \
+  --model_path /lambda/nfs/storage/models/llama-2-7b \
+  --output_dir /lambda/nfs/storage/outputs \
+  --checkpoint_dir /lambda/nfs/storage/checkpoints
+```
+
+### Workflow 2: Batch inference
+
+```bash
+# 1. Launch A10 instance (cost-effective for inference)
+
+# 2. Run inference
+python inference.py \
+  --model /lambda/nfs/storage/models/fine-tuned \
+  --input /lambda/nfs/storage/data/inputs.jsonl \
+  --output /lambda/nfs/storage/data/outputs.jsonl
+```
+
+## Cost optimization
+
+### Choose right GPU
+
+| Task | Recommended GPU |
+|------|-----------------|
+| LLM fine-tuning (7B) | A100 40GB |
+| LLM fine-tuning (70B) | 8x H100 |
+| Inference | A10, A6000 |
+| Development | V100, A10 |
+| Maximum performance | B200 |
+
+### Reduce costs
+
+1. **Use filesystems**: Avoid re-downloading data
+2. **Checkpoint frequently**: Resume interrupted training
+3. **Right-size**: Don't over-provision GPUs
+4. **Terminate idle**: No auto-stop, manually terminate
+
+### Monitor usage
+
+- Dashboard shows real-time GPU utilization
+- API for programmatic monitoring
+
+## Common issues
+
+| Issue | Solution |
+|-------|----------|
+| Instance won't launch | Check region availability, try different GPU |
+| SSH connection refused | Wait for instance to initialize (3-15 min) |
+| Data lost after terminate | Use persistent filesystems |
+| Slow data transfer | Use filesystem in same region |
+| GPU not detected | Reboot instance, check drivers |
+
+## References
+
+- **[Advanced Usage](https://github.com/NousResearch/hermes-agent/blob/main/optional-skills/mlops/lambda-labs/references/advanced-usage.md)** - Multi-node training, API automation
+- **[Troubleshooting](https://github.com/NousResearch/hermes-agent/blob/main/optional-skills/mlops/lambda-labs/references/troubleshooting.md)** - Common issues and solutions
+
+## Resources
+
+- **Documentation**: https://docs.lambda.ai
+- **Console**: https://cloud.lambda.ai
+- **Pricing**: https://lambda.ai/instances
+- **Support**: https://support.lambdalabs.com
+- **Blog**: https://lambda.ai/blog