refactor: reorganize skills into sub-categories

The skills directory was getting disorganized — mlops alone had 40 skills in a flat list, and 12 categories were singletons with just one skill each. Code change: - prompt_builder.py: Support sub-categories in skill scanner. skills/mlops/training/axolotl/SKILL.md now shows as category 'mlops/training' instead of just 'mlops'. Backwards-compatible with existing flat structure. Split mlops (40 skills) into 7 sub-categories: - mlops/training (12): accelerate, axolotl, flash-attention, grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, torchtitan, trl-fine-tuning, unsloth - mlops/inference (8): gguf, guidance, instructor, llama-cpp, obliteratus, outlines, tensorrt-llm, vllm - mlops/models (6): audiocraft, clip, llava, segment-anything, stable-diffusion, whisper - mlops/vector-databases (4): chroma, faiss, pinecone, qdrant - mlops/evaluation (5): huggingface-tokenizers, lm-evaluation-harness, nemo-curator, saelens, weights-and-biases - mlops/cloud (2): lambda-labs, modal - mlops/research (1): dspy Merged singleton categories: - gifs → media (gif-search joins youtube-content) - music-creation → media (heartmula, songsee) - diagramming → creative (excalidraw joins ascii-art) - ocr-and-documents → productivity - domain → research (domain-intel) - feeds → research (blogwatcher) - market-data → research (polymarket) Fixed misplaced skills: - mlops/code-review → software-development (not ML-specific) - mlops/ml-paper-writing → research (academic writing) Added DESCRIPTION.md files for all new/updated categories.
2026-04-25 00:51:20 +00:00 · 2026-03-09 03:35:53 -07:00 · 2026-03-09 03:35:53 -07:00 · 732c66b0f3
commit 732c66b0f3
parent d6c710706f
217 changed files with 39 additions and 4 deletions
--- a/skills/mlops/cloud/lambda-labs/references/advanced-usage.md
+++ b/skills/mlops/cloud/lambda-labs/references/advanced-usage.md
@ -0,0 +1,611 @@
+# Lambda Labs Advanced Usage Guide
+
+## Multi-Node Distributed Training
+
+### PyTorch DDP across nodes
+
+```python
+# train_multi_node.py
+import os
+import torch
+import torch.distributed as dist
+from torch.nn.parallel import DistributedDataParallel as DDP
+
+def setup_distributed():
+    # Environment variables set by launcher
+    rank = int(os.environ["RANK"])
+    world_size = int(os.environ["WORLD_SIZE"])
+    local_rank = int(os.environ["LOCAL_RANK"])
+
+    dist.init_process_group(
+        backend="nccl",
+        rank=rank,
+        world_size=world_size
+    )
+
+    torch.cuda.set_device(local_rank)
+    return rank, world_size, local_rank
+
+def main():
+    rank, world_size, local_rank = setup_distributed()
+
+    model = MyModel().cuda(local_rank)
+    model = DDP(model, device_ids=[local_rank])
+
+    # Training loop with synchronized gradients
+    for epoch in range(num_epochs):
+        train_one_epoch(model, dataloader)
+
+        # Save checkpoint on rank 0 only
+        if rank == 0:
+            torch.save(model.module.state_dict(), f"checkpoint_{epoch}.pt")
+
+    dist.destroy_process_group()
+
+if __name__ == "__main__":
+    main()
+```
+
+### Launch on multiple instances
+
+```bash
+# On Node 0 (master)
+export MASTER_ADDR=<NODE0_PRIVATE_IP>
+export MASTER_PORT=29500
+
+torchrun \
+    --nnodes=2 \
+    --nproc_per_node=8 \
+    --node_rank=0 \
+    --master_addr=$MASTER_ADDR \
+    --master_port=$MASTER_PORT \
+    train_multi_node.py
+
+# On Node 1
+export MASTER_ADDR=<NODE0_PRIVATE_IP>
+export MASTER_PORT=29500
+
+torchrun \
+    --nnodes=2 \
+    --nproc_per_node=8 \
+    --node_rank=1 \
+    --master_addr=$MASTER_ADDR \
+    --master_port=$MASTER_PORT \
+    train_multi_node.py
+```
+
+### FSDP for large models
+
+```python
+from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
+from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
+from transformers.models.llama.modeling_llama import LlamaDecoderLayer
+
+# Wrap policy for transformer models
+auto_wrap_policy = functools.partial(
+    transformer_auto_wrap_policy,
+    transformer_layer_cls={LlamaDecoderLayer}
+)
+
+model = FSDP(
+    model,
+    auto_wrap_policy=auto_wrap_policy,
+    mixed_precision=MixedPrecision(
+        param_dtype=torch.bfloat16,
+        reduce_dtype=torch.bfloat16,
+        buffer_dtype=torch.bfloat16,
+    ),
+    device_id=local_rank,
+)
+```
+
+### DeepSpeed ZeRO
+
+```python
+# ds_config.json
+{
+    "train_batch_size": 64,
+    "gradient_accumulation_steps": 4,
+    "fp16": {"enabled": true},
+    "zero_optimization": {
+        "stage": 3,
+        "offload_optimizer": {"device": "cpu"},
+        "offload_param": {"device": "cpu"}
+    }
+}
+```
+
+```bash
+# Launch with DeepSpeed
+deepspeed --num_nodes=2 \
+    --num_gpus=8 \
+    --hostfile=hostfile.txt \
+    train.py --deepspeed ds_config.json
+```
+
+### Hostfile for multi-node
+
+```bash
+# hostfile.txt
+node0_ip slots=8
+node1_ip slots=8
+```
+
+## API Automation
+
+### Auto-launch training jobs
+
+```python
+import os
+import time
+import lambda_cloud_client
+from lambda_cloud_client.models import LaunchInstanceRequest
+
+class LambdaJobManager:
+    def __init__(self, api_key: str):
+        self.config = lambda_cloud_client.Configuration(
+            host="https://cloud.lambdalabs.com/api/v1",
+            access_token=api_key
+        )
+
+    def find_available_gpu(self, gpu_types: list[str], regions: list[str] = None):
+        """Find first available GPU type across regions."""
+        with lambda_cloud_client.ApiClient(self.config) as client:
+            api = lambda_cloud_client.DefaultApi(client)
+            types = api.instance_types()
+
+            for gpu_type in gpu_types:
+                if gpu_type in types.data:
+                    info = types.data[gpu_type]
+                    for region in info.regions_with_capacity_available:
+                        if regions is None or region.name in regions:
+                            return gpu_type, region.name
+
+        return None, None
+
+    def launch_and_wait(self, instance_type: str, region: str,
+                        ssh_key: str, filesystem: str = None,
+                        timeout: int = 900) -> dict:
+        """Launch instance and wait for it to be ready."""
+        with lambda_cloud_client.ApiClient(self.config) as client:
+            api = lambda_cloud_client.DefaultApi(client)
+
+            request = LaunchInstanceRequest(
+                region_name=region,
+                instance_type_name=instance_type,
+                ssh_key_names=[ssh_key],
+                file_system_names=[filesystem] if filesystem else [],
+            )
+
+            response = api.launch_instance(request)
+            instance_id = response.data.instance_ids[0]
+
+            # Poll until ready
+            start = time.time()
+            while time.time() - start < timeout:
+                instance = api.get_instance(instance_id)
+                if instance.data.status == "active":
+                    return {
+                        "id": instance_id,
+                        "ip": instance.data.ip,
+                        "status": "active"
+                    }
+                time.sleep(30)
+
+            raise TimeoutError(f"Instance {instance_id} not ready after {timeout}s")
+
+    def terminate(self, instance_ids: list[str]):
+        """Terminate instances."""
+        from lambda_cloud_client.models import TerminateInstanceRequest
+
+        with lambda_cloud_client.ApiClient(self.config) as client:
+            api = lambda_cloud_client.DefaultApi(client)
+            request = TerminateInstanceRequest(instance_ids=instance_ids)
+            api.terminate_instance(request)
+
+
+# Usage
+manager = LambdaJobManager(os.environ["LAMBDA_API_KEY"])
+
+# Find available H100 or A100
+gpu_type, region = manager.find_available_gpu(
+    ["gpu_8x_h100_sxm5", "gpu_8x_a100_80gb_sxm4"],
+    regions=["us-west-1", "us-east-1"]
+)
+
+if gpu_type:
+    instance = manager.launch_and_wait(
+        gpu_type, region,
+        ssh_key="my-key",
+        filesystem="training-data"
+    )
+    print(f"Ready: ssh ubuntu@{instance['ip']}")
+```
+
+### Batch job submission
+
+```python
+import subprocess
+import paramiko
+
+def run_remote_job(ip: str, ssh_key_path: str, commands: list[str]):
+    """Execute commands on remote instance."""
+    client = paramiko.SSHClient()
+    client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
+    client.connect(ip, username="ubuntu", key_filename=ssh_key_path)
+
+    for cmd in commands:
+        stdin, stdout, stderr = client.exec_command(cmd)
+        print(stdout.read().decode())
+        if stderr.read():
+            print(f"Error: {stderr.read().decode()}")
+
+    client.close()
+
+# Submit training job
+commands = [
+    "cd /lambda/nfs/storage/project",
+    "git pull",
+    "pip install -r requirements.txt",
+    "nohup torchrun --nproc_per_node=8 train.py > train.log 2>&1 &"
+]
+
+run_remote_job(instance["ip"], "~/.ssh/lambda_key", commands)
+```
+
+### Monitor training progress
+
+```python
+def monitor_job(ip: str, ssh_key_path: str, log_file: str = "train.log"):
+    """Stream training logs from remote instance."""
+    import time
+
+    client = paramiko.SSHClient()
+    client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
+    client.connect(ip, username="ubuntu", key_filename=ssh_key_path)
+
+    # Tail log file
+    stdin, stdout, stderr = client.exec_command(f"tail -f {log_file}")
+
+    try:
+        for line in stdout:
+            print(line.strip())
+    except KeyboardInterrupt:
+        pass
+    finally:
+        client.close()
+```
+
+## 1-Click Cluster Workflows
+
+### Slurm job submission
+
+```bash
+#!/bin/bash
+#SBATCH --job-name=llm-training
+#SBATCH --nodes=4
+#SBATCH --ntasks-per-node=8
+#SBATCH --gpus-per-node=8
+#SBATCH --time=24:00:00
+#SBATCH --output=logs/%j.out
+#SBATCH --error=logs/%j.err
+
+# Set up distributed environment
+export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
+export MASTER_PORT=29500
+
+# Launch training
+srun torchrun \
+    --nnodes=$SLURM_NNODES \
+    --nproc_per_node=$SLURM_GPUS_PER_NODE \
+    --rdzv_backend=c10d \
+    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
+    train.py \
+    --config config.yaml
+```
+
+### Interactive cluster session
+
+```bash
+# Request interactive session
+srun --nodes=1 --ntasks=1 --gpus=8 --time=4:00:00 --pty bash
+
+# Now on compute node with 8 GPUs
+nvidia-smi
+python train.py
+```
+
+### Monitoring cluster jobs
+
+```bash
+# View job queue
+squeue
+
+# View job details
+scontrol show job <JOB_ID>
+
+# Cancel job
+scancel <JOB_ID>
+
+# View node status
+sinfo
+
+# View GPU usage across cluster
+srun --nodes=4 nvidia-smi --query-gpu=name,utilization.gpu --format=csv
+```
+
+## Advanced Filesystem Usage
+
+### Data staging workflow
+
+```bash
+# Stage data from S3 to filesystem (one-time)
+aws s3 sync s3://my-bucket/dataset /lambda/nfs/storage/datasets/
+
+# Or use rclone
+rclone sync s3:my-bucket/dataset /lambda/nfs/storage/datasets/
+```
+
+### Shared filesystem across instances
+
+```python
+# Instance 1: Write checkpoints
+checkpoint_path = "/lambda/nfs/shared/checkpoints/model_step_1000.pt"
+torch.save(model.state_dict(), checkpoint_path)
+
+# Instance 2: Read checkpoints
+model.load_state_dict(torch.load(checkpoint_path))
+```
+
+### Filesystem best practices
+
+```bash
+# Organize for ML workflows
+/lambda/nfs/storage/
+├── datasets/
+│   ├── raw/           # Original data
+│   └── processed/     # Preprocessed data
+├── models/
+│   ├── pretrained/    # Base models
+│   └── fine-tuned/    # Your trained models
+├── checkpoints/
+│   └── experiment_1/  # Per-experiment checkpoints
+├── logs/
+│   └── tensorboard/   # Training logs
+└── outputs/
+    └── inference/     # Inference results
+```
+
+## Environment Management
+
+### Custom Python environments
+
+```bash
+# Don't modify system Python, create venv
+python -m venv ~/myenv
+source ~/myenv/bin/activate
+
+# Install packages
+pip install torch transformers accelerate
+
+# Save to filesystem for reuse
+cp -r ~/myenv /lambda/nfs/storage/envs/myenv
+```
+
+### Conda environments
+
+```bash
+# Install miniconda (if not present)
+wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
+bash Miniconda3-latest-Linux-x86_64.sh -b -p ~/miniconda3
+
+# Create environment
+~/miniconda3/bin/conda create -n ml python=3.10 pytorch pytorch-cuda=12.1 -c pytorch -c nvidia -y
+
+# Activate
+source ~/miniconda3/bin/activate ml
+```
+
+### Docker containers
+
+```bash
+# Pull and run NVIDIA container
+docker run --gpus all -it --rm \
+    -v /lambda/nfs/storage:/data \
+    nvcr.io/nvidia/pytorch:24.01-py3
+
+# Run training in container
+docker run --gpus all -d \
+    -v /lambda/nfs/storage:/data \
+    -v $(pwd):/workspace \
+    nvcr.io/nvidia/pytorch:24.01-py3 \
+    python /workspace/train.py
+```
+
+## Monitoring and Observability
+
+### GPU monitoring
+
+```bash
+# Real-time GPU stats
+watch -n 1 nvidia-smi
+
+# GPU utilization over time
+nvidia-smi dmon -s u -d 1
+
+# Detailed GPU info
+nvidia-smi -q
+```
+
+### System monitoring
+
+```bash
+# CPU and memory
+htop
+
+# Disk I/O
+iostat -x 1
+
+# Network
+iftop
+
+# All resources
+glances
+```
+
+### TensorBoard integration
+
+```bash
+# Start TensorBoard
+tensorboard --logdir /lambda/nfs/storage/logs --port 6006 --bind_all
+
+# SSH tunnel from local machine
+ssh -L 6006:localhost:6006 ubuntu@<IP>
+
+# Access at http://localhost:6006
+```
+
+### Weights & Biases integration
+
+```python
+import wandb
+
+# Initialize with API key
+wandb.login(key=os.environ["WANDB_API_KEY"])
+
+# Start run
+wandb.init(
+    project="lambda-training",
+    config={"learning_rate": 1e-4, "epochs": 100}
+)
+
+# Log metrics
+wandb.log({"loss": loss, "accuracy": acc})
+
+# Save artifacts to filesystem + W&B
+wandb.save("/lambda/nfs/storage/checkpoints/best_model.pt")
+```
+
+## Cost Optimization Strategies
+
+### Checkpointing for interruption recovery
+
+```python
+import os
+
+def save_checkpoint(model, optimizer, epoch, loss, path):
+    torch.save({
+        'epoch': epoch,
+        'model_state_dict': model.state_dict(),
+        'optimizer_state_dict': optimizer.state_dict(),
+        'loss': loss,
+    }, path)
+
+def load_checkpoint(path, model, optimizer):
+    if os.path.exists(path):
+        checkpoint = torch.load(path)
+        model.load_state_dict(checkpoint['model_state_dict'])
+        optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
+        return checkpoint['epoch'], checkpoint['loss']
+    return 0, float('inf')
+
+# Save every N steps to filesystem
+checkpoint_path = "/lambda/nfs/storage/checkpoints/latest.pt"
+if step % 1000 == 0:
+    save_checkpoint(model, optimizer, epoch, loss, checkpoint_path)
+```
+
+### Instance selection by workload
+
+```python
+def recommend_instance(model_params: int, batch_size: int, task: str) -> str:
+    """Recommend Lambda instance based on workload."""
+
+    if task == "inference":
+        if model_params < 7e9:
+            return "gpu_1x_a10"  # $0.75/hr
+        elif model_params < 13e9:
+            return "gpu_1x_a6000"  # $0.80/hr
+        else:
+            return "gpu_1x_h100_pcie"  # $2.49/hr
+
+    elif task == "fine-tuning":
+        if model_params < 7e9:
+            return "gpu_1x_a100"  # $1.29/hr
+        elif model_params < 13e9:
+            return "gpu_4x_a100"  # $5.16/hr
+        else:
+            return "gpu_8x_h100_sxm5"  # $23.92/hr
+
+    elif task == "pretraining":
+        return "gpu_8x_h100_sxm5"  # Maximum performance
+
+    return "gpu_1x_a100"  # Default
+```
+
+### Auto-terminate idle instances
+
+```python
+import time
+from datetime import datetime, timedelta
+
+def auto_terminate_idle(api_key: str, idle_threshold_hours: float = 2):
+    """Terminate instances idle for too long."""
+    manager = LambdaJobManager(api_key)
+
+    with lambda_cloud_client.ApiClient(manager.config) as client:
+        api = lambda_cloud_client.DefaultApi(client)
+        instances = api.list_instances()
+
+        for instance in instances.data:
+            # Check if instance has been running without activity
+            # (You'd need to track this separately)
+            launch_time = instance.launched_at
+            if datetime.now() - launch_time > timedelta(hours=idle_threshold_hours):
+                print(f"Terminating idle instance: {instance.id}")
+                manager.terminate([instance.id])
+```
+
+## Security Best Practices
+
+### SSH key rotation
+
+```bash
+# Generate new key pair
+ssh-keygen -t ed25519 -f ~/.ssh/lambda_key_new -C "lambda-$(date +%Y%m)"
+
+# Add new key via Lambda console or API
+# Update authorized_keys on running instances
+ssh ubuntu@<IP> "echo '$(cat ~/.ssh/lambda_key_new.pub)' >> ~/.ssh/authorized_keys"
+
+# Test new key
+ssh -i ~/.ssh/lambda_key_new ubuntu@<IP>
+
+# Remove old key from Lambda console
+```
+
+### Firewall configuration
+
+```bash
+# Lambda console: Only open necessary ports
+# Recommended:
+# - 22 (SSH) - Always needed
+# - 6006 (TensorBoard) - If using
+# - 8888 (Jupyter) - If using
+# - 29500 (PyTorch distributed) - For multi-node only
+```
+
+### Secrets management
+
+```bash
+# Don't hardcode API keys in code
+# Use environment variables
+export HF_TOKEN="hf_..."
+export WANDB_API_KEY="..."
+
+# Or use .env file (add to .gitignore)
+source .env
+
+# On instance, store in ~/.bashrc
+echo 'export HF_TOKEN="..."' >> ~/.bashrc
+```
--- a/skills/mlops/cloud/lambda-labs/references/troubleshooting.md
+++ b/skills/mlops/cloud/lambda-labs/references/troubleshooting.md
@ -0,0 +1,530 @@
+# Lambda Labs Troubleshooting Guide
+
+## Instance Launch Issues
+
+### No instances available
+
+**Error**: "No capacity available" or instance type not listed
+
+**Solutions**:
+```bash
+# Check availability via API
+curl -u $LAMBDA_API_KEY: \
+  https://cloud.lambdalabs.com/api/v1/instance-types | jq '.data | to_entries[] | select(.value.regions_with_capacity_available | length > 0) | .key'
+
+# Try different regions
+# US regions: us-west-1, us-east-1, us-south-1
+# International: eu-west-1, asia-northeast-1, etc.
+
+# Try alternative GPU types
+# H100 not available? Try A100
+# A100 not available? Try A10 or A6000
+```
+
+### Instance stuck launching
+
+**Problem**: Instance shows "booting" for over 20 minutes
+
+**Solutions**:
+```bash
+# Single-GPU: Should be ready in 3-5 minutes
+# Multi-GPU (8x): May take 10-15 minutes
+
+# If stuck longer:
+# 1. Terminate the instance
+# 2. Try a different region
+# 3. Try a different instance type
+# 4. Contact Lambda support if persistent
+```
+
+### API authentication fails
+
+**Error**: `401 Unauthorized` or `403 Forbidden`
+
+**Solutions**:
+```bash
+# Verify API key format (should start with specific prefix)
+echo $LAMBDA_API_KEY
+
+# Test API key
+curl -u $LAMBDA_API_KEY: \
+  https://cloud.lambdalabs.com/api/v1/instance-types
+
+# Generate new API key from Lambda console if needed
+# Settings > API keys > Generate
+```
+
+### Quota limits reached
+
+**Error**: "Instance limit reached" or "Quota exceeded"
+
+**Solutions**:
+- Check current running instances in console
+- Terminate unused instances
+- Contact Lambda support to request quota increase
+- Use 1-Click Clusters for large-scale needs
+
+## SSH Connection Issues
+
+### Connection refused
+
+**Error**: `ssh: connect to host <IP> port 22: Connection refused`
+
+**Solutions**:
+```bash
+# Wait for instance to fully initialize
+# Single-GPU: 3-5 minutes
+# Multi-GPU: 10-15 minutes
+
+# Check instance status in console (should be "active")
+
+# Verify correct IP address
+curl -u $LAMBDA_API_KEY: \
+  https://cloud.lambdalabs.com/api/v1/instances | jq '.data[].ip'
+```
+
+### Permission denied
+
+**Error**: `Permission denied (publickey)`
+
+**Solutions**:
+```bash
+# Verify SSH key matches
+ssh -v -i ~/.ssh/lambda_key ubuntu@<IP>
+
+# Check key permissions
+chmod 600 ~/.ssh/lambda_key
+chmod 644 ~/.ssh/lambda_key.pub
+
+# Verify key was added to Lambda console before launch
+# Keys must be added BEFORE launching instance
+
+# Check authorized_keys on instance (if you have another way in)
+cat ~/.ssh/authorized_keys
+```
+
+### Host key verification failed
+
+**Error**: `WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!`
+
+**Solutions**:
+```bash
+# This happens when IP is reused by different instance
+# Remove old key
+ssh-keygen -R <IP>
+
+# Then connect again
+ssh ubuntu@<IP>
+```
+
+### Timeout during SSH
+
+**Error**: `ssh: connect to host <IP> port 22: Operation timed out`
+
+**Solutions**:
+```bash
+# Check if instance is in "active" state
+
+# Verify firewall allows SSH (port 22)
+# Lambda console > Firewall
+
+# Check your local network allows outbound SSH
+
+# Try from different network/VPN
+```
+
+## GPU Issues
+
+### GPU not detected
+
+**Error**: `nvidia-smi: command not found` or no GPUs shown
+
+**Solutions**:
+```bash
+# Reboot instance
+sudo reboot
+
+# Reinstall NVIDIA drivers (if needed)
+wget -nv -O- https://lambdalabs.com/install-lambda-stack.sh | sh -
+sudo reboot
+
+# Check driver status
+nvidia-smi
+lsmod | grep nvidia
+```
+
+### CUDA out of memory
+
+**Error**: `torch.cuda.OutOfMemoryError: CUDA out of memory`
+
+**Solutions**:
+```python
+# Check GPU memory
+import torch
+print(torch.cuda.get_device_properties(0).total_memory / 1e9, "GB")
+
+# Clear cache
+torch.cuda.empty_cache()
+
+# Reduce batch size
+batch_size = batch_size // 2
+
+# Enable gradient checkpointing
+model.gradient_checkpointing_enable()
+
+# Use mixed precision
+from torch.cuda.amp import autocast
+with autocast():
+    outputs = model(**inputs)
+
+# Use larger GPU instance
+# A100-40GB → A100-80GB → H100
+```
+
+### CUDA version mismatch
+
+**Error**: `CUDA driver version is insufficient for CUDA runtime version`
+
+**Solutions**:
+```bash
+# Check versions
+nvidia-smi  # Shows driver CUDA version
+nvcc --version  # Shows toolkit version
+
+# Lambda Stack should have compatible versions
+# If mismatch, reinstall Lambda Stack
+wget -nv -O- https://lambdalabs.com/install-lambda-stack.sh | sh -
+sudo reboot
+
+# Or install specific PyTorch version
+pip install torch==2.1.0+cu121 -f https://download.pytorch.org/whl/torch_stable.html
+```
+
+### Multi-GPU not working
+
+**Error**: Only one GPU being used
+
+**Solutions**:
+```python
+# Check all GPUs visible
+import torch
+print(f"GPUs available: {torch.cuda.device_count()}")
+
+# Verify CUDA_VISIBLE_DEVICES not set restrictively
+import os
+print(os.environ.get("CUDA_VISIBLE_DEVICES", "not set"))
+
+# Use DataParallel or DistributedDataParallel
+model = torch.nn.DataParallel(model)
+# or
+model = torch.nn.parallel.DistributedDataParallel(model)
+```
+
+## Filesystem Issues
+
+### Filesystem not mounted
+
+**Error**: `/lambda/nfs/<name>` doesn't exist
+
+**Solutions**:
+```bash
+# Filesystem must be attached at launch time
+# Cannot attach to running instance
+
+# Verify filesystem was selected during launch
+
+# Check mount points
+df -h | grep lambda
+
+# If missing, terminate and relaunch with filesystem
+```
+
+### Slow filesystem performance
+
+**Problem**: Reading/writing to filesystem is slow
+
+**Solutions**:
+```bash
+# Use local SSD for temporary/intermediate files
+# /home/ubuntu has fast NVMe storage
+
+# Copy frequently accessed data to local storage
+cp -r /lambda/nfs/storage/dataset /home/ubuntu/dataset
+
+# Use filesystem for checkpoints and final outputs only
+
+# Check network bandwidth
+iperf3 -c <filesystem_server>
+```
+
+### Data lost after termination
+
+**Problem**: Files disappeared after instance terminated
+
+**Solutions**:
+```bash
+# Root volume (/home/ubuntu) is EPHEMERAL
+# Data there is lost on termination
+
+# ALWAYS use filesystem for persistent data
+/lambda/nfs/<filesystem_name>/
+
+# Sync important local files before terminating
+rsync -av /home/ubuntu/outputs/ /lambda/nfs/storage/outputs/
+```
+
+### Filesystem full
+
+**Error**: `No space left on device`
+
+**Solutions**:
+```bash
+# Check filesystem usage
+df -h /lambda/nfs/storage
+
+# Find large files
+du -sh /lambda/nfs/storage/* | sort -h
+
+# Clean up old checkpoints
+find /lambda/nfs/storage/checkpoints -mtime +7 -delete
+
+# Increase filesystem size in Lambda console
+# (may require support request)
+```
+
+## Network Issues
+
+### Port not accessible
+
+**Error**: Cannot connect to service (TensorBoard, Jupyter, etc.)
+
+**Solutions**:
+```bash
+# Lambda default: Only port 22 is open
+# Configure firewall in Lambda console
+
+# Or use SSH tunneling (recommended)
+ssh -L 6006:localhost:6006 ubuntu@<IP>
+# Access at http://localhost:6006
+
+# For Jupyter
+ssh -L 8888:localhost:8888 ubuntu@<IP>
+```
+
+### Slow data download
+
+**Problem**: Downloading datasets is slow
+
+**Solutions**:
+```bash
+# Check available bandwidth
+speedtest-cli
+
+# Use multi-threaded download
+aria2c -x 16 <URL>
+
+# For HuggingFace models
+export HF_HUB_ENABLE_HF_TRANSFER=1
+pip install hf_transfer
+
+# For S3, use parallel transfer
+aws s3 sync s3://bucket/data /local/data --quiet
+```
+
+### Inter-node communication fails
+
+**Error**: Distributed training can't connect between nodes
+
+**Solutions**:
+```bash
+# Verify nodes in same region (required)
+
+# Check private IPs can communicate
+ping <other_node_private_ip>
+
+# Verify NCCL settings
+export NCCL_DEBUG=INFO
+export NCCL_IB_DISABLE=0  # Enable InfiniBand if available
+
+# Check firewall allows distributed ports
+# Need: 29500 (PyTorch), or configured MASTER_PORT
+```
+
+## Software Issues
+
+### Package installation fails
+
+**Error**: `pip install` errors
+
+**Solutions**:
+```bash
+# Use virtual environment (don't modify system Python)
+python -m venv ~/myenv
+source ~/myenv/bin/activate
+pip install <package>
+
+# For CUDA packages, match CUDA version
+pip install torch --index-url https://download.pytorch.org/whl/cu121
+
+# Clear pip cache if corrupted
+pip cache purge
+```
+
+### Python version issues
+
+**Error**: Package requires different Python version
+
+**Solutions**:
+```bash
+# Install alternate Python (don't replace system Python)
+sudo apt install python3.11 python3.11-venv python3.11-dev
+
+# Create venv with specific Python
+python3.11 -m venv ~/py311env
+source ~/py311env/bin/activate
+```
+
+### ImportError or ModuleNotFoundError
+
+**Error**: Module not found despite installation
+
+**Solutions**:
+```bash
+# Verify correct Python environment
+which python
+pip list | grep <module>
+
+# Ensure virtual environment is activated
+source ~/myenv/bin/activate
+
+# Reinstall in correct environment
+pip uninstall <package>
+pip install <package>
+```
+
+## Training Issues
+
+### Training hangs
+
+**Problem**: Training stops progressing, no output
+
+**Solutions**:
+```bash
+# Check GPU utilization
+watch -n 1 nvidia-smi
+
+# If GPUs at 0%, likely data loading bottleneck
+# Increase num_workers in DataLoader
+
+# Check for deadlocks in distributed training
+export NCCL_DEBUG=INFO
+
+# Add timeouts
+dist.init_process_group(..., timeout=timedelta(minutes=30))
+```
+
+### Checkpoint corruption
+
+**Error**: `RuntimeError: storage has wrong size` or similar
+
+**Solutions**:
+```python
+# Use safe saving pattern
+checkpoint_path = "/lambda/nfs/storage/checkpoint.pt"
+temp_path = checkpoint_path + ".tmp"
+
+# Save to temp first
+torch.save(state_dict, temp_path)
+# Then atomic rename
+os.rename(temp_path, checkpoint_path)
+
+# For loading corrupted checkpoint
+try:
+    state = torch.load(checkpoint_path)
+except:
+    # Fall back to previous checkpoint
+    state = torch.load(checkpoint_path + ".backup")
+```
+
+### Memory leak
+
+**Problem**: Memory usage grows over time
+
+**Solutions**:
+```python
+# Clear CUDA cache periodically
+torch.cuda.empty_cache()
+
+# Detach tensors when logging
+loss_value = loss.detach().cpu().item()
+
+# Don't accumulate gradients unintentionally
+optimizer.zero_grad(set_to_none=True)
+
+# Use gradient accumulation properly
+if (step + 1) % accumulation_steps == 0:
+    optimizer.step()
+    optimizer.zero_grad()
+```
+
+## Billing Issues
+
+### Unexpected charges
+
+**Problem**: Bill higher than expected
+
+**Solutions**:
+```bash
+# Check for forgotten running instances
+curl -u $LAMBDA_API_KEY: \
+  https://cloud.lambdalabs.com/api/v1/instances | jq '.data[].id'
+
+# Terminate all instances
+# Lambda console > Instances > Terminate all
+
+# Lambda charges by the minute
+# No charge for stopped instances (but no "stop" feature - only terminate)
+```
+
+### Instance terminated unexpectedly
+
+**Problem**: Instance disappeared without manual termination
+
+**Possible causes**:
+- Payment issue (card declined)
+- Account suspension
+- Instance health check failure
+
+**Solutions**:
+- Check email for Lambda notifications
+- Verify payment method in console
+- Contact Lambda support
+- Always checkpoint to filesystem
+
+## Common Error Messages
+
+| Error | Cause | Solution |
+|-------|-------|----------|
+| `No capacity available` | Region/GPU sold out | Try different region or GPU type |
+| `Permission denied (publickey)` | SSH key mismatch | Re-add key, check permissions |
+| `CUDA out of memory` | Model too large | Reduce batch size, use larger GPU |
+| `No space left on device` | Disk full | Clean up or use filesystem |
+| `Connection refused` | Instance not ready | Wait 3-15 minutes for boot |
+| `Module not found` | Wrong Python env | Activate correct virtualenv |
+
+## Getting Help
+
+1. **Documentation**: https://docs.lambda.ai
+2. **Support**: https://support.lambdalabs.com
+3. **Email**: support@lambdalabs.com
+4. **Status**: Check Lambda status page for outages
+
+### Information to Include
+
+When contacting support, include:
+- Instance ID
+- Region
+- Instance type
+- Error message (full traceback)
+- Steps to reproduce
+- Time of occurrence