refactor: reorganize skills into sub-categories

The skills directory was getting disorganized — mlops alone had 40 skills in a flat list, and 12 categories were singletons with just one skill each. Code change: - prompt_builder.py: Support sub-categories in skill scanner. skills/mlops/training/axolotl/SKILL.md now shows as category 'mlops/training' instead of just 'mlops'. Backwards-compatible with existing flat structure. Split mlops (40 skills) into 7 sub-categories: - mlops/training (12): accelerate, axolotl, flash-attention, grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, torchtitan, trl-fine-tuning, unsloth - mlops/inference (8): gguf, guidance, instructor, llama-cpp, obliteratus, outlines, tensorrt-llm, vllm - mlops/models (6): audiocraft, clip, llava, segment-anything, stable-diffusion, whisper - mlops/vector-databases (4): chroma, faiss, pinecone, qdrant - mlops/evaluation (5): huggingface-tokenizers, lm-evaluation-harness, nemo-curator, saelens, weights-and-biases - mlops/cloud (2): lambda-labs, modal - mlops/research (1): dspy Merged singleton categories: - gifs → media (gif-search joins youtube-content) - music-creation → media (heartmula, songsee) - diagramming → creative (excalidraw joins ascii-art) - ocr-and-documents → productivity - domain → research (domain-intel) - feeds → research (blogwatcher) - market-data → research (polymarket) Fixed misplaced skills: - mlops/code-review → software-development (not ML-specific) - mlops/ml-paper-writing → research (academic writing) Added DESCRIPTION.md files for all new/updated categories.
2026-04-25 00:51:20 +00:00 · 2026-03-09 03:35:53 -07:00 · 2026-03-09 03:35:53 -07:00 · 732c66b0f3
commit 732c66b0f3
parent d6c710706f
217 changed files with 39 additions and 4 deletions
--- a/skills/mlops/cloud/lambda-labs/references/troubleshooting.md
+++ b/skills/mlops/cloud/lambda-labs/references/troubleshooting.md
@ -0,0 +1,530 @@
+# Lambda Labs Troubleshooting Guide
+
+## Instance Launch Issues
+
+### No instances available
+
+**Error**: "No capacity available" or instance type not listed
+
+**Solutions**:
+```bash
+# Check availability via API
+curl -u $LAMBDA_API_KEY: \
+  https://cloud.lambdalabs.com/api/v1/instance-types | jq '.data | to_entries[] | select(.value.regions_with_capacity_available | length > 0) | .key'
+
+# Try different regions
+# US regions: us-west-1, us-east-1, us-south-1
+# International: eu-west-1, asia-northeast-1, etc.
+
+# Try alternative GPU types
+# H100 not available? Try A100
+# A100 not available? Try A10 or A6000
+```
+
+### Instance stuck launching
+
+**Problem**: Instance shows "booting" for over 20 minutes
+
+**Solutions**:
+```bash
+# Single-GPU: Should be ready in 3-5 minutes
+# Multi-GPU (8x): May take 10-15 minutes
+
+# If stuck longer:
+# 1. Terminate the instance
+# 2. Try a different region
+# 3. Try a different instance type
+# 4. Contact Lambda support if persistent
+```
+
+### API authentication fails
+
+**Error**: `401 Unauthorized` or `403 Forbidden`
+
+**Solutions**:
+```bash
+# Verify API key format (should start with specific prefix)
+echo $LAMBDA_API_KEY
+
+# Test API key
+curl -u $LAMBDA_API_KEY: \
+  https://cloud.lambdalabs.com/api/v1/instance-types
+
+# Generate new API key from Lambda console if needed
+# Settings > API keys > Generate
+```
+
+### Quota limits reached
+
+**Error**: "Instance limit reached" or "Quota exceeded"
+
+**Solutions**:
+- Check current running instances in console
+- Terminate unused instances
+- Contact Lambda support to request quota increase
+- Use 1-Click Clusters for large-scale needs
+
+## SSH Connection Issues
+
+### Connection refused
+
+**Error**: `ssh: connect to host <IP> port 22: Connection refused`
+
+**Solutions**:
+```bash
+# Wait for instance to fully initialize
+# Single-GPU: 3-5 minutes
+# Multi-GPU: 10-15 minutes
+
+# Check instance status in console (should be "active")
+
+# Verify correct IP address
+curl -u $LAMBDA_API_KEY: \
+  https://cloud.lambdalabs.com/api/v1/instances | jq '.data[].ip'
+```
+
+### Permission denied
+
+**Error**: `Permission denied (publickey)`
+
+**Solutions**:
+```bash
+# Verify SSH key matches
+ssh -v -i ~/.ssh/lambda_key ubuntu@<IP>
+
+# Check key permissions
+chmod 600 ~/.ssh/lambda_key
+chmod 644 ~/.ssh/lambda_key.pub
+
+# Verify key was added to Lambda console before launch
+# Keys must be added BEFORE launching instance
+
+# Check authorized_keys on instance (if you have another way in)
+cat ~/.ssh/authorized_keys
+```
+
+### Host key verification failed
+
+**Error**: `WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!`
+
+**Solutions**:
+```bash
+# This happens when IP is reused by different instance
+# Remove old key
+ssh-keygen -R <IP>
+
+# Then connect again
+ssh ubuntu@<IP>
+```
+
+### Timeout during SSH
+
+**Error**: `ssh: connect to host <IP> port 22: Operation timed out`
+
+**Solutions**:
+```bash
+# Check if instance is in "active" state
+
+# Verify firewall allows SSH (port 22)
+# Lambda console > Firewall
+
+# Check your local network allows outbound SSH
+
+# Try from different network/VPN
+```
+
+## GPU Issues
+
+### GPU not detected
+
+**Error**: `nvidia-smi: command not found` or no GPUs shown
+
+**Solutions**:
+```bash
+# Reboot instance
+sudo reboot
+
+# Reinstall NVIDIA drivers (if needed)
+wget -nv -O- https://lambdalabs.com/install-lambda-stack.sh | sh -
+sudo reboot
+
+# Check driver status
+nvidia-smi
+lsmod | grep nvidia
+```
+
+### CUDA out of memory
+
+**Error**: `torch.cuda.OutOfMemoryError: CUDA out of memory`
+
+**Solutions**:
+```python
+# Check GPU memory
+import torch
+print(torch.cuda.get_device_properties(0).total_memory / 1e9, "GB")
+
+# Clear cache
+torch.cuda.empty_cache()
+
+# Reduce batch size
+batch_size = batch_size // 2
+
+# Enable gradient checkpointing
+model.gradient_checkpointing_enable()
+
+# Use mixed precision
+from torch.cuda.amp import autocast
+with autocast():
+    outputs = model(**inputs)
+
+# Use larger GPU instance
+# A100-40GB → A100-80GB → H100
+```
+
+### CUDA version mismatch
+
+**Error**: `CUDA driver version is insufficient for CUDA runtime version`
+
+**Solutions**:
+```bash
+# Check versions
+nvidia-smi  # Shows driver CUDA version
+nvcc --version  # Shows toolkit version
+
+# Lambda Stack should have compatible versions
+# If mismatch, reinstall Lambda Stack
+wget -nv -O- https://lambdalabs.com/install-lambda-stack.sh | sh -
+sudo reboot
+
+# Or install specific PyTorch version
+pip install torch==2.1.0+cu121 -f https://download.pytorch.org/whl/torch_stable.html
+```
+
+### Multi-GPU not working
+
+**Error**: Only one GPU being used
+
+**Solutions**:
+```python
+# Check all GPUs visible
+import torch
+print(f"GPUs available: {torch.cuda.device_count()}")
+
+# Verify CUDA_VISIBLE_DEVICES not set restrictively
+import os
+print(os.environ.get("CUDA_VISIBLE_DEVICES", "not set"))
+
+# Use DataParallel or DistributedDataParallel
+model = torch.nn.DataParallel(model)
+# or
+model = torch.nn.parallel.DistributedDataParallel(model)
+```
+
+## Filesystem Issues
+
+### Filesystem not mounted
+
+**Error**: `/lambda/nfs/<name>` doesn't exist
+
+**Solutions**:
+```bash
+# Filesystem must be attached at launch time
+# Cannot attach to running instance
+
+# Verify filesystem was selected during launch
+
+# Check mount points
+df -h | grep lambda
+
+# If missing, terminate and relaunch with filesystem
+```
+
+### Slow filesystem performance
+
+**Problem**: Reading/writing to filesystem is slow
+
+**Solutions**:
+```bash
+# Use local SSD for temporary/intermediate files
+# /home/ubuntu has fast NVMe storage
+
+# Copy frequently accessed data to local storage
+cp -r /lambda/nfs/storage/dataset /home/ubuntu/dataset
+
+# Use filesystem for checkpoints and final outputs only
+
+# Check network bandwidth
+iperf3 -c <filesystem_server>
+```
+
+### Data lost after termination
+
+**Problem**: Files disappeared after instance terminated
+
+**Solutions**:
+```bash
+# Root volume (/home/ubuntu) is EPHEMERAL
+# Data there is lost on termination
+
+# ALWAYS use filesystem for persistent data
+/lambda/nfs/<filesystem_name>/
+
+# Sync important local files before terminating
+rsync -av /home/ubuntu/outputs/ /lambda/nfs/storage/outputs/
+```
+
+### Filesystem full
+
+**Error**: `No space left on device`
+
+**Solutions**:
+```bash
+# Check filesystem usage
+df -h /lambda/nfs/storage
+
+# Find large files
+du -sh /lambda/nfs/storage/* | sort -h
+
+# Clean up old checkpoints
+find /lambda/nfs/storage/checkpoints -mtime +7 -delete
+
+# Increase filesystem size in Lambda console
+# (may require support request)
+```
+
+## Network Issues
+
+### Port not accessible
+
+**Error**: Cannot connect to service (TensorBoard, Jupyter, etc.)
+
+**Solutions**:
+```bash
+# Lambda default: Only port 22 is open
+# Configure firewall in Lambda console
+
+# Or use SSH tunneling (recommended)
+ssh -L 6006:localhost:6006 ubuntu@<IP>
+# Access at http://localhost:6006
+
+# For Jupyter
+ssh -L 8888:localhost:8888 ubuntu@<IP>
+```
+
+### Slow data download
+
+**Problem**: Downloading datasets is slow
+
+**Solutions**:
+```bash
+# Check available bandwidth
+speedtest-cli
+
+# Use multi-threaded download
+aria2c -x 16 <URL>
+
+# For HuggingFace models
+export HF_HUB_ENABLE_HF_TRANSFER=1
+pip install hf_transfer
+
+# For S3, use parallel transfer
+aws s3 sync s3://bucket/data /local/data --quiet
+```
+
+### Inter-node communication fails
+
+**Error**: Distributed training can't connect between nodes
+
+**Solutions**:
+```bash
+# Verify nodes in same region (required)
+
+# Check private IPs can communicate
+ping <other_node_private_ip>
+
+# Verify NCCL settings
+export NCCL_DEBUG=INFO
+export NCCL_IB_DISABLE=0  # Enable InfiniBand if available
+
+# Check firewall allows distributed ports
+# Need: 29500 (PyTorch), or configured MASTER_PORT
+```
+
+## Software Issues
+
+### Package installation fails
+
+**Error**: `pip install` errors
+
+**Solutions**:
+```bash
+# Use virtual environment (don't modify system Python)
+python -m venv ~/myenv
+source ~/myenv/bin/activate
+pip install <package>
+
+# For CUDA packages, match CUDA version
+pip install torch --index-url https://download.pytorch.org/whl/cu121
+
+# Clear pip cache if corrupted
+pip cache purge
+```
+
+### Python version issues
+
+**Error**: Package requires different Python version
+
+**Solutions**:
+```bash
+# Install alternate Python (don't replace system Python)
+sudo apt install python3.11 python3.11-venv python3.11-dev
+
+# Create venv with specific Python
+python3.11 -m venv ~/py311env
+source ~/py311env/bin/activate
+```
+
+### ImportError or ModuleNotFoundError
+
+**Error**: Module not found despite installation
+
+**Solutions**:
+```bash
+# Verify correct Python environment
+which python
+pip list | grep <module>
+
+# Ensure virtual environment is activated
+source ~/myenv/bin/activate
+
+# Reinstall in correct environment
+pip uninstall <package>
+pip install <package>
+```
+
+## Training Issues
+
+### Training hangs
+
+**Problem**: Training stops progressing, no output
+
+**Solutions**:
+```bash
+# Check GPU utilization
+watch -n 1 nvidia-smi
+
+# If GPUs at 0%, likely data loading bottleneck
+# Increase num_workers in DataLoader
+
+# Check for deadlocks in distributed training
+export NCCL_DEBUG=INFO
+
+# Add timeouts
+dist.init_process_group(..., timeout=timedelta(minutes=30))
+```
+
+### Checkpoint corruption
+
+**Error**: `RuntimeError: storage has wrong size` or similar
+
+**Solutions**:
+```python
+# Use safe saving pattern
+checkpoint_path = "/lambda/nfs/storage/checkpoint.pt"
+temp_path = checkpoint_path + ".tmp"
+
+# Save to temp first
+torch.save(state_dict, temp_path)
+# Then atomic rename
+os.rename(temp_path, checkpoint_path)
+
+# For loading corrupted checkpoint
+try:
+    state = torch.load(checkpoint_path)
+except:
+    # Fall back to previous checkpoint
+    state = torch.load(checkpoint_path + ".backup")
+```
+
+### Memory leak
+
+**Problem**: Memory usage grows over time
+
+**Solutions**:
+```python
+# Clear CUDA cache periodically
+torch.cuda.empty_cache()
+
+# Detach tensors when logging
+loss_value = loss.detach().cpu().item()
+
+# Don't accumulate gradients unintentionally
+optimizer.zero_grad(set_to_none=True)
+
+# Use gradient accumulation properly
+if (step + 1) % accumulation_steps == 0:
+    optimizer.step()
+    optimizer.zero_grad()
+```
+
+## Billing Issues
+
+### Unexpected charges
+
+**Problem**: Bill higher than expected
+
+**Solutions**:
+```bash
+# Check for forgotten running instances
+curl -u $LAMBDA_API_KEY: \
+  https://cloud.lambdalabs.com/api/v1/instances | jq '.data[].id'
+
+# Terminate all instances
+# Lambda console > Instances > Terminate all
+
+# Lambda charges by the minute
+# No charge for stopped instances (but no "stop" feature - only terminate)
+```
+
+### Instance terminated unexpectedly
+
+**Problem**: Instance disappeared without manual termination
+
+**Possible causes**:
+- Payment issue (card declined)
+- Account suspension
+- Instance health check failure
+
+**Solutions**:
+- Check email for Lambda notifications
+- Verify payment method in console
+- Contact Lambda support
+- Always checkpoint to filesystem
+
+## Common Error Messages
+
+| Error | Cause | Solution |
+|-------|-------|----------|
+| `No capacity available` | Region/GPU sold out | Try different region or GPU type |
+| `Permission denied (publickey)` | SSH key mismatch | Re-add key, check permissions |
+| `CUDA out of memory` | Model too large | Reduce batch size, use larger GPU |
+| `No space left on device` | Disk full | Clean up or use filesystem |
+| `Connection refused` | Instance not ready | Wait 3-15 minutes for boot |
+| `Module not found` | Wrong Python env | Activate correct virtualenv |
+
+## Getting Help
+
+1. **Documentation**: https://docs.lambda.ai
+2. **Support**: https://support.lambdalabs.com
+3. **Email**: support@lambdalabs.com
+4. **Status**: Check Lambda status page for outages
+
+### Information to Include
+
+When contacting support, include:
+- Instance ID
+- Region
+- Instance type
+- Error message (full traceback)
+- Steps to reproduce
+- Time of occurrence