refactor: reorganize skills into sub-categories

The skills directory was getting disorganized — mlops alone had 40 skills in a flat list, and 12 categories were singletons with just one skill each. Code change: - prompt_builder.py: Support sub-categories in skill scanner. skills/mlops/training/axolotl/SKILL.md now shows as category 'mlops/training' instead of just 'mlops'. Backwards-compatible with existing flat structure. Split mlops (40 skills) into 7 sub-categories: - mlops/training (12): accelerate, axolotl, flash-attention, grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, torchtitan, trl-fine-tuning, unsloth - mlops/inference (8): gguf, guidance, instructor, llama-cpp, obliteratus, outlines, tensorrt-llm, vllm - mlops/models (6): audiocraft, clip, llava, segment-anything, stable-diffusion, whisper - mlops/vector-databases (4): chroma, faiss, pinecone, qdrant - mlops/evaluation (5): huggingface-tokenizers, lm-evaluation-harness, nemo-curator, saelens, weights-and-biases - mlops/cloud (2): lambda-labs, modal - mlops/research (1): dspy Merged singleton categories: - gifs → media (gif-search joins youtube-content) - music-creation → media (heartmula, songsee) - diagramming → creative (excalidraw joins ascii-art) - ocr-and-documents → productivity - domain → research (domain-intel) - feeds → research (blogwatcher) - market-data → research (polymarket) Fixed misplaced skills: - mlops/code-review → software-development (not ML-specific) - mlops/ml-paper-writing → research (academic writing) Added DESCRIPTION.md files for all new/updated categories.
2026-04-26 01:01:40 +00:00 · 2026-03-09 03:35:53 -07:00 · 2026-03-09 03:35:53 -07:00 · 732c66b0f3
commit 732c66b0f3
parent d6c710706f
217 changed files with 39 additions and 4 deletions
--- a/skills/mlops/training/slime/references/troubleshooting.md
+++ b/skills/mlops/training/slime/references/troubleshooting.md
@ -0,0 +1,386 @@
+# slime Troubleshooting Guide
+
+## Common Issues and Solutions
+
+### SGLang Issues
+
+#### Issue: SGLang Engine Crash
+
+**Symptoms**: Inference engine dies mid-training, connection errors
+
+**Solutions**:
+
+1. **Enable fault tolerance**:
+```bash
+--use-fault-tolerance
+```
+
+2. **Increase memory allocation**:
+```bash
+--sglang-mem-fraction-static 0.85  # Increase from 0.8
+```
+
+3. **Reduce batch size**:
+```bash
+--rollout-batch-size 16  # Reduce from 32
+```
+
+4. **Disable CUDA graphs** (for debugging):
+```bash
+--sglang-disable-cuda-graph
+```
+
+#### Issue: SGLang Router Load Imbalance
+
+**Symptoms**: Some SGLang engines overloaded while others idle
+
+**Solutions**:
+
+1. **Adjust routing strategy**:
+```bash
+--sglang-router-strategy round_robin
+```
+
+2. **Increase number of engines**:
+```bash
+--rollout-num-gpus-per-engine 1  # More engines, less GPUs each
+```
+
+### Weight Synchronization Issues
+
+#### Issue: Weight Sync Timeout
+
+**Symptoms**: Training hangs after rollout, timeout errors
+
+**Solutions**:
+
+1. **Increase sync interval** (async mode):
+```bash
+--update-weights-interval 5  # Increase from 2
+```
+
+2. **Use colocated mode** (eliminates network transfer):
+```bash
+--colocate
+```
+
+3. **Check network bandwidth**:
+```bash
+# Verify InfiniBand is enabled
+ibstat
+```
+
+#### Issue: Weight Sync Failures in Multi-Node
+
+**Symptoms**: Nodes fail to receive updated weights
+
+**Solutions**:
+
+1. **Set NCCL environment**:
+```bash
+export NCCL_DEBUG=INFO
+export NCCL_SOCKET_IFNAME=eth0
+export NCCL_IB_DISABLE=0
+```
+
+2. **Increase timeout**:
+```bash
+export NCCL_TIMEOUT=1800
+```
+
+### Memory Issues
+
+#### Issue: OOM During Training
+
+**Symptoms**: CUDA OOM in backward pass
+
+**Solutions**:
+
+1. **Enable gradient checkpointing**:
+```bash
+--recompute-activations
+```
+
+2. **Reduce micro-batch size**:
+```bash
+--micro-batch-size 1
+```
+
+3. **Enable sequence parallelism**:
+```bash
+--sequence-parallel
+```
+
+4. **Reduce global batch size**:
+```bash
+--global-batch-size 128  # Reduce from 256
+```
+
+#### Issue: OOM in Colocated Mode
+
+**Symptoms**: OOM when both training and inference run on same GPUs
+
+**Solutions**:
+
+1. **Reduce SGLang memory**:
+```bash
+--sglang-mem-fraction-static 0.4  # Reduce from 0.8
+```
+
+2. **Enable offloading**:
+```bash
+--offload-optimizer-states
+```
+
+3. **Use smaller sequence length**:
+```bash
+--seq-length 2048  # Reduce from 4096
+```
+
+### Data Loading Issues
+
+#### Issue: Slow Data Loading
+
+**Symptoms**: GPU idle during data fetch, low GPU utilization
+
+**Solutions**:
+
+1. **Increase data workers**:
+```bash
+--num-data-workers 4
+```
+
+2. **Use streaming dataset**:
+```bash
+--streaming-data
+```
+
+3. **Pre-tokenize data**:
+```python
+# Pre-process data offline
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("model_path")
+# Save tokenized data
+```
+
+#### Issue: Data Format Errors
+
+**Symptoms**: KeyError, missing fields, parsing failures
+
+**Solutions**:
+
+1. **Verify data format**:
+```python
+import json
+with open("data.jsonl") as f:
+    for line in f:
+        data = json.loads(line)
+        assert "prompt" in data, "Missing prompt field"
+        assert "label" in data, "Missing label field"
+```
+
+2. **Check key names**:
+```bash
+--input-key prompt  # Must match your data
+--label-key label   # Must match your data
+```
+
+### Training Stability Issues
+
+#### Issue: Loss Explosion / NaN
+
+**Symptoms**: Loss becomes NaN or explodes
+
+**Solutions**:
+
+1. **Reduce learning rate**:
+```bash
+--lr 1e-6  # Reduce from 5e-6
+```
+
+2. **Enable gradient clipping**:
+```bash
+--clip-grad 1.0
+```
+
+3. **Check for data issues**:
+```python
+# Verify no empty prompts or responses
+for sample in dataset:
+    assert len(sample["prompt"]) > 0
+```
+
+4. **Use BF16 instead of FP16**:
+```bash
+--bf16  # More numerically stable
+```
+
+#### Issue: Reward Collapse
+
+**Symptoms**: Reward drops to zero, model outputs garbage
+
+**Solutions**:
+
+1. **Increase KL penalty**:
+```bash
+--kl-loss-coef 0.01  # Increase from 0.001
+```
+
+2. **Reduce number of samples**:
+```bash
+--n-samples-per-prompt 4  # Reduce from 8
+```
+
+3. **Verify reward function**:
+```python
+# Test reward function independently
+from custom_rm import reward_func
+sample = Sample(prompt="test", response="test response")
+reward = reward_func(args, sample)
+print(f"Reward: {reward}")  # Should be reasonable
+```
+
+### Async Training Issues
+
+#### Issue: Async Training Not Supported with Colocate
+
+**Symptoms**: Error when using `--colocate` with `train_async.py`
+
+**Solution**: Colocated mode is NOT supported for async training. Use separate GPUs:
+```bash
+# Remove --colocate flag
+python train_async.py \
+    --actor-num-gpus-per-node 4 \
+    --rollout-num-gpus 4 \
+    # No --colocate
+```
+
+#### Issue: Stale Weights in Async Mode
+
+**Symptoms**: Policy divergence, inconsistent behavior
+
+**Solutions**:
+
+1. **Reduce async buffer size**:
+```bash
+--async-buffer-size 2  # Reduce from 4
+```
+
+2. **Increase weight update frequency**:
+```bash
+--update-weights-interval 1  # Sync every rollout
+```
+
+### Multi-Turn Training Issues
+
+#### Issue: Tool Responses Included in Loss
+
+**Symptoms**: Model learns to output tool responses verbatim
+
+**Solution**: Properly set loss mask in custom generate function:
+```python
+def build_loss_mask(sample):
+    """Create loss mask that excludes tool responses."""
+    mask = []
+    for i, token in enumerate(sample.tokens):
+        if is_tool_response(token, sample.metadata):
+            mask.append(0)  # Don't compute loss
+        else:
+            mask.append(1)  # Compute loss
+    return mask
+```
+
+#### Issue: Multi-Turn Context Too Long
+
+**Symptoms**: OOM or truncation in multi-turn conversations
+
+**Solutions**:
+
+1. **Limit conversation history**:
+```python
+# In custom generate function
+conversation = sample.prompt[-10:]  # Keep last 10 turns
+```
+
+2. **Increase context length**:
+```bash
+--sglang-context-length 16384
+```
+
+### Checkpoint Issues
+
+#### Issue: Checkpoint Loading Fails
+
+**Symptoms**: Cannot load saved checkpoint
+
+**Solutions**:
+
+1. **Verify checkpoint path**:
+```bash
+ls -la /path/to/checkpoint/
+```
+
+2. **Check parallelism matches**:
+```bash
+# Checkpoint was saved with TP=2, must load with TP=2
+--tensor-model-parallel-size 2
+```
+
+3. **Convert HuggingFace to Megatron** (if needed):
+```bash
+python tools/convert_hf_to_megatron.py \
+    --hf_model_path /path/to/hf/model \
+    --save_path /path/to/megatron/checkpoint
+```
+
+### Debugging Tips
+
+#### Enable Verbose Logging
+
+```bash
+--log-level DEBUG
+export SLIME_DEBUG=1
+```
+
+#### Check GPU Utilization
+
+```bash
+watch -n 1 nvidia-smi
+```
+
+#### Monitor Training
+
+```bash
+tensorboard --logdir outputs/
+```
+
+#### Test Custom Functions Independently
+
+```python
+# Test reward function
+import asyncio
+from custom_rm import reward_func
+
+async def test():
+    sample = Sample(prompt="test", response="test", label="expected")
+    reward = await reward_func(args, sample)
+    print(f"Reward: {reward}")
+
+asyncio.run(test())
+```
+
+## Constraint Reference
+
+Key constraint to remember:
+
+```
+rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout
+```
+
+Example: `32 × 8 = 256 × 1`
+
+## Resources
+
+- GitHub Issues: https://github.com/THUDM/slime/issues
+- Documentation: https://thudm.github.io/slime/
+- Examples: `examples/` directory