The skills directory was getting disorganized — mlops alone had 40 skills in a flat list, and 12 categories were singletons with just one skill each. Code change: - prompt_builder.py: Support sub-categories in skill scanner. skills/mlops/training/axolotl/SKILL.md now shows as category 'mlops/training' instead of just 'mlops'. Backwards-compatible with existing flat structure. Split mlops (40 skills) into 7 sub-categories: - mlops/training (12): accelerate, axolotl, flash-attention, grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, torchtitan, trl-fine-tuning, unsloth - mlops/inference (8): gguf, guidance, instructor, llama-cpp, obliteratus, outlines, tensorrt-llm, vllm - mlops/models (6): audiocraft, clip, llava, segment-anything, stable-diffusion, whisper - mlops/vector-databases (4): chroma, faiss, pinecone, qdrant - mlops/evaluation (5): huggingface-tokenizers, lm-evaluation-harness, nemo-curator, saelens, weights-and-biases - mlops/cloud (2): lambda-labs, modal - mlops/research (1): dspy Merged singleton categories: - gifs → media (gif-search joins youtube-content) - music-creation → media (heartmula, songsee) - diagramming → creative (excalidraw joins ascii-art) - ocr-and-documents → productivity - domain → research (domain-intel) - feeds → research (blogwatcher) - market-data → research (polymarket) Fixed misplaced skills: - mlops/code-review → software-development (not ML-specific) - mlops/ml-paper-writing → research (academic writing) Added DESCRIPTION.md files for all new/updated categories.
7.1 KiB
slime Troubleshooting Guide
Common Issues and Solutions
SGLang Issues
Issue: SGLang Engine Crash
Symptoms: Inference engine dies mid-training, connection errors
Solutions:
- Enable fault tolerance:
--use-fault-tolerance
- Increase memory allocation:
--sglang-mem-fraction-static 0.85 # Increase from 0.8
- Reduce batch size:
--rollout-batch-size 16 # Reduce from 32
- Disable CUDA graphs (for debugging):
--sglang-disable-cuda-graph
Issue: SGLang Router Load Imbalance
Symptoms: Some SGLang engines overloaded while others idle
Solutions:
- Adjust routing strategy:
--sglang-router-strategy round_robin
- Increase number of engines:
--rollout-num-gpus-per-engine 1 # More engines, less GPUs each
Weight Synchronization Issues
Issue: Weight Sync Timeout
Symptoms: Training hangs after rollout, timeout errors
Solutions:
- Increase sync interval (async mode):
--update-weights-interval 5 # Increase from 2
- Use colocated mode (eliminates network transfer):
--colocate
- Check network bandwidth:
# Verify InfiniBand is enabled
ibstat
Issue: Weight Sync Failures in Multi-Node
Symptoms: Nodes fail to receive updated weights
Solutions:
- Set NCCL environment:
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=0
- Increase timeout:
export NCCL_TIMEOUT=1800
Memory Issues
Issue: OOM During Training
Symptoms: CUDA OOM in backward pass
Solutions:
- Enable gradient checkpointing:
--recompute-activations
- Reduce micro-batch size:
--micro-batch-size 1
- Enable sequence parallelism:
--sequence-parallel
- Reduce global batch size:
--global-batch-size 128 # Reduce from 256
Issue: OOM in Colocated Mode
Symptoms: OOM when both training and inference run on same GPUs
Solutions:
- Reduce SGLang memory:
--sglang-mem-fraction-static 0.4 # Reduce from 0.8
- Enable offloading:
--offload-optimizer-states
- Use smaller sequence length:
--seq-length 2048 # Reduce from 4096
Data Loading Issues
Issue: Slow Data Loading
Symptoms: GPU idle during data fetch, low GPU utilization
Solutions:
- Increase data workers:
--num-data-workers 4
- Use streaming dataset:
--streaming-data
- Pre-tokenize data:
# Pre-process data offline
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("model_path")
# Save tokenized data
Issue: Data Format Errors
Symptoms: KeyError, missing fields, parsing failures
Solutions:
- Verify data format:
import json
with open("data.jsonl") as f:
for line in f:
data = json.loads(line)
assert "prompt" in data, "Missing prompt field"
assert "label" in data, "Missing label field"
- Check key names:
--input-key prompt # Must match your data
--label-key label # Must match your data
Training Stability Issues
Issue: Loss Explosion / NaN
Symptoms: Loss becomes NaN or explodes
Solutions:
- Reduce learning rate:
--lr 1e-6 # Reduce from 5e-6
- Enable gradient clipping:
--clip-grad 1.0
- Check for data issues:
# Verify no empty prompts or responses
for sample in dataset:
assert len(sample["prompt"]) > 0
- Use BF16 instead of FP16:
--bf16 # More numerically stable
Issue: Reward Collapse
Symptoms: Reward drops to zero, model outputs garbage
Solutions:
- Increase KL penalty:
--kl-loss-coef 0.01 # Increase from 0.001
- Reduce number of samples:
--n-samples-per-prompt 4 # Reduce from 8
- Verify reward function:
# Test reward function independently
from custom_rm import reward_func
sample = Sample(prompt="test", response="test response")
reward = reward_func(args, sample)
print(f"Reward: {reward}") # Should be reasonable
Async Training Issues
Issue: Async Training Not Supported with Colocate
Symptoms: Error when using --colocate with train_async.py
Solution: Colocated mode is NOT supported for async training. Use separate GPUs:
# Remove --colocate flag
python train_async.py \
--actor-num-gpus-per-node 4 \
--rollout-num-gpus 4 \
# No --colocate
Issue: Stale Weights in Async Mode
Symptoms: Policy divergence, inconsistent behavior
Solutions:
- Reduce async buffer size:
--async-buffer-size 2 # Reduce from 4
- Increase weight update frequency:
--update-weights-interval 1 # Sync every rollout
Multi-Turn Training Issues
Issue: Tool Responses Included in Loss
Symptoms: Model learns to output tool responses verbatim
Solution: Properly set loss mask in custom generate function:
def build_loss_mask(sample):
"""Create loss mask that excludes tool responses."""
mask = []
for i, token in enumerate(sample.tokens):
if is_tool_response(token, sample.metadata):
mask.append(0) # Don't compute loss
else:
mask.append(1) # Compute loss
return mask
Issue: Multi-Turn Context Too Long
Symptoms: OOM or truncation in multi-turn conversations
Solutions:
- Limit conversation history:
# In custom generate function
conversation = sample.prompt[-10:] # Keep last 10 turns
- Increase context length:
--sglang-context-length 16384
Checkpoint Issues
Issue: Checkpoint Loading Fails
Symptoms: Cannot load saved checkpoint
Solutions:
- Verify checkpoint path:
ls -la /path/to/checkpoint/
- Check parallelism matches:
# Checkpoint was saved with TP=2, must load with TP=2
--tensor-model-parallel-size 2
- Convert HuggingFace to Megatron (if needed):
python tools/convert_hf_to_megatron.py \
--hf_model_path /path/to/hf/model \
--save_path /path/to/megatron/checkpoint
Debugging Tips
Enable Verbose Logging
--log-level DEBUG
export SLIME_DEBUG=1
Check GPU Utilization
watch -n 1 nvidia-smi
Monitor Training
tensorboard --logdir outputs/
Test Custom Functions Independently
# Test reward function
import asyncio
from custom_rm import reward_func
async def test():
sample = Sample(prompt="test", response="test", label="expected")
reward = await reward_func(args, sample)
print(f"Reward: {reward}")
asyncio.run(test())
Constraint Reference
Key constraint to remember:
rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout
Example: 32 × 8 = 256 × 1
Resources
- GitHub Issues: https://github.com/THUDM/slime/issues
- Documentation: https://thudm.github.io/slime/
- Examples:
examples/directory