- Restored 21 skills removed in commits757d012and740dd92: accelerate, audiocraft, code-review, faiss, flash-attention, gguf, grpo-rl-training, guidance, llava, nemo-curator, obliteratus, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, stable-diffusion, tensorrt-llm, torchtitan, trl-fine-tuning, whisper - Rewrote sync_skills() with proper update semantics: * New skills (not in manifest): copied to user dir * Existing skills (in manifest + on disk): updated via hash comparison * User-deleted skills (in manifest, not on disk): respected, not re-added * Stale manifest entries (removed from bundled): cleaned from manifest - Added sync_skills() to CLI startup (cmd_chat) and gateway startup (start_gateway) — previously only ran during 'hermes update' - Updated cmd_update output to show new/updated/cleaned counts - Rewrote tests: 20 tests covering manifest CRUD, dir hashing, fresh install, user deletion respect, update detection, stale cleanup, and name collision handling 75 bundled skills total. 2002 tests pass.
7.1 KiB
slime Troubleshooting Guide
Common Issues and Solutions
SGLang Issues
Issue: SGLang Engine Crash
Symptoms: Inference engine dies mid-training, connection errors
Solutions:
- Enable fault tolerance:
--use-fault-tolerance
- Increase memory allocation:
--sglang-mem-fraction-static 0.85 # Increase from 0.8
- Reduce batch size:
--rollout-batch-size 16 # Reduce from 32
- Disable CUDA graphs (for debugging):
--sglang-disable-cuda-graph
Issue: SGLang Router Load Imbalance
Symptoms: Some SGLang engines overloaded while others idle
Solutions:
- Adjust routing strategy:
--sglang-router-strategy round_robin
- Increase number of engines:
--rollout-num-gpus-per-engine 1 # More engines, less GPUs each
Weight Synchronization Issues
Issue: Weight Sync Timeout
Symptoms: Training hangs after rollout, timeout errors
Solutions:
- Increase sync interval (async mode):
--update-weights-interval 5 # Increase from 2
- Use colocated mode (eliminates network transfer):
--colocate
- Check network bandwidth:
# Verify InfiniBand is enabled
ibstat
Issue: Weight Sync Failures in Multi-Node
Symptoms: Nodes fail to receive updated weights
Solutions:
- Set NCCL environment:
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=0
- Increase timeout:
export NCCL_TIMEOUT=1800
Memory Issues
Issue: OOM During Training
Symptoms: CUDA OOM in backward pass
Solutions:
- Enable gradient checkpointing:
--recompute-activations
- Reduce micro-batch size:
--micro-batch-size 1
- Enable sequence parallelism:
--sequence-parallel
- Reduce global batch size:
--global-batch-size 128 # Reduce from 256
Issue: OOM in Colocated Mode
Symptoms: OOM when both training and inference run on same GPUs
Solutions:
- Reduce SGLang memory:
--sglang-mem-fraction-static 0.4 # Reduce from 0.8
- Enable offloading:
--offload-optimizer-states
- Use smaller sequence length:
--seq-length 2048 # Reduce from 4096
Data Loading Issues
Issue: Slow Data Loading
Symptoms: GPU idle during data fetch, low GPU utilization
Solutions:
- Increase data workers:
--num-data-workers 4
- Use streaming dataset:
--streaming-data
- Pre-tokenize data:
# Pre-process data offline
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("model_path")
# Save tokenized data
Issue: Data Format Errors
Symptoms: KeyError, missing fields, parsing failures
Solutions:
- Verify data format:
import json
with open("data.jsonl") as f:
for line in f:
data = json.loads(line)
assert "prompt" in data, "Missing prompt field"
assert "label" in data, "Missing label field"
- Check key names:
--input-key prompt # Must match your data
--label-key label # Must match your data
Training Stability Issues
Issue: Loss Explosion / NaN
Symptoms: Loss becomes NaN or explodes
Solutions:
- Reduce learning rate:
--lr 1e-6 # Reduce from 5e-6
- Enable gradient clipping:
--clip-grad 1.0
- Check for data issues:
# Verify no empty prompts or responses
for sample in dataset:
assert len(sample["prompt"]) > 0
- Use BF16 instead of FP16:
--bf16 # More numerically stable
Issue: Reward Collapse
Symptoms: Reward drops to zero, model outputs garbage
Solutions:
- Increase KL penalty:
--kl-loss-coef 0.01 # Increase from 0.001
- Reduce number of samples:
--n-samples-per-prompt 4 # Reduce from 8
- Verify reward function:
# Test reward function independently
from custom_rm import reward_func
sample = Sample(prompt="test", response="test response")
reward = reward_func(args, sample)
print(f"Reward: {reward}") # Should be reasonable
Async Training Issues
Issue: Async Training Not Supported with Colocate
Symptoms: Error when using --colocate with train_async.py
Solution: Colocated mode is NOT supported for async training. Use separate GPUs:
# Remove --colocate flag
python train_async.py \
--actor-num-gpus-per-node 4 \
--rollout-num-gpus 4 \
# No --colocate
Issue: Stale Weights in Async Mode
Symptoms: Policy divergence, inconsistent behavior
Solutions:
- Reduce async buffer size:
--async-buffer-size 2 # Reduce from 4
- Increase weight update frequency:
--update-weights-interval 1 # Sync every rollout
Multi-Turn Training Issues
Issue: Tool Responses Included in Loss
Symptoms: Model learns to output tool responses verbatim
Solution: Properly set loss mask in custom generate function:
def build_loss_mask(sample):
"""Create loss mask that excludes tool responses."""
mask = []
for i, token in enumerate(sample.tokens):
if is_tool_response(token, sample.metadata):
mask.append(0) # Don't compute loss
else:
mask.append(1) # Compute loss
return mask
Issue: Multi-Turn Context Too Long
Symptoms: OOM or truncation in multi-turn conversations
Solutions:
- Limit conversation history:
# In custom generate function
conversation = sample.prompt[-10:] # Keep last 10 turns
- Increase context length:
--sglang-context-length 16384
Checkpoint Issues
Issue: Checkpoint Loading Fails
Symptoms: Cannot load saved checkpoint
Solutions:
- Verify checkpoint path:
ls -la /path/to/checkpoint/
- Check parallelism matches:
# Checkpoint was saved with TP=2, must load with TP=2
--tensor-model-parallel-size 2
- Convert HuggingFace to Megatron (if needed):
python tools/convert_hf_to_megatron.py \
--hf_model_path /path/to/hf/model \
--save_path /path/to/megatron/checkpoint
Debugging Tips
Enable Verbose Logging
--log-level DEBUG
export SLIME_DEBUG=1
Check GPU Utilization
watch -n 1 nvidia-smi
Monitor Training
tensorboard --logdir outputs/
Test Custom Functions Independently
# Test reward function
import asyncio
from custom_rm import reward_func
async def test():
sample = Sample(prompt="test", response="test", label="expected")
reward = await reward_func(args, sample)
print(f"Reward: {reward}")
asyncio.run(test())
Constraint Reference
Key constraint to remember:
rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout
Example: 32 × 8 = 256 × 1
Resources
- GitHub Issues: https://github.com/THUDM/slime/issues
- Documentation: https://thudm.github.io/slime/
- Examples:
examples/directory