* feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap Map active skills to Telegram's slash command menu so users can discover and invoke skills directly. Three changes: 1. Telegram menu now includes active skill commands alongside built-in commands, capped at 100 entries (Telegram Bot API limit). Overflow commands remain callable but hidden from the picker. Logged at startup when cap is hit. 2. New /commands [page] gateway command for paginated browsing of all commands + skills. /help now shows first 10 skill commands and points to /commands for the full list. 3. When a user types a slash command that matches a disabled or uninstalled skill, they get actionable guidance: - Disabled: 'Enable it with: hermes skills config' - Optional (not installed): 'Install with: hermes skills install official/<path>' Built on ideas from PR #3921 by @kshitijk4poor. * chore: move 21 niche skills to optional-skills Move specialized/niche skills from built-in (skills/) to optional (optional-skills/) to reduce the default skill count. Users can install them with: hermes skills install official/<category>/<name> Moved skills (21): - mlops: accelerate, chroma, faiss, flash-attention, hermes-atropos-environments, huggingface-tokenizers, instructor, lambda-labs, llava, nemo-curator, pinecone, pytorch-lightning, qdrant, saelens, simpo, slime, tensorrt-llm, torchtitan - research: domain-intel, duckduckgo-search - devops: inference-sh cli Built-in skills: 96 → 75 Optional skills: 22 → 43 * fix: only include repo built-in skills in Telegram menu, not user-installed User-installed skills (from hub or manually added) stay accessible via /skills and by typing the command directly, but don't get registered in the Telegram slash command picker. Only skills whose SKILL.md is under the repo's skills/ directory are included in the menu. This keeps the Telegram menu focused on the curated built-in set while user-installed skills remain discoverable through /skills and /commands.
7.1 KiB
slime Troubleshooting Guide
Common Issues and Solutions
SGLang Issues
Issue: SGLang Engine Crash
Symptoms: Inference engine dies mid-training, connection errors
Solutions:
- Enable fault tolerance:
--use-fault-tolerance
- Increase memory allocation:
--sglang-mem-fraction-static 0.85 # Increase from 0.8
- Reduce batch size:
--rollout-batch-size 16 # Reduce from 32
- Disable CUDA graphs (for debugging):
--sglang-disable-cuda-graph
Issue: SGLang Router Load Imbalance
Symptoms: Some SGLang engines overloaded while others idle
Solutions:
- Adjust routing strategy:
--sglang-router-strategy round_robin
- Increase number of engines:
--rollout-num-gpus-per-engine 1 # More engines, less GPUs each
Weight Synchronization Issues
Issue: Weight Sync Timeout
Symptoms: Training hangs after rollout, timeout errors
Solutions:
- Increase sync interval (async mode):
--update-weights-interval 5 # Increase from 2
- Use colocated mode (eliminates network transfer):
--colocate
- Check network bandwidth:
# Verify InfiniBand is enabled
ibstat
Issue: Weight Sync Failures in Multi-Node
Symptoms: Nodes fail to receive updated weights
Solutions:
- Set NCCL environment:
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=0
- Increase timeout:
export NCCL_TIMEOUT=1800
Memory Issues
Issue: OOM During Training
Symptoms: CUDA OOM in backward pass
Solutions:
- Enable gradient checkpointing:
--recompute-activations
- Reduce micro-batch size:
--micro-batch-size 1
- Enable sequence parallelism:
--sequence-parallel
- Reduce global batch size:
--global-batch-size 128 # Reduce from 256
Issue: OOM in Colocated Mode
Symptoms: OOM when both training and inference run on same GPUs
Solutions:
- Reduce SGLang memory:
--sglang-mem-fraction-static 0.4 # Reduce from 0.8
- Enable offloading:
--offload-optimizer-states
- Use smaller sequence length:
--seq-length 2048 # Reduce from 4096
Data Loading Issues
Issue: Slow Data Loading
Symptoms: GPU idle during data fetch, low GPU utilization
Solutions:
- Increase data workers:
--num-data-workers 4
- Use streaming dataset:
--streaming-data
- Pre-tokenize data:
# Pre-process data offline
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("model_path")
# Save tokenized data
Issue: Data Format Errors
Symptoms: KeyError, missing fields, parsing failures
Solutions:
- Verify data format:
import json
with open("data.jsonl") as f:
for line in f:
data = json.loads(line)
assert "prompt" in data, "Missing prompt field"
assert "label" in data, "Missing label field"
- Check key names:
--input-key prompt # Must match your data
--label-key label # Must match your data
Training Stability Issues
Issue: Loss Explosion / NaN
Symptoms: Loss becomes NaN or explodes
Solutions:
- Reduce learning rate:
--lr 1e-6 # Reduce from 5e-6
- Enable gradient clipping:
--clip-grad 1.0
- Check for data issues:
# Verify no empty prompts or responses
for sample in dataset:
assert len(sample["prompt"]) > 0
- Use BF16 instead of FP16:
--bf16 # More numerically stable
Issue: Reward Collapse
Symptoms: Reward drops to zero, model outputs garbage
Solutions:
- Increase KL penalty:
--kl-loss-coef 0.01 # Increase from 0.001
- Reduce number of samples:
--n-samples-per-prompt 4 # Reduce from 8
- Verify reward function:
# Test reward function independently
from custom_rm import reward_func
sample = Sample(prompt="test", response="test response")
reward = reward_func(args, sample)
print(f"Reward: {reward}") # Should be reasonable
Async Training Issues
Issue: Async Training Not Supported with Colocate
Symptoms: Error when using --colocate with train_async.py
Solution: Colocated mode is NOT supported for async training. Use separate GPUs:
# Remove --colocate flag
python train_async.py \
--actor-num-gpus-per-node 4 \
--rollout-num-gpus 4 \
# No --colocate
Issue: Stale Weights in Async Mode
Symptoms: Policy divergence, inconsistent behavior
Solutions:
- Reduce async buffer size:
--async-buffer-size 2 # Reduce from 4
- Increase weight update frequency:
--update-weights-interval 1 # Sync every rollout
Multi-Turn Training Issues
Issue: Tool Responses Included in Loss
Symptoms: Model learns to output tool responses verbatim
Solution: Properly set loss mask in custom generate function:
def build_loss_mask(sample):
"""Create loss mask that excludes tool responses."""
mask = []
for i, token in enumerate(sample.tokens):
if is_tool_response(token, sample.metadata):
mask.append(0) # Don't compute loss
else:
mask.append(1) # Compute loss
return mask
Issue: Multi-Turn Context Too Long
Symptoms: OOM or truncation in multi-turn conversations
Solutions:
- Limit conversation history:
# In custom generate function
conversation = sample.prompt[-10:] # Keep last 10 turns
- Increase context length:
--sglang-context-length 16384
Checkpoint Issues
Issue: Checkpoint Loading Fails
Symptoms: Cannot load saved checkpoint
Solutions:
- Verify checkpoint path:
ls -la /path/to/checkpoint/
- Check parallelism matches:
# Checkpoint was saved with TP=2, must load with TP=2
--tensor-model-parallel-size 2
- Convert HuggingFace to Megatron (if needed):
python tools/convert_hf_to_megatron.py \
--hf_model_path /path/to/hf/model \
--save_path /path/to/megatron/checkpoint
Debugging Tips
Enable Verbose Logging
--log-level DEBUG
export SLIME_DEBUG=1
Check GPU Utilization
watch -n 1 nvidia-smi
Monitor Training
tensorboard --logdir outputs/
Test Custom Functions Independently
# Test reward function
import asyncio
from custom_rm import reward_func
async def test():
sample = Sample(prompt="test", response="test", label="expected")
reward = await reward_func(args, sample)
print(f"Reward: {reward}")
asyncio.run(test())
Constraint Reference
Key constraint to remember:
rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout
Example: 32 × 8 = 256 × 1
Resources
- GitHub Issues: https://github.com/THUDM/slime/issues
- Documentation: https://thudm.github.io/slime/
- Examples:
examples/directory