mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-25 00:51:20 +00:00
* feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap Map active skills to Telegram's slash command menu so users can discover and invoke skills directly. Three changes: 1. Telegram menu now includes active skill commands alongside built-in commands, capped at 100 entries (Telegram Bot API limit). Overflow commands remain callable but hidden from the picker. Logged at startup when cap is hit. 2. New /commands [page] gateway command for paginated browsing of all commands + skills. /help now shows first 10 skill commands and points to /commands for the full list. 3. When a user types a slash command that matches a disabled or uninstalled skill, they get actionable guidance: - Disabled: 'Enable it with: hermes skills config' - Optional (not installed): 'Install with: hermes skills install official/<path>' Built on ideas from PR #3921 by @kshitijk4poor. * chore: move 21 niche skills to optional-skills Move specialized/niche skills from built-in (skills/) to optional (optional-skills/) to reduce the default skill count. Users can install them with: hermes skills install official/<category>/<name> Moved skills (21): - mlops: accelerate, chroma, faiss, flash-attention, hermes-atropos-environments, huggingface-tokenizers, instructor, lambda-labs, llava, nemo-curator, pinecone, pytorch-lightning, qdrant, saelens, simpo, slime, tensorrt-llm, torchtitan - research: domain-intel, duckduckgo-search - devops: inference-sh cli Built-in skills: 96 → 75 Optional skills: 22 → 43 * fix: only include repo built-in skills in Telegram menu, not user-installed User-installed skills (from hub or manually added) stay accessible via /skills and by typing the command directly, but don't get registered in the Telegram slash command picker. Only skills whose SKILL.md is under the repo's skills/ directory are included in the menu. This keeps the Telegram menu focused on the curated built-in set while user-installed skills remain discoverable through /skills and /commands.
386 lines
7.1 KiB
Markdown
386 lines
7.1 KiB
Markdown
# slime Troubleshooting Guide
|
||
|
||
## Common Issues and Solutions
|
||
|
||
### SGLang Issues
|
||
|
||
#### Issue: SGLang Engine Crash
|
||
|
||
**Symptoms**: Inference engine dies mid-training, connection errors
|
||
|
||
**Solutions**:
|
||
|
||
1. **Enable fault tolerance**:
|
||
```bash
|
||
--use-fault-tolerance
|
||
```
|
||
|
||
2. **Increase memory allocation**:
|
||
```bash
|
||
--sglang-mem-fraction-static 0.85 # Increase from 0.8
|
||
```
|
||
|
||
3. **Reduce batch size**:
|
||
```bash
|
||
--rollout-batch-size 16 # Reduce from 32
|
||
```
|
||
|
||
4. **Disable CUDA graphs** (for debugging):
|
||
```bash
|
||
--sglang-disable-cuda-graph
|
||
```
|
||
|
||
#### Issue: SGLang Router Load Imbalance
|
||
|
||
**Symptoms**: Some SGLang engines overloaded while others idle
|
||
|
||
**Solutions**:
|
||
|
||
1. **Adjust routing strategy**:
|
||
```bash
|
||
--sglang-router-strategy round_robin
|
||
```
|
||
|
||
2. **Increase number of engines**:
|
||
```bash
|
||
--rollout-num-gpus-per-engine 1 # More engines, less GPUs each
|
||
```
|
||
|
||
### Weight Synchronization Issues
|
||
|
||
#### Issue: Weight Sync Timeout
|
||
|
||
**Symptoms**: Training hangs after rollout, timeout errors
|
||
|
||
**Solutions**:
|
||
|
||
1. **Increase sync interval** (async mode):
|
||
```bash
|
||
--update-weights-interval 5 # Increase from 2
|
||
```
|
||
|
||
2. **Use colocated mode** (eliminates network transfer):
|
||
```bash
|
||
--colocate
|
||
```
|
||
|
||
3. **Check network bandwidth**:
|
||
```bash
|
||
# Verify InfiniBand is enabled
|
||
ibstat
|
||
```
|
||
|
||
#### Issue: Weight Sync Failures in Multi-Node
|
||
|
||
**Symptoms**: Nodes fail to receive updated weights
|
||
|
||
**Solutions**:
|
||
|
||
1. **Set NCCL environment**:
|
||
```bash
|
||
export NCCL_DEBUG=INFO
|
||
export NCCL_SOCKET_IFNAME=eth0
|
||
export NCCL_IB_DISABLE=0
|
||
```
|
||
|
||
2. **Increase timeout**:
|
||
```bash
|
||
export NCCL_TIMEOUT=1800
|
||
```
|
||
|
||
### Memory Issues
|
||
|
||
#### Issue: OOM During Training
|
||
|
||
**Symptoms**: CUDA OOM in backward pass
|
||
|
||
**Solutions**:
|
||
|
||
1. **Enable gradient checkpointing**:
|
||
```bash
|
||
--recompute-activations
|
||
```
|
||
|
||
2. **Reduce micro-batch size**:
|
||
```bash
|
||
--micro-batch-size 1
|
||
```
|
||
|
||
3. **Enable sequence parallelism**:
|
||
```bash
|
||
--sequence-parallel
|
||
```
|
||
|
||
4. **Reduce global batch size**:
|
||
```bash
|
||
--global-batch-size 128 # Reduce from 256
|
||
```
|
||
|
||
#### Issue: OOM in Colocated Mode
|
||
|
||
**Symptoms**: OOM when both training and inference run on same GPUs
|
||
|
||
**Solutions**:
|
||
|
||
1. **Reduce SGLang memory**:
|
||
```bash
|
||
--sglang-mem-fraction-static 0.4 # Reduce from 0.8
|
||
```
|
||
|
||
2. **Enable offloading**:
|
||
```bash
|
||
--offload-optimizer-states
|
||
```
|
||
|
||
3. **Use smaller sequence length**:
|
||
```bash
|
||
--seq-length 2048 # Reduce from 4096
|
||
```
|
||
|
||
### Data Loading Issues
|
||
|
||
#### Issue: Slow Data Loading
|
||
|
||
**Symptoms**: GPU idle during data fetch, low GPU utilization
|
||
|
||
**Solutions**:
|
||
|
||
1. **Increase data workers**:
|
||
```bash
|
||
--num-data-workers 4
|
||
```
|
||
|
||
2. **Use streaming dataset**:
|
||
```bash
|
||
--streaming-data
|
||
```
|
||
|
||
3. **Pre-tokenize data**:
|
||
```python
|
||
# Pre-process data offline
|
||
from transformers import AutoTokenizer
|
||
tokenizer = AutoTokenizer.from_pretrained("model_path")
|
||
# Save tokenized data
|
||
```
|
||
|
||
#### Issue: Data Format Errors
|
||
|
||
**Symptoms**: KeyError, missing fields, parsing failures
|
||
|
||
**Solutions**:
|
||
|
||
1. **Verify data format**:
|
||
```python
|
||
import json
|
||
with open("data.jsonl") as f:
|
||
for line in f:
|
||
data = json.loads(line)
|
||
assert "prompt" in data, "Missing prompt field"
|
||
assert "label" in data, "Missing label field"
|
||
```
|
||
|
||
2. **Check key names**:
|
||
```bash
|
||
--input-key prompt # Must match your data
|
||
--label-key label # Must match your data
|
||
```
|
||
|
||
### Training Stability Issues
|
||
|
||
#### Issue: Loss Explosion / NaN
|
||
|
||
**Symptoms**: Loss becomes NaN or explodes
|
||
|
||
**Solutions**:
|
||
|
||
1. **Reduce learning rate**:
|
||
```bash
|
||
--lr 1e-6 # Reduce from 5e-6
|
||
```
|
||
|
||
2. **Enable gradient clipping**:
|
||
```bash
|
||
--clip-grad 1.0
|
||
```
|
||
|
||
3. **Check for data issues**:
|
||
```python
|
||
# Verify no empty prompts or responses
|
||
for sample in dataset:
|
||
assert len(sample["prompt"]) > 0
|
||
```
|
||
|
||
4. **Use BF16 instead of FP16**:
|
||
```bash
|
||
--bf16 # More numerically stable
|
||
```
|
||
|
||
#### Issue: Reward Collapse
|
||
|
||
**Symptoms**: Reward drops to zero, model outputs garbage
|
||
|
||
**Solutions**:
|
||
|
||
1. **Increase KL penalty**:
|
||
```bash
|
||
--kl-loss-coef 0.01 # Increase from 0.001
|
||
```
|
||
|
||
2. **Reduce number of samples**:
|
||
```bash
|
||
--n-samples-per-prompt 4 # Reduce from 8
|
||
```
|
||
|
||
3. **Verify reward function**:
|
||
```python
|
||
# Test reward function independently
|
||
from custom_rm import reward_func
|
||
sample = Sample(prompt="test", response="test response")
|
||
reward = reward_func(args, sample)
|
||
print(f"Reward: {reward}") # Should be reasonable
|
||
```
|
||
|
||
### Async Training Issues
|
||
|
||
#### Issue: Async Training Not Supported with Colocate
|
||
|
||
**Symptoms**: Error when using `--colocate` with `train_async.py`
|
||
|
||
**Solution**: Colocated mode is NOT supported for async training. Use separate GPUs:
|
||
```bash
|
||
# Remove --colocate flag
|
||
python train_async.py \
|
||
--actor-num-gpus-per-node 4 \
|
||
--rollout-num-gpus 4 \
|
||
# No --colocate
|
||
```
|
||
|
||
#### Issue: Stale Weights in Async Mode
|
||
|
||
**Symptoms**: Policy divergence, inconsistent behavior
|
||
|
||
**Solutions**:
|
||
|
||
1. **Reduce async buffer size**:
|
||
```bash
|
||
--async-buffer-size 2 # Reduce from 4
|
||
```
|
||
|
||
2. **Increase weight update frequency**:
|
||
```bash
|
||
--update-weights-interval 1 # Sync every rollout
|
||
```
|
||
|
||
### Multi-Turn Training Issues
|
||
|
||
#### Issue: Tool Responses Included in Loss
|
||
|
||
**Symptoms**: Model learns to output tool responses verbatim
|
||
|
||
**Solution**: Properly set loss mask in custom generate function:
|
||
```python
|
||
def build_loss_mask(sample):
|
||
"""Create loss mask that excludes tool responses."""
|
||
mask = []
|
||
for i, token in enumerate(sample.tokens):
|
||
if is_tool_response(token, sample.metadata):
|
||
mask.append(0) # Don't compute loss
|
||
else:
|
||
mask.append(1) # Compute loss
|
||
return mask
|
||
```
|
||
|
||
#### Issue: Multi-Turn Context Too Long
|
||
|
||
**Symptoms**: OOM or truncation in multi-turn conversations
|
||
|
||
**Solutions**:
|
||
|
||
1. **Limit conversation history**:
|
||
```python
|
||
# In custom generate function
|
||
conversation = sample.prompt[-10:] # Keep last 10 turns
|
||
```
|
||
|
||
2. **Increase context length**:
|
||
```bash
|
||
--sglang-context-length 16384
|
||
```
|
||
|
||
### Checkpoint Issues
|
||
|
||
#### Issue: Checkpoint Loading Fails
|
||
|
||
**Symptoms**: Cannot load saved checkpoint
|
||
|
||
**Solutions**:
|
||
|
||
1. **Verify checkpoint path**:
|
||
```bash
|
||
ls -la /path/to/checkpoint/
|
||
```
|
||
|
||
2. **Check parallelism matches**:
|
||
```bash
|
||
# Checkpoint was saved with TP=2, must load with TP=2
|
||
--tensor-model-parallel-size 2
|
||
```
|
||
|
||
3. **Convert HuggingFace to Megatron** (if needed):
|
||
```bash
|
||
python tools/convert_hf_to_megatron.py \
|
||
--hf_model_path /path/to/hf/model \
|
||
--save_path /path/to/megatron/checkpoint
|
||
```
|
||
|
||
### Debugging Tips
|
||
|
||
#### Enable Verbose Logging
|
||
|
||
```bash
|
||
--log-level DEBUG
|
||
export SLIME_DEBUG=1
|
||
```
|
||
|
||
#### Check GPU Utilization
|
||
|
||
```bash
|
||
watch -n 1 nvidia-smi
|
||
```
|
||
|
||
#### Monitor Training
|
||
|
||
```bash
|
||
tensorboard --logdir outputs/
|
||
```
|
||
|
||
#### Test Custom Functions Independently
|
||
|
||
```python
|
||
# Test reward function
|
||
import asyncio
|
||
from custom_rm import reward_func
|
||
|
||
async def test():
|
||
sample = Sample(prompt="test", response="test", label="expected")
|
||
reward = await reward_func(args, sample)
|
||
print(f"Reward: {reward}")
|
||
|
||
asyncio.run(test())
|
||
```
|
||
|
||
## Constraint Reference
|
||
|
||
Key constraint to remember:
|
||
|
||
```
|
||
rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout
|
||
```
|
||
|
||
Example: `32 × 8 = 256 × 1`
|
||
|
||
## Resources
|
||
|
||
- GitHub Issues: https://github.com/THUDM/slime/issues
|
||
- Documentation: https://thudm.github.io/slime/
|
||
- Examples: `examples/` directory
|