mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-04-25 00:51:20 +00:00

feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap (#3934 )

* feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap

Map active skills to Telegram's slash command menu so users can
discover and invoke skills directly. Three changes:

1. Telegram menu now includes active skill commands alongside built-in
   commands, capped at 100 entries (Telegram Bot API limit). Overflow
   commands remain callable but hidden from the picker. Logged at
   startup when cap is hit.

2. New /commands [page] gateway command for paginated browsing of all
   commands + skills. /help now shows first 10 skill commands and
   points to /commands for the full list.

3. When a user types a slash command that matches a disabled or
   uninstalled skill, they get actionable guidance:
   - Disabled: 'Enable it with: hermes skills config'
   - Optional (not installed): 'Install with: hermes skills install official/<path>'

Built on ideas from PR #3921 by @kshitijk4poor.

* chore: move 21 niche skills to optional-skills

Move specialized/niche skills from built-in (skills/) to optional
(optional-skills/) to reduce the default skill count. Users can
install them with: hermes skills install official/<category>/<name>

Moved skills (21):
- mlops: accelerate, chroma, faiss, flash-attention,
  hermes-atropos-environments, huggingface-tokenizers, instructor,
  lambda-labs, llava, nemo-curator, pinecone, pytorch-lightning,
  qdrant, saelens, simpo, slime, tensorrt-llm, torchtitan
- research: domain-intel, duckduckgo-search
- devops: inference-sh cli

Built-in skills: 96 → 75
Optional skills: 22 → 43

* fix: only include repo built-in skills in Telegram menu, not user-installed

User-installed skills (from hub or manually added) stay accessible via
/skills and by typing the command directly, but don't get registered
in the Telegram slash command picker. Only skills whose SKILL.md is
under the repo's skills/ directory are included in the menu.

This keeps the Telegram menu focused on the curated built-in set while
user-installed skills remain discoverable through /skills and /commands.

2026-03-30 10:57:30 -07:00

7.1 KiB

Raw Blame History

slime Troubleshooting Guide

Common Issues and Solutions

SGLang Issues

Issue: SGLang Engine Crash

Symptoms: Inference engine dies mid-training, connection errors

Solutions:

Enable fault tolerance:

--use-fault-tolerance

Increase memory allocation:

--sglang-mem-fraction-static 0.85  # Increase from 0.8

Reduce batch size:

--rollout-batch-size 16  # Reduce from 32

Disable CUDA graphs (for debugging):

--sglang-disable-cuda-graph

Issue: SGLang Router Load Imbalance

Symptoms: Some SGLang engines overloaded while others idle

Solutions:

Adjust routing strategy:

--sglang-router-strategy round_robin

Increase number of engines:

--rollout-num-gpus-per-engine 1  # More engines, less GPUs each

Weight Synchronization Issues

Issue: Weight Sync Timeout

Symptoms: Training hangs after rollout, timeout errors

Solutions:

Increase sync interval (async mode):

--update-weights-interval 5  # Increase from 2

Use colocated mode (eliminates network transfer):

--colocate

Check network bandwidth:

# Verify InfiniBand is enabled
ibstat

Issue: Weight Sync Failures in Multi-Node

Symptoms: Nodes fail to receive updated weights

Solutions:

Set NCCL environment:

export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=0

Increase timeout:

export NCCL_TIMEOUT=1800

Memory Issues

Issue: OOM During Training

Symptoms: CUDA OOM in backward pass

Solutions:

Enable gradient checkpointing:

--recompute-activations

Reduce micro-batch size:

--micro-batch-size 1

Enable sequence parallelism:

--sequence-parallel

Reduce global batch size:

--global-batch-size 128  # Reduce from 256

Issue: OOM in Colocated Mode

Symptoms: OOM when both training and inference run on same GPUs

Solutions:

Reduce SGLang memory:

--sglang-mem-fraction-static 0.4  # Reduce from 0.8

Enable offloading:

--offload-optimizer-states

Use smaller sequence length:

--seq-length 2048  # Reduce from 4096

Data Loading Issues

Issue: Slow Data Loading

Symptoms: GPU idle during data fetch, low GPU utilization

Solutions:

Increase data workers:

--num-data-workers 4

Use streaming dataset:

--streaming-data

Pre-tokenize data:

# Pre-process data offline
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("model_path")
# Save tokenized data

Issue: Data Format Errors

Symptoms: KeyError, missing fields, parsing failures

Solutions:

Verify data format:

import json
with open("data.jsonl") as f:
    for line in f:
        data = json.loads(line)
        assert "prompt" in data, "Missing prompt field"
        assert "label" in data, "Missing label field"

Check key names:

--input-key prompt  # Must match your data
--label-key label   # Must match your data

Training Stability Issues

Issue: Loss Explosion / NaN

Symptoms: Loss becomes NaN or explodes

Solutions:

Reduce learning rate:

--lr 1e-6  # Reduce from 5e-6

Enable gradient clipping:

--clip-grad 1.0

Check for data issues:

# Verify no empty prompts or responses
for sample in dataset:
    assert len(sample["prompt"]) > 0

Use BF16 instead of FP16:

--bf16  # More numerically stable

Issue: Reward Collapse

Symptoms: Reward drops to zero, model outputs garbage

Solutions:

Increase KL penalty:

--kl-loss-coef 0.01  # Increase from 0.001

Reduce number of samples:

--n-samples-per-prompt 4  # Reduce from 8

Verify reward function:

# Test reward function independently
from custom_rm import reward_func
sample = Sample(prompt="test", response="test response")
reward = reward_func(args, sample)
print(f"Reward: {reward}")  # Should be reasonable

Async Training Issues

Issue: Async Training Not Supported with Colocate

Symptoms: Error when using --colocate with train_async.py

Solution: Colocated mode is NOT supported for async training. Use separate GPUs:

# Remove --colocate flag
python train_async.py \
    --actor-num-gpus-per-node 4 \
    --rollout-num-gpus 4 \
    # No --colocate

Issue: Stale Weights in Async Mode

Symptoms: Policy divergence, inconsistent behavior

Solutions:

Reduce async buffer size:

--async-buffer-size 2  # Reduce from 4

Increase weight update frequency:

--update-weights-interval 1  # Sync every rollout

Multi-Turn Training Issues

Issue: Tool Responses Included in Loss

Symptoms: Model learns to output tool responses verbatim

Solution: Properly set loss mask in custom generate function:

def build_loss_mask(sample):
    """Create loss mask that excludes tool responses."""
    mask = []
    for i, token in enumerate(sample.tokens):
        if is_tool_response(token, sample.metadata):
            mask.append(0)  # Don't compute loss
        else:
            mask.append(1)  # Compute loss
    return mask

Issue: Multi-Turn Context Too Long

Symptoms: OOM or truncation in multi-turn conversations

Solutions:

Limit conversation history:

# In custom generate function
conversation = sample.prompt[-10:]  # Keep last 10 turns

Increase context length:

--sglang-context-length 16384

Checkpoint Issues

Issue: Checkpoint Loading Fails

Symptoms: Cannot load saved checkpoint

Solutions:

Verify checkpoint path:

ls -la /path/to/checkpoint/

Check parallelism matches:

# Checkpoint was saved with TP=2, must load with TP=2
--tensor-model-parallel-size 2

Convert HuggingFace to Megatron (if needed):

python tools/convert_hf_to_megatron.py \
    --hf_model_path /path/to/hf/model \
    --save_path /path/to/megatron/checkpoint

Debugging Tips

Enable Verbose Logging

--log-level DEBUG
export SLIME_DEBUG=1

Check GPU Utilization

watch -n 1 nvidia-smi

Monitor Training

tensorboard --logdir outputs/

Test Custom Functions Independently

# Test reward function
import asyncio
from custom_rm import reward_func

async def test():
    sample = Sample(prompt="test", response="test", label="expected")
    reward = await reward_func(args, sample)
    print(f"Reward: {reward}")

asyncio.run(test())

Constraint Reference

Key constraint to remember:

rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout

Example: 32 × 8 = 256 × 1

Resources

GitHub Issues: https://github.com/THUDM/slime/issues
Documentation: https://thudm.github.io/slime/
Examples: examples/ directory

7.1 KiB Raw Blame History Unescape Escape

slime Troubleshooting Guide

Common Issues and Solutions

SGLang Issues

Issue: SGLang Engine Crash

Issue: SGLang Router Load Imbalance

Weight Synchronization Issues

Issue: Weight Sync Timeout

Issue: Weight Sync Failures in Multi-Node

Memory Issues

Issue: OOM During Training

Issue: OOM in Colocated Mode

Data Loading Issues

Issue: Slow Data Loading

Issue: Data Format Errors

Training Stability Issues

Issue: Loss Explosion / NaN

Issue: Reward Collapse

Async Training Issues

Issue: Async Training Not Supported with Colocate

Issue: Stale Weights in Async Mode

Multi-Turn Training Issues

Issue: Tool Responses Included in Loss

Issue: Multi-Turn Context Too Long

Checkpoint Issues

Issue: Checkpoint Loading Fails

Debugging Tips

Enable Verbose Logging

Check GPU Utilization

Monitor Training

Test Custom Functions Independently

Constraint Reference

Resources

7.1 KiB

Raw Blame History