hermes-agent/skills/mlops/training/slime/references/troubleshooting.md
teknium1 732c66b0f3 refactor: reorganize skills into sub-categories
The skills directory was getting disorganized — mlops alone had 40
skills in a flat list, and 12 categories were singletons with just
one skill each.

Code change:
- prompt_builder.py: Support sub-categories in skill scanner.
  skills/mlops/training/axolotl/SKILL.md now shows as category
  'mlops/training' instead of just 'mlops'. Backwards-compatible
  with existing flat structure.

Split mlops (40 skills) into 7 sub-categories:
- mlops/training (12): accelerate, axolotl, flash-attention,
  grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning,
  simpo, slime, torchtitan, trl-fine-tuning, unsloth
- mlops/inference (8): gguf, guidance, instructor, llama-cpp,
  obliteratus, outlines, tensorrt-llm, vllm
- mlops/models (6): audiocraft, clip, llava, segment-anything,
  stable-diffusion, whisper
- mlops/vector-databases (4): chroma, faiss, pinecone, qdrant
- mlops/evaluation (5): huggingface-tokenizers,
  lm-evaluation-harness, nemo-curator, saelens, weights-and-biases
- mlops/cloud (2): lambda-labs, modal
- mlops/research (1): dspy

Merged singleton categories:
- gifs → media (gif-search joins youtube-content)
- music-creation → media (heartmula, songsee)
- diagramming → creative (excalidraw joins ascii-art)
- ocr-and-documents → productivity
- domain → research (domain-intel)
- feeds → research (blogwatcher)
- market-data → research (polymarket)

Fixed misplaced skills:
- mlops/code-review → software-development (not ML-specific)
- mlops/ml-paper-writing → research (academic writing)

Added DESCRIPTION.md files for all new/updated categories.
2026-03-09 03:35:53 -07:00

7.1 KiB
Raw Blame History

slime Troubleshooting Guide

Common Issues and Solutions

SGLang Issues

Issue: SGLang Engine Crash

Symptoms: Inference engine dies mid-training, connection errors

Solutions:

  1. Enable fault tolerance:
--use-fault-tolerance
  1. Increase memory allocation:
--sglang-mem-fraction-static 0.85  # Increase from 0.8
  1. Reduce batch size:
--rollout-batch-size 16  # Reduce from 32
  1. Disable CUDA graphs (for debugging):
--sglang-disable-cuda-graph

Issue: SGLang Router Load Imbalance

Symptoms: Some SGLang engines overloaded while others idle

Solutions:

  1. Adjust routing strategy:
--sglang-router-strategy round_robin
  1. Increase number of engines:
--rollout-num-gpus-per-engine 1  # More engines, less GPUs each

Weight Synchronization Issues

Issue: Weight Sync Timeout

Symptoms: Training hangs after rollout, timeout errors

Solutions:

  1. Increase sync interval (async mode):
--update-weights-interval 5  # Increase from 2
  1. Use colocated mode (eliminates network transfer):
--colocate
  1. Check network bandwidth:
# Verify InfiniBand is enabled
ibstat

Issue: Weight Sync Failures in Multi-Node

Symptoms: Nodes fail to receive updated weights

Solutions:

  1. Set NCCL environment:
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=0
  1. Increase timeout:
export NCCL_TIMEOUT=1800

Memory Issues

Issue: OOM During Training

Symptoms: CUDA OOM in backward pass

Solutions:

  1. Enable gradient checkpointing:
--recompute-activations
  1. Reduce micro-batch size:
--micro-batch-size 1
  1. Enable sequence parallelism:
--sequence-parallel
  1. Reduce global batch size:
--global-batch-size 128  # Reduce from 256

Issue: OOM in Colocated Mode

Symptoms: OOM when both training and inference run on same GPUs

Solutions:

  1. Reduce SGLang memory:
--sglang-mem-fraction-static 0.4  # Reduce from 0.8
  1. Enable offloading:
--offload-optimizer-states
  1. Use smaller sequence length:
--seq-length 2048  # Reduce from 4096

Data Loading Issues

Issue: Slow Data Loading

Symptoms: GPU idle during data fetch, low GPU utilization

Solutions:

  1. Increase data workers:
--num-data-workers 4
  1. Use streaming dataset:
--streaming-data
  1. Pre-tokenize data:
# Pre-process data offline
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("model_path")
# Save tokenized data

Issue: Data Format Errors

Symptoms: KeyError, missing fields, parsing failures

Solutions:

  1. Verify data format:
import json
with open("data.jsonl") as f:
    for line in f:
        data = json.loads(line)
        assert "prompt" in data, "Missing prompt field"
        assert "label" in data, "Missing label field"
  1. Check key names:
--input-key prompt  # Must match your data
--label-key label   # Must match your data

Training Stability Issues

Issue: Loss Explosion / NaN

Symptoms: Loss becomes NaN or explodes

Solutions:

  1. Reduce learning rate:
--lr 1e-6  # Reduce from 5e-6
  1. Enable gradient clipping:
--clip-grad 1.0
  1. Check for data issues:
# Verify no empty prompts or responses
for sample in dataset:
    assert len(sample["prompt"]) > 0
  1. Use BF16 instead of FP16:
--bf16  # More numerically stable

Issue: Reward Collapse

Symptoms: Reward drops to zero, model outputs garbage

Solutions:

  1. Increase KL penalty:
--kl-loss-coef 0.01  # Increase from 0.001
  1. Reduce number of samples:
--n-samples-per-prompt 4  # Reduce from 8
  1. Verify reward function:
# Test reward function independently
from custom_rm import reward_func
sample = Sample(prompt="test", response="test response")
reward = reward_func(args, sample)
print(f"Reward: {reward}")  # Should be reasonable

Async Training Issues

Issue: Async Training Not Supported with Colocate

Symptoms: Error when using --colocate with train_async.py

Solution: Colocated mode is NOT supported for async training. Use separate GPUs:

# Remove --colocate flag
python train_async.py \
    --actor-num-gpus-per-node 4 \
    --rollout-num-gpus 4 \
    # No --colocate

Issue: Stale Weights in Async Mode

Symptoms: Policy divergence, inconsistent behavior

Solutions:

  1. Reduce async buffer size:
--async-buffer-size 2  # Reduce from 4
  1. Increase weight update frequency:
--update-weights-interval 1  # Sync every rollout

Multi-Turn Training Issues

Issue: Tool Responses Included in Loss

Symptoms: Model learns to output tool responses verbatim

Solution: Properly set loss mask in custom generate function:

def build_loss_mask(sample):
    """Create loss mask that excludes tool responses."""
    mask = []
    for i, token in enumerate(sample.tokens):
        if is_tool_response(token, sample.metadata):
            mask.append(0)  # Don't compute loss
        else:
            mask.append(1)  # Compute loss
    return mask

Issue: Multi-Turn Context Too Long

Symptoms: OOM or truncation in multi-turn conversations

Solutions:

  1. Limit conversation history:
# In custom generate function
conversation = sample.prompt[-10:]  # Keep last 10 turns
  1. Increase context length:
--sglang-context-length 16384

Checkpoint Issues

Issue: Checkpoint Loading Fails

Symptoms: Cannot load saved checkpoint

Solutions:

  1. Verify checkpoint path:
ls -la /path/to/checkpoint/
  1. Check parallelism matches:
# Checkpoint was saved with TP=2, must load with TP=2
--tensor-model-parallel-size 2
  1. Convert HuggingFace to Megatron (if needed):
python tools/convert_hf_to_megatron.py \
    --hf_model_path /path/to/hf/model \
    --save_path /path/to/megatron/checkpoint

Debugging Tips

Enable Verbose Logging

--log-level DEBUG
export SLIME_DEBUG=1

Check GPU Utilization

watch -n 1 nvidia-smi

Monitor Training

tensorboard --logdir outputs/

Test Custom Functions Independently

# Test reward function
import asyncio
from custom_rm import reward_func

async def test():
    sample = Sample(prompt="test", response="test", label="expected")
    reward = await reward_func(args, sample)
    print(f"Reward: {reward}")

asyncio.run(test())

Constraint Reference

Key constraint to remember:

rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout

Example: 32 × 8 = 256 × 1

Resources