hermes-agent/optional-skills/mlops/slime/SKILL.md
Teknium db22efbe88 feat(optional-skills): declare platforms frontmatter for all 63 undeclared skills
Extends the Windows-gating work to the optional-skills/ tree. Every
SKILL.md that previously omitted the platforms: field now carries an
explicit declaration, which Hermes's loader (agent.skill_utils.
skill_matches_platform) honors to skip-load on incompatible OSes.

58 skills declared cross-platform (platforms: [linux, macos, windows]):
  autonomous-ai-agents/blackbox, autonomous-ai-agents/honcho
  blockchain/base, blockchain/solana
  communication/one-three-one-rule
  creative/blender-mcp, creative/concept-diagrams, creative/hyperframes,
  creative/kanban-video-orchestrator, creative/meme-generation
  devops/cli (inference-sh-cli), devops/docker-management
  dogfood/adversarial-ux-test
  email/agentmail
  finance/3-statement-model, finance/comps-analysis, finance/dcf-model,
  finance/excel-author, finance/lbo-model, finance/merger-model,
  finance/pptx-author
  health/fitness-nutrition, health/neuroskill-bci
  mcp/fastmcp, mcp/mcporter
  migration/openclaw-migration
  mlops/accelerate, mlops/chroma, mlops/clip, mlops/guidance,
  mlops/hermes-atropos-environments, mlops/huggingface-tokenizers,
  mlops/instructor, mlops/lambda-labs, mlops/llava, mlops/modal,
  mlops/peft, mlops/pinecone, mlops/pytorch-lightning, mlops/qdrant,
  mlops/saelens, mlops/simpo, mlops/stable-diffusion
  productivity/canvas, productivity/shop-app, productivity/shopify,
  productivity/siyuan, productivity/telephony
  research/domain-intel, research/drug-discovery, research/duckduckgo-search,
  research/gitnexus-explorer, research/parallel-cli, research/scrapling
  security/1password, security/oss-forensics, security/sherlock
  web-development/page-agent

5 skills gated from Windows (platforms: [linux, macos]):
  mlops/flash-attention   - Flash Attention wheels are Linux-first; Windows
                            install requires building from source with CUDA
  mlops/faiss             - faiss-gpu has no Windows wheel; gate rather than
                            leak partial (faiss-cpu) support
  mlops/nemo-curator      - NVIDIA NeMo ecosystem has no first-class Windows path
  mlops/slime             - Megatron+SGLang RL stack is Linux-only in practice
  mlops/whisper           - openai-whisper + ffmpeg setup on Windows is
                            non-trivial; gate until Windows install stanza lands

Methodology: scanned every SKILL.md for Windows-hostile signals
(apt-get, brew, systemd, osascript, ptrace, X11 binaries, POSIX-only
Python APIs, Docker POSIX $(pwd) bind-mounts, explicit 'linux-only' /
'macos-only' text). 3 skills flagged as having hard signals on review:
docker-management and qdrant only had POSIX $(pwd) docker examples and
the tools themselves (Docker Desktop, Qdrant) run fine on Windows —
declared ALL. whisper had an apt/brew ffmpeg install path and nothing
else but the openai-whisper Windows install story is rough enough to
warrant gating.

Strict-over-lenient policy: when in doubt, gate. Easier to un-gate after
verified Windows support lands than to leak partial support that
manifests as mid-task failures for Windows users.
2026-05-08 14:27:40 -07:00

11 KiB
Raw Blame History

name description version author license dependencies platforms metadata
slime-rl-training Provides guidance for LLM post-training with RL using slime, a Megatron+SGLang framework. Use when training GLM models, implementing custom data generation workflows, or needing tight Megatron-LM integration for RL scaling. 1.0.0 Orchestra Research MIT
sglang-router>=0.2.3
ray
torch>=2.0.0
transformers>=4.40.0
linux
macos
hermes
tags
Reinforcement Learning
Megatron-LM
SGLang
GRPO
Post-Training
GLM

slime: LLM Post-Training Framework for RL Scaling

slime is an LLM post-training framework from Tsinghua's THUDM team, powering GLM-4.5, GLM-4.6, and GLM-4.7. It connects Megatron-LM for training with SGLang for high-throughput rollout generation.

When to Use slime

Choose slime when you need:

  • Megatron-LM native training with SGLang inference
  • Custom data generation workflows with flexible data buffers
  • Training GLM, Qwen3, DeepSeek V3, or Llama 3 models
  • Research-grade framework with production backing (Z.ai)

Consider alternatives when:

  • You need enterprise-grade stability features → use miles
  • You want flexible backend swapping → use verl
  • You need PyTorch-native abstractions → use torchforge

Key Features

  • Training: Megatron-LM with full parallelism support (TP, PP, DP, SP)
  • Rollout: SGLang-based high-throughput generation with router
  • Data Buffer: Flexible prompt management and sample storage
  • Models: GLM-4.x, Qwen3, DeepSeek V3/R1, Llama 3

Architecture Overview

┌─────────────────────────────────────────────────────────┐
│                    Data Buffer                          │
│ - Prompt initialization and management                  │
│ - Custom data generation and filtering                  │
│ - Rollout sample storage                                │
└─────────────┬───────────────────────────┬───────────────┘
              │                           │
┌─────────────▼───────────┐ ┌─────────────▼───────────────┐
│ Training (Megatron-LM)  │ │ Rollout (SGLang + Router)   │
│ - Actor model training  │ │ - Response generation       │
│ - Critic (optional)     │ │ - Reward/verifier output    │
│ - Weight sync to rollout│ │ - Multi-turn support        │
└─────────────────────────┘ └─────────────────────────────┘

Installation

# Recommended: Docker
docker pull slimerl/slime:latest
docker run --rm --gpus all --ipc=host --shm-size=16g \
  -it slimerl/slime:latest /bin/bash

# Inside container
cd /root/slime && pip install -e . --no-deps

From Source

git clone https://github.com/THUDM/slime.git
cd slime
pip install -r requirements.txt
pip install -e .

Quick Start: GRPO Training

# Source model configuration
source scripts/models/qwen3-4B.sh

# Launch training
python train.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 4 \
    --rollout-num-gpus 4 \
    --advantage-estimator grpo \
    --use-kl-loss --kl-loss-coef 0.001 \
    --rollout-batch-size 32 \
    --n-samples-per-prompt 8 \
    --global-batch-size 256 \
    --num-rollout 3000 \
    --prompt-data /path/to/data.jsonl \
    ${MODEL_ARGS[@]} ${CKPT_ARGS[@]}

Workflow 1: Standard GRPO Training

Use this workflow for training reasoning models with group-relative advantages.

Prerequisites Checklist

  • Docker environment or Megatron-LM + SGLang installed
  • Model checkpoint (HuggingFace or Megatron format)
  • Training data in JSONL format

Step 1: Prepare Data

# data.jsonl format
{"prompt": "What is 2 + 2?", "label": "4"}
{"prompt": "Solve: 3x = 12", "label": "x = 4"}

Or with chat format:

{
    "prompt": [
        {"role": "system", "content": "You are a math tutor."},
        {"role": "user", "content": "What is 15 + 27?"}
    ],
    "label": "42"
}

Step 2: Configure Model

Choose a pre-configured model script:

# List available models
ls scripts/models/
# glm4-9B.sh, qwen3-4B.sh, qwen3-30B-A3B.sh, deepseek-v3.sh, llama3-8B.sh, ...

# Source your model
source scripts/models/qwen3-4B.sh

Step 3: Launch Training

python train.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --advantage-estimator grpo \
    --use-kl-loss \
    --kl-loss-coef 0.001 \
    --prompt-data /path/to/train.jsonl \
    --input-key prompt \
    --label-key label \
    --apply-chat-template \
    --rollout-batch-size 32 \
    --n-samples-per-prompt 8 \
    --global-batch-size 256 \
    --num-rollout 3000 \
    --save-interval 100 \
    --eval-interval 50 \
    ${MODEL_ARGS[@]}

Step 4: Monitor Training

  • Check TensorBoard: tensorboard --logdir outputs/
  • Verify reward curves are increasing
  • Monitor GPU utilization across nodes

Workflow 2: Asynchronous Training

Use async mode for higher throughput by overlapping rollout and training.

When to Use Async

  • Large models with long generation times
  • High GPU idle time in synchronous mode
  • Sufficient memory for buffering

Launch Async Training

python train_async.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --advantage-estimator grpo \
    --async-buffer-size 4 \
    --prompt-data /path/to/train.jsonl \
    ${MODEL_ARGS[@]}

Async-Specific Parameters

--async-buffer-size 4        # Number of rollouts to buffer
--update-weights-interval 2  # Sync weights every N rollouts

Workflow 3: Multi-Turn Agentic Training

Use this workflow for training agents with tool use or multi-step reasoning.

Prerequisites

  • Custom generate function for multi-turn logic
  • Tool/environment interface

Step 1: Define Custom Generate Function

# custom_generate.py
async def custom_generate(args, samples, evaluation=False):
    """Multi-turn generation with tool calling."""
    for sample in samples:
        conversation = sample.prompt

        for turn in range(args.max_turns):
            # Generate response
            response = await generate_single(conversation)

            # Check for tool call
            tool_call = extract_tool_call(response)
            if tool_call:
                tool_result = execute_tool(tool_call)
                conversation.append({"role": "assistant", "content": response})
                conversation.append({"role": "tool", "content": tool_result})
            else:
                break

        sample.response = response
        sample.reward = compute_reward(sample)

    return samples

Step 2: Launch with Custom Function

python train.py \
    --custom-generate-function-path custom_generate.py \
    --max-turns 5 \
    --prompt-data /path/to/agent_data.jsonl \
    ${MODEL_ARGS[@]}

See examples/search-r1/ for a complete multi-turn search example.


Configuration Reference

Three Argument Categories

slime uses three types of arguments:

1. Megatron Arguments (passed directly):

--tensor-model-parallel-size 2
--pipeline-model-parallel-size 1
--num-layers 32
--hidden-size 4096

2. SGLang Arguments (prefixed with --sglang-):

--sglang-mem-fraction-static 0.8
--sglang-context-length 8192
--sglang-log-level INFO

3. slime Arguments:

# Resource allocation
--actor-num-nodes 1
--actor-num-gpus-per-node 8
--rollout-num-gpus 8
--colocate  # Share GPUs between training/inference

# Data
--prompt-data /path/to/data.jsonl
--input-key prompt
--label-key label

# Training loop
--num-rollout 3000
--rollout-batch-size 32
--n-samples-per-prompt 8
--global-batch-size 256

# Algorithm
--advantage-estimator grpo  # or: gspo, ppo, reinforce_plus_plus
--use-kl-loss
--kl-loss-coef 0.001

Key Constraints

rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout

Example: 32 × 8 = 256 × 1


Data Buffer System

slime's data buffer enables flexible data management:

Basic Data Source

class RolloutDataSource:
    def get_samples(self, num_samples):
        """Fetch prompts from dataset."""
        return self.dataset.sample(num_samples)

    def add_samples(self, samples):
        """Called after generation (no-op by default)."""
        pass

Buffered Data Source (Off-Policy)

class RolloutDataSourceWithBuffer(RolloutDataSource):
    def __init__(self):
        self.buffer = []

    def add_samples(self, samples):
        """Store generated samples for reuse."""
        self.buffer.extend(samples)

    def buffer_filter(self, args, buffer, num_samples):
        """Custom selection logic (prioritized, stratified, etc.)."""
        return select_best(buffer, num_samples)

Common Issues and Solutions

Issue: SGLang Engine Crash

Symptoms: Inference engine dies mid-training

Solutions:

# Enable fault tolerance
--use-fault-tolerance

# Increase memory allocation
--sglang-mem-fraction-static 0.85

# Reduce batch size
--rollout-batch-size 16

Issue: Weight Sync Timeout

Symptoms: Training hangs after rollout

Solutions:

# Increase sync interval
--update-weights-interval 5

# Use colocated mode (no network transfer)
--colocate

Issue: OOM During Training

Symptoms: CUDA OOM in backward pass

Solutions:

# Enable gradient checkpointing
--recompute-activations

# Reduce micro-batch size
--micro-batch-size 1

# Enable sequence parallelism
--sequence-parallel

Issue: Slow Data Loading

Symptoms: GPU idle during data fetch

Solutions:

# Increase data workers
--num-data-workers 4

# Use streaming dataset
--streaming-data

Supported Models

Model Family Configurations
GLM GLM-4.5, GLM-4.6, GLM-4.7, GLM-Z1-9B
Qwen Qwen3 (4B, 8B, 30B-A3B), Qwen3-MoE, Qwen2.5
DeepSeek V3, V3.1, R1
Llama Llama 3 (8B, 70B)
Others Kimi K2, Moonlight-16B

Each model has pre-configured scripts in scripts/models/.


Advanced Topics

Co-location Mode

Share GPUs between training and inference to reduce memory:

python train.py \
    --colocate \
    --actor-num-gpus-per-node 8 \
    --sglang-mem-fraction-static 0.4 \
    ${MODEL_ARGS[@]}

Custom Reward Model

# custom_rm.py
class CustomRewardModel:
    def __init__(self, model_path):
        self.model = load_model(model_path)

    def compute_reward(self, prompts, responses):
        inputs = self.tokenize(prompts, responses)
        scores = self.model(inputs)
        return scores.tolist()
--custom-rm-path custom_rm.py

Evaluation Multi-Task

--eval-prompt-data aime /path/to/aime.jsonl \
--eval-prompt-data gsm8k /path/to/gsm8k.jsonl \
--n-samples-per-eval-prompt 16

Resources