hermes-agent/optional-skills/mlops/accelerate/references/megatron-integration.md
Teknium 5ceed021dc
feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap (#3934)
* feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap

Map active skills to Telegram's slash command menu so users can
discover and invoke skills directly. Three changes:

1. Telegram menu now includes active skill commands alongside built-in
   commands, capped at 100 entries (Telegram Bot API limit). Overflow
   commands remain callable but hidden from the picker. Logged at
   startup when cap is hit.

2. New /commands [page] gateway command for paginated browsing of all
   commands + skills. /help now shows first 10 skill commands and
   points to /commands for the full list.

3. When a user types a slash command that matches a disabled or
   uninstalled skill, they get actionable guidance:
   - Disabled: 'Enable it with: hermes skills config'
   - Optional (not installed): 'Install with: hermes skills install official/<path>'

Built on ideas from PR #3921 by @kshitijk4poor.

* chore: move 21 niche skills to optional-skills

Move specialized/niche skills from built-in (skills/) to optional
(optional-skills/) to reduce the default skill count. Users can
install them with: hermes skills install official/<category>/<name>

Moved skills (21):
- mlops: accelerate, chroma, faiss, flash-attention,
  hermes-atropos-environments, huggingface-tokenizers, instructor,
  lambda-labs, llava, nemo-curator, pinecone, pytorch-lightning,
  qdrant, saelens, simpo, slime, tensorrt-llm, torchtitan
- research: domain-intel, duckduckgo-search
- devops: inference-sh cli

Built-in skills: 96 → 75
Optional skills: 22 → 43

* fix: only include repo built-in skills in Telegram menu, not user-installed

User-installed skills (from hub or manually added) stay accessible via
/skills and by typing the command directly, but don't get registered
in the Telegram slash command picker. Only skills whose SKILL.md is
under the repo's skills/ directory are included in the menu.

This keeps the Telegram menu focused on the curated built-in set while
user-installed skills remain discoverable through /skills and /commands.
2026-03-30 10:57:30 -07:00

11 KiB
Raw Blame History

Megatron Integration with Accelerate

Overview

Accelerate supports Megatron-LM for massive model training with tensor parallelism and pipeline parallelism.

Megatron capabilities:

  • Tensor Parallelism (TP): Split layers across GPUs
  • Pipeline Parallelism (PP): Split model depth across GPUs
  • Data Parallelism (DP): Replicate model across GPU groups
  • Sequence Parallelism: Split sequences for long contexts

Setup

Install Megatron-LM

# Clone Megatron-LM repository
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
pip install -e .

# Install Apex (NVIDIA optimizations)
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation \
  --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

Accelerate Configuration

accelerate config

Questions:

In which compute environment are you running?
> This machine

Which type of machine are you using?
> Multi-GPU

How many different machines will you use?
> 1

Do you want to use DeepSpeed/FSDP?
> No

Do you want to use Megatron-LM?
> Yes

What is the Tensor Parallelism degree? [1-8]
> 2

Do you want to enable Sequence Parallelism?
> No

What is the Pipeline Parallelism degree? [1-8]
> 2

What is the Data Parallelism degree? [1-8]
> 2

Where to perform activation checkpointing? ['SELECTIVE', 'FULL', 'NONE']
> SELECTIVE

Where to perform activation partitioning? ['SEQUENTIAL', 'UNIFORM']
> SEQUENTIAL

Generated config (~/.cache/huggingface/accelerate/default_config.yaml):

compute_environment: LOCAL_MACHINE
distributed_type: MEGATRON_LM
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
megatron_lm_config:
  megatron_lm_gradient_clipping: 1.0
  megatron_lm_learning_rate_decay_iters: 320000
  megatron_lm_num_micro_batches: 1
  megatron_lm_pp_degree: 2
  megatron_lm_recompute_activations: true
  megatron_lm_sequence_parallelism: false
  megatron_lm_tp_degree: 2
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Parallelism Strategies

Tensor Parallelism (TP)

Splits each transformer layer across GPUs:

# Layer split across 2 GPUs
# GPU 0: First half of attention heads
# GPU 1: Second half of attention heads

# Each GPU computes partial outputs
# All-reduce combines results

TP degree recommendations:

  • TP=1: No tensor parallelism (single GPU per layer)
  • TP=2: 2 GPUs per layer (good for 7-13B models)
  • TP=4: 4 GPUs per layer (good for 20-40B models)
  • TP=8: 8 GPUs per layer (good for 70B+ models)

Benefits:

  • Reduces memory per GPU
  • All-reduce communication (fast)

Drawbacks:

  • Requires fast inter-GPU bandwidth (NVLink)
  • Communication overhead per layer

Pipeline Parallelism (PP)

Splits model depth across GPUs:

# 12-layer model, PP=4
# GPU 0: Layers 0-2
# GPU 1: Layers 3-5
# GPU 2: Layers 6-8
# GPU 3: Layers 9-11

PP degree recommendations:

  • PP=1: No pipeline parallelism
  • PP=2: 2 pipeline stages (good for 20-40B models)
  • PP=4: 4 pipeline stages (good for 70B+ models)
  • PP=8: 8 pipeline stages (good for 175B+ models)

Benefits:

  • Linear memory reduction (4× PP = 4× less memory)
  • Works across nodes (slower interconnect OK)

Drawbacks:

  • Pipeline bubbles (idle time)
  • Requires micro-batching

Data Parallelism (DP)

Replicates model across GPU groups:

# 8 GPUs, TP=2, PP=2, DP=2
# Group 0 (GPUs 0-3): Full model replica
# Group 1 (GPUs 4-7): Full model replica

DP degree:

  • DP = total_gpus / (TP × PP)
  • Example: 8 GPUs, TP=2, PP=2 → DP=2

Benefits:

  • Increases throughput
  • Scales batch size

Sequence Parallelism

Splits long sequences across GPUs (extends TP):

# 8K sequence, TP=2, Sequence Parallel=True
# GPU 0: Tokens 0-4095
# GPU 1: Tokens 4096-8191

Benefits:

  • Enables very long sequences (100K+ tokens)
  • Reduces activation memory

Requirements:

  • Must use with TP > 1
  • RoPE/ALiBi position encodings work best

Accelerate Code Example

Basic Setup

from accelerate import Accelerator
from accelerate.utils import MegatronLMPlugin

# Configure Megatron
megatron_plugin = MegatronLMPlugin(
    tp_degree=2,              # Tensor parallelism degree
    pp_degree=2,              # Pipeline parallelism degree
    num_micro_batches=4,      # Micro-batches for pipeline
    gradient_clipping=1.0,    # Gradient clipping value
    sequence_parallelism=False,  # Enable sequence parallelism
    recompute_activations=True,  # Activation checkpointing
    use_distributed_optimizer=True,  # Distributed optimizer
    custom_prepare_model_function=None,  # Custom model prep
)

# Initialize accelerator
accelerator = Accelerator(
    mixed_precision='bf16',
    megatron_lm_plugin=megatron_plugin
)

# Prepare model and optimizer
model, optimizer, train_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader
)

# Training loop (same as DDP!)
for batch in train_dataloader:
    optimizer.zero_grad()
    outputs = model(**batch)
    loss = outputs.loss
    accelerator.backward(loss)
    optimizer.step()

Full Training Script

import torch
from accelerate import Accelerator
from accelerate.utils import MegatronLMPlugin
from transformers import GPT2Config, GPT2LMHeadModel

def main():
    # Megatron configuration
    megatron_plugin = MegatronLMPlugin(
        tp_degree=2,
        pp_degree=2,
        num_micro_batches=4,
        gradient_clipping=1.0,
    )

    accelerator = Accelerator(
        mixed_precision='bf16',
        gradient_accumulation_steps=8,
        megatron_lm_plugin=megatron_plugin
    )

    # Model
    config = GPT2Config(
        n_layer=24,
        n_head=16,
        n_embd=1024,
    )
    model = GPT2LMHeadModel(config)

    # Optimizer
    optimizer = torch.optim.AdamW(model.parameters(), lr=6e-4)

    # Prepare
    model, optimizer, train_loader = accelerator.prepare(
        model, optimizer, train_loader
    )

    # Training loop
    for epoch in range(num_epochs):
        for batch in train_loader:
            with accelerator.accumulate(model):
                outputs = model(**batch)
                loss = outputs.loss
                accelerator.backward(loss)
                optimizer.step()
                optimizer.zero_grad()

        # Save checkpoint
        accelerator.wait_for_everyone()
        accelerator.save_state(f'checkpoint-epoch-{epoch}')

if __name__ == '__main__':
    main()

Launch Command

# 8 GPUs, TP=2, PP=2, DP=2
accelerate launch --multi_gpu --num_processes 8 train.py

# Multi-node (2 nodes, 8 GPUs each)
# Node 0
accelerate launch --multi_gpu --num_processes 16 \
  --num_machines 2 --machine_rank 0 \
  --main_process_ip $MASTER_ADDR \
  --main_process_port 29500 \
  train.py

# Node 1
accelerate launch --multi_gpu --num_processes 16 \
  --num_machines 2 --machine_rank 1 \
  --main_process_ip $MASTER_ADDR \
  --main_process_port 29500 \
  train.py

Activation Checkpointing

Reduces memory by recomputing activations:

megatron_plugin = MegatronLMPlugin(
    recompute_activations=True,      # Enable checkpointing
    checkpoint_num_layers=1,         # Checkpoint every N layers
    distribute_checkpointed_activations=True,  # Distribute across TP
    partition_activations=True,      # Partition in PP
    check_for_nan_in_loss_and_grad=True,  # Stability check
)

Strategies:

  • SELECTIVE: Checkpoint transformer blocks only
  • FULL: Checkpoint all layers
  • NONE: No checkpointing

Memory savings: 30-50% with 10-15% slowdown

Distributed Optimizer

Shards optimizer state across DP ranks:

megatron_plugin = MegatronLMPlugin(
    use_distributed_optimizer=True,  # Enable sharded optimizer
)

Benefits:

  • Reduces optimizer memory by DP degree
  • Example: DP=4 → 4× less optimizer memory per GPU

Compatible with:

  • AdamW, Adam, SGD
  • Mixed precision training

Performance Tuning

Micro-Batch Size

# Pipeline parallelism requires micro-batching
megatron_plugin = MegatronLMPlugin(
    pp_degree=4,
    num_micro_batches=16,  # 16 micro-batches per pipeline
)

# Effective batch = num_micro_batches × micro_batch_size × DP
# Example: 16 × 2 × 4 = 128

Recommendations:

  • More micro-batches → less pipeline bubble
  • Typical: 4-16 micro-batches

Sequence Length

# For long sequences, enable sequence parallelism
megatron_plugin = MegatronLMPlugin(
    tp_degree=4,
    sequence_parallelism=True,  # Required: TP > 1
)

# Enables sequences up to TP × normal limit
# Example: TP=4, 8K normal → 32K with sequence parallel

GPU Topology

NVLink required for TP:

# Check NVLink topology
nvidia-smi topo -m

# Good topology (NVLink between all GPUs)
# GPU0 - GPU1: NV12 (fast)
# GPU0 - GPU2: NV12 (fast)

# Bad topology (PCIe only)
# GPU0 - GPU4: PHB (slow, avoid TP across these)

Recommendations:

  • TP: Within same node (NVLink)
  • PP: Across nodes (slower interconnect OK)
  • DP: Any topology

Model Size Guidelines

Model Size GPUs TP PP DP Micro-Batches
7B 8 1 1 8 1
13B 8 2 1 4 1
20B 16 4 1 4 1
40B 32 4 2 4 4
70B 64 8 2 4 8
175B 128 8 4 4 16

Assumptions: BF16, 2K sequence length, A100 80GB

Checkpointing

Save Checkpoint

# Save full model state
accelerator.save_state('checkpoint-1000')

# Megatron saves separate files per rank
# checkpoint-1000/
#   pytorch_model_tp_0_pp_0.bin
#   pytorch_model_tp_0_pp_1.bin
#   pytorch_model_tp_1_pp_0.bin
#   pytorch_model_tp_1_pp_1.bin
#   optimizer_tp_0_pp_0.bin
#   ...

Load Checkpoint

# Resume training
accelerator.load_state('checkpoint-1000')

# Automatically loads correct shard per rank

Convert to Standard PyTorch

# Merge Megatron checkpoint to single file
python merge_megatron_checkpoint.py \
  --checkpoint-dir checkpoint-1000 \
  --output pytorch_model.bin

Common Issues

Issue: OOM with Pipeline Parallelism

Solution: Increase micro-batches

megatron_plugin = MegatronLMPlugin(
    pp_degree=4,
    num_micro_batches=16,  # Increase from 4
)

Issue: Slow Training

Check 1: Pipeline bubbles (PP too high)

# Reduce PP, increase TP
tp_degree=4  # Increase
pp_degree=2  # Decrease

Check 2: Micro-batch size too small

num_micro_batches=8  # Increase
# Verify NVLink
nvidia-smi nvlink -s

# If no NVLink, avoid TP > 1
# Use PP or DP instead

Resources