mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-04-26 01:01:40 +00:00

teknium1 ab0f4126cf fix: restore all removed bundled skills + fix skills sync system

- Restored 21 skills removed in commits 757d012 and 740dd92:
  accelerate, audiocraft, code-review, faiss, flash-attention, gguf,
  grpo-rl-training, guidance, llava, nemo-curator, obliteratus, peft,
  pytorch-fsdp, pytorch-lightning, simpo, slime, stable-diffusion,
  tensorrt-llm, torchtitan, trl-fine-tuning, whisper

- Rewrote sync_skills() with proper update semantics:
  * New skills (not in manifest): copied to user dir
  * Existing skills (in manifest + on disk): updated via hash comparison
  * User-deleted skills (in manifest, not on disk): respected, not re-added
  * Stale manifest entries (removed from bundled): cleaned from manifest

- Added sync_skills() to CLI startup (cmd_chat) and gateway startup
  (start_gateway) — previously only ran during 'hermes update'

- Updated cmd_update output to show new/updated/cleaned counts

- Rewrote tests: 20 tests covering manifest CRUD, dir hashing, fresh
  install, user deletion respect, update detection, stale cleanup, and
  name collision handling

75 bundled skills total. 2002 tests pass.

2026-03-06 15:57:30 -08:00

11 KiB

Raw Blame History

Megatron Integration with Accelerate

Overview

Accelerate supports Megatron-LM for massive model training with tensor parallelism and pipeline parallelism.

Megatron capabilities:

Tensor Parallelism (TP): Split layers across GPUs
Pipeline Parallelism (PP): Split model depth across GPUs
Data Parallelism (DP): Replicate model across GPU groups
Sequence Parallelism: Split sequences for long contexts

Setup

Install Megatron-LM

# Clone Megatron-LM repository
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
pip install -e .

# Install Apex (NVIDIA optimizations)
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation \
  --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

Accelerate Configuration

accelerate config

Questions:

In which compute environment are you running?
> This machine

Which type of machine are you using?
> Multi-GPU

How many different machines will you use?
> 1

Do you want to use DeepSpeed/FSDP?
> No

Do you want to use Megatron-LM?
> Yes

What is the Tensor Parallelism degree? [1-8]
> 2

Do you want to enable Sequence Parallelism?
> No

What is the Pipeline Parallelism degree? [1-8]
> 2

What is the Data Parallelism degree? [1-8]
> 2

Where to perform activation checkpointing? ['SELECTIVE', 'FULL', 'NONE']
> SELECTIVE

Where to perform activation partitioning? ['SEQUENTIAL', 'UNIFORM']
> SEQUENTIAL

Generated config (~/.cache/huggingface/accelerate/default_config.yaml):

compute_environment: LOCAL_MACHINE
distributed_type: MEGATRON_LM
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
megatron_lm_config:
  megatron_lm_gradient_clipping: 1.0
  megatron_lm_learning_rate_decay_iters: 320000
  megatron_lm_num_micro_batches: 1
  megatron_lm_pp_degree: 2
  megatron_lm_recompute_activations: true
  megatron_lm_sequence_parallelism: false
  megatron_lm_tp_degree: 2
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Parallelism Strategies

Tensor Parallelism (TP)

Splits each transformer layer across GPUs:

# Layer split across 2 GPUs
# GPU 0: First half of attention heads
# GPU 1: Second half of attention heads

# Each GPU computes partial outputs
# All-reduce combines results

TP degree recommendations:

TP=1: No tensor parallelism (single GPU per layer)
TP=2: 2 GPUs per layer (good for 7-13B models)
TP=4: 4 GPUs per layer (good for 20-40B models)
TP=8: 8 GPUs per layer (good for 70B+ models)

Benefits:

Reduces memory per GPU
All-reduce communication (fast)

Drawbacks:

Requires fast inter-GPU bandwidth (NVLink)
Communication overhead per layer

Pipeline Parallelism (PP)

Splits model depth across GPUs:

# 12-layer model, PP=4
# GPU 0: Layers 0-2
# GPU 1: Layers 3-5
# GPU 2: Layers 6-8
# GPU 3: Layers 9-11

PP degree recommendations:

PP=1: No pipeline parallelism
PP=2: 2 pipeline stages (good for 20-40B models)
PP=4: 4 pipeline stages (good for 70B+ models)
PP=8: 8 pipeline stages (good for 175B+ models)

Benefits:

Linear memory reduction (4× PP = 4× less memory)
Works across nodes (slower interconnect OK)

Drawbacks:

Pipeline bubbles (idle time)
Requires micro-batching

Data Parallelism (DP)

Replicates model across GPU groups:

# 8 GPUs, TP=2, PP=2, DP=2
# Group 0 (GPUs 0-3): Full model replica
# Group 1 (GPUs 4-7): Full model replica

DP degree:

DP = total_gpus / (TP × PP)
Example: 8 GPUs, TP=2, PP=2 → DP=2

Benefits:

Increases throughput
Scales batch size

Sequence Parallelism

Splits long sequences across GPUs (extends TP):

# 8K sequence, TP=2, Sequence Parallel=True
# GPU 0: Tokens 0-4095
# GPU 1: Tokens 4096-8191

Benefits:

Enables very long sequences (100K+ tokens)
Reduces activation memory

Requirements:

Must use with TP > 1
RoPE/ALiBi position encodings work best

Accelerate Code Example

Basic Setup

from accelerate import Accelerator
from accelerate.utils import MegatronLMPlugin

# Configure Megatron
megatron_plugin = MegatronLMPlugin(
    tp_degree=2,              # Tensor parallelism degree
    pp_degree=2,              # Pipeline parallelism degree
    num_micro_batches=4,      # Micro-batches for pipeline
    gradient_clipping=1.0,    # Gradient clipping value
    sequence_parallelism=False,  # Enable sequence parallelism
    recompute_activations=True,  # Activation checkpointing
    use_distributed_optimizer=True,  # Distributed optimizer
    custom_prepare_model_function=None,  # Custom model prep
)

# Initialize accelerator
accelerator = Accelerator(
    mixed_precision='bf16',
    megatron_lm_plugin=megatron_plugin
)

# Prepare model and optimizer
model, optimizer, train_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader
)

# Training loop (same as DDP!)
for batch in train_dataloader:
    optimizer.zero_grad()
    outputs = model(**batch)
    loss = outputs.loss
    accelerator.backward(loss)
    optimizer.step()

Full Training Script

import torch
from accelerate import Accelerator
from accelerate.utils import MegatronLMPlugin
from transformers import GPT2Config, GPT2LMHeadModel

def main():
    # Megatron configuration
    megatron_plugin = MegatronLMPlugin(
        tp_degree=2,
        pp_degree=2,
        num_micro_batches=4,
        gradient_clipping=1.0,
    )

    accelerator = Accelerator(
        mixed_precision='bf16',
        gradient_accumulation_steps=8,
        megatron_lm_plugin=megatron_plugin
    )

    # Model
    config = GPT2Config(
        n_layer=24,
        n_head=16,
        n_embd=1024,
    )
    model = GPT2LMHeadModel(config)

    # Optimizer
    optimizer = torch.optim.AdamW(model.parameters(), lr=6e-4)

    # Prepare
    model, optimizer, train_loader = accelerator.prepare(
        model, optimizer, train_loader
    )

    # Training loop
    for epoch in range(num_epochs):
        for batch in train_loader:
            with accelerator.accumulate(model):
                outputs = model(**batch)
                loss = outputs.loss
                accelerator.backward(loss)
                optimizer.step()
                optimizer.zero_grad()

        # Save checkpoint
        accelerator.wait_for_everyone()
        accelerator.save_state(f'checkpoint-epoch-{epoch}')

if __name__ == '__main__':
    main()

Launch Command

# 8 GPUs, TP=2, PP=2, DP=2
accelerate launch --multi_gpu --num_processes 8 train.py

# Multi-node (2 nodes, 8 GPUs each)
# Node 0
accelerate launch --multi_gpu --num_processes 16 \
  --num_machines 2 --machine_rank 0 \
  --main_process_ip $MASTER_ADDR \
  --main_process_port 29500 \
  train.py

# Node 1
accelerate launch --multi_gpu --num_processes 16 \
  --num_machines 2 --machine_rank 1 \
  --main_process_ip $MASTER_ADDR \
  --main_process_port 29500 \
  train.py

Activation Checkpointing

Reduces memory by recomputing activations:

megatron_plugin = MegatronLMPlugin(
    recompute_activations=True,      # Enable checkpointing
    checkpoint_num_layers=1,         # Checkpoint every N layers
    distribute_checkpointed_activations=True,  # Distribute across TP
    partition_activations=True,      # Partition in PP
    check_for_nan_in_loss_and_grad=True,  # Stability check
)

Strategies:

SELECTIVE: Checkpoint transformer blocks only
FULL: Checkpoint all layers
NONE: No checkpointing

Memory savings: 30-50% with 10-15% slowdown

Distributed Optimizer

Shards optimizer state across DP ranks:

megatron_plugin = MegatronLMPlugin(
    use_distributed_optimizer=True,  # Enable sharded optimizer
)

Benefits:

Reduces optimizer memory by DP degree
Example: DP=4 → 4× less optimizer memory per GPU

Compatible with:

AdamW, Adam, SGD
Mixed precision training

Performance Tuning

Micro-Batch Size

# Pipeline parallelism requires micro-batching
megatron_plugin = MegatronLMPlugin(
    pp_degree=4,
    num_micro_batches=16,  # 16 micro-batches per pipeline
)

# Effective batch = num_micro_batches × micro_batch_size × DP
# Example: 16 × 2 × 4 = 128

Recommendations:

More micro-batches → less pipeline bubble
Typical: 4-16 micro-batches

Sequence Length

# For long sequences, enable sequence parallelism
megatron_plugin = MegatronLMPlugin(
    tp_degree=4,
    sequence_parallelism=True,  # Required: TP > 1
)

# Enables sequences up to TP × normal limit
# Example: TP=4, 8K normal → 32K with sequence parallel

GPU Topology

NVLink required for TP:

# Check NVLink topology
nvidia-smi topo -m

# Good topology (NVLink between all GPUs)
# GPU0 - GPU1: NV12 (fast)
# GPU0 - GPU2: NV12 (fast)

# Bad topology (PCIe only)
# GPU0 - GPU4: PHB (slow, avoid TP across these)

Recommendations:

TP: Within same node (NVLink)
PP: Across nodes (slower interconnect OK)
DP: Any topology

Model Size Guidelines

Model Size	GPUs	TP	PP	DP	Micro-Batches
7B	8	1	1	8	1
13B	8	2	1	4	1
20B	16	4	1	4	1
40B	32	4	2	4	4
70B	64	8	2	4	8
175B	128	8	4	4	16

Assumptions: BF16, 2K sequence length, A100 80GB

Checkpointing

Save Checkpoint

# Save full model state
accelerator.save_state('checkpoint-1000')

# Megatron saves separate files per rank
# checkpoint-1000/
#   pytorch_model_tp_0_pp_0.bin
#   pytorch_model_tp_0_pp_1.bin
#   pytorch_model_tp_1_pp_0.bin
#   pytorch_model_tp_1_pp_1.bin
#   optimizer_tp_0_pp_0.bin
#   ...

Load Checkpoint

# Resume training
accelerator.load_state('checkpoint-1000')

# Automatically loads correct shard per rank

Convert to Standard PyTorch

# Merge Megatron checkpoint to single file
python merge_megatron_checkpoint.py \
  --checkpoint-dir checkpoint-1000 \
  --output pytorch_model.bin

Common Issues

Issue: OOM with Pipeline Parallelism

Solution: Increase micro-batches

megatron_plugin = MegatronLMPlugin(
    pp_degree=4,
    num_micro_batches=16,  # Increase from 4
)

Issue: Slow Training

Check 1: Pipeline bubbles (PP too high)

# Reduce PP, increase TP
tp_degree=4  # Increase
pp_degree=2  # Decrease

Check 2: Micro-batch size too small

num_micro_batches=8  # Increase

Issue: NVLink Not Detected

# Verify NVLink
nvidia-smi nvlink -s

# If no NVLink, avoid TP > 1
# Use PP or DP instead

Resources

Megatron-LM: https://github.com/NVIDIA/Megatron-LM
Accelerate Megatron docs: https://huggingface.co/docs/accelerate/usage_guides/megatron_lm
Paper: "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism"
NVIDIA Apex: https://github.com/NVIDIA/apex

11 KiB Raw Blame History Unescape Escape

Megatron Integration with Accelerate

Overview

Setup

Install Megatron-LM

Accelerate Configuration

Parallelism Strategies

Tensor Parallelism (TP)

Pipeline Parallelism (PP)

Data Parallelism (DP)

Sequence Parallelism

Accelerate Code Example

Basic Setup

Full Training Script

Launch Command

Activation Checkpointing

Distributed Optimizer

Performance Tuning

Micro-Batch Size

Sequence Length

GPU Topology

Model Size Guidelines

Checkpointing

Save Checkpoint

Load Checkpoint

Convert to Standard PyTorch

Common Issues

Issue: OOM with Pipeline Parallelism

Issue: Slow Training

Issue: NVLink Not Detected

Resources

11 KiB

Raw Blame History