mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-04-26 01:01:40 +00:00

teknium1 ab0f4126cf fix: restore all removed bundled skills + fix skills sync system

- Restored 21 skills removed in commits 757d012 and 740dd92:
  accelerate, audiocraft, code-review, faiss, flash-attention, gguf,
  grpo-rl-training, guidance, llava, nemo-curator, obliteratus, peft,
  pytorch-fsdp, pytorch-lightning, simpo, slime, stable-diffusion,
  tensorrt-llm, torchtitan, trl-fine-tuning, whisper

- Rewrote sync_skills() with proper update semantics:
  * New skills (not in manifest): copied to user dir
  * Existing skills (in manifest + on disk): updated via hash comparison
  * User-deleted skills (in manifest, not on disk): respected, not re-added
  * Stale manifest entries (removed from bundled): cleaned from manifest

- Added sync_skills() to CLI startup (cmd_chat) and gateway startup
  (start_gateway) — previously only ran during 'hermes update'

- Updated cmd_update output to show new/updated/cleaned counts

- Rewrote tests: 20 tests covering manifest CRUD, dir hashing, fresh
  install, user deletion respect, update detection, stale cleanup, and
  name collision handling

75 bundled skills total. 2002 tests pass.

2026-03-06 15:57:30 -08:00

4 KiB

Raw Blame History

Float8 Training in TorchTitan

Float8 training provides substantial speedups for models where GEMMs are large enough that the FP8 tensorcore speedup outweighs dynamic quantization overhead.

Hardware Requirements

NVIDIA H100 or newer GPUs (FP8 Tensor Cores)
Blackwell GPUs for MXFP8 training

Installation

USE_CPP=0 pip install git+https://github.com/pytorch/ao.git

Usage: Tensorwise Scaling

Standard Float8 with tensorwise dynamic scaling:

CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh \
  --model.converters="quantize.linear.float8" \
  --quantize.linear.float8.enable_fsdp_float8_all_gather \
  --quantize.linear.float8.precompute_float8_dynamic_scale_for_fsdp \
  --compile.enable

Key Arguments

Argument	Description
`--model.converters="quantize.linear.float8"`	Swap `nn.Linear` with `Float8Linear`
`--quantize.linear.float8.enable_fsdp_float8_all_gather`	Communicate in float8 to save bandwidth
`--quantize.linear.float8.precompute_float8_dynamic_scale_for_fsdp`	Single all-reduce for all AMAX/scales
`--compile.enable`	Required - fuses float8 scaling/casting kernels

Usage: Rowwise Scaling

Higher accuracy than tensorwise scaling:

CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh \
  --model.converters="quantize.linear.float8" \
  --quantize.linear.float8.recipe_name rowwise \
  --compile.enable

Filtering Layers

Not all layers benefit from Float8. Filter small layers:

--quantize.linear.float8.filter_fqns="attention.wk,attention.wv,output"

Auto-filtering

Automatically skip layers too small to benefit:

--quantize.linear.float8.filter_fqns="auto_filter_small_kn"

Thresholds based on H100 microbenchmarks where speedup > overhead.

TOML Configuration

[model]
converters = ["quantize.linear.float8"]

[quantize.linear.float8]
enable_fsdp_float8_all_gather = true
precompute_float8_dynamic_scale_for_fsdp = true
filter_fqns = ["output", "auto_filter_small_kn"]

[compile]
enable = true
components = ["model", "loss"]

How Float8 Works with Distributed Training

Single Device

Cast input and weight to float8 inside forward before calling torch._scaled_mm:

# Float8 matmul requires scales
torch._scaled_mm(input_fp8, weight_fp8, scale_a=scale_input, scale_b=scale_weight)

FSDP + Float8

Cast sharded high-precision weights (1/N per rank) to float8
Perform float8 all-gather (saves bandwidth vs bf16/fp32)
Communicate max(abs) across ranks for scale computation
At forward start, have unsharded float8 weights ready

Net benefit: Float8 all-gather + amax communication can beat bf16/fp32 all-gather, depending on world size and message size.

TP + Float8

Input: Cast sharded input to float8, all-gather in float8
Weights: Communicate max(abs) for sharded weights
Matmul: Float8 input (unsharded) x float8 weight (sharded) with global scales

Scaling Strategies

Strategy	Status	Description
Tensorwise dynamic	Stable	Single scale per tensor
Rowwise dynamic	Alpha	Scale per row, higher accuracy

Performance Gains

From benchmarks on H100:

Configuration	TPS/GPU	vs Baseline
FSDP only	5,762	-
FSDP + compile	6,667	+16%
FSDP + compile + Float8	8,532	+48%

Determining Float8 Benefit

Check torchao microbenchmarks for forward+backward pass speedups on "layer norm => linear => sigmoid" for different M,N,K sizes.

Rule of thumb: GEMMs with K,N > 4096 typically benefit from Float8.

MXFP8 Training (Blackwell)

For NVIDIA Blackwell GPUs, TorchTitan supports MXFP8 (Microscaling FP8) for both dense and MoE models. See docs/mxfp8.md for details.

4 KiB Raw Blame History