* feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap Map active skills to Telegram's slash command menu so users can discover and invoke skills directly. Three changes: 1. Telegram menu now includes active skill commands alongside built-in commands, capped at 100 entries (Telegram Bot API limit). Overflow commands remain callable but hidden from the picker. Logged at startup when cap is hit. 2. New /commands [page] gateway command for paginated browsing of all commands + skills. /help now shows first 10 skill commands and points to /commands for the full list. 3. When a user types a slash command that matches a disabled or uninstalled skill, they get actionable guidance: - Disabled: 'Enable it with: hermes skills config' - Optional (not installed): 'Install with: hermes skills install official/<path>' Built on ideas from PR #3921 by @kshitijk4poor. * chore: move 21 niche skills to optional-skills Move specialized/niche skills from built-in (skills/) to optional (optional-skills/) to reduce the default skill count. Users can install them with: hermes skills install official/<category>/<name> Moved skills (21): - mlops: accelerate, chroma, faiss, flash-attention, hermes-atropos-environments, huggingface-tokenizers, instructor, lambda-labs, llava, nemo-curator, pinecone, pytorch-lightning, qdrant, saelens, simpo, slime, tensorrt-llm, torchtitan - research: domain-intel, duckduckgo-search - devops: inference-sh cli Built-in skills: 96 → 75 Optional skills: 22 → 43 * fix: only include repo built-in skills in Telegram menu, not user-installed User-installed skills (from hub or manually added) stay accessible via /skills and by typing the command directly, but don't get registered in the Telegram slash command picker. Only skills whose SKILL.md is under the repo's skills/ directory are included in the menu. This keeps the Telegram menu focused on the curated built-in set while user-installed skills remain discoverable through /skills and /commands.
4 KiB
Float8 Training in TorchTitan
Float8 training provides substantial speedups for models where GEMMs are large enough that the FP8 tensorcore speedup outweighs dynamic quantization overhead.
Hardware Requirements
- NVIDIA H100 or newer GPUs (FP8 Tensor Cores)
- Blackwell GPUs for MXFP8 training
Installation
USE_CPP=0 pip install git+https://github.com/pytorch/ao.git
Usage: Tensorwise Scaling
Standard Float8 with tensorwise dynamic scaling:
CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh \
--model.converters="quantize.linear.float8" \
--quantize.linear.float8.enable_fsdp_float8_all_gather \
--quantize.linear.float8.precompute_float8_dynamic_scale_for_fsdp \
--compile.enable
Key Arguments
| Argument | Description |
|---|---|
--model.converters="quantize.linear.float8" |
Swap nn.Linear with Float8Linear |
--quantize.linear.float8.enable_fsdp_float8_all_gather |
Communicate in float8 to save bandwidth |
--quantize.linear.float8.precompute_float8_dynamic_scale_for_fsdp |
Single all-reduce for all AMAX/scales |
--compile.enable |
Required - fuses float8 scaling/casting kernels |
Usage: Rowwise Scaling
Higher accuracy than tensorwise scaling:
CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh \
--model.converters="quantize.linear.float8" \
--quantize.linear.float8.recipe_name rowwise \
--compile.enable
Filtering Layers
Not all layers benefit from Float8. Filter small layers:
--quantize.linear.float8.filter_fqns="attention.wk,attention.wv,output"
Auto-filtering
Automatically skip layers too small to benefit:
--quantize.linear.float8.filter_fqns="auto_filter_small_kn"
Thresholds based on H100 microbenchmarks where speedup > overhead.
TOML Configuration
[model]
converters = ["quantize.linear.float8"]
[quantize.linear.float8]
enable_fsdp_float8_all_gather = true
precompute_float8_dynamic_scale_for_fsdp = true
filter_fqns = ["output", "auto_filter_small_kn"]
[compile]
enable = true
components = ["model", "loss"]
How Float8 Works with Distributed Training
Single Device
Cast input and weight to float8 inside forward before calling torch._scaled_mm:
# Float8 matmul requires scales
torch._scaled_mm(input_fp8, weight_fp8, scale_a=scale_input, scale_b=scale_weight)
FSDP + Float8
- Cast sharded high-precision weights (1/N per rank) to float8
- Perform float8 all-gather (saves bandwidth vs bf16/fp32)
- Communicate
max(abs)across ranks for scale computation - At forward start, have unsharded float8 weights ready
Net benefit: Float8 all-gather + amax communication can beat bf16/fp32 all-gather, depending on world size and message size.
TP + Float8
- Input: Cast sharded input to float8, all-gather in float8
- Weights: Communicate
max(abs)for sharded weights - Matmul: Float8 input (unsharded) x float8 weight (sharded) with global scales
Scaling Strategies
| Strategy | Status | Description |
|---|---|---|
| Tensorwise dynamic | Stable | Single scale per tensor |
| Rowwise dynamic | Alpha | Scale per row, higher accuracy |
Performance Gains
From benchmarks on H100:
| Configuration | TPS/GPU | vs Baseline |
|---|---|---|
| FSDP only | 5,762 | - |
| FSDP + compile | 6,667 | +16% |
| FSDP + compile + Float8 | 8,532 | +48% |
Determining Float8 Benefit
Check torchao microbenchmarks for forward+backward pass speedups on "layer norm => linear => sigmoid" for different M,N,K sizes.
Rule of thumb: GEMMs with K,N > 4096 typically benefit from Float8.
MXFP8 Training (Blackwell)
For NVIDIA Blackwell GPUs, TorchTitan supports MXFP8 (Microscaling FP8) for both dense and MoE models. See docs/mxfp8.md for details.