mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-25 00:51:20 +00:00
* feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap Map active skills to Telegram's slash command menu so users can discover and invoke skills directly. Three changes: 1. Telegram menu now includes active skill commands alongside built-in commands, capped at 100 entries (Telegram Bot API limit). Overflow commands remain callable but hidden from the picker. Logged at startup when cap is hit. 2. New /commands [page] gateway command for paginated browsing of all commands + skills. /help now shows first 10 skill commands and points to /commands for the full list. 3. When a user types a slash command that matches a disabled or uninstalled skill, they get actionable guidance: - Disabled: 'Enable it with: hermes skills config' - Optional (not installed): 'Install with: hermes skills install official/<path>' Built on ideas from PR #3921 by @kshitijk4poor. * chore: move 21 niche skills to optional-skills Move specialized/niche skills from built-in (skills/) to optional (optional-skills/) to reduce the default skill count. Users can install them with: hermes skills install official/<category>/<name> Moved skills (21): - mlops: accelerate, chroma, faiss, flash-attention, hermes-atropos-environments, huggingface-tokenizers, instructor, lambda-labs, llava, nemo-curator, pinecone, pytorch-lightning, qdrant, saelens, simpo, slime, tensorrt-llm, torchtitan - research: domain-intel, duckduckgo-search - devops: inference-sh cli Built-in skills: 96 → 75 Optional skills: 22 → 43 * fix: only include repo built-in skills in Telegram menu, not user-installed User-installed skills (from hub or manually added) stay accessible via /skills and by typing the command directly, but don't get registered in the Telegram slash command picker. Only skills whose SKILL.md is under the repo's skills/ directory are included in the menu. This keeps the Telegram menu focused on the curated built-in set while user-installed skills remain discoverable through /skills and /commands.
181 lines
4.1 KiB
Markdown
181 lines
4.1 KiB
Markdown
# Checkpointing in TorchTitan
|
|
|
|
TorchTitan uses PyTorch Distributed Checkpoint (DCP) for fault-tolerant, interoperable checkpointing.
|
|
|
|
## Basic Configuration
|
|
|
|
```toml
|
|
[checkpoint]
|
|
enable = true
|
|
folder = "checkpoint"
|
|
interval = 500
|
|
```
|
|
|
|
## Save Model Only (Smaller Checkpoints)
|
|
|
|
Exclude optimizer state and training metadata:
|
|
|
|
```toml
|
|
[checkpoint]
|
|
enable = true
|
|
last_save_model_only = true
|
|
export_dtype = "bfloat16" # Optional: export in lower precision
|
|
```
|
|
|
|
## Excluding Keys from Loading
|
|
|
|
Partial checkpoint loading for modified settings:
|
|
|
|
```toml
|
|
[checkpoint]
|
|
enable = true
|
|
exclude_from_loading = ["data_loader", "lr_scheduler"]
|
|
```
|
|
|
|
CLI equivalent:
|
|
```bash
|
|
--checkpoint.exclude_from_loading data_loader,lr_scheduler
|
|
```
|
|
|
|
## Creating Seed Checkpoints
|
|
|
|
Required for Pipeline Parallelism to ensure consistent initialization:
|
|
|
|
```bash
|
|
NGPU=1 CONFIG_FILE=<path_to_config> ./run_train.sh \
|
|
--checkpoint.enable \
|
|
--checkpoint.create_seed_checkpoint \
|
|
--parallelism.data_parallel_replicate_degree 1 \
|
|
--parallelism.data_parallel_shard_degree 1 \
|
|
--parallelism.tensor_parallel_degree 1 \
|
|
--parallelism.pipeline_parallel_degree 1 \
|
|
--parallelism.context_parallel_degree 1 \
|
|
--parallelism.expert_parallel_degree 1
|
|
```
|
|
|
|
This initializes on single CPU for reproducible initialization across any GPU count.
|
|
|
|
## Async Checkpointing
|
|
|
|
Reduce checkpoint overhead with async writes:
|
|
|
|
```toml
|
|
[checkpoint]
|
|
enable = true
|
|
async_mode = "async" # Options: "disabled", "async", "async_with_pinned_mem"
|
|
```
|
|
|
|
## HuggingFace Conversion
|
|
|
|
### During Training
|
|
|
|
Save directly in HuggingFace format:
|
|
|
|
```toml
|
|
[checkpoint]
|
|
last_save_in_hf = true
|
|
last_save_model_only = true
|
|
```
|
|
|
|
Load from HuggingFace:
|
|
|
|
```toml
|
|
[checkpoint]
|
|
initial_load_in_hf = true
|
|
|
|
[model]
|
|
hf_assets_path = "./path/to/hf/checkpoint"
|
|
```
|
|
|
|
### Offline Conversion
|
|
|
|
Convert without running training:
|
|
|
|
```bash
|
|
# HuggingFace -> TorchTitan
|
|
python ./scripts/checkpoint_conversion/convert_from_hf.py \
|
|
<input_dir> <output_dir> \
|
|
--model_name llama3 \
|
|
--model_flavor 8B
|
|
|
|
# TorchTitan -> HuggingFace
|
|
python ./scripts/checkpoint_conversion/convert_to_hf.py \
|
|
<input_dir> <output_dir> \
|
|
--hf_assets_path ./assets/hf/Llama3.1-8B \
|
|
--model_name llama3 \
|
|
--model_flavor 8B
|
|
```
|
|
|
|
### Example
|
|
|
|
```bash
|
|
python ./scripts/convert_from_hf.py \
|
|
~/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots/8cde5ca8380496c9a6cc7ef3a8b46a0372a1d920/ \
|
|
./initial_load_path/ \
|
|
--model_name llama3 \
|
|
--model_flavor 8B
|
|
```
|
|
|
|
## Converting to Single .pt File
|
|
|
|
Convert DCP sharded checkpoint to single PyTorch file:
|
|
|
|
```bash
|
|
python -m torch.distributed.checkpoint.format_utils \
|
|
dcp_to_torch \
|
|
torchtitan/outputs/checkpoint/step-1000 \
|
|
checkpoint.pt
|
|
```
|
|
|
|
## Checkpoint Structure
|
|
|
|
DCP saves sharded checkpoints that can be resharded for different parallelism configurations:
|
|
|
|
```
|
|
checkpoint/
|
|
├── step-500/
|
|
│ ├── .metadata
|
|
│ ├── __0_0.distcp
|
|
│ ├── __0_1.distcp
|
|
│ └── ...
|
|
└── step-1000/
|
|
└── ...
|
|
```
|
|
|
|
## Resume Training
|
|
|
|
Training auto-resumes from the latest checkpoint in the configured folder. To resume from a specific step:
|
|
|
|
```toml
|
|
[checkpoint]
|
|
load_step = 500 # Resume from step 500
|
|
```
|
|
|
|
## Interoperability with TorchTune
|
|
|
|
Checkpoints saved with `last_save_model_only = true` can be loaded directly into [torchtune](https://github.com/pytorch/torchtune) for fine-tuning.
|
|
|
|
## Full Configuration Example
|
|
|
|
```toml
|
|
[checkpoint]
|
|
enable = true
|
|
folder = "checkpoint"
|
|
interval = 500
|
|
load_step = -1 # -1 = latest, or specify step number
|
|
last_save_model_only = true
|
|
export_dtype = "bfloat16"
|
|
async_mode = "async"
|
|
exclude_from_loading = []
|
|
last_save_in_hf = false
|
|
initial_load_in_hf = false
|
|
create_seed_checkpoint = false
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
1. **Large models**: Use `async_mode = "async"` to overlap checkpoint saves with training
|
|
2. **Fine-tuning export**: Enable `last_save_model_only` and `export_dtype = "bfloat16"` for smaller files
|
|
3. **Pipeline parallelism**: Always create seed checkpoint first
|
|
4. **Debugging**: Save frequent checkpoints during development, reduce for production
|
|
5. **HF interop**: Use conversion scripts for offline conversion, direct save/load for training workflows
|