hermes-agent/skills/mlops/training/torchtitan/references/checkpoint.md
teknium1 732c66b0f3 refactor: reorganize skills into sub-categories
The skills directory was getting disorganized — mlops alone had 40
skills in a flat list, and 12 categories were singletons with just
one skill each.

Code change:
- prompt_builder.py: Support sub-categories in skill scanner.
  skills/mlops/training/axolotl/SKILL.md now shows as category
  'mlops/training' instead of just 'mlops'. Backwards-compatible
  with existing flat structure.

Split mlops (40 skills) into 7 sub-categories:
- mlops/training (12): accelerate, axolotl, flash-attention,
  grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning,
  simpo, slime, torchtitan, trl-fine-tuning, unsloth
- mlops/inference (8): gguf, guidance, instructor, llama-cpp,
  obliteratus, outlines, tensorrt-llm, vllm
- mlops/models (6): audiocraft, clip, llava, segment-anything,
  stable-diffusion, whisper
- mlops/vector-databases (4): chroma, faiss, pinecone, qdrant
- mlops/evaluation (5): huggingface-tokenizers,
  lm-evaluation-harness, nemo-curator, saelens, weights-and-biases
- mlops/cloud (2): lambda-labs, modal
- mlops/research (1): dspy

Merged singleton categories:
- gifs → media (gif-search joins youtube-content)
- music-creation → media (heartmula, songsee)
- diagramming → creative (excalidraw joins ascii-art)
- ocr-and-documents → productivity
- domain → research (domain-intel)
- feeds → research (blogwatcher)
- market-data → research (polymarket)

Fixed misplaced skills:
- mlops/code-review → software-development (not ML-specific)
- mlops/ml-paper-writing → research (academic writing)

Added DESCRIPTION.md files for all new/updated categories.
2026-03-09 03:35:53 -07:00

4.1 KiB

Checkpointing in TorchTitan

TorchTitan uses PyTorch Distributed Checkpoint (DCP) for fault-tolerant, interoperable checkpointing.

Basic Configuration

[checkpoint]
enable = true
folder = "checkpoint"
interval = 500

Save Model Only (Smaller Checkpoints)

Exclude optimizer state and training metadata:

[checkpoint]
enable = true
last_save_model_only = true
export_dtype = "bfloat16"  # Optional: export in lower precision

Excluding Keys from Loading

Partial checkpoint loading for modified settings:

[checkpoint]
enable = true
exclude_from_loading = ["data_loader", "lr_scheduler"]

CLI equivalent:

--checkpoint.exclude_from_loading data_loader,lr_scheduler

Creating Seed Checkpoints

Required for Pipeline Parallelism to ensure consistent initialization:

NGPU=1 CONFIG_FILE=<path_to_config> ./run_train.sh \
  --checkpoint.enable \
  --checkpoint.create_seed_checkpoint \
  --parallelism.data_parallel_replicate_degree 1 \
  --parallelism.data_parallel_shard_degree 1 \
  --parallelism.tensor_parallel_degree 1 \
  --parallelism.pipeline_parallel_degree 1 \
  --parallelism.context_parallel_degree 1 \
  --parallelism.expert_parallel_degree 1

This initializes on single CPU for reproducible initialization across any GPU count.

Async Checkpointing

Reduce checkpoint overhead with async writes:

[checkpoint]
enable = true
async_mode = "async"  # Options: "disabled", "async", "async_with_pinned_mem"

HuggingFace Conversion

During Training

Save directly in HuggingFace format:

[checkpoint]
last_save_in_hf = true
last_save_model_only = true

Load from HuggingFace:

[checkpoint]
initial_load_in_hf = true

[model]
hf_assets_path = "./path/to/hf/checkpoint"

Offline Conversion

Convert without running training:

# HuggingFace -> TorchTitan
python ./scripts/checkpoint_conversion/convert_from_hf.py \
  <input_dir> <output_dir> \
  --model_name llama3 \
  --model_flavor 8B

# TorchTitan -> HuggingFace
python ./scripts/checkpoint_conversion/convert_to_hf.py \
  <input_dir> <output_dir> \
  --hf_assets_path ./assets/hf/Llama3.1-8B \
  --model_name llama3 \
  --model_flavor 8B

Example

python ./scripts/convert_from_hf.py \
  ~/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots/8cde5ca8380496c9a6cc7ef3a8b46a0372a1d920/ \
  ./initial_load_path/ \
  --model_name llama3 \
  --model_flavor 8B

Converting to Single .pt File

Convert DCP sharded checkpoint to single PyTorch file:

python -m torch.distributed.checkpoint.format_utils \
  dcp_to_torch \
  torchtitan/outputs/checkpoint/step-1000 \
  checkpoint.pt

Checkpoint Structure

DCP saves sharded checkpoints that can be resharded for different parallelism configurations:

checkpoint/
├── step-500/
│   ├── .metadata
│   ├── __0_0.distcp
│   ├── __0_1.distcp
│   └── ...
└── step-1000/
    └── ...

Resume Training

Training auto-resumes from the latest checkpoint in the configured folder. To resume from a specific step:

[checkpoint]
load_step = 500  # Resume from step 500

Interoperability with TorchTune

Checkpoints saved with last_save_model_only = true can be loaded directly into torchtune for fine-tuning.

Full Configuration Example

[checkpoint]
enable = true
folder = "checkpoint"
interval = 500
load_step = -1  # -1 = latest, or specify step number
last_save_model_only = true
export_dtype = "bfloat16"
async_mode = "async"
exclude_from_loading = []
last_save_in_hf = false
initial_load_in_hf = false
create_seed_checkpoint = false

Best Practices

  1. Large models: Use async_mode = "async" to overlap checkpoint saves with training
  2. Fine-tuning export: Enable last_save_model_only and export_dtype = "bfloat16" for smaller files
  3. Pipeline parallelism: Always create seed checkpoint first
  4. Debugging: Save frequent checkpoints during development, reduce for production
  5. HF interop: Use conversion scripts for offline conversion, direct save/load for training workflows