chore(skills): move heavy training skills + outlines to optional-skills (#22912)

These skills require heavy GPU/CUDA stacks or are niche enough that they shouldn't be active by default. Moved to optional-skills/ where users opt-in via `hermes skills install official/...`. Moved: - mlops/training/axolotl - mlops/training/trl-fine-tuning - mlops/training/unsloth - mlops/inference/outlines Counts: 91 -> 87 built-in, 72 -> 76 optional. Auto-regenerated docs (per-skill pages + catalogs) reflect the move.
2026-05-28 06:21:33 +00:00 · 2026-05-09 18:44:12 -07:00 · 2026-05-09 18:44:12 -07:00 · ded194eb6a
commit ded194eb6a
parent 4375b82cd9
27 changed files with 18 additions and 18 deletions
--- a/skills/mlops/training/axolotl/SKILL.md
+++ b/skills/mlops/training/axolotl/SKILL.md
@ -1,166 +0,0 @@
---
-name: axolotl
-description: "Axolotl: YAML LLM fine-tuning (LoRA, DPO, GRPO)."
-version: 1.0.0
-author: Orchestra Research
-license: MIT
-dependencies: [axolotl, torch, transformers, datasets, peft, accelerate, deepspeed]
-platforms: [linux, macos]
-metadata:
-  hermes:
-    tags: [Fine-Tuning, Axolotl, LLM, LoRA, QLoRA, DPO, KTO, ORPO, GRPO, YAML, HuggingFace, DeepSpeed, Multimodal]
-
---
-
-# Axolotl Skill
-
-## What's inside
-
-Expert guidance for fine-tuning LLMs with Axolotl — YAML configs, 100+ models, LoRA/QLoRA, DPO/KTO/ORPO/GRPO, multimodal support.
-
-Comprehensive assistance with axolotl development, generated from official documentation.
-
-## When to Use This Skill
-
-This skill should be triggered when:
- Working with axolotl
- Asking about axolotl features or APIs
- Implementing axolotl solutions
- Debugging axolotl code
- Learning axolotl best practices
-
-## Quick Reference
-
-### Common Patterns
-
-**Pattern 1:** To validate that acceptable data transfer speeds exist for your training job, running NCCL Tests can help pinpoint bottlenecks, for example:
-
-```
-./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
-```
-
-**Pattern 2:** Configure your model to use FSDP in the Axolotl yaml. For example:
-
-```
-fsdp_version: 2
-fsdp_config:
-  offload_params: true
-  state_dict_type: FULL_STATE_DICT
-  auto_wrap_policy: TRANSFORMER_BASED_WRAP
-  transformer_layer_cls_to_wrap: LlamaDecoderLayer
-  reshard_after_forward: true
-```
-
-**Pattern 3:** The context_parallel_size should be a divisor of the total number of GPUs. For example:
-
-```
-context_parallel_size
-```
-
-**Pattern 4:** For example: - With 8 GPUs and no sequence parallelism: 8 different batches processed per step - With 8 GPUs and context_parallel_size=4: Only 2 different batches processed per step (each split across 4 GPUs) - If your per-GPU micro_batch_size is 2, the global batch size decreases from 16 to 4
-
-```
-context_parallel_size=4
-```
-
-**Pattern 5:** Setting save_compressed: true in your configuration enables saving models in a compressed format, which: - Reduces disk space usage by approximately 40% - Maintains compatibility with vLLM for accelerated inference - Maintains compatibility with llmcompressor for further optimization (example: quantization)
-
-```
-save_compressed: true
-```
-
-**Pattern 6:** Note It is not necessary to place your integration in the integrations folder. It can be in any location, so long as it’s installed in a package in your python env. See this repo for an example: https://github.com/axolotl-ai-cloud/diff-transformer
-
-```
-integrations
-```
-
-**Pattern 7:** Handle both single-example and batched data. - single example: sample[‘input_ids’] is a list[int] - batched data: sample[‘input_ids’] is a list[list[int]]
-
-```
-utils.trainer.drop_long_seq(sample, sequence_len=2048, min_sequence_len=2)
-```
-
-### Example Code Patterns
-
-**Example 1** (python):
-```python
-cli.cloud.modal_.ModalCloud(config, app=None)
-```
-
-**Example 2** (python):
-```python
-cli.cloud.modal_.run_cmd(cmd, run_folder, volumes=None)
-```
-
-**Example 3** (python):
-```python
-core.trainers.base.AxolotlTrainer(
-    *_args,
-    bench_data_collator=None,
-    eval_data_collator=None,
-    dataset_tags=None,
-    **kwargs,
-)
-```
-
-**Example 4** (python):
-```python
-core.trainers.base.AxolotlTrainer.log(logs, start_time=None)
-```
-
-**Example 5** (python):
-```python
-prompt_strategies.input_output.RawInputOutputPrompter()
-```
-
-## Reference Files
-
-This skill includes comprehensive documentation in `references/`:
-
- **api.md** - Api documentation
- **dataset-formats.md** - Dataset-Formats documentation
- **other.md** - Other documentation
-
-Use `view` to read specific reference files when detailed information is needed.
-
-## Working with This Skill
-
-### For Beginners
-Start with the getting_started or tutorials reference files for foundational concepts.
-
-### For Specific Features
-Use the appropriate category reference file (api, guides, etc.) for detailed information.
-
-### For Code Examples
-The quick reference section above contains common patterns extracted from the official docs.
-
-## Resources
-
-### references/
-Organized documentation extracted from official sources. These files contain:
- Detailed explanations
- Code examples with language annotations
- Links to original documentation
- Table of contents for quick navigation
-
-### scripts/
-Add helper scripts here for common automation tasks.
-
-### assets/
-Add templates, boilerplate, or example projects here.
-
-## Notes
-
- This skill was automatically generated from official documentation
- Reference files preserve the structure and examples from source docs
- Code examples include language detection for better syntax highlighting
- Quick reference patterns are extracted from common usage examples in the docs
-
-## Updating
-
-To refresh this skill with updated documentation:
-1. Re-run the scraper with the same configuration
-2. The skill will be rebuilt with the latest information
-
-
--- a/skills/mlops/training/axolotl/references/api.md
+++ b/skills/mlops/training/axolotl/references/api.md
--- a/skills/mlops/training/axolotl/references/dataset-formats.md
+++ b/skills/mlops/training/axolotl/references/dataset-formats.md
--- a/skills/mlops/training/axolotl/references/index.md
+++ b/skills/mlops/training/axolotl/references/index.md
@ -1,15 +0,0 @@
-# Axolotl Documentation Index
-
-## Categories
-
-### Api
-**File:** `api.md`
-**Pages:** 150
-
-### Dataset-Formats
-**File:** `dataset-formats.md`
-**Pages:** 9
-
-### Other
-**File:** `other.md`
-**Pages:** 26
--- a/skills/mlops/training/axolotl/references/other.md
+++ b/skills/mlops/training/axolotl/references/other.md
--- a/skills/mlops/training/trl-fine-tuning/SKILL.md
+++ b/skills/mlops/training/trl-fine-tuning/SKILL.md
@ -1,463 +0,0 @@
---
-name: fine-tuning-with-trl
-description: "TRL: SFT, DPO, PPO, GRPO, reward modeling for LLM RLHF."
-version: 1.0.0
-author: Orchestra Research
-license: MIT
-dependencies: [trl, transformers, datasets, peft, accelerate, torch]
-platforms: [linux, macos, windows]
-metadata:
-  hermes:
-    tags: [Post-Training, TRL, Reinforcement Learning, Fine-Tuning, SFT, DPO, PPO, GRPO, RLHF, Preference Alignment, HuggingFace]
-
---
-
-# TRL - Transformer Reinforcement Learning
-
-## Quick start
-
-TRL provides post-training methods for aligning language models with human preferences.
-
-**Installation**:
-```bash
-pip install trl transformers datasets peft accelerate
-```
-
-**Supervised Fine-Tuning** (instruction tuning):
-```python
-from trl import SFTTrainer
-
-trainer = SFTTrainer(
-    model="Qwen/Qwen2.5-0.5B",
-    train_dataset=dataset,  # Prompt-completion pairs
-)
-trainer.train()
-```
-
-**DPO** (align with preferences):
-```python
-from trl import DPOTrainer, DPOConfig
-
-config = DPOConfig(output_dir="model-dpo", beta=0.1)
-trainer = DPOTrainer(
-    model=model,
-    args=config,
-    train_dataset=preference_dataset,  # chosen/rejected pairs
-    processing_class=tokenizer
-)
-trainer.train()
-```
-
-## Common workflows
-
-### Workflow 1: Full RLHF pipeline (SFT → Reward Model → PPO)
-
-Complete pipeline from base model to human-aligned model.
-
-Copy this checklist:
-
-```
-RLHF Training:
- [ ] Step 1: Supervised fine-tuning (SFT)
- [ ] Step 2: Train reward model
- [ ] Step 3: PPO reinforcement learning
- [ ] Step 4: Evaluate aligned model
-```
-
-**Step 1: Supervised fine-tuning**
-
-Train base model on instruction-following data:
-
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-from trl import SFTTrainer, SFTConfig
-from datasets import load_dataset
-
-# Load model
-model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B")
-tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
-
-# Load instruction dataset
-dataset = load_dataset("trl-lib/Capybara", split="train")
-
-# Configure training
-training_args = SFTConfig(
-    output_dir="Qwen2.5-0.5B-SFT",
-    per_device_train_batch_size=4,
-    num_train_epochs=1,
-    learning_rate=2e-5,
-    logging_steps=10,
-    save_strategy="epoch"
-)
-
-# Train
-trainer = SFTTrainer(
-    model=model,
-    args=training_args,
-    train_dataset=dataset,
-    tokenizer=tokenizer
-)
-trainer.train()
-trainer.save_model()
-```
-
-**Step 2: Train reward model**
-
-Train model to predict human preferences:
-
-```python
-from transformers import AutoModelForSequenceClassification
-from trl import RewardTrainer, RewardConfig
-
-# Load SFT model as base
-model = AutoModelForSequenceClassification.from_pretrained(
-    "Qwen2.5-0.5B-SFT",
-    num_labels=1  # Single reward score
-)
-tokenizer = AutoTokenizer.from_pretrained("Qwen2.5-0.5B-SFT")
-
-# Load preference data (chosen/rejected pairs)
-dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
-
-# Configure training
-training_args = RewardConfig(
-    output_dir="Qwen2.5-0.5B-Reward",
-    per_device_train_batch_size=2,
-    num_train_epochs=1,
-    learning_rate=1e-5
-)
-
-# Train reward model
-trainer = RewardTrainer(
-    model=model,
-    args=training_args,
-    processing_class=tokenizer,
-    train_dataset=dataset
-)
-trainer.train()
-trainer.save_model()
-```
-
-**Step 3: PPO reinforcement learning**
-
-Optimize policy using reward model:
-
-```bash
-python -m trl.scripts.ppo \
-    --model_name_or_path Qwen2.5-0.5B-SFT \
-    --reward_model_path Qwen2.5-0.5B-Reward \
-    --dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
-    --output_dir Qwen2.5-0.5B-PPO \
-    --learning_rate 3e-6 \
-    --per_device_train_batch_size 64 \
-    --total_episodes 10000
-```
-
-**Step 4: Evaluate**
-
-```python
-from transformers import pipeline
-
-# Load aligned model
-generator = pipeline("text-generation", model="Qwen2.5-0.5B-PPO")
-
-# Test
-prompt = "Explain quantum computing to a 10-year-old"
-output = generator(prompt, max_length=200)[0]["generated_text"]
-print(output)
-```
-
-### Workflow 2: Simple preference alignment with DPO
-
-Align model with preferences without reward model.
-
-Copy this checklist:
-
-```
-DPO Training:
- [ ] Step 1: Prepare preference dataset
- [ ] Step 2: Configure DPO
- [ ] Step 3: Train with DPOTrainer
- [ ] Step 4: Evaluate alignment
-```
-
-**Step 1: Prepare preference dataset**
-
-Dataset format:
-```json
-{
-  "prompt": "What is the capital of France?",
-  "chosen": "The capital of France is Paris.",
-  "rejected": "I don't know."
-}
-```
-
-Load dataset:
-```python
-from datasets import load_dataset
-
-dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
-# Or load your own
-# dataset = load_dataset("json", data_files="preferences.json")
-```
-
-**Step 2: Configure DPO**
-
-```python
-from trl import DPOConfig
-
-config = DPOConfig(
-    output_dir="Qwen2.5-0.5B-DPO",
-    per_device_train_batch_size=4,
-    num_train_epochs=1,
-    learning_rate=5e-7,
-    beta=0.1,  # KL penalty strength
-    max_prompt_length=512,
-    max_length=1024,
-    logging_steps=10
-)
-```
-
-**Step 3: Train with DPOTrainer**
-
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-from trl import DPOTrainer
-
-model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
-tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
-
-trainer = DPOTrainer(
-    model=model,
-    args=config,
-    train_dataset=dataset,
-    processing_class=tokenizer
-)
-
-trainer.train()
-trainer.save_model()
-```
-
-**CLI alternative**:
-```bash
-trl dpo \
-    --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
-    --dataset_name argilla/Capybara-Preferences \
-    --output_dir Qwen2.5-0.5B-DPO \
-    --per_device_train_batch_size 4 \
-    --learning_rate 5e-7 \
-    --beta 0.1
-```
-
-### Workflow 3: Memory-efficient online RL with GRPO
-
-Train with reinforcement learning using minimal memory.
-
-For in-depth GRPO guidance — reward function design, critical training insights (loss behavior, mode collapse, tuning), and advanced multi-stage patterns — see **[references/grpo-training.md](references/grpo-training.md)**. A production-ready training script is in **[templates/basic_grpo_training.py](templates/basic_grpo_training.py)**.
-
-Copy this checklist:
-
-```
-GRPO Training:
- [ ] Step 1: Define reward function
- [ ] Step 2: Configure GRPO
- [ ] Step 3: Train with GRPOTrainer
-```
-
-**Step 1: Define reward function**
-
-```python
-def reward_function(completions, **kwargs):
-    """
-    Compute rewards for completions.
-
-    Args:
-        completions: List of generated texts
-
-    Returns:
-        List of reward scores (floats)
-    """
-    rewards = []
-    for completion in completions:
-        # Example: reward based on length and unique words
-        score = len(completion.split())  # Favor longer responses
-        score += len(set(completion.lower().split()))  # Reward unique words
-        rewards.append(score)
-    return rewards
-```
-
-Or use a reward model:
-```python
-from transformers import pipeline
-
-reward_model = pipeline("text-classification", model="reward-model-path")
-
-def reward_from_model(completions, prompts, **kwargs):
-    # Combine prompt + completion
-    full_texts = [p + c for p, c in zip(prompts, completions)]
-    # Get reward scores
-    results = reward_model(full_texts)
-    return [r["score"] for r in results]
-```
-
-**Step 2: Configure GRPO**
-
-```python
-from trl import GRPOConfig
-
-config = GRPOConfig(
-    output_dir="Qwen2-GRPO",
-    per_device_train_batch_size=4,
-    num_train_epochs=1,
-    learning_rate=1e-5,
-    num_generations=4,  # Generate 4 completions per prompt
-    max_new_tokens=128
-)
-```
-
-**Step 3: Train with GRPOTrainer**
-
-```python
-from datasets import load_dataset
-from trl import GRPOTrainer
-
-# Load prompt-only dataset
-dataset = load_dataset("trl-lib/tldr", split="train")
-
-trainer = GRPOTrainer(
-    model="Qwen/Qwen2-0.5B-Instruct",
-    reward_funcs=reward_function,  # Your reward function
-    args=config,
-    train_dataset=dataset
-)
-
-trainer.train()
-```
-
-**CLI**:
-```bash
-trl grpo \
-    --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
-    --dataset_name trl-lib/tldr \
-    --output_dir Qwen2-GRPO \
-    --num_generations 4
-```
-
-## When to use vs alternatives
-
-**Use TRL when:**
- Need to align model with human preferences
- Have preference data (chosen/rejected pairs)
- Want to use reinforcement learning (PPO, GRPO)
- Need reward model training
- Doing RLHF (full pipeline)
-
-**Method selection**:
- **SFT**: Have prompt-completion pairs, want basic instruction following
- **DPO**: Have preferences, want simple alignment (no reward model needed)
- **PPO**: Have reward model, need maximum control over RL
- **GRPO**: Memory-constrained, want online RL
- **Reward Model**: Building RLHF pipeline, need to score generations
-
-**Use alternatives instead:**
- **HuggingFace Trainer**: Basic fine-tuning without RL
- **Axolotl**: YAML-based training configuration
- **LitGPT**: Educational, minimal fine-tuning
- **Unsloth**: Fast LoRA training
-
-## Common issues
-
-**Issue: OOM during DPO training**
-
-Reduce batch size and sequence length:
-```python
-config = DPOConfig(
-    per_device_train_batch_size=1,  # Reduce from 4
-    max_length=512,  # Reduce from 1024
-    gradient_accumulation_steps=8  # Maintain effective batch
-)
-```
-
-Or use gradient checkpointing:
-```python
-model.gradient_checkpointing_enable()
-```
-
-**Issue: Poor alignment quality**
-
-Tune beta parameter:
-```python
-# Higher beta = more conservative (stays closer to reference)
-config = DPOConfig(beta=0.5)  # Default 0.1
-
-# Lower beta = more aggressive alignment
-config = DPOConfig(beta=0.01)
-```
-
-**Issue: Reward model not learning**
-
-Check loss type and learning rate:
-```python
-config = RewardConfig(
-    learning_rate=1e-5,  # Try different LR
-    num_train_epochs=3  # Train longer
-)
-```
-
-Ensure preference dataset has clear winners:
-```python
-# Verify dataset
-print(dataset[0])
-# Should have clear chosen > rejected
-```
-
-**Issue: PPO training unstable**
-
-Adjust KL coefficient:
-```python
-config = PPOConfig(
-    kl_coef=0.1,  # Increase from 0.05
-    cliprange=0.1  # Reduce from 0.2
-)
-```
-
-## Advanced topics
-
-**SFT training guide**: See [references/sft-training.md](references/sft-training.md) for dataset formats, chat templates, packing strategies, and multi-GPU training.
-
-**DPO variants**: See [references/dpo-variants.md](references/dpo-variants.md) for IPO, cDPO, RPO, and other DPO loss functions with recommended hyperparameters.
-
-**Reward modeling**: See [references/reward-modeling.md](references/reward-modeling.md) for outcome vs process rewards, Bradley-Terry loss, and reward model evaluation.
-
-**Online RL methods**: See [references/online-rl.md](references/online-rl.md) for PPO, GRPO, RLOO, and OnlineDPO with detailed configurations.
-
-**GRPO deep dive**: See [references/grpo-training.md](references/grpo-training.md) for expert-level GRPO patterns — reward function design philosophy, training insights (why loss increases, mode collapse detection), hyperparameter tuning, multi-stage training, and troubleshooting. Production-ready template in [templates/basic_grpo_training.py](templates/basic_grpo_training.py).
-
-## Hardware requirements
-
- **GPU**: NVIDIA (CUDA required)
- **VRAM**: Depends on model and method
-  - SFT 7B: 16GB (with LoRA)
-  - DPO 7B: 24GB (stores reference model)
-  - PPO 7B: 40GB (policy + reward model)
-  - GRPO 7B: 24GB (more memory efficient)
- **Multi-GPU**: Supported via `accelerate`
- **Mixed precision**: BF16 recommended (A100/H100)
-
-**Memory optimization**:
- Use LoRA/QLoRA for all methods
- Enable gradient checkpointing
- Use smaller batch sizes with gradient accumulation
-
-## Resources
-
- Docs: https://huggingface.co/docs/trl/
- GitHub: https://github.com/huggingface/trl
- Papers:
-  - "Training language models to follow instructions with human feedback" (InstructGPT, 2022)
-  - "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (DPO, 2023)
-  - "Group Relative Policy Optimization" (GRPO, 2024)
- Examples: https://github.com/huggingface/trl/tree/main/examples/scripts
-
-
-
--- a/skills/mlops/training/trl-fine-tuning/references/dpo-variants.md
+++ b/skills/mlops/training/trl-fine-tuning/references/dpo-variants.md
@ -1,227 +0,0 @@
-# DPO Variants
-
-Complete guide to Direct Preference Optimization loss variants in TRL.
-
-## Overview
-
-DPO optimizes models using preference data (chosen/rejected pairs). TRL supports 10+ loss variants for different scenarios.
-
-## Loss Types
-
-### 1. Sigmoid (Standard DPO)
-
-**Formula**: `-log(sigmoid(β * logits))`
-
-**When to use**: Default choice, general preference alignment
-
-**Config**:
-```python
-DPOConfig(
-    loss_type="sigmoid",
-    beta=0.1,  # KL penalty
-    per_device_train_batch_size=64,
-    learning_rate=1e-6
-)
-```
-
-### 2. IPO (Identity Policy Optimization)
-
-**Formula**: `(logits - 1/(2β))²`
-
-**When to use**: Better theoretical foundation, reduce overfitting
-
-**Config**:
-```python
-DPOConfig(
-    loss_type="ipo",
-    beta=0.1,
-    per_device_train_batch_size=90,
-    learning_rate=1e-2
-)
-```
-
-### 3. Hinge (SLiC)
-
-**Formula**: `ReLU(1 - β * logits)`
-
-**When to use**: Margin-based objective
-
-**Config**:
-```python
-DPOConfig(
-    loss_type="hinge",
-    beta=0.1,
-    per_device_train_batch_size=512,
-    learning_rate=1e-4
-)
-```
-
-### 4. Robust DPO
-
-**Formula**: Sigmoid with label smoothing for noise robustness
-
-**When to use**: Noisy preference labels
-
-**Config**:
-```python
-DPOConfig(
-    loss_type="robust",
-    beta=0.01,
-    label_smoothing=0.1,  # Noise probability
-    per_device_train_batch_size=16,
-    learning_rate=1e-3,
-    max_prompt_length=128,
-    max_length=512
-)
-```
-
-### 5. BCO Pair (Binary Classification)
-
-**Formula**: Train binary classifier (chosen=1, rejected=0)
-
-**When to use**: Pairwise preference data
-
-**Config**:
-```python
-DPOConfig(
-    loss_type="bco_pair",
-    beta=0.01,
-    per_device_train_batch_size=128,
-    learning_rate=5e-7,
-    max_prompt_length=1536,
-    max_completion_length=512
-)
-```
-
-### 6. SPPO Hard
-
-**Formula**: Push chosen→0.5, rejected→-0.5
-
-**When to use**: Nash equilibrium, sparse data
-
-**Config**:
-```python
-DPOConfig(
-    loss_type="sppo_hard",
-    beta=0.1
-)
-```
-
-### 7. DiscoPOP
-
-**Formula**: Log-Ratio Modulated Loss
-
-**When to use**: Automated loss discovery
-
-**Config**:
-```python
-DPOConfig(
-    loss_type="discopop",
-    beta=0.05,
-    discopop_tau=0.05,
-    per_device_train_batch_size=64,
-    learning_rate=5e-7
-)
-```
-
-### 8. APO Zero
-
-**Formula**: Increase chosen, decrease rejected likelihood
-
-**When to use**: Model worse than winning outputs
-
-**Config**:
-```python
-DPOConfig(
-    loss_type="apo_zero",
-    beta=0.1,
-    per_device_train_batch_size=64,
-    learning_rate=2e-7,
-    max_prompt_length=512,
-    max_completion_length=512
-)
-```
-
-### 9. APO Down
-
-**Formula**: Decrease both, emphasize rejected reduction
-
-**When to use**: Model better than winning outputs
-
-**Config**:
-```python
-DPOConfig(
-    loss_type="apo_down",
-    beta=0.1,
-    # Same hyperparameters as apo_zero
-)
-```
-
-### 10. AOT & AOT Pair
-
-**Formula**: Distributional alignment via stochastic dominance
-
-**When to use**:
- `aot_pair`: Paired preference data
- `aot`: Unpaired data
-
-**Config**:
-```python
-DPOConfig(
-    loss_type="aot_pair",  # or "aot"
-    beta=0.1,
-    label_smoothing=0.0
-)
-```
-
-## Multi-Loss Training
-
-Combine multiple losses:
-
-```python
-DPOConfig(
-    loss_type=["sigmoid", "ipo"],
-    loss_weights=[0.7, 0.3],  # Weighted combination
-    beta=0.1
-)
-```
-
-## Key Parameters
-
-### Beta (β)
-
-Controls deviation from reference model:
- **Higher** (0.5): More conservative, stays close to reference
- **Lower** (0.01): More aggressive alignment
- **Default**: 0.1
-
-### Label Smoothing
-
-For robust DPO:
- **0.0**: No smoothing (default)
- **0.1-0.3**: Moderate noise robustness
- **0.5**: Maximum noise tolerance
-
-### Max Lengths
-
- `max_prompt_length`: 128-1536
- `max_completion_length`: 128-512
- `max_length`: Total sequence (1024-2048)
-
-## Comparison Table
-
-| Loss | Speed | Stability | Best For |
-|------|-------|-----------|----------|
-| Sigmoid | Fast | Good | **General use** |
-| IPO | Fast | Better | Overfitting issues |
-| Hinge | Fast | Good | Margin objectives |
-| Robust | Fast | Best | Noisy data |
-| BCO | Medium | Good | Binary classification |
-| DiscoPOP | Fast | Good | New architectures |
-| APO | Fast | Good | Model quality matching |
-
-## References
-
- DPO paper: https://arxiv.org/abs/2305.18290
- IPO paper: https://arxiv.org/abs/2310.12036
- TRL docs: https://huggingface.co/docs/trl/dpo_trainer
--- a/skills/mlops/training/trl-fine-tuning/references/grpo-training.md
+++ b/skills/mlops/training/trl-fine-tuning/references/grpo-training.md
@ -1,504 +0,0 @@
-# GRPO (Group Relative Policy Optimization) — Deep Guide
-
-Expert-level patterns, critical insights, and production-ready workflows for fine-tuning language models with custom reward functions using TRL's `GRPOTrainer`. This is the deep reference for the GRPO workflow summarized in the main skill.
-
-## When to use GRPO
-
-Use GRPO when you need to:
- **Enforce specific output formats** (XML tags, JSON, structured reasoning)
- **Teach verifiable tasks** with objective correctness metrics (math, coding, fact-checking)
- **Improve reasoning capabilities** by rewarding chain-of-thought patterns
- **Align models to domain-specific behaviors** without labeled preference data
- **Optimize for multiple objectives** simultaneously (format + correctness + style)
-
-**Do NOT use GRPO for:**
- Simple supervised fine-tuning tasks → use SFT
- Tasks without clear reward signals
- When you already have high-quality preference pairs → use DPO/PPO
-
-## Core concepts
-
-### 1. GRPO algorithm fundamentals
-
-**Key mechanism:**
- Generates **multiple completions** per prompt (group size: 4–16)
- Compares completions within each group using reward functions
- Updates policy to favor higher-rewarded responses relative to the group
-
-**Critical differences from PPO:**
- No separate reward model needed
- More sample-efficient (learns from within-group comparisons)
- Simpler to implement and debug
-
-**Mathematical intuition:**
-```
-For each prompt p:
-  1. Generate N completions: {c₁, c₂, ..., cₙ}
-  2. Compute rewards: {r₁, r₂, ..., rₙ}
-  3. Learn to increase probability of high-reward completions
-     relative to low-reward ones in the same group
-```
-
-### 2. Reward function design philosophy
-
-**Golden rules:**
-1. **Compose multiple reward functions** — each handles one aspect (format, correctness, style)
-2. **Scale rewards appropriately** — higher weight = stronger signal
-3. **Use incremental rewards** — partial credit for partial compliance
-4. **Test rewards independently** — debug each reward function in isolation
-
-**Reward function types:**
-
-| Type | Use Case | Example Weight |
-|------|----------|----------------|
-| **Correctness** | Verifiable tasks (math, code) | 2.0 (highest) |
-| **Format** | Strict structure enforcement | 0.5–1.0 |
-| **Length** | Encourage verbosity/conciseness | 0.1–0.5 |
-| **Style** | Penalize unwanted patterns | −0.5 to 0.5 |
-
-## Implementation workflow
-
-### Step 1: Dataset preparation
-
-**Critical requirements:**
- Prompts in chat format (list of dicts with `role` and `content`)
- Include system prompts to set expectations
- For verifiable tasks, include ground truth answers as additional columns
-
-```python
-from datasets import load_dataset, Dataset
-
-SYSTEM_PROMPT = """
-Respond in the following format:
-<reasoning>
-[Your step-by-step thinking]
-</reasoning>
-<answer>
-[Final answer]
-</answer>
-"""
-
-def prepare_dataset(raw_data):
-    """Transform raw data into GRPO-compatible format.
-
-    Returns: Dataset with columns:
-    - 'prompt': List[Dict] with role/content (system + user messages)
-    - 'answer': str (ground truth, optional but recommended)
-    """
-    return raw_data.map(lambda x: {
-        'prompt': [
-            {'role': 'system', 'content': SYSTEM_PROMPT},
-            {'role': 'user', 'content': x['question']}
-        ],
-        'answer': extract_answer(x['raw_answer'])
-    })
-```
-
-**Pro tips:**
- Use one-shot or few-shot examples in the system prompt for complex formats
- Keep prompts concise (max_prompt_length: 256–512 tokens)
- Validate data quality before training (garbage in = garbage out)
-
-### Step 2: Reward function implementation
-
-**Template structure:**
-```python
-def reward_function_name(
-    prompts,        # List[List[Dict]]: Original prompts
-    completions,    # List[List[Dict]]: Model generations
-    answer=None,    # Optional: Ground truth from dataset
-    **kwargs        # Additional dataset columns
-) -> list[float]:
-    """Evaluate completions and return rewards (one per completion)."""
-    responses = [comp[0]['content'] for comp in completions]
-    rewards = []
-    for response in responses:
-        score = compute_score(response)
-        rewards.append(score)
-    return rewards
-```
-
-**Example 1: correctness reward (math/coding)**
-```python
-def correctness_reward(prompts, completions, answer, **kwargs):
-    """Reward correct answers with high score."""
-    responses = [comp[0]['content'] for comp in completions]
-    extracted = [extract_final_answer(r) for r in responses]
-    return [2.0 if ans == gt else 0.0
-            for ans, gt in zip(extracted, answer)]
-```
-
-**Example 2: format reward (structured output)**
-```python
-import re
-
-def format_reward(completions, **kwargs):
-    """Reward XML-like structured format."""
-    pattern = r'<reasoning>.*?</reasoning>\s*<answer>.*?</answer>'
-    responses = [comp[0]['content'] for comp in completions]
-    return [1.0 if re.search(pattern, r, re.DOTALL) else 0.0
-            for r in responses]
-```
-
-**Example 3: incremental format reward (partial credit)**
-```python
-def incremental_format_reward(completions, **kwargs):
-    """Award partial credit for format compliance."""
-    responses = [comp[0]['content'] for comp in completions]
-    rewards = []
-
-    for r in responses:
-        score = 0.0
-        if '<reasoning>' in r:  score += 0.25
-        if '</reasoning>' in r: score += 0.25
-        if '<answer>' in r:     score += 0.25
-        if '</answer>' in r:    score += 0.25
-        # Penalize extra text after closing tag
-        if r.count('</answer>') == 1:
-            extra_text = r.split('</answer>')[-1].strip()
-            score -= len(extra_text) * 0.001
-        rewards.append(score)
-
-    return rewards
-```
-
-**Critical insight:** Combine 3–5 reward functions for robust training. Order matters less than diversity of signals.
-
-### Step 3: Training configuration
-
-**Memory-optimized config (small GPU)**
-```python
-from trl import GRPOConfig
-
-training_args = GRPOConfig(
-    output_dir="outputs/grpo-model",
-
-    # Learning rate
-    learning_rate=5e-6,          # Lower = more stable
-    adam_beta1=0.9,
-    adam_beta2=0.99,
-    weight_decay=0.1,
-    warmup_ratio=0.1,
-    lr_scheduler_type='cosine',
-
-    # Batch settings
-    per_device_train_batch_size=1,
-    gradient_accumulation_steps=4,  # Effective batch = 4
-
-    # GRPO-specific
-    num_generations=8,            # Group size: 8–16 recommended
-    max_prompt_length=256,
-    max_completion_length=512,
-
-    # Training duration
-    num_train_epochs=1,
-    max_steps=None,
-
-    # Optimization
-    bf16=True,                    # Faster on A100/H100
-    optim="adamw_8bit",          # Memory-efficient optimizer
-    max_grad_norm=0.1,
-
-    # Logging
-    logging_steps=1,
-    save_steps=100,
-    report_to="wandb",
-)
-```
-
-**High-performance config (large GPU)**
-```python
-training_args = GRPOConfig(
-    output_dir="outputs/grpo-model",
-    learning_rate=1e-5,
-    per_device_train_batch_size=4,
-    gradient_accumulation_steps=2,
-    num_generations=16,           # Larger groups = better signal
-    max_prompt_length=512,
-    max_completion_length=1024,
-    num_train_epochs=1,
-    bf16=True,
-    use_vllm=True,                # Fast generation with vLLM
-    logging_steps=10,
-)
-```
-
-**Critical hyperparameters:**
-
-| Parameter | Impact | Tuning Advice |
-|-----------|--------|---------------|
-| `num_generations` | Group size for comparison | Start 8, increase to 16 if GPU allows |
-| `learning_rate` | Convergence speed/stability | 5e-6 (safe), 1e-5 (faster, riskier) |
-| `max_completion_length` | Output verbosity | Match your task (512 reasoning, 256 short answers) |
-| `gradient_accumulation_steps` | Effective batch size | Increase if GPU memory limited |
-
-### Step 4: Model setup and training
-
-**Standard setup (Transformers + TRL)**
-```python
-import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer
-from peft import LoraConfig
-from trl import GRPOTrainer
-
-model_name = "Qwen/Qwen2.5-1.5B-Instruct"
-model = AutoModelForCausalLM.from_pretrained(
-    model_name,
-    torch_dtype=torch.bfloat16,
-    attn_implementation="flash_attention_2",  # 2–3× faster
-    device_map="auto",
-)
-
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-tokenizer.pad_token = tokenizer.eos_token
-
-# Optional: LoRA for parameter-efficient training
-peft_config = LoraConfig(
-    r=16,
-    lora_alpha=32,
-    target_modules=[
-        "q_proj", "k_proj", "v_proj", "o_proj",
-        "gate_proj", "up_proj", "down_proj",
-    ],
-    task_type="CAUSAL_LM",
-    lora_dropout=0.05,
-)
-
-trainer = GRPOTrainer(
-    model=model,
-    processing_class=tokenizer,
-    reward_funcs=[
-        incremental_format_reward,
-        format_reward,
-        correctness_reward,
-    ],
-    args=training_args,
-    train_dataset=dataset,
-    peft_config=peft_config,   # Remove for full fine-tuning
-)
-
-trainer.train()
-trainer.save_model("final_model")
-```
-
-**Unsloth setup (2–3× faster)**
-```python
-from unsloth import FastLanguageModel
-
-model, tokenizer = FastLanguageModel.from_pretrained(
-    model_name="google/gemma-3-1b-it",
-    max_seq_length=1024,
-    load_in_4bit=True,
-    fast_inference=True,
-    max_lora_rank=32,
-)
-
-model = FastLanguageModel.get_peft_model(
-    model,
-    r=32,
-    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
-                    "gate_proj", "up_proj", "down_proj"],
-    lora_alpha=32,
-    use_gradient_checkpointing="unsloth",
-)
-
-# Rest is identical to the standard setup
-trainer = GRPOTrainer(model=model, ...)
-trainer.train()
-```
-
-## Critical training insights
-
-### 1. Loss behavior (EXPECTED pattern)
- **Loss starts near 0 and INCREASES during training** — this is CORRECT
- Loss measures KL divergence from initial policy; the model is learning (diverging from original behavior to optimize rewards)
- **Monitor reward metrics, not loss, for progress**
-
-### 2. Reward tracking
-
-Key metrics to watch:
- `reward` — average across all completions
- `reward_std` — diversity within groups (should remain > 0)
- `kl` — KL divergence from reference (should grow moderately)
-
-**Healthy pattern:**
-```
-Step   Reward    Reward_Std   KL
-100    0.5       0.3          0.02
-200    0.8       0.25         0.05
-300    1.2       0.2          0.08  ← Good progression
-400    1.5       0.15         0.12
-```
-
-**Warning signs:**
- `reward_std` → 0 (model collapsing to a single response)
- `kl` exploding (> 0.5) — diverging too much, reduce LR
- Reward stuck — reward functions too harsh or model capacity issue
-
-### 3. Common pitfalls and solutions
-
-| Problem | Symptom | Solution |
-|---------|---------|----------|
-| **Mode collapse** | All completions identical | Increase `num_generations`, add diversity penalty |
-| **No learning** | Flat rewards | Check reward function logic, increase LR |
-| **OOM errors** | GPU memory exceeded | Reduce `num_generations`, enable gradient checkpointing |
-| **Slow training** | < 1 it/s | Enable `use_vllm=True`, use Unsloth, reduce seq length |
-| **Format ignored** | Model doesn't follow structure | Increase format reward weight, add incremental rewards |
-
-## Advanced patterns
-
-### 1. Multi-stage training
-
-For complex tasks, train in stages:
-
-```python
-# Stage 1: Format compliance
-trainer_stage1 = GRPOTrainer(
-    model=model,
-    reward_funcs=[incremental_format_reward, format_reward],
-    ...
-)
-trainer_stage1.train()
-
-# Stage 2: Correctness
-trainer_stage2 = GRPOTrainer(
-    model=model,
-    reward_funcs=[format_reward, correctness_reward],
-    ...
-)
-trainer_stage2.train()
-```
-
-### 2. Adaptive reward scaling
-
-```python
-class AdaptiveReward:
-    def __init__(self, base_reward_func, initial_weight=1.0):
-        self.func = base_reward_func
-        self.weight = initial_weight
-
-    def __call__(self, *args, **kwargs):
-        rewards = self.func(*args, **kwargs)
-        return [r * self.weight for r in rewards]
-
-    def adjust_weight(self, success_rate):
-        """Increase weight if model struggling, decrease if succeeding."""
-        if success_rate < 0.3:
-            self.weight *= 1.2
-        elif success_rate > 0.8:
-            self.weight *= 0.9
-```
-
-### 3. Custom dataset integration
-
-```python
-def load_custom_knowledge_base(csv_path):
-    import pandas as pd
-    df = pd.read_csv(csv_path)
-    return Dataset.from_pandas(df).map(lambda x: {
-        'prompt': [
-            {'role': 'system', 'content': CUSTOM_SYSTEM_PROMPT},
-            {'role': 'user', 'content': x['question']}
-        ],
-        'answer': x['expert_answer']
-    })
-```
-
-## Deployment and inference
-
-### Save and merge LoRA
-```python
-if hasattr(trainer.model, 'merge_and_unload'):
-    merged_model = trainer.model.merge_and_unload()
-    merged_model.save_pretrained("production_model")
-    tokenizer.save_pretrained("production_model")
-```
-
-### Inference
-```python
-from transformers import pipeline
-
-generator = pipeline("text-generation", model="production_model", tokenizer=tokenizer)
-
-result = generator(
-    [
-        {'role': 'system', 'content': SYSTEM_PROMPT},
-        {'role': 'user', 'content': "What is 15 + 27?"},
-    ],
-    max_new_tokens=256,
-    do_sample=True,
-    temperature=0.7,
-    top_p=0.9,
-)
-print(result[0]['generated_text'])
-```
-
-## Best practices checklist
-
-**Before training:**
- [ ] Validate dataset format (prompts as List[Dict])
- [ ] Test reward functions on sample data
- [ ] Calculate expected `max_prompt_length` from data
- [ ] Choose `num_generations` based on GPU memory
- [ ] Set up logging (wandb recommended)
-
-**During training:**
- [ ] Monitor reward progression (should increase)
- [ ] Check `reward_std` (should stay > 0.1)
- [ ] Watch for OOM errors (reduce batch size if needed)
- [ ] Sample generations every 50–100 steps
- [ ] Validate format compliance on holdout set
-
-**After training:**
- [ ] Merge LoRA weights if using PEFT
- [ ] Test on diverse prompts
- [ ] Compare to baseline model
- [ ] Document reward weights and hyperparameters
- [ ] Save reproducibility config
-
-## Troubleshooting
-
-### Debugging workflow
-1. **Isolate reward functions** — test each independently
-2. **Check data distribution** — ensure diversity in prompts
-3. **Reduce complexity** — start with single reward, add gradually
-4. **Monitor generations** — print samples every N steps
-5. **Validate extraction logic** — ensure answer parsing works
-
-### Quick debug reward
-```python
-def debug_reward(completions, **kwargs):
-    responses = [comp[0]['content'] for comp in completions]
-    for i, r in enumerate(responses[:2]):
-        print(f"Response {i}: {r[:200]}...")
-    return [1.0] * len(responses)
-
-# Test without training
-trainer = GRPOTrainer(..., reward_funcs=[debug_reward])
-trainer.generate_completions(dataset[:1])
-```
-
-## Template
-
-A production-ready training script lives at **`../templates/basic_grpo_training.py`**. It uses Qwen 2.5-1.5B-Instruct with LoRA and three reward functions (incremental format, strict format, correctness) on GSM8K. Copy and adapt:
-1. `get_dataset()` — swap in your data loader
-2. Reward functions — tune to your task
-3. `SYSTEM_PROMPT` — match your output format
-4. `GRPOConfig` — adjust hyperparameters for your GPU
-
-## References and resources
-
- TRL GRPO Trainer: https://huggingface.co/docs/trl/grpo_trainer
- GRPO paper (DeepSeek): https://arxiv.org/abs/2402.03300
- DeepSeek R1 paper: https://arxiv.org/abs/2501.12948
- Open R1 implementation: https://github.com/huggingface/open-r1
- TRL examples: https://github.com/huggingface/trl/tree/main/examples
- Unsloth (faster training): https://docs.unsloth.ai/
-
-## Critical reminders
-
- **Loss goes UP during training** — this is normal (it's KL divergence)
- **Use 3–5 reward functions** — single rewards often fail
- **Test rewards before training** — debug each function independently
- **Monitor `reward_std`** — should stay > 0.1 (avoid mode collapse)
- **Start with `num_generations=4–8`** — scale up if GPU allows
--- a/skills/mlops/training/trl-fine-tuning/references/online-rl.md
+++ b/skills/mlops/training/trl-fine-tuning/references/online-rl.md
@ -1,82 +0,0 @@
-# Online RL Methods
-
-Guide to online reinforcement learning with PPO, GRPO, RLOO, and OnlineDPO.
-
-## Overview
-
-Online RL generates completions during training and optimizes based on rewards.
-
-## PPO (Proximal Policy Optimization)
-
-Classic RL algorithm for LLM alignment.
-
-### Basic Usage
-
-```bash
-python -m trl.scripts.ppo \
-    --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
-    --reward_model_path reward-model \
-    --dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
-    --output_dir model-ppo \
-    --learning_rate 3e-6 \
-    --per_device_train_batch_size 64 \
-    --total_episodes 10000 \
-    --num_ppo_epochs 4 \
-    --kl_coef 0.05
-```
-
-### Key Parameters
-
- `kl_coef`: KL penalty (0.05-0.2)
- `num_ppo_epochs`: Epochs per batch (2-4)
- `cliprange`: PPO clip (0.1-0.3)
- `vf_coef`: Value function coef (0.1)
-
-## GRPO (Group Relative Policy Optimization)
-
-Memory-efficient online RL.
-
-### Basic Usage
-
-```python
-from trl import GRPOTrainer, GRPOConfig
-from datasets import load_dataset
-
-# Define reward function
-def reward_func(completions, **kwargs):
-    return [len(set(c.split())) for c in completions]
-
-config = GRPOConfig(
-    output_dir="model-grpo",
-    num_generations=4,  # Completions per prompt
-    max_new_tokens=128
-)
-
-trainer = GRPOTrainer(
-    model="Qwen/Qwen2-0.5B-Instruct",
-    reward_funcs=reward_func,
-    args=config,
-    train_dataset=load_dataset("trl-lib/tldr", split="train")
-)
-trainer.train()
-```
-
-### Key Parameters
-
- `num_generations`: 2-8 completions
- `max_new_tokens`: 64-256
- Learning rate: 1e-5 to 1e-4
-
-## Memory Comparison
-
-| Method | Memory (7B) | Speed | Use Case |
-|--------|-------------|-------|----------|
-| PPO | 40GB | Medium | Maximum control |
-| GRPO | 24GB | Fast | **Memory-constrained** |
-| OnlineDPO | 28GB | Fast | No reward model |
-
-## References
-
- PPO paper: https://arxiv.org/abs/1707.06347
- GRPO paper: https://arxiv.org/abs/2402.03300
- TRL docs: https://huggingface.co/docs/trl/
--- a/skills/mlops/training/trl-fine-tuning/references/reward-modeling.md
+++ b/skills/mlops/training/trl-fine-tuning/references/reward-modeling.md
@ -1,122 +0,0 @@
-# Reward Modeling
-
-Guide to training reward models with TRL for RLHF pipelines.
-
-## Overview
-
-Reward models score completions based on human preferences. Used in:
- PPO training (RL feedback)
- GRPO online RL
- Completion ranking
-
-## Basic Training
-
-```python
-from transformers import AutoModelForSequenceClassification, AutoTokenizer
-from trl import RewardTrainer, RewardConfig
-from datasets import load_dataset
-
-# Load model (num_labels=1 for single reward score)
-model = AutoModelForSequenceClassification.from_pretrained(
-    "Qwen/Qwen2.5-0.5B-Instruct",
-    num_labels=1
-)
-tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
-
-# Load preference dataset (chosen/rejected pairs)
-dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
-
-# Configure
-config = RewardConfig(
-    output_dir="Qwen2.5-Reward",
-    per_device_train_batch_size=2,
-    num_train_epochs=1,
-    learning_rate=1e-5
-)
-
-# Train
-trainer = RewardTrainer(
-    model=model,
-    args=config,
-    processing_class=tokenizer,
-    train_dataset=dataset
-)
-trainer.train()
-```
-
-## Dataset Format
-
-Required fields:
-```json
-{
-  "prompt": "Question or instruction",
-  "chosen": "Better response",
-  "rejected": "Worse response"
-}
-```
-
-## Bradley-Terry Loss
-
-Default loss function:
-```
-loss = -log(sigmoid(reward_chosen - reward_rejected))
-```
-
-Learns to score chosen > rejected.
-
-## Using Reward Models
-
-### Inference
-
-```python
-from transformers import pipeline
-
-# Load trained reward model
-reward_pipe = pipeline("text-classification", model="Qwen2.5-Reward")
-
-# Score completions
-texts = ["Good answer", "Bad answer"]
-scores = reward_pipe(texts)
-print(scores)  # Higher score = better
-```
-
-### In PPO
-
-```python
-from trl import PPOTrainer, PPOConfig
-
-config = PPOConfig(
-    reward_model_path="Qwen2.5-Reward"  # Use trained reward model
-)
-
-trainer = PPOTrainer(
-    model=policy_model,
-    config=config,
-    # Reward model loaded automatically
-)
-```
-
-## Hyperparameters
-
-| Model Size | Learning Rate | Batch Size | Epochs |
-|------------|---------------|------------|--------|
-| <1B | 2e-5 | 4-8 | 1-2 |
-| 1-7B | 1e-5 | 2-4 | 1 |
-| 7-13B | 5e-6 | 1-2 | 1 |
-
-## Evaluation
-
-Check reward separation:
-```python
-# Chosen should score higher than rejected
-chosen_rewards = model(**chosen_inputs).logits
-rejected_rewards = model(**rejected_inputs).logits
-
-accuracy = (chosen_rewards > rejected_rewards).float().mean()
-print(f"Accuracy: {accuracy:.2%}")  # Target: >80%
-```
-
-## References
-
- InstructGPT paper: https://arxiv.org/abs/2203.02155
- TRL docs: https://huggingface.co/docs/trl/reward_trainer
--- a/skills/mlops/training/trl-fine-tuning/references/sft-training.md
+++ b/skills/mlops/training/trl-fine-tuning/references/sft-training.md
@ -1,168 +0,0 @@
-# SFT Training Guide
-
-Complete guide to Supervised Fine-Tuning (SFT) with TRL for instruction tuning and task-specific fine-tuning.
-
-## Overview
-
-SFT trains models on input-output pairs to minimize cross-entropy loss. Use for:
- Instruction following
- Task-specific fine-tuning
- Chatbot training
- Domain adaptation
-
-## Dataset Formats
-
-### Format 1: Prompt-Completion
-
-```json
-[
-  {
-    "prompt": "What is the capital of France?",
-    "completion": "The capital of France is Paris."
-  }
-]
-```
-
-### Format 2: Conversational (ChatML)
-
-```json
-[
-  {
-    "messages": [
-      {"role": "user", "content": "What is Python?"},
-      {"role": "assistant", "content": "Python is a programming language."}
-    ]
-  }
-]
-```
-
-### Format 3: Text-only
-
-```json
-[
-  {"text": "User: Hello\nAssistant: Hi! How can I help?"}
-]
-```
-
-## Basic Training
-
-```python
-from trl import SFTTrainer, SFTConfig
-from transformers import AutoModelForCausalLM, AutoTokenizer
-from datasets import load_dataset
-
-# Load model
-model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B")
-tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
-
-# Load dataset
-dataset = load_dataset("trl-lib/Capybara", split="train")
-
-# Configure
-config = SFTConfig(
-    output_dir="Qwen2.5-SFT",
-    per_device_train_batch_size=4,
-    num_train_epochs=1,
-    learning_rate=2e-5,
-    save_strategy="epoch"
-)
-
-# Train
-trainer = SFTTrainer(
-    model=model,
-    args=config,
-    train_dataset=dataset,
-    tokenizer=tokenizer
-)
-trainer.train()
-```
-
-## Chat Templates
-
-Apply chat templates automatically:
-
-```python
-trainer = SFTTrainer(
-    model=model,
-    args=config,
-    train_dataset=dataset,  # Messages format
-    tokenizer=tokenizer
-    # Chat template applied automatically
-)
-```
-
-Or manually:
-```python
-def format_chat(example):
-    messages = example["messages"]
-    text = tokenizer.apply_chat_template(messages, tokenize=False)
-    return {"text": text}
-
-dataset = dataset.map(format_chat)
-```
-
-## Packing for Efficiency
-
-Pack multiple sequences into one to maximize GPU utilization:
-
-```python
-config = SFTConfig(
-    packing=True,  # Enable packing
-    max_seq_length=2048,
-    dataset_text_field="text"
-)
-```
-
-**Benefits**: 2-3× faster training
-**Trade-off**: Slightly more complex batching
-
-## Multi-GPU Training
-
-```bash
-accelerate launch --num_processes 4 train_sft.py
-```
-
-Or with config:
-```python
-config = SFTConfig(
-    output_dir="model-sft",
-    per_device_train_batch_size=4,
-    gradient_accumulation_steps=4,
-    num_train_epochs=1
-)
-```
-
-## LoRA Fine-Tuning
-
-```python
-from peft import LoraConfig
-
-lora_config = LoraConfig(
-    r=16,
-    lora_alpha=32,
-    target_modules="all-linear",
-    lora_dropout=0.05,
-    task_type="CAUSAL_LM"
-)
-
-trainer = SFTTrainer(
-    model=model,
-    args=config,
-    train_dataset=dataset,
-    peft_config=lora_config  # Add LoRA
-)
-```
-
-## Hyperparameters
-
-| Model Size | Learning Rate | Batch Size | Epochs |
-|------------|---------------|------------|--------|
-| <1B | 5e-5 | 8-16 | 1-3 |
-| 1-7B | 2e-5 | 4-8 | 1-2 |
-| 7-13B | 1e-5 | 2-4 | 1 |
-| 13B+ | 5e-6 | 1-2 | 1 |
-
-## References
-
- TRL docs: https://huggingface.co/docs/trl/sft_trainer
- Examples: https://github.com/huggingface/trl/tree/main/examples/scripts
--- a/skills/mlops/training/trl-fine-tuning/templates/basic_grpo_training.py
+++ b/skills/mlops/training/trl-fine-tuning/templates/basic_grpo_training.py
@ -1,228 +0,0 @@
-"""
-Basic GRPO Training Template
-=============================
-
-A minimal, production-ready template for GRPO training with TRL.
-Adapt this for your specific task by modifying:
-1. Dataset loading (get_dataset function)
-2. Reward functions (reward_*_func)
-3. System prompt (SYSTEM_PROMPT)
-4. Hyperparameters (GRPOConfig)
-"""
-
-import torch
-import re
-from datasets import load_dataset
-from transformers import AutoModelForCausalLM, AutoTokenizer
-from peft import LoraConfig
-from trl import GRPOTrainer, GRPOConfig
-
-# ==================== CONFIGURATION ====================
-
-MODEL_NAME = "Qwen/Qwen2.5-1.5B-Instruct"
-OUTPUT_DIR = "outputs/grpo-model"
-MAX_PROMPT_LENGTH = 256
-MAX_COMPLETION_LENGTH = 512
-
-SYSTEM_PROMPT = """
-Respond in the following format:
-<reasoning>
-[Your step-by-step thinking]
-</reasoning>
-<answer>
-[Final answer]
-</answer>
-"""
-
-# ==================== DATASET ====================
-
-def get_dataset(split="train"):
-    """
-    Load and prepare your dataset.
-
-    Returns: Dataset with columns:
-    - 'prompt': List[Dict] with role/content
-    - 'answer': str (ground truth, optional)
-    """
-    # Example: GSM8K math dataset
-    data = load_dataset('openai/gsm8k', 'main')[split]
-
-    def process_example(x):
-        # Extract ground truth answer
-        answer = x['answer'].split('####')[1].strip() if '####' in x['answer'] else None
-
-        return {
-            'prompt': [
-                {'role': 'system', 'content': SYSTEM_PROMPT},
-                {'role': 'user', 'content': x['question']}
-            ],
-            'answer': answer
-        }
-
-    return data.map(process_example)
-
-# ==================== HELPER FUNCTIONS ====================
-
-def extract_xml_tag(text: str, tag: str) -> str:
-    """Extract content between XML tags."""
-    pattern = f'<{tag}>(.*?)</{tag}>'
-    match = re.search(pattern, text, re.DOTALL)
-    return match.group(1).strip() if match else ""
-
-def extract_answer(text: str) -> str:
-    """Extract the final answer from structured output."""
-    return extract_xml_tag(text, 'answer')
-
-# ==================== REWARD FUNCTIONS ====================
-
-def correctness_reward_func(prompts, completions, answer, **kwargs):
-    """
-    Reward correct answers.
-    Weight: 2.0 (highest priority)
-    """
-    responses = [comp[0]['content'] for comp in completions]
-    extracted = [extract_answer(r) for r in responses]
-    return [2.0 if ans == gt else 0.0 for ans, gt in zip(extracted, answer)]
-
-def format_reward_func(completions, **kwargs):
-    """
-    Reward proper XML format.
-    Weight: 0.5
-    """
-    pattern = r'<reasoning>.*?</reasoning>\s*<answer>.*?</answer>'
-    responses = [comp[0]['content'] for comp in completions]
-    return [0.5 if re.search(pattern, r, re.DOTALL) else 0.0 for r in responses]
-
-def incremental_format_reward_func(completions, **kwargs):
-    """
-    Incremental reward for partial format compliance.
-    Weight: up to 0.5
-    """
-    responses = [comp[0]['content'] for comp in completions]
-    rewards = []
-
-    for r in responses:
-        score = 0.0
-        if '<reasoning>' in r:
-            score += 0.125
-        if '</reasoning>' in r:
-            score += 0.125
-        if '<answer>' in r:
-            score += 0.125
-        if '</answer>' in r:
-            score += 0.125
-
-        # Penalize extra content after closing tag
-        if '</answer>' in r:
-            extra = r.split('</answer>')[-1].strip()
-            score -= len(extra) * 0.001
-
-        rewards.append(score)
-
-    return rewards
-
-# ==================== MODEL SETUP ====================
-
-def setup_model_and_tokenizer():
-    """Load model and tokenizer with optimizations."""
-    model = AutoModelForCausalLM.from_pretrained(
-        MODEL_NAME,
-        torch_dtype=torch.bfloat16,
-        attn_implementation="flash_attention_2",
-        device_map="auto"
-    )
-
-    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
-    tokenizer.pad_token = tokenizer.eos_token
-
-    return model, tokenizer
-
-def get_peft_config():
-    """LoRA configuration for parameter-efficient training."""
-    return LoraConfig(
-        r=16,
-        lora_alpha=32,
-        target_modules=[
-            "q_proj", "k_proj", "v_proj", "o_proj",
-            "gate_proj", "up_proj", "down_proj"
-        ],
-        task_type="CAUSAL_LM",
-        lora_dropout=0.05,
-    )
-
-# ==================== TRAINING ====================
-
-def main():
-    """Main training function."""
-
-    # Load data
-    print("Loading dataset...")
-    dataset = get_dataset()
-    print(f"Dataset size: {len(dataset)}")
-
-    # Setup model
-    print("Loading model...")
-    model, tokenizer = setup_model_and_tokenizer()
-
-    # Training configuration
-    training_args = GRPOConfig(
-        output_dir=OUTPUT_DIR,
-        run_name="grpo-training",
-
-        # Learning rate
-        learning_rate=5e-6,
-        adam_beta1=0.9,
-        adam_beta2=0.99,
-        weight_decay=0.1,
-        warmup_ratio=0.1,
-        lr_scheduler_type='cosine',
-
-        # Batch settings
-        per_device_train_batch_size=1,
-        gradient_accumulation_steps=4,
-
-        # GRPO specific
-        num_generations=8,
-        max_prompt_length=MAX_PROMPT_LENGTH,
-        max_completion_length=MAX_COMPLETION_LENGTH,
-
-        # Training duration
-        num_train_epochs=1,
-
-        # Optimization
-        bf16=True,
-        optim="adamw_8bit",
-        max_grad_norm=0.1,
-
-        # Logging
-        logging_steps=1,
-        save_steps=100,
-        report_to="wandb",  # Change to "none" to disable logging
-    )
-
-    # Initialize trainer
-    trainer = GRPOTrainer(
-        model=model,
-        processing_class=tokenizer,
-        reward_funcs=[
-            incremental_format_reward_func,
-            format_reward_func,
-            correctness_reward_func,
-        ],
-        args=training_args,
-        train_dataset=dataset,
-        peft_config=get_peft_config(),
-    )
-
-    # Train
-    print("Starting training...")
-    trainer.train()
-
-    # Save final model
-    print(f"Saving model to {OUTPUT_DIR}/final")
-    trainer.save_model(f"{OUTPUT_DIR}/final")
-
-    print("Training complete!")
-
-if __name__ == "__main__":
-    main()
--- a/skills/mlops/training/unsloth/SKILL.md
+++ b/skills/mlops/training/unsloth/SKILL.md
@ -1,84 +0,0 @@
---
-name: unsloth
-description: "Unsloth: 2-5x faster LoRA/QLoRA fine-tuning, less VRAM."
-version: 1.0.0
-author: Orchestra Research
-license: MIT
-dependencies: [unsloth, torch, transformers, trl, datasets, peft]
-platforms: [linux, macos]
-metadata:
-  hermes:
-    tags: [Fine-Tuning, Unsloth, Fast Training, LoRA, QLoRA, Memory-Efficient, Optimization, Llama, Mistral, Gemma, Qwen]
-
---
-
-# Unsloth Skill
-
-Comprehensive assistance with unsloth development, generated from official documentation.
-
-## When to Use This Skill
-
-This skill should be triggered when:
- Working with unsloth
- Asking about unsloth features or APIs
- Implementing unsloth solutions
- Debugging unsloth code
- Learning unsloth best practices
-
-## Quick Reference
-
-### Common Patterns
-
-*Quick reference patterns will be added as you use the skill.*
-
-## Reference Files
-
-This skill includes comprehensive documentation in `references/`:
-
- **llms-txt.md** - Llms-Txt documentation
-
-Use `view` to read specific reference files when detailed information is needed.
-
-## Working with This Skill
-
-### For Beginners
-Start with the getting_started or tutorials reference files for foundational concepts.
-
-### For Specific Features
-Use the appropriate category reference file (api, guides, etc.) for detailed information.
-
-### For Code Examples
-The quick reference section above contains common patterns extracted from the official docs.
-
-## Resources
-
-### references/
-Organized documentation extracted from official sources. These files contain:
- Detailed explanations
- Code examples with language annotations
- Links to original documentation
- Table of contents for quick navigation
-
-### scripts/
-Add helper scripts here for common automation tasks.
-
-### assets/
-Add templates, boilerplate, or example projects here.
-
-## Notes
-
- This skill was automatically generated from official documentation
- Reference files preserve the structure and examples from source docs
- Code examples include language detection for better syntax highlighting
- Quick reference patterns are extracted from common usage examples in the docs
-
-## Updating
-
-To refresh this skill with updated documentation:
-1. Re-run the scraper with the same configuration
-2. The skill will be rebuilt with the latest information
-
-<!-- Trigger re-upload 1763621536 -->
-
-
-
--- a/skills/mlops/training/unsloth/references/index.md
+++ b/skills/mlops/training/unsloth/references/index.md
@ -1,7 +0,0 @@
-# Unsloth Documentation Index
-
-## Categories
-
-### Llms-Txt
-**File:** `llms-txt.md`
-**Pages:** 136
--- a/skills/mlops/training/unsloth/references/llms-full.md
+++ b/skills/mlops/training/unsloth/references/llms-full.md
--- a/skills/mlops/training/unsloth/references/llms-txt.md
+++ b/skills/mlops/training/unsloth/references/llms-txt.md
--- a/skills/mlops/training/unsloth/references/llms.md
+++ b/skills/mlops/training/unsloth/references/llms.md
@ -1,82 +0,0 @@
-# Unsloth Documentation
-
-## Unsloth Documentation
-
- [Unsloth Docs](/get-started/unsloth-docs.md): Train your own model with Unsloth, an open-source framework for LLM fine-tuning and reinforcement learning.
- [Beginner? Start here!](/get-started/beginner-start-here.md)
- [Unsloth Requirements](/get-started/beginner-start-here/unsloth-requirements.md): Here are Unsloth's requirements including system and GPU VRAM requirements.
- [FAQ + Is Fine-tuning Right For Me?](/get-started/beginner-start-here/faq-+-is-fine-tuning-right-for-me.md): If you're stuck on if fine-tuning is right for you, see here! Learn about fine-tuning misconceptions, how it compared to RAG and more:
- [Unsloth Notebooks](/get-started/unsloth-notebooks.md): Explore our catalog of Unsloth notebooks:
- [All Our Models](/get-started/all-our-models.md)
- [Install & Update](/get-started/install-and-update.md): Learn to install Unsloth locally or online.
- [Updating](/get-started/install-and-update/updating.md): To update or use an old version of Unsloth, follow the steps below:
- [Pip Install](/get-started/install-and-update/pip-install.md): To install Unsloth locally via Pip, follow the steps below:
- [Docker](/get-started/install-and-update/docker.md): Install Unsloth using our official Docker container
- [Windows Installation](/get-started/install-and-update/windows-installation.md): See how to install Unsloth on Windows with or without WSL.
- [AMD](/get-started/install-and-update/amd.md): Fine-tune with Unsloth on AMD GPUs.
- [Conda Install](/get-started/install-and-update/conda-install.md): To install Unsloth locally on Conda, follow the steps below:
- [Google Colab](/get-started/install-and-update/google-colab.md): To install and run Unsloth on Google Colab, follow the steps below:
- [Fine-tuning LLMs Guide](/get-started/fine-tuning-llms-guide.md): Learn all the basics and best practices of fine-tuning. Beginner-friendly.
- [What Model Should I Use?](/get-started/fine-tuning-llms-guide/what-model-should-i-use.md)
- [Datasets Guide](/get-started/fine-tuning-llms-guide/datasets-guide.md): Learn how to create & prepare a dataset for fine-tuning.
- [LoRA Hyperparameters Guide](/get-started/fine-tuning-llms-guide/lora-hyperparameters-guide.md): Optimal lora rank. alpha, number of epochs, batch size & gradient accumulation, QLoRA vs LoRA, target modules and more!
- [Tutorial: How to Finetune Llama-3 and Use In Ollama](/get-started/fine-tuning-llms-guide/tutorial-how-to-finetune-llama-3-and-use-in-ollama.md): Beginner's Guide for creating a customized personal assistant (like ChatGPT) to run locally on Ollama
- [Reinforcement Learning (RL) Guide](/get-started/reinforcement-learning-rl-guide.md): Learn all about Reinforcement Learning (RL) and how to train your own DeepSeek-R1 reasoning model with Unsloth using GRPO. A complete guide from beginner to advanced.
- [Tutorial: Train your own Reasoning model with GRPO](/get-started/reinforcement-learning-rl-guide/tutorial-train-your-own-reasoning-model-with-grpo.md): Beginner's Guide to transforming a model like Llama 3.1 (8B) into a reasoning model by using Unsloth and GRPO.
- [Advanced RL Documentation](/get-started/reinforcement-learning-rl-guide/advanced-rl-documentation.md): Advanced documentation settings when using Unsloth with GRPO.
- [Memory Efficient RL](/get-started/reinforcement-learning-rl-guide/memory-efficient-rl.md)
- [RL Reward Hacking](/get-started/reinforcement-learning-rl-guide/rl-reward-hacking.md): Learn what is Reward Hacking in Reinforcement Learning and how to counter it.
- [GSPO Reinforcement Learning](/get-started/reinforcement-learning-rl-guide/gspo-reinforcement-learning.md): Train with GSPO (Group Sequence Policy Optimization) RL in Unsloth.
- [Reinforcement Learning - DPO, ORPO & KTO](/get-started/reinforcement-learning-rl-guide/reinforcement-learning-dpo-orpo-and-kto.md): To use the reward modelling functions for DPO, GRPO, ORPO or KTO with Unsloth, follow the steps below:
- [DeepSeek-OCR: How to Run & Fine-tune](/new/deepseek-ocr-how-to-run-and-fine-tune.md): Guide on how to run and fine-tune DeepSeek-OCR locally.
- [How to Fine-tune LLMs with Unsloth & Docker](/new/how-to-fine-tune-llms-with-unsloth-and-docker.md): Learn how to fine-tune LLMs or do Reinforcement Learning (RL) with Unsloth's Docker image.
- [Vision Reinforcement Learning (VLM RL)](/new/vision-reinforcement-learning-vlm-rl.md): Train Vision/multimodal models via GRPO and RL with Unsloth!
- [gpt-oss Reinforcement Learning](/new/gpt-oss-reinforcement-learning.md)
- [Tutorial: How to Train gpt-oss with RL](/new/gpt-oss-reinforcement-learning/tutorial-how-to-train-gpt-oss-with-rl.md): Learn to train OpenAI gpt-oss with GRPO to autonomously beat 2048 locally or on Colab.
- [Unsloth Dynamic GGUFs on Aider Polyglot](/new/unsloth-dynamic-ggufs-on-aider-polyglot.md): Performance of Unsloth Dynamic GGUFs on Aider Polyglot Benchmarks
- [Qwen3-VL: How to Run & Fine-tune](/models/qwen3-vl-how-to-run-and-fine-tune.md): Learn to fine-tune and run Qwen3-VL locally with Unsloth.
- [gpt-oss: How to Run & Fine-tune](/models/gpt-oss-how-to-run-and-fine-tune.md): Run & fine-tune OpenAI's new open-source models!
- [Tutorial: How to Fine-tune gpt-oss](/models/gpt-oss-how-to-run-and-fine-tune/tutorial-how-to-fine-tune-gpt-oss.md): Learn step-by-step how to train OpenAI gpt-oss locally with Unsloth.
- [Long Context gpt-oss Training](/models/gpt-oss-how-to-run-and-fine-tune/long-context-gpt-oss-training.md)
- [GLM-4.6: How to Run Locally](/models/glm-4.6-how-to-run-locally.md): A guide on how to run Z.ai's new GLM-4.6 model on your own local device!
- [IBM Granite 4.0](/models/ibm-granite-4.0.md): How to run IBM Granite-4.0 with Unsloth GGUFs on llama.cpp, Ollama and how to fine-tune!
- [DeepSeek-V3.1: How to Run Locally](/models/deepseek-v3.1-how-to-run-locally.md): A guide on how to run DeepSeek-V3.1 and Terminus on your own local device!
- [Qwen3-Coder: How to Run Locally](/models/qwen3-coder-how-to-run-locally.md): Run Qwen3-Coder-30B-A3B-Instruct and 480B-A35B locally with Unsloth Dynamic quants.
- [Gemma 3: How to Run & Fine-tune](/models/gemma-3-how-to-run-and-fine-tune.md): How to run Gemma 3 effectively with our GGUFs on llama.cpp, Ollama, Open WebUI and how to fine-tune with Unsloth!
- [Gemma 3n: How to Run & Fine-tune](/models/gemma-3-how-to-run-and-fine-tune/gemma-3n-how-to-run-and-fine-tune.md): Run Google's new Gemma 3n locally with Dynamic GGUFs on llama.cpp, Ollama, Open WebUI and fine-tune with Unsloth!
- [Qwen3: How to Run & Fine-tune](/models/qwen3-how-to-run-and-fine-tune.md): Learn to run & fine-tune Qwen3 locally with Unsloth + our Dynamic 2.0 quants
- [Qwen3-2507](/models/qwen3-how-to-run-and-fine-tune/qwen3-2507.md): Run Qwen3-30B-A3B-2507 and 235B-A22B Thinking and Instruct versions locally on your device!
- [Tutorials: How To Fine-tune & Run LLMs](/models/tutorials-how-to-fine-tune-and-run-llms.md): Learn how to run and fine-tune models for optimal performance 100% locally with Unsloth.
- [DeepSeek-R1-0528: How to Run Locally](/models/tutorials-how-to-fine-tune-and-run-llms/deepseek-r1-0528-how-to-run-locally.md): A guide on how to run DeepSeek-R1-0528 including Qwen3 on your own local device!
- [Magistral: How to Run & Fine-tune](/models/tutorials-how-to-fine-tune-and-run-llms/magistral-how-to-run-and-fine-tune.md): Meet Magistral - Mistral's new reasoning models.
- [Llama 4: How to Run & Fine-tune](/models/tutorials-how-to-fine-tune-and-run-llms/llama-4-how-to-run-and-fine-tune.md): How to run Llama 4 locally using our dynamic GGUFs which recovers accuracy compared to standard quantization.
- [Kimi K2: How to Run Locally](/models/tutorials-how-to-fine-tune-and-run-llms/kimi-k2-how-to-run-locally.md): Guide on running Kimi K2 and Kimi-K2-Instruct-0905 on your own local device!
- [Grok 2](/models/tutorials-how-to-fine-tune-and-run-llms/grok-2.md): Run xAI's Grok 2 model locally!
- [Devstral: How to Run & Fine-tune](/models/tutorials-how-to-fine-tune-and-run-llms/devstral-how-to-run-and-fine-tune.md): Run and fine-tune Mistral Devstral 1.1, including Small-2507 and 2505.
- [DeepSeek-V3-0324: How to Run Locally](/models/tutorials-how-to-fine-tune-and-run-llms/deepseek-v3-0324-how-to-run-locally.md): How to run DeepSeek-V3-0324 locally using our dynamic quants which recovers accuracy
- [DeepSeek-R1: How to Run Locally](/models/tutorials-how-to-fine-tune-and-run-llms/deepseek-r1-how-to-run-locally.md): A guide on how you can run our 1.58-bit Dynamic Quants for DeepSeek-R1 using llama.cpp.
- [DeepSeek-R1 Dynamic 1.58-bit](/models/tutorials-how-to-fine-tune-and-run-llms/deepseek-r1-how-to-run-locally/deepseek-r1-dynamic-1.58-bit.md): See performance comparison tables for Unsloth's Dynamic GGUF Quants vs Standard IMatrix Quants.
- [QwQ-32B: How to Run effectively](/models/tutorials-how-to-fine-tune-and-run-llms/qwq-32b-how-to-run-effectively.md): How to run QwQ-32B effectively with our bug fixes and without endless generations + GGUFs.
- [Phi-4 Reasoning: How to Run & Fine-tune](/models/tutorials-how-to-fine-tune-and-run-llms/phi-4-reasoning-how-to-run-and-fine-tune.md): Learn to run & fine-tune Phi-4 reasoning models locally with Unsloth + our Dynamic 2.0 quants
- [Running & Saving Models](/basics/running-and-saving-models.md): Learn how to save your finetuned model so you can run it in your favorite inference engine.
- [Saving to GGUF](/basics/running-and-saving-models/saving-to-gguf.md): Saving models to 16bit for GGUF so you can use it for Ollama, Jan AI, Open WebUI and more!
- [Saving to Ollama](/basics/running-and-saving-models/saving-to-ollama.md)
- [Saving to vLLM for deployment](/basics/running-and-saving-models/saving-to-vllm-for-deployment.md): Saving models to 16bit for vLLM deployment and serving
- [Saving to SGLang for deployment](/basics/running-and-saving-models/saving-to-sglang-for-deployment.md): Saving models to 16bit for SGLang for deployment and serving
- [Unsloth Inference](/basics/running-and-saving-models/unsloth-inference.md): Learn how to run your finetuned model with Unsloth's faster inference.
- [Troubleshooting Inference](/basics/running-and-saving-models/troubleshooting-inference.md): If you're experiencing issues when running or saving your model.
- [vLLM Engine Arguments](/basics/running-and-saving-models/vllm-engine-arguments.md)
- [LoRA Hot Swapping Guide](/basics/running-and-saving-models/lora-hot-swapping-guide.md)
- [Text-to-Speech (TTS) Fine-tuning](/basics/text-to-speech-tts-fine-tuning.md): Learn how to fine-tune TTS & STT voice models with Unsloth.
- [Unsloth Dynamic 2.0 GGUFs](/basics/unsloth-dynamic-2.0-ggufs.md): A big new upgrade to our Dynamic Quants!
- [Vision Fine-tuning](/basics/vision-fine-tuning.md): Learn how to fine-tune vision/multimodal LLMs with Unsloth
- [Fine-tuning LLMs with NVIDIA DGX Spark and Unsloth](/basics/fine-tuning-llms-with-nvidia-dgx-spark-and-unsloth.md): Tutorial on how to fine-tune and do reinforcement learning (RL) with OpenAI gpt-oss on NVIDIA DGX Spark.
- [Fine-tuning LLMs with Blackwell, RTX 50 series & Unsloth](/basics/fine-tuning-llms-with-blackwell-rtx-50-series-and-unsloth.md): Learn how to fine-tune LLMs on NVIDIA's Blackwell RTX 50 series and B200 GPUs with our step-by-step guide.
- [Multi-GPU Training with Unsloth](/basics/multi-gpu-training-with-unsloth.md): Learn how to fine-tune LLMs on multiple GPUs and parallelism with Unsloth.
- [Finetuning from Last Checkpoint](/basics/finetuning-from-last-checkpoint.md): Checkpointing allows you to save your finetuning progress so you can pause it and then continue.
- [Troubleshooting & FAQs](/basics/troubleshooting-and-faqs.md): Tips to solve issues, and frequently asked questions.
- [Chat Templates](/basics/chat-templates.md): Learn the fundamentals and customization options of chat templates, including Conversational, ChatML, ShareGPT, Alpaca formats, and more!
- [Quantization-Aware Training (QAT)](/basics/quantization-aware-training-qat.md): Quantize models to 4-bit with Unsloth and PyTorch to recover accuracy.
- [Unsloth Environment Flags](/basics/unsloth-environment-flags.md): Advanced flags which might be useful if you see breaking finetunes, or you want to turn stuff off.
- [Continued Pretraining](/basics/continued-pretraining.md): AKA as Continued Finetuning. Unsloth allows you to continually pretrain so a model can learn a new language.
- [Unsloth Benchmarks](/basics/unsloth-benchmarks.md): Unsloth recorded benchmarks on NVIDIA GPUs.