refactor: reorganize skills into sub-categories

The skills directory was getting disorganized — mlops alone had 40 skills in a flat list, and 12 categories were singletons with just one skill each. Code change: - prompt_builder.py: Support sub-categories in skill scanner. skills/mlops/training/axolotl/SKILL.md now shows as category 'mlops/training' instead of just 'mlops'. Backwards-compatible with existing flat structure. Split mlops (40 skills) into 7 sub-categories: - mlops/training (12): accelerate, axolotl, flash-attention, grpo-rl-training, peft, pytorch-fsdp, pytorch-lightning, simpo, slime, torchtitan, trl-fine-tuning, unsloth - mlops/inference (8): gguf, guidance, instructor, llama-cpp, obliteratus, outlines, tensorrt-llm, vllm - mlops/models (6): audiocraft, clip, llava, segment-anything, stable-diffusion, whisper - mlops/vector-databases (4): chroma, faiss, pinecone, qdrant - mlops/evaluation (5): huggingface-tokenizers, lm-evaluation-harness, nemo-curator, saelens, weights-and-biases - mlops/cloud (2): lambda-labs, modal - mlops/research (1): dspy Merged singleton categories: - gifs → media (gif-search joins youtube-content) - music-creation → media (heartmula, songsee) - diagramming → creative (excalidraw joins ascii-art) - ocr-and-documents → productivity - domain → research (domain-intel) - feeds → research (blogwatcher) - market-data → research (polymarket) Fixed misplaced skills: - mlops/code-review → software-development (not ML-specific) - mlops/ml-paper-writing → research (academic writing) Added DESCRIPTION.md files for all new/updated categories.
2026-04-25 00:51:20 +00:00 · 2026-03-09 03:35:53 -07:00 · 2026-03-09 03:35:53 -07:00 · 732c66b0f3
commit 732c66b0f3
parent d6c710706f
217 changed files with 39 additions and 4 deletions
--- a/skills/mlops/training/pytorch-lightning/references/hyperparameter-tuning.md
+++ b/skills/mlops/training/pytorch-lightning/references/hyperparameter-tuning.md
@ -0,0 +1,556 @@
+# Hyperparameter Tuning with PyTorch Lightning
+
+## Integration with Tuning Frameworks
+
+Lightning integrates seamlessly with popular hyperparameter tuning libraries.
+
+### 1. Ray Tune Integration
+
+**Installation**:
+```bash
+pip install ray[tune]
+pip install lightning
+```
+
+**Basic Ray Tune example**:
+
+```python
+import lightning as L
+from ray import tune
+from ray.tune.integration.pytorch_lightning import TuneReportCallback
+
+class LitModel(L.LightningModule):
+    def __init__(self, lr, batch_size):
+        super().__init__()
+        self.lr = lr
+        self.batch_size = batch_size
+        self.model = nn.Sequential(nn.Linear(10, 128), nn.ReLU(), nn.Linear(128, 1))
+
+    def training_step(self, batch, batch_idx):
+        loss = self.model(batch).mean()
+        self.log('train_loss', loss)
+        return loss
+
+    def validation_step(self, batch, batch_idx):
+        val_loss = self.model(batch).mean()
+        self.log('val_loss', val_loss)
+
+    def configure_optimizers(self):
+        return torch.optim.Adam(self.parameters(), lr=self.lr)
+
+def train_fn(config):
+    """Training function for Ray Tune."""
+    model = LitModel(lr=config["lr"], batch_size=config["batch_size"])
+
+    # Add callback to report metrics to Tune
+    trainer = L.Trainer(
+        max_epochs=10,
+        callbacks=[TuneReportCallback({"loss": "val_loss"}, on="validation_end")]
+    )
+
+    trainer.fit(model, train_loader, val_loader)
+
+# Define search space
+config = {
+    "lr": tune.loguniform(1e-5, 1e-1),
+    "batch_size": tune.choice([16, 32, 64, 128])
+}
+
+# Run hyperparameter search
+analysis = tune.run(
+    train_fn,
+    config=config,
+    num_samples=20,  # 20 trials
+    resources_per_trial={"gpu": 1}
+)
+
+# Best hyperparameters
+best_config = analysis.get_best_config(metric="loss", mode="min")
+print(f"Best config: {best_config}")
+```
+
+**Advanced: Population-Based Training (PBT)**:
+
+```python
+from ray.tune.schedulers import PopulationBasedTraining
+
+# PBT scheduler
+scheduler = PopulationBasedTraining(
+    time_attr='training_iteration',
+    metric='val_loss',
+    mode='min',
+    perturbation_interval=5,  # Perturb every 5 epochs
+    hyperparam_mutations={
+        "lr": tune.loguniform(1e-5, 1e-1),
+        "batch_size": [16, 32, 64, 128]
+    }
+)
+
+analysis = tune.run(
+    train_fn,
+    config=config,
+    num_samples=8,  # Population size
+    scheduler=scheduler,
+    resources_per_trial={"gpu": 1}
+)
+```
+
+### 2. Optuna Integration
+
+**Installation**:
+```bash
+pip install optuna
+pip install optuna-integration
+```
+
+**Optuna example**:
+
+```python
+import optuna
+from optuna.integration import PyTorchLightningPruningCallback
+
+def objective(trial):
+    # Suggest hyperparameters
+    lr = trial.suggest_loguniform('lr', 1e-5, 1e-1)
+    batch_size = trial.suggest_categorical('batch_size', [16, 32, 64, 128])
+    n_layers = trial.suggest_int('n_layers', 1, 3)
+    hidden_size = trial.suggest_int('hidden_size', 64, 512, step=64)
+
+    # Create model
+    model = LitModel(lr=lr, n_layers=n_layers, hidden_size=hidden_size)
+
+    # Pruning callback (early stopping for bad trials)
+    pruning_callback = PyTorchLightningPruningCallback(trial, monitor="val_loss")
+
+    trainer = L.Trainer(
+        max_epochs=20,
+        callbacks=[pruning_callback],
+        enable_progress_bar=False,
+        logger=False
+    )
+
+    trainer.fit(model, train_loader, val_loader)
+
+    return trainer.callback_metrics["val_loss"].item()
+
+# Create study
+study = optuna.create_study(
+    direction='minimize',
+    pruner=optuna.pruners.MedianPruner()  # Prune bad trials early
+)
+
+# Optimize
+study.optimize(objective, n_trials=50, timeout=3600)
+
+# Best params
+print(f"Best trial: {study.best_trial.params}")
+print(f"Best value: {study.best_value}")
+
+# Visualization
+optuna.visualization.plot_optimization_history(study).show()
+optuna.visualization.plot_param_importances(study).show()
+```
+
+**Optuna with distributed training**:
+
+```python
+import optuna
+
+# Shared database for distributed optimization
+storage = optuna.storages.RDBStorage(
+    url='postgresql://user:pass@localhost/optuna'
+)
+
+study = optuna.create_study(
+    study_name='distributed_study',
+    storage=storage,
+    load_if_exists=True,
+    direction='minimize'
+)
+
+# Run on multiple machines
+study.optimize(objective, n_trials=50)
+```
+
+### 3. Weights & Biases (WandB) Sweeps
+
+**Installation**:
+```bash
+pip install wandb
+```
+
+**WandB sweep config** (`sweep.yaml`):
+```yaml
+program: train.py
+method: bayes
+metric:
+  name: val_loss
+  goal: minimize
+parameters:
+  lr:
+    distribution: log_uniform_values
+    min: 0.00001
+    max: 0.1
+  batch_size:
+    values: [16, 32, 64, 128]
+  optimizer:
+    values: ['adam', 'sgd', 'adamw']
+  dropout:
+    distribution: uniform
+    min: 0.0
+    max: 0.5
+```
+
+**Training script** (`train.py`):
+```python
+import wandb
+import lightning as L
+from lightning.pytorch.loggers import WandbLogger
+
+def train():
+    # Initialize wandb
+    wandb.init()
+    config = wandb.config
+
+    # Create model with sweep params
+    model = LitModel(
+        lr=config.lr,
+        batch_size=config.batch_size,
+        optimizer=config.optimizer,
+        dropout=config.dropout
+    )
+
+    # WandB logger
+    wandb_logger = WandbLogger(project='hyperparameter-sweep')
+
+    trainer = L.Trainer(
+        max_epochs=20,
+        logger=wandb_logger
+    )
+
+    trainer.fit(model, train_loader, val_loader)
+
+if __name__ == '__main__':
+    train()
+```
+
+**Launch sweep**:
+```bash
+# Initialize sweep
+wandb sweep sweep.yaml
+# Output: wandb: Created sweep with ID: abc123
+
+# Run agent (can run on multiple machines)
+wandb agent your-entity/your-project/abc123
+```
+
+### 4. Hyperopt Integration
+
+**Installation**:
+```bash
+pip install hyperopt
+```
+
+**Hyperopt example**:
+
+```python
+from hyperopt import hp, fmin, tpe, Trials
+
+def objective(params):
+    model = LitModel(
+        lr=params['lr'],
+        batch_size=int(params['batch_size']),
+        hidden_size=int(params['hidden_size'])
+    )
+
+    trainer = L.Trainer(
+        max_epochs=10,
+        enable_progress_bar=False,
+        logger=False
+    )
+
+    trainer.fit(model, train_loader, val_loader)
+
+    # Return loss (minimize)
+    return trainer.callback_metrics["val_loss"].item()
+
+# Define search space
+space = {
+    'lr': hp.loguniform('lr', np.log(1e-5), np.log(1e-1)),
+    'batch_size': hp.quniform('batch_size', 16, 128, 16),
+    'hidden_size': hp.quniform('hidden_size', 64, 512, 64)
+}
+
+# Optimize
+trials = Trials()
+best = fmin(
+    fn=objective,
+    space=space,
+    algo=tpe.suggest,  # Tree-structured Parzen Estimator
+    max_evals=50,
+    trials=trials
+)
+
+print(f"Best hyperparameters: {best}")
+```
+
+## Built-In Lightning Tuning
+
+### Auto Learning Rate Finder
+
+```python
+class LitModel(L.LightningModule):
+    def __init__(self, lr=1e-3):
+        super().__init__()
+        self.lr = lr
+        self.model = nn.Linear(10, 1)
+
+    def configure_optimizers(self):
+        return torch.optim.Adam(self.parameters(), lr=self.lr)
+
+    def training_step(self, batch, batch_idx):
+        loss = self.model(batch).mean()
+        return loss
+
+# Find optimal learning rate
+model = LitModel()
+trainer = L.Trainer(auto_lr_find=True)
+
+# This runs LR finder before training
+trainer.tune(model, train_loader)
+
+# Or manually
+from lightning.pytorch.tuner import Tuner
+tuner = Tuner(trainer)
+lr_finder = tuner.lr_find(model, train_loader)
+
+# Plot results
+fig = lr_finder.plot(suggest=True)
+fig.show()
+
+# Get suggested LR
+suggested_lr = lr_finder.suggestion()
+print(f"Suggested LR: {suggested_lr}")
+
+# Update model
+model.lr = suggested_lr
+
+# Train with optimal LR
+trainer.fit(model, train_loader)
+```
+
+### Auto Batch Size Finder
+
+```python
+class LitModel(L.LightningModule):
+    def __init__(self, batch_size=32):
+        super().__init__()
+        self.batch_size = batch_size
+        self.model = nn.Linear(10, 1)
+
+    def train_dataloader(self):
+        return DataLoader(dataset, batch_size=self.batch_size)
+
+model = LitModel()
+trainer = L.Trainer(auto_scale_batch_size='binsearch')
+
+# Find optimal batch size
+trainer.tune(model)
+
+print(f"Optimal batch size: {model.batch_size}")
+
+# Train with optimal batch size
+trainer.fit(model, train_loader)
+```
+
+## Advanced Tuning Strategies
+
+### 1. Multi-Fidelity Optimization (Successive Halving)
+
+```python
+from ray.tune.schedulers import ASHAScheduler
+
+# ASHA: Asynchronous Successive Halving Algorithm
+scheduler = ASHAScheduler(
+    max_t=100,  # Max epochs
+    grace_period=10,  # Min epochs before stopping
+    reduction_factor=2  # Halve resources each round
+)
+
+analysis = tune.run(
+    train_fn,
+    config=config,
+    num_samples=64,
+    scheduler=scheduler,
+    resources_per_trial={"gpu": 1}
+)
+```
+
+**How it works**:
+- Start 64 trials
+- After 10 epochs, stop bottom 50% (32 trials remain)
+- After 20 epochs, stop bottom 50% (16 trials remain)
+- After 40 epochs, stop bottom 50% (8 trials remain)
+- After 80 epochs, stop bottom 50% (4 trials remain)
+- Run remaining 4 trials to completion (100 epochs)
+
+### 2. Bayesian Optimization
+
+```python
+from ray.tune.search.bayesopt import BayesOptSearch
+
+search = BayesOptSearch(
+    metric="val_loss",
+    mode="min"
+)
+
+analysis = tune.run(
+    train_fn,
+    config=config,
+    num_samples=50,
+    search_alg=search,
+    resources_per_trial={"gpu": 1}
+)
+```
+
+### 3. Grid Search
+
+```python
+from ray import tune
+
+# Exhaustive grid search
+config = {
+    "lr": tune.grid_search([1e-5, 1e-4, 1e-3, 1e-2]),
+    "batch_size": tune.grid_search([16, 32, 64, 128]),
+    "optimizer": tune.grid_search(['adam', 'sgd', 'adamw'])
+}
+
+# Total trials: 4 × 4 × 3 = 48
+analysis = tune.run(train_fn, config=config)
+```
+
+### 4. Random Search
+
+```python
+config = {
+    "lr": tune.loguniform(1e-5, 1e-1),
+    "batch_size": tune.choice([16, 32, 64, 128]),
+    "dropout": tune.uniform(0.0, 0.5),
+    "hidden_size": tune.randint(64, 512)
+}
+
+# Random sampling
+analysis = tune.run(
+    train_fn,
+    config=config,
+    num_samples=100  # 100 random samples
+)
+```
+
+## Best Practices
+
+### 1. Start Simple
+
+```python
+# Phase 1: Coarse search (fast)
+coarse_config = {
+    "lr": tune.loguniform(1e-5, 1e-1),
+    "batch_size": tune.choice([32, 64])
+}
+coarse_analysis = tune.run(train_fn, config=coarse_config, num_samples=10, max_epochs=5)
+
+# Phase 2: Fine-tune around best (slow)
+best_lr = coarse_analysis.best_config["lr"]
+fine_config = {
+    "lr": tune.uniform(best_lr * 0.5, best_lr * 2),
+    "batch_size": tune.choice([16, 32, 64, 128])
+}
+fine_analysis = tune.run(train_fn, config=fine_config, num_samples=20, max_epochs=20)
+```
+
+### 2. Use Checkpointing
+
+```python
+def train_fn(config, checkpoint_dir=None):
+    model = LitModel(lr=config["lr"])
+
+    trainer = L.Trainer(
+        max_epochs=100,
+        callbacks=[
+            TuneReportCheckpointCallback(
+                metrics={"loss": "val_loss"},
+                filename="checkpoint",
+                on="validation_end"
+            )
+        ]
+    )
+
+    # Resume from checkpoint if exists
+    ckpt_path = None
+    if checkpoint_dir:
+        ckpt_path = os.path.join(checkpoint_dir, "checkpoint")
+
+    trainer.fit(model, train_loader, val_loader, ckpt_path=ckpt_path)
+```
+
+### 3. Monitor Resource Usage
+
+```python
+import GPUtil
+
+def train_fn(config):
+    # Before training
+    GPUs = GPUtil.getGPUs()
+    print(f"GPU memory before: {GPUs[0].memoryUsed} MB")
+
+    # Train
+    model = LitModel(lr=config["lr"], batch_size=config["batch_size"])
+    trainer.fit(model, train_loader)
+
+    # After training
+    GPUs = GPUtil.getGPUs()
+    print(f"GPU memory after: {GPUs[0].memoryUsed} MB")
+```
+
+## Common Issues
+
+### Issue: Trials Running Out of Memory
+
+**Solution**: Reduce concurrent trials or batch size
+```python
+analysis = tune.run(
+    train_fn,
+    config=config,
+    resources_per_trial={"gpu": 0.5},  # 2 trials per GPU
+    max_concurrent_trials=2  # Limit concurrent trials
+)
+```
+
+### Issue: Slow Hyperparameter Search
+
+**Solution**: Use early stopping scheduler
+```python
+from ray.tune.schedulers import ASHAScheduler
+
+scheduler = ASHAScheduler(
+    max_t=100,
+    grace_period=5,  # Stop bad trials after 5 epochs
+    reduction_factor=3
+)
+```
+
+### Issue: Can't Reproduce Best Trial
+
+**Solution**: Set seeds in training function
+```python
+def train_fn(config):
+    L.seed_everything(42, workers=True)
+    # Rest of training...
+```
+
+## Resources
+
+- Ray Tune + Lightning: https://docs.ray.io/en/latest/tune/examples/tune-pytorch-lightning.html
+- Optuna: https://optuna.readthedocs.io/
+- WandB Sweeps: https://docs.wandb.ai/guides/sweeps
+- Lightning Tuner: https://lightning.ai/docs/pytorch/stable/tuning.html