feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap (#3934)

* feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap Map active skills to Telegram's slash command menu so users can discover and invoke skills directly. Three changes: 1. Telegram menu now includes active skill commands alongside built-in commands, capped at 100 entries (Telegram Bot API limit). Overflow commands remain callable but hidden from the picker. Logged at startup when cap is hit. 2. New /commands [page] gateway command for paginated browsing of all commands + skills. /help now shows first 10 skill commands and points to /commands for the full list. 3. When a user types a slash command that matches a disabled or uninstalled skill, they get actionable guidance: - Disabled: 'Enable it with: hermes skills config' - Optional (not installed): 'Install with: hermes skills install official/<path>' Built on ideas from PR #3921 by @kshitijk4poor. * chore: move 21 niche skills to optional-skills Move specialized/niche skills from built-in (skills/) to optional (optional-skills/) to reduce the default skill count. Users can install them with: hermes skills install official/<category>/<name> Moved skills (21): - mlops: accelerate, chroma, faiss, flash-attention, hermes-atropos-environments, huggingface-tokenizers, instructor, lambda-labs, llava, nemo-curator, pinecone, pytorch-lightning, qdrant, saelens, simpo, slime, tensorrt-llm, torchtitan - research: domain-intel, duckduckgo-search - devops: inference-sh cli Built-in skills: 96 → 75 Optional skills: 22 → 43 * fix: only include repo built-in skills in Telegram menu, not user-installed User-installed skills (from hub or manually added) stay accessible via /skills and by typing the command directly, but don't get registered in the Telegram slash command picker. Only skills whose SKILL.md is under the repo's skills/ directory are included in the menu. This keeps the Telegram menu focused on the curated built-in set while user-installed skills remain discoverable through /skills and /commands.
2026-04-25 00:51:20 +00:00 · 2026-03-30 10:57:30 -07:00 · 2026-03-30 10:57:30 -07:00 · 5ceed021dc
commit 5ceed021dc
parent 97d6813f51
73 changed files with 163 additions and 4 deletions
--- a/optional-skills/mlops/pytorch-lightning/references/callbacks.md
+++ b/optional-skills/mlops/pytorch-lightning/references/callbacks.md
@ -0,0 +1,436 @@
+# PyTorch Lightning Callbacks
+
+## Overview
+
+Callbacks add functionality to training without modifying the LightningModule. They capture **non-essential logic** like checkpointing, early stopping, and logging.
+
+## Built-In Callbacks
+
+### 1. ModelCheckpoint
+
+**Saves best models during training**:
+
+```python
+from lightning.pytorch.callbacks import ModelCheckpoint
+
+# Save top 3 models based on validation loss
+checkpoint = ModelCheckpoint(
+    dirpath='checkpoints/',
+    filename='model-{epoch:02d}-{val_loss:.2f}',
+    monitor='val_loss',
+    mode='min',
+    save_top_k=3,
+    save_last=True,  # Also save last epoch
+    verbose=True
+)
+
+trainer = L.Trainer(callbacks=[checkpoint])
+trainer.fit(model, train_loader, val_loader)
+```
+
+**Configuration options**:
+```python
+checkpoint = ModelCheckpoint(
+    monitor='val_acc',        # Metric to monitor
+    mode='max',               # 'max' for accuracy, 'min' for loss
+    save_top_k=5,             # Keep best 5 models
+    save_last=True,           # Save last epoch separately
+    every_n_epochs=1,         # Save every N epochs
+    save_on_train_epoch_end=False,  # Save on validation end instead
+    filename='best-{epoch}-{val_acc:.3f}',  # Naming pattern
+    auto_insert_metric_name=False  # Don't auto-add metric to filename
+)
+```
+
+**Load checkpoint**:
+```python
+# Load best model
+best_model_path = checkpoint.best_model_path
+model = LitModel.load_from_checkpoint(best_model_path)
+
+# Resume training
+trainer = L.Trainer(callbacks=[checkpoint])
+trainer.fit(model, train_loader, val_loader, ckpt_path='checkpoints/last.ckpt')
+```
+
+### 2. EarlyStopping
+
+**Stops training when metric stops improving**:
+
+```python
+from lightning.pytorch.callbacks import EarlyStopping
+
+early_stop = EarlyStopping(
+    monitor='val_loss',
+    patience=5,               # Wait 5 epochs
+    mode='min',
+    min_delta=0.001,          # Minimum change to qualify as improvement
+    verbose=True,
+    strict=True,              # Crash if monitored metric not found
+    check_on_train_epoch_end=False  # Check on validation end
+)
+
+trainer = L.Trainer(callbacks=[early_stop])
+trainer.fit(model, train_loader, val_loader)
+# Stops automatically if no improvement for 5 epochs
+```
+
+**Advanced usage**:
+```python
+early_stop = EarlyStopping(
+    monitor='val_loss',
+    patience=10,
+    min_delta=0.0,
+    verbose=True,
+    mode='min',
+    stopping_threshold=0.1,   # Stop if val_loss < 0.1
+    divergence_threshold=5.0, # Stop if val_loss > 5.0
+    check_finite=True         # Stop on NaN/Inf
+)
+```
+
+### 3. LearningRateMonitor
+
+**Logs learning rate**:
+
+```python
+from lightning.pytorch.callbacks import LearningRateMonitor
+
+lr_monitor = LearningRateMonitor(
+    logging_interval='epoch',  # Or 'step'
+    log_momentum=True          # Also log momentum
+)
+
+trainer = L.Trainer(callbacks=[lr_monitor])
+# Learning rate automatically logged to TensorBoard/WandB
+```
+
+### 4. TQDMProgressBar
+
+**Customizes progress bar**:
+
+```python
+from lightning.pytorch.callbacks import TQDMProgressBar
+
+progress_bar = TQDMProgressBar(
+    refresh_rate=10,  # Update every 10 batches
+    process_position=0
+)
+
+trainer = L.Trainer(callbacks=[progress_bar])
+```
+
+### 5. GradientAccumulationScheduler
+
+**Dynamic gradient accumulation**:
+
+```python
+from lightning.pytorch.callbacks import GradientAccumulationScheduler
+
+# Accumulate more gradients as training progresses
+accumulator = GradientAccumulationScheduler(
+    scheduling={
+        0: 8,   # Epochs 0-4: accumulate 8 batches
+        5: 4,   # Epochs 5-9: accumulate 4 batches
+        10: 2   # Epochs 10+: accumulate 2 batches
+    }
+)
+
+trainer = L.Trainer(callbacks=[accumulator])
+```
+
+### 6. StochasticWeightAveraging (SWA)
+
+**Averages weights for better generalization**:
+
+```python
+from lightning.pytorch.callbacks import StochasticWeightAveraging
+
+swa = StochasticWeightAveraging(
+    swa_lrs=1e-2,  # SWA learning rate
+    swa_epoch_start=0.8,  # Start at 80% of training
+    annealing_epochs=10,  # Annealing period
+    annealing_strategy='cos'  # 'cos' or 'linear'
+)
+
+trainer = L.Trainer(callbacks=[swa])
+```
+
+## Custom Callbacks
+
+### Basic Custom Callback
+
+```python
+from lightning.pytorch.callbacks import Callback
+
+class PrintingCallback(Callback):
+    def on_train_start(self, trainer, pl_module):
+        print("Training is starting!")
+
+    def on_train_end(self, trainer, pl_module):
+        print("Training is done!")
+
+    def on_epoch_end(self, trainer, pl_module):
+        print(f"Epoch {trainer.current_epoch} ended")
+
+# Use it
+trainer = L.Trainer(callbacks=[PrintingCallback()])
+```
+
+### Advanced Custom Callback
+
+```python
+class MetricsCallback(Callback):
+    """Logs custom metrics every N batches."""
+
+    def __init__(self, log_every_n_batches=100):
+        self.log_every_n_batches = log_every_n_batches
+        self.metrics = []
+
+    def on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx):
+        if batch_idx % self.log_every_n_batches == 0:
+            # Compute custom metric
+            metric = self.compute_metric(outputs)
+            self.metrics.append(metric)
+
+            # Log to Lightning
+            pl_module.log('custom_metric', metric)
+
+    def compute_metric(self, outputs):
+        # Your custom logic
+        return outputs['loss'].item()
+
+    def state_dict(self):
+        """Save callback state in checkpoint."""
+        return {'metrics': self.metrics}
+
+    def load_state_dict(self, state_dict):
+        """Restore callback state from checkpoint."""
+        self.metrics = state_dict['metrics']
+```
+
+### Gradient Monitoring Callback
+
+```python
+class GradientMonitorCallback(Callback):
+    """Monitor gradient norms."""
+
+    def on_after_backward(self, trainer, pl_module):
+        # Compute gradient norm
+        total_norm = 0.0
+        for p in pl_module.parameters():
+            if p.grad is not None:
+                param_norm = p.grad.data.norm(2)
+                total_norm += param_norm.item() ** 2
+        total_norm = total_norm ** 0.5
+
+        # Log
+        pl_module.log('grad_norm', total_norm)
+
+        # Warn if exploding
+        if total_norm > 100:
+            print(f"Warning: Large gradient norm: {total_norm:.2f}")
+```
+
+### Model Inspection Callback
+
+```python
+class ModelInspectionCallback(Callback):
+    """Inspect model activations during training."""
+
+    def on_train_batch_start(self, trainer, pl_module, batch, batch_idx):
+        if batch_idx == 0:  # First batch of epoch
+            # Register hooks
+            self.activations = {}
+
+            def get_activation(name):
+                def hook(model, input, output):
+                    self.activations[name] = output.detach()
+                return hook
+
+            # Attach to specific layers
+            pl_module.model.layer1.register_forward_hook(get_activation('layer1'))
+            pl_module.model.layer2.register_forward_hook(get_activation('layer2'))
+
+    def on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx):
+        if batch_idx == 0:
+            # Log activation statistics
+            for name, activation in self.activations.items():
+                mean = activation.mean().item()
+                std = activation.std().item()
+                pl_module.log(f'{name}_mean', mean)
+                pl_module.log(f'{name}_std', std)
+```
+
+## Callback Hooks
+
+**All available hooks**:
+
+```python
+class MyCallback(Callback):
+    # Setup/Teardown
+    def setup(self, trainer, pl_module, stage):
+        """Called at beginning of fit/test/predict."""
+        pass
+
+    def teardown(self, trainer, pl_module, stage):
+        """Called at end of fit/test/predict."""
+        pass
+
+    # Training
+    def on_train_start(self, trainer, pl_module):
+        pass
+
+    def on_train_epoch_start(self, trainer, pl_module):
+        pass
+
+    def on_train_batch_start(self, trainer, pl_module, batch, batch_idx):
+        pass
+
+    def on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx):
+        pass
+
+    def on_train_epoch_end(self, trainer, pl_module):
+        pass
+
+    def on_train_end(self, trainer, pl_module):
+        pass
+
+    # Validation
+    def on_validation_start(self, trainer, pl_module):
+        pass
+
+    def on_validation_epoch_start(self, trainer, pl_module):
+        pass
+
+    def on_validation_batch_start(self, trainer, pl_module, batch, batch_idx, dataloader_idx):
+        pass
+
+    def on_validation_batch_end(self, trainer, pl_module, outputs, batch, batch_idx, dataloader_idx):
+        pass
+
+    def on_validation_epoch_end(self, trainer, pl_module):
+        pass
+
+    def on_validation_end(self, trainer, pl_module):
+        pass
+
+    # Test (same structure as validation)
+    def on_test_start(self, trainer, pl_module):
+        pass
+    # ... (test_epoch_start, test_batch_start, etc.)
+
+    # Predict
+    def on_predict_start(self, trainer, pl_module):
+        pass
+    # ... (predict_epoch_start, predict_batch_start, etc.)
+
+    # Backward
+    def on_before_backward(self, trainer, pl_module, loss):
+        pass
+
+    def on_after_backward(self, trainer, pl_module):
+        pass
+
+    # Optimizer
+    def on_before_optimizer_step(self, trainer, pl_module, optimizer):
+        pass
+
+    # Checkpointing
+    def on_save_checkpoint(self, trainer, pl_module, checkpoint):
+        """Add data to checkpoint."""
+        pass
+
+    def on_load_checkpoint(self, trainer, pl_module, checkpoint):
+        """Restore data from checkpoint."""
+        pass
+```
+
+## Combining Multiple Callbacks
+
+```python
+from lightning.pytorch.callbacks import ModelCheckpoint, EarlyStopping, LearningRateMonitor
+
+# Create all callbacks
+checkpoint = ModelCheckpoint(monitor='val_loss', mode='min', save_top_k=3)
+early_stop = EarlyStopping(monitor='val_loss', patience=5)
+lr_monitor = LearningRateMonitor(logging_interval='epoch')
+custom_callback = MyCustomCallback()
+
+# Add all to Trainer
+trainer = L.Trainer(
+    callbacks=[checkpoint, early_stop, lr_monitor, custom_callback]
+)
+
+trainer.fit(model, train_loader, val_loader)
+```
+
+**Execution order**: Callbacks execute in the order they're added
+
+## Best Practices
+
+### 1. Keep Callbacks Independent
+
+**Bad** (dependent on other callback):
+```python
+class BadCallback(Callback):
+    def on_train_end(self, trainer, pl_module):
+        # Assumes ModelCheckpoint is present
+        best_path = trainer.checkpoint_callback.best_model_path  # Fragile!
+```
+
+**Good** (self-contained):
+```python
+class GoodCallback(Callback):
+    def on_train_end(self, trainer, pl_module):
+        # Find checkpoint callback if present
+        for callback in trainer.callbacks:
+            if isinstance(callback, ModelCheckpoint):
+                best_path = callback.best_model_path
+                break
+```
+
+### 2. Use State Dict for Persistence
+
+```python
+class StatefulCallback(Callback):
+    def __init__(self):
+        self.counter = 0
+        self.history = []
+
+    def on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx):
+        self.counter += 1
+        self.history.append(outputs['loss'].item())
+
+    def state_dict(self):
+        """Save state."""
+        return {
+            'counter': self.counter,
+            'history': self.history
+        }
+
+    def load_state_dict(self, state_dict):
+        """Restore state."""
+        self.counter = state_dict['counter']
+        self.history = state_dict['history']
+```
+
+### 3. Handle Distributed Training
+
+```python
+class DistributedCallback(Callback):
+    def on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx):
+        # Only run on main process
+        if trainer.is_global_zero:
+            print("This only prints once in distributed training")
+
+        # Run on all processes
+        loss = outputs['loss']
+        # ... do something with loss on each GPU
+```
+
+## Resources
+
+- Callback API: https://lightning.ai/docs/pytorch/stable/extensions/callbacks.html
+- Built-in callbacks: https://lightning.ai/docs/pytorch/stable/api_references.html#callbacks
+- Examples: https://github.com/Lightning-AI/pytorch-lightning/tree/master/examples/callbacks