mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-25 00:51:20 +00:00
feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap (#3934)
* feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap Map active skills to Telegram's slash command menu so users can discover and invoke skills directly. Three changes: 1. Telegram menu now includes active skill commands alongside built-in commands, capped at 100 entries (Telegram Bot API limit). Overflow commands remain callable but hidden from the picker. Logged at startup when cap is hit. 2. New /commands [page] gateway command for paginated browsing of all commands + skills. /help now shows first 10 skill commands and points to /commands for the full list. 3. When a user types a slash command that matches a disabled or uninstalled skill, they get actionable guidance: - Disabled: 'Enable it with: hermes skills config' - Optional (not installed): 'Install with: hermes skills install official/<path>' Built on ideas from PR #3921 by @kshitijk4poor. * chore: move 21 niche skills to optional-skills Move specialized/niche skills from built-in (skills/) to optional (optional-skills/) to reduce the default skill count. Users can install them with: hermes skills install official/<category>/<name> Moved skills (21): - mlops: accelerate, chroma, faiss, flash-attention, hermes-atropos-environments, huggingface-tokenizers, instructor, lambda-labs, llava, nemo-curator, pinecone, pytorch-lightning, qdrant, saelens, simpo, slime, tensorrt-llm, torchtitan - research: domain-intel, duckduckgo-search - devops: inference-sh cli Built-in skills: 96 → 75 Optional skills: 22 → 43 * fix: only include repo built-in skills in Telegram menu, not user-installed User-installed skills (from hub or manually added) stay accessible via /skills and by typing the command directly, but don't get registered in the Telegram slash command picker. Only skills whose SKILL.md is under the repo's skills/ directory are included in the menu. This keeps the Telegram menu focused on the curated built-in set while user-installed skills remain discoverable through /skills and /commands.
This commit is contained in:
parent
97d6813f51
commit
5ceed021dc
73 changed files with 163 additions and 4 deletions
436
optional-skills/mlops/pytorch-lightning/references/callbacks.md
Normal file
436
optional-skills/mlops/pytorch-lightning/references/callbacks.md
Normal file
|
|
@ -0,0 +1,436 @@
|
|||
# PyTorch Lightning Callbacks
|
||||
|
||||
## Overview
|
||||
|
||||
Callbacks add functionality to training without modifying the LightningModule. They capture **non-essential logic** like checkpointing, early stopping, and logging.
|
||||
|
||||
## Built-In Callbacks
|
||||
|
||||
### 1. ModelCheckpoint
|
||||
|
||||
**Saves best models during training**:
|
||||
|
||||
```python
|
||||
from lightning.pytorch.callbacks import ModelCheckpoint
|
||||
|
||||
# Save top 3 models based on validation loss
|
||||
checkpoint = ModelCheckpoint(
|
||||
dirpath='checkpoints/',
|
||||
filename='model-{epoch:02d}-{val_loss:.2f}',
|
||||
monitor='val_loss',
|
||||
mode='min',
|
||||
save_top_k=3,
|
||||
save_last=True, # Also save last epoch
|
||||
verbose=True
|
||||
)
|
||||
|
||||
trainer = L.Trainer(callbacks=[checkpoint])
|
||||
trainer.fit(model, train_loader, val_loader)
|
||||
```
|
||||
|
||||
**Configuration options**:
|
||||
```python
|
||||
checkpoint = ModelCheckpoint(
|
||||
monitor='val_acc', # Metric to monitor
|
||||
mode='max', # 'max' for accuracy, 'min' for loss
|
||||
save_top_k=5, # Keep best 5 models
|
||||
save_last=True, # Save last epoch separately
|
||||
every_n_epochs=1, # Save every N epochs
|
||||
save_on_train_epoch_end=False, # Save on validation end instead
|
||||
filename='best-{epoch}-{val_acc:.3f}', # Naming pattern
|
||||
auto_insert_metric_name=False # Don't auto-add metric to filename
|
||||
)
|
||||
```
|
||||
|
||||
**Load checkpoint**:
|
||||
```python
|
||||
# Load best model
|
||||
best_model_path = checkpoint.best_model_path
|
||||
model = LitModel.load_from_checkpoint(best_model_path)
|
||||
|
||||
# Resume training
|
||||
trainer = L.Trainer(callbacks=[checkpoint])
|
||||
trainer.fit(model, train_loader, val_loader, ckpt_path='checkpoints/last.ckpt')
|
||||
```
|
||||
|
||||
### 2. EarlyStopping
|
||||
|
||||
**Stops training when metric stops improving**:
|
||||
|
||||
```python
|
||||
from lightning.pytorch.callbacks import EarlyStopping
|
||||
|
||||
early_stop = EarlyStopping(
|
||||
monitor='val_loss',
|
||||
patience=5, # Wait 5 epochs
|
||||
mode='min',
|
||||
min_delta=0.001, # Minimum change to qualify as improvement
|
||||
verbose=True,
|
||||
strict=True, # Crash if monitored metric not found
|
||||
check_on_train_epoch_end=False # Check on validation end
|
||||
)
|
||||
|
||||
trainer = L.Trainer(callbacks=[early_stop])
|
||||
trainer.fit(model, train_loader, val_loader)
|
||||
# Stops automatically if no improvement for 5 epochs
|
||||
```
|
||||
|
||||
**Advanced usage**:
|
||||
```python
|
||||
early_stop = EarlyStopping(
|
||||
monitor='val_loss',
|
||||
patience=10,
|
||||
min_delta=0.0,
|
||||
verbose=True,
|
||||
mode='min',
|
||||
stopping_threshold=0.1, # Stop if val_loss < 0.1
|
||||
divergence_threshold=5.0, # Stop if val_loss > 5.0
|
||||
check_finite=True # Stop on NaN/Inf
|
||||
)
|
||||
```
|
||||
|
||||
### 3. LearningRateMonitor
|
||||
|
||||
**Logs learning rate**:
|
||||
|
||||
```python
|
||||
from lightning.pytorch.callbacks import LearningRateMonitor
|
||||
|
||||
lr_monitor = LearningRateMonitor(
|
||||
logging_interval='epoch', # Or 'step'
|
||||
log_momentum=True # Also log momentum
|
||||
)
|
||||
|
||||
trainer = L.Trainer(callbacks=[lr_monitor])
|
||||
# Learning rate automatically logged to TensorBoard/WandB
|
||||
```
|
||||
|
||||
### 4. TQDMProgressBar
|
||||
|
||||
**Customizes progress bar**:
|
||||
|
||||
```python
|
||||
from lightning.pytorch.callbacks import TQDMProgressBar
|
||||
|
||||
progress_bar = TQDMProgressBar(
|
||||
refresh_rate=10, # Update every 10 batches
|
||||
process_position=0
|
||||
)
|
||||
|
||||
trainer = L.Trainer(callbacks=[progress_bar])
|
||||
```
|
||||
|
||||
### 5. GradientAccumulationScheduler
|
||||
|
||||
**Dynamic gradient accumulation**:
|
||||
|
||||
```python
|
||||
from lightning.pytorch.callbacks import GradientAccumulationScheduler
|
||||
|
||||
# Accumulate more gradients as training progresses
|
||||
accumulator = GradientAccumulationScheduler(
|
||||
scheduling={
|
||||
0: 8, # Epochs 0-4: accumulate 8 batches
|
||||
5: 4, # Epochs 5-9: accumulate 4 batches
|
||||
10: 2 # Epochs 10+: accumulate 2 batches
|
||||
}
|
||||
)
|
||||
|
||||
trainer = L.Trainer(callbacks=[accumulator])
|
||||
```
|
||||
|
||||
### 6. StochasticWeightAveraging (SWA)
|
||||
|
||||
**Averages weights for better generalization**:
|
||||
|
||||
```python
|
||||
from lightning.pytorch.callbacks import StochasticWeightAveraging
|
||||
|
||||
swa = StochasticWeightAveraging(
|
||||
swa_lrs=1e-2, # SWA learning rate
|
||||
swa_epoch_start=0.8, # Start at 80% of training
|
||||
annealing_epochs=10, # Annealing period
|
||||
annealing_strategy='cos' # 'cos' or 'linear'
|
||||
)
|
||||
|
||||
trainer = L.Trainer(callbacks=[swa])
|
||||
```
|
||||
|
||||
## Custom Callbacks
|
||||
|
||||
### Basic Custom Callback
|
||||
|
||||
```python
|
||||
from lightning.pytorch.callbacks import Callback
|
||||
|
||||
class PrintingCallback(Callback):
|
||||
def on_train_start(self, trainer, pl_module):
|
||||
print("Training is starting!")
|
||||
|
||||
def on_train_end(self, trainer, pl_module):
|
||||
print("Training is done!")
|
||||
|
||||
def on_epoch_end(self, trainer, pl_module):
|
||||
print(f"Epoch {trainer.current_epoch} ended")
|
||||
|
||||
# Use it
|
||||
trainer = L.Trainer(callbacks=[PrintingCallback()])
|
||||
```
|
||||
|
||||
### Advanced Custom Callback
|
||||
|
||||
```python
|
||||
class MetricsCallback(Callback):
|
||||
"""Logs custom metrics every N batches."""
|
||||
|
||||
def __init__(self, log_every_n_batches=100):
|
||||
self.log_every_n_batches = log_every_n_batches
|
||||
self.metrics = []
|
||||
|
||||
def on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx):
|
||||
if batch_idx % self.log_every_n_batches == 0:
|
||||
# Compute custom metric
|
||||
metric = self.compute_metric(outputs)
|
||||
self.metrics.append(metric)
|
||||
|
||||
# Log to Lightning
|
||||
pl_module.log('custom_metric', metric)
|
||||
|
||||
def compute_metric(self, outputs):
|
||||
# Your custom logic
|
||||
return outputs['loss'].item()
|
||||
|
||||
def state_dict(self):
|
||||
"""Save callback state in checkpoint."""
|
||||
return {'metrics': self.metrics}
|
||||
|
||||
def load_state_dict(self, state_dict):
|
||||
"""Restore callback state from checkpoint."""
|
||||
self.metrics = state_dict['metrics']
|
||||
```
|
||||
|
||||
### Gradient Monitoring Callback
|
||||
|
||||
```python
|
||||
class GradientMonitorCallback(Callback):
|
||||
"""Monitor gradient norms."""
|
||||
|
||||
def on_after_backward(self, trainer, pl_module):
|
||||
# Compute gradient norm
|
||||
total_norm = 0.0
|
||||
for p in pl_module.parameters():
|
||||
if p.grad is not None:
|
||||
param_norm = p.grad.data.norm(2)
|
||||
total_norm += param_norm.item() ** 2
|
||||
total_norm = total_norm ** 0.5
|
||||
|
||||
# Log
|
||||
pl_module.log('grad_norm', total_norm)
|
||||
|
||||
# Warn if exploding
|
||||
if total_norm > 100:
|
||||
print(f"Warning: Large gradient norm: {total_norm:.2f}")
|
||||
```
|
||||
|
||||
### Model Inspection Callback
|
||||
|
||||
```python
|
||||
class ModelInspectionCallback(Callback):
|
||||
"""Inspect model activations during training."""
|
||||
|
||||
def on_train_batch_start(self, trainer, pl_module, batch, batch_idx):
|
||||
if batch_idx == 0: # First batch of epoch
|
||||
# Register hooks
|
||||
self.activations = {}
|
||||
|
||||
def get_activation(name):
|
||||
def hook(model, input, output):
|
||||
self.activations[name] = output.detach()
|
||||
return hook
|
||||
|
||||
# Attach to specific layers
|
||||
pl_module.model.layer1.register_forward_hook(get_activation('layer1'))
|
||||
pl_module.model.layer2.register_forward_hook(get_activation('layer2'))
|
||||
|
||||
def on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx):
|
||||
if batch_idx == 0:
|
||||
# Log activation statistics
|
||||
for name, activation in self.activations.items():
|
||||
mean = activation.mean().item()
|
||||
std = activation.std().item()
|
||||
pl_module.log(f'{name}_mean', mean)
|
||||
pl_module.log(f'{name}_std', std)
|
||||
```
|
||||
|
||||
## Callback Hooks
|
||||
|
||||
**All available hooks**:
|
||||
|
||||
```python
|
||||
class MyCallback(Callback):
|
||||
# Setup/Teardown
|
||||
def setup(self, trainer, pl_module, stage):
|
||||
"""Called at beginning of fit/test/predict."""
|
||||
pass
|
||||
|
||||
def teardown(self, trainer, pl_module, stage):
|
||||
"""Called at end of fit/test/predict."""
|
||||
pass
|
||||
|
||||
# Training
|
||||
def on_train_start(self, trainer, pl_module):
|
||||
pass
|
||||
|
||||
def on_train_epoch_start(self, trainer, pl_module):
|
||||
pass
|
||||
|
||||
def on_train_batch_start(self, trainer, pl_module, batch, batch_idx):
|
||||
pass
|
||||
|
||||
def on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx):
|
||||
pass
|
||||
|
||||
def on_train_epoch_end(self, trainer, pl_module):
|
||||
pass
|
||||
|
||||
def on_train_end(self, trainer, pl_module):
|
||||
pass
|
||||
|
||||
# Validation
|
||||
def on_validation_start(self, trainer, pl_module):
|
||||
pass
|
||||
|
||||
def on_validation_epoch_start(self, trainer, pl_module):
|
||||
pass
|
||||
|
||||
def on_validation_batch_start(self, trainer, pl_module, batch, batch_idx, dataloader_idx):
|
||||
pass
|
||||
|
||||
def on_validation_batch_end(self, trainer, pl_module, outputs, batch, batch_idx, dataloader_idx):
|
||||
pass
|
||||
|
||||
def on_validation_epoch_end(self, trainer, pl_module):
|
||||
pass
|
||||
|
||||
def on_validation_end(self, trainer, pl_module):
|
||||
pass
|
||||
|
||||
# Test (same structure as validation)
|
||||
def on_test_start(self, trainer, pl_module):
|
||||
pass
|
||||
# ... (test_epoch_start, test_batch_start, etc.)
|
||||
|
||||
# Predict
|
||||
def on_predict_start(self, trainer, pl_module):
|
||||
pass
|
||||
# ... (predict_epoch_start, predict_batch_start, etc.)
|
||||
|
||||
# Backward
|
||||
def on_before_backward(self, trainer, pl_module, loss):
|
||||
pass
|
||||
|
||||
def on_after_backward(self, trainer, pl_module):
|
||||
pass
|
||||
|
||||
# Optimizer
|
||||
def on_before_optimizer_step(self, trainer, pl_module, optimizer):
|
||||
pass
|
||||
|
||||
# Checkpointing
|
||||
def on_save_checkpoint(self, trainer, pl_module, checkpoint):
|
||||
"""Add data to checkpoint."""
|
||||
pass
|
||||
|
||||
def on_load_checkpoint(self, trainer, pl_module, checkpoint):
|
||||
"""Restore data from checkpoint."""
|
||||
pass
|
||||
```
|
||||
|
||||
## Combining Multiple Callbacks
|
||||
|
||||
```python
|
||||
from lightning.pytorch.callbacks import ModelCheckpoint, EarlyStopping, LearningRateMonitor
|
||||
|
||||
# Create all callbacks
|
||||
checkpoint = ModelCheckpoint(monitor='val_loss', mode='min', save_top_k=3)
|
||||
early_stop = EarlyStopping(monitor='val_loss', patience=5)
|
||||
lr_monitor = LearningRateMonitor(logging_interval='epoch')
|
||||
custom_callback = MyCustomCallback()
|
||||
|
||||
# Add all to Trainer
|
||||
trainer = L.Trainer(
|
||||
callbacks=[checkpoint, early_stop, lr_monitor, custom_callback]
|
||||
)
|
||||
|
||||
trainer.fit(model, train_loader, val_loader)
|
||||
```
|
||||
|
||||
**Execution order**: Callbacks execute in the order they're added
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Keep Callbacks Independent
|
||||
|
||||
**Bad** (dependent on other callback):
|
||||
```python
|
||||
class BadCallback(Callback):
|
||||
def on_train_end(self, trainer, pl_module):
|
||||
# Assumes ModelCheckpoint is present
|
||||
best_path = trainer.checkpoint_callback.best_model_path # Fragile!
|
||||
```
|
||||
|
||||
**Good** (self-contained):
|
||||
```python
|
||||
class GoodCallback(Callback):
|
||||
def on_train_end(self, trainer, pl_module):
|
||||
# Find checkpoint callback if present
|
||||
for callback in trainer.callbacks:
|
||||
if isinstance(callback, ModelCheckpoint):
|
||||
best_path = callback.best_model_path
|
||||
break
|
||||
```
|
||||
|
||||
### 2. Use State Dict for Persistence
|
||||
|
||||
```python
|
||||
class StatefulCallback(Callback):
|
||||
def __init__(self):
|
||||
self.counter = 0
|
||||
self.history = []
|
||||
|
||||
def on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx):
|
||||
self.counter += 1
|
||||
self.history.append(outputs['loss'].item())
|
||||
|
||||
def state_dict(self):
|
||||
"""Save state."""
|
||||
return {
|
||||
'counter': self.counter,
|
||||
'history': self.history
|
||||
}
|
||||
|
||||
def load_state_dict(self, state_dict):
|
||||
"""Restore state."""
|
||||
self.counter = state_dict['counter']
|
||||
self.history = state_dict['history']
|
||||
```
|
||||
|
||||
### 3. Handle Distributed Training
|
||||
|
||||
```python
|
||||
class DistributedCallback(Callback):
|
||||
def on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx):
|
||||
# Only run on main process
|
||||
if trainer.is_global_zero:
|
||||
print("This only prints once in distributed training")
|
||||
|
||||
# Run on all processes
|
||||
loss = outputs['loss']
|
||||
# ... do something with loss on each GPU
|
||||
```
|
||||
|
||||
## Resources
|
||||
|
||||
- Callback API: https://lightning.ai/docs/pytorch/stable/extensions/callbacks.html
|
||||
- Built-in callbacks: https://lightning.ai/docs/pytorch/stable/api_references.html#callbacks
|
||||
- Examples: https://github.com/Lightning-AI/pytorch-lightning/tree/master/examples/callbacks
|
||||
Loading…
Add table
Add a link
Reference in a new issue