mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-25 00:51:20 +00:00
- Introduced new skills tools: `skills_categories`, `skills_list`, and `skill_view` in `model_tools.py`, allowing for better organization and access to skill-related functionalities. - Updated `toolsets.py` to include a new `skills` toolset, providing a dedicated space for skill tools. - Enhanced `batch_runner.py` to recognize and validate skills tools during batch processing. - Added comprehensive tool definitions for skills tools, ensuring compatibility with OpenAI's expected format. - Created new shell script `test_skills_kimi.sh` for testing skills tool functionality with Kimi K2.5. - Added example skill files demonstrating the structure and usage of skills within the Hermes-Agent framework, including `SKILL.md` for example and audiocraft skills. - Improved documentation for skills tools and their integration into the existing tool framework, ensuring clarity for future development and usage.
17 KiB
17 KiB
Comprehensive Hyperparameter Sweeps Guide
Complete guide to hyperparameter optimization with W&B Sweeps.
Table of Contents
- Sweep Configuration
- Search Strategies
- Parameter Distributions
- Early Termination
- Parallel Execution
- Advanced Patterns
- Real-World Examples
Sweep Configuration
Basic Sweep Config
sweep_config = {
'method': 'bayes', # Search strategy
'metric': {
'name': 'val/accuracy',
'goal': 'maximize' # or 'minimize'
},
'parameters': {
'learning_rate': {
'distribution': 'log_uniform',
'min': 1e-5,
'max': 1e-1
},
'batch_size': {
'values': [16, 32, 64, 128]
}
}
}
# Initialize sweep
sweep_id = wandb.sweep(sweep_config, project="my-project")
Complete Config Example
sweep_config = {
# Required: Search method
'method': 'bayes',
# Required: Optimization metric
'metric': {
'name': 'val/f1_score',
'goal': 'maximize'
},
# Required: Parameters to search
'parameters': {
# Continuous parameter
'learning_rate': {
'distribution': 'log_uniform',
'min': 1e-5,
'max': 1e-1
},
# Discrete values
'batch_size': {
'values': [16, 32, 64, 128]
},
# Categorical
'optimizer': {
'values': ['adam', 'sgd', 'rmsprop', 'adamw']
},
# Uniform distribution
'dropout': {
'distribution': 'uniform',
'min': 0.1,
'max': 0.5
},
# Integer range
'num_layers': {
'distribution': 'int_uniform',
'min': 2,
'max': 10
},
# Fixed value (constant across runs)
'epochs': {
'value': 50
}
},
# Optional: Early termination
'early_terminate': {
'type': 'hyperband',
'min_iter': 5,
's': 2,
'eta': 3,
'max_iter': 27
}
}
Search Strategies
1. Grid Search
Exhaustively search all combinations.
sweep_config = {
'method': 'grid',
'parameters': {
'learning_rate': {
'values': [0.001, 0.01, 0.1]
},
'batch_size': {
'values': [16, 32, 64]
},
'optimizer': {
'values': ['adam', 'sgd']
}
}
}
# Total runs: 3 × 3 × 2 = 18 runs
Pros:
- Comprehensive search
- Reproducible results
- No randomness
Cons:
- Exponential growth with parameters
- Inefficient for continuous parameters
- Not scalable beyond 3-4 parameters
When to use:
- Few parameters (< 4)
- All discrete values
- Need complete coverage
2. Random Search
Randomly sample parameter combinations.
sweep_config = {
'method': 'random',
'parameters': {
'learning_rate': {
'distribution': 'log_uniform',
'min': 1e-5,
'max': 1e-1
},
'batch_size': {
'values': [16, 32, 64, 128, 256]
},
'dropout': {
'distribution': 'uniform',
'min': 0.0,
'max': 0.5
},
'num_layers': {
'distribution': 'int_uniform',
'min': 2,
'max': 8
}
}
}
# Run 100 random trials
wandb.agent(sweep_id, function=train, count=100)
Pros:
- Scales to many parameters
- Can run indefinitely
- Often finds good solutions quickly
Cons:
- No learning from previous runs
- May miss optimal region
- Results vary with random seed
When to use:
- Many parameters (> 4)
- Quick exploration
- Limited budget
3. Bayesian Optimization (Recommended)
Learn from previous trials to sample promising regions.
sweep_config = {
'method': 'bayes',
'metric': {
'name': 'val/loss',
'goal': 'minimize'
},
'parameters': {
'learning_rate': {
'distribution': 'log_uniform',
'min': 1e-5,
'max': 1e-1
},
'weight_decay': {
'distribution': 'log_uniform',
'min': 1e-6,
'max': 1e-2
},
'dropout': {
'distribution': 'uniform',
'min': 0.1,
'max': 0.5
},
'num_layers': {
'values': [2, 3, 4, 5, 6]
}
}
}
Pros:
- Most sample-efficient
- Learns from past trials
- Focuses on promising regions
Cons:
- Initial random exploration phase
- May get stuck in local optima
- Slower per iteration
When to use:
- Expensive training runs
- Need best performance
- Limited compute budget
Parameter Distributions
Continuous Distributions
# Log-uniform: Good for learning rates, regularization
'learning_rate': {
'distribution': 'log_uniform',
'min': 1e-6,
'max': 1e-1
}
# Uniform: Good for dropout, momentum
'dropout': {
'distribution': 'uniform',
'min': 0.0,
'max': 0.5
}
# Normal distribution
'parameter': {
'distribution': 'normal',
'mu': 0.5,
'sigma': 0.1
}
# Log-normal distribution
'parameter': {
'distribution': 'log_normal',
'mu': 0.0,
'sigma': 1.0
}
Discrete Distributions
# Fixed values
'batch_size': {
'values': [16, 32, 64, 128, 256]
}
# Integer uniform
'num_layers': {
'distribution': 'int_uniform',
'min': 2,
'max': 10
}
# Quantized uniform (step size)
'layer_size': {
'distribution': 'q_uniform',
'min': 32,
'max': 512,
'q': 32 # Step by 32: 32, 64, 96, 128...
}
# Quantized log-uniform
'hidden_size': {
'distribution': 'q_log_uniform',
'min': 32,
'max': 1024,
'q': 32
}
Categorical Parameters
# Optimizers
'optimizer': {
'values': ['adam', 'sgd', 'rmsprop', 'adamw']
}
# Model architectures
'model': {
'values': ['resnet18', 'resnet34', 'resnet50', 'efficientnet_b0']
}
# Activation functions
'activation': {
'values': ['relu', 'gelu', 'silu', 'leaky_relu']
}
Early Termination
Stop underperforming runs early to save compute.
Hyperband
sweep_config = {
'method': 'bayes',
'metric': {'name': 'val/accuracy', 'goal': 'maximize'},
'parameters': {...},
# Hyperband early termination
'early_terminate': {
'type': 'hyperband',
'min_iter': 3, # Minimum iterations before termination
's': 2, # Bracket count
'eta': 3, # Downsampling rate
'max_iter': 27 # Maximum iterations
}
}
How it works:
- Runs trials in brackets
- Keeps top 1/eta performers each round
- Eliminates bottom performers early
Custom Termination
def train():
run = wandb.init()
for epoch in range(MAX_EPOCHS):
loss = train_epoch()
val_acc = validate()
wandb.log({'val/accuracy': val_acc, 'epoch': epoch})
# Custom early stopping
if epoch > 5 and val_acc < 0.5:
print("Early stop: Poor performance")
break
if epoch > 10 and val_acc > best_acc - 0.01:
print("Early stop: No improvement")
break
Training Function
Basic Template
def train():
# Initialize W&B run
run = wandb.init()
# Get hyperparameters
config = wandb.config
# Build model with config
model = build_model(
hidden_size=config.hidden_size,
num_layers=config.num_layers,
dropout=config.dropout
)
# Create optimizer
optimizer = create_optimizer(
model.parameters(),
name=config.optimizer,
lr=config.learning_rate,
weight_decay=config.weight_decay
)
# Training loop
for epoch in range(config.epochs):
# Train
train_loss, train_acc = train_epoch(
model, optimizer, train_loader, config.batch_size
)
# Validate
val_loss, val_acc = validate(model, val_loader)
# Log metrics
wandb.log({
'train/loss': train_loss,
'train/accuracy': train_acc,
'val/loss': val_loss,
'val/accuracy': val_acc,
'epoch': epoch
})
# Log final model
torch.save(model.state_dict(), 'model.pth')
wandb.save('model.pth')
# Finish run
wandb.finish()
With PyTorch
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
import wandb
def train():
run = wandb.init()
config = wandb.config
# Data
train_loader = DataLoader(
train_dataset,
batch_size=config.batch_size,
shuffle=True
)
# Model
model = ResNet(
num_classes=config.num_classes,
dropout=config.dropout
).to(device)
# Optimizer
if config.optimizer == 'adam':
optimizer = torch.optim.Adam(
model.parameters(),
lr=config.learning_rate,
weight_decay=config.weight_decay
)
elif config.optimizer == 'sgd':
optimizer = torch.optim.SGD(
model.parameters(),
lr=config.learning_rate,
momentum=config.momentum,
weight_decay=config.weight_decay
)
# Scheduler
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=config.epochs
)
# Training
for epoch in range(config.epochs):
model.train()
train_loss = 0.0
for data, target in train_loader:
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = nn.CrossEntropyLoss()(output, target)
loss.backward()
optimizer.step()
train_loss += loss.item()
# Validation
model.eval()
val_loss, val_acc = validate(model, val_loader)
# Step scheduler
scheduler.step()
# Log
wandb.log({
'train/loss': train_loss / len(train_loader),
'val/loss': val_loss,
'val/accuracy': val_acc,
'learning_rate': scheduler.get_last_lr()[0],
'epoch': epoch
})
Parallel Execution
Multiple Agents
Run sweep agents in parallel to speed up search.
# Initialize sweep once
sweep_id = wandb.sweep(sweep_config, project="my-project")
# Run multiple agents in parallel
# Agent 1 (Terminal 1)
wandb.agent(sweep_id, function=train, count=20)
# Agent 2 (Terminal 2)
wandb.agent(sweep_id, function=train, count=20)
# Agent 3 (Terminal 3)
wandb.agent(sweep_id, function=train, count=20)
# Total: 60 runs across 3 agents
Multi-GPU Execution
import os
def train():
# Get available GPU
gpu_id = os.environ.get('CUDA_VISIBLE_DEVICES', '0')
run = wandb.init()
config = wandb.config
# Train on specific GPU
device = torch.device(f'cuda:{gpu_id}')
model = model.to(device)
# ... rest of training ...
# Run agents on different GPUs
# Terminal 1
# CUDA_VISIBLE_DEVICES=0 wandb agent sweep_id
# Terminal 2
# CUDA_VISIBLE_DEVICES=1 wandb agent sweep_id
# Terminal 3
# CUDA_VISIBLE_DEVICES=2 wandb agent sweep_id
Advanced Patterns
Nested Parameters
sweep_config = {
'method': 'bayes',
'metric': {'name': 'val/accuracy', 'goal': 'maximize'},
'parameters': {
'model': {
'parameters': {
'type': {
'values': ['resnet', 'efficientnet']
},
'size': {
'values': ['small', 'medium', 'large']
}
}
},
'optimizer': {
'parameters': {
'type': {
'values': ['adam', 'sgd']
},
'lr': {
'distribution': 'log_uniform',
'min': 1e-5,
'max': 1e-1
}
}
}
}
}
# Access nested config
def train():
run = wandb.init()
model_type = wandb.config.model.type
model_size = wandb.config.model.size
opt_type = wandb.config.optimizer.type
lr = wandb.config.optimizer.lr
Conditional Parameters
sweep_config = {
'method': 'bayes',
'parameters': {
'optimizer': {
'values': ['adam', 'sgd']
},
'learning_rate': {
'distribution': 'log_uniform',
'min': 1e-5,
'max': 1e-1
},
# Only used if optimizer == 'sgd'
'momentum': {
'distribution': 'uniform',
'min': 0.5,
'max': 0.99
}
}
}
def train():
run = wandb.init()
config = wandb.config
if config.optimizer == 'adam':
optimizer = torch.optim.Adam(
model.parameters(),
lr=config.learning_rate
)
elif config.optimizer == 'sgd':
optimizer = torch.optim.SGD(
model.parameters(),
lr=config.learning_rate,
momentum=config.momentum # Conditional parameter
)
Real-World Examples
Image Classification
sweep_config = {
'method': 'bayes',
'metric': {
'name': 'val/top1_accuracy',
'goal': 'maximize'
},
'parameters': {
# Model
'architecture': {
'values': ['resnet50', 'resnet101', 'efficientnet_b0', 'efficientnet_b3']
},
'pretrained': {
'values': [True, False]
},
# Training
'learning_rate': {
'distribution': 'log_uniform',
'min': 1e-5,
'max': 1e-2
},
'batch_size': {
'values': [16, 32, 64, 128]
},
'optimizer': {
'values': ['adam', 'sgd', 'adamw']
},
'weight_decay': {
'distribution': 'log_uniform',
'min': 1e-6,
'max': 1e-2
},
# Regularization
'dropout': {
'distribution': 'uniform',
'min': 0.0,
'max': 0.5
},
'label_smoothing': {
'distribution': 'uniform',
'min': 0.0,
'max': 0.2
},
# Data augmentation
'mixup_alpha': {
'distribution': 'uniform',
'min': 0.0,
'max': 1.0
},
'cutmix_alpha': {
'distribution': 'uniform',
'min': 0.0,
'max': 1.0
}
},
'early_terminate': {
'type': 'hyperband',
'min_iter': 5
}
}
NLP Fine-Tuning
sweep_config = {
'method': 'bayes',
'metric': {'name': 'eval/f1', 'goal': 'maximize'},
'parameters': {
# Model
'model_name': {
'values': ['bert-base-uncased', 'roberta-base', 'distilbert-base-uncased']
},
# Training
'learning_rate': {
'distribution': 'log_uniform',
'min': 1e-6,
'max': 1e-4
},
'per_device_train_batch_size': {
'values': [8, 16, 32]
},
'num_train_epochs': {
'values': [3, 4, 5]
},
'warmup_ratio': {
'distribution': 'uniform',
'min': 0.0,
'max': 0.1
},
'weight_decay': {
'distribution': 'log_uniform',
'min': 1e-4,
'max': 1e-1
},
# Optimizer
'adam_beta1': {
'distribution': 'uniform',
'min': 0.8,
'max': 0.95
},
'adam_beta2': {
'distribution': 'uniform',
'min': 0.95,
'max': 0.999
}
}
}
Best Practices
1. Start Small
# Initial exploration: Random search, 20 runs
sweep_config_v1 = {
'method': 'random',
'parameters': {...}
}
wandb.agent(sweep_id_v1, train, count=20)
# Refined search: Bayes, narrow ranges
sweep_config_v2 = {
'method': 'bayes',
'parameters': {
'learning_rate': {
'min': 5e-5, # Narrowed from 1e-6 to 1e-4
'max': 1e-4
}
}
}
2. Use Log Scales
# ✅ Good: Log scale for learning rate
'learning_rate': {
'distribution': 'log_uniform',
'min': 1e-6,
'max': 1e-2
}
# ❌ Bad: Linear scale
'learning_rate': {
'distribution': 'uniform',
'min': 0.000001,
'max': 0.01
}
3. Set Reasonable Ranges
# Base ranges on prior knowledge
'learning_rate': {'min': 1e-5, 'max': 1e-3}, # Typical for Adam
'batch_size': {'values': [16, 32, 64]}, # GPU memory limits
'dropout': {'min': 0.1, 'max': 0.5} # Too high hurts training
4. Monitor Resource Usage
def train():
run = wandb.init()
# Log system metrics
wandb.log({
'system/gpu_memory_allocated': torch.cuda.memory_allocated(),
'system/gpu_memory_reserved': torch.cuda.memory_reserved()
})
5. Save Best Models
def train():
run = wandb.init()
best_acc = 0.0
for epoch in range(config.epochs):
val_acc = validate(model)
if val_acc > best_acc:
best_acc = val_acc
# Save best checkpoint
torch.save(model.state_dict(), 'best_model.pth')
wandb.save('best_model.pth')
Resources
- Sweeps Documentation: https://docs.wandb.ai/guides/sweeps
- Configuration Reference: https://docs.wandb.ai/guides/sweeps/configuration
- Examples: https://github.com/wandb/examples/tree/master/examples/wandb-sweeps