Add skills tools and enhance model integration

- Introduced new skills tools: `skills_categories`, `skills_list`, and `skill_view` in `model_tools.py`, allowing for better organization and access to skill-related functionalities. - Updated `toolsets.py` to include a new `skills` toolset, providing a dedicated space for skill tools. - Enhanced `batch_runner.py` to recognize and validate skills tools during batch processing. - Added comprehensive tool definitions for skills tools, ensuring compatibility with OpenAI's expected format. - Created new shell script `test_skills_kimi.sh` for testing skills tool functionality with Kimi K2.5. - Added example skill files demonstrating the structure and usage of skills within the Hermes-Agent framework, including `SKILL.md` for example and audiocraft skills. - Improved documentation for skills tools and their integration into the existing tool framework, ensuring clarity for future development and usage.
2026-04-26 01:01:40 +00:00 · 2026-01-30 07:39:55 +00:00 · 2026-01-30 07:39:55 +00:00 · f172f7d4aa
commit f172f7d4aa
parent 8e8b6be690
189 changed files with 116214 additions and 2 deletions
--- a/skills/mlops/llama-cpp/references/optimization.md
+++ b/skills/mlops/llama-cpp/references/optimization.md
@ -0,0 +1,89 @@
+# Performance Optimization Guide
+
+Maximize llama.cpp inference speed and efficiency.
+
+## CPU Optimization
+
+### Thread tuning
+```bash
+# Set threads (default: physical cores)
+./llama-cli -m model.gguf -t 8
+
+# For AMD Ryzen 9 7950X (16 cores, 32 threads)
+-t 16  # Best: physical cores
+
+# Avoid hyperthreading (slower for matrix ops)
+```
+
+### BLAS acceleration
+```bash
+# OpenBLAS (faster matrix ops)
+make LLAMA_OPENBLAS=1
+
+# BLAS gives 2-3× speedup
+```
+
+## GPU Offloading
+
+### Layer offloading
+```bash
+# Offload 35 layers to GPU (hybrid mode)
+./llama-cli -m model.gguf -ngl 35
+
+# Offload all layers
+./llama-cli -m model.gguf -ngl 999
+
+# Find optimal value:
+# Start with -ngl 999
+# If OOM, reduce by 5 until fits
+```
+
+### Memory usage
+```bash
+# Check VRAM usage
+nvidia-smi dmon
+
+# Reduce context if needed
+./llama-cli -m model.gguf -c 2048  # 2K context instead of 4K
+```
+
+## Batch Processing
+
+```bash
+# Increase batch size for throughput
+./llama-cli -m model.gguf -b 512  # Default: 512
+
+# Physical batch (GPU)
+--ubatch 128  # Process 128 tokens at once
+```
+
+## Context Management
+
+```bash
+# Default context (512 tokens)
+-c 512
+
+# Longer context (slower, more memory)
+-c 4096
+
+# Very long context (if model supports)
+-c 32768
+```
+
+## Benchmarks
+
+### CPU Performance (Llama 2-7B Q4_K_M)
+
+| Setup | Speed | Notes |
+|-------|-------|-------|
+| Apple M3 Max | 50 tok/s | Metal acceleration |
+| AMD 7950X (16c) | 35 tok/s | OpenBLAS |
+| Intel i9-13900K | 30 tok/s | AVX2 |
+
+### GPU Offloading (RTX 4090)
+
+| Layers GPU | Speed | VRAM |
+|------------|-------|------|
+| 0 (CPU only) | 30 tok/s | 0 GB |
+| 20 (hybrid) | 80 tok/s | 8 GB |
+| 35 (all) | 120 tok/s | 12 GB |