docs: add 11 new pages + expand 4 existing pages (26 → 37 total)

New pages (sourced from actual codebase): - Security: command approval, DM pairing, container isolation, production checklist - Session Management: resume, export, prune, search, per-platform tracking - Context Files: AGENTS.md project context, discovery, size limits, security - Personality: SOUL.md, 14 built-in personalities, custom definitions - Browser Automation: Browserbase setup, 10 browser tools, stealth mode - Image Generation: FLUX 2 Pro via FAL, aspect ratios, auto-upscaling - Provider Routing: OpenRouter sort/only/ignore/order config - Honcho: AI-native memory integration, setup, peer config - Home Assistant: HASS setup, 4 HA tools, WebSocket gateway - Batch Processing: trajectory generation, dataset format, checkpointing - RL Training: Atropos/Tinker integration, environments, workflow Expanded pages: - code-execution: 51 → 195 lines (examples, limits, security, comparison table) - delegation: 60 → 216 lines (context tips, batch mode, model override) - cron: 88 → 273 lines (real-world examples, delivery options, expression cheat sheet) - memory: 98 → 249 lines (best practices, capacity management, examples)
2026-04-25 00:51:20 +00:00 · 2026-03-05 07:28:41 -08:00 · 2026-03-05 07:28:41 -08:00 · d50e9bcef7
commit d50e9bcef7
parent c4e520fd6e
17 changed files with 3116 additions and 41 deletions
--- a/website/docs/user-guide/features/rl-training.md
+++ b/website/docs/user-guide/features/rl-training.md
@ -0,0 +1,238 @@
+---
+sidebar_position: 13
+title: "RL Training"
+description: "Reinforcement learning on agent behaviors with Tinker-Atropos — environment discovery, training, and evaluation"
+---
+
+# RL Training
+
+Hermes Agent includes an integrated RL (Reinforcement Learning) training pipeline built on **Tinker-Atropos**. This enables training language models on environment-specific tasks using GRPO (Group Relative Policy Optimization) with LoRA adapters, orchestrated entirely through the agent's tool interface.
+
+## Overview
+
+The RL training system consists of three components:
+
+1. **Atropos** — A trajectory API server that coordinates environment interactions, manages rollout groups, and computes advantages
+2. **Tinker** — A training service that handles model weights, LoRA training, sampling/inference, and optimizer steps
+3. **Environments** — Python classes that define tasks, scoring, and reward functions (e.g., GSM8K math problems)
+
+The agent can discover environments, configure training parameters, launch training runs, and monitor metrics — all through a set of `rl_*` tools.
+
+## Requirements
+
+RL training requires:
+
+- **Python >= 3.11** (Tinker package requirement)
+- **TINKER_API_KEY** — API key for the Tinker training service
+- **WANDB_API_KEY** — API key for Weights & Biases metrics tracking
+- The `tinker-atropos` submodule (at `tinker-atropos/` relative to the Hermes root)
+
+```bash
+# Set up API keys
+hermes config set TINKER_API_KEY your-tinker-key
+hermes config set WANDB_API_KEY your-wandb-key
+```
+
+When both keys are present and Python >= 3.11 is available, the `rl` toolset is automatically enabled.
+
+## Available Tools
+
+| Tool | Description |
+|------|-------------|
+| `rl_list_environments` | Discover available RL environments |
+| `rl_select_environment` | Select an environment and load its config |
+| `rl_get_current_config` | View configurable and locked fields |
+| `rl_edit_config` | Modify configurable training parameters |
+| `rl_start_training` | Launch a training run (spawns 3 processes) |
+| `rl_check_status` | Monitor training progress and WandB metrics |
+| `rl_stop_training` | Stop a running training job |
+| `rl_get_results` | Get final metrics and model weights path |
+| `rl_list_runs` | List all active and completed runs |
+| `rl_test_inference` | Quick inference test using OpenRouter |
+
+## Workflow
+
+### 1. Discover Environments
+
+```
+List the available RL environments
+```
+
+The agent calls `rl_list_environments()` which scans `tinker-atropos/tinker_atropos/environments/` using AST parsing to find Python classes inheriting from `BaseEnv`. Each environment defines:
+
+- **Dataset loading** — where training data comes from (e.g., HuggingFace datasets)
+- **Prompt construction** — how to format items for the model
+- **Scoring/verification** — how to evaluate model outputs and assign rewards
+
+### 2. Select and Configure
+
+```
+Select the GSM8K environment and show me the configuration
+```
+
+The agent calls `rl_select_environment("gsm8k_tinker")`, then `rl_get_current_config()` to see all parameters.
+
+Configuration fields are divided into two categories:
+
+**Configurable fields** (can be modified):
+- `group_size` — Number of completions per item (default: 16)
+- `batch_size` — Training batch size (default: 128)
+- `wandb_name` — WandB run name (auto-set to `{env}-{timestamp}`)
+- Other environment-specific parameters
+
+**Locked fields** (infrastructure settings, cannot be changed):
+- `tokenizer_name` — Model tokenizer (e.g., `Qwen/Qwen3-8B`)
+- `rollout_server_url` — Atropos API URL (`http://localhost:8000`)
+- `max_token_length` — Maximum token length (8192)
+- `max_num_workers` — Maximum parallel workers (2048)
+- `total_steps` — Total training steps (2500)
+- `lora_rank` — LoRA adapter rank (32)
+- `learning_rate` — Learning rate (4e-5)
+- `max_token_trainer_length` — Max tokens for trainer (9000)
+
+### 3. Start Training
+
+```
+Start the training run
+```
+
+The agent calls `rl_start_training()` which:
+
+1. Generates a YAML config file merging locked settings with configurable overrides
+2. Creates a unique run ID
+3. Spawns three processes:
+   - **Atropos API server** (`run-api`) — trajectory coordination
+   - **Tinker trainer** (`launch_training.py`) — LoRA training + FastAPI inference server on port 8001
+   - **Environment** (`environment.py serve`) — the selected environment connecting to Atropos
+
+The processes start with staggered delays (5s for API, 30s for trainer, 90s more for environment) to ensure proper initialization order.
+
+### 4. Monitor Progress
+
+```
+Check the status of training run abc12345
+```
+
+The agent calls `rl_check_status(run_id)` which reports:
+
+- Process status (running/exited for each of the 3 processes)
+- Running time
+- WandB metrics (step, reward mean, percent correct, eval accuracy)
+- Log file locations for debugging
+
+:::note Rate Limiting
+Status checks are rate-limited to once every **30 minutes** per run ID. This prevents excessive polling during long-running training jobs that take hours.
+:::
+
+### 5. Stop or Get Results
+
+```
+Stop the training run
+# or
+Get the final results for run abc12345
+```
+
+`rl_stop_training()` terminates all three processes in reverse order (environment → trainer → API). `rl_get_results()` retrieves final WandB metrics and training history.
+
+## Inference Testing
+
+Before committing to a full training run, you can test if an environment works correctly using `rl_test_inference`. This runs a few steps of inference and scoring using OpenRouter — no Tinker API needed, just an `OPENROUTER_API_KEY`.
+
+```
+Test the selected environment with inference
+```
+
+Default configuration:
+- **3 steps × 16 completions = 48 rollouts per model**
+- Tests 3 models at different scales for robustness:
+  - `qwen/qwen3-8b` (small)
+  - `z-ai/glm-4.7-flash` (medium)
+  - `minimax/minimax-m2.1` (large)
+- Total: ~144 rollouts
+
+This validates:
+- Environment loads correctly
+- Prompt construction works
+- Inference response parsing is robust across model scales
+- Verifier/scoring logic produces valid rewards
+
+## Tinker API Integration
+
+The trainer uses the [Tinker](https://tinker.computer) API for model training operations:
+
+- **ServiceClient** — Creates training and sampling clients
+- **Training client** — Handles forward-backward passes with importance sampling loss, optimizer steps (Adam), and weight checkpointing
+- **Sampling client** — Provides inference using the latest trained weights
+
+The training loop:
+1. Fetches a batch of rollouts from Atropos (prompt + completions + scores)
+2. Converts to Tinker Datum objects with padded logprobs and advantages
+3. Runs forward-backward pass with importance sampling loss
+4. Takes an optimizer step (Adam: lr=4e-5, β1=0.9, β2=0.95)
+5. Saves weights and creates a new sampling client for next-step inference
+6. Logs metrics to WandB
+
+## Architecture Diagram
+
+```
+┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
+│   Atropos API   │◄────│   Environment    │────►│  OpenAI/sglang  │
+│  (run-api)      │     │  (BaseEnv impl)  │     │  Inference API  │
+│  Port 8000      │     │                  │     │  Port 8001      │
+└────────┬────────┘     └──────────────────┘     └────────┬────────┘
+         │                                                │
+         │  Batches (tokens + scores + logprobs)          │
+         │                                                │
+         ▼                                                │
+┌─────────────────┐                                       │
+│  Tinker Trainer  │◄──────────────────────────────────────┘
+│  (LoRA training) │  Serves inference via FastAPI
+│  + FastAPI       │  Trains via Tinker ServiceClient
+└─────────────────┘
+```
+
+## Creating Custom Environments
+
+To create a new RL environment:
+
+1. Create a Python file in `tinker-atropos/tinker_atropos/environments/`
+2. Define a class that inherits from `BaseEnv`
+3. Implement the required methods:
+   - `load_dataset()` — Load your training data
+   - `get_next_item()` — Provide the next item to the model
+   - `score_answer()` — Score model outputs and assign rewards
+   - `collect_trajectories()` — Collect and return trajectories
+4. Optionally define a custom config class inheriting from `BaseEnvConfig`
+
+Study the existing `gsm8k_tinker.py` as a template. The agent can help you create new environments — it can read existing environment files, inspect HuggingFace datasets, and write new environment code.
+
+## WandB Metrics
+
+Training runs log to Weights & Biases with these key metrics:
+
+| Metric | Description |
+|--------|-------------|
+| `train/loss` | Training loss (importance sampling) |
+| `train/learning_rate` | Current learning rate |
+| `reward/mean` | Mean reward across groups |
+| `logprobs/mean` | Mean reference logprobs |
+| `logprobs/mean_training` | Mean training logprobs |
+| `logprobs/diff` | Logprob drift (reference - training) |
+| `advantages/mean` | Mean advantage values |
+| `advantages/std` | Advantage standard deviation |
+
+## Log Files
+
+Each training run generates log files in `tinker-atropos/logs/`:
+
+```
+logs/
+├── api_{run_id}.log        # Atropos API server logs
+├── trainer_{run_id}.log    # Tinker trainer logs
+├── env_{run_id}.log        # Environment process logs
+└── inference_tests/        # Inference test results
+    ├── test_{env}_{model}.jsonl
+    └── test_{env}_{model}.log
+```
+
+These are invaluable for debugging when training fails or produces unexpected results.