mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-24 05:41:40 +00:00
chore: remove Atropos RL environments and tinker-atropos integration (#26106)
* chore: remove Atropos RL environments, tools, tests, skill, and tinker-atropos submodule Delete: - environments/ (43 files — base env, agent loop, tool call parsers, benchmarks) - rl_cli.py (standalone RL training CLI) - tools/rl_training_tool.py (all 10 rl_* tools) - tests: test_rl_training_tool, test_tool_call_parsers, test_managed_server_tool_support, test_agent_loop, test_agent_loop_vllm, test_agent_loop_tool_calling, test_terminalbench2_env_security - optional-skills/mlops/hermes-atropos-environments/ - tinker-atropos git submodule + .gitmodules * chore: remove RL/Atropos references from Python source - toolsets.py: remove rl toolset block + update comment - model_tools.py: remove rl_tools group + update async bridging comment - hermes_cli/tools_config.py: remove RL display entry, _DEFAULT_OFF_TOOLSETS, setup block, and rl_training post-setup handler - tools/budget_config.py: remove RL environment reference in docstring - tests/test_model_tools.py: remove rl_tools from expected groups - tests/run_agent/test_streaming_tool_call_repair.py: fix stale cross-reference * chore: remove rl/yc-bench extras and tinker-atropos refs from pyproject.toml - Remove rl extra (atroposlib, tinker, fastapi, uvicorn, wandb) - Remove yc-bench extra - Remove rl_cli from py-modules - Remove [tool.ty.src] exclude for tinker-atropos - Remove [tool.ruff] exclude for tinker-atropos - Regenerate uv.lock * chore: remove tinker-atropos from install/setup scripts - setup-hermes.sh: remove entire tinker-atropos submodule install block - scripts/install.sh: remove both tinker-atropos blocks (Termux + standard) - scripts/install.ps1: remove tinker-atropos block - nix/hermes-agent.nix: remove tinker-atropos pip install line * chore: remove RL references from cli-config.yaml.example * docs: remove Atropos/RL references from README, CONTRIBUTING, AGENTS.md * docs: remove RL/Atropos references from website - Delete: environments.md, rl-training.md, mlops-hermes-atropos-environments.md - sidebars.ts: remove rl-training and environments sidebar entries - optional-skills-catalog.md: remove hermes-atropos-environments row - tools-reference.md: remove entire rl toolset section - toolsets-reference.md: remove rl row + update example - integrations/index.md: remove RL Training bullet - architecture.md: remove environments/ from tree + RL section - contributing.md: remove tinker-atropos setup - updating.md: remove tinker-atropos install + stale submodule update * chore: remove remaining RL/Atropos stragglers - hermes_cli/config.py: remove TINKER_API_KEY + WANDB_API_KEY env var defs - hermes_cli/doctor.py: remove Submodules check section (tinker-atropos) - hermes_cli/setup.py: remove RL Training status check - hermes_cli/status.py: remove Tinker + WandB from API key status display - agent/display.py: remove both rl_* tool preview/activity blocks - website/docs: remove RL references from providers.md + env-variables.md - tests: remove TINKER_API_KEY from conftest, set_config_value, setup_script * chore: remove RL training section from .env.example
This commit is contained in:
parent
d364132114
commit
5af672c753
97 changed files with 18 additions and 15690 deletions
|
|
@ -127,7 +127,6 @@ hermes-agent/
|
|||
├── cron/ # Scheduler (jobs.py, scheduler.py)
|
||||
├── plugins/memory/ # Memory provider plugins
|
||||
├── plugins/context_engine/ # Context engine plugins
|
||||
├── environments/ # RL training environments (Atropos)
|
||||
├── skills/ # Bundled skills (always available)
|
||||
├── optional-skills/ # Official optional skills (install explicitly)
|
||||
├── website/ # Docusaurus documentation site
|
||||
|
|
@ -185,7 +184,6 @@ If you are new to the codebase:
|
|||
8. **[Gateway Internals](./gateway-internals.md)** — messaging platform gateway
|
||||
9. **[Context Compression & Prompt Caching](./context-compression-and-caching.md)** — compression and caching
|
||||
10. **[ACP Internals](./acp-internals.md)** — IDE integration
|
||||
11. **[Environments, Benchmarks & Data Generation](./environments.md)** — RL training
|
||||
|
||||
## Major Subsystems
|
||||
|
||||
|
|
@ -247,11 +245,11 @@ Exposes Hermes as an editor-native agent over stdio/JSON-RPC for VS Code, Zed, a
|
|||
|
||||
→ [ACP Internals](./acp-internals.md)
|
||||
|
||||
### RL / Environments / Trajectories
|
||||
### Trajectories
|
||||
|
||||
Full environment framework for evaluation and RL training. Integrates with Atropos, supports multiple tool-call parsers, and generates ShareGPT-format trajectories.
|
||||
Generates ShareGPT-format trajectories from agent sessions for training data generation.
|
||||
|
||||
→ [Environments, Benchmarks & Data Generation](./environments.md), [Trajectories & Training Format](./trajectory-format.md)
|
||||
→ [Trajectories & Training Format](./trajectory-format.md)
|
||||
|
||||
## Design Principles
|
||||
|
||||
|
|
|
|||
|
|
@ -50,9 +50,6 @@ export VIRTUAL_ENV="$(pwd)/venv"
|
|||
|
||||
# Install with all extras (messaging, cron, CLI menus, dev tools)
|
||||
uv pip install -e ".[all,dev]"
|
||||
# tinker-atropos is a git submodule — needs `git submodule update --init` first
|
||||
# if you didn't clone with `--recurse-submodules`
|
||||
uv pip install -e "./tinker-atropos"
|
||||
|
||||
# Optional: browser tools
|
||||
npm install
|
||||
|
|
|
|||
|
|
@ -1,520 +0,0 @@
|
|||
---
|
||||
sidebar_position: 5
|
||||
title: "Environments, Benchmarks & Data Generation"
|
||||
description: "Building RL training environments, running evaluation benchmarks, and generating SFT data with the Hermes-Agent Atropos integration"
|
||||
---
|
||||
|
||||
# Environments, Benchmarks & Data Generation
|
||||
|
||||
Hermes Agent includes a full environment framework that connects its tool-calling capabilities to the [Atropos](https://github.com/NousResearch/atropos) RL training framework. This enables three workflows:
|
||||
|
||||
1. **RL Training** — Train language models on multi-turn agentic tasks with GRPO
|
||||
2. **Benchmarks** — Evaluate models on standardised agentic benchmarks
|
||||
3. **Data Generation** — Generate SFT training data from agent rollouts
|
||||
|
||||
All three share the same core: an **environment** class that defines tasks, runs an agent loop, and scores the output.
|
||||
|
||||
:::info Repo environments vs RL training tools
|
||||
The Python environment framework documented here lives under the repo's `environments/` directory and is the implementation-level API for Hermes/Atropos integration. This is separate from the user-facing `rl_*` tools, which operate as an orchestration surface for remote RL training workflows.
|
||||
:::
|
||||
|
||||
:::tip Quick Links
|
||||
- **Want to run benchmarks?** Jump to [Available Benchmarks](#available-benchmarks)
|
||||
- **Want to train with RL?** See [RL Training Tools](/user-guide/features/rl-training) for the agent-driven interface, or [Running Environments](#running-environments) for manual execution
|
||||
- **Want to create a new environment?** See [Creating Environments](#creating-environments)
|
||||
:::
|
||||
|
||||
## Architecture
|
||||
|
||||
The environment system is built on a three-layer inheritance chain:
|
||||
|
||||
```mermaid
|
||||
classDiagram
|
||||
class BaseEnv {
|
||||
Server management
|
||||
Worker scheduling
|
||||
Wandb logging
|
||||
CLI: serve / process / evaluate
|
||||
}
|
||||
|
||||
class HermesAgentBaseEnv {
|
||||
Terminal backend configuration
|
||||
Tool resolution
|
||||
Agent loop engine
|
||||
ToolContext access
|
||||
}
|
||||
|
||||
class TerminalTestEnv {
|
||||
Stack testing
|
||||
}
|
||||
|
||||
class HermesSweEnv {
|
||||
SWE training
|
||||
}
|
||||
|
||||
class TerminalBench2EvalEnv {
|
||||
Benchmark evaluation
|
||||
}
|
||||
|
||||
class TBLiteEvalEnv {
|
||||
Fast benchmark
|
||||
}
|
||||
|
||||
class YCBenchEvalEnv {
|
||||
Long-horizon benchmark
|
||||
}
|
||||
|
||||
BaseEnv <|-- HermesAgentBaseEnv
|
||||
HermesAgentBaseEnv <|-- TerminalTestEnv
|
||||
HermesAgentBaseEnv <|-- HermesSweEnv
|
||||
HermesAgentBaseEnv <|-- TerminalBench2EvalEnv
|
||||
TerminalBench2EvalEnv <|-- TBLiteEvalEnv
|
||||
TerminalBench2EvalEnv <|-- YCBenchEvalEnv
|
||||
```
|
||||
|
||||
### BaseEnv (Atropos)
|
||||
|
||||
The foundation from `atroposlib`. Provides:
|
||||
- **Server management** — connects to OpenAI-compatible APIs (VLLM, SGLang, OpenRouter)
|
||||
- **Worker scheduling** — parallel rollout coordination
|
||||
- **Wandb integration** — metrics logging and rollout visualisation
|
||||
- **CLI interface** — three subcommands: `serve`, `process`, `evaluate`
|
||||
- **Eval logging** — `evaluate_log()` saves results to JSON + JSONL
|
||||
|
||||
### HermesAgentBaseEnv
|
||||
|
||||
The hermes-agent layer (`environments/hermes_base_env.py`). Adds:
|
||||
- **Terminal backend configuration** — sets `TERMINAL_ENV` for sandboxed execution (local, Docker, Modal, Daytona, SSH, Singularity)
|
||||
- **Tool resolution** — `_resolve_tools_for_group()` calls hermes-agent's `get_tool_definitions()` to get the right tool schemas based on enabled/disabled toolsets
|
||||
- **Agent loop integration** — `collect_trajectory()` runs `HermesAgentLoop` and scores the result
|
||||
- **Two-phase operation** — Phase 1 (OpenAI server) for eval/SFT, Phase 2 (VLLM ManagedServer) for full RL with logprobs
|
||||
- **Async safety patches** — monkey-patches Modal backend to work inside Atropos's event loop
|
||||
|
||||
### Concrete Environments
|
||||
|
||||
Your environment inherits from `HermesAgentBaseEnv` and implements five methods:
|
||||
|
||||
| Method | Purpose |
|
||||
|--------|---------|
|
||||
| `setup()` | Load dataset, initialise state |
|
||||
| `get_next_item()` | Return the next item for rollout |
|
||||
| `format_prompt(item)` | Convert an item into the user message |
|
||||
| `compute_reward(item, result, ctx)` | Score the rollout (0.0–1.0) |
|
||||
| `evaluate()` | Periodic evaluation logic |
|
||||
|
||||
## Core Components
|
||||
|
||||
### Agent Loop
|
||||
|
||||
`HermesAgentLoop` (`environments/agent_loop.py`) is the reusable multi-turn agent engine. It runs the same tool-calling pattern as hermes-agent's main loop:
|
||||
|
||||
1. Send messages + tool schemas to the API via `server.chat_completion()`
|
||||
2. If the response contains `tool_calls`, dispatch each via `handle_function_call()`
|
||||
3. Append tool results to the conversation, go back to step 1
|
||||
4. If no `tool_calls`, the agent is done
|
||||
|
||||
Tool calls execute in a thread pool (`ThreadPoolExecutor(128)`) so that async backends (Modal, Docker) don't deadlock inside Atropos's event loop.
|
||||
|
||||
Returns an `AgentResult`:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class AgentResult:
|
||||
messages: List[Dict[str, Any]] # Full conversation history
|
||||
turns_used: int # Number of LLM calls made
|
||||
finished_naturally: bool # True if model stopped on its own
|
||||
reasoning_per_turn: List[Optional[str]] # Extracted reasoning content
|
||||
tool_errors: List[ToolError] # Errors encountered during tool dispatch
|
||||
managed_state: Optional[Dict] # VLLM ManagedServer state (Phase 2)
|
||||
```
|
||||
|
||||
### Tool Context
|
||||
|
||||
`ToolContext` (`environments/tool_context.py`) gives reward functions direct access to the **same sandbox** the model used during its rollout. The `task_id` scoping means all state (files, processes, browser tabs) is preserved.
|
||||
|
||||
```python
|
||||
async def compute_reward(self, item, result, ctx: ToolContext):
|
||||
# Run tests in the model's terminal sandbox
|
||||
test = ctx.terminal("pytest -v")
|
||||
if test["exit_code"] == 0:
|
||||
return 1.0
|
||||
|
||||
# Check if a file was created
|
||||
content = ctx.read_file("/workspace/solution.py")
|
||||
if content.get("content"):
|
||||
return 0.5
|
||||
|
||||
# Download files for local verification
|
||||
ctx.download_file("/remote/output.bin", "/local/output.bin")
|
||||
return 0.0
|
||||
```
|
||||
|
||||
Available methods:
|
||||
|
||||
| Category | Methods |
|
||||
|----------|---------|
|
||||
| **Terminal** | `terminal(command, timeout)` |
|
||||
| **Files** | `read_file(path)`, `write_file(path, content)`, `search(query, path)` |
|
||||
| **Transfers** | `upload_file()`, `upload_dir()`, `download_file()`, `download_dir()` |
|
||||
| **Web** | `web_search(query)`, `web_extract(urls)` |
|
||||
| **Browser** | `browser_navigate(url)`, `browser_snapshot()` |
|
||||
| **Generic** | `call_tool(name, args)` — escape hatch for any hermes-agent tool |
|
||||
| **Cleanup** | `cleanup()` — release all resources |
|
||||
|
||||
### Tool Call Parsers
|
||||
|
||||
For **Phase 2** (VLLM ManagedServer), the server returns raw text without structured tool calls. Client-side parsers in `environments/tool_call_parsers/` extract `tool_calls` from raw output:
|
||||
|
||||
```python
|
||||
from environments.tool_call_parsers import get_parser
|
||||
|
||||
parser = get_parser("hermes") # or "mistral", "llama3_json", "qwen", "deepseek_v3", etc.
|
||||
content, tool_calls = parser.parse(raw_model_output)
|
||||
```
|
||||
|
||||
Available parsers: `hermes`, `mistral`, `llama3_json`, `llama4_json`, `qwen`, `qwen3_coder`, `deepseek_v3`, `deepseek_v3_1` (alias `deepseek_v31`), `kimi_k2`, `longcat`, `glm45`, `glm47`.
|
||||
|
||||
In Phase 1 (OpenAI server type), parsers are not needed — the server handles tool call parsing natively.
|
||||
|
||||
## Available Benchmarks
|
||||
|
||||
### TerminalBench2
|
||||
|
||||
**89 challenging terminal tasks** with per-task Docker sandbox environments.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **What it tests** | Single-task coding/sysadmin ability |
|
||||
| **Scoring** | Binary pass/fail (test suite verification) |
|
||||
| **Sandbox** | Modal cloud sandboxes (per-task Docker images) |
|
||||
| **Tools** | `terminal` + `file` |
|
||||
| **Tasks** | 89 tasks across multiple categories |
|
||||
| **Cost** | ~$50–200 for full eval (parallel execution) |
|
||||
| **Time** | ~2–4 hours |
|
||||
|
||||
```bash
|
||||
python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
|
||||
--config environments/benchmarks/terminalbench_2/default.yaml
|
||||
|
||||
# Run specific tasks
|
||||
python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
|
||||
--config environments/benchmarks/terminalbench_2/default.yaml \
|
||||
--env.task_filter fix-git,git-multibranch
|
||||
```
|
||||
|
||||
Dataset: [NousResearch/terminal-bench-2](https://huggingface.co/datasets/NousResearch/terminal-bench-2) on HuggingFace.
|
||||
|
||||
### TBLite (OpenThoughts Terminal Bench Lite)
|
||||
|
||||
**100 difficulty-calibrated tasks** — a faster proxy for TerminalBench2.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **What it tests** | Same as TB2 (coding/sysadmin), calibrated difficulty tiers |
|
||||
| **Scoring** | Binary pass/fail |
|
||||
| **Sandbox** | Modal cloud sandboxes |
|
||||
| **Tools** | `terminal` + `file` |
|
||||
| **Tasks** | 100 tasks: Easy (40), Medium (26), Hard (26), Extreme (8) |
|
||||
| **Correlation** | r=0.911 with full TB2 |
|
||||
| **Speed** | 2.6–8× faster than TB2 |
|
||||
|
||||
```bash
|
||||
python environments/benchmarks/tblite/tblite_env.py evaluate \
|
||||
--config environments/benchmarks/tblite/default.yaml
|
||||
```
|
||||
|
||||
TBLite is a thin subclass of TerminalBench2 — only the dataset and timeouts differ. Created by the OpenThoughts Agent team (Snorkel AI + Bespoke Labs). Dataset: [NousResearch/openthoughts-tblite](https://huggingface.co/datasets/NousResearch/openthoughts-tblite).
|
||||
|
||||
### YC-Bench
|
||||
|
||||
**Long-horizon strategic benchmark** — the agent plays CEO of an AI startup.
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| **What it tests** | Multi-turn strategic coherence over hundreds of turns |
|
||||
| **Scoring** | Composite: `0.5 × survival + 0.5 × normalised_funds` |
|
||||
| **Sandbox** | Local terminal (no Modal needed) |
|
||||
| **Tools** | `terminal` only |
|
||||
| **Runs** | 9 default (3 presets × 3 seeds), sequential |
|
||||
| **Cost** | ~$50–200 for full eval |
|
||||
| **Time** | ~3–6 hours |
|
||||
|
||||
```bash
|
||||
# Install yc-bench (optional dependency)
|
||||
pip install "hermes-agent[yc-bench]"
|
||||
|
||||
# Run evaluation
|
||||
bash environments/benchmarks/yc_bench/run_eval.sh
|
||||
|
||||
# Or directly
|
||||
python environments/benchmarks/yc_bench/yc_bench_env.py evaluate \
|
||||
--config environments/benchmarks/yc_bench/default.yaml
|
||||
|
||||
# Quick single-preset test
|
||||
python environments/benchmarks/yc_bench/yc_bench_env.py evaluate \
|
||||
--config environments/benchmarks/yc_bench/default.yaml \
|
||||
--env.presets '["fast_test"]' --env.seeds '[1]'
|
||||
```
|
||||
|
||||
YC-Bench uses [collinear-ai/yc-bench](https://github.com/collinear-ai/yc-bench) — a deterministic simulation with 4 skill domains (research, inference, data_environment, training), prestige system, employee management, and financial pressure. Unlike TB2's per-task binary scoring, YC-Bench measures whether an agent can maintain coherent strategy over hundreds of compounding decisions.
|
||||
|
||||
## Training Environments
|
||||
|
||||
### TerminalTestEnv
|
||||
|
||||
A minimal self-contained environment with inline tasks (no external dataset). Used for **validating the full stack** end-to-end. Each task asks the model to create a file at a known path; the verifier checks the content.
|
||||
|
||||
```bash
|
||||
# Process mode (saves rollouts to JSONL, no training server needed)
|
||||
python environments/terminal_test_env/terminal_test_env.py process \
|
||||
--env.data_path_to_save_groups terminal_test_output.jsonl
|
||||
|
||||
# Serve mode (connects to Atropos API for RL training)
|
||||
python environments/terminal_test_env/terminal_test_env.py serve
|
||||
```
|
||||
|
||||
### HermesSweEnv
|
||||
|
||||
SWE-bench style training environment. The model gets a coding task, uses terminal + file + web tools to solve it, and the reward function runs tests in the same Modal sandbox.
|
||||
|
||||
```bash
|
||||
python environments/hermes_swe_env/hermes_swe_env.py serve \
|
||||
--openai.model_name YourModel \
|
||||
--env.dataset_name bigcode/humanevalpack \
|
||||
--env.terminal_backend modal
|
||||
```
|
||||
|
||||
## Running Environments
|
||||
|
||||
Every environment is a standalone Python script with three CLI subcommands:
|
||||
|
||||
### `evaluate` — Run a benchmark
|
||||
|
||||
For eval-only environments (benchmarks). Runs all items, computes metrics, logs to wandb.
|
||||
|
||||
```bash
|
||||
python environments/benchmarks/tblite/tblite_env.py evaluate \
|
||||
--config environments/benchmarks/tblite/default.yaml \
|
||||
--openai.model_name anthropic/claude-sonnet-4.6
|
||||
```
|
||||
|
||||
No training server or `run-api` needed. The environment handles everything.
|
||||
|
||||
### `process` — Generate SFT data
|
||||
|
||||
Runs rollouts and saves scored trajectories to JSONL. Useful for generating training data without a full RL loop.
|
||||
|
||||
```bash
|
||||
python environments/terminal_test_env/terminal_test_env.py process \
|
||||
--env.data_path_to_save_groups output.jsonl \
|
||||
--openai.model_name anthropic/claude-sonnet-4.6
|
||||
```
|
||||
|
||||
Output format: each line is a scored trajectory with the full conversation history, reward, and metadata.
|
||||
|
||||
### `serve` — Connect to Atropos for RL training
|
||||
|
||||
Connects the environment to a running Atropos API server (`run-api`). Used during live RL training.
|
||||
|
||||
```bash
|
||||
# Terminal 1: Start the Atropos API
|
||||
run-api
|
||||
|
||||
# Terminal 2: Start the environment
|
||||
python environments/hermes_swe_env/hermes_swe_env.py serve \
|
||||
--openai.model_name YourModel
|
||||
```
|
||||
|
||||
The environment receives items from Atropos, runs agent rollouts, computes rewards, and sends scored trajectories back for training.
|
||||
|
||||
## Two-Phase Operation
|
||||
|
||||
### Phase 1: OpenAI Server (Eval / SFT)
|
||||
|
||||
Uses `server.chat_completion()` with `tools=` parameter. The server (VLLM, SGLang, OpenRouter, OpenAI) handles tool call parsing natively. Returns `ChatCompletion` objects with structured `tool_calls`.
|
||||
|
||||
- **Use for**: evaluation, SFT data generation, benchmarks, testing
|
||||
- **Placeholder tokens** are created for the Atropos pipeline (since real token IDs aren't available from the OpenAI API)
|
||||
|
||||
### Phase 2: VLLM ManagedServer (Full RL)
|
||||
|
||||
Uses ManagedServer for exact token IDs + logprobs via `/generate`. A client-side [tool call parser](#tool-call-parsers) reconstructs structured `tool_calls` from raw output.
|
||||
|
||||
- **Use for**: full RL training with GRPO/PPO
|
||||
- **Real tokens**, masks, and logprobs flow through the pipeline
|
||||
- Set `tool_call_parser` in config to match your model's format (e.g., `"hermes"`, `"qwen"`, `"mistral"`)
|
||||
|
||||
## Creating Environments
|
||||
|
||||
### Training Environment
|
||||
|
||||
```python
|
||||
from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
|
||||
from atroposlib.envs.server_handling.server_manager import APIServerConfig
|
||||
|
||||
class MyEnvConfig(HermesAgentEnvConfig):
|
||||
my_custom_field: str = "default_value"
|
||||
|
||||
class MyEnv(HermesAgentBaseEnv):
|
||||
name = "my-env"
|
||||
env_config_cls = MyEnvConfig
|
||||
|
||||
@classmethod
|
||||
def config_init(cls):
|
||||
env_config = MyEnvConfig(
|
||||
enabled_toolsets=["terminal", "file"],
|
||||
terminal_backend="modal",
|
||||
max_agent_turns=30,
|
||||
)
|
||||
server_configs = [APIServerConfig(
|
||||
base_url="https://openrouter.ai/api/v1",
|
||||
model_name="anthropic/claude-sonnet-4.6",
|
||||
server_type="openai",
|
||||
)]
|
||||
return env_config, server_configs
|
||||
|
||||
async def setup(self):
|
||||
from datasets import load_dataset
|
||||
self.dataset = list(load_dataset("my-dataset", split="train"))
|
||||
self.iter = 0
|
||||
|
||||
async def get_next_item(self):
|
||||
item = self.dataset[self.iter % len(self.dataset)]
|
||||
self.iter += 1
|
||||
return item
|
||||
|
||||
def format_prompt(self, item):
|
||||
return item["instruction"]
|
||||
|
||||
async def compute_reward(self, item, result, ctx):
|
||||
# ctx gives full tool access to the rollout's sandbox
|
||||
test = ctx.terminal("pytest -v")
|
||||
return 1.0 if test["exit_code"] == 0 else 0.0
|
||||
|
||||
async def evaluate(self, *args, **kwargs):
|
||||
# Periodic evaluation during training
|
||||
pass
|
||||
|
||||
if __name__ == "__main__":
|
||||
MyEnv.cli()
|
||||
```
|
||||
|
||||
### Eval-Only Benchmark
|
||||
|
||||
For benchmarks, follow the pattern used by TerminalBench2, TBLite, and YC-Bench:
|
||||
|
||||
1. **Create under** `environments/benchmarks/your-benchmark/`
|
||||
2. **Set eval-only config**: `eval_handling=STOP_TRAIN`, `steps_per_eval=1`, `total_steps=1`
|
||||
3. **Stub training methods**: `collect_trajectories()` returns `(None, [])`, `score()` returns `None`
|
||||
4. **Implement** `rollout_and_score_eval(eval_item)` — the per-item agent loop + scoring
|
||||
5. **Implement** `evaluate()` — orchestrates all runs, computes aggregate metrics
|
||||
6. **Add streaming JSONL** for crash-safe result persistence
|
||||
7. **Add cleanup**: `KeyboardInterrupt` handling, `cleanup_all_environments()`, `_tool_executor.shutdown()`
|
||||
8. **Run with** `evaluate` subcommand
|
||||
|
||||
See `environments/benchmarks/yc_bench/yc_bench_env.py` for a clean, well-documented reference implementation.
|
||||
|
||||
## Configuration Reference
|
||||
|
||||
### HermesAgentEnvConfig Fields
|
||||
|
||||
| Field | Type | Default | Description |
|
||||
|-------|------|---------|-------------|
|
||||
| `enabled_toolsets` | `List[str]` | `None` (all) | Which hermes toolsets to enable |
|
||||
| `disabled_toolsets` | `List[str]` | `None` | Toolsets to filter out |
|
||||
| `distribution` | `str` | `None` | Probabilistic toolset distribution name |
|
||||
| `max_agent_turns` | `int` | `30` | Max LLM calls per rollout |
|
||||
| `agent_temperature` | `float` | `1.0` | Sampling temperature |
|
||||
| `system_prompt` | `str` | `None` | System message for the agent |
|
||||
| `terminal_backend` | `str` | `"local"` | `local`, `docker`, `modal`, `daytona`, `ssh`, `singularity` |
|
||||
| `terminal_timeout` | `int` | `120` | Seconds per terminal command |
|
||||
| `terminal_lifetime` | `int` | `3600` | Max sandbox lifetime |
|
||||
| `dataset_name` | `str` | `None` | HuggingFace dataset identifier |
|
||||
| `tool_pool_size` | `int` | `128` | Thread pool size for tool execution |
|
||||
| `tool_call_parser` | `str` | `"hermes"` | Parser for Phase 2 raw output |
|
||||
| `extra_body` | `Dict` | `None` | Extra params for OpenAI API (e.g., OpenRouter provider prefs) |
|
||||
| `eval_handling` | `Enum` | `STOP_TRAIN` | `STOP_TRAIN`, `LIMIT_TRAIN`, `NONE` |
|
||||
|
||||
### YAML Configuration
|
||||
|
||||
Environments can be configured via YAML files passed with `--config`:
|
||||
|
||||
```yaml
|
||||
env:
|
||||
enabled_toolsets: ["terminal", "file"]
|
||||
max_agent_turns: 60
|
||||
max_token_length: 32000
|
||||
agent_temperature: 0.8
|
||||
terminal_backend: "modal"
|
||||
terminal_timeout: 300
|
||||
dataset_name: "NousResearch/terminal-bench-2"
|
||||
tokenizer_name: "NousResearch/Hermes-3-Llama-3.1-8B"
|
||||
use_wandb: true
|
||||
wandb_name: "my-benchmark"
|
||||
|
||||
openai:
|
||||
base_url: "https://openrouter.ai/api/v1"
|
||||
model_name: "anthropic/claude-sonnet-4.6"
|
||||
server_type: "openai"
|
||||
health_check: false
|
||||
```
|
||||
|
||||
YAML values override `config_init()` defaults. CLI arguments override YAML values:
|
||||
|
||||
```bash
|
||||
python my_env.py evaluate \
|
||||
--config my_config.yaml \
|
||||
--openai.model_name anthropic/claude-opus-4.6 # overrides YAML
|
||||
```
|
||||
|
||||
## Prerequisites
|
||||
|
||||
### For all environments
|
||||
|
||||
- Python >= 3.11
|
||||
- `atroposlib`: `pip install git+https://github.com/NousResearch/atropos.git`
|
||||
- An LLM API key (OpenRouter, OpenAI, or self-hosted VLLM/SGLang)
|
||||
|
||||
### For Modal-sandboxed benchmarks (TB2, TBLite)
|
||||
|
||||
- [Modal](https://modal.com) account and CLI: `pip install "hermes-agent[modal]"`
|
||||
- `MODAL_TOKEN_ID` and `MODAL_TOKEN_SECRET` environment variables
|
||||
|
||||
### For YC-Bench
|
||||
|
||||
- `pip install "hermes-agent[yc-bench]"` (installs the yc-bench CLI + SQLAlchemy)
|
||||
- No Modal needed — runs with local terminal backend
|
||||
|
||||
### For RL training
|
||||
|
||||
- `TINKER_API_KEY` — API key for the [Tinker](https://tinker.computer) training service
|
||||
- `WANDB_API_KEY` — for Weights & Biases metrics tracking
|
||||
- The `tinker-atropos` submodule (at `tinker-atropos/` in the repo)
|
||||
|
||||
See [RL Training](/user-guide/features/rl-training) for the agent-driven RL workflow.
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
environments/
|
||||
├── hermes_base_env.py # Abstract base class (HermesAgentBaseEnv)
|
||||
├── agent_loop.py # Multi-turn agent engine (HermesAgentLoop)
|
||||
├── tool_context.py # Per-rollout tool access for reward functions
|
||||
├── patches.py # Async-safety patches for Modal backend
|
||||
│
|
||||
├── tool_call_parsers/ # Phase 2 client-side parsers
|
||||
│ ├── hermes_parser.py # Hermes/ChatML <tool_call> format
|
||||
│ ├── mistral_parser.py # Mistral [TOOL_CALLS] format
|
||||
│ ├── llama_parser.py # Llama 3 JSON tool calling
|
||||
│ ├── qwen_parser.py # Qwen format
|
||||
│ ├── deepseek_v3_parser.py # DeepSeek V3 format
|
||||
│ └── ... # + kimi_k2, longcat, glm45/47, etc.
|
||||
│
|
||||
├── terminal_test_env/ # Stack validation (inline tasks)
|
||||
├── hermes_swe_env/ # SWE-bench training environment
|
||||
│
|
||||
└── benchmarks/ # Evaluation benchmarks
|
||||
├── terminalbench_2/ # 89 terminal tasks, Modal sandboxes
|
||||
├── tblite/ # 100 calibrated tasks (fast TB2 proxy)
|
||||
└── yc_bench/ # Long-horizon strategic benchmark
|
||||
```
|
||||
|
|
@ -123,13 +123,11 @@ If you installed manually (not via the quick installer):
|
|||
cd /path/to/hermes-agent
|
||||
export VIRTUAL_ENV="$(pwd)/venv"
|
||||
|
||||
# Pull latest code and submodules
|
||||
# Pull latest code
|
||||
git pull origin main
|
||||
git submodule update --init --recursive
|
||||
|
||||
# Reinstall (picks up new dependencies)
|
||||
uv pip install -e ".[all]"
|
||||
uv pip install -e "./tinker-atropos"
|
||||
|
||||
# Check for new config options
|
||||
hermes config check
|
||||
|
|
|
|||
|
|
@ -97,5 +97,4 @@ See the [Messaging Gateway overview](/docs/user-guide/messaging) for the platfor
|
|||
|
||||
## Training & Evaluation
|
||||
|
||||
- **[RL Training](/docs/user-guide/features/rl-training)** — Generate trajectory data from agent sessions for reinforcement learning and model fine-tuning. Supports Atropos environments with customizable reward functions.
|
||||
- **[Batch Processing](/docs/user-guide/features/batch-processing)** — Run the agent across hundreds of prompts in parallel, generating structured ShareGPT-format trajectory data for training data generation or evaluation.
|
||||
|
|
|
|||
|
|
@ -1355,7 +1355,6 @@ You can switch between providers at any time with `hermes model` — no restart
|
|||
| Premium TTS voices | [ElevenLabs](https://elevenlabs.io/) | `ELEVENLABS_API_KEY` |
|
||||
| OpenAI TTS + voice transcription | [OpenAI](https://platform.openai.com/api-keys) | `VOICE_TOOLS_OPENAI_KEY` |
|
||||
| Mistral TTS + voice transcription | [Mistral](https://console.mistral.ai/) | `MISTRAL_API_KEY` |
|
||||
| RL Training | [Tinker](https://tinker-console.thinkingmachines.ai/) + [WandB](https://wandb.ai/) | `TINKER_API_KEY`, `WANDB_API_KEY` |
|
||||
| Cross-session user modeling | [Honcho](https://honcho.dev/) | `HONCHO_API_KEY` |
|
||||
| Semantic long-term memory | [Supermemory](https://supermemory.ai) | `SUPERMEMORY_API_KEY` |
|
||||
|
||||
|
|
|
|||
|
|
@ -148,8 +148,6 @@ For native Anthropic auth, Hermes prefers Claude Code's own credential files whe
|
|||
| `HONCHO_BASE_URL` | Base URL for self-hosted Honcho instances (default: Honcho cloud). No API key required for local instances |
|
||||
| `HINDSIGHT_TIMEOUT` | Timeout in seconds for Hindsight memory-provider API calls (default: `60`). Bump this if your Hindsight instance is slow to respond during `/sync` or `on_session_switch` and you're seeing timeouts in `errors.log`. |
|
||||
| `SUPERMEMORY_API_KEY` | Semantic long-term memory with profile recall and session ingest ([supermemory.ai](https://supermemory.ai)) |
|
||||
| `TINKER_API_KEY` | RL training ([tinker-console.thinkingmachines.ai](https://tinker-console.thinkingmachines.ai/)) |
|
||||
| `WANDB_API_KEY` | RL training metrics ([wandb.ai](https://wandb.ai/)) |
|
||||
| `DAYTONA_API_KEY` | Daytona cloud sandboxes ([daytona.io](https://daytona.io/)) |
|
||||
| `VERCEL_TOKEN` | Vercel Sandbox access token ([vercel.com](https://vercel.com/)) |
|
||||
| `VERCEL_PROJECT_ID` | Vercel project ID (required with `VERCEL_TOKEN`) |
|
||||
|
|
|
|||
|
|
@ -120,7 +120,6 @@ hermes skills uninstall <skill-name>
|
|||
| [**faiss**](/docs/user-guide/skills/optional/mlops/mlops-faiss) | Facebook's library for efficient similarity search and clustering of dense vectors. Supports billions of vectors, GPU acceleration, and various index types (Flat, IVF, HNSW). Use for fast k-NN search, large-scale vector retrieval, or whe... |
|
||||
| [**optimizing-attention-flash**](/docs/user-guide/skills/optional/mlops/mlops-flash-attention) | Optimizes transformer attention with Flash Attention for 2-4x speedup and 10-20x memory reduction. Use when training/running transformers with long sequences (>512 tokens), encountering GPU memory issues with attention, or need faster in... |
|
||||
| [**guidance**](/docs/user-guide/skills/optional/mlops/mlops-guidance) | Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance - Microsoft Research's constrained generation framework |
|
||||
| [**hermes-atropos-environments**](/docs/user-guide/skills/optional/mlops/mlops-hermes-atropos-environments) | Build, test, and debug Hermes Agent RL environments for Atropos training. Covers the HermesAgentBaseEnv interface, reward functions, agent loop integration, evaluation with tools, wandb logging, and the three CLI modes (serve/process/eva... |
|
||||
| [**huggingface-tokenizers**](/docs/user-guide/skills/optional/mlops/mlops-huggingface-tokenizers) | Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integ... |
|
||||
| [**instructor**](/docs/user-guide/skills/optional/mlops/mlops-instructor) | Extract structured data from LLM responses with Pydantic validation, retry failed extractions automatically, parse complex JSON with type safety, and stream partial results with Instructor - battle-tested structured output library |
|
||||
| [**lambda-labs-gpu-cloud**](/docs/user-guide/skills/optional/mlops/mlops-lambda-labs) | Reserved and on-demand GPU cloud instances for ML training and inference. Use when you need dedicated GPU instances with simple SSH access, persistent filesystems, or high-performance multi-node clusters for large-scale training. |
|
||||
|
|
|
|||
|
|
@ -148,21 +148,6 @@ Registered only when the agent is spawned by the kanban dispatcher (`HERMES_KANB
|
|||
|------|-------------|----------------------|
|
||||
| `mixture_of_agents` | Route a hard problem through multiple frontier LLMs collaboratively. Makes 5 API calls (4 reference models + 1 aggregator) with maximum reasoning effort — use sparingly for genuinely difficult problems. Best for: complex math, advanced alg… | OPENROUTER_API_KEY |
|
||||
|
||||
## `rl` toolset
|
||||
|
||||
| Tool | Description | Requires environment |
|
||||
|------|-------------|----------------------|
|
||||
| `rl_check_status` | Get status and metrics for a training run. RATE LIMITED: enforces 30-minute minimum between checks for the same run. Returns WandB metrics: step, state, reward_mean, loss, percent_correct. | TINKER_API_KEY, WANDB_API_KEY |
|
||||
| `rl_edit_config` | Update a configuration field. Use rl_get_current_config() first to see all available fields for the selected environment. Each environment has different configurable options. Infrastructure settings (tokenizer, URLs, lora_rank, learning_ra… | TINKER_API_KEY, WANDB_API_KEY |
|
||||
| `rl_get_current_config` | Get the current environment configuration. Returns only fields that can be modified: group_size, max_token_length, total_steps, steps_per_eval, use_wandb, wandb_name, max_num_workers. | TINKER_API_KEY, WANDB_API_KEY |
|
||||
| `rl_get_results` | Get final results and metrics for a completed training run. Returns final metrics and path to trained weights. | TINKER_API_KEY, WANDB_API_KEY |
|
||||
| `rl_list_environments` | List all available RL environments. Returns environment names, paths, and descriptions. TIP: Read the file_path with file tools to understand how each environment works (verifiers, data loading, rewards). | TINKER_API_KEY, WANDB_API_KEY |
|
||||
| `rl_list_runs` | List all training runs (active and completed) with their status. | TINKER_API_KEY, WANDB_API_KEY |
|
||||
| `rl_select_environment` | Select an RL environment for training. Loads the environment's default configuration. After selecting, use rl_get_current_config() to see settings and rl_edit_config() to modify them. | TINKER_API_KEY, WANDB_API_KEY |
|
||||
| `rl_start_training` | Start a new RL training run with the current environment and config. Most training parameters (lora_rank, learning_rate, etc.) are fixed. Use rl_edit_config() to set group_size, batch_size, wandb_project before starting. WARNING: Training… | TINKER_API_KEY, WANDB_API_KEY |
|
||||
| `rl_stop_training` | Stop a running training job. Use if metrics look bad, training is stagnant, or you want to try different settings. | TINKER_API_KEY, WANDB_API_KEY |
|
||||
| `rl_test_inference` | Quick inference test for any environment. Runs a few steps of inference + scoring using OpenRouter. Default: 3 steps x 16 completions = 48 rollouts per model, testing 3 models = 144 total. Tests environment loading, prompt construction, in… | TINKER_API_KEY, WANDB_API_KEY |
|
||||
|
||||
## `session_search` toolset
|
||||
|
||||
| Tool | Description | Requires environment |
|
||||
|
|
|
|||
|
|
@ -45,7 +45,7 @@ Or in-session:
|
|||
```
|
||||
/tools list
|
||||
/tools disable browser
|
||||
/tools enable rl
|
||||
/tools enable homeassistant
|
||||
```
|
||||
|
||||
## Core Toolsets
|
||||
|
|
@ -71,7 +71,6 @@ Or in-session:
|
|||
| `memory` | `memory` | Persistent cross-session memory management. |
|
||||
| `messaging` | `send_message` | Send messages to other platforms (Telegram, Discord, etc.) from within a session. |
|
||||
| `moa` | `mixture_of_agents` | Multi-model consensus via Mixture of Agents. |
|
||||
| `rl` | `rl_check_status`, `rl_edit_config`, `rl_get_current_config`, `rl_get_results`, `rl_list_environments`, `rl_list_runs`, `rl_select_environment`, `rl_start_training`, `rl_stop_training`, `rl_test_inference` | RL training environment management (Atropos). |
|
||||
| `safe` | `image_generate`, `vision_analyze`, `web_extract`, `web_search` (via `includes`) | Read-only research + media generation. No file writes, no terminal, no code execution. |
|
||||
| `search` | `web_search` | Web search only (without extract). |
|
||||
| `session_search` | `session_search` | Search past conversation sessions. |
|
||||
|
|
|
|||
|
|
@ -1,234 +0,0 @@
|
|||
---
|
||||
sidebar_position: 13
|
||||
title: "RL Training"
|
||||
description: "Reinforcement learning on agent behaviors with Tinker-Atropos — environment discovery, training, and evaluation"
|
||||
---
|
||||
|
||||
# RL Training
|
||||
|
||||
Hermes Agent includes an integrated RL (Reinforcement Learning) training pipeline built on **Tinker-Atropos**. This enables training language models on environment-specific tasks using GRPO (Group Relative Policy Optimization) with LoRA adapters, orchestrated entirely through the agent's tool interface.
|
||||
|
||||
## Overview
|
||||
|
||||
The RL training system consists of three components:
|
||||
|
||||
1. **[Atropos](https://github.com/NousResearch/atropos)** — A trajectory API server that coordinates environment interactions, manages rollout groups, and computes advantages
|
||||
2. **[Tinker](https://thinkingmachines.ai/tinker/)** — A training service that handles model weights, LoRA training, sampling/inference, and optimizer steps
|
||||
3. **Environments** — Python classes that define tasks, scoring, and reward functions (e.g., GSM8K math problems)
|
||||
|
||||
The agent can discover environments, configure training parameters, launch training runs, and monitor metrics — all through a set of `rl_*` tools.
|
||||
|
||||
## Requirements
|
||||
|
||||
RL training requires:
|
||||
|
||||
- **Python >= 3.11** (Tinker package requirement)
|
||||
- **TINKER_API_KEY** — API key for the Tinker training service
|
||||
- **WANDB_API_KEY** — API key for [Weights & Biases](https://wandb.ai/) metrics tracking
|
||||
- The `tinker-atropos` submodule (at `tinker-atropos/` relative to the Hermes root)
|
||||
|
||||
```bash
|
||||
# Set up API keys
|
||||
hermes config set TINKER_API_KEY your-tinker-key
|
||||
hermes config set WANDB_API_KEY your-wandb-key
|
||||
```
|
||||
|
||||
When both keys are present and Python >= 3.11 is available, the `rl` toolset is automatically enabled.
|
||||
|
||||
## Available Tools
|
||||
|
||||
| Tool | Description |
|
||||
|------|-------------|
|
||||
| `rl_list_environments` | Discover available RL environments |
|
||||
| `rl_select_environment` | Select an environment and load its config |
|
||||
| `rl_get_current_config` | View configurable and locked fields |
|
||||
| `rl_edit_config` | Modify configurable training parameters |
|
||||
| `rl_start_training` | Launch a training run (spawns 3 processes) |
|
||||
| `rl_check_status` | Monitor training progress and WandB metrics |
|
||||
| `rl_stop_training` | Stop a running training job |
|
||||
| `rl_get_results` | Get final metrics and model weights path |
|
||||
| `rl_list_runs` | List all active and completed runs |
|
||||
| `rl_test_inference` | Quick inference test using OpenRouter |
|
||||
|
||||
## Workflow
|
||||
|
||||
### 1. Discover Environments
|
||||
|
||||
```
|
||||
List the available RL environments
|
||||
```
|
||||
|
||||
The agent calls `rl_list_environments()` which scans `tinker-atropos/tinker_atropos/environments/` using AST parsing to find Python classes inheriting from `BaseEnv`. Each environment defines:
|
||||
|
||||
- **Dataset loading** — where training data comes from (e.g., HuggingFace datasets)
|
||||
- **Prompt construction** — how to format items for the model
|
||||
- **Scoring/verification** — how to evaluate model outputs and assign rewards
|
||||
|
||||
### 2. Select and Configure
|
||||
|
||||
```
|
||||
Select the GSM8K environment and show me the configuration
|
||||
```
|
||||
|
||||
The agent calls `rl_select_environment("gsm8k_tinker")`, then `rl_get_current_config()` to see all parameters.
|
||||
|
||||
Configuration fields are divided into two categories:
|
||||
|
||||
**Configurable fields** (can be modified):
|
||||
- `group_size` — Number of completions per item (default: 16)
|
||||
- `batch_size` — Training batch size (default: 128)
|
||||
- `wandb_name` — WandB run name (auto-set to `{env}-{timestamp}`)
|
||||
- Other environment-specific parameters
|
||||
|
||||
**Locked fields** (infrastructure settings, cannot be changed):
|
||||
- `tokenizer_name` — Model tokenizer (e.g., `Qwen/Qwen3-8B`)
|
||||
- `rollout_server_url` — Atropos API URL (`http://localhost:8000`)
|
||||
- `max_token_length` — Maximum token length (8192)
|
||||
- `max_num_workers` — Maximum parallel workers (2048)
|
||||
- `total_steps` — Total training steps (2500)
|
||||
- `lora_rank` — LoRA adapter rank (32)
|
||||
- `learning_rate` — Learning rate (4e-5)
|
||||
- `max_token_trainer_length` — Max tokens for trainer (9000)
|
||||
|
||||
### 3. Start Training
|
||||
|
||||
```
|
||||
Start the training run
|
||||
```
|
||||
|
||||
The agent calls `rl_start_training()` which:
|
||||
|
||||
1. Generates a YAML config file merging locked settings with configurable overrides
|
||||
2. Creates a unique run ID
|
||||
3. Spawns three processes:
|
||||
- **Atropos API server** (`run-api`) — trajectory coordination
|
||||
- **Tinker trainer** (`launch_training.py`) — LoRA training + FastAPI inference server on port 8001
|
||||
- **Environment** (`environment.py serve`) — the selected environment connecting to Atropos
|
||||
|
||||
The processes start with staggered delays (5s for API, 30s for trainer, 90s more for environment) to ensure proper initialization order.
|
||||
|
||||
### 4. Monitor Progress
|
||||
|
||||
```
|
||||
Check the status of training run abc12345
|
||||
```
|
||||
|
||||
The agent calls `rl_check_status(run_id)` which reports:
|
||||
|
||||
- Process status (running/exited for each of the 3 processes)
|
||||
- Running time
|
||||
- WandB metrics (step, reward mean, percent correct, eval accuracy)
|
||||
- Log file locations for debugging
|
||||
|
||||
:::note Rate Limiting
|
||||
Status checks are rate-limited to once every **30 minutes** per run ID. This prevents excessive polling during long-running training jobs that take hours.
|
||||
:::
|
||||
|
||||
### 5. Stop or Get Results
|
||||
|
||||
```
|
||||
Stop the training run
|
||||
# or
|
||||
Get the final results for run abc12345
|
||||
```
|
||||
|
||||
`rl_stop_training()` terminates all three processes in reverse order (environment → trainer → API). `rl_get_results()` retrieves final WandB metrics and training history.
|
||||
|
||||
## Inference Testing
|
||||
|
||||
Before committing to a full training run, you can test if an environment works correctly using `rl_test_inference`. This runs a few steps of inference and scoring using OpenRouter — no Tinker API needed, just an `OPENROUTER_API_KEY`.
|
||||
|
||||
```
|
||||
Test the selected environment with inference
|
||||
```
|
||||
|
||||
Default configuration:
|
||||
- **3 steps × 16 completions = 48 rollouts per model**
|
||||
- Tests 3 models at different scales for robustness:
|
||||
- `qwen/qwen3-8b` (small)
|
||||
- `z-ai/glm-4.7-flash` (medium)
|
||||
- `minimax/minimax-m2.7` (large)
|
||||
- Total: ~144 rollouts
|
||||
|
||||
This validates:
|
||||
- Environment loads correctly
|
||||
- Prompt construction works
|
||||
- Inference response parsing is robust across model scales
|
||||
- Verifier/scoring logic produces valid rewards
|
||||
|
||||
## Tinker API Integration
|
||||
|
||||
The trainer uses the [Tinker](https://tinker.computer) API for model training operations:
|
||||
|
||||
- **ServiceClient** — Creates training and sampling clients
|
||||
- **Training client** — Handles forward-backward passes with importance sampling loss, optimizer steps (Adam), and weight checkpointing
|
||||
- **Sampling client** — Provides inference using the latest trained weights
|
||||
|
||||
The training loop:
|
||||
1. Fetches a batch of rollouts from Atropos (prompt + completions + scores)
|
||||
2. Converts to Tinker Datum objects with padded logprobs and advantages
|
||||
3. Runs forward-backward pass with importance sampling loss
|
||||
4. Takes an optimizer step (Adam: lr=4e-5, β1=0.9, β2=0.95)
|
||||
5. Saves weights and creates a new sampling client for next-step inference
|
||||
6. Logs metrics to WandB
|
||||
|
||||
## Architecture Diagram
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
api["Atropos API<br/>run-api<br/>port 8000"]
|
||||
env["Environment<br/>BaseEnv implementation"]
|
||||
infer["OpenAI / sglang<br/>inference API<br/>port 8001"]
|
||||
trainer["Tinker Trainer<br/>LoRA training + FastAPI"]
|
||||
|
||||
env <--> api
|
||||
env --> infer
|
||||
api -->|"batches: tokens, scores, logprobs"| trainer
|
||||
trainer -->|"serves inference"| infer
|
||||
```
|
||||
|
||||
## Creating Custom Environments
|
||||
|
||||
To create a new RL environment:
|
||||
|
||||
1. Create a Python file in `tinker-atropos/tinker_atropos/environments/`
|
||||
2. Define a class that inherits from `BaseEnv`
|
||||
3. Implement the required methods:
|
||||
- `load_dataset()` — Load your training data
|
||||
- `get_next_item()` — Provide the next item to the model
|
||||
- `score_answer()` — Score model outputs and assign rewards
|
||||
- `collect_trajectories()` — Collect and return trajectories
|
||||
4. Optionally define a custom config class inheriting from `BaseEnvConfig`
|
||||
|
||||
Study the existing `gsm8k_tinker.py` as a template. The agent can help you create new environments — it can read existing environment files, inspect HuggingFace datasets, and write new environment code.
|
||||
|
||||
## WandB Metrics
|
||||
|
||||
Training runs log to Weights & Biases with these key metrics:
|
||||
|
||||
| Metric | Description |
|
||||
|--------|-------------|
|
||||
| `train/loss` | Training loss (importance sampling) |
|
||||
| `train/learning_rate` | Current learning rate |
|
||||
| `reward/mean` | Mean reward across groups |
|
||||
| `logprobs/mean` | Mean reference logprobs |
|
||||
| `logprobs/mean_training` | Mean training logprobs |
|
||||
| `logprobs/diff` | Logprob drift (reference - training) |
|
||||
| `advantages/mean` | Mean advantage values |
|
||||
| `advantages/std` | Advantage standard deviation |
|
||||
|
||||
## Log Files
|
||||
|
||||
Each training run generates log files in `~/.hermes/logs/rl_training/`:
|
||||
|
||||
```
|
||||
logs/
|
||||
├── api_{run_id}.log # Atropos API server logs
|
||||
├── trainer_{run_id}.log # Tinker trainer logs
|
||||
├── env_{run_id}.log # Environment process logs
|
||||
└── inference_tests/ # Inference test results
|
||||
├── test_{env}_{model}.jsonl
|
||||
└── test_{env}_{model}.log
|
||||
```
|
||||
|
||||
These are invaluable for debugging when training fails or produces unexpected results.
|
||||
|
|
@ -1,323 +0,0 @@
|
|||
---
|
||||
title: "Hermes Atropos Environments — Build, test, and debug Hermes Agent RL environments for Atropos training"
|
||||
sidebar_label: "Hermes Atropos Environments"
|
||||
description: "Build, test, and debug Hermes Agent RL environments for Atropos training"
|
||||
---
|
||||
|
||||
{/* This page is auto-generated from the skill's SKILL.md by website/scripts/generate-skill-docs.py. Edit the source SKILL.md, not this page. */}
|
||||
|
||||
# Hermes Atropos Environments
|
||||
|
||||
Build, test, and debug Hermes Agent RL environments for Atropos training. Covers the HermesAgentBaseEnv interface, reward functions, agent loop integration, evaluation with tools, wandb logging, and the three CLI modes (serve/process/evaluate). Use when creating, reviewing, or fixing RL environments in the hermes-agent repo.
|
||||
|
||||
## Skill metadata
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Source | Optional — install with `hermes skills install official/mlops/hermes-atropos-environments` |
|
||||
| Path | `optional-skills/mlops/hermes-atropos-environments` |
|
||||
| Version | `1.1.0` |
|
||||
| Author | Hermes Agent |
|
||||
| License | MIT |
|
||||
| Platforms | linux, macos, windows |
|
||||
| Tags | `atropos`, `rl`, `environments`, `training`, `reinforcement-learning`, `reward-functions` |
|
||||
| Related skills | [`axolotl`](/docs/user-guide/skills/optional/mlops/mlops-training-axolotl), [`fine-tuning-with-trl`](/docs/user-guide/skills/optional/mlops/mlops-training-trl-fine-tuning), `lm-evaluation-harness` |
|
||||
|
||||
## Reference: full SKILL.md
|
||||
|
||||
:::info
|
||||
The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
|
||||
:::
|
||||
|
||||
# Hermes Agent Atropos Environments
|
||||
|
||||
Guide for building RL environments in the hermes-agent repo that integrate with the Atropos training framework.
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
<!-- ascii-guard-ignore -->
|
||||
```
|
||||
Atropos BaseEnv (atroposlib/envs/base.py)
|
||||
└── HermesAgentBaseEnv (environments/hermes_base_env.py)
|
||||
├── Handles agent loop orchestration
|
||||
├── Handles tool resolution per group
|
||||
├── Handles ToolContext for reward verification
|
||||
└── YOUR ENVIRONMENT (environments/your_env.py)
|
||||
Only implements: setup, get_next_item, format_prompt,
|
||||
compute_reward, evaluate, wandb_log
|
||||
```
|
||||
<!-- ascii-guard-ignore-end -->
|
||||
|
||||
Hermes environments are special because they run a **multi-turn agent loop with tool calling** — not just single-turn completions. The base env handles the loop; you implement the task and scoring.
|
||||
|
||||
## File Locations
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `environments/hermes_base_env.py` | Base class with agent loop + tool resolution |
|
||||
| `environments/agent_loop.py` | `HermesAgentLoop` + `AgentResult` dataclass |
|
||||
| `environments/tool_context.py` | `ToolContext` for reward verification |
|
||||
| `environments/tool_call_parsers.py` | Phase 2 tool call parsers (hermes, mistral, etc.) |
|
||||
| `environments/your_env.py` | Your environment implementation |
|
||||
|
||||
## Inference Setup — Ask the User First
|
||||
|
||||
**IMPORTANT:** Before running any test, evaluation, or data generation command, always ask the user how they want to handle inference. Do NOT assume OpenRouter or any specific endpoint. Present these options:
|
||||
|
||||
1. **OpenRouter** — Ask which model they want to use (e.g., `anthropic/claude-sonnet-4.5`, `google/gemini-2.5-pro`, `meta-llama/llama-3.3-70b-instruct`, etc.). Requires `OPENROUTER_API_KEY` in environment.
|
||||
2. **Self-hosted VLLM endpoint** — Ask for their base URL (e.g., `http://localhost:8000/v1`) and model name. Set `--openai.server_type vllm`.
|
||||
3. **Other OpenAI-compatible API** — Ask for the base URL, model name, and any required API key. Set `--openai.server_type openai` and `--openai.health_check false`.
|
||||
4. **Local Atropos training server** — For `serve` mode with a live training loop. Default `http://localhost:8000/v1`.
|
||||
|
||||
Once the user tells you their setup, use those values in all CLI commands for that session. Example prompts:
|
||||
|
||||
> "Before I run this, how would you like to handle inference?
|
||||
> 1. OpenRouter (I'll need your preferred model, e.g. claude-sonnet-4.5)
|
||||
> 2. A self-hosted VLLM endpoint (give me the URL and model name)
|
||||
> 3. Another OpenAI-compatible API (give me the URL, model, and any auth details)
|
||||
> 4. Local Atropos training server (serve mode)"
|
||||
|
||||
### Key flags by provider:
|
||||
|
||||
| Provider | `--openai.server_type` | `--openai.health_check` | `--openai.api_key` |
|
||||
|----------|----------------------|------------------------|-------------------|
|
||||
| OpenRouter | `openai` | `false` | `$OPENROUTER_API_KEY` |
|
||||
| VLLM (self-hosted) | `vllm` | (default) | (not needed) |
|
||||
| Other OpenAI-compatible | `openai` | `false` | As needed |
|
||||
| Local Atropos | (default) | (default) | (not needed) |
|
||||
|
||||
## Required Methods
|
||||
|
||||
### 1. `setup()` — Load dataset and initialize state
|
||||
|
||||
```python
|
||||
async def setup(self) -> None:
|
||||
"""Called once at startup. Load datasets, initialize state."""
|
||||
# Try HuggingFace first, fallback to built-in samples
|
||||
try:
|
||||
from datasets import load_dataset
|
||||
ds = load_dataset("your/dataset", split="test")
|
||||
self._items = [...]
|
||||
except Exception:
|
||||
self._items = BUILTIN_SAMPLES
|
||||
|
||||
# Always split into train/eval
|
||||
random.shuffle(self._items)
|
||||
eval_size = max(20, int(len(self._items) * 0.1))
|
||||
self._eval_items = self._items[:eval_size]
|
||||
self._items = self._items[eval_size:]
|
||||
```
|
||||
|
||||
### 2. `get_next_item()` — Return next training item
|
||||
|
||||
```python
|
||||
async def get_next_item(self) -> dict:
|
||||
"""Return next item, cycling through dataset."""
|
||||
item = self._items[self._index % len(self._items)]
|
||||
self._index += 1
|
||||
return item
|
||||
```
|
||||
|
||||
### 3. `format_prompt(item)` — Convert item to user message
|
||||
|
||||
```python
|
||||
def format_prompt(self, item: dict) -> str:
|
||||
"""Convert a dataset item into the user-facing prompt."""
|
||||
return f"Research this question: {item['question']}"
|
||||
```
|
||||
|
||||
### 4. `compute_reward(item, result, ctx)` — Score the rollout
|
||||
|
||||
**CRITICAL**: `result` is an `AgentResult`, NOT a dict. It has these attributes:
|
||||
- `result.messages` — List of message dicts (OpenAI format)
|
||||
- `result.turns_used` — Number of LLM calls made
|
||||
- `result.finished_naturally` — True if model stopped voluntarily
|
||||
- `result.tool_errors` — List of ToolError objects
|
||||
|
||||
**AgentResult does NOT have**: `final_response`, `tool_calls`, `tools_used`.
|
||||
You must extract these from `result.messages`:
|
||||
|
||||
```python
|
||||
async def compute_reward(self, item, result: AgentResult, ctx: ToolContext) -> float:
|
||||
# Extract final response (last assistant message with content)
|
||||
final_response = ""
|
||||
tools_used = []
|
||||
for msg in reversed(result.messages):
|
||||
if msg.get("role") == "assistant" and msg.get("content") and not final_response:
|
||||
final_response = msg["content"]
|
||||
if msg.get("role") == "assistant" and msg.get("tool_calls"):
|
||||
for tc in msg["tool_calls"]:
|
||||
fn = tc.get("function", {}) if isinstance(tc, dict) else {}
|
||||
name = fn.get("name", "")
|
||||
if name:
|
||||
tools_used.append(name)
|
||||
|
||||
# Score using LLM judge, heuristic, or ToolContext verification
|
||||
correctness = await self._llm_judge(item, final_response)
|
||||
return correctness
|
||||
```
|
||||
|
||||
`ctx` (ToolContext) gives you terminal/file access to the agent's sandbox for verification:
|
||||
```python
|
||||
# Run tests in the agent's sandbox
|
||||
result = ctx.terminal("pytest /workspace/test.py")
|
||||
return 1.0 if result["exit_code"] == 0 else 0.0
|
||||
```
|
||||
|
||||
### 5. `evaluate()` — Periodic evaluation with full agent loop
|
||||
|
||||
**MUST use the full agent loop with tools**, not single-turn chat_completion.
|
||||
The whole point of hermes-agent environments is agentic evaluation:
|
||||
|
||||
```python
|
||||
async def evaluate(self, *args, **kwargs) -> None:
|
||||
import time, uuid
|
||||
from environments.agent_loop import HermesAgentLoop
|
||||
from environments.tool_context import ToolContext
|
||||
|
||||
start_time = time.time()
|
||||
tools, valid_names = self._resolve_tools_for_group()
|
||||
samples = []
|
||||
|
||||
for item in self._eval_items[:self.config.eval_size]:
|
||||
task_id = str(uuid.uuid4())
|
||||
messages = []
|
||||
if self.config.system_prompt:
|
||||
messages.append({"role": "system", "content": self.config.system_prompt})
|
||||
messages.append({"role": "user", "content": self.format_prompt(item)})
|
||||
|
||||
agent = HermesAgentLoop(
|
||||
server=self.server,
|
||||
tool_schemas=tools,
|
||||
valid_tool_names=valid_names,
|
||||
max_turns=self.config.max_agent_turns,
|
||||
task_id=task_id,
|
||||
temperature=0.0, # Deterministic for eval
|
||||
max_tokens=self.config.max_token_length,
|
||||
extra_body=self.config.extra_body,
|
||||
)
|
||||
result = await agent.run(messages)
|
||||
|
||||
ctx = ToolContext(task_id)
|
||||
try:
|
||||
reward = await self.compute_reward(item, result, ctx)
|
||||
finally:
|
||||
ctx.cleanup()
|
||||
|
||||
samples.append({"prompt": ..., "response": ..., "reward": reward})
|
||||
|
||||
eval_metrics = {"eval/mean_reward": ...}
|
||||
await self.evaluate_log(metrics=eval_metrics, samples=samples,
|
||||
start_time=start_time, end_time=time.time())
|
||||
```
|
||||
|
||||
### 6. `wandb_log()` — Custom metrics logging
|
||||
|
||||
Always call `super().wandb_log()` at the end:
|
||||
|
||||
```python
|
||||
async def wandb_log(self, wandb_metrics=None):
|
||||
if wandb_metrics is None:
|
||||
wandb_metrics = {}
|
||||
if self._reward_buffer:
|
||||
n = len(self._reward_buffer)
|
||||
wandb_metrics["train/mean_reward"] = sum(self._reward_buffer) / n
|
||||
self._reward_buffer.clear()
|
||||
await super().wandb_log(wandb_metrics) # MUST call super
|
||||
```
|
||||
|
||||
**Pitfall**: `compute_reward` appends to metric buffers. During eval, this pollutes training metrics. Roll back buffer entries added during eval.
|
||||
|
||||
## Config Class
|
||||
|
||||
Always create a custom config subclass with Pydantic Field descriptors. Key inherited fields you can tune: `enabled_toolsets`, `max_agent_turns`, `agent_temperature`, `system_prompt`, `terminal_backend`, `group_size`, `steps_per_eval`, `total_steps`.
|
||||
|
||||
## config_init() — Default Configuration
|
||||
|
||||
Classmethod returning `(YourEnvConfig, [APIServerConfig(...)])`. Set server_type to "openai" for OpenRouter/external APIs. Load API key from environment variable.
|
||||
|
||||
## Three CLI Modes
|
||||
|
||||
```bash
|
||||
# SERVE — Full training loop (connects to Atropos API server)
|
||||
python environments/my_env.py serve --openai.base_url http://localhost:8000/v1
|
||||
|
||||
# PROCESS — Offline data generation (saves JSONL)
|
||||
python environments/my_env.py process --env.total_steps 10 --env.group_size 1 \
|
||||
--env.use_wandb false --env.data_path_to_save_groups output.jsonl \
|
||||
--openai.base_url "<USER_BASE_URL>" \
|
||||
--openai.model_name "<USER_MODEL>" \
|
||||
--openai.server_type <USER_SERVER_TYPE> --openai.health_check false
|
||||
|
||||
# EVALUATE — Standalone eval (runs setup + evaluate only)
|
||||
python environments/my_env.py evaluate --env.eval_size 20 \
|
||||
--env.data_dir_to_save_evals /tmp/eval_results \
|
||||
--openai.base_url "<USER_BASE_URL>" \
|
||||
--openai.model_name "<USER_MODEL>" \
|
||||
--openai.server_type <USER_SERVER_TYPE> --openai.health_check false
|
||||
```
|
||||
|
||||
Config priority: CLI args > YAML file > config_init() defaults.
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
1. **AgentResult has .messages, not .final_response** — Extract the final response by iterating reversed(result.messages) looking for the last assistant message with content.
|
||||
|
||||
2. **evaluate() must use HermesAgentLoop, not chat_completion** — Single-turn chat_completion has no tools. The whole point of hermes-agent benchmarks is agentic evaluation with tool use.
|
||||
|
||||
3. **Don't call _llm_judge twice** — If compute_reward already calls it, extract the score from the buffer instead of calling judge separately in evaluate().
|
||||
|
||||
4. **Eval pollutes training buffers** — compute_reward appends to metric buffers. During eval, roll back buffer entries to keep training metrics clean.
|
||||
|
||||
5. **Always set health_check=false for OpenRouter** — OpenRouter has no /health endpoint.
|
||||
|
||||
6. **Set data_dir_to_save_evals in evaluate mode** — Without it, results aren't saved.
|
||||
|
||||
7. **default_toolsets class variable vs enabled_toolsets config** — The class variable is a hint; the config field is what actually controls tool resolution.
|
||||
|
||||
8. **Tool call parsing in messages** — Tool calls are dicts with `{"function": {"name": ..., "arguments": ...}}`. Always check `isinstance(tc, dict)`.
|
||||
|
||||
9. **ToolContext.cleanup()** — Always call in a finally block to release sandbox resources.
|
||||
|
||||
10. **server_type must be "openai" for external APIs** — Without it, Atropos assumes a local VLLM server.
|
||||
|
||||
11. **Always ask the user for their inference setup** — Never hardcode or assume a specific provider/model. See the "Inference Setup" section above.
|
||||
|
||||
## Reward Function Patterns
|
||||
|
||||
### LLM Judge (for open-ended tasks)
|
||||
Use `self.server.chat_completion()` with a scoring prompt. Parse JSON response for score float. Always include a heuristic fallback (keyword overlap) for when the judge call fails.
|
||||
|
||||
### Binary Verification (for code/terminal tasks)
|
||||
Use `ctx.terminal("pytest test.py -q")` to run tests in the agent's sandbox. Return 1.0 for pass, 0.0 for fail.
|
||||
|
||||
### Multi-Signal (combine multiple indicators)
|
||||
Weight correctness (0.6) + tool usage (0.2) + efficiency (0.2) + optional bonuses. Clamp to [0, 1].
|
||||
|
||||
## Testing Your Environment
|
||||
|
||||
1. **Import test**: `python -c "from environments.my_env import MyEnv; print('OK')"`
|
||||
2. **Ask the user for inference setup** (see "Inference Setup" section above)
|
||||
3. **Process mode** (1 item): Verify JSONL output has valid tokens, masks, scores
|
||||
4. **Evaluate mode**: Verify full agent loop runs with tools, metrics logged correctly
|
||||
5. **Check reward range**: Scores should be in [0, 1], not all identical
|
||||
|
||||
## Minimum Implementation Checklist
|
||||
|
||||
```python
|
||||
class MyEnv(HermesAgentBaseEnv):
|
||||
name = "my-env"
|
||||
env_config_cls = MyEnvConfig
|
||||
|
||||
@classmethod
|
||||
def config_init(cls): ... # Default server + env config
|
||||
async def setup(self): ... # Load dataset + train/eval split
|
||||
async def get_next_item(self): ... # Cycle through training items
|
||||
def format_prompt(self, item): ... # Item → user message string
|
||||
async def compute_reward(self, item, result, ctx): ... # Score rollout
|
||||
async def evaluate(self, *args, **kwargs): ... # Full agent loop eval
|
||||
async def wandb_log(self, metrics=None): ... # Custom metrics + super()
|
||||
|
||||
if __name__ == "__main__":
|
||||
MyEnv.cli()
|
||||
```
|
||||
Loading…
Add table
Add a link
Reference in a new issue