mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-18 04:41:56 +00:00
Cross-checked 75 docs pages under user-guide/messaging/, developer-guide/,
guides/, and integrations/ against the live registries and gateway code.
messaging/
- index.md: API Server toolset is hermes-api-server (was 'hermes (default)');
Google Chat slug is hermes-google_chat (underscore — plugin name uses _).
- google_chat.md: drop bogus 'pip install hermes-agent[google_chat]' (no such
extra); list the actual deps (google-cloud-pubsub, google-api-python-client,
google-auth, google-auth-oauthlib).
- qqbot.md: config namespace is platforms.qqbot (was platforms.qq, which is
silently ignored by the adapter); QQ_STT_BASE_URL is not read directly —
baseUrl lives under platforms.qqbot.extra.stt.
- teams-meetings.md: 'hermes teams-pipeline' is plugin-gated (teams_pipeline
plugin must be enabled), not a built-in subcommand.
- sms.md: example log line 0.0.0.0:8080 -> 127.0.0.1:8080 (default
SMS_WEBHOOK_HOST).
- open-webui.md: API_SERVER_* are env vars, not YAML keys — write them to
per-profile .env, not 'hermes config set' (same pattern fixed in
api-server.md last round). Also bumped example ports to 8650+ to dodge the
default webhook (8644)/wecom-callback (8645)/msgraph-webhook (8646)
collision.
developer-guide/
- architecture.md: tool/toolset counts (61/52 -> 70+/~28); LOC stamps for
run_agent.py, cli.py, hermes_cli/main.py, setup.py, mcp_tool.py,
gateway/run.py replaced with 'large file' to stop drifting.
- agent-loop.md: same LOC drift (~13,700 -> 'a large file (15k+ lines)').
- gateway-internals.md: '14+ external messaging platforms' -> '20+'; gateway
platform tree updated (qqbot is a sub-package, not qqbot.py; added
yuanbao.py, feishu_comment.py, msgraph_webhook.py); 'gateway/builtin_hooks/
(always active)' was wrong — it's an empty extension point and
_register_builtin_hooks() is a no-op stub.
- acp-internals.md: drop fictional 'message_callback' from the bridged-
callbacks list; clarify thinking_callback is currently set to None.
- provider-runtime.md: provider list was missing AWS Bedrock, Azure Foundry,
NVIDIA NIM, xAI, Arcee, GMI Cloud, StepFun, Qwen OAuth, Xiaomi, Ollama
Cloud, LM Studio, Tencent TokenHub. Fallback section described only the
legacy single-pair model — corrected to the canonical list-form
fallback_providers chain.
- environments.md: parsers list missing llama4_json and the deepseek_v31
alias; both register via @register_parser.
- browser-supervisor.md: drop reference to scripts/browser_supervisor_e2e.py
which doesn't exist in-repo.
- contributing.md: tinker-atropos is a git submodule — note that
'git submodule update --init' is required if cloning without
--recurse-submodules.
guides/
- operate-teams-meeting-pipeline.md: cron flags were all wrong — schedule is
positional (not --schedule), the script-only flag is --no-agent (not
--script-only), and there's no --command flag. Replaced with a real example
that creates the script under ~/.hermes/scripts/ and uses the actual flags.
Also replaced fictional 'hermes cron show <name>' with 'hermes cron status'.
- automation-templates.md: 'cron create --skills "a,b"' doesn't work —
the flag is --skill (singular, repeatable). Fixed all 5 occurrences via AST
rewrite.
- minimax-oauth.md: 'hermes auth add minimax-oauth --region cn' silently
fails because --region isn't registered on the auth-add argparse spec.
Pointed users at the minimax-cn provider (or MINIMAX_CN_API_KEY env) for
China-region access.
- cron-script-only.md: 'hermes send' is fictional — replaced the comparison-
table mention with a webhook-subscription pointer; also fixed the dead link
to /guides/pipe-script-output (page doesn't exist).
- cron-troubleshooting.md: 'hermes serve' isn't a real subcommand. Pointed
at 'hermes gateway' (foreground) / 'hermes gateway start' (service).
- local-ollama-setup.md: 'agent.api_timeout' is not a config key. The right
knob is the HERMES_API_TIMEOUT env var.
- python-library.md: run_conversation() return dict has only final_response
and messages — task_id is stored on the agent instance, not echoed back.
- use-mcp-with-hermes.md: '--args /c "npx -y …"' wraps the npx command in
one quoted string, so cmd.exe gets a single arg instead of the multi-token
command line it needs. Removed the surrounding quotes — argparse nargs='*'
collects each token correctly.
integrations/
- providers.md: Bedrock guardrail YAML keys were 'id'/'version' (don't exist);
actual keys are guardrail_identifier/guardrail_version (matches DEFAULT_CONFIG
and the run_agent.py reader). GMI default base URL (api.gmi.ai/v1 ->
api.gmi-serving.com/v1) and portal URL (inference.gmi.ai -> www.gmicloud.ai)
refreshed. Fallback section rewritten to lead with the canonical
fallback_providers list form (was leading with the legacy fallback_model
single dict); supported-providers list extended to include azure-foundry,
alibaba-coding-plan, lmstudio.
index.md
- '68 built-in tools' -> '70+'; '15+ platforms' was both inconsistent with
integrations/index.md ('19+') and undercounted — bumped to 20+ and added
Weixin/QQ Bot/Yuanbao/Google Chat to the list.
Validation: 'npm run build' clean (exit 0); broken-link count unchanged at
155 (same as round-1 post-skill-regen baseline). 24 files, +132/-89.
520 lines
20 KiB
Markdown
520 lines
20 KiB
Markdown
---
|
||
sidebar_position: 5
|
||
title: "Environments, Benchmarks & Data Generation"
|
||
description: "Building RL training environments, running evaluation benchmarks, and generating SFT data with the Hermes-Agent Atropos integration"
|
||
---
|
||
|
||
# Environments, Benchmarks & Data Generation
|
||
|
||
Hermes Agent includes a full environment framework that connects its tool-calling capabilities to the [Atropos](https://github.com/NousResearch/atropos) RL training framework. This enables three workflows:
|
||
|
||
1. **RL Training** — Train language models on multi-turn agentic tasks with GRPO
|
||
2. **Benchmarks** — Evaluate models on standardised agentic benchmarks
|
||
3. **Data Generation** — Generate SFT training data from agent rollouts
|
||
|
||
All three share the same core: an **environment** class that defines tasks, runs an agent loop, and scores the output.
|
||
|
||
:::info Repo environments vs RL training tools
|
||
The Python environment framework documented here lives under the repo's `environments/` directory and is the implementation-level API for Hermes/Atropos integration. This is separate from the user-facing `rl_*` tools, which operate as an orchestration surface for remote RL training workflows.
|
||
:::
|
||
|
||
:::tip Quick Links
|
||
- **Want to run benchmarks?** Jump to [Available Benchmarks](#available-benchmarks)
|
||
- **Want to train with RL?** See [RL Training Tools](/user-guide/features/rl-training) for the agent-driven interface, or [Running Environments](#running-environments) for manual execution
|
||
- **Want to create a new environment?** See [Creating Environments](#creating-environments)
|
||
:::
|
||
|
||
## Architecture
|
||
|
||
The environment system is built on a three-layer inheritance chain:
|
||
|
||
```mermaid
|
||
classDiagram
|
||
class BaseEnv {
|
||
Server management
|
||
Worker scheduling
|
||
Wandb logging
|
||
CLI: serve / process / evaluate
|
||
}
|
||
|
||
class HermesAgentBaseEnv {
|
||
Terminal backend configuration
|
||
Tool resolution
|
||
Agent loop engine
|
||
ToolContext access
|
||
}
|
||
|
||
class TerminalTestEnv {
|
||
Stack testing
|
||
}
|
||
|
||
class HermesSweEnv {
|
||
SWE training
|
||
}
|
||
|
||
class TerminalBench2EvalEnv {
|
||
Benchmark evaluation
|
||
}
|
||
|
||
class TBLiteEvalEnv {
|
||
Fast benchmark
|
||
}
|
||
|
||
class YCBenchEvalEnv {
|
||
Long-horizon benchmark
|
||
}
|
||
|
||
BaseEnv <|-- HermesAgentBaseEnv
|
||
HermesAgentBaseEnv <|-- TerminalTestEnv
|
||
HermesAgentBaseEnv <|-- HermesSweEnv
|
||
HermesAgentBaseEnv <|-- TerminalBench2EvalEnv
|
||
TerminalBench2EvalEnv <|-- TBLiteEvalEnv
|
||
TerminalBench2EvalEnv <|-- YCBenchEvalEnv
|
||
```
|
||
|
||
### BaseEnv (Atropos)
|
||
|
||
The foundation from `atroposlib`. Provides:
|
||
- **Server management** — connects to OpenAI-compatible APIs (VLLM, SGLang, OpenRouter)
|
||
- **Worker scheduling** — parallel rollout coordination
|
||
- **Wandb integration** — metrics logging and rollout visualisation
|
||
- **CLI interface** — three subcommands: `serve`, `process`, `evaluate`
|
||
- **Eval logging** — `evaluate_log()` saves results to JSON + JSONL
|
||
|
||
### HermesAgentBaseEnv
|
||
|
||
The hermes-agent layer (`environments/hermes_base_env.py`). Adds:
|
||
- **Terminal backend configuration** — sets `TERMINAL_ENV` for sandboxed execution (local, Docker, Modal, Daytona, SSH, Singularity)
|
||
- **Tool resolution** — `_resolve_tools_for_group()` calls hermes-agent's `get_tool_definitions()` to get the right tool schemas based on enabled/disabled toolsets
|
||
- **Agent loop integration** — `collect_trajectory()` runs `HermesAgentLoop` and scores the result
|
||
- **Two-phase operation** — Phase 1 (OpenAI server) for eval/SFT, Phase 2 (VLLM ManagedServer) for full RL with logprobs
|
||
- **Async safety patches** — monkey-patches Modal backend to work inside Atropos's event loop
|
||
|
||
### Concrete Environments
|
||
|
||
Your environment inherits from `HermesAgentBaseEnv` and implements five methods:
|
||
|
||
| Method | Purpose |
|
||
|--------|---------|
|
||
| `setup()` | Load dataset, initialise state |
|
||
| `get_next_item()` | Return the next item for rollout |
|
||
| `format_prompt(item)` | Convert an item into the user message |
|
||
| `compute_reward(item, result, ctx)` | Score the rollout (0.0–1.0) |
|
||
| `evaluate()` | Periodic evaluation logic |
|
||
|
||
## Core Components
|
||
|
||
### Agent Loop
|
||
|
||
`HermesAgentLoop` (`environments/agent_loop.py`) is the reusable multi-turn agent engine. It runs the same tool-calling pattern as hermes-agent's main loop:
|
||
|
||
1. Send messages + tool schemas to the API via `server.chat_completion()`
|
||
2. If the response contains `tool_calls`, dispatch each via `handle_function_call()`
|
||
3. Append tool results to the conversation, go back to step 1
|
||
4. If no `tool_calls`, the agent is done
|
||
|
||
Tool calls execute in a thread pool (`ThreadPoolExecutor(128)`) so that async backends (Modal, Docker) don't deadlock inside Atropos's event loop.
|
||
|
||
Returns an `AgentResult`:
|
||
|
||
```python
|
||
@dataclass
|
||
class AgentResult:
|
||
messages: List[Dict[str, Any]] # Full conversation history
|
||
turns_used: int # Number of LLM calls made
|
||
finished_naturally: bool # True if model stopped on its own
|
||
reasoning_per_turn: List[Optional[str]] # Extracted reasoning content
|
||
tool_errors: List[ToolError] # Errors encountered during tool dispatch
|
||
managed_state: Optional[Dict] # VLLM ManagedServer state (Phase 2)
|
||
```
|
||
|
||
### Tool Context
|
||
|
||
`ToolContext` (`environments/tool_context.py`) gives reward functions direct access to the **same sandbox** the model used during its rollout. The `task_id` scoping means all state (files, processes, browser tabs) is preserved.
|
||
|
||
```python
|
||
async def compute_reward(self, item, result, ctx: ToolContext):
|
||
# Run tests in the model's terminal sandbox
|
||
test = ctx.terminal("pytest -v")
|
||
if test["exit_code"] == 0:
|
||
return 1.0
|
||
|
||
# Check if a file was created
|
||
content = ctx.read_file("/workspace/solution.py")
|
||
if content.get("content"):
|
||
return 0.5
|
||
|
||
# Download files for local verification
|
||
ctx.download_file("/remote/output.bin", "/local/output.bin")
|
||
return 0.0
|
||
```
|
||
|
||
Available methods:
|
||
|
||
| Category | Methods |
|
||
|----------|---------|
|
||
| **Terminal** | `terminal(command, timeout)` |
|
||
| **Files** | `read_file(path)`, `write_file(path, content)`, `search(query, path)` |
|
||
| **Transfers** | `upload_file()`, `upload_dir()`, `download_file()`, `download_dir()` |
|
||
| **Web** | `web_search(query)`, `web_extract(urls)` |
|
||
| **Browser** | `browser_navigate(url)`, `browser_snapshot()` |
|
||
| **Generic** | `call_tool(name, args)` — escape hatch for any hermes-agent tool |
|
||
| **Cleanup** | `cleanup()` — release all resources |
|
||
|
||
### Tool Call Parsers
|
||
|
||
For **Phase 2** (VLLM ManagedServer), the server returns raw text without structured tool calls. Client-side parsers in `environments/tool_call_parsers/` extract `tool_calls` from raw output:
|
||
|
||
```python
|
||
from environments.tool_call_parsers import get_parser
|
||
|
||
parser = get_parser("hermes") # or "mistral", "llama3_json", "qwen", "deepseek_v3", etc.
|
||
content, tool_calls = parser.parse(raw_model_output)
|
||
```
|
||
|
||
Available parsers: `hermes`, `mistral`, `llama3_json`, `llama4_json`, `qwen`, `qwen3_coder`, `deepseek_v3`, `deepseek_v3_1` (alias `deepseek_v31`), `kimi_k2`, `longcat`, `glm45`, `glm47`.
|
||
|
||
In Phase 1 (OpenAI server type), parsers are not needed — the server handles tool call parsing natively.
|
||
|
||
## Available Benchmarks
|
||
|
||
### TerminalBench2
|
||
|
||
**89 challenging terminal tasks** with per-task Docker sandbox environments.
|
||
|
||
| | |
|
||
|---|---|
|
||
| **What it tests** | Single-task coding/sysadmin ability |
|
||
| **Scoring** | Binary pass/fail (test suite verification) |
|
||
| **Sandbox** | Modal cloud sandboxes (per-task Docker images) |
|
||
| **Tools** | `terminal` + `file` |
|
||
| **Tasks** | 89 tasks across multiple categories |
|
||
| **Cost** | ~$50–200 for full eval (parallel execution) |
|
||
| **Time** | ~2–4 hours |
|
||
|
||
```bash
|
||
python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
|
||
--config environments/benchmarks/terminalbench_2/default.yaml
|
||
|
||
# Run specific tasks
|
||
python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
|
||
--config environments/benchmarks/terminalbench_2/default.yaml \
|
||
--env.task_filter fix-git,git-multibranch
|
||
```
|
||
|
||
Dataset: [NousResearch/terminal-bench-2](https://huggingface.co/datasets/NousResearch/terminal-bench-2) on HuggingFace.
|
||
|
||
### TBLite (OpenThoughts Terminal Bench Lite)
|
||
|
||
**100 difficulty-calibrated tasks** — a faster proxy for TerminalBench2.
|
||
|
||
| | |
|
||
|---|---|
|
||
| **What it tests** | Same as TB2 (coding/sysadmin), calibrated difficulty tiers |
|
||
| **Scoring** | Binary pass/fail |
|
||
| **Sandbox** | Modal cloud sandboxes |
|
||
| **Tools** | `terminal` + `file` |
|
||
| **Tasks** | 100 tasks: Easy (40), Medium (26), Hard (26), Extreme (8) |
|
||
| **Correlation** | r=0.911 with full TB2 |
|
||
| **Speed** | 2.6–8× faster than TB2 |
|
||
|
||
```bash
|
||
python environments/benchmarks/tblite/tblite_env.py evaluate \
|
||
--config environments/benchmarks/tblite/default.yaml
|
||
```
|
||
|
||
TBLite is a thin subclass of TerminalBench2 — only the dataset and timeouts differ. Created by the OpenThoughts Agent team (Snorkel AI + Bespoke Labs). Dataset: [NousResearch/openthoughts-tblite](https://huggingface.co/datasets/NousResearch/openthoughts-tblite).
|
||
|
||
### YC-Bench
|
||
|
||
**Long-horizon strategic benchmark** — the agent plays CEO of an AI startup.
|
||
|
||
| | |
|
||
|---|---|
|
||
| **What it tests** | Multi-turn strategic coherence over hundreds of turns |
|
||
| **Scoring** | Composite: `0.5 × survival + 0.5 × normalised_funds` |
|
||
| **Sandbox** | Local terminal (no Modal needed) |
|
||
| **Tools** | `terminal` only |
|
||
| **Runs** | 9 default (3 presets × 3 seeds), sequential |
|
||
| **Cost** | ~$50–200 for full eval |
|
||
| **Time** | ~3–6 hours |
|
||
|
||
```bash
|
||
# Install yc-bench (optional dependency)
|
||
pip install "hermes-agent[yc-bench]"
|
||
|
||
# Run evaluation
|
||
bash environments/benchmarks/yc_bench/run_eval.sh
|
||
|
||
# Or directly
|
||
python environments/benchmarks/yc_bench/yc_bench_env.py evaluate \
|
||
--config environments/benchmarks/yc_bench/default.yaml
|
||
|
||
# Quick single-preset test
|
||
python environments/benchmarks/yc_bench/yc_bench_env.py evaluate \
|
||
--config environments/benchmarks/yc_bench/default.yaml \
|
||
--env.presets '["fast_test"]' --env.seeds '[1]'
|
||
```
|
||
|
||
YC-Bench uses [collinear-ai/yc-bench](https://github.com/collinear-ai/yc-bench) — a deterministic simulation with 4 skill domains (research, inference, data_environment, training), prestige system, employee management, and financial pressure. Unlike TB2's per-task binary scoring, YC-Bench measures whether an agent can maintain coherent strategy over hundreds of compounding decisions.
|
||
|
||
## Training Environments
|
||
|
||
### TerminalTestEnv
|
||
|
||
A minimal self-contained environment with inline tasks (no external dataset). Used for **validating the full stack** end-to-end. Each task asks the model to create a file at a known path; the verifier checks the content.
|
||
|
||
```bash
|
||
# Process mode (saves rollouts to JSONL, no training server needed)
|
||
python environments/terminal_test_env/terminal_test_env.py process \
|
||
--env.data_path_to_save_groups terminal_test_output.jsonl
|
||
|
||
# Serve mode (connects to Atropos API for RL training)
|
||
python environments/terminal_test_env/terminal_test_env.py serve
|
||
```
|
||
|
||
### HermesSweEnv
|
||
|
||
SWE-bench style training environment. The model gets a coding task, uses terminal + file + web tools to solve it, and the reward function runs tests in the same Modal sandbox.
|
||
|
||
```bash
|
||
python environments/hermes_swe_env/hermes_swe_env.py serve \
|
||
--openai.model_name YourModel \
|
||
--env.dataset_name bigcode/humanevalpack \
|
||
--env.terminal_backend modal
|
||
```
|
||
|
||
## Running Environments
|
||
|
||
Every environment is a standalone Python script with three CLI subcommands:
|
||
|
||
### `evaluate` — Run a benchmark
|
||
|
||
For eval-only environments (benchmarks). Runs all items, computes metrics, logs to wandb.
|
||
|
||
```bash
|
||
python environments/benchmarks/tblite/tblite_env.py evaluate \
|
||
--config environments/benchmarks/tblite/default.yaml \
|
||
--openai.model_name anthropic/claude-sonnet-4.6
|
||
```
|
||
|
||
No training server or `run-api` needed. The environment handles everything.
|
||
|
||
### `process` — Generate SFT data
|
||
|
||
Runs rollouts and saves scored trajectories to JSONL. Useful for generating training data without a full RL loop.
|
||
|
||
```bash
|
||
python environments/terminal_test_env/terminal_test_env.py process \
|
||
--env.data_path_to_save_groups output.jsonl \
|
||
--openai.model_name anthropic/claude-sonnet-4.6
|
||
```
|
||
|
||
Output format: each line is a scored trajectory with the full conversation history, reward, and metadata.
|
||
|
||
### `serve` — Connect to Atropos for RL training
|
||
|
||
Connects the environment to a running Atropos API server (`run-api`). Used during live RL training.
|
||
|
||
```bash
|
||
# Terminal 1: Start the Atropos API
|
||
run-api
|
||
|
||
# Terminal 2: Start the environment
|
||
python environments/hermes_swe_env/hermes_swe_env.py serve \
|
||
--openai.model_name YourModel
|
||
```
|
||
|
||
The environment receives items from Atropos, runs agent rollouts, computes rewards, and sends scored trajectories back for training.
|
||
|
||
## Two-Phase Operation
|
||
|
||
### Phase 1: OpenAI Server (Eval / SFT)
|
||
|
||
Uses `server.chat_completion()` with `tools=` parameter. The server (VLLM, SGLang, OpenRouter, OpenAI) handles tool call parsing natively. Returns `ChatCompletion` objects with structured `tool_calls`.
|
||
|
||
- **Use for**: evaluation, SFT data generation, benchmarks, testing
|
||
- **Placeholder tokens** are created for the Atropos pipeline (since real token IDs aren't available from the OpenAI API)
|
||
|
||
### Phase 2: VLLM ManagedServer (Full RL)
|
||
|
||
Uses ManagedServer for exact token IDs + logprobs via `/generate`. A client-side [tool call parser](#tool-call-parsers) reconstructs structured `tool_calls` from raw output.
|
||
|
||
- **Use for**: full RL training with GRPO/PPO
|
||
- **Real tokens**, masks, and logprobs flow through the pipeline
|
||
- Set `tool_call_parser` in config to match your model's format (e.g., `"hermes"`, `"qwen"`, `"mistral"`)
|
||
|
||
## Creating Environments
|
||
|
||
### Training Environment
|
||
|
||
```python
|
||
from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
|
||
from atroposlib.envs.server_handling.server_manager import APIServerConfig
|
||
|
||
class MyEnvConfig(HermesAgentEnvConfig):
|
||
my_custom_field: str = "default_value"
|
||
|
||
class MyEnv(HermesAgentBaseEnv):
|
||
name = "my-env"
|
||
env_config_cls = MyEnvConfig
|
||
|
||
@classmethod
|
||
def config_init(cls):
|
||
env_config = MyEnvConfig(
|
||
enabled_toolsets=["terminal", "file"],
|
||
terminal_backend="modal",
|
||
max_agent_turns=30,
|
||
)
|
||
server_configs = [APIServerConfig(
|
||
base_url="https://openrouter.ai/api/v1",
|
||
model_name="anthropic/claude-sonnet-4.6",
|
||
server_type="openai",
|
||
)]
|
||
return env_config, server_configs
|
||
|
||
async def setup(self):
|
||
from datasets import load_dataset
|
||
self.dataset = list(load_dataset("my-dataset", split="train"))
|
||
self.iter = 0
|
||
|
||
async def get_next_item(self):
|
||
item = self.dataset[self.iter % len(self.dataset)]
|
||
self.iter += 1
|
||
return item
|
||
|
||
def format_prompt(self, item):
|
||
return item["instruction"]
|
||
|
||
async def compute_reward(self, item, result, ctx):
|
||
# ctx gives full tool access to the rollout's sandbox
|
||
test = ctx.terminal("pytest -v")
|
||
return 1.0 if test["exit_code"] == 0 else 0.0
|
||
|
||
async def evaluate(self, *args, **kwargs):
|
||
# Periodic evaluation during training
|
||
pass
|
||
|
||
if __name__ == "__main__":
|
||
MyEnv.cli()
|
||
```
|
||
|
||
### Eval-Only Benchmark
|
||
|
||
For benchmarks, follow the pattern used by TerminalBench2, TBLite, and YC-Bench:
|
||
|
||
1. **Create under** `environments/benchmarks/your-benchmark/`
|
||
2. **Set eval-only config**: `eval_handling=STOP_TRAIN`, `steps_per_eval=1`, `total_steps=1`
|
||
3. **Stub training methods**: `collect_trajectories()` returns `(None, [])`, `score()` returns `None`
|
||
4. **Implement** `rollout_and_score_eval(eval_item)` — the per-item agent loop + scoring
|
||
5. **Implement** `evaluate()` — orchestrates all runs, computes aggregate metrics
|
||
6. **Add streaming JSONL** for crash-safe result persistence
|
||
7. **Add cleanup**: `KeyboardInterrupt` handling, `cleanup_all_environments()`, `_tool_executor.shutdown()`
|
||
8. **Run with** `evaluate` subcommand
|
||
|
||
See `environments/benchmarks/yc_bench/yc_bench_env.py` for a clean, well-documented reference implementation.
|
||
|
||
## Configuration Reference
|
||
|
||
### HermesAgentEnvConfig Fields
|
||
|
||
| Field | Type | Default | Description |
|
||
|-------|------|---------|-------------|
|
||
| `enabled_toolsets` | `List[str]` | `None` (all) | Which hermes toolsets to enable |
|
||
| `disabled_toolsets` | `List[str]` | `None` | Toolsets to filter out |
|
||
| `distribution` | `str` | `None` | Probabilistic toolset distribution name |
|
||
| `max_agent_turns` | `int` | `30` | Max LLM calls per rollout |
|
||
| `agent_temperature` | `float` | `1.0` | Sampling temperature |
|
||
| `system_prompt` | `str` | `None` | System message for the agent |
|
||
| `terminal_backend` | `str` | `"local"` | `local`, `docker`, `modal`, `daytona`, `ssh`, `singularity` |
|
||
| `terminal_timeout` | `int` | `120` | Seconds per terminal command |
|
||
| `terminal_lifetime` | `int` | `3600` | Max sandbox lifetime |
|
||
| `dataset_name` | `str` | `None` | HuggingFace dataset identifier |
|
||
| `tool_pool_size` | `int` | `128` | Thread pool size for tool execution |
|
||
| `tool_call_parser` | `str` | `"hermes"` | Parser for Phase 2 raw output |
|
||
| `extra_body` | `Dict` | `None` | Extra params for OpenAI API (e.g., OpenRouter provider prefs) |
|
||
| `eval_handling` | `Enum` | `STOP_TRAIN` | `STOP_TRAIN`, `LIMIT_TRAIN`, `NONE` |
|
||
|
||
### YAML Configuration
|
||
|
||
Environments can be configured via YAML files passed with `--config`:
|
||
|
||
```yaml
|
||
env:
|
||
enabled_toolsets: ["terminal", "file"]
|
||
max_agent_turns: 60
|
||
max_token_length: 32000
|
||
agent_temperature: 0.8
|
||
terminal_backend: "modal"
|
||
terminal_timeout: 300
|
||
dataset_name: "NousResearch/terminal-bench-2"
|
||
tokenizer_name: "NousResearch/Hermes-3-Llama-3.1-8B"
|
||
use_wandb: true
|
||
wandb_name: "my-benchmark"
|
||
|
||
openai:
|
||
base_url: "https://openrouter.ai/api/v1"
|
||
model_name: "anthropic/claude-sonnet-4.6"
|
||
server_type: "openai"
|
||
health_check: false
|
||
```
|
||
|
||
YAML values override `config_init()` defaults. CLI arguments override YAML values:
|
||
|
||
```bash
|
||
python my_env.py evaluate \
|
||
--config my_config.yaml \
|
||
--openai.model_name anthropic/claude-opus-4.6 # overrides YAML
|
||
```
|
||
|
||
## Prerequisites
|
||
|
||
### For all environments
|
||
|
||
- Python >= 3.11
|
||
- `atroposlib`: `pip install git+https://github.com/NousResearch/atropos.git`
|
||
- An LLM API key (OpenRouter, OpenAI, or self-hosted VLLM/SGLang)
|
||
|
||
### For Modal-sandboxed benchmarks (TB2, TBLite)
|
||
|
||
- [Modal](https://modal.com) account and CLI: `pip install "hermes-agent[modal]"`
|
||
- `MODAL_TOKEN_ID` and `MODAL_TOKEN_SECRET` environment variables
|
||
|
||
### For YC-Bench
|
||
|
||
- `pip install "hermes-agent[yc-bench]"` (installs the yc-bench CLI + SQLAlchemy)
|
||
- No Modal needed — runs with local terminal backend
|
||
|
||
### For RL training
|
||
|
||
- `TINKER_API_KEY` — API key for the [Tinker](https://tinker.computer) training service
|
||
- `WANDB_API_KEY` — for Weights & Biases metrics tracking
|
||
- The `tinker-atropos` submodule (at `tinker-atropos/` in the repo)
|
||
|
||
See [RL Training](/user-guide/features/rl-training) for the agent-driven RL workflow.
|
||
|
||
## Directory Structure
|
||
|
||
```
|
||
environments/
|
||
├── hermes_base_env.py # Abstract base class (HermesAgentBaseEnv)
|
||
├── agent_loop.py # Multi-turn agent engine (HermesAgentLoop)
|
||
├── tool_context.py # Per-rollout tool access for reward functions
|
||
├── patches.py # Async-safety patches for Modal backend
|
||
│
|
||
├── tool_call_parsers/ # Phase 2 client-side parsers
|
||
│ ├── hermes_parser.py # Hermes/ChatML <tool_call> format
|
||
│ ├── mistral_parser.py # Mistral [TOOL_CALLS] format
|
||
│ ├── llama_parser.py # Llama 3 JSON tool calling
|
||
│ ├── qwen_parser.py # Qwen format
|
||
│ ├── deepseek_v3_parser.py # DeepSeek V3 format
|
||
│ └── ... # + kimi_k2, longcat, glm45/47, etc.
|
||
│
|
||
├── terminal_test_env/ # Stack validation (inline tasks)
|
||
├── hermes_swe_env/ # SWE-bench training environment
|
||
│
|
||
└── benchmarks/ # Evaluation benchmarks
|
||
├── terminalbench_2/ # 89 terminal tasks, Modal sandboxes
|
||
├── tblite/ # 100 calibrated tasks (fast TB2 proxy)
|
||
└── yc_bench/ # Long-horizon strategic benchmark
|
||
```
|