Cross-checked 75 docs pages under user-guide/messaging/, developer-guide/,
guides/, and integrations/ against the live registries and gateway code.
messaging/
- index.md: API Server toolset is hermes-api-server (was 'hermes (default)');
Google Chat slug is hermes-google_chat (underscore — plugin name uses _).
- google_chat.md: drop bogus 'pip install hermes-agent[google_chat]' (no such
extra); list the actual deps (google-cloud-pubsub, google-api-python-client,
google-auth, google-auth-oauthlib).
- qqbot.md: config namespace is platforms.qqbot (was platforms.qq, which is
silently ignored by the adapter); QQ_STT_BASE_URL is not read directly —
baseUrl lives under platforms.qqbot.extra.stt.
- teams-meetings.md: 'hermes teams-pipeline' is plugin-gated (teams_pipeline
plugin must be enabled), not a built-in subcommand.
- sms.md: example log line 0.0.0.0:8080 -> 127.0.0.1:8080 (default
SMS_WEBHOOK_HOST).
- open-webui.md: API_SERVER_* are env vars, not YAML keys — write them to
per-profile .env, not 'hermes config set' (same pattern fixed in
api-server.md last round). Also bumped example ports to 8650+ to dodge the
default webhook (8644)/wecom-callback (8645)/msgraph-webhook (8646)
collision.
developer-guide/
- architecture.md: tool/toolset counts (61/52 -> 70+/~28); LOC stamps for
run_agent.py, cli.py, hermes_cli/main.py, setup.py, mcp_tool.py,
gateway/run.py replaced with 'large file' to stop drifting.
- agent-loop.md: same LOC drift (~13,700 -> 'a large file (15k+ lines)').
- gateway-internals.md: '14+ external messaging platforms' -> '20+'; gateway
platform tree updated (qqbot is a sub-package, not qqbot.py; added
yuanbao.py, feishu_comment.py, msgraph_webhook.py); 'gateway/builtin_hooks/
(always active)' was wrong — it's an empty extension point and
_register_builtin_hooks() is a no-op stub.
- acp-internals.md: drop fictional 'message_callback' from the bridged-
callbacks list; clarify thinking_callback is currently set to None.
- provider-runtime.md: provider list was missing AWS Bedrock, Azure Foundry,
NVIDIA NIM, xAI, Arcee, GMI Cloud, StepFun, Qwen OAuth, Xiaomi, Ollama
Cloud, LM Studio, Tencent TokenHub. Fallback section described only the
legacy single-pair model — corrected to the canonical list-form
fallback_providers chain.
- environments.md: parsers list missing llama4_json and the deepseek_v31
alias; both register via @register_parser.
- browser-supervisor.md: drop reference to scripts/browser_supervisor_e2e.py
which doesn't exist in-repo.
- contributing.md: tinker-atropos is a git submodule — note that
'git submodule update --init' is required if cloning without
--recurse-submodules.
guides/
- operate-teams-meeting-pipeline.md: cron flags were all wrong — schedule is
positional (not --schedule), the script-only flag is --no-agent (not
--script-only), and there's no --command flag. Replaced with a real example
that creates the script under ~/.hermes/scripts/ and uses the actual flags.
Also replaced fictional 'hermes cron show <name>' with 'hermes cron status'.
- automation-templates.md: 'cron create --skills "a,b"' doesn't work —
the flag is --skill (singular, repeatable). Fixed all 5 occurrences via AST
rewrite.
- minimax-oauth.md: 'hermes auth add minimax-oauth --region cn' silently
fails because --region isn't registered on the auth-add argparse spec.
Pointed users at the minimax-cn provider (or MINIMAX_CN_API_KEY env) for
China-region access.
- cron-script-only.md: 'hermes send' is fictional — replaced the comparison-
table mention with a webhook-subscription pointer; also fixed the dead link
to /guides/pipe-script-output (page doesn't exist).
- cron-troubleshooting.md: 'hermes serve' isn't a real subcommand. Pointed
at 'hermes gateway' (foreground) / 'hermes gateway start' (service).
- local-ollama-setup.md: 'agent.api_timeout' is not a config key. The right
knob is the HERMES_API_TIMEOUT env var.
- python-library.md: run_conversation() return dict has only final_response
and messages — task_id is stored on the agent instance, not echoed back.
- use-mcp-with-hermes.md: '--args /c "npx -y …"' wraps the npx command in
one quoted string, so cmd.exe gets a single arg instead of the multi-token
command line it needs. Removed the surrounding quotes — argparse nargs='*'
collects each token correctly.
integrations/
- providers.md: Bedrock guardrail YAML keys were 'id'/'version' (don't exist);
actual keys are guardrail_identifier/guardrail_version (matches DEFAULT_CONFIG
and the run_agent.py reader). GMI default base URL (api.gmi.ai/v1 ->
api.gmi-serving.com/v1) and portal URL (inference.gmi.ai -> www.gmicloud.ai)
refreshed. Fallback section rewritten to lead with the canonical
fallback_providers list form (was leading with the legacy fallback_model
single dict); supported-providers list extended to include azure-foundry,
alibaba-coding-plan, lmstudio.
index.md
- '68 built-in tools' -> '70+'; '15+ platforms' was both inconsistent with
integrations/index.md ('19+') and undercounted — bumped to 20+ and added
Weixin/QQ Bot/Yuanbao/Google Chat to the list.
Validation: 'npm run build' clean (exit 0); broken-link count unchanged at
155 (same as round-1 post-skill-regen baseline). 24 files, +132/-89.
20 KiB
| sidebar_position | title | description |
|---|---|---|
| 5 | Environments, Benchmarks & Data Generation | Building RL training environments, running evaluation benchmarks, and generating SFT data with the Hermes-Agent Atropos integration |
Environments, Benchmarks & Data Generation
Hermes Agent includes a full environment framework that connects its tool-calling capabilities to the Atropos RL training framework. This enables three workflows:
- RL Training — Train language models on multi-turn agentic tasks with GRPO
- Benchmarks — Evaluate models on standardised agentic benchmarks
- Data Generation — Generate SFT training data from agent rollouts
All three share the same core: an environment class that defines tasks, runs an agent loop, and scores the output.
:::info Repo environments vs RL training tools
The Python environment framework documented here lives under the repo's environments/ directory and is the implementation-level API for Hermes/Atropos integration. This is separate from the user-facing rl_* tools, which operate as an orchestration surface for remote RL training workflows.
:::
:::tip Quick Links
- Want to run benchmarks? Jump to Available Benchmarks
- Want to train with RL? See RL Training Tools for the agent-driven interface, or Running Environments for manual execution
- Want to create a new environment? See Creating Environments :::
Architecture
The environment system is built on a three-layer inheritance chain:
classDiagram
class BaseEnv {
Server management
Worker scheduling
Wandb logging
CLI: serve / process / evaluate
}
class HermesAgentBaseEnv {
Terminal backend configuration
Tool resolution
Agent loop engine
ToolContext access
}
class TerminalTestEnv {
Stack testing
}
class HermesSweEnv {
SWE training
}
class TerminalBench2EvalEnv {
Benchmark evaluation
}
class TBLiteEvalEnv {
Fast benchmark
}
class YCBenchEvalEnv {
Long-horizon benchmark
}
BaseEnv <|-- HermesAgentBaseEnv
HermesAgentBaseEnv <|-- TerminalTestEnv
HermesAgentBaseEnv <|-- HermesSweEnv
HermesAgentBaseEnv <|-- TerminalBench2EvalEnv
TerminalBench2EvalEnv <|-- TBLiteEvalEnv
TerminalBench2EvalEnv <|-- YCBenchEvalEnv
BaseEnv (Atropos)
The foundation from atroposlib. Provides:
- Server management — connects to OpenAI-compatible APIs (VLLM, SGLang, OpenRouter)
- Worker scheduling — parallel rollout coordination
- Wandb integration — metrics logging and rollout visualisation
- CLI interface — three subcommands:
serve,process,evaluate - Eval logging —
evaluate_log()saves results to JSON + JSONL
HermesAgentBaseEnv
The hermes-agent layer (environments/hermes_base_env.py). Adds:
- Terminal backend configuration — sets
TERMINAL_ENVfor sandboxed execution (local, Docker, Modal, Daytona, SSH, Singularity) - Tool resolution —
_resolve_tools_for_group()calls hermes-agent'sget_tool_definitions()to get the right tool schemas based on enabled/disabled toolsets - Agent loop integration —
collect_trajectory()runsHermesAgentLoopand scores the result - Two-phase operation — Phase 1 (OpenAI server) for eval/SFT, Phase 2 (VLLM ManagedServer) for full RL with logprobs
- Async safety patches — monkey-patches Modal backend to work inside Atropos's event loop
Concrete Environments
Your environment inherits from HermesAgentBaseEnv and implements five methods:
| Method | Purpose |
|---|---|
setup() |
Load dataset, initialise state |
get_next_item() |
Return the next item for rollout |
format_prompt(item) |
Convert an item into the user message |
compute_reward(item, result, ctx) |
Score the rollout (0.0–1.0) |
evaluate() |
Periodic evaluation logic |
Core Components
Agent Loop
HermesAgentLoop (environments/agent_loop.py) is the reusable multi-turn agent engine. It runs the same tool-calling pattern as hermes-agent's main loop:
- Send messages + tool schemas to the API via
server.chat_completion() - If the response contains
tool_calls, dispatch each viahandle_function_call() - Append tool results to the conversation, go back to step 1
- If no
tool_calls, the agent is done
Tool calls execute in a thread pool (ThreadPoolExecutor(128)) so that async backends (Modal, Docker) don't deadlock inside Atropos's event loop.
Returns an AgentResult:
@dataclass
class AgentResult:
messages: List[Dict[str, Any]] # Full conversation history
turns_used: int # Number of LLM calls made
finished_naturally: bool # True if model stopped on its own
reasoning_per_turn: List[Optional[str]] # Extracted reasoning content
tool_errors: List[ToolError] # Errors encountered during tool dispatch
managed_state: Optional[Dict] # VLLM ManagedServer state (Phase 2)
Tool Context
ToolContext (environments/tool_context.py) gives reward functions direct access to the same sandbox the model used during its rollout. The task_id scoping means all state (files, processes, browser tabs) is preserved.
async def compute_reward(self, item, result, ctx: ToolContext):
# Run tests in the model's terminal sandbox
test = ctx.terminal("pytest -v")
if test["exit_code"] == 0:
return 1.0
# Check if a file was created
content = ctx.read_file("/workspace/solution.py")
if content.get("content"):
return 0.5
# Download files for local verification
ctx.download_file("/remote/output.bin", "/local/output.bin")
return 0.0
Available methods:
| Category | Methods |
|---|---|
| Terminal | terminal(command, timeout) |
| Files | read_file(path), write_file(path, content), search(query, path) |
| Transfers | upload_file(), upload_dir(), download_file(), download_dir() |
| Web | web_search(query), web_extract(urls) |
| Browser | browser_navigate(url), browser_snapshot() |
| Generic | call_tool(name, args) — escape hatch for any hermes-agent tool |
| Cleanup | cleanup() — release all resources |
Tool Call Parsers
For Phase 2 (VLLM ManagedServer), the server returns raw text without structured tool calls. Client-side parsers in environments/tool_call_parsers/ extract tool_calls from raw output:
from environments.tool_call_parsers import get_parser
parser = get_parser("hermes") # or "mistral", "llama3_json", "qwen", "deepseek_v3", etc.
content, tool_calls = parser.parse(raw_model_output)
Available parsers: hermes, mistral, llama3_json, llama4_json, qwen, qwen3_coder, deepseek_v3, deepseek_v3_1 (alias deepseek_v31), kimi_k2, longcat, glm45, glm47.
In Phase 1 (OpenAI server type), parsers are not needed — the server handles tool call parsing natively.
Available Benchmarks
TerminalBench2
89 challenging terminal tasks with per-task Docker sandbox environments.
| What it tests | Single-task coding/sysadmin ability |
| Scoring | Binary pass/fail (test suite verification) |
| Sandbox | Modal cloud sandboxes (per-task Docker images) |
| Tools | terminal + file |
| Tasks | 89 tasks across multiple categories |
| Cost | ~$50–200 for full eval (parallel execution) |
| Time | ~2–4 hours |
python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
--config environments/benchmarks/terminalbench_2/default.yaml
# Run specific tasks
python environments/benchmarks/terminalbench_2/terminalbench2_env.py evaluate \
--config environments/benchmarks/terminalbench_2/default.yaml \
--env.task_filter fix-git,git-multibranch
Dataset: NousResearch/terminal-bench-2 on HuggingFace.
TBLite (OpenThoughts Terminal Bench Lite)
100 difficulty-calibrated tasks — a faster proxy for TerminalBench2.
| What it tests | Same as TB2 (coding/sysadmin), calibrated difficulty tiers |
| Scoring | Binary pass/fail |
| Sandbox | Modal cloud sandboxes |
| Tools | terminal + file |
| Tasks | 100 tasks: Easy (40), Medium (26), Hard (26), Extreme (8) |
| Correlation | r=0.911 with full TB2 |
| Speed | 2.6–8× faster than TB2 |
python environments/benchmarks/tblite/tblite_env.py evaluate \
--config environments/benchmarks/tblite/default.yaml
TBLite is a thin subclass of TerminalBench2 — only the dataset and timeouts differ. Created by the OpenThoughts Agent team (Snorkel AI + Bespoke Labs). Dataset: NousResearch/openthoughts-tblite.
YC-Bench
Long-horizon strategic benchmark — the agent plays CEO of an AI startup.
| What it tests | Multi-turn strategic coherence over hundreds of turns |
| Scoring | Composite: 0.5 × survival + 0.5 × normalised_funds |
| Sandbox | Local terminal (no Modal needed) |
| Tools | terminal only |
| Runs | 9 default (3 presets × 3 seeds), sequential |
| Cost | ~$50–200 for full eval |
| Time | ~3–6 hours |
# Install yc-bench (optional dependency)
pip install "hermes-agent[yc-bench]"
# Run evaluation
bash environments/benchmarks/yc_bench/run_eval.sh
# Or directly
python environments/benchmarks/yc_bench/yc_bench_env.py evaluate \
--config environments/benchmarks/yc_bench/default.yaml
# Quick single-preset test
python environments/benchmarks/yc_bench/yc_bench_env.py evaluate \
--config environments/benchmarks/yc_bench/default.yaml \
--env.presets '["fast_test"]' --env.seeds '[1]'
YC-Bench uses collinear-ai/yc-bench — a deterministic simulation with 4 skill domains (research, inference, data_environment, training), prestige system, employee management, and financial pressure. Unlike TB2's per-task binary scoring, YC-Bench measures whether an agent can maintain coherent strategy over hundreds of compounding decisions.
Training Environments
TerminalTestEnv
A minimal self-contained environment with inline tasks (no external dataset). Used for validating the full stack end-to-end. Each task asks the model to create a file at a known path; the verifier checks the content.
# Process mode (saves rollouts to JSONL, no training server needed)
python environments/terminal_test_env/terminal_test_env.py process \
--env.data_path_to_save_groups terminal_test_output.jsonl
# Serve mode (connects to Atropos API for RL training)
python environments/terminal_test_env/terminal_test_env.py serve
HermesSweEnv
SWE-bench style training environment. The model gets a coding task, uses terminal + file + web tools to solve it, and the reward function runs tests in the same Modal sandbox.
python environments/hermes_swe_env/hermes_swe_env.py serve \
--openai.model_name YourModel \
--env.dataset_name bigcode/humanevalpack \
--env.terminal_backend modal
Running Environments
Every environment is a standalone Python script with three CLI subcommands:
evaluate — Run a benchmark
For eval-only environments (benchmarks). Runs all items, computes metrics, logs to wandb.
python environments/benchmarks/tblite/tblite_env.py evaluate \
--config environments/benchmarks/tblite/default.yaml \
--openai.model_name anthropic/claude-sonnet-4.6
No training server or run-api needed. The environment handles everything.
process — Generate SFT data
Runs rollouts and saves scored trajectories to JSONL. Useful for generating training data without a full RL loop.
python environments/terminal_test_env/terminal_test_env.py process \
--env.data_path_to_save_groups output.jsonl \
--openai.model_name anthropic/claude-sonnet-4.6
Output format: each line is a scored trajectory with the full conversation history, reward, and metadata.
serve — Connect to Atropos for RL training
Connects the environment to a running Atropos API server (run-api). Used during live RL training.
# Terminal 1: Start the Atropos API
run-api
# Terminal 2: Start the environment
python environments/hermes_swe_env/hermes_swe_env.py serve \
--openai.model_name YourModel
The environment receives items from Atropos, runs agent rollouts, computes rewards, and sends scored trajectories back for training.
Two-Phase Operation
Phase 1: OpenAI Server (Eval / SFT)
Uses server.chat_completion() with tools= parameter. The server (VLLM, SGLang, OpenRouter, OpenAI) handles tool call parsing natively. Returns ChatCompletion objects with structured tool_calls.
- Use for: evaluation, SFT data generation, benchmarks, testing
- Placeholder tokens are created for the Atropos pipeline (since real token IDs aren't available from the OpenAI API)
Phase 2: VLLM ManagedServer (Full RL)
Uses ManagedServer for exact token IDs + logprobs via /generate. A client-side tool call parser reconstructs structured tool_calls from raw output.
- Use for: full RL training with GRPO/PPO
- Real tokens, masks, and logprobs flow through the pipeline
- Set
tool_call_parserin config to match your model's format (e.g.,"hermes","qwen","mistral")
Creating Environments
Training Environment
from environments.hermes_base_env import HermesAgentBaseEnv, HermesAgentEnvConfig
from atroposlib.envs.server_handling.server_manager import APIServerConfig
class MyEnvConfig(HermesAgentEnvConfig):
my_custom_field: str = "default_value"
class MyEnv(HermesAgentBaseEnv):
name = "my-env"
env_config_cls = MyEnvConfig
@classmethod
def config_init(cls):
env_config = MyEnvConfig(
enabled_toolsets=["terminal", "file"],
terminal_backend="modal",
max_agent_turns=30,
)
server_configs = [APIServerConfig(
base_url="https://openrouter.ai/api/v1",
model_name="anthropic/claude-sonnet-4.6",
server_type="openai",
)]
return env_config, server_configs
async def setup(self):
from datasets import load_dataset
self.dataset = list(load_dataset("my-dataset", split="train"))
self.iter = 0
async def get_next_item(self):
item = self.dataset[self.iter % len(self.dataset)]
self.iter += 1
return item
def format_prompt(self, item):
return item["instruction"]
async def compute_reward(self, item, result, ctx):
# ctx gives full tool access to the rollout's sandbox
test = ctx.terminal("pytest -v")
return 1.0 if test["exit_code"] == 0 else 0.0
async def evaluate(self, *args, **kwargs):
# Periodic evaluation during training
pass
if __name__ == "__main__":
MyEnv.cli()
Eval-Only Benchmark
For benchmarks, follow the pattern used by TerminalBench2, TBLite, and YC-Bench:
- Create under
environments/benchmarks/your-benchmark/ - Set eval-only config:
eval_handling=STOP_TRAIN,steps_per_eval=1,total_steps=1 - Stub training methods:
collect_trajectories()returns(None, []),score()returnsNone - Implement
rollout_and_score_eval(eval_item)— the per-item agent loop + scoring - Implement
evaluate()— orchestrates all runs, computes aggregate metrics - Add streaming JSONL for crash-safe result persistence
- Add cleanup:
KeyboardInterrupthandling,cleanup_all_environments(),_tool_executor.shutdown() - Run with
evaluatesubcommand
See environments/benchmarks/yc_bench/yc_bench_env.py for a clean, well-documented reference implementation.
Configuration Reference
HermesAgentEnvConfig Fields
| Field | Type | Default | Description |
|---|---|---|---|
enabled_toolsets |
List[str] |
None (all) |
Which hermes toolsets to enable |
disabled_toolsets |
List[str] |
None |
Toolsets to filter out |
distribution |
str |
None |
Probabilistic toolset distribution name |
max_agent_turns |
int |
30 |
Max LLM calls per rollout |
agent_temperature |
float |
1.0 |
Sampling temperature |
system_prompt |
str |
None |
System message for the agent |
terminal_backend |
str |
"local" |
local, docker, modal, daytona, ssh, singularity |
terminal_timeout |
int |
120 |
Seconds per terminal command |
terminal_lifetime |
int |
3600 |
Max sandbox lifetime |
dataset_name |
str |
None |
HuggingFace dataset identifier |
tool_pool_size |
int |
128 |
Thread pool size for tool execution |
tool_call_parser |
str |
"hermes" |
Parser for Phase 2 raw output |
extra_body |
Dict |
None |
Extra params for OpenAI API (e.g., OpenRouter provider prefs) |
eval_handling |
Enum |
STOP_TRAIN |
STOP_TRAIN, LIMIT_TRAIN, NONE |
YAML Configuration
Environments can be configured via YAML files passed with --config:
env:
enabled_toolsets: ["terminal", "file"]
max_agent_turns: 60
max_token_length: 32000
agent_temperature: 0.8
terminal_backend: "modal"
terminal_timeout: 300
dataset_name: "NousResearch/terminal-bench-2"
tokenizer_name: "NousResearch/Hermes-3-Llama-3.1-8B"
use_wandb: true
wandb_name: "my-benchmark"
openai:
base_url: "https://openrouter.ai/api/v1"
model_name: "anthropic/claude-sonnet-4.6"
server_type: "openai"
health_check: false
YAML values override config_init() defaults. CLI arguments override YAML values:
python my_env.py evaluate \
--config my_config.yaml \
--openai.model_name anthropic/claude-opus-4.6 # overrides YAML
Prerequisites
For all environments
- Python >= 3.11
atroposlib:pip install git+https://github.com/NousResearch/atropos.git- An LLM API key (OpenRouter, OpenAI, or self-hosted VLLM/SGLang)
For Modal-sandboxed benchmarks (TB2, TBLite)
- Modal account and CLI:
pip install "hermes-agent[modal]" MODAL_TOKEN_IDandMODAL_TOKEN_SECRETenvironment variables
For YC-Bench
pip install "hermes-agent[yc-bench]"(installs the yc-bench CLI + SQLAlchemy)- No Modal needed — runs with local terminal backend
For RL training
TINKER_API_KEY— API key for the Tinker training serviceWANDB_API_KEY— for Weights & Biases metrics tracking- The
tinker-atropossubmodule (attinker-atropos/in the repo)
See RL Training for the agent-driven RL workflow.
Directory Structure
environments/
├── hermes_base_env.py # Abstract base class (HermesAgentBaseEnv)
├── agent_loop.py # Multi-turn agent engine (HermesAgentLoop)
├── tool_context.py # Per-rollout tool access for reward functions
├── patches.py # Async-safety patches for Modal backend
│
├── tool_call_parsers/ # Phase 2 client-side parsers
│ ├── hermes_parser.py # Hermes/ChatML <tool_call> format
│ ├── mistral_parser.py # Mistral [TOOL_CALLS] format
│ ├── llama_parser.py # Llama 3 JSON tool calling
│ ├── qwen_parser.py # Qwen format
│ ├── deepseek_v3_parser.py # DeepSeek V3 format
│ └── ... # + kimi_k2, longcat, glm45/47, etc.
│
├── terminal_test_env/ # Stack validation (inline tasks)
├── hermes_swe_env/ # SWE-bench training environment
│
└── benchmarks/ # Evaluation benchmarks
├── terminalbench_2/ # 89 terminal tasks, Modal sandboxes
├── tblite/ # 100 calibrated tasks (fast TB2 proxy)
└── yc_bench/ # Long-horizon strategic benchmark