Three tightly-scoped built-in skill consolidations to reduce redundancy in the available_skills listing injected into every system prompt: 1. gguf-quantization → llama-cpp (merged) GGUF is llama.cpp's format; two skills covered the same toolchain. The merged llama-cpp skill keeps the full K-quant table + imatrix workflow from gguf and the ROCm/benchmarks/supported-models sections from the original llama-cpp. All 5 reference files preserved. 2. grpo-rl-training → fine-tuning-with-trl (folded in) GRPO isn't a framework, it's a trainer inside TRL. Moved the 17KB deep-dive SKILL.md to references/grpo-training.md and the working template to templates/basic_grpo_training.py. TRL's GRPO workflow section now points to both. Atropos skill's related_skills updated. 3. guidance → optional-skills/mlops/ Dropped from built-in. Outlines (still built-in) covers the same structured-generation ground with wider adoption. Listed in the optional catalog for users who specifically want Guidance. Net: 3 fewer built-in skill lines in every system prompt, zero content loss. Contributor authorship preserved via git rename detection.
13 KiB
| name | description | version | author | license | metadata | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| hermes-atropos-environments | Build, test, and debug Hermes Agent RL environments for Atropos training. Covers the HermesAgentBaseEnv interface, reward functions, agent loop integration, evaluation with tools, wandb logging, and the three CLI modes (serve/process/evaluate). Use when creating, reviewing, or fixing RL environments in the hermes-agent repo. | 1.1.0 | Hermes Agent | MIT |
|
Hermes Agent Atropos Environments
Guide for building RL environments in the hermes-agent repo that integrate with the Atropos training framework.
Architecture Overview
Atropos BaseEnv (atroposlib/envs/base.py)
└── HermesAgentBaseEnv (environments/hermes_base_env.py)
├── Handles agent loop orchestration
├── Handles tool resolution per group
├── Handles ToolContext for reward verification
└── YOUR ENVIRONMENT (environments/your_env.py)
Only implements: setup, get_next_item, format_prompt,
compute_reward, evaluate, wandb_log
Hermes environments are special because they run a multi-turn agent loop with tool calling — not just single-turn completions. The base env handles the loop; you implement the task and scoring.
File Locations
| File | Purpose |
|---|---|
environments/hermes_base_env.py |
Base class with agent loop + tool resolution |
environments/agent_loop.py |
HermesAgentLoop + AgentResult dataclass |
environments/tool_context.py |
ToolContext for reward verification |
environments/tool_call_parsers.py |
Phase 2 tool call parsers (hermes, mistral, etc.) |
environments/your_env.py |
Your environment implementation |
Inference Setup — Ask the User First
IMPORTANT: Before running any test, evaluation, or data generation command, always ask the user how they want to handle inference. Do NOT assume OpenRouter or any specific endpoint. Present these options:
- OpenRouter — Ask which model they want to use (e.g.,
anthropic/claude-sonnet-4.5,google/gemini-2.5-pro,meta-llama/llama-3.3-70b-instruct, etc.). RequiresOPENROUTER_API_KEYin environment. - Self-hosted VLLM endpoint — Ask for their base URL (e.g.,
http://localhost:8000/v1) and model name. Set--openai.server_type vllm. - Other OpenAI-compatible API — Ask for the base URL, model name, and any required API key. Set
--openai.server_type openaiand--openai.health_check false. - Local Atropos training server — For
servemode with a live training loop. Defaulthttp://localhost:8000/v1.
Once the user tells you their setup, use those values in all CLI commands for that session. Example prompts:
"Before I run this, how would you like to handle inference?
- OpenRouter (I'll need your preferred model, e.g. claude-sonnet-4.5)
- A self-hosted VLLM endpoint (give me the URL and model name)
- Another OpenAI-compatible API (give me the URL, model, and any auth details)
- Local Atropos training server (serve mode)"
Key flags by provider:
| Provider | --openai.server_type |
--openai.health_check |
--openai.api_key |
|---|---|---|---|
| OpenRouter | openai |
false |
$OPENROUTER_API_KEY |
| VLLM (self-hosted) | vllm |
(default) | (not needed) |
| Other OpenAI-compatible | openai |
false |
As needed |
| Local Atropos | (default) | (default) | (not needed) |
Required Methods
1. setup() — Load dataset and initialize state
async def setup(self) -> None:
"""Called once at startup. Load datasets, initialize state."""
# Try HuggingFace first, fallback to built-in samples
try:
from datasets import load_dataset
ds = load_dataset("your/dataset", split="test")
self._items = [...]
except Exception:
self._items = BUILTIN_SAMPLES
# Always split into train/eval
random.shuffle(self._items)
eval_size = max(20, int(len(self._items) * 0.1))
self._eval_items = self._items[:eval_size]
self._items = self._items[eval_size:]
2. get_next_item() — Return next training item
async def get_next_item(self) -> dict:
"""Return next item, cycling through dataset."""
item = self._items[self._index % len(self._items)]
self._index += 1
return item
3. format_prompt(item) — Convert item to user message
def format_prompt(self, item: dict) -> str:
"""Convert a dataset item into the user-facing prompt."""
return f"Research this question: {item['question']}"
4. compute_reward(item, result, ctx) — Score the rollout
CRITICAL: result is an AgentResult, NOT a dict. It has these attributes:
result.messages— List of message dicts (OpenAI format)result.turns_used— Number of LLM calls maderesult.finished_naturally— True if model stopped voluntarilyresult.tool_errors— List of ToolError objects
AgentResult does NOT have: final_response, tool_calls, tools_used.
You must extract these from result.messages:
async def compute_reward(self, item, result: AgentResult, ctx: ToolContext) -> float:
# Extract final response (last assistant message with content)
final_response = ""
tools_used = []
for msg in reversed(result.messages):
if msg.get("role") == "assistant" and msg.get("content") and not final_response:
final_response = msg["content"]
if msg.get("role") == "assistant" and msg.get("tool_calls"):
for tc in msg["tool_calls"]:
fn = tc.get("function", {}) if isinstance(tc, dict) else {}
name = fn.get("name", "")
if name:
tools_used.append(name)
# Score using LLM judge, heuristic, or ToolContext verification
correctness = await self._llm_judge(item, final_response)
return correctness
ctx (ToolContext) gives you terminal/file access to the agent's sandbox for verification:
# Run tests in the agent's sandbox
result = ctx.terminal("pytest /workspace/test.py")
return 1.0 if result["exit_code"] == 0 else 0.0
5. evaluate() — Periodic evaluation with full agent loop
MUST use the full agent loop with tools, not single-turn chat_completion. The whole point of hermes-agent environments is agentic evaluation:
async def evaluate(self, *args, **kwargs) -> None:
import time, uuid
from environments.agent_loop import HermesAgentLoop
from environments.tool_context import ToolContext
start_time = time.time()
tools, valid_names = self._resolve_tools_for_group()
samples = []
for item in self._eval_items[:self.config.eval_size]:
task_id = str(uuid.uuid4())
messages = []
if self.config.system_prompt:
messages.append({"role": "system", "content": self.config.system_prompt})
messages.append({"role": "user", "content": self.format_prompt(item)})
agent = HermesAgentLoop(
server=self.server,
tool_schemas=tools,
valid_tool_names=valid_names,
max_turns=self.config.max_agent_turns,
task_id=task_id,
temperature=0.0, # Deterministic for eval
max_tokens=self.config.max_token_length,
extra_body=self.config.extra_body,
)
result = await agent.run(messages)
ctx = ToolContext(task_id)
try:
reward = await self.compute_reward(item, result, ctx)
finally:
ctx.cleanup()
samples.append({"prompt": ..., "response": ..., "reward": reward})
eval_metrics = {"eval/mean_reward": ...}
await self.evaluate_log(metrics=eval_metrics, samples=samples,
start_time=start_time, end_time=time.time())
6. wandb_log() — Custom metrics logging
Always call super().wandb_log() at the end:
async def wandb_log(self, wandb_metrics=None):
if wandb_metrics is None:
wandb_metrics = {}
if self._reward_buffer:
n = len(self._reward_buffer)
wandb_metrics["train/mean_reward"] = sum(self._reward_buffer) / n
self._reward_buffer.clear()
await super().wandb_log(wandb_metrics) # MUST call super
Pitfall: compute_reward appends to metric buffers. During eval, this pollutes training metrics. Roll back buffer entries added during eval.
Config Class
Always create a custom config subclass with Pydantic Field descriptors. Key inherited fields you can tune: enabled_toolsets, max_agent_turns, agent_temperature, system_prompt, terminal_backend, group_size, steps_per_eval, total_steps.
config_init() — Default Configuration
Classmethod returning (YourEnvConfig, [APIServerConfig(...)]). Set server_type to "openai" for OpenRouter/external APIs. Load API key from environment variable.
Three CLI Modes
# SERVE — Full training loop (connects to Atropos API server)
python environments/my_env.py serve --openai.base_url http://localhost:8000/v1
# PROCESS — Offline data generation (saves JSONL)
python environments/my_env.py process --env.total_steps 10 --env.group_size 1 \
--env.use_wandb false --env.data_path_to_save_groups output.jsonl \
--openai.base_url "<USER_BASE_URL>" \
--openai.model_name "<USER_MODEL>" \
--openai.server_type <USER_SERVER_TYPE> --openai.health_check false
# EVALUATE — Standalone eval (runs setup + evaluate only)
python environments/my_env.py evaluate --env.eval_size 20 \
--env.data_dir_to_save_evals /tmp/eval_results \
--openai.base_url "<USER_BASE_URL>" \
--openai.model_name "<USER_MODEL>" \
--openai.server_type <USER_SERVER_TYPE> --openai.health_check false
Config priority: CLI args > YAML file > config_init() defaults.
Common Pitfalls
-
AgentResult has .messages, not .final_response — Extract the final response by iterating reversed(result.messages) looking for the last assistant message with content.
-
evaluate() must use HermesAgentLoop, not chat_completion — Single-turn chat_completion has no tools. The whole point of hermes-agent benchmarks is agentic evaluation with tool use.
-
Don't call _llm_judge twice — If compute_reward already calls it, extract the score from the buffer instead of calling judge separately in evaluate().
-
Eval pollutes training buffers — compute_reward appends to metric buffers. During eval, roll back buffer entries to keep training metrics clean.
-
Always set health_check=false for OpenRouter — OpenRouter has no /health endpoint.
-
Set data_dir_to_save_evals in evaluate mode — Without it, results aren't saved.
-
default_toolsets class variable vs enabled_toolsets config — The class variable is a hint; the config field is what actually controls tool resolution.
-
Tool call parsing in messages — Tool calls are dicts with
{"function": {"name": ..., "arguments": ...}}. Always checkisinstance(tc, dict). -
ToolContext.cleanup() — Always call in a finally block to release sandbox resources.
-
server_type must be "openai" for external APIs — Without it, Atropos assumes a local VLLM server.
-
Always ask the user for their inference setup — Never hardcode or assume a specific provider/model. See the "Inference Setup" section above.
Reward Function Patterns
LLM Judge (for open-ended tasks)
Use self.server.chat_completion() with a scoring prompt. Parse JSON response for score float. Always include a heuristic fallback (keyword overlap) for when the judge call fails.
Binary Verification (for code/terminal tasks)
Use ctx.terminal("pytest test.py -q") to run tests in the agent's sandbox. Return 1.0 for pass, 0.0 for fail.
Multi-Signal (combine multiple indicators)
Weight correctness (0.6) + tool usage (0.2) + efficiency (0.2) + optional bonuses. Clamp to [0, 1].
Testing Your Environment
- Import test:
python -c "from environments.my_env import MyEnv; print('OK')" - Ask the user for inference setup (see "Inference Setup" section above)
- Process mode (1 item): Verify JSONL output has valid tokens, masks, scores
- Evaluate mode: Verify full agent loop runs with tools, metrics logged correctly
- Check reward range: Scores should be in [0, 1], not all identical
Minimum Implementation Checklist
class MyEnv(HermesAgentBaseEnv):
name = "my-env"
env_config_cls = MyEnvConfig
@classmethod
def config_init(cls): ... # Default server + env config
async def setup(self): ... # Load dataset + train/eval split
async def get_next_item(self): ... # Cycle through training items
def format_prompt(self, item): ... # Item → user message string
async def compute_reward(self, item, result, ctx): ... # Score rollout
async def evaluate(self, *args, **kwargs): ... # Full agent loop eval
async def wandb_log(self, metrics=None): ... # Custom metrics + super()
if __name__ == "__main__":
MyEnv.cli()