mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-26 01:01:40 +00:00
Add comprehensive skill for building, testing, and debugging Hermes Agent RL environments for Atropos training. Includes: - SKILL.md: Full guide covering HermesAgentBaseEnv interface, required methods, config class, CLI modes (serve/process/evaluate), reward function patterns, common pitfalls, and minimum implementation checklist - New 'Inference Setup' section: instructs the agent to always ask the user for their inference provider (OpenRouter + model choice, self-hosted VLLM endpoint, or other OpenAI-compatible API) before running tests - references/agentresult-fields.md: AgentResult dataclass field reference - references/atropos-base-env.md: Atropos BaseEnv API reference - references/usage-patterns.md: Step-by-step patterns for process, evaluate, serve, and smoke test modes Will be auto-synced to ~/.hermes/skills/ via skills_sync.
3 KiB
3 KiB
Atropos BaseEnv Reference
Source: atroposlib/envs/base.py (~2124 lines)
Abstract Methods (MUST implement)
| Method | Signature | Description |
|---|---|---|
get_next_item() |
async def get_next_item(self) -> Item |
Return next item for trajectory. Return None to pause. |
evaluate() |
async def evaluate(self, *args, **kwargs) |
Called every steps_per_eval steps. |
setup() |
async def setup(self) |
Called once at start. Load datasets, init models. |
collect_trajectory() |
async def collect_trajectory(self, item) -> Tuple[Optional[ScoredDataItem], List[Item]] |
Single rollout. Or override collect_trajectories instead. |
Overridable Methods
| Method | Default Behavior | Override When |
|---|---|---|
collect_trajectories() |
Runs collect_trajectory group_size times in parallel | Batch generation, MCTS, coupled rollouts |
wandb_log() |
Logs completion lengths, rollout table, perf stats | Add custom metrics (always call super) |
config_init() |
Returns (env_config_cls(), ServerBaseline()) | Custom defaults + server configs |
postprocess_histories() |
Passthrough | Final processing before sending to trainer |
save_checkpoint() |
Saves JSON to checkpoint_dir | Custom serialization |
cleanup() |
No-op | Release resources after each rollout |
ScoredDataGroup Structure
ScoredDataGroup = TypedDict with:
tokens: List[List[int]] # Token IDs per rollout
masks: List[List[int]] # -100=prompt, token_id=completion
scores: List[float] # Score per rollout
advantages: Optional[...] # Per-token advantages
ref_logprobs: Optional[...] # Reference model logprobs
messages: Optional[...] # OpenAI-format messages
inference_logprobs: Optional[...] # Inference logprobs
BaseEnvConfig Key Fields
| Field | Default | Description |
|---|---|---|
group_size |
4 | Responses grouped for scoring |
steps_per_eval |
100 | Steps between evaluations |
max_token_length |
2048 | Max token length for generations |
total_steps |
1000 | Total training steps |
use_wandb |
True | Enable wandb logging |
tokenizer_name |
DeepHermes-3 | Tokenizer for token encoding |
ensure_scores_are_not_same |
True | Skip groups with identical scores |
worker_timeout |
600 | Task timeout seconds |
Data Flow
env_manager() → add_train_workers() → handle_env()
→ collect_trajectories() → postprocess_histories()
→ handle_send_to_api() → training server
Atropos Environment Statistics (82 environments analyzed)
- 95% implement setup, collect_trajectories, evaluate, get_next_item
- 76% override wandb_log
- 54% have custom config class
- Most use collect_trajectories (plural), not collect_trajectory (singular)
- Common reward patterns: LLM-judge (~40), regex-extract (~35), code-exec (~12)