hermes-agent/skills/mlops/training/hermes-atropos-environments/references/atropos-base-env.md
teknium1 6ab3ebf195 Add hermes-atropos-environments skill (bundled)
Add comprehensive skill for building, testing, and debugging Hermes Agent
RL environments for Atropos training. Includes:

- SKILL.md: Full guide covering HermesAgentBaseEnv interface, required
  methods, config class, CLI modes (serve/process/evaluate), reward
  function patterns, common pitfalls, and minimum implementation checklist
- New 'Inference Setup' section: instructs the agent to always ask the
  user for their inference provider (OpenRouter + model choice, self-hosted
  VLLM endpoint, or other OpenAI-compatible API) before running tests
- references/agentresult-fields.md: AgentResult dataclass field reference
- references/atropos-base-env.md: Atropos BaseEnv API reference
- references/usage-patterns.md: Step-by-step patterns for process,
  evaluate, serve, and smoke test modes

Will be auto-synced to ~/.hermes/skills/ via skills_sync.
2026-03-09 23:04:17 -07:00

65 lines
3 KiB
Markdown

# Atropos BaseEnv Reference
Source: `atroposlib/envs/base.py` (~2124 lines)
## Abstract Methods (MUST implement)
| Method | Signature | Description |
|--------|-----------|-------------|
| `get_next_item()` | `async def get_next_item(self) -> Item` | Return next item for trajectory. Return None to pause. |
| `evaluate()` | `async def evaluate(self, *args, **kwargs)` | Called every steps_per_eval steps. |
| `setup()` | `async def setup(self)` | Called once at start. Load datasets, init models. |
| `collect_trajectory()` | `async def collect_trajectory(self, item) -> Tuple[Optional[ScoredDataItem], List[Item]]` | Single rollout. Or override collect_trajectories instead. |
## Overridable Methods
| Method | Default Behavior | Override When |
|--------|-----------------|---------------|
| `collect_trajectories()` | Runs collect_trajectory group_size times in parallel | Batch generation, MCTS, coupled rollouts |
| `wandb_log()` | Logs completion lengths, rollout table, perf stats | Add custom metrics (always call super) |
| `config_init()` | Returns (env_config_cls(), ServerBaseline()) | Custom defaults + server configs |
| `postprocess_histories()` | Passthrough | Final processing before sending to trainer |
| `save_checkpoint()` | Saves JSON to checkpoint_dir | Custom serialization |
| `cleanup()` | No-op | Release resources after each rollout |
## ScoredDataGroup Structure
```python
ScoredDataGroup = TypedDict with:
tokens: List[List[int]] # Token IDs per rollout
masks: List[List[int]] # -100=prompt, token_id=completion
scores: List[float] # Score per rollout
advantages: Optional[...] # Per-token advantages
ref_logprobs: Optional[...] # Reference model logprobs
messages: Optional[...] # OpenAI-format messages
inference_logprobs: Optional[...] # Inference logprobs
```
## BaseEnvConfig Key Fields
| Field | Default | Description |
|-------|---------|-------------|
| `group_size` | 4 | Responses grouped for scoring |
| `steps_per_eval` | 100 | Steps between evaluations |
| `max_token_length` | 2048 | Max token length for generations |
| `total_steps` | 1000 | Total training steps |
| `use_wandb` | True | Enable wandb logging |
| `tokenizer_name` | DeepHermes-3 | Tokenizer for token encoding |
| `ensure_scores_are_not_same` | True | Skip groups with identical scores |
| `worker_timeout` | 600 | Task timeout seconds |
## Data Flow
```
env_manager() → add_train_workers() → handle_env()
→ collect_trajectories() → postprocess_histories()
→ handle_send_to_api() → training server
```
## Atropos Environment Statistics (82 environments analyzed)
- 95% implement setup, collect_trajectories, evaluate, get_next_item
- 76% override wandb_log
- 54% have custom config class
- Most use collect_trajectories (plural), not collect_trajectory (singular)
- Common reward patterns: LLM-judge (~40), regex-extract (~35), code-exec (~12)