chore: remove Atropos RL environments and tinker-atropos integration (#26106)

* chore: remove Atropos RL environments, tools, tests, skill, and tinker-atropos submodule Delete: - environments/ (43 files — base env, agent loop, tool call parsers, benchmarks) - rl_cli.py (standalone RL training CLI) - tools/rl_training_tool.py (all 10 rl_* tools) - tests: test_rl_training_tool, test_tool_call_parsers, test_managed_server_tool_support, test_agent_loop, test_agent_loop_vllm, test_agent_loop_tool_calling, test_terminalbench2_env_security - optional-skills/mlops/hermes-atropos-environments/ - tinker-atropos git submodule + .gitmodules * chore: remove RL/Atropos references from Python source - toolsets.py: remove rl toolset block + update comment - model_tools.py: remove rl_tools group + update async bridging comment - hermes_cli/tools_config.py: remove RL display entry, _DEFAULT_OFF_TOOLSETS, setup block, and rl_training post-setup handler - tools/budget_config.py: remove RL environment reference in docstring - tests/test_model_tools.py: remove rl_tools from expected groups - tests/run_agent/test_streaming_tool_call_repair.py: fix stale cross-reference * chore: remove rl/yc-bench extras and tinker-atropos refs from pyproject.toml - Remove rl extra (atroposlib, tinker, fastapi, uvicorn, wandb) - Remove yc-bench extra - Remove rl_cli from py-modules - Remove [tool.ty.src] exclude for tinker-atropos - Remove [tool.ruff] exclude for tinker-atropos - Regenerate uv.lock * chore: remove tinker-atropos from install/setup scripts - setup-hermes.sh: remove entire tinker-atropos submodule install block - scripts/install.sh: remove both tinker-atropos blocks (Termux + standard) - scripts/install.ps1: remove tinker-atropos block - nix/hermes-agent.nix: remove tinker-atropos pip install line * chore: remove RL references from cli-config.yaml.example * docs: remove Atropos/RL references from README, CONTRIBUTING, AGENTS.md * docs: remove RL/Atropos references from website - Delete: environments.md, rl-training.md, mlops-hermes-atropos-environments.md - sidebars.ts: remove rl-training and environments sidebar entries - optional-skills-catalog.md: remove hermes-atropos-environments row - tools-reference.md: remove entire rl toolset section - toolsets-reference.md: remove rl row + update example - integrations/index.md: remove RL Training bullet - architecture.md: remove environments/ from tree + RL section - contributing.md: remove tinker-atropos setup - updating.md: remove tinker-atropos install + stale submodule update * chore: remove remaining RL/Atropos stragglers - hermes_cli/config.py: remove TINKER_API_KEY + WANDB_API_KEY env var defs - hermes_cli/doctor.py: remove Submodules check section (tinker-atropos) - hermes_cli/setup.py: remove RL Training status check - hermes_cli/status.py: remove Tinker + WandB from API key status display - agent/display.py: remove both rl_* tool preview/activity blocks - website/docs: remove RL references from providers.md + env-variables.md - tests: remove TINKER_API_KEY from conftest, set_config_value, setup_script * chore: remove RL training section from .env.example
2026-05-24 05:41:40 +00:00 · 2026-05-15 10:36:38 +05:30 · 2026-05-15 10:36:38 +05:30 · 5af672c753
commit 5af672c753
parent d364132114
97 changed files with 18 additions and 15690 deletions
--- a/website/docs/user-guide/features/rl-training.md
+++ b/website/docs/user-guide/features/rl-training.md
@ -1,234 +0,0 @@
---
-sidebar_position: 13
-title: "RL Training"
-description: "Reinforcement learning on agent behaviors with Tinker-Atropos — environment discovery, training, and evaluation"
---
-
-# RL Training
-
-Hermes Agent includes an integrated RL (Reinforcement Learning) training pipeline built on **Tinker-Atropos**. This enables training language models on environment-specific tasks using GRPO (Group Relative Policy Optimization) with LoRA adapters, orchestrated entirely through the agent's tool interface.
-
-## Overview
-
-The RL training system consists of three components:
-
-1. **[Atropos](https://github.com/NousResearch/atropos)** — A trajectory API server that coordinates environment interactions, manages rollout groups, and computes advantages
-2. **[Tinker](https://thinkingmachines.ai/tinker/)** — A training service that handles model weights, LoRA training, sampling/inference, and optimizer steps
-3. **Environments** — Python classes that define tasks, scoring, and reward functions (e.g., GSM8K math problems)
-
-The agent can discover environments, configure training parameters, launch training runs, and monitor metrics — all through a set of `rl_*` tools.
-
-## Requirements
-
-RL training requires:
-
- **Python >= 3.11** (Tinker package requirement)
- **TINKER_API_KEY** — API key for the Tinker training service
- **WANDB_API_KEY** — API key for [Weights & Biases](https://wandb.ai/) metrics tracking
- The `tinker-atropos` submodule (at `tinker-atropos/` relative to the Hermes root)
-
-```bash
-# Set up API keys
-hermes config set TINKER_API_KEY your-tinker-key
-hermes config set WANDB_API_KEY your-wandb-key
-```
-
-When both keys are present and Python >= 3.11 is available, the `rl` toolset is automatically enabled.
-
-## Available Tools
-
-| Tool | Description |
-|------|-------------|
-| `rl_list_environments` | Discover available RL environments |
-| `rl_select_environment` | Select an environment and load its config |
-| `rl_get_current_config` | View configurable and locked fields |
-| `rl_edit_config` | Modify configurable training parameters |
-| `rl_start_training` | Launch a training run (spawns 3 processes) |
-| `rl_check_status` | Monitor training progress and WandB metrics |
-| `rl_stop_training` | Stop a running training job |
-| `rl_get_results` | Get final metrics and model weights path |
-| `rl_list_runs` | List all active and completed runs |
-| `rl_test_inference` | Quick inference test using OpenRouter |
-
-## Workflow
-
-### 1. Discover Environments
-
-```
-List the available RL environments
-```
-
-The agent calls `rl_list_environments()` which scans `tinker-atropos/tinker_atropos/environments/` using AST parsing to find Python classes inheriting from `BaseEnv`. Each environment defines:
-
- **Dataset loading** — where training data comes from (e.g., HuggingFace datasets)
- **Prompt construction** — how to format items for the model
- **Scoring/verification** — how to evaluate model outputs and assign rewards
-
-### 2. Select and Configure
-
-```
-Select the GSM8K environment and show me the configuration
-```
-
-The agent calls `rl_select_environment("gsm8k_tinker")`, then `rl_get_current_config()` to see all parameters.
-
-Configuration fields are divided into two categories:
-
-**Configurable fields** (can be modified):
- `group_size` — Number of completions per item (default: 16)
- `batch_size` — Training batch size (default: 128)
- `wandb_name` — WandB run name (auto-set to `{env}-{timestamp}`)
- Other environment-specific parameters
-
-**Locked fields** (infrastructure settings, cannot be changed):
- `tokenizer_name` — Model tokenizer (e.g., `Qwen/Qwen3-8B`)
- `rollout_server_url` — Atropos API URL (`http://localhost:8000`)
- `max_token_length` — Maximum token length (8192)
- `max_num_workers` — Maximum parallel workers (2048)
- `total_steps` — Total training steps (2500)
- `lora_rank` — LoRA adapter rank (32)
- `learning_rate` — Learning rate (4e-5)
- `max_token_trainer_length` — Max tokens for trainer (9000)
-
-### 3. Start Training
-
-```
-Start the training run
-```
-
-The agent calls `rl_start_training()` which:
-
-1. Generates a YAML config file merging locked settings with configurable overrides
-2. Creates a unique run ID
-3. Spawns three processes:
-   - **Atropos API server** (`run-api`) — trajectory coordination
-   - **Tinker trainer** (`launch_training.py`) — LoRA training + FastAPI inference server on port 8001
-   - **Environment** (`environment.py serve`) — the selected environment connecting to Atropos
-
-The processes start with staggered delays (5s for API, 30s for trainer, 90s more for environment) to ensure proper initialization order.
-
-### 4. Monitor Progress
-
-```
-Check the status of training run abc12345
-```
-
-The agent calls `rl_check_status(run_id)` which reports:
-
- Process status (running/exited for each of the 3 processes)
- Running time
- WandB metrics (step, reward mean, percent correct, eval accuracy)
- Log file locations for debugging
-
-:::note Rate Limiting
-Status checks are rate-limited to once every **30 minutes** per run ID. This prevents excessive polling during long-running training jobs that take hours.
-:::
-
-### 5. Stop or Get Results
-
-```
-Stop the training run
-# or
-Get the final results for run abc12345
-```
-
-`rl_stop_training()` terminates all three processes in reverse order (environment → trainer → API). `rl_get_results()` retrieves final WandB metrics and training history.
-
-## Inference Testing
-
-Before committing to a full training run, you can test if an environment works correctly using `rl_test_inference`. This runs a few steps of inference and scoring using OpenRouter — no Tinker API needed, just an `OPENROUTER_API_KEY`.
-
-```
-Test the selected environment with inference
-```
-
-Default configuration:
- **3 steps × 16 completions = 48 rollouts per model**
- Tests 3 models at different scales for robustness:
-  - `qwen/qwen3-8b` (small)
-  - `z-ai/glm-4.7-flash` (medium)
-  - `minimax/minimax-m2.7` (large)
- Total: ~144 rollouts
-
-This validates:
- Environment loads correctly
- Prompt construction works
- Inference response parsing is robust across model scales
- Verifier/scoring logic produces valid rewards
-
-## Tinker API Integration
-
-The trainer uses the [Tinker](https://tinker.computer) API for model training operations:
-
- **ServiceClient** — Creates training and sampling clients
- **Training client** — Handles forward-backward passes with importance sampling loss, optimizer steps (Adam), and weight checkpointing
- **Sampling client** — Provides inference using the latest trained weights
-
-The training loop:
-1. Fetches a batch of rollouts from Atropos (prompt + completions + scores)
-2. Converts to Tinker Datum objects with padded logprobs and advantages
-3. Runs forward-backward pass with importance sampling loss
-4. Takes an optimizer step (Adam: lr=4e-5, β1=0.9, β2=0.95)
-5. Saves weights and creates a new sampling client for next-step inference
-6. Logs metrics to WandB
-
-## Architecture Diagram
-
-```mermaid
-flowchart LR
-    api["Atropos API<br/>run-api<br/>port 8000"]
-    env["Environment<br/>BaseEnv implementation"]
-    infer["OpenAI / sglang<br/>inference API<br/>port 8001"]
-    trainer["Tinker Trainer<br/>LoRA training + FastAPI"]
-
-    env <--> api
-    env --> infer
-    api -->|"batches: tokens, scores, logprobs"| trainer
-    trainer -->|"serves inference"| infer
-```
-
-## Creating Custom Environments
-
-To create a new RL environment:
-
-1. Create a Python file in `tinker-atropos/tinker_atropos/environments/`
-2. Define a class that inherits from `BaseEnv`
-3. Implement the required methods:
-   - `load_dataset()` — Load your training data
-   - `get_next_item()` — Provide the next item to the model
-   - `score_answer()` — Score model outputs and assign rewards
-   - `collect_trajectories()` — Collect and return trajectories
-4. Optionally define a custom config class inheriting from `BaseEnvConfig`
-
-Study the existing `gsm8k_tinker.py` as a template. The agent can help you create new environments — it can read existing environment files, inspect HuggingFace datasets, and write new environment code.
-
-## WandB Metrics
-
-Training runs log to Weights & Biases with these key metrics:
-
-| Metric | Description |
-|--------|-------------|
-| `train/loss` | Training loss (importance sampling) |
-| `train/learning_rate` | Current learning rate |
-| `reward/mean` | Mean reward across groups |
-| `logprobs/mean` | Mean reference logprobs |
-| `logprobs/mean_training` | Mean training logprobs |
-| `logprobs/diff` | Logprob drift (reference - training) |
-| `advantages/mean` | Mean advantage values |
-| `advantages/std` | Advantage standard deviation |
-
-## Log Files
-
-Each training run generates log files in `~/.hermes/logs/rl_training/`:
-
-```
-logs/
-├── api_{run_id}.log        # Atropos API server logs
-├── trainer_{run_id}.log    # Tinker trainer logs
-├── env_{run_id}.log        # Environment process logs
-└── inference_tests/        # Inference test results
-    ├── test_{env}_{model}.jsonl
-    └── test_{env}_{model}.log
-```
-
-These are invaluable for debugging when training fails or produces unexpected results.
--- a/website/docs/user-guide/skills/optional/mlops/mlops-hermes-atropos-environments.md
+++ b/website/docs/user-guide/skills/optional/mlops/mlops-hermes-atropos-environments.md
@ -1,323 +0,0 @@
---
-title: "Hermes Atropos Environments — Build, test, and debug Hermes Agent RL environments for Atropos training"
-sidebar_label: "Hermes Atropos Environments"
-description: "Build, test, and debug Hermes Agent RL environments for Atropos training"
---
-
-{/* This page is auto-generated from the skill's SKILL.md by website/scripts/generate-skill-docs.py. Edit the source SKILL.md, not this page. */}
-
-# Hermes Atropos Environments
-
-Build, test, and debug Hermes Agent RL environments for Atropos training. Covers the HermesAgentBaseEnv interface, reward functions, agent loop integration, evaluation with tools, wandb logging, and the three CLI modes (serve/process/evaluate). Use when creating, reviewing, or fixing RL environments in the hermes-agent repo.
-
-## Skill metadata
-
-| | |
-|---|---|
-| Source | Optional — install with `hermes skills install official/mlops/hermes-atropos-environments` |
-| Path | `optional-skills/mlops/hermes-atropos-environments` |
-| Version | `1.1.0` |
-| Author | Hermes Agent |
-| License | MIT |
-| Platforms | linux, macos, windows |
-| Tags | `atropos`, `rl`, `environments`, `training`, `reinforcement-learning`, `reward-functions` |
-| Related skills | [`axolotl`](/docs/user-guide/skills/optional/mlops/mlops-training-axolotl), [`fine-tuning-with-trl`](/docs/user-guide/skills/optional/mlops/mlops-training-trl-fine-tuning), `lm-evaluation-harness` |
-
-## Reference: full SKILL.md
-
-:::info
-The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
-:::
-
-# Hermes Agent Atropos Environments
-
-Guide for building RL environments in the hermes-agent repo that integrate with the Atropos training framework.
-
-## Architecture Overview
-
-<!-- ascii-guard-ignore -->
-```
-Atropos BaseEnv (atroposlib/envs/base.py)
-    └── HermesAgentBaseEnv (environments/hermes_base_env.py)
-            ├── Handles agent loop orchestration
-            ├── Handles tool resolution per group
-            ├── Handles ToolContext for reward verification
-            └── YOUR ENVIRONMENT (environments/your_env.py)
-                    Only implements: setup, get_next_item, format_prompt,
-                                    compute_reward, evaluate, wandb_log
-```
-<!-- ascii-guard-ignore-end -->
-
-Hermes environments are special because they run a **multi-turn agent loop with tool calling** — not just single-turn completions. The base env handles the loop; you implement the task and scoring.
-
-## File Locations
-
-| File | Purpose |
-|------|---------|
-| `environments/hermes_base_env.py` | Base class with agent loop + tool resolution |
-| `environments/agent_loop.py` | `HermesAgentLoop` + `AgentResult` dataclass |
-| `environments/tool_context.py` | `ToolContext` for reward verification |
-| `environments/tool_call_parsers.py` | Phase 2 tool call parsers (hermes, mistral, etc.) |
-| `environments/your_env.py` | Your environment implementation |
-
-## Inference Setup — Ask the User First
-
-**IMPORTANT:** Before running any test, evaluation, or data generation command, always ask the user how they want to handle inference. Do NOT assume OpenRouter or any specific endpoint. Present these options:
-
-1. **OpenRouter** — Ask which model they want to use (e.g., `anthropic/claude-sonnet-4.5`, `google/gemini-2.5-pro`, `meta-llama/llama-3.3-70b-instruct`, etc.). Requires `OPENROUTER_API_KEY` in environment.
-2. **Self-hosted VLLM endpoint** — Ask for their base URL (e.g., `http://localhost:8000/v1`) and model name. Set `--openai.server_type vllm`.
-3. **Other OpenAI-compatible API** — Ask for the base URL, model name, and any required API key. Set `--openai.server_type openai` and `--openai.health_check false`.
-4. **Local Atropos training server** — For `serve` mode with a live training loop. Default `http://localhost:8000/v1`.
-
-Once the user tells you their setup, use those values in all CLI commands for that session. Example prompts:
-
-> "Before I run this, how would you like to handle inference?
-> 1. OpenRouter (I'll need your preferred model, e.g. claude-sonnet-4.5)
-> 2. A self-hosted VLLM endpoint (give me the URL and model name)
-> 3. Another OpenAI-compatible API (give me the URL, model, and any auth details)
-> 4. Local Atropos training server (serve mode)"
-
-### Key flags by provider:
-
-| Provider | `--openai.server_type` | `--openai.health_check` | `--openai.api_key` |
-|----------|----------------------|------------------------|-------------------|
-| OpenRouter | `openai` | `false` | `$OPENROUTER_API_KEY` |
-| VLLM (self-hosted) | `vllm` | (default) | (not needed) |
-| Other OpenAI-compatible | `openai` | `false` | As needed |
-| Local Atropos | (default) | (default) | (not needed) |
-
-## Required Methods
-
-### 1. `setup()` — Load dataset and initialize state
-
-```python
-async def setup(self) -> None:
-    """Called once at startup. Load datasets, initialize state."""
-    # Try HuggingFace first, fallback to built-in samples
-    try:
-        from datasets import load_dataset
-        ds = load_dataset("your/dataset", split="test")
-        self._items = [...]
-    except Exception:
-        self._items = BUILTIN_SAMPLES
-
-    # Always split into train/eval
-    random.shuffle(self._items)
-    eval_size = max(20, int(len(self._items) * 0.1))
-    self._eval_items = self._items[:eval_size]
-    self._items = self._items[eval_size:]
-```
-
-### 2. `get_next_item()` — Return next training item
-
-```python
-async def get_next_item(self) -> dict:
-    """Return next item, cycling through dataset."""
-    item = self._items[self._index % len(self._items)]
-    self._index += 1
-    return item
-```
-
-### 3. `format_prompt(item)` — Convert item to user message
-
-```python
-def format_prompt(self, item: dict) -> str:
-    """Convert a dataset item into the user-facing prompt."""
-    return f"Research this question: {item['question']}"
-```
-
-### 4. `compute_reward(item, result, ctx)` — Score the rollout
-
-**CRITICAL**: `result` is an `AgentResult`, NOT a dict. It has these attributes:
- `result.messages` — List of message dicts (OpenAI format)
- `result.turns_used` — Number of LLM calls made
- `result.finished_naturally` — True if model stopped voluntarily
- `result.tool_errors` — List of ToolError objects
-
-**AgentResult does NOT have**: `final_response`, `tool_calls`, `tools_used`.
-You must extract these from `result.messages`:
-
-```python
-async def compute_reward(self, item, result: AgentResult, ctx: ToolContext) -> float:
-    # Extract final response (last assistant message with content)
-    final_response = ""
-    tools_used = []
-    for msg in reversed(result.messages):
-        if msg.get("role") == "assistant" and msg.get("content") and not final_response:
-            final_response = msg["content"]
-        if msg.get("role") == "assistant" and msg.get("tool_calls"):
-            for tc in msg["tool_calls"]:
-                fn = tc.get("function", {}) if isinstance(tc, dict) else {}
-                name = fn.get("name", "")
-                if name:
-                    tools_used.append(name)
-
-    # Score using LLM judge, heuristic, or ToolContext verification
-    correctness = await self._llm_judge(item, final_response)
-    return correctness
-```
-
-`ctx` (ToolContext) gives you terminal/file access to the agent's sandbox for verification:
-```python
-# Run tests in the agent's sandbox
-result = ctx.terminal("pytest /workspace/test.py")
-return 1.0 if result["exit_code"] == 0 else 0.0
-```
-
-### 5. `evaluate()` — Periodic evaluation with full agent loop
-
-**MUST use the full agent loop with tools**, not single-turn chat_completion.
-The whole point of hermes-agent environments is agentic evaluation:
-
-```python
-async def evaluate(self, *args, **kwargs) -> None:
-    import time, uuid
-    from environments.agent_loop import HermesAgentLoop
-    from environments.tool_context import ToolContext
-
-    start_time = time.time()
-    tools, valid_names = self._resolve_tools_for_group()
-    samples = []
-
-    for item in self._eval_items[:self.config.eval_size]:
-        task_id = str(uuid.uuid4())
-        messages = []
-        if self.config.system_prompt:
-            messages.append({"role": "system", "content": self.config.system_prompt})
-        messages.append({"role": "user", "content": self.format_prompt(item)})
-
-        agent = HermesAgentLoop(
-            server=self.server,
-            tool_schemas=tools,
-            valid_tool_names=valid_names,
-            max_turns=self.config.max_agent_turns,
-            task_id=task_id,
-            temperature=0.0,  # Deterministic for eval
-            max_tokens=self.config.max_token_length,
-            extra_body=self.config.extra_body,
-        )
-        result = await agent.run(messages)
-
-        ctx = ToolContext(task_id)
-        try:
-            reward = await self.compute_reward(item, result, ctx)
-        finally:
-            ctx.cleanup()
-
-        samples.append({"prompt": ..., "response": ..., "reward": reward})
-
-    eval_metrics = {"eval/mean_reward": ...}
-    await self.evaluate_log(metrics=eval_metrics, samples=samples,
-                            start_time=start_time, end_time=time.time())
-```
-
-### 6. `wandb_log()` — Custom metrics logging
-
-Always call `super().wandb_log()` at the end:
-
-```python
-async def wandb_log(self, wandb_metrics=None):
-    if wandb_metrics is None:
-        wandb_metrics = {}
-    if self._reward_buffer:
-        n = len(self._reward_buffer)
-        wandb_metrics["train/mean_reward"] = sum(self._reward_buffer) / n
-        self._reward_buffer.clear()
-    await super().wandb_log(wandb_metrics)  # MUST call super
-```
-
-**Pitfall**: `compute_reward` appends to metric buffers. During eval, this pollutes training metrics. Roll back buffer entries added during eval.
-
-## Config Class
-
-Always create a custom config subclass with Pydantic Field descriptors. Key inherited fields you can tune: `enabled_toolsets`, `max_agent_turns`, `agent_temperature`, `system_prompt`, `terminal_backend`, `group_size`, `steps_per_eval`, `total_steps`.
-
-## config_init() — Default Configuration
-
-Classmethod returning `(YourEnvConfig, [APIServerConfig(...)])`. Set server_type to "openai" for OpenRouter/external APIs. Load API key from environment variable.
-
-## Three CLI Modes
-
-```bash
-# SERVE — Full training loop (connects to Atropos API server)
-python environments/my_env.py serve --openai.base_url http://localhost:8000/v1
-
-# PROCESS — Offline data generation (saves JSONL)
-python environments/my_env.py process --env.total_steps 10 --env.group_size 1 \
-    --env.use_wandb false --env.data_path_to_save_groups output.jsonl \
-    --openai.base_url "<USER_BASE_URL>" \
-    --openai.model_name "<USER_MODEL>" \
-    --openai.server_type <USER_SERVER_TYPE> --openai.health_check false
-
-# EVALUATE — Standalone eval (runs setup + evaluate only)
-python environments/my_env.py evaluate --env.eval_size 20 \
-    --env.data_dir_to_save_evals /tmp/eval_results \
-    --openai.base_url "<USER_BASE_URL>" \
-    --openai.model_name "<USER_MODEL>" \
-    --openai.server_type <USER_SERVER_TYPE> --openai.health_check false
-```
-
-Config priority: CLI args > YAML file > config_init() defaults.
-
-## Common Pitfalls
-
-1. **AgentResult has .messages, not .final_response** — Extract the final response by iterating reversed(result.messages) looking for the last assistant message with content.
-
-2. **evaluate() must use HermesAgentLoop, not chat_completion** — Single-turn chat_completion has no tools. The whole point of hermes-agent benchmarks is agentic evaluation with tool use.
-
-3. **Don't call _llm_judge twice** — If compute_reward already calls it, extract the score from the buffer instead of calling judge separately in evaluate().
-
-4. **Eval pollutes training buffers** — compute_reward appends to metric buffers. During eval, roll back buffer entries to keep training metrics clean.
-
-5. **Always set health_check=false for OpenRouter** — OpenRouter has no /health endpoint.
-
-6. **Set data_dir_to_save_evals in evaluate mode** — Without it, results aren't saved.
-
-7. **default_toolsets class variable vs enabled_toolsets config** — The class variable is a hint; the config field is what actually controls tool resolution.
-
-8. **Tool call parsing in messages** — Tool calls are dicts with `{"function": {"name": ..., "arguments": ...}}`. Always check `isinstance(tc, dict)`.
-
-9. **ToolContext.cleanup()** — Always call in a finally block to release sandbox resources.
-
-10. **server_type must be "openai" for external APIs** — Without it, Atropos assumes a local VLLM server.
-
-11. **Always ask the user for their inference setup** — Never hardcode or assume a specific provider/model. See the "Inference Setup" section above.
-
-## Reward Function Patterns
-
-### LLM Judge (for open-ended tasks)
-Use `self.server.chat_completion()` with a scoring prompt. Parse JSON response for score float. Always include a heuristic fallback (keyword overlap) for when the judge call fails.
-
-### Binary Verification (for code/terminal tasks)
-Use `ctx.terminal("pytest test.py -q")` to run tests in the agent's sandbox. Return 1.0 for pass, 0.0 for fail.
-
-### Multi-Signal (combine multiple indicators)
-Weight correctness (0.6) + tool usage (0.2) + efficiency (0.2) + optional bonuses. Clamp to [0, 1].
-
-## Testing Your Environment
-
-1. **Import test**: `python -c "from environments.my_env import MyEnv; print('OK')"`
-2. **Ask the user for inference setup** (see "Inference Setup" section above)
-3. **Process mode** (1 item): Verify JSONL output has valid tokens, masks, scores
-4. **Evaluate mode**: Verify full agent loop runs with tools, metrics logged correctly
-5. **Check reward range**: Scores should be in [0, 1], not all identical
-
-## Minimum Implementation Checklist
-
-```python
-class MyEnv(HermesAgentBaseEnv):
-    name = "my-env"
-    env_config_cls = MyEnvConfig
-
-    @classmethod
-    def config_init(cls): ...          # Default server + env config
-    async def setup(self): ...         # Load dataset + train/eval split
-    async def get_next_item(self): ... # Cycle through training items
-    def format_prompt(self, item): ... # Item → user message string
-    async def compute_reward(self, item, result, ctx): ...  # Score rollout
-    async def evaluate(self, *args, **kwargs): ...  # Full agent loop eval
-    async def wandb_log(self, metrics=None): ...    # Custom metrics + super()
-
-if __name__ == "__main__":
-    MyEnv.cli()
-```