Add GSM8k agent env using proper HermesAgentBaseEnv (not ICL)

- environments/gsm8k_agent_env.py: Math reasoning with Python REPL tool
  - Subclasses HermesAgentBaseEnv (proper tools= parameter, not ICL)
  - Uses ATROPOS_SERVER_* env vars from .env
  - Hermes tool call parser, configurable per model
  - Math verification via math_verify with string fallback
  - Tested: process mode works, both trajectories scored 1.0

- Updated memory bank with consolidation plan:
  - environments/ is the canonical env system (proper tool calling)
  - atropos/backends/ kept as sandbox infrastructure
  - atropos/agent/ and atropos/envs/agent_env.py marked for removal
This commit is contained in:
Shannon Sands 2026-02-10 01:45:07 +00:00
parent 9dc27880cd
commit 975c849308
4 changed files with 555 additions and 155 deletions

View file

@ -1,61 +1,99 @@
# Active Context
## Current Focus
Tinker RL training integration - pipeline fully wired up, waiting on Tinker billing to test.
Consolidating the two Atropos environment systems and fixing tool calling to use proper OpenAI-spec approach instead of ICL.
## Recently Completed (Feb 9, 2026)
## PR Feedback from Lead Dev (Feb 10, 2026)
### Tinker RL Training Integration
Created a complete agent training pipeline using Tinker (Thinking Machines) + Atropos:
The PR was rejected because our approach has three fundamental issues:
**New Files Created:**
1. `tinker-atropos/tinker_atropos/environments/gsm8k_agent.py` - Agent GSM8k environment with:
- Python REPL tool calling (Hermes-style `<tool_call>` format)
- Multi-step agent loop within `collect_trajectories()`
- Math answer verification via `math_verify`
- Subprocess-based Python execution
- WandB metrics (percent_correct, tool_use_rate)
2. `tinker-atropos/configs/gsm8k_agent.yaml` - Config for Qwen3-4B-Instruct training
### Issue 1: ManagedServer doesn't pass `tools={}` to `apply_chat_template()`
- When using Phase 2 (VLLM/SGLang for RL training), `ManagedServer` needs to pass tools to `tokenizer.apply_chat_template(tools=...)`
- This makes the system prompt include tool definitions the way models were trained to expect
- **Fix**: Atropos PR #366 adds `tool_call_parser` support to ManagedServer (branch: `tool_call_support`)
**Dependencies Updated:**
- `pyproject.toml` `[atropos]` extra now includes: tinker SDK, torch, wandb, math-verify
- Installed: tinker 0.12.0, tinker-atropos 0.1.0, torch (CPU)
### Issue 2: ICL prompt vs proper tool calling
- Our code embeds tools as XML in the system prompt (`<tools>...</tools>`)
- Proper approach: pass `tools=` parameter in `chat_completion()` calls and let the tokenizer's chat template handle formatting
- All Hermes datasets train on the proper format, not ICL
**README Updated:**
- Added comprehensive "RL Training with Tinker" section with architecture diagram, quick start, config docs
- Added TINKER_API_KEY and WANDB_API_KEY to optional keys table
### Issue 3: Only Hermes `<tool_call>` parser, no multi-model support
- Our code only handles Hermes-style `<tool_call>` XML parsing
- Proper approach: parser registry supporting 11+ model families (hermes, qwen, deepseek, llama, mistral, etc.)
**Verified Working:**
- Tinker SDK connection ✅
- All imports (tinker, tinker_atropos, trainer, environment) ✅
- Python REPL execution + tool call parsing ✅
- Math verification ✅
- Atropos run-api (port 8000) ✅
- Tinker trainer starts, loads config, creates inference server (port 8001) ✅
**Blocked:** Tinker billing (402 error) - user's payment didn't process (possibly regional card issue)
### Main Branch Merge (Feb 9, 2026)
Merged `origin/main` into `atropos-integrations` - 22,560 lines, 79 files, 5 conflicts resolved.
### Modal Backend (Feb 8, 2026)
Merged modal-integration branch, working with Modal Sandboxes.
### Singularity/Apptainer (Feb 6, 2026)
Completed and tested.
## Architecture: Training Pipeline
## Architecture: What Exists Now (Two Parallel Systems)
### `environments/` (Teknium's proper approach) ✅ CORRECT
```
Terminal 1: run-api (port 8000) - Atropos Rollout API
Terminal 2: launch_training.py (port 8001) - Tinker Trainer + FastAPI inference
Terminal 3: gsm8k_agent.py serve - Environment (generates trajectories)
environments/
├── agent_loop.py ← Uses tools= in chat_completion() (OpenAI spec)
├── hermes_base_env.py ← Phase 1 (OpenAI) + Phase 2 (ManagedServer + parser)
├── tool_context.py ← ToolContext for reward functions
├── tool_call_parsers/ ← 11 model parsers (hermes, qwen, deepseek, llama, etc.)
│ ├── __init__.py ← Registry with get_parser(), register_parser()
│ ├── hermes_parser.py
│ ├── qwen_parser.py
│ ├── deepseek_v3_parser.py
│ ├── llama_parser.py
│ ├── mistral_parser.py
│ └── ... (11 total)
├── terminal_test_env.py ← Working example: file creation tasks
├── hermes_swe_env.py ← SWE environment
└── patches.py ← Async-safe monkey patches
```
The agent env gets math problems → model calls Python REPL tool → scores answer → sends to Atropos → Tinker does LoRA training → updates sampling weights → repeat.
**How it works correctly:**
1. `HermesAgentLoop.run()` passes `tools=self.tool_schemas` to `chat_completion()`
2. ManagedServer passes tools to `tokenizer.apply_chat_template(tools=...)`
3. Parser registry reconstructs `tool_calls` from raw model output
4. Tool execution uses hermes-agent's `handle_function_call()` from `model_tools.py`
## Next Steps
- [ ] Resolve Tinker billing to test full training loop
- [ ] Run GSM8k agent training for ~20 steps (proof of concept)
- [ ] Monitor WandB for reward improvement
- [ ] Graduate to more complex agent envs (SWE tasks with Modal backend)
### `atropos/` (Our sandbox-optimized code) - PARTIALLY REDUNDANT
```
atropos/
├── agent/atropos_agent.py ← ICL-based agent (REDUNDANT with agent_loop.py)
├── envs/agent_env.py ← Environment with sandbox backends (PARTIALLY REDUNDANT)
├── envs/swe_smith_oracle_env.py ← SWE env using sandbox (KEEP - port to new base)
├── backends/ ← Sandbox backends (KEEP - valuable infrastructure)
│ ├── modal_backend.py ← Modal sandbox pool
│ ├── nomad_backend.py ← Nomad/Docker/Singularity
│ └── base.py ← ToolBackend protocol
├── slots/ ← Slot multiplexing (KEEP)
├── nomad/ ← Nomad client (KEEP)
├── tools/ ← Sandbox tool registry (PARTIALLY REDUNDANT)
└── sandbox_server.py ← HTTP server in containers (KEEP)
```
## Plan: Consolidate into `environments/`
### What to KEEP from `atropos/`:
- `backends/` - Modal, Nomad, Singularity backends (valuable infrastructure for scale)
- `slots/` - Slot multiplexing
- `nomad/` - Nomad client
- `sandbox_server.py` - Container HTTP server
- `Dockerfile` - Sandbox container image
### What to REMOVE/REPLACE:
- `atropos/agent/atropos_agent.py` → replaced by `environments/agent_loop.py`
- `atropos/envs/agent_env.py` → functionality merged into `environments/hermes_base_env.py`
- `atropos/tools/` → replaced by `model_tools.py` + `tools/` (hermes-agent's standard tools)
### What to CREATE:
- `environments/gsm8k_agent_env.py` → GSM8k with tool calling, subclasses `HermesAgentBaseEnv`
- Update `environments/hermes_base_env.py` to optionally use sandbox backends (Nomad/Modal) for terminal isolation when needed for scale
### Steps:
1. Install atropos `tool_call_support` branch (PR #366)
2. Create `environments/gsm8k_agent_env.py` using `HermesAgentBaseEnv`
3. Port `swe_smith_oracle_env.py` to use `HermesAgentBaseEnv`
4. Make sandbox backends accessible from `HermesAgentBaseEnv` (terminal_backend config)
5. Remove redundant `atropos/agent/` and `atropos/envs/agent_env.py`
6. Clean up `atropos/tools/` (keep only sandbox-specific tools)
7. Update tinker-atropos gsm8k env to use proper base class
8. Test everything end-to-end
## Previous Completed Work
- Modal backend integration (Feb 8) - KEEP backends, update integration point
- Main branch merge (Feb 9) - completed
- Singularity/Apptainer (Feb 6) - KEEP
- Memory Bank initialized (Feb 5)

View file

@ -1,96 +1,85 @@
# Progress
## Current Sprint: Consolidate Environment Systems (Feb 10, 2026)
PR feedback from lead dev identified three fundamental issues with our approach:
1. Tool calling uses ICL (in-context learning) instead of proper `tools=` parameter
2. ManagedServer doesn't pass tools to `apply_chat_template()`
3. Only Hermes parser, no multi-model support
Teknium already built the correct approach in `environments/` directory. Our task is to consolidate.
### Status
- [ ] Install atropos `tool_call_support` branch (PR #366)
- [ ] Create `environments/gsm8k_agent_env.py` using `HermesAgentBaseEnv`
- [ ] Port SWE env to `HermesAgentBaseEnv`
- [ ] Make sandbox backends accessible from `HermesAgentBaseEnv`
- [ ] Remove redundant `atropos/agent/` and `atropos/envs/agent_env.py`
- [ ] Clean up redundant `atropos/tools/`
- [ ] Test end-to-end with Tinker
## Completed Features
### ✅ Modal Backend Integration (Feb 8, 2026 - MERGED & TESTED)
Merged the `modal-integration` branch and fixed integration issues.
### ✅ Modal Backend Integration (Feb 8, 2026)
- `ModalToolBackend` with slot-based multiplexing
- Multi-profile support (CPU, GPU, high-memory)
- Auto-scaling sandbox pool via Modal Sandboxes
- **Status: KEEP backends, but change integration point from atropos/envs/ to environments/**
**What Works:**
- `ModalToolBackend` implements full `ToolBackend` interface (start, stop, acquire, release, execute_batch)
- Modal Sandboxes used for long-lived containers (not Functions)
- `sandbox.exec()` for direct command execution (no HTTP server needed)
- Slot-based multiplexing matching Nomad pattern
- Multi-profile support (`ModalSandboxConfig`, `_ModalMultiProfileManager`)
- YAML profile loading (`modal_profiles.yaml`)
- `AgentEnvConfig` fields for all Modal settings (`--env.modal_*`)
- `create_tool_backend()` supports `tool_pool_mode="modal"`
- Terminal tool (`tools/terminal_tool.py`) native Modal integration with pool management
- Named sandbox recovery via `Sandbox.from_name()`
- Auto-scaling sandbox pool per profile
- Artifact helpers (read, list, archive)
### ✅ Main Branch Merge (Feb 9, 2026)
- Merged 22,560 lines, 79 files, 5 conflicts resolved
- New: hermes_cli/, file_operations, RL training tools, gateway, cron
**CLI Usage:**
```bash
# Atropos backend
python -m atropos.envs.swe_smith_oracle_env process \
--env.tool_pool_mode modal \
--env.modal_image python:3.11
### ✅ Tinker RL Training Setup (Feb 9, 2026)
- tinker 0.12.0 + tinker-atropos installed
- GSM8k agent env created (needs rewrite to use proper base class)
- Config for Qwen3-4B created
- Pipeline verified: Tinker API connection works, all imports pass
- **Blocked on billing** (Tinker 402 error - regional payment issue)
# Terminal tool
TERMINAL_ENV=modal ./hermes
```
### ✅ Singularity/Apptainer Sandbox (Feb 6, 2026)
- Nomad raw_exec driver for HPC clusters
- All sandbox operations tested and working
**Files Modified/Created:**
- `atropos/backends/modal_backend.py` - Full implementation (~1200 lines)
- `atropos/backends/__init__.py` - `create_tool_backend()` updated
- `atropos/envs/agent_env.py` - 15 Modal config fields added
- `tools/terminal_tool.py` - Native Modal sandbox pool
- `docs/MODAL_BACKEND.md` - Documentation
- `modal_profiles.yaml.example` - Example profiles
- `tests/test_modal_integration.py` - Integration tests
- `tests/test_modal_stress.py` - Stress tests
- `tests/test_modal_terminal.py` - Terminal tool tests
### ✅ Memory Bank (Feb 5, 2026)
- Project documentation structure initialized
### ✅ Singularity/Apptainer Sandbox Integration (Feb 6, 2026 - FULLY TESTED)
Adapted the Atropos sandbox environment from Docker to Singularity/Apptainer for HPC clusters.
## What to KEEP vs REMOVE
**What Works:**
- `create_sandbox_job()` supports both `driver="docker"` and `driver="singularity"`
- SlotPoolConfig and NomadBackendConfig propagate driver settings
- Singularity container runs sandbox_server.py via Nomad's raw_exec driver
- All sandbox operations work: bash execution, file read/write
- **CLI arguments** `--env.driver` and `--env.singularity_image` for AgentEnvConfig
- **Static port binding** for Singularity (ReservedPorts vs DynamicPorts)
### KEEP (valuable infrastructure):
| Component | Location | Purpose |
|-----------|----------|---------|
| Modal backend | `atropos/backends/modal_backend.py` | Cloud sandbox pool |
| Nomad backend | `atropos/backends/nomad_backend.py` | Docker/Singularity sandboxes |
| Slot pool | `atropos/slots/` | Container multiplexing |
| Nomad client | `atropos/nomad/` | Nomad API |
| Sandbox server | `atropos/sandbox_server.py` | HTTP server in containers |
| Dockerfile | `atropos/Dockerfile` | Container image |
| Agent loop | `environments/agent_loop.py` | Proper OpenAI-spec tool calling |
| Base env | `environments/hermes_base_env.py` | Phase 1/2 with parsers |
| Tool parsers | `environments/tool_call_parsers/` | 11+ model parsers |
### ✅ Memory Bank Initialized (Feb 5, 2026)
Set up project documentation structure for context persistence.
## In Progress
None currently.
### REMOVE (redundant with environments/):
| Component | Location | Replaced By |
|-----------|----------|-------------|
| ICL agent | `atropos/agent/atropos_agent.py` | `environments/agent_loop.py` |
| AgentEnv | `atropos/envs/agent_env.py` | `environments/hermes_base_env.py` |
| Tool registry | `atropos/tools/` | `model_tools.py` + `tools/` |
| GSM8k ICL env | `tinker-atropos/.../gsm8k_agent.py` | New proper version |
## Known Issues
- Modal backend not yet live-tested with actual Modal cloud credentials
- Tinker billing (402 error) - user's payment didn't process
- `bwrap_available: false` in Singularity containers
- Health check timing - may need longer wait for container startup on slower systems
## What's Left to Build
### Modal Backend
- [ ] Live test with Modal credentials on actual cloud
- [ ] Test multi-profile GPU workflows
- [ ] Test sandbox recovery after restart
- [ ] Integrate with SWE-smith-oracle env for GRPO training loop
- [ ] Performance benchmarking vs Nomad backend
### HPC Deployment
- [ ] Test on actual HPC cluster with Slurm/PBS integration
- [ ] Document cluster-specific deployment procedures
### Documentation
- [ ] Add Singularity deployment to README
- [ ] Create HPC deployment skill in skills/mlops/
- atropos `tool_call_support` branch not yet installed (PR #366)
## Evolution of Decisions
### Container Runtime Selection
- **Initial**: Docker-only via Nomad docker driver
- **Problem**: HPC clusters don't allow Docker without sudo
- **Solution**: Added Singularity/Apptainer support via raw_exec driver
- **Result**: Both runtimes now supported with same API
### Agent Architecture
- **v1 (our branch)**: ICL-based agent with `<tool_call>` XML tags in system prompt
- **v2 (Teknium's)**: Proper OpenAI-spec tool calling with `tools=` parameter
- **Decision**: Adopt v2, consolidate into `environments/`, keep sandbox backends from v1
### Modal Backend Architecture
- **Initial**: Stub placeholder raising RuntimeError
- **Investigation**: Modal Sandboxes vs Functions - chose Sandboxes for long-lived containers
- **Design**: Direct `sandbox.exec()` instead of HTTP/sandbox_server.py (simpler, no networking needed)
- **Implementation**: Merged from `modal-integration` branch, fixed agent_env.py config fields
- **Result**: Three backends now supported: Nomad/Docker, Nomad/Singularity, Modal
### Environment Organization
- **Before**: Two parallel systems (`atropos/envs/` and `environments/`)
- **After**: Single system in `environments/`, using `HermesAgentBaseEnv` as base class
- Sandbox backends remain in `atropos/backends/` but integrate via terminal backend config

View file

@ -148,11 +148,50 @@ The agent validates responses before accepting:
4. `AIAgent` reads env vars when initializing terminal tool
5. Terminal tool creates appropriate backend based on `TERMINAL_ENV`
## Atropos Backend Architecture
## RL Training Architecture (Consolidated)
### Environment System (`environments/`)
The canonical way to build agentic RL environments in Hermes-Agent:
### Backend Hierarchy
```
ToolBackend (Protocol - base.py)
environments/
├── agent_loop.py ← HermesAgentLoop: OpenAI-spec tool calling
├── hermes_base_env.py ← HermesAgentBaseEnv: base class for all envs
├── tool_context.py ← ToolContext: reward function tool access
├── tool_call_parsers/ ← 11+ model parsers (hermes, qwen, deepseek, etc.)
├── terminal_test_env.py ← Example: file creation tasks
├── hermes_swe_env.py ← SWE environment
└── gsm8k_agent_env.py ← GSM8k with Python REPL (TODO)
```
### Two-Phase Operation
- **Phase 1 (OpenAI server)**: Native tool_calls from VLLM/SGLang/OpenRouter
- Good for: SFT data gen, testing, evaluation
- **Phase 2 (ManagedServer)**: Client-side tool call parser + logprob tracking
- Required for: RL training
- Parser registry selects per-model parser (hermes, qwen, llama, etc.)
### Key Design: Proper Tool Calling (NOT ICL)
```python
# CORRECT: pass tools= to chat_completion()
response = await server.chat_completion(
messages=messages,
tools=tool_schemas, # ← tokenizer.apply_chat_template(tools=...) formats these
temperature=1.0,
)
# Response has response.choices[0].message.tool_calls (structured objects)
# WRONG (old approach): embed tools in system prompt as XML
system_prompt = f"<tools>{json.dumps(tools)}</tools>" # ← ICL, not proper training format
```
### Sandbox Backends (`atropos/backends/`)
Infrastructure for scaled sandbox execution (separate from the env system):
```
ToolBackend (Protocol)
├── NomadToolBackend → SlotPool → NomadClient + SandboxExecutor (HTTP)
│ ├── Docker driver (default)
│ └── Singularity driver (HPC)
@ -160,32 +199,16 @@ ToolBackend (Protocol - base.py)
└── _ModalMultiProfileManager (multi-profile support)
```
### Slot-Based Multiplexing Pattern
All backends share the same slot multiplexing concept:
- **Sandbox/Container**: Long-lived compute unit
- **Slot**: Isolated workspace directory within a sandbox (e.g., `/data/slot_0`)
- **Trajectory**: One agent task using one slot
- Multiple trajectories share a sandbox via different slots
Accessed via `HermesAgentBaseEnv.terminal_backend` config option:
- `local` - Direct execution (default, development)
- `docker` - Docker containers
- `modal` - Modal cloud sandboxes (production RL)
- `singularity` - HPC clusters
- `ssh` - Remote server
### Nomad Backend (HTTP-based)
- Deploys `sandbox_server.py` inside containers (Docker or Singularity)
- Uses `SandboxExecutor` for HTTP communication (POST /execute, POST /batch)
- Nomad manages container lifecycle (scaling, health checks)
- Tools: bash, bash_stateful, read_file, write_file, tmux
### Modal Backend (exec-based)
- Creates `modal.Sandbox` instances (long-lived containers)
- Uses `sandbox.exec("bash", "-c", command)` directly (no HTTP server)
- Modal manages container lifecycle (idle_timeout, max_lifetime)
- Multi-profile support: different resource configs (CPU, GPU, memory)
- Named sandboxes for recovery: `Sandbox.from_name(app_name, sandbox_name)`
- YAML config via `modal_profiles.yaml`
### Backend Selection
```python
# In agent_env.py / create_tool_backend()
if mode == "nomad":
return NomadToolBackend(NomadBackendConfig.from_agent_env_config(cfg))
if mode == "modal":
return ModalToolBackend(ModalSandboxConfig.from_agent_env_config(cfg))
### Training Pipeline (Tinker + Atropos)
```
Terminal 1: run-api (port 8000) ← Atropos Rollout API
Terminal 2: launch_training.py (port 8001) ← Tinker Trainer + inference
Terminal 3: environment.py serve ← Environment (rollouts)
```