alt-glitch
77c5bc9da9
feat(budget): make tool result persistence thresholds configurable
...
Add BudgetConfig dataclass to centralize and make overridable the
hardcoded constants (50K per-result, 200K per-turn, 2K preview) that
control when tool outputs get persisted to sandbox. Configurable at
the RL environment level via HermesAgentEnvConfig fields, threaded
through HermesAgentLoop to the storage layer.
Resolution: pinned (read_file=inf) > env config overrides > registry
per-tool > default. CLI override: --env.turn_budget_chars 80000
2026-04-08 02:24:32 -07:00
teknium1
1a5f31d631
feat: add agentic on-policy distillation (OPD) environment
...
First Atropos environment to populate distill_token_ids / distill_logprobs
on ScoredDataGroup, enabling on-policy distillation training.
Based on OpenClaw-RL (Princeton, arXiv:2603.10165):
- Extracts hindsight hints from next-state signals (tool results, errors)
- Uses LLM judge with majority voting for hint extraction
- Scores student tokens under hint-enhanced distribution via get_logprobs
- Packages teacher's top-K predictions as distillation targets
Architecture:
- AgenticOPDEnv extends HermesAgentBaseEnv
- Overrides collect_trajectories to add OPD pipeline after standard rollouts
- Uses Atropos's built-in get_logprobs (VLLM prompt_logprobs) for teacher scoring
- No external servers needed — same VLLM backend handles both rollouts and scoring
Task: Coding problems with test verification (8 built-in tasks, HF dataset support)
Reward: correctness (0.7) + efficiency (0.15) + tool usage (0.15)
OPD: Per-turn hint extraction → enhanced prompt → teacher top-K logprobs
Configurable: opd_enabled, distill_topk, prm_votes, hint truncation length
Metrics: opd/mean_hints_per_rollout, opd/mean_turns_scored, opd/hint_rate
2026-03-13 02:45:08 -07:00