hermes-agent

mirrors/hermes-agent

Fork 0

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-04-25 00:51:20 +00:00

Commit graph

Author	SHA1	Message	Date
alt-glitch	77c5bc9da9	feat(budget): make tool result persistence thresholds configurable Add BudgetConfig dataclass to centralize and make overridable the hardcoded constants (50K per-result, 200K per-turn, 2K preview) that control when tool outputs get persisted to sandbox. Configurable at the RL environment level via HermesAgentEnvConfig fields, threaded through HermesAgentLoop to the storage layer. Resolution: pinned (read_file=inf) > env config overrides > registry per-tool > default. CLI override: --env.turn_budget_chars 80000	2026-04-08 02:24:32 -07:00
teknium1	1a5f31d631	feat: add agentic on-policy distillation (OPD) environment First Atropos environment to populate distill_token_ids / distill_logprobs on ScoredDataGroup, enabling on-policy distillation training. Based on OpenClaw-RL (Princeton, arXiv:2603.10165): - Extracts hindsight hints from next-state signals (tool results, errors) - Uses LLM judge with majority voting for hint extraction - Scores student tokens under hint-enhanced distribution via get_logprobs - Packages teacher's top-K predictions as distillation targets Architecture: - AgenticOPDEnv extends HermesAgentBaseEnv - Overrides collect_trajectories to add OPD pipeline after standard rollouts - Uses Atropos's built-in get_logprobs (VLLM prompt_logprobs) for teacher scoring - No external servers needed — same VLLM backend handles both rollouts and scoring Task: Coding problems with test verification (8 built-in tasks, HF dataset support) Reward: correctness (0.7) + efficiency (0.15) + tool usage (0.15) OPD: Per-turn hint extraction → enhanced prompt → teacher top-K logprobs Configurable: opd_enabled, distill_topk, prm_votes, hint truncation length Metrics: opd/mean_hints_per_rollout, opd/mean_turns_scored, opd/hint_rate	2026-03-13 02:45:08 -07:00

Author

SHA1

Message

Date

alt-glitch

77c5bc9da9

feat(budget): make tool result persistence thresholds configurable

Add BudgetConfig dataclass to centralize and make overridable the
hardcoded constants (50K per-result, 200K per-turn, 2K preview) that
control when tool outputs get persisted to sandbox. Configurable at
the RL environment level via HermesAgentEnvConfig fields, threaded
through HermesAgentLoop to the storage layer.

Resolution: pinned (read_file=inf) > env config overrides > registry
per-tool > default. CLI override: --env.turn_budget_chars 80000

2026-04-08 02:24:32 -07:00

teknium1

1a5f31d631

feat: add agentic on-policy distillation (OPD) environment

First Atropos environment to populate distill_token_ids / distill_logprobs
on ScoredDataGroup, enabling on-policy distillation training.

Based on OpenClaw-RL (Princeton, arXiv:2603.10165):
- Extracts hindsight hints from next-state signals (tool results, errors)
- Uses LLM judge with majority voting for hint extraction
- Scores student tokens under hint-enhanced distribution via get_logprobs
- Packages teacher's top-K predictions as distillation targets

Architecture:
- AgenticOPDEnv extends HermesAgentBaseEnv
- Overrides collect_trajectories to add OPD pipeline after standard rollouts
- Uses Atropos's built-in get_logprobs (VLLM prompt_logprobs) for teacher scoring
- No external servers needed — same VLLM backend handles both rollouts and scoring

Task: Coding problems with test verification (8 built-in tasks, HF dataset support)
Reward: correctness (0.7) + efficiency (0.15) + tool usage (0.15)
OPD: Per-turn hint extraction → enhanced prompt → teacher top-K logprobs

Configurable: opd_enabled, distill_topk, prm_votes, hint truncation length
Metrics: opd/mean_hints_per_rollout, opd/mean_turns_scored, opd/hint_rate

2026-03-13 02:45:08 -07:00

2 commits