hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-05-05 02:31:47 +00:00

History

Teknium 1e6285c53d feat: compression eval harness for agent/context_compressor.py Ships a complete offline eval harness at scripts/compression_eval/. Runs a real conversation fixture through ContextCompressor.compress(), asks the compressor model to answer probe questions from the compressed state, then has a judge model score each answer 0-5 on six dimensions (accuracy, context_awareness, artifact_trail, completeness, continuity, instruction_following). Methodology adapted from Factory's Dec 2025 write-up (https://factory.ai/news/evaluating-compression); the scoreboard framing is not adopted. Motivation: we edit context_compressor.py prompts and _template_sections by hand and ship with no automated check that compression still preserves file paths, error codes, or the active task. Until now there has been no signal between 'test suite green' and 'a user hits a bad summary in production.' What's shipped - DESIGN.md — full architecture, fixture/probe format, scrubber pipeline, grading rubric, open follow-ups - README.md — usage, cost expectations, when to run it - scrub_fixtures.py — reproducible pipeline that converts real sessions from ~/.hermes/sessions/*.jsonl into public-safe JSON fixtures. Applies agent.redact.redact_sensitive_text + username path normalisation + personal handle scrubbing + email/git-author normalisation + reasoning scratchpad stripping + platform-mention scrubbing + first-user paraphrase + system-prompt placeholder + orphan-message pruning + 2KB tool-output truncation - fixtures/ — three scrubbed session snapshots covering three session shapes: feature-impl-context-priority (75 msgs / ~17k tokens) debug-session-feishu-id-model (59 msgs / ~13k tokens) config-build-competitive-scouts (61 msgs / ~23k tokens) - probes/ — three probe banks (10-11 probes each) covering all four types (recall/artifact/continuation/decision) with expected_facts anchors (PR numbers, file paths, error codes, commands) - rubric.py — six-dimension grading rubric, judge-prompt builder, JSON-with-fallback response parser - compressor_driver.py — thin wrapper around ContextCompressor for forced single-shot compression (fixtures are below the default 100k threshold so we force compress() to attribute score deltas to prompt changes, not threshold-fire variance) - grader.py — two-phase continuation + grading calls via the OpenAI SDK directly against the resolved provider endpoint - report.py — markdown report renderer (paste-ready for PR bodies), --compare-to delta mode, per-run JSON dumper - run_eval.py — fire-style CLI (--fixtures, --runs, --judge-model, --compressor-model, --label, --focus-topic, --compare-to, --verbose) - tests/scripts/test_compression_eval.py — 33 hermetic unit tests covering rubric parsing edge cases, judge-prompt building, report rendering, summariser medians, per-run JSON roundtrip, fixture and probe loading, and a PII smoke check on the checked-in fixtures Non-LLM paths are covered by the 33-test suite that runs in CI. The LLM paths (continuation + grading) require credentials and real API calls, so they're exercised by running the eval itself — not by CI. Validation - 33/33 unit tests pass in 0.33s via scripts/run_tests.sh - 50/50 adjacent tests (tests/agent/test_context_compressor.py) still pass — no regression introduced - End-to-end dry run against debug-session-feishu-id-model with openai/gpt-5.4-mini via Nous Portal: Compression: 13081 -> 3055 tokens (76.6% ratio), 59 -> 10 messages Overall score: 3.25 (artifact_trail 1.50 is the weak spot, matching Factory's published observation) Specific probe misses surfaced with concrete judge notes Noise floor (one empirical data point) Same inputs re-run: overall 3.25 -> 3.17 (delta -0.08). Individual dimensions varied up to ±0.5 between two single-run medians. Confirms the DESIGN.md < 0.3 noise guidance is the right order of magnitude for single-run comparisons. Tighter noise measurement (N=10) is tracked as an open follow-up in DESIGN.md. Why scripts/ and not tests/ Requires API credentials, costs ~$0.50-1.50 per run, minutes to execute, LLM-graded (non-deterministic). Incompatible with scripts/run_tests.sh which is hermetic, parallel, credential-free. scripts/sample_and_compress.py is the existing precedent for offline credentialed tooling. Open follow-ups (tracked in DESIGN.md, not blocking this PR) 1. Iterative-merge fixture (two chained compressions on one session) 2. Precise noise-floor measurement at N=10 3. Scripted scrubber helpers to lower the cost of fixture #4+ 4. Judge model selection policy (pin vs. per-user)		2026-04-25 06:17:44 -07:00
..
acp	fix(acp): include MCP toolsets in ACP sessions	2026-04-24 03:04:42 -07:00
agent	fix(aux): force anthropic oauth refresh after 401	2026-04-24 07:14:00 -07:00
cli	fix(tests): resolve 17 persistent CI test failures (#15084 )	2026-04-24 03:46:46 -07:00
cron	feat(cron): per-job workdir for project-aware cron runs (#15110 )	2026-04-24 05:07:01 -07:00
e2e	refactor(commands): drop /provider, /plan handler, and clean up slash registry (#15047 )	2026-04-24 03:10:52 -07:00
environments/benchmarks	fix(security): consolidated security hardening — SSRF, timing attack, tar traversal, credential leakage (#5944 )	2026-04-07 17:28:37 -07:00
fakes
gateway	test(gateway): on_session_finalize fires on idle-expiry + AUTHOR_MAP	2026-04-24 05:40:52 -07:00
hermes_cli	fix: re-auth on stale OAuth token; read Claude Code credentials from macOS Keychain	2026-04-24 07:14:00 -07:00
hermes_state	fix(resume): redirect --resume to the descendant that actually holds the messages	2026-04-24 03:04:42 -07:00
honcho_plugin	feat(honcho): wizard cadence default 2, surface reasoning level, backwards-compat fallback	2026-04-18 22:50:55 -07:00
integration	fix(discord): strip RTP padding before DAVE/Opus decode (#11267 )	2026-04-16 16:50:15 -07:00
plugins	feat(hindsight): optional bank_id_template for per-agent / per-user banks	2026-04-24 03:38:17 -07:00
run_agent	fix(agent): only set rate-limit cooldown when leaving primary; add tests	2026-04-24 05:35:43 -07:00
scripts	feat: compression eval harness for agent/context_compressor.py	2026-04-25 06:17:44 -07:00
skills	fix(google-workspace): normalize authorized user token writes	2026-04-16 04:22:16 -07:00
tools	refactor(spotify): convert to built-in bundled plugin under plugins/spotify (#15174 )	2026-04-24 07:06:11 -07:00
tui_gateway	Merge branch 'main' into fix/tui-provider-resolution	2026-04-22 11:47:49 -07:00
__init__.py
conftest.py	test(conftest): reset module-level state + unset platform allowlists (#13400 )	2026-04-21 01:33:10 -07:00
run_interrupt_test.py
test_account_usage.py	feat(account-usage): add per-provider account limits module	2026-04-21 01:56:35 -07:00
test_base_url_hostname.py	security(runtime_provider): close OLLAMA_API_KEY substring-leak sweep miss (#13522 )	2026-04-21 06:06:16 -07:00
test_batch_runner_checkpoint.py	fix(batch_runner): mark discarded no-reasoning prompts as completed (#9950 )	2026-04-20 04:56:06 -07:00
test_cli_file_drop.py	fix(tui): improve macOS paste and shortcut parity	2026-04-21 08:00:00 -07:00
test_cli_skin_integration.py	fix: align status bar skin tests with upstream main	2026-04-22 13:20:02 -07:00
test_ctx_halving_fix.py	fix(tests): fix 78 CI test failures and remove dead test (#9036 )	2026-04-13 10:50:24 -07:00
test_empty_model_fallback.py	fix: fall back to provider's default model when model config is empty (#8303 )	2026-04-12 03:53:30 -07:00
test_evidence_store.py
test_hermes_constants.py	fix(gateway): harden Docker/container gateway pathway	2026-04-12 16:36:11 -07:00
test_hermes_logging.py	fix(tests): fix 78 CI test failures and remove dead test (#9036 )	2026-04-13 10:50:24 -07:00
test_hermes_state.py	feat(dashboard): track real API call count per session	2026-04-22 05:51:58 -07:00
test_honcho_client_config.py	feat(memory): pluggable memory provider interface with profile isolation, review fixes, and honcho CLI restoration (#4623 )	2026-04-02 15:33:51 -07:00
test_ipv4_preference.py	feat: add network.force_ipv4 config to fix IPv6 timeout issues (#8196 )	2026-04-11 23:12:11 -07:00
test_mcp_serve.py
test_mini_swe_runner.py	fix(kimi): omit temperature entirely for Kimi/Moonshot models (#13157 )	2026-04-20 12:23:05 -07:00
test_minimax_model_validation.py	fix(models): validate MiniMax models against static catalog (#12611 , #12460 , #12399 , #12547 )	2026-04-19 22:44:47 -07:00
test_minisweagent_path.py
test_model_picker_scroll.py	fix: CLI/UX batch — ChatConsole errors, curses scroll, skin-aware banner, git state banner (#5974 )	2026-04-07 17:59:42 -07:00
test_model_tools.py	feat(plugins): add transform_tool_result hook for generic tool-result rewriting (#12972 )	2026-04-20 03:48:08 -07:00
test_model_tools_async_bridge.py	fix(core): ensure non-blocking executor shutdown on async timeout	2026-04-22 14:42:32 -07:00
test_ollama_num_ctx.py	fix: provider/model resolution — salvage 4 PRs + MiniMax aux URL fix (#5983 )	2026-04-07 22:23:28 -07:00
test_packaging_metadata.py
test_plugin_skills.py	fix(tests): attach caplog to specific logger in 3 order-dependent tests (#11453 )	2026-04-17 00:20:40 -07:00
test_project_metadata.py	build(deps): add qrcode to dingtalk + feishu extras (parity with messaging) (#11627 )	2026-04-17 13:31:53 -07:00
test_retry_utils.py	feat(agent): add jittered retry backoff	2026-04-08 00:41:36 -07:00
test_sql_injection.py
test_subprocess_home_isolation.py	fix: per-profile subprocess HOME isolation (#4426 ) (#7357 )	2026-04-10 13:37:45 -07:00
test_timezone.py	test: speed up slow tests (backoff + subprocess + IMDS network) (#11797 )	2026-04-17 14:21:22 -07:00
test_toolset_distributions.py
test_toolsets.py	fix(ci): unblock test suite + cut ~2s of dead Z.AI probes from every AIAgent	2026-04-19 19:18:19 -07:00
test_trajectory_compressor.py	fix(kimi): omit temperature entirely for Kimi/Moonshot models (#13157 )	2026-04-20 12:23:05 -07:00
test_trajectory_compressor_async.py	fix(kimi): omit temperature entirely for Kimi/Moonshot models (#13157 )	2026-04-20 12:23:05 -07:00
test_transform_tool_result_hook.py	test: stop testing mutable data — convert change-detectors to invariants (#13363 )	2026-04-20 23:20:33 -07:00
test_tui_gateway_server.py	feat(tui): per-section visibility for the details accordion	2026-04-24 02:34:32 -05:00
test_utils_truthy_values.py