mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-25 00:51:20 +00:00
Adds scripts/compression_eval/ with a design doc, README, a placeholder run_eval.py, and three checked-in scrubbed session fixtures. No working eval yet — PR is for design review before implementation. Motivation: we edit agent/context_compressor.py prompts and _template_sections by hand and ship without any automated check that compression still preserves file paths, error codes, or the active task. Factory.ai's Dec 2025 write-up (https://factory.ai/news/evaluating-compression) documents a probe-based eval scored on six dimensions. We adopt the methodology; we do not publish scores. Contents: - DESIGN.md — fixture format, probe format (recall / artifact / continuation / decision), six grading dimensions, report format, cost expectations, scrubber pipeline, open questions, and staged follow-up PR plan. - README.md — short 'what this is / when to run it' page. - run_eval.py — placeholder that prints 'not implemented, see DESIGN.md' and exits 1. - scrub_fixtures.py — reproducible pipeline that converts real sessions from ~/.hermes/sessions/*.jsonl into public-safe JSON fixtures. Applies: redact_sensitive_text, username path normalization, personal handle scrubbing, email and git-author normalization, reasoning scratchpad / <think> stripping, platform user-mention scrubbing, first-user paraphrase, system-prompt placeholder, orphan-message pruning, and tool-output size truncation for fixture readability. - fixtures/feature-impl-context-priority.json — 75 msgs / ~17k tokens. Investigate → patch → test → PR → merge. - fixtures/debug-session-feishu-id-model.json — 59 msgs / ~13k tokens. PR triage + upstream docs + decision. - fixtures/config-build-competitive-scouts.json — 61 msgs / ~23k tokens. Iterative config accumulation (11 cron jobs across 7 weekdays). PII audit: zero matches across the three fixtures for the maintainer's handle (all case variants), personal email domains, and known contributor emails. Only 'contributor@example.com' placeholder remains. Why scripts/: requires API credentials, costs ~\$1 per run, LLM-graded (non-deterministic), must not run in CI. scripts/sample_and_compress.py is the existing precedent for offline credentialed tooling.
59 lines
2 KiB
Markdown
59 lines
2 KiB
Markdown
# compression_eval
|
|
|
|
Offline eval harness for `agent/context_compressor.py`. Runs a real
|
|
conversation transcript through the compressor, then probes the
|
|
compressed state with targeted questions graded on six dimensions.
|
|
|
|
**Status:** design only. See `DESIGN.md` for the full proposal and
|
|
open questions. `run_eval.py` is a placeholder.
|
|
|
|
## When to run
|
|
|
|
Before merging changes to:
|
|
|
|
- `agent/context_compressor.py`
|
|
- `agent/auxiliary_client.py` routing for compression tasks
|
|
- `agent/prompt_builder.py` compression-note phrasing
|
|
|
|
## Not for CI
|
|
|
|
This harness makes real model calls, costs ~$1 per run on a mainstream
|
|
model, takes minutes, and is LLM-graded (non-deterministic). It lives
|
|
in `scripts/` and is invoked by hand. `tests/` and
|
|
`scripts/run_tests.sh` do not touch it.
|
|
|
|
## Usage (once implemented)
|
|
|
|
```
|
|
python scripts/compression_eval/run_eval.py
|
|
python scripts/compression_eval/run_eval.py --fixtures=401-debug
|
|
python scripts/compression_eval/run_eval.py --runs=5 --label=my-prompt-v2
|
|
python scripts/compression_eval/run_eval.py --compare-to=results/2026-04-24_baseline
|
|
```
|
|
|
|
Results land in `results/<label>/report.md` and are intended to be
|
|
pasted verbatim into PR descriptions.
|
|
|
|
## Fixtures
|
|
|
|
Three scrubbed session snapshots live under `fixtures/`:
|
|
|
|
- `feature-impl-context-priority.json` — 75 msgs, investigate →
|
|
patch → test → PR → merge
|
|
- `debug-session-feishu-id-model.json` — 59 msgs, PR triage +
|
|
upstream docs + decision
|
|
- `config-build-competitive-scouts.json` — 61 msgs, iterative
|
|
config accumulation (11 cron jobs)
|
|
|
|
Regenerate them from the maintainer's `~/.hermes/sessions/*.jsonl`
|
|
with `python3 scripts/compression_eval/scrub_fixtures.py`. The
|
|
scrubber pipeline and PII-audit checklist are documented in
|
|
`DESIGN.md` under **Scrubber pipeline**.
|
|
|
|
## Related
|
|
|
|
- `agent/context_compressor.py` — the thing under test
|
|
- `tests/agent/test_context_compressor.py` — structural unit tests
|
|
that do run in CI
|
|
- `scripts/sample_and_compress.py` — the closest existing script in
|
|
shape (offline, credential-requiring, not in CI)
|