hermes-agent/scripts/compression_eval
Teknium 9f5c13f874
design: compression eval harness — add three scrubbed fixtures + scrubber
Adds scripts/compression_eval/ with a design doc, README, a placeholder
run_eval.py, and three checked-in scrubbed session fixtures. No working
eval yet — PR is for design review before implementation.

Motivation: we edit agent/context_compressor.py prompts and
_template_sections by hand and ship without any automated check that
compression still preserves file paths, error codes, or the active task.
Factory.ai's Dec 2025 write-up
(https://factory.ai/news/evaluating-compression) documents a probe-based
eval scored on six dimensions. We adopt the methodology; we do not publish
scores.

Contents:
- DESIGN.md — fixture format, probe format (recall / artifact /
  continuation / decision), six grading dimensions, report format,
  cost expectations, scrubber pipeline, open questions, and staged
  follow-up PR plan.
- README.md — short 'what this is / when to run it' page.
- run_eval.py — placeholder that prints 'not implemented, see
  DESIGN.md' and exits 1.
- scrub_fixtures.py — reproducible pipeline that converts real sessions
  from ~/.hermes/sessions/*.jsonl into public-safe JSON fixtures.
  Applies: redact_sensitive_text, username path normalization, personal
  handle scrubbing, email and git-author normalization, reasoning
  scratchpad / <think> stripping, platform user-mention scrubbing,
  first-user paraphrase, system-prompt placeholder, orphan-message
  pruning, and tool-output size truncation for fixture readability.
- fixtures/feature-impl-context-priority.json — 75 msgs / ~17k tokens.
  Investigate → patch → test → PR → merge.
- fixtures/debug-session-feishu-id-model.json — 59 msgs / ~13k tokens.
  PR triage + upstream docs + decision.
- fixtures/config-build-competitive-scouts.json — 61 msgs / ~23k tokens.
  Iterative config accumulation (11 cron jobs across 7 weekdays).

PII audit: zero matches across the three fixtures for the maintainer's
handle (all case variants), personal email domains, and known contributor
emails. Only 'contributor@example.com' placeholder remains.

Why scripts/: requires API credentials, costs ~\$1 per run, LLM-graded
(non-deterministic), must not run in CI. scripts/sample_and_compress.py
is the existing precedent for offline credentialed tooling.
2026-04-24 07:40:42 -07:00
..
fixtures design: compression eval harness — add three scrubbed fixtures + scrubber 2026-04-24 07:40:42 -07:00
probes design: compression eval harness — add three scrubbed fixtures + scrubber 2026-04-24 07:40:42 -07:00
results design: compression eval harness — add three scrubbed fixtures + scrubber 2026-04-24 07:40:42 -07:00
DESIGN.md design: compression eval harness — add three scrubbed fixtures + scrubber 2026-04-24 07:40:42 -07:00
README.md design: compression eval harness — add three scrubbed fixtures + scrubber 2026-04-24 07:40:42 -07:00
run_eval.py design: compression eval harness — add three scrubbed fixtures + scrubber 2026-04-24 07:40:42 -07:00
scrub_fixtures.py design: compression eval harness — add three scrubbed fixtures + scrubber 2026-04-24 07:40:42 -07:00

compression_eval

Offline eval harness for agent/context_compressor.py. Runs a real conversation transcript through the compressor, then probes the compressed state with targeted questions graded on six dimensions.

Status: design only. See DESIGN.md for the full proposal and open questions. run_eval.py is a placeholder.

When to run

Before merging changes to:

  • agent/context_compressor.py
  • agent/auxiliary_client.py routing for compression tasks
  • agent/prompt_builder.py compression-note phrasing

Not for CI

This harness makes real model calls, costs ~$1 per run on a mainstream model, takes minutes, and is LLM-graded (non-deterministic). It lives in scripts/ and is invoked by hand. tests/ and scripts/run_tests.sh do not touch it.

Usage (once implemented)

python scripts/compression_eval/run_eval.py
python scripts/compression_eval/run_eval.py --fixtures=401-debug
python scripts/compression_eval/run_eval.py --runs=5 --label=my-prompt-v2
python scripts/compression_eval/run_eval.py --compare-to=results/2026-04-24_baseline

Results land in results/<label>/report.md and are intended to be pasted verbatim into PR descriptions.

Fixtures

Three scrubbed session snapshots live under fixtures/:

  • feature-impl-context-priority.json — 75 msgs, investigate → patch → test → PR → merge
  • debug-session-feishu-id-model.json — 59 msgs, PR triage + upstream docs + decision
  • config-build-competitive-scouts.json — 61 msgs, iterative config accumulation (11 cron jobs)

Regenerate them from the maintainer's ~/.hermes/sessions/*.jsonl with python3 scripts/compression_eval/scrub_fixtures.py. The scrubber pipeline and PII-audit checklist are documented in DESIGN.md under Scrubber pipeline.

  • agent/context_compressor.py — the thing under test
  • tests/agent/test_context_compressor.py — structural unit tests that do run in CI
  • scripts/sample_and_compress.py — the closest existing script in shape (offline, credential-requiring, not in CI)