Adds scripts/compression_eval/ with a design doc, README, a placeholder
run_eval.py, and three checked-in scrubbed session fixtures. No working
eval yet — PR is for design review before implementation.
Motivation: we edit agent/context_compressor.py prompts and
_template_sections by hand and ship without any automated check that
compression still preserves file paths, error codes, or the active task.
Factory.ai's Dec 2025 write-up
(https://factory.ai/news/evaluating-compression) documents a probe-based
eval scored on six dimensions. We adopt the methodology; we do not publish
scores.
Contents:
- DESIGN.md — fixture format, probe format (recall / artifact /
continuation / decision), six grading dimensions, report format,
cost expectations, scrubber pipeline, open questions, and staged
follow-up PR plan.
- README.md — short 'what this is / when to run it' page.
- run_eval.py — placeholder that prints 'not implemented, see
DESIGN.md' and exits 1.
- scrub_fixtures.py — reproducible pipeline that converts real sessions
from ~/.hermes/sessions/*.jsonl into public-safe JSON fixtures.
Applies: redact_sensitive_text, username path normalization, personal
handle scrubbing, email and git-author normalization, reasoning
scratchpad / <think> stripping, platform user-mention scrubbing,
first-user paraphrase, system-prompt placeholder, orphan-message
pruning, and tool-output size truncation for fixture readability.
- fixtures/feature-impl-context-priority.json — 75 msgs / ~17k tokens.
Investigate → patch → test → PR → merge.
- fixtures/debug-session-feishu-id-model.json — 59 msgs / ~13k tokens.
PR triage + upstream docs + decision.
- fixtures/config-build-competitive-scouts.json — 61 msgs / ~23k tokens.
Iterative config accumulation (11 cron jobs across 7 weekdays).
PII audit: zero matches across the three fixtures for the maintainer's
handle (all case variants), personal email domains, and known contributor
emails. Only 'contributor@example.com' placeholder remains.
Why scripts/: requires API credentials, costs ~\$1 per run, LLM-graded
(non-deterministic), must not run in CI. scripts/sample_and_compress.py
is the existing precedent for offline credentialed tooling.
Regression test for #14981. Verifies that _session_expiry_watcher fires
on_session_finalize for each session swept out of the store, matching
the contract documented for /new, /reset, CLI shutdown, and gateway stop.
Verified the test fails cleanly on pre-fix code (hook call list missing
sess-expired) and passes with the fix applied.
Follow-up on top of #15096 cherry-pick:
- Remove spotify_* from _HERMES_CORE_TOOLS (keep only in the 'spotify'
toolset, so the 9 Spotify tool schemas are not shipped to every user).
- Add 'spotify' to CONFIGURABLE_TOOLSETS + _DEFAULT_OFF_TOOLSETS so new
installs get it opt-in via 'hermes tools', matching homeassistant/rl.
- Wire TOOL_CATEGORIES entry pointing at 'hermes auth spotify' for the
actual PKCE login (optional HERMES_SPOTIFY_CLIENT_ID /
HERMES_SPOTIFY_REDIRECT_URI env vars).
- scripts/release.py: map contributor email to GitHub login.
- AUTHOR_MAP entry for 130918800+devorun for #6636 attribution
- test_moa_defaults: was a change-detector tied to the exact frontier
model list — flips red every OpenRouter churn. Rewritten as an
invariant (non-empty, valid vendor/model slugs).