hermes-agent/scripts/compression_eval/DESIGN.md
Teknium 9f5c13f874
design: compression eval harness — add three scrubbed fixtures + scrubber
Adds scripts/compression_eval/ with a design doc, README, a placeholder
run_eval.py, and three checked-in scrubbed session fixtures. No working
eval yet — PR is for design review before implementation.

Motivation: we edit agent/context_compressor.py prompts and
_template_sections by hand and ship without any automated check that
compression still preserves file paths, error codes, or the active task.
Factory.ai's Dec 2025 write-up
(https://factory.ai/news/evaluating-compression) documents a probe-based
eval scored on six dimensions. We adopt the methodology; we do not publish
scores.

Contents:
- DESIGN.md — fixture format, probe format (recall / artifact /
  continuation / decision), six grading dimensions, report format,
  cost expectations, scrubber pipeline, open questions, and staged
  follow-up PR plan.
- README.md — short 'what this is / when to run it' page.
- run_eval.py — placeholder that prints 'not implemented, see
  DESIGN.md' and exits 1.
- scrub_fixtures.py — reproducible pipeline that converts real sessions
  from ~/.hermes/sessions/*.jsonl into public-safe JSON fixtures.
  Applies: redact_sensitive_text, username path normalization, personal
  handle scrubbing, email and git-author normalization, reasoning
  scratchpad / <think> stripping, platform user-mention scrubbing,
  first-user paraphrase, system-prompt placeholder, orphan-message
  pruning, and tool-output size truncation for fixture readability.
- fixtures/feature-impl-context-priority.json — 75 msgs / ~17k tokens.
  Investigate → patch → test → PR → merge.
- fixtures/debug-session-feishu-id-model.json — 59 msgs / ~13k tokens.
  PR triage + upstream docs + decision.
- fixtures/config-build-competitive-scouts.json — 61 msgs / ~23k tokens.
  Iterative config accumulation (11 cron jobs across 7 weekdays).

PII audit: zero matches across the three fixtures for the maintainer's
handle (all case variants), personal email domains, and known contributor
emails. Only 'contributor@example.com' placeholder remains.

Why scripts/: requires API credentials, costs ~\$1 per run, LLM-graded
(non-deterministic), must not run in CI. scripts/sample_and_compress.py
is the existing precedent for offline credentialed tooling.
2026-04-24 07:40:42 -07:00

14 KiB
Raw Blame History

Compression Eval — Design

Status: proposal. Nothing under scripts/compression_eval/ runs in CI. This is an offline tool authors run before merging prompt or algorithm changes to agent/context_compressor.py.

Why

We tune the compressor prompt and the _template_sections checklist by hand, ship, and wait for the next real session to notice regressions. There is no automated check that a prompt edit still preserves file paths, error messages, or the active task across a compression.

Factory.ai's December 2025 write-up (https://factory.ai/news/evaluating-compression) describes a probe-based eval that scores compressed state on six dimensions. The methodology is the valuable part — the benchmarks in the post are a marketing piece. We adopt the methodology and discard the scoreboard.

Goal

Given a real session transcript and a bank of probe questions that exercise what the transcript contained, answer:

  1. After ContextCompressor.compress() runs, can the agent still answer each probe correctly from the compressed state?
  2. Which of the six dimensions (accuracy, context awareness, artifact trail, completeness, continuity, instruction following) is the prompt weakest on?
  3. Does a prompt change improve or regress any dimension vs. the previous run?

That is the full scope. No "compare against OpenAI and Anthropic" benchmarking, no public scoreboard, no marketing claims.

Non-goals

  • Not a pytest. Requires API credentials, costs money, takes minutes per fixture, and output is LLM-graded and non-deterministic.
  • Not part of scripts/run_tests.sh. Not invoked by CI.
  • Not a replacement for the existing compressor unit tests in tests/agent/test_context_compressor.py — those stay as the structural / boundary / tool-pair-sanitization guard.
  • Not a general trajectory eval. Scoped to context compaction only.

Where it lives

scripts/compression_eval/
├── DESIGN.md                 # this file
├── README.md                 # how to run, cost expectations, caveats
├── run_eval.py               # entry point (fire CLI, like sample_and_compress.py)
├── scrub_fixtures.py         # regenerate fixtures from ~/.hermes/sessions/*.jsonl
├── fixtures/                 # checked-in scrubbed session snapshots
│   ├── feature-impl-context-priority.json
│   ├── debug-session-feishu-id-model.json
│   └── config-build-competitive-scouts.json
├── probes/                   # probe banks paired with fixtures
│   └── <fixture>.probes.json
├── rubric.py                 # grading prompt + dimension definitions
├── grader.py                 # judge-model call + score parsing
├── compressor_driver.py      # thin wrapper over ContextCompressor
└── results/                  # gitignored; timestamped output per run
    └── .gitkeep

scripts/ is the right home: offline tooling, no CI involvement, precedent already set by sample_and_compress.py, contributor_audit.py, discord-voice-doctor.py.

environments/ is for Atropos RL training environments — wrong shape. tests/ is hermetic and credential-free — incompatible with a probe-based eval that needs a judge model.

Fixture format

A fixture is a single compressed-enough conversation captured from a real session. Stored as JSON (pretty-printed, reviewable in PRs):

{
  "name": "401-debug",
  "description": "178-turn session debugging a 401 on /api/auth/login",
  "model": "anthropic/claude-sonnet-4.6",
  "context_length": 200000,
  "messages": [
    {"role": "system", "content": "..."},
    {"role": "user", "content": "..."},
    {"role": "assistant", "content": "...", "tool_calls": [...]},
    {"role": "tool", "tool_call_id": "...", "content": "..."}
  ],
  "notes": "Captured 2026-04-24 from session 20260424_*.jsonl; \
            PII scrubbed; secrets redacted via redact_sensitive_text."
}

Sourcing fixtures

Fixtures are scrubbed snapshots of real sessions from the maintainer's ~/.hermes/sessions/*.jsonl store, generated reproducibly by scrub_fixtures.py in this directory. Re-run the scrubber with python3 scripts/compression_eval/scrub_fixtures.py to regenerate them after a scrubber change.

Three shipped fixtures cover three different session shapes:

Fixture Source shape Messages Tokens (rough) Tests
feature-impl-context-priority investigate → patch → test → PR → merge 75 ~17k continuation, artifact trail (2 files modified, 1 PR)
debug-session-feishu-id-model PR triage + upstream docs + decision 59 ~13k recall (PR #, error shape), decision (outcome + reason)
config-build-competitive-scouts iterative config: 11 cron jobs across 7 weekdays 61 ~23k artifact trail (which jobs, which days), iterative-merge

The ~17k-23k token range is below the default 50%-of-200k compression threshold, so the eval will always force a compress() call rather than wait for the natural trigger. That is the intended shape — we want a controlled single-shot compression so score deltas are attributable to the prompt change, not to whether the threshold happened to fire at the same boundary twice.

Scrubber pipeline

scrub_fixtures.py applies, per message:

  1. agent.redact.redact_sensitive_text — API keys, tokens, connection strings
  2. Username paths: /home/teknium/home/user
  3. Personal handles: all case variants of the maintainer name → user
  4. Email addresses → contributor@example.com; git Author: Name <addr> header lines normalised
  5. <REASONING_SCRATCHPAD>...</REASONING_SCRATCHPAD> and <think>...</think> stripped from assistant content
  6. Messaging-platform user mentions (<@123456>, <@***>) → <@user>
  7. First user message paraphrased to remove personal voice; subsequent user turns kept verbatim after the redactions above
  8. System prompt replaced with a generic public-safe placeholder so we don't check in the maintainer's tuned soul/skills/memory system block
  9. Orphan empty-assistant messages (artifact of scratchpad-only turns) and trailing tool messages with no matching assistant are dropped
  10. Tool outputs longer than 2000 chars are truncated with a size annotation; the compressor sees that the tool was called and returned something but not the full 16KB skill_view or 5KB web_extract body (no signal loss for compression probes)

Before every fixture PR: grep the fixture for PII patterns. An audit is embedded at the bottom of the scrubber as comments.

Fixtures must stay small. Target <150 KB per fixture, <500 KB total for the directory. Current total: ~230 KB across three fixtures. Larger sessions are truncated with a truncated_to: <index> field in the fixture header so the cut is reviewable.

Probe format

One probe file per fixture, so reviewers can see the question bank evolve alongside the fixture:

{
  "fixture": "401-debug",
  "probes": [
    {
      "id": "recall-error-code",
      "type": "recall",
      "question": "What was the original error code and endpoint?",
      "expected_facts": ["401", "/api/auth/login"]
    },
    {
      "id": "artifact-files-modified",
      "type": "artifact",
      "question": "Which files have been modified in this session?",
      "expected_facts": ["session_store.py", "redis_client.py"]
    },
    {
      "id": "continuation-next-step",
      "type": "continuation",
      "question": "What should we do next?",
      "expected_facts": ["re-run the integration tests", "restart the worker"]
    },
    {
      "id": "decision-redis-approach",
      "type": "decision",
      "question": "What did we decide about the Redis issue?",
      "expected_facts": ["switch to redis-py 5.x", "pooled connection"]
    }
  ]
}

The four probe types come directly from Factory's methodology: recall, artifact, continuation, decision. expected_facts gives the grader concrete anchors instead of relying purely on LLM taste.

Authoring a probe bank is a one-time cost per fixture. 8-12 probes per fixture is the target — enough to cover all four types, few enough to grade in under a minute at reasonable cost.

Grading

Each probe gets scored 0-5 on six dimensions (Factory's six):

Dimension What it measures
accuracy File paths, function names, error codes are correct
context_awareness Reflects current state, not a mid-session snapshot
artifact_trail Knows which files were read / modified / created
completeness Addresses all parts of the probe
continuity Agent can continue without re-fetching
instruction_following Probe answered in the requested form

Grading is done by a single judge-model call per probe with a deterministic rubric prompt (see rubric.py). The rubric includes the expected_facts list so the judge has a concrete anchor. Default judge model: whatever the user has configured as their main model at run time (same resolution path as auxiliary_client.call_llm). A --judge-model flag allows overriding for consistency across runs.

Non-determinism caveat: two runs of the same fixture will produce different scores. A single run means nothing. Report medians over N=3 runs by default, and require an improvement of >=0.3 on any dimension before claiming a prompt change is a win.

Run flow

python scripts/compression_eval/run_eval.py [OPTIONS]

Options (fire-style, mirroring sample_and_compress.py):

Flag Default Purpose
--fixtures all Comma-separated fixture names
--runs 3 Runs per fixture (for median)
--judge-model auto Override judge model
--compressor-model auto Override model used inside the compressor
--label timestamp Subdirectory under results/
--focus-topic none Pass-through to compress(focus_topic=)
--compare-to none Path to a previous run for diff output

Steps per fixture per run:

  1. Load fixture JSON and probe bank.
  2. Construct a ContextCompressor against the fixture's model.
  3. Call compressor.compress(messages) — capture the compressed message list.
  4. For each probe: ask the judge model to role-play as the continuing agent with only the compressed state, then grade the answer on the six dimensions using rubric.py.
  5. Write a per-run JSON to results/<label>/<fixture>-run-N.json.
  6. After all runs, emit a markdown summary to results/<label>/report.md.

Report format

Pasted verbatim into PR descriptions that touch the compressor:

## Compression eval — label 2026-04-25_13-40-02

Main model: anthropic/claude-sonnet-4.6   Judge: same
3 runs per fixture, medians reported.

| Fixture        | Accuracy | Context | Artifact | Complete | Continuity | Instruction | Overall |
|----------------|----------|---------|----------|----------|------------|-------------|---------|
| 401-debug      | 4.1      | 4.0     | 2.5      | 4.3      | 3.8        | 5.0         | 3.95    |
| pr-review      | 3.9      | 3.8     | 3.1      | 4.2      | 3.9        | 5.0         | 3.98    |
| feature-impl   | 4.0      | 3.9     | 2.9      | 4.1      | 4.0        | 5.0         | 3.98    |

Per-probe misses (score < 3.0):
- 401-debug / artifact-files-modified: 1.7 — summary dropped redis_client.py
- pr-review / decision-auth-rewrite: 2.3 — outcome captured, reasoning dropped

Cost expectations

Dominated by the judge calls. For 3 fixtures × 10 probes × 3 runs = 90 judge calls per eval run. On Claude Sonnet 4.6 that is roughly $0.50-$1.50 per full eval depending on probe length. The compressor itself makes 1 call per fixture × 3 runs = 9 additional calls.

This is not a check to run after every commit. It is a before-merge check for PRs that touch:

  • agent/context_compressor.py — any change to _template_sections, _generate_summary, or compress().
  • agent/auxiliary_client.py — when changing how compression tasks are routed.
  • agent/prompt_builder.py — when the compression-note phrasing changes.

Open questions (to resolve before implementing)

  1. Fixture scrubbing: manual or scripted? A scripted scrub that also replaces project names / hostnames would lower the cost of contributing a new fixture. Risk: over-aggressive replacement destroys the signal the probe depends on. Propose: start manual, add scripted helpers once we have 3 fixtures and know the common PII shapes.

  2. Judge model selection. Factory uses GPT-5.2. We can't pin one — user's main model changes. Options: (a) grade with main model (cheap, inconsistent across users), (b) require a specific judge model (e.g. claude-sonnet-4.6), inconsistent for users without access. Propose (a) with a --judge-model override, and make the model name prominent in the report so comparisons across machines are legible.

  3. Noise floor. Before landing prompt changes, run the current prompt N=10 times to measure per-dimension stddev. That tells us the minimum delta to call a change significant. Suspect 0.2-0.3 on a 0-5 scale. Decision deferred until after the first fixture is landed.

  4. Iterative-merge coverage. The real Factory-vs-Anthropic difference is incremental merge vs. regenerate. A fixture that only compresses once doesn't exercise our iterative path. Add a fourth fixture that forces two compressions (manually chained), with probes that test whether information from the first compression survives the second. Deferred to a follow-up PR.

Implementation order

This PR: design doc + scaffolding + three checked-in fixtures + scrubber script. run_eval.py is still a placeholder that prints a pointer to DESIGN.md.

Follow-ups, each a separate PR:

  1. Probe banks for the three fixtures (~8-12 probes each), plus rubric.py + grader.py + compressor_driver.py. Enough to produce a full report.
  2. Wire results output, --compare-to diff mode, and the report markdown template.
  3. Iterative-merge fixture (two chained compressions) + follow-ups from the open questions.

Each follow-up is independently useful.