design: compression eval harness — add three scrubbed fixtures + scrubber

Adds scripts/compression_eval/ with a design doc, README, a placeholder run_eval.py, and three checked-in scrubbed session fixtures. No working eval yet — PR is for design review before implementation. Motivation: we edit agent/context_compressor.py prompts and _template_sections by hand and ship without any automated check that compression still preserves file paths, error codes, or the active task. Factory.ai's Dec 2025 write-up (https://factory.ai/news/evaluating-compression) documents a probe-based eval scored on six dimensions. We adopt the methodology; we do not publish scores. Contents: - DESIGN.md — fixture format, probe format (recall / artifact / continuation / decision), six grading dimensions, report format, cost expectations, scrubber pipeline, open questions, and staged follow-up PR plan. - README.md — short 'what this is / when to run it' page. - run_eval.py — placeholder that prints 'not implemented, see DESIGN.md' and exits 1. - scrub_fixtures.py — reproducible pipeline that converts real sessions from ~/.hermes/sessions/*.jsonl into public-safe JSON fixtures. Applies: redact_sensitive_text, username path normalization, personal handle scrubbing, email and git-author normalization, reasoning scratchpad / <think> stripping, platform user-mention scrubbing, first-user paraphrase, system-prompt placeholder, orphan-message pruning, and tool-output size truncation for fixture readability. - fixtures/feature-impl-context-priority.json — 75 msgs / ~17k tokens. Investigate → patch → test → PR → merge. - fixtures/debug-session-feishu-id-model.json — 59 msgs / ~13k tokens. PR triage + upstream docs + decision. - fixtures/config-build-competitive-scouts.json — 61 msgs / ~23k tokens. Iterative config accumulation (11 cron jobs across 7 weekdays). PII audit: zero matches across the three fixtures for the maintainer's handle (all case variants), personal email domains, and known contributor emails. Only 'contributor@example.com' placeholder remains. Why scripts/: requires API credentials, costs ~\$1 per run, LLM-graded (non-deterministic), must not run in CI. scripts/sample_and_compress.py is the existing precedent for offline credentialed tooling.
2026-04-25 00:51:20 +00:00 · 2026-04-24 07:21:09 -07:00 · 2026-04-24 07:21:09 -07:00 · 9f5c13f874
commit 9f5c13f874
parent c6b734e24d
10 changed files with 2642 additions and 0 deletions
--- a/scripts/compression_eval/DESIGN.md
+++ b/scripts/compression_eval/DESIGN.md
@ -0,0 +1,341 @@
+# Compression Eval — Design
+
+Status: proposal. Nothing under `scripts/compression_eval/` runs in CI.
+This is an offline tool authors run before merging prompt or algorithm
+changes to `agent/context_compressor.py`.
+
+## Why
+
+We tune the compressor prompt and the `_template_sections` checklist by
+hand, ship, and wait for the next real session to notice regressions.
+There is no automated check that a prompt edit still preserves file
+paths, error messages, or the active task across a compression.
+
+Factory.ai's December 2025 write-up
+(https://factory.ai/news/evaluating-compression) describes a
+probe-based eval that scores compressed state on six dimensions. The
+methodology is the valuable part — the benchmarks in the post are a
+marketing piece. We adopt the methodology and discard the scoreboard.
+
+## Goal
+
+Given a real session transcript and a bank of probe questions that
+exercise what the transcript contained, answer:
+
+1. After `ContextCompressor.compress()` runs, can the agent still
+   answer each probe correctly from the compressed state?
+2. Which of the six dimensions (accuracy, context awareness, artifact
+   trail, completeness, continuity, instruction following) is the
+   prompt weakest on?
+3. Does a prompt change improve or regress any dimension vs. the
+   previous run?
+
+That is the full scope. No "compare against OpenAI and Anthropic"
+benchmarking, no public scoreboard, no marketing claims.
+
+## Non-goals
+
+- Not a pytest. Requires API credentials, costs money, takes minutes
+  per fixture, and output is LLM-graded and non-deterministic.
+- Not part of `scripts/run_tests.sh`. Not invoked by CI.
+- Not a replacement for the existing compressor unit tests in
+  `tests/agent/test_context_compressor.py` — those stay as the
+  structural / boundary / tool-pair-sanitization guard.
+- Not a general trajectory eval. Scoped to context compaction only.
+
+## Where it lives
+
+```
+scripts/compression_eval/
+├── DESIGN.md                 # this file
+├── README.md                 # how to run, cost expectations, caveats
+├── run_eval.py               # entry point (fire CLI, like sample_and_compress.py)
+├── scrub_fixtures.py         # regenerate fixtures from ~/.hermes/sessions/*.jsonl
+├── fixtures/                 # checked-in scrubbed session snapshots
+│   ├── feature-impl-context-priority.json
+│   ├── debug-session-feishu-id-model.json
+│   └── config-build-competitive-scouts.json
+├── probes/                   # probe banks paired with fixtures
+│   └── <fixture>.probes.json
+├── rubric.py                 # grading prompt + dimension definitions
+├── grader.py                 # judge-model call + score parsing
+├── compressor_driver.py      # thin wrapper over ContextCompressor
+└── results/                  # gitignored; timestamped output per run
+    └── .gitkeep
+```
+
+`scripts/` is the right home: offline tooling, no CI involvement,
+precedent already set by `sample_and_compress.py`,
+`contributor_audit.py`, `discord-voice-doctor.py`.
+
+`environments/` is for Atropos RL training environments — wrong shape.
+`tests/` is hermetic and credential-free — incompatible with a
+probe-based eval that needs a judge model.
+
+## Fixture format
+
+A fixture is a single compressed-enough conversation captured from a
+real session. Stored as JSON (pretty-printed, reviewable in PRs):
+
+```json
+{
+  "name": "401-debug",
+  "description": "178-turn session debugging a 401 on /api/auth/login",
+  "model": "anthropic/claude-sonnet-4.6",
+  "context_length": 200000,
+  "messages": [
+    {"role": "system", "content": "..."},
+    {"role": "user", "content": "..."},
+    {"role": "assistant", "content": "...", "tool_calls": [...]},
+    {"role": "tool", "tool_call_id": "...", "content": "..."}
+  ],
+  "notes": "Captured 2026-04-24 from session 20260424_*.jsonl; \
+            PII scrubbed; secrets redacted via redact_sensitive_text."
+}
+```
+
+### Sourcing fixtures
+
+Fixtures are scrubbed snapshots of real sessions from the
+maintainer's `~/.hermes/sessions/*.jsonl` store, generated
+reproducibly by `scrub_fixtures.py` in this directory. Re-run the
+scrubber with `python3 scripts/compression_eval/scrub_fixtures.py`
+to regenerate them after a scrubber change.
+
+Three shipped fixtures cover three different session shapes:
+
+| Fixture | Source shape | Messages | Tokens (rough) | Tests |
+|---|---|---|---|---|
+| `feature-impl-context-priority` | investigate → patch → test → PR → merge | 75 | ~17k | continuation, artifact trail (2 files modified, 1 PR) |
+| `debug-session-feishu-id-model` | PR triage + upstream docs + decision | 59 | ~13k | recall (PR #, error shape), decision (outcome + reason) |
+| `config-build-competitive-scouts` | iterative config: 11 cron jobs across 7 weekdays | 61 | ~23k | artifact trail (which jobs, which days), iterative-merge |
+
+The `~17k-23k` token range is below the default 50%-of-200k
+compression threshold, so the eval will always **force** a
+`compress()` call rather than wait for the natural trigger. That is
+the intended shape — we want a controlled single-shot compression so
+score deltas are attributable to the prompt change, not to whether
+the threshold happened to fire at the same boundary twice.
+
+### Scrubber pipeline
+
+`scrub_fixtures.py` applies, per message:
+
+1. `agent.redact.redact_sensitive_text` — API keys, tokens,
+   connection strings
+2. Username paths: `/home/teknium` → `/home/user`
+3. Personal handles: all case variants of the maintainer name → `user`
+4. Email addresses → `contributor@example.com`; git
+   `Author: Name <addr>` header lines normalised
+5. `<REASONING_SCRATCHPAD>...</REASONING_SCRATCHPAD>` and
+   `<think>...</think>` stripped from assistant content
+6. Messaging-platform user mentions (`<@123456>`, `<@***>`) →
+   `<@user>`
+7. First user message paraphrased to remove personal voice;
+   subsequent user turns kept verbatim after the redactions above
+8. System prompt replaced with a generic public-safe placeholder so
+   we don't check in the maintainer's tuned soul/skills/memory system
+   block
+9. Orphan empty-assistant messages (artifact of scratchpad-only
+   turns) and trailing tool messages with no matching assistant are
+   dropped
+10. Tool outputs longer than 2000 chars are truncated with a size
+    annotation; the compressor sees that the tool was called and
+    returned something but not the full 16KB skill_view or 5KB
+    web_extract body (no signal loss for compression probes)
+
+Before every fixture PR: grep the fixture for PII patterns. An
+audit is embedded at the bottom of the scrubber as comments.
+
+**Fixtures must stay small.** Target <150 KB per fixture, <500 KB
+total for the directory. Current total: ~230 KB across three
+fixtures. Larger sessions are truncated with a
+`truncated_to: <index>` field in the fixture header so the cut is
+reviewable.
+
+## Probe format
+
+One probe file per fixture, so reviewers can see the question bank
+evolve alongside the fixture:
+
+```json
+{
+  "fixture": "401-debug",
+  "probes": [
+    {
+      "id": "recall-error-code",
+      "type": "recall",
+      "question": "What was the original error code and endpoint?",
+      "expected_facts": ["401", "/api/auth/login"]
+    },
+    {
+      "id": "artifact-files-modified",
+      "type": "artifact",
+      "question": "Which files have been modified in this session?",
+      "expected_facts": ["session_store.py", "redis_client.py"]
+    },
+    {
+      "id": "continuation-next-step",
+      "type": "continuation",
+      "question": "What should we do next?",
+      "expected_facts": ["re-run the integration tests", "restart the worker"]
+    },
+    {
+      "id": "decision-redis-approach",
+      "type": "decision",
+      "question": "What did we decide about the Redis issue?",
+      "expected_facts": ["switch to redis-py 5.x", "pooled connection"]
+    }
+  ]
+}
+```
+
+The four probe types come directly from Factory's methodology:
+**recall, artifact, continuation, decision**. `expected_facts` gives
+the grader concrete anchors instead of relying purely on LLM taste.
+
+Authoring a probe bank is a one-time cost per fixture. 8-12 probes per
+fixture is the target — enough to cover all four types, few enough to
+grade in under a minute at reasonable cost.
+
+## Grading
+
+Each probe gets scored 0-5 on **six dimensions** (Factory's six):
+
+| Dimension             | What it measures                                    |
+|-----------------------|-----------------------------------------------------|
+| accuracy              | File paths, function names, error codes are correct |
+| context_awareness     | Reflects current state, not a mid-session snapshot  |
+| artifact_trail        | Knows which files were read / modified / created    |
+| completeness          | Addresses all parts of the probe                    |
+| continuity            | Agent can continue without re-fetching              |
+| instruction_following | Probe answered in the requested form                |
+
+Grading is done by a single judge-model call per probe with a
+deterministic rubric prompt (see `rubric.py`). The rubric includes the
+`expected_facts` list so the judge has a concrete anchor. Default
+judge model: whatever the user has configured as their main model at
+run time (same resolution path as `auxiliary_client.call_llm`). A
+`--judge-model` flag allows overriding for consistency across runs.
+
+Non-determinism caveat: two runs of the same fixture will produce
+different scores. A single run means nothing. Report medians over
+N=3 runs by default, and require an improvement of >=0.3 on any
+dimension before claiming a prompt change is a win.
+
+## Run flow
+
+```
+python scripts/compression_eval/run_eval.py [OPTIONS]
+```
+
+Options (fire-style, mirroring `sample_and_compress.py`):
+
+| Flag                   | Default    | Purpose                                   |
+|------------------------|------------|-------------------------------------------|
+| `--fixtures`           | all        | Comma-separated fixture names             |
+| `--runs`               | 3          | Runs per fixture (for median)             |
+| `--judge-model`        | auto       | Override judge model                      |
+| `--compressor-model`   | auto       | Override model used *inside* the compressor |
+| `--label`              | timestamp  | Subdirectory under `results/`             |
+| `--focus-topic`        | none       | Pass-through to `compress(focus_topic=)`  |
+| `--compare-to`         | none       | Path to a previous run for diff output    |
+
+Steps per fixture per run:
+
+1. Load fixture JSON and probe bank.
+2. Construct a `ContextCompressor` against the fixture's model.
+3. Call `compressor.compress(messages)` — capture the compressed
+   message list.
+4. For each probe: ask the judge model to role-play as the continuing
+   agent with only the compressed state, then grade the answer on the
+   six dimensions using `rubric.py`.
+5. Write a per-run JSON to `results/<label>/<fixture>-run-N.json`.
+6. After all runs, emit a markdown summary to
+   `results/<label>/report.md`.
+
+## Report format
+
+Pasted verbatim into PR descriptions that touch the compressor:
+
+```
+## Compression eval — label 2026-04-25_13-40-02
+
+Main model: anthropic/claude-sonnet-4.6   Judge: same
+3 runs per fixture, medians reported.
+
+| Fixture        | Accuracy | Context | Artifact | Complete | Continuity | Instruction | Overall |
+|----------------|----------|---------|----------|----------|------------|-------------|---------|
+| 401-debug      | 4.1      | 4.0     | 2.5      | 4.3      | 3.8        | 5.0         | 3.95    |
+| pr-review      | 3.9      | 3.8     | 3.1      | 4.2      | 3.9        | 5.0         | 3.98    |
+| feature-impl   | 4.0      | 3.9     | 2.9      | 4.1      | 4.0        | 5.0         | 3.98    |
+
+Per-probe misses (score < 3.0):
+- 401-debug / artifact-files-modified: 1.7 — summary dropped redis_client.py
+- pr-review / decision-auth-rewrite: 2.3 — outcome captured, reasoning dropped
+```
+
+## Cost expectations
+
+Dominated by the judge calls. For 3 fixtures × 10 probes × 3 runs =
+90 judge calls per eval run. On Claude Sonnet 4.6 that is roughly
+$0.50-$1.50 per full eval depending on probe length. The compressor
+itself makes 1 call per fixture × 3 runs = 9 additional calls.
+
+**This is not a check to run after every commit.** It is a
+before-merge check for PRs that touch:
+
+- `agent/context_compressor.py` — any change to `_template_sections`,
+  `_generate_summary`, or `compress()`.
+- `agent/auxiliary_client.py` — when changing how compression tasks
+  are routed.
+- `agent/prompt_builder.py` — when the compression-note phrasing
+  changes.
+
+## Open questions (to resolve before implementing)
+
+1. **Fixture scrubbing: manual or scripted?** A scripted scrub that
+   also replaces project names / hostnames would lower the cost of
+   contributing a new fixture. Risk: over-aggressive replacement
+   destroys the signal the probe depends on. Propose: start manual,
+   add scripted helpers once we have 3 fixtures and know the common
+   PII shapes.
+
+2. **Judge model selection.** Factory uses GPT-5.2. We can't pin one
+   — user's main model changes. Options: (a) grade with main model
+   (cheap, inconsistent across users), (b) require a specific judge
+   model (e.g. `claude-sonnet-4.6`), inconsistent for users without
+   access. Propose (a) with a `--judge-model` override, and make the
+   model name prominent in the report so comparisons across machines
+   are legible.
+
+3. **Noise floor.** Before landing prompt changes, run the current
+   prompt N=10 times to measure per-dimension stddev. That tells us
+   the minimum delta to call a change significant. Suspect 0.2-0.3 on
+   a 0-5 scale. Decision deferred until after the first fixture is
+   landed.
+
+4. **Iterative-merge coverage.** The real Factory-vs-Anthropic
+   difference is incremental merge vs. regenerate. A fixture that
+   only compresses once doesn't exercise our iterative path. Add a
+   fourth fixture that forces two compressions (manually chained),
+   with probes that test whether information from the first
+   compression survives the second. Deferred to a follow-up PR.
+
+## Implementation order
+
+This PR: design doc + scaffolding + **three checked-in fixtures** +
+scrubber script. `run_eval.py` is still a placeholder that prints a
+pointer to DESIGN.md.
+
+Follow-ups, each a separate PR:
+
+1. Probe banks for the three fixtures (~8-12 probes each), plus
+   `rubric.py` + `grader.py` + `compressor_driver.py`. Enough to
+   produce a full report.
+2. Wire results output, `--compare-to` diff mode, and the report
+   markdown template.
+3. Iterative-merge fixture (two chained compressions) + follow-ups
+   from the open questions.
+
+Each follow-up is independently useful.