From 9f5c13f874589d22a9f63eb2ac3d45a7bfcebb27 Mon Sep 17 00:00:00 2001 From: Teknium Date: Fri, 24 Apr 2026 07:21:09 -0700 Subject: [PATCH] =?UTF-8?q?design:=20compression=20eval=20harness=20?= =?UTF-8?q?=E2=80=94=20add=20three=20scrubbed=20fixtures=20+=20scrubber?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds scripts/compression_eval/ with a design doc, README, a placeholder run_eval.py, and three checked-in scrubbed session fixtures. No working eval yet — PR is for design review before implementation. Motivation: we edit agent/context_compressor.py prompts and _template_sections by hand and ship without any automated check that compression still preserves file paths, error codes, or the active task. Factory.ai's Dec 2025 write-up (https://factory.ai/news/evaluating-compression) documents a probe-based eval scored on six dimensions. We adopt the methodology; we do not publish scores. Contents: - DESIGN.md — fixture format, probe format (recall / artifact / continuation / decision), six grading dimensions, report format, cost expectations, scrubber pipeline, open questions, and staged follow-up PR plan. - README.md — short 'what this is / when to run it' page. - run_eval.py — placeholder that prints 'not implemented, see DESIGN.md' and exits 1. - scrub_fixtures.py — reproducible pipeline that converts real sessions from ~/.hermes/sessions/*.jsonl into public-safe JSON fixtures. Applies: redact_sensitive_text, username path normalization, personal handle scrubbing, email and git-author normalization, reasoning scratchpad / stripping, platform user-mention scrubbing, first-user paraphrase, system-prompt placeholder, orphan-message pruning, and tool-output size truncation for fixture readability. - fixtures/feature-impl-context-priority.json — 75 msgs / ~17k tokens. Investigate → patch → test → PR → merge. - fixtures/debug-session-feishu-id-model.json — 59 msgs / ~13k tokens. PR triage + upstream docs + decision. - fixtures/config-build-competitive-scouts.json — 61 msgs / ~23k tokens. Iterative config accumulation (11 cron jobs across 7 weekdays). PII audit: zero matches across the three fixtures for the maintainer's handle (all case variants), personal email domains, and known contributor emails. Only 'contributor@example.com' placeholder remains. Why scripts/: requires API credentials, costs ~\$1 per run, LLM-graded (non-deterministic), must not run in CI. scripts/sample_and_compress.py is the existing precedent for offline credentialed tooling. --- .gitignore | 4 + scripts/compression_eval/DESIGN.md | 341 +++++++++ scripts/compression_eval/README.md | 59 ++ .../config-build-competitive-scouts.json | 536 ++++++++++++++ .../debug-session-feishu-id-model.json | 626 ++++++++++++++++ .../feature-impl-context-priority.json | 678 ++++++++++++++++++ scripts/compression_eval/probes/.gitkeep | 0 scripts/compression_eval/results/.gitkeep | 0 scripts/compression_eval/run_eval.py | 28 + scripts/compression_eval/scrub_fixtures.py | 370 ++++++++++ 10 files changed, 2642 insertions(+) create mode 100644 scripts/compression_eval/DESIGN.md create mode 100644 scripts/compression_eval/README.md create mode 100644 scripts/compression_eval/fixtures/config-build-competitive-scouts.json create mode 100644 scripts/compression_eval/fixtures/debug-session-feishu-id-model.json create mode 100644 scripts/compression_eval/fixtures/feature-impl-context-priority.json create mode 100644 scripts/compression_eval/probes/.gitkeep create mode 100644 scripts/compression_eval/results/.gitkeep create mode 100755 scripts/compression_eval/run_eval.py create mode 100755 scripts/compression_eval/scrub_fixtures.py diff --git a/.gitignore b/.gitignore index 72f3bd17f..08585c9ce 100644 --- a/.gitignore +++ b/.gitignore @@ -52,6 +52,10 @@ ignored/ .worktrees/ environments/benchmarks/evals/ +# Compression eval run outputs (harness lives in scripts/compression_eval/) +scripts/compression_eval/results/* +!scripts/compression_eval/results/.gitkeep + # Web UI build output hermes_cli/web_dist/ diff --git a/scripts/compression_eval/DESIGN.md b/scripts/compression_eval/DESIGN.md new file mode 100644 index 000000000..a2d47ae2d --- /dev/null +++ b/scripts/compression_eval/DESIGN.md @@ -0,0 +1,341 @@ +# Compression Eval — Design + +Status: proposal. Nothing under `scripts/compression_eval/` runs in CI. +This is an offline tool authors run before merging prompt or algorithm +changes to `agent/context_compressor.py`. + +## Why + +We tune the compressor prompt and the `_template_sections` checklist by +hand, ship, and wait for the next real session to notice regressions. +There is no automated check that a prompt edit still preserves file +paths, error messages, or the active task across a compression. + +Factory.ai's December 2025 write-up +(https://factory.ai/news/evaluating-compression) describes a +probe-based eval that scores compressed state on six dimensions. The +methodology is the valuable part — the benchmarks in the post are a +marketing piece. We adopt the methodology and discard the scoreboard. + +## Goal + +Given a real session transcript and a bank of probe questions that +exercise what the transcript contained, answer: + +1. After `ContextCompressor.compress()` runs, can the agent still + answer each probe correctly from the compressed state? +2. Which of the six dimensions (accuracy, context awareness, artifact + trail, completeness, continuity, instruction following) is the + prompt weakest on? +3. Does a prompt change improve or regress any dimension vs. the + previous run? + +That is the full scope. No "compare against OpenAI and Anthropic" +benchmarking, no public scoreboard, no marketing claims. + +## Non-goals + +- Not a pytest. Requires API credentials, costs money, takes minutes + per fixture, and output is LLM-graded and non-deterministic. +- Not part of `scripts/run_tests.sh`. Not invoked by CI. +- Not a replacement for the existing compressor unit tests in + `tests/agent/test_context_compressor.py` — those stay as the + structural / boundary / tool-pair-sanitization guard. +- Not a general trajectory eval. Scoped to context compaction only. + +## Where it lives + +``` +scripts/compression_eval/ +├── DESIGN.md # this file +├── README.md # how to run, cost expectations, caveats +├── run_eval.py # entry point (fire CLI, like sample_and_compress.py) +├── scrub_fixtures.py # regenerate fixtures from ~/.hermes/sessions/*.jsonl +├── fixtures/ # checked-in scrubbed session snapshots +│ ├── feature-impl-context-priority.json +│ ├── debug-session-feishu-id-model.json +│ └── config-build-competitive-scouts.json +├── probes/ # probe banks paired with fixtures +│ └── .probes.json +├── rubric.py # grading prompt + dimension definitions +├── grader.py # judge-model call + score parsing +├── compressor_driver.py # thin wrapper over ContextCompressor +└── results/ # gitignored; timestamped output per run + └── .gitkeep +``` + +`scripts/` is the right home: offline tooling, no CI involvement, +precedent already set by `sample_and_compress.py`, +`contributor_audit.py`, `discord-voice-doctor.py`. + +`environments/` is for Atropos RL training environments — wrong shape. +`tests/` is hermetic and credential-free — incompatible with a +probe-based eval that needs a judge model. + +## Fixture format + +A fixture is a single compressed-enough conversation captured from a +real session. Stored as JSON (pretty-printed, reviewable in PRs): + +```json +{ + "name": "401-debug", + "description": "178-turn session debugging a 401 on /api/auth/login", + "model": "anthropic/claude-sonnet-4.6", + "context_length": 200000, + "messages": [ + {"role": "system", "content": "..."}, + {"role": "user", "content": "..."}, + {"role": "assistant", "content": "...", "tool_calls": [...]}, + {"role": "tool", "tool_call_id": "...", "content": "..."} + ], + "notes": "Captured 2026-04-24 from session 20260424_*.jsonl; \ + PII scrubbed; secrets redacted via redact_sensitive_text." +} +``` + +### Sourcing fixtures + +Fixtures are scrubbed snapshots of real sessions from the +maintainer's `~/.hermes/sessions/*.jsonl` store, generated +reproducibly by `scrub_fixtures.py` in this directory. Re-run the +scrubber with `python3 scripts/compression_eval/scrub_fixtures.py` +to regenerate them after a scrubber change. + +Three shipped fixtures cover three different session shapes: + +| Fixture | Source shape | Messages | Tokens (rough) | Tests | +|---|---|---|---|---| +| `feature-impl-context-priority` | investigate → patch → test → PR → merge | 75 | ~17k | continuation, artifact trail (2 files modified, 1 PR) | +| `debug-session-feishu-id-model` | PR triage + upstream docs + decision | 59 | ~13k | recall (PR #, error shape), decision (outcome + reason) | +| `config-build-competitive-scouts` | iterative config: 11 cron jobs across 7 weekdays | 61 | ~23k | artifact trail (which jobs, which days), iterative-merge | + +The `~17k-23k` token range is below the default 50%-of-200k +compression threshold, so the eval will always **force** a +`compress()` call rather than wait for the natural trigger. That is +the intended shape — we want a controlled single-shot compression so +score deltas are attributable to the prompt change, not to whether +the threshold happened to fire at the same boundary twice. + +### Scrubber pipeline + +`scrub_fixtures.py` applies, per message: + +1. `agent.redact.redact_sensitive_text` — API keys, tokens, + connection strings +2. Username paths: `/home/teknium` → `/home/user` +3. Personal handles: all case variants of the maintainer name → `user` +4. Email addresses → `contributor@example.com`; git + `Author: Name ` header lines normalised +5. `...` and + `...` stripped from assistant content +6. Messaging-platform user mentions (`<@123456>`, `<@***>`) → + `<@user>` +7. First user message paraphrased to remove personal voice; + subsequent user turns kept verbatim after the redactions above +8. System prompt replaced with a generic public-safe placeholder so + we don't check in the maintainer's tuned soul/skills/memory system + block +9. Orphan empty-assistant messages (artifact of scratchpad-only + turns) and trailing tool messages with no matching assistant are + dropped +10. Tool outputs longer than 2000 chars are truncated with a size + annotation; the compressor sees that the tool was called and + returned something but not the full 16KB skill_view or 5KB + web_extract body (no signal loss for compression probes) + +Before every fixture PR: grep the fixture for PII patterns. An +audit is embedded at the bottom of the scrubber as comments. + +**Fixtures must stay small.** Target <150 KB per fixture, <500 KB +total for the directory. Current total: ~230 KB across three +fixtures. Larger sessions are truncated with a +`truncated_to: ` field in the fixture header so the cut is +reviewable. + +## Probe format + +One probe file per fixture, so reviewers can see the question bank +evolve alongside the fixture: + +```json +{ + "fixture": "401-debug", + "probes": [ + { + "id": "recall-error-code", + "type": "recall", + "question": "What was the original error code and endpoint?", + "expected_facts": ["401", "/api/auth/login"] + }, + { + "id": "artifact-files-modified", + "type": "artifact", + "question": "Which files have been modified in this session?", + "expected_facts": ["session_store.py", "redis_client.py"] + }, + { + "id": "continuation-next-step", + "type": "continuation", + "question": "What should we do next?", + "expected_facts": ["re-run the integration tests", "restart the worker"] + }, + { + "id": "decision-redis-approach", + "type": "decision", + "question": "What did we decide about the Redis issue?", + "expected_facts": ["switch to redis-py 5.x", "pooled connection"] + } + ] +} +``` + +The four probe types come directly from Factory's methodology: +**recall, artifact, continuation, decision**. `expected_facts` gives +the grader concrete anchors instead of relying purely on LLM taste. + +Authoring a probe bank is a one-time cost per fixture. 8-12 probes per +fixture is the target — enough to cover all four types, few enough to +grade in under a minute at reasonable cost. + +## Grading + +Each probe gets scored 0-5 on **six dimensions** (Factory's six): + +| Dimension | What it measures | +|-----------------------|-----------------------------------------------------| +| accuracy | File paths, function names, error codes are correct | +| context_awareness | Reflects current state, not a mid-session snapshot | +| artifact_trail | Knows which files were read / modified / created | +| completeness | Addresses all parts of the probe | +| continuity | Agent can continue without re-fetching | +| instruction_following | Probe answered in the requested form | + +Grading is done by a single judge-model call per probe with a +deterministic rubric prompt (see `rubric.py`). The rubric includes the +`expected_facts` list so the judge has a concrete anchor. Default +judge model: whatever the user has configured as their main model at +run time (same resolution path as `auxiliary_client.call_llm`). A +`--judge-model` flag allows overriding for consistency across runs. + +Non-determinism caveat: two runs of the same fixture will produce +different scores. A single run means nothing. Report medians over +N=3 runs by default, and require an improvement of >=0.3 on any +dimension before claiming a prompt change is a win. + +## Run flow + +``` +python scripts/compression_eval/run_eval.py [OPTIONS] +``` + +Options (fire-style, mirroring `sample_and_compress.py`): + +| Flag | Default | Purpose | +|------------------------|------------|-------------------------------------------| +| `--fixtures` | all | Comma-separated fixture names | +| `--runs` | 3 | Runs per fixture (for median) | +| `--judge-model` | auto | Override judge model | +| `--compressor-model` | auto | Override model used *inside* the compressor | +| `--label` | timestamp | Subdirectory under `results/` | +| `--focus-topic` | none | Pass-through to `compress(focus_topic=)` | +| `--compare-to` | none | Path to a previous run for diff output | + +Steps per fixture per run: + +1. Load fixture JSON and probe bank. +2. Construct a `ContextCompressor` against the fixture's model. +3. Call `compressor.compress(messages)` — capture the compressed + message list. +4. For each probe: ask the judge model to role-play as the continuing + agent with only the compressed state, then grade the answer on the + six dimensions using `rubric.py`. +5. Write a per-run JSON to `results/