Adds scripts/compression_eval/ with a design doc, README, a placeholder run_eval.py, and three checked-in scrubbed session fixtures. No working eval yet — PR is for design review before implementation. Motivation: we edit agent/context_compressor.py prompts and _template_sections by hand and ship without any automated check that compression still preserves file paths, error codes, or the active task. Factory.ai's Dec 2025 write-up (https://factory.ai/news/evaluating-compression) documents a probe-based eval scored on six dimensions. We adopt the methodology; we do not publish scores. Contents: - DESIGN.md — fixture format, probe format (recall / artifact / continuation / decision), six grading dimensions, report format, cost expectations, scrubber pipeline, open questions, and staged follow-up PR plan. - README.md — short 'what this is / when to run it' page. - run_eval.py — placeholder that prints 'not implemented, see DESIGN.md' and exits 1. - scrub_fixtures.py — reproducible pipeline that converts real sessions from ~/.hermes/sessions/*.jsonl into public-safe JSON fixtures. Applies: redact_sensitive_text, username path normalization, personal handle scrubbing, email and git-author normalization, reasoning scratchpad / <think> stripping, platform user-mention scrubbing, first-user paraphrase, system-prompt placeholder, orphan-message pruning, and tool-output size truncation for fixture readability. - fixtures/feature-impl-context-priority.json — 75 msgs / ~17k tokens. Investigate → patch → test → PR → merge. - fixtures/debug-session-feishu-id-model.json — 59 msgs / ~13k tokens. PR triage + upstream docs + decision. - fixtures/config-build-competitive-scouts.json — 61 msgs / ~23k tokens. Iterative config accumulation (11 cron jobs across 7 weekdays). PII audit: zero matches across the three fixtures for the maintainer's handle (all case variants), personal email domains, and known contributor emails. Only 'contributor@example.com' placeholder remains. Why scripts/: requires API credentials, costs ~\$1 per run, LLM-graded (non-deterministic), must not run in CI. scripts/sample_and_compress.py is the existing precedent for offline credentialed tooling.
14 KiB
Compression Eval — Design
Status: proposal. Nothing under scripts/compression_eval/ runs in CI.
This is an offline tool authors run before merging prompt or algorithm
changes to agent/context_compressor.py.
Why
We tune the compressor prompt and the _template_sections checklist by
hand, ship, and wait for the next real session to notice regressions.
There is no automated check that a prompt edit still preserves file
paths, error messages, or the active task across a compression.
Factory.ai's December 2025 write-up (https://factory.ai/news/evaluating-compression) describes a probe-based eval that scores compressed state on six dimensions. The methodology is the valuable part — the benchmarks in the post are a marketing piece. We adopt the methodology and discard the scoreboard.
Goal
Given a real session transcript and a bank of probe questions that exercise what the transcript contained, answer:
- After
ContextCompressor.compress()runs, can the agent still answer each probe correctly from the compressed state? - Which of the six dimensions (accuracy, context awareness, artifact trail, completeness, continuity, instruction following) is the prompt weakest on?
- Does a prompt change improve or regress any dimension vs. the previous run?
That is the full scope. No "compare against OpenAI and Anthropic" benchmarking, no public scoreboard, no marketing claims.
Non-goals
- Not a pytest. Requires API credentials, costs money, takes minutes per fixture, and output is LLM-graded and non-deterministic.
- Not part of
scripts/run_tests.sh. Not invoked by CI. - Not a replacement for the existing compressor unit tests in
tests/agent/test_context_compressor.py— those stay as the structural / boundary / tool-pair-sanitization guard. - Not a general trajectory eval. Scoped to context compaction only.
Where it lives
scripts/compression_eval/
├── DESIGN.md # this file
├── README.md # how to run, cost expectations, caveats
├── run_eval.py # entry point (fire CLI, like sample_and_compress.py)
├── scrub_fixtures.py # regenerate fixtures from ~/.hermes/sessions/*.jsonl
├── fixtures/ # checked-in scrubbed session snapshots
│ ├── feature-impl-context-priority.json
│ ├── debug-session-feishu-id-model.json
│ └── config-build-competitive-scouts.json
├── probes/ # probe banks paired with fixtures
│ └── <fixture>.probes.json
├── rubric.py # grading prompt + dimension definitions
├── grader.py # judge-model call + score parsing
├── compressor_driver.py # thin wrapper over ContextCompressor
└── results/ # gitignored; timestamped output per run
└── .gitkeep
scripts/ is the right home: offline tooling, no CI involvement,
precedent already set by sample_and_compress.py,
contributor_audit.py, discord-voice-doctor.py.
environments/ is for Atropos RL training environments — wrong shape.
tests/ is hermetic and credential-free — incompatible with a
probe-based eval that needs a judge model.
Fixture format
A fixture is a single compressed-enough conversation captured from a real session. Stored as JSON (pretty-printed, reviewable in PRs):
{
"name": "401-debug",
"description": "178-turn session debugging a 401 on /api/auth/login",
"model": "anthropic/claude-sonnet-4.6",
"context_length": 200000,
"messages": [
{"role": "system", "content": "..."},
{"role": "user", "content": "..."},
{"role": "assistant", "content": "...", "tool_calls": [...]},
{"role": "tool", "tool_call_id": "...", "content": "..."}
],
"notes": "Captured 2026-04-24 from session 20260424_*.jsonl; \
PII scrubbed; secrets redacted via redact_sensitive_text."
}
Sourcing fixtures
Fixtures are scrubbed snapshots of real sessions from the
maintainer's ~/.hermes/sessions/*.jsonl store, generated
reproducibly by scrub_fixtures.py in this directory. Re-run the
scrubber with python3 scripts/compression_eval/scrub_fixtures.py
to regenerate them after a scrubber change.
Three shipped fixtures cover three different session shapes:
| Fixture | Source shape | Messages | Tokens (rough) | Tests |
|---|---|---|---|---|
feature-impl-context-priority |
investigate → patch → test → PR → merge | 75 | ~17k | continuation, artifact trail (2 files modified, 1 PR) |
debug-session-feishu-id-model |
PR triage + upstream docs + decision | 59 | ~13k | recall (PR #, error shape), decision (outcome + reason) |
config-build-competitive-scouts |
iterative config: 11 cron jobs across 7 weekdays | 61 | ~23k | artifact trail (which jobs, which days), iterative-merge |
The ~17k-23k token range is below the default 50%-of-200k
compression threshold, so the eval will always force a
compress() call rather than wait for the natural trigger. That is
the intended shape — we want a controlled single-shot compression so
score deltas are attributable to the prompt change, not to whether
the threshold happened to fire at the same boundary twice.
Scrubber pipeline
scrub_fixtures.py applies, per message:
agent.redact.redact_sensitive_text— API keys, tokens, connection strings- Username paths:
/home/teknium→/home/user - Personal handles: all case variants of the maintainer name →
user - Email addresses →
contributor@example.com; gitAuthor: Name <addr>header lines normalised <REASONING_SCRATCHPAD>...</REASONING_SCRATCHPAD>and<think>...</think>stripped from assistant content- Messaging-platform user mentions (
<@123456>,<@***>) →<@user> - First user message paraphrased to remove personal voice; subsequent user turns kept verbatim after the redactions above
- System prompt replaced with a generic public-safe placeholder so we don't check in the maintainer's tuned soul/skills/memory system block
- Orphan empty-assistant messages (artifact of scratchpad-only turns) and trailing tool messages with no matching assistant are dropped
- Tool outputs longer than 2000 chars are truncated with a size annotation; the compressor sees that the tool was called and returned something but not the full 16KB skill_view or 5KB web_extract body (no signal loss for compression probes)
Before every fixture PR: grep the fixture for PII patterns. An audit is embedded at the bottom of the scrubber as comments.
Fixtures must stay small. Target <150 KB per fixture, <500 KB
total for the directory. Current total: ~230 KB across three
fixtures. Larger sessions are truncated with a
truncated_to: <index> field in the fixture header so the cut is
reviewable.
Probe format
One probe file per fixture, so reviewers can see the question bank evolve alongside the fixture:
{
"fixture": "401-debug",
"probes": [
{
"id": "recall-error-code",
"type": "recall",
"question": "What was the original error code and endpoint?",
"expected_facts": ["401", "/api/auth/login"]
},
{
"id": "artifact-files-modified",
"type": "artifact",
"question": "Which files have been modified in this session?",
"expected_facts": ["session_store.py", "redis_client.py"]
},
{
"id": "continuation-next-step",
"type": "continuation",
"question": "What should we do next?",
"expected_facts": ["re-run the integration tests", "restart the worker"]
},
{
"id": "decision-redis-approach",
"type": "decision",
"question": "What did we decide about the Redis issue?",
"expected_facts": ["switch to redis-py 5.x", "pooled connection"]
}
]
}
The four probe types come directly from Factory's methodology:
recall, artifact, continuation, decision. expected_facts gives
the grader concrete anchors instead of relying purely on LLM taste.
Authoring a probe bank is a one-time cost per fixture. 8-12 probes per fixture is the target — enough to cover all four types, few enough to grade in under a minute at reasonable cost.
Grading
Each probe gets scored 0-5 on six dimensions (Factory's six):
| Dimension | What it measures |
|---|---|
| accuracy | File paths, function names, error codes are correct |
| context_awareness | Reflects current state, not a mid-session snapshot |
| artifact_trail | Knows which files were read / modified / created |
| completeness | Addresses all parts of the probe |
| continuity | Agent can continue without re-fetching |
| instruction_following | Probe answered in the requested form |
Grading is done by a single judge-model call per probe with a
deterministic rubric prompt (see rubric.py). The rubric includes the
expected_facts list so the judge has a concrete anchor. Default
judge model: whatever the user has configured as their main model at
run time (same resolution path as auxiliary_client.call_llm). A
--judge-model flag allows overriding for consistency across runs.
Non-determinism caveat: two runs of the same fixture will produce different scores. A single run means nothing. Report medians over N=3 runs by default, and require an improvement of >=0.3 on any dimension before claiming a prompt change is a win.
Run flow
python scripts/compression_eval/run_eval.py [OPTIONS]
Options (fire-style, mirroring sample_and_compress.py):
| Flag | Default | Purpose |
|---|---|---|
--fixtures |
all | Comma-separated fixture names |
--runs |
3 | Runs per fixture (for median) |
--judge-model |
auto | Override judge model |
--compressor-model |
auto | Override model used inside the compressor |
--label |
timestamp | Subdirectory under results/ |
--focus-topic |
none | Pass-through to compress(focus_topic=) |
--compare-to |
none | Path to a previous run for diff output |
Steps per fixture per run:
- Load fixture JSON and probe bank.
- Construct a
ContextCompressoragainst the fixture's model. - Call
compressor.compress(messages)— capture the compressed message list. - For each probe: ask the judge model to role-play as the continuing
agent with only the compressed state, then grade the answer on the
six dimensions using
rubric.py. - Write a per-run JSON to
results/<label>/<fixture>-run-N.json. - After all runs, emit a markdown summary to
results/<label>/report.md.
Report format
Pasted verbatim into PR descriptions that touch the compressor:
## Compression eval — label 2026-04-25_13-40-02
Main model: anthropic/claude-sonnet-4.6 Judge: same
3 runs per fixture, medians reported.
| Fixture | Accuracy | Context | Artifact | Complete | Continuity | Instruction | Overall |
|----------------|----------|---------|----------|----------|------------|-------------|---------|
| 401-debug | 4.1 | 4.0 | 2.5 | 4.3 | 3.8 | 5.0 | 3.95 |
| pr-review | 3.9 | 3.8 | 3.1 | 4.2 | 3.9 | 5.0 | 3.98 |
| feature-impl | 4.0 | 3.9 | 2.9 | 4.1 | 4.0 | 5.0 | 3.98 |
Per-probe misses (score < 3.0):
- 401-debug / artifact-files-modified: 1.7 — summary dropped redis_client.py
- pr-review / decision-auth-rewrite: 2.3 — outcome captured, reasoning dropped
Cost expectations
Dominated by the judge calls. For 3 fixtures × 10 probes × 3 runs = 90 judge calls per eval run. On Claude Sonnet 4.6 that is roughly $0.50-$1.50 per full eval depending on probe length. The compressor itself makes 1 call per fixture × 3 runs = 9 additional calls.
This is not a check to run after every commit. It is a before-merge check for PRs that touch:
agent/context_compressor.py— any change to_template_sections,_generate_summary, orcompress().agent/auxiliary_client.py— when changing how compression tasks are routed.agent/prompt_builder.py— when the compression-note phrasing changes.
Open questions (to resolve before implementing)
-
Fixture scrubbing: manual or scripted? A scripted scrub that also replaces project names / hostnames would lower the cost of contributing a new fixture. Risk: over-aggressive replacement destroys the signal the probe depends on. Propose: start manual, add scripted helpers once we have 3 fixtures and know the common PII shapes.
-
Judge model selection. Factory uses GPT-5.2. We can't pin one — user's main model changes. Options: (a) grade with main model (cheap, inconsistent across users), (b) require a specific judge model (e.g.
claude-sonnet-4.6), inconsistent for users without access. Propose (a) with a--judge-modeloverride, and make the model name prominent in the report so comparisons across machines are legible. -
Noise floor. Before landing prompt changes, run the current prompt N=10 times to measure per-dimension stddev. That tells us the minimum delta to call a change significant. Suspect 0.2-0.3 on a 0-5 scale. Decision deferred until after the first fixture is landed.
-
Iterative-merge coverage. The real Factory-vs-Anthropic difference is incremental merge vs. regenerate. A fixture that only compresses once doesn't exercise our iterative path. Add a fourth fixture that forces two compressions (manually chained), with probes that test whether information from the first compression survives the second. Deferred to a follow-up PR.
Implementation order
This PR: design doc + scaffolding + three checked-in fixtures +
scrubber script. run_eval.py is still a placeholder that prints a
pointer to DESIGN.md.
Follow-ups, each a separate PR:
- Probe banks for the three fixtures (~8-12 probes each), plus
rubric.py+grader.py+compressor_driver.py. Enough to produce a full report. - Wire results output,
--compare-todiff mode, and the report markdown template. - Iterative-merge fixture (two chained compressions) + follow-ups from the open questions.
Each follow-up is independently useful.