mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-25 00:51:20 +00:00
Adds scripts/compression_eval/ with a design doc, README, a placeholder run_eval.py, and three checked-in scrubbed session fixtures. No working eval yet — PR is for design review before implementation. Motivation: we edit agent/context_compressor.py prompts and _template_sections by hand and ship without any automated check that compression still preserves file paths, error codes, or the active task. Factory.ai's Dec 2025 write-up (https://factory.ai/news/evaluating-compression) documents a probe-based eval scored on six dimensions. We adopt the methodology; we do not publish scores. Contents: - DESIGN.md — fixture format, probe format (recall / artifact / continuation / decision), six grading dimensions, report format, cost expectations, scrubber pipeline, open questions, and staged follow-up PR plan. - README.md — short 'what this is / when to run it' page. - run_eval.py — placeholder that prints 'not implemented, see DESIGN.md' and exits 1. - scrub_fixtures.py — reproducible pipeline that converts real sessions from ~/.hermes/sessions/*.jsonl into public-safe JSON fixtures. Applies: redact_sensitive_text, username path normalization, personal handle scrubbing, email and git-author normalization, reasoning scratchpad / <think> stripping, platform user-mention scrubbing, first-user paraphrase, system-prompt placeholder, orphan-message pruning, and tool-output size truncation for fixture readability. - fixtures/feature-impl-context-priority.json — 75 msgs / ~17k tokens. Investigate → patch → test → PR → merge. - fixtures/debug-session-feishu-id-model.json — 59 msgs / ~13k tokens. PR triage + upstream docs + decision. - fixtures/config-build-competitive-scouts.json — 61 msgs / ~23k tokens. Iterative config accumulation (11 cron jobs across 7 weekdays). PII audit: zero matches across the three fixtures for the maintainer's handle (all case variants), personal email domains, and known contributor emails. Only 'contributor@example.com' placeholder remains. Why scripts/: requires API credentials, costs ~\$1 per run, LLM-graded (non-deterministic), must not run in CI. scripts/sample_and_compress.py is the existing precedent for offline credentialed tooling.
75 lines
1.3 KiB
Text
75 lines
1.3 KiB
Text
.DS_Store
|
|
/venv/
|
|
/_pycache/
|
|
*.pyc*
|
|
__pycache__/
|
|
.venv/
|
|
.vscode/
|
|
.env
|
|
.env.local
|
|
.env.development.local
|
|
.env.test.local
|
|
.env.production.local
|
|
.env.development
|
|
.env.test
|
|
export*
|
|
__pycache__/model_tools.cpython-310.pyc
|
|
__pycache__/web_tools.cpython-310.pyc
|
|
logs/
|
|
data/
|
|
.pytest_cache/
|
|
tmp/
|
|
temp_vision_images/
|
|
hermes-*/*
|
|
examples/
|
|
tests/quick_test_dataset.jsonl
|
|
tests/sample_dataset.jsonl
|
|
run_datagen_kimik2-thinking.sh
|
|
run_datagen_megascience_glm4-6.sh
|
|
run_datagen_sonnet.sh
|
|
source-data/*
|
|
run_datagen_megascience_glm4-6.sh
|
|
data/*
|
|
node_modules/
|
|
browser-use/
|
|
agent-browser/
|
|
# Private keys
|
|
*.ppk
|
|
*.pem
|
|
privvy*
|
|
images/
|
|
__pycache__/
|
|
hermes_agent.egg-info/
|
|
wandb/
|
|
testlogs
|
|
|
|
# CLI config (may contain sensitive SSH paths)
|
|
cli-config.yaml
|
|
|
|
# Skills Hub state (lives in ~/.hermes/skills/.hub/ at runtime, but just in case)
|
|
skills/.hub/
|
|
ignored/
|
|
.worktrees/
|
|
environments/benchmarks/evals/
|
|
|
|
# Compression eval run outputs (harness lives in scripts/compression_eval/)
|
|
scripts/compression_eval/results/*
|
|
!scripts/compression_eval/results/.gitkeep
|
|
|
|
# Web UI build output
|
|
hermes_cli/web_dist/
|
|
|
|
# Web UI assets — synced from @nous-research/ui at build time via
|
|
# `npm run sync-assets` (see web/package.json).
|
|
web/public/fonts/
|
|
web/public/ds-assets/
|
|
|
|
# Release script temp files
|
|
.release_notes.md
|
|
mini-swe-agent/
|
|
|
|
# Nix
|
|
.direnv/
|
|
.nix-stamps/
|
|
result
|
|
website/static/api/skills-index.json
|