mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-10 03:22:05 +00:00
feat(checkpoints): v2 single-store rewrite with real pruning + disk guardrails (#20709)
Replaces the per-directory shadow-repo design with a single shared shadow
git store at ~/.hermes/checkpoints/store/. Object DB is now deduplicated
across every working directory the agent has ever touched; a dozen
worktrees of the same project cost near-zero in additional disk.
Why
---
Pre-v2 design had three compounding problems that let ~/.hermes/checkpoints/
grow to multi-GB on active machines:
1. Each working directory got its own full shadow git repo — no object
dedup across projects or across worktrees of the same project.
2. _prune() was a documented no-op: max_snapshots only limited the
/rollback listing. Loose objects accumulated forever.
3. Defaults: enabled=True, auto_prune=False — users paid the disk cost
without ever asking for /rollback.
Field report on a single workstation: 847 MB across 47 shadow repos,
mostly redundant clones of the hermes-agent source tree.
Changes
-------
- tools/checkpoint_manager.py: full rewrite. Single bare store, per-project
refs (refs/hermes/<hash>), per-project indexes (store/indexes/<hash>),
per-project metadata (store/projects/<hash>.json with workdir +
created_at + last_touch). On first v2 init, any pre-v2 per-directory
shadow repos are auto-migrated into legacy-<timestamp>/ so the new
store starts clean. _prune() now actually rewrites the per-project ref
to the last max_snapshots commits and runs git gc --prune=now. New
_enforce_size_cap() drops oldest commits round-robin across projects
when the store exceeds max_total_size_mb. _drop_oversize_from_index()
filters any single file larger than max_file_size_mb out of the snapshot.
- hermes_cli/checkpoints.py: new 'hermes checkpoints' CLI
(status / list / prune / clear / clear-legacy) for managing the store
outside a session.
- hermes_cli/config.py: flipped defaults — enabled=False, max_snapshots=20,
auto_prune=True. Added max_total_size_mb=500, max_file_size_mb=10.
Tightened DEFAULT_EXCLUDES (added target/, *.so/*.dylib/*.dll,
*.mp4/*.mov, *.zip/*.tar.gz, .worktrees/, .mypy_cache/, etc.).
- run_agent.py / cli.py / gateway/run.py: thread the new kwargs through
AIAgent and the startup auto_prune hooks.
- Tests rewritten to match v2 storage while keeping backwards-compat
coverage for the pre-v2 prune path (per-directory shadow repos under
base/ are still swept correctly for anyone mid-migration).
- Docs updated: user-guide/checkpoints-and-rollback.md explains the
shared store, new defaults, migration, and the new CLI;
reference/cli-commands.md documents 'hermes checkpoints'.
E2E validated
-------------
- Legacy migration: pre-v2 shadow repos auto-archived into legacy-<ts>/.
- Object dedup: two projects with an identical shared.py blob resolve to
7 total objects in the store (v1 would have stored the blob twice).
- max_snapshots=3 actually enforced: after 6 commits, list shows 3.
- Orphan prune: deleting a project's workdir + 'hermes checkpoints prune
--retention-days 0' removes its ref, index, and metadata; GC reclaims
the objects.
- max_file_size_mb=1 excludes a 2 MB weights.bin while keeping the
tracked source code files.
- hermes checkpoints {status,prune,clear,clear-legacy} all work from the
CLI without an agent running.
Breaking / migration
--------------------
No in-place data migration — legacy per-directory shadow repos are moved
into legacy-<timestamp>/ on first run. Old /rollback history is still
accessible by inspecting the archive with git; run
'hermes checkpoints clear-legacy' to reclaim the space when ready. Users
relying on /rollback must now set checkpoints.enabled=true (or pass
--checkpoints) explicitly.
This commit is contained in:
parent
b045e7a2ba
commit
a0fedfbb1b
10 changed files with 1965 additions and 715 deletions
|
|
@ -574,21 +574,39 @@ DEFAULT_CONFIG = {
|
|||
},
|
||||
|
||||
# Filesystem checkpoints — automatic snapshots before destructive file ops.
|
||||
# When enabled, the agent takes a snapshot of the working directory once per
|
||||
# conversation turn (on first write_file/patch call). Use /rollback to restore.
|
||||
# When enabled, the agent takes a snapshot of the working directory once
|
||||
# per conversation turn (on first write_file/patch call). Use /rollback
|
||||
# to restore.
|
||||
#
|
||||
# Defaults changed in v2 (single shared shadow store, real pruning):
|
||||
# - enabled: True -> False (opt-in; most users never use /rollback)
|
||||
# - max_snapshots: 50 -> 20 (now actually enforced via ref rewrite)
|
||||
# - auto_prune: False -> True (orphans/stale pruned automatically)
|
||||
# Opt in via ``hermes chat --checkpoints`` or set enabled=True here.
|
||||
"checkpoints": {
|
||||
"enabled": True,
|
||||
"max_snapshots": 50, # Max checkpoints to keep per directory
|
||||
# Auto-maintenance: shadow repos accumulate forever under
|
||||
# ~/.hermes/checkpoints/ (one per cd'd working directory). Field
|
||||
# reports put the typical offender at 1000+ repos / ~12 GB. When
|
||||
# auto_prune is on, hermes sweeps at startup (at most once per
|
||||
# min_interval_hours) and deletes:
|
||||
# * orphan repos: HERMES_WORKDIR no longer exists on disk
|
||||
# * stale repos: newest mtime older than retention_days
|
||||
# Opt-in so users who rely on /rollback against long-ago sessions
|
||||
# never lose data silently.
|
||||
"auto_prune": False,
|
||||
"enabled": False,
|
||||
# Max checkpoints to keep per working directory. Pre-v2 this only
|
||||
# limited the `/rollback` listing; v2 actually rewrites the ref and
|
||||
# garbage-collects older commits.
|
||||
"max_snapshots": 20,
|
||||
# Hard ceiling on total ``~/.hermes/checkpoints/`` size (MB). When
|
||||
# exceeded, the oldest checkpoint per project is dropped in a
|
||||
# round-robin pass until total size falls under the cap.
|
||||
# 0 disables the size cap.
|
||||
"max_total_size_mb": 500,
|
||||
# Skip any single file larger than this when staging a checkpoint.
|
||||
# Prevents accidental snapshotting of datasets, model weights, and
|
||||
# other large generated assets. 0 disables the filter.
|
||||
"max_file_size_mb": 10,
|
||||
# Auto-maintenance: hermes sweeps the checkpoint base at startup
|
||||
# (at most once per ``min_interval_hours``) and:
|
||||
# * deletes project entries whose workdir no longer exists (orphan)
|
||||
# * deletes project entries whose last_touch is older than
|
||||
# ``retention_days``
|
||||
# * GCs the single shared store to reclaim unreachable objects
|
||||
# * enforces ``max_total_size_mb`` across remaining projects
|
||||
# * deletes ``legacy-*`` archives older than ``retention_days``
|
||||
"auto_prune": True,
|
||||
"retention_days": 7,
|
||||
"delete_orphans": True,
|
||||
"min_interval_hours": 24,
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue