feat(checkpoints): v2 single-store rewrite with real pruning + disk guardrails (#20709)

Replaces the per-directory shadow-repo design with a single shared shadow
git store at ~/.hermes/checkpoints/store/. Object DB is now deduplicated
across every working directory the agent has ever touched; a dozen
worktrees of the same project cost near-zero in additional disk.

Why
---
Pre-v2 design had three compounding problems that let ~/.hermes/checkpoints/
grow to multi-GB on active machines:

1. Each working directory got its own full shadow git repo — no object
   dedup across projects or across worktrees of the same project.
2. _prune() was a documented no-op: max_snapshots only limited the
   /rollback listing. Loose objects accumulated forever.
3. Defaults: enabled=True, auto_prune=False — users paid the disk cost
   without ever asking for /rollback.

Field report on a single workstation: 847 MB across 47 shadow repos,
mostly redundant clones of the hermes-agent source tree.

Changes
-------
- tools/checkpoint_manager.py: full rewrite. Single bare store, per-project
  refs (refs/hermes/<hash>), per-project indexes (store/indexes/<hash>),
  per-project metadata (store/projects/<hash>.json with workdir +
  created_at + last_touch). On first v2 init, any pre-v2 per-directory
  shadow repos are auto-migrated into legacy-<timestamp>/ so the new
  store starts clean. _prune() now actually rewrites the per-project ref
  to the last max_snapshots commits and runs git gc --prune=now. New
  _enforce_size_cap() drops oldest commits round-robin across projects
  when the store exceeds max_total_size_mb. _drop_oversize_from_index()
  filters any single file larger than max_file_size_mb out of the snapshot.
- hermes_cli/checkpoints.py: new 'hermes checkpoints' CLI
  (status / list / prune / clear / clear-legacy) for managing the store
  outside a session.
- hermes_cli/config.py: flipped defaults — enabled=False, max_snapshots=20,
  auto_prune=True. Added max_total_size_mb=500, max_file_size_mb=10.
  Tightened DEFAULT_EXCLUDES (added target/, *.so/*.dylib/*.dll,
  *.mp4/*.mov, *.zip/*.tar.gz, .worktrees/, .mypy_cache/, etc.).
- run_agent.py / cli.py / gateway/run.py: thread the new kwargs through
  AIAgent and the startup auto_prune hooks.
- Tests rewritten to match v2 storage while keeping backwards-compat
  coverage for the pre-v2 prune path (per-directory shadow repos under
  base/ are still swept correctly for anyone mid-migration).
- Docs updated: user-guide/checkpoints-and-rollback.md explains the
  shared store, new defaults, migration, and the new CLI;
  reference/cli-commands.md documents 'hermes checkpoints'.

E2E validated
-------------
- Legacy migration: pre-v2 shadow repos auto-archived into legacy-<ts>/.
- Object dedup: two projects with an identical shared.py blob resolve to
  7 total objects in the store (v1 would have stored the blob twice).
- max_snapshots=3 actually enforced: after 6 commits, list shows 3.
- Orphan prune: deleting a project's workdir + 'hermes checkpoints prune
  --retention-days 0' removes its ref, index, and metadata; GC reclaims
  the objects.
- max_file_size_mb=1 excludes a 2 MB weights.bin while keeping the
  tracked source code files.
- hermes checkpoints {status,prune,clear,clear-legacy} all work from the
  CLI without an agent running.

Breaking / migration
--------------------
No in-place data migration — legacy per-directory shadow repos are moved
into legacy-<timestamp>/ on first run. Old /rollback history is still
accessible by inspecting the archive with git; run
'hermes checkpoints clear-legacy' to reclaim the space when ready. Users
relying on /rollback must now set checkpoints.enabled=true (or pass
--checkpoints) explicitly.
This commit is contained in:
Teknium 2026-05-06 05:44:35 -07:00 committed by GitHub
parent b045e7a2ba
commit a0fedfbb1b
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
10 changed files with 1965 additions and 715 deletions

7
cli.py
View file

@ -987,6 +987,7 @@ def _run_checkpoint_auto_maintenance() -> None:
retention_days=int(cfg.get("retention_days", 7)), retention_days=int(cfg.get("retention_days", 7)),
min_interval_hours=int(cfg.get("min_interval_hours", 24)), min_interval_hours=int(cfg.get("min_interval_hours", 24)),
delete_orphans=bool(cfg.get("delete_orphans", True)), delete_orphans=bool(cfg.get("delete_orphans", True)),
max_total_size_mb=int(cfg.get("max_total_size_mb", 500)),
) )
except Exception as exc: except Exception as exc:
logger.debug("checkpoint auto-maintenance skipped: %s", exc) logger.debug("checkpoint auto-maintenance skipped: %s", exc)
@ -2273,7 +2274,9 @@ class HermesCLI:
if isinstance(cp_cfg, bool): if isinstance(cp_cfg, bool):
cp_cfg = {"enabled": cp_cfg} cp_cfg = {"enabled": cp_cfg}
self.checkpoints_enabled = checkpoints or cp_cfg.get("enabled", False) self.checkpoints_enabled = checkpoints or cp_cfg.get("enabled", False)
self.checkpoint_max_snapshots = cp_cfg.get("max_snapshots", 50) self.checkpoint_max_snapshots = cp_cfg.get("max_snapshots", 20)
self.checkpoint_max_total_size_mb = cp_cfg.get("max_total_size_mb", 500)
self.checkpoint_max_file_size_mb = cp_cfg.get("max_file_size_mb", 10)
self.pass_session_id = pass_session_id self.pass_session_id = pass_session_id
# --ignore-rules: honor either the constructor flag or the env var set # --ignore-rules: honor either the constructor flag or the env var set
# by `hermes chat --ignore-rules` in hermes_cli/main.py. When true we # by `hermes chat --ignore-rules` in hermes_cli/main.py. When true we
@ -3845,6 +3848,8 @@ class HermesCLI:
thinking_callback=self._on_thinking, thinking_callback=self._on_thinking,
checkpoints_enabled=self.checkpoints_enabled, checkpoints_enabled=self.checkpoints_enabled,
checkpoint_max_snapshots=self.checkpoint_max_snapshots, checkpoint_max_snapshots=self.checkpoint_max_snapshots,
checkpoint_max_total_size_mb=self.checkpoint_max_total_size_mb,
checkpoint_max_file_size_mb=self.checkpoint_max_file_size_mb,
pass_session_id=self.pass_session_id, pass_session_id=self.pass_session_id,
skip_context_files=self.ignore_rules, skip_context_files=self.ignore_rules,
skip_memory=self.ignore_rules, skip_memory=self.ignore_rules,

View file

@ -1160,6 +1160,7 @@ class GatewayRunner:
retention_days=int(_ckpt_cfg.get("retention_days", 7)), retention_days=int(_ckpt_cfg.get("retention_days", 7)),
min_interval_hours=int(_ckpt_cfg.get("min_interval_hours", 24)), min_interval_hours=int(_ckpt_cfg.get("min_interval_hours", 24)),
delete_orphans=bool(_ckpt_cfg.get("delete_orphans", True)), delete_orphans=bool(_ckpt_cfg.get("delete_orphans", True)),
max_total_size_mb=int(_ckpt_cfg.get("max_total_size_mb", 500)),
) )
except Exception as exc: except Exception as exc:
logger.debug("checkpoint auto-maintenance skipped: %s", exc) logger.debug("checkpoint auto-maintenance skipped: %s", exc)

244
hermes_cli/checkpoints.py Normal file
View file

@ -0,0 +1,244 @@
"""`hermes checkpoints` CLI subcommand.
Gives users direct visibility and control over the filesystem checkpoint
store at ``~/.hermes/checkpoints/``. Actions:
hermes checkpoints # same as `status`
hermes checkpoints status # total size, project count, breakdown
hermes checkpoints list # per-project checkpoint counts + workdir
hermes checkpoints prune [opts] # force a sweep (ignores the 24h marker)
hermes checkpoints clear [-f] # nuke the entire base (asks first)
hermes checkpoints clear-legacy # delete just the legacy-* archives
Examples::
hermes checkpoints
hermes checkpoints prune --retention-days 3 --max-size-mb 200
hermes checkpoints clear -f
None of these require the agent to be running. Safe to call any time.
"""
from __future__ import annotations
import argparse
import time
from datetime import datetime
from pathlib import Path
from typing import Any, Dict
def _fmt_bytes(n: int) -> str:
units = ("B", "KB", "MB", "GB", "TB")
size = float(n or 0)
for unit in units:
if size < 1024 or unit == units[-1]:
if unit == "B":
return f"{int(size)} {unit}"
return f"{size:.1f} {unit}"
size /= 1024
return f"{size:.1f} TB"
def _fmt_ts(ts: Any) -> str:
try:
return datetime.fromtimestamp(float(ts)).strftime("%Y-%m-%d %H:%M")
except (TypeError, ValueError):
return ""
def _fmt_age(ts: Any) -> str:
try:
age = time.time() - float(ts)
except (TypeError, ValueError):
return ""
if age < 0:
return "now"
if age < 60:
return f"{int(age)}s ago"
if age < 3600:
return f"{int(age / 60)}m ago"
if age < 86400:
return f"{int(age / 3600)}h ago"
return f"{int(age / 86400)}d ago"
def cmd_status(args: argparse.Namespace) -> int:
from tools.checkpoint_manager import store_status
info = store_status()
base = info["base"]
print(f"Checkpoint base: {base}")
print(f"Total size: {_fmt_bytes(info['total_size_bytes'])}")
print(f" store/ {_fmt_bytes(info['store_size_bytes'])}")
print(f" legacy-* {_fmt_bytes(info['legacy_size_bytes'])}")
print(f"Projects: {info['project_count']}")
projects = sorted(
info["projects"],
key=lambda p: (p.get("last_touch") or 0),
reverse=True,
)
if projects:
print()
print(f" {'WORKDIR':<60} {'COMMITS':>7} {'LAST TOUCH':>12} STATE")
for p in projects[: args.limit if hasattr(args, "limit") and args.limit else 20]:
wd = p.get("workdir") or "(unknown)"
if len(wd) > 60:
wd = "" + wd[-59:]
exists = p.get("exists")
state = "live" if exists else "orphan"
commits = p.get("commits", 0)
last = _fmt_age(p.get("last_touch"))
print(f" {wd:<60} {commits:>7} {last:>12} {state}")
legacy = info.get("legacy_archives", [])
if legacy:
print()
print(f"Legacy archives ({len(legacy)}):")
for arch in sorted(legacy, key=lambda a: a.get("mtime", 0), reverse=True):
print(f" {arch['name']:<40} {_fmt_bytes(arch['size_bytes']):>10}")
print()
print("Clear with: hermes checkpoints clear-legacy")
return 0
def cmd_list(args: argparse.Namespace) -> int:
# `list` is just a terser status — already covered.
return cmd_status(args)
def cmd_prune(args: argparse.Namespace) -> int:
from tools.checkpoint_manager import prune_checkpoints
retention_days = args.retention_days
max_size_mb = args.max_size_mb
print("Pruning checkpoint store…")
print(f" retention_days: {retention_days}")
print(f" delete_orphans: {not args.keep_orphans}")
print(f" max_total_size_mb: {max_size_mb}")
print()
result = prune_checkpoints(
retention_days=retention_days,
delete_orphans=not args.keep_orphans,
max_total_size_mb=max_size_mb,
)
print(f"Scanned: {result['scanned']}")
print(f"Deleted orphan: {result['deleted_orphan']}")
print(f"Deleted stale: {result['deleted_stale']}")
print(f"Errors: {result['errors']}")
print(f"Bytes reclaimed: {_fmt_bytes(result['bytes_freed'])}")
return 0
def _confirm(prompt: str) -> bool:
try:
resp = input(f"{prompt} [y/N]: ").strip().lower()
except (EOFError, KeyboardInterrupt):
print()
return False
return resp in ("y", "yes")
def cmd_clear(args: argparse.Namespace) -> int:
from tools.checkpoint_manager import CHECKPOINT_BASE, clear_all, store_status
info = store_status()
if info["total_size_bytes"] == 0 and not Path(CHECKPOINT_BASE).exists():
print("Nothing to clear — checkpoint base does not exist.")
return 0
print(f"This will delete the ENTIRE checkpoint base at {info['base']}")
print(f" size: {_fmt_bytes(info['total_size_bytes'])}")
print(f" projects: {info['project_count']}")
print(f" legacy dirs: {len(info.get('legacy_archives', []))}")
print()
print("All /rollback history for every working directory will be lost.")
if not args.force and not _confirm("Proceed?"):
print("Aborted.")
return 1
result = clear_all()
if result["deleted"]:
print(f"Cleared. Reclaimed {_fmt_bytes(result['bytes_freed'])}.")
return 0
print("Could not clear checkpoint base (see logs).")
return 2
def cmd_clear_legacy(args: argparse.Namespace) -> int:
from tools.checkpoint_manager import clear_legacy, store_status
info = store_status()
legacy = info.get("legacy_archives", [])
if not legacy:
print("No legacy archives to clear.")
return 0
total = sum(a.get("size_bytes", 0) for a in legacy)
print(f"Found {len(legacy)} legacy archive(s), total {_fmt_bytes(total)}:")
for arch in legacy:
print(f" {arch['name']:<40} {_fmt_bytes(arch['size_bytes']):>10}")
print()
print("Legacy archives hold pre-v2 per-project shadow repos, moved aside")
print("during the single-store migration. Delete when you're confident")
print("you don't need the old /rollback history.")
if not args.force and not _confirm("Delete all legacy archives?"):
print("Aborted.")
return 1
result = clear_legacy()
print(f"Deleted {result['deleted']} archive(s), reclaimed {_fmt_bytes(result['bytes_freed'])}.")
return 0
def register_cli(parser: argparse.ArgumentParser) -> None:
"""Wire subcommands onto the ``hermes checkpoints`` parser."""
parser.set_defaults(func=cmd_status) # bare `hermes checkpoints` → status
subs = parser.add_subparsers(dest="checkpoints_command", metavar="COMMAND")
p_status = subs.add_parser(
"status",
help="Show total size, project count, and per-project breakdown",
)
p_status.add_argument("--limit", type=int, default=20,
help="Max projects to list (default 20)")
p_status.set_defaults(func=cmd_status)
p_list = subs.add_parser(
"list",
help="Alias for 'status'",
)
p_list.add_argument("--limit", type=int, default=20)
p_list.set_defaults(func=cmd_list)
p_prune = subs.add_parser(
"prune",
help="Delete orphan/stale checkpoints and GC the store",
)
p_prune.add_argument("--retention-days", type=int, default=7,
help="Drop projects whose last_touch is older than N days (default 7)")
p_prune.add_argument("--max-size-mb", type=int, default=500,
help="After orphan/stale prune, drop oldest commits "
"per project until total size <= this (default 500)")
p_prune.add_argument("--keep-orphans", action="store_true",
help="Skip deleting projects whose workdir no longer exists")
p_prune.set_defaults(func=cmd_prune)
p_clear = subs.add_parser(
"clear",
help="Delete the entire checkpoint base (all /rollback history)",
)
p_clear.add_argument("-f", "--force", action="store_true",
help="Skip confirmation prompt")
p_clear.set_defaults(func=cmd_clear)
p_legacy = subs.add_parser(
"clear-legacy",
help="Delete only the legacy-<ts>/ archives from v1 migration",
)
p_legacy.add_argument("-f", "--force", action="store_true",
help="Skip confirmation prompt")
p_legacy.set_defaults(func=cmd_clear_legacy)

View file

@ -574,21 +574,39 @@ DEFAULT_CONFIG = {
}, },
# Filesystem checkpoints — automatic snapshots before destructive file ops. # Filesystem checkpoints — automatic snapshots before destructive file ops.
# When enabled, the agent takes a snapshot of the working directory once per # When enabled, the agent takes a snapshot of the working directory once
# conversation turn (on first write_file/patch call). Use /rollback to restore. # per conversation turn (on first write_file/patch call). Use /rollback
# to restore.
#
# Defaults changed in v2 (single shared shadow store, real pruning):
# - enabled: True -> False (opt-in; most users never use /rollback)
# - max_snapshots: 50 -> 20 (now actually enforced via ref rewrite)
# - auto_prune: False -> True (orphans/stale pruned automatically)
# Opt in via ``hermes chat --checkpoints`` or set enabled=True here.
"checkpoints": { "checkpoints": {
"enabled": True, "enabled": False,
"max_snapshots": 50, # Max checkpoints to keep per directory # Max checkpoints to keep per working directory. Pre-v2 this only
# Auto-maintenance: shadow repos accumulate forever under # limited the `/rollback` listing; v2 actually rewrites the ref and
# ~/.hermes/checkpoints/ (one per cd'd working directory). Field # garbage-collects older commits.
# reports put the typical offender at 1000+ repos / ~12 GB. When "max_snapshots": 20,
# auto_prune is on, hermes sweeps at startup (at most once per # Hard ceiling on total ``~/.hermes/checkpoints/`` size (MB). When
# min_interval_hours) and deletes: # exceeded, the oldest checkpoint per project is dropped in a
# * orphan repos: HERMES_WORKDIR no longer exists on disk # round-robin pass until total size falls under the cap.
# * stale repos: newest mtime older than retention_days # 0 disables the size cap.
# Opt-in so users who rely on /rollback against long-ago sessions "max_total_size_mb": 500,
# never lose data silently. # Skip any single file larger than this when staging a checkpoint.
"auto_prune": False, # Prevents accidental snapshotting of datasets, model weights, and
# other large generated assets. 0 disables the filter.
"max_file_size_mb": 10,
# Auto-maintenance: hermes sweeps the checkpoint base at startup
# (at most once per ``min_interval_hours``) and:
# * deletes project entries whose workdir no longer exists (orphan)
# * deletes project entries whose last_touch is older than
# ``retention_days``
# * GCs the single shared store to reclaim unreachable objects
# * enforces ``max_total_size_mb`` across remaining projects
# * deletes ``legacy-*`` archives older than ``retention_days``
"auto_prune": True,
"retention_days": 7, "retention_days": 7,
"delete_orphans": True, "delete_orphans": True,
"min_interval_hours": 24, "min_interval_hours": 24,

View file

@ -9379,6 +9379,20 @@ Examples:
) )
backup_parser.set_defaults(func=cmd_backup) backup_parser.set_defaults(func=cmd_backup)
# =========================================================================
# checkpoints command
# =========================================================================
checkpoints_parser = subparsers.add_parser(
"checkpoints",
help="Inspect / prune / clear ~/.hermes/checkpoints/",
description="Manage the filesystem checkpoint store — the shadow git "
"repo hermes uses to snapshot working directories before "
"write_file/patch/terminal calls. Lets you see how much "
"space checkpoints occupy, force a prune, or wipe the base.",
)
from hermes_cli.checkpoints import register_cli as _register_checkpoints_cli
_register_checkpoints_cli(checkpoints_parser)
# ========================================================================= # =========================================================================
# import command # import command
# ========================================================================= # =========================================================================

View file

@ -966,7 +966,9 @@ class AIAgent:
fallback_model: Dict[str, Any] = None, fallback_model: Dict[str, Any] = None,
credential_pool=None, credential_pool=None,
checkpoints_enabled: bool = False, checkpoints_enabled: bool = False,
checkpoint_max_snapshots: int = 50, checkpoint_max_snapshots: int = 20,
checkpoint_max_total_size_mb: int = 500,
checkpoint_max_file_size_mb: int = 10,
pass_session_id: bool = False, pass_session_id: bool = False,
): ):
""" """
@ -1689,6 +1691,8 @@ class AIAgent:
self._checkpoint_mgr = CheckpointManager( self._checkpoint_mgr = CheckpointManager(
enabled=checkpoints_enabled, enabled=checkpoints_enabled,
max_snapshots=checkpoint_max_snapshots, max_snapshots=checkpoint_max_snapshots,
max_total_size_mb=checkpoint_max_total_size_mb,
max_file_size_mb=checkpoint_max_file_size_mb,
) )
# SQLite session store (optional -- provided by CLI or gateway) # SQLite session store (optional -- provided by CLI or gateway)

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

View file

@ -54,6 +54,7 @@ hermes [global-options] <command> [subcommand/options]
| `hermes dump` | Copy-pasteable setup summary for support/debugging. | | `hermes dump` | Copy-pasteable setup summary for support/debugging. |
| `hermes debug` | Debug tools — upload logs and system info for support. | | `hermes debug` | Debug tools — upload logs and system info for support. |
| `hermes backup` | Back up Hermes home directory to a zip file. | | `hermes backup` | Back up Hermes home directory to a zip file. |
| `hermes checkpoints` | Inspect / prune / clear `~/.hermes/checkpoints/` (the shadow store used by `/rollback`). Run with no args for a status overview. |
| `hermes import` | Restore a Hermes backup from a zip file. | | `hermes import` | Restore a Hermes backup from a zip file. |
| `hermes logs` | View, tail, and filter agent/gateway/error log files. | | `hermes logs` | View, tail, and filter agent/gateway/error log files. |
| `hermes config` | Show, edit, migrate, and query configuration files. | | `hermes config` | Show, edit, migrate, and query configuration files. |
@ -579,6 +580,44 @@ hermes backup --quick # Quick state-only snapshot
hermes backup --quick --label "pre-upgrade" # Quick snapshot with label hermes backup --quick --label "pre-upgrade" # Quick snapshot with label
``` ```
## `hermes checkpoints`
```bash
hermes checkpoints [COMMAND]
```
Inspect and manage the shadow git store at `~/.hermes/checkpoints/` — the storage layer behind the in-session `/rollback` command. Safe to run any time; does not require the agent to be running.
| Subcommand | Description |
|------------|-------------|
| `status` (default) | Show total size, project count, and per-project breakdown. Bare `hermes checkpoints` is equivalent. |
| `list` | Alias for `status`. |
| `prune` | Force a cleanup sweep — delete orphan and stale projects, GC the store, enforce the size cap. Ignores the 24h idempotency marker. |
| `clear` | Delete the entire checkpoint base. Irreversible; asks for confirmation unless `-f`. |
| `clear-legacy` | Delete only the `legacy-<timestamp>/` archives produced by the v1→v2 migration. |
### Options
| Option | Subcommand | Description |
|--------|------------|-------------|
| `--limit N` | `status`, `list` | Max projects to list (default 20). |
| `--retention-days N` | `prune` | Drop projects whose `last_touch` is older than N days (default 7). |
| `--max-size-mb N` | `prune` | After the orphan/stale pass, drop the oldest commit per project until total store size ≤ N MB (default 500). |
| `--keep-orphans` | `prune` | Skip deleting projects whose working directory no longer exists. |
| `-f`, `--force` | `clear`, `clear-legacy` | Skip the confirmation prompt. |
### Examples
```bash
hermes checkpoints # status overview
hermes checkpoints prune --retention-days 3 # aggressive cleanup
hermes checkpoints prune --max-size-mb 200 # tighten size cap once
hermes checkpoints clear-legacy -f # drop v1 archive dirs
hermes checkpoints clear -f # wipe everything
```
See [Checkpoints and `/rollback`](../user-guide/checkpoints-and-rollback.md) for the full architecture and the in-session commands.
## `hermes import` ## `hermes import`
```bash ```bash

View file

@ -7,9 +7,22 @@ description: "Filesystem safety nets for destructive operations using shadow git
# Checkpoints and `/rollback` # Checkpoints and `/rollback`
Hermes Agent automatically snapshots your project before **destructive operations** and lets you restore it with a single command. Checkpoints are **enabled by default** — there's zero cost when no file-mutating tools fire. Hermes Agent can automatically snapshot your project before **destructive operations** and restore it with a single command. Checkpoints are **opt-in** as of v2 — most users never use `/rollback`, and the shadow-store storage is non-trivial over time, so the default is off.
This safety net is powered by an internal **Checkpoint Manager** that keeps a separate shadow git repository under `~/.hermes/checkpoints/` — your real project `.git` is never touched. Enable checkpoints per-session with `--checkpoints`:
```bash
hermes chat --checkpoints
```
Or enable globally in `~/.hermes/config.yaml`:
```yaml
checkpoints:
enabled: true
```
This safety net is powered by an internal **Checkpoint Manager** that keeps a single shared shadow git repository under `~/.hermes/checkpoints/store/` — your real project `.git` is never touched. Every project the agent works in shares the same store, so git's content-addressable object DB deduplicates across projects and across turns.
## What Triggers a Checkpoint ## What Triggers a Checkpoint
@ -22,6 +35,8 @@ The agent creates **at most one checkpoint per directory per turn**, so long-run
## Quick Reference ## Quick Reference
In-session slash commands:
| Command | Description | | Command | Description |
|---------|-------------| |---------|-------------|
| `/rollback` | List all checkpoints with change stats | | `/rollback` | List all checkpoints with change stats |
@ -29,6 +44,17 @@ The agent creates **at most one checkpoint per directory per turn**, so long-run
| `/rollback diff <N>` | Preview diff between checkpoint N and current state | | `/rollback diff <N>` | Preview diff between checkpoint N and current state |
| `/rollback <N> <file>` | Restore a single file from checkpoint N | | `/rollback <N> <file>` | Restore a single file from checkpoint N |
CLI for inspecting and managing the store outside a session:
| Command | Description |
|---------|-------------|
| `hermes checkpoints` | Show total size, project count, per-project breakdown |
| `hermes checkpoints status` | Same as bare `checkpoints` |
| `hermes checkpoints list` | Alias for `status` |
| `hermes checkpoints prune` | Force a sweep: delete orphans/stale, GC, enforce size cap |
| `hermes checkpoints clear` | Nuke the entire checkpoint base (asks first) |
| `hermes checkpoints clear-legacy` | Delete only the `legacy-*` archives from v1 migration |
## How Checkpoints Work ## How Checkpoints Work
At a high level: At a high level:
@ -36,9 +62,9 @@ At a high level:
- Hermes detects when tools are about to **modify files** in your working tree. - Hermes detects when tools are about to **modify files** in your working tree.
- Once per conversation turn (per directory), it: - Once per conversation turn (per directory), it:
- Resolves a reasonable project root for the file. - Resolves a reasonable project root for the file.
- Initialises or reuses a **shadow git repo** tied to that directory. - Initialises or reuses the **single shared shadow store** at `~/.hermes/checkpoints/store/`.
- Stages and commits the current state with a short, humanreadable reason. - Stages into a per-project index, builds a tree, and commits to a per-project ref (`refs/hermes/<project-hash>`).
- These commits form a checkpoint history that you can inspect and restore via `/rollback`. - These per-project refs form a checkpoint history that you can inspect and restore via `/rollback`.
```mermaid ```mermaid
flowchart LR flowchart LR
@ -46,44 +72,46 @@ flowchart LR
agent["AIAgent\n(run_agent.py)"] agent["AIAgent\n(run_agent.py)"]
tools["File & terminal tools"] tools["File & terminal tools"]
cpMgr["CheckpointManager"] cpMgr["CheckpointManager"]
shadowRepo["Shadow git repo\n~/.hermes/checkpoints/<hash>"] store["Shared shadow store\n~/.hermes/checkpoints/store/"]
user --> agent user --> agent
agent -->|"tool call"| tools agent -->|"tool call"| tools
tools -->|"before mutate\nensure_checkpoint()"| cpMgr tools -->|"before mutate\nensure_checkpoint()"| cpMgr
cpMgr -->|"git add/commit"| shadowRepo cpMgr -->|"git add/commit-tree/update-ref"| store
cpMgr -->|"OK / skipped"| tools cpMgr -->|"OK / skipped"| tools
tools -->|"apply changes"| agent tools -->|"apply changes"| agent
``` ```
## Configuration ## Configuration
Checkpoints are enabled by default. Configure in `~/.hermes/config.yaml`: Configure in `~/.hermes/config.yaml`:
```yaml ```yaml
checkpoints: checkpoints:
enabled: true # master switch (default: true) enabled: false # master switch (default: false — opt-in)
max_snapshots: 50 # max checkpoints per directory max_snapshots: 20 # max checkpoints per project (enforced via ref rewrite + gc)
max_total_size_mb: 500 # hard cap on total store size; oldest commits dropped
max_file_size_mb: 10 # skip any single file larger than this
# Auto-maintenance (opt-in): sweep ~/.hermes/checkpoints/ at startup # Auto-maintenance (on by default): sweep ~/.hermes/checkpoints/ at startup
# and delete shadow repos whose working directory no longer exists # and delete project entries whose working directory no longer exists
# (orphans) or whose newest commit is older than retention_days. # (orphans) or whose last_touch is older than retention_days. Runs at most
# Runs at most once per min_interval_hours, tracked via a # once per min_interval_hours, tracked via a .last_prune marker.
# .last_prune marker inside ~/.hermes/checkpoints/. auto_prune: true
auto_prune: false # default off — enable to reclaim disk
retention_days: 7 retention_days: 7
delete_orphans: true # delete repos whose workdir is gone delete_orphans: true
min_interval_hours: 24 min_interval_hours: 24
``` ```
To disable: To disable everything:
```yaml ```yaml
checkpoints: checkpoints:
enabled: false enabled: false
auto_prune: false
``` ```
When disabled, the Checkpoint Manager is a noop and never attempts git operations. When `enabled: false`, the Checkpoint Manager is a no-op and never attempts git operations. When `auto_prune: false`, the store grows until you run `hermes checkpoints prune` manually.
## Listing Checkpoints ## Listing Checkpoints
@ -107,12 +135,38 @@ Hermes responds with a formatted list showing change statistics:
/rollback <N> <file> restore a single file from checkpoint N /rollback <N> <file> restore a single file from checkpoint N
``` ```
Each entry shows: ## Inspecting the Store from the Shell
- Short hash ```bash
- Timestamp hermes checkpoints
- Reason (what triggered the snapshot) ```
- Change summary (files changed, insertions/deletions)
Sample output:
```text
Checkpoint base: /home/you/.hermes/checkpoints
Total size: 142.3 MB
store/ 138.1 MB
legacy-* 4.2 MB
Projects: 12
WORKDIR COMMITS LAST TOUCH STATE
/home/you/code/hermes-agent 20 2h ago live
/home/you/code/experiments/rl-runner 8 1d ago live
/home/you/code/old-prototype 3 9d ago orphan
...
Legacy archives (1):
legacy-20260506-050616 4.2 MB
Clear with: hermes checkpoints clear-legacy
```
Force a full sweep (ignores the 24h idempotency marker):
```bash
hermes checkpoints prune --retention-days 3 --max-size-mb 200
```
## Previewing Changes with `/rollback diff` ## Previewing Changes with `/rollback diff`
@ -122,49 +176,21 @@ Before committing to a restore, preview what has changed since a checkpoint:
/rollback diff 1 /rollback diff 1
``` ```
This shows a git diff stat summary followed by the actual diff: This shows a git diff stat summary followed by the actual diff.
```text
test.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/test.py b/test.py
--- a/test.py
+++ b/test.py
@@ -1 +1 @@
-print('original content')
+print('modified content')
```
Long diffs are capped at 80 lines to avoid flooding the terminal.
## Restoring with `/rollback` ## Restoring with `/rollback`
Restore to a checkpoint by number:
``` ```
/rollback 1 /rollback 1
``` ```
Behind the scenes, Hermes: Behind the scenes, Hermes:
1. Verifies the target commit exists in the shadow repo. 1. Verifies the target commit exists in the shadow store.
2. Takes a **prerollback snapshot** of the current state so you can "undo the undo" later. 2. Takes a **pre-rollback snapshot** of the current state so you can "undo the undo" later.
3. Restores tracked files in your working directory. 3. Restores tracked files in your working directory.
4. **Undoes the last conversation turn** so the agent's context matches the restored filesystem state. 4. **Undoes the last conversation turn** so the agent's context matches the restored filesystem state.
On success:
```text
✅ Restored to checkpoint 4270a8c5: before patch
A pre-rollback snapshot was saved automatically.
(^_^)b Undid 4 message(s). Removed: "Now update test.py to ..."
4 message(s) remaining in history.
Chat turn undone to match restored file state.
```
The conversation undo ensures the agent doesn't "remember" changes that have been rolled back, avoiding confusion on the next turn.
## Single-File Restore ## Single-File Restore
Restore just one file from a checkpoint without affecting the rest of the directory: Restore just one file from a checkpoint without affecting the rest of the directory:
@ -173,42 +199,51 @@ Restore just one file from a checkpoint without affecting the rest of the direct
/rollback 1 src/broken_file.py /rollback 1 src/broken_file.py
``` ```
This is useful when the agent made changes to multiple files but only one needs to be reverted.
## Safety and Performance Guards ## Safety and Performance Guards
To keep checkpointing safe and fast, Hermes applies several guardrails:
- **Git availability** — if `git` is not found on `PATH`, checkpoints are transparently disabled. - **Git availability** — if `git` is not found on `PATH`, checkpoints are transparently disabled.
- **Directory scope** — Hermes skips overly broad directories (root `/`, home `$HOME`). - **Directory scope** — Hermes skips overly broad directories (root `/`, home `$HOME`).
- **Repository size** — directories with more than 50,000 files are skipped to avoid slow git operations. - **Repository size** — directories with more than 50,000 files are skipped.
- **Nochange snapshots** — if there are no changes since the last snapshot, the checkpoint is skipped. - **Per-file size cap** — files larger than `max_file_size_mb` (default 10 MB) are excluded from the snapshot. Prevents accidentally swallowing datasets, model weights, or generated media.
- **Nonfatal errors** — all errors inside the Checkpoint Manager are logged at debug level; your tools continue to run. - **Total store size cap** — when the store exceeds `max_total_size_mb` (default 500 MB), the oldest commit per project is dropped round-robin until under the cap.
- **Real pruning**`max_snapshots` is enforced by rewriting the per-project ref and running `git gc --prune=now` afterwards, so loose objects don't accumulate.
- **No-change snapshots** — if there are no changes since the last snapshot, the checkpoint is skipped.
- **Non-fatal errors** — all errors inside the Checkpoint Manager are logged at debug level; your tools continue to run.
## Where Checkpoints Live ## Where Checkpoints Live
All shadow repos live under:
```text ```text
~/.hermes/checkpoints/ ~/.hermes/checkpoints/
├── <hash1>/ # shadow git repo for one working directory ├── store/ # single shared bare git repo
├── <hash2>/ │ ├── HEAD, objects/ # git internals (shared across projects)
└── ... │ ├── refs/hermes/<hash> # per-project branch tip
│ ├── indexes/<hash> # per-project git index
│ ├── projects/<hash>.json # workdir + created_at + last_touch
│ └── info/exclude
├── .last_prune # auto-prune idempotency marker
└── legacy-<ts>/ # archived pre-v2 per-project shadow repos
``` ```
Each `<hash>` is derived from the absolute path of the working directory. Inside each shadow repo you'll find: Each `<hash>` is derived from the absolute path of the working directory. You normally never need to touch these manually — use `hermes checkpoints status` / `prune` / `clear` instead.
- Standard git internals (`HEAD`, `refs/`, `objects/`) ### Migration from v1
- An `info/exclude` file containing a curated ignore list
- A `HERMES_WORKDIR` file pointing back to the original project root
You normally never need to touch these manually. Before the v2 rewrite, each working directory got its own complete shadow git repo directly under `~/.hermes/checkpoints/<hash>/`. That layout couldn't dedup objects across projects and had a documented no-op pruner — the store would grow without bound.
On first v2 run, any pre-v2 shadow repos are moved into `~/.hermes/checkpoints/legacy-<timestamp>/` so the new single-store layout starts clean. Old `/rollback` history is still reachable by manually inspecting the legacy archive with `git`; once you're confident you don't need it, run:
```bash
hermes checkpoints clear-legacy
```
to reclaim the space. Legacy archives are also swept by `auto_prune` after `retention_days`.
## Best Practices ## Best Practices
- **Leave checkpoints enabled** — they're on by default and have zero cost when no files are modified. - **Enable checkpoints only when you need them** — `hermes chat --checkpoints` or per-profile `enabled: true`.
- **Use `/rollback diff` before restoring** — preview what will change to pick the right checkpoint. - **Use `/rollback diff` before restoring** — preview what will change to pick the right checkpoint.
- **Use `/rollback` instead of `git reset`** when you want to undo agent-driven changes only. - **Use `/rollback` instead of `git reset`** when you want to undo agent-driven changes only.
- **Check `hermes checkpoints status` occasionally** if you use checkpoints regularly — shows which projects are active and what the store costs you.
- **Combine with Git worktrees** for maximum safety — keep each Hermes session in its own worktree/branch, with checkpoints as an extra layer. - **Combine with Git worktrees** for maximum safety — keep each Hermes session in its own worktree/branch, with checkpoints as an extra layer.
For running multiple agents in parallel on the same repo, see the guide on [Git worktrees](./git-worktrees.md). For running multiple agents in parallel on the same repo, see the guide on [Git worktrees](./git-worktrees.md).