feat(checkpoints): v2 single-store rewrite with real pruning + disk guardrails (#20709)

Replaces the per-directory shadow-repo design with a single shared shadow
git store at ~/.hermes/checkpoints/store/. Object DB is now deduplicated
across every working directory the agent has ever touched; a dozen
worktrees of the same project cost near-zero in additional disk.

Why
---
Pre-v2 design had three compounding problems that let ~/.hermes/checkpoints/
grow to multi-GB on active machines:

1. Each working directory got its own full shadow git repo — no object
   dedup across projects or across worktrees of the same project.
2. _prune() was a documented no-op: max_snapshots only limited the
   /rollback listing. Loose objects accumulated forever.
3. Defaults: enabled=True, auto_prune=False — users paid the disk cost
   without ever asking for /rollback.

Field report on a single workstation: 847 MB across 47 shadow repos,
mostly redundant clones of the hermes-agent source tree.

Changes
-------
- tools/checkpoint_manager.py: full rewrite. Single bare store, per-project
  refs (refs/hermes/<hash>), per-project indexes (store/indexes/<hash>),
  per-project metadata (store/projects/<hash>.json with workdir +
  created_at + last_touch). On first v2 init, any pre-v2 per-directory
  shadow repos are auto-migrated into legacy-<timestamp>/ so the new
  store starts clean. _prune() now actually rewrites the per-project ref
  to the last max_snapshots commits and runs git gc --prune=now. New
  _enforce_size_cap() drops oldest commits round-robin across projects
  when the store exceeds max_total_size_mb. _drop_oversize_from_index()
  filters any single file larger than max_file_size_mb out of the snapshot.
- hermes_cli/checkpoints.py: new 'hermes checkpoints' CLI
  (status / list / prune / clear / clear-legacy) for managing the store
  outside a session.
- hermes_cli/config.py: flipped defaults — enabled=False, max_snapshots=20,
  auto_prune=True. Added max_total_size_mb=500, max_file_size_mb=10.
  Tightened DEFAULT_EXCLUDES (added target/, *.so/*.dylib/*.dll,
  *.mp4/*.mov, *.zip/*.tar.gz, .worktrees/, .mypy_cache/, etc.).
- run_agent.py / cli.py / gateway/run.py: thread the new kwargs through
  AIAgent and the startup auto_prune hooks.
- Tests rewritten to match v2 storage while keeping backwards-compat
  coverage for the pre-v2 prune path (per-directory shadow repos under
  base/ are still swept correctly for anyone mid-migration).
- Docs updated: user-guide/checkpoints-and-rollback.md explains the
  shared store, new defaults, migration, and the new CLI;
  reference/cli-commands.md documents 'hermes checkpoints'.

E2E validated
-------------
- Legacy migration: pre-v2 shadow repos auto-archived into legacy-<ts>/.
- Object dedup: two projects with an identical shared.py blob resolve to
  7 total objects in the store (v1 would have stored the blob twice).
- max_snapshots=3 actually enforced: after 6 commits, list shows 3.
- Orphan prune: deleting a project's workdir + 'hermes checkpoints prune
  --retention-days 0' removes its ref, index, and metadata; GC reclaims
  the objects.
- max_file_size_mb=1 excludes a 2 MB weights.bin while keeping the
  tracked source code files.
- hermes checkpoints {status,prune,clear,clear-legacy} all work from the
  CLI without an agent running.

Breaking / migration
--------------------
No in-place data migration — legacy per-directory shadow repos are moved
into legacy-<timestamp>/ on first run. Old /rollback history is still
accessible by inspecting the archive with git; run
'hermes checkpoints clear-legacy' to reclaim the space when ready. Users
relying on /rollback must now set checkpoints.enabled=true (or pass
--checkpoints) explicitly.
This commit is contained in:
Teknium 2026-05-06 05:44:35 -07:00 committed by GitHub
parent b045e7a2ba
commit a0fedfbb1b
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
10 changed files with 1965 additions and 715 deletions

7
cli.py
View file

@ -987,6 +987,7 @@ def _run_checkpoint_auto_maintenance() -> None:
retention_days=int(cfg.get("retention_days", 7)),
min_interval_hours=int(cfg.get("min_interval_hours", 24)),
delete_orphans=bool(cfg.get("delete_orphans", True)),
max_total_size_mb=int(cfg.get("max_total_size_mb", 500)),
)
except Exception as exc:
logger.debug("checkpoint auto-maintenance skipped: %s", exc)
@ -2273,7 +2274,9 @@ class HermesCLI:
if isinstance(cp_cfg, bool):
cp_cfg = {"enabled": cp_cfg}
self.checkpoints_enabled = checkpoints or cp_cfg.get("enabled", False)
self.checkpoint_max_snapshots = cp_cfg.get("max_snapshots", 50)
self.checkpoint_max_snapshots = cp_cfg.get("max_snapshots", 20)
self.checkpoint_max_total_size_mb = cp_cfg.get("max_total_size_mb", 500)
self.checkpoint_max_file_size_mb = cp_cfg.get("max_file_size_mb", 10)
self.pass_session_id = pass_session_id
# --ignore-rules: honor either the constructor flag or the env var set
# by `hermes chat --ignore-rules` in hermes_cli/main.py. When true we
@ -3845,6 +3848,8 @@ class HermesCLI:
thinking_callback=self._on_thinking,
checkpoints_enabled=self.checkpoints_enabled,
checkpoint_max_snapshots=self.checkpoint_max_snapshots,
checkpoint_max_total_size_mb=self.checkpoint_max_total_size_mb,
checkpoint_max_file_size_mb=self.checkpoint_max_file_size_mb,
pass_session_id=self.pass_session_id,
skip_context_files=self.ignore_rules,
skip_memory=self.ignore_rules,

View file

@ -1160,6 +1160,7 @@ class GatewayRunner:
retention_days=int(_ckpt_cfg.get("retention_days", 7)),
min_interval_hours=int(_ckpt_cfg.get("min_interval_hours", 24)),
delete_orphans=bool(_ckpt_cfg.get("delete_orphans", True)),
max_total_size_mb=int(_ckpt_cfg.get("max_total_size_mb", 500)),
)
except Exception as exc:
logger.debug("checkpoint auto-maintenance skipped: %s", exc)

244
hermes_cli/checkpoints.py Normal file
View file

@ -0,0 +1,244 @@
"""`hermes checkpoints` CLI subcommand.
Gives users direct visibility and control over the filesystem checkpoint
store at ``~/.hermes/checkpoints/``. Actions:
hermes checkpoints # same as `status`
hermes checkpoints status # total size, project count, breakdown
hermes checkpoints list # per-project checkpoint counts + workdir
hermes checkpoints prune [opts] # force a sweep (ignores the 24h marker)
hermes checkpoints clear [-f] # nuke the entire base (asks first)
hermes checkpoints clear-legacy # delete just the legacy-* archives
Examples::
hermes checkpoints
hermes checkpoints prune --retention-days 3 --max-size-mb 200
hermes checkpoints clear -f
None of these require the agent to be running. Safe to call any time.
"""
from __future__ import annotations
import argparse
import time
from datetime import datetime
from pathlib import Path
from typing import Any, Dict
def _fmt_bytes(n: int) -> str:
units = ("B", "KB", "MB", "GB", "TB")
size = float(n or 0)
for unit in units:
if size < 1024 or unit == units[-1]:
if unit == "B":
return f"{int(size)} {unit}"
return f"{size:.1f} {unit}"
size /= 1024
return f"{size:.1f} TB"
def _fmt_ts(ts: Any) -> str:
try:
return datetime.fromtimestamp(float(ts)).strftime("%Y-%m-%d %H:%M")
except (TypeError, ValueError):
return ""
def _fmt_age(ts: Any) -> str:
try:
age = time.time() - float(ts)
except (TypeError, ValueError):
return ""
if age < 0:
return "now"
if age < 60:
return f"{int(age)}s ago"
if age < 3600:
return f"{int(age / 60)}m ago"
if age < 86400:
return f"{int(age / 3600)}h ago"
return f"{int(age / 86400)}d ago"
def cmd_status(args: argparse.Namespace) -> int:
from tools.checkpoint_manager import store_status
info = store_status()
base = info["base"]
print(f"Checkpoint base: {base}")
print(f"Total size: {_fmt_bytes(info['total_size_bytes'])}")
print(f" store/ {_fmt_bytes(info['store_size_bytes'])}")
print(f" legacy-* {_fmt_bytes(info['legacy_size_bytes'])}")
print(f"Projects: {info['project_count']}")
projects = sorted(
info["projects"],
key=lambda p: (p.get("last_touch") or 0),
reverse=True,
)
if projects:
print()
print(f" {'WORKDIR':<60} {'COMMITS':>7} {'LAST TOUCH':>12} STATE")
for p in projects[: args.limit if hasattr(args, "limit") and args.limit else 20]:
wd = p.get("workdir") or "(unknown)"
if len(wd) > 60:
wd = "" + wd[-59:]
exists = p.get("exists")
state = "live" if exists else "orphan"
commits = p.get("commits", 0)
last = _fmt_age(p.get("last_touch"))
print(f" {wd:<60} {commits:>7} {last:>12} {state}")
legacy = info.get("legacy_archives", [])
if legacy:
print()
print(f"Legacy archives ({len(legacy)}):")
for arch in sorted(legacy, key=lambda a: a.get("mtime", 0), reverse=True):
print(f" {arch['name']:<40} {_fmt_bytes(arch['size_bytes']):>10}")
print()
print("Clear with: hermes checkpoints clear-legacy")
return 0
def cmd_list(args: argparse.Namespace) -> int:
# `list` is just a terser status — already covered.
return cmd_status(args)
def cmd_prune(args: argparse.Namespace) -> int:
from tools.checkpoint_manager import prune_checkpoints
retention_days = args.retention_days
max_size_mb = args.max_size_mb
print("Pruning checkpoint store…")
print(f" retention_days: {retention_days}")
print(f" delete_orphans: {not args.keep_orphans}")
print(f" max_total_size_mb: {max_size_mb}")
print()
result = prune_checkpoints(
retention_days=retention_days,
delete_orphans=not args.keep_orphans,
max_total_size_mb=max_size_mb,
)
print(f"Scanned: {result['scanned']}")
print(f"Deleted orphan: {result['deleted_orphan']}")
print(f"Deleted stale: {result['deleted_stale']}")
print(f"Errors: {result['errors']}")
print(f"Bytes reclaimed: {_fmt_bytes(result['bytes_freed'])}")
return 0
def _confirm(prompt: str) -> bool:
try:
resp = input(f"{prompt} [y/N]: ").strip().lower()
except (EOFError, KeyboardInterrupt):
print()
return False
return resp in ("y", "yes")
def cmd_clear(args: argparse.Namespace) -> int:
from tools.checkpoint_manager import CHECKPOINT_BASE, clear_all, store_status
info = store_status()
if info["total_size_bytes"] == 0 and not Path(CHECKPOINT_BASE).exists():
print("Nothing to clear — checkpoint base does not exist.")
return 0
print(f"This will delete the ENTIRE checkpoint base at {info['base']}")
print(f" size: {_fmt_bytes(info['total_size_bytes'])}")
print(f" projects: {info['project_count']}")
print(f" legacy dirs: {len(info.get('legacy_archives', []))}")
print()
print("All /rollback history for every working directory will be lost.")
if not args.force and not _confirm("Proceed?"):
print("Aborted.")
return 1
result = clear_all()
if result["deleted"]:
print(f"Cleared. Reclaimed {_fmt_bytes(result['bytes_freed'])}.")
return 0
print("Could not clear checkpoint base (see logs).")
return 2
def cmd_clear_legacy(args: argparse.Namespace) -> int:
from tools.checkpoint_manager import clear_legacy, store_status
info = store_status()
legacy = info.get("legacy_archives", [])
if not legacy:
print("No legacy archives to clear.")
return 0
total = sum(a.get("size_bytes", 0) for a in legacy)
print(f"Found {len(legacy)} legacy archive(s), total {_fmt_bytes(total)}:")
for arch in legacy:
print(f" {arch['name']:<40} {_fmt_bytes(arch['size_bytes']):>10}")
print()
print("Legacy archives hold pre-v2 per-project shadow repos, moved aside")
print("during the single-store migration. Delete when you're confident")
print("you don't need the old /rollback history.")
if not args.force and not _confirm("Delete all legacy archives?"):
print("Aborted.")
return 1
result = clear_legacy()
print(f"Deleted {result['deleted']} archive(s), reclaimed {_fmt_bytes(result['bytes_freed'])}.")
return 0
def register_cli(parser: argparse.ArgumentParser) -> None:
"""Wire subcommands onto the ``hermes checkpoints`` parser."""
parser.set_defaults(func=cmd_status) # bare `hermes checkpoints` → status
subs = parser.add_subparsers(dest="checkpoints_command", metavar="COMMAND")
p_status = subs.add_parser(
"status",
help="Show total size, project count, and per-project breakdown",
)
p_status.add_argument("--limit", type=int, default=20,
help="Max projects to list (default 20)")
p_status.set_defaults(func=cmd_status)
p_list = subs.add_parser(
"list",
help="Alias for 'status'",
)
p_list.add_argument("--limit", type=int, default=20)
p_list.set_defaults(func=cmd_list)
p_prune = subs.add_parser(
"prune",
help="Delete orphan/stale checkpoints and GC the store",
)
p_prune.add_argument("--retention-days", type=int, default=7,
help="Drop projects whose last_touch is older than N days (default 7)")
p_prune.add_argument("--max-size-mb", type=int, default=500,
help="After orphan/stale prune, drop oldest commits "
"per project until total size <= this (default 500)")
p_prune.add_argument("--keep-orphans", action="store_true",
help="Skip deleting projects whose workdir no longer exists")
p_prune.set_defaults(func=cmd_prune)
p_clear = subs.add_parser(
"clear",
help="Delete the entire checkpoint base (all /rollback history)",
)
p_clear.add_argument("-f", "--force", action="store_true",
help="Skip confirmation prompt")
p_clear.set_defaults(func=cmd_clear)
p_legacy = subs.add_parser(
"clear-legacy",
help="Delete only the legacy-<ts>/ archives from v1 migration",
)
p_legacy.add_argument("-f", "--force", action="store_true",
help="Skip confirmation prompt")
p_legacy.set_defaults(func=cmd_clear_legacy)

View file

@ -574,21 +574,39 @@ DEFAULT_CONFIG = {
},
# Filesystem checkpoints — automatic snapshots before destructive file ops.
# When enabled, the agent takes a snapshot of the working directory once per
# conversation turn (on first write_file/patch call). Use /rollback to restore.
# When enabled, the agent takes a snapshot of the working directory once
# per conversation turn (on first write_file/patch call). Use /rollback
# to restore.
#
# Defaults changed in v2 (single shared shadow store, real pruning):
# - enabled: True -> False (opt-in; most users never use /rollback)
# - max_snapshots: 50 -> 20 (now actually enforced via ref rewrite)
# - auto_prune: False -> True (orphans/stale pruned automatically)
# Opt in via ``hermes chat --checkpoints`` or set enabled=True here.
"checkpoints": {
"enabled": True,
"max_snapshots": 50, # Max checkpoints to keep per directory
# Auto-maintenance: shadow repos accumulate forever under
# ~/.hermes/checkpoints/ (one per cd'd working directory). Field
# reports put the typical offender at 1000+ repos / ~12 GB. When
# auto_prune is on, hermes sweeps at startup (at most once per
# min_interval_hours) and deletes:
# * orphan repos: HERMES_WORKDIR no longer exists on disk
# * stale repos: newest mtime older than retention_days
# Opt-in so users who rely on /rollback against long-ago sessions
# never lose data silently.
"auto_prune": False,
"enabled": False,
# Max checkpoints to keep per working directory. Pre-v2 this only
# limited the `/rollback` listing; v2 actually rewrites the ref and
# garbage-collects older commits.
"max_snapshots": 20,
# Hard ceiling on total ``~/.hermes/checkpoints/`` size (MB). When
# exceeded, the oldest checkpoint per project is dropped in a
# round-robin pass until total size falls under the cap.
# 0 disables the size cap.
"max_total_size_mb": 500,
# Skip any single file larger than this when staging a checkpoint.
# Prevents accidental snapshotting of datasets, model weights, and
# other large generated assets. 0 disables the filter.
"max_file_size_mb": 10,
# Auto-maintenance: hermes sweeps the checkpoint base at startup
# (at most once per ``min_interval_hours``) and:
# * deletes project entries whose workdir no longer exists (orphan)
# * deletes project entries whose last_touch is older than
# ``retention_days``
# * GCs the single shared store to reclaim unreachable objects
# * enforces ``max_total_size_mb`` across remaining projects
# * deletes ``legacy-*`` archives older than ``retention_days``
"auto_prune": True,
"retention_days": 7,
"delete_orphans": True,
"min_interval_hours": 24,

View file

@ -9379,6 +9379,20 @@ Examples:
)
backup_parser.set_defaults(func=cmd_backup)
# =========================================================================
# checkpoints command
# =========================================================================
checkpoints_parser = subparsers.add_parser(
"checkpoints",
help="Inspect / prune / clear ~/.hermes/checkpoints/",
description="Manage the filesystem checkpoint store — the shadow git "
"repo hermes uses to snapshot working directories before "
"write_file/patch/terminal calls. Lets you see how much "
"space checkpoints occupy, force a prune, or wipe the base.",
)
from hermes_cli.checkpoints import register_cli as _register_checkpoints_cli
_register_checkpoints_cli(checkpoints_parser)
# =========================================================================
# import command
# =========================================================================

View file

@ -966,7 +966,9 @@ class AIAgent:
fallback_model: Dict[str, Any] = None,
credential_pool=None,
checkpoints_enabled: bool = False,
checkpoint_max_snapshots: int = 50,
checkpoint_max_snapshots: int = 20,
checkpoint_max_total_size_mb: int = 500,
checkpoint_max_file_size_mb: int = 10,
pass_session_id: bool = False,
):
"""
@ -1689,6 +1691,8 @@ class AIAgent:
self._checkpoint_mgr = CheckpointManager(
enabled=checkpoints_enabled,
max_snapshots=checkpoint_max_snapshots,
max_total_size_mb=checkpoint_max_total_size_mb,
max_file_size_mb=checkpoint_max_file_size_mb,
)
# SQLite session store (optional -- provided by CLI or gateway)

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

View file

@ -54,6 +54,7 @@ hermes [global-options] <command> [subcommand/options]
| `hermes dump` | Copy-pasteable setup summary for support/debugging. |
| `hermes debug` | Debug tools — upload logs and system info for support. |
| `hermes backup` | Back up Hermes home directory to a zip file. |
| `hermes checkpoints` | Inspect / prune / clear `~/.hermes/checkpoints/` (the shadow store used by `/rollback`). Run with no args for a status overview. |
| `hermes import` | Restore a Hermes backup from a zip file. |
| `hermes logs` | View, tail, and filter agent/gateway/error log files. |
| `hermes config` | Show, edit, migrate, and query configuration files. |
@ -579,6 +580,44 @@ hermes backup --quick # Quick state-only snapshot
hermes backup --quick --label "pre-upgrade" # Quick snapshot with label
```
## `hermes checkpoints`
```bash
hermes checkpoints [COMMAND]
```
Inspect and manage the shadow git store at `~/.hermes/checkpoints/` — the storage layer behind the in-session `/rollback` command. Safe to run any time; does not require the agent to be running.
| Subcommand | Description |
|------------|-------------|
| `status` (default) | Show total size, project count, and per-project breakdown. Bare `hermes checkpoints` is equivalent. |
| `list` | Alias for `status`. |
| `prune` | Force a cleanup sweep — delete orphan and stale projects, GC the store, enforce the size cap. Ignores the 24h idempotency marker. |
| `clear` | Delete the entire checkpoint base. Irreversible; asks for confirmation unless `-f`. |
| `clear-legacy` | Delete only the `legacy-<timestamp>/` archives produced by the v1→v2 migration. |
### Options
| Option | Subcommand | Description |
|--------|------------|-------------|
| `--limit N` | `status`, `list` | Max projects to list (default 20). |
| `--retention-days N` | `prune` | Drop projects whose `last_touch` is older than N days (default 7). |
| `--max-size-mb N` | `prune` | After the orphan/stale pass, drop the oldest commit per project until total store size ≤ N MB (default 500). |
| `--keep-orphans` | `prune` | Skip deleting projects whose working directory no longer exists. |
| `-f`, `--force` | `clear`, `clear-legacy` | Skip the confirmation prompt. |
### Examples
```bash
hermes checkpoints # status overview
hermes checkpoints prune --retention-days 3 # aggressive cleanup
hermes checkpoints prune --max-size-mb 200 # tighten size cap once
hermes checkpoints clear-legacy -f # drop v1 archive dirs
hermes checkpoints clear -f # wipe everything
```
See [Checkpoints and `/rollback`](../user-guide/checkpoints-and-rollback.md) for the full architecture and the in-session commands.
## `hermes import`
```bash

View file

@ -7,9 +7,22 @@ description: "Filesystem safety nets for destructive operations using shadow git
# Checkpoints and `/rollback`
Hermes Agent automatically snapshots your project before **destructive operations** and lets you restore it with a single command. Checkpoints are **enabled by default** — there's zero cost when no file-mutating tools fire.
Hermes Agent can automatically snapshot your project before **destructive operations** and restore it with a single command. Checkpoints are **opt-in** as of v2 — most users never use `/rollback`, and the shadow-store storage is non-trivial over time, so the default is off.
This safety net is powered by an internal **Checkpoint Manager** that keeps a separate shadow git repository under `~/.hermes/checkpoints/` — your real project `.git` is never touched.
Enable checkpoints per-session with `--checkpoints`:
```bash
hermes chat --checkpoints
```
Or enable globally in `~/.hermes/config.yaml`:
```yaml
checkpoints:
enabled: true
```
This safety net is powered by an internal **Checkpoint Manager** that keeps a single shared shadow git repository under `~/.hermes/checkpoints/store/` — your real project `.git` is never touched. Every project the agent works in shares the same store, so git's content-addressable object DB deduplicates across projects and across turns.
## What Triggers a Checkpoint
@ -22,6 +35,8 @@ The agent creates **at most one checkpoint per directory per turn**, so long-run
## Quick Reference
In-session slash commands:
| Command | Description |
|---------|-------------|
| `/rollback` | List all checkpoints with change stats |
@ -29,6 +44,17 @@ The agent creates **at most one checkpoint per directory per turn**, so long-run
| `/rollback diff <N>` | Preview diff between checkpoint N and current state |
| `/rollback <N> <file>` | Restore a single file from checkpoint N |
CLI for inspecting and managing the store outside a session:
| Command | Description |
|---------|-------------|
| `hermes checkpoints` | Show total size, project count, per-project breakdown |
| `hermes checkpoints status` | Same as bare `checkpoints` |
| `hermes checkpoints list` | Alias for `status` |
| `hermes checkpoints prune` | Force a sweep: delete orphans/stale, GC, enforce size cap |
| `hermes checkpoints clear` | Nuke the entire checkpoint base (asks first) |
| `hermes checkpoints clear-legacy` | Delete only the `legacy-*` archives from v1 migration |
## How Checkpoints Work
At a high level:
@ -36,9 +62,9 @@ At a high level:
- Hermes detects when tools are about to **modify files** in your working tree.
- Once per conversation turn (per directory), it:
- Resolves a reasonable project root for the file.
- Initialises or reuses a **shadow git repo** tied to that directory.
- Stages and commits the current state with a short, humanreadable reason.
- These commits form a checkpoint history that you can inspect and restore via `/rollback`.
- Initialises or reuses the **single shared shadow store** at `~/.hermes/checkpoints/store/`.
- Stages into a per-project index, builds a tree, and commits to a per-project ref (`refs/hermes/<project-hash>`).
- These per-project refs form a checkpoint history that you can inspect and restore via `/rollback`.
```mermaid
flowchart LR
@ -46,44 +72,46 @@ flowchart LR
agent["AIAgent\n(run_agent.py)"]
tools["File & terminal tools"]
cpMgr["CheckpointManager"]
shadowRepo["Shadow git repo\n~/.hermes/checkpoints/<hash>"]
store["Shared shadow store\n~/.hermes/checkpoints/store/"]
user --> agent
agent -->|"tool call"| tools
tools -->|"before mutate\nensure_checkpoint()"| cpMgr
cpMgr -->|"git add/commit"| shadowRepo
cpMgr -->|"git add/commit-tree/update-ref"| store
cpMgr -->|"OK / skipped"| tools
tools -->|"apply changes"| agent
```
## Configuration
Checkpoints are enabled by default. Configure in `~/.hermes/config.yaml`:
Configure in `~/.hermes/config.yaml`:
```yaml
checkpoints:
enabled: true # master switch (default: true)
max_snapshots: 50 # max checkpoints per directory
enabled: false # master switch (default: false — opt-in)
max_snapshots: 20 # max checkpoints per project (enforced via ref rewrite + gc)
max_total_size_mb: 500 # hard cap on total store size; oldest commits dropped
max_file_size_mb: 10 # skip any single file larger than this
# Auto-maintenance (opt-in): sweep ~/.hermes/checkpoints/ at startup
# and delete shadow repos whose working directory no longer exists
# (orphans) or whose newest commit is older than retention_days.
# Runs at most once per min_interval_hours, tracked via a
# .last_prune marker inside ~/.hermes/checkpoints/.
auto_prune: false # default off — enable to reclaim disk
# Auto-maintenance (on by default): sweep ~/.hermes/checkpoints/ at startup
# and delete project entries whose working directory no longer exists
# (orphans) or whose last_touch is older than retention_days. Runs at most
# once per min_interval_hours, tracked via a .last_prune marker.
auto_prune: true
retention_days: 7
delete_orphans: true # delete repos whose workdir is gone
delete_orphans: true
min_interval_hours: 24
```
To disable:
To disable everything:
```yaml
checkpoints:
enabled: false
auto_prune: false
```
When disabled, the Checkpoint Manager is a noop and never attempts git operations.
When `enabled: false`, the Checkpoint Manager is a no-op and never attempts git operations. When `auto_prune: false`, the store grows until you run `hermes checkpoints prune` manually.
## Listing Checkpoints
@ -107,12 +135,38 @@ Hermes responds with a formatted list showing change statistics:
/rollback <N> <file> restore a single file from checkpoint N
```
Each entry shows:
## Inspecting the Store from the Shell
- Short hash
- Timestamp
- Reason (what triggered the snapshot)
- Change summary (files changed, insertions/deletions)
```bash
hermes checkpoints
```
Sample output:
```text
Checkpoint base: /home/you/.hermes/checkpoints
Total size: 142.3 MB
store/ 138.1 MB
legacy-* 4.2 MB
Projects: 12
WORKDIR COMMITS LAST TOUCH STATE
/home/you/code/hermes-agent 20 2h ago live
/home/you/code/experiments/rl-runner 8 1d ago live
/home/you/code/old-prototype 3 9d ago orphan
...
Legacy archives (1):
legacy-20260506-050616 4.2 MB
Clear with: hermes checkpoints clear-legacy
```
Force a full sweep (ignores the 24h idempotency marker):
```bash
hermes checkpoints prune --retention-days 3 --max-size-mb 200
```
## Previewing Changes with `/rollback diff`
@ -122,49 +176,21 @@ Before committing to a restore, preview what has changed since a checkpoint:
/rollback diff 1
```
This shows a git diff stat summary followed by the actual diff:
```text
test.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/test.py b/test.py
--- a/test.py
+++ b/test.py
@@ -1 +1 @@
-print('original content')
+print('modified content')
```
Long diffs are capped at 80 lines to avoid flooding the terminal.
This shows a git diff stat summary followed by the actual diff.
## Restoring with `/rollback`
Restore to a checkpoint by number:
```
/rollback 1
```
Behind the scenes, Hermes:
1. Verifies the target commit exists in the shadow repo.
2. Takes a **prerollback snapshot** of the current state so you can "undo the undo" later.
1. Verifies the target commit exists in the shadow store.
2. Takes a **pre-rollback snapshot** of the current state so you can "undo the undo" later.
3. Restores tracked files in your working directory.
4. **Undoes the last conversation turn** so the agent's context matches the restored filesystem state.
On success:
```text
✅ Restored to checkpoint 4270a8c5: before patch
A pre-rollback snapshot was saved automatically.
(^_^)b Undid 4 message(s). Removed: "Now update test.py to ..."
4 message(s) remaining in history.
Chat turn undone to match restored file state.
```
The conversation undo ensures the agent doesn't "remember" changes that have been rolled back, avoiding confusion on the next turn.
## Single-File Restore
Restore just one file from a checkpoint without affecting the rest of the directory:
@ -173,42 +199,51 @@ Restore just one file from a checkpoint without affecting the rest of the direct
/rollback 1 src/broken_file.py
```
This is useful when the agent made changes to multiple files but only one needs to be reverted.
## Safety and Performance Guards
To keep checkpointing safe and fast, Hermes applies several guardrails:
- **Git availability** — if `git` is not found on `PATH`, checkpoints are transparently disabled.
- **Directory scope** — Hermes skips overly broad directories (root `/`, home `$HOME`).
- **Repository size** — directories with more than 50,000 files are skipped to avoid slow git operations.
- **Nochange snapshots** — if there are no changes since the last snapshot, the checkpoint is skipped.
- **Nonfatal errors** — all errors inside the Checkpoint Manager are logged at debug level; your tools continue to run.
- **Repository size** — directories with more than 50,000 files are skipped.
- **Per-file size cap** — files larger than `max_file_size_mb` (default 10 MB) are excluded from the snapshot. Prevents accidentally swallowing datasets, model weights, or generated media.
- **Total store size cap** — when the store exceeds `max_total_size_mb` (default 500 MB), the oldest commit per project is dropped round-robin until under the cap.
- **Real pruning**`max_snapshots` is enforced by rewriting the per-project ref and running `git gc --prune=now` afterwards, so loose objects don't accumulate.
- **No-change snapshots** — if there are no changes since the last snapshot, the checkpoint is skipped.
- **Non-fatal errors** — all errors inside the Checkpoint Manager are logged at debug level; your tools continue to run.
## Where Checkpoints Live
All shadow repos live under:
```text
~/.hermes/checkpoints/
├── <hash1>/ # shadow git repo for one working directory
├── <hash2>/
└── ...
├── store/ # single shared bare git repo
│ ├── HEAD, objects/ # git internals (shared across projects)
│ ├── refs/hermes/<hash> # per-project branch tip
│ ├── indexes/<hash> # per-project git index
│ ├── projects/<hash>.json # workdir + created_at + last_touch
│ └── info/exclude
├── .last_prune # auto-prune idempotency marker
└── legacy-<ts>/ # archived pre-v2 per-project shadow repos
```
Each `<hash>` is derived from the absolute path of the working directory. Inside each shadow repo you'll find:
Each `<hash>` is derived from the absolute path of the working directory. You normally never need to touch these manually — use `hermes checkpoints status` / `prune` / `clear` instead.
- Standard git internals (`HEAD`, `refs/`, `objects/`)
- An `info/exclude` file containing a curated ignore list
- A `HERMES_WORKDIR` file pointing back to the original project root
### Migration from v1
You normally never need to touch these manually.
Before the v2 rewrite, each working directory got its own complete shadow git repo directly under `~/.hermes/checkpoints/<hash>/`. That layout couldn't dedup objects across projects and had a documented no-op pruner — the store would grow without bound.
On first v2 run, any pre-v2 shadow repos are moved into `~/.hermes/checkpoints/legacy-<timestamp>/` so the new single-store layout starts clean. Old `/rollback` history is still reachable by manually inspecting the legacy archive with `git`; once you're confident you don't need it, run:
```bash
hermes checkpoints clear-legacy
```
to reclaim the space. Legacy archives are also swept by `auto_prune` after `retention_days`.
## Best Practices
- **Leave checkpoints enabled** — they're on by default and have zero cost when no files are modified.
- **Enable checkpoints only when you need them** — `hermes chat --checkpoints` or per-profile `enabled: true`.
- **Use `/rollback diff` before restoring** — preview what will change to pick the right checkpoint.
- **Use `/rollback` instead of `git reset`** when you want to undo agent-driven changes only.
- **Check `hermes checkpoints status` occasionally** if you use checkpoints regularly — shows which projects are active and what the store costs you.
- **Combine with Git worktrees** for maximum safety — keep each Hermes session in its own worktree/branch, with checkpoints as an extra layer.
For running multiple agents in parallel on the same repo, see the guide on [Git worktrees](./git-worktrees.md).