The stale-code self-check (Issue #17648) used sentinel-file mtimes to
decide whether the gateway survived a `hermes update` with stale
`sys.modules`. That signal false-positives on any write to the
sentinel files — including agent-driven edits during Hermes-on-Hermes
dev sessions. Telling the agent to patch `run_agent.py` would flip
the check to True on the next user message and force a gateway
restart even though no update happened.
Switch the signal to `git rev-parse HEAD`. Agent file edits don't
move HEAD; `hermes update` (git pull) always does. Reading .git/HEAD
directly (no subprocess) with a 5s cache keeps the overhead negligible
on bursty chats. Non-git installs short-circuit to False — the
stale-modules class can't occur without a git-backed update path, so
there's nothing to detect.
The legacy `_compute_repo_mtime` helper is kept but unused by
detection, reserved as a fallback hook for future pip-install update
paths.
- _read_git_head_sha(): resolves HEAD across main checkout, worktree
(follows `gitdir:` + `commondir` pointers), and packed-refs layouts.
- _current_git_sha_cached(): per-runner 5s SHA cache.
- _detect_stale_code(): boot SHA vs current SHA, returns False when
either is unavailable.
- Tests cover all four layouts, the agent-edits-don't-trigger
regression, and cache behavior.
Refs #17648.
Long-running gateway processes that survive 'hermes update' keep
pre-update modules cached in sys.modules. When new tool files on
disk then try to 'from hermes_cli.config import cfg_get' (added in
PR #17304), the import resolves against the stale module object
and raises ImportError — hitting users on Matrix, Telegram, Feishu,
and other platforms.
Two defenses:
1. Gateway self-check (gateway/run.py). On __init__, snapshot the
newest mtime across sentinel source files (hermes_cli/config.py,
run_agent.py, gateway/run.py, etc.). On every inbound message,
re-read those mtimes; if any is newer than boot time + 2s slack,
request a graceful restart via the normal drain path and return
a one-line ack to the user. Idempotent, works regardless of how
the update happened (hermes update, manual git pull, installer).
2. Post-restart survivor sweep ('hermes update'). After the existing
restart loop, sleep 3s, rescan for gateway PIDs we already tried
to kill, and SIGKILL any survivors. The detached profile watchers
and systemd then relaunch with fresh code instead of waiting out
the 120s watcher timeout.
Closes#17648.