hermes-agent

mirrors/hermes-agent

Fork 0

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-06-25 11:02:03 +00:00

Commit graph

Author	SHA1	Message	Date
Teknium	e581740aa1	fix(kanban): single-writer dispatch lock to prevent orphan-dispatcher DB corruption (#50331 ) A shell-launched 'hermes gateway run --replace' / 'gateway restart' on a systemd/launchd host can leave an orphan gateway whose kanban dispatcher escapes the service cgroup, survives 'systemctl restart', and becomes a second long-lived writer on the shared kanban.db. Two dispatchers that each believe they own the file both pass SQLite busy_timeout and then race on WAL frames — the documented root cause of multi-writer corruption (issue #35240). The existing _guard_supervised_gateway_conflict startup guard blocks the common way an orphan is born, but does nothing once a second dispatcher already exists. This adds the defense-in-depth: dispatch_once now wraps every tick in a non-blocking, board-scoped flock (_dispatch_tick_lock). A losing dispatcher returns DispatchResult(skipped_locked=True) and does zero DB writes this tick — so two dispatchers can never run a reclaim/spawn/write sequence concurrently regardless of how the second one got there. - Non-blocking (LOCK_NB): never stalls the gateway's async watcher. - Board-scoped: lock file is a .dispatch.lock sibling of each board's kanban.db, so unrelated boards tick in parallel. - POSIX + Windows (fcntl / msvcrt LK_NBLCK), no-op degrade where neither exists — mirrors the existing _cross_process_init_lock pattern. Verified with a real two-process orphan repro: while a separate process holds the lock, dispatch_once skips; after release it runs.	2026-06-21 12:06:24 -07:00

Author

SHA1

Message

Date

Teknium

e581740aa1

fix(kanban): single-writer dispatch lock to prevent orphan-dispatcher DB corruption (#50331 )

A shell-launched 'hermes gateway run --replace' / 'gateway restart' on a
systemd/launchd host can leave an orphan gateway whose kanban dispatcher
escapes the service cgroup, survives 'systemctl restart', and becomes a
second long-lived writer on the shared kanban.db. Two dispatchers that each
believe they own the file both pass SQLite busy_timeout and then race on WAL
frames — the documented root cause of multi-writer corruption (issue #35240).

The existing _guard_supervised_gateway_conflict startup guard blocks the
common way an orphan is born, but does nothing once a second dispatcher
already exists. This adds the defense-in-depth: dispatch_once now wraps every
tick in a non-blocking, board-scoped flock (_dispatch_tick_lock). A losing
dispatcher returns DispatchResult(skipped_locked=True) and does zero DB writes
this tick — so two dispatchers can never run a reclaim/spawn/write sequence
concurrently regardless of how the second one got there.

- Non-blocking (LOCK_NB): never stalls the gateway's async watcher.
- Board-scoped: lock file is a .dispatch.lock sibling of each board's
  kanban.db, so unrelated boards tick in parallel.
- POSIX + Windows (fcntl / msvcrt LK_NBLCK), no-op degrade where neither
  exists — mirrors the existing _cross_process_init_lock pattern.

Verified with a real two-process orphan repro: while a separate process holds
the lock, dispatch_once skips; after release it runs.

2026-06-21 12:06:24 -07:00

1 commit