hermes-agent

mirrors/hermes-agent

Fork 0

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-06-25 11:02:03 +00:00

Commit graph

Author	SHA1	Message	Date
Teknium	84ba83b09a	fix(kanban): bound the cross-process init lock so connect() can't hang forever (#50353 ) connect() wrapped its entire body in an unbounded blocking flock(LOCK_EX) on every call (_cross_process_init_lock). A single process stalled inside the critical section — or a stale lock held by a wedged worker — blocked every other connect(), including the long-lived gateway dispatcher's next-tick connect, forever. No timeout, no traceback, no recovery: the board silently stopped being worked until a manual restart (issue #36644). Two fixes: 1. Fast-path skip: once THIS process has initialized a path, the expensive first-open work (header validation, integrity probe, schema + additive migrations) is already cached in _INITIALIZED_PATHS. The steady-state connect has nothing for the cross-process lock to protect, so it now opens the connection (WAL + pragmas) under only the cheap in-process _INIT_LOCK and never touches the file lock. This removes the lock from the dispatcher's hot path entirely — a stalled external 'hermes kanban list' can no longer block ticks. 2. Bounded acquire: even on first-init, _cross_process_init_lock now retries a non-blocking acquire up to a 10s deadline, then logs a WARNING and proceeds WITHOUT the cross-process lock. Safe because the in-process _INIT_LOCK still serializes same-process threads and the init work is idempotent (CREATE TABLE IF NOT EXISTS + additive migrations) — worst case is redundant work, not corruption. A bounded 'proceed anyway' beats an unbounded hang. Windows path switched LK_LOCK -> LK_NBLCK (non-blocking) to match. Closes #36644.	2026-06-21 12:43:41 -07:00

Author

SHA1

Message

Date

Teknium

84ba83b09a

fix(kanban): bound the cross-process init lock so connect() can't hang forever (#50353 )

connect() wrapped its entire body in an unbounded blocking flock(LOCK_EX) on
every call (_cross_process_init_lock). A single process stalled inside the
critical section — or a stale lock held by a wedged worker — blocked every
other connect(), including the long-lived gateway dispatcher's next-tick
connect, forever. No timeout, no traceback, no recovery: the board silently
stopped being worked until a manual restart (issue #36644).

Two fixes:

1. Fast-path skip: once THIS process has initialized a path, the expensive
   first-open work (header validation, integrity probe, schema + additive
   migrations) is already cached in _INITIALIZED_PATHS. The steady-state
   connect has nothing for the cross-process lock to protect, so it now opens
   the connection (WAL + pragmas) under only the cheap in-process _INIT_LOCK
   and never touches the file lock. This removes the lock from the dispatcher's
   hot path entirely — a stalled external 'hermes kanban list' can no longer
   block ticks.

2. Bounded acquire: even on first-init, _cross_process_init_lock now retries a
   non-blocking acquire up to a 10s deadline, then logs a WARNING and proceeds
   WITHOUT the cross-process lock. Safe because the in-process _INIT_LOCK still
   serializes same-process threads and the init work is idempotent
   (CREATE TABLE IF NOT EXISTS + additive migrations) — worst case is redundant
   work, not corruption. A bounded 'proceed anyway' beats an unbounded hang.

Windows path switched LK_LOCK -> LK_NBLCK (non-blocking) to match.

Closes #36644.

2026-06-21 12:43:41 -07:00

1 commit