connect() wrapped its entire body in an unbounded blocking flock(LOCK_EX) on
every call (_cross_process_init_lock). A single process stalled inside the
critical section — or a stale lock held by a wedged worker — blocked every
other connect(), including the long-lived gateway dispatcher's next-tick
connect, forever. No timeout, no traceback, no recovery: the board silently
stopped being worked until a manual restart (issue #36644).
Two fixes:
1. Fast-path skip: once THIS process has initialized a path, the expensive
first-open work (header validation, integrity probe, schema + additive
migrations) is already cached in _INITIALIZED_PATHS. The steady-state
connect has nothing for the cross-process lock to protect, so it now opens
the connection (WAL + pragmas) under only the cheap in-process _INIT_LOCK
and never touches the file lock. This removes the lock from the dispatcher's
hot path entirely — a stalled external 'hermes kanban list' can no longer
block ticks.
2. Bounded acquire: even on first-init, _cross_process_init_lock now retries a
non-blocking acquire up to a 10s deadline, then logs a WARNING and proceeds
WITHOUT the cross-process lock. Safe because the in-process _INIT_LOCK still
serializes same-process threads and the init work is idempotent
(CREATE TABLE IF NOT EXISTS + additive migrations) — worst case is redundant
work, not corruption. A bounded 'proceed anyway' beats an unbounded hang.
Windows path switched LK_LOCK -> LK_NBLCK (non-blocking) to match.
Closes#36644.