Commit graph

2 commits

Author SHA1 Message Date
Stephen Chin
5c49cd0ed0 fix(state): never silently downgrade WAL to DELETE on transient EIO
apply_wal_with_fallback() treated "disk i/o error" as a permanent
WAL-incompatibility marker, identical to "locking protocol" (NFS) and
"not authorized" (FUSE). But EIO during PRAGMA journal_mode=WAL is
typically TRANSIENT — page-cache pressure, brief lock contention,
recoverable storage hiccups — not a permanent filesystem property.

Treating transient EIO as a permanent downgrade signal produces the
mixed-journal-mode-across-processes corruption pattern:

  1. Process A opens kanban.db, hits transient EIO on the WAL pragma,
     silently downgrades to journal_mode=DELETE.
  2. Process B (no EIO) opens the same file moments later and
     successfully sets journal_mode=WAL.
  3. A writes rollback-journal frames while B writes WAL frames. SQLite
     documents this as unsupported and corrupts the file:
     https://www.sqlite.org/wal.html ("all connections to the same
     database must use the same locking protocol").

This was the root cause of repeated kanban.db corruption on hosts with
multiple gateway processes plus CLI invocations against the same DB
(observed pattern: corruption shortly after gateway startup, after the
process logged "WAL journal_mode unsupported on this filesystem (disk
I/O error) — falling back to journal_mode=DELETE"). The fallback
warning told the truth — fallback DID happen — but the premise
("unsupported on this filesystem") was wrong; the EIO was a one-shot
event and sibling processes successfully used WAL.

Fix has two layers:

1. Remove "disk i/o error" from _WAL_INCOMPAT_MARKERS. EIO now re-raises
   so callers can retry instead of silently corrupting the DB. The two
   remaining markers ("locking protocol", "not authorized") are
   deterministic per filesystem so they remain safe permanent-downgrade
   signals.

2. Belt-and-suspenders: before downgrading on ANY marker match, peek the
   on-disk journal mode. If the header says WAL, refuse to downgrade and
   re-raise the original error. This guards against any future addition
   to _WAL_INCOMPAT_MARKERS turning out to be transient in some
   environment we haven't yet seen.

Tests:

- tests/test_hermes_state_wal_fallback.py:
  * Flipped test_falls_back_on_disk_io_error → test_reraises_on_disk_io_error
    asserting EIO is re-raised, not silently swallowed.
  * Added test_does_not_downgrade_when_disk_says_wal covering the
    on-disk-header safety guard for the existing legitimate markers.

- tests/hermes_cli/test_kanban_db.py:
  * test_connect_falls_back_to_delete_on_locking_protocol now uses a
    truly-fresh DB (instead of the kanban_home fixture which pre-inits
    in WAL). On NFS the very first process touching the file legitimately
    downgrades; on a file already in WAL the new guard correctly refuses.

A standalone reproducer lives at /tmp/kanban-stress/repro_bugD_eio_wal_downgrade.py
(not committed): without fix the DB silently flips from WAL to DELETE
mid-process; with fix the EIO surfaces and the file stays WAL.

Refs: Bug D in the kanban-corruption investigation series (Bugs A and C
shipped in ebe7374f3 and e02147d5e respectively). Bug D explains every
corruption incident this week including those that survived A's
single-dispatcher mitigation, because every CLI invocation is a
separate process whose WAL pragma can transiently fail.
2026-05-27 14:31:55 -07:00
kshitij
2a7047c2ed
fix(sqlite): fall back to journal_mode=DELETE on NFS/SMB/FUSE (#22043)
SQLite's WAL mode requires shared-memory (mmap) coordination and fcntl
byte-range locks that don't reliably work on network filesystems. Upstream
documents this explicitly:
  https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode

On NFS / SMB / some FUSE mounts / WSL1, 'PRAGMA journal_mode=WAL' raises
'sqlite3.OperationalError: locking protocol' (SQLITE_PROTOCOL). Before
this change, every feature backed by state.db or kanban.db broke silently:
  - /resume, /title, /history, /branch returned 'Session database not
    available.' with no cause
  - gateway logged the init failure at DEBUG (invisible in errors.log)
  - kanban dispatcher crashed every 60s, driving the known migration race
    (duplicate column name: consecutive_failures, #21708 / #21374)

Changes:
  - hermes_state.apply_wal_with_fallback(): shared helper that tries WAL
    and falls back to DELETE on SQLITE_PROTOCOL-style errors with one
    WARNING explaining why
  - hermes_state.get_last_init_error() + format_session_db_unavailable():
    capture the init failure cause and surface it in user-facing strings
    (with an NFS/SMB pointer for 'locking protocol')
  - hermes_cli/kanban_db.connect(): use the shared helper
  - gateway/run.py: bump SessionDB init failure log DEBUG -> WARNING
    (matches cli.py's existing correct behavior)
  - cli.py (4 sites) + gateway/run.py (5 sites): replace bare
    'Session database not available.' with format_session_db_unavailable()

Tests: 12 new tests in tests/test_hermes_state_wal_fallback.py + 1 new
test in tests/hermes_cli/test_kanban_db.py. Existing suites (state,
kanban, gateway, cli) remain green for all tests unrelated to pre-existing
failures on main.

Evidence: real-world user on NFSv3 mount (172.26.224.200:d2dfac12/home,
local_lock=none) reporting 'Session database not available.' on /resume;
'locking protocol' appears in 4 distinct log entries across backup,
kanban, TUI, and CLI paths in the same session.

closes #22032
2026-05-09 02:09:35 -07:00