Tool-role messages (44% of corpus) produce junk terms from structured
JSON output: field names like 'output', 'exit_code', 'error', 'status'
and numeric values like '0', '1', '42'. These have zero search value and
account for ~77% of index rows.
Changes:
- stop_words.py: add JSON key stop words (output, exit_code, error, null,
true, false, status, content, message, cleared, success) and
is_noise_term() that also filters pure numeric tokens
- term_index.py: switch from is_stop_word to is_noise_term
- hermes_state.py: skip tool-role messages at insert and reindex paths
- tests: 6 new TestNoiseReduction tests
Impact on live DB: 1.19M -> 278K rows (77% reduction), 1.5s reindex.
Adds a term-based inverted index (term_index table, schema v7) that
eliminates LLM summarization from the default search path. The fast
path returns session metadata and match counts in ~1ms vs 10-15s for
the full FTS5+LLM pipeline.
Key changes:
- term_index table: (term, message_id, session_id) WITHOUT ROWID
for clustered B-tree lookups. Populated at write time in
append_message (best-effort, never blocks inserts).
- stop_words.py: 179-word NLTK English stop list, no stemming
- term_index.py: extract_terms() for term extraction
- session_search_tool.py: fast=True default, _fast_search for term
index path, _full_search preserves original behavior, CJK query
fallback to slow path
- Auto-reindex on v7 migration: _init_schema returns needs_reindex
flag, __init__ calls reindex_term_index() after migration
- Swap strategy for reindex: builds into temp table, then atomic
swap in single transaction (no empty-index window)
- get_child_session_ids(): public API replacing db._lock/db._conn
access in _fast_search
- mode field in search results: 'fast' or 'full'
- Cascade deletes: clear_messages, delete_session, prune_sessions
all clean term_index entries
Benchmarks on production DB (47.7 MB, 29,435 messages):
- Term index reindex: 1,152,587 entries from 29,435 messages in 4s
- Fast path: 1-4ms (no LLM)
- Slow path: 10,000-16,000ms (FTS5 + LLM summarization)
- Speedup: 4,000-15,000x on full round-trip
195 tests passing (48 term_index + 149 hermes_state).
12 regression tests from red-team QA covering: param binding,
child session resolution, cascade deletes, CJK fallback.