Commit graph

2 commits

Author SHA1 Message Date
AJ
94f1758742 feat: reduce term index noise — skip tool messages, filter numerics and JSON keys
Tool-role messages (44% of corpus) produce junk terms from structured
JSON output: field names like 'output', 'exit_code', 'error', 'status'
and numeric values like '0', '1', '42'. These have zero search value and
account for ~77% of index rows.

Changes:
- stop_words.py: add JSON key stop words (output, exit_code, error, null,
  true, false, status, content, message, cleared, success) and
  is_noise_term() that also filters pure numeric tokens
- term_index.py: switch from is_stop_word to is_noise_term
- hermes_state.py: skip tool-role messages at insert and reindex paths
- tests: 6 new TestNoiseReduction tests

Impact on live DB: 1.19M -> 278K rows (77% reduction), 1.5s reindex.
2026-04-24 19:32:36 -04:00
AJ
410456c599 feat: add term_index inverted index for instant session search
Adds a term-based inverted index (term_index table, schema v7) that
eliminates LLM summarization from the default search path. The fast
path returns session metadata and match counts in ~1ms vs 10-15s for
the full FTS5+LLM pipeline.

Key changes:
- term_index table: (term, message_id, session_id) WITHOUT ROWID
  for clustered B-tree lookups. Populated at write time in
  append_message (best-effort, never blocks inserts).
- stop_words.py: 179-word NLTK English stop list, no stemming
- term_index.py: extract_terms() for term extraction
- session_search_tool.py: fast=True default, _fast_search for term
  index path, _full_search preserves original behavior, CJK query
  fallback to slow path
- Auto-reindex on v7 migration: _init_schema returns needs_reindex
  flag, __init__ calls reindex_term_index() after migration
- Swap strategy for reindex: builds into temp table, then atomic
  swap in single transaction (no empty-index window)
- get_child_session_ids(): public API replacing db._lock/db._conn
  access in _fast_search
- mode field in search results: 'fast' or 'full'
- Cascade deletes: clear_messages, delete_session, prune_sessions
  all clean term_index entries

Benchmarks on production DB (47.7 MB, 29,435 messages):
  - Term index reindex: 1,152,587 entries from 29,435 messages in 4s
  - Fast path: 1-4ms (no LLM)
  - Slow path: 10,000-16,000ms (FTS5 + LLM summarization)
  - Speedup: 4,000-15,000x on full round-trip

195 tests passing (48 term_index + 149 hermes_state).
12 regression tests from red-team QA covering: param binding,
child session resolution, cascade deletes, CJK fallback.
2026-04-24 19:32:36 -04:00