Three closely-related fixes for shutdown / lifecycle hygiene.
1. _release_running_agent_state(session_key) helper
----------------------------------------------------
Per-running-agent state lived in three dicts that drifted out of sync
across cleanup sites:
self._running_agents — AIAgent per session_key
self._running_agents_ts — start timestamp per session_key
self._busy_ack_ts — last busy-ack timestamp per session_key
Inventory before this PR:
8 sites: del self._running_agents[key]
— only 1 (stale-eviction) cleaned all three
— 1 cleaned _running_agents + _running_agents_ts only
— 6 cleaned _running_agents only
Each missed entry was a (str, float) tuple per session per gateway
lifetime — small, persistent, accumulates across thousands of
sessions over months. Per-platform leaks compounded.
This change adds a single helper that pops all three dicts in
lockstep, and replaces every bare 'del self._running_agents[key]'
site with it. Per-session state that PERSISTS across turns
(_session_model_overrides, _voice_mode, _pending_approvals,
_update_prompt_pending) is intentionally NOT touched here — those
have their own lifecycles tied to user actions, not turn boundaries.
2. _running_agents_ts cleared in _stop_impl
----------------------------------------
Was being missed alongside _running_agents.clear(); now included.
3. SessionDB close() in _stop_impl
---------------------------------
The SQLite WAL write lock stayed held by the old gateway connection
until Python actually exited — causing 'database is locked' errors
when --replace launched a new gateway against the same file. We
now explicitly close both self._db and self.session_store._db
inside _stop_impl, with try/except so a flaky close on one doesn't
block the other.
Tests
-----
tests/gateway/test_session_state_cleanup.py — 10 cases covering:
* helper pops all three dicts atomically
* idempotent on missing/empty keys
* preserves other sessions
* tolerates older runners without _busy_ack_ts attribute
* thread-safe under concurrent release
* regression guard: scans gateway/run.py and fails if a future
contributor reintroduces 'del self._running_agents[...]'
outside docstrings
* SessionDB close called on both holders during shutdown
* shutdown tolerates missing session_store
* shutdown tolerates close() raising on one db (other still closes)
Broader gateway suite: 3108 passed (vs 3100 on baseline) — failure
delta is +8 net passes; the 10 remaining failures are pre-existing
cross-test pollution / missing optional deps (matrix needs olm,
signal/telegram approval flake, dingtalk Mock wiring), all reproduce
on stashed baseline.