Commit graph

2 commits

Author SHA1 Message Date
Teknium
5651a73331 fix(gateway): guard-match the finally-block _active_sessions delete
Before this, _process_message_background's finally did an unconditional
'del self._active_sessions[session_key]' — even if a /stop/ /new
command had already swapped in its own command_guard via
_dispatch_active_session_command and cancelled us.  The old task's
unwind would clobber the newer guard, opening a race for follow-ups.

Replace with _release_session_guard(session_key, guard=interrupt_event)
so the delete only fires when the guard we captured is still the one
installed.  The sibling _session_tasks pop already had equivalent
ownership matching via asyncio.current_task() identity; this closes the
asymmetry.

Adds two direct regressions in test_session_split_brain_11016:
- stale guard reference must not clobber a newer guard by identity
- guard=None default still releases unconditionally (for callers that
  don't have a captured guard to match against)

Refs #11016
2026-04-23 05:15:52 -07:00
Teknium
ec02d905c9 test(gateway): regressions for issue #11016 split-brain session locks
Covers all three layers of the salvaged fix:

1. Adapter-side cancellation: /stop, /new, /reset cancel the in-flight
   adapter task, release the guard, and let follow-up messages through;
   /new keeps the guard installed until the runner response lands, then
   drains the queued follow-up in order.

2. Adapter-side self-heal: a split-brain guard (done owner task, lock
   still live) is healed on the next inbound message and the user gets
   a reply instead of being trapped in infinite busy acks.  A guard
   with no recorded owner task is NOT auto-healed (protects fixtures
   that install guards directly).

3. Runner-side generation guard: stale async runs whose generation was
   bumped by /stop or /new cannot clear a newer run's _running_agents
   slot on the way out.

11 tests, all green.

Refs #11016
2026-04-23 05:15:52 -07:00