fix(mcp): reset circuit breaker on successful OAuth reconnect

Previously the breaker was only cleared when the post-reconnect retry
call itself succeeded (via _reset_server_error at the end of the try
block). If OAuth recovery succeeded but the retry call happened to
fail for a different reason, control fell through to the
needs_reauth path which called _bump_server_error — adding to an
already-tripped count instead of the fresh count the reconnect
justified. With fix #1 in place this would still self-heal on the
next cooldown, but we should not pay a 60s stall when we already
have positive evidence the server is viable.

Move _reset_server_error(server_name) up to immediately after the
reconnect-and-ready-wait block, before the retry_call. The
subsequent retry still goes through _bump_server_error on failure,
so a genuinely broken server re-trips the breaker as normal — but
the retry starts from a clean count (1 after a failure), not a
stale one.
This commit is contained in:
Ben 2026-04-21 19:20:15 +10:00 committed by Teknium
parent 8cc3cebca2
commit 484d151e99

View file

@ -1429,6 +1429,16 @@ def _handle_auth_error_and_retry(
break
time.sleep(0.25)
# A successful OAuth recovery is independent evidence that the
# server is viable again, so close the circuit breaker here —
# not only on retry success. Without this, a reconnect
# followed by a failing retry would leave the breaker pinned
# above threshold forever (the retry-exception branch below
# bumps the count again). The post-reset retry still goes
# through _bump_server_error on failure, so a genuinely broken
# server will re-trip the breaker as normal.
_reset_server_error(server_name)
try:
result = retry_call()
try: