fix(desktop): recover stranded session windows when resume fails (#47655)

* fix(desktop): recover stranded session windows when resume fails

Opening a session in a new window (or any routed resume) could latch the
thread loader on "session" forever — the reported "stays stuck loading,
even after a nap" bug. Two compounding causes:

1. use-session-actions.resumeSession's catch ran the REST transcript
   fallback OUTSIDE its own try. When session.resume rejected AND the
   fallback also threw (the common case on a wedged/unreachable backend),
   the throw skipped setMessages and left activeSessionId null with an
   empty transcript — exactly the state the loader gates on
   (messagesEmpty && !activeSessionId), with no terminal/error state.

2. use-route-resume's self-heal could never re-fire: resumeSession sets
   selectedStoredSessionIdRef synchronously at entry (before failing), so
   stuckOnRoutedSession stays false, and on an already-open idle window
   neither pathnameChanged nor gatewayBecameOpen fire again. The window
   never retried — naps, focus, nothing recovered it.

Fix:
- Wrap the REST fallback in its own try so a fallback failure can't strand
  the loader.
- Add $resumeFailedSessionId: armed on terminal resume failure, cleared at
  the next resume's entry (and left clear on success).
- use-route-resume gains a bounded backoff auto-retry (4 attempts, 1s→8s)
  that re-resumes while the routed session matches the failure flag, with a
  fire-time liveness recheck so a recovered session isn't double-resumed.

Regression tests cover: fallback-wrap arming the flag without throwing,
flag cleared on success, retry fires on backoff, no retry for a
non-routed/recovered session, and the retry cap.

* feat(desktop): show error + manual Retry when resume retries exhaust

When a stranded session window's bounded auto-retry gives up (gateway
resume RPC + REST fallback fail through all MAX_RESUME_RETRIES attempts),
the loader latched forever. Add a $resumeExhaustedSessionId atom armed at
the give-up point so the chat view swaps the perpetual spinner for an
explicit error state + manual Retry button. Retry / reconnect / reselect
clears the latch and resets the auto-retry counter for a fresh cycle; a
route-change away from the stranded session also clears it.

Distinct from $resumeFailedSessionId (armed during the backoff window) so
the error UI only appears once auto-recovery has actually given up, not
mid-retry. Adds i18n strings across en/ja/zh/zh-hant and 3 tests covering
latch-arms-on-exhaustion, stays-clear-while-retries-remain, and
clears-on-route-change.

* fix(desktop): address review on stranded-resume recovery layer

Follow-up to review on #47655 (PR head 253bfc0e3). Four issues on the
recovery layer:

1. (blocking) Arm $resumeFailedSessionId only when the transcript is still
   empty after the REST fallback ($messages.get().length === 0), matching the
   atom's documented contract and the loader's messagesEmpty gate. Previously
   armed on any resume-RPC reject regardless of fallback outcome, so a window
   that recovered its history via REST still auto-retried and, on exhaustion,
   blanked the visible transcript behind the error overlay.

2. Reset the bounded-retry attempt counter on the $resumeExhaustedSessionId
   armed->cleared edge so a manual Retry / reconnect / reselect on the SAME
   stranded session gets a fresh backoff cycle, not a single one-shot attempt
   that immediately re-arms the error. (Keyed on the exhausted latch rather
   than the resumeFailedSessionId null->value transition the review suggested:
   the auto-retry loop itself toggles resumeFailedSessionId every cycle, so
   keying the reset there would defeat the MAX_RESUME_RETRIES cap. Only
   resumeSession clears the exhausted latch, making its clear edge the
   unambiguous manual-retry signal.)

3. Advance retryAttemptRef only when the timer actually dispatches a resume,
   not at schedule time. Prevents unrelated dep changes during the 1s-8s
   backoff window (transient gatewayState flip, non-stable resumeSession) from
   burning attempts and hitting MAX with fewer than 4 real resume attempts.

4. Drop unrelated blank-line-only insertions in store/session.ts and
   use-session-actions.ts to keep the diff tight.

Tests: +3 (RPC-fails-REST-succeeds-no-arm; manual-retry-fresh-cycle;
no-attempts-burned-on-dep-churn). All 19 resume tests + full session-hook
suite (65) pass; tsc --noEmit clean.

---------

Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>
This commit is contained in:
Austin Pickett 2026-06-17 17:33:53 -04:00 committed by GitHub
parent fd674af47f
commit 016bce1a09
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
12 changed files with 624 additions and 9 deletions

View file

@ -15,7 +15,9 @@ import { Backdrop } from '@/components/Backdrop'
import { PromptOverlays } from '@/components/prompt-overlays'
import { Button } from '@/components/ui/button'
import { Codicon } from '@/components/ui/codicon'
import { ErrorState } from '@/components/ui/error-state'
import { getGlobalModelOptions, type HermesGateway } from '@/hermes'
import { useI18n } from '@/i18n'
import type { ChatMessage } from '@/lib/chat-messages'
import { quickModelOptions, sessionTitle, toRuntimeMessage } from '@/lib/chat-runtime'
import { useIncrementalExternalStoreRuntime } from '@/lib/incremental-external-store-runtime'
@ -38,6 +40,7 @@ import {
$lastVisibleMessageIsUser,
$messages,
$messagesEmpty,
$resumeExhaustedSessionId,
$selectedStoredSessionId,
$sessions,
sessionPinId
@ -86,6 +89,7 @@ interface ChatViewProps extends Omit<React.ComponentProps<'div'>, 'onSubmit'> {
onEdit: (message: AppendMessage) => Promise<void>
onReload: (parentId: string | null) => Promise<void>
onRestoreToMessage?: (messageId: string) => Promise<void>
onRetryResume: (sessionId: string) => void
onTranscribeAudio?: (audio: Blob) => Promise<string>
onDismissError?: (messageId: string) => void
}
@ -273,10 +277,12 @@ export function ChatView({
onEdit,
onReload,
onRestoreToMessage,
onRetryResume,
onTranscribeAudio,
onDismissError
}: ChatViewProps) {
const location = useLocation()
const { t } = useI18n()
const activeSessionId = useStore($activeSessionId)
const awaitingResponse = useStore($awaitingResponse)
const busy = useStore($busy)
@ -298,6 +304,7 @@ export function ChatView({
const messagesEmpty = useStore($messagesEmpty)
const lastVisibleIsUser = useStore($lastVisibleMessageIsUser)
const selectedSessionId = useStore($selectedStoredSessionId)
const resumeExhaustedSessionId = useStore($resumeExhaustedSessionId)
const routedSessionId = routeSessionId(location.pathname)
const isRoutedSessionView = Boolean(routedSessionId)
@ -317,9 +324,21 @@ export function ChatView({
// session exists — even if it has zero messages (a brand-new routed
// session). The flicker where `busy` flips true briefly during hydrate
// is handled by `threadLoadingState`'s last-visible-user gate.
const loadingSession = isRoutedSessionView && (routeSessionMismatch || (messagesEmpty && !activeSessionId))
//
// resumeExhausted: the bounded auto-retry in use-route-resume gave up on this
// routed session (gateway RPC + REST fallback failed through every attempt).
// Suppress the loader and show an explicit error + manual Retry instead of
// spinning forever. Gated on the route matching so a stale latch from another
// session can't blank the current one.
const resumeExhausted = isRoutedSessionView && resumeExhaustedSessionId === routedSessionId
const loadingSession =
!resumeExhausted && isRoutedSessionView && (routeSessionMismatch || (messagesEmpty && !activeSessionId))
const threadLoading = threadLoadingState(loadingSession, busy, awaitingResponse, lastVisibleIsUser)
const showChatBar = !loadingSession
// Hide the composer in the exhausted error state too: there's no live runtime
// to send to until a retry rebinds one.
const showChatBar = !loadingSession && !resumeExhausted
const threadKey = selectedSessionId || activeSessionId || (isRoutedSessionView ? location.pathname : 'new')
const modelOptionsQuery = useQuery<ModelOptionsResponse>({
@ -468,6 +487,21 @@ export function ChatView({
</Suspense>
)}
</ChatRuntimeBoundary>
{resumeExhausted && routedSessionId && (
<div className="absolute inset-0 z-10 grid place-items-center bg-(--ui-chat-surface-background) px-8 py-10">
<ErrorState
className="max-w-sm"
description={t.desktop.resumeStrandedBody}
title={t.desktop.resumeStrandedTitle}
>
<div className="grid justify-items-center">
<Button onClick={() => onRetryResume(routedSessionId)} size="sm" variant="outline">
{t.desktop.resumeRetry}
</Button>
</div>
</ErrorState>
</div>
)}
{showChatBar && <ScrollToBottomButton />}
<ChatDropOverlay kind={dragKind} />
<ChatSwapOverlay profile={gatewaySwapTarget} />

View file

@ -54,6 +54,8 @@ import {
$gatewayState,
$messages,
$messagingSessions,
$resumeFailedSessionId,
$resumeExhaustedSessionId,
$selectedStoredSessionId,
$sessions,
$workingSessionIds,
@ -200,6 +202,8 @@ export function DesktopController() {
const activeSessionId = useStore($activeSessionId)
const currentCwd = useStore($currentCwd)
const freshDraftReady = useStore($freshDraftReady)
const resumeFailedSessionId = useStore($resumeFailedSessionId)
const resumeExhaustedSessionId = useStore($resumeExhaustedSessionId)
const filePreviewTarget = useStore($filePreviewTarget)
const previewTarget = useStore($previewTarget)
const selectedStoredSessionId = useStore($selectedStoredSessionId)
@ -889,6 +893,8 @@ export function DesktopController() {
gatewayState,
locationPathname: location.pathname,
resumeSession,
resumeFailedSessionId,
resumeExhaustedSessionId,
routedSessionId,
runtimeIdByStoredSessionIdRef,
selectedStoredSessionId,
@ -1047,6 +1053,7 @@ export function DesktopController() {
onReload={reloadFromMessage}
onRemoveAttachment={id => void composer.removeAttachment(id)}
onRestoreToMessage={restoreToMessage}
onRetryResume={sessionId => void resumeSession(sessionId, true)}
onSteer={steerPrompt}
onSubmit={submitText}
onThreadMessagesChange={handleThreadMessagesChange}

View file

@ -2,6 +2,8 @@ import { cleanup, render } from '@testing-library/react'
import type { MutableRefObject } from 'react'
import { afterEach, describe, expect, it, vi } from 'vitest'
import { $resumeExhaustedSessionId, setResumeExhaustedSessionId } from '@/store/session'
import { useRouteResume } from './use-route-resume'
interface HarnessProps {
@ -13,6 +15,8 @@ interface HarnessProps {
gatewayState: string
locationPathname: string
resumeSession: (sessionId: string, focus: boolean) => Promise<unknown>
resumeFailedSessionId?: null | string
resumeExhaustedSessionId?: null | string
routedSessionId: null | string
runtimeIdByStoredSessionIdRef: MutableRefObject<Map<string, string>>
selectedStoredSessionId: null | string
@ -20,8 +24,12 @@ interface HarnessProps {
startFreshSessionDraft: (focus: boolean) => unknown
}
function RouteResumeHarness(props: HarnessProps) {
useRouteResume(props)
function RouteResumeHarness({
resumeFailedSessionId = null,
resumeExhaustedSessionId = null,
...props
}: HarnessProps) {
useRouteResume({ ...props, resumeExhaustedSessionId, resumeFailedSessionId })
return null
}
@ -256,3 +264,212 @@ describe('useRouteResume', () => {
expect(resumeSession).toHaveBeenCalledWith('session-1', true)
})
})
describe('useRouteResume bounded auto-retry after a failed resume', () => {
afterEach(() => {
cleanup()
vi.useRealTimers()
vi.restoreAllMocks()
setResumeExhaustedSessionId(null)
})
// Common stranded-window props: gateway open, route on the session, no runtime
// yet, and the ref already synced to the route (resumeSession sets it at entry
// before failing) — the exact state that defeats the main effect's self-heal.
function strandedProps(resumeSession: (sid: string, focus: boolean) => Promise<unknown>) {
return {
activeSessionId: null,
activeSessionIdRef: { current: null } as MutableRefObject<null | string>,
creatingSessionRef: { current: false },
currentView: 'chat',
freshDraftReady: false,
gatewayState: 'open',
locationPathname: '/session-1',
resumeSession,
routedSessionId: 'session-1',
runtimeIdByStoredSessionIdRef: { current: new Map<string, string>() },
selectedStoredSessionId: 'session-1',
// Synced to the route by the failed resume's synchronous entry-write.
selectedStoredSessionIdRef: { current: 'session-1' } as MutableRefObject<null | string>,
startFreshSessionDraft: vi.fn()
}
}
it('retries the resume on backoff when the routed session is flagged as failed', () => {
vi.useFakeTimers()
const resumeSession = vi.fn(async () => undefined)
render(<RouteResumeHarness {...strandedProps(resumeSession)} resumeFailedSessionId="session-1" />)
// The main effect fires one resume on mount (pathname-changed). Clear it so
// we assert purely the bounded-retry effect's scheduled retry below.
resumeSession.mockClear()
// No immediate fire — the retry is scheduled behind the backoff timer.
expect(resumeSession).not.toHaveBeenCalled()
// First backoff window (1s) elapses → one retry.
vi.advanceTimersByTime(1_000)
expect(resumeSession).toHaveBeenCalledTimes(1)
expect(resumeSession).toHaveBeenCalledWith('session-1', true)
})
it('does NOT retry a failed session that is not the routed one', () => {
vi.useFakeTimers()
const resumeSession = vi.fn(async () => undefined)
// The failure flag points at a different session than the route.
render(<RouteResumeHarness {...strandedProps(resumeSession)} resumeFailedSessionId="other-session" />)
resumeSession.mockClear() // drop the mount resume
vi.advanceTimersByTime(10_000)
expect(resumeSession).not.toHaveBeenCalled()
})
it('skips the scheduled retry if the session already recovered when the timer fires', () => {
vi.useFakeTimers()
const resumeSession = vi.fn(async () => undefined)
const props = strandedProps(resumeSession)
render(<RouteResumeHarness {...props} resumeFailedSessionId="session-1" />)
resumeSession.mockClear() // drop the mount resume
// A resume landed while we waited: runtime is now bound.
props.activeSessionIdRef.current = 'runtime-1'
vi.advanceTimersByTime(8_000)
expect(resumeSession).not.toHaveBeenCalled()
})
it('stops retrying after MAX_RESUME_RETRIES consecutive failures', () => {
vi.useFakeTimers()
const resumeSession = vi.fn(async () => undefined)
const props = strandedProps(resumeSession)
// Model the real re-arm loop: resumeSession clears $resumeFailedSessionId at
// entry (null) and a repeat failure re-sets it ('session-1'). That null->id
// toggle is what re-runs the effect and advances the bounded counter. The
// routed session never changes, so the counter is NOT reset between cycles.
const { rerender } = render(<RouteResumeHarness {...props} resumeFailedSessionId="session-1" />)
resumeSession.mockClear() // drop the mount resume; count only the retries
for (let i = 0; i < 8; i += 1) {
vi.advanceTimersByTime(8_000) // fire the scheduled retry (if any)
rerender(<RouteResumeHarness {...props} resumeFailedSessionId={null} />) // cleared at entry
rerender(<RouteResumeHarness {...props} resumeFailedSessionId="session-1" />) // re-armed on failure
}
// Capped at MAX_RESUME_RETRIES (4): a persistently dead backend can't
// hot-loop the resume forever.
expect(resumeSession.mock.calls.length).toBe(4)
// Once auto-retry gives up, the exhausted latch is armed for the routed
// session so the chat view can swap the perpetual loader for an explicit
// error + manual Retry instead of spinning forever.
expect($resumeExhaustedSessionId.get()).toBe('session-1')
})
it('does not arm the exhausted latch while retries remain', () => {
vi.useFakeTimers()
const resumeSession = vi.fn(async () => undefined)
const props = strandedProps(resumeSession)
const { rerender } = render(<RouteResumeHarness {...props} resumeFailedSessionId="session-1" />)
resumeSession.mockClear()
// Two failure cycles — still under the 4-retry cap, so the latch must stay
// clear and the loader keeps spinning (auto-recovery hasn't given up yet).
for (let i = 0; i < 2; i += 1) {
vi.advanceTimersByTime(8_000)
rerender(<RouteResumeHarness {...props} resumeFailedSessionId={null} />)
rerender(<RouteResumeHarness {...props} resumeFailedSessionId="session-1" />)
}
expect($resumeExhaustedSessionId.get()).toBeNull()
})
it('clears a stale exhausted latch when the route moves off the stranded session', () => {
vi.useFakeTimers()
const resumeSession = vi.fn(async () => undefined)
const props = strandedProps(resumeSession)
// Pre-arm the latch as if this session had exhausted its retries.
setResumeExhaustedSessionId('session-1')
// Route is now on a different, healthy session that is not flagged as
// failed — the retry effect's "route moved off" branch clears the latch.
render(
<RouteResumeHarness
{...props}
activeSessionId="runtime-2"
activeSessionIdRef={{ current: 'runtime-2' }}
locationPathname="/session-2"
resumeFailedSessionId={null}
routedSessionId="session-2"
selectedStoredSessionId="session-2"
selectedStoredSessionIdRef={{ current: 'session-2' }}
/>
)
expect($resumeExhaustedSessionId.get()).toBeNull()
})
it('resets the retry counter for a fresh backoff cycle when the exhausted latch clears (manual retry, same session)', () => {
vi.useFakeTimers()
const resumeSession = vi.fn(async () => undefined)
const props = strandedProps(resumeSession)
// Phase A — exhaust the bounded auto-retry (counter → MAX) like a dead
// backend. The resumeExhaustedSessionId prop stays null here: the hook sets
// the store, which doesn't feed back into the prop in this harness.
const { rerender } = render(<RouteResumeHarness {...props} resumeFailedSessionId="session-1" />)
resumeSession.mockClear()
for (let i = 0; i < 8; i += 1) {
vi.advanceTimersByTime(8_000)
rerender(<RouteResumeHarness {...props} resumeFailedSessionId={null} />)
rerender(<RouteResumeHarness {...props} resumeFailedSessionId="session-1" />)
}
expect(resumeSession.mock.calls.length).toBe(4) // capped
expect($resumeExhaustedSessionId.get()).toBe('session-1')
// Phase B — user clicks Retry on the SAME stranded session. resumeSession
// clears both latches at entry; the exhausted latch's armed->cleared edge
// must reset the attempt counter so a fresh bounded cycle runs, not a single
// one-shot attempt that immediately re-arms the error. Model the prop
// transitions: reflect the armed latch, then clear it (retry), then re-arm
// the failure latch on the fresh failure.
resumeSession.mockClear()
rerender(<RouteResumeHarness {...props} resumeExhaustedSessionId="session-1" resumeFailedSessionId="session-1" />)
rerender(<RouteResumeHarness {...props} resumeExhaustedSessionId={null} resumeFailedSessionId={null} />)
rerender(<RouteResumeHarness {...props} resumeExhaustedSessionId={null} resumeFailedSessionId="session-1" />)
// A real retry fires again instead of staying pinned at MAX (which would
// dispatch nothing). Without the reset the counter stays >= MAX and this
// advance dispatches zero resumes.
vi.advanceTimersByTime(8_000)
expect(resumeSession.mock.calls.length).toBeGreaterThan(0)
})
it('does not burn retry attempts on unrelated re-renders during the backoff window', () => {
vi.useFakeTimers()
const props = strandedProps(vi.fn())
// Mount schedules the first backoff timer. Then re-render repeatedly with a
// fresh resumeSession identity (referential instability — a real dep change
// for the retry effect) WITHOUT ever letting the timer fire. The old code
// incremented the attempt counter at schedule time, so >= MAX re-renders
// armed the exhausted error with zero resumes actually dispatched. The fix
// only advances the counter when a timer truly fires, so the latch stays
// clear no matter how many spurious re-renders happen mid-backoff.
const { rerender } = render(
<RouteResumeHarness {...props} resumeFailedSessionId="session-1" resumeSession={vi.fn(async () => undefined)} />
)
for (let j = 0; j < 8; j += 1) {
rerender(
<RouteResumeHarness {...props} resumeFailedSessionId="session-1" resumeSession={vi.fn(async () => undefined)} />
)
}
expect($resumeExhaustedSessionId.get()).toBeNull()
})
})

View file

@ -1,6 +1,7 @@
import { type MutableRefObject, useEffect, useRef } from 'react'
import { isNewChatRoute } from '@/app/routes'
import { setResumeExhaustedSessionId } from '@/store/session'
interface RouteResumeOptions {
activeSessionId: string | null
@ -11,6 +12,17 @@ interface RouteResumeOptions {
gatewayState: string | undefined
locationPathname: string
resumeSession: (sessionId: string, focus: boolean) => Promise<unknown>
// Stored-session id whose most recent resume failed terminally (set by
// useSessionActions, mirrored from $resumeFailedSessionId). While this equals
// routedSessionId the window would otherwise latch on the loader forever, so
// the bounded-retry effect below re-attempts the resume.
resumeFailedSessionId: string | null
// Stored-session id whose bounded auto-retry has EXHAUSTED (mirrored from
// $resumeExhaustedSessionId). Only resumeSession clears this latch (manual
// Retry / reconnect / reselect) — the auto-retry loop never does — so its
// armed->cleared edge is an unambiguous "give me a fresh backoff cycle"
// signal the effect below uses to reset the attempt counter.
resumeExhaustedSessionId: string | null
routedSessionId: string | null
runtimeIdByStoredSessionIdRef: MutableRefObject<Map<string, string>>
selectedStoredSessionId: string | null
@ -18,6 +30,19 @@ interface RouteResumeOptions {
startFreshSessionDraft: (focus: boolean) => unknown
}
// Bounded auto-retry for a stranded session window. A resume can fail terminally
// (gateway RPC reject + REST fallback failure) on a transiently wedged backend —
// dead provider key, a runaway turn hogging the dispatcher, flaky DNS. Without a
// retry the loader latches forever. We retry with backoff, capped, so a
// genuinely dead backend doesn't hot-loop the resume.
const MAX_RESUME_RETRIES = 4
const RESUME_RETRY_BASE_MS = 1_000
const RESUME_RETRY_MAX_MS = 8_000
function resumeRetryDelayMs(attempt: number): number {
return Math.min(RESUME_RETRY_MAX_MS, RESUME_RETRY_BASE_MS * 2 ** attempt)
}
// HashRouter boot edge case: pathname briefly reads `/` before the hash is
// parsed. If the hash references a real session, defer; resume picks it up
// next tick. Without this, ctrl+R on `#/:sessionId` flashes 5 loading states.
@ -49,6 +74,8 @@ export function useRouteResume({
gatewayState,
locationPathname,
resumeSession,
resumeFailedSessionId,
resumeExhaustedSessionId,
routedSessionId,
runtimeIdByStoredSessionIdRef,
selectedStoredSessionId,
@ -58,6 +85,16 @@ export function useRouteResume({
const lastPathnameRef = useRef<string | null>(null)
const seenGatewayStateRef = useRef(false)
const wasGatewayOpenRef = useRef(false)
// Per-session retry bookkeeping for the bounded auto-retry effect below. Keyed
// by the session id we're retrying so switching chats resets the counter.
const retrySessionIdRef = useRef<string | null>(null)
const retryAttemptRef = useRef(0)
// Tracks the previous exhausted-latch value so we can detect its armed->cleared
// edge. resumeSession clears $resumeExhaustedSessionId on a manual Retry /
// reconnect / reselect; that transition is our cue to reset the attempt counter
// for a fresh backoff cycle on the SAME session (the auto-retry loop itself
// never touches this latch, so it can't spuriously trigger the reset).
const prevResumeExhaustedRef = useRef<string | null>(null)
useEffect(() => {
const gatewayOpen = gatewayState === 'open'
@ -139,4 +176,111 @@ export function useRouteResume({
selectedStoredSessionIdRef,
startFreshSessionDraft
])
// Bounded auto-retry: when the routed session's resume failed terminally
// (resumeFailedSessionId matches the route), schedule a backoff retry so the
// window recovers on its own instead of latching the loader forever. This is
// the safety net the main effect above can't provide: after a failed resume,
// selectedStoredSessionIdRef.current already equals the route (resumeSession
// sets it synchronously at entry) and the pathname/gateway are unchanged, so
// none of stuckOnRoutedSession / pathnameChanged / gatewayBecameOpen fire
// again. resumeSession clears resumeFailedSessionId on its next attempt; a
// success keeps it clear (the effect's guard then no-ops), a repeat failure
// re-arms it and we back off further, capped at MAX_RESUME_RETRIES.
useEffect(() => {
// Detect the exhausted-latch armed->cleared edge for the current route. Only
// resumeSession clears $resumeExhaustedSessionId (manual Retry / reconnect /
// reselect) — the auto-retry loop never touches it — so this transition
// uniquely means "the user asked for another go." Reset the attempt counter
// for a fresh bounded backoff cycle on the SAME session. Without this,
// retryAttemptRef stays pinned at MAX after exhaustion (the !stranded reset
// below only fires on a route CHANGE to a different session), so a manual
// retry on the same stranded session would get exactly ONE attempt and then
// immediately re-arm the exhausted error — never the renewed backoff cycle
// the store/session.ts + use-session-actions.ts comments promise. (Point 2)
const wasExhausted = prevResumeExhaustedRef.current
prevResumeExhaustedRef.current = resumeExhaustedSessionId
if (wasExhausted && wasExhausted === routedSessionId && resumeExhaustedSessionId !== wasExhausted) {
retrySessionIdRef.current = routedSessionId
retryAttemptRef.current = 0
}
if (currentView !== 'chat' || gatewayState !== 'open') {
return
}
const stranded =
Boolean(routedSessionId) &&
resumeFailedSessionId === routedSessionId &&
!creatingSessionRef.current
if (!stranded) {
// Route moved off the stranded session (or it recovered) — reset the
// counter so a future failure on another session starts fresh, and clear
// any exhausted-latch armed for a session we're no longer viewing (never
// the current route: that's the error state we want to keep showing).
// resumeSession also clears it on a fresh attempt; this covers a plain
// route-change away from the stranded window.
if (retrySessionIdRef.current !== routedSessionId) {
retrySessionIdRef.current = null
retryAttemptRef.current = 0
setResumeExhaustedSessionId(current => (current && current !== routedSessionId ? null : current))
}
return
}
// New stranded session id → reset the attempt counter.
if (retrySessionIdRef.current !== routedSessionId) {
retrySessionIdRef.current = routedSessionId
retryAttemptRef.current = 0
}
if (retryAttemptRef.current >= MAX_RESUME_RETRIES) {
// Give up auto-retrying a persistently dead backend; the user can still
// reconnect / reselect (which resets the counter via the branch above).
// Surface an explicit error + manual Retry in the chat view instead of
// spinning the loader forever — resumeSession (manual Retry / reconnect /
// reselect) clears this latch and resets the counter for a fresh cycle.
setResumeExhaustedSessionId(routedSessionId)
return
}
const attempt = retryAttemptRef.current
const sessionId = routedSessionId as string
const timer = setTimeout(() => {
// Re-check liveness at fire time: a resume may have landed while we waited.
if (
creatingSessionRef.current ||
selectedStoredSessionIdRef.current !== sessionId ||
activeSessionIdRef.current !== null
) {
return
}
// Consume an attempt ONLY now that a resume is actually dispatching.
// Incrementing at schedule time (the old behavior) let unrelated dep
// changes during the 1s8s backoff window — a transient gatewayState
// flip, a non-referentially-stable resumeSession — clear the pending
// timer and re-run the effect, burning an attempt without any resume
// having fired. A flapping backend could then hit MAX in a couple of
// re-renders with far fewer than MAX real attempts. (Point 3)
retryAttemptRef.current += 1
void resumeSession(sessionId, true)
}, resumeRetryDelayMs(attempt))
return () => clearTimeout(timer)
}, [
activeSessionIdRef,
creatingSessionRef,
currentView,
gatewayState,
resumeSession,
resumeFailedSessionId,
resumeExhaustedSessionId,
routedSessionId,
selectedStoredSessionIdRef
])
}

View file

@ -3,8 +3,9 @@ import type { MutableRefObject } from 'react'
import { useEffect } from 'react'
import { afterEach, describe, expect, it, vi } from 'vitest'
import { getSessionMessages } from '@/hermes'
import { $activeGatewayProfile, $newChatProfile } from '@/store/profile'
import { $currentCwd } from '@/store/session'
import { $currentCwd, $messages, $resumeFailedSessionId, setMessages, setResumeFailedSessionId } from '@/store/session'
import type { ClientSessionState } from '../../types'
@ -117,3 +118,142 @@ describe('createBackendSessionForSend profile routing', () => {
expect(params).toMatchObject({ profile: 'default' })
})
})
// ── Resume failure recovery (the "stuck loading session window" bug) ──────────
// When session.resume rejects AND the REST transcript fallback ALSO fails, the
// hook must (a) not throw out of the fallback (which stranded the loader), and
// (b) arm $resumeFailedSessionId so use-route-resume can retry. A resume that
// succeeds must NOT leave the flag armed.
function ResumeHarness({
onReady,
requestGateway
}: {
onReady: (resume: (storedSessionId: string, replaceRoute?: boolean) => Promise<unknown>) => void
requestGateway: <T>(method: string, params?: Record<string, unknown>) => Promise<T>
}) {
const ref = <T,>(value: T): MutableRefObject<T> => ({ current: value })
const actions = useSessionActions({
activeSessionId: null,
activeSessionIdRef: ref<string | null>(null),
busyRef: ref(false),
creatingSessionRef: ref(false),
ensureSessionState: () => ({}) as ClientSessionState,
getRouteToken: () => 'token',
navigate: vi.fn() as never,
requestGateway,
runtimeIdByStoredSessionIdRef: ref(new Map<string, string>()),
selectedStoredSessionId: null,
selectedStoredSessionIdRef: ref<string | null>(null),
sessionStateByRuntimeIdRef: ref(new Map<string, ClientSessionState>()),
syncSessionStateToView: vi.fn(),
updateSessionState: (_sessionId, updater) => updater({} as ClientSessionState)
})
useEffect(() => {
onReady(actions.resumeSession)
}, [actions.resumeSession, onReady])
return null
}
describe('resumeSession failure recovery', () => {
afterEach(() => {
cleanup()
setResumeFailedSessionId(null)
setMessages([])
vi.restoreAllMocks()
})
async function runResume(
requestGateway: <T>(method: string, params?: Record<string, unknown>) => Promise<T>
): Promise<void> {
let resume: ((storedSessionId: string, replaceRoute?: boolean) => Promise<unknown>) | null = null
render(<ResumeHarness onReady={r => (resume = r)} requestGateway={requestGateway} />)
await waitFor(() => expect(resume).not.toBeNull())
await resume!('stored-1', true)
}
it('arms $resumeFailedSessionId when resume RPC and REST fallback both fail', async () => {
// session.resume rejects (e.g. timeout against a wedged backend)...
const requestGateway = vi.fn(async (method: string) => {
if (method === 'session.resume') {
throw new Error('request timed out: session.resume')
}
return {} as never
})
// ...and the REST transcript fallback also rejects (backend unreachable).
vi.mocked(getSessionMessages).mockRejectedValue(new Error('network down'))
await runResume(requestGateway)
// The window is no longer silently stranded: the failure latch is armed for
// the stored session, which use-route-resume consumes to retry.
expect($resumeFailedSessionId.get()).toBe('stored-1')
})
it('does NOT arm the failure latch when the resume RPC fails but the REST fallback paints history', async () => {
// session.resume rejects, but the REST transcript fallback succeeds and
// hydrates a readable transcript — the window is NOT stranded.
const requestGateway = vi.fn(async (method: string) => {
if (method === 'session.resume') {
throw new Error('request timed out: session.resume')
}
return {} as never
})
vi.mocked(getSessionMessages).mockResolvedValue({
messages: [
{ content: 'hello', role: 'user', timestamp: 1 },
{ content: 'hi there', role: 'assistant', timestamp: 2 }
],
session_id: 'stored-1'
} as never)
await runResume(requestGateway)
// Arming here would auto-retry a window that already shows history and,
// on exhaustion, blank that transcript behind the error overlay — a
// regression vs. plain fallback-success. The latch must stay clear.
expect($resumeFailedSessionId.get()).toBeNull()
// The fallback transcript is visible.
expect($messages.get().length).toBeGreaterThan(0)
})
it('does NOT throw out of the fallback when REST also fails (no unhandled rejection)', async () => {
const requestGateway = vi.fn(async (method: string) => {
if (method === 'session.resume') {
throw new Error('request timed out: session.resume')
}
return {} as never
})
vi.mocked(getSessionMessages).mockRejectedValue(new Error('network down'))
// resumeSession must resolve (swallow the fallback failure), not reject.
await expect(runResume(requestGateway)).resolves.toBeUndefined()
})
it('leaves the failure latch clear when resume succeeds', async () => {
// Pre-arm to prove a successful resume clears it (entry-clear path).
setResumeFailedSessionId('stored-1')
const requestGateway = vi.fn(async (method: string, params?: Record<string, unknown>) => {
if (method === 'session.resume') {
return { session_id: 'runtime-1', resumed: params?.session_id, messages: [], info: {} } as never
}
return {} as never
})
vi.mocked(getSessionMessages).mockResolvedValue({ messages: [] } as never)
await runResume(requestGateway)
expect($resumeFailedSessionId.get()).toBeNull()
})
})

View file

@ -38,6 +38,8 @@ import {
setFreshDraftReady,
setIntroSeed,
setMessages,
setResumeExhaustedSessionId,
setResumeFailedSessionId,
setSelectedStoredSessionId,
setSessions,
setSessionStartedAt,
@ -579,6 +581,15 @@ export function useSessionActions({
clearNotifications()
setSelectedStoredSessionId(storedSessionId)
selectedStoredSessionIdRef.current = storedSessionId
// Optimistically clear any prior resume-failure latch for this session:
// we're attempting a fresh resume, so the self-heal in use-route-resume
// must not keep treating it as stranded. It's re-armed below only if THIS
// attempt fails terminally (RPC reject + REST fallback failure).
setResumeFailedSessionId(current => (current === storedSessionId ? null : current))
// Also clear the exhausted-latch: a fresh attempt (manual Retry, reconnect,
// reselect) gives the bounded auto-retry counter a clean cycle, so the
// chat view drops the error state and shows the loader again.
setResumeExhaustedSessionId(current => (current === storedSessionId ? null : current))
const warmRuntimeId = runtimeIdByStoredSessionIdRef.current.get(storedSessionId)
@ -769,13 +780,41 @@ export function useSessionActions({
return
}
const fallback = await getSessionMessages(storedSessionId, sessionProfile)
// The gateway resume RPC failed. Try the REST transcript as a fallback
// so the window at least shows history. CRITICAL: this fallback must be
// wrapped in its own try — if it ALSO throws (wedged/unreachable backend,
// the common case when resume failed in the first place), an unguarded
// throw here skips setMessages AND leaves activeSessionId null with an
// empty transcript. That is the exact state the thread loader latches on
// forever (messagesEmpty && !activeSessionId) with no recovery path —
// the "open in new window stays stuck loading, even after a nap" bug.
try {
const fallback = await getSessionMessages(storedSessionId, sessionProfile)
if (!isCurrentResume()) {
return
if (!isCurrentResume()) {
return
}
setMessages(preserveLocalAssistantErrors(toChatMessages(fallback.messages), $messages.get()))
} catch {
// Fallback also failed: nothing to paint. Leave whatever messages are
// already shown and fall through to arm the resume-failure latch so
// use-route-resume re-attempts the resume on the next render / window
// focus / gateway reconnect instead of stranding the loader.
}
if (isCurrentResume() && $messages.get().length === 0) {
// Arm the self-heal ONLY when the window is still empty: the gateway
// resume rejected AND the REST fallback failed to paint a transcript.
// That is the exact stranded state the loader latches on
// (messagesEmpty && !activeSessionId), and matches $resumeFailedSessionId's
// documented contract. If the REST fallback DID paint history, the
// window is readable — arming here would needlessly auto-retry and,
// once retries exhaust, blank that visible transcript behind the
// exhausted-state error overlay (a regression vs. plain fallback success).
setResumeFailedSessionId(storedSessionId)
}
setMessages(preserveLocalAssistantErrors(toChatMessages(fallback.messages), $messages.get()))
notifyError(err, copy.resumeFailed)
} finally {
if (isCurrentResume()) {

View file

@ -1843,6 +1843,9 @@ export const en: Translations = {
regenerateFailed: 'Regenerate failed',
editFailed: 'Edit failed',
resumeFailed: 'Resume failed',
resumeStrandedTitle: "Couldn't load this session",
resumeStrandedBody: 'The connection to this session failed and automatic retries gave up. Check that the gateway is running, then try again.',
resumeRetry: 'Retry',
nothingToBranch: 'Nothing to branch',
branchNeedsChat: 'Start or resume a chat before branching.',
sessionBusy: 'Session busy',

View file

@ -1974,6 +1974,9 @@ export const ja = defineLocale({
regenerateFailed: '再生成に失敗しました',
editFailed: '編集に失敗しました',
resumeFailed: '再開に失敗しました',
resumeStrandedTitle: 'このセッションを読み込めませんでした',
resumeStrandedBody: 'このセッションへの接続に失敗し、自動再試行も停止しました。ゲートウェイが実行中か確認してから、もう一度お試しください。',
resumeRetry: '再試行',
nothingToBranch: 'ブランチするものがありません',
branchNeedsChat: 'ブランチする前にチャットを開始または再開してください。',
sessionBusy: 'セッションが使用中',

View file

@ -1481,6 +1481,9 @@ export interface Translations {
regenerateFailed: string
editFailed: string
resumeFailed: string
resumeStrandedTitle: string
resumeStrandedBody: string
resumeRetry: string
nothingToBranch: string
branchNeedsChat: string
sessionBusy: string

View file

@ -1914,6 +1914,9 @@ export const zhHant = defineLocale({
regenerateFailed: '重新生成失敗',
editFailed: '編輯失敗',
resumeFailed: '繼續失敗',
resumeStrandedTitle: '無法載入此工作階段',
resumeStrandedBody: '與此工作階段的連線失敗,自動重試已停止。請確認閘道正在執行,然後重試。',
resumeRetry: '重試',
nothingToBranch: '沒有可分支的內容',
branchNeedsChat: '分支前請先開始或繼續一個聊天。',
sessionBusy: '工作階段忙碌中',

View file

@ -2021,6 +2021,9 @@ export const zh: Translations = {
regenerateFailed: '重新生成失败',
editFailed: '编辑失败',
resumeFailed: '恢复失败',
resumeStrandedTitle: '无法加载此会话',
resumeStrandedBody: '与此会话的连接失败,自动重试已停止。请确认网关正在运行,然后重试。',
resumeRetry: '重试',
nothingToBranch: '没有可分支的内容',
branchNeedsChat: '分支前请先开始或恢复一个对话。',
sessionBusy: '会话忙碌中',

View file

@ -218,6 +218,23 @@ export const $lastVisibleMessageIsUser = computed($messages, lastVisibleMessageI
export const $freshDraftReady = atom(false)
export const $busy = atom(false)
export const $awaitingResponse = atom(false)
// Stored-session id whose most recent resume FAILED terminally (the gateway RPC
// rejected AND the REST transcript fallback also failed), leaving the window
// with no runtime and an empty transcript. Drives use-route-resume's self-heal:
// while this matches the routed session the loader would otherwise latch
// forever (messagesEmpty && !activeSessionId), so the hook re-attempts the
// resume on the next render/focus/reconnect instead of stranding the window.
// Null whenever the active route has a healthy (or in-flight) resume.
export const $resumeFailedSessionId = atom<string | null>(null)
// Stored-session id whose resume has EXHAUSTED its bounded auto-retries (the
// terminal-failure latch above kept failing through all MAX_RESUME_RETRIES
// attempts). Distinct from $resumeFailedSessionId, which is armed *during* the
// backoff window too: this fires only once auto-recovery has given up, so the
// chat view can swap the perpetual loader for an explicit error + manual Retry
// affordance. A fresh resumeSession() (manual Retry, reconnect, reselect)
// clears it and resets the retry counter. Null whenever the active route has a
// healthy, in-flight, or still-auto-retrying resume.
export const $resumeExhaustedSessionId = atom<string | null>(null)
export const $currentModel = atom(storedString(COMPOSER_MODEL_KEY) ?? '')
export const $currentProvider = atom(storedString(COMPOSER_PROVIDER_KEY) ?? '')
export const $currentReasoningEffort = atom(storedString(COMPOSER_EFFORT_KEY) ?? '')
@ -262,6 +279,8 @@ export const setActiveSessionId = (next: Updater<string | null>) => updateAtom($
export const setSelectedStoredSessionId = (next: Updater<string | null>) => updateAtom($selectedStoredSessionId, next)
export const setMessages = (next: Updater<ChatMessage[]>) => updateAtom($messages, next)
export const setFreshDraftReady = (next: Updater<boolean>) => updateAtom($freshDraftReady, next)
export const setResumeFailedSessionId = (next: Updater<string | null>) => updateAtom($resumeFailedSessionId, next)
export const setResumeExhaustedSessionId = (next: Updater<string | null>) => updateAtom($resumeExhaustedSessionId, next)
export const setBusy = (next: Updater<boolean>) => updateAtom($busy, next)
export const setAwaitingResponse = (next: Updater<boolean>) => updateAtom($awaitingResponse, next)