fix(desktop): recover chat after sleep/wake by revalidating a stale remote backend

After sleep/wake, a remote (global-remote) primary backend can become
unreachable, but it has no child process whose 'exit' clears the main
process's cached connectionPromise. The renderer then re-dials the same
dead remote forever and the composer stays stuck on "Starting Hermes…";
only a quit+reopen recovered.

Fix: the renderer's existing backoff-paced reconnect loop now asks the
main process to revalidate the cached connection before re-dialing. The
main process liveness-probes the cached REMOTE backend's public
/api/status and, if unreachable, drops the cache (resetHermesConnection
only nulls connectionPromise for a remote — no child to SIGTERM) so the
next getConnection() rebuilds a reachable descriptor. Local backends are
never touched here; they self-heal via the child 'exit' handler. The
renderer's loop already provides retry pacing and rides out transient
blips, so no streak/episode bookkeeping is needed in the main process.

The boot hook dismisses the boot-progress overlay on the post-rebuild
'open' so an in-place rebuild can't leave it stuck at ~94%.

Reimplements #40135 by @AlchemistChaos on a smaller, more interpretable
path (63 added lines vs 555): no extracted helper module, no
failure-streak / episode-window state, the renderer's backoff loop is
the retry mechanism. Original diagnosis and fix by @AlchemistChaos.

Co-authored-by: AlchemistChaos <alchemistchaos@protonmail.com>
This commit is contained in:
teknium1 2026-06-07 07:57:26 -07:00 committed by Teknium
parent c986377236
commit cadb74adad
4 changed files with 63 additions and 0 deletions

View file

@ -4737,6 +4737,45 @@ function createWindow() {
}
ipcMain.handle('hermes:connection', async (_event, profile) => ensureBackend(profile))
// Reconnect-after-wake recovery. A REMOTE primary backend has no child process,
// so the 'exit'/'error' handlers that would clear a dead connectionPromise never
// fire — once the remote becomes unreachable across a sleep/wake the renderer
// re-dials the same dead descriptor forever and the composer stays stuck on
// "Starting Hermes…". Before the renderer's backoff loop reconnects, it asks us
// to confirm the cached PRIMARY backend is still reachable; if a remote one is
// not, we drop the cache so the next getConnection() rebuilds it. Local backends
// self-heal via their child 'exit' handler, so we never touch them here.
ipcMain.handle('hermes:connection:revalidate', async () => {
if (!connectionPromise) {
return { ok: true, rebuilt: false }
}
let conn = null
try {
conn = await connectionPromise
} catch {
// The cached boot already rejected (its own catch nulls connectionPromise);
// nothing to revalidate — the next getConnection() builds fresh.
return { ok: true, rebuilt: false }
}
if (!conn || conn.mode !== 'remote' || !conn.baseUrl) {
return { ok: true, rebuilt: false }
}
const base = conn.baseUrl.replace(/\/+$/, '')
try {
await fetchPublicJson(`${base}/api/status`, { timeoutMs: 2_500 })
return { ok: true, rebuilt: false }
} catch {
// Unreachable remote: drop the stale cache so the renderer's next reconnect
// tick rebuilds a fresh, reachable descriptor. resetHermesConnection only
// nulls connectionPromise for a remote (no child to SIGTERM).
rememberLog('Cached remote Hermes backend failed liveness probe; dropping stale connection.')
resetHermesConnection()
return { ok: true, rebuilt: true }
}
})
ipcMain.handle('hermes:backend:touch', async (_event, profile) => {
touchPoolBackend(profile)
return { ok: true }

View file

@ -2,6 +2,7 @@ const { contextBridge, ipcRenderer, webUtils } = require('electron')
contextBridge.exposeInMainWorld('hermesDesktop', {
getConnection: profile => ipcRenderer.invoke('hermes:connection', profile),
revalidateConnection: () => ipcRenderer.invoke('hermes:connection:revalidate'),
touchBackend: profile => ipcRenderer.invoke('hermes:backend:touch', profile),
getGatewayWsUrl: profile => ipcRenderer.invoke('hermes:gateway:ws-url', profile),
getBootProgress: () => ipcRenderer.invoke('hermes:boot-progress:get'),

View file

@ -120,6 +120,13 @@ export function useGatewayBoot({
reconnecting = true
try {
// Drop a stale REMOTE backend cache before re-dialing. After sleep/wake a
// remote backend can become unreachable, but it has no child process
// whose 'exit' would clear the main process's cached descriptor — without
// this the renderer re-dials the same dead endpoint forever and stays on
// "Starting Hermes…". The probe is a no-op for a healthy or local backend.
await desktop.revalidateConnection?.().catch(() => undefined)
const conn = await desktop.getConnection($activeGatewayProfile.get())
if (cancelled) {
@ -218,6 +225,15 @@ export function useGatewayBoot({
reconnectAttempt = 0
reauthNotified = false
clearReconnectTimer()
// A revalidate-driven reconnect can rebuild the backend in place when the
// cached remote was found dead, which re-drives the boot-progress overlay.
// Unlike the initial boot, nothing calls completeDesktopBoot() afterwards,
// so dismiss it here once we're open again — otherwise the overlay sticks
// at ~94%. A no-op on a normal (non-rebuild) reconnect.
if (bootCompleted) {
completeDesktopBoot()
}
} else if (bootCompleted && (st === 'closed' || st === 'error')) {
// The socket dropped after a healthy boot (typically sleep/wake). Try
// to bring it back instead of leaving the composer stuck disabled.

View file

@ -7,6 +7,13 @@ declare global {
// the window's backend; pass a named profile to lazily spawn/reuse that
// profile's backend from the pool.
getConnection: (profile?: string | null) => Promise<HermesConnection>
// Reconnect-after-wake recovery: liveness-probe the cached PRIMARY backend
// and drop it if a remote one has gone unreachable, so the next
// getConnection() rebuilds a reachable descriptor instead of the renderer
// re-dialing a dead remote forever. No-op for local backends (they
// self-heal via the child 'exit' handler). `rebuilt` is true when a stale
// remote cache was dropped.
revalidateConnection: () => Promise<{ ok: boolean; rebuilt: boolean }>
// Keepalive: mark a pool profile backend as recently used so the idle
// reaper spares it while its chat is active.
touchBackend: (profile?: string | null) => Promise<{ ok: boolean }>