desktop+gateway: harden Slack socket recovery and Windows restart dedupe (#28873)

* desktop+gateway: harden Slack socket recovery and Windows restart dedupe

Fix Slack Socket Mode reliability by adding a watchdog/reconnect path so silent socket task drops no longer leave the adapter stuck. Harden Windows gateway lifecycle by avoiding desktop-binary path collisions, making gateway PID scans case/extension tolerant, and reusing in-flight restart actions to prevent duplicate gateway spawns.

* test(slack): add Socket Mode watchdog/reconnect behavioural coverage

Drive the new Slack Socket Mode self-healing logic through a fake AsyncSocketModeHandler so we can simulate the P0 silent-hang failure mode (task exit, transport disconnected, intentional shutdown, concurrent reconnect attempts) without touching real Slack.

* fix(slack,desktop): address Copilot review on watchdog races and path normalization

- connect(): explicitly cancel + await the prior socket watchdog before flipping _running, so an old monitor cannot exit between teardown and respawn (Copilot #1)
- _socket_watchdog_loop: wrap the body in try/except + add a done-callback that respawns on unexpected crash, so a transient bug cannot permanently disable self-healing (Copilot #2)
- normalizeExecutablePathForCompare: use the resolved path for realpathSync so non-string inputs cannot leak through (Copilot #3)
- Add tests for crash-recovery and atomic watchdog replacement across reconnects

* fix(slack): tighten connect() error path and clarify watchdog test intent

Address Copilot review round 2.

- connect(): wrap _start_socket_mode_handler/_ensure_socket_watchdog in a focused try/except so any failure rolls back partially-started handler/task state and leaves _running=False, ensuring the platform lock is always released by the outer finally
- Defer _running=True until after the handler is actually started so the watchdog observes a live socket task immediately and never spins against a half-built adapter
- Rename test_watchdog_self_restarts_after_unexpected_crash to test_watchdog_cancellation_does_not_respawn (matches what it actually asserts) and add test_watchdog_unexpected_exit_respawns_via_done_callback that drives a real RuntimeError through _on_socket_watchdog_done and verifies a fresh task replaces the crashed one

* fix(web_server): serialize action spawn check+store under a threading lock

Address Copilot review round 3.

FastAPI runs sync handlers on its threadpool, so two near-simultaneous /api/gateway/restart (or /api/hermes/update) requests could both observe "no live process" in _spawn_hermes_action's poll-based dedupe and double-spawn. Add a module-level _ACTION_SPAWN_LOCK around the entire check + Popen + _ACTION_PROCS store sequence so the dedupe is atomic across threads.

* fix: address Copilot review round 4

- slack.disconnect(): mirror connect()'s defensive cleanup — catch the broad Exception path on watchdog await so handler shutdown and lock release still run if the watchdog raised before cancellation took effect
- web_server._spawn_hermes_action: wrap subprocess.Popen in try/except so a missing executable / permission error closes the log file handle, writes a failure marker, and re-raises instead of leaking a file descriptor
- gateway._scan_gateway_pids: drop the over-broad "hermes.exe --profile" / "hermes.exe -p" patterns that would match any Hermes CLI subcommand using a profile flag (e.g. `hermes.exe --profile foo dashboard`); rely on the "hermes.exe gateway" + "hermes-gateway.exe" tokens instead
- tests: tighten _fake_create_task to assert coroutine input and return a real asyncio.Task that stays pending until pytest teardown, and update the three callsites whose mocked AsyncSocketModeHandler.start_async returned a non-coroutine value

* fix(slack): reset multi-workspace state on reconnect

Address Copilot review round 5.

connect() is reentrant (gateway restart, in-process reconnect), but it was leaving _bot_user_id / _team_clients / _team_bot_user_ids populated from the previous session. A reconnect that rotated the primary token or dropped a workspace would silently keep the stale bot user id and stale workspace client maps, leading to dispatch against gone workspaces.

Clear these three pieces of state right after _stop_socket_mode_handler() and before the auth_test loop, then let the loop repopulate from the current tokens. Add test_reconnect_refreshes_multi_workspace_state to lock it in.
This commit is contained in:
brooklyn! 2026-05-19 15:31:53 -05:00 committed by GitHub
parent c858484b45
commit 7f8b0dd1e0
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
6 changed files with 3639 additions and 1198 deletions

View file

@ -48,9 +48,7 @@ try {
const path = require('node:path')
const resourcesPath = process.resourcesPath
if (resourcesPath) {
nodePty = require(
path.join(resourcesPath, 'native-deps', 'node-pty')
)
nodePty = require(path.join(resourcesPath, 'native-deps', 'node-pty'))
}
} catch {
nodePty = null
@ -103,7 +101,9 @@ function loadInstallStamp() {
const parsed = JSON.parse(raw)
if (parsed && typeof parsed === 'object' && typeof parsed.commit === 'string' && parsed.commit.length >= 7) {
if (parsed.schemaVersion !== INSTALL_STAMP_SCHEMA_VERSION) {
console.warn(`[hermes] install-stamp.json schemaVersion ${parsed.schemaVersion} != expected ${INSTALL_STAMP_SCHEMA_VERSION}; ignoring`)
console.warn(
`[hermes] install-stamp.json schemaVersion ${parsed.schemaVersion} != expected ${INSTALL_STAMP_SCHEMA_VERSION}; ignoring`
)
continue
}
return Object.freeze({
@ -124,11 +124,15 @@ function loadInstallStamp() {
}
const INSTALL_STAMP = loadInstallStamp()
if (INSTALL_STAMP) {
console.log(`[hermes] install stamp: ${INSTALL_STAMP.commit.slice(0, 12)}${INSTALL_STAMP.branch ? ` (${INSTALL_STAMP.branch})` : ''}${INSTALL_STAMP.dirty ? ' [DIRTY]' : ''} from ${INSTALL_STAMP.source || 'unknown'}`)
console.log(
`[hermes] install stamp: ${INSTALL_STAMP.commit.slice(0, 12)}${INSTALL_STAMP.branch ? ` (${INSTALL_STAMP.branch})` : ''}${INSTALL_STAMP.dirty ? ' [DIRTY]' : ''} from ${INSTALL_STAMP.source || 'unknown'}`
)
} else if (IS_PACKAGED) {
// Dev builds without a stamp are normal; packaged builds without one
// mean the bootstrap won't know what to clone. Surface clearly.
console.error('[hermes] WARNING: no install-stamp.json found in packaged build. First-launch bootstrap will not have a pinned ref to install.')
console.error(
'[hermes] WARNING: no install-stamp.json found in packaged build. First-launch bootstrap will not have a pinned ref to install.'
)
}
// HERMES_HOME — the user-facing root for everything Hermes-related. Mirrors
@ -667,6 +671,41 @@ function isCommandScript(command) {
return IS_WINDOWS && /\.(cmd|bat)$/i.test(command || '')
}
function normalizeExecutablePathForCompare(commandPath) {
if (!commandPath) return null
let resolved = path.resolve(String(commandPath))
try {
resolved = fs.realpathSync.native ? fs.realpathSync.native(resolved) : fs.realpathSync(resolved)
} catch {
// Fallback to path.resolve() above.
}
return IS_WINDOWS ? resolved.toLowerCase() : resolved
}
function looksLikeDesktopAppBinary(commandPath) {
if (!IS_WINDOWS || !commandPath) return false
const normalizedCandidate = normalizeExecutablePathForCompare(commandPath)
const normalizedCurrentExec = normalizeExecutablePathForCompare(process.execPath)
if (normalizedCandidate && normalizedCurrentExec && normalizedCandidate === normalizedCurrentExec) {
return true
}
let resolved = path.resolve(String(commandPath))
try {
resolved = fs.realpathSync.native ? fs.realpathSync.native(resolved) : fs.realpathSync(resolved)
} catch {
// Keep resolved path fallback.
}
const resourcesDir = path.join(path.dirname(resolved), 'resources')
return (
fileExists(path.join(resourcesDir, 'app.asar')) || directoryExists(path.join(resourcesDir, 'app.asar.unpacked'))
)
}
function isHermesSourceRoot(root) {
return directoryExists(root) && fileExists(path.join(root, 'hermes_cli', 'main.py'))
}
@ -1303,6 +1342,13 @@ function resolveHermesBackend(dashboardArgs) {
hermesCommand = findOnPath('hermes')
}
if (hermesCommand) {
if (looksLikeDesktopAppBinary(hermesCommand)) {
rememberLog(`Ignoring desktop app executable on PATH while resolving Hermes CLI: ${hermesCommand}`)
hermesCommand = null
}
}
if (hermesCommand) {
return {
label: `existing Hermes CLI at ${hermesCommand}`,
@ -1498,8 +1544,7 @@ async function ensureRuntime(backend) {
// install.ps1 succeeds. If we hit this, the user (or a deleted venv)
// broke the invariant; tell them to re-run the install.
throw new Error(
`Hermes venv missing at ${VENV_ROOT}. Re-run the desktop installer or ` +
'`scripts/install.ps1` to rebuild it.'
`Hermes venv missing at ${VENV_ROOT}. Re-run the desktop installer or ` + '`scripts/install.ps1` to rebuild it.'
)
}

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

View file

@ -442,6 +442,28 @@ def test_find_gateway_pids_falls_back_to_pid_file_when_process_scan_fails(monkey
assert gateway.find_gateway_pids() == [321]
def test_scan_gateway_pids_detects_windows_hermes_exe_case_variants(monkeypatch):
monkeypatch.setattr(gateway, "is_windows", lambda: True)
monkeypatch.setattr(gateway, "_get_ancestor_pids", lambda: set())
monkeypatch.setattr(gateway.shutil, "which", lambda name: "wmic.exe" if name == "wmic" else None)
def fake_run(cmd, **kwargs):
if cmd[:4] == ["wmic.exe", "process", "get", "ProcessId,CommandLine"]:
return SimpleNamespace(
returncode=0,
stdout=(
"CommandLine=C:\\Program Files\\Hermes\\Hermes.EXE gateway run --replace\n"
"ProcessId=2468\n\n"
),
stderr="",
)
raise AssertionError(f"Unexpected command: {cmd}")
monkeypatch.setattr(gateway.subprocess, "run", fake_run)
assert gateway._scan_gateway_pids(set(), all_profiles=True) == [2468]
# ---------------------------------------------------------------------------
# _wait_for_gateway_exit
# ---------------------------------------------------------------------------