hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-07-01 12:02:05 +00:00

Author	SHA1	Message	Date
Teknium	0c2e6c0049	test: make active session cross-process race deterministic (#54248 )	2026-06-28 05:49:21 -07:00
teknium1	1ffa01f35f	test(windows): cover no-window backend subprocess flags	2026-06-28 05:28:45 -07:00
Teknium	cb982ad997	fix(windows): hide console-window flash on backend git/gh/wmic/bash subprocess spawns The Windows desktop GUI runs its backend headless via pythonw.exe. Several auxiliary subprocess sites that run inside that windowless backend spawned console-subsystem children (git, gh, wmic, powershell, bash, rg, taskkill) WITHOUT CREATE_NO_WINDOW, so Windows allocated a fresh conhost per call and flashed a black window on screen — sometimes continuously (the dashboard Projects-tree git probe alone fired ~118 spawns in 60s on startup). The terminal tool, cron, browser, code_execution, and gateway-spawn paths already carry windows_hide_flags(); these auxiliary probe/scan/launcher legs were missed. Wire the existing helper into them: - tui_gateway/git_probe.py: run_git (+ encoding=utf-8/errors=replace, fixes the cp950 UnicodeDecodeError on CJK paths from the same site) - agent/coding_context.py: _git (per-turn git status/log/diff) - agent/context_references.py: _run_git + _rg_files (@file/@ref resolution) - hermes_cli/copilot_auth.py: gh auth token probe (auxiliary provider:auto) - hermes_cli/gateway.py: wmic + PowerShell Get-CimInstance PID scan - hermes_cli/main.py: wmic stale-dashboard PID scan - gateway/status.py: taskkill /T /F force-kill windows_hide_flags() returns 0 on POSIX, so every changed call is a no-op on Linux/macOS (verified: real git/rg probes still work; Windows-simulated calls all pass creationflags=CREATE_NO_WINDOW). Scoped to the windowless-backend paths that cause the reported flashing. The Electron updater-handoff leg (main.cjs windowsHide:false) and the interactive-CLI banner probes (cli.py) are intentionally NOT touched here — the former needs a Windows-tested change of its own, the latter runs in a visible console anyway. Tracking: #54220 Refs: #53178 #53631 #53781 #53957 #49602 #52982 #53424 #53053 #53016	2026-06-28 05:28:45 -07:00
teknium1	f25f235722	chore: map salvaged PR #49845 author email for AUTHOR_MAP	2026-06-28 04:47:39 -07:00
homelab-ha-agent	d05cc8f4d6	fix(mcp): skip preflight content-type probe for OAuth servers OAuth-protected MCP servers (e.g. Hospitable) return 200 text/html on an unauthenticated HEAD probe — a login/landing page the server cannot substitute for a real MCP response without a Bearer token. The preflight cannot distinguish this from a misconfigured URL, so it raises NonMcpEndpointError before the OAuth browser flow has a chance to run. Add `and self._auth_type != "oauth"` to the preflight condition in MCPServerTask.run(). The probe is inapplicable to OAuth servers: their URL legitimacy is established by .well-known/oauth-protected-resource during the OAuth handshake, not by a GET content-type check. Concrete repro: Hospitable (https://mcp.hospitable.com/mcp) returns `200 text/html` to an unauthenticated httpx HEAD. Without the guard: ✗ NonMcpEndpointError at `hermes mcp test` With the guard: ✓ Connected (1487ms) — 63 tools discovered Relation to open PRs: - #37598 adds a POST probe fallback for POST-only non-OAuth servers (e.g. DocuSeal), but only passes when POST returns 2xx + MCP content-type. Hospitable returns 401 on the POST probe (Bearer challenge), so #37598 does not cover this case. - #49463 extends the POST probe to also pass on non-2xx auth challenges (making it OAuth-aware), but is labeled duplicate of #37598 and may not land independently. This fix is complementary: it handles OAuth servers with zero extra round-trips rather than adding a POST probe step. Tests: - test_oauth_server_html_response_raises_without_skip: documents that _preflight_content_type raises NonMcpEndpointError for 200 text/html (the underlying issue), with an OAuth-server docstring. - test_run_skips_preflight_for_oauth: verifies that run() does NOT invoke _preflight_content_type when auth_type=="oauth", using class-level monkeypatching so the gate is exercised without a live MCP transport. 23 passed tests/tools/test_mcp_preflight_content_type.py	2026-06-28 04:47:39 -07:00
liuhao1024	9d919daf44	fix(gateway): mark platform lock failure as retryable instead of permanently fatal When a stale lock file survives a gateway crash, `acquire_scoped_lock()` may return `(False, existing_dict)` even after detecting and deleting the stale lock (e.g. if unlink fails or a race condition occurs). Previously, `_acquire_platform_lock()` called `_set_fatal_error(..., retryable=False)`, which permanently killed the platform — the reconnect watcher never retries a non-retryable fatal error. Change to `retryable=True` so the platform enters the "retrying" state and the reconnect watcher can attempt acquisition again after the standard backoff delay. Fixes #54167	2026-06-28 04:35:37 -07:00
teknium1	61622bb56a	fix(tui): use role=user for model switch marker to avoid HTTP 400 on strict providers (#48338 ) _append_model_switch_marker() appended the post-/model-switch context marker to session history as {"role": "system"}. The cached system prompt is prepended to the API message list (conversation_loop.py), so this marker became a SECOND system message mid-array after prior user/assistant turns. Strict OpenAI-compatible providers (vLLM, Qwen) reject any system message that is not at the beginning of the array, returning HTTP 400 and killing the conversation on the next turn. Flip the marker to role="user" (history entry + both session-DB persist sites), matching the existing personality-overlay marker which already uses role="user". repair_message_sequence() then coalesces it with adjacent user turns as needed. Co-authored-by: liuhao1024 <sunsky.lau@gmail.com> Co-authored-by: Lucas Nicolas <lucas.nicolas@proton.me>	2026-06-28 04:34:55 -07:00
Brad Hallett	376d021fee	fix(desktop): force app exit after update/uninstall handoff on macOS Some checks are pending CI / Detect affected areas (push) Waiting to run Details CI / Python tests (push) Blocked by required conditions Details CI / Python lints (push) Blocked by required conditions Details CI / TypeScript (push) Blocked by required conditions Details CI / Docs Site (push) Blocked by required conditions Details CI / Deny unrelated histories (push) Blocked by required conditions Details CI / Check contributors (push) Blocked by required conditions Details CI / Check uv.lock (push) Blocked by required conditions Details CI / Lint Docker scripts (push) Blocked by required conditions Details CI / Build&Test Docker image (push) Blocked by required conditions Details CI / Supply-chain scan (push) Blocked by required conditions Details CI / OSV scan (push) Waiting to run Details CI / All required checks pass (push) Blocked by required conditions Details Deploy Site / deploy-vercel (push) Waiting to run Details Deploy Site / deploy-docs (push) Waiting to run Details On macOS app.quit() closes windows but window-all-closed deliberately keeps the process alive (Dock convention). Every detached hand-off (update swap, relaunch, Windows bootstrap recovery, uninstall cleanup) waits for the desktop PID to exit before replacing/removing the bundle — so the process never dying means the script spins its full PID-wait and the user sees a blank app, or an uninstall that appears to do nothing. Add a module-level isQuittingForHandoff flag, set before every hand-off app.quit(); window-all-closed then quits on all platforms when it's set. Covers all five hand-off sites including the Linux relaunch path.	2026-06-28 04:30:14 -07:00
teknium1	e54bedd8ea	docs: add infographic for #42006 launchd bootout fix	2026-06-28 04:17:13 -07:00
izumi0uu	c4719aa51c	fix(gateway): boot out stale launchd registration before restart bootstrap launchd restart can leave the gateway job stopped but still registered after update-time drain logic, so a direct bootstrap hits exit 5 and falls back to a detached process. Booting the stale registration out before bootstrap keeps the launchd-managed restart path intact and locks it with a regression test. Constraint: Keep upstream-facing conventional commit style while preserving local decision context Rejected: Treat bootstrap exit 5 as expected \| Leaves macOS launchd restart outside launchd supervision after update Confidence: high Scope-risk: narrow Directive: Keep launchd start/restart recovery flows aligned when changing launchctl handling Tested: pytest -q tests/hermes_cli/test_gateway_service.py -k "launchd_restart_boots_out_stale_registration_before_bootstrap or launchd_restart_falls_back_to_detached_on_error_5 or launchd_restart_drains_running_gateway_before_kickstart or launchd_restart_self_requests_graceful_restart_without_kickstart" Tested: pytest -q tests/hermes_cli/test_gateway_service.py -k launchd Not-tested: Manual macOS launchctl restart after hermes update	2026-06-28 04:17:13 -07:00
Teknium	52a853f5c3	fix(test): pin monotonic clock in spinner-elapsed test to fix CI flake (#54203 ) test_spinner_elapsed_format_is_fixed_width_to_reduce_wrap_jitter derived _tool_start_time from the live time.monotonic() clock (now - 65.2 / now - 9.2). monotonic()'s epoch is arbitrary — on a host where monotonic() < 65.2 (fresh subprocess on a freshly-booted CI runner) the start time went negative, the (t0 > 0) guard in _render_spinner_text() dropped the '(elapsed)' suffix, and short.split('(',1)[1] raised IndexError: list index out of range. Deterministic given a small clock, so it would keep flaking, not clear on rerun. Pin time.monotonic to a fixed 1000.0 and offset _tool_start_time from it so both the <60s and >=60s paths always render the elapsed suffix regardless of the runner's monotonic epoch. Pre-existing main flake (surfaced in CI test slice 1/8).	2026-06-28 04:16:25 -07:00
Teknium	8e356eccea	docs(readme): trim provider list to a few names plus docs link (#54169 ) The README line enumerated 11 providers inline, which dilutes the point and goes stale as providers come and go. Replace with Nous Portal, OpenRouter, OpenAI, your own endpoint, and a 'many others' link to the canonical AI Providers docs page that already lists them all.	2026-06-28 04:14:59 -07:00
teknium1	f22b9d3867	docs: add infographic for MCP WS discovery fix (#38945 )	2026-06-28 04:14:12 -07:00
Cornna	5c2c85c545	fix(tui): start MCP discovery for websocket sessions The desktop app and dashboard chat reach the agent through the /api/ws JSON-RPC sidecar (tui_gateway.ws.handle_ws), NOT through tui_gateway.entry.main() — the stdio-TUI path that spawns the background MCP discovery thread. In the WS process discovery was therefore never started: _make_agent only waits (wait_for_mcp_discovery), which no-ops when the thread was never created, so the agent snapshotted an MCP-less tool list. The only discovery trigger reachable was a manual /reload-mcp, which is why tools appeared after a reload but vanished on restart. Start the shared, idempotent, config-gated background discovery in handle_ws right after accept() and before gateway.ready, so the first agent build picks up already-spawning servers (and the existing late-binding refresh handles slow ones). Fixes #38945.	2026-06-28 04:14:12 -07:00
teknium1	091ce825fe	test(redact): fix file_read regression-guard for current-main YAML collapse The salvaged #35519 regression guard asserted that default (non-file_read) mode keeps a head/tail `ghp_S1...Pn2T` mask for a `token: <key>` line. On current main the YAML config pass (`_YAML_ASSIGN_RE`, key `token`) re-masks the already-prefix-masked value to `***`, so the assertion was stale. Switch to a bare-token context so the guard isolates what it claims (prefix-mask head/tail shape in default mode) without depending on the YAML collapse.	2026-06-28 04:13:20 -07:00
kshitijk4poor	de928bccde	fix(redact): non-reusable sentinel for prefix secrets in file reads (#35519 ) When security.redact_secrets is on (default), read_file/search_files/cat applied redact_sensitive_text(code_file=True) to file content, which still ran prefix masking. An API key in config.yaml (ghp_..., sk-..., xai-..., etc.) came back as a head/tail mask like `ghp_S1...Pn2T` — a plausible-looking truncated key. When an agent read that and wrote it back to config, the masked value replaced the real credential, silently breaking auth (401). Production evidence: a config.yaml found containing the exact 13-char masked GitHub PAT. The two community PRs (#35529, #35534) fixed the corruption by NOT redacting prefixes for config reads — but that exposes the user's real keys to the agent context, model, and logs (a security regression). This takes the safer route: keep redacting, but for file content emit a NON-REUSABLE sentinel. - New `_mask_token_nonreusable`: prefix secrets -> `«redacted:ghp_…»` (vendor label preserved for debuggability; zero secret bytes; angle-bracket/ellipsis wrapper is syntactically invalid as a token so it can't be mistaken for or written back as a usable key). - New `redact_sensitive_text(file_read=True)` routes prefix matches through it (implies code_file=True). Default/log/display mode is UNCHANGED — `_mask_token` still keeps head/tail (fine for logs, never written back). - Wired the 3 file_tools.py call sites (read_file / search_files / cat) to file_read=True. Fixes both the corruption AND avoids the secret-exposure of the un-redact approach. 6 new tests (sentinel shape, no-leak, not-a-plausible-key, default mode unchanged, file_read implies code_file, sk- prefix); 88 redact tests pass; mutation-verified (reverting to the old mask fails the sentinel/leak tests). Co-authored-by: liuhao1024 <sunsky.lau@gmail.com> Co-authored-by: adammatski1972 <289282750+adammatski1972@users.noreply.github.com> Closes #35519. Supersedes #35529, #35534.	2026-06-28 04:13:20 -07:00
teknium1	19cbbe304a	docs: add infographic for clarify typed-replies fix	2026-06-28 04:13:19 -07:00
tymrtn	d7f655f370	fix: accept typed clarify choice replies	2026-06-28 04:13:19 -07:00
teknium1	9bb5a809b5	fix(gateway): make zombie check defensive against partial psutil stubs The zombie status probe referenced psutil.Process/NoSuchProcess/Error unconditionally, which raised AttributeError when psutil is a partial stub that only defines pid_exists (as in test_windows_native_support's fallback tests). Guard the probe so any failure to read status degrades to the authoritative pid_exists() instead of raising.	2026-06-28 04:11:14 -07:00
MorAlekss	acca526286	fix(gateway): treat zombie PIDs as dead in _pid_exists to unblock --replace (closes #42126 ) Under systemd Restart=always, the old gateway becomes a zombie (in the process table, awaiting reap) when the replacement starts. _pid_exists() reported the zombie as alive, so --replace waited on a PID that never dies, then aborted with exit 1 — a silent crash loop. Standalone runs are unaffected because nothing respawns the gateway into a zombie. The live path is psutil.pid_exists(), which returns True for zombies, so the check is added there (Process.status() == STATUS_ZOMBIE -> dead). The psutil-less POSIX fallback also reads /proc/<pid>/stat (state Z) with a ps state= fallback for macOS/BSD, before the os.kill(pid, 0) liveness probe. Diagnosis and the /proc + ps POSIX fallback by MorAlekss (PR #44898); extended to cover the psutil hot path so the fix applies on normal installs. Co-authored-by: MorAlekss <mor.aleksandr@yahoo.com>	2026-06-28 04:11:14 -07:00
teknium1	463225caf1	fix(gateway): bypass legacy-unit prompt in non-TTY systemd install Folds in PR #42124 (kyssta-exe): systemd_install gained a non_interactive flag so the 'Remove the legacy unit(s)?' prompt — the second hidden prompt not guarded by --start-now/--start-on-login — is also skipped in headless contexts. Updates systemd_install test mocks to accept the new kwarg and adds coverage for the legacy-unit-skip path.	2026-06-28 04:09:54 -07:00
liuhao1024	831d443b03	fix(gateway): honor --start-now/--start-on-login flags and support non-TTY headless installs When running `hermes gateway install` on Linux/systemd, the command unconditionally prompts with two `prompt_yes_no` questions, breaking headless installs (SSH, CI, provisioning scripts) and ignoring the existing --start-now / --start-on-login CLI flags that the Windows branch already respects. The fix mirrors the Windows path: read CLI flags first, prompt only when flags are not provided AND stdin is a TTY, and fall back to True defaults for non-TTY contexts. The argparse help strings are promoted from SUPPRESS to visible so users can discover the flags. Fixes #42065	2026-06-28 04:09:54 -07:00
Teknium	5e7bca95d9	fix(tui): coalesce render frames while stdout backpressure is unresolved (#31486 ) (#54171 ) When the previous frame's stdout.write has not drained (the outer terminal parser is overwhelmed by a wide CR+LF burst — CJK + ANSI tool output on a high-context session), the renderer kept writing a new frame every tick. That piled writes onto an already-backed-up pipe and kept the macrotask queue hot, starving the stdin 'readable' callback — the observed stdin freeze where the agent loop keeps running but keystrokes/Ctrl-C are dead. onRender now coalesces: while pendingWriteStart is non-null (prior write's drain callback hasn't fired) it skips the frame and retries on the drain tick instead of writing. A MAX_COALESCED_BACKPRESSURE_FRAMES ceiling forces a write through after N skips so a terminal whose drain callback never fires (OSError EIO on flush) self-heals once the pipe recovers rather than wedging forever. TTY-only; piped stdout has no flow control. Coalesce counter resets on every real write. This is the stdout-backpressure strand left open after #54046 fixed the swallowed-exception strand.	2026-06-28 04:00:22 -07:00
Teknium	a06d0198cd	fix(dashboard): reap PTY bridge on child EOF, not only in writer finally (#54190 ) The /api/pty handler only closed the PtyBridge in the writer loop's finally. On child EOF the reader task closes the WebSocket, but if the handler task is cancelled the instant the socket closes, the writer's finally can be skipped and the PTY fds leak (#54028) — the FD-leak the regression test guards. Under dashboard auto-reconnect this stacks orphaned PTYs until fds are exhausted. Reap the bridge in the reader's EOF finally too (close() is idempotent), so the PTY is reaped independently of the writer-loop cancellation race. Harden the regression test to poll for teardown instead of asserting on the same tick. Was flaky on main (2/20); now 25/25.	2026-06-28 03:58:18 -07:00
Teknium	7968c90318	test(install): track run_with_timeout extraction after #39219 refactor (#54185 ) PR #39219 split run_browser_install_with_timeout into a thin wrapper that delegates to a new run_with_timeout helper (and parameterized the timeout binary as $timeout_bin for macOS gtimeout support), but did not update tests/test_install_sh_browser_install.py. The behavioral harness extracted only the now-empty wrapper, so the install command never ran (runs==[]), failing all 8 behavioral cases; two text assertions also still expected the old literal 'timeout' invocation. Fix the stale test: extract run_with_timeout alongside the wrapper, and match the $timeout_bin-parameterized GNU-timeout strings. Behavior unchanged.	2026-06-28 03:58:01 -07:00
Christian Persico	135f235165	docs: fix incorrect web search instructions	2026-06-28 02:54:27 -07:00
kshitijk4poor	546193aa6d	fix(install): time-box desktop + node-deps installs so a stalled download self-heals (#39219 ) The desktop install step ran npm ci / npm run pack with no wall-clock cap, and the sibling browser-tools / TUI / agent-browser dependency installs had the same gap. The Electron binary (~150MB) is fetched from GitHub during the pack; on a throttled or region-blocked link that download can stall rather than fail — npm never errors and never exits, so the installer sits on "Build desktop app" (step 9/11) indefinitely with only harmless 'npm warn deprecated' lines visible. The existing self-heal escalation (cache purge -> dist restore -> npmmirror fallback) only fires when pack returns non-zero, so a stall bypassed it. - run_with_timeout (generalized from run_browser_install_with_timeout): GNU timeout --foreground -k 10 (Ctrl+C-aware, #35166) / gtimeout for external commands, else a pure-shell process-group watchdog so stock macOS (neither binary present) is protected. Shell functions (_desktop_pack) always take the pure-shell path — the timeout binary can't exec a function. Integer-normalized budget + a boundary recheck so a command finishing in the final poll second isn't mislabeled 124. The internal wait is guarded so set -e can't abort mid-function before the real exit code is computed. - Wrap the desktop npm ci/install (sharing ONE budget via a computed deadline so a stall can't cost 2x DESKTOP_BUILD_TIMEOUT) + all three _desktop_pack attempts (DESKTOP_BUILD_TIMEOUT, default 900s), and the browser-tools / TUI / agent- browser registry installs (NODE_DEPS_TIMEOUT, default 600s). A stall now converts to a bounded non-zero exit that feeds the existing mirror self-heal instead of hanging the whole install.	2026-06-28 02:47:47 -07:00
Teknium	c1c179a239	fix(security): redact secrets in background process + foreground env-dump output (#43025 ) (#54149 ) * fix(security): redact secrets in background process + foreground env-dump output Terminal-output redaction was incomplete (#43025): - Gap 1: process(action=poll/log/wait) returned background stdout verbatim — no redaction at all. A background printenv/server/test emitting a key leaked raw to the model, session.db, and CLI display. Same for the gateway background-process watcher's completion/progress notifications. - Gap 2: the foreground terminal path hardcoded code_file=True, which skips the ENV-assignment pass, so an opaque token (no vendor prefix) from env/printenv leaked even there. Adds agent.redact.redact_terminal_output(output, command) as the single policy for ALL terminal-output surfaces: env-dump commands (env/printenv/set/export/ declare) get the ENV-assignment pass (code_file=False) to mask opaque tokens; other commands stay on code_file=True to avoid false positives on source dumps. Wired into terminal_tool, process_registry (_handle_process boundary), and the gateway watcher. Respects security.redact_secrets (no force) — opt-out preserved. * docs: add infographic for #43025 terminal-output redaction fix	2026-06-28 02:44:21 -07:00
teknium1	d5ba374c03	fix(telegram): detect wedged getUpdates consumer via pending_update_count The merged CLOSE-WAIT heartbeat (#52744) only probes get_me(), which uses the general request path and stays healthy while PTB's getUpdates consumer is silently wedged (updater.running=True but the long-poll task is stuck, observed on WSL2). DMs then queue in the Bot API and never reach handlers (#42909). Augment the existing _polling_heartbeat_loop to also probe get_webhook_info().pending_update_count. After two consecutive probes that see a non-draining queue while the updater claims to be running, escalate into the existing _handle_polling_network_error recovery ladder — no new restart machinery. No-ops in webhook mode, when the updater is not running, or when a reconnect is already in flight. Credit to @gazzumatteo, whose PR #42959 identified the pending_update_count signal as the missing liveness probe. This reuses the existing heartbeat + recovery path rather than adding a parallel watchdog. Fixes #42909.	2026-06-28 02:44:17 -07:00
teknium1	822b71cbf8	docs: add infographic for #43083 secret-redaction fix	2026-06-28 02:44:06 -07:00
teknium1	bbe1bf4045	fix(agent): stop redacting tool-call args in history; fix auth-header quote-eating Two related redaction bugs from #43083: 1. build_assistant_message redacted tool-call arguments in-memory. That dict feeds both the replayed conversation history and state.db (which is itself replayed verbatim on session resume), so the model read back its own PGPASSWORD='***' psql call and copied the placeholder, breaking every credential-dependent command on the second turn. The masking gave no real protection either — the same secret still leaks through tool OUTPUT. Remove it. Keeping secrets out of the replayable store is a separate tokenization/vault concern (security.redact_secrets still governs storage-time redaction elsewhere). 2. _AUTH_HEADER_RE's greedy \S+ credential class ate a closing quote when the token sat flush against it (Authorization: Bearer sk-.."), turning value corruption into syntax corruption (unterminated quote -> shell EOF / SyntaxError). Exclude " and ' from the token class; real credentials never contain them. Closes #43083.	2026-06-28 02:44:06 -07:00
yoniebans	204a67f0c8	fix(kanban): retry write_txn on transient SQLITE_BUSY	2026-06-28 02:44:04 -07:00
yoniebans	90c1dc0493	test(kanban): cover write_txn BUSY retry (currently failing)	2026-06-28 02:44:04 -07:00
teknium1	9844243b18	fix(gateway): gate quick_commands through slash access policy Config-backed quick_commands bypassed the admin-only slash gate. The early gate in _handle_message only fires for registry-known commands (is_gateway_known_command), but quick_commands are never in the gateway registry, so they reached the type:exec dispatch sink unchecked. An allowlisted non-admin gateway user could invoke admin-only quick commands — including shell exec in the gateway process — even when the operator set allow_admin_from / user_allowed_commands to lock them out. Apply _check_slash_access(source, command) at the quick_commands dispatch site (the single exec chokepoint, cold-path only) using the raw typed name. Admins and users with the command in user_allowed_commands still run it; backward-compat (no policy set) is unaffected. Fixes #44727. Co-authored-by: maxpetrusenko <max.petrusenko.agent@gmail.com> Co-authored-by: zapabob <1920071390@campus.ouj.ac.jp>	2026-06-28 02:43:23 -07:00
Teknium	6d879d486b	fix(dashboard): close PTY WebSocket on child EOF to stop FD leak (#54028 ) (#54123 ) * fix(dashboard): close PTY WebSocket on child EOF to stop FD leak The /api/pty handler's reader task returns on child EOF, but the writer loop stayed blocked on ws.receive() until the browser sent a disconnect. When the browser socket is half-open (no FIN delivered — common on macOS/launchd), that disconnect never arrives, so the handler never reaches its finally and the PTY master fd + child process leak. With dashboard auto-reconnect (#52962), every dropped socket then spawns a fresh PTY on top of the orphaned one, exhausting file descriptors within hours (EMFILE / Errno 24). Fix: the reader task now closes the WebSocket in a finally when the child EOFs or the send side breaks, which unblocks ws.receive() so the existing finally runs bridge.close(). The writer loop also guards ws.receive() against the RuntimeError Starlette raises once the socket is closed. Reported by @fifteenzhang. Fixes #54028 * docs: add infographic for #54028 PTY FD leak fix	2026-06-28 02:42:21 -07:00
teknium1	7ef04ae7a7	fix(browser): close eval return-value SSRF bypass (sibling of #44731 ) The snapshot/vision guards re-check the page URL before returning content, but browser_console(expression=...) -> _browser_eval returns arbitrary JS results directly, leaving two same-class bypasses open: 1. Direct fetch: fetch('http://127.0.0.1/secret').then(r=>r.text()) reads a private endpoint and returns the body — the page URL stays public so the post-eval recheck never sees it. 2. Navigate-then-read: location.href='http://127.0.0.1/' then a later eval reads document.body.innerText. Guard _browser_eval on the same condition as navigate/snapshot/vision (not local backend, not local sidecar, not allow_private_urls): - pre-scan the expression for private/always-blocked URL literals - re-check window.location.href after the eval at both success-return sites (supervisor fast-path + subprocess fallback) Probe failures fail-open (matching the snapshot/vision guards).	2026-06-28 02:42:01 -07:00
liuhao1024	0ae6196087	fix(browser): allow local sidecar sessions to bypass SSRF guard The private-network guard in browser_snapshot() and browser_vision() blocked all private URLs, including those accessed via local sidecar sessions (hybrid routing). Local sidecar sessions intentionally access private URLs — the cloud provider never sees the URL in that case. Add `_is_local_sidecar_key(effective_task_id)` check to both guards, matching the existing pattern in browser_navigate(). Fixes #45101 review feedback from egilewski.	2026-06-28 02:42:01 -07:00
liuhao1024	48f5c42599	fix(browser): extend private-network guard to browser_vision The SSRF bypass in #44731 was only patched for browser_snapshot(), but browser_vision() exposes the same vulnerability — it takes a screenshot and sends it to the vision model without checking if eval-driven navigation moved the page to a private/internal URL. Add the same current-page URL safety check to browser_vision() before any screenshot is captured, encoded, or forwarded to the vision model. This covers both the normal screenshot path and the Lightpanda Chrome fallback path. 7 new tests: blocks private URL, allows public URL, skips in local backend, skips when private URLs allowed, handles eval failure/empty/exception.	2026-06-28 02:42:01 -07:00
liuhao1024	7a6fe9bbfa	fix(browser): block snapshot from eval-navigated private pages browser_snapshot() now checks the current page URL before returning content. When browser_console() changes location.href to a private or internal address (e.g., http://127.0.0.1:8080/), the snapshot returns an error instead of exposing the private page content. This closes the SSRF bypass where an attacker could: 1. Navigate to a public page 2. Use browser_console to eval location.href = 'http://127.0.0.1:port/' 3. Use browser_snapshot to read the private page content The fix reuses the existing _is_safe_url() and _allow_private_urls() infrastructure, and fails open if the URL check itself fails. Fixes #44731	2026-06-28 02:42:01 -07:00
Teknium	7c0a5def58	fix(memory/holographic): close DB connection on shutdown instead of leaking to GC (#54133 ) HolographicMemoryProvider.shutdown() dropped its MemoryStore reference without calling the existing MemoryStore.close(). Since the connection is opened check_same_thread=False (one per session), its fd was released by refcount/GC at a non-deterministic time on a non-deterministic thread, churning a DB fd through the kernel free pool on every session teardown. Call close() so the fd is released deterministically. Reported by @alfranli123 (#44037), who pinpointed the exact code location. Note: the report's TLS-fd-recycle corruption attribution could not be reproduced from the code — dropping a sqlite connection flushes valid SQLite pages via the VFS, never TLS framing, and the provider is at most a releaser of DB fds, not a TLS-flushing socket owner. This change is correct resource hygiene that removes per-session fd churn regardless.	2026-06-28 02:41:52 -07:00
Teknium	00d8c2c915	fix(gateway): prune stale sessions.json entries on startup A hard gateway crash (exit code 1) skips the graceful shutdown path, so sessions.json is never cleared and is left pointing at sessions already ended in state.db. On the next startup get_or_create_session() reuses those stale entries as long as the time/policy reset checks pass — it never consults end_reason — so every incoming message is silently routed into a closed session, with no log or error (#52804). SessionStore._ensure_loaded_locked() now calls a new _prune_stale_sessions_locked() that drops any entry whose session_id has end_reason IS NOT NULL in state.db. Idempotent, _db=None / legacy-absent safe, DB errors non-fatal, sessions.json rewritten only when something was pruned. Self-heals into a fresh session on the next message. Reported and diagnosed by @terry197913 (#52808).	2026-06-28 02:41:47 -07:00
teknium1	c38dfba3a7	docs: add infographic for #53175 gateway cleanup off-loop fix	2026-06-28 02:41:36 -07:00
teknium1	ea5aaa7a22	fix(gateway): offload remaining inline agent cleanup off the event loop (#53175 ) #35994 moved /new reset cleanup off the loop, but _cleanup_agent_resources (agent.close() subprocess teardown; shutdown_memory_provider() plugin IO) was still called INLINE on the event loop from three other sites: - _session_expiry_watcher (5-min idle sweep) — live loop - _handle_message_with_agent cache-hygiene re-eviction — live loop - _finalize_shutdown_agents / stop() idle-cache loop — shutdown A wedged memory provider on any of these froze the loop: bot goes silent, runtime-status updated_at heartbeat stops advancing, and SIGTERM can't be serviced (requires kill -9) — exactly the #53175 zombie pattern. Adds _cleanup_agent_resources_off_loop: a bounded (30s) worker-thread offload mirroring the #35994 reset fix, and routes all four sites through it.	2026-06-28 02:41:36 -07:00
teknium1	aa50c1ba5d	fix(prompt): repair backend probe import (get_environment never existed) The system-prompt backend probe imported a nonexistent symbol — `from tools.environments import get_environment` — which always raised ImportError: cannot import name 'get_environment'. The exception is caught and only drops the live backend description to a static fallback, so it is cosmetic, but it broke the live OS/user/cwd probe for every non-local backend (docker/singularity/modal/daytona/ssh). The real factory is `_create_environment` in tools.terminal_tool. Build the environment the same way the live terminal path does (select backend image, assemble ssh/container config from _get_env_config()), then run the probe. Note: this does NOT affect tool loading — tool selection runs each tool's check_fn and never consults this probe. Regression from #52147 (2026-06-25). Closes #53667 (probe import); the 'cronjob-only' tool-collapse symptom is not reproducible — tool selection has no probe dependency and memory's check_fn is unconditionally True.	2026-06-28 02:41:31 -07:00
Teknium	b508d4296e	test(ci): raise per-file timeout 140s → 300s to stop false timeouts (#54143 ) * test(ci): raise per-file timeout 140s to 300s to stop false timeouts The per-file parallel runner caps each test-file subprocess at a flat wall-clock budget. Combined with per-test subprocess isolation (a fresh Python process per test), a large-collection file pays N x (interpreter startup + import) of overhead before any test logic runs. That overhead dilates under load on shared CI runners, so a file that finishes in ~100s on a quiet box can blow the old 140s cap purely from scheduling jitter, surfacing as a false 'no tests ran' timeout (rc=124) with zero actual test failures. Raise the default to 300s (5 min). The Docker build matrix jobs already take 7-10 min, so this headroom costs nothing on total CI wall time while still bounding a genuinely hung file. * docs: add infographic for CI per-file timeout bump	2026-06-28 02:41:07 -07:00
teknium1	dcc6cd1b42	docs: add infographic for #52378 Windows update-loop salvage	2026-06-28 02:40:37 -07:00
teknium1	fe89ce0694	chore(release): map Cossackx in AUTHOR_MAP for #52528 salvage	2026-06-28 02:40:37 -07:00
Cossackx	ba37c910e0	fix(desktop/windows): resolve real hermes over extensionless shim + prefer --update on recovery Two Windows-only desktop boot bugs that caused spurious reinstall/repair loops: 1. findOnPath() searched the empty extension BEFORE PATHEXT, so an extensionless Git-Bash `hermes` shim shadowed the real hermes.cmd/.exe. The shim then failed the shell:false --version probe and the resolver fell through to bootstrap/repair even though a working CLI was on PATH. Fix: try PATHEXT extensions first, keep the empty entry LAST so callers that already include the extension (py.exe, pwsh.exe) still resolve. 2. handOffWindowsBootstrapRecovery() chose the destructive --repair over the gentle --update by checking only venv\Scripts\hermes.exe -- the setuptools console-script shim, written at the END of venv setup and absent in interrupted/quarantined states. Fix: take --update when ANY real-install signal is present (venv python, the shim, or .hermes-bootstrap-complete). Adds windows-hermes-resolution.test.cjs (source-assertion pattern, wired into test:desktop:platforms) guarding both regressions. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 02:40:37 -07:00
Cornna	0229246ab8	fix(desktop): probe venv runtime health before trusting bootstrap marker A broken/empty Windows launcher venv can see the source tree via PYTHONPATH but lack PyYAML, so 'import hermes_cli' succeeds while the first real CLI import dies — the desktop then trusts the bootstrap marker, spawns a dead backend, and loops on 'gateway offline' (#52378). - backend-probes.cjs: canImportHermesCli now runs 'import yaml; import hermes_cli.config' (extracted as hermesRuntimeImportProbe) and accepts an env override, so a dependency regression is caught without a real broken venv fixture. - main.cjs: isBootstrapComplete() routes through new isActiveRuntimeUsable(), which requires the venv python to pass the runtime import probe (with ACTIVE_HERMES_ROOT on PYTHONPATH) — not just exist on disk. Salvaged from PR #38179. The PR's install.ps1 reset/clean + autocrlf changes and their tests are dropped: current main already preserves dirty checkouts via stash (the data-loss-safe #38542 path) rather than the PR's older reset-based Repair-ManagedCheckoutBeforeUpdate approach.	2026-06-28 02:40:37 -07:00
teknium1	7c9cdad9fd	test(cli): cover Windows self-lock recovery guard + cmd-quote its hint Add two tests for the self-lock guard in _recover_from_interrupted_install: one asserting it clears the marker and skips install when hermes.exe is a process ancestor (breaking the #52378/#45542 loop), one asserting it falls through to a normal recovery install when the shim is NOT an ancestor. The guard's manual-recovery hint runs only inside the Windows branch, so quote it for cmd.exe (cd /d, double-quoted paths) — the cross-platform fallback hint at the end of the function is left POSIX-correct. Map Icather in scripts/release.py AUTHOR_MAP for the salvage.	2026-06-28 02:40:37 -07:00

1 2 3 4 5 ...

13366 commits