hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-07-01 12:02:05 +00:00

Author	SHA1	Message	Date
teknium1	822b71cbf8	docs: add infographic for #43083 secret-redaction fix	2026-06-28 02:44:06 -07:00
teknium1	bbe1bf4045	fix(agent): stop redacting tool-call args in history; fix auth-header quote-eating Two related redaction bugs from #43083: 1. build_assistant_message redacted tool-call arguments in-memory. That dict feeds both the replayed conversation history and state.db (which is itself replayed verbatim on session resume), so the model read back its own PGPASSWORD='***' psql call and copied the placeholder, breaking every credential-dependent command on the second turn. The masking gave no real protection either — the same secret still leaks through tool OUTPUT. Remove it. Keeping secrets out of the replayable store is a separate tokenization/vault concern (security.redact_secrets still governs storage-time redaction elsewhere). 2. _AUTH_HEADER_RE's greedy \S+ credential class ate a closing quote when the token sat flush against it (Authorization: Bearer sk-.."), turning value corruption into syntax corruption (unterminated quote -> shell EOF / SyntaxError). Exclude " and ' from the token class; real credentials never contain them. Closes #43083.	2026-06-28 02:44:06 -07:00
yoniebans	204a67f0c8	fix(kanban): retry write_txn on transient SQLITE_BUSY	2026-06-28 02:44:04 -07:00
yoniebans	90c1dc0493	test(kanban): cover write_txn BUSY retry (currently failing)	2026-06-28 02:44:04 -07:00
teknium1	9844243b18	fix(gateway): gate quick_commands through slash access policy Config-backed quick_commands bypassed the admin-only slash gate. The early gate in _handle_message only fires for registry-known commands (is_gateway_known_command), but quick_commands are never in the gateway registry, so they reached the type:exec dispatch sink unchecked. An allowlisted non-admin gateway user could invoke admin-only quick commands — including shell exec in the gateway process — even when the operator set allow_admin_from / user_allowed_commands to lock them out. Apply _check_slash_access(source, command) at the quick_commands dispatch site (the single exec chokepoint, cold-path only) using the raw typed name. Admins and users with the command in user_allowed_commands still run it; backward-compat (no policy set) is unaffected. Fixes #44727. Co-authored-by: maxpetrusenko <max.petrusenko.agent@gmail.com> Co-authored-by: zapabob <1920071390@campus.ouj.ac.jp>	2026-06-28 02:43:23 -07:00
Teknium	6d879d486b	fix(dashboard): close PTY WebSocket on child EOF to stop FD leak (#54028 ) (#54123 ) * fix(dashboard): close PTY WebSocket on child EOF to stop FD leak The /api/pty handler's reader task returns on child EOF, but the writer loop stayed blocked on ws.receive() until the browser sent a disconnect. When the browser socket is half-open (no FIN delivered — common on macOS/launchd), that disconnect never arrives, so the handler never reaches its finally and the PTY master fd + child process leak. With dashboard auto-reconnect (#52962), every dropped socket then spawns a fresh PTY on top of the orphaned one, exhausting file descriptors within hours (EMFILE / Errno 24). Fix: the reader task now closes the WebSocket in a finally when the child EOFs or the send side breaks, which unblocks ws.receive() so the existing finally runs bridge.close(). The writer loop also guards ws.receive() against the RuntimeError Starlette raises once the socket is closed. Reported by @fifteenzhang. Fixes #54028 * docs: add infographic for #54028 PTY FD leak fix	2026-06-28 02:42:21 -07:00
teknium1	7ef04ae7a7	fix(browser): close eval return-value SSRF bypass (sibling of #44731 ) The snapshot/vision guards re-check the page URL before returning content, but browser_console(expression=...) -> _browser_eval returns arbitrary JS results directly, leaving two same-class bypasses open: 1. Direct fetch: fetch('http://127.0.0.1/secret').then(r=>r.text()) reads a private endpoint and returns the body — the page URL stays public so the post-eval recheck never sees it. 2. Navigate-then-read: location.href='http://127.0.0.1/' then a later eval reads document.body.innerText. Guard _browser_eval on the same condition as navigate/snapshot/vision (not local backend, not local sidecar, not allow_private_urls): - pre-scan the expression for private/always-blocked URL literals - re-check window.location.href after the eval at both success-return sites (supervisor fast-path + subprocess fallback) Probe failures fail-open (matching the snapshot/vision guards).	2026-06-28 02:42:01 -07:00
liuhao1024	0ae6196087	fix(browser): allow local sidecar sessions to bypass SSRF guard The private-network guard in browser_snapshot() and browser_vision() blocked all private URLs, including those accessed via local sidecar sessions (hybrid routing). Local sidecar sessions intentionally access private URLs — the cloud provider never sees the URL in that case. Add `_is_local_sidecar_key(effective_task_id)` check to both guards, matching the existing pattern in browser_navigate(). Fixes #45101 review feedback from egilewski.	2026-06-28 02:42:01 -07:00
liuhao1024	48f5c42599	fix(browser): extend private-network guard to browser_vision The SSRF bypass in #44731 was only patched for browser_snapshot(), but browser_vision() exposes the same vulnerability — it takes a screenshot and sends it to the vision model without checking if eval-driven navigation moved the page to a private/internal URL. Add the same current-page URL safety check to browser_vision() before any screenshot is captured, encoded, or forwarded to the vision model. This covers both the normal screenshot path and the Lightpanda Chrome fallback path. 7 new tests: blocks private URL, allows public URL, skips in local backend, skips when private URLs allowed, handles eval failure/empty/exception.	2026-06-28 02:42:01 -07:00
liuhao1024	7a6fe9bbfa	fix(browser): block snapshot from eval-navigated private pages browser_snapshot() now checks the current page URL before returning content. When browser_console() changes location.href to a private or internal address (e.g., http://127.0.0.1:8080/), the snapshot returns an error instead of exposing the private page content. This closes the SSRF bypass where an attacker could: 1. Navigate to a public page 2. Use browser_console to eval location.href = 'http://127.0.0.1:port/' 3. Use browser_snapshot to read the private page content The fix reuses the existing _is_safe_url() and _allow_private_urls() infrastructure, and fails open if the URL check itself fails. Fixes #44731	2026-06-28 02:42:01 -07:00
Teknium	7c0a5def58	fix(memory/holographic): close DB connection on shutdown instead of leaking to GC (#54133 ) HolographicMemoryProvider.shutdown() dropped its MemoryStore reference without calling the existing MemoryStore.close(). Since the connection is opened check_same_thread=False (one per session), its fd was released by refcount/GC at a non-deterministic time on a non-deterministic thread, churning a DB fd through the kernel free pool on every session teardown. Call close() so the fd is released deterministically. Reported by @alfranli123 (#44037), who pinpointed the exact code location. Note: the report's TLS-fd-recycle corruption attribution could not be reproduced from the code — dropping a sqlite connection flushes valid SQLite pages via the VFS, never TLS framing, and the provider is at most a releaser of DB fds, not a TLS-flushing socket owner. This change is correct resource hygiene that removes per-session fd churn regardless.	2026-06-28 02:41:52 -07:00
Teknium	00d8c2c915	fix(gateway): prune stale sessions.json entries on startup A hard gateway crash (exit code 1) skips the graceful shutdown path, so sessions.json is never cleared and is left pointing at sessions already ended in state.db. On the next startup get_or_create_session() reuses those stale entries as long as the time/policy reset checks pass — it never consults end_reason — so every incoming message is silently routed into a closed session, with no log or error (#52804). SessionStore._ensure_loaded_locked() now calls a new _prune_stale_sessions_locked() that drops any entry whose session_id has end_reason IS NOT NULL in state.db. Idempotent, _db=None / legacy-absent safe, DB errors non-fatal, sessions.json rewritten only when something was pruned. Self-heals into a fresh session on the next message. Reported and diagnosed by @terry197913 (#52808).	2026-06-28 02:41:47 -07:00
teknium1	c38dfba3a7	docs: add infographic for #53175 gateway cleanup off-loop fix	2026-06-28 02:41:36 -07:00
teknium1	ea5aaa7a22	fix(gateway): offload remaining inline agent cleanup off the event loop (#53175 ) #35994 moved /new reset cleanup off the loop, but _cleanup_agent_resources (agent.close() subprocess teardown; shutdown_memory_provider() plugin IO) was still called INLINE on the event loop from three other sites: - _session_expiry_watcher (5-min idle sweep) — live loop - _handle_message_with_agent cache-hygiene re-eviction — live loop - _finalize_shutdown_agents / stop() idle-cache loop — shutdown A wedged memory provider on any of these froze the loop: bot goes silent, runtime-status updated_at heartbeat stops advancing, and SIGTERM can't be serviced (requires kill -9) — exactly the #53175 zombie pattern. Adds _cleanup_agent_resources_off_loop: a bounded (30s) worker-thread offload mirroring the #35994 reset fix, and routes all four sites through it.	2026-06-28 02:41:36 -07:00
teknium1	aa50c1ba5d	fix(prompt): repair backend probe import (get_environment never existed) The system-prompt backend probe imported a nonexistent symbol — `from tools.environments import get_environment` — which always raised ImportError: cannot import name 'get_environment'. The exception is caught and only drops the live backend description to a static fallback, so it is cosmetic, but it broke the live OS/user/cwd probe for every non-local backend (docker/singularity/modal/daytona/ssh). The real factory is `_create_environment` in tools.terminal_tool. Build the environment the same way the live terminal path does (select backend image, assemble ssh/container config from _get_env_config()), then run the probe. Note: this does NOT affect tool loading — tool selection runs each tool's check_fn and never consults this probe. Regression from #52147 (2026-06-25). Closes #53667 (probe import); the 'cronjob-only' tool-collapse symptom is not reproducible — tool selection has no probe dependency and memory's check_fn is unconditionally True.	2026-06-28 02:41:31 -07:00
Teknium	b508d4296e	test(ci): raise per-file timeout 140s → 300s to stop false timeouts (#54143 ) * test(ci): raise per-file timeout 140s to 300s to stop false timeouts The per-file parallel runner caps each test-file subprocess at a flat wall-clock budget. Combined with per-test subprocess isolation (a fresh Python process per test), a large-collection file pays N x (interpreter startup + import) of overhead before any test logic runs. That overhead dilates under load on shared CI runners, so a file that finishes in ~100s on a quiet box can blow the old 140s cap purely from scheduling jitter, surfacing as a false 'no tests ran' timeout (rc=124) with zero actual test failures. Raise the default to 300s (5 min). The Docker build matrix jobs already take 7-10 min, so this headroom costs nothing on total CI wall time while still bounding a genuinely hung file. * docs: add infographic for CI per-file timeout bump	2026-06-28 02:41:07 -07:00
teknium1	dcc6cd1b42	docs: add infographic for #52378 Windows update-loop salvage	2026-06-28 02:40:37 -07:00
teknium1	fe89ce0694	chore(release): map Cossackx in AUTHOR_MAP for #52528 salvage	2026-06-28 02:40:37 -07:00
Cossackx	ba37c910e0	fix(desktop/windows): resolve real hermes over extensionless shim + prefer --update on recovery Two Windows-only desktop boot bugs that caused spurious reinstall/repair loops: 1. findOnPath() searched the empty extension BEFORE PATHEXT, so an extensionless Git-Bash `hermes` shim shadowed the real hermes.cmd/.exe. The shim then failed the shell:false --version probe and the resolver fell through to bootstrap/repair even though a working CLI was on PATH. Fix: try PATHEXT extensions first, keep the empty entry LAST so callers that already include the extension (py.exe, pwsh.exe) still resolve. 2. handOffWindowsBootstrapRecovery() chose the destructive --repair over the gentle --update by checking only venv\Scripts\hermes.exe -- the setuptools console-script shim, written at the END of venv setup and absent in interrupted/quarantined states. Fix: take --update when ANY real-install signal is present (venv python, the shim, or .hermes-bootstrap-complete). Adds windows-hermes-resolution.test.cjs (source-assertion pattern, wired into test:desktop:platforms) guarding both regressions. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 02:40:37 -07:00
Cornna	0229246ab8	fix(desktop): probe venv runtime health before trusting bootstrap marker A broken/empty Windows launcher venv can see the source tree via PYTHONPATH but lack PyYAML, so 'import hermes_cli' succeeds while the first real CLI import dies — the desktop then trusts the bootstrap marker, spawns a dead backend, and loops on 'gateway offline' (#52378). - backend-probes.cjs: canImportHermesCli now runs 'import yaml; import hermes_cli.config' (extracted as hermesRuntimeImportProbe) and accepts an env override, so a dependency regression is caught without a real broken venv fixture. - main.cjs: isBootstrapComplete() routes through new isActiveRuntimeUsable(), which requires the venv python to pass the runtime import probe (with ACTIVE_HERMES_ROOT on PYTHONPATH) — not just exist on disk. Salvaged from PR #38179. The PR's install.ps1 reset/clean + autocrlf changes and their tests are dropped: current main already preserves dirty checkouts via stash (the data-loss-safe #38542 path) rather than the PR's older reset-based Repair-ManagedCheckoutBeforeUpdate approach.	2026-06-28 02:40:37 -07:00
teknium1	7c9cdad9fd	test(cli): cover Windows self-lock recovery guard + cmd-quote its hint Add two tests for the self-lock guard in _recover_from_interrupted_install: one asserting it clears the marker and skips install when hermes.exe is a process ancestor (breaking the #52378/#45542 loop), one asserting it falls through to a normal recovery install when the shim is NOT an ancestor. The guard's manual-recovery hint runs only inside the Windows branch, so quote it for cmd.exe (cd /d, double-quoted paths) — the cross-platform fallback hint at the end of the function is left POSIX-correct. Map Icather in scripts/release.py AUTHOR_MAP for the salvage.	2026-06-28 02:40:37 -07:00
灵越羽毛	b6f592dbdc	fix(cli): detect self-lock in update recovery to break infinite retry loop on Windows	2026-06-28 02:40:37 -07:00
liuhao1024	14baeefe1d	fix(matrix): record DM rooms in m.direct on invite to prevent group misclassification Rebase onto plugins/platforms/matrix/adapter.py (code moved from gateway/platforms/matrix.py). Same logic: _on_invite checks is_direct on invite events and calls _record_dm_room to persist in m.direct account data. Fixes #44679	2026-06-28 02:37:52 -07:00
Teknium	fde1c8570f	fix(tui_gateway): suppress WS peer-hangup teardown error flood (#50005 ) (#54126 ) When the Desktop forcibly closes its WebSocket mid-write, asyncio logs a full traceback for every pending connection-lost callback — 50+ identical WinError 10054 (ConnectionResetError) lines per disconnect on Windows, the equivalent ConnectionResetError/BrokenPipeError on POSIX. These are not actionable: they are the expected side effect of the peer hanging up before our writes drained. Install a loop exception handler on the gateway serving loop that collapses exactly this teardown class (ConnectionResetError/ConnectionAbortedError/ BrokenPipeError originating from _call_connection_lost) to a single debug line, forwarding every other loop error to the existing/default handler unchanged so genuine loop bugs still surface. Idempotent per loop.	2026-06-28 02:35:01 -07:00
teknium1	6eec0d4f08	docs: add infographic for #53107 gateway force-exit fix	2026-06-28 02:34:23 -07:00
LeonSGP43	9f0e64cedd	fix(gateway): force exit after graceful shutdown Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-06-28 02:34:23 -07:00
teknium1	dddaea0c98	chore(release): map yungchentang author for #53622 salvage	2026-06-28 02:34:17 -07:00
yungchentang	7e2ca7f68d	fix(telegram): reset send pool after pool timeouts	2026-06-28 02:34:17 -07:00
kshitij	f3d8f20a59	Merge pull request #54116 from kshitijk4poor/fix/36658-gateway-drain-microtask fix(tui): defer buffered gateway events to stop dashboard chat #301 (#36658)	2026-06-28 14:50:07 +05:30
Teknium	f646b82ff0	docs: add infographic for #38249 atomic env-snapshot fix	2026-06-28 02:08:57 -07:00
Teknium	9f17f16c66	fix(environments): use $BASHPID for atomic snapshot temp + harden failure path The atomic mv approach (kyssta-exe's commit) narrows but does not close the #38249 race: the temp name used $$ (parent shell PID), which is identical across &-launched concurrent subshells. Two concurrent writers pick the same temp file, clobber each other mid-write, and mv then publishes a torn snapshot — a reader sourcing it absorbs declare-x/export fragments into PATH. - Use $BASHPID (actual per-subshell PID) so concurrent writers never collide. - Chain mv on export success (&&) and rm the temp on failure so a partial dump never replaces a good snapshot; apply the same to the init_session bootstrap. - shlex-quote the static temp-path portion (Windows/spaces), $BASHPID outside. - LocalEnvironment.cleanup sweeps orphaned snap.tmp.* temps. - Regression tests: string-shape + a behavioral concurrent writers/readers test that proves the snapshot never tears (would still tear with $$).	2026-06-28 02:08:57 -07:00
kyssta-exe	6a2958a521	fix(environments): use atomic file replacement for snapshot writes Fix race condition in terminal environment snapshots that could corrupt PATH with declare -x entries. When concurrent terminal calls share the same snapshot file, the non-atomic 'export -p > snapshot.sh' write could be read mid-write by another process, causing partial/corrupted env vars to be sourced and mixed into PATH. The fix uses atomic file replacement: - Write to a temp file: export -p > snapshot.sh.tmp.303651 - Atomically replace: mv -f snapshot.sh.tmp.303651 snapshot.sh On POSIX, mv within the same filesystem is atomic, so source() will either see the old complete snapshot or the new complete one, never a partial/truncated file. Fixes #38249	2026-06-28 02:08:57 -07:00
teknium1	c23f394eb8	fix: satisfy ruff encoding + windows-footgun lints for cgroup reaper - read_text(encoding='utf-8') (PLW1514) - # windows-footgun: ok on signal.SIGKILL — module is Linux-only (reads /proc, /sys/fs/cgroup; runs from a systemd unit) - test lambda accepts the new encoding kwarg	2026-06-28 02:05:50 -07:00
teknium1	86ec979f66	chore(release): map PRATHAMESH75 author for #37550 salvage	2026-06-28 02:05:50 -07:00
PRATHAMESH75	e551da6ddb	fix(gateway): reap cgroup orphans via ExecStopPost to unblock restart Long-lived helpers spawned indirectly by tool calls (adb, platform bridges) were left in the service cgroup after the gateway's main process exited. When the kernel rejected the deferred cgroup-wide kill with EINVAL, systemd blocked Restart=always for 6+ minutes, taking down all platforms and cron windows (#37454). Add a small ExecStopPost helper (gateway.cgroup_cleanup) that walks cgroup.procs and sends per-PID SIGKILLs — a different kernel code path than cgroup.kill, so it succeeds where the cgroup-wide write failed. KillMode=mixed is preserved so the gateway still reaps its own tool-call children before systemd intervenes (#8202). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-28 02:05:50 -07:00
Coy Geek	d7a1052424	fix(env-passthrough): fail closed when provider blocklist import fails When tools.environments.local can't be imported (partial install, import-time error), _is_hermes_provider_credential() returned False — fail-open. A skill could then register a Hermes provider credential (ANTHROPIC_API_KEY, etc.) as env passthrough; _scrub_child_env lets passthrough vars bypass the secret-substring net (rule 1), so the operator's real key would land in the execute_code child. Reopens the GHSA-rhgp-j443-p4rf bypass. Fail closed instead: on import failure, treat the name as a protected provider credential and refuse passthrough. Regression test exercises the full register -> scrub path under a simulated import failure. Co-authored-by: Hermes Agent <noreply@nousresearch.com>	2026-06-28 02:05:43 -07:00
teknium1	58c36b1798	fix(api-server): widen error redaction to cron-endpoint + SSE sites Follow-up to the salvaged #37733 fix. The contributor centralized redaction at _openai_error and the chat/responses failure paths, which covers the OpenAI-compatible envelopes transitively. Two sibling classes crossed the same authenticated HTTP boundary unredacted: - 8x cron-management endpoints returning {"error": str(e)} on 500 - the session-chat SSE error event ({"message": str(exc)}) Route both through the same _redact_api_error_text(force=True) helper. Add AUTHOR_MAP entry for coygeek and a TestRedactApiErrorText guard covering mask/force/limit/passthrough behavior.	2026-06-28 02:05:38 -07:00
Coy Geek	5e774de76e	fix(api-server): redact provider errors at HTTP boundary Force API-server error text through the existing secret redactor before returning OpenAI-compatible errors, response fallback text, response snapshots, and run failure events. This prevents credential-shaped provider failure text from crossing the API-server boundary while preserving debuggable sanitized messages.	2026-06-28 02:05:38 -07:00
HexLab98	d2fda5925d	test(gateway): cover Discord/Slack compression status suppression (#39293 )	2026-06-28 14:35:32 +05:30
HexLab98	d2ea948bc0	fix(gateway): suppress compression status noise on Discord and other chats (#39293 ) Extend the gateway noisy-status filter beyond Telegram so internal compression lifecycle messages stay in logs instead of spamming Discord, Slack, and other messaging channels.	2026-06-28 14:35:32 +05:30
teknium1	9f7d520caf	docs: add infographic for #36664 WhatsApp LID session-path fix	2026-06-28 02:05:26 -07:00
teknium1	3aaa98dd01	test(whatsapp): cover LID allowlist match on modern session layout Add an _is_user_authorized E2E for the platforms/whatsapp/session layout on top of fesalfayed's resolver fix (#36665) — guards the actual silently-dropped-LID-sender path from #36664.	2026-06-28 02:05:26 -07:00
fesalfayed	263ffec1b0	fix(whatsapp): resolve LID aliases on modern platforms/ session layout expand_whatsapp_aliases hardcoded get_hermes_home()/whatsapp/session, but the adapter writes lid-mapping files via get_hermes_dir("platforms/whatsapp/ session", "whatsapp/session"). On installs without the legacy directory the two paths diverge, so the resolver finds no mappings and returns the bare LID, which misses the allowlist and silently drops the message. Resolve through the same helper so both sides stay in lockstep on new and legacy layouts.	2026-06-28 02:05:26 -07:00
teknium1	d0f087e7f9	docs: add infographic for #36109 empty-400 diagnostics	2026-06-28 02:05:20 -07:00
xxxigm	093f567f0d	fix(agent,cli): surface empty-body API errors and fail oneshot exit code When an LLM API call returns HTTP 4xx with an empty parsed SDK `body` ({}), `_summarize_api_error` fell through to a bare `str(error)`, so users saw only "HTTP 400" with no provider detail (reported on Windows in #36109). The SDK leaves `body` empty in this case, but the httpx `response` still carries the payload in `.text`. - run_agent.py `_summarize_api_error`: when `body` is empty, fall back to `response.text` — parse a JSON `error.message`/`message` when present, else surface the raw (truncated) body. Platform-agnostic diagnostics. - hermes_cli/oneshot.py: `hermes -z` now runs via `run_conversation` and returns exit code 2 when the run is failed/partial with no usable final response, so scripts can detect LLM failures (still 0 when a response — incl. an error summary as output — is produced). Tests: new tests/run_agent/test_summarize_api_error.py (empty-body JSON + raw text, RED/GREEN verified) + oneshot exit-code/`run_conversation` wiring tests. NOTE: #36109's original root cause (Windows "all providers return empty 400") is not reproducible on current main (heavy provider-transport churn since v0.15.1). This change does not claim to fix that root cause — it makes any empty-body API error LEGIBLE so a future occurrence shows the real provider message instead of a bare HTTP 400. Relates to #36109 (does not close it).	2026-06-28 02:05:20 -07:00
teknium1	c0b4a3438a	fix(install): scope Playwright override to too-new apt releases + keep step interruptible Follow-up on #54032 for #35166: - Gate the PLAYWRIGHT_HOST_PLATFORM_OVERRIDE retry on the host being an apt release newer than Playwright recognizes (Ubuntu >24.04 / Debian >13) via playwright_host_unrecognized(), instead of retrying on ANY install failure. A network/disk/permission failure on a supported host now surfaces unchanged rather than getting a mismatched-glibc build forced onto it. - detect_os() now captures DISTRO_VERSION from os-release. - Fold in the interruptibility fix (was PR #35304, self-closed): wrap the download in 'timeout --foreground -k 10' (probed, with plain-timeout fallback) so a terminal Ctrl+C reaches the child and a wedged download is force-killed after the deadline. - Add behavioral tests that source the helpers and assert the retry fires only on Ubuntu 26.04 / Debian 14, not on supported hosts, non-apt distros, native-success, operator-pinned override, or unsupported arch.	2026-06-28 02:05:18 -07:00
kshitijk4poor	a28fe788a6	fix(install): retry Playwright install with platform override on unrecognized host (#35166 ) On apt releases newer than the bundled Playwright recognizes (Ubuntu 26.04, Debian 14, and future distros), 'npx playwright install --with-deps chromium' hangs uninterruptibly at 'Installing Playwright Chromium with system dependencies' because Playwright's resolver maps the host to a platform with no download build (#35166). Wrap every installer Playwright call in run_playwright_install(), which tries the native install first and, only if it fails or times out, retries once with PLAYWRIGHT_HOST_PLATFORM_OVERRIDE pinned to the newest known build (ubuntu24.04-<arch>). This is the escape hatch Playwright's maintainers bless for unrecognized platforms (microsoft/playwright#33434). Try-native-first (not a hardcoded distro/version table) is deliberate: - Self-correcting — when Playwright already supports the host (e.g. Ubuntu 26.04 on Playwright >=1.61) the first attempt succeeds and the override is never applied, so we never force a mismatched-glibc build onto a release Playwright handles correctly (microsoft/playwright#35114). - Zero-maintenance — new distro releases work the moment Playwright adds them. - Covers Debian 14+ and future releases, not just Ubuntu 26.04. An operator-set PLAYWRIGHT_HOST_PLATFORM_OVERRIDE is always respected (applied to the first attempt; retry skipped). Non-x64/arm64 arches have no fallback build and skip the retry. Refs #35166	2026-06-28 02:05:18 -07:00
teknium1	64972b6403	fix(config): canonicalize model.name/model.model to model.default (#34500 ) A custom_providers config that names the model under model.name (or model.model) resolved to an empty model, so the API request went out with model= — HTTP 400 from OpenAI-compatible backends. Display paths (hermes status/dump) already read model.name and showed the model, making the failure silent. The model id was read via 'default or model' at ~14 independent sites (cli, gateway, cron, curator, oneshot, fallback, profiles, ...), none of which honored 'name'. Rather than patch every site, canonicalize at the single load/save chokepoint: _normalize_root_model_keys() now promotes model.model/model.name -> model.default (precedence default > model > name) and drops the stale alias, so every reader — present and future — sees a populated default and config.yaml is migrated canonical on next save. The gateway, which bypasses load_config(), replays the same normalization in _load_gateway_config(). Co-authored-by: Bartok9 <danielrpike9@gmail.com> Credit: root-cause analysis and fix direction from @Bartok9 (#34502, first) and @v86861062 (#34527).	2026-06-28 02:05:13 -07:00
kshitijk4poor	f64d15ccb7	fix(tui): defer buffered gateway events to stop dashboard chat #301 (#36658 ) Dashboard /chat spawns the TUI attached to the dashboard's in-memory gateway via HERMES_TUI_GATEWAY_URL. In that attach mode the already-running gateway replays `gateway.ready` (and `session.info`) the instant the socket connects, so those events land in GatewayClient.bufferedEvents before the consumer's mount-time subscribe effect (useMainApp.ts) calls drain(). drain() then emitted the buffered events synchronously, so the `gateway.ready` handler's patchUiState / setHistoryItems cascade ran while React was still inside the first commit — tripping "Too many re-renders" (Minified React error #301) and breaking Dashboard chat after `hermes update`. Spawn / inline / sidecar modes never hit this: their `gateway.ready` only arrives after the Python child boots, on a later async tick. Fix: drain() defers the replay to the next microtask AND keeps `subscribed` false until that microtask runs. Keeping `subscribed` false in the gap means any live event arriving before the flush keeps buffering (publish() pushes when !subscribed) instead of emitting synchronously and jumping ahead of the chronologically-earlier replayed events — the flush re-drains the buffer right after flipping `subscribed`, preserving FIFO order. A drainGeneration token (bumped in resetStartupState) makes a queued flush a no-op if the transport was reset/killed in the meantime, avoiding use-after-teardown and duplicate/reordered exits. Regression tests: (1) drain() does not dispatch buffered events synchronously; (2) a live event arriving in the post-drain / pre-microtask window still delivers BEHIND the earlier-buffered event (FIFO). Both are red against the old synchronous behavior, green with this fix. Same class of fix as #44528. Closes #36658	2026-06-28 14:18:47 +05:30
Teknium	2ecb6f7fe6	fix(telegram): clear send_path_degraded on successful reconnect (#35205 ) (#54076 ) * fix(telegram): clear send_path_degraded on successful reconnect _send_path_degraded was cleared only in _verify_polling_after_reconnect, 60s after reconnect and only if scheduled. A clean start_polling() reconnect left the flag stuck True, short-circuiting send() and blocking all outbound messages until the deferred probe ran (or forever if it never did). Clear the flag the moment start_polling() succeeds — that is the recovery signal. The deferred probe remains a defensive re-check that re-enters the reconnect ladder (re-setting the flag) if it detects a silent wedge. Fixes #35205. * docs: add infographic for #35205 telegram send-path fix	2026-06-28 01:38:17 -07:00

1 2 3 4 5 ...

13337 commits