hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-07-01 12:02:05 +00:00

Author	SHA1	Message	Date
teknium1	61622bb56a	fix(tui): use role=user for model switch marker to avoid HTTP 400 on strict providers (#48338 ) _append_model_switch_marker() appended the post-/model-switch context marker to session history as {"role": "system"}. The cached system prompt is prepended to the API message list (conversation_loop.py), so this marker became a SECOND system message mid-array after prior user/assistant turns. Strict OpenAI-compatible providers (vLLM, Qwen) reject any system message that is not at the beginning of the array, returning HTTP 400 and killing the conversation on the next turn. Flip the marker to role="user" (history entry + both session-DB persist sites), matching the existing personality-overlay marker which already uses role="user". repair_message_sequence() then coalesces it with adjacent user turns as needed. Co-authored-by: liuhao1024 <sunsky.lau@gmail.com> Co-authored-by: Lucas Nicolas <lucas.nicolas@proton.me>	2026-06-28 04:34:55 -07:00
izumi0uu	c4719aa51c	fix(gateway): boot out stale launchd registration before restart bootstrap launchd restart can leave the gateway job stopped but still registered after update-time drain logic, so a direct bootstrap hits exit 5 and falls back to a detached process. Booting the stale registration out before bootstrap keeps the launchd-managed restart path intact and locks it with a regression test. Constraint: Keep upstream-facing conventional commit style while preserving local decision context Rejected: Treat bootstrap exit 5 as expected \| Leaves macOS launchd restart outside launchd supervision after update Confidence: high Scope-risk: narrow Directive: Keep launchd start/restart recovery flows aligned when changing launchctl handling Tested: pytest -q tests/hermes_cli/test_gateway_service.py -k "launchd_restart_boots_out_stale_registration_before_bootstrap or launchd_restart_falls_back_to_detached_on_error_5 or launchd_restart_drains_running_gateway_before_kickstart or launchd_restart_self_requests_graceful_restart_without_kickstart" Tested: pytest -q tests/hermes_cli/test_gateway_service.py -k launchd Not-tested: Manual macOS launchctl restart after hermes update	2026-06-28 04:17:13 -07:00
Teknium	52a853f5c3	fix(test): pin monotonic clock in spinner-elapsed test to fix CI flake (#54203 ) test_spinner_elapsed_format_is_fixed_width_to_reduce_wrap_jitter derived _tool_start_time from the live time.monotonic() clock (now - 65.2 / now - 9.2). monotonic()'s epoch is arbitrary — on a host where monotonic() < 65.2 (fresh subprocess on a freshly-booted CI runner) the start time went negative, the (t0 > 0) guard in _render_spinner_text() dropped the '(elapsed)' suffix, and short.split('(',1)[1] raised IndexError: list index out of range. Deterministic given a small clock, so it would keep flaking, not clear on rerun. Pin time.monotonic to a fixed 1000.0 and offset _tool_start_time from it so both the <60s and >=60s paths always render the elapsed suffix regardless of the runner's monotonic epoch. Pre-existing main flake (surfaced in CI test slice 1/8).	2026-06-28 04:16:25 -07:00
Cornna	5c2c85c545	fix(tui): start MCP discovery for websocket sessions The desktop app and dashboard chat reach the agent through the /api/ws JSON-RPC sidecar (tui_gateway.ws.handle_ws), NOT through tui_gateway.entry.main() — the stdio-TUI path that spawns the background MCP discovery thread. In the WS process discovery was therefore never started: _make_agent only waits (wait_for_mcp_discovery), which no-ops when the thread was never created, so the agent snapshotted an MCP-less tool list. The only discovery trigger reachable was a manual /reload-mcp, which is why tools appeared after a reload but vanished on restart. Start the shared, idempotent, config-gated background discovery in handle_ws right after accept() and before gateway.ready, so the first agent build picks up already-spawning servers (and the existing late-binding refresh handles slow ones). Fixes #38945.	2026-06-28 04:14:12 -07:00
teknium1	091ce825fe	test(redact): fix file_read regression-guard for current-main YAML collapse The salvaged #35519 regression guard asserted that default (non-file_read) mode keeps a head/tail `ghp_S1...Pn2T` mask for a `token: <key>` line. On current main the YAML config pass (`_YAML_ASSIGN_RE`, key `token`) re-masks the already-prefix-masked value to `***`, so the assertion was stale. Switch to a bare-token context so the guard isolates what it claims (prefix-mask head/tail shape in default mode) without depending on the YAML collapse.	2026-06-28 04:13:20 -07:00
kshitijk4poor	de928bccde	fix(redact): non-reusable sentinel for prefix secrets in file reads (#35519 ) When security.redact_secrets is on (default), read_file/search_files/cat applied redact_sensitive_text(code_file=True) to file content, which still ran prefix masking. An API key in config.yaml (ghp_..., sk-..., xai-..., etc.) came back as a head/tail mask like `ghp_S1...Pn2T` — a plausible-looking truncated key. When an agent read that and wrote it back to config, the masked value replaced the real credential, silently breaking auth (401). Production evidence: a config.yaml found containing the exact 13-char masked GitHub PAT. The two community PRs (#35529, #35534) fixed the corruption by NOT redacting prefixes for config reads — but that exposes the user's real keys to the agent context, model, and logs (a security regression). This takes the safer route: keep redacting, but for file content emit a NON-REUSABLE sentinel. - New `_mask_token_nonreusable`: prefix secrets -> `«redacted:ghp_…»` (vendor label preserved for debuggability; zero secret bytes; angle-bracket/ellipsis wrapper is syntactically invalid as a token so it can't be mistaken for or written back as a usable key). - New `redact_sensitive_text(file_read=True)` routes prefix matches through it (implies code_file=True). Default/log/display mode is UNCHANGED — `_mask_token` still keeps head/tail (fine for logs, never written back). - Wired the 3 file_tools.py call sites (read_file / search_files / cat) to file_read=True. Fixes both the corruption AND avoids the secret-exposure of the un-redact approach. 6 new tests (sentinel shape, no-leak, not-a-plausible-key, default mode unchanged, file_read implies code_file, sk- prefix); 88 redact tests pass; mutation-verified (reverting to the old mask fails the sentinel/leak tests). Co-authored-by: liuhao1024 <sunsky.lau@gmail.com> Co-authored-by: adammatski1972 <289282750+adammatski1972@users.noreply.github.com> Closes #35519. Supersedes #35529, #35534.	2026-06-28 04:13:20 -07:00
tymrtn	d7f655f370	fix: accept typed clarify choice replies	2026-06-28 04:13:19 -07:00
MorAlekss	acca526286	fix(gateway): treat zombie PIDs as dead in _pid_exists to unblock --replace (closes #42126 ) Under systemd Restart=always, the old gateway becomes a zombie (in the process table, awaiting reap) when the replacement starts. _pid_exists() reported the zombie as alive, so --replace waited on a PID that never dies, then aborted with exit 1 — a silent crash loop. Standalone runs are unaffected because nothing respawns the gateway into a zombie. The live path is psutil.pid_exists(), which returns True for zombies, so the check is added there (Process.status() == STATUS_ZOMBIE -> dead). The psutil-less POSIX fallback also reads /proc/<pid>/stat (state Z) with a ps state= fallback for macOS/BSD, before the os.kill(pid, 0) liveness probe. Diagnosis and the /proc + ps POSIX fallback by MorAlekss (PR #44898); extended to cover the psutil hot path so the fix applies on normal installs. Co-authored-by: MorAlekss <mor.aleksandr@yahoo.com>	2026-06-28 04:11:14 -07:00
teknium1	463225caf1	fix(gateway): bypass legacy-unit prompt in non-TTY systemd install Folds in PR #42124 (kyssta-exe): systemd_install gained a non_interactive flag so the 'Remove the legacy unit(s)?' prompt — the second hidden prompt not guarded by --start-now/--start-on-login — is also skipped in headless contexts. Updates systemd_install test mocks to accept the new kwarg and adds coverage for the legacy-unit-skip path.	2026-06-28 04:09:54 -07:00
liuhao1024	831d443b03	fix(gateway): honor --start-now/--start-on-login flags and support non-TTY headless installs When running `hermes gateway install` on Linux/systemd, the command unconditionally prompts with two `prompt_yes_no` questions, breaking headless installs (SSH, CI, provisioning scripts) and ignoring the existing --start-now / --start-on-login CLI flags that the Windows branch already respects. The fix mirrors the Windows path: read CLI flags first, prompt only when flags are not provided AND stdin is a TTY, and fall back to True defaults for non-TTY contexts. The argparse help strings are promoted from SUPPRESS to visible so users can discover the flags. Fixes #42065	2026-06-28 04:09:54 -07:00
Teknium	a06d0198cd	fix(dashboard): reap PTY bridge on child EOF, not only in writer finally (#54190 ) The /api/pty handler only closed the PtyBridge in the writer loop's finally. On child EOF the reader task closes the WebSocket, but if the handler task is cancelled the instant the socket closes, the writer's finally can be skipped and the PTY fds leak (#54028) — the FD-leak the regression test guards. Under dashboard auto-reconnect this stacks orphaned PTYs until fds are exhausted. Reap the bridge in the reader's EOF finally too (close() is idempotent), so the PTY is reaped independently of the writer-loop cancellation race. Harden the regression test to poll for teardown instead of asserting on the same tick. Was flaky on main (2/20); now 25/25.	2026-06-28 03:58:18 -07:00
Teknium	7968c90318	test(install): track run_with_timeout extraction after #39219 refactor (#54185 ) PR #39219 split run_browser_install_with_timeout into a thin wrapper that delegates to a new run_with_timeout helper (and parameterized the timeout binary as $timeout_bin for macOS gtimeout support), but did not update tests/test_install_sh_browser_install.py. The behavioral harness extracted only the now-empty wrapper, so the install command never ran (runs==[]), failing all 8 behavioral cases; two text assertions also still expected the old literal 'timeout' invocation. Fix the stale test: extract run_with_timeout alongside the wrapper, and match the $timeout_bin-parameterized GNU-timeout strings. Behavior unchanged.	2026-06-28 03:58:01 -07:00
Teknium	c1c179a239	fix(security): redact secrets in background process + foreground env-dump output (#43025 ) (#54149 ) * fix(security): redact secrets in background process + foreground env-dump output Terminal-output redaction was incomplete (#43025): - Gap 1: process(action=poll/log/wait) returned background stdout verbatim — no redaction at all. A background printenv/server/test emitting a key leaked raw to the model, session.db, and CLI display. Same for the gateway background-process watcher's completion/progress notifications. - Gap 2: the foreground terminal path hardcoded code_file=True, which skips the ENV-assignment pass, so an opaque token (no vendor prefix) from env/printenv leaked even there. Adds agent.redact.redact_terminal_output(output, command) as the single policy for ALL terminal-output surfaces: env-dump commands (env/printenv/set/export/ declare) get the ENV-assignment pass (code_file=False) to mask opaque tokens; other commands stay on code_file=True to avoid false positives on source dumps. Wired into terminal_tool, process_registry (_handle_process boundary), and the gateway watcher. Respects security.redact_secrets (no force) — opt-out preserved. * docs: add infographic for #43025 terminal-output redaction fix	2026-06-28 02:44:21 -07:00
teknium1	d5ba374c03	fix(telegram): detect wedged getUpdates consumer via pending_update_count The merged CLOSE-WAIT heartbeat (#52744) only probes get_me(), which uses the general request path and stays healthy while PTB's getUpdates consumer is silently wedged (updater.running=True but the long-poll task is stuck, observed on WSL2). DMs then queue in the Bot API and never reach handlers (#42909). Augment the existing _polling_heartbeat_loop to also probe get_webhook_info().pending_update_count. After two consecutive probes that see a non-draining queue while the updater claims to be running, escalate into the existing _handle_polling_network_error recovery ladder — no new restart machinery. No-ops in webhook mode, when the updater is not running, or when a reconnect is already in flight. Credit to @gazzumatteo, whose PR #42959 identified the pending_update_count signal as the missing liveness probe. This reuses the existing heartbeat + recovery path rather than adding a parallel watchdog. Fixes #42909.	2026-06-28 02:44:17 -07:00
teknium1	bbe1bf4045	fix(agent): stop redacting tool-call args in history; fix auth-header quote-eating Two related redaction bugs from #43083: 1. build_assistant_message redacted tool-call arguments in-memory. That dict feeds both the replayed conversation history and state.db (which is itself replayed verbatim on session resume), so the model read back its own PGPASSWORD='***' psql call and copied the placeholder, breaking every credential-dependent command on the second turn. The masking gave no real protection either — the same secret still leaks through tool OUTPUT. Remove it. Keeping secrets out of the replayable store is a separate tokenization/vault concern (security.redact_secrets still governs storage-time redaction elsewhere). 2. _AUTH_HEADER_RE's greedy \S+ credential class ate a closing quote when the token sat flush against it (Authorization: Bearer sk-.."), turning value corruption into syntax corruption (unterminated quote -> shell EOF / SyntaxError). Exclude " and ' from the token class; real credentials never contain them. Closes #43083.	2026-06-28 02:44:06 -07:00
yoniebans	204a67f0c8	fix(kanban): retry write_txn on transient SQLITE_BUSY	2026-06-28 02:44:04 -07:00
yoniebans	90c1dc0493	test(kanban): cover write_txn BUSY retry (currently failing)	2026-06-28 02:44:04 -07:00
teknium1	9844243b18	fix(gateway): gate quick_commands through slash access policy Config-backed quick_commands bypassed the admin-only slash gate. The early gate in _handle_message only fires for registry-known commands (is_gateway_known_command), but quick_commands are never in the gateway registry, so they reached the type:exec dispatch sink unchecked. An allowlisted non-admin gateway user could invoke admin-only quick commands — including shell exec in the gateway process — even when the operator set allow_admin_from / user_allowed_commands to lock them out. Apply _check_slash_access(source, command) at the quick_commands dispatch site (the single exec chokepoint, cold-path only) using the raw typed name. Admins and users with the command in user_allowed_commands still run it; backward-compat (no policy set) is unaffected. Fixes #44727. Co-authored-by: maxpetrusenko <max.petrusenko.agent@gmail.com> Co-authored-by: zapabob <1920071390@campus.ouj.ac.jp>	2026-06-28 02:43:23 -07:00
Teknium	6d879d486b	fix(dashboard): close PTY WebSocket on child EOF to stop FD leak (#54028 ) (#54123 ) * fix(dashboard): close PTY WebSocket on child EOF to stop FD leak The /api/pty handler's reader task returns on child EOF, but the writer loop stayed blocked on ws.receive() until the browser sent a disconnect. When the browser socket is half-open (no FIN delivered — common on macOS/launchd), that disconnect never arrives, so the handler never reaches its finally and the PTY master fd + child process leak. With dashboard auto-reconnect (#52962), every dropped socket then spawns a fresh PTY on top of the orphaned one, exhausting file descriptors within hours (EMFILE / Errno 24). Fix: the reader task now closes the WebSocket in a finally when the child EOFs or the send side breaks, which unblocks ws.receive() so the existing finally runs bridge.close(). The writer loop also guards ws.receive() against the RuntimeError Starlette raises once the socket is closed. Reported by @fifteenzhang. Fixes #54028 * docs: add infographic for #54028 PTY FD leak fix	2026-06-28 02:42:21 -07:00
teknium1	7ef04ae7a7	fix(browser): close eval return-value SSRF bypass (sibling of #44731 ) The snapshot/vision guards re-check the page URL before returning content, but browser_console(expression=...) -> _browser_eval returns arbitrary JS results directly, leaving two same-class bypasses open: 1. Direct fetch: fetch('http://127.0.0.1/secret').then(r=>r.text()) reads a private endpoint and returns the body — the page URL stays public so the post-eval recheck never sees it. 2. Navigate-then-read: location.href='http://127.0.0.1/' then a later eval reads document.body.innerText. Guard _browser_eval on the same condition as navigate/snapshot/vision (not local backend, not local sidecar, not allow_private_urls): - pre-scan the expression for private/always-blocked URL literals - re-check window.location.href after the eval at both success-return sites (supervisor fast-path + subprocess fallback) Probe failures fail-open (matching the snapshot/vision guards).	2026-06-28 02:42:01 -07:00
liuhao1024	0ae6196087	fix(browser): allow local sidecar sessions to bypass SSRF guard The private-network guard in browser_snapshot() and browser_vision() blocked all private URLs, including those accessed via local sidecar sessions (hybrid routing). Local sidecar sessions intentionally access private URLs — the cloud provider never sees the URL in that case. Add `_is_local_sidecar_key(effective_task_id)` check to both guards, matching the existing pattern in browser_navigate(). Fixes #45101 review feedback from egilewski.	2026-06-28 02:42:01 -07:00
liuhao1024	48f5c42599	fix(browser): extend private-network guard to browser_vision The SSRF bypass in #44731 was only patched for browser_snapshot(), but browser_vision() exposes the same vulnerability — it takes a screenshot and sends it to the vision model without checking if eval-driven navigation moved the page to a private/internal URL. Add the same current-page URL safety check to browser_vision() before any screenshot is captured, encoded, or forwarded to the vision model. This covers both the normal screenshot path and the Lightpanda Chrome fallback path. 7 new tests: blocks private URL, allows public URL, skips in local backend, skips when private URLs allowed, handles eval failure/empty/exception.	2026-06-28 02:42:01 -07:00
liuhao1024	7a6fe9bbfa	fix(browser): block snapshot from eval-navigated private pages browser_snapshot() now checks the current page URL before returning content. When browser_console() changes location.href to a private or internal address (e.g., http://127.0.0.1:8080/), the snapshot returns an error instead of exposing the private page content. This closes the SSRF bypass where an attacker could: 1. Navigate to a public page 2. Use browser_console to eval location.href = 'http://127.0.0.1:port/' 3. Use browser_snapshot to read the private page content The fix reuses the existing _is_safe_url() and _allow_private_urls() infrastructure, and fails open if the URL check itself fails. Fixes #44731	2026-06-28 02:42:01 -07:00
Teknium	7c0a5def58	fix(memory/holographic): close DB connection on shutdown instead of leaking to GC (#54133 ) HolographicMemoryProvider.shutdown() dropped its MemoryStore reference without calling the existing MemoryStore.close(). Since the connection is opened check_same_thread=False (one per session), its fd was released by refcount/GC at a non-deterministic time on a non-deterministic thread, churning a DB fd through the kernel free pool on every session teardown. Call close() so the fd is released deterministically. Reported by @alfranli123 (#44037), who pinpointed the exact code location. Note: the report's TLS-fd-recycle corruption attribution could not be reproduced from the code — dropping a sqlite connection flushes valid SQLite pages via the VFS, never TLS framing, and the provider is at most a releaser of DB fds, not a TLS-flushing socket owner. This change is correct resource hygiene that removes per-session fd churn regardless.	2026-06-28 02:41:52 -07:00
Teknium	00d8c2c915	fix(gateway): prune stale sessions.json entries on startup A hard gateway crash (exit code 1) skips the graceful shutdown path, so sessions.json is never cleared and is left pointing at sessions already ended in state.db. On the next startup get_or_create_session() reuses those stale entries as long as the time/policy reset checks pass — it never consults end_reason — so every incoming message is silently routed into a closed session, with no log or error (#52804). SessionStore._ensure_loaded_locked() now calls a new _prune_stale_sessions_locked() that drops any entry whose session_id has end_reason IS NOT NULL in state.db. Idempotent, _db=None / legacy-absent safe, DB errors non-fatal, sessions.json rewritten only when something was pruned. Self-heals into a fresh session on the next message. Reported and diagnosed by @terry197913 (#52808).	2026-06-28 02:41:47 -07:00
teknium1	ea5aaa7a22	fix(gateway): offload remaining inline agent cleanup off the event loop (#53175 ) #35994 moved /new reset cleanup off the loop, but _cleanup_agent_resources (agent.close() subprocess teardown; shutdown_memory_provider() plugin IO) was still called INLINE on the event loop from three other sites: - _session_expiry_watcher (5-min idle sweep) — live loop - _handle_message_with_agent cache-hygiene re-eviction — live loop - _finalize_shutdown_agents / stop() idle-cache loop — shutdown A wedged memory provider on any of these froze the loop: bot goes silent, runtime-status updated_at heartbeat stops advancing, and SIGTERM can't be serviced (requires kill -9) — exactly the #53175 zombie pattern. Adds _cleanup_agent_resources_off_loop: a bounded (30s) worker-thread offload mirroring the #35994 reset fix, and routes all four sites through it.	2026-06-28 02:41:36 -07:00
teknium1	aa50c1ba5d	fix(prompt): repair backend probe import (get_environment never existed) The system-prompt backend probe imported a nonexistent symbol — `from tools.environments import get_environment` — which always raised ImportError: cannot import name 'get_environment'. The exception is caught and only drops the live backend description to a static fallback, so it is cosmetic, but it broke the live OS/user/cwd probe for every non-local backend (docker/singularity/modal/daytona/ssh). The real factory is `_create_environment` in tools.terminal_tool. Build the environment the same way the live terminal path does (select backend image, assemble ssh/container config from _get_env_config()), then run the probe. Note: this does NOT affect tool loading — tool selection runs each tool's check_fn and never consults this probe. Regression from #52147 (2026-06-25). Closes #53667 (probe import); the 'cronjob-only' tool-collapse symptom is not reproducible — tool selection has no probe dependency and memory's check_fn is unconditionally True.	2026-06-28 02:41:31 -07:00
teknium1	7c9cdad9fd	test(cli): cover Windows self-lock recovery guard + cmd-quote its hint Add two tests for the self-lock guard in _recover_from_interrupted_install: one asserting it clears the marker and skips install when hermes.exe is a process ancestor (breaking the #52378/#45542 loop), one asserting it falls through to a normal recovery install when the shim is NOT an ancestor. The guard's manual-recovery hint runs only inside the Windows branch, so quote it for cmd.exe (cd /d, double-quoted paths) — the cross-platform fallback hint at the end of the function is left POSIX-correct. Map Icather in scripts/release.py AUTHOR_MAP for the salvage.	2026-06-28 02:40:37 -07:00
liuhao1024	14baeefe1d	fix(matrix): record DM rooms in m.direct on invite to prevent group misclassification Rebase onto plugins/platforms/matrix/adapter.py (code moved from gateway/platforms/matrix.py). Same logic: _on_invite checks is_direct on invite events and calls _record_dm_room to persist in m.direct account data. Fixes #44679	2026-06-28 02:37:52 -07:00
Teknium	fde1c8570f	fix(tui_gateway): suppress WS peer-hangup teardown error flood (#50005 ) (#54126 ) When the Desktop forcibly closes its WebSocket mid-write, asyncio logs a full traceback for every pending connection-lost callback — 50+ identical WinError 10054 (ConnectionResetError) lines per disconnect on Windows, the equivalent ConnectionResetError/BrokenPipeError on POSIX. These are not actionable: they are the expected side effect of the peer hanging up before our writes drained. Install a loop exception handler on the gateway serving loop that collapses exactly this teardown class (ConnectionResetError/ConnectionAbortedError/ BrokenPipeError originating from _call_connection_lost) to a single debug line, forwarding every other loop error to the existing/default handler unchanged so genuine loop bugs still surface. Idempotent per loop.	2026-06-28 02:35:01 -07:00
LeonSGP43	9f0e64cedd	fix(gateway): force exit after graceful shutdown Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-06-28 02:34:23 -07:00
yungchentang	7e2ca7f68d	fix(telegram): reset send pool after pool timeouts	2026-06-28 02:34:17 -07:00
Teknium	9f17f16c66	fix(environments): use $BASHPID for atomic snapshot temp + harden failure path The atomic mv approach (kyssta-exe's commit) narrows but does not close the #38249 race: the temp name used $$ (parent shell PID), which is identical across &-launched concurrent subshells. Two concurrent writers pick the same temp file, clobber each other mid-write, and mv then publishes a torn snapshot — a reader sourcing it absorbs declare-x/export fragments into PATH. - Use $BASHPID (actual per-subshell PID) so concurrent writers never collide. - Chain mv on export success (&&) and rm the temp on failure so a partial dump never replaces a good snapshot; apply the same to the init_session bootstrap. - shlex-quote the static temp-path portion (Windows/spaces), $BASHPID outside. - LocalEnvironment.cleanup sweeps orphaned snap.tmp.* temps. - Regression tests: string-shape + a behavioral concurrent writers/readers test that proves the snapshot never tears (would still tear with $$).	2026-06-28 02:08:57 -07:00
teknium1	c23f394eb8	fix: satisfy ruff encoding + windows-footgun lints for cgroup reaper - read_text(encoding='utf-8') (PLW1514) - # windows-footgun: ok on signal.SIGKILL — module is Linux-only (reads /proc, /sys/fs/cgroup; runs from a systemd unit) - test lambda accepts the new encoding kwarg	2026-06-28 02:05:50 -07:00
PRATHAMESH75	e551da6ddb	fix(gateway): reap cgroup orphans via ExecStopPost to unblock restart Long-lived helpers spawned indirectly by tool calls (adb, platform bridges) were left in the service cgroup after the gateway's main process exited. When the kernel rejected the deferred cgroup-wide kill with EINVAL, systemd blocked Restart=always for 6+ minutes, taking down all platforms and cron windows (#37454). Add a small ExecStopPost helper (gateway.cgroup_cleanup) that walks cgroup.procs and sends per-PID SIGKILLs — a different kernel code path than cgroup.kill, so it succeeds where the cgroup-wide write failed. KillMode=mixed is preserved so the gateway still reaps its own tool-call children before systemd intervenes (#8202). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-28 02:05:50 -07:00
Coy Geek	d7a1052424	fix(env-passthrough): fail closed when provider blocklist import fails When tools.environments.local can't be imported (partial install, import-time error), _is_hermes_provider_credential() returned False — fail-open. A skill could then register a Hermes provider credential (ANTHROPIC_API_KEY, etc.) as env passthrough; _scrub_child_env lets passthrough vars bypass the secret-substring net (rule 1), so the operator's real key would land in the execute_code child. Reopens the GHSA-rhgp-j443-p4rf bypass. Fail closed instead: on import failure, treat the name as a protected provider credential and refuse passthrough. Regression test exercises the full register -> scrub path under a simulated import failure. Co-authored-by: Hermes Agent <noreply@nousresearch.com>	2026-06-28 02:05:43 -07:00
teknium1	58c36b1798	fix(api-server): widen error redaction to cron-endpoint + SSE sites Follow-up to the salvaged #37733 fix. The contributor centralized redaction at _openai_error and the chat/responses failure paths, which covers the OpenAI-compatible envelopes transitively. Two sibling classes crossed the same authenticated HTTP boundary unredacted: - 8x cron-management endpoints returning {"error": str(e)} on 500 - the session-chat SSE error event ({"message": str(exc)}) Route both through the same _redact_api_error_text(force=True) helper. Add AUTHOR_MAP entry for coygeek and a TestRedactApiErrorText guard covering mask/force/limit/passthrough behavior.	2026-06-28 02:05:38 -07:00
Coy Geek	5e774de76e	fix(api-server): redact provider errors at HTTP boundary Force API-server error text through the existing secret redactor before returning OpenAI-compatible errors, response fallback text, response snapshots, and run failure events. This prevents credential-shaped provider failure text from crossing the API-server boundary while preserving debuggable sanitized messages.	2026-06-28 02:05:38 -07:00
HexLab98	d2fda5925d	test(gateway): cover Discord/Slack compression status suppression (#39293 )	2026-06-28 14:35:32 +05:30
teknium1	3aaa98dd01	test(whatsapp): cover LID allowlist match on modern session layout Add an _is_user_authorized E2E for the platforms/whatsapp/session layout on top of fesalfayed's resolver fix (#36665) — guards the actual silently-dropped-LID-sender path from #36664.	2026-06-28 02:05:26 -07:00
fesalfayed	263ffec1b0	fix(whatsapp): resolve LID aliases on modern platforms/ session layout expand_whatsapp_aliases hardcoded get_hermes_home()/whatsapp/session, but the adapter writes lid-mapping files via get_hermes_dir("platforms/whatsapp/ session", "whatsapp/session"). On installs without the legacy directory the two paths diverge, so the resolver finds no mappings and returns the bare LID, which misses the allowlist and silently drops the message. Resolve through the same helper so both sides stay in lockstep on new and legacy layouts.	2026-06-28 02:05:26 -07:00
xxxigm	093f567f0d	fix(agent,cli): surface empty-body API errors and fail oneshot exit code When an LLM API call returns HTTP 4xx with an empty parsed SDK `body` ({}), `_summarize_api_error` fell through to a bare `str(error)`, so users saw only "HTTP 400" with no provider detail (reported on Windows in #36109). The SDK leaves `body` empty in this case, but the httpx `response` still carries the payload in `.text`. - run_agent.py `_summarize_api_error`: when `body` is empty, fall back to `response.text` — parse a JSON `error.message`/`message` when present, else surface the raw (truncated) body. Platform-agnostic diagnostics. - hermes_cli/oneshot.py: `hermes -z` now runs via `run_conversation` and returns exit code 2 when the run is failed/partial with no usable final response, so scripts can detect LLM failures (still 0 when a response — incl. an error summary as output — is produced). Tests: new tests/run_agent/test_summarize_api_error.py (empty-body JSON + raw text, RED/GREEN verified) + oneshot exit-code/`run_conversation` wiring tests. NOTE: #36109's original root cause (Windows "all providers return empty 400") is not reproducible on current main (heavy provider-transport churn since v0.15.1). This change does not claim to fix that root cause — it makes any empty-body API error LEGIBLE so a future occurrence shows the real provider message instead of a bare HTTP 400. Relates to #36109 (does not close it).	2026-06-28 02:05:20 -07:00
teknium1	c0b4a3438a	fix(install): scope Playwright override to too-new apt releases + keep step interruptible Follow-up on #54032 for #35166: - Gate the PLAYWRIGHT_HOST_PLATFORM_OVERRIDE retry on the host being an apt release newer than Playwright recognizes (Ubuntu >24.04 / Debian >13) via playwright_host_unrecognized(), instead of retrying on ANY install failure. A network/disk/permission failure on a supported host now surfaces unchanged rather than getting a mismatched-glibc build forced onto it. - detect_os() now captures DISTRO_VERSION from os-release. - Fold in the interruptibility fix (was PR #35304, self-closed): wrap the download in 'timeout --foreground -k 10' (probed, with plain-timeout fallback) so a terminal Ctrl+C reaches the child and a wedged download is force-killed after the deadline. - Add behavioral tests that source the helpers and assert the retry fires only on Ubuntu 26.04 / Debian 14, not on supported hosts, non-apt distros, native-success, operator-pinned override, or unsupported arch.	2026-06-28 02:05:18 -07:00
kshitijk4poor	a28fe788a6	fix(install): retry Playwright install with platform override on unrecognized host (#35166 ) On apt releases newer than the bundled Playwright recognizes (Ubuntu 26.04, Debian 14, and future distros), 'npx playwright install --with-deps chromium' hangs uninterruptibly at 'Installing Playwright Chromium with system dependencies' because Playwright's resolver maps the host to a platform with no download build (#35166). Wrap every installer Playwright call in run_playwright_install(), which tries the native install first and, only if it fails or times out, retries once with PLAYWRIGHT_HOST_PLATFORM_OVERRIDE pinned to the newest known build (ubuntu24.04-<arch>). This is the escape hatch Playwright's maintainers bless for unrecognized platforms (microsoft/playwright#33434). Try-native-first (not a hardcoded distro/version table) is deliberate: - Self-correcting — when Playwright already supports the host (e.g. Ubuntu 26.04 on Playwright >=1.61) the first attempt succeeds and the override is never applied, so we never force a mismatched-glibc build onto a release Playwright handles correctly (microsoft/playwright#35114). - Zero-maintenance — new distro releases work the moment Playwright adds them. - Covers Debian 14+ and future releases, not just Ubuntu 26.04. An operator-set PLAYWRIGHT_HOST_PLATFORM_OVERRIDE is always respected (applied to the first attempt; retry skipped). Non-x64/arm64 arches have no fallback build and skip the retry. Refs #35166	2026-06-28 02:05:18 -07:00
teknium1	64972b6403	fix(config): canonicalize model.name/model.model to model.default (#34500 ) A custom_providers config that names the model under model.name (or model.model) resolved to an empty model, so the API request went out with model= — HTTP 400 from OpenAI-compatible backends. Display paths (hermes status/dump) already read model.name and showed the model, making the failure silent. The model id was read via 'default or model' at ~14 independent sites (cli, gateway, cron, curator, oneshot, fallback, profiles, ...), none of which honored 'name'. Rather than patch every site, canonicalize at the single load/save chokepoint: _normalize_root_model_keys() now promotes model.model/model.name -> model.default (precedence default > model > name) and drops the stale alias, so every reader — present and future — sees a populated default and config.yaml is migrated canonical on next save. The gateway, which bypasses load_config(), replays the same normalization in _load_gateway_config(). Co-authored-by: Bartok9 <danielrpike9@gmail.com> Credit: root-cause analysis and fix direction from @Bartok9 (#34502, first) and @v86861062 (#34527).	2026-06-28 02:05:13 -07:00
Teknium	2ecb6f7fe6	fix(telegram): clear send_path_degraded on successful reconnect (#35205 ) (#54076 ) * fix(telegram): clear send_path_degraded on successful reconnect _send_path_degraded was cleared only in _verify_polling_after_reconnect, 60s after reconnect and only if scheduled. A clean start_polling() reconnect left the flag stuck True, short-circuiting send() and blocking all outbound messages until the deferred probe ran (or forever if it never did). Clear the flag the moment start_polling() succeeds — that is the recovery signal. The deferred probe remains a defensive re-check that re-enters the reconnect ladder (re-setting the flag) if it detects a silent wedge. Fixes #35205. * docs: add infographic for #35205 telegram send-path fix	2026-06-28 01:38:17 -07:00
Teknium	674e16e7c6	fix(redact): stop DB-connstr redaction from corrupting code output (#33801 ) (#54061 ) Secret redaction is display/output-scoped on main — write_file writes content verbatim, terminal/execute_code redact only output not the command/source. The real bug is in displayed tool OUTPUT (read_file, terminal, execute_code): _DB_CONNSTR_RE's password group [^@]+ was greedy across newlines, so on a multi-line block it scanned past the DSN line to the next stray '@' (a Python @decorator), replacing every intervening character — including line breaks — with *. That dropped lines and concatenated the next line onto the f-string line, making read_file output look corrupted (the file on disk was always correct). Reported in #33801. Fix: - Forbid whitespace in the userinfo/password groups ([^:\s]+ / [^@\s]+) so the match can never span a line break. A real DSN password never contains whitespace. This alone kills the catastrophic line-dropping. - Under code_file=True, preserve a password group that is a pure {...} brace expression — f"postgresql://{user}:{pass}@{host}" is an f-string template, not a live credential. Literal passwords are still masked. - Pass code_file=True at the terminal and execute_code output redaction call sites (file_tools already did) so code-execution output isn't corrupted by ENV/JSON/template false positives. Real prefixes, auth headers, JWTs, and private keys are still redacted. Verified E2E against the reporter's exact pydantic-settings module: file written verbatim, read_file shows the DSN f-string + @model_validator intact with zero * corruption, while a literal postgresql://admin:pw@host DSN and a real sk- key are still masked. Reported-by: koishi70 Reported-by: pfrenssen	2026-06-28 01:15:39 -07:00
teknium1	578e3989d4	fix(agent): route content-filter stream stalls to fallback chain (#32421 ) When a provider's output-layer safety filter (MiniMax "output new_sensitive (1027)", Azure content_filter, etc.) kills a streaming response after deltas were already sent, interruptible_streaming_api_call swallows the raw error into a finish_reason=length partial-stream stub. The conversation loop then burned 3 continuation retries against the SAME primary — re-hitting the content-deterministic filter every time — and gave up with "Response remained truncated after 3 continuation attempts", never consulting fallback_providers. Builds on @595650661's classifier change (cherry-picked) so error_classifier recognizes the filter; then: - chat_completion_helpers: run the swallowed error through error_classifier at the stub-creation point and stamp _content_filter_terminated on the stub (single source of truth — no parallel pattern list). - conversation_loop: read the tag and activate the fallback chain BEFORE burning any continuation retries; roll partial content back to the last clean turn and re-issue against the new provider (restart_with_rebuilt_messages). Plain network stalls are unaffected (only content_policy_blocked is tagged). Credits #32479 (@sweetcornna) and #33845 (@Tranquil-Flow) which fixed the same issue via the stub-tag and loop-escalation approaches respectively. Live E2E confirmed: before, _try_activate_fallback called 0x; after, fallback fires on the first stub and the fallback provider completes the turn.	2026-06-28 01:15:21 -07:00
595650661	b8e2268628	fix(agent): add MiniMax 'new_sensitive' to content_policy_blocked patterns The MiniMax output-layer safety filter surfaces the error verbatim as `output new_sensitive (1027)` (sometimes with additional provider wrapping like 'Stream stalled mid tool-call: output new_sensitive (1027)'). When the model emits a large tool-call argument block, the upstream filter trips and the SSE stream is truncated mid-flight, producing 'stream stalled mid tool-call' errors. Until now this case was misclassified and retried 3x on the same provider, reproducing the same refusal and burning paid attempts. Adding `new_sensitive` to `_CONTENT_POLICY_BLOCKED_PATTERNS` routes it through the existing is_client_error path: skip 3x retry, activate configured fallback model immediately, surface a clear provider-safety message to the user. Refs #32421	2026-06-28 01:15:21 -07:00
Teknium	c9df4bc094	fix(gateway): default restart_drain_timeout to 0 to kill systemd crash loop (#54066 ) A restart now interrupts in-flight agents immediately rather than holding the gateway open for a grace window. The previous 180s default coupled two independently-set timers: the gateway's own drain timer and systemd's TimeoutStopSec. On a stale unit where TimeoutStopSec < drain, systemd SIGKILLed the gateway mid-cleanup, leaving a stale lock that made the next startup exit immediately ('already running') — an infinite crash loop under Restart=on-failure (#31981). Setting drain to 0 makes the mismatch structurally impossible: with drain 0 the generated unit gets TimeoutStopSec=90 against a near-instant drain, so systemd never kills mid-cleanup. Contract: restart the gateway, in-flight work stops. A grace window large enough to 'save' a long agent turn would have to outlast an unbounded task, which is impossible. Also fixes the stale-unit warning's suggested command (hermes gateway service install --replace -> hermes gateway install --force); the former subcommand does not exist. Closes #31981	2026-06-28 01:14:34 -07:00

1 2 3 4 5 ...

6498 commits