hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-07-01 12:02:05 +00:00

Author	SHA1	Message	Date
LIC99	dda3268d09	fix(approvals): warn and default to manual on unknown approvals.mode _normalize_approval_mode() previously accepted any string, so an unknown value like 'auto' fell through every downstream mode check (off/smart) and silently behaved like manual with no signal. Validate against the known modes (manual/smart/off), emit a warning for anything else, and default to manual to match the config default and the rest of the function. Bug 1 from the original PR (/approve & /deny bypassing the running-agent guard) already landed on main independently, so only the mode-validation fix is salvaged here. Fixes #4261 Co-authored-by: Hermes Agent <agent@nousresearch.com>	2026-06-28 19:04:18 -07:00
Teknium	11183e8332	fix(profiles): validate custom alias names to prevent path traversal `hermes profile alias <profile> --name <custom>` accepted arbitrary strings and used them verbatim as a filename under ~/.local/bin. Because normalize_profile_name only lowercases/strips (no regex gate), a value like `../../.bashrc` escaped the wrapper directory and clobbered arbitrary user-writable files. remove_wrapper_script had the same sink. Add validate_alias_name (reusing the profile-id regex, which forbids `/`, `.`, and `..`) and wire it into check_alias_collision, create_wrapper_script, remove_wrapper_script, and the CLI alias action so the rejection surfaces a clear "Invalid alias name" error instead of silently writing or unlinking outside the wrapper dir. Co-authored-by: Gutslabs <gutslabsxyz@gmail.com> Co-authored-by: Xowiek <xowiekk@gmail.com>	2026-06-28 18:53:33 -07:00
Teknium	490f215a19	test: cover export-prefix stripping in .env parsers (PR #6659 )	2026-06-28 18:53:00 -07:00
Teknium	3483424aaa	fix(security): redact bare-token credentials in URL userinfo (#6396 ) (#54475 ) git remote set-url with an embedded password (https://PASSWORD@github.com) leaked the credential into agent output — the redaction engine only masked user:pass@ DB connection strings, never the colon-less bare-token userinfo form a git remote uses. Add _URL_BARE_TOKEN_RE: scheme://TOKEN@host for web/transport schemes (http/https/wss/git/ssh/ftp), 8+ char floor to skip short usernames, token class forbidding /:@ so an @ in a path/query is never treated as userinfo. Deliberately scoped to the bare-token form only. The user:pass@ colon form and query-string tokens stay passing through (#34029, 'pass web URLs through unchanged') so magic-link / OAuth round-trip skills keep working — a bare credential in userinfo is never a workflow token (those live in the query string), so masking it can't break a skill.	2026-06-28 18:52:42 -07:00
Teknium	9860d93f2a	fix(terminal): require approval for host-bound Docker commands (#54483 ) * fix(terminal): require approval for host-bound Docker commands The Docker terminal backend blanket-skips dangerous-command approval on the assumption that the container is isolated from the host. That holds only when nothing is bind-mounted in. Once a host path is exposed (via TERMINAL_DOCKER_MOUNT_CWD_TO_WORKSPACE or a host-path entry in TERMINAL_DOCKER_VOLUMES), a command like `rm -rf /workspace` reaches real host files but is still auto-approved. Detect host bind mounts and route those sessions through the normal approval flow. Isolated Docker keeps the fast path. The same gating is applied to the execute_code guard, which had the identical blanket skip. Co-authored-by: Hermes Agent <agent@nousresearch.com> * chore: add AUTHOR_MAP entry for PR #6436 salvage (Kolektori) * test: accept has_host_access kwarg in _check_all_guards mocks The host-bound Docker approval fix adds a has_host_access kwarg to the _check_all_guards wrapper. Six pre-existing tests monkeypatch it with a fixed (command, env_type) / (cmd, env) lambda signature, which now raises TypeError when terminal_tool passes the new kwarg. Widen those mock signatures to accept **kwargs. --------- Co-authored-by: Kolektori <256073454+Kolektori@users.noreply.github.com> Co-authored-by: Hermes Agent <agent@nousresearch.com>	2026-06-29 11:35:41 +10:00
Ben Barclay	7cfa2fa13f	fix(docker): gate resource limit flags on cgroup controller availability (#54516 ) On hosts where the cgroup v2 cpu/memory/pids controllers are not delegated to the docker/podman process (unprivileged Proxmox LXCs, some rootless and nested setups), --pids-limit/--cpus/--memory cause every container start to fail with OCI runtime error / exit 126, breaking terminal + execute_code. - Add _cgroup_limits_available(image): one-shot, host-wide cached probe that spawns a throwaway container from the sandbox image itself (sleep 0) with all three flags together, mirroring the existing _storage_opt_supported probe-and-degrade pattern. - Remove --pids-limit from static _BASE_SECURITY_ARGS; apply it (default 256 via _DEFAULT_PIDS_LIMIT) in resource_args gated on the probe. - Gate --cpus and --memory on the same probe. Behavior unchanged on cgroup-capable hosts; graceful degradation with a one-time warning where controllers aren't delegated. Fixes #6568. (cherry picked from commit `c933880b7e`) Co-authored-by: angelos <angelos@oikos.lan.home.malaiwah.com>	2026-06-29 11:01:08 +10:00
Brooklyn Nicholson	f34cf7e3a4	test(gmi): stub profile fetch_models in static-fallback test The fallback test only mocked fetch_api_models; CI still hit the real GMI /v1/models endpoint via ProviderProfile.fetch_models and merged live models into the result.	2026-06-28 18:05:28 -05:00
Brooklyn Nicholson	cb1bb1a48d	refactor(windows): unify windowless spawn form across the touched sites windows_hide_flags() already returns 0 on POSIX (and creationflags=0 is the no-op default there, exactly how server.py::_list_repo_files does it), so drop the IS_WINDOWS import + ternary/one-use-dict gating and just pass creationflags=windows_hide_flags() directly. Tests lose the now-pointless IS_WINDOWS monkeypatch.	2026-06-28 17:44:47 -05:00
Brooklyn Nicholson	32087e4bc9	fix(windows): hide console flash on checkpoint git + skills_hub gh probes The #54236/#54417 backend git/gh sweep routed git_probe, the repo-file picker, coding_context, context_references, copilot_auth, and the gateway process scans through CREATE_NO_WINDOW, but two sibling spawn legs that also run inside the console-less desktop/gateway backend were missed: - tools/checkpoint_manager.py `_run_git` (and the one-shot `git init --bare` in `_init_store`) — when checkpoints are enabled, every file-mutating turn fires multiple bare `git` calls (status, add, write-tree/commit-tree, update-ref). Spawned from a parent with no console (Electron spawns the backend with windowsHide → CREATE_NO_WINDOW), each one allocates its own conhost window → a flurry of terminal popups. - tools/skills_hub.py `GitHubAuth._try_gh_cli` — `gh auth token`, the same bug class as the already-fixed copilot_auth gh probe. Route both through `windows_hide_flags()` (no-op on POSIX), matching the established per-site pattern. Tests added to tests/test_windows_subprocess_no_window_flags.py.	2026-06-28 17:41:47 -05:00
Teknium	980622d0ec	perf(startup): parse config + plugin manifests with libyaml CSafeLoader (#54486 ) The startup config/manifest reads used PyYAML's pure-Python SafeLoader, which is ~8x slower than the libyaml-backed CSafeLoader C extension. config.yaml is parsed several times during launch (cli config, raw config, early interface/redaction bridge, logging config) and every plugin manifest is parsed once — all on the slow path. Add utils.fast_safe_load (CSafeLoader-preferring, pure-Python fallback, true drop-in for safe_load) and route the hot startup parse sites through it: hermes_cli/config.py (config + manifest reads), hermes_cli/plugins.py (manifest parse), env_loader, cli.load_cli_config, hermes_logging, and the two pre-config early YAML bridges in main.py. Behavior is identical (same restricted safe tag set); only speed changes. safe_load calls on the startup path drop from ~79 to ~0, cutting the YAML parse cost from ~0.9s to ~0.15s under profiling. Adds tests/test_fast_safe_load.py asserting equivalence with safe_load across input shapes, empty-doc falsiness, C-loader preference, and that python/object tags are still rejected (safe, not full loader).	2026-06-28 15:38:39 -07:00
Teknium	d65468e7ff	fix(security): SSRF guard yuanbao media download_url (#54470 ) yuanbao_media.download_url() fetched model-supplied (outbound) and inbound image/file URLs server-side via httpx with follow_redirects=True and no SSRF check. A model response containing <img src="http://169.254.169.254/..."> routed through ImageUrlHandler -> download_url and would fetch cloud-metadata endpoints; same for inbound media. Add an is_safe_url() pre-flight plus an async redirect event-hook that re-validates every 30x target, matching the cache_image_from_url() guard in gateway/platforms/base.py. The other gateway adapters already guard their URL-fetch paths; this was the remaining unguarded one.	2026-06-28 15:29:59 -07:00
brooklyn!	16ff1a3b93	Merge pull request #54457 from NousResearch/bb/windows-console-launcher-repair fix(windows): repair missing console script launchers	2026-06-28 17:15:56 -05:00
奥森木	e7d4ade8cf	fix(anthropic): ignore stale non-Anthropic base_url across all resolution paths A config left with `provider: anthropic` but a leftover `base_url: https://openrouter.ai/api/v1` (e.g. after a provider switch) would route Anthropic OAuth/setup-token traffic to OpenRouter and 404. Add `_anthropic_base_url_override_ok()` and gate the three native-Anthropic resolution branches (pool, explicit, native) on it. The guard honors a configured `model.base_url` only when it plausibly speaks the Anthropic Messages protocol — official `.anthropic.com` / `.claude.com` hosts, Azure Foundry endpoints, and `/anthropic`-suffixed or Kimi `/coding` proxies — and falls back to `https://api.anthropic.com` otherwise. Aggregator URLs like openrouter.ai / api.openai.com are treated as stale. Reconstructed from @clovericbot's PR #3661 onto current main: the original patched one branch with an anthropic-only allow-list, which would have broken Azure-via-anthropic; widened to all three sites and made Azure/proxy-safe.	2026-06-28 15:12:03 -07:00
Mibayy	b0b7ff0d75	fix(provider): auto+base_url bypasses cloud API when custom endpoint configured (#3846 ) When config.yaml has `provider: auto` and a non-cloud `base_url` (e.g. Ollama at localhost:11434), requests were silently sent to https://api.anthropic.com whenever ANTHROPIC_API_KEY was present in the environment, ignoring the configured local endpoint and returning HTTP 401 / "credit balance too low". Root cause: resolve_provider("auto") scans env vars and returns "anthropic" when ANTHROPIC_API_KEY is set, before config.model.base_url is ever consulted. In resolve_runtime_provider(), before calling resolve_provider(), short-circuit to the OpenAI-compatible resolver when no explicit creds were passed, provider is "auto"/unset, and a non-cloud base_url is configured. Well-known cloud roots (openrouter.ai, anthropic.com, openai.com) are matched on HOST (not substring) so look-alike hosts can't evade the bypass and leak a cloud credential. Co-authored-by: Hermes Agent <hermes@nousresearch.com>	2026-06-28 15:11:55 -07:00
Teknium	86e64900b9	fix(gateway): preserve sessions across restarts (#54442 )	2026-06-28 15:10:39 -07:00
Teknium	4c2961c511	fix(curator): never archive cron-referenced skills + floor use=0 pruning (#54443 ) The curator's inactivity prune archived any non-pinned agent-created skill whose activity was older than archive_after_days (90d). A skill loaded only by a cron job had its usage bumped solely when the job fired, so paused jobs, infrequent (quarterly/annual) schedules, and far-future one-shots aged their skills out from under them — the next run then failed to load the now-archived skill. - cron/jobs.py: add referenced_skill_names() returning skills used by ANY job (incl. paused/disabled). - curator.apply_automatic_transitions(): skip cron-referenced skills like pinned; add a use=0 grace floor so a never-used skill is not marked stale/archived until it is at least stale_after_days old. - LLM review pass: candidate list marks cron=yes; prompt forbids pruning cron-referenced skills and never-used skills under 30 days. Tested E2E against a real cron job + real usage records and with 4 new unit tests.	2026-06-28 15:10:21 -07:00
Gille	df8e2523fa	fix(windows): verify launchers after primary install	2026-06-28 17:02:05 -05:00
HexLab98	76bb8f46a0	test(cli): cover Windows console script repair (#52931 ) Add unit tests for missing-shim detection and repair trigger in _verify_console_scripts_installed.	2026-06-28 17:01:31 -05:00
brooklyn!	28097d9cd9	Merge pull request #54385 from NousResearch/bb/project-folder-picker-remote feat(desktop): remote-gateway-aware folder picker + git cockpit (status, review, worktrees)	2026-06-28 16:35:57 -05:00
Teknium	e5d22ab80d	fix(daytona): quote single-upload mkdir parent path (#54440 ) * fix(daytona): quote single-upload mkdir parent path The single-file _daytona_upload() path shelled out 'mkdir -p {parent}' with the remote parent interpolated unquoted, so shell metacharacters in the path could break the command or inject arbitrary commands into the sandbox. The bulk-upload, bulk-download, and delete paths were already hardened with shlex-quoting helpers; this single-upload path was missed. Route it through the existing quoted_mkdir_command() helper and add a regression test covering a path with shell metacharacters. Reported by @Gutslabs (#3960); the original branch predated the file_sync refactor, so the fix is re-applied to the current code path. * docs(infographic): daytona quote-sync fix	2026-06-28 14:33:03 -07:00
Brooklyn Nicholson	f9b469d7de	test(web_git): assert default branch invariant, not hardcoded main CI git init defaults to master on some runners; compare branch to defaultBranch instead of pinning a branch name.	2026-06-28 16:29:52 -05:00
teknium1	c648ecdca5	fix(telegram): reject unauthorized users before event construction (#40863 ) Removed/unauthorized Telegram users could inject prompt content before the per-user auth gate fired. The adapter ran `_should_process_message`, `_build_message_event`, and text/photo batching — and dispatched to the runner — before `_is_user_authorized()` (gateway/authz_mixin.py) rejected the sender. Unmentioned group chatter from a removed user was also persisted into the session transcript via `_observe_unmentioned_group_message`, leaking into the agent's observed context independent of dispatch. Add `_is_user_authorized_from_message()` as an intake prefilter that runs in `_handle_text_message`, `_handle_command`, `_handle_location_message`, and `_handle_media_message` BEFORE batching, event construction, and the unmentioned-group observe branch. It reuses the runner's `_is_user_authorized()` with a correctly-shaped SessionSource (group vs forum vs dm, real chat_id for TELEGRAM_GROUP_ALLOWED_* allowlists), falls back to env allowlists, and only rejects when an allowlist actually exists — unknown DMs with no allowlist still reach the pairing flow. Channel posts authorize via `sender_chat` identity when `from_user` is absent. Co-authored-by: liuhao1024 <sunsky.lau@gmail.com> Co-authored-by: Carlos Manuel Cejas <carlosmcejas@gmail.com>	2026-06-28 14:25:15 -07:00
srojk34	61210097a5	fix(browser): extend private-network guard to browser_get_images The SSRF cluster (`7a6fe9bb`, `48f5c425`, `7ef04ae7`) sealed browser_snapshot, browser_vision, and _browser_eval against eval-navigated private pages, but browser_get_images bypasses _browser_eval and calls _run_browser_command("eval", ...) directly. An eval-driven navigation to a private address followed by browser_get_images would leak image src URLs and alt text from the private page. Add the same _eval_ssrf_guard_active + _current_page_private_url recheck before returning image data, matching the pattern established by the sibling guards. 5 new tests cover: block on private page, allow on public page, skip for local backend, skip when private URLs allowed, no guard needed on failed eval.	2026-06-28 14:25:10 -07:00
Teknium	9a0010fd46	fix(windows): cover remaining console-flash spawn legs (#54417 )	2026-06-28 13:49:08 -07:00
Brooklyn Nicholson	4e9439cc3b	fix(desktop): route composer context picking through remote-aware fs Second pass on the remote-project flow: the project dialog and git cockpit were remote-aware, but the composer's Add file/folder context picker still called the native Electron picker directly. Route it through selectDesktopPaths so remote sessions use the backend-aware picker instead of local disk paths; preserve local multi-select behavior and keep remote folder selection single because the in-app remote picker only supports one directory. Also use readDesktopFileDataUrl for image previews so an already-known backend image path can be read through /api/fs/read-data-url, and add focused coverage for backend file-diff routing plus the plain-folder git init/worktree path.	2026-06-28 14:35:23 -05:00
Brooklyn Nicholson	fc86e35764	feat(desktop): make the git cockpit work over a remote gateway After the folder picker fix, an added remote folder was still half-usable: the desktop's git GUI (coding-rail status, worktree lanes, review pane, branch switch, file diff) all ran Electron-local git on the USER's machine, so against a remote-gateway repo they silently degraded to empty. Mirror the whole surface over the dashboard REST API so it acts on the BACKEND repo where sessions actually run: - hermes_cli/web_git.py: git/gh logic (status, worktrees, branches, review list/diff/stage/unstage/revert/commit/commit-context/push/ship-info/ create-pr, file-diff, worktree add/remove, branch switch) shelling to the system git, mirroring the Electron ops' shapes. - web_server.py: /api/git/* routes (same auth gate + _fs_path hardening as /api/fs, executor-offloaded, mutations -> 400). - apps/desktop desktop-git.ts: remote-aware facade exposing the same shape as window.hermesDesktop.git; coding-status / review / projects / model / desktop-fs route through desktopGit() so local stays Electron, remote hits /api/git/*. Tests: tests/hermes_cli/test_web_server_git.py (real repo: status counts, review classification, diff incl. untracked all-add, stage+commit roundtrip, worktree/branch lifecycle, commit-context, gh-absent ship-info, auth) and desktop-git.test.ts (local vs remote routing, envelope unwrap, POST bodies).	2026-06-28 14:26:09 -05:00
Brooklyn Nicholson	70292596ef	feat(browser): auto-install Chromium binary on local cold-start failure When a local browser_navigate (or any browser command) fails fast because Chromium isn't on disk, attempt a one-shot binary download via `agent-browser install` and retry instead of only printing a hint. Scope is narrow on purpose: - binary only, never `--with-deps` (that shells apt/needs root, so missing system libraries stay a user action) - gated by `security.allow_lazy_installs` (same opt-out as every lazy install) - skipped in Docker (Chromium ships in the image) - attempted once per process Follow-up to #54353, which made the cold-start failure legible; this closes the "doesn't actually install the missing browser" gap for the common case.	2026-06-28 12:25:15 -05:00
infinitycrew39	7bb8aa3bd5	test(browser): cover open timeout diagnostics and failed navigate title Add regression tests for open-command timeout floors, sandbox bypass, stderr capture formatting, first-navigation timeout wiring, and desktop failed-navigate labeling.	2026-06-28 12:14:21 -05:00
ygd58	3e16176ba4	fix(tools): reconcile agent.disabled_toolsets when a toolset is enabled _get_platform_tools() applies agent.disabled_toolsets as a final override AFTER reading platform_toolsets.<platform>, so a toolset listed there stays permanently OFF no matter what the toggle write path saves. Blank Slate installs pre-populate this list with ~27 toolsets, making most of the desktop Toolsets UI un-enableable (issue #49995). Fix: _save_platform_tools() now removes any toolset the user just explicitly enabled FOR THIS PLATFORM from agent.disabled_toolsets. Toolsets the user did not touch, or that remain disabled on other platforms, are left alone -- disabled_toolsets keeps working as a cross-platform suppression list for anything not actively re-enabled. Disabling a toolset (unchecking it) does not touch disabled_toolsets at all -- only enables reconcile it. Verified end-to-end with the exact repro from the issue: Blank Slate config (disabled_toolsets=['todo','memory','browser'], cli=['file', 'terminal']) -> enable 'todo' via the toggle -> _get_platform_tools() now resolves 'todo' as enabled while 'memory'/'browser' (untouched) remain disabled. Added 4 regression tests. Full tools_config suite: 101 passed (97 existing + 4 new), no regressions. Fixes #49995	2026-06-28 21:59:03 +05:30
Brooklyn Nicholson	eeca59f489	fix(windows): hide remaining backend console-flash legs missed on main main (`cb982ad99`) wired windows_hide_flags() into the auxiliary git/gh/wmic/ bash/powershell/taskkill legs but left two it didn't reach, plus the Electron backend-launch leg it explicitly deferred. Cover them the same way: - apps/desktop/electron/main.cjs: getNoConsoleVenvPython resolves the BASE pythonw.exe instead of the venv Scripts\pythonw.exe shim, which re-execs a console python.exe and flashes a conhost the desktop backend can't suppress. Both backend creators put the venv site-packages on PYTHONPATH so imports still resolve under the base interpreter. (main's commit said this Electron leg "needs a Windows-tested change of its own".) - tools/tts_tool.py, tools/transcription_tools.py, plugins/platforms/discord: ffmpeg conversions (voice notes / TTS / STT) via windows_hide_flags(). - plugins/platforms/whatsapp: netstat + taskkill bridge-port cleanup via windows_hide_flags(). All no-ops on POSIX. Tests assert the base-pythonw preference and the ffmpeg legs pass CREATE_NO_WINDOW.	2026-06-28 10:19:21 -05:00
Teknium	0c2e6c0049	test: make active session cross-process race deterministic (#54248 )	2026-06-28 05:49:21 -07:00
teknium1	1ffa01f35f	test(windows): cover no-window backend subprocess flags	2026-06-28 05:28:45 -07:00
Teknium	cb982ad997	fix(windows): hide console-window flash on backend git/gh/wmic/bash subprocess spawns The Windows desktop GUI runs its backend headless via pythonw.exe. Several auxiliary subprocess sites that run inside that windowless backend spawned console-subsystem children (git, gh, wmic, powershell, bash, rg, taskkill) WITHOUT CREATE_NO_WINDOW, so Windows allocated a fresh conhost per call and flashed a black window on screen — sometimes continuously (the dashboard Projects-tree git probe alone fired ~118 spawns in 60s on startup). The terminal tool, cron, browser, code_execution, and gateway-spawn paths already carry windows_hide_flags(); these auxiliary probe/scan/launcher legs were missed. Wire the existing helper into them: - tui_gateway/git_probe.py: run_git (+ encoding=utf-8/errors=replace, fixes the cp950 UnicodeDecodeError on CJK paths from the same site) - agent/coding_context.py: _git (per-turn git status/log/diff) - agent/context_references.py: _run_git + _rg_files (@file/@ref resolution) - hermes_cli/copilot_auth.py: gh auth token probe (auxiliary provider:auto) - hermes_cli/gateway.py: wmic + PowerShell Get-CimInstance PID scan - hermes_cli/main.py: wmic stale-dashboard PID scan - gateway/status.py: taskkill /T /F force-kill windows_hide_flags() returns 0 on POSIX, so every changed call is a no-op on Linux/macOS (verified: real git/rg probes still work; Windows-simulated calls all pass creationflags=CREATE_NO_WINDOW). Scoped to the windowless-backend paths that cause the reported flashing. The Electron updater-handoff leg (main.cjs windowsHide:false) and the interactive-CLI banner probes (cli.py) are intentionally NOT touched here — the former needs a Windows-tested change of its own, the latter runs in a visible console anyway. Tracking: #54220 Refs: #53178 #53631 #53781 #53957 #49602 #52982 #53424 #53053 #53016	2026-06-28 05:28:45 -07:00
homelab-ha-agent	d05cc8f4d6	fix(mcp): skip preflight content-type probe for OAuth servers OAuth-protected MCP servers (e.g. Hospitable) return 200 text/html on an unauthenticated HEAD probe — a login/landing page the server cannot substitute for a real MCP response without a Bearer token. The preflight cannot distinguish this from a misconfigured URL, so it raises NonMcpEndpointError before the OAuth browser flow has a chance to run. Add `and self._auth_type != "oauth"` to the preflight condition in MCPServerTask.run(). The probe is inapplicable to OAuth servers: their URL legitimacy is established by .well-known/oauth-protected-resource during the OAuth handshake, not by a GET content-type check. Concrete repro: Hospitable (https://mcp.hospitable.com/mcp) returns `200 text/html` to an unauthenticated httpx HEAD. Without the guard: ✗ NonMcpEndpointError at `hermes mcp test` With the guard: ✓ Connected (1487ms) — 63 tools discovered Relation to open PRs: - #37598 adds a POST probe fallback for POST-only non-OAuth servers (e.g. DocuSeal), but only passes when POST returns 2xx + MCP content-type. Hospitable returns 401 on the POST probe (Bearer challenge), so #37598 does not cover this case. - #49463 extends the POST probe to also pass on non-2xx auth challenges (making it OAuth-aware), but is labeled duplicate of #37598 and may not land independently. This fix is complementary: it handles OAuth servers with zero extra round-trips rather than adding a POST probe step. Tests: - test_oauth_server_html_response_raises_without_skip: documents that _preflight_content_type raises NonMcpEndpointError for 200 text/html (the underlying issue), with an OAuth-server docstring. - test_run_skips_preflight_for_oauth: verifies that run() does NOT invoke _preflight_content_type when auth_type=="oauth", using class-level monkeypatching so the gate is exercised without a live MCP transport. 23 passed tests/tools/test_mcp_preflight_content_type.py	2026-06-28 04:47:39 -07:00
liuhao1024	9d919daf44	fix(gateway): mark platform lock failure as retryable instead of permanently fatal When a stale lock file survives a gateway crash, `acquire_scoped_lock()` may return `(False, existing_dict)` even after detecting and deleting the stale lock (e.g. if unlink fails or a race condition occurs). Previously, `_acquire_platform_lock()` called `_set_fatal_error(..., retryable=False)`, which permanently killed the platform — the reconnect watcher never retries a non-retryable fatal error. Change to `retryable=True` so the platform enters the "retrying" state and the reconnect watcher can attempt acquisition again after the standard backoff delay. Fixes #54167	2026-06-28 04:35:37 -07:00
teknium1	61622bb56a	fix(tui): use role=user for model switch marker to avoid HTTP 400 on strict providers (#48338 ) _append_model_switch_marker() appended the post-/model-switch context marker to session history as {"role": "system"}. The cached system prompt is prepended to the API message list (conversation_loop.py), so this marker became a SECOND system message mid-array after prior user/assistant turns. Strict OpenAI-compatible providers (vLLM, Qwen) reject any system message that is not at the beginning of the array, returning HTTP 400 and killing the conversation on the next turn. Flip the marker to role="user" (history entry + both session-DB persist sites), matching the existing personality-overlay marker which already uses role="user". repair_message_sequence() then coalesces it with adjacent user turns as needed. Co-authored-by: liuhao1024 <sunsky.lau@gmail.com> Co-authored-by: Lucas Nicolas <lucas.nicolas@proton.me>	2026-06-28 04:34:55 -07:00
izumi0uu	c4719aa51c	fix(gateway): boot out stale launchd registration before restart bootstrap launchd restart can leave the gateway job stopped but still registered after update-time drain logic, so a direct bootstrap hits exit 5 and falls back to a detached process. Booting the stale registration out before bootstrap keeps the launchd-managed restart path intact and locks it with a regression test. Constraint: Keep upstream-facing conventional commit style while preserving local decision context Rejected: Treat bootstrap exit 5 as expected \| Leaves macOS launchd restart outside launchd supervision after update Confidence: high Scope-risk: narrow Directive: Keep launchd start/restart recovery flows aligned when changing launchctl handling Tested: pytest -q tests/hermes_cli/test_gateway_service.py -k "launchd_restart_boots_out_stale_registration_before_bootstrap or launchd_restart_falls_back_to_detached_on_error_5 or launchd_restart_drains_running_gateway_before_kickstart or launchd_restart_self_requests_graceful_restart_without_kickstart" Tested: pytest -q tests/hermes_cli/test_gateway_service.py -k launchd Not-tested: Manual macOS launchctl restart after hermes update	2026-06-28 04:17:13 -07:00
Teknium	52a853f5c3	fix(test): pin monotonic clock in spinner-elapsed test to fix CI flake (#54203 ) test_spinner_elapsed_format_is_fixed_width_to_reduce_wrap_jitter derived _tool_start_time from the live time.monotonic() clock (now - 65.2 / now - 9.2). monotonic()'s epoch is arbitrary — on a host where monotonic() < 65.2 (fresh subprocess on a freshly-booted CI runner) the start time went negative, the (t0 > 0) guard in _render_spinner_text() dropped the '(elapsed)' suffix, and short.split('(',1)[1] raised IndexError: list index out of range. Deterministic given a small clock, so it would keep flaking, not clear on rerun. Pin time.monotonic to a fixed 1000.0 and offset _tool_start_time from it so both the <60s and >=60s paths always render the elapsed suffix regardless of the runner's monotonic epoch. Pre-existing main flake (surfaced in CI test slice 1/8).	2026-06-28 04:16:25 -07:00
Cornna	5c2c85c545	fix(tui): start MCP discovery for websocket sessions The desktop app and dashboard chat reach the agent through the /api/ws JSON-RPC sidecar (tui_gateway.ws.handle_ws), NOT through tui_gateway.entry.main() — the stdio-TUI path that spawns the background MCP discovery thread. In the WS process discovery was therefore never started: _make_agent only waits (wait_for_mcp_discovery), which no-ops when the thread was never created, so the agent snapshotted an MCP-less tool list. The only discovery trigger reachable was a manual /reload-mcp, which is why tools appeared after a reload but vanished on restart. Start the shared, idempotent, config-gated background discovery in handle_ws right after accept() and before gateway.ready, so the first agent build picks up already-spawning servers (and the existing late-binding refresh handles slow ones). Fixes #38945.	2026-06-28 04:14:12 -07:00
teknium1	091ce825fe	test(redact): fix file_read regression-guard for current-main YAML collapse The salvaged #35519 regression guard asserted that default (non-file_read) mode keeps a head/tail `ghp_S1...Pn2T` mask for a `token: <key>` line. On current main the YAML config pass (`_YAML_ASSIGN_RE`, key `token`) re-masks the already-prefix-masked value to `***`, so the assertion was stale. Switch to a bare-token context so the guard isolates what it claims (prefix-mask head/tail shape in default mode) without depending on the YAML collapse.	2026-06-28 04:13:20 -07:00
kshitijk4poor	de928bccde	fix(redact): non-reusable sentinel for prefix secrets in file reads (#35519 ) When security.redact_secrets is on (default), read_file/search_files/cat applied redact_sensitive_text(code_file=True) to file content, which still ran prefix masking. An API key in config.yaml (ghp_..., sk-..., xai-..., etc.) came back as a head/tail mask like `ghp_S1...Pn2T` — a plausible-looking truncated key. When an agent read that and wrote it back to config, the masked value replaced the real credential, silently breaking auth (401). Production evidence: a config.yaml found containing the exact 13-char masked GitHub PAT. The two community PRs (#35529, #35534) fixed the corruption by NOT redacting prefixes for config reads — but that exposes the user's real keys to the agent context, model, and logs (a security regression). This takes the safer route: keep redacting, but for file content emit a NON-REUSABLE sentinel. - New `_mask_token_nonreusable`: prefix secrets -> `«redacted:ghp_…»` (vendor label preserved for debuggability; zero secret bytes; angle-bracket/ellipsis wrapper is syntactically invalid as a token so it can't be mistaken for or written back as a usable key). - New `redact_sensitive_text(file_read=True)` routes prefix matches through it (implies code_file=True). Default/log/display mode is UNCHANGED — `_mask_token` still keeps head/tail (fine for logs, never written back). - Wired the 3 file_tools.py call sites (read_file / search_files / cat) to file_read=True. Fixes both the corruption AND avoids the secret-exposure of the un-redact approach. 6 new tests (sentinel shape, no-leak, not-a-plausible-key, default mode unchanged, file_read implies code_file, sk- prefix); 88 redact tests pass; mutation-verified (reverting to the old mask fails the sentinel/leak tests). Co-authored-by: liuhao1024 <sunsky.lau@gmail.com> Co-authored-by: adammatski1972 <289282750+adammatski1972@users.noreply.github.com> Closes #35519. Supersedes #35529, #35534.	2026-06-28 04:13:20 -07:00
tymrtn	d7f655f370	fix: accept typed clarify choice replies	2026-06-28 04:13:19 -07:00
MorAlekss	acca526286	fix(gateway): treat zombie PIDs as dead in _pid_exists to unblock --replace (closes #42126 ) Under systemd Restart=always, the old gateway becomes a zombie (in the process table, awaiting reap) when the replacement starts. _pid_exists() reported the zombie as alive, so --replace waited on a PID that never dies, then aborted with exit 1 — a silent crash loop. Standalone runs are unaffected because nothing respawns the gateway into a zombie. The live path is psutil.pid_exists(), which returns True for zombies, so the check is added there (Process.status() == STATUS_ZOMBIE -> dead). The psutil-less POSIX fallback also reads /proc/<pid>/stat (state Z) with a ps state= fallback for macOS/BSD, before the os.kill(pid, 0) liveness probe. Diagnosis and the /proc + ps POSIX fallback by MorAlekss (PR #44898); extended to cover the psutil hot path so the fix applies on normal installs. Co-authored-by: MorAlekss <mor.aleksandr@yahoo.com>	2026-06-28 04:11:14 -07:00
teknium1	463225caf1	fix(gateway): bypass legacy-unit prompt in non-TTY systemd install Folds in PR #42124 (kyssta-exe): systemd_install gained a non_interactive flag so the 'Remove the legacy unit(s)?' prompt — the second hidden prompt not guarded by --start-now/--start-on-login — is also skipped in headless contexts. Updates systemd_install test mocks to accept the new kwarg and adds coverage for the legacy-unit-skip path.	2026-06-28 04:09:54 -07:00
liuhao1024	831d443b03	fix(gateway): honor --start-now/--start-on-login flags and support non-TTY headless installs When running `hermes gateway install` on Linux/systemd, the command unconditionally prompts with two `prompt_yes_no` questions, breaking headless installs (SSH, CI, provisioning scripts) and ignoring the existing --start-now / --start-on-login CLI flags that the Windows branch already respects. The fix mirrors the Windows path: read CLI flags first, prompt only when flags are not provided AND stdin is a TTY, and fall back to True defaults for non-TTY contexts. The argparse help strings are promoted from SUPPRESS to visible so users can discover the flags. Fixes #42065	2026-06-28 04:09:54 -07:00
Teknium	a06d0198cd	fix(dashboard): reap PTY bridge on child EOF, not only in writer finally (#54190 ) The /api/pty handler only closed the PtyBridge in the writer loop's finally. On child EOF the reader task closes the WebSocket, but if the handler task is cancelled the instant the socket closes, the writer's finally can be skipped and the PTY fds leak (#54028) — the FD-leak the regression test guards. Under dashboard auto-reconnect this stacks orphaned PTYs until fds are exhausted. Reap the bridge in the reader's EOF finally too (close() is idempotent), so the PTY is reaped independently of the writer-loop cancellation race. Harden the regression test to poll for teardown instead of asserting on the same tick. Was flaky on main (2/20); now 25/25.	2026-06-28 03:58:18 -07:00
Teknium	7968c90318	test(install): track run_with_timeout extraction after #39219 refactor (#54185 ) PR #39219 split run_browser_install_with_timeout into a thin wrapper that delegates to a new run_with_timeout helper (and parameterized the timeout binary as $timeout_bin for macOS gtimeout support), but did not update tests/test_install_sh_browser_install.py. The behavioral harness extracted only the now-empty wrapper, so the install command never ran (runs==[]), failing all 8 behavioral cases; two text assertions also still expected the old literal 'timeout' invocation. Fix the stale test: extract run_with_timeout alongside the wrapper, and match the $timeout_bin-parameterized GNU-timeout strings. Behavior unchanged.	2026-06-28 03:58:01 -07:00
Teknium	c1c179a239	fix(security): redact secrets in background process + foreground env-dump output (#43025 ) (#54149 ) * fix(security): redact secrets in background process + foreground env-dump output Terminal-output redaction was incomplete (#43025): - Gap 1: process(action=poll/log/wait) returned background stdout verbatim — no redaction at all. A background printenv/server/test emitting a key leaked raw to the model, session.db, and CLI display. Same for the gateway background-process watcher's completion/progress notifications. - Gap 2: the foreground terminal path hardcoded code_file=True, which skips the ENV-assignment pass, so an opaque token (no vendor prefix) from env/printenv leaked even there. Adds agent.redact.redact_terminal_output(output, command) as the single policy for ALL terminal-output surfaces: env-dump commands (env/printenv/set/export/ declare) get the ENV-assignment pass (code_file=False) to mask opaque tokens; other commands stay on code_file=True to avoid false positives on source dumps. Wired into terminal_tool, process_registry (_handle_process boundary), and the gateway watcher. Respects security.redact_secrets (no force) — opt-out preserved. * docs: add infographic for #43025 terminal-output redaction fix	2026-06-28 02:44:21 -07:00
teknium1	d5ba374c03	fix(telegram): detect wedged getUpdates consumer via pending_update_count The merged CLOSE-WAIT heartbeat (#52744) only probes get_me(), which uses the general request path and stays healthy while PTB's getUpdates consumer is silently wedged (updater.running=True but the long-poll task is stuck, observed on WSL2). DMs then queue in the Bot API and never reach handlers (#42909). Augment the existing _polling_heartbeat_loop to also probe get_webhook_info().pending_update_count. After two consecutive probes that see a non-draining queue while the updater claims to be running, escalate into the existing _handle_polling_network_error recovery ladder — no new restart machinery. No-ops in webhook mode, when the updater is not running, or when a reconnect is already in flight. Credit to @gazzumatteo, whose PR #42959 identified the pending_update_count signal as the missing liveness probe. This reuses the existing heartbeat + recovery path rather than adding a parallel watchdog. Fixes #42909.	2026-06-28 02:44:17 -07:00
teknium1	bbe1bf4045	fix(agent): stop redacting tool-call args in history; fix auth-header quote-eating Two related redaction bugs from #43083: 1. build_assistant_message redacted tool-call arguments in-memory. That dict feeds both the replayed conversation history and state.db (which is itself replayed verbatim on session resume), so the model read back its own PGPASSWORD='***' psql call and copied the placeholder, breaking every credential-dependent command on the second turn. The masking gave no real protection either — the same secret still leaks through tool OUTPUT. Remove it. Keeping secrets out of the replayable store is a separate tokenization/vault concern (security.redact_secrets still governs storage-time redaction elsewhere). 2. _AUTH_HEADER_RE's greedy \S+ credential class ate a closing quote when the token sat flush against it (Authorization: Bearer sk-.."), turning value corruption into syntax corruption (unterminated quote -> shell EOF / SyntaxError). Exclude " and ' from the token class; real credentials never contain them. Closes #43083.	2026-06-28 02:44:06 -07:00

1 2 3 4 5 ...

6533 commits