hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-07-01 12:02:05 +00:00

Author	SHA1	Message	Date
Teknium	cb982ad997	fix(windows): hide console-window flash on backend git/gh/wmic/bash subprocess spawns The Windows desktop GUI runs its backend headless via pythonw.exe. Several auxiliary subprocess sites that run inside that windowless backend spawned console-subsystem children (git, gh, wmic, powershell, bash, rg, taskkill) WITHOUT CREATE_NO_WINDOW, so Windows allocated a fresh conhost per call and flashed a black window on screen — sometimes continuously (the dashboard Projects-tree git probe alone fired ~118 spawns in 60s on startup). The terminal tool, cron, browser, code_execution, and gateway-spawn paths already carry windows_hide_flags(); these auxiliary probe/scan/launcher legs were missed. Wire the existing helper into them: - tui_gateway/git_probe.py: run_git (+ encoding=utf-8/errors=replace, fixes the cp950 UnicodeDecodeError on CJK paths from the same site) - agent/coding_context.py: _git (per-turn git status/log/diff) - agent/context_references.py: _run_git + _rg_files (@file/@ref resolution) - hermes_cli/copilot_auth.py: gh auth token probe (auxiliary provider:auto) - hermes_cli/gateway.py: wmic + PowerShell Get-CimInstance PID scan - hermes_cli/main.py: wmic stale-dashboard PID scan - gateway/status.py: taskkill /T /F force-kill windows_hide_flags() returns 0 on POSIX, so every changed call is a no-op on Linux/macOS (verified: real git/rg probes still work; Windows-simulated calls all pass creationflags=CREATE_NO_WINDOW). Scoped to the windowless-backend paths that cause the reported flashing. The Electron updater-handoff leg (main.cjs windowsHide:false) and the interactive-CLI banner probes (cli.py) are intentionally NOT touched here — the former needs a Windows-tested change of its own, the latter runs in a visible console anyway. Tracking: #54220 Refs: #53178 #53631 #53781 #53957 #49602 #52982 #53424 #53053 #53016	2026-06-28 05:28:45 -07:00
izumi0uu	c4719aa51c	fix(gateway): boot out stale launchd registration before restart bootstrap launchd restart can leave the gateway job stopped but still registered after update-time drain logic, so a direct bootstrap hits exit 5 and falls back to a detached process. Booting the stale registration out before bootstrap keeps the launchd-managed restart path intact and locks it with a regression test. Constraint: Keep upstream-facing conventional commit style while preserving local decision context Rejected: Treat bootstrap exit 5 as expected \| Leaves macOS launchd restart outside launchd supervision after update Confidence: high Scope-risk: narrow Directive: Keep launchd start/restart recovery flows aligned when changing launchctl handling Tested: pytest -q tests/hermes_cli/test_gateway_service.py -k "launchd_restart_boots_out_stale_registration_before_bootstrap or launchd_restart_falls_back_to_detached_on_error_5 or launchd_restart_drains_running_gateway_before_kickstart or launchd_restart_self_requests_graceful_restart_without_kickstart" Tested: pytest -q tests/hermes_cli/test_gateway_service.py -k launchd Not-tested: Manual macOS launchctl restart after hermes update	2026-06-28 04:17:13 -07:00
teknium1	463225caf1	fix(gateway): bypass legacy-unit prompt in non-TTY systemd install Folds in PR #42124 (kyssta-exe): systemd_install gained a non_interactive flag so the 'Remove the legacy unit(s)?' prompt — the second hidden prompt not guarded by --start-now/--start-on-login — is also skipped in headless contexts. Updates systemd_install test mocks to accept the new kwarg and adds coverage for the legacy-unit-skip path.	2026-06-28 04:09:54 -07:00
liuhao1024	831d443b03	fix(gateway): honor --start-now/--start-on-login flags and support non-TTY headless installs When running `hermes gateway install` on Linux/systemd, the command unconditionally prompts with two `prompt_yes_no` questions, breaking headless installs (SSH, CI, provisioning scripts) and ignoring the existing --start-now / --start-on-login CLI flags that the Windows branch already respects. The fix mirrors the Windows path: read CLI flags first, prompt only when flags are not provided AND stdin is a TTY, and fall back to True defaults for non-TTY contexts. The argparse help strings are promoted from SUPPRESS to visible so users can discover the flags. Fixes #42065	2026-06-28 04:09:54 -07:00
Teknium	a06d0198cd	fix(dashboard): reap PTY bridge on child EOF, not only in writer finally (#54190 ) The /api/pty handler only closed the PtyBridge in the writer loop's finally. On child EOF the reader task closes the WebSocket, but if the handler task is cancelled the instant the socket closes, the writer's finally can be skipped and the PTY fds leak (#54028) — the FD-leak the regression test guards. Under dashboard auto-reconnect this stacks orphaned PTYs until fds are exhausted. Reap the bridge in the reader's EOF finally too (close() is idempotent), so the PTY is reaped independently of the writer-loop cancellation race. Harden the regression test to poll for teardown instead of asserting on the same tick. Was flaky on main (2/20); now 25/25.	2026-06-28 03:58:18 -07:00
yoniebans	204a67f0c8	fix(kanban): retry write_txn on transient SQLITE_BUSY	2026-06-28 02:44:04 -07:00
Teknium	6d879d486b	fix(dashboard): close PTY WebSocket on child EOF to stop FD leak (#54028 ) (#54123 ) * fix(dashboard): close PTY WebSocket on child EOF to stop FD leak The /api/pty handler's reader task returns on child EOF, but the writer loop stayed blocked on ws.receive() until the browser sent a disconnect. When the browser socket is half-open (no FIN delivered — common on macOS/launchd), that disconnect never arrives, so the handler never reaches its finally and the PTY master fd + child process leak. With dashboard auto-reconnect (#52962), every dropped socket then spawns a fresh PTY on top of the orphaned one, exhausting file descriptors within hours (EMFILE / Errno 24). Fix: the reader task now closes the WebSocket in a finally when the child EOFs or the send side breaks, which unblocks ws.receive() so the existing finally runs bridge.close(). The writer loop also guards ws.receive() against the RuntimeError Starlette raises once the socket is closed. Reported by @fifteenzhang. Fixes #54028 * docs: add infographic for #54028 PTY FD leak fix	2026-06-28 02:42:21 -07:00
teknium1	7c9cdad9fd	test(cli): cover Windows self-lock recovery guard + cmd-quote its hint Add two tests for the self-lock guard in _recover_from_interrupted_install: one asserting it clears the marker and skips install when hermes.exe is a process ancestor (breaking the #52378/#45542 loop), one asserting it falls through to a normal recovery install when the shim is NOT an ancestor. The guard's manual-recovery hint runs only inside the Windows branch, so quote it for cmd.exe (cd /d, double-quoted paths) — the cross-platform fallback hint at the end of the function is left POSIX-correct. Map Icather in scripts/release.py AUTHOR_MAP for the salvage.	2026-06-28 02:40:37 -07:00
灵越羽毛	b6f592dbdc	fix(cli): detect self-lock in update recovery to break infinite retry loop on Windows	2026-06-28 02:40:37 -07:00
Teknium	fde1c8570f	fix(tui_gateway): suppress WS peer-hangup teardown error flood (#50005 ) (#54126 ) When the Desktop forcibly closes its WebSocket mid-write, asyncio logs a full traceback for every pending connection-lost callback — 50+ identical WinError 10054 (ConnectionResetError) lines per disconnect on Windows, the equivalent ConnectionResetError/BrokenPipeError on POSIX. These are not actionable: they are the expected side effect of the peer hanging up before our writes drained. Install a loop exception handler on the gateway serving loop that collapses exactly this teardown class (ConnectionResetError/ConnectionAbortedError/ BrokenPipeError originating from _call_connection_lost) to a single debug line, forwarding every other loop error to the existing/default handler unchanged so genuine loop bugs still surface. Idempotent per loop.	2026-06-28 02:35:01 -07:00
PRATHAMESH75	e551da6ddb	fix(gateway): reap cgroup orphans via ExecStopPost to unblock restart Long-lived helpers spawned indirectly by tool calls (adb, platform bridges) were left in the service cgroup after the gateway's main process exited. When the kernel rejected the deferred cgroup-wide kill with EINVAL, systemd blocked Restart=always for 6+ minutes, taking down all platforms and cron windows (#37454). Add a small ExecStopPost helper (gateway.cgroup_cleanup) that walks cgroup.procs and sends per-PID SIGKILLs — a different kernel code path than cgroup.kill, so it succeeds where the cgroup-wide write failed. KillMode=mixed is preserved so the gateway still reaps its own tool-call children before systemd intervenes (#8202). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-28 02:05:50 -07:00
xxxigm	093f567f0d	fix(agent,cli): surface empty-body API errors and fail oneshot exit code When an LLM API call returns HTTP 4xx with an empty parsed SDK `body` ({}), `_summarize_api_error` fell through to a bare `str(error)`, so users saw only "HTTP 400" with no provider detail (reported on Windows in #36109). The SDK leaves `body` empty in this case, but the httpx `response` still carries the payload in `.text`. - run_agent.py `_summarize_api_error`: when `body` is empty, fall back to `response.text` — parse a JSON `error.message`/`message` when present, else surface the raw (truncated) body. Platform-agnostic diagnostics. - hermes_cli/oneshot.py: `hermes -z` now runs via `run_conversation` and returns exit code 2 when the run is failed/partial with no usable final response, so scripts can detect LLM failures (still 0 when a response — incl. an error summary as output — is produced). Tests: new tests/run_agent/test_summarize_api_error.py (empty-body JSON + raw text, RED/GREEN verified) + oneshot exit-code/`run_conversation` wiring tests. NOTE: #36109's original root cause (Windows "all providers return empty 400") is not reproducible on current main (heavy provider-transport churn since v0.15.1). This change does not claim to fix that root cause — it makes any empty-body API error LEGIBLE so a future occurrence shows the real provider message instead of a bare HTTP 400. Relates to #36109 (does not close it).	2026-06-28 02:05:20 -07:00
teknium1	64972b6403	fix(config): canonicalize model.name/model.model to model.default (#34500 ) A custom_providers config that names the model under model.name (or model.model) resolved to an empty model, so the API request went out with model= — HTTP 400 from OpenAI-compatible backends. Display paths (hermes status/dump) already read model.name and showed the model, making the failure silent. The model id was read via 'default or model' at ~14 independent sites (cli, gateway, cron, curator, oneshot, fallback, profiles, ...), none of which honored 'name'. Rather than patch every site, canonicalize at the single load/save chokepoint: _normalize_root_model_keys() now promotes model.model/model.name -> model.default (precedence default > model > name) and drops the stale alias, so every reader — present and future — sees a populated default and config.yaml is migrated canonical on next save. The gateway, which bypasses load_config(), replays the same normalization in _load_gateway_config(). Co-authored-by: Bartok9 <danielrpike9@gmail.com> Credit: root-cause analysis and fix direction from @Bartok9 (#34502, first) and @v86861062 (#34527).	2026-06-28 02:05:13 -07:00
Teknium	c9df4bc094	fix(gateway): default restart_drain_timeout to 0 to kill systemd crash loop (#54066 ) A restart now interrupts in-flight agents immediately rather than holding the gateway open for a grace window. The previous 180s default coupled two independently-set timers: the gateway's own drain timer and systemd's TimeoutStopSec. On a stale unit where TimeoutStopSec < drain, systemd SIGKILLed the gateway mid-cleanup, leaving a stale lock that made the next startup exit immediately ('already running') — an infinite crash loop under Restart=on-failure (#31981). Setting drain to 0 makes the mismatch structurally impossible: with drain 0 the generated unit gets TimeoutStopSec=90 against a near-instant drain, so systemd never kills mid-cleanup. Contract: restart the gateway, in-flight work stops. A grace window large enough to 'save' a long agent turn would have to outlast an unbounded task, which is impossible. Also fixes the stale-unit warning's suggested command (hermes gateway service install --replace -> hermes gateway install --force); the former subcommand does not exist. Closes #31981	2026-06-28 01:14:34 -07:00
teknium1	aacc15b2c9	fix(clarify): raise default clarify_timeout to 3600s (#32762 ) The 600s default evicted the gateway clarify entry while users were still away (meeting/AFK); a later button tap then landed on a dead entry and the agent hung on 'running: clarify'. Raise the default to 1h in DEFAULT_CONFIG and the get_clarify_timeout() code-level fallback, documenting the running-agent-guard tradeoff. User overrides still win.	2026-06-28 01:07:53 -07:00
teknium1	c918d42d88	feat(desktop): config-driven Electron launch flags + GPU policy Adds a desktop: section to config.yaml so headless/VM users can make `hermes desktop` launch correctly without a wrapper command: - desktop.electron_flags: extra Electron CLI flags (e.g. --ozone-platform=x11) appended to every launch. Accepts a list or a shell-split string. - desktop.disable_gpu: auto\|true\|false, bridged to the HERMES_DESKTOP_DISABLE_GPU env var the Electron app already reads. An explicit env var still wins. cmd_gui() reads these via _desktop_launch_options() and applies them. This is the config.yaml form of the capability proposed as a raw env var in #38934 (@1RB) — behavioral settings belong in config.yaml, not a new HERMES_* env var. Co-authored-by: ray <86501179+1RB@users.noreply.github.com>	2026-06-27 22:26:43 -07:00
Rafael Millan	54ea059919	fix: fall back to no-sandbox for desktop launch on restricted Linux hosts	2026-06-27 22:16:20 -07:00
Teknium	e3c9924b8b	fix(cli): correct stale `hermes auth login nous` hints to `hermes auth add nous` (#53929 ) * fix(cli): correct stale `hermes auth login nous` hints to `hermes auth add nous` There is no `hermes auth login` subcommand — valid auth verbs are add/list/remove/reset/status/logout/spotify. Six user-facing strings told users to run `hermes auth login nous`, which fails with `invalid choice: 'login'` — the same broken-hint class reported in #28089 for the proxy flow (already fixed there to `hermes auth add nous`). Sites corrected to `hermes auth add nous`: - hermes_cli/dashboard_register.py (401 retry hint, not-logged-in hint) - hermes_cli/gateway_enroll.py (401 retry hint, not-logged-in hint) - cli-config.yaml.example (two provider-requirement comments) * docs(infographic): auth login nous hint fix	2026-06-27 21:30:37 -07:00
Teknium	4626ceb747	fix(gateway): only offer system-scope gateway install to root sessions (#53975 ) Non-root users picking 'System service' in the setup wizard were handed a 'sudo hermes gateway install --system --run-as-user <you>' recipe that fails on most distros: sudo's secure_path strips ~/.local/bin (pipx/uv installs), so 'sudo hermes' is command-not-found. Worse, it funnels a non-root user toward a system install they shouldn't be doing from a user session. Now prompt_linux_gateway_install_scope() only offers system scope when os.geteuid()==0. Non-root sessions get user-service or skip, with a tip to re-run as root for a boot service. The non-root branch in install_linux_gateway_from_setup becomes a defensive guard that refuses without printing any self-elevation recipe. Gated the matching deferral hint in setup.py behind root too.	2026-06-27 21:24:08 -07:00
Priyanshu Sharma	f6deabca0d	fix(gateway): clear stale base_url on model switches	2026-06-27 21:23:25 -07:00
teknium1	f54c52800a	fix(models): scope live-first picker merge to opencode aggregators only Follow-up to the salvaged #49129 commit. The original change flipped the shared generic-provider merge in provider_model_ids() to live-first unconditionally, which regressed curated-first for single providers (kimi/zai, #46309) — and the PR encoded that regression by flipping the kimi-coding and zai test assertions to expect live-first. Gate live-first on an explicit _LIVE_FIRST_PICKER_PROVIDERS set ({opencode-zen, opencode-go}); every other provider keeps curated-first. Also widen the uncapped picker + live-first sets to opencode-go, which has the same 70+ model catalog problem as opencode-zen. Restore the kimi-coding curated-first test and rewrite the merge-order test to assert the per-provider contract.	2026-06-27 21:23:25 -07:00
Afnath Ahamed	f98ffbc246	fix(models): live-first merge + update opencode-zen catalog + uncap aggregator picker	2026-06-27 21:23:25 -07:00
Teknium	3b23a984b5	feat(kanban): stamp handoff freshness so workers don't read stale state as current (#53973 ) Multi-agent boards leak staleness: a sibling worker's parent handoff, comment, or prior-attempt summary gets read by the next worker as live truth even when it's a day old. build_worker_context surfaced the text with (at best) a bare absolute timestamp, which an LLM reads as fact regardless of age — parent results had no timestamp at all. Adds a coarse relative-age stamp (just now / 18h ago / 3d ago) to every recalled-state line and a one-line 'point-in-time snapshot, re-verify against source' frame on the parent-results section, so the worker sees when handoffs were produced and re-checks stale ones before acting.	2026-06-27 21:21:54 -07:00
Teknium	d43e0cf304	fix(agent): config-driven intent-ack continuation for all api_modes (#27881 ) (#53943 ) * fix(agent): config-driven intent-ack continuation for all api_modes (#27881) The agent could end a turn after only stating intent ('I will run a health check...') without executing the announced tool call, forcing the user to re-prompt. A continuation guard that catches this and nudges the model to proceed already existed but was hard-gated to the codex_responses api_mode, so Gemini/Claude/OpenRouter turns never benefited. - New agent.intent_ack_continuation config (default 'auto' = codex-only, byte-stable for existing conversations). 'true'/model-list opts every api_mode in; 'false' disables. Mirrors agent.tool_use_enforcement's shape. - looks_like_codex_intermediate_ack gains require_workspace (default True). The opted-in path drops the codebase/filesystem requirement so general autonomous workflows (server ops, deploys, API calls) are caught, not just coding tasks. Future-ack + action-verb + short-content + no-prior-tool guards still apply; the 2-nudge-per-turn cap is unchanged. - Resolution centralized in intent_ack_continuation_mode (off/codex_only/all). * docs(infographic): intent-ack continuation (#27881)	2026-06-27 20:46:00 -07:00
teknium1	9c6229ce24	fix(security): centralize credential-safe subprocess env (#29157 ) Subprocesses spawned outside the terminal/execute_code path (agent-browser, copilot ACP, dep-ensure, lazy_deps uv install, TUI Node host, cli.exec) inherited the operator's full credential environment via os.environ.copy(). The terminal path was already scrubbed by _HERMES_PROVIDER_ENV_BLOCKLIST (#1002/#1264/#32314); these spawn sites bypassed it. Adds hermes_subprocess_env(inherit_credentials=) in tools/environments/local.py reusing the existing dynamic blocklist as the single source of truth: - Tier 1 (_ALWAYS_STRIP_KEYS): gateway bot tokens, GitHub auth, infra secrets -- stripped even for credential-inheriting children. - Tier 2 (_HERMES_PROVIDER_ENV_BLOCKLIST): provider/tool keys -- stripped unless inherit_credentials=True. The opt-in is grep-able for audit. Browser worker keeps a _BROWSER_PASSTHROUGH_KEYS allowlist (BROWSERBASE/ FIRECRAWL) re-added after the strip. Model-driving children (ACP, TUI Node host, cli.exec) use inherit_credentials=True so they still get provider keys while losing Tier-1 secrets. Installers (dep-ensure, lazy_deps) inherit nothing sensitive. cua_backend already routed through _sanitize_subprocess_env on main -- left as-is. Gateway adapter utility spawns (gh pr comment, ffmpeg) are left inheriting env: gh needs GH_TOKEN by design, ffmpeg is a trusted system binary -- no untrusted-dependency exposure. This is defense-in-depth (personal-assistant trust model: same-user spawns), making the existing scrub policy uniform across the spawn surface; the main real payoff is shrinking the blast radius if a transitive npm dep in agent-browser is compromised. Reconstructed on current main from the design in #31959 (Tranquil-Flow); also credits #39003 (rodboev), #37843 (coygeek), #35769 (egilewski). Co-authored-by: Tranquil-Flow <tranquil_flow@protonmail.com> Co-authored-by: rodboev <rod.boev@gmail.com> Co-authored-by: egilewski <egilewski@egilewski.com>	2026-06-27 20:45:31 -07:00
kshitijk4poor	2af1678bfc	fix(auth): explicit provider intent beats stale OAuth active_provider (#29285 ) `resolve_provider("auto")` checked `auth.json` `active_provider` BEFORE the config.yaml `model.provider` and env-var API-key checks. So a user who was OAuth-logged-into one provider (e.g. Anthropic) but had set an explicit `model.provider` or exported an API key (e.g. `OPENAI_API_KEY`) was silently routed to the stale OAuth provider — the override was invisible and surprising. Reorder the auto-path so explicit intent wins (the order the issue asks for): 1. explicit CLI api_key/base_url 2. config.yaml `model.provider` (safety net — see below) 3. OPENAI_API_KEY / OPENROUTER_API_KEY env 4. OpenRouter credential pool 5. provider-specific API-key env vars 6. auth.json `active_provider` (OAuth) ← demoted to last-resort 7. AWS Bedrock credential chain 8. error `active_provider` is still honored — it's just a last-resort fallback chosen only when the user expressed no other preference, instead of overriding one. The normal chat/gateway/TUI/ACP/status path already resolves config.provider upstream in `resolve_requested_provider()` before "auto" is reached, so this duplicate config check is the safety net for the lone direct caller (`main.py` `resolve_provider("auto")`) and any future bypass. Because every surface funnels through this one resolver, the fix propagates everywhere with a single edit — no sibling path re-implements precedence. Also add a one-shot WARN when resolution lands on `active_provider` while a populated `model` config dict lacks a `provider` key — surfacing the silent override the issue reported without breaking first-install. Synthesizes the two competing PRs: #29615 (LifeJiggy — config-before-auth + the silent-override framing) and #29809 (Minksgo — the env-before-auth reorder). #29809 could not be merged directly (bundled unrelated, un-opt-in cost-tagging telemetry); its reorder idea is incorporated here and credited. Tests: tests/hermes_cli/test_provider_precedence.py — config/env beat stale OAuth, OAuth still used as last resort, explicit request short-circuits, WARN fires on silent fall-through. Full provider-resolution suites: 374 passed. Fixes #29285 Co-authored-by: LifeJiggy <141562589+LifeJiggy@users.noreply.github.com> Co-authored-by: Minksgo <153416856+Minksgo@users.noreply.github.com>	2026-06-27 19:49:02 -07:00
Teknium	45b2e4dd6b	fix(config): opt newer migrations out of default-stripping The salvaged #27354 fix made save_config strip schema-default leaves by default. Five migration sites added to main after the PR was authored still called bare save_config(config) and intentionally materialize a (often default-valued) key: model_catalog.ttl_hours, write_approval, curator.consolidate, agent.verify_on_stop, and the suspicious-MCP-server disable. Pass strip_defaults=False so those one-time deliberate writes survive, matching the opt-out the PR applied to the other migrations.	2026-06-27 19:38:11 -07:00
郝鹏宇	98488c4be4	fix(config): prevent save_config from materialising schema defaults Fixes #27354 Root cause: called during init (or by any code path that saves ) wrote injected schema defaults into config.yaml as if the user had authored them. Two fix layers: 1. now only injects when the user actually set somewhere (root or agent). A user who never set keeps it absent, so 's explicit-path detection won't treat it as user-authored. 2. gains a parameter and a new pass that removes keys matching unless those paths were explicitly present in the raw (pre-normalization) config on disk. Explicit-path detection uses on before any normalisation runs — preventing injected-in defaults from being mistaken for user-set values. All migration and edit-config call sites pass to preserve their intentional default-seeding behaviour. New helpers: - — collects leaf-key paths from a raw dict - — removes keys matching schema defaults Test coverage: 4 new regression tests (59 total, all passing).	2026-06-27 19:38:11 -07:00
konsisumer	8b4c29f0f0	fix(auth): preserve concurrently-added credentials on pool rewrite	2026-06-27 19:01:37 -07:00
Teknium	d3d621f7c3	revert(windows): roll back terminal-popup PRs #53791 #53810 #53829 (#53853 ) * Revert "fix(windows): capture is not a no-window boundary; route flashing spawns through chokepoint (#53829)" This reverts commit `2ecca1e7d3`. * Revert "fix(windows): stop terminal-window popups from background spawns (#53810)" This reverts commit `5db1430af9`. * Revert "fix(windows): stop subprocess console-window popups + add CI guard (#53791)" This reverts commit `ef17cd204d`.	2026-06-27 15:59:00 -07:00
Teknium	2ecca1e7d3	fix(windows): capture is not a no-window boundary; route flashing spawns through chokepoint (#53829 ) Follow-up to #53791 addressing review feedback: the footgun checker treated capture_output=/stdout=/stderr=/check_output as proof a subprocess can't pop a Windows console. That invariant is false — stream redirection controls where a child's output goes, not whether a console is allocated. From a console-less parent (Desktop/Electron, pythonw.exe, detached gateway/cron) a console-subsystem child still flashes a window even when fully captured. - check-windows-footguns.py: capture/redirect/check_output is no longer a blanket safe-pass. Added _WINDOWS_FLASHING_PROGRAMS (git/gh/npm/node/python/uv/ffmpeg/ docker/powershell/…); calls to those are flagged even when captured. Non-flashing programs keep the capture exemption (no 271-site noise). _subprocess_compat.run/ popen calls are inherently safe (wrapper injects CREATE_NO_WINDOW). - Routed the 35 genuine flashing git/gh/npm/uv/ffmpeg/docker spawns through the _subprocess_compat.run/popen chokepoint (Brooklyn's wrapper from #53810) — the durable fix, not per-site annotations. cmd.exe /c start stays # ok (intentional). - Updated tests + CONTRIBUTING.md rule #17 to the corrected invariant.	2026-06-27 14:49:41 -07:00
Gille	66aeda3550	fix(moa): keep virtual provider on MoA client	2026-06-27 14:20:51 -07:00
brooklyn!	5db1430af9	fix(windows): stop terminal-window popups from background spawns (#53810 ) * fix(windows): stop terminal-window popups from background spawns Native-Windows desktop/gateway users saw cmd/conhost windows flash on gateway restart, image paste, the dashboard Projects tree, voice notes, and ~5 min after closing the app (detached cron). Two root causes: - Console-subsystem exes (taskkill, schtasks, wmic, netstat, tasklist, agent-browser, git, ffmpeg, powershell, git-bash) spawned via raw subprocess allocate a fresh console when the launching process has none (pythonw desktop backend / detached gateway) - even with output captured. - uv venv pythonw shims re-exec console python.exe, so Python children get a console regardless of how they're launched. Fixes: - Single hidden-spawn primitive (_subprocess_compat.run/.popen) that ORs CREATE_NO_WINDOW on Windows, no-op on POSIX. Route every Hermes-owned console-exe spawn through it. - FreeConsole() catch-all in hermes_bootstrap: any Python child that exclusively owns an auto-allocated console detaches it at startup (GetConsoleProcessList()==1 gate leaves shared interactive consoles untouched). - Replace PowerShell/wmic gateway PID scans with in-process psutil. - Skip schtasks queries on non-interactive desktop restarts. - Prefer native agent-browser .exe over .cmd shims. - Guard test bans raw subprocess spawns of the Windows-only console tools repo-wide so the popup class can't regress. * fix(windows): scope FreeConsole to background entry points; fix merge fallout Console detach review (per #53810 feedback): GetConsoleProcessList()==1 can't tell a uv pythonw->python phantom console apart from a user opening the interactive CLI/TUI in its own fresh console (double-click, shortcut, ConPTY) — both report a single attached process with a tty. Running FreeConsole() in the import-time bootstrap therefore risked detaching a legitimately-interactive terminal. - Extract FreeConsole into explicit hermes_bootstrap.detach_orphan_console(); remove it from apply_windows_utf8_bootstrap() (import side effect). - Call it only from known background mains: gateway run, dashboard backend (start_server, what the desktop spawns), cron standalone, tui_gateway entry, slash worker. Interactive CLI/TUI never calls it. - Behavior-contract tests: frees only when solo owner, leaves shared console, no-op without console / on POSIX, and asserts it's not an import side effect. Merge fallout from origin/main (#53791): - local.py: 3-way merge left a dangling *_popen_kwargs (NameError crashing every terminal init). _subprocess_compat.popen already hides the window, so drop it. - discord adapter: merge stacked an undefined windows_hide_flags() onto the primitive call; drop the redundant arg. - test_gateway: scan now goes psutil-first (zero spawn); rewrite the case-variant test to drive that production path. test(claw): mock _subprocess_compat.run seam for Windows process scan claw.py's Windows tasklist/powershell scan routes through the hidden-spawn primitive; the tests still patched claw_mod.subprocess, so on win32 the mock was never hit and real spawns returned nothing. Patch the actual seam.	2026-06-27 14:02:24 -07:00
Teknium	ef17cd204d	fix(windows): stop subprocess console-window popups + add CI guard (#53791 ) * fix(windows): stop subprocess console-window popups + add CI guard The single biggest source of Windows 'terminal popup' bug reports was bare subprocess.run/Popen calls spawning a console window. The compat helpers (windows_hide_flags / windows_detach_popen_kwargs) already existed but the footgun checker had no rule to stop new bare calls from reintroducing the flash. - scripts/check-windows-footguns.py: new AST-based rule flagging subprocess calls that can create a new console — output-redirection-aware (capture/ redirect/check_output exempt) and POSIX-only-program-aware (launchctl/ systemctl/brew/etc. exempt). Comprehensive on real popups, no annotation burden on calls that can't flash. - Swept all genuine window-spawning sites through windows_hide_flags()/ windows_detach_popen_kwargs(); marked intentionally-visible launches (editor/terminal/foreground re-exec) with '# windows-footgun: ok'. - tests/scripts/test_windows_footgun_subprocess_rule.py: behavior-contract tests + full-repo cleanliness invariant. - CONTRIBUTING.md: documents the rule + the helper pattern. * test: accept creationflags kwarg in psutil_android fake_subprocess_run The Windows no-window sweep added creationflags=windows_hide_flags() to install_psutil_android.py's subprocess.run call; the test's fake stub had a fixed (cmd) signature and raised TypeError on the new kwarg.	2026-06-27 13:03:51 -07:00
ailthrim	25ec01f79f	fix(desktop): don't purge Electron cache / mirror-retry after a late build failure `hermes desktop` / `hermes update` recover from a corrupt Electron download by purging the cached zip + re-downloading and retrying the pack, and then by falling back to a public mirror. That recovery is only meaningful when the packaged executable is MISSING — the signature of a partial/corrupt unpack. A LATE failure such as macOS code signing (#40187) leaves `Hermes.app/Contents/MacOS/Hermes` (or the platform equivalent) in place. Re-downloading Electron can't repair a signing failure, so the purge + slow mirror retry just grind through another identical failure before the build finally errors out. Gate both recovery blocks on `_desktop_packaged_executable(desktop_dir) is None` so a build that already produced the executable fails fast instead of triggering the destructive download recovery. The corrupt-download path (executable missing) is unchanged. Salvage of #42782, re-applied onto current main (the surrounding recovery was refactored to `_electron_dist_ok` / `_redownload_electron_dist` since the PR was opened). Adds a regression test asserting no purge / mirror retry runs when the executable exists, and updates the existing retry/mirror tests to model the corrupt-download case (executable absent) the recovery is actually for. Related to #40187 (the residual cache-purge sub-issue; the signing failure itself is fixed by #52591).	2026-06-28 00:29:34 +05:30
teknium1	1ef19bad90	fix(model): show MoA preset picker on selection and label MoA in the banner Selecting 'Mixture of Agents' in the `hermes model` provider picker fell through silently — select_provider_and_model had no moa branch, so it just reprinted the current model/provider summary and exited. And the CLI session banner rendered the bare preset name (e.g. 'opus-gpt · Nous Research'), which is meaningless out of context. - Add _model_flow_moa: always lists the available presets (even one), then prints the full reference-models + aggregator breakdown for the selection and persists model.provider=moa / model.default=<preset> (dropping stale base_url + endpoint creds, since moa is a virtual local provider). - Wire the branch into select_provider_and_model. - build_welcome_banner takes provider; when 'moa' it renders 'MoA: <preset> · agg <aggregator>' instead of a bare slug. Both CLI call sites pass self.provider. Tests: 2 new banner tests (moa + non-moa unchanged); E2E verified the picker persists the preset and clears stale base_url/api_key.	2026-06-27 11:45:07 -07:00
Teknium	27322612b4	fix(update): route loud build/installer output to update.log instead of the terminal (#53616 ) * fix(update): route loud build/installer output to update.log instead of the terminal hermes update flooded the terminal with the full vite asset dump, electron-builder logs, npm deprecation warnings from the desktop build, and the cua-driver installer's 'Next steps' wall. All of that is low-signal noise the user doesn't need on a successful update. - Capture the desktop --build-only subprocess (vite + electron-builder) into ~/.hermes/logs/update.log; print a one-line status, and on failure surface the last 15 lines + a pointer to the full log. - Capture the cua-driver installer's output when verbose=False (the hermes update refresh path); concise upgrade line is unchanged. - Add _log_only_write() / _run_logged_subprocess() helpers that write to the update.log handle without echoing to the terminal. The repo-root npm install keeps streaming (capture_output=False) — that is the deliberate #18840 guard so a slow postinstall download doesn't look hung. The desktop npm install is a separate Electron process with no such progress concern and is captured. * fix(update): persist full cua-driver installer output to update.log The captured cua-driver installer output was only sent to logger.debug (agent.log) on failure, so the 'Next steps' wall was lost from update.log entirely on success. Write the full captured output straight to the update.log handle (sys.stdout._log) on both success and failure, matching the desktop-build capture, so update.log keeps the complete record of everything an update did.	2026-06-27 11:43:01 -07:00
Teknium	917f6bdb00	fix(tools): let vision pick any provider+model, not just OpenRouter (#53606 ) * fix(tools): let vision pick any provider+model, not just OpenRouter hermes tools → configure → vision no longer forces an OPENROUTER_API_KEY. It now offers the same any-provider surface as the model command: Auto (use main model / aggregator fallback), pick any authenticated provider + model, or a custom OpenAI-compatible endpoint. Selections persist to auxiliary.vision.{provider,model,base_url} — the keys the vision resolver already reads. Custom endpoint pins provider=custom so base_url routes correctly. Reconfigure path uses the same picker instead of re-prompting for OPENROUTER_API_KEY. * docs: add PR infographic for vision any-provider picker	2026-06-27 04:41:42 -07:00
ms-alan	16192103f4	fix(config): accept placeholder base_url in custom provider validation _normalize_custom_provider_entry() ran urlparse() on base_url and dropped any entry whose value was an un-expanded placeholder, so a caller reaching the normalizer with raw config (e.g. the Dockerized gateway path) silently skipped the provider with a 'not a valid URL' warning. Skip URL validation when the candidate contains a placeholder token — both ${ENV_VAR} env-refs and bare {region}-style templates — since those are expanded at runtime. Closes #14457	2026-06-27 04:15:27 -07:00
Teknium	5ab4136631	fix(webui): switch provider when Config-page model field changes (#53583 ) The dashboard Config tab's Model field is a flat string with no provider info. _denormalize_config_from_web only updated model.default and kept the stale provider, so picking an OpenRouter model while the default provider was ollama-local left provider=ollama-local and every call 404'd. When the model string actually changes, infer the serving provider — curated catalog first, then a vendor/model-slug heuristic for non-aggregator providers — and route the switch through the existing _normalize_main_model_assignment / _apply_main_model_assignment chokepoints so stale base_url/api_mode/api_key are cleared on a provider change and preserved on a same-provider re-pick. Saving an unchanged model never re-detects, so unrelated config saves keep an explicit provider. Closes #14058	2026-06-27 04:13:44 -07:00
blaryx	76af2456a2	fix(dashboard): merge PUT /api/config with existing on-disk config The dashboard form is built from CONFIG_SCHEMA, which doesn't enumerate every root-level key the YAML supports. Most visibly, `custom_providers` is in `_KNOWN_ROOT_KEYS` but is absent from the schema — so the frontend never sends it in the PUT body. The previous full-replace save() then silently wiped the key from disk every time the user clicked anything that triggered a save. Other casualties (less visible because defaults re-mask them on load) include `agent.personalities`, `agent.reasoning_effort`, `terminal.lifetime_seconds`, etc. Fix: read the raw on-disk config and deep-merge the incoming PUT body on top of it before saving. The frontend can only overwrite what it explicitly sends; everything else is preserved verbatim. Reuses the existing `_deep_merge` helper from `hermes_cli.config`. Tests: - `test_round_trip_preserves_custom_providers` exercises the exact bug: seed config with custom_providers, GET → drop the key → PUT, assert it's still on disk. - `test_round_trip_preserves_schema_invisible_nested_keys` covers the shallow-vs-deep-merge case for nested dicts under `agent` etc. Both fail on current main; both pass with this patch.	2026-06-27 03:48:18 -07:00
teknium1	a5d1f68c74	refactor(moa): share one virtual-provider row builder across pickers Follow-up on the gateway-picker salvage: the cherry-picked change added a second copy of the MoA virtual-provider row in model_switch.py, duplicating inventory._moa_provider_row (same slug/name/preset-models, identical extra fields). Make _moa_provider_row take a bare current_provider string and reuse it from the gateway picker path so the row shape lives in one place and the two surfaces can't drift.	2026-06-27 03:43:38 -07:00
dodo-reach	ed54469d06	fix(gateway): show MoA presets in model picker	2026-06-27 03:43:38 -07:00
briandevans	8dd4e576d0	fix(moa): tolerate non-list reference_models in hand-edited MoA preset config	2026-06-27 03:43:16 -07:00
Teknium	60f58a2b95	feat(verify-on-stop): default OFF, one-time migration, skip doc-only edits (#53552 ) The verify-on-stop guard fired too eagerly — including on doc/markdown/skill edits with nothing to verify, where it pushed a pointless /tmp verification script. Three changes: 1. Default OFF for new installs: agent.verify_on_stop defaults to false (was the "auto" surface-aware sentinel). _config_version bumped 30 -> 31. 2. One-time migration (v30 -> v31): existing installs are switched off once, but only when the value is missing or still the "auto" sentinel — an explicit true/false the user set is preserved. 3. Path filter: build_verify_on_stop_nudge() now drops documentation/prose paths (.md/.mdx/.rst/.txt/LICENSE/CHANGELOG/...) so even when explicitly enabled, a doc-only turn never nudges. Mixed doc+code turns still nudge on the code paths. The legacy "auto" sentinel is still honored when set explicitly (ON for interactive coding surfaces, OFF for messaging). HERMES_VERIFY_ON_STOP env override unchanged.	2026-06-27 03:23:22 -07:00
Versun	c655cdf2c1	feat(dashboard): expose cron job execution fields	2026-06-27 03:20:32 -07:00
teknium1	50f6855217	feat(moa): make /moa one-shot only; route preset switching through the model picker /moa no longer does a sticky model switch. It now always runs a single prompt through the default MoA preset and restores the prior model afterward; the whole argument is the prompt (no preset-name matching). To switch to a MoA preset for the session, select it from the model picker, where presets already surface under a virtual Mixture of Agents provider on every model-selection surface. Also fixes #53444: the TUI one-shot only set session[model_override], which the already-built cached agent ignored, so MoA silently never ran and the turn used the original model. The TUI now does a real in-place agent.switch_model() via _apply_model_switch() when a live agent exists (with a proper restore after the turn), and falls back to a model_override for lazy/unbuilt sessions. Removes the redundant sticky-switch branch from the CLI, gateway, and TUI /moa handlers; updates the command description, usage string, and docs.	2026-06-27 03:09:09 -07:00
Teknium	d712a7fd73	fix(model-picker): surface the current custom/uncurated model in picker rows (#53457 ) A model selected via the CLI (e.g. /model openrouter/<uncurated-name>) was absent from every model picker — the main picker AND the MoA reference/ aggregator slot pickers — because each provider row only carried its curated catalog. Inject the current model at the front of its provider's row so it is selectable and shown everywhere.	2026-06-27 00:06:34 -07:00
Nacho Avecilla	dbe734beff	fix(dashboard-auth): exclude non-interactive providers from interactive login surfaces (#53239 ) * Return None instead of erroring on drain login failure * Fix login on drain * Remove login for drained endpoints flow and clean the code * chore: drop unrelated credits changes from this PR * Remove extra comments that were not really necessary	2026-06-27 10:08:13 +10:00
zapabob	e55ddc3e33	fix(mcp): suppress interactive OAuth stdin prompts during background discovery (#35927 ) When an MCP server requires OAuth, the interactive `hermes` TUI froze on startup: background MCP discovery hit the OAuth flow, which on an interactive TTY spawns a daemon thread doing a blocking `sys.stdin.readline()` (the "paste the redirect URL" fallback in mcp_oauth._wait_for_callback). That thread competes with the TUI's own stdin reader for the same terminal, so keystrokes get swallowed and the TUI appears frozen (up to the 300s OAuth timeout). Reported symptom: "MCP OAuth: authorization required / Open this URL ... the tui is freezing, not respond to typing." Add a thread-local `suppress_interactive_oauth()` context manager in tools/mcp_oauth.py; `_is_interactive()` returns False while it's active, so the stdin paste-thread and prompt are never created. Background discovery (hermes_cli/mcp_startup.py, tui_gateway/entry.py) now runs discovery inside that context, so OAuth-requiring servers soft-skip (raise OAuthNonInteractiveError, already handled) instead of stealing the TUI's stdin. A real `hermes mcp login` on the main thread is unaffected (thread-local). Salvaged from #35945 by @zapabob (authorship preserved via cherry-pick; resolved a conflict against main's new mcp_discovery_timeout / wait_for_mcp_ discovery refactor, keeping both). Verified E2E: with suppression the paste prompt is NOT printed and no stdin thread spawns (raises OAuthNonInteractive soft-skip); without it the prompt shows (the freeze). Mutation-verified (removing the suppress check in _is_interactive fails the regression test). 76 tests pass, ruff clean. Closes #35927. SELF-REVIEW FIX: the original #35945 used threading.local(), which does NOT propagate to the dedicated mcp-event-loop thread where OAuth actually runs (discover_mcp_tools dispatches the connect via run_coroutine_threadsafe), so the suppression was a NO-OP in production (the tests passed only by stubbing out the cross-thread dispatch). Converted to a contextvars.ContextVar, which asyncio copies onto the scheduled coroutine — empirically verified suppression now holds on the mcp-event-loop thread through the real _run_on_mcp_loop path. Added a cross-thread regression test (fails on threading.local, passes on the ContextVar) so the no-op can't regress.	2026-06-27 04:59:23 +05:30

1 2 3 4 5 ...

3108 commits