hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-06-09 08:21:50 +00:00

Author	SHA1	Message	Date
teknium1	55b83c3d99	refactor(agent): extract run_conversation post-loop tail into finalize_turn (god-file Phase 1) Lift the post-loop finalization tail out of run_conversation into agent/turn_finalizer.py:finalize_turn. Behavior-neutral; run_conversation 4204 -> 3846 LOC, conversation_loop.py 4578 -> 4220. The region (everything after the main tool-calling while loop): budget-exhaustion summary, trajectory save, session persist, turn diagnostics, response transforms, result-dict assembly, steer drain, and the memory/skill review trigger. Lifted verbatim into a synchronous single-return free function; the 12 post-loop locals it reads are passed as keyword args and the assembled result dict is returned to run_conversation (which returns it to the caller). All agent.* side effects fire exactly as before. Imports: os + _summarize_user_message_for_log at module top; logger lazy from agent.conversation_loop (preserves the gateway... err 'agent.conversation_loop' logger name, no import cycle). Validation: 1609/1609 tests/run_agent/ pass; live PTY agent turn PASS.	2026-06-08 09:42:23 -07:00
teknium1	a706a349b5	refactor(gateway): extract authorization cluster into GatewayAuthorizationMixin (god-file Phase 3) Lift the 4 inbound-message authorization methods out of GatewayRunner into gateway/authz_mixin.py:GatewayAuthorizationMixin. Behavior-neutral; gateway/run.py 16200 -> 15812 LOC. Methods moved (~389 LOC): _is_user_authorized, _get_unauthorized_dm_behavior, _adapter_dm_policy, _adapter_enforces_own_access_policy. The two adapter-policy helpers are private to _is_user_authorized, so the cluster is fully self-contained (zero outside-cluster self.method calls after the lift). All self.* calls resolve unchanged via the MRO (GatewayRunner(GatewayAuthorizationMixin, ...)). Import split: 6 neutral deps (os, Optional, Platform, SessionSource, the two whatsapp_identity helpers) at the mixin module top; the module-level logger is imported lazily inside _is_user_authorized (from gateway.run import logger) so the mixin never imports gateway.run at module scope -> no cycle. The lazy import preserves the exact logger name (gateway.run) so log records are unchanged.	2026-06-08 09:42:02 -07:00
teknium1	094aa85c37	refactor(cli): extract agent-construction cluster into CLIAgentSetupMixin (god-file Phase 4) Lift the 5 agent-construction/session-resume methods out of HermesCLI into hermes_cli/cli_agent_setup_mixin.py:CLIAgentSetupMixin. Behavior-neutral; cli.py 14139 -> 13492 LOC. Methods moved (~647 LOC): _ensure_runtime_credentials, _resolve_turn_agent_config, _init_agent, _preload_resumed_session, _display_resumed_history. All self.* calls resolve unchanged via the MRO (HermesCLI(CLIAgentSetupMixin, CLICommandsMixin)). Import split (same recipe as #41942): 2 neutral deps (sys, _escape) imported at the mixin module top; 12 cli.py-internal helpers/constants (AIAgent, ChatConsole, CLI_CONFIG, _cprint, _DIM, _RST, _accent_hex, ...) imported lazily per-method (from cli import ...) so the mixin never imports cli at module scope -> no cycle. Repointed one source-inspection change-detector (test_callable_api_key.py) to read the mixin file where the method now lives.	2026-06-08 09:41:34 -07:00
qWait	cef00ae602	fix(tui): handle Windows PTY stdin and detached WS frames (#41953 ) Two narrow Windows desktop fixes: 1. tools/process_registry.py — PTY stdin writes are now platform-aware. pywinpty (Windows) expects str; ptyprocess (POSIX) expects bytes. Previously bytes was unconditionally passed, producing a TypeError on Windows ("'bytes' object cannot be converted to 'PyString'"). 2. tui_gateway/server.py + ws.py — Detached WebSocket sessions now park on a _DropTransport sink instead of _stdio_transport. In the desktop the gateway runs in-process and stdout is captured by Electron into desktop.log, so falling back to stdio leaked raw JSON-RPC frames into the desktop log after WS disconnects. Orphan-reap semantics are preserved via _ws_session_is_orphaned. Verified on a Windows desktop install: - pywinpty 2.0.15 rejects bytes / accepts str — reproduced exactly - Focused suite green (write_stdin × 2, write_json_drops_detached_ws_frames, ws_orphan_reap × 2) - All 6 CI test shards green, e2e green, nix (ubuntu/macos) green Salvage commit (`21be7ca`) fixes the new test referencing an undefined _ThreadUnsafeStdout — uses the existing _ChunkyStdout helper.	2026-06-08 09:41:20 -07:00
Teknium	74744795af	docs(tui): correct HERMES_TUI_GATEWAY_URL — dashboard-internal, not remote-attach (#42162 ) The TUI docs presented HERMES_TUI_GATEWAY_URL + /api/ws as a supported 'attach the TUI to a standalone running gateway' workflow. It isn't. /api/ws exists only inside the dashboard's FastAPI server (hermes_cli/web_server.py), which spawns its own embedded TUI child and injects the var as an internal wiring detail. The OpenAI-compat API server (api_server platform) deliberately does not serve /api/ws, so the documented ws://host:port/api/ws workflow 404s — the cause of #32882 and the two PRs (#32904, #32955) that tried to add the route to the wrong surface. Rewrites the section in en + zh-Hans to describe the var accurately and point users at shared state.db / dashboard embedded chat for multi-surface session sharing.	2026-06-08 09:37:03 -07:00
Teknium	399b8ee5f0	fix(anthropic): strip Responses-only kwargs before Messages SDK call (#31673 ) (#42155 ) A Responses-API-shaped payload carrying instructions=/input=/store=/ parallel_tool_calls= can reach the native Anthropic messages.stream() / messages.create() call under a rare api_mode-flip race (e.g. a concurrent auxiliary vision call mutating a shared agent between the kwargs build and the stream dispatch). The Anthropic SDK rejects these with a non-retryable TypeError that kills the whole turn and propagates the entire fallback chain. Add sanitize_anthropic_kwargs() at both Anthropic dispatch sites: it drops the Responses-only keys in place and logs a WARNING (with #31673 breadcrumb) when one is present, so the underlying race stays visible in the wild instead of being silently papered over.	2026-06-08 09:36:38 -07:00
Teknium	47d5177a7d	fix(plugins): thread-safe lazy-singleton helpers; fix honcho TOCTOU (#24759 ) (#42150 ) * fix(plugins): add thread-safe lazy-singleton helpers, fix honcho TOCTOU (#24759) get_honcho_client() and fal's _load_fal_client() used unlocked check-then-init: racing threads both ran the expensive build and the loser's client (open connection) leaked. Rather than one-off locks, add plugins/plugin_utils.py with two reusable primitives every plugin author can drop in: - lazy_singleton: decorator for zero-arg accessors - SingletonSlot: manual slot for config-keyed accessors (first wins) Both use double-checked locking; factory runs at most once; failed builds aren't cached. honcho is the reference consumer; fal's sibling TOCTOU gets a matching double-checked lock. Plugin dev guide documents the pattern so future plugins don't reintroduce the race. Closes #24759 * test(honcho): update reset test for SingletonSlot internals test_reset_clears_singleton poked the removed _honcho_client module global directly. Assert through the slot's public peek() surface instead, matching the #24759 refactor.	2026-06-08 09:35:22 -07:00
yoniebans	74239b4942	i18n(desktop): translate backend update apply status messages Two independent reviewers flagged that applyBackendUpdate's in-progress and error messages were inline English while the rest of the update overlay is i18n'd. Move them into updates.applyStatus (preparing/pulling/restarting/ notAvailable/failed/noReturn) across en, ja, zh, zh-hant + types.	2026-06-08 08:58:26 -07:00
yoniebans	b000e05b11	fix(desktop): don't claim the backend update succeeded when it never returns The no-return error said 'Backend updated but did not come back online' — but once the connection drops the client can't know the update's exit code, only that it was started and the backend is unreachable. Reword to not overclaim: the update may not have completed.	2026-06-08 08:58:26 -07:00
yoniebans	cd030f5f40	fix(desktop): close the backend update overlay on success; error on no-return Three rough edges in the remote backend apply flow: - On success the overlay dropped to IDLE, briefly re-rendering the pre-install 'update available' view and then the generic 'you're all set' before settling. Close the overlay outright once the backend is confirmed back instead of bouncing through the idle view. - If the backend never came back (a failed restart), the flow still reported success. waitForBackendReturn now returns whether the backend answered; finishBackendApply surfaces an error when it didn't. - The up-to-date copy said 'you're running the latest version', conflating client and backend. Backend target now reads 'the backend is running the latest version' — the client's own version is a separate pill.	2026-06-08 08:58:26 -07:00
yoniebans	81647458c7	fix(desktop): recover the backend update overlay after the remote restarts The backend Install path set stage:'restart' and stopped — in remote mode no boot-progress events arrive to carry the overlay to done, so it sat on the restarting spinner until a manual reload while the backend had already come back. Poll the backend until it answers again, then clear the overlay and refresh the backend status. Target-aware applying copy explains the remote restart + auto-reconnect instead of the local-updater-window wording. Also switch the apply poll sleeps from window.setTimeout to globalThis.setTimeout so the flow is exercisable off the renderer.	2026-06-08 08:58:26 -07:00
yoniebans	9b2a64fa6a	fix(desktop): reflect env-override remote in gateway connection state HERMES_DESKTOP_REMOTE_URL forces a remote connection but never writes connection.json, so the gateway panel read mode/url from persisted config and mislabelled an env-remote session as local with no url.	2026-06-08 08:58:26 -07:00
yoniebans	47518bc913	fix(desktop): check backend updates when the connection becomes remote The poller starts at mount, before the gateway connects, so its initial checkBackendUpdates() ran while mode was still unset and no-op'd via the remote-mode guard — leaving the backend button empty until the user clicked it. Subscribe to $connection and re-check the backend when mode resolves to remote.	2026-06-08 08:58:26 -07:00
yoniebans	cfaa46fcae	fix(desktop): pre-check backend updates in poller; client button first Two follow-ups from testing the two-button bar: - The background poller and focus handler only checked the client, so the backend behind-count and changelog stayed empty until the user opened the overlay — and the overlay's first render then hit the empty-commits fallback ('Improvements and fixes') instead of the real changelog. Check the backend alongside the client on poller start, interval, and focus so its state is ready before the button is clicked. - Order the status bar client-first, backend-second.	2026-06-08 08:58:26 -07:00
yoniebans	56be1a63a3	fix(desktop): split client and backend into two distinct update buttons The status bar merged both versions into one pill with a single click target, so there was no way to tell which artifact an update acted on — and the apply path was overloaded by connection mode. Separate them: - store: independent client (checkUpdates/applyUpdates) and backend (checkBackendUpdates/applyBackendUpdate) flows with their own status/apply atoms; openUpdateOverlayFor(target) drives the overlay. - status bar: two buttons — client vX (always) and backend vY (+N) (remote only), each with its own behind-count, opening the overlay for its target. - overlay: reads the active target's atoms; install/check route per target. Removes the version-bar merge helper (no longer merging the two versions).	2026-06-08 08:58:26 -07:00
yoniebans	9c264555b0	fix(desktop): name the update target in the overlay; honest no-changelog copy The updates overlay showed generic 'New update available / improvements and fixes' with no indication of whether it was updating the client or the backend. In remote mode it now reads 'Backend update available' and names the connected backend, and when there's no commit changelog (e.g. pip/non-git backend) it degrades to honest 'release notes aren't available for this install type' copy instead of filler. Copy selection extracted to a pure resolveUpdateCopy() helper (unit-tested); threads target ('client'\|'backend') from connection.mode through the overlay.	2026-06-08 08:58:26 -07:00
yoniebans	87ac7cac13	fix(dashboard): log update changelog against origin/main, not @{upstream} The behind-count (banner._check_via_local_git) measures HEAD..origin/main, but _recent_upstream_commits logged HEAD..@{upstream}. On a feature-branch checkout @{upstream} is the branch's own tip (0 commits), so the changelog came back empty while behind>0 — the overlay then showed generic filler instead of what changed. Pin the commit range to origin/main so count and changelog agree. Verified against a checkout 11 behind origin/main: now returns 11 commits.	2026-06-08 08:58:26 -07:00
yoniebans	64da518db4	feat(desktop): remote update overlay sourced from backend In remote mode, checkUpdates()/applyUpdates() branch on connection.mode and drive the existing updates overlay from the connected backend instead of the local Electron git bridge: - checkUpdates -> GET /api/hermes/update/check, mapped onto DesktopUpdateStatus (behind, commits, supported=can_apply, message). The overlay renders the commit list as 'what's changed' and shows guidance (not Install) when the backend install can't self-apply (docker/nix). - applyUpdates -> POST /api/hermes/update (the proven command-center path), polling the action to completion and handling the expected mid-update connection drop as the restart phase. Local mode is unchanged. Adds checkHermesUpdate() to hermes.ts and a BackendUpdateCheckResponse type.	2026-06-08 08:58:26 -07:00
yoniebans	ed1e2533b7	feat(desktop): show client and backend versions in status bar when remote In remote thin-client mode the Electron client and the backend it connects to are separate installs that drift independently. The status bar previously showed only the client version, hiding skew (e.g. client 0.15.1 talking to backend 0.16.0 looked fine). Add a pure resolveVersionBar() helper (unit-tested) that, gated on connection.mode === 'remote', renders both 'client vX · backend vY' from the desktop appVersion and StatusResponse.version, and flags skew. Local mode is byte-identical to before. Wire it into the status-bar version item.	2026-06-08 08:58:26 -07:00
yoniebans	2284147044	docs: document commits field on /api/hermes/update/check	2026-06-08 08:58:26 -07:00
yoniebans	9e360681f8	feat(dashboard): return recent commits from /api/hermes/update/check Add a best-effort `commits` list (sha/summary/author/at) to the update-check response for git/pip installs that are behind upstream, so the desktop's remote update overlay can show what's changed before applying. Additive and non-breaking: existing consumers (legacy dashboard, tests using subset assertions) ignore the new field. Leaves the shared check_for_updates() int contract untouched — commits come from a separate best-effort git call.	2026-06-08 08:58:26 -07:00
Teknium	fd1e7c2bc3	fix(tui): install the process.on('exit') terminal-mode backstop (#42165 ) #19194's fix added process.exit(0) to die()/dieWithCode() with a comment relying on a process.on('exit') handler in entry.tsx that resets terminal modes — but that handler was never installed. So /quit, Ctrl+C, Ctrl+D and every process.exit() path left DEC mouse tracking (?1000/1002/1003/1006) armed in the parent shell. The terminal then kept emitting mouse reports into stdin — read as keystrokes by the shell or a freshly relaunched TUI — surfacing as ...;...M garbage in the input box. Install the missing handler. 'exit' fires once on real termination and runs synchronous code only; resetTerminalModes() writes via writeSync, so the disable sequence lands before the process is gone. Fixes #28419	2026-06-08 08:21:19 -07:00
Siddharth Balyan	7230fcb7f2	revert(nix): drop the cp patchPhase workaround from #41867 (#42151 ) #41867 replaced mkNpmPassthru's patchPhase with `cp $npmDeps/package-lock.json package-lock.json`, on the theory that prefetch-npm-deps strips advisory fields (engines/os/cpu) from the cache lockfile. That diagnosis was wrong. prefetch-npm-deps copies the lockfile into the cache verbatim (prefetch-npm-deps/src/main.rs reads it and writes it unchanged). Building the cache fresh from the current root lockfile yields exactly the pinned npmDepsHash, and that cache's package-lock.json is byte-identical to the source (740 "engines" blocks on each side). With the hash correct, npmConfigHook's consistency check passes on its own — verified by building .#tui and .#default green with this (original) patchPhase. So the cp was unnecessary, and worse: it bypasses the consistency check wholesale, silently masking a genuinely stale npmDepsHash (a lockfile that changed without its hash being refreshed) instead of failing loudly. The original patchPhase keeps the check meaningful while still handling the one real cosmetic difference it was written for (trailing newlines); stale-hash drift is caught by the npmDepsHash itself plus the auto-fix workflow. Keeps the fix-lockfiles real-build verification and the nix-lockfile-fix.yml file-path fix from #41867 — only the patchPhase cp is reverted.	2026-06-08 20:29:41 +05:30
Siddharth Balyan	4219a91df5	fix(nix): make config.yaml group-writable under addToSystemPackages (#41940 ) addToSystemPackages exports HERMES_HOME system-wide and puts the hermes CLI on interactive users' PATH, so those users (in the hermes group) share the gateway's state — that's the option's whole purpose. But the activation script wrote config.yaml as 0640 (group read-only), so an interactive user saving a setting via the CLI/TUI hit: error: [Errno 13] Permission denied: '/var/lib/hermes/.hermes/config.yaml' Make the mode conditional: 0660 when addToSystemPackages is set (group hermes can write), else the previous 0640. .env stays 0640 either way — it holds secrets, not user-facing settings. The config merge already preserves user-added keys across rebuilds, so this simply lets interactive hermes-group users actually make those edits. Verified by evaluating the module's activation script for both option values: addToSystemPackages=true -> chmod 0660, false -> chmod 0640.	2026-06-08 20:10:47 +05:30
Teknium	a3fca26c56	fix(tui): close slash_worker inside _finalize_session (defense-in-depth, #38095 ) (#42149 ) Fold the slash-worker subprocess close into _finalize_session itself — the single _finalized-guarded session-end chokepoint — instead of relying on each caller (_teardown_session, _shutdown_sessions) to close it separately. A future code path that finalizes a session directly can no longer reintroduce the #38095 worker leak. Idempotent: _SlashWorker.close() is poll()-guarded and _finalize_session short-circuits on _finalized, so the existing teardown paths are unaffected. Drops the now-redundant separate close() in _shutdown_sessions. Note: the active leak this issue reported was already fixed on main (WS-orphan reaper #38591, _restart_slash_worker close, atexit shutdown). This addresses the residual defense-in-depth gap the reporter correctly identified in their follow-up comment.	2026-06-08 07:26:05 -07:00
Teknium	5e06c9ffef	fix(agent): clear _session_messages in AIAgent.close() (#42123 ) close() is the hard teardown for true session boundaries (/new, /reset, session expiry). It already closes the OpenAI client and child agents but left the conversation-history list intact. Mirror the soft-eviction path (_release_evicted_agent_soft clears _session_messages) so a held reference to a closed agent — e.g. a draining background task — doesn't pin tens of MB of tool outputs until the agent object itself is collected.	2026-06-08 07:03:39 -07:00
teknium1	cb13723f53	fix(pty-bridge): mark os.killpg/getpgid windows-footgun-ok (POSIX-only module)	2026-06-08 07:03:12 -07:00
teknium1	8cb1908e18	chore: map paulb26 in AUTHOR_MAP for #24135 salvage	2026-06-08 07:03:12 -07:00
firefly	8b6a8f667d	feat(slash-worker): self-terminate on parent death via create_time watchdog Daemon thread polls _is_orphaned (original ppid check + psutil create_time PID-reuse guard, no PR_SET_PDEATHSIG). On orphan, drains an in-flight command up to a grace window then os._exit(0). Started before the HermesCLI build to cover the spawn window. Task: swl-qrf.8	2026-06-08 07:03:12 -07:00
paulb26	b31c6c33b2	fix(pty-bridge): terminate PTY process groups on teardown	2026-06-08 07:03:12 -07:00
Teknium	e9c1e757fe	fix(gateway): release evicted agent clients to stop RSS leak (#29298 ) (#41974 ) _evict_cached_agent (the chokepoint for /new, /model, /undo, session resets — 17 call sites) only popped the cache entry, dropping the AIAgent reference without releasing its httpx client pool. AIAgent holds reference cycles (callbacks, tool state) so CPython refcounting does not free the client promptly; under steady gateway traffic the held sockets + buffers accumulate and RSS climbs (the leak class behind Now the chokepoint pops AND schedules a soft release_clients() on a daemon thread (mirrors the cap-enforcer / idle-sweeper). Soft release frees the client pool + per-turn child subagents but preserves the session's terminal sandbox / browser / bg processes for resumption. Mid-turn agents are skipped so a running request is never torn down. Also fixes the no-lock branch which previously never popped at all.	2026-06-08 06:44:51 -07:00
Michael Steuer	3d029a53ec	fix(gateway): close residual memory-leak sites under heavy scheduled workload Long-lived gateways under heavy cron/build workloads grow steadily (~18 MB/hr post-phantom-dispatch-fix) and eventually need a restart-or-OOM. Four retention sites, all confirmed live on current main: 1. _evict_cached_agent() (/model, /reasoning, codex-runtime, /undo, etc.) popped the cache entry without releasing the agent's OpenAI client, httpx transport, SSL context, or conversation history. Only /new cleaned up first. Now releases clients on a daemon thread, matching _enforce_agent_cache_cap. 2. _release_evicted_agent_soft() now clears _session_messages after release_clients() — tool outputs (file reads, terminal output, search results) can be tens of MB per 100+-tool-call session; the list is rebuilt from persisted session JSON on resume, so dropping it on soft eviction is safe. 3. The session-expiry watcher (permanent finalization) now drops the session's per-session control dicts (_session_model_overrides, _session_reasoning_overrides, _pending_approvals, _update_prompt_pending, _pending_model_notes). These leaked one entry per session per gateway lifetime. NOTE: this is the session-finalize path, NOT idle agent-cache eviction — an idle-evicted session is still alive and rebuilds its agent from these overrides, so pruning them there would silently reset a user's /model choice. 4. _tool_defs_cache is now bounded (_TOOL_DEFS_CACHE_MAX=8) with oldest-first eviction instead of growing unboundedly across the distinct toolset/config fingerprints a gateway sees over its lifetime. Salvaged from #25318 by Michael Steuer (@mssteuer); fix 3 redirected from the idle-sweep to the session-finalize lifecycle, magic number 8 lifted to a named constant, test ported. Fixes #19251 Co-authored-by: Michael Steuer <michael@make.software>	2026-06-08 06:32:42 -07:00
teknium1	400e6e43ca	test(gateway): de-flake concurrent-compression lock test with a barrier test_concurrent_compressions_same_session_serialize relied on a time.sleep(0.25) inside the stubbed compressor to make the two threads overlap inside the per-session lock window. Under CI CPU starvation that sleep is insufficient: one thread can acquire -> compress -> rotate -> RELEASE the lock before the other reaches try_acquire, so both acquire on the shared session_id and both compress (the recurring 'Expected exactly one agent to compress, got 2' failure on shard test (1)). Replace the timing dependency with a threading.Barrier(2) wrapped around the shared db's try_acquire_compression_lock: both threads rendezvous immediately before the real (atomic) acquire, guaranteeing genuine simultaneous contention regardless of scheduling. The real lock logic is unchanged and still picks exactly one winner — this only fixes the test's overlap guarantee. Restored after join so the post-join lock-leak assertion hits the unwrapped method. Verified: 20/20 plain + 15/15 under all-core CPU stress (load avg ~4.6), where the old version flaked.	2026-06-08 06:32:23 -07:00
kshitij	b99c6c4277	Merge #42076 : nested category plugin discovery + alias-normalized enable/disable (#41066 ) Merge #42076: nested category plugin discovery + alias-normalized enable/disable (#41066) Lands the complete nested category plugin fix: - Discovery in `hermes plugins list` (from @islam666's #41076, carried in this PR) - Alias-normalized enable/disable mutation path so nested plugins can be toggled - Fixes the #41076 base breakages (web_server 6-tuple unpack + stale test fixtures) Co-authored work: discovery by @islam666 (#41076). Closes #41066.	2026-06-08 05:47:27 -07:00
kshitijk4poor	2b89afec79	fix(plugins): alias-normalize enable/disable for nested category plugins (follow-up to #41076 ) #41076 makes `hermes plugins list` discover nested category plugins (e.g. observability/nemo_relay). This adds the missing enable/disable mutation path so those plugins can actually be toggled, and fixes two incomplete-update breakages on the #41076 base. Before: `hermes plugins enable nemo_relay` -> "Plugin 'nemo_relay' is not installed or bundled." (exit 1), because cmd_enable/cmd_disable went through _plugin_exists(), which only checked top-level plugins/<name>/. Changes: - Add _resolve_plugin_key(): resolve a bare manifest/leaf name OR a full path-derived key (observability/nemo_relay) to the canonical key the runtime loader gates on, reusing #41076's _discover_all_plugins(). A bare leaf name ambiguous across two categories resolves to None rather than silently picking one. - cmd_enable/cmd_disable resolve first, persist the canonical key, and drop any stale legacy bare-name alias so the enabled/disabled lists can't drift into a contradictory state. _plugin_exists delegates to the same resolver. - Fix #41076 base breakages: _discover_all_plugins now returns 6-tuples, but web_server._merged_plugins_hub() still unpacked 5 (ValueError on the dashboard plugins-hub endpoint) and several test_plugins_cmd_list.py fixtures were still 5-tuples. Both updated; the hub status check is now key-aware. Verified e2e on the real CLI + runtime loader (isolated HERMES_HOME): `hermes plugins enable nemo_relay` writes observability/nemo_relay to config.yaml and the loader then loads it (enabled=True, error=None); a stale bare-name alias is cleared on disable; the dashboard _merged_plugins_hub() runs without crashing. Adds resolution + enable/disable tests; full tests/hermes_cli/test_plugins_cmd* + web_server plugin tests green. Follow-up to #41076 (#41066). Branched from that PR's head.	2026-06-08 17:57:37 +05:30
kshitij	c3055d6185	Merge pull request #41984 from kshitijk4poor/salvage/6600-stale-streaming-worker fix(gateway): transcribe voice messages during active agent runs (salvage #6600, voice half)	2026-06-08 02:51:25 -07:00
kshitijk4poor	f96eb857a5	chore: add kristianvast to AUTHOR_MAP	2026-06-08 15:16:20 +05:30
Kristian Vastveit	d55304c39f	fix(gateway): transcribe voice messages during active agent runs Salvaged from #6600 (@kristianvast) — re-scoped to the voice half only and rebased onto current main. The cascading-interrupt hang half of the original PR landed independently in `dd0d1222a`, so this carries ONLY Problem 1. When a voice/audio message arrives while the agent is busy on the same session, it hit the interrupt path with empty text because STT only ran after the running-agent guard — the voice was effectively lost. Now we transcribe audio BEFORE signaling the agent (and on the fresh-message path), echo the raw transcript back to the user (🎙️), and _enrich_message_with_transcription returns (text, transcripts) so callers can echo. A new _dequeue_pending_with_transcription drives the post-agent drain the same way. Reapplied onto _prepare_inbound_message_text (inbound enrichment was extracted from the inline dispatch block since the original PR). Co-authored-by: Kristian Vastveit <kristian@agrointel.no>	2026-06-08 15:16:20 +05:30
teknium1	00c46b8ff9	test(tui): cover heapdump opt-in gate + retention; add AUTHOR_MAP On-disk vitest coverage for the auto-heapdump disk-safety guard: opt-in gating (suppressed diagnostics-only path), truthy-spelling acceptance, manual-trigger passthrough, and the retention prune. Test approach adapted from #21780 (briandevans) and #21822 (LeonSGP43), reconciled to the merged gate semantics. Maps alarcritty into AUTHOR_MAP for CI.	2026-06-08 02:20:49 -07:00
alarcritty	8ae0d054f4	fix(tui): guard automatic heap dumps against disk fill Automatic heap dumps from the TUI memory monitor could write multi-GiB .heapsnapshot files on every threshold cross, growing ~/.hermes/heapdumps to tens of GiB. Add four layered safeguards: - Gate auto-high/auto-critical snapshots behind HERMES_AUTO_HEAPDUMP=1; manual dumps remain unchanged. - Always write the lightweight diagnostics JSON sidecar so users still get an actionable artifact when the snapshot is suppressed. - Cap total bytes in the dump dir (HERMES_HEAPDUMP_MAX_BYTES, default 2 GiB), evicting oldest first, retaining the newest. - Add a cooldown between auto dumps (HERMES_AUTO_HEAPDUMP_COOLDOWN_MS, default 10 min) so an oscillating heap can't re-trigger. Closes #21767	2026-06-08 02:20:49 -07:00
teknium1	dd0d1222a2	fix(agent): don't retry interrupt-induced transport errors (cascading-interrupt hang) When agent.interrupt() fires during an active LLM call, the main poll loop force-closes the worker-local httpx client to stop token generation. That raises a transport error (RemoteProtocolError) on the worker thread — the EXPECTED consequence of our own close, not a network bug. The streaming retry loop misclassified it as a transient connection error and retried; each doomed retry stalled for the full stream-stale timeout (up to 300s). Because the gateway caches AIAgent instances per session, the stale worker outlived the interrupted turn and raced the next turn's request on shared client state — the root of the multi-minute cascading-interrupt hang reported in the wild. Fix: a request-local _request_cancelled token set by the poll loop right before the force-close, in both interruptible_api_call (non-streaming) and interruptible_streaming_api_call. The worker's exception handler checks the token and exits cleanly — no retry, no fallback, no 'reconnecting' status — instead of treating the forced error as transient. The token is request- local (not agent._interrupt_requested, which is cleared at turn boundaries) so a stale worker outliving its turn still recognizes its own forced close. Original diagnosis and fix by @kristianvast (PR #6600), against the then- inline methods in run_agent.py. Those were since extracted into agent/chat_completion_helpers.py, so the fix is reapplied there. Co-authored-by: Kristian Vastveit <kristianvast@users.noreply.github.com>	2026-06-08 02:19:13 -07:00
Teknium	aa6f2775fa	fix(memory): run end-of-turn sync off the turn thread (#41945 ) A misconfigured/slow external memory provider could hold the agent in the 'running' state for minutes after the final response was delivered. MemoryManager.sync_all / queue_prefetch_all looped provider.sync_turn / queue_prefetch INLINE on the turn-completion path; a provider making a blocking network/daemon call (a broken Hindsight daemon was observed blocking ~298s before failing) blocked run_conversation from returning. Because every interface (CLI, TUI, gateway) marks the agent 'running' until run_conversation returns, the agent stayed busy for the full block and any follow-up message triggered an aggressive interrupt that dropped the message. Dispatch provider sync/prefetch to a lazily-created single-worker background executor. sync_all / queue_prefetch_all return immediately; work completes (or fails, logged) in the background. A single worker serializes writes so turn N lands before turn N+1. flush_pending() provides a barrier for session boundaries and deterministic tests. shutdown_all() drains the executor with a bounded timeout so a wedged provider can never hang teardown. Builtin-only / no-provider sessions spawn no executor (zero new threads in the common case).	2026-06-08 02:18:59 -07:00
xxxigm	a5c12f5f59	fix(install): move broken checkout aside instead of deleting it Review feedback (#40998): `rm -rf` / `Remove-Item -Recurse -Force` on the install dir is destructive -- a user might still want whatever is there. Rename the broken checkout to a timestamped `<dir>.broken-<ts>` backup and re-clone fresh, so nothing is ever deleted. Transient cleanup of a clone attempt that fails within the same run is left as-is.	2026-06-08 02:18:21 -07:00
xxxigm	5d7abf9114	test(install): cover commit-less checkout handling (#40998 ) Behavioral coverage for install.sh's clone_repo() guard (removes a commit-less checkout, keeps a real one, ignores a non-repo dir) plus a contract check that install.ps1's repo-validity gate requires a resolvable HEAD.	2026-06-08 02:18:21 -07:00
xxxigm	fc0900d120	fix(install): re-clone interrupted (commit-less) checkout instead of failing An interrupted previous clone leaves the install dir's .git present but with no initial commit. rev-parse --is-inside-work-tree and git status both still succeed there, so the installer entered the update path and ran `git stash`, which aborts with "You do not have the initial commit yet" and failed the desktop install at the "Cloning Hermes repository" stage. - install.ps1: add a `git rev-parse --verify HEAD` probe to the repo-validity check so a commit-less checkout is treated as broken and re-cloned fresh. - install.sh: mirror it at the top of clone_repo() — drop a partial checkout with no resolvable HEAD so the fresh-clone path handles it (POSIX parity). Fixes #40998	2026-06-08 02:18:21 -07:00
teknium1	0904bc7ea2	refactor(cli): extract 32 slash-command handlers into CLICommandsMixin (god-file Phase 4) Lift the `_handle_*_command` cluster (2,077 LOC) out of HermesCLI into hermes_cli/cli_commands_mixin.py; HermesCLI now inherits CLICommandsMixin so every self.<handler> call resolves unchanged via the MRO. Behavior-neutral. Import discipline mirrors gateway/slash_commands.py (PR #41886): neutral deps imported at the mixin module top level; cli.py-internal helpers/constants (_cprint, _ACCENT, save_config_value, ...) imported lazily inside each handler via 'from cli import ...' so the mixin never imports cli at module scope. cli.py 16215 -> 14139 LOC. One test mock repointed (cli.is_browser_debug_ready -> hermes_cli.cli_commands_mixin.is_browser_debug_ready).	2026-06-08 02:13:07 -07:00
kshitij	4eb8972390	Merge pull request #33817 from sweetcornna/fix/28503-busy-input-fifo fix(gateway): use FIFO queue for busy_input_mode pending messages	2026-06-08 02:02:02 -07:00
Gille	039fbb41fc	fix(desktop): show newly configured model providers (#41545 )	2026-06-08 01:39:37 -07:00
floory	15c99b437f	fix(cli): set PYTHON env for node-gyp native builds on NixOS (#40690 ) * fix(cli): set PYTHON env for node-gyp native builds on NixOS node-gyp (triggered by node-pty during npm ci) looks for python3 on PATH, which fails on NixOS because python3 lives in the nix store and is not on the system PATH. Add _nixos_build_env() — a two-tier helper that detects NixOS and: 1. Fast path: hermes venv python3 (~0s) 2. Fallback: nix-shell which python3 (~2-5s) Wire it into _run_npm_install_deterministic() via a new env= parameter, then pass it through cmd_gui() and _update_node_dependencies(). Non-NixOS systems: _nixos_build_env() returns None, behavior unchanged. * fix(cli): merge _nixos_build_env() with os.environ, fix NixOS detection, add explicit return None - Critical fix: both Tier 1 (venv) and Tier 2 (nix-shell) now return {**os.environ, "PYTHON": ...} instead of {"PYTHON": ...} — subprocess.run with env= replaces the entire environment, so the old code wiped PATH and broke npm/node on NixOS entirely. - Uses re.search(r"^ID=nixos$", ...) for anchored NixOS detection instead of unanchored substring match (could match ID_LIKE=...nixos). - Removes redundant Path.exists() guard before read_text(); just catches OSError (one filesystem read instead of two). - Adds explicit return None at end of function for type-hint consistency.	2026-06-08 13:57:37 +05:30
teknium1	7a5827c8b0	test: repoint percentage-clamp source guard to gateway/slash_commands.py test_gateway_run_clamped read gateway/run.py asserting the /usage stats handler clamps pct with min(100, ...). That handler moved to gateway/slash_commands.py in this PR's extraction; repoint the guard so it still fires on clamp removal. tests/run_agent/ + tests/gateway/ 8024 passed / 0 failed.	2026-06-08 01:25:35 -07:00

1 2 3 4 5 ...

11040 commits