hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-06-09 08:21:50 +00:00

Author	SHA1	Message	Date
Teknium	fd1e7c2bc3	fix(tui): install the process.on('exit') terminal-mode backstop (#42165 ) #19194's fix added process.exit(0) to die()/dieWithCode() with a comment relying on a process.on('exit') handler in entry.tsx that resets terminal modes — but that handler was never installed. So /quit, Ctrl+C, Ctrl+D and every process.exit() path left DEC mouse tracking (?1000/1002/1003/1006) armed in the parent shell. The terminal then kept emitting mouse reports into stdin — read as keystrokes by the shell or a freshly relaunched TUI — surfacing as ...;...M garbage in the input box. Install the missing handler. 'exit' fires once on real termination and runs synchronous code only; resetTerminalModes() writes via writeSync, so the disable sequence lands before the process is gone. Fixes #28419	2026-06-08 08:21:19 -07:00
Siddharth Balyan	7230fcb7f2	revert(nix): drop the cp patchPhase workaround from #41867 (#42151 ) #41867 replaced mkNpmPassthru's patchPhase with `cp $npmDeps/package-lock.json package-lock.json`, on the theory that prefetch-npm-deps strips advisory fields (engines/os/cpu) from the cache lockfile. That diagnosis was wrong. prefetch-npm-deps copies the lockfile into the cache verbatim (prefetch-npm-deps/src/main.rs reads it and writes it unchanged). Building the cache fresh from the current root lockfile yields exactly the pinned npmDepsHash, and that cache's package-lock.json is byte-identical to the source (740 "engines" blocks on each side). With the hash correct, npmConfigHook's consistency check passes on its own — verified by building .#tui and .#default green with this (original) patchPhase. So the cp was unnecessary, and worse: it bypasses the consistency check wholesale, silently masking a genuinely stale npmDepsHash (a lockfile that changed without its hash being refreshed) instead of failing loudly. The original patchPhase keeps the check meaningful while still handling the one real cosmetic difference it was written for (trailing newlines); stale-hash drift is caught by the npmDepsHash itself plus the auto-fix workflow. Keeps the fix-lockfiles real-build verification and the nix-lockfile-fix.yml file-path fix from #41867 — only the patchPhase cp is reverted.	2026-06-08 20:29:41 +05:30
Siddharth Balyan	4219a91df5	fix(nix): make config.yaml group-writable under addToSystemPackages (#41940 ) addToSystemPackages exports HERMES_HOME system-wide and puts the hermes CLI on interactive users' PATH, so those users (in the hermes group) share the gateway's state — that's the option's whole purpose. But the activation script wrote config.yaml as 0640 (group read-only), so an interactive user saving a setting via the CLI/TUI hit: error: [Errno 13] Permission denied: '/var/lib/hermes/.hermes/config.yaml' Make the mode conditional: 0660 when addToSystemPackages is set (group hermes can write), else the previous 0640. .env stays 0640 either way — it holds secrets, not user-facing settings. The config merge already preserves user-added keys across rebuilds, so this simply lets interactive hermes-group users actually make those edits. Verified by evaluating the module's activation script for both option values: addToSystemPackages=true -> chmod 0660, false -> chmod 0640.	2026-06-08 20:10:47 +05:30
Teknium	a3fca26c56	fix(tui): close slash_worker inside _finalize_session (defense-in-depth, #38095 ) (#42149 ) Fold the slash-worker subprocess close into _finalize_session itself — the single _finalized-guarded session-end chokepoint — instead of relying on each caller (_teardown_session, _shutdown_sessions) to close it separately. A future code path that finalizes a session directly can no longer reintroduce the #38095 worker leak. Idempotent: _SlashWorker.close() is poll()-guarded and _finalize_session short-circuits on _finalized, so the existing teardown paths are unaffected. Drops the now-redundant separate close() in _shutdown_sessions. Note: the active leak this issue reported was already fixed on main (WS-orphan reaper #38591, _restart_slash_worker close, atexit shutdown). This addresses the residual defense-in-depth gap the reporter correctly identified in their follow-up comment.	2026-06-08 07:26:05 -07:00
Teknium	5e06c9ffef	fix(agent): clear _session_messages in AIAgent.close() (#42123 ) close() is the hard teardown for true session boundaries (/new, /reset, session expiry). It already closes the OpenAI client and child agents but left the conversation-history list intact. Mirror the soft-eviction path (_release_evicted_agent_soft clears _session_messages) so a held reference to a closed agent — e.g. a draining background task — doesn't pin tens of MB of tool outputs until the agent object itself is collected.	2026-06-08 07:03:39 -07:00
teknium1	cb13723f53	fix(pty-bridge): mark os.killpg/getpgid windows-footgun-ok (POSIX-only module)	2026-06-08 07:03:12 -07:00
teknium1	8cb1908e18	chore: map paulb26 in AUTHOR_MAP for #24135 salvage	2026-06-08 07:03:12 -07:00
firefly	8b6a8f667d	feat(slash-worker): self-terminate on parent death via create_time watchdog Daemon thread polls _is_orphaned (original ppid check + psutil create_time PID-reuse guard, no PR_SET_PDEATHSIG). On orphan, drains an in-flight command up to a grace window then os._exit(0). Started before the HermesCLI build to cover the spawn window. Task: swl-qrf.8	2026-06-08 07:03:12 -07:00
paulb26	b31c6c33b2	fix(pty-bridge): terminate PTY process groups on teardown	2026-06-08 07:03:12 -07:00
Teknium	e9c1e757fe	fix(gateway): release evicted agent clients to stop RSS leak (#29298 ) (#41974 ) _evict_cached_agent (the chokepoint for /new, /model, /undo, session resets — 17 call sites) only popped the cache entry, dropping the AIAgent reference without releasing its httpx client pool. AIAgent holds reference cycles (callbacks, tool state) so CPython refcounting does not free the client promptly; under steady gateway traffic the held sockets + buffers accumulate and RSS climbs (the leak class behind Now the chokepoint pops AND schedules a soft release_clients() on a daemon thread (mirrors the cap-enforcer / idle-sweeper). Soft release frees the client pool + per-turn child subagents but preserves the session's terminal sandbox / browser / bg processes for resumption. Mid-turn agents are skipped so a running request is never torn down. Also fixes the no-lock branch which previously never popped at all.	2026-06-08 06:44:51 -07:00
Michael Steuer	3d029a53ec	fix(gateway): close residual memory-leak sites under heavy scheduled workload Long-lived gateways under heavy cron/build workloads grow steadily (~18 MB/hr post-phantom-dispatch-fix) and eventually need a restart-or-OOM. Four retention sites, all confirmed live on current main: 1. _evict_cached_agent() (/model, /reasoning, codex-runtime, /undo, etc.) popped the cache entry without releasing the agent's OpenAI client, httpx transport, SSL context, or conversation history. Only /new cleaned up first. Now releases clients on a daemon thread, matching _enforce_agent_cache_cap. 2. _release_evicted_agent_soft() now clears _session_messages after release_clients() — tool outputs (file reads, terminal output, search results) can be tens of MB per 100+-tool-call session; the list is rebuilt from persisted session JSON on resume, so dropping it on soft eviction is safe. 3. The session-expiry watcher (permanent finalization) now drops the session's per-session control dicts (_session_model_overrides, _session_reasoning_overrides, _pending_approvals, _update_prompt_pending, _pending_model_notes). These leaked one entry per session per gateway lifetime. NOTE: this is the session-finalize path, NOT idle agent-cache eviction — an idle-evicted session is still alive and rebuilds its agent from these overrides, so pruning them there would silently reset a user's /model choice. 4. _tool_defs_cache is now bounded (_TOOL_DEFS_CACHE_MAX=8) with oldest-first eviction instead of growing unboundedly across the distinct toolset/config fingerprints a gateway sees over its lifetime. Salvaged from #25318 by Michael Steuer (@mssteuer); fix 3 redirected from the idle-sweep to the session-finalize lifecycle, magic number 8 lifted to a named constant, test ported. Fixes #19251 Co-authored-by: Michael Steuer <michael@make.software>	2026-06-08 06:32:42 -07:00
teknium1	400e6e43ca	test(gateway): de-flake concurrent-compression lock test with a barrier test_concurrent_compressions_same_session_serialize relied on a time.sleep(0.25) inside the stubbed compressor to make the two threads overlap inside the per-session lock window. Under CI CPU starvation that sleep is insufficient: one thread can acquire -> compress -> rotate -> RELEASE the lock before the other reaches try_acquire, so both acquire on the shared session_id and both compress (the recurring 'Expected exactly one agent to compress, got 2' failure on shard test (1)). Replace the timing dependency with a threading.Barrier(2) wrapped around the shared db's try_acquire_compression_lock: both threads rendezvous immediately before the real (atomic) acquire, guaranteeing genuine simultaneous contention regardless of scheduling. The real lock logic is unchanged and still picks exactly one winner — this only fixes the test's overlap guarantee. Restored after join so the post-join lock-leak assertion hits the unwrapped method. Verified: 20/20 plain + 15/15 under all-core CPU stress (load avg ~4.6), where the old version flaked.	2026-06-08 06:32:23 -07:00
kshitij	b99c6c4277	Merge #42076 : nested category plugin discovery + alias-normalized enable/disable (#41066 ) Merge #42076: nested category plugin discovery + alias-normalized enable/disable (#41066) Lands the complete nested category plugin fix: - Discovery in `hermes plugins list` (from @islam666's #41076, carried in this PR) - Alias-normalized enable/disable mutation path so nested plugins can be toggled - Fixes the #41076 base breakages (web_server 6-tuple unpack + stale test fixtures) Co-authored work: discovery by @islam666 (#41076). Closes #41066.	2026-06-08 05:47:27 -07:00
kshitijk4poor	2b89afec79	fix(plugins): alias-normalize enable/disable for nested category plugins (follow-up to #41076 ) #41076 makes `hermes plugins list` discover nested category plugins (e.g. observability/nemo_relay). This adds the missing enable/disable mutation path so those plugins can actually be toggled, and fixes two incomplete-update breakages on the #41076 base. Before: `hermes plugins enable nemo_relay` -> "Plugin 'nemo_relay' is not installed or bundled." (exit 1), because cmd_enable/cmd_disable went through _plugin_exists(), which only checked top-level plugins/<name>/. Changes: - Add _resolve_plugin_key(): resolve a bare manifest/leaf name OR a full path-derived key (observability/nemo_relay) to the canonical key the runtime loader gates on, reusing #41076's _discover_all_plugins(). A bare leaf name ambiguous across two categories resolves to None rather than silently picking one. - cmd_enable/cmd_disable resolve first, persist the canonical key, and drop any stale legacy bare-name alias so the enabled/disabled lists can't drift into a contradictory state. _plugin_exists delegates to the same resolver. - Fix #41076 base breakages: _discover_all_plugins now returns 6-tuples, but web_server._merged_plugins_hub() still unpacked 5 (ValueError on the dashboard plugins-hub endpoint) and several test_plugins_cmd_list.py fixtures were still 5-tuples. Both updated; the hub status check is now key-aware. Verified e2e on the real CLI + runtime loader (isolated HERMES_HOME): `hermes plugins enable nemo_relay` writes observability/nemo_relay to config.yaml and the loader then loads it (enabled=True, error=None); a stale bare-name alias is cleared on disable; the dashboard _merged_plugins_hub() runs without crashing. Adds resolution + enable/disable tests; full tests/hermes_cli/test_plugins_cmd* + web_server plugin tests green. Follow-up to #41076 (#41066). Branched from that PR's head.	2026-06-08 17:57:37 +05:30
kshitij	c3055d6185	Merge pull request #41984 from kshitijk4poor/salvage/6600-stale-streaming-worker fix(gateway): transcribe voice messages during active agent runs (salvage #6600, voice half)	2026-06-08 02:51:25 -07:00
kshitijk4poor	f96eb857a5	chore: add kristianvast to AUTHOR_MAP	2026-06-08 15:16:20 +05:30
Kristian Vastveit	d55304c39f	fix(gateway): transcribe voice messages during active agent runs Salvaged from #6600 (@kristianvast) — re-scoped to the voice half only and rebased onto current main. The cascading-interrupt hang half of the original PR landed independently in `dd0d1222a`, so this carries ONLY Problem 1. When a voice/audio message arrives while the agent is busy on the same session, it hit the interrupt path with empty text because STT only ran after the running-agent guard — the voice was effectively lost. Now we transcribe audio BEFORE signaling the agent (and on the fresh-message path), echo the raw transcript back to the user (🎙️), and _enrich_message_with_transcription returns (text, transcripts) so callers can echo. A new _dequeue_pending_with_transcription drives the post-agent drain the same way. Reapplied onto _prepare_inbound_message_text (inbound enrichment was extracted from the inline dispatch block since the original PR). Co-authored-by: Kristian Vastveit <kristian@agrointel.no>	2026-06-08 15:16:20 +05:30
teknium1	00c46b8ff9	test(tui): cover heapdump opt-in gate + retention; add AUTHOR_MAP On-disk vitest coverage for the auto-heapdump disk-safety guard: opt-in gating (suppressed diagnostics-only path), truthy-spelling acceptance, manual-trigger passthrough, and the retention prune. Test approach adapted from #21780 (briandevans) and #21822 (LeonSGP43), reconciled to the merged gate semantics. Maps alarcritty into AUTHOR_MAP for CI.	2026-06-08 02:20:49 -07:00
alarcritty	8ae0d054f4	fix(tui): guard automatic heap dumps against disk fill Automatic heap dumps from the TUI memory monitor could write multi-GiB .heapsnapshot files on every threshold cross, growing ~/.hermes/heapdumps to tens of GiB. Add four layered safeguards: - Gate auto-high/auto-critical snapshots behind HERMES_AUTO_HEAPDUMP=1; manual dumps remain unchanged. - Always write the lightweight diagnostics JSON sidecar so users still get an actionable artifact when the snapshot is suppressed. - Cap total bytes in the dump dir (HERMES_HEAPDUMP_MAX_BYTES, default 2 GiB), evicting oldest first, retaining the newest. - Add a cooldown between auto dumps (HERMES_AUTO_HEAPDUMP_COOLDOWN_MS, default 10 min) so an oscillating heap can't re-trigger. Closes #21767	2026-06-08 02:20:49 -07:00
teknium1	dd0d1222a2	fix(agent): don't retry interrupt-induced transport errors (cascading-interrupt hang) When agent.interrupt() fires during an active LLM call, the main poll loop force-closes the worker-local httpx client to stop token generation. That raises a transport error (RemoteProtocolError) on the worker thread — the EXPECTED consequence of our own close, not a network bug. The streaming retry loop misclassified it as a transient connection error and retried; each doomed retry stalled for the full stream-stale timeout (up to 300s). Because the gateway caches AIAgent instances per session, the stale worker outlived the interrupted turn and raced the next turn's request on shared client state — the root of the multi-minute cascading-interrupt hang reported in the wild. Fix: a request-local _request_cancelled token set by the poll loop right before the force-close, in both interruptible_api_call (non-streaming) and interruptible_streaming_api_call. The worker's exception handler checks the token and exits cleanly — no retry, no fallback, no 'reconnecting' status — instead of treating the forced error as transient. The token is request- local (not agent._interrupt_requested, which is cleared at turn boundaries) so a stale worker outliving its turn still recognizes its own forced close. Original diagnosis and fix by @kristianvast (PR #6600), against the then- inline methods in run_agent.py. Those were since extracted into agent/chat_completion_helpers.py, so the fix is reapplied there. Co-authored-by: Kristian Vastveit <kristianvast@users.noreply.github.com>	2026-06-08 02:19:13 -07:00
Teknium	aa6f2775fa	fix(memory): run end-of-turn sync off the turn thread (#41945 ) A misconfigured/slow external memory provider could hold the agent in the 'running' state for minutes after the final response was delivered. MemoryManager.sync_all / queue_prefetch_all looped provider.sync_turn / queue_prefetch INLINE on the turn-completion path; a provider making a blocking network/daemon call (a broken Hindsight daemon was observed blocking ~298s before failing) blocked run_conversation from returning. Because every interface (CLI, TUI, gateway) marks the agent 'running' until run_conversation returns, the agent stayed busy for the full block and any follow-up message triggered an aggressive interrupt that dropped the message. Dispatch provider sync/prefetch to a lazily-created single-worker background executor. sync_all / queue_prefetch_all return immediately; work completes (or fails, logged) in the background. A single worker serializes writes so turn N lands before turn N+1. flush_pending() provides a barrier for session boundaries and deterministic tests. shutdown_all() drains the executor with a bounded timeout so a wedged provider can never hang teardown. Builtin-only / no-provider sessions spawn no executor (zero new threads in the common case).	2026-06-08 02:18:59 -07:00
xxxigm	a5c12f5f59	fix(install): move broken checkout aside instead of deleting it Review feedback (#40998): `rm -rf` / `Remove-Item -Recurse -Force` on the install dir is destructive -- a user might still want whatever is there. Rename the broken checkout to a timestamped `<dir>.broken-<ts>` backup and re-clone fresh, so nothing is ever deleted. Transient cleanup of a clone attempt that fails within the same run is left as-is.	2026-06-08 02:18:21 -07:00
xxxigm	5d7abf9114	test(install): cover commit-less checkout handling (#40998 ) Behavioral coverage for install.sh's clone_repo() guard (removes a commit-less checkout, keeps a real one, ignores a non-repo dir) plus a contract check that install.ps1's repo-validity gate requires a resolvable HEAD.	2026-06-08 02:18:21 -07:00
xxxigm	fc0900d120	fix(install): re-clone interrupted (commit-less) checkout instead of failing An interrupted previous clone leaves the install dir's .git present but with no initial commit. rev-parse --is-inside-work-tree and git status both still succeed there, so the installer entered the update path and ran `git stash`, which aborts with "You do not have the initial commit yet" and failed the desktop install at the "Cloning Hermes repository" stage. - install.ps1: add a `git rev-parse --verify HEAD` probe to the repo-validity check so a commit-less checkout is treated as broken and re-cloned fresh. - install.sh: mirror it at the top of clone_repo() — drop a partial checkout with no resolvable HEAD so the fresh-clone path handles it (POSIX parity). Fixes #40998	2026-06-08 02:18:21 -07:00
teknium1	0904bc7ea2	refactor(cli): extract 32 slash-command handlers into CLICommandsMixin (god-file Phase 4) Lift the `_handle_*_command` cluster (2,077 LOC) out of HermesCLI into hermes_cli/cli_commands_mixin.py; HermesCLI now inherits CLICommandsMixin so every self.<handler> call resolves unchanged via the MRO. Behavior-neutral. Import discipline mirrors gateway/slash_commands.py (PR #41886): neutral deps imported at the mixin module top level; cli.py-internal helpers/constants (_cprint, _ACCENT, save_config_value, ...) imported lazily inside each handler via 'from cli import ...' so the mixin never imports cli at module scope. cli.py 16215 -> 14139 LOC. One test mock repointed (cli.is_browser_debug_ready -> hermes_cli.cli_commands_mixin.is_browser_debug_ready).	2026-06-08 02:13:07 -07:00
kshitij	4eb8972390	Merge pull request #33817 from sweetcornna/fix/28503-busy-input-fifo fix(gateway): use FIFO queue for busy_input_mode pending messages	2026-06-08 02:02:02 -07:00
Gille	039fbb41fc	fix(desktop): show newly configured model providers (#41545 )	2026-06-08 01:39:37 -07:00
floory	15c99b437f	fix(cli): set PYTHON env for node-gyp native builds on NixOS (#40690 ) * fix(cli): set PYTHON env for node-gyp native builds on NixOS node-gyp (triggered by node-pty during npm ci) looks for python3 on PATH, which fails on NixOS because python3 lives in the nix store and is not on the system PATH. Add _nixos_build_env() — a two-tier helper that detects NixOS and: 1. Fast path: hermes venv python3 (~0s) 2. Fallback: nix-shell which python3 (~2-5s) Wire it into _run_npm_install_deterministic() via a new env= parameter, then pass it through cmd_gui() and _update_node_dependencies(). Non-NixOS systems: _nixos_build_env() returns None, behavior unchanged. * fix(cli): merge _nixos_build_env() with os.environ, fix NixOS detection, add explicit return None - Critical fix: both Tier 1 (venv) and Tier 2 (nix-shell) now return {**os.environ, "PYTHON": ...} instead of {"PYTHON": ...} — subprocess.run with env= replaces the entire environment, so the old code wiped PATH and broke npm/node on NixOS entirely. - Uses re.search(r"^ID=nixos$", ...) for anchored NixOS detection instead of unanchored substring match (could match ID_LIKE=...nixos). - Removes redundant Path.exists() guard before read_text(); just catches OSError (one filesystem read instead of two). - Adds explicit return None at end of function for type-hint consistency.	2026-06-08 13:57:37 +05:30
teknium1	7a5827c8b0	test: repoint percentage-clamp source guard to gateway/slash_commands.py test_gateway_run_clamped read gateway/run.py asserting the /usage stats handler clamps pct with min(100, ...). That handler moved to gateway/slash_commands.py in this PR's extraction; repoint the guard so it still fires on clamp removal. tests/run_agent/ + tests/gateway/ 8024 passed / 0 failed.	2026-06-08 01:25:35 -07:00
teknium1	de5fe2fa7d	test(gateway): repoint slash-command mocks after mixin extraction Tests for the extracted handlers mocked symbols at gateway.run.*; the handlers now resolve top-level-imported deps (atomic_json_write, fetch_account_usage, render_account_usage_lines) and __file__ from gateway.slash_commands. Repoint those mocks. run.py-resident methods (_increment_restart_failure_counts, _clear_restart_failure_count) keep their gateway.run.atomic_json_write mock — only the moved handlers' mocks change. tests/gateway/ 6415 passed / 0 failed.	2026-06-08 01:25:35 -07:00
teknium1	619bd78273	refactor(gateway): extract 42 slash-command handlers into GatewaySlashCommandsMixin (god-file Phase 3b) The in-session slash commands (/model, /reset, /usage, /compress, /voice, ...) — 42 _handle__command handlers, ~3,200 LOC — move out of gateway/run.py into a mixin GatewayRunner inherits. self._handle__command dispatch + all test references resolve unchanged via the MRO. Neutral deps (MessageEvent, EphemeralReply, Platform, t, cfg_get, atomic_*_write, account-usage helpers, stdlib) imported at the mixin top level. The ~10 run.py- internal helpers (_hermes_home, _load_gateway_config, _resolve_gateway_model, _AGENT_PENDING_SENTINEL, ...) imported lazily inside the handlers that need them to avoid an import cycle. gateway/run.py 19157 -> 15870 LOC; GatewayRunner direct methods 214 -> 172. Behavior-neutral: voice/update/model/compress command test suites pass; all 42 resolve to the mixin via MRO.	2026-06-08 01:25:35 -07:00
teknium1	02a4d66951	fix(auxiliary): retry transient transport error once before fallback (#16587 ) Some checks failed Deploy Site / deploy-vercel (push) Waiting to run Details Deploy Site / deploy-docs (push) Waiting to run Details Docker Build and Publish / build-amd64 (push) Waiting to run Details Docker Build and Publish / build-arm64 (push) Waiting to run Details Docker Build and Publish / merge (push) Blocked by required conditions Details Lint (ruff + ty) / ruff + ty diff (push) Waiting to run Details Lint (ruff + ty) / ruff enforcement (blocking) (push) Waiting to run Details Lint (ruff + ty) / Windows footguns (blocking) (push) Waiting to run Details Nix / nix (macos-latest) (push) Waiting to run Details Nix / nix (ubuntu-latest) (push) Waiting to run Details Tests / test (1) (push) Waiting to run Details Tests / test (2) (push) Waiting to run Details Tests / test (3) (push) Waiting to run Details Tests / test (4) (push) Waiting to run Details Tests / test (5) (push) Waiting to run Details Tests / test (6) (push) Waiting to run Details Tests / save-durations (push) Blocked by required conditions Details Tests / e2e (push) Waiting to run Details Nix Lockfile Fix / auto-fix-main (push) Has been cancelled Details Nix Lockfile Fix / fix (push) Has been cancelled Details A one-off transient transport failure (streaming-close / incomplete chunked read / 5xx / 408) on an auxiliary LLM call escalated straight to provider/model fallback (or, for context compression, dropped the summary and entered cooldown), even when an immediate retry on the same provider would have succeeded. Add a single same-target retry at the top of call_llm() and async_call_llm() — before the existing except-chain — gated on a new _is_transient_transport_error() that reuses the canonical _is_connection_error() detector plus a 5xx/408 status check. A second failure (or any non-transient error: auth, other 4xx, malformed payload) falls through to first_err and the existing fallback handling unchanged. This lives in call_llm so every auxiliary task (compression, memory flush, title generation, session search, vision) shares one transient-retry surface, rather than each caller re-implementing it. The context compressor needs no change — it calls call_llm and inherits the retry; its existing fallback-to-main path (#18458) now composes naturally (retry the aux model once, then fall back to main only if the retry also fails). Co-authored-by: ARegalado1 <alberto.regalado@ymail.com>	2026-06-08 01:05:45 -07:00
kshitij	4107076128	Merge pull request #41155 from kshitijk4poor/fix/cli-modal-direct-invalidate-41098 fix(cli): paint approval/clarify/sudo/secret modal prompts directly, not via the throttle (#41098)	2026-06-08 01:01:51 -07:00
Teknium	4d18717b6c	fix(gateway): drop --replace from systemd unit templates (#41892 ) Under systemd's Restart=always, --replace turns every restart into a self-kill loop: the new instance reads gateway.pid, kills the previous process, writes its own PID, and on the next restart the cycle repeats. A process supervisor owns the lifecycle — --replace is for manual one-shot takeovers and fights the supervisor. Remove --replace from both the system-level and user-level systemd ExecStart lines. The --replace flag stays available for manual 'hermes gateway run --replace' and on the macOS launchd fallback path (#23387), which is a deliberate manual takeover, not a supervised unit. Also drop RestartMaxDelaySec / RestartSteps from the templates — they require systemd v255+ and are silently ignored on older versions. The _strip_optional_systemd_directives normalizer stays so existing installs whose on-disk unit still carries those directives aren't flagged as outdated. Credit: reported and diagnosed by @Skippy-the-Magnificent-one (PR #37145); reimplemented here under project authorship because the original commit was authored under a non-existent email.	2026-06-08 00:20:08 -07:00
Siddharth Balyan	d02a59b679	fix(nix): cold npm builds + fix-lockfiles real-build verification + auto-fix workflow (#41867 ) * fix(nix): fix-lockfiles real-build verification + point auto-fix at nix/lib.nix Two related fixes to the npm lockfile-hash tooling that, together, let a broken nix build slip onto main and stay there: 1. fix-lockfiles trusted prefetch-npm-deps. It computes the hash from the lockfile contents and early-exited "ok" whenever that matched the pin, never running the real fetchNpmDeps + npmConfigHook build. Those two can disagree (the --apply path already works around it), so `--check` reported "ok" while a cold build was actually broken (e.g. lockfile engines/os/cpu fields the pinned nixpkgs strips from the deps cache, tripping npmConfigHook's consistency diff). Now, when prefetch says the hash matches, confirm with `nix build .#<attr>` before believing it: adopt the real fetchNpmDeps hash if nix reports a 'got:' mismatch, surface non-hash failures honestly (exit 1) instead of claiming "ok", and keep the transient-cache-failure skip. 2. nix-lockfile-fix.yml's auto-fix-main (and the PR-fix job) whitelisted and staged nix/tui.nix + nix/web.nix, but the single npmDepsHash moved to nix/lib.nix. So fix-lockfiles --apply edited nix/lib.nix, the guard flagged it as an "unexpected modified file", and the job exited without committing — the auto-healer could never push a fix. Point the guard regex and both `git add` lines at nix/lib.nix. * fix(nix): fix cold npm builds — adopt the deps-cache lockfile in patchPhase hermes-tui/hermes-agent could not be built from source on the pinned nixpkgs: prefetch-npm-deps strips advisory lockfile fields (engines/os/cpu/funding/ bin/…) that newer npm writes into package-lock.json, then npmConfigHook byte-compares the source lockfile against the cache's stripped copy and fails on the difference. CI only stayed green because it substitutes the prebuilt hermes-tui from Cachix and never cold-builds it; anyone building cold (e.g. a local path: input, or a cache miss) hit the failure. mkNpmPassthru's patchPhase now copies the cache's own normalized package-lock.json over the source before npmConfigHook runs, so the consistency check is trivially satisfied. The resolved dependency set (version/resolved/integrity/dependencies) is identical — fetchNpmDeps derived the cache from this very lockfile — so `npm ci` installs the same tree; only advisory metadata is dropped. Genuine drift is still caught by the fixed-output npmDepsHash check, which runs before this phase. Verified by cold-building .#tui and .#default (full hermes-agent) from scratch on the pinned nixpkgs (6201e2) — both succeed where they previously failed at npmConfigHook.	2026-06-08 12:41:37 +05:30
Teknium	e45b745835	fix(file-tools): reject sentinel TERMINAL_CWD; anchor worktree edits before live cwd exists (#41861 ) Completes the worktree-misroute fix from #35399, which made misroutes visible (resolved_path) but did not prevent them: its divergence warning only fired once a terminal command had populated the live cwd registry. A fresh worktree session (registry still empty) with a stale TERMINAL_CWD='.' got neither a worktree anchor nor a warning, so a relative write_file/patch silently landed in the MAIN checkout. Two changes in tools/file_tools.py: - Treat sentinel TERMINAL_CWD values ('', '.', './', 'auto', 'cwd') and any relative value as UNSET rather than a literal anchor. Previously '.' was joined onto the process cwd, silently routing edits to wherever the process happened to be (the main repo, in a worktree session). The gateway already sanitizes the same set at import time; the file-tool layer now matches. - New _authoritative_workspace_root(): prefers the live terminal cwd, else a sentinel-free absolute TERMINAL_CWD (the worktree path cli.py/main.py set for -w). _resolve_base_dir() and _path_resolution_warning() both use it, so a worktree session resolves into — and warns about escaping — the worktree from the very first write, before any cd has run. Validation: 11 new/parametrized tests (sentinel handling, empty-registry anchoring, early divergence warning, live-cwd precedence). 32/32 pass under scripts/run_tests.sh. Live E2E: relative write in an empty-registry worktree session lands in the worktree, main untouched.	2026-06-07 23:58:47 -07:00
LeonSGP43	e02f4c03c3	fix(gateway): abort --replace when old PID survives SIGKILL When --replace force-kills an unresponsive old gateway, SIGKILL can fail to reap it (uninterruptible sleep, zombie-reaping parent, etc.). The old code unconditionally cleared the PID file and scoped locks and started a fresh instance anyway, leaving two live gateways fighting over the same bot token — a duplicate-gateway failure mode of #19471. Re-verify the process is actually gone (via the Windows-safe _pid_exists helper) after the force-kill; if it still appears alive, clear the takeover marker and abort the replacement instead of duplicating. Co-authored-by: Hermes <noreply@nousresearch.com>	2026-06-07 23:57:32 -07:00
konsisumer	3714caa1b9	fix(session): follow compression continuations for transcript reads	2026-06-07 23:57:20 -07:00
teknium1	329c33dac3	fix(terminal): read cwd overrides under raw task_id after container collapse PR #41822 collapsed CWD-only overrides to the shared 'default' container via _resolve_container_task_id, but three call sites kept routing the env/override lookup through that collapsed id: - the foreground exec path read _task_env_overrides[effective_task_id], yet register_task_env_overrides writes under the raw task_id, so a CWD-only override's cwd was silently dropped (env spun up at the wrong root, exit 126); - the get-or-create env lookup keyed solely on effective_task_id, so an env cached under the raw task_id was missed and duplicated; - register_task_env_overrides synced the new cwd onto the env under the collapsed id, missing a live env cached under the raw task_id. Container identity still collapses to 'default' (sharing preserved); only the per-session env/override lookup now prefers the raw task_id and falls back to the collapsed id. Fixes the 3 regressions in test_terminal_task_cwd.py left red by #41822.	2026-06-07 23:44:04 -07:00
teknium1	d759c13c09	chore(salvage): lint fix + AUTHOR_MAP for desktop source-folders PR #40272 eslint --fix (import sort + padding-line-between-statements) on sidebar/index.tsx after cherry-picking @dangelo352's commits; add release.py AUTHOR_MAP entry so CI doesn't block on the unmapped author email.	2026-06-07 23:44:04 -07:00
D'Angelo Rodriguez	694adec635	Smooth desktop sidebar drag sorting	2026-06-07 23:44:04 -07:00
D'Angelo Rodriguez	f0fcaa1e54	Preserve dragged order inside source folders	2026-06-07 23:44:04 -07:00
D'Angelo Rodriguez	0f500fc41d	Render grouped sessions when local list is empty	2026-06-07 23:44:04 -07:00
D'Angelo Rodriguez	3fc67b7333	Persist desktop sidebar drag order	2026-06-07 23:44:04 -07:00
D'Angelo Rodriguez	ede4f5a4a3	Show messaging source folders in desktop sessions	2026-06-07 23:44:04 -07:00
D'Angelo Rodriguez	9d6992ee8a	Show platform sources in desktop sessions	2026-06-07 23:44:04 -07:00
teknium1	1c68f6f81f	refactor(gateway): extract kanban watcher loops into GatewayKanbanWatchersMixin (god-file Phase 3) gateway/run.py is the largest god file (20k LOC, GatewayRunner with 220 methods). This lifts the cohesive kanban-watcher cluster — _kanban_notifier_watcher, _kanban_dispatcher_watcher, _kanban_advance/unsub/rewind, _deliver_kanban_artifacts (~1,035 LOC, 6 methods) — into gateway/kanban_watchers.py as a mixin that GatewayRunner inherits. Mixin (not free functions) because the methods use only self state: inheriting keeps every self._kanban_* call site working unchanged via the MRO, making this a behavior-neutral move. The methods' lazy imports (_kb, _decomp, _load_config, Platform) travel with them; the mixin needs only stdlib + a matching logging.getLogger('gateway.run'). run.py 20187 -> 19157 LOC; GatewayRunner direct methods 220 -> 214. Behavior-neutral: gateway test suite 6582 passed / 0 failed; start() still wires both watchers via self._kanban_*; MRO resolves all 6 to the mixin. One test (corrupt-board quarantine retry) keyed its time-travel mock on the caller's filename being gateway/run.py — updated to also accept gateway/kanban_watchers.py. Establishes the mixin-extraction pattern for further GatewayRunner decomposition (the 2406-LOC _run_agent and 1164-LOC _handle_message remain, but their callback closures need a context-object redesign — deferred).	2026-06-07 23:14:18 -07:00
liuhao1024	6459b3d991	fix(terminal): collapse CWD-only overrides to shared container When register_task_env_overrides is called with only a 'cwd' key (ACP adapter workspace tracking), the task_id should collapse to 'default' so all interactive surfaces (TUI, gateway, dashboard) share one long-lived container. Previously, any override registration — even CWD-only — caused _resolve_container_task_id to return the session key unchanged, spinning up a separate container per session. This made it impossible to authenticate into external services once and have that auth available across all surfaces. Now only overrides containing isolation keys (docker_image, modal_image, singularity_image, daytona_image, env_type) trigger per-task container isolation. Fixes #37361	2026-06-07 23:04:54 -07:00
teknium1	1a626470ca	refactor(cli): promote 9 closure handlers to top-level + extract their parsers (god-file Phase 2 follow-up) Subcommands whose handler was a closure defined inside main() — memory, acp, tools, insights, skills, pairing, plugins, mcp, claw — have their handler promoted to a top-level function and their parser block extracted into hermes_cli/subcommands/<name>.py (build_<name>_parser, injected handler). These 9 had zero closure-over-main-locals, so promotion is a pure relocation. acp/mcp parser blocks use the shared add_accept_hooks_flag helper. main() 1798 -> 954 LOC (71% below the 3297 Phase-2 starting point); add_parser calls in main.py 89 -> 28. Deferred: sessions, computer-use, secrets handlers reference <name>_parser (for a no-subcommand print_help fallback) — left in place to avoid the _self_parser indirection; minority, low value. Behavior-neutral: all 9 subcommands' --help (incl nested subactions) byte- identical to pre-extraction (diff-verified). tests/hermes_cli/ 6519 passed / 0 failed; new test_subcommands_followup.py covers the 9 builders.	2026-06-07 22:56:23 -07:00
teknium1	524453dab5	refactor(agent): consolidate inner-retry-loop recovery flags into TurnRetryState (god-file Phase 1b) run_conversation's inner retry loop tracked recovery state in ~15 scattered bare booleans (per-provider OAuth refresh guards, format-recovery guards, restart signals). They are now fields on a single TurnRetryState dataclass the loop mutates in place (_retry.<flag>), giving the recovery bookkeeping a named, testable home. Loop-control vars (retry_count, max_retries, max_compression_attempts) stay as plain locals — they're while-mechanics, not recovery bookkeeping. Behavior-neutral: pure local→attribute rewrite of 42 references; kwarg NAMES preserved (e.g. has_retried_429=_retry.has_retried_429). Live simple + tool turns OK. Validation: tests/run_agent/ 1615 passed / 0 failed under per-file process isolation; new test_turn_retry_state.py pins the field contract.	2026-06-07 22:42:05 -07:00

1 2 3 4 5 ...

11019 commits