hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-06-09 08:21:50 +00:00

Author	SHA1	Message	Date
Michael Steuer	3d029a53ec	fix(gateway): close residual memory-leak sites under heavy scheduled workload Long-lived gateways under heavy cron/build workloads grow steadily (~18 MB/hr post-phantom-dispatch-fix) and eventually need a restart-or-OOM. Four retention sites, all confirmed live on current main: 1. _evict_cached_agent() (/model, /reasoning, codex-runtime, /undo, etc.) popped the cache entry without releasing the agent's OpenAI client, httpx transport, SSL context, or conversation history. Only /new cleaned up first. Now releases clients on a daemon thread, matching _enforce_agent_cache_cap. 2. _release_evicted_agent_soft() now clears _session_messages after release_clients() — tool outputs (file reads, terminal output, search results) can be tens of MB per 100+-tool-call session; the list is rebuilt from persisted session JSON on resume, so dropping it on soft eviction is safe. 3. The session-expiry watcher (permanent finalization) now drops the session's per-session control dicts (_session_model_overrides, _session_reasoning_overrides, _pending_approvals, _update_prompt_pending, _pending_model_notes). These leaked one entry per session per gateway lifetime. NOTE: this is the session-finalize path, NOT idle agent-cache eviction — an idle-evicted session is still alive and rebuilds its agent from these overrides, so pruning them there would silently reset a user's /model choice. 4. _tool_defs_cache is now bounded (_TOOL_DEFS_CACHE_MAX=8) with oldest-first eviction instead of growing unboundedly across the distinct toolset/config fingerprints a gateway sees over its lifetime. Salvaged from #25318 by Michael Steuer (@mssteuer); fix 3 redirected from the idle-sweep to the session-finalize lifecycle, magic number 8 lifted to a named constant, test ported. Fixes #19251 Co-authored-by: Michael Steuer <michael@make.software>	2026-06-08 06:32:42 -07:00
teknium1	400e6e43ca	test(gateway): de-flake concurrent-compression lock test with a barrier test_concurrent_compressions_same_session_serialize relied on a time.sleep(0.25) inside the stubbed compressor to make the two threads overlap inside the per-session lock window. Under CI CPU starvation that sleep is insufficient: one thread can acquire -> compress -> rotate -> RELEASE the lock before the other reaches try_acquire, so both acquire on the shared session_id and both compress (the recurring 'Expected exactly one agent to compress, got 2' failure on shard test (1)). Replace the timing dependency with a threading.Barrier(2) wrapped around the shared db's try_acquire_compression_lock: both threads rendezvous immediately before the real (atomic) acquire, guaranteeing genuine simultaneous contention regardless of scheduling. The real lock logic is unchanged and still picks exactly one winner — this only fixes the test's overlap guarantee. Restored after join so the post-join lock-leak assertion hits the unwrapped method. Verified: 20/20 plain + 15/15 under all-core CPU stress (load avg ~4.6), where the old version flaked.	2026-06-08 06:32:23 -07:00
kshitij	b99c6c4277	Merge #42076 : nested category plugin discovery + alias-normalized enable/disable (#41066 ) Merge #42076: nested category plugin discovery + alias-normalized enable/disable (#41066) Lands the complete nested category plugin fix: - Discovery in `hermes plugins list` (from @islam666's #41076, carried in this PR) - Alias-normalized enable/disable mutation path so nested plugins can be toggled - Fixes the #41076 base breakages (web_server 6-tuple unpack + stale test fixtures) Co-authored work: discovery by @islam666 (#41076). Closes #41066.	2026-06-08 05:47:27 -07:00
kshitijk4poor	2b89afec79	fix(plugins): alias-normalize enable/disable for nested category plugins (follow-up to #41076 ) #41076 makes `hermes plugins list` discover nested category plugins (e.g. observability/nemo_relay). This adds the missing enable/disable mutation path so those plugins can actually be toggled, and fixes two incomplete-update breakages on the #41076 base. Before: `hermes plugins enable nemo_relay` -> "Plugin 'nemo_relay' is not installed or bundled." (exit 1), because cmd_enable/cmd_disable went through _plugin_exists(), which only checked top-level plugins/<name>/. Changes: - Add _resolve_plugin_key(): resolve a bare manifest/leaf name OR a full path-derived key (observability/nemo_relay) to the canonical key the runtime loader gates on, reusing #41076's _discover_all_plugins(). A bare leaf name ambiguous across two categories resolves to None rather than silently picking one. - cmd_enable/cmd_disable resolve first, persist the canonical key, and drop any stale legacy bare-name alias so the enabled/disabled lists can't drift into a contradictory state. _plugin_exists delegates to the same resolver. - Fix #41076 base breakages: _discover_all_plugins now returns 6-tuples, but web_server._merged_plugins_hub() still unpacked 5 (ValueError on the dashboard plugins-hub endpoint) and several test_plugins_cmd_list.py fixtures were still 5-tuples. Both updated; the hub status check is now key-aware. Verified e2e on the real CLI + runtime loader (isolated HERMES_HOME): `hermes plugins enable nemo_relay` writes observability/nemo_relay to config.yaml and the loader then loads it (enabled=True, error=None); a stale bare-name alias is cleared on disable; the dashboard _merged_plugins_hub() runs without crashing. Adds resolution + enable/disable tests; full tests/hermes_cli/test_plugins_cmd* + web_server plugin tests green. Follow-up to #41076 (#41066). Branched from that PR's head.	2026-06-08 17:57:37 +05:30
kshitij	c3055d6185	Merge pull request #41984 from kshitijk4poor/salvage/6600-stale-streaming-worker fix(gateway): transcribe voice messages during active agent runs (salvage #6600, voice half)	2026-06-08 02:51:25 -07:00
kshitijk4poor	f96eb857a5	chore: add kristianvast to AUTHOR_MAP	2026-06-08 15:16:20 +05:30
Kristian Vastveit	d55304c39f	fix(gateway): transcribe voice messages during active agent runs Salvaged from #6600 (@kristianvast) — re-scoped to the voice half only and rebased onto current main. The cascading-interrupt hang half of the original PR landed independently in `dd0d1222a`, so this carries ONLY Problem 1. When a voice/audio message arrives while the agent is busy on the same session, it hit the interrupt path with empty text because STT only ran after the running-agent guard — the voice was effectively lost. Now we transcribe audio BEFORE signaling the agent (and on the fresh-message path), echo the raw transcript back to the user (🎙️), and _enrich_message_with_transcription returns (text, transcripts) so callers can echo. A new _dequeue_pending_with_transcription drives the post-agent drain the same way. Reapplied onto _prepare_inbound_message_text (inbound enrichment was extracted from the inline dispatch block since the original PR). Co-authored-by: Kristian Vastveit <kristian@agrointel.no>	2026-06-08 15:16:20 +05:30
teknium1	00c46b8ff9	test(tui): cover heapdump opt-in gate + retention; add AUTHOR_MAP On-disk vitest coverage for the auto-heapdump disk-safety guard: opt-in gating (suppressed diagnostics-only path), truthy-spelling acceptance, manual-trigger passthrough, and the retention prune. Test approach adapted from #21780 (briandevans) and #21822 (LeonSGP43), reconciled to the merged gate semantics. Maps alarcritty into AUTHOR_MAP for CI.	2026-06-08 02:20:49 -07:00
alarcritty	8ae0d054f4	fix(tui): guard automatic heap dumps against disk fill Automatic heap dumps from the TUI memory monitor could write multi-GiB .heapsnapshot files on every threshold cross, growing ~/.hermes/heapdumps to tens of GiB. Add four layered safeguards: - Gate auto-high/auto-critical snapshots behind HERMES_AUTO_HEAPDUMP=1; manual dumps remain unchanged. - Always write the lightweight diagnostics JSON sidecar so users still get an actionable artifact when the snapshot is suppressed. - Cap total bytes in the dump dir (HERMES_HEAPDUMP_MAX_BYTES, default 2 GiB), evicting oldest first, retaining the newest. - Add a cooldown between auto dumps (HERMES_AUTO_HEAPDUMP_COOLDOWN_MS, default 10 min) so an oscillating heap can't re-trigger. Closes #21767	2026-06-08 02:20:49 -07:00
teknium1	dd0d1222a2	fix(agent): don't retry interrupt-induced transport errors (cascading-interrupt hang) When agent.interrupt() fires during an active LLM call, the main poll loop force-closes the worker-local httpx client to stop token generation. That raises a transport error (RemoteProtocolError) on the worker thread — the EXPECTED consequence of our own close, not a network bug. The streaming retry loop misclassified it as a transient connection error and retried; each doomed retry stalled for the full stream-stale timeout (up to 300s). Because the gateway caches AIAgent instances per session, the stale worker outlived the interrupted turn and raced the next turn's request on shared client state — the root of the multi-minute cascading-interrupt hang reported in the wild. Fix: a request-local _request_cancelled token set by the poll loop right before the force-close, in both interruptible_api_call (non-streaming) and interruptible_streaming_api_call. The worker's exception handler checks the token and exits cleanly — no retry, no fallback, no 'reconnecting' status — instead of treating the forced error as transient. The token is request- local (not agent._interrupt_requested, which is cleared at turn boundaries) so a stale worker outliving its turn still recognizes its own forced close. Original diagnosis and fix by @kristianvast (PR #6600), against the then- inline methods in run_agent.py. Those were since extracted into agent/chat_completion_helpers.py, so the fix is reapplied there. Co-authored-by: Kristian Vastveit <kristianvast@users.noreply.github.com>	2026-06-08 02:19:13 -07:00
Teknium	aa6f2775fa	fix(memory): run end-of-turn sync off the turn thread (#41945 ) A misconfigured/slow external memory provider could hold the agent in the 'running' state for minutes after the final response was delivered. MemoryManager.sync_all / queue_prefetch_all looped provider.sync_turn / queue_prefetch INLINE on the turn-completion path; a provider making a blocking network/daemon call (a broken Hindsight daemon was observed blocking ~298s before failing) blocked run_conversation from returning. Because every interface (CLI, TUI, gateway) marks the agent 'running' until run_conversation returns, the agent stayed busy for the full block and any follow-up message triggered an aggressive interrupt that dropped the message. Dispatch provider sync/prefetch to a lazily-created single-worker background executor. sync_all / queue_prefetch_all return immediately; work completes (or fails, logged) in the background. A single worker serializes writes so turn N lands before turn N+1. flush_pending() provides a barrier for session boundaries and deterministic tests. shutdown_all() drains the executor with a bounded timeout so a wedged provider can never hang teardown. Builtin-only / no-provider sessions spawn no executor (zero new threads in the common case).	2026-06-08 02:18:59 -07:00
xxxigm	a5c12f5f59	fix(install): move broken checkout aside instead of deleting it Review feedback (#40998): `rm -rf` / `Remove-Item -Recurse -Force` on the install dir is destructive -- a user might still want whatever is there. Rename the broken checkout to a timestamped `<dir>.broken-<ts>` backup and re-clone fresh, so nothing is ever deleted. Transient cleanup of a clone attempt that fails within the same run is left as-is.	2026-06-08 02:18:21 -07:00
xxxigm	5d7abf9114	test(install): cover commit-less checkout handling (#40998 ) Behavioral coverage for install.sh's clone_repo() guard (removes a commit-less checkout, keeps a real one, ignores a non-repo dir) plus a contract check that install.ps1's repo-validity gate requires a resolvable HEAD.	2026-06-08 02:18:21 -07:00
xxxigm	fc0900d120	fix(install): re-clone interrupted (commit-less) checkout instead of failing An interrupted previous clone leaves the install dir's .git present but with no initial commit. rev-parse --is-inside-work-tree and git status both still succeed there, so the installer entered the update path and ran `git stash`, which aborts with "You do not have the initial commit yet" and failed the desktop install at the "Cloning Hermes repository" stage. - install.ps1: add a `git rev-parse --verify HEAD` probe to the repo-validity check so a commit-less checkout is treated as broken and re-cloned fresh. - install.sh: mirror it at the top of clone_repo() — drop a partial checkout with no resolvable HEAD so the fresh-clone path handles it (POSIX parity). Fixes #40998	2026-06-08 02:18:21 -07:00
teknium1	0904bc7ea2	refactor(cli): extract 32 slash-command handlers into CLICommandsMixin (god-file Phase 4) Lift the `_handle_*_command` cluster (2,077 LOC) out of HermesCLI into hermes_cli/cli_commands_mixin.py; HermesCLI now inherits CLICommandsMixin so every self.<handler> call resolves unchanged via the MRO. Behavior-neutral. Import discipline mirrors gateway/slash_commands.py (PR #41886): neutral deps imported at the mixin module top level; cli.py-internal helpers/constants (_cprint, _ACCENT, save_config_value, ...) imported lazily inside each handler via 'from cli import ...' so the mixin never imports cli at module scope. cli.py 16215 -> 14139 LOC. One test mock repointed (cli.is_browser_debug_ready -> hermes_cli.cli_commands_mixin.is_browser_debug_ready).	2026-06-08 02:13:07 -07:00
kshitij	4eb8972390	Merge pull request #33817 from sweetcornna/fix/28503-busy-input-fifo fix(gateway): use FIFO queue for busy_input_mode pending messages	2026-06-08 02:02:02 -07:00
Gille	039fbb41fc	fix(desktop): show newly configured model providers (#41545 )	2026-06-08 01:39:37 -07:00
floory	15c99b437f	fix(cli): set PYTHON env for node-gyp native builds on NixOS (#40690 ) * fix(cli): set PYTHON env for node-gyp native builds on NixOS node-gyp (triggered by node-pty during npm ci) looks for python3 on PATH, which fails on NixOS because python3 lives in the nix store and is not on the system PATH. Add _nixos_build_env() — a two-tier helper that detects NixOS and: 1. Fast path: hermes venv python3 (~0s) 2. Fallback: nix-shell which python3 (~2-5s) Wire it into _run_npm_install_deterministic() via a new env= parameter, then pass it through cmd_gui() and _update_node_dependencies(). Non-NixOS systems: _nixos_build_env() returns None, behavior unchanged. * fix(cli): merge _nixos_build_env() with os.environ, fix NixOS detection, add explicit return None - Critical fix: both Tier 1 (venv) and Tier 2 (nix-shell) now return {**os.environ, "PYTHON": ...} instead of {"PYTHON": ...} — subprocess.run with env= replaces the entire environment, so the old code wiped PATH and broke npm/node on NixOS entirely. - Uses re.search(r"^ID=nixos$", ...) for anchored NixOS detection instead of unanchored substring match (could match ID_LIKE=...nixos). - Removes redundant Path.exists() guard before read_text(); just catches OSError (one filesystem read instead of two). - Adds explicit return None at end of function for type-hint consistency.	2026-06-08 13:57:37 +05:30
teknium1	7a5827c8b0	test: repoint percentage-clamp source guard to gateway/slash_commands.py test_gateway_run_clamped read gateway/run.py asserting the /usage stats handler clamps pct with min(100, ...). That handler moved to gateway/slash_commands.py in this PR's extraction; repoint the guard so it still fires on clamp removal. tests/run_agent/ + tests/gateway/ 8024 passed / 0 failed.	2026-06-08 01:25:35 -07:00
teknium1	de5fe2fa7d	test(gateway): repoint slash-command mocks after mixin extraction Tests for the extracted handlers mocked symbols at gateway.run.*; the handlers now resolve top-level-imported deps (atomic_json_write, fetch_account_usage, render_account_usage_lines) and __file__ from gateway.slash_commands. Repoint those mocks. run.py-resident methods (_increment_restart_failure_counts, _clear_restart_failure_count) keep their gateway.run.atomic_json_write mock — only the moved handlers' mocks change. tests/gateway/ 6415 passed / 0 failed.	2026-06-08 01:25:35 -07:00
teknium1	619bd78273	refactor(gateway): extract 42 slash-command handlers into GatewaySlashCommandsMixin (god-file Phase 3b) The in-session slash commands (/model, /reset, /usage, /compress, /voice, ...) — 42 _handle__command handlers, ~3,200 LOC — move out of gateway/run.py into a mixin GatewayRunner inherits. self._handle__command dispatch + all test references resolve unchanged via the MRO. Neutral deps (MessageEvent, EphemeralReply, Platform, t, cfg_get, atomic_*_write, account-usage helpers, stdlib) imported at the mixin top level. The ~10 run.py- internal helpers (_hermes_home, _load_gateway_config, _resolve_gateway_model, _AGENT_PENDING_SENTINEL, ...) imported lazily inside the handlers that need them to avoid an import cycle. gateway/run.py 19157 -> 15870 LOC; GatewayRunner direct methods 214 -> 172. Behavior-neutral: voice/update/model/compress command test suites pass; all 42 resolve to the mixin via MRO.	2026-06-08 01:25:35 -07:00
teknium1	02a4d66951	fix(auxiliary): retry transient transport error once before fallback (#16587 ) Some checks failed Deploy Site / deploy-vercel (push) Waiting to run Details Deploy Site / deploy-docs (push) Waiting to run Details Docker Build and Publish / build-amd64 (push) Waiting to run Details Docker Build and Publish / build-arm64 (push) Waiting to run Details Docker Build and Publish / merge (push) Blocked by required conditions Details Lint (ruff + ty) / ruff + ty diff (push) Waiting to run Details Lint (ruff + ty) / ruff enforcement (blocking) (push) Waiting to run Details Lint (ruff + ty) / Windows footguns (blocking) (push) Waiting to run Details Nix / nix (macos-latest) (push) Waiting to run Details Nix / nix (ubuntu-latest) (push) Waiting to run Details Tests / test (1) (push) Waiting to run Details Tests / test (2) (push) Waiting to run Details Tests / test (3) (push) Waiting to run Details Tests / test (4) (push) Waiting to run Details Tests / test (5) (push) Waiting to run Details Tests / test (6) (push) Waiting to run Details Tests / save-durations (push) Blocked by required conditions Details Tests / e2e (push) Waiting to run Details Nix Lockfile Fix / auto-fix-main (push) Has been cancelled Details Nix Lockfile Fix / fix (push) Has been cancelled Details A one-off transient transport failure (streaming-close / incomplete chunked read / 5xx / 408) on an auxiliary LLM call escalated straight to provider/model fallback (or, for context compression, dropped the summary and entered cooldown), even when an immediate retry on the same provider would have succeeded. Add a single same-target retry at the top of call_llm() and async_call_llm() — before the existing except-chain — gated on a new _is_transient_transport_error() that reuses the canonical _is_connection_error() detector plus a 5xx/408 status check. A second failure (or any non-transient error: auth, other 4xx, malformed payload) falls through to first_err and the existing fallback handling unchanged. This lives in call_llm so every auxiliary task (compression, memory flush, title generation, session search, vision) shares one transient-retry surface, rather than each caller re-implementing it. The context compressor needs no change — it calls call_llm and inherits the retry; its existing fallback-to-main path (#18458) now composes naturally (retry the aux model once, then fall back to main only if the retry also fails). Co-authored-by: ARegalado1 <alberto.regalado@ymail.com>	2026-06-08 01:05:45 -07:00
kshitij	4107076128	Merge pull request #41155 from kshitijk4poor/fix/cli-modal-direct-invalidate-41098 fix(cli): paint approval/clarify/sudo/secret modal prompts directly, not via the throttle (#41098)	2026-06-08 01:01:51 -07:00
Teknium	4d18717b6c	fix(gateway): drop --replace from systemd unit templates (#41892 ) Under systemd's Restart=always, --replace turns every restart into a self-kill loop: the new instance reads gateway.pid, kills the previous process, writes its own PID, and on the next restart the cycle repeats. A process supervisor owns the lifecycle — --replace is for manual one-shot takeovers and fights the supervisor. Remove --replace from both the system-level and user-level systemd ExecStart lines. The --replace flag stays available for manual 'hermes gateway run --replace' and on the macOS launchd fallback path (#23387), which is a deliberate manual takeover, not a supervised unit. Also drop RestartMaxDelaySec / RestartSteps from the templates — they require systemd v255+ and are silently ignored on older versions. The _strip_optional_systemd_directives normalizer stays so existing installs whose on-disk unit still carries those directives aren't flagged as outdated. Credit: reported and diagnosed by @Skippy-the-Magnificent-one (PR #37145); reimplemented here under project authorship because the original commit was authored under a non-existent email.	2026-06-08 00:20:08 -07:00
Siddharth Balyan	d02a59b679	fix(nix): cold npm builds + fix-lockfiles real-build verification + auto-fix workflow (#41867 ) * fix(nix): fix-lockfiles real-build verification + point auto-fix at nix/lib.nix Two related fixes to the npm lockfile-hash tooling that, together, let a broken nix build slip onto main and stay there: 1. fix-lockfiles trusted prefetch-npm-deps. It computes the hash from the lockfile contents and early-exited "ok" whenever that matched the pin, never running the real fetchNpmDeps + npmConfigHook build. Those two can disagree (the --apply path already works around it), so `--check` reported "ok" while a cold build was actually broken (e.g. lockfile engines/os/cpu fields the pinned nixpkgs strips from the deps cache, tripping npmConfigHook's consistency diff). Now, when prefetch says the hash matches, confirm with `nix build .#<attr>` before believing it: adopt the real fetchNpmDeps hash if nix reports a 'got:' mismatch, surface non-hash failures honestly (exit 1) instead of claiming "ok", and keep the transient-cache-failure skip. 2. nix-lockfile-fix.yml's auto-fix-main (and the PR-fix job) whitelisted and staged nix/tui.nix + nix/web.nix, but the single npmDepsHash moved to nix/lib.nix. So fix-lockfiles --apply edited nix/lib.nix, the guard flagged it as an "unexpected modified file", and the job exited without committing — the auto-healer could never push a fix. Point the guard regex and both `git add` lines at nix/lib.nix. * fix(nix): fix cold npm builds — adopt the deps-cache lockfile in patchPhase hermes-tui/hermes-agent could not be built from source on the pinned nixpkgs: prefetch-npm-deps strips advisory lockfile fields (engines/os/cpu/funding/ bin/…) that newer npm writes into package-lock.json, then npmConfigHook byte-compares the source lockfile against the cache's stripped copy and fails on the difference. CI only stayed green because it substitutes the prebuilt hermes-tui from Cachix and never cold-builds it; anyone building cold (e.g. a local path: input, or a cache miss) hit the failure. mkNpmPassthru's patchPhase now copies the cache's own normalized package-lock.json over the source before npmConfigHook runs, so the consistency check is trivially satisfied. The resolved dependency set (version/resolved/integrity/dependencies) is identical — fetchNpmDeps derived the cache from this very lockfile — so `npm ci` installs the same tree; only advisory metadata is dropped. Genuine drift is still caught by the fixed-output npmDepsHash check, which runs before this phase. Verified by cold-building .#tui and .#default (full hermes-agent) from scratch on the pinned nixpkgs (6201e2) — both succeed where they previously failed at npmConfigHook.	2026-06-08 12:41:37 +05:30
Teknium	e45b745835	fix(file-tools): reject sentinel TERMINAL_CWD; anchor worktree edits before live cwd exists (#41861 ) Completes the worktree-misroute fix from #35399, which made misroutes visible (resolved_path) but did not prevent them: its divergence warning only fired once a terminal command had populated the live cwd registry. A fresh worktree session (registry still empty) with a stale TERMINAL_CWD='.' got neither a worktree anchor nor a warning, so a relative write_file/patch silently landed in the MAIN checkout. Two changes in tools/file_tools.py: - Treat sentinel TERMINAL_CWD values ('', '.', './', 'auto', 'cwd') and any relative value as UNSET rather than a literal anchor. Previously '.' was joined onto the process cwd, silently routing edits to wherever the process happened to be (the main repo, in a worktree session). The gateway already sanitizes the same set at import time; the file-tool layer now matches. - New _authoritative_workspace_root(): prefers the live terminal cwd, else a sentinel-free absolute TERMINAL_CWD (the worktree path cli.py/main.py set for -w). _resolve_base_dir() and _path_resolution_warning() both use it, so a worktree session resolves into — and warns about escaping — the worktree from the very first write, before any cd has run. Validation: 11 new/parametrized tests (sentinel handling, empty-registry anchoring, early divergence warning, live-cwd precedence). 32/32 pass under scripts/run_tests.sh. Live E2E: relative write in an empty-registry worktree session lands in the worktree, main untouched.	2026-06-07 23:58:47 -07:00
LeonSGP43	e02f4c03c3	fix(gateway): abort --replace when old PID survives SIGKILL When --replace force-kills an unresponsive old gateway, SIGKILL can fail to reap it (uninterruptible sleep, zombie-reaping parent, etc.). The old code unconditionally cleared the PID file and scoped locks and started a fresh instance anyway, leaving two live gateways fighting over the same bot token — a duplicate-gateway failure mode of #19471. Re-verify the process is actually gone (via the Windows-safe _pid_exists helper) after the force-kill; if it still appears alive, clear the takeover marker and abort the replacement instead of duplicating. Co-authored-by: Hermes <noreply@nousresearch.com>	2026-06-07 23:57:32 -07:00
konsisumer	3714caa1b9	fix(session): follow compression continuations for transcript reads	2026-06-07 23:57:20 -07:00
teknium1	329c33dac3	fix(terminal): read cwd overrides under raw task_id after container collapse PR #41822 collapsed CWD-only overrides to the shared 'default' container via _resolve_container_task_id, but three call sites kept routing the env/override lookup through that collapsed id: - the foreground exec path read _task_env_overrides[effective_task_id], yet register_task_env_overrides writes under the raw task_id, so a CWD-only override's cwd was silently dropped (env spun up at the wrong root, exit 126); - the get-or-create env lookup keyed solely on effective_task_id, so an env cached under the raw task_id was missed and duplicated; - register_task_env_overrides synced the new cwd onto the env under the collapsed id, missing a live env cached under the raw task_id. Container identity still collapses to 'default' (sharing preserved); only the per-session env/override lookup now prefers the raw task_id and falls back to the collapsed id. Fixes the 3 regressions in test_terminal_task_cwd.py left red by #41822.	2026-06-07 23:44:04 -07:00
teknium1	d759c13c09	chore(salvage): lint fix + AUTHOR_MAP for desktop source-folders PR #40272 eslint --fix (import sort + padding-line-between-statements) on sidebar/index.tsx after cherry-picking @dangelo352's commits; add release.py AUTHOR_MAP entry so CI doesn't block on the unmapped author email.	2026-06-07 23:44:04 -07:00
D'Angelo Rodriguez	694adec635	Smooth desktop sidebar drag sorting	2026-06-07 23:44:04 -07:00
D'Angelo Rodriguez	f0fcaa1e54	Preserve dragged order inside source folders	2026-06-07 23:44:04 -07:00
D'Angelo Rodriguez	0f500fc41d	Render grouped sessions when local list is empty	2026-06-07 23:44:04 -07:00
D'Angelo Rodriguez	3fc67b7333	Persist desktop sidebar drag order	2026-06-07 23:44:04 -07:00
D'Angelo Rodriguez	ede4f5a4a3	Show messaging source folders in desktop sessions	2026-06-07 23:44:04 -07:00
D'Angelo Rodriguez	9d6992ee8a	Show platform sources in desktop sessions	2026-06-07 23:44:04 -07:00
teknium1	1c68f6f81f	refactor(gateway): extract kanban watcher loops into GatewayKanbanWatchersMixin (god-file Phase 3) gateway/run.py is the largest god file (20k LOC, GatewayRunner with 220 methods). This lifts the cohesive kanban-watcher cluster — _kanban_notifier_watcher, _kanban_dispatcher_watcher, _kanban_advance/unsub/rewind, _deliver_kanban_artifacts (~1,035 LOC, 6 methods) — into gateway/kanban_watchers.py as a mixin that GatewayRunner inherits. Mixin (not free functions) because the methods use only self state: inheriting keeps every self._kanban_* call site working unchanged via the MRO, making this a behavior-neutral move. The methods' lazy imports (_kb, _decomp, _load_config, Platform) travel with them; the mixin needs only stdlib + a matching logging.getLogger('gateway.run'). run.py 20187 -> 19157 LOC; GatewayRunner direct methods 220 -> 214. Behavior-neutral: gateway test suite 6582 passed / 0 failed; start() still wires both watchers via self._kanban_*; MRO resolves all 6 to the mixin. One test (corrupt-board quarantine retry) keyed its time-travel mock on the caller's filename being gateway/run.py — updated to also accept gateway/kanban_watchers.py. Establishes the mixin-extraction pattern for further GatewayRunner decomposition (the 2406-LOC _run_agent and 1164-LOC _handle_message remain, but their callback closures need a context-object redesign — deferred).	2026-06-07 23:14:18 -07:00
liuhao1024	6459b3d991	fix(terminal): collapse CWD-only overrides to shared container When register_task_env_overrides is called with only a 'cwd' key (ACP adapter workspace tracking), the task_id should collapse to 'default' so all interactive surfaces (TUI, gateway, dashboard) share one long-lived container. Previously, any override registration — even CWD-only — caused _resolve_container_task_id to return the session key unchanged, spinning up a separate container per session. This made it impossible to authenticate into external services once and have that auth available across all surfaces. Now only overrides containing isolation keys (docker_image, modal_image, singularity_image, daytona_image, env_type) trigger per-task container isolation. Fixes #37361	2026-06-07 23:04:54 -07:00
teknium1	1a626470ca	refactor(cli): promote 9 closure handlers to top-level + extract their parsers (god-file Phase 2 follow-up) Subcommands whose handler was a closure defined inside main() — memory, acp, tools, insights, skills, pairing, plugins, mcp, claw — have their handler promoted to a top-level function and their parser block extracted into hermes_cli/subcommands/<name>.py (build_<name>_parser, injected handler). These 9 had zero closure-over-main-locals, so promotion is a pure relocation. acp/mcp parser blocks use the shared add_accept_hooks_flag helper. main() 1798 -> 954 LOC (71% below the 3297 Phase-2 starting point); add_parser calls in main.py 89 -> 28. Deferred: sessions, computer-use, secrets handlers reference <name>_parser (for a no-subcommand print_help fallback) — left in place to avoid the _self_parser indirection; minority, low value. Behavior-neutral: all 9 subcommands' --help (incl nested subactions) byte- identical to pre-extraction (diff-verified). tests/hermes_cli/ 6519 passed / 0 failed; new test_subcommands_followup.py covers the 9 builders.	2026-06-07 22:56:23 -07:00
teknium1	524453dab5	refactor(agent): consolidate inner-retry-loop recovery flags into TurnRetryState (god-file Phase 1b) run_conversation's inner retry loop tracked recovery state in ~15 scattered bare booleans (per-provider OAuth refresh guards, format-recovery guards, restart signals). They are now fields on a single TurnRetryState dataclass the loop mutates in place (_retry.<flag>), giving the recovery bookkeeping a named, testable home. Loop-control vars (retry_count, max_retries, max_compression_attempts) stay as plain locals — they're while-mechanics, not recovery bookkeeping. Behavior-neutral: pure local→attribute rewrite of 42 references; kwarg NAMES preserved (e.g. has_retried_429=_retry.has_retried_429). Live simple + tool turns OK. Validation: tests/run_agent/ 1615 passed / 0 failed under per-file process isolation; new test_turn_retry_state.py pins the field contract.	2026-06-07 22:42:05 -07:00
teknium1	4d926f248d	chore(release): add AUTHOR_MAP entry for rodboev	2026-06-07 22:39:51 -07:00
Rod Boev	648706936d	test(gateway): add compression session_id rotation integration tests (#34089 )	2026-06-07 22:39:51 -07:00
teknium1	39c4ac3af1	chore(release): add AUTHOR_MAP entry for JimStenstrom	2026-06-07 22:30:02 -07:00
JimStenstrom	cb5c24e37d	fix(agent): sync logging session context on compaction id rotation When context compaction rotates agent.session_id, it updates the gateway/tools session context (set_current_session_id -> HERMES_SESSION_ID env + ContextVar) but never updates the separate logging session context. The [session_id] tag on log lines comes from hermes_logging._session_context (set once per turn in conversation_loop.py), so post-compaction log lines in the same turn carry the STALE old id while the message/DB/gateway state carry the new one — breaking log correlation exactly at the compaction boundary. Call hermes_logging.set_session_context(agent.session_id) alongside the existing set_current_session_id, guarded so a logging failure can't regress the routing update. Logs-only; no runtime or caching impact. Refs #34089	2026-06-07 22:30:02 -07:00
Teknium	8e223b36ed	fix(curator): protect load-bearing built-in skills from archival/consolidation (#41817 ) The curator's idle-archival path (apply_automatic_transitions under prune_builtins) could archive the bundled `plan` skill, killing the /plan slash command silently — typing /plan then returned 'Unknown command' with no signal that a skill had vanished. The archived skill's hash stays in .bundled_manifest, so 'hermes update' wouldn't re-seed it. Add PROTECTED_BUILTIN_SKILLS ({plan}) enforced at the master gate is_curation_eligible() (covers archive_skill + the transition walk) and in the candidate enumerator (so the LLM consolidation pass never sees them). Immune to prune_builtins, pin state, and LLM judgment.	2026-06-07 22:23:29 -07:00
Teknium	777dc9da62	feat(acp): emit session provenance metadata for compression rotation (#41724 ) Closes #33617. Adds additive _meta.hermes.sessionProvenance to ACP session surfaces so clients can detect compression-driven internal session rotation without parsing status text, guessing from token drops, or reading state.db. Derived on demand from the existing compression chain (parent_session_id / end_reason) — no new persisted state, no schema change, no ACP protocol change. ACP session_id stays the stable client handle. - acp_adapter/provenance.py: derive provenance from SessionDB - server.py: attach _meta to new/load/resume responses; emit a session_info_update when the internal head rotates during a prompt	2026-06-07 22:22:21 -07:00
teknium1	240c5d4543	chore: map martin.alca@gmail.com -> draix in AUTHOR_MAP Salvage follow-up for PR #33221 — the cherry-picked commit is authored under martin.alca@gmail.com (not the draixagent@gmail.com already mapped), which would fail the CI author-attribution gate.	2026-06-07 22:22:01 -07:00
Martín Alcalá Rubí	132d6fe6d6	fix(volcengine): strip XML attribute fragments from tool_use.name (#33007 ) VolcEngine's api/plan endpoint occasionally leaks raw XML attribute fragments into tool_use.name when its protocol-translation layer converts the model's native XML-style tool emission to Anthropic Messages tool_use blocks, producing names like: terminal" parameter="command" string="true execute_code" parameter="code" string="true session_search" parameter="session_id" string="true The corruption happens server-side at the provider, but it breaks every tool call for affected users — no normalization rule in repair_tool_call can rescue them, so each request runs through three retries and then aborts as partial. Add an early sanitizer in agent_runtime_helpers.repair_tool_call that trims at the first ' " ', " ' ", '<', or '>' character (idx > 0 only) so the rest of the existing repair pipeline (lowercase / snake_case / fuzzy match) can resolve the cleaned name normally. Whitespace is deliberately NOT a separator — the legitimate "write file" -> write_file repair path (covered by test_space_to_underscore) must keep working. Tests: 11 new regression cases in TestVolcEngineXmlPollution covering all three observed polluted names, CamelCase + pollution mix, single-quote variants, angle-bracket variants, clean-name passthrough, and the whitespace-preservation guard. All 18 pre- existing repair tests still pass (29 total in the file).	2026-06-07 22:22:01 -07:00
teknium1	f5bd09af4b	refactor(acp): share interrupt-sentinel prefix, simplify guard Replace the ACP-local prefix/suffix matcher + helper with a single startswith() check against INTERRUPT_WAITING_FOR_MODEL_PREFIX, now defined once in conversation_loop.py where the sentinel is produced. Keeps the source of truth in one place so the guard cannot drift if the status string changes. Net -17 LOC in server.py. Also add lsaether to release.py AUTHOR_MAP.	2026-06-07 22:20:43 -07:00
lsaether	9b631e4ae1	fix(acp): suppress cancel interrupt sentinel	2026-06-07 22:20:43 -07:00

1 2 3 4 5 ...

11009 commits