Commit graph

11040 commits

Author SHA1 Message Date
teknium1
55b83c3d99 refactor(agent): extract run_conversation post-loop tail into finalize_turn (god-file Phase 1)
Lift the post-loop finalization tail out of run_conversation into
agent/turn_finalizer.py:finalize_turn. Behavior-neutral; run_conversation
4204 -> 3846 LOC, conversation_loop.py 4578 -> 4220.

The region (everything after the main tool-calling while loop): budget-exhaustion
summary, trajectory save, session persist, turn diagnostics, response transforms,
result-dict assembly, steer drain, and the memory/skill review trigger. Lifted
verbatim into a synchronous single-return free function; the 12 post-loop locals
it reads are passed as keyword args and the assembled result dict is returned to
run_conversation (which returns it to the caller). All agent.* side effects fire
exactly as before.

Imports: os + _summarize_user_message_for_log at module top; logger lazy from
agent.conversation_loop (preserves the gateway... err 'agent.conversation_loop'
logger name, no import cycle).

Validation: 1609/1609 tests/run_agent/ pass; live PTY agent turn PASS.
2026-06-08 09:42:23 -07:00
teknium1
a706a349b5 refactor(gateway): extract authorization cluster into GatewayAuthorizationMixin (god-file Phase 3)
Lift the 4 inbound-message authorization methods out of GatewayRunner into
gateway/authz_mixin.py:GatewayAuthorizationMixin. Behavior-neutral; gateway/run.py
16200 -> 15812 LOC.

Methods moved (~389 LOC): _is_user_authorized, _get_unauthorized_dm_behavior,
_adapter_dm_policy, _adapter_enforces_own_access_policy. The two adapter-policy
helpers are private to _is_user_authorized, so the cluster is fully self-contained
(zero outside-cluster self.method calls after the lift). All self.* calls resolve
unchanged via the MRO (GatewayRunner(GatewayAuthorizationMixin, ...)).

Import split: 6 neutral deps (os, Optional, Platform, SessionSource, the two
whatsapp_identity helpers) at the mixin module top; the module-level logger is
imported lazily inside _is_user_authorized (from gateway.run import logger) so
the mixin never imports gateway.run at module scope -> no cycle. The lazy import
preserves the exact logger name (gateway.run) so log records are unchanged.
2026-06-08 09:42:02 -07:00
teknium1
094aa85c37 refactor(cli): extract agent-construction cluster into CLIAgentSetupMixin (god-file Phase 4)
Lift the 5 agent-construction/session-resume methods out of HermesCLI into
hermes_cli/cli_agent_setup_mixin.py:CLIAgentSetupMixin. Behavior-neutral; cli.py
14139 -> 13492 LOC.

Methods moved (~647 LOC): _ensure_runtime_credentials, _resolve_turn_agent_config,
_init_agent, _preload_resumed_session, _display_resumed_history. All self.* calls
resolve unchanged via the MRO (HermesCLI(CLIAgentSetupMixin, CLICommandsMixin)).

Import split (same recipe as #41942): 2 neutral deps (sys, _escape) imported at
the mixin module top; 12 cli.py-internal helpers/constants (AIAgent, ChatConsole,
CLI_CONFIG, _cprint, _DIM, _RST, _accent_hex, ...) imported lazily per-method
(from cli import ...) so the mixin never imports cli at module scope -> no cycle.

Repointed one source-inspection change-detector (test_callable_api_key.py) to read
the mixin file where the method now lives.
2026-06-08 09:41:34 -07:00
qWait
cef00ae602
fix(tui): handle Windows PTY stdin and detached WS frames (#41953)
Two narrow Windows desktop fixes:

1. tools/process_registry.py — PTY stdin writes are now platform-aware.
   pywinpty (Windows) expects str; ptyprocess (POSIX) expects bytes.
   Previously bytes was unconditionally passed, producing a TypeError on
   Windows ("'bytes' object cannot be converted to 'PyString'").

2. tui_gateway/server.py + ws.py — Detached WebSocket sessions now park on
   a _DropTransport sink instead of _stdio_transport. In the desktop the
   gateway runs in-process and stdout is captured by Electron into
   desktop.log, so falling back to stdio leaked raw JSON-RPC frames into
   the desktop log after WS disconnects. Orphan-reap semantics are
   preserved via _ws_session_is_orphaned.

Verified on a Windows desktop install:
- pywinpty 2.0.15 rejects bytes / accepts str — reproduced exactly
- Focused suite green (write_stdin × 2, write_json_drops_detached_ws_frames,
  ws_orphan_reap × 2)
- All 6 CI test shards green, e2e green, nix (ubuntu/macos) green

Salvage commit (21be7ca) fixes the new test referencing an undefined
_ThreadUnsafeStdout — uses the existing _ChunkyStdout helper.
2026-06-08 09:41:20 -07:00
Teknium
74744795af
docs(tui): correct HERMES_TUI_GATEWAY_URL — dashboard-internal, not remote-attach (#42162)
The TUI docs presented HERMES_TUI_GATEWAY_URL + /api/ws as a supported
'attach the TUI to a standalone running gateway' workflow. It isn't.

/api/ws exists only inside the dashboard's FastAPI server
(hermes_cli/web_server.py), which spawns its own embedded TUI child and
injects the var as an internal wiring detail. The OpenAI-compat API
server (api_server platform) deliberately does not serve /api/ws, so the
documented ws://host:port/api/ws workflow 404s — the cause of #32882 and
the two PRs (#32904, #32955) that tried to add the route to the wrong
surface.

Rewrites the section in en + zh-Hans to describe the var accurately and
point users at shared state.db / dashboard embedded chat for multi-surface
session sharing.
2026-06-08 09:37:03 -07:00
Teknium
399b8ee5f0
fix(anthropic): strip Responses-only kwargs before Messages SDK call (#31673) (#42155)
A Responses-API-shaped payload carrying instructions=/input=/store=/
parallel_tool_calls= can reach the native Anthropic messages.stream() /
messages.create() call under a rare api_mode-flip race (e.g. a concurrent
auxiliary vision call mutating a shared agent between the kwargs build and
the stream dispatch). The Anthropic SDK rejects these with a non-retryable
TypeError that kills the whole turn and propagates the entire fallback chain.

Add sanitize_anthropic_kwargs() at both Anthropic dispatch sites: it drops
the Responses-only keys in place and logs a WARNING (with #31673 breadcrumb)
when one is present, so the underlying race stays visible in the wild
instead of being silently papered over.
2026-06-08 09:36:38 -07:00
Teknium
47d5177a7d
fix(plugins): thread-safe lazy-singleton helpers; fix honcho TOCTOU (#24759) (#42150)
* fix(plugins): add thread-safe lazy-singleton helpers, fix honcho TOCTOU (#24759)

get_honcho_client() and fal's _load_fal_client() used unlocked
check-then-init: racing threads both ran the expensive build and the
loser's client (open connection) leaked.

Rather than one-off locks, add plugins/plugin_utils.py with two
reusable primitives every plugin author can drop in:
- lazy_singleton: decorator for zero-arg accessors
- SingletonSlot: manual slot for config-keyed accessors (first wins)

Both use double-checked locking; factory runs at most once; failed
builds aren't cached. honcho is the reference consumer; fal's sibling
TOCTOU gets a matching double-checked lock. Plugin dev guide documents
the pattern so future plugins don't reintroduce the race.

Closes #24759

* test(honcho): update reset test for SingletonSlot internals

test_reset_clears_singleton poked the removed _honcho_client module
global directly. Assert through the slot's public peek() surface
instead, matching the #24759 refactor.
2026-06-08 09:35:22 -07:00
yoniebans
74239b4942 i18n(desktop): translate backend update apply status messages
Two independent reviewers flagged that applyBackendUpdate's in-progress and
error messages were inline English while the rest of the update overlay is
i18n'd. Move them into updates.applyStatus (preparing/pulling/restarting/
notAvailable/failed/noReturn) across en, ja, zh, zh-hant + types.
2026-06-08 08:58:26 -07:00
yoniebans
b000e05b11 fix(desktop): don't claim the backend update succeeded when it never returns
The no-return error said 'Backend updated but did not come back online' — but
once the connection drops the client can't know the update's exit code, only
that it was started and the backend is unreachable. Reword to not overclaim:
the update may not have completed.
2026-06-08 08:58:26 -07:00
yoniebans
cd030f5f40 fix(desktop): close the backend update overlay on success; error on no-return
Three rough edges in the remote backend apply flow:
- On success the overlay dropped to IDLE, briefly re-rendering the pre-install
  'update available' view and then the generic 'you're all set' before settling.
  Close the overlay outright once the backend is confirmed back instead of
  bouncing through the idle view.
- If the backend never came back (a failed restart), the flow still reported
  success. waitForBackendReturn now returns whether the backend answered;
  finishBackendApply surfaces an error when it didn't.
- The up-to-date copy said 'you're running the latest version', conflating
  client and backend. Backend target now reads 'the backend is running the
  latest version' — the client's own version is a separate pill.
2026-06-08 08:58:26 -07:00
yoniebans
81647458c7 fix(desktop): recover the backend update overlay after the remote restarts
The backend Install path set stage:'restart' and stopped — in remote mode no
boot-progress events arrive to carry the overlay to done, so it sat on the
restarting spinner until a manual reload while the backend had already come
back. Poll the backend until it answers again, then clear the overlay and
refresh the backend status. Target-aware applying copy explains the remote
restart + auto-reconnect instead of the local-updater-window wording.

Also switch the apply poll sleeps from window.setTimeout to globalThis.setTimeout
so the flow is exercisable off the renderer.
2026-06-08 08:58:26 -07:00
yoniebans
9b2a64fa6a fix(desktop): reflect env-override remote in gateway connection state
HERMES_DESKTOP_REMOTE_URL forces a remote connection but never writes
connection.json, so the gateway panel read mode/url from persisted config
and mislabelled an env-remote session as local with no url.
2026-06-08 08:58:26 -07:00
yoniebans
47518bc913 fix(desktop): check backend updates when the connection becomes remote
The poller starts at mount, before the gateway connects, so its initial
checkBackendUpdates() ran while mode was still unset and no-op'd via the
remote-mode guard — leaving the backend button empty until the user clicked it.
Subscribe to $connection and re-check the backend when mode resolves to remote.
2026-06-08 08:58:26 -07:00
yoniebans
cfaa46fcae fix(desktop): pre-check backend updates in poller; client button first
Two follow-ups from testing the two-button bar:

- The background poller and focus handler only checked the client, so the
  backend behind-count and changelog stayed empty until the user opened the
  overlay — and the overlay's first render then hit the empty-commits fallback
  ('Improvements and fixes') instead of the real changelog. Check the backend
  alongside the client on poller start, interval, and focus so its state is
  ready before the button is clicked.
- Order the status bar client-first, backend-second.
2026-06-08 08:58:26 -07:00
yoniebans
56be1a63a3 fix(desktop): split client and backend into two distinct update buttons
The status bar merged both versions into one pill with a single click target,
so there was no way to tell which artifact an update acted on — and the apply
path was overloaded by connection mode. Separate them:

- store: independent client (checkUpdates/applyUpdates) and backend
  (checkBackendUpdates/applyBackendUpdate) flows with their own status/apply
  atoms; openUpdateOverlayFor(target) drives the overlay.
- status bar: two buttons — client vX (always) and backend vY (+N) (remote
  only), each with its own behind-count, opening the overlay for its target.
- overlay: reads the active target's atoms; install/check route per target.

Removes the version-bar merge helper (no longer merging the two versions).
2026-06-08 08:58:26 -07:00
yoniebans
9c264555b0 fix(desktop): name the update target in the overlay; honest no-changelog copy
The updates overlay showed generic 'New update available / improvements and
fixes' with no indication of whether it was updating the client or the backend.
In remote mode it now reads 'Backend update available' and names the connected
backend, and when there's no commit changelog (e.g. pip/non-git backend) it
degrades to honest 'release notes aren't available for this install type' copy
instead of filler.

Copy selection extracted to a pure resolveUpdateCopy() helper (unit-tested);
threads target ('client'|'backend') from connection.mode through the overlay.
2026-06-08 08:58:26 -07:00
yoniebans
87ac7cac13 fix(dashboard): log update changelog against origin/main, not @{upstream}
The behind-count (banner._check_via_local_git) measures HEAD..origin/main, but
_recent_upstream_commits logged HEAD..@{upstream}. On a feature-branch checkout
@{upstream} is the branch's own tip (0 commits), so the changelog came back
empty while behind>0 — the overlay then showed generic filler instead of what
changed. Pin the commit range to origin/main so count and changelog agree.

Verified against a checkout 11 behind origin/main: now returns 11 commits.
2026-06-08 08:58:26 -07:00
yoniebans
64da518db4 feat(desktop): remote update overlay sourced from backend
In remote mode, checkUpdates()/applyUpdates() branch on connection.mode and
drive the existing updates overlay from the connected backend instead of the
local Electron git bridge:

- checkUpdates -> GET /api/hermes/update/check, mapped onto DesktopUpdateStatus
  (behind, commits, supported=can_apply, message). The overlay renders the
  commit list as 'what's changed' and shows guidance (not Install) when the
  backend install can't self-apply (docker/nix).
- applyUpdates -> POST /api/hermes/update (the proven command-center path),
  polling the action to completion and handling the expected mid-update
  connection drop as the restart phase.

Local mode is unchanged. Adds checkHermesUpdate() to hermes.ts and a
BackendUpdateCheckResponse type.
2026-06-08 08:58:26 -07:00
yoniebans
ed1e2533b7 feat(desktop): show client and backend versions in status bar when remote
In remote thin-client mode the Electron client and the backend it connects to
are separate installs that drift independently. The status bar previously showed
only the client version, hiding skew (e.g. client 0.15.1 talking to backend
0.16.0 looked fine).

Add a pure resolveVersionBar() helper (unit-tested) that, gated on
connection.mode === 'remote', renders both 'client vX · backend vY' from the
desktop appVersion and StatusResponse.version, and flags skew. Local mode is
byte-identical to before. Wire it into the status-bar version item.
2026-06-08 08:58:26 -07:00
yoniebans
2284147044 docs: document commits field on /api/hermes/update/check 2026-06-08 08:58:26 -07:00
yoniebans
9e360681f8 feat(dashboard): return recent commits from /api/hermes/update/check
Add a best-effort `commits` list (sha/summary/author/at) to the update-check
response for git/pip installs that are behind upstream, so the desktop's
remote update overlay can show what's changed before applying.

Additive and non-breaking: existing consumers (legacy dashboard, tests using
subset assertions) ignore the new field. Leaves the shared check_for_updates()
int contract untouched — commits come from a separate best-effort git call.
2026-06-08 08:58:26 -07:00
Teknium
fd1e7c2bc3
fix(tui): install the process.on('exit') terminal-mode backstop (#42165)
#19194's fix added process.exit(0) to die()/dieWithCode() with a comment
relying on a process.on('exit') handler in entry.tsx that resets terminal
modes — but that handler was never installed. So /quit, Ctrl+C, Ctrl+D and
every process.exit() path left DEC mouse tracking (?1000/1002/1003/1006)
armed in the parent shell. The terminal then kept emitting mouse reports
into stdin — read as keystrokes by the shell or a freshly relaunched TUI —
surfacing as ...;...M garbage in the input box.

Install the missing handler. 'exit' fires once on real termination and runs
synchronous code only; resetTerminalModes() writes via writeSync, so the
disable sequence lands before the process is gone.

Fixes #28419
2026-06-08 08:21:19 -07:00
Siddharth Balyan
7230fcb7f2
revert(nix): drop the cp patchPhase workaround from #41867 (#42151)
#41867 replaced mkNpmPassthru's patchPhase with
`cp $npmDeps/package-lock.json package-lock.json`, on the theory that
prefetch-npm-deps strips advisory fields (engines/os/cpu) from the cache
lockfile. That diagnosis was wrong.

prefetch-npm-deps copies the lockfile into the cache *verbatim*
(prefetch-npm-deps/src/main.rs reads it and writes it unchanged). Building the
cache fresh from the current root lockfile yields exactly the pinned
npmDepsHash, and that cache's package-lock.json is byte-identical to the source
(740 "engines" blocks on each side). With the hash correct, npmConfigHook's
consistency check passes on its own — verified by building .#tui and .#default
green with this (original) patchPhase.

So the cp was unnecessary, and worse: it bypasses the consistency check
wholesale, silently masking a genuinely stale npmDepsHash (a lockfile that
changed without its hash being refreshed) instead of failing loudly. The
original patchPhase keeps the check meaningful while still handling the one real
cosmetic difference it was written for (trailing newlines); stale-hash drift is
caught by the npmDepsHash itself plus the auto-fix workflow.

Keeps the fix-lockfiles real-build verification and the nix-lockfile-fix.yml
file-path fix from #41867 — only the patchPhase cp is reverted.
2026-06-08 20:29:41 +05:30
Siddharth Balyan
4219a91df5
fix(nix): make config.yaml group-writable under addToSystemPackages (#41940)
addToSystemPackages exports HERMES_HOME system-wide and puts the hermes CLI on
interactive users' PATH, so those users (in the hermes group) share the
gateway's state — that's the option's whole purpose. But the activation script
wrote config.yaml as 0640 (group read-only), so an interactive user saving a
setting via the CLI/TUI hit:

  error: [Errno 13] Permission denied: '/var/lib/hermes/.hermes/config.yaml'

Make the mode conditional: 0660 when addToSystemPackages is set (group hermes
can write), else the previous 0640. .env stays 0640 either way — it holds
secrets, not user-facing settings. The config merge already preserves
user-added keys across rebuilds, so this simply lets interactive hermes-group
users actually make those edits.

Verified by evaluating the module's activation script for both option values:
addToSystemPackages=true -> chmod 0660, false -> chmod 0640.
2026-06-08 20:10:47 +05:30
Teknium
a3fca26c56
fix(tui): close slash_worker inside _finalize_session (defense-in-depth, #38095) (#42149)
Fold the slash-worker subprocess close into _finalize_session itself —
the single _finalized-guarded session-end chokepoint — instead of
relying on each caller (_teardown_session, _shutdown_sessions) to close
it separately. A future code path that finalizes a session directly can
no longer reintroduce the #38095 worker leak.

Idempotent: _SlashWorker.close() is poll()-guarded and _finalize_session
short-circuits on _finalized, so the existing teardown paths are
unaffected. Drops the now-redundant separate close() in
_shutdown_sessions.

Note: the active leak this issue reported was already fixed on main
(WS-orphan reaper #38591, _restart_slash_worker close, atexit shutdown).
This addresses the residual defense-in-depth gap the reporter correctly
identified in their follow-up comment.
2026-06-08 07:26:05 -07:00
Teknium
5e06c9ffef
fix(agent): clear _session_messages in AIAgent.close() (#42123)
close() is the hard teardown for true session boundaries (/new, /reset,
session expiry).  It already closes the OpenAI client and child agents but
left the conversation-history list intact.  Mirror the soft-eviction path
(_release_evicted_agent_soft clears _session_messages) so a held reference
to a closed agent — e.g. a draining background task — doesn't pin tens of
MB of tool outputs until the agent object itself is collected.
2026-06-08 07:03:39 -07:00
teknium1
cb13723f53 fix(pty-bridge): mark os.killpg/getpgid windows-footgun-ok (POSIX-only module) 2026-06-08 07:03:12 -07:00
teknium1
8cb1908e18 chore: map paulb26 in AUTHOR_MAP for #24135 salvage 2026-06-08 07:03:12 -07:00
firefly
8b6a8f667d feat(slash-worker): self-terminate on parent death via create_time watchdog
Daemon thread polls _is_orphaned (original ppid check + psutil create_time PID-reuse
guard, no PR_SET_PDEATHSIG). On orphan, drains an in-flight command up to a grace
window then os._exit(0). Started before the HermesCLI build to cover the spawn window.

Task: swl-qrf.8
2026-06-08 07:03:12 -07:00
paulb26
b31c6c33b2 fix(pty-bridge): terminate PTY process groups on teardown 2026-06-08 07:03:12 -07:00
Teknium
e9c1e757fe
fix(gateway): release evicted agent clients to stop RSS leak (#29298) (#41974)
_evict_cached_agent (the chokepoint for /new, /model, /undo, session
resets — 17 call sites) only popped the cache entry, dropping the
AIAgent reference without releasing its httpx client pool. AIAgent
holds reference cycles (callbacks, tool state) so CPython refcounting
does not free the client promptly; under steady gateway traffic the
held sockets + buffers accumulate and RSS climbs (the leak class behind

Now the chokepoint pops AND schedules a soft release_clients() on a
daemon thread (mirrors the cap-enforcer / idle-sweeper). Soft release
frees the client pool + per-turn child subagents but preserves the
session's terminal sandbox / browser / bg processes for resumption.
Mid-turn agents are skipped so a running request is never torn down.
Also fixes the no-lock branch which previously never popped at all.
2026-06-08 06:44:51 -07:00
Michael Steuer
3d029a53ec fix(gateway): close residual memory-leak sites under heavy scheduled workload
Long-lived gateways under heavy cron/build workloads grow steadily (~18 MB/hr
post-phantom-dispatch-fix) and eventually need a restart-or-OOM. Four retention
sites, all confirmed live on current main:

1. _evict_cached_agent() (/model, /reasoning, codex-runtime, /undo, etc.) popped
   the cache entry without releasing the agent's OpenAI client, httpx transport,
   SSL context, or conversation history. Only /new cleaned up first. Now releases
   clients on a daemon thread, matching _enforce_agent_cache_cap.

2. _release_evicted_agent_soft() now clears _session_messages after
   release_clients() — tool outputs (file reads, terminal output, search results)
   can be tens of MB per 100+-tool-call session; the list is rebuilt from
   persisted session JSON on resume, so dropping it on soft eviction is safe.

3. The session-expiry watcher (permanent finalization) now drops the session's
   per-session control dicts (_session_model_overrides, _session_reasoning_overrides,
   _pending_approvals, _update_prompt_pending, _pending_model_notes). These leaked
   one entry per session per gateway lifetime. NOTE: this is the session-finalize
   path, NOT idle agent-cache eviction — an idle-evicted session is still alive and
   rebuilds its agent from these overrides, so pruning them there would silently
   reset a user's /model choice.

4. _tool_defs_cache is now bounded (_TOOL_DEFS_CACHE_MAX=8) with oldest-first
   eviction instead of growing unboundedly across the distinct toolset/config
   fingerprints a gateway sees over its lifetime.

Salvaged from #25318 by Michael Steuer (@mssteuer); fix 3 redirected from the
idle-sweep to the session-finalize lifecycle, magic number 8 lifted to a named
constant, test ported.

Fixes #19251
Co-authored-by: Michael Steuer <michael@make.software>
2026-06-08 06:32:42 -07:00
teknium1
400e6e43ca test(gateway): de-flake concurrent-compression lock test with a barrier
test_concurrent_compressions_same_session_serialize relied on a
time.sleep(0.25) inside the stubbed compressor to make the two threads
overlap inside the per-session lock window. Under CI CPU starvation that
sleep is insufficient: one thread can acquire -> compress -> rotate ->
RELEASE the lock before the other reaches try_acquire, so both acquire on
the shared session_id and both compress (the recurring 'Expected exactly
one agent to compress, got 2' failure on shard test (1)).

Replace the timing dependency with a threading.Barrier(2) wrapped around
the shared db's try_acquire_compression_lock: both threads rendezvous
immediately before the real (atomic) acquire, guaranteeing genuine
simultaneous contention regardless of scheduling. The real lock logic is
unchanged and still picks exactly one winner — this only fixes the test's
overlap guarantee. Restored after join so the post-join lock-leak
assertion hits the unwrapped method.

Verified: 20/20 plain + 15/15 under all-core CPU stress (load avg ~4.6),
where the old version flaked.
2026-06-08 06:32:23 -07:00
kshitij
b99c6c4277
Merge #42076: nested category plugin discovery + alias-normalized enable/disable (#41066)
Merge #42076: nested category plugin discovery + alias-normalized enable/disable (#41066)

Lands the complete nested category plugin fix:
- Discovery in `hermes plugins list` (from @islam666's #41076, carried in this PR)
- Alias-normalized enable/disable mutation path so nested plugins can be toggled
- Fixes the #41076 base breakages (web_server 6-tuple unpack + stale test fixtures)

Co-authored work: discovery by @islam666 (#41076).
Closes #41066.
2026-06-08 05:47:27 -07:00
kshitijk4poor
2b89afec79 fix(plugins): alias-normalize enable/disable for nested category plugins (follow-up to #41076)
#41076 makes `hermes plugins list` discover nested category plugins (e.g.
observability/nemo_relay). This adds the missing enable/disable mutation path
so those plugins can actually be toggled, and fixes two incomplete-update
breakages on the #41076 base.

Before: `hermes plugins enable nemo_relay` -> "Plugin 'nemo_relay' is not
installed or bundled." (exit 1), because cmd_enable/cmd_disable went through
_plugin_exists(), which only checked top-level plugins/<name>/.

Changes:
- Add _resolve_plugin_key(): resolve a bare manifest/leaf name OR a full
  path-derived key (observability/nemo_relay) to the canonical key the runtime
  loader gates on, reusing #41076's _discover_all_plugins(). A bare leaf name
  ambiguous across two categories resolves to None rather than silently picking
  one.
- cmd_enable/cmd_disable resolve first, persist the canonical key, and drop any
  stale legacy bare-name alias so the enabled/disabled lists can't drift into a
  contradictory state. _plugin_exists delegates to the same resolver.
- Fix #41076 base breakages: _discover_all_plugins now returns 6-tuples, but
  web_server._merged_plugins_hub() still unpacked 5 (ValueError on the
  dashboard plugins-hub endpoint) and several test_plugins_cmd_list.py fixtures
  were still 5-tuples. Both updated; the hub status check is now key-aware.

Verified e2e on the real CLI + runtime loader (isolated HERMES_HOME):
`hermes plugins enable nemo_relay` writes observability/nemo_relay to
config.yaml and the loader then loads it (enabled=True, error=None); a stale
bare-name alias is cleared on disable; the dashboard _merged_plugins_hub() runs
without crashing. Adds resolution + enable/disable tests; full
tests/hermes_cli/test_plugins_cmd* + web_server plugin tests green.

Follow-up to #41076 (#41066). Branched from that PR's head.
2026-06-08 17:57:37 +05:30
kshitij
c3055d6185
Merge pull request #41984 from kshitijk4poor/salvage/6600-stale-streaming-worker
fix(gateway): transcribe voice messages during active agent runs (salvage #6600, voice half)
2026-06-08 02:51:25 -07:00
kshitijk4poor
f96eb857a5 chore: add kristianvast to AUTHOR_MAP 2026-06-08 15:16:20 +05:30
Kristian Vastveit
d55304c39f fix(gateway): transcribe voice messages during active agent runs
Salvaged from #6600 (@kristianvast) — re-scoped to the voice half only and
rebased onto current main. The cascading-interrupt hang half of the original
PR landed independently in dd0d1222a, so this carries ONLY Problem 1.

When a voice/audio message arrives while the agent is busy on the same
session, it hit the interrupt path with empty text because STT only ran after
the running-agent guard — the voice was effectively lost. Now we transcribe
audio BEFORE signaling the agent (and on the fresh-message path), echo the raw
transcript back to the user (🎙️), and _enrich_message_with_transcription
returns (text, transcripts) so callers can echo. A new
_dequeue_pending_with_transcription drives the post-agent drain the same way.

Reapplied onto _prepare_inbound_message_text (inbound enrichment was extracted
from the inline dispatch block since the original PR).

Co-authored-by: Kristian Vastveit <kristian@agrointel.no>
2026-06-08 15:16:20 +05:30
teknium1
00c46b8ff9 test(tui): cover heapdump opt-in gate + retention; add AUTHOR_MAP
On-disk vitest coverage for the auto-heapdump disk-safety guard: opt-in
gating (suppressed diagnostics-only path), truthy-spelling acceptance,
manual-trigger passthrough, and the retention prune. Test approach
adapted from #21780 (briandevans) and #21822 (LeonSGP43), reconciled to
the merged gate semantics. Maps alarcritty into AUTHOR_MAP for CI.
2026-06-08 02:20:49 -07:00
alarcritty
8ae0d054f4 fix(tui): guard automatic heap dumps against disk fill
Automatic heap dumps from the TUI memory monitor could write multi-GiB
  .heapsnapshot files on every threshold cross, growing ~/.hermes/heapdumps
  to tens of GiB. Add four layered safeguards:

  - Gate auto-high/auto-critical snapshots behind HERMES_AUTO_HEAPDUMP=1;
    manual dumps remain unchanged.
  - Always write the lightweight diagnostics JSON sidecar so users still
    get an actionable artifact when the snapshot is suppressed.
  - Cap total bytes in the dump dir (HERMES_HEAPDUMP_MAX_BYTES, default
    2 GiB), evicting oldest first, retaining the newest.
  - Add a cooldown between auto dumps (HERMES_AUTO_HEAPDUMP_COOLDOWN_MS,
    default 10 min) so an oscillating heap can't re-trigger.

  Closes #21767
2026-06-08 02:20:49 -07:00
teknium1
dd0d1222a2 fix(agent): don't retry interrupt-induced transport errors (cascading-interrupt hang)
When agent.interrupt() fires during an active LLM call, the main poll loop
force-closes the worker-local httpx client to stop token generation. That
raises a transport error (RemoteProtocolError) on the worker thread — the
EXPECTED consequence of our own close, not a network bug.

The streaming retry loop misclassified it as a transient connection error
and retried; each doomed retry stalled for the full stream-stale timeout
(up to 300s). Because the gateway caches AIAgent instances per session, the
stale worker outlived the interrupted turn and raced the next turn's request
on shared client state — the root of the multi-minute cascading-interrupt
hang reported in the wild.

Fix: a request-local _request_cancelled token set by the poll loop right
before the force-close, in both interruptible_api_call (non-streaming) and
interruptible_streaming_api_call. The worker's exception handler checks the
token and exits cleanly — no retry, no fallback, no 'reconnecting' status —
instead of treating the forced error as transient. The token is request-
local (not agent._interrupt_requested, which is cleared at turn boundaries)
so a stale worker outliving its turn still recognizes its own forced close.

Original diagnosis and fix by @kristianvast (PR #6600), against the then-
inline methods in run_agent.py. Those were since extracted into
agent/chat_completion_helpers.py, so the fix is reapplied there.

Co-authored-by: Kristian Vastveit <kristianvast@users.noreply.github.com>
2026-06-08 02:19:13 -07:00
Teknium
aa6f2775fa
fix(memory): run end-of-turn sync off the turn thread (#41945)
A misconfigured/slow external memory provider could hold the agent in
the 'running' state for minutes after the final response was delivered.
MemoryManager.sync_all / queue_prefetch_all looped provider.sync_turn /
queue_prefetch INLINE on the turn-completion path; a provider making a
blocking network/daemon call (a broken Hindsight daemon was observed
blocking ~298s before failing) blocked run_conversation from returning.
Because every interface (CLI, TUI, gateway) marks the agent 'running'
until run_conversation returns, the agent stayed busy for the full block
and any follow-up message triggered an aggressive interrupt that dropped
the message.

Dispatch provider sync/prefetch to a lazily-created single-worker
background executor. sync_all / queue_prefetch_all return immediately;
work completes (or fails, logged) in the background. A single worker
serializes writes so turn N lands before turn N+1. flush_pending()
provides a barrier for session boundaries and deterministic tests.
shutdown_all() drains the executor with a bounded timeout so a wedged
provider can never hang teardown.

Builtin-only / no-provider sessions spawn no executor (zero new threads
in the common case).
2026-06-08 02:18:59 -07:00
xxxigm
a5c12f5f59 fix(install): move broken checkout aside instead of deleting it
Review feedback (#40998): `rm -rf` / `Remove-Item -Recurse -Force` on the
install dir is destructive -- a user might still want whatever is there.
Rename the broken checkout to a timestamped `<dir>.broken-<ts>` backup and
re-clone fresh, so nothing is ever deleted. Transient cleanup of a clone
attempt that fails within the same run is left as-is.
2026-06-08 02:18:21 -07:00
xxxigm
5d7abf9114 test(install): cover commit-less checkout handling (#40998)
Behavioral coverage for install.sh's clone_repo() guard (removes a
commit-less checkout, keeps a real one, ignores a non-repo dir) plus a
contract check that install.ps1's repo-validity gate requires a resolvable
HEAD.
2026-06-08 02:18:21 -07:00
xxxigm
fc0900d120 fix(install): re-clone interrupted (commit-less) checkout instead of failing
An interrupted previous clone leaves the install dir's .git present but with
no initial commit. rev-parse --is-inside-work-tree and git status both still
succeed there, so the installer entered the update path and ran `git stash`,
which aborts with "You do not have the initial commit yet" and failed the
desktop install at the "Cloning Hermes repository" stage.

- install.ps1: add a `git rev-parse --verify HEAD` probe to the repo-validity
  check so a commit-less checkout is treated as broken and re-cloned fresh.
- install.sh: mirror it at the top of clone_repo() — drop a partial checkout
  with no resolvable HEAD so the fresh-clone path handles it (POSIX parity).

Fixes #40998
2026-06-08 02:18:21 -07:00
teknium1
0904bc7ea2 refactor(cli): extract 32 slash-command handlers into CLICommandsMixin (god-file Phase 4)
Lift the `_handle_*_command` cluster (2,077 LOC) out of HermesCLI into
hermes_cli/cli_commands_mixin.py; HermesCLI now inherits CLICommandsMixin so
every self.<handler> call resolves unchanged via the MRO. Behavior-neutral.

Import discipline mirrors gateway/slash_commands.py (PR #41886): neutral deps
imported at the mixin module top level; cli.py-internal helpers/constants
(_cprint, _ACCENT, save_config_value, ...) imported lazily inside each handler
via 'from cli import ...' so the mixin never imports cli at module scope.

cli.py 16215 -> 14139 LOC. One test mock repointed (cli.is_browser_debug_ready
-> hermes_cli.cli_commands_mixin.is_browser_debug_ready).
2026-06-08 02:13:07 -07:00
kshitij
4eb8972390
Merge pull request #33817 from sweetcornna/fix/28503-busy-input-fifo
fix(gateway): use FIFO queue for busy_input_mode pending messages
2026-06-08 02:02:02 -07:00
Gille
039fbb41fc
fix(desktop): show newly configured model providers (#41545) 2026-06-08 01:39:37 -07:00
floory
15c99b437f
fix(cli): set PYTHON env for node-gyp native builds on NixOS (#40690)
* fix(cli): set PYTHON env for node-gyp native builds on NixOS

node-gyp (triggered by node-pty during npm ci) looks for python3 on
PATH, which fails on NixOS because python3 lives in the nix store and
is not on the system PATH.

Add _nixos_build_env() — a two-tier helper that detects NixOS and:
1. Fast path: hermes venv python3 (~0s)
2. Fallback: nix-shell which python3 (~2-5s)

Wire it into _run_npm_install_deterministic() via a new env= parameter,
then pass it through cmd_gui() and _update_node_dependencies().

Non-NixOS systems: _nixos_build_env() returns None, behavior unchanged.

* fix(cli): merge _nixos_build_env() with os.environ, fix NixOS detection, add explicit return None

- Critical fix: both Tier 1 (venv) and Tier 2 (nix-shell) now return
  {**os.environ, "PYTHON": ...} instead of {"PYTHON": ...} — subprocess.run
  with env= replaces the entire environment, so the old code wiped PATH
  and broke npm/node on NixOS entirely.
- Uses re.search(r"^ID=nixos$", ...) for anchored NixOS detection instead
  of unanchored substring match (could match ID_LIKE=...nixos).
- Removes redundant Path.exists() guard before read_text(); just catches
  OSError (one filesystem read instead of two).
- Adds explicit return None at end of function for type-hint consistency.
2026-06-08 13:57:37 +05:30
teknium1
7a5827c8b0 test: repoint percentage-clamp source guard to gateway/slash_commands.py
test_gateway_run_clamped read gateway/run.py asserting the /usage stats handler
clamps pct with min(100, ...). That handler moved to gateway/slash_commands.py
in this PR's extraction; repoint the guard so it still fires on clamp removal.

tests/run_agent/ + tests/gateway/ 8024 passed / 0 failed.
2026-06-08 01:25:35 -07:00