hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-06-21 10:22:18 +00:00

Author	SHA1	Message	Date
kshitijk4poor	d4e7dd609d	refactor(windows): tidy managed-node resolver helpers Behavior-preserving cleanups on the managed-node resolver: - Hoist _candidate_node_command_names() out of the inner dir loop in find_hermes_node_executable (computed once, not per directory). - Drop redundant os.environ.copy() at the two with_hermes_node_path( os.environ.copy()) sites \u2014 the helper already copies os.environ when called with no argument (verified env-equivalent). - Add reciprocal keep-in-sync comments between iter_hermes_node_dirs() (hermes_constants.py) and hermesManagedNodePathEntries() (electron main.cjs), which mirror the same platform-ordering rule across the Python/Node boundary.	2026-06-20 02:12:16 +05:30
kshitijk4poor	fcc169057d	fix(windows): prefer managed npm for hermes update desktop-rebuild gate The `hermes update` desktop-rebuild gate still used a bare `shutil.which("npm")` presence check. On a Windows box where the only working npm is the Hermes-managed npm.cmd (not on PATH), the gate would skip the desktop rebuild even though _build_web_ui / cmd_gui can now find it via find_node_executable. Route the gate through the same resolver for full bug-class coverage. Surfaced during review of #49239.	2026-06-20 02:01:24 +05:30
helix4u	7a7b56d498	fix(windows): prefer managed node for whatsapp and desktop	2026-06-20 02:00:37 +05:30
hakanpak	38f1a923af	fix(gateway): rename the Telegram topic from /title, not only auto-titles Auto-generated session titles already rename the Telegram forum topic via the title_callback path, but the /title command only wrote the session title to the database. On a Telegram topic lane the visible topic kept its auto-assigned name, so a user who ran /title to override it saw no change. Propagate the user-chosen title to the topic by calling the existing _schedule_telegram_topic_title_rename helper on a successful /title set. It already no-ops off Telegram topic lanes and when auto-rename is disabled.	2026-06-20 01:54:16 +05:30
Teknium	866f1d65c4	chore(desktop): sync package.json version fallback to 0.17.0 (#49236 )	2026-06-19 12:53:35 -07:00
teknium1	2bd1977d8f	chore: release v0.17.0 (2026.6.19)	2026-06-19 12:38:31 -07:00
emozilla	40722058e5	fix(mcp): keep short-TTL HTTP sessions alive with configurable ping keepalive MCP Streamable HTTP servers that garbage-collect idle sessions on a short TTL (e.g. Unreal Engine's editor MCP, ~15s) were unusable: the keepalive was hardcoded at 180s, so the session was always dead by the time it ran, and every idle tool call then landed on an expired session and paid the full reconnect path (observed hangs of 113-143s until interrupt, bounded only by the 300s tool_timeout). Two coordinated, backward-compatible changes: - Add per-server `keepalive_interval` (config.yaml, not an env var per the contribution rubric). Default 180s — byte-identical to the old hardcoded value when unset — floored at 5s. Servers with short session TTLs set it below their TTL so the session stays warm. - Switch the keepalive probe from `list_tools()` to `ping` (the MCP base protocol liveness primitive). On large servers `list_tools` pulled ~1 MB every cycle (830 tools = 1,068,041 bytes); `ping` is ~55 bytes and works uniformly across tool/prompt/resource servers. Tool-list changes still arrive out-of-band via notifications/tools/list_changed -> _refresh_tools. `ping` is an OPTIONAL utility, so to guarantee zero regression for a tool-capable server that doesn't implement it: the first -32601 latches `_ping_unsupported` and the probe falls back to the pre-ping `list_tools` path for that connection (no reconnect loop). The latch resets on each fresh connection (_discover_tools, all transport paths) so a server that gains ping support after a reconnect is re-probed with the cheap path. Non-(-32601) ping errors propagate as genuine liveness failures. Verified end-to-end against a live Unreal MCP server (idle 22s past the ~15s TTL -> post-idle tool call returns in 0.31s, no teardown) and with a simulated ping-less tool server driving the real keepalive loop (ping once, list_tools thereafter, no reconnect). 25/25 unit tests pass. Note: a separate upstream defect (modelcontextprotocol/python-sdk#2604) still tears down the whole session when one tool-call POST returns 4xx; that is not addressed here.	2026-06-19 12:16:33 -07:00
kshitij	4c5217b717	Merge pull request #49207 from kshitijk4poor/fix/cron-script-env-sanitize fix(cron): sanitize env for job script subprocesses	2026-06-20 00:36:26 +05:30
Teknium	ba49fb51a5	fix(discord): hydrate channel context when replying to a message (#49212 ) * fix(discord): hydrate channel context when replying to a message Replying to a message in a free-response (non-mention, threads-off) channel previously received only the 500-char "[Replying to: ...]" snippet — the history-backfill gate fired only for mention-gated channels and threads, so a reply got no surrounding channel context. Replies now route through the same _fetch_channel_context hydration that threads use. When the user replied to a specific (often older) message, a reply-anchored window is scanned ending at that message so the agent sees the exchange around what was pointed at, even when the target sits before the self-message partition. The two windows are merged chronologically and de-duplicated by message id. Also hardens the recent-window scan to skip non-conversational status bumps before the self-message partition check, and makes author-name resolution defensive against partial/deleted authors. * fix(discord): duck-type reply-target resolution instead of isinstance(discord.Message) The e2e suite stubs the discord module, so discord.Message is a MagicMock and isinstance(_resolved, discord.Message) raises 'isinstance() arg 2 must be a type'. Any object with an int .id works as a scan anchor, so resolve the reply target by duck-typing on .id and fall back to a _Snowflake from the reference message_id.	2026-06-19 12:03:08 -07:00
kshitijk4poor	f06508836d	docs(security): enumerate cron job scripts in §2.3 credential scoping The cron-script subprocess is now sanitized alongside shell/MCP/ code-exec children; §2.3 listed only the original three. Makes the _run_job_script docstring's §2.3 citation fully accurate. Follow-up to salvaged PR #49207.	2026-06-20 00:30:42 +05:30
kshitijk4poor	8dc0b18894	refactor(cron): copy os.environ before sanitizing for subprocess Matches the env= callsite convention at the other sanitized subprocess spawns (cua_backend dict(os.environ), gateway os.environ.copy()). Functionally equivalent — _sanitize_subprocess_env never mutates its input — but avoids handing the live mapping to the helper. Follow-up to salvaged PR #49207.	2026-06-20 00:29:46 +05:30
alt-glitch	16642e2769	fix(mcp): revert ACP rebuild to original; harden generation guard CI caught 3 ACP test failures (tests/acp/test_server.py, tests/acp/test_mcp_e2e.py). Root cause: routing ACP's tool-surface rebuild through the shared refresh_agent_mcp_tools helper (added in the round-2 pass) broke a deliberate, pre-existing ACP contract: - the ACP tests assert `agent.tools is <get_tool_definitions return>` (object identity) and an exact get_tool_definitions(enabled_toolsets=[...], disabled_toolsets=..., quiet_mode=True) call signature; the shared helper list()-copies and re-derives differently, breaking identity; and - the tests use a MagicMock agent whose _tool_snapshot_generation is a mock, so the new `int < published_gen` generation guard raised TypeError and the whole ACP refresh silently failed. ACP already preserves memory-provider tools (its own inject call) and excludes context_engine, so there was no bug to fix there — only over-reach. Reverted ACP to its original rebuild. (Same lesson as the gateway path: leave call sites that carry their own tested contract alone; a reviewer's "inert today, fragile" note meant leave-it, not change-it.) Also hardened the generation guard defensively: tolerate a non-int _tool_snapshot_generation (mock / partially-built agent) instead of throwing TypeError and silently failing the refresh.	2026-06-19 11:57:43 -07:00
alt-glitch	f3e967aae5	fix(mcp): round-3 polish — generation capture adjacency + gateway contract note Third review pass (Hermes subagent) declared convergence: no BLOCKING, the round-2 generation-aware publish / context-engine staging / CLI reload / ACP routing all verified correct by hand and by test. - agent_init: capture _tool_snapshot_generation immediately before the tool snapshot (was ~425 lines earlier); removes a harmless skew window so the recorded generation always matches the snapshot it describes. - gateway/run.py _execute_mcp_reload: keep preserving each cached agent's build-time enabled_toolsets EXACTLY (do NOT merge newly-connected servers like CLI/TUI do) and document WHY — gateway sessions can be deliberately locked down, and test_reload_mcp_preserves_per_agent_toolset_overrides asserts this. A reviewer suggested "parity" here; it would have violated that contract.	2026-06-19 11:57:43 -07:00
alt-glitch	88d523220f	fix(mcp): address adversarial review round 2 (stale-publish race, parity holes) Second review pass (Codex + Hermes subagent). Codex reproduced a real race with a two-thread harness; both converged on the remaining issues. - Generation-aware publish (fixes a lost-update race): two refresh callers (the late-refresh daemon and the between-turns prologue around turn 1) could each compute a snapshot outside the lock; a SLOWER caller holding an OLDER registry generation could acquire the publish lock after a newer caller and clobber it, deleting just-landed tools. refresh_agent_mcp_tools now captures registry._generation before computing and refuses to publish a stale set; agent._tool_snapshot_generation tracks the published generation. - Context-engine routing names (_context_engine_tool_names) are now staged on a local and published atomically with the snapshot, and only claimed when this rebuild actually appended the schema — matching agent_init's dedup so a registry/plugin tool of the same name keeps its own dispatch. (Previously mutated live, before the publish lock, and on no-change refreshes.) - CLI /reload-mcp: self.enabled_toolsets is resolved once at startup, so a server newly ENABLED in config mid-session wasn't picked up (TUI already re-resolved). Merge now-connected MCP server names into the override (unless the user pinned all/*), mirroring startup, and keep self.enabled_toolsets in sync. Closes the CLI/TUI parity hole. - ACP (acp_adapter/server.py) routed through the shared helper — it was a 5th sibling rebuild that re-injected memory tools but NOT context-engine tools and bypassed the atomic/name-diff path (inert today, fragile). - mcp_startup._resolve_discovery_timeout pulls its default from DEFAULT_CONFIG (single source of truth) instead of a stale hardcoded 5.0 literal. - Tests: stale-generation-no-clobber, _skip_mcp_refresh honored, timeout fallback uses DEFAULT_CONFIG.	2026-06-19 11:57:43 -07:00
alt-glitch	b6e2a54a94	fix(mcp): address adversarial review round 1 (cache parity, gates, races) Consolidated findings from three independent reviewers (Codex, Claude Code, a Hermes subagent w/ the hermes-agent-dev skill): - BLOCKING: refresh_agent_mcp_tools rebuilt only the registry subset, silently dropping post-build-injected memory-provider (mem0/honcho/…) and context- engine (lcm_) tools on every refresh. Now additive-preserving: re-applies the same injectors agent_init uses, staged on locals and published atomically. - Re-injection now honors the #5544 enabled_toolsets gate for context-engine tools, so a restricted-toolset platform can't get lcm_ leaked back in. - Atomic read-diff-publish under one lock: the returned `added` set and the (tools, valid_tool_names) pair are consistent even under concurrent callers (no half-swap, no TOCTOU). - background_review fork opts out (_skip_mcp_refresh) so its byte-identical tools[] cache parity with the parent is preserved. - CLI /reload-mcp routed through the shared helper (was a 4th divergent copy with the same clobber bug + missing disabled_toolsets). - Explicit reloads (TUI RPC + CLI) pass enabled_override so a server the user just enabled in config this session is picked up; automatic paths reuse the agent's build-time selection. - mcp_discovery_timeout default 5.0 -> 1.5s: correctness now comes from the between-turns refresh, so the startup wait is only a small turn-1 UX bump rather than a heavy dead-server latency penalty. - has_registered_mcp_tools checks registered TOOLS (not connected servers) so a zero-tool/prompt-only server doesn't make the per-turn hook fire forever. - Tests: rewrote the thread-safety test to actually exercise the write path (alternating tool sets), added the #5544-gate regression, the memory/context preservation regression, and a "callable next turn via valid_tool_names" contract; removed a dead monkeypatch line.	2026-06-19 11:57:43 -07:00
alt-glitch	3713483874	fix(mcp): refresh agent tool snapshot between turns (cache-safe late-binding) A slow MCP server (HTTP/OAuth, 2-6s cold connect) that finishes connecting after the agent's one-time tool snapshot was uncallable for the rest of the session. The merged pre-first-turn late-refresh only helps during the dead air before the user's first keystroke; once a turn starts it bails to protect the prompt cache, so a user who types before the server connects never gets the tools without a manual /reload-mcp. Refresh the snapshot in the per-turn prologue (build_turn_context), before this turn's first API call assembles tools=. This is cache-safe by construction: the refresh only ever extends a fresh request prefix at a turn boundary, never mutates the cached prefix of an in-flight turn. So late tools become callable on the user's NEXT turn automatically, with no /reload-mcp and no cache cost. - tools/mcp_tool.py: has_registered_mcp_tools() — cheap guard so sessions with no MCP servers (the common case) skip the rebuild entirely. - agent/turn_context.py: call the shared refresh_agent_mcp_tools() helper at the top of the prologue when MCP servers are registered. - tests: 3 contract tests through the real build_turn_context (adds late tool; skipped when no servers; no snapshot churn when unchanged). .hermes/plans/: SPEC + PLAN documenting the root cause, the cache-safety constraint, and why the existing fixes (#48403/#41630/#42802) don't close it.	2026-06-19 11:57:43 -07:00
alt-glitch	93d6e73028	fix(mcp): expose late-connecting MCP tools to the agent (TUI/CLI/gateway) MCP servers that connect after the agent's one-time tool snapshot were invisible for the whole session. Two root causes, fixed together: 1. The startup discovery wait was a flat 0.75s. HTTP/OAuth servers commonly take 2-6s on a cold connect, so they missed the window and their tools never entered the agent's snapshot. `thread.join(timeout)` already returns the instant discovery completes, so raising the bound costs ~0s for the common case (no MCP / fast servers) and only ever blocks for a genuinely-pending server, capped so a dead server can't freeze startup. The bound is now configurable via `mcp_discovery_timeout` (config.yaml, default 5.0s). 2. Three call sites duplicated the agent tool-snapshot rebuild (the TUI `reload.mcp` RPC, the gateway reload, and the TUI late-binding refresh thread), and the late-refresh detected changes by tool COUNT — missing an equal-size add/remove swap. Consolidated into one shared `tools.mcp_tool.refresh_agent_mcp_tools(agent)` helper that diffs by tool NAME, mutates the agent under a lock (thread-safe), and respects the agent's own enabled/disabled toolsets. The late-binding refresh keeps its pre-first-turn cache-safety guard: it never rebuilds the tool list once a turn has started, so the cached prompt prefix is never invalidated mid-conversation. Tests: new tests/tools/test_refresh_agent_mcp_tools.py covers the name-based diff, in-place mutation, agent-scoped filtering, thread safety, and the config-driven discovery bound (incl. instant-return when nothing is pending). 75 passed across the touched areas.	2026-06-19 11:57:43 -07:00
kshitijk4poor	2d978bf44a	test(cron): make env-sanitize probe var deterministic next(iter(frozenset)) picked a different blocklist var each run (PYTHONHASHSEED-dependent), hurting reproducibility. sorted()[0] keeps the invariant-style assertion (any real blocklisted var) while making failures reproducible. Follow-up to salvaged PR #49207.	2026-06-20 00:22:55 +05:30
teknium1	746c46d610	chore: add lgalabru to AUTHOR_MAP for PR #43112 salvage	2026-06-19 11:46:25 -07:00
Ludo Galabru	239740a19e	feat(tools): MCP elicitation handler with gateway-aware approval routing Wires support for the MCP `elicitation/create` request (Python SDK 1.11+) so MCP servers can ask the user to confirm sensitive operations mid-tool-call (payment authorization, OAuth confirmation, etc.) instead of failing closed or requiring out-of-band biometrics. Behavior: - `tools/mcp_tool.py` adds `ElicitationHandler`, attached per server task and passed to `ClientSession` as `elicitation_callback`. Form-mode requests route through the existing approval system; URL-mode requests decline cleanly (out of scope for this pass). - `tools/approval.py` adds `request_elicitation_consent()`, which dispatches to whichever surface owns the active session — `_await_gateway_decision` for Telegram / Slack / etc. (so the approval prompt lands on the right platform), `prompt_dangerous_approval` for CLI / TUI. Fails closed on timeout, missing notify_cb, or exception. - The MCP tool wrapper snapshots `contextvars.copy_context()` into `MCPServerTask._pending_call_context` before each `session.call_tool` and clears it after. The recv-loop task that dispatches incoming `elicitation/create` requests does not inherit the agent task's contextvars (HERMES_SESSION_PLATFORM and friends), so without the bridge `_is_gateway_approval_context()` returns False on every gateway session and the elicitation falls through to a CLI prompt that has no TTY → fail-closed decline. The handler now reads the snapshot via its `owner` back-reference and replays it through `Context.copy().run(...)` so attribution survives the task hop. Tests (`tests/tools/test_mcp_elicitation.py`): - form-mode accept / decline / cancel - URL-mode declined without prompting - exception in approval system → decline - timeout in approval → cancel - context-bridge regression tests (replay observed in consent call, missing-context fallback, multiple-replay safety, owner with cleared `_pending_call_context`) Verified end-to-end against pay's MCP server on macOS: agent message arrives via Telegram, agent calls `mcp_pay_curl` against a paid endpoint, pay returns 402, ElicitationHandler routes the approval prompt back to the originating Telegram chat, user replies in TG, the curl tool signs and completes. Platforms tested: macOS 14 (darwin/arm64). No Unix-only syscalls introduced; Windows footgun checker passes on the touched files.	2026-06-19 11:46:25 -07:00
0z1-ghb	da7253215d	fix(cron): sanitize env for job script subprocesses Cron no_agent and pre-check scripts ran with the full gateway/agent environment, allowing scripts under HERMES_HOME/scripts/ to read provider credentials. Apply _sanitize_subprocess_env like terminal and MCP paths (SECURITY.md section 2.3). Add regression test asserting blocklisted provider vars are absent in the child process.	2026-06-20 00:13:11 +05:30
Teknium	26e76a75e5	feat(telegram): opt-in Online/Offline bot status indicator (#49134 ) Sets the Telegram bot's short description (the line under its name) to "Online" on gateway connect and "Offline" on clean disconnect, gated behind extra.status_indicator (off by default). Telegram bots have no presence/online dot — that's a user-account feature the Bot API doesn't expose for bots. The short description is the closest available surface, so this gives users a way to tell whether the gateway is up from the bot's profile. - New extra.status_indicator flag (+ status_online/status_offline text overrides), read in __init__ via config.extra — no config-schema change. - _set_status_indicator() helper: best-effort, swallows API errors so it never blocks connect/disconnect; truncates to Telegram's 120-char cap. - Wired Online after _mark_connected(), Offline at top of disconnect() while the bot HTTP client is still alive. - 9 unit tests + Telegram docs section. Requested by @ilTrumpista, cc @Teknium.	2026-06-19 11:38:39 -07:00
alt-glitch	990273d90a	fix(agent): accept pixel-correct image downscale when bytes grow (#48013 ) The image-too-large reactive shrink (try_shrink_image_parts_in_messages) conflated two independent constraints: it always rejected a resize whose re-encoded bytes were >= the original, even when the shrink was driven by a PIXEL-DIMENSION cap (Anthropic many-image 2000px) rather than the byte budget. Downscaled screenshot PNGs routinely re-encode LARGER in bytes, so the dimension-correct result was discarded and the image left oversized -> the provider re-rejected on retry and the session wedged forever. Fix: track which constraint triggered the shrink (bytes vs dimension) and gate the accept on the SAME axis. * dimension path: accept the result as long as it is now within max_dimension, regardless of byte size (verify via Pillow; fall back to the byte gate only when the re-encode can't be decoded). * bytes path: still require bytes to shrink, but ALSO re-check the per-side cap when it's active — _resize_image_for_vision returns a best-effort, possibly over-cap blob when it exhausts its halving budget on a very-high-aspect image, so a byte-shrink alone can leave it over the dimension cap and re-brick on retry. Extend the unshrinkable-oversized guard to the pixel axis so a partial shrink doesn't burn the one-shot retry. Single shared agent path -> fixes CLI, TUI, and gateway alike. Adds a real-Pillow runnable proof (repro_48013_image_shrink_brick.py) that reproduces the issue's per-image table (bricks 3/5 before, passes 5/5 after) plus unit invariants for the dimension and bytes accept/reject paths, partial-progress accounting, and the bytes-path still-over-cap regression surfaced by adversarial review. Closes #48013	2026-06-19 11:37:51 -07:00
Teknium	ac00e73688	feat(dashboard): add a reasoning-effort picker to the chat sidebar (#49141 ) The web dashboard only showed a read-only "Reasoning" capability badge with no way to set the effort level — unlike the desktop app, which has an effort radio in its composer model menu. This adds a picker so the two surfaces reach parity. - ReasoningPicker: a Select rendered in the chat sidebar, gated on the effective model's supports_reasoning capability (from /api/model/info). Reads/writes agent.reasoning_effort via the existing config REST endpoints (read-modify-write, the dashboard's single-key save pattern), so the value lands in the config the agent boots a fresh chat from. Options mirror the desktop: Off/Minimal/Low/Medium/High/Max. - ChatSidebar: capture supports_reasoning from the model-info fetch and render the picker; on change, show the same 'apply on /new or reload' notice the model switch uses. - reasoning-effort.ts: DOM-free helpers (normalizeEffort + options) so the node-env vitest harness can cover the resolution logic, plus tests.	2026-06-19 11:37:40 -07:00
Teknium	c06898098b	fix(cli): clear viewport on width-change resize so the status bar can't duplicate (#49120 ) The classic CLI status bar could appear twice after a horizontal terminal resize — two bars at two widths with two different elapsed readings. Root cause: prompt_toolkit's Application._on_resize() calls renderer.erase(), which does cursor_up(_cursor_pos.y) + erase_down() using the _cursor_pos.y cached from the LAST render at the OLD width (renderer.py:745). On a column shrink the terminal reflows the already-painted full-width chrome into extra physical rows, so the cached y undershoots: cursor_up doesn't climb past the reflowed rows and erase_down leaves the old bar stranded ABOVE the live origin. The next paint stacks a fresh bar below it. The existing post-resize suppression hides the NEW bar for ~0.35s but never erases the already-reflowed OLD one, so the ghost survives the whole window. Ctrl+L / /redraw clears it, confirming a viewport wipe is the fix. Fix: on a WIDTH change, _recover_after_resize now routes through the same recovery as Ctrl+L — _clear_prompt_toolkit_screen(rebuild_scrollback=False) (CSI 2J, visible viewport only) + _replay_output_history() — BEFORE delegating to prompt_toolkit's resize. Banner-safe: 2J never touches scrollback history (that's CSI 3J, which we don't send here), so the startup banner is preserved. Rows-only resizes skip the clear (no reflow → no ghost) to avoid an extra repaint. Tracks _last_resize_width to distinguish the two. Tests: replace the now-obsolete 'never clears on resize' assertion with two tests — rows-only resize delegates without clearing; width change clears the viewport + replays and never wipes scrollback.	2026-06-19 08:43:42 -07:00
Teknium	b266ad748c	chore(deps): npm audit fix — bump transitive undici to clear advisories (#49113 ) Resolves the 2 npm audit advisories (1 high, 1 moderate), both from transitive undici: - undici 6.26.0 -> 6.27.0 (high: TLS bypass / header injection / response queue poisoning class, via node-gyp + ui-tui) - jsdom's undici 7.27.2 -> 7.28.0 (moderate, via jsdom test dep) Both are in-range bumps (no --force). Lockfile also reconciled two pre-existing manifest drifts during the install: dompurify 3.4.10 -> 3.4.11 (in-range patch) and the web workspace's already-declared vitest ^4.1.5 devDep. No package.json changes. npm audit reports 0 vulnerabilities in root, ui-tui, and apps/desktop after.	2026-06-19 08:20:03 -07:00
brooklyn!	0e8b76532e	fix(desktop): rename "Restart messaging" → "Restart gateway", surface restarts in the statusbar, make logs selectable (#49094 ) * fix(desktop): rename "Restart messaging" -> "Restart gateway" The Command Center control restarts the whole messaging gateway, yet was labelled "Restart messaging" while the status line above it reads "Messaging gateway running/stopped". Rename the i18n key to match what it does, across all 4 locales. * feat(desktop): restart the gateway from Cmd+K, with statusbar spinner feedback Add a shared runGatewayRestart() (store/system-actions.ts) and wire it to a new Cmd+K "Restart gateway" action. While a restart is in flight the statusbar "Gateway" item swaps its icon for the TUI glyph spinner and reads "restarting…", returning to its real state on completion — driven by a $gatewayRestarting atom, not a transient toast or the generic "Agents running" counter. The helper owns its error handling so fire-and-forget callers can't leak an unhandled rejection; only a failure toasts. * fix(desktop): offer a Restart gateway action on messaging save/toggle toasts The "setup saved" and "platform enabled/disabled" toasts told users their change needs a gateway restart but left it a separate hunt. Attach a "Restart gateway" action (the shared runGatewayRestart), and reword the copy to state the pending consequence ("...takes effect after a gateway restart") now that the button carries the verb. Updated all 4 locales. * fix(desktop): make rendered logs selectable so they can be copied The global body { user-select: none } left log surfaces unselectable. Opt them back in via the existing data-selectable-text convention — at the shared LogView primitive (boot-failure + bootstrap install overlays) plus Command Center recent logs, toolset post-setup output, notification detail, and subagent stream/file lines.	2026-06-19 10:09:15 -05:00
Brooklyn Nicholson	929dbf7778	fix(desktop): make rendered logs selectable so they can be copied The global body { user-select: none } left log surfaces unselectable. Opt them back in via the existing data-selectable-text convention — at the shared LogView primitive (boot-failure + bootstrap install overlays) plus Command Center recent logs, toolset post-setup output, notification detail, and subagent stream/file lines.	2026-06-19 10:03:46 -05:00
Brooklyn Nicholson	a1639921ac	fix(desktop): offer a Restart gateway action on messaging save/toggle toasts The "setup saved" and "platform enabled/disabled" toasts told users their change needs a gateway restart but left it a separate hunt. Attach a "Restart gateway" action (the shared runGatewayRestart), and reword the copy to state the pending consequence ("...takes effect after a gateway restart") now that the button carries the verb. Updated all 4 locales.	2026-06-19 10:03:24 -05:00
Brooklyn Nicholson	553cf4f977	feat(desktop): restart the gateway from Cmd+K, with statusbar spinner feedback Add a shared runGatewayRestart() (store/system-actions.ts) and wire it to a new Cmd+K "Restart gateway" action. While a restart is in flight the statusbar "Gateway" item swaps its icon for the TUI glyph spinner and reads "restarting…", returning to its real state on completion — driven by a $gatewayRestarting atom, not a transient toast or the generic "Agents running" counter. The helper owns its error handling so fire-and-forget callers can't leak an unhandled rejection; only a failure toasts.	2026-06-19 10:02:54 -05:00
Brooklyn Nicholson	6308d3416a	fix(desktop): rename "Restart messaging" -> "Restart gateway" The Command Center control restarts the whole messaging gateway, yet was labelled "Restart messaging" while the status line above it reads "Messaging gateway running/stopped". Rename the i18n key to match what it does, across all 4 locales.	2026-06-19 10:02:21 -05:00
Teknium	0d7abd555c	fix(dashboard): sort chat session switcher by most-recent activity (#49104 ) The Chat-tab session switcher rendered rows in the API's default order="created" (original start time) while each row displays last_active — so a session you just messaged in could sit below an older one, and the list looked unsorted against its own timestamps. Pass order="recent" from ChatSessionList so the switcher sorts by latest activity across the compression chain (most-recently-used at top, ChatGPT-style; long conversations that auto-compressed into a new continuation id stay on the first page). Adds an optional, defaulted `order` arg to api.getSessions; the paginated Sessions page keeps the stable created order.	2026-06-19 07:58:56 -07:00
Teknium	1b04e4ede5	fix(cli): status bar no longer stays hidden after resize during idle (#49105 ) The classic CLI status bar could vanish for the rest of a session: any terminal reflow (SIGWINCH from a tmux pane change, SSH window restore, font zoom) set _status_bar_suppressed_after_resize=True, but the flag was ONLY cleared on the next submitted user input. Resize then sit idle and the bottom chrome rendered at height 0 on every repaint — even with the refresh clock ticking — so the bar was gone until you typed and hit enter. Fix: _recover_after_resize now schedules a debounced unsuppress timer that clears the flag and repaints once the reflow settles (~0.35s), so the bar returns on its own during idle. The next-submit clear stays as a fast path. Fails open: any error in scheduling clears the flag immediately rather than leaving the bar stuck hidden.	2026-06-19 07:53:58 -07:00
teknium1	7d86178cf5	fix(raft): set stdin=DEVNULL on bridge subprocess Satisfies the repo-wide subprocess-stdin guard (tests/tools/test_subprocess_stdin_guard.py); the long-lived bridge child should not inherit the gateway's stdin.	2026-06-19 07:52:37 -07:00
teknium1	22ccb12c30	chore(release): map skyzh@mail.build to xxchan for Raft salvage CI blocks PRs with unmapped commit-author emails.	2026-06-19 07:52:37 -07:00
skyzh	9026a8c789	feat(gateway): add Raft bundled platform plugin with activity hooks Adds a Raft platform adapter as a bundled plugin (plugins/platforms/raft/) connecting Hermes to Raft as an external agent via a wake-channel bridge. The adapter starts a loopback HTTP endpoint, spawns 'raft agent bridge' as a child process, and injects content-free wake hints into the gateway session pipeline. The agent reads/sends messages through the Raft CLI; the adapter never touches message bodies or delivery cursors. Activity observer hooks report tool/LLM/session lifecycle events via a bounded at-most-once queue. Auto-enables when RAFT_PROFILE is set. Cherry-picked from PR #47629. Authored by skyzh (@xxchan).	2026-06-19 07:52:37 -07:00
Teknium	2a5e9d994a	Merge pull request #48275 from NousResearch/feat/cron-scheduler-provider-chronos feat(cron): pluggable CronScheduler interface + Chronos managed-cron provider (scale-to-zero)	2026-06-19 07:51:59 -07:00
Ben	1928aa0443	fix(managed-scope): honor managed scope in config→env bridges too Manual verification surfaced a second bypass class beyond the standalone config loaders: several code paths bridge config.yaml values into os.environ (HERMES_TIMEZONE, HERMES_REDACT_SECRETS, HERMES_MAX_ITERATIONS, TERMINAL_*, network.force_ipv4, ...) by reading the raw user YAML, so the env the whole process reads carried the USER's value even when an administrator pinned it — e.g. a managed timezone was overridden because gateway/run.py wrote the user's timezone into HERMES_TIMEZONE, and _resolve_timezone_name() checks the env var first. Wired the shared apply_managed_overlay() into every config→env bridge: - gateway/run.py module-level startup bridge (timezone, redact_secrets, max_turns, terminal, display, gateway.strict, ...) - gateway/run.py _reload_runtime_env_preserving_config_authority (the per-turn re-bridge that keeps config authoritative over reloaded .env — must keep MANAGED authoritative on every turn, not just startup) - hermes_cli/main.py early security.redact_secrets / network.force_ipv4 bridge (runs before load_config is usable, at import time) - hermes_cli/send_cmd.py top-level scalar config→env bridge Verified end-to-end against a writable managed dir (12/12 checks incl. timezone, logging, model, skin, gateway settings, write-guard) and in a clean process the gateway per-turn bridge writes HERMES_TIMEZONE=<managed>. Adds an order-independent regression test for the bridge overlay.	2026-06-19 07:46:33 -07:00
Ben	b0e47a98f9	fix(managed-scope): honor managed scope in all standalone config loaders The skin bug was one instance of a class: several subsystems build their config dict directly from config.yaml instead of routing through hermes_cli.config.load_config (which carries the managed merge), so they silently ignored administrator-pinned values. Audited every config.yaml reader and fixed the behavioral-read bypasses: - gateway/config.py load_gateway_config (messaging gateway: session_reset, quick_commands, stt, model, ...) - gateway/run.py _load_gateway_config (its read_raw_config fast path also skipped the merge — read_raw_config returns raw user YAML) - tui_gateway/server.py _load_cfg (new TUI + desktop backend: skin, reasoning_effort, service_tier, provider_routing) - cron/scheduler.py (scheduled-job model/reasoning/toolsets/provider_routing) - hermes_logging.py (logging.level/max_size_mb/backup_count) - hermes_time.py (timezone) - hermes_cli/doctor.py (memory-provider diagnostic reads effective config) All route through a new shared managed_scope.apply_managed_overlay() helper that mirrors _load_config_impl (env-only expansion so a user ${VAR} can't shadow a managed literal, root-model-string normalization, leaf-merge) and is fail-open. cli.py's earlier inline fix is refactored onto the same helper. Write-back paths (slash_commands, telegram/yuanbao dm_topics, profile distribution) are deliberately left reading raw user YAML — overlaying managed values there would persist them into the user file. The dashboard (web_server.py) already routes through load_config and needed no change. TUI loader caches the RAW config so _save_cfg never writes managed values to disk. Adds test_managed_scope_overlay.py (helper) and test_managed_scope_loaders.py (per-surface integration); mutation-checked.	2026-06-19 07:46:33 -07:00
Ben	732293cf87	fix(managed-scope): apply managed layer in cli.py's standalone config loader cli.py's load_cli_config() builds CLI_CONFIG independently of hermes_cli.config._load_config_impl (it reads config.yaml directly and merges into hardcoded defaults), so the Phase 2 managed merge never reached the interactive CLI/TUI surface. Symptom: a managed display.skin (and any other display/CLI pref read from CLI_CONFIG) was silently ignored by the TUI while `hermes config`/`doctor`/write-guards — which go through load_config — correctly honored it. Found via manual testing: the skin engine kept using 'default'. Fix: overlay the managed config last in load_cli_config(), mirroring _load_config_impl — expand against the process env only (so a user ${VAR} can't shadow a managed literal), normalize the root model key so a managed `model: x/y` string can't clobber the dict shape callers expect, then leaf-merge. Fail-open so managed scope can never block CLI startup. Adds tests/hermes_cli/test_managed_scope_cli_config.py locking that CLI_CONFIG honors managed values, preserves user siblings, and is inert with no scope.	2026-06-19 07:46:33 -07:00
Ben	9a24e41d0f	docs: add managed scope admin guide + cross-link from configuration	2026-06-19 07:46:33 -07:00
Ben	ddd519ea70	feat(managed-scope): surface managed scope in config show and doctor - show_config prints an administrator header naming the managed source and lists the pinned config/env keys when a scope is active (silent otherwise). - hermes doctor gains a managed_scope_check under Configuration Files that reports the resolved managed dir + pinned key counts, and flags a HERMES_MANAGED_DIR redirect (the documented foot-gun).	2026-06-19 07:46:33 -07:00
Ben	4f9e15df97	feat(managed-scope): guard writes to managed config/env keys - set_config_value hard-rejects a managed config key (D2) and names the source, exiting non-zero. - save_env_value / remove_env_value refuse a managed env key. - save_config strips managed leaves from a bulk write (mechanical safety net) with a warning, so the unmanaged remainder still persists. New _strip_dotted_keys helper drives the bulk-save pruning. All guards are distinct from and layered after the existing is_managed() package-manager write-lock.	2026-06-19 07:46:33 -07:00
Ben	81a663abea	feat(managed-scope): apply managed .env last with override load_hermes_dotenv now loads the managed-scope .env after user/project .env and external secret sources, with override=True, so managed env values beat the user .env and any pre-existing shell export. Reuses the existing dotenv fallback + credential-sanitization path. Fail-open: no managed dir/.env is a no-op and any error is swallowed so managed scope never blocks startup.	2026-06-19 07:46:33 -07:00
Ben	b5ddd6e719	feat(managed-scope): managed config layer wins over user config _load_config_impl now deep-merges the managed config.yaml on top of the expanded user config so managed leaves win while sibling keys stay user-controlled (leaf-level merge, D3). Managed values are expanded against the process env only, never user-defined ${VAR}, so a user can't shadow a managed literal. The managed file's (mtime,size) is folded into the load cache key so editing it invalidates the cache. This inverts the usual env-over-config precedence for pinned keys by design (see design doc §4.1).	2026-06-19 07:46:33 -07:00
Ben	9cbcc0c9c8	feat(managed-scope): add managed_scope module (resolver, loaders, key helpers) New hermes_cli/managed_scope.py resolves a system-level managed directory (HERMES_MANAGED_DIR override > /etc/hermes), parses managed config.yaml/.env with fail-open semantics, and exposes is_key_managed/is_env_managed helpers. The system default is ignored under pytest and HERMES_MANAGED_DIR is added to the conftest env scrub so a real managed scope can't leak into the suite. Not wired into the load paths yet (Phases 2-3).	2026-06-19 07:46:33 -07:00
Ben	bf9a0481fa	test(config): pin config/env load behavior before managed scope	2026-06-19 07:46:33 -07:00
teknium1	a58287afcb	Merge remote-tracking branch 'origin/main' into pr48275-rebase # Conflicts: # cron/scheduler.py	2026-06-19 07:40:29 -07:00
Teknium	35e7ca03d5	fix(kanban): treat already-gone worker as terminated, not survived _terminate_reclaimed_worker early-returned on ProcessLookupError with terminated=False. The new reclaim-defer guard reads that as 'worker survived the kill' and defers the reclaim forever, so a stale task whose worker is already dead never lands in result.stale. ProcessLookupError means the process is gone — that IS a successful termination. Split it from the generic OSError branch and set terminated=True.	2026-06-19 07:38:10 -07:00
Sahil Saghir	b9e521da23	fix(kanban): hold reclaim while the worker is still alive release_stale_claims and detect_stale_running call _terminate_reclaimed_worker and then release the task claim unconditionally, even when the termination did not actually kill the worker. _terminate_reclaimed_worker already reports this via its "terminated" flag, but the callers ignore it. When a worker is parked in uninterruptible (D) state — for example throttled by a cgroup memory.high limit — a pending SIGTERM/SIGKILL cannot be delivered until the throttle lifts, so the kill is a no-op. The dispatcher then frees the claim and spawns a fresh worker beside the still-alive one. Repeated every dispatch tick this accumulates duplicate workers without bound, deepening the memory pressure that caused the throttle in the first place — a self-reinforcing runaway. Fix: gate both automatic reclaim paths on _worker_survived_termination(). When we attempted to kill our own host-local worker and it is still alive, defer the reclaim (_defer_reclaim_for_live_worker extends the claim a short grace and emits a reclaim_deferred event) instead of releasing. This guarantees at most one live worker per task and is self-correcting: not spawning a duplicate is what relieves the pressure so the pending signal lands and the worker dies, and the next tick reclaims cleanly. Non-host-local claims and the operator-driven reclaim_task() path keep their existing force-release behaviour. Related: #41448 (concurrent dispatchers amplify this by doubling reclaim frequency); #42858 (kill the worker rather than orphan it on archive). Tests: defer-when-worker-survives, reclaim-when-killed, release-when-not-host-local, and the detect_stale_running path.	2026-06-19 07:38:10 -07:00

1 2 3 4 5 ...

12197 commits