hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-05-08 03:01:47 +00:00

Author	SHA1	Message	Date
thelumiereguy	8a96fa48c1	fix(gateway): avoid duplicated responses history	2026-05-07 05:07:59 -07:00
teknium1	429e78589b	refactor(auth): dedupe file-lock helper; document Nous lock order Extract the shared flock/msvcrt boilerplate from _auth_store_lock and _nous_shared_store_lock into a single _file_lock(lock_path, holder, timeout, message) helper. Each caller keeps its own threading.local holder so reentrancy state stays per-lock. Also document the lock-ordering invariant on both wrappers: _auth_store_lock is OUTER, _nous_shared_store_lock is INNER for all runtime refresh paths. The one exception is _try_import_shared_nous_state, which holds the shared lock alone across the full HTTP refresh+mint cycle to prevent concurrent sibling imports from racing on the single- use shared refresh token; that helper must not be called with the auth lock already held.	2026-05-07 05:07:06 -07:00
Michael Nguyen	a84e56d4c6	fix(auth): sync shared Nous refresh tokens	2026-05-07 05:07:06 -07:00
Teknium	38b1c7dce5	refactor(gateway): simplify auto-resume + extend to crash recovery Follow-up on top of @kyan12's PR #20888 — same feature, cleaner shape, wider coverage. Changes: - Drop the synthetic '[System note: ...]' in the internal MessageEvent. The existing _is_resume_pending branch in _handle_message_with_agent (run.py ~L13738) already injects a reason-aware recovery system note on the next turn. With kyan's text in place the model saw two stacked system notes. Now the event text is empty and the existing injection path owns the wording. - Drop SessionStore.list_resume_pending() as a new public method. The filter is 8 lines inline in _schedule_resume_pending_sessions() — one caller, no other pluggability need. - Add 'restart_interrupted' to the auto-resume reason set. That's the reason SessionStore.suspend_recently_active() stamps on sessions recovered from a crash/OOM/SIGKILL (no .clean_shutdown marker). Previously those sessions had to wait for a real user message to auto-resume; now they continue automatically at startup like drain-timeout interruptions do. - Reasons live in a _AUTO_RESUME_REASONS frozenset at class scope so future reasons (e.g. 'manual_resume_request') can be opted in with one line. Test coverage added: - drain-timeout + crash-recovery both scheduled - stale entries skipped (outside freshness window) - suspended entries skipped (suspended > resume_pending) - originless entries skipped (no routing target) - disallowed reasons skipped (graceful forward-compat) E2E verified end-to-end with a real on-disk SessionStore: 2 eligible sessions scheduled, 2 ineligible skipped, empty-text internal events delivered to the adapter. Co-authored-by: Kevin Yan <kevyan1998@gmail.com>	2026-05-07 05:05:34 -07:00
Kevin Yan	961a3535fa	fix(gateway): preserve resume marker on interrupted restart	2026-05-07 05:05:34 -07:00
Kevin Yan	fad684b1f3	feat(gateway): auto-resume interrupted sessions after restart	2026-05-07 05:05:34 -07:00
Teknium	233bfd3621	chore(release): map mwnickerson noreply email	2026-05-07 05:05:20 -07:00
mwnickerson	411cfa26e3	fix: auto-block repeated kanban retries	2026-05-07 05:05:20 -07:00
Teknium	595e906698	chore(release): map sonic-netizen noreply email	2026-05-07 05:05:20 -07:00
Sonic Chang	b49a3f8474	fix(kanban): reap completed worker children in dispatch_once The gateway-embedded dispatcher (default since `kanban.dispatch_in_gateway = true`) is the parent of every spawned kanban worker. `_default_spawn` calls `subprocess.Popen(..., start_new_session=True)` and returns the pid — `start_new_session` detaches the controlling tty but does not reparent to init, so the gateway keeps each worker as a child until it `wait()`s for them. Nothing in the dispatch loop ever calls `waitpid`. Result: every completed worker becomes a `<defunct>` zombie that lingers until the gateway exits. We hit ~430 zombies on a single hermes-agent container after ~40 days of steady kanban traffic, approaching process-table exhaustion on the host. Fix: add a non-blocking reap loop at the top of `dispatch_once`, so every dispatcher tick (default 60s) drains zombies that accumulated since the last tick. WNOHANG keeps the call non-blocking; ChildProcessError means no children to reap. Why here, not a SIGCHLD handler: - signal.signal requires the main thread; gateway threading model makes that placement non-trivial. - Bounded staleness: at default interval=60s the maximum live zombie count is one tick's worth of worker completions. - No interaction with detect_crashed_workers: that function only inspects rows where status='running', and rows reach 'done' (and stop being inspected) before their workers exit.	2026-05-07 05:05:20 -07:00
LeonSGP43	06f24351c5	fix(kanban): stop reclaimed workers before retry	2026-05-07 05:05:20 -07:00
Teknium	63bd690a50	chore(release): map stephen0110 noreply email	2026-05-07 05:05:20 -07:00
stephen0110	40b51c93a2	fix(kanban): heartbeat tool extends claim TTL, not just last_heartbeat_at The kanban_heartbeat tool called heartbeat_worker but never heartbeat_claim, so a worker that loops the tool while a single tool call blocks the agent for >DEFAULT_CLAIM_TTL_SECONDS still got reclaimed by release_stale_claims. The function name and heartbeat_claim's own docstring imply otherwise: "Workers that know they'll exceed 15 minutes should call this every few minutes to keep ownership." But there was no caller in the worker tool path. Workers couldn't invoke heartbeat_claim themselves either — it isn't exposed as a tool. Fix: _handle_heartbeat now calls heartbeat_claim first, reading HERMES_KANBAN_CLAIM_LOCK from the worker env (the dispatcher pins this in _default_spawn). Falls back to _claimer_id() for locally- driven workers that didn't go through dispatcher spawn. Test: tests/tools/test_kanban_tools.py::test_heartbeat_extends_claim_expires rewinds claim_expires into the past, calls the tool, and asserts the new value is at least now + DEFAULT_CLAIM_TTL_SECONDS // 2. Verified to fail against the unfixed code (claim_expires stays at the rewound value). Closes the root cause underlying the symptom in #21141 (15-min respawns of long-running workers). #21141 separately addresses post-reclaim cleanup; this fixes the upstream "shouldn't have been reclaimed in the first place" half.	2026-05-07 05:05:20 -07:00
Teknium	bf843adf05	feat(gateway): opt-in cleanup of temporary progress bubbles (#21186 ) When display.cleanup_progress (or display.platforms.<plat>.cleanup_progress) is true, the gateway deletes tool-progress bubbles, long-running '⏳ Still working...' notices, and status-callback messages after the final response is delivered successfully. Currently effective on adapters that implement delete_message (Telegram); silently no-ops elsewhere. Off by default. Failed runs skip cleanup so bubbles stay as breadcrumbs. Minimal plumbing: base.py's existing post_delivery_callback slot now chains new registrations onto any existing callback (with per-callback exception isolation) rather than clobbering. Stale-generation registrations are rejected so they can't step on a fresher run's callbacks. This lets the cleanup callback coexist with the background-review release hook already registered on the same slot. Co-authored-by: mrcharlesiv <Mrcharlesiv@gmail.com>	2026-05-07 05:04:37 -07:00
ambition0802	7c0766e06a	fix(gateway): translate inbound document host paths to container paths for Docker backend When terminal.backend is docker, inbound documents uploaded via messaging platforms (Telegram, Slack, Discord, Feishu, Email, etc.) are cached at a host path under ~/.hermes/cache/documents, but the container sandbox only sees them at the auto-mounted /root/.hermes/cache/documents path. This PR adds to_agent_visible_cache_path() in tools/credential_files.py (the natural sibling to get_cache_directory_mounts()) and calls it at the document-context-injection site in gateway/run.py so the agent always receives a path it can open directly, matching the mount layout already established by get_cache_directory_mounts() (#4846). Scope: only Docker backend for now; other backends use different mount semantics and are left unchanged until verified. Fixes #18787	2026-05-07 05:02:26 -07:00
Tranquil-Flow	d4de7d4179	test(skills): cover additional rescan paths in skill_commands cache (#14536 ) The rescan-on-platform-change fix landed in #18739 ships one regression test that exercises the HERMES_PLATFORM env-var path. Three other code paths in get_skill_commands / _resolve_skill_commands_platform have no direct coverage; this commit adds a regression test for each. - Gateway session context (HERMES_SESSION_PLATFORM via ContextVar): the resolver consults get_session_env after HERMES_PLATFORM, and the gateway sets that variable through set_session_vars (a ContextVar), not os.environ. The test uses set_session_vars / clear_session_vars to drive the actual gateway signal, and the disabled-skill stub reads the same value via get_session_env. A regression that swapped get_session_env for plain os.getenv would still pass an env-var-based test but break concurrent gateway sessions, which is the bug the ContextVar plumbing exists to prevent. - Returning to no-platform-scope (CLI / cron / RL rollouts after a gateway session): the cached telegram view must be dropped and the unfiltered scan repopulated when HERMES_PLATFORM is unset again. - Same-platform cache hit: consecutive calls under the same platform scope must NOT rescan. The rescan trigger is change in scope, not "always re-resolve" — a gateway serving many consecutive telegram requests should pay the scan cost once, not per request. The third test wraps scan_skill_commands with a spy after the cache is primed, so the assertion is on call_count == 0 across three subsequent get_skill_commands() calls. All 39 tests in tests/agent/test_skill_commands.py pass under scripts/run_tests.sh.	2026-05-07 04:59:43 -07:00
Teknium	fce58cbe2e	feat(optional-skills): port Anthropic financial-services skills as optional finance bundle (#21180 ) Adds 7 optional skills under optional-skills/finance/ adapted from anthropics/financial-services (Apache-2.0): excel-author — openpyxl conventions: blue/black/green cells, formulas over hardcodes, named ranges, balance checks, sensitivity tables. Ships recalc.py. pptx-author — python-pptx for model-backed decks (pitch, IC memo, earnings note) that bind every number to a source workbook cell. dcf-model — institutional DCF (49KB skill): projections, WACC, terminal value, Bear/Base/Bull scenarios, 5x5 sensitivity tables. Ships validate_dcf.py. comps-analysis — comparable company analysis: operating metrics, multiples, statistical benchmarking. lbo-model — leveraged buyout: S&U, debt schedule, cash sweep, exit multiple, IRR/MOIC sensitivity. 3-statement-model — fully-integrated IS/BS/CF with balance-check plugs. Ships references/ for formatting, formulas, SEC filings. merger-model — accretion/dilution analysis for M&A. All seven are optional (not active by default). Users install via 'hermes skills install official/finance/<skill>'. Hermesification: - Stripped every Office JS / Office Add-in / mcp__office__* branch — skills assume headless openpyxl only. - Replaced Cowork MCP data-source instructions with 'MCP first (via native-mcp), fall back to web_search/web_extract against SEC EDGAR and user-provided data'. - Swapped Claude tool references (Bash, Read, Write, Edit, mcp__*) for Hermes-native equivalents and Python library calls. - Canonical Hermes frontmatter (name/description/version/author/ license/metadata.hermes.{tags,related_skills}). - Descriptions tightened to 187-238 chars, trigger-first. - Attribution preserved: author field credits 'Anthropic (adapted by Nous Research)', license: Apache-2.0, each SKILL.md links back to the upstream source directory. Verification: - All 7 discovered by OptionalSkillSource with source_id='official' - Bundle fetch includes support files (scripts, references, troubleshooting) - related_skills cross-refs all resolve within the bundle - No Claude product / Cowork / Office JS / /mnt/skills leakage remains in body text (bounded mentions only in attribution blocks) Source: https://github.com/anthropics/financial-services (Apache-2.0)	2026-05-07 04:58:39 -07:00
briandevans	11b9b146f1	fix(image-routing): expose attached image paths in native multimodal text part In native image mode (vision-capable models like gpt-4o, claude-sonnet-4), build_native_content_parts() previously emitted only the user's caption plus image_url parts. The local file path of each attached image never appeared in the conversation text, so the model could see the pixels but had no string handle for tools that take image_url: str (custom MCP tools, vision_analyze on a re-look, attach-to-tracker workflows). The text-mode path already injects an equivalent hint via Runner._enrich_message_with_vision ("...vision_analyze using image_url: <path>..."). This brings native mode to parity by appending one "[Image attached at: <path>]" line per successfully attached image to the user-text part of the multimodal turn. Skipped (unreadable) paths are NOT advertised, so the model is never told a non-existent file is attached. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 04:58:00 -07:00
Sanjay Santhanam	1f27ca638f	test(update): teach restart-mocks about the post-update survivor sweep Issue #17648 added a post-update SIGTERM-survivor sweep to `cmd_update`: ~3s after issuing graceful/SIGTERM restarts, the code re-queries `find_gateway_pids` and SIGKILLs anything still alive. That's the right fix for stuck-drain gateways in production, but it broke three unit tests that assumed `find_gateway_pids` would keep returning the same PIDs forever: FAILED ::TestCmdUpdateLaunchdRestart::test_update_restarts_profile_manual_gateways AssertionError: Expected 'kill' to not have been called. Called 1 times. Calls: [call(12345, <Signals.SIGKILL: 9>)]. FAILED ::TestCmdUpdateLaunchdRestart::test_update_profile_manual_gateway_falls_back_to_sigterm AssertionError: Expected 'kill' to have been called once. Called 2 times. Calls: [call(12345, SIGTERM), call(12345, SIGKILL)]. FAILED ::TestServicePidExclusion::test_update_kills_manual_pid_but_not_service_pid assert 2 == 1 manual_kills = [call(42999, SIGTERM), call(42999, SIGKILL)] In each test `os.kill` is mocked, so the simulated PID never actually exits \u2014 the sweep finds it again and escalates. The production code is correct; the tests just need to model OS behaviour properly. Two-test fix (profile-manual restart cases): use `side_effect=[[12345], []]` so the first `find_gateway_pids` call returns the live PID and the second (the sweep) returns nothing, as if the OS had reaped the process. Service-PID-exclusion fix: track which PIDs got killed in a closure set, and exclude them on subsequent `fake_find` calls. `os.kill` gets a `side_effect` that records the kill instead of swallowing it silently. Now the sweep doesn't re-find the manual PID, no SIGKILL escalation, `manual_kills == 1`. Validation: $ pytest tests/hermes_cli/test_update_gateway_restart.py -q 43 passed in 4.13s No production code change. Fixes the three failures observed on `main` (run 25250051126): test_update_restarts_profile_manual_gateways test_update_profile_manual_gateway_falls_back_to_sigterm test_update_kills_manual_pid_but_not_service_pid Refs: #17648 (post-update survivor sweep that the tests didn't model).	2026-05-07 04:56:25 -07:00
Teknium	aa5690342b	chore(release): add Gutslabs to AUTHOR_MAP for PR #21148 salvage	2026-05-07 04:56:13 -07:00
Gutslabs	7d36e8346b	fix(security): close TOCTOU window when saving MCP OAuth credentials _write_json (the persistence helper used by HermesTokenStorage for both tokens and client_info) created the temp file via Path.write_text and only chmod'd it to 0o600 afterward. Between create and chmod the file existed on disk at the process umask (commonly 0o644 = world-readable), briefly exposing MCP OAuth access/refresh tokens to other local users. Use os.open with O_WRONLY\|O_CREAT\|O_EXCL and an explicit S_IRUSR\|S_IWUSR mode so the file is created atomically at 0o600, plus tighten the parent dir to 0o700 so siblings can't traverse to the creds file. The temp name also gains a per-process random suffix to avoid collisions between concurrent writers and stale leftovers from a crashed prior write. Mirrors the fix shipped for agent/google_oauth.py in #19673. Adds a regression test asserting the resulting file mode is 0o600 and the parent directory is 0o700 (skipped on Windows where POSIX mode bits aren't enforced).	2026-05-07 04:56:13 -07:00
Harish Kukreja	a5c9c83b78	fix(web): force light color-scheme on docs iframe The Documentation tab embeds the public Hermes Agent docs site via an <iframe>. On any system where the browser's prefers-color-scheme resolves to dark — the default on macOS with system dark mode, and common on Linux/Windows too — the docs body text rendered nearly invisible against its own background. Cause: Docusaurus intentionally leaves <html> and <body> transparent and relies on the browser's Canvas color to fill the viewport. Inside our iframe, the iframe element had bg-background (the dashboard's dark canvas) AND inherited the dashboard's dark color-scheme, so the browser set the iframe's Canvas to a dark value. Docusaurus's transparent body exposed that dark Canvas, and the docs body text (tuned for a light Canvas) became near-illegible. Affects every built-in dashboard theme. Fix: replace bg-background on the iframe with [color-scheme:light] (spec-blessed cross-origin override of the inherited color-scheme; forces the iframe's Canvas to light) and bg-white (belt-and-suspenders fallback during the brief paint window before content loads). The docs site's own theme toggle keeps working — Docusaurus stores its choice in localStorage and applies opaque dark backgrounds to its layout elements that cover the white Canvas we forced.	2026-05-07 04:55:47 -07:00
Sanjay Santhanam	595bcc89fc	test(update): patch isatty on real streams to fix xdist-flaky --yes tests Two CI tests for the new `--yes` update flag (#18261) flaked under `pytest-xdist` on Linux/Python 3.11 even though they passed every local run on macOS/Python 3.14.4: FAILED tests/hermes_cli/test_update_yes_flag.py ::TestUpdateYesConfigMigration::test_no_yes_flag_still_prompts_in_tty `AssertionError: assert <MagicMock 'input'>.called is False` FAILED tests/hermes_cli/test_update_yes_flag.py ::TestUpdateYesStashRestore::test_yes_restores_stash_without_prompting `AssertionError: assert <MagicMock '_restore_stashed_changes'>.called is False` Captured stdout for the first failure shows `cmd_update` taking the "Non-interactive session \u2014 skipping config migration prompt." branch \u2014 i.e. the `sys.stdin.isatty() and sys.stdout.isatty()` check at `hermes_cli/main.py:7118` evaluated to `False` despite the test doing: with patch("hermes_cli.main.sys") as mock_sys: mock_sys.stdin.isatty.return_value = True mock_sys.stdout.isatty.return_value = True The whole-module mock is fragile under xdist worker reuse: a sibling test that imports `hermes_cli.main` first can leave another `sys` reference resolved inside the function (re-import in a helper, etc.), and the wholesale module replacement never gets consulted. Switch to `patch.object(_sys.stdin, "isatty", return_value=True)` (and the same for `stdout`). That patches the attribute on the real stream object \u2014 every call site, no matter how it reached `sys.stdin`, hits the patched method. Same fix applied to the stash-restore test (it took the "non-TTY \u2192 skip restore prompt" branch for the same reason). Validation: $ pytest tests/hermes_cli/test_update_yes_flag.py -q 3 passed in 5.47s No production code change. Fixes the two failures observed on `main` (run 25250051126): `tests/hermes_cli/test_update_yes_flag.py::TestUpdateYesConfigMigration::test_no_yes_flag_still_prompts_in_tty` `tests/hermes_cli/test_update_yes_flag.py::TestUpdateYesStashRestore::test_yes_restores_stash_without_prompting` Refs: #18261 (added the `--yes` flag + these tests).	2026-05-07 04:54:57 -07:00
Sanjay Santhanam	033e533d05	test(docker): align Dockerfile contract tests with simplified TUI flow The Dockerfile dropped the manual `@hermes/ink` materialisation gymnastics in favour of letting npm workspaces resolve the bundled package naturally. Two contract tests still asserted the older flow: `test_dockerfile_installs_tui_dependencies` required: 'ui-tui/packages/hermes-ink/package-lock.json' in dockerfile_text …but the lockfile is no longer COPIED individually \u2014 the entire `ui-tui/packages/hermes-ink/` tree is COPIED instead (the workspace reference from `ui-tui/package.json` is `file:` so npm needs the real source, not just a manifest stub). `test_dockerfile_materializes_local_tui_ink_package` required a 7-clause conjunction matching specific `rm -rf` / `npm install --omit=dev` `--prefix node_modules/@hermes/ink` / `rm -rf .../react` invocations that were stripped out when the workspace resolution was simplified. Update the assertions to pin the contract the image actually has to carry rather than the exact shell incantations the old flow used: * TUI deps install: ui-tui/package.json + ui-tui/package-lock.json + ui-tui/packages/hermes-ink/ tree are all COPIED, and an npm install/ci step runs in ui-tui. * Bundled hermes-ink: the workspace package source is COPIED (so `await import('@hermes/ink')` resolves at runtime). This keeps the spirit of #15012 / #16690 (zombie reaping + bundled workspace materialisation must continue to work) without locking the Dockerfile into one specific implementation flavour. Validation: $ pytest tests/tools/test_dockerfile_pid1_reaping.py -q 6 passed in 1.43s No production code change. Fixes the two failures observed on `main` (run 25250051126): `tests/tools/test_dockerfile_pid1_reaping.py::test_dockerfile_installs_tui_dependencies` `tests/tools/test_dockerfile_pid1_reaping.py::test_dockerfile_materializes_local_tui_ink_package`	2026-05-07 04:53:10 -07:00
Teknium	e7eb07cec7	chore: AUTHOR_MAP entry for mrcoferland	2026-05-07 04:51:46 -07:00
mrcoferland	bd0c54d171	fix: route Telegram image documents through photo handling	2026-05-07 04:51:46 -07:00
Teknium	51f9953e69	feat(profiles): --no-skills flag for empty profile creation (#20986 ) Adds `hermes profile create <name> --no-skills` to create a profile with zero bundled skills. Writes a `.no-bundled-skills` marker file in the profile root so `hermes update`'s all-profile skill sync loop also skips the profile — without the marker, every update would re-seed skills and the user would have to delete them again. Use case (from @hiut1u): orchestrator profiles and narrow-task profiles don't need 100+ bundled skills polluting their system prompt. - create_profile() gains a `no_skills` param, mutually exclusive with `--clone` / `--clone-all` (cloning explicitly copies skills). - seed_profile_skills() no-ops on opted-out profiles and returns `{skipped_opt_out: True}` so callers can report cleanly. - Web API (POST /api/profiles) accepts `no_skills: bool`. - Delete `.no-bundled-skills` to opt back in — next `hermes update` re-seeds normally. 6 new tests in TestNoSkillsOptOut cover marker write, mutual exclusion with clone, seed_profile_skills opt-out, fresh profile unaffected, and delete-marker-re-enables-seeding.	2026-05-07 04:34:38 -07:00
Teknium	49c3c2e0d3	docs(kanban): fix worker skill setup instructions too (#20960 ) Some checks are pending Deploy Site / deploy-vercel (push) Waiting to run Details Deploy Site / deploy-docs (push) Waiting to run Details Docker Build and Publish / build-and-push (push) Waiting to run Details Docker Build and Publish / move-latest (push) Blocked by required conditions Details Lint (ruff + ty) / ruff + ty diff (push) Waiting to run Details Nix / nix (macos-latest) (push) Waiting to run Details Nix / nix (ubuntu-latest) (push) Waiting to run Details OSV-Scanner / Scan lockfiles (push) Waiting to run Details Tests / test (push) Waiting to run Details Tests / e2e (push) Waiting to run Details Follow-up to #20958. The worker skill section had the same stale 'hermes skills install devops/kanban-worker' command — kanban-worker is also bundled, so that command fails with 'Could not fetch from any source.' Replace with bundled-skill verification + restore pattern, matching the orchestrator section. Uses <your-worker-profile> placeholder since assignees vary (researcher, writer, ops, linguist, reviewer, etc.) rather than a single fixed 'worker' profile.	2026-05-06 18:40:30 -07:00
Gille	45cbf93899	docs(kanban): fix orchestrator skill setup instructions (#20958 )	2026-05-06 18:14:30 -07:00
Teknium	5a3cadf6eb	fix(discord): narrow rate-limit catch and move sync state under gateway/ Two follow-ups on top of helix4u's slash-command sync hardening: - Only suppress exceptions that are actually Discord 429 rate limits (discord.RateLimited, HTTPException with status 429, or a clearly rate-limit-named duck type). Arbitrary failures that happen to expose a retry_after attribute now re-raise to the outer handler instead of silently swallowing a cooldown. - Move the sync-state JSON under $HERMES_HOME/gateway/ so the home root stops collecting ad-hoc runtime files. Added a test verifying unrelated exceptions don't get misclassified as rate limits.	2026-05-06 18:12:35 -07:00
helix4u	d797755a1c	fix(gateway): wait for systemd restart readiness	2026-05-06 18:12:35 -07:00
Teknium	3cdbf334d5	fix(gateway): don't dead-end setup wizard when only system-scope unit is installed The setup wizard dropped non-root users at a bare shell prompt when trying to start a system-scope gateway service. Previously _require_root_for_system_service called sys.exit(1), which the wizard's `except Exception` guards cannot catch (SystemExit is a BaseException). Users with a pre-existing /etc/systemd/system unit (e.g. from an earlier `sudo hermes setup` run) hit this whenever they re-ran `hermes setup` as a regular user. - Convert _require_root_for_system_service to raise a typed SystemScopeRequiresRootError (RuntimeError subclass) instead of sys.exit(1). The direct CLI path (`hermes gateway install\|start\|stop\| restart\|uninstall` without sudo) still exits 1 cleanly via a new catch at the top of gateway_command, matching the existing UserSystemdUnavailableError pattern. - Add _system_scope_wizard_would_need_root() pre-check and _print_system_scope_remediation() helper. Both setup wizards (hermes_cli/setup.py and hermes_cli/gateway.py::gateway_setup) now detect the dead-end before prompting and print actionable guidance: either `sudo systemctl start <service>` this time, or uninstall the system unit and install a per-user one. - Defense-in-depth: all 5 wizard prompt sites also catch SystemScopeRequiresRootError and fall back to the remediation helper if the pre-check is bypassed (race, etc.). Tests: 12 new tests in TestSystemScopeRequiresRootError, TestSystemScopeWizardPreCheck, TestSystemScopeRemediationOutput, and TestGatewayCommandCatchesSystemScopeError covering the exception contract, pre-check matrix (root vs non-root, system-only vs user-present vs none vs explicit system=True), remediation output for each action, and the direct-CLI exit-1 path.	2026-05-06 15:58:02 -07:00
brooklyn!	04cf4788cc	fix(tui): restore voice push-to-talk parity (#20897 ) * fix(tui): restore classic CLI voice push-to-talk parity (cherry picked from commit `93b9ae301b`) * fix(tui): harden voice push-to-talk stop flow Address review feedback from PR #16189 by stopping the active recorder before background transcription, documenting single-shot voice capture, and covering the TUI gateway flags with regression tests. * fix(tui): preserve silent voice strike tracking Keep single-shot voice recording's no-speech counter alive across starts so the TUI can still emit the three-strikes auto-disable event, and bind the auto-restart state at module scope for type checking. * fix(tui): clean up voice stop failure path Address follow-up review by naming the TUI flow as single-shot push-to-talk and cancelling the recorder when forced stop cannot produce a WAV. * fix(tui): report busy voice capture starts Return explicit start state from the voice wrapper so the TUI gateway does not report recording while forced-stop transcription is still cleaning up. * fix(tui): handle busy voice record responses Apply the gateway busy status immediately in the TUI and route forced-stop voice events to the session that sent the stop request. * fix(tui): clear voice recording on null response Treat a null voice.record RPC result as a failed optimistic start so the REC badge cannot stick after gateway-side errors. * fix(tui): count silent manual voice stops Preserve single-shot voice no-speech strikes through forced stop transcription so empty push-to-talk captures still trigger the three-strikes guard. --------- Co-authored-by: Montbra <montbra@gmail.com>	2026-05-06 15:49:59 -07:00
brooklyn!	5ccab51fa8	fix(tui): steady transcript scrollbar (#20917 ) * fix(tui): steady transcript scrollbar Keep the visible scrollbar tied to committed viewport position while virtual history can still prefetch against pending scroll targets, and preserve drag grab offset synchronously for native-feeling scrollbar drags. * fix(tui): smooth precision wheel scroll Replace the opt-scroll throttle with frame-sized coalescing so modifier wheel gestures stay line-precise without stepping.	2026-05-06 14:50:31 -07:00
ethernet	53a024994a	Merge pull request #20890 from NousResearch/fix/docker-push ci(docker): don't cancel overlapping builds, guard :latest	2026-05-06 17:38:21 -04:00
brooklyn!	f1a8e99942	fix(tui): honor skin highlight colors (#20895 )	2026-05-06 14:01:56 -07:00
brooklyn!	da6019820a	fix(tui): refresh virtual offsets after row resize (#20898 )	2026-05-06 13:54:46 -07:00
brooklyn!	5044e1cbf1	fix(cli): submit LF enter in thin PTYs (#20896 )	2026-05-06 13:51:13 -07:00
Teknium	d8b85bfd1c	chore: add guillaumemeyer to AUTHOR_MAP For cherry-picked commits in PR #20801.	2026-05-06 13:39:43 -07:00
Guillaume Meyer	7df6115199	feat(gateway): also gate pre-restart "Gateway restarting" notification Extend the gateway_restart_notification flag to cover _notify_active_sessions_of_shutdown — the message that fires just before drain ("⚠️ Gateway restarting — Your current task will be interrupted. Send any message after restart and I'll try to resume where you left off.") sent to active sessions and home channels. Same operator/end-user reasoning: on a Slack workspace shared with end users, "Gateway restarting" reads as "the bot is broken" — the operator should be able to suppress it consistently with the other two lifecycle pings rather than having a partial opt-out. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 13:39:43 -07:00
Guillaume Meyer	b71f80e6ce	feat(gateway): per-platform gateway_restart_notification flag Adds an opt-out toggle on PlatformConfig that gates both restart lifecycle pings: the "♻ Gateway restarted" message sent to the chat that issued /restart, and the "♻️ Gateway online" home-channel startup notification. Defaults to True so existing deployments are unaffected. The motivating split is operator vs. end-user surfaces: a back-channel like Telegram should keep these pings, while a Slack workspace shared with end users should not surface gateway lifecycle noise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 13:39:43 -07:00
Teknium	33bf5f6292	fix(auth): fall back to global-root auth.json for providers missing in profile Profile processes (kanban workers, cron subprocesses, delegated subagents) read the profile's auth.json only. If a provider was authenticated at the global root but not inside the profile, the profile's credential_pool comes back empty and the process fails with 'No LLM provider configured' — even though the credentials are sitting in ~/.hermes/auth.json. #18594 propagated HERMES_HOME correctly, which is what surfaced this: workers now land in the right profile, and the profile turns out to shadow global with no fallback. Semantics (read-only, per-provider shadowing): * Profile has any entries for provider X → use profile only (global ignored). * Profile has zero entries for provider X → fall back to global. * Writes (write_credential_pool, _save_auth_store) still target the profile. * Classic mode (HERMES_HOME == global root) skips the fallback entirely — _global_auth_file_path() returns None. Also mirrors the fallback in get_provider_auth_state so OAuth singletons (nous, minimax-oauth, openai-codex, spotify) inherit cleanly — the Nous shared-token store (PR #19712) remains the authoritative path for Nous OAuth rotation, this just makes the read side consistent with it. Seat belt: _load_global_auth_store() refuses to read the real user's ~/.hermes/auth.json under PYTEST_CURRENT_TEST even when HERMES_HOME points to a profile-shaped path. Guard uses $HOME (stable across fixtures) rather than Path.home() (which fixtures often monkeypatch to a tmp root). Reported by @SeedsForbidden on Twitter as the credential_pool shadowing follow-up to the #18594 fix.	2026-05-06 13:29:54 -07:00
Teknium	d514dd4055	docs(tool-gateway): rewrite as pitch-first marketing page (#20827 ) Previous version read like internal API docs \u2014 leading with env var tables, config YAML, and 'precedence' rules before ever explaining the product. Complete rewrite inverts the structure so readers see value first, mechanics last. Structure now: - Lede: 'One subscription. Every tool built in.' + pitch paragraph - CTA: subscribe/manage button styled as a real call-to-action - What's included: emoji-led table with expanded descriptions per tool. Image gen lists all 9 models by name (FLUX 2 Klein/Pro, Z-Image Turbo, Nano Banana Pro, GPT Image 1.5/2, Ideogram V3, Recraft V4 Pro, Qwen) - Why it's here: value bullets \u2014 one bill, one signup, one key, same quality, bring-your-own anytime - Get started: two-command flow (hermes model \u2192 hermes status) - Eligibility: paid-tier note with upgrade link - Mix and match: three realistic usage patterns - Using individual image models: ID reference table for power users - --- separator --- - Configuration reference (demoted): use_gateway flag, disabling, self-hosted gateway env vars moved below the fold where they belong - FAQ: streamlined, removed redundant content Fact-checked against code: - 9 FAL models confirmed from tools/image_generation_tool.py FAL_MODELS - Status section output verified against hermes_cli/status.py - Portal subscription URL preserved - Self-hosted env vars (TOOL_GATEWAY_DOMAIN etc.) kept accurate Verified: docusaurus build SUCCESS, page renders, no new broken links.	2026-05-06 13:20:09 -07:00
ethernet	f4031df05d	ci(docker): don't cancel overlapping builds, guard :latest Switch top-level concurrency to cancel-in-progress=false so every push to main gets its own SHA-tagged image published — no more discarded builds when commits land back-to-back. Guard the :latest tag with a second job that has its own concurrency group with cancel-in-progress=true plus a git-ancestor check against the revision label on the current :latest. Together these guarantee :latest only ever moves forward in history: a slower run whose commit isn't a descendant of the current :latest refuses to clobber it, and a newer push mid-way through the move-latest job preempts the older one before it can retag. - Every main push publishes nousresearch/hermes-agent:sha-<commit> with an org.opencontainers.image.revision label embedded. - move-latest job reads that label off :latest, runs merge-base --is-ancestor, and only retags (via buildx imagetools create, registry-side, no rebuild) if our commit strictly descends. - fetch-depth bumped to 1000 so merge-base has the history it needs. - Release tag flow unchanged (unique tag, no race).	2026-05-06 15:53:47 -04:00
asheriif	946ef0ea19	fix(tui): bound virtual history offset searches	2026-05-06 11:57:01 -07:00
ethernet	a345f7b6e5	Merge pull request #19908 from NousResearch/typecheck change: enable ruff/ty	2026-05-06 14:43:14 -04:00
kshitijk4poor	a2ff193050	chore: follow-up cleanup for Kanban migration fix - Expand migration comment to name the primary failure mode (missing column OperationalError from #20842) ahead of the secondary SQLite schema-reparse concern; also document the stale-cols-snapshot invariant - Add clarifying comments on from_row() legacy fallback branches noting they are belt-and-suspenders dead code post-migration - Add task_events comment in existing test explaining why the table is required by the migrator - Add test_legacy_migration_no_legacy_columns_at_all: Scenario A — explicitly asserts the exact #20842 crash no longer occurs and that consecutive_failures defaults to 0 on a DB that never had spawn_failures - Add test_legacy_migration_both_columns_already_present: Scenario D — asserts the migration is a no-op when both columns already exist, preserving the existing counter value	2026-05-06 11:25:16 -07:00
helix4u	b1d420e75f	fix(kanban): avoid fragile failure-column renames	2026-05-06 11:25:16 -07:00
kshitijk4poor	28299afc21	chore: follow-up cleanup for Feishu topic thread fix - Remove dead metadata.get('reply_to') fallback in _send_raw_message; nothing in the codebase ever sets 'reply_to' inside a metadata dict — the key only appears as a top-level send_voice() keyword argument - Simplify _status_thread_metadata construction in run.py to use a single dict literal instead of create-then-mutate pattern; the or-{} guard was dead since source.thread_id implies _progress_thread_id is also set for Feishu - Add yuqian@zmetasoft.com to AUTHOR_MAP for contributor attribution	2026-05-06 10:52:51 -07:00
Yuqian	441ef75d15	fix(feishu): keep topic replies in threads Route Feishu topic progress, status, approval, stream, and fallback messages through threaded replies by preserving the originating message id as the reply target. Add regressions for tool progress topic metadata and Feishu metadata-driven reply routing.	2026-05-06 10:52:51 -07:00

... 2 3 4 5 6 ...

7625 commits