hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-07-25 17:18:11 +00:00

Author	SHA1	Message	Date
helix4u	591e6fb8f4	fix(computer_use): honor custom vision routing	2026-06-07 02:09:20 -07:00
Teknium	fe0b3f2338	fix(windows): retry watcher Popen without breakaway when parent job denies it, plus regression tests for the breakaway bit (#40956 ) #40909 added `CREATE_BREAKAWAY_FROM_JOB` to `windows_detach_flags()`, which fixed the headline bug (gateway dies after Desktop GUI update and never comes back). The flag's own docstring acknowledges that restrictive parent job objects can still refuse breakaway with `ERROR_ACCESS_DENIED`, surfacing as `OSError` on the `subprocess.Popen` call: "Callers in this codebase already wrap detached spawns in try/except OSError and fall back to a cmd.exe wrapper, so the breakaway-denied case degrades gracefully rather than crashing." That's true for `_spawn_detached` in `gateway_windows.py` (the `hermes gateway start` path), which has both the breakaway bit AND a retry-without-breakaway fallback. It's NOT true for the post-update watcher path in `launch_detached_profile_gateway_restart` (`hermes_cli/gateway.py`), which only has `except OSError: return False` and gives up entirely. If a user's shell/terminal/container wraps Hermes in a breakaway-denying job, the gateway-respawn watcher silently fails to launch instead of trying again without breakaway. This PR closes that gap and adds the regression tests that were missing from the original fix. ## Changes ### `hermes_cli/_subprocess_compat.py` Adds a sibling helper `windows_detach_flags_without_breakaway()` so callers can express the fallback symbolically (via the helper) rather than coding the magic `& ~0x01000000` mask at every site. Documented on `windows_detach_flags` and `windows_detach_flags_without_breakaway` with the recommended try/except pattern. ### `hermes_cli/gateway.py::launch_detached_profile_gateway_restart` Two changes, both aligned with the canonical pattern in `gateway_windows._spawn_detached`: 1. The outer watcher Popen now wraps in `try/except OSError`, and on failure retries with `windows_detach_flags_without_breakaway()` (POSIX never reaches this branch — `start_new_session=True` can't raise OSError). 2. The inlined respawn payload (the `python -c` watcher) also wraps its CreateProcess in try/except OSError and retries with `_flags & ~_CREATE_BREAKAWAY_FROM_JOB` on failure. This matters because the watcher's job-object inheritance is independent of the outer process's — even if the outer Popen succeeds with breakaway, the respawned gateway might inherit a job that doesn't. ### Regression tests in `tests/tools/test_windows_native_support.py` #40909 shipped the fix without any test that the breakaway bit is present (the existing `test_windows_detach_flags_has_expected_win32_bits` asserted only the three legacy bits). Four new tests close that: - `test_windows_detach_flags_includes_breakaway_from_job` — explicit assertion that the breakaway bit is in the default bundle, with the rationale spelled out in the docstring so a future maintainer staring at this test understands why removing it would resurrect the gateway-dies-after-GUI-update bug. - `test_windows_detach_flags_without_breakaway_drops_only_that_bit` — fallback payload keeps the other three detach bits intact. - `test_launch_detached_profile_gateway_restart_inlined_watcher_uses_breakaway` — static-text check on the stringified watcher payload. The inlined Python program isn't reachable via normal import-time inspection because it lives in a `textwrap.dedent("""...""")` literal that gets passed to a separate `python -c` interpreter. Asserting that both `_CREATE_BREAKAWAY_FROM_JOB` (symbolic) and `0x01000000` (hex literal) appear inside the dedent block is a sufficient regression guard against accidental refactors. - `test_launch_detached_profile_gateway_restart_outer_popen_has_access_denied_fallback` — static check that this PR's fallback retry is wired up symbolically. Without standing up a real Windows job object that refuses breakaway, we can't trigger the OSError in a unit test; the text guard catches the case where a future refactor removes the helper import or the `& ~_CREATE_BREAKAWAY_FROM_JOB` retry. Also extends `test_windows_detach_flags_has_expected_win32_bits` to include the breakaway bit assertion and updates `test_windows_flags_zero_on_posix` to cover the new helper. ## Tests Locally on Windows: 8/8 in the `-k "detach or breakaway or popen_kwargs or launch_detached or gateway_run_update or hermes_cli_gateway"` slice pass. Broader `tests/hermes_cli/test_gateway*.py + test_windows_native_support.py`: 172 passed, 10 failed. All 10 failures are pre-existing POSIX-only tests running on a Windows host (os.geteuid, SIGKILL fallback, is_linux fixture mismatches). Stashing this PR and re-running on bare post-#40909 main reproduces all 10 identically — none are regressions. POSIX paths unchanged: `windows_detach_flags()` and `windows_detach_flags_without_breakaway()` both return 0 off Windows, `windows_detach_popen_kwargs()` still yields `{"start_new_session": True}`. ## Out of scope - The other detached-spawn site in `hermes_cli/gateway.py` (around line 3068) also uses `windows_detach_popen_kwargs()` + `except OSError`. It deserves the same fallback treatment but the codepath is different enough (not the update-flow watcher) that it warrants a separate PR with its own scrutiny. - `gateway/run.py` has Windows branches with `windows_detach_popen_kwargs` too — same reasoning. ## Context Follow-up to #40909 (merged). I had a parallel PR (#40934, closed) that duplicated the core breakaway fix; the bits unique to that PR that #40909 didn't cover are the contents of this one. Closing #40934 and opening this slimmed-down version as the focused follow-up.	2026-06-07 01:21:58 -07:00
Teknium	887295ba54	fix(config): preserve custom-provider models maps and metadata through v11->v12 migration (#40573 ) Salvaged from #40410; cleaned up, re-verified against main, tests added. Co-authored-by: rodboev <rodboev@users.noreply.github.com>	2026-06-06 18:43:20 -07:00
Teknium	365437e4aa	fix(cua-driver): reconnect MCP stdio session once on ClosedResourceError after daemon restart (#40570 ) Salvaged from #40282; cleaned up, re-verified against main, tests added. Co-authored-by: jeeves-assistant <jeeves-assistant@users.noreply.github.com>	2026-06-06 18:35:12 -07:00
Teknium	5a36f76a00	fix(skill_manager): allow SKILL.md in _validate_file_path without weakening traversal guard (#40568 ) Salvaged from #40453; cleaned up, re-verified against main, tests added. Co-authored-by: l37525778-coder <l37525778-coder@users.noreply.github.com>	2026-06-06 18:32:37 -07:00
Teknium	c0424b06af	fix(osv_check): honor npx --package/-p install target when parsing package arg (#40567 ) Salvaged from #40461; cleaned up, re-verified against main, tests added. Co-authored-by: HeLLGURD <HeLLGURD@users.noreply.github.com>	2026-06-06 18:30:39 -07:00
Teknium	56f833efa4	fix(skills): block path traversal via skill_view name argument (#40566 ) Closes #38643. Salvaged from #40521; cleaned up, re-verified against main, tests added. Co-authored-by: xy200303 <xy200303@users.noreply.github.com>	2026-06-06 18:29:52 -07:00
kshitijk4poor	c79e3fd0ba	refactor(image_gen): delegate cache-path mapping to shared helper Follow-up on the backend-visible artifact-path fix. - Extract the cache-mount iteration loop into a reusable, backend-agnostic credential_files.map_cache_path_to_container(host_path, container_base) that returns the POSIX container path or None. to_agent_visible_cache_path() now delegates to it (keeping its Docker-only gate), and image_generation_tool's _agent_visible_cache_path() delegates to it too — eliminating the duplicated loop and the divergent path-join (posixpath vs Path) between the two. - Drop the now-unused posixpath/Path imports from image_generation_tool.py. - Document the agent_visible_cache_base getattr probe as a forward-looking optional hook (no producer yet) so it doesn't read as a typo'd attribute. - Add unit tests for map_cache_path_to_container.	2026-06-06 13:19:07 -07:00
Gille	7c4aa3e4da	fix(image_gen): expose backend-visible artifact paths	2026-06-06 13:19:07 -07:00
kshitijk4poor	c37c6eaf29	refactor(gateway): migrate Home Assistant adapter to bundled plugin Move gateway/platforms/homeassistant.py into plugins/platforms/homeassistant/ following the same shape as the Mattermost and Discord migrations. - Adapter file is renamed via git mv (history is preserved). - register() exposes the platform via the plugin system instead of the hardcoded Platform.HOMEASSISTANT elif in gateway/run.py::build_adapter(). - _standalone_send() replaces the legacy _send_homeassistant() helper in tools/send_message_tool.py. Out-of-process cron delivery (deliver=homeassistant from a cron process not co-located with the gateway) now flows through the registry's standalone_sender_fn path instead of the hardcoded elif. - _is_connected() probes HASS_TOKEN via hermes_cli.gateway.get_env_value so existing connected-platform checks behave identically. The HASS_TOKEN / HASS_URL env-to-PlatformConfig seeding in gateway/config.py stays in core — same pattern bluebubbles, mattermost, and discord migrations followed. No setup_fn or apply_yaml_config_fn is registered because Home Assistant has no _setup_homeassistant wizard in hermes_cli/setup.py and no homeassistant: YAML block in config.yaml today; setup runs through the existing hermes_cli/tools_config.py toolset wizard. Test imports were rewritten across tests/gateway/test_homeassistant.py, tests/integration/test_ha_integration.py, and tests/tools/test_send_message_missing_platforms.py; the legacy (token, extra, chat_id, message)-shaped _send_homeassistant call site is preserved via a small SimpleNamespace shim in test_send_message_missing_platforms.py (same approach used when mattermost moved). - Focused HA suites (64 tests across the three rewritten files) pass. - Broader gateway/cron sweep produces 10 failures identical to main baseline (telegram approval/model-picker xdist isolation flakes, wecom_callback defusedxml issue, cron script_timeout fixture issue). Zero net new failures.	2026-06-06 11:46:24 -07:00
Teknium	f8a241e105	fix(delegate): flatten content blocks in live overlay tail + AUTHOR_MAP Follow-up on the cherry-picked content-block fix. _extract_output_tail (the live subagent overlay) still used crude str(content), which renders a "[{'type': 'text'...}]" blob and — worse — mislabels a block-wrapped "Error: ..." result as is_error=False. Route it through the same _stringify_tool_content helper so error detection and previews work at both consumer sites. - delegate_tool.py: _extract_output_tail uses _stringify_tool_content - tests: add _extract_output_tail content-block test (error detection + clean preview) - release.py: AUTHOR_MAP entry for randomsnowflake (CI gate)	2026-06-05 23:34:00 -07:00
Alexander Lehmann	f83918c31d	fix(delegate): handle content-block tool results	2026-06-05 23:34:00 -07:00
helix4u	338c074336	fix(send-message): treat ntfy topic targets as explicit	2026-06-05 20:38:28 -07:00
Teknium	ea266f43e9	fix(file-ops): make rg/grep search error guard reachable and preserve partial matches (#39858 ) The error guard in _search_with_rg/_search_with_grep was unreachable and, if it had fired, would have discarded valid results. Two root causes: 1. Unreachable. Both methods pipe the search through `\| head` with no pipefail, so the pipeline reported head's exit code (0), masking rg/grep's error code (2). The guard never fired. Worse, because _exec merges stderr into stdout (stderr=subprocess.STDOUT), the error text was then parsed as bogus match lines instead of being surfaced — the user got garbage matches with no indication the search failed. 2. Latent results-dropping. The original `not result.stdout.strip()` check was always False on error (error text lives in stdout), and the `hasattr(result, 'stderr')` branch was dead code (ExecuteResult has no stderr field). A naive broadening to `exit_code == 2` would have nuked real matches whenever rg/grep also hit a non-fatal error (e.g. one unreadable file in a tree that otherwise matched), which both tools signal with exit 2. Fix: - Prefix the piped command with `set -o pipefail` so rg/grep's real exit status propagates. rg exits 0 on a truncating head; grep exits 141 (SIGPIPE), so the strict `== 2` guard ignores truncated-success. - Add _split_tool_diagnostics() to separate tool diagnostics from match output by tool prefix and output shape. Diagnostics never become matches; on a hard error they are the message to surface. - Only surface an error when exit==2 AND no usable match payload remains, so partial errors keep their real matches. Tests: tests/tools/test_search_error_guard.py drives both methods through the real local backend (hard error surfaced, partial error keeps matches, truncation no false error, files_only/count exclude diagnostics) plus unit coverage for the splitter. Supersedes #39710.	2026-06-05 17:44:52 -07:00
Teknium	d41427504e	feat(delegation): uncap max_spawn_depth (floor 1, no ceiling) (#39772 ) * fix: respect disabled auto-compaction on context overflow Port from anomalyco/opencode#30749. When compression.enabled is false, NO automatic compaction trigger may fire. The proactive token-threshold paths (preflight + post-response should_compress gate) already honoured the setting, but the three provider-overflow recovery paths in the agent loop — long-context-tier 429, 413 payload-too-large, and context-overflow — called _compress_context() unconditionally, silently compressing and rotating the session against the user's explicit choice. Add a single guard at the top of the overflow-recovery dispatch: when compression is disabled and the error is one of those three overflow classes, surface a terminal error (compaction_disabled: True) telling the user to /compress manually, /new, switch to a larger-context model, or reduce attachments. Manual /compress (force=True) is unaffected — it never enters this loop. Tests: new TestOverflowWithCompactionDisabled (413 + 400 overflow don't compress when disabled; control case still compresses when enabled). Existing overflow-recovery tests updated to enable compaction explicitly (they verify the recovery fires); fixture defaults flipped to True to match production (compression.enabled defaults to True). * feat(delegation): uncap max_spawn_depth to match max_concurrent_children Removed the hard ceiling of 3 on delegation.max_spawn_depth. Depth now has a floor of 1 and no upper limit, mirroring max_concurrent_children. Cost (each level multiplies API spend) is the practical limiter, not a constant. - delegate_tool.py: drop _MAX_SPAWN_DEPTH_CAP, _get_max_spawn_depth() floors at 1 instead of clamping to [1,3]; depth-limit error string reworded - config.py / cli-config.yaml.example: doc comments say floor 1, no ceiling - docs (configuration, delegation, delegation-patterns): range 1-3 -> >=1 - tests: convert clamp-above-3 change-detector into a no-ceiling invariant, drop the _MAX_SPAWN_DEPTH_CAP==3 snapshot assert, fix warning-text assert	2026-06-05 04:46:02 -07:00
Coy Geek	3278b423d5	fix(dashboard): strip session token from subprocess env Add HERMES_DASHBOARD_SESSION_TOKEN to the Hermes-managed subprocess environment blocklist so dashboard authorization material does not propagate into shell, PTY, or background process launches. Extend the local environment blocklist regression coverage to prove the dashboard session token is stripped like other Hermes-managed secrets.	2026-06-05 02:31:19 -07:00
Baris Sencan	ad69d3edc7	fix(terminal): guard os.getcwd() against a deleted CWD `os.getcwd()` raises FileNotFoundError when the process's working directory was removed out from under it (e.g. a scratch workspace cleaned up mid-session), crashing terminal env setup. Extract a `_safe_getcwd()` helper that falls back to TERMINAL_CWD, then the user's home, on FileNotFoundError, and route all three `os.getcwd()` call sites in terminal_tool.py through it (local default_cwd, the Docker cwd-passthrough source, and the debug-config print) so the same crash can't resurface at a sibling site. Adds unit tests for the real-cwd path and both fallback branches. Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com>	2026-06-04 23:39:34 -07:00
kyssta-exe	25742372eb	fix(approval): check is_approved in execute_code guard (#39275 ) check_execute_code_guard() never called is_approved() before entering the approval flow, and never persisted session/permanent approvals from the gateway response. This meant 'Approve session' and 'Always' buttons had no effect — every execute_code call re-prompted the user. - Add is_approved() check after get_current_session_key(), matching check_all_command_guards() - Persist session ('approve_session') and permanent ('approve_permanent') approvals based on the gateway choice, same as terminal command guard - Add 3 regression tests for session persistence, permanent persistence, and short-circuit on pre-existing approval	2026-06-04 19:40:30 -07:00
Brooklyn Nicholson	89baf02919	Merge origin/main into bb/desktop-profile-support Resolve conflicts in desktop settings/cron/messaging/sidebar: adopt main's ListRow + actions-menu refactors for credential rows; keep our profileColor import on the sidebar. Drop the now-orphaned Tip-based helpers.	2026-06-04 20:17:07 -05:00
kewe63	46abf04012	fix(ssh): handle WinError 1314 symlink failure with shutil.copy2 fallback On Windows, os.symlink() raises OSError (WinError 1314) unless the process has Administrator rights or Developer Mode is enabled. The SSH bulk-upload staging logic used symlinks to mirror the remote layout before piping through tar; this caused all ssh_bulk_upload tests to fail on Windows. - ssh.py: wrap os.symlink() in try/except OSError and fall back to shutil.copy2() so staging works on every platform. shutil was already imported, no new dependency introduced. - file_sync.py: replace str(Path(remote).parent) with posixpath.dirname(remote) in unique_parent_dirs(). pathlib.Path uses the host separator (\ on Windows), but these paths are sent to a remote Linux host over SSH and must always use forward slashes. - test_ssh_bulk_upload.py: make test_staging_symlinks_mirror_remote_layout platform-agnostic — assert file existence and content instead of os.path.islink() + os.readlink(), since the staged entry may be a copy on Windows.	2026-06-04 18:06:21 -07:00
teknium1	93b5df3189	fix(test): patch async_is_safe_url in web-provider SSRF mocks web_tools.is_safe_url was replaced by async_is_safe_url, but three web-provider test files still monkeypatched the old sync name, raising AttributeError. Patch the async variant with an async lambda.	2026-06-04 18:04:47 -07:00
kewe63	c60952ba94	fix(web): run URL SSRF checks off the event loop in async paths Add async_is_safe_url() wrapping is_safe_url via asyncio.to_thread, and route all async SSRF call sites through it: web_extract_tool, the vision/video preflight checks, and both download redirect guards. socket.getaddrinfo blocks; calling it inline from async tool paths froze the event loop for the duration of DNS resolution. vision_tools: split _validate_image_url into _image_url_shape_ok (no DNS) + sync _validate_image_url (for sync callers/tests) + async _validate_image_url_async. Widened beyond the original PR #3691 to sibling async sites that also blocked the loop (second redirect guard, video preflight). Salvage of #3691 by @Kewe63 — surgically re-applied onto current main because the original branch was too stale to cherry-pick cleanly (would have reverted the web_crawl_tool refactor). Co-authored-by: Kewe63 <kewe.3217@gmail.com>	2026-06-04 18:04:47 -07:00
dirtyren	74e845c000	fix(slack): pass thread_ts in standalone send_message tool path The standalone `_send_slack()` function used by the send_message tool and cron delivery fallback was not passing `thread_ts` to the Slack API, causing messages to post to the top-level channel instead of inside threads. - Add `thread_ts` parameter to `_send_slack()` - Include `thread_ts` in the chat.postMessage payload when present - Pass `thread_id` from `_send_to_platform()` to `_send_slack()` Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-06-04 17:42:10 -07:00
Brooklyn Nicholson	9dbd3c57d7	feat(desktop): drag sessions into chat as @session links + spawn loader Drag a sidebar session into the composer to drop an @session:<profile>/<id> chip the agent resolves via session_search. New READ shape dumps a whole session by id (head+tail when large); a `profile` param reads another profile's DB read-only, and a cross-profile locate scan resolves bare ids when the model drops the owning profile from the link. Also: ASCII "waking up <profile>" overlay during lazy gateway swaps, global haptic rate-limit to kill the reconnect-storm "clickity" buzz, and reauth toasts surfaced once per disconnect instead of every backoff tick.	2026-06-04 19:41:51 -05:00
Ben Barclay	8a888441d7	fix(docker): recover from out-of-band container removal in persistent mode (salvage #36631 ) (#39415 ) Salvage of #36631 (@annguyenNous), rebased onto current main with regression tests added. Fixes #36266. When a persistent Docker sandbox container is removed out-of-band (idle reaper, `docker prune`, OOM kill, daemon restart), the gateway kept issuing `docker exec` against the dead container ID, returning "No such container" on every subsequent tool call — the agent was permanently blocked until the gateway process restarted. DockerEnvironment.execute() now detects the "No such container" / "is not running" error after a non-zero exit (gated on persist_across_processes) and calls _recreate_container(): it tries label-based reuse first, falls back to a fresh container replaying the same image + full all_run_args set, re-runs init_session(), and retries the command once. A genuine non-zero exit is NOT misclassified as container-gone. Differs from #36631 as submitted: adds the tests the original lacked. tests/tools/test_docker_environment.py covers _is_container_gone pattern matching (incl. the negative/control case), the recover-and-retry path, the persist_across_processes=False opt-out (no recovery), and the ordinary-failure passthrough (no spurious recreation). _make_dummy_env now forwards persist_across_processes. Verified: - Unit: 67/67 in test_docker_environment.py (4 new + existing). - Live E2E against the real docker daemon: started a persistent container, `docker rm -f`'d it out-of-band, and the next execute() transparently recreated a fresh container and succeeded; a follow-up command worked in the recovered container; a real `exit N` passed through without triggering recovery. Co-authored-by: annguyenNous <annguyenNous@users.noreply.github.com>	2026-06-05 10:33:44 +10:00
Ben Barclay	82c157b267	fix(docker): clean up orphaned container when docker run fails (salvage #7440 ) (#39412 ) When `docker run -d` fails after Docker has already created the container object (e.g. exit 125 when the daemon isn't ready, or a timeout mid image pull), the code raised before `self._container_id` was set — so the container leaked permanently in "Created" state. Reported in #7439: 110+ orphaned containers accumulated over 3 days from hourly cron- scheduled gateway sessions hitting a Docker Desktop startup race. The orphan reaper added in #33645 (reap_orphan_containers) does NOT cover this case: it filters `status=exited`, but a failed-create container is in `Created` state, so it slips through and is never reaped. Wrap the `docker run -d` call in try/except and `docker rm -f` the container by its known name before re-raising. Salvages #7440 by @Tranquil-Flow. Their branch predated the cross-process reuse + labels rework on `main`, so a cherry-pick conflicted; reconstructed the same intent (plus their two regression tests, adapted to mock the new reuse `docker ps` probe) against current `main`. Verified adversarially: reverted just the product change to origin/main's `docker.py`, ran the two new tests -> both FAIL with `assert 0 == 1 ("docker rm should be called once")`. With the fix applied, both pass; full test_docker_environment.py is 65/65 green. Closes #7440. Fixes #7439. Co-authored-by: Evi Nova <66773372+Tranquil-Flow@users.noreply.github.com>	2026-06-05 10:19:08 +10:00
Dusk	495c3733d8	fix(config): bridge docker_volumes and docker_forward_env in config set (#38611 ) Co-authored-by: Ben Barclay <ben@nousresearch.com>	2026-06-05 09:31:01 +10:00
teknium1	dd4ba4c2c4	fix(vision): cap pixel dimensions proactively at embed time + declare Pillow Follow-up to the salvaged #37727. That PR fixed the reactive recovery path (classifier + post-failure shrinker) but left the PROACTIVE embed-time guard in vision_tools byte-only — a tall small-byte screenshot (e.g. 1200x12000 at 0.06 MB) still baked into immutable history un-resized, relying on a failed round-trip to trigger reactive shrink. - vision_tools: add _image_exceeds_dimension() + _EMBED_MAX_DIMENSION (7900px); the embed-time cap now fires on bytes OR pixels and passes max_dimension to the resizer, so tall small-byte images are shrunk before they're embedded. - vision_tools: best-effort lazy-install of Pillow (tool.vision) in the resize ImportError fallback so the soft dep self-heals (respects allow_lazy_installs). - error_classifier: add two more Anthropic dimension-cap wording variants. - pyproject + lazy_deps: declare Pillow as the [vision] extra / tool.vision lazy dep (it was undeclared everywhere; without it ALL resize recovery no-ops). - tests: cover _image_exceeds_dimension (tall/small/edge/no-Pillow/corrupt). Co-authored-by: kyssta-exe <kyssta-exe@users.noreply.github.com>	2026-06-04 06:16:45 -07:00
Teknium	38d3c49aaf	refactor(skills): clean up bundled skill set + add environments: relevance gate (#39028 ) * refactor(skills): clean up bundled skill set + add environments: relevance gate Bundled skills cleanup pass plus a new offer-time relevance gate. Removals (redundant / dead): - spotify (covered by the spotify plugin's 7 native tools) - linear (covered by `hermes mcp install linear`) - kanban-codex-lane, debugging-hermes-tui-commands - empty category markers: diagramming, gifs, inference-sh, mlops/training, mlops/vector-databases - domain (stale orphan dup of optional/research/domain-intel) Bundled -> optional: - baoyu-article-illustrator, baoyu-comic, creative-ideation, pixel-art - dspy, subagent-driven-development - minecraft-modpack-server, pokemon-player - hermes-s6-container-supervision (-> optional/devops) Consolidation: - webhook-subscriptions + native-mcp folded into the hermes-agent skill as references/webhooks.md + references/native-mcp.md with SKILL.md pointers - writing-plans merged into plan (v2.0.0); related_skills + prose refs updated New: environments: frontmatter gate (agent/skill_utils.skill_matches_environment) - Offer-time relevance filter (kanban / docker / s6), parallel to platforms:. - Wired into the 3 OFFER surfaces only (prompt_builder skills index, skills_tool.list_skills, skill_commands slash discovery). - Explicit loads (skill_view, --skills preload) intentionally BYPASS it, so load-bearing force-loads like the kanban dispatcher's `--skills kanban-worker` always resolve. Verified via E2E. - kanban-orchestrator/kanban-worker tagged environments: [kanban]; hermes-s6-container-supervision tagged environments: [s6] + platforms: [linux]. Validation: 8/8 E2E gating assertions (incl force-load invariant); 442 targeted tests green (agent, skills_tool, skill_commands, kanban worker). * docs: regenerate skill catalogs + pages for the bundled cleanup Regenerated per-skill doc pages, catalogs, and sidebar to match the skill moves/removals in the parent commit. Moved skills' pages relocate bundled -> optional (history preserved); removed skills' pages deleted; edited skills' pages refreshed (hermes-agent now embeds the webhook + native-mcp reference pointers). zh-Hans i18n mirror: stale bundled pages and catalog rows for moved/removed skills pruned (new optional translations land via the translation pipeline). * test: drop regression test for removed kanban-codex-lane skill The kanban-codex-lane skill was removed in the bundled-skills cleanup; its dedicated regression test read the now-deleted SKILL.md and failed with FileNotFoundError on CI shard 6.	2026-06-04 06:11:22 -07:00
Teknium	b04c6e95f6	fix(approval): catch perl/ruby -i as a separate flag token The salvaged pattern matched -i only inside the first flag token, so `perl -p -i -e '...' config.yaml` (the -i split out after -p) slipped through. Widen to match a -...i flag token anywhere in the args; still no false positive on `perl -e` code eval or config reads. Adds tests for the separate-token, backup-suffix, and read-safe forms.	2026-06-04 05:36:30 -07:00
AhmetArif0	a6a4e6f9d7	fix(approval): gate perl/ruby -i in-place edits of Hermes config/env sed -i coverage for ~/.hermes/config.yaml and .env was added in #14639, but perl -i and ruby -i — which perform the same direct file mutation — were not covered. The existing perl/ruby pattern only catches -e/-c (code evaluation), not -i (file mutation), so: perl -i -pe 's/approvals.mode: on/approvals.mode: off/' ~/.hermes/config.yaml bypasses the approval gate entirely, letting the agent flip approvals.mode off mid-session via the mtime-keyed config cache reload. Add a single pattern mirroring the sed -i lines: `\b(?:perl\|ruby)\s+-[^\s]*i` against both _HERMES_CONFIG_PATH and _HERMES_ENV_PATH. Three regression tests pin the new coverage.	2026-06-04 05:36:30 -07:00
Ben Barclay	03ba06ebfb	fix(docker): chown gateway install tree on UID remap (salvage #37928 ) (#38655 ) Salvage of #37928 (@sarvesh1327), reduced to the still-needed delta. `/opt/hermes/gateway` is a runtime-writable Python package: on first import the supervised gateway writes `__pycache__` beneath it, and the image does not set PYTHONDONTWRITEBYTECODE. When HERMES_UID/PUID is remapped at boot (e.g. Unraid 99), `usermod -u` only re-chowns the hermes home dir; the build trees under /opt/hermes keep the build-time UID (10000). main already chowns `.venv`, `ui-tui`, and `node_modules` on remap (#38556) but missed `gateway`, so the remapped gateway hits EACCES writing `__pycache__` (#27221). Add `/opt/hermes/gateway` to both chown sites — the Dockerfile build-time `chown -R hermes:hermes` line and the stage2-hook build-tree repair — so it tracks the remapped UID like the sibling trees. Differs from #37928 as submitted: dropped the `uid_gid_remapped` flag and the `\|\| [ "$uid_gid_remapped" = true ]` chown gate. main's #38556 already solved that half, and more correctly — it probes the actual tree ownership (`venv_owner != actual_hermes_uid`) rather than tracking same-boot remaps, which also catches pre-existing ownership drift and stays idempotent. Keeping #37928's flag would regress that. The salvage is the `gateway`-tree addition only. Verified end-to-end against a real image build: on baseline main a remap to UID 99 leaves `gateway` owned by 10000 and a write as uid 99 fails EACCES; with this change `gateway` is chowned to 99:100 and the write succeeds, while the default-uid (no-remap) path is unchanged. Fixes #27221. Co-authored-by: Sarvesh <sarveshagl1327@gmail.com>	2026-06-04 13:34:23 +10:00
cornna	7402706c5e	fix(docker): accept Unraid uid mappings (#38098 ) Co-authored-by: Cornna <96944678+ymylive@users.noreply.github.com>	2026-06-04 12:38:24 +10:00
Ben Barclay	04d620d91f	fix(docker): run config migrations during container boot (salvage #35508 ) (#36627 ) Salvage of #35508 (@dchenk), rebased onto current main. Resolved the tests/tools/test_stage2_hook_puid_pgid.py conflict (kept both the envdir-creation regression test on main and the new config-migration tests). Docker image upgrades replace code under $INSTALL_DIR but preserve $HERMES_HOME on the mounted volume, so the persisted config.yaml never received the schema migrations that non-Docker `hermes update` runs (#35406). This adds scripts/docker_config_migrate.py, invoked from stage2-hook after first-boot seeding and before gateway services start: it backs up config.yaml + .env, runs migrate_config(interactive=False), and honors HERMES_SKIP_CONFIG_MIGRATION=1 for manual control. Also fixes a latent bug in check_config_version(): it called load_config() which deep-merges DEFAULT_CONFIG, so a legacy config with no raw _config_version falsely reported as already-current. It now reads the raw on-disk file so legacy configs are correctly detected for migration. Differs from #35508 as submitted (Option B cleanup): dropped the `_config_version` line added to cli-config.yaml.example and removed the accompanying test_cli_config_example_declares_latest_version change-detector test. The example is a copy-template and has no business asserting a schema version; check_config_version() reads the user's real config.yaml, not the example. This removes a second sync point that drifts on every version bump. Closes #35508. Fixes #35406. Co-authored-by: Dmitriy Cherchenko <17372886+dchenk@users.noreply.github.com>	2026-06-04 11:11:27 +10:00
Ben Barclay	343c54e35b	fix(docker): reject unsupported --user <arbitrary-uid> start with clear guidance (#38579 ) `docker run --user $(id -u):$(id -g)` was a tini-era trick to make container-written files match the host user. Under s6-overlay it no longer works: the bootstrap (UID remap, volume + build-tree chown, config seeding) needs root, and the baked image dirs (/opt/data, /opt/hermes/.venv, ui-tui, node_modules) are owned by the hermes build UID (10000). A pinned arbitrary UID can't write them, so the runtime fails with EACCES on a bind mount or hard-crashes on a named volume (Docker inits the volume from the image as 10000; the non-root start can't even `cd /opt/data`, and the profile reconciler dies with PermissionError on gateway_state.json). Detect that start early in both the cont-init hook (stage2-hook.sh) and the CMD wrapper (main-wrapper.sh) and fail fast with actionable guidance pointing at the supported path: root start + HERMES_UID/HERMES_GID (or the PUID/PGID aliases), which remaps the hermes user and chowns the volume — the same host-UID-matching outcome --user was used for, without breaking s6. The guard fires only when the current UID is neither root NOR the hermes UID. This preserves the supported non-root start from #34648/#34837 (running with `--user 10000:10000`, i.e. pinned to the hermes UID itself), which is unaffected — only the arbitrary-UID variant that #34837 never actually made writable is rejected. Verified live across five scenarios (built image, bind + named volume): arbitrary --user on bind -> rejected with guidance, hermes does not run; arbitrary --user on named volume -> guidance shown, no raw 'can't cd' crash; --user 10000:10000 -> boots; root + HERMES_UID=4242 remap -> boots, guard not tripped; default root start -> boots. Pre-fix control reproduces the raw PermissionError + 'can't cd' crash with no guidance.	2026-06-04 10:51:51 +10:00
Ben Barclay	5446153c98	fix(docker): chown build trees on UID remap independently of $HERMES_HOME (#35027 regression) (#38556 ) The stage2 hook gates the recursive chown of the build trees under $INSTALL_DIR (.venv, ui-tui, node_modules) so a HERMES_UID/PUID remap leaves them writable by the new runtime UID — needed for lazy_deps 'uv pip install' of platform extras (#15012, #21100) and the TUI esbuild rebuild into ui-tui/dist (#28851). #35027 folded that chown under the $HERMES_HOME ownership check ('stat $HERMES_HOME != hermes_uid'). But 'usermod -u <new> hermes' re-chowns the hermes home dir ($HERMES_HOME == /opt/data) to the new UID as a side effect, so after any remap that stat is already satisfied and needs_chown is false — silently skipping the build-tree chown on the common PUID/NAS path. The venv stays owned by the build-time UID (10000), so lazy installs and TUI rebuilds fail with EACCES. Probe the build trees directly instead: chown only when /opt/hermes/.venv is not already owned by the runtime hermes UID. Independent of $HERMES_HOME ownership, idempotent across restarts. Verified live: built the image, booted with HERMES_UID/HERMES_GID on a fresh named volume, confirmed .venv/ui-tui/node_modules end up owned by the remapped UID and 'uv pip install' into the venv succeeds; confirmed the recursive chown fires once and is skipped on restart.	2026-06-04 10:17:55 +10:00
Ben Barclay	96f0ddc6a9	fix(docker): bake hindsight-client into the image (#38128 ) (#38530 ) The native Hindsight memory provider lazy-installs hindsight-client into /opt/hermes/.venv at first use (tools/lazy_deps.py: memory.hindsight). That venv lives inside the immutable image layer, not the mounted /opt/data volume, so the dependency is wiped on every container recreate / image update. After an update, profile config still points at Hindsight and the Hindsight server is healthy, but recall/retain fails with: ModuleNotFoundError: No module named 'hindsight_client' The manual workaround (uv pip install hindsight-client inside the running container) doesn't survive the next recreate, and pip-install-into-.venv is not an officially supported durable Docker workflow. Fix: add --extra hindsight to the image's uv sync line, same pattern as the --extra anthropic/bedrock/azure-identity providers (#30504) and --extra messaging (#24698) — bake the optional dependency into the build layer so it survives container recreate. The pyproject [hindsight] pin (hindsight-client==0.6.1) already matches tools/lazy_deps.py and uv.lock, so this is a pure additive --extra with no lockfile churn. Verified: 'uv sync --frozen --no-install-project --extra hindsight' against the committed uv.lock installs hindsight-client 0.6.1 and the module imports cleanly. Adds a regression test (mirrors test_dockerfile_preinstalls_gateway_ messaging_dependencies) so a future Dockerfile cleanup can't silently drop the extra.	2026-06-04 09:17:35 +10:00
kshitijk4poor	432325933a	test: restore unrelated trailing newlines in cwd/tool-search tests The salvaged PR incidentally stripped a trailing blank line from two unrelated test files (test_file_tools_cwd_resolution.py, test_tool_search.py). Restore them to keep the salvage diff scoped to the observability feature.	2026-06-03 06:36:46 -07:00
Bryan Bednarski	0d9b7132ff	feat(observability): observer-grade telemetry hooks + NeMo-Relay plugin Adds backend-neutral observer hooks for plugins: session, turn, API request, tool, approval, and subagent lifecycle events with stable correlation IDs (session_id, task_id, turn_id, api_request_id, tool_call_id, parent/child subagent ids). Extends VALID_HOOKS with api_request_error and subagent_start. Hot path is zero-cost when no plugin subscribes: has_hook()/presence checks gate all payload construction, request payloads are returned by reference when no middleware rewrites, and the sanitized response payload no longer embeds raw response objects. Bundles the optional NeMo-Relay observability plugin (plugins/observability/nemo_relay) as an in-repo consumer of the new hooks, peer to the existing langfuse plugin. Fails open when the optional nemo-relay package is not installed. Authored-by: Bryan Bednarski <bbednarski@nvidia.com> Salvaged from #29722 onto current main.	2026-06-03 06:36:46 -07:00
Ben Barclay	d9f7e7ac81	fix(docker): seed gateway_state.json from HERMES_GATEWAY_BOOTSTRAP_STATE on first boot (#37896 ) On a fresh volume there is no gateway_state.json, so the boot reconciler (cont-init.d/02-reconcile-profiles) registers the gateway-default s6 slot but leaves it down — it only auto-starts when the last recorded state was "running". A freshly-provisioned container therefore comes up with the gateway down until something starts it (e.g. the dashboard's start button). Add a generic, first-boot-only env-seed in stage2-hook.sh (which runs before 02-reconcile-profiles): when HERMES_GATEWAY_BOOTSTRAP_STATE=running and no gateway_state.json exists yet, seed {"gateway_state":"running"} so the reconciler brings the supervised slot up on the very first boot. This mirrors the existing HERMES_AUTH_JSON_BOOTSTRAP pattern: it seeds the same state file the reconciler already consults, guarded by [ ! -f ] so persisted runtime state always wins on later boots (a deliberately-stopped gateway stays stopped across restarts). Only the literal "running" is honoured (the sole value in the reconciler's _AUTOSTART_STATES). Generic container contract — no host-specific code. Useful to any orchestrator that provisions a blank volume and wants the gateway up from first boot (the supervised gateway/dashboard already work on such hosts; only the first-boot autostart was missing because the CLI lifecycle commands can't drive the s6 layer when container self-detection misses). Adds a shell-level contract test and documents the env var.	2026-06-03 15:11:15 +10:00
ethernet	a51a7b9b92	fix(node/nix): consolidate workspace lockfile + update all consumers Consolidate per-package package-lock.json files into a single root-level workspace lockfile. Update all consumers: - Nix: shared src/npmDeps/npmDepsHash in lib.nix; devshell hook stamps package.json paths then runs npm ci from root; individual .nix files use mkNpmPassthru attrs instead of per-package fetchNpmDeps. - Python CLI: new _workspace_root() helper so _tui_need_npm_install, _make_tui_argv, _build_web_ui resolve lockfile/node_modules from the workspace root. - Desktop: replace --force-build/mtime heuristic with content-hash build stamp (_compute_desktop_content_hash via pathspec). Remove --force-build flag. - Dockerfile: single root npm install; no per-directory lockfile copies. - CI: nix-lockfile-fix and osv-scanner reference root package-lock.json; apps/dashboard → apps/desktop. - Tests: new test_tui_npm_install.py; desktop stamp tests in test_gui_command.py; updated assertions in test_cmd_update.py, test_web_ui_build.py, test_dockerfile_pid1_reaping.py. - Docs: remove --force-build from desktop flag table. Deleted: apps/desktop/package-lock.json, ui-tui/package-lock.json, ui-tui/packages/hermes-ink/package-lock.json, web/package-lock.json.	2026-06-02 20:28:18 -04:00
Teknium	2c0d648397	fix(cron): sanitize invisible unicode in vetted skill content instead of hard-blocking (#37245 ) A stray zero-width space (U+200B), BOM, or bidi control in loaded skill markdown permanently killed any cron that loaded it. The skills-attached assembled-prompt scan hard-blocked on any invisible-unicode char, even though skill bodies are already install-time vetted by skills_guard.py and the chars commonly appear in copy-pasted unicode docs / code examples. The skills path now strips invisibles (logging the codepoints) and runs the cleaned prompt. The raw user-prompt path (_scan_cron_prompt) keeps the hard block — that is the actual #3968 injection surface, where a small directive prompt with a ZWSP is a smoking gun, not prose. Stripping does not let a real injection slip through: the directive still matches after sanitization. _scan_cron_skill_assembled now returns (cleaned_prompt, error).	2026-06-02 00:29:44 -07:00
Teknium	272c2f30aa	fix(kanban): kanban_create inherits the spawning worker's task workspace (#37182 ) When a dispatcher-spawned worker (HERMES_KANBAN_TASK set) calls kanban_create without an explicit workspace, the new child now inherits the worker's own running-task workspace_kind/workspace_path instead of defaulting to scratch. A worker editing a dir:/worktree project that spawns a follow-up child keeps it in that project. Orchestrators (kanban toolset, no HERMES_KANBAN_TASK) and CLI/dashboard callers still default to scratch. An explicit workspace arg always wins.	2026-06-01 21:26:29 -07:00
kyssta-exe	d4b533de4e	fix: batch of small robustness/correctness fixes from @kyssta-exe Salvages 8 distinct fixes from a batch of PRs by @kyssta-exe, reapplied onto current main (original branches were stale) with a few refinements. - cron(jobs.py): load_jobs() validates top-level JSON shape — a bare list auto-repairs into the {"jobs": [...]} dict; scalars/null raise a clear RuntimeError instead of an uncaught AttributeError that took down the whole cron subsystem (#37065, closes #36867). - web(web_server.py): close the per-action log file handle after Popen so the parent stops leaking one fd per spawned action (#36843). - web(web_server.py): DELETE /api/env returns 400 for invalid key names instead of a misleading 500, mirroring PUT /api/env (#36840). - gateway(gateway.py): read /proc/<pid>/cmdline inside a with-block so the fd is released immediately instead of relying on GC (#36804). - web-tools(web_tools.py): include "xai" in check_web_api_key() so a configured X.AI web backend reports as available (#36802). - compression(conversation_compression.py): mark the feasibility check done only after it completes, and default the gate to "not checked" if the attribute is missing (#36803). - completion(completion.py): replace `ls` with directory globbing in the generated bash/zsh/fish profile listers — handles names with spaces and skips non-directory entries (#36806). - terminal-tool(terminal_tool.py): drop a duplicate `import threading` (#36808). - claw(claw.py): the migrate recommendation now points at the real `hermes gateway stop` command instead of the non-existent `hermes stop` (#36795, #36796, closes #36771). - tests: guard against a leaked HERMES_CRON_SESSION breaking gateway approval tests — add it to the hermetic conftest unset list (root cause, protects every test) and pop it in the affected test's setup_method (#36796). Co-authored-by: kyssta-exe <kyssta-exe@users.noreply.github.com>	2026-06-01 19:51:03 -07:00
teknium1	64f7f36713	fix(mcp): make non-MCP HTTP endpoint fast-fail robust and non-retryable Reworks the content-type preflight so a misconfigured HTTP MCP url (a web-app root serving HTML) fails in <1s instead of hanging the full 60s connect_timeout — and does so non-retryably, which neither original PR achieved. - Allow-list detection (application/json, text/event-stream) instead of a text/html-only denylist — catches text/plain, application/xml, etc. - New NonMcpEndpointError(ConnectionError); run() catches it in the same top-level fast-fail block as InvalidMcpUrlError, so it returns before the reconnect-backoff loop (truly non-retryable) and the probe runs once, not on every reconnect. - Probe runs on its own httpx client OUTSIDE the SDK anyio task group, so the error propagates as itself rather than wrapped in an ExceptionGroup (the trap that made the in-SDK event-hook approach a no-op). - Forwards ssl_verify + client_cert + headers; HEAD->GET fallback on 405/501; best-effort pass-through on missing content type, non-2xx, and network errors; skips SSE transport. CancelledError is never swallowed. - Replaces the malformed test file (which never imported the real method and failed CI) with 21 tests driving the actual _preflight_content_type against a real local HTTP server, plus full run() integration verifying <1s non-retryable failure. Co-authored-by: liuhao1024 <sunsky.lau@gmail.com> Co-authored-by: uzunkuyruk <egitimviscara@gmail.com>	2026-06-01 19:49:50 -07:00
liuhao1024	c914e4a371	fix(mcp): fail fast on HTML content-type instead of waiting full connect_timeout A misconfigured MCP server URL that returns text/html (e.g. pointing at a web app root instead of an MCP endpoint) causes the MCP SDK to block for the full connect_timeout (default 60 s) before surfacing CancelledError. Add a lightweight HEAD pre-flight check that detects text/html responses in ≤5 s and raises ConnectionError with an actionable message. Non-HTML responses, missing headers, and network errors pass through silently so the normal MCP handshake proceeds unaffected. Fixes #36052	2026-06-01 19:49:50 -07:00
Julien Talbot	8104b20269	fix(xai): route video models by modality	2026-06-01 19:00:30 -07:00
firefly	128da68823	test(tools): characterize tool-surface TERMINAL_CWD contract (#29265 ) Port PR #29365's tool-surface contract test: terminal/file/execute_code already honor TERMINAL_CWD (out of scope for the resolver cluster). Pinning the behavior makes the supersession of #29365 airtight and guards against a future refactor silently regressing the workspace contract.	2026-06-01 16:55:04 -07:00
teknium1	4e9d886d9d	fix(approval): pair terminal-side gate for ~/.hermes/config.yaml writes Subway2023's #14639 blocks write_file/patch to ~/.hermes/config.yaml, but the terminal side was only partially paired: echo>/tee/cp/mv to config.yaml already tripped the project-config pattern, while `sed -i` and direct edits slipped through with auto-approve. An unpaired write_file deny is theater per SECURITY.md — the agent could flip approvals.mode=off via `sed -i` and the mtime-keyed config cache reloads it mid-session. config.yaml IS the security policy (approvals.mode/yolo/permanent allowlist live there), so it warrants real pairing, not a half-door. Add a _HERMES_CONFIG_PATH fragment mirroring _HERMES_ENV_PATH, fold it into _SENSITIVE_WRITE_TARGET (covers tee/>/>>/cp/mv), and add sed -i coverage for both config.yaml and .env. Pins 9 regression tests including no-regression guards (reads pass, /tmp writes pass). Co-authored-by: sbw2025 <subw3@mail2.sysu.edu.cn>	2026-06-01 03:29:48 -07:00
sbw2025	8f2931e3ee	fix(file_tools): block agent writes to ~/.hermes/config.yaml to prevent silent approval bypass	2026-06-01 03:29:48 -07:00

1 2 3 4 5 ...

1072 commits