hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-05-18 04:41:56 +00:00

Author	SHA1	Message	Date
v1b3coder	4fdaf0b4d8	fix: use credential_pool for custom endpoint model listing probes Same-provider /model switches on a 'custom' endpoint kept stale credentials because (a) _resolve_named_custom_runtime's bare-custom + explicit_base_url path went straight to OPENAI_API_KEY/OPENROUTER_API_KEY env fallbacks without consulting the credential pool, and (b) switch_model() guarded against custom-provider re-resolution to preserve base_url, locking in the prior api_key. Now the bare-custom path queries the credential pool first (mirroring the named-custom-provider branch behavior), and the same-provider switch guard is removed since resolve_runtime_provider has since grown a robust custom-resolution path that preserves base_url from model_cfg. Refs #18681 (the gateway-side api_key wiring is still separate), #16254, #12919.	2026-05-09 17:54:58 -07:00
Wesley Simplicio	6bf7ac3185	fix(gateway): detect gateway process via /proc in Docker without procps Salvage of NousResearch/hermes-agent#7622. Docker images often lack procps so `ps` is unavailable. Try reading /proc/*/cmdline first (works in any Linux container) and fall back to `ps -A eww` only when /proc is not present. PermissionError on individual PIDs is silently skipped. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-09 17:54:17 -07:00
Teknium	2ffef15675	fix(test_gateway): stop run_gateway() tests from rewriting the dev's installed systemd unit (#22900 ) run_gateway() calls refresh_systemd_unit_if_needed() on every invocation so restart settings stay current after exit-code-75 respawns. The user-scope unit path resolves under Path.home() (NOT sandboxed by conftest, only HERMES_HOME is), and generate_systemd_unit() bakes the current HERMES_HOME into the unit's Environment= line. Result: any test that exercises run_gateway() end-to-end on a real Linux dev box silently rewrites the developer's installed ~/.config/systemd/user/hermes-gateway.service with a polluted HERMES_HOME pointing at /tmp/pytest-of-<user>/.../hermes_test. On the next reboot, systemd loads that unit, the gateway starts looking at an empty tmp dir, and Telegram/Discord/etc. all show as 'No messaging platforms enabled' even though the user's real config is fine. Three tests in tests/hermes_cli/test_gateway.py hit this path: test_run_gateway_exits_cleanly_on_keyboard_interrupt, test_run_gateway_exits_nonzero_when_start_gateway_reports_failure, and test_run_gateway_root_guard_has_escape_hatch. Two-layer fix: 1. _install_fake_gateway_run helper (covers all four run_gateway() call sites in test_gateway.py and any future ones) now also stubs supports_systemd_services and refresh_systemd_unit_if_needed. 2. refresh_systemd_unit_if_needed() itself sniffs the generated unit body for /pytest-of- and /hermes_test markers and refuses to write when present. Defense in depth so a future test that bypasses the helper still can't corrupt the dev's gateway. Tests that legitimately exercise the refresh flow (test_run_gateway_refreshes_outdated_unit_on_boot) patch generate_systemd_unit to return synthetic content that doesn't carry those markers, so they keep working. Adds test_refresh_refuses_to_bake_pytest_tmpdir_into_real_user_unit as a regression test for the source-side guard.	2026-05-09 17:54:09 -07:00
Wesley Simplicio	116a1446a4	fix(terminal): bridge docker_env config to TERMINAL_DOCKER_ENV Problem: terminal.docker_env set in config.yaml was silently ignored. Docker containers never received the user-specified env vars. Root cause: docker_env was missing from all three config→env bridging maps (cli.py env_mappings, gateway/run.py _terminal_env_map, hermes_cli/config.py _config_to_env_sync) and from the terminal_tool _get_env_config() reader. _create_environment() consumed the key from container_config correctly, but it was always {} because TERMINAL_DOCKER_ENV was never set. Also extend the list-serialisation branches in cli.py and gateway/run.py to handle dict values via json.dumps (lists already used json.dumps; plain str() on a dict produces undecodable output). Fix: - cli.py: add "docker_env": "TERMINAL_DOCKER_ENV" to env_mappings; serialise dict values with json.dumps alongside existing list path - gateway/run.py: same additions to _terminal_env_map and serialisation - hermes_cli/config.py: add "terminal.docker_env": "TERMINAL_DOCKER_ENV" to _config_to_env_sync so `hermes config set terminal.docker_env …` persists to .env correctly - tools/terminal_tool.py: add docker_env key to _get_env_config() reading TERMINAL_DOCKER_ENV via _parse_env_var with default "{}" Tests: add test_docker_env_is_bridged_everywhere to tests/tools/test_terminal_config_env_sync.py — stash-verified: fails on origin/main, passes with fix. Fixes #20537	2026-05-09 17:53:35 -07:00
Teknium	c179bdab3c	fix(install): also patch psutil on Termux fresh-install path The Termux update path (PR #22814) prebuilds psutil from a marker-patched sdist so 'platform android is not supported' doesn't kill it. The same psutil setup.py error blocks fresh installs via scripts/install.sh — only the update path was wired up. Without this, a brand-new Termux user can't get past the very first 'pip install -e .[termux-all]' call. - New scripts/install_psutil_android.py — standalone version of the same patcher hermes_cli/main.py uses, callable from bash. - scripts/install.sh detects sys.platform == 'android' and runs the patcher before pip install. - TODO note added to both copies pointing at upstream https://github.com/giampaolo/psutil/pull/2762; remove both when that ships. Note: we keep psutil as a base dep on Android (do not adopt the proposed sys_platform != 'android' marker in pyproject). Removing it would crash five unguarded 'import psutil' sites at runtime (tools/code_execution_tool.py, tools/tts_tool.py, tools/process_registry.py (2x), gateway/platforms/whatsapp.py).	2026-05-09 17:53:15 -07:00
adybag14-cyber	6d5d467d39	fix(update): use termux-all uv fallback path on Termux	2026-05-09 17:53:15 -07:00
adybag14-cyber	3863d6d344	fix(update): prebuild psutil on Termux Android via Linux path shim	2026-05-09 17:53:15 -07:00
Teknium	0bcc327cab	docs(openrouter): document auxiliary.<task>.extra_body for OR routing and Pareto (#22844 ) The plumbing for setting OpenRouter provider preferences and the Pareto Code router on auxiliary tasks already exists — auxiliary.<task>.extra_body is forwarded verbatim by call_llm() / async_call_llm(). It just wasn't documented, so users who wanted (e.g.) Pareto Code routing for compression but the strongest coder for the main agent had no way to discover the escape hatch. - hermes_cli/config.py: expand the auxiliary section header with a YAML example showing provider routing plus plugins under extra_body, and an explicit note that main-agent provider_routing / openrouter.min_coding_score do NOT propagate to aux calls (each task is independent by design) - website/docs/user-guide/configuration.md: new 'OpenRouter routing and Pareto Code for auxiliary tasks' subsection with worked example - website/docs/integrations/providers.md: cross-link from the Pareto Code Router section to the aux-side doc E2E verified that auxiliary.<task>.extra_body reaches the OpenRouter API with the configured provider routing and plugins blocks intact.	2026-05-09 14:51:20 -07:00
Teknium	c7f0aab949	feat(openrouter): wire Pareto Code router with min_coding_score knob (#22838 ) Pick openrouter/pareto-code as your model and OpenRouter auto-routes each request to the cheapest model meeting your coding-quality bar (ranked by Artificial Analysis). The new openrouter.min_coding_score config key (0.0-1.0, default 0.65) tunes the floor. - hermes_cli/models.py: add openrouter/pareto-code to OPENROUTER_MODELS so it shows up in the picker with a description - hermes_cli/config.py: add openrouter.min_coding_score (default 0.65 — lands on a mid-tier coder on the current Pareto frontier) - plugins/model-providers/openrouter: emit extra_body.plugins = [{id: pareto-router, min_coding_score: X}] when model is openrouter/pareto-code AND the score is a valid float in [0.0, 1.0] - agent/transports/chat_completions.py: same emission on the legacy flag path (when no provider profile is loaded) - run_agent.py: openrouter_min_coding_score kwarg + storage; plumbed into both build_kwargs() invocations and the context-summary extra_body path - cli.py: read openrouter.min_coding_score once at init, validate float in [0,1], pass to AIAgent constructions (CLI + background-task paths) - cron/scheduler.py, batch_runner.py, tools/delegate_tool.py, tui_gateway/server.py: propagate the kwarg (mirrors providers_order plumbing — subagents inherit, cron/batch read from config) - tests: profile-level + transport-level coverage of the model gating, unset/empty/out-of-range handling, and the legacy flag path - docs: new 'OpenRouter Pareto Code Router' section in providers.md Verified end-to-end against api.openrouter.ai: at score=0.65 we land on a mid-tier coder, at omission we get the strongest. Score is silently dropped on any model other than openrouter/pareto-code, so it's safe to leave set.	2026-05-09 14:47:00 -07:00
HenkDz	840ebe063e	fix: make session search initialize session db	2026-05-09 14:36:58 -07:00
helix4u	9c26297c80	fix(gateway): preserve Ctrl+C for Windows foreground runs	2026-05-09 14:34:18 -07:00
mbac	1508dcb9c2	fix(gateway): adopt unit's HERMES_HOME for --system CLI ops When systemd_restart / systemd_status / systemd_stop run under sudo, HERMES_HOME is stripped and HOME=/root, so get_hermes_home() resolves to /root/.hermes instead of the unit's pinned home. read_runtime_status and get_running_pid then look at the wrong gateway_state.json — the 60s status poll never sees "running", times out, and forces another systemctl restart that SIGTERMs the in-progress new gateway. Read the unit's pinned HERMES_HOME from `systemctl show -p Environment` and mirror it into os.environ before any HERMES_HOME-derived read. Early-out when system=False (user-scope inherits naturally). Errors swallowed so a transient systemctl failure doesn't break unrelated CLI ops. Closes #22035.	2026-05-09 13:38:38 -07:00
Sanjay Santhanam	fe61d95b44	fix(completion): use valid zsh _arguments exclusion-group syntax The generated zsh completion script used `(-h --help)` as the exclusion group for `_arguments`, which zsh rejects with: _arguments:comparguments: invalid argument: (-h --help){-h,--help}[...] Exclusion groups in `_arguments` cannot contain long options. Use the canonical `(-)` form (exclude all other options) which correctly handles flag pairs like `-h`/`--help`. Fixes NousResearch/hermes-agent#22686	2026-05-09 13:36:44 -07:00
Wesley Simplicio	6e848f60ef	fix(doctor): normalize provider name and aliases before dedicated-skip check	2026-05-09 13:36:33 -07:00
Wesley Simplicio	1dd0790654	fix(doctor): skip pluggable provider profiles when a dedicated check exists (#22346 ) Problem ------- `hermes doctor` ran two health checks for Anthropic: a dedicated one with the correct `x-api-key` + `anthropic-version` headers, and a generic Bearer-auth one driven by the pluggable `ProviderProfile` for "anthropic". The generic check called `https://api.anthropic.com/v1/models` with `Authorization: Bearer ...`, which Anthropic answers with HTTP 404, producing a noisy duplicate warning even when the dedicated check passed. Root cause ---------- `hermes_cli/doctor.py:_build_apikey_providers_list` deduplicated profiles against a `_known_canonical` set built from the static list (Z.AI/GLM, Kimi, DeepSeek, …). Providers with their own dedicated check above the generic loop (Anthropic, OpenRouter, Bedrock) were not in that set, so their profiles were appended and ran a second, broken check. Fix --- Add `{"anthropic", "openrouter", "bedrock"}` to the skip set, and also skip profiles whose aliases match any of those names (e.g. `claude`, `claude-oauth` → anthropic). Tests ----- tests/hermes_cli/test_doctor_dedicated_provider_skip.py: - test_build_apikey_providers_list_skips_dedicated_check_providers: asserts the assembled list does not contain anthropic, openrouter, or bedrock entries. - test_build_apikey_providers_list_includes_non_dedicated_providers: sanity guard that legitimate providers (DeepSeek, Z.AI/GLM) survive. Both confirmed via stash-verify (fail pre-fix with anthropic/openrouter leaking, pass post-fix). Fixes #22346	2026-05-09 13:36:33 -07:00
Wesley Simplicio	78698381af	fix(kanban): make _migrate_add_optional_columns idempotent on concurrent open ALTER TABLE calls inside _migrate_add_optional_columns were guarded by a snapshot of PRAGMA table_info taken at function entry. When the gateway dispatcher opens the kanban DB twice per tick (once in _tick_once_for_board and once via init_db's discard-and-reconnect path), a second connection can run the same migration before the first one commits, causing: sqlite3.OperationalError: duplicate column name: consecutive_failures This crashed the dispatcher on every first tick after a gateway restart (subsequent ticks succeeded because the columns were then present). Fix: introduce _add_column_if_missing() which wraps ALTER TABLE in a try/except that swallows OperationalError whose message contains 'duplicate column name'. All ALTER TABLE calls in _migrate_add_optional_columns are routed through this helper. Closes #21708	2026-05-09 13:36:23 -07:00
Teknium	e612c3d6f0	perf(doctor): parallelize API connectivity checks and disable IMDS (#22766 ) `hermes doctor` ran every connectivity probe sequentially and on a typical developer laptop spent ~2s of its ~5s wall time inside boto3's EC2 instance-metadata-service lookup (169.254.169.254) — the default AWS credential chain probes IMDS even when AWS_BEARER_TOKEN_BEDROCK or AWS_ACCESS_KEY_ID is the only legitimate source. Refactor the API Connectivity section so every probe (OpenRouter, Anthropic, ~16 static API-key providers + dynamic profiles, AWS Bedrock) is a pure function returning a structured result, then fan them out through a ThreadPoolExecutor(max_workers=8). Output order, glyphs, colours, padding, and issue strings stay byte-for-byte identical to the sequential implementation; results are gathered in submission order. Also disable IMDS for the parallel block by setting AWS_EC2_METADATA_DISABLED=true on the parent thread before submitting work (and restoring its prior value in a finally block). Bedrock's real-API call gets a Config(connect_timeout=5, read_timeout=10, retries={max_attempts:1}) so a transient regional failure can't pad the run by 30+ seconds. Measured impact (5-run medians, 9950X3D): hermes doctor: 5.07 → 2.16 s (-57%) Doctor tests: 48 passed (test_doctor.py + test_doctor_command_install.py). The remaining ~2s of wall is import overhead + a couple of one-off network calls outside the API Connectivity section (`fetch_models_dev` provider catalog refresh, Nous OAuth refresh in `Auth Providers`). Those are next-tier targets, not part of this change.	2026-05-09 13:03:20 -07:00
Teknium	8f711f79a4	fix(tools): install cua-driver when Computer Use is enabled via 'hermes tools' (#22765 ) Returning users who enabled '🖱️ Computer Use (macOS)' via 'hermes tools' saw '✓ Saved configuration' but no install — cua-driver was never on PATH and the toolset failed at first use. Two compounding causes: 1. _toolset_needs_configuration_prompt fell through to _toolset_has_keys, which returned True for any provider with empty env_vars. cua-driver has no env vars, so the gate skipped _configure_toolset entirely and _run_post_setup('cua_driver') never ran. 2. No stable CLI entry-point existed for re-running the install when the picker no-op'd it (e.g. when toggling the toolset off+on inside one picker session, where 'added' is empty). Changes: - hermes_cli/tools_config.py: add _POST_SETUP_INSTALLED registry mapping post_setup keys to installed-state predicates. The gate now returns True when any visible provider has a registered post_setup whose predicate fails. cua_driver is the only opt-in for now; other post_setup hooks keep their existing behaviour. - hermes_cli/main.py: add 'hermes computer-use install' and 'hermes computer-use status' as a stable docs target. install reuses the same _run_post_setup('cua_driver') path that the picker invokes; status reports whether cua-driver is on PATH. - tools/computer_use/cua_backend.py: install hint now points users at 'hermes computer-use install' first. - website/docs/user-guide/features/computer-use.md: document the new command as the primary install path. - website/docs/reference/cli-commands.md: catalog 'hermes computer-use' alongside 'hermes tools'. - tests/hermes_cli/test_post_setup_gating.py: regression coverage for the gate predicate (missing -> setup forced, installed -> setup skipped, broken predicate -> non-blocking, unregistered keys -> behaviour unchanged). Fixes #22737. Reported by @f-trycua.	2026-05-09 13:02:25 -07:00
Teknium	70bc52e408	fix(cli): make Ctrl+Enter insert newline on WSL/SSH/Windows Terminal (#22777 ) Native Windows, WSL, SSH sessions, and Windows Terminal all send Ctrl+Enter as bare LF (c-j). Hermes was binding c-j as submit on every POSIX platform, so Ctrl+Enter submitted instead of inserting a newline on those terminals. Reported in #22379. Add _preserve_ctrl_enter_newline() predicate that detects the environments where Ctrl+Enter must produce a newline (sys.platform == 'win32', SSH_CONNECTION/SSH_CLIENT/SSH_TTY env, WT_SESSION, WSL_DISTRO_NAME, /proc/version 'microsoft' marker). Gate the c-j-as-submit binding off in those environments and gate the c-j-as-newline handler on. Local POSIX TTYs without those markers (docker exec, plain ssh from a Mac) keep c-j as submit so plain Enter still works on thin PTYs. Add install_ctrl_enter_alias() in hermes_cli/pt_input_extras.py mapping the three CSI-u / modifyOtherKeys variants of Ctrl+Enter ('\x1b[13;5u', '\x1b[27;5;13~', '\x1b[27;5;13u') to the (Escape, ControlM) tuple Alt+Enter produces. This lets Kitty / mintty / xterm-with-modifyOtherKeys users over SSH get a Ctrl+Enter newline through the existing Alt+Enter handler. 9 new tests + extended existing test_lf_enter_binds_to_submit_handler_posix to cover bare-local vs SSH branches. Closes #22379.	2026-05-09 12:48:14 -07:00
Teknium	ade5981429	fix(kanban): sanitize comment author rendering in build_worker_context (#22769 ) Operator-controlled HERMES_PROFILE values were rendered as '${author} (${ts}):' — markdown bold with no provenance prefix. Worker comment bodies render directly underneath. A misleading profile name like 'hermes-system' or 'operator' could be misread by the next worker as a system directive above attacker-influenced content (confused-deputy primitive gated on operator misconfig). The LLM-controlled author-forgery surface was already closed in #22435 (author removed from KANBAN_COMMENT_SCHEMA). This is defense-in-depth: render with an explicit 'comment from worker `<author>` at <ts>:' prefix so even 'hermes-system' resolves to 'comment from worker `hermes-system` at ...' — parseable as worker-comment metadata, not a system directive. Strip backticks from author so they can't break out of the fence. Update test_build_worker_context_caps_comments to count by body regex since the rendered author line now also starts with 'comment '. Closes #22452.	2026-05-09 12:47:58 -07:00
kshitijk4poor	9aefa74a9f	feat(mcp): add codex preset for built-in MCP server discovery Adds 'codex' to the _MCP_PRESETS registry so users can add it via Connecting to 'codex'... ✓ Connected! Found 2 tool(s) from 'codex': codex Run a Codex session. Accepts configuration parameters matchi... codex-reply Continue a Codex conversation by providing the thread id and... Enable all 2 tools? [Y/n/select]: Cancelled. without manually specifying the command and args. Enables: codex mcp-server → Hermes native MCP client → Codex tools available as first-class Hermes tools.	2026-05-09 11:11:28 -07:00
Wesley Simplicio	a33c63b9f8	fix(profiles): honour active_profile when HERMES_HOME points to hermes root Problem: After `hermes profile use NAME`, the gateway (started via systemd with HERMES_HOME=/root/.hermes hardcoded) ignores the active profile and always runs as the Default profile. WebUI, Telegram, and all non-CLI platforms are affected. Root cause: _apply_profile_override() contained an early-return guard: if profile_name is None and os.environ.get("HERMES_HOME"): return # trust the inherited value The intent was to let child processes inherit their parent's profile via HERMES_HOME without redundantly re-reading active_profile. But systemd also sets HERMES_HOME — to the hermes root (/root/.hermes), not a profile directory — so the guard fired and silently skipped the active_profile check. The user's `hermes profile use NAME` write to ~/.hermes/active_profile was never seen by the gateway process. Fix: Only skip the active_profile check when HERMES_HOME is already a profile directory, identified by its immediate parent directory being named "profiles" (e.g. ~/.hermes/profiles/coder or /opt/data/profiles/coder). When HERMES_HOME points to a root directory (parent name != "profiles"), continue to read active_profile. Tests: - test_hermes_home_at_root_with_active_profile_is_redirected: the bug scenario — HERMES_HOME=/root/.hermes + active_profile=coder → HERMES_HOME must be redirected to .../profiles/coder. Stash-verified: FAILS without fix, PASSES with fix. - test_hermes_home_already_profile_dir_is_trusted: child-process inheritance contract unchanged — .../profiles/coder is trusted as-is. - test_hermes_home_unset_reads_active_profile: classic path unchanged. - test_hermes_home_unset_default_profile_no_redirect: "default" still produces no redirect. 4/4 tests green. Closes #22502.	2026-05-09 11:10:53 -07:00
xieNniu	c8ede8aa1b	fix(plugins): resolve Git binary for installs under minimal PATH Resolve git via shutil.which with POSIX and Git-for-Windows fallbacks before clone and pull so Dashboard/API installs do not misreport Git as missing. Add regression tests for the resolver and pull subprocess invocation.	2026-05-09 11:10:04 -07:00
JackJin	7d276bfbee	fix(cli): expand composite toolset when mixed with configurables in platform_toolsets When platform_toolsets[<platform>] contains both a composite (e.g. hermes-cli) and at least one configurable opt-in (e.g. spotify), the has_explicit_config branch in _get_platform_tools silently dropped the composite, leaving sessions with only the configurable + plugin tools and no native tools (terminal, file, web, browser, memory, etc.). Mirror the else-branch's subset inference for composites that sit alongside the configurables, but apply _DEFAULT_OFF_TOOLSETS only to the implicit expansion so user-listed default-off toolsets (spotify, discord) survive.	2026-05-09 11:08:05 -07:00
Matthew Cater	cda20eec0c	fix(kanban): gate claim + unblock on parent completion Enforce the parent-completion invariant at claim_task (the single ready->running chokepoint) and re-gate unblock_task so blocked->ready only fires when parents are done. Prevents child tasks from running ahead of in-progress parents under the create-then-link race. Also adds a stress test that races concurrent create+link against hammered claim_task and asserts no child runs while any parent is undone. Ref: kanban/boards/cookai/workspaces/t_a6acd07d/root-cause.md Refs: t_8d6af9d6	2026-05-09 11:07:37 -07:00
Teknium	79694018f8	feat(plugins): HERMES_PLUGINS_DEBUG=1 surfaces plugin discovery logs (#22684 ) Plugin authors had no easy way to figure out why their plugin wasn't loading — failures were buried in agent.log at WARNING and skip reasons (disabled, not enabled, depth cap, exclusive) were DEBUG-only and invisible by default. Set HERMES_PLUGINS_DEBUG=1 to attach a stderr handler at DEBUG to the hermes_cli.plugins logger only. Surfaces: - which directories were scanned + manifest counts per source - per manifest: resolved key, name, kind, source, on-disk path - skip reasons (disabled, not enabled, exclusive, depth cap, no register) - per load: tools/hooks/slash/CLI commands the plugin registered - full traceback on YAML parse failure (exc_info on the existing warning) - full traceback on register() exceptions, pointing at the plugin author's line Env var off (default) → zero new stderr output, same as before. Touches only hermes_cli/plugins.py + a doc section in the plugin-build guide + an entry in the env-vars reference. 3 new tests lock the attach/idempotent/no-attach behavior.	2026-05-09 11:07:12 -07:00
Wesley Simplicio	0c22434f03	fix(kanban): call recompute_ready after unlink_tasks removes a dependency Problem: unlink_tasks() removes a parent→child dependency edge but does not trigger recompute_ready(). A child whose last blocking parent is unlinked stays stuck in 'todo' indefinitely — it only promotes to 'ready' on the next dispatcher tick or a manual 'hermes kanban recompute'. For CLI-only users without a dispatcher, the child is permanently stuck. Root cause: complete_task() and unblock_task() both call recompute_ready() after their write transaction so downstream children are evaluated immediately. unlink_tasks() was missing this call — removing a dependency is semantically equivalent to completing one, so the same recompute is needed. Fix: Capture the rowcount result before the write_txn exits, then call recompute_ready(conn) outside the transaction when a row was actually deleted (so the child sees the updated task_links state). Tests: Added test_unlink_tasks_triggers_recompute_ready in tests/hermes_cli/test_kanban_db.py: creates parent A (done) + parent C (running), child B with both parents (todo), unlinks C→B, asserts B is ready immediately. Stash-verified: FAILS without fix (child stays todo), PASSES with fix. 62/62 tests green in tests/hermes_cli/test_kanban_db.py. Closes #22459.	2026-05-09 11:06:21 -07:00
Teknium	b9c001116e	feat: confirm prompt for destructive slash commands (#4069 ) (#22687 ) /clear, /new, /reset, and /undo now ask the user to confirm before discarding conversation state — three-option prompt routed through the existing tools.slash_confirm primitive. Native yes/no buttons render on Telegram, Discord, and Slack (their adapters already implement send_slash_confirm); other platforms get a text-fallback prompt and reply with /approve, /always, or /cancel. The classic prompt_toolkit CLI uses the same three-option flow via the established _prompt_text_input pattern (see _confirm_and_reload_mcp). TUI keeps its existing modal overlay (#12312). Gated by new config key approvals.destructive_slash_confirm (default true). Picking 'Always Approve' flips the gate to false so subsequent destructive commands run silently — matches the established mcp_reload_confirm UX. Out of scope: /cron remove (separate domain — scheduled jobs, not session history). Existing TUI overlay env-var (HERMES_TUI_NO_CONFIRM) left unchanged; cosmetic unification can come later. Closes #4069.	2026-05-09 11:04:46 -07:00
fahdad	cca2869d78	fix(banner): resolve update-check repo from running code, not profile-scoped path check_for_updates() and _resolve_repo_dir() were preferring $HERMES_HOME/hermes-agent/ over Path(__file__).parent.parent.resolve() when looking for a .git checkout. For profiles created with --clone-all, $HERMES_HOME/hermes-agent/ points to a stale copy with a frozen HEAD, causing persistent "N commits behind" banners that never resolved. Flip the resolution order: prefer the running code's location first, fall back to $HERMES_HOME/hermes-agent/ only when the live checkout doesn't have a .git (system-wide pip installs, distro packages). The embedded-rev branch (HERMES_REVISION env var, set by nix builds) is unaffected — it uses git ls-remote against upstream, never reads the local checkout's HEAD. Based on PR #21728 by @fahdad	2026-05-09 04:10:35 -07:00
donrhmexe	f7e514d4ad	fix(profiles): exclude infrastructure artifacts when cloning with --clone-all When the source profile is the default (~/.hermes), shutil.copytree() was copying multi-GB infrastructure alongside the ~40 MB of actual profile data: hermes-agent/ (repo checkout + 3 GB venv), .worktrees/, profiles/ (sibling profiles — recursive!), bin/ (installed binaries), node_modules/ (hundreds of MB). Add _CLONE_ALL_DEFAULT_EXCLUDE_ROOT frozenset with these five entries and pass an ignore callback to copytree(). Exclusions are gated on the source actually being the default profile (is_default_source) so named-profile sources are never affected. Also exclude at any depth: __pycache__/, .pyc, .pyo, .sock, .tmp. Profile data (config.yaml, .env, auth.json, state.db, sessions/, skills/, logs/) is preserved intact — clone-all means 'complete snapshot minus infrastructure'. Mirrors the approach already used by _default_export_ignore() and _DEFAULT_EXPORT_EXCLUDE_ROOT (the export-side exclusion set which is broader because it produces a portable archive, not a live clone). Co-authored-by: MustafaKara7 <karamusti912@gmail.com> Co-authored-by: fahdad <30740087+fahdad@users.noreply.github.com> Fixes #5022 Based on PRs #5025, #5026, and #21728	2026-05-09 04:10:35 -07:00
Zhekinmaksim	4a1840e683	fix(async): replace get_event_loop() with get_running_loop() in async contexts Follow-up to PR #21293 (cli.py), which fixed the same anti-pattern. `asyncio.get_event_loop()` is documented as effectively "always returns the running loop when called from a coroutine" and emits DeprecationWarning/RuntimeWarning in some interpreter configurations. The Python docs explicitly recommend get_running_loop() inside coroutines. Replaces the remaining 9 call sites that are unconditionally inside async def bodies: - tools/browser_cdp_tool.py — _cdp_call() (4 sites): deadline + remaining computations inside the async websockets.connect context manager. - hermes_cli/web_server.py — get_status, _start_device_code_flow, submit_oauth_code (3 sites): all FastAPI async endpoints offloading blocking httpx / PKCE work to run_in_executor. - environments/agent_loop.py — HermesAgentLoop (1 site): tool dispatch inside the async rollout loop. - environments/benchmarks/terminalbench_2/terminalbench2_env.py — rollout_and_score_eval (1 site): test verification thread offload. All 9 sites are unconditionally inside async def bodies, so a running loop is guaranteed and no try/except RuntimeError fallback is needed (unlike the cli.py case in #21293, which ran from a background thread). Behavior is identical on supported Python versions; aligns the codebase with the post-#21293 idiom and avoids future warnings as the deprecation hardens. Salvaged from PR #21930 by @Zhekinmaksim onto current main (the original branch was 109 commits behind and carried unintended stale-branch reverts of unrelated landed changes — _tail_lines encoding=utf-8 and the Windows PTY bridge guard). Only the 9 swaps from the PR's intended scope are applied here.	2026-05-09 02:34:19 -07:00
kshitij	2a7047c2ed	fix(sqlite): fall back to journal_mode=DELETE on NFS/SMB/FUSE (#22043 ) SQLite's WAL mode requires shared-memory (mmap) coordination and fcntl byte-range locks that don't reliably work on network filesystems. Upstream documents this explicitly: https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode On NFS / SMB / some FUSE mounts / WSL1, 'PRAGMA journal_mode=WAL' raises 'sqlite3.OperationalError: locking protocol' (SQLITE_PROTOCOL). Before this change, every feature backed by state.db or kanban.db broke silently: - /resume, /title, /history, /branch returned 'Session database not available.' with no cause - gateway logged the init failure at DEBUG (invisible in errors.log) - kanban dispatcher crashed every 60s, driving the known migration race (duplicate column name: consecutive_failures, #21708 / #21374) Changes: - hermes_state.apply_wal_with_fallback(): shared helper that tries WAL and falls back to DELETE on SQLITE_PROTOCOL-style errors with one WARNING explaining why - hermes_state.get_last_init_error() + format_session_db_unavailable(): capture the init failure cause and surface it in user-facing strings (with an NFS/SMB pointer for 'locking protocol') - hermes_cli/kanban_db.connect(): use the shared helper - gateway/run.py: bump SessionDB init failure log DEBUG -> WARNING (matches cli.py's existing correct behavior) - cli.py (4 sites) + gateway/run.py (5 sites): replace bare 'Session database not available.' with format_session_db_unavailable() Tests: 12 new tests in tests/test_hermes_state_wal_fallback.py + 1 new test in tests/hermes_cli/test_kanban_db.py. Existing suites (state, kanban, gateway, cli) remain green for all tests unrelated to pre-existing failures on main. Evidence: real-world user on NFSv3 mount (172.26.224.200:d2dfac12/home, local_lock=none) reporting 'Session database not available.' on /resume; 'locking protocol' appears in 4 distinct log entries across backup, kanban, TUI, and CLI paths in the same session. closes #22032	2026-05-09 02:09:35 -07:00
teknium1	78b0008f44	fix(gateway): also catch restart TimeoutExpired; friendly message Some checks failed Deploy Site / deploy-vercel (push) Waiting to run Details Deploy Site / deploy-docs (push) Waiting to run Details Docker Build and Publish / build-amd64 (push) Waiting to run Details Docker Build and Publish / build-arm64 (push) Waiting to run Details Docker Build and Publish / merge (push) Blocked by required conditions Details Docker Build and Publish / move-latest (push) Blocked by required conditions Details Lint (ruff + ty) / ruff + ty diff (push) Waiting to run Details Lint (ruff + ty) / ruff enforcement (blocking) (push) Waiting to run Details Lint (ruff + ty) / Windows footguns (blocking) (push) Waiting to run Details Nix / nix (macos-latest) (push) Waiting to run Details Nix / nix (ubuntu-latest) (push) Waiting to run Details OSV-Scanner / Scan lockfiles (push) Waiting to run Details Tests / test (push) Waiting to run Details Tests / e2e (push) Waiting to run Details uv.lock check / uv lock --check (push) Waiting to run Details Build Skills Index / build-index (push) Has been cancelled Details Build Skills Index / deploy-with-index (push) Has been cancelled Details Extends #19994 to the restart path. Dashboard spawns 'hermes gateway restart' in the background; when a wedged adapter websocket pushes drain past the 90s CLI timeout, the dashboard previously surfaced a raw subprocess.TimeoutExpired traceback. Mirror systemd_stop()'s TimeoutExpired catch onto both forcing-restart sites in systemd_restart(). Adds a test that exercises the no-active-pid branch end-to-end.	2026-05-08 18:50:25 -07:00
LeonSGP43	dccf1fb6e0	fix(gateway): cap adapter disconnect during stop	2026-05-08 18:50:25 -07:00
dante	24d3216175	fix(slack): enable writable app home DMs in manifest	2026-05-08 17:01:12 -07:00
adybag14-cyber	7c174e65f7	fix: harden termux update path with uv bootstrap and env guard	2026-05-08 16:49:37 -07:00
Syed Abdur Rehman Ali	f5b635f6ab	feat(cli): recognise Shift+Enter as a newline key Closes #5346. Most terminals send the same byte sequence for `Enter` and `Shift+Enter` by default, so the application can't tell them apart — this is a terminal protocol limitation, not something Hermes can paper over. But terminals that implement the Kitty keyboard protocol (Kitty / foot / WezTerm / Ghostty by default; iTerm2 / Alacritty / VS Code terminal / Warp once the protocol is enabled) DO emit a distinct sequence for `Shift+Enter`: - `\x1b[13;2u` — Kitty / CSI-u, modifier=2 - `\x1b[27;2;13~` — xterm modifyOtherKeys=2 Stock prompt_toolkit doesn't have the CSI-u sequence in its `ANSI_SEQUENCES` table at all, and it maps the modifyOtherKeys variant to plain `Keys.ControlM` (Enter) — i.e. it strips the Shift modifier, which is the bug users actually hit on iTerm2 and friends. This PR adds `hermes_cli/pt_input_extras.install_shift_enter_alias()`, called once at CLI startup from `cli.py`, which inserts/overwrites those sequences in `ANSI_SEQUENCES` so they decode to `(Keys.Escape, Keys.ControlM)` — the same key tuple `Alt+Enter` produces. The existing Alt+Enter newline handler (`@kb.add('escape', 'enter')` in `cli.py`) then fires unchanged, so there is no new keybinding to register and no behavioral change for terminals that don't emit the distinct sequences. Files ===== * `hermes_cli/pt_input_extras.py` — new module hosting the helper. Lives outside `cli.py` so it's importable in tests without dragging in the full CLI runtime (which depends on `fire`, `rich`, etc.). * `cli.py` — calls `install_shift_enter_alias()` once at module import. Wrapped in try/except so prompt_toolkit version drift can't break CLI startup. * `tests/cli/test_cli_shift_enter_newline.py` — 6 tests: - registration of all three byte sequences - overwrite of stock prompt_toolkit's broken modifyOtherKeys mapping - idempotency - parser equivalence: CSI-u Shift+Enter == Alt+Enter - parser equivalence: modifyOtherKeys Shift+Enter == Alt+Enter - plain Enter remains a single key (submit), distinct from the two-key Alt+Enter / Shift+Enter tuple * `website/docs/user-guide/cli.md` — keybinding table updated; new "Shift+Enter compatibility" subsection with a per-terminal status table noting macOS Terminal / stock Windows Terminal cannot distinguish the keystroke at the protocol level. * `website/docs/getting-started/quickstart.md`, `website/docs/guides/tips.md` — short mention pointing readers at the full compatibility note in `cli.md`. Tested ====== pytest tests/cli/test_cli_shift_enter_newline.py # 6 passed Live-tested by triggering `\x1b[13;2u` against the running Vt100Parser (see test). Not exercised in a real terminal end-to-end because that requires a Kitty-protocol-capable host; the test exercises the parser path that drives the live terminal too.	2026-05-08 16:26:51 -07:00
Teknium	d971b26bfd	fix(update): bypass systemd RestartSec after graceful drain (#22101 ) After a clean SIGUSR1 drain, cmd_update passively polled for systemd's auto-restart to fire. Our unit file sets RestartSec=60 (a crash-loop guard), so the voluntary-restart path waited a full minute of dead air before the gateway came back — the user saw 'draining (up to 75s)...' and stared at it. Change: after the drain exits with code 75, call 'reset-failed' + 'start' explicitly. Manual start bypasses RestartSec entirely (RestartSec only governs systemd's own auto-restart logic). Takes about as long as the gateway needs to come up (~1-3s on a warm box) instead of ~60s. The RestartSec=60 default stays — it's the right crash-loop guard for actual crashes. This only short-circuits the voluntary-restart path. Matches the pattern already used in 'hermes gateway restart' (systemd_restart() in hermes_cli/gateway.py, PR #20949). Tests: - tests/hermes_cli/test_update_gateway_restart.py: new test_update_bypasses_restartsec_after_graceful_drain asserts both 'reset-failed hermes-gateway' AND 'start hermes-gateway' (NOT 'restart') are issued after a successful graceful drain. - All existing tests in the affected classes still pass (TestCmdUpdateLaunchdRestart, TestCmdUpdateResetFailedBeforeRestart are green; one pre-existing flake in the latter is unrelated).	2026-05-08 16:11:07 -07:00
Teknium	5089596685	perf(cli): skip eager plugin discovery on known built-in subcommands (#22120 ) `hermes --help` drops from ~700ms to ~180ms; `hermes version` from ~950ms to ~240ms. ~4-5x startup speedup on inspection / diagnostic invocations. Changes: - hermes_cli/main.py: gate the argparse-setup `discover_plugins()` call behind `_plugin_cli_discovery_needed()`. Eager plugin imports (google.cloud.pubsub_v1, aiohttp, grpc, PIL) cost 500-650ms and are pure waste when the user is running a built-in subcommand that doesn't take plugin extensions (`--help`, `version`, `logs`, `config`, `sessions`, etc.). New `_BUILTIN_SUBCOMMANDS` frozenset + `_first_positional_argv` helper handle flag-value skipping (`-m gpt5 chat` → still fast). - hermes_cli/main.py: `cmd_version` now reads the OpenAI SDK version via `importlib.metadata` (~2ms) instead of `import openai` (~800ms of pydantic type-module loading). Agent-running paths (`hermes chat`, `hermes gateway run`) are unaffected — the second `discover_plugins()` call later in `main()` still runs so plugin hooks / tools wire up normally. Tests: - tests/hermes_cli/test_startup_plugin_gating.py: parity test guards the `_BUILTIN_SUBCOMMANDS` set against drift (every registered subparser must be declared; no phantom entries). Behavior tests for flag-value skipping, `--` terminator, inline `--flag=value` form. 37 tests.	2026-05-08 16:07:23 -07:00
Teknium	a54cae60d4	fix(setup): offer gateway service install on Windows (#22099 ) Both setup wizards (hermes setup and hermes gateway setup) gated the service install/start/restart prompts behind 'supports_systemd or is_macos()' and fell through to 'run in foreground' on Windows, even though _is_service_installed() / _is_service_running() already call gateway_windows.is_installed() and the Windows backend has a full install/start/stop/restart contract. Wire the Windows branch into both wizards: - supports_service_manager now includes is_windows(). - Install offer reads 'Scheduled Task service' on Windows. - install() on Windows starts the task inline via schtasks /Run (or direct-spawn fallback) so the separate 'Start the service now?' prompt is skipped. - Start and Restart delegate to gateway_windows.start() / .restart(). hermes_cli/setup.py +30 -4 hermes_cli/gateway.py +28 -4	2026-05-08 14:59:59 -07:00
Teknium	26bac67ef9	fix(entry-points): guard hermes_bootstrap import so partial updates don't brick hermes (#22091 ) teknium1 hit ModuleNotFoundError: No module named 'hermes_bootstrap' after a code update, on both his Windows machine AND his Linux workstation. The failure mode is real and affects every user who updates hermes by any path OTHER than a fully-successful ``hermes update``. ## What happens hermes_bootstrap.py is a top-level module registered via pyproject.toml's ``py-modules`` list (added by Brooklyn's Windows UTF-8 stdio work). It must be registered in the venv's editable-install .pth file before Python can find it as a bare ``import hermes_bootstrap``. ``hermes update`` handles this correctly: (1) git reset --hard, (2) clear __pycache__, (3) uv pip install -e . (re-registers the package including the new py-modules list), (4) restart. BUT if any step AFTER (1) fails — network blip during pip install, PEP 668 on a system Python, venv locked, uv not in PATH, a crash mid-update — the user is left with new code that references hermes_bootstrap and a venv that doesn't know about it. Every hermes invocation after that crashes with ModuleNotFoundError, including ``hermes update`` itself. No recovery path without manual `uv pip install -e .`. Also affects users who ``git pull`` the repo directly without running hermes update — relatively common for developers. ## Fix Wrap ``import hermes_bootstrap`` in a try/except ModuleNotFoundError across all 6 entry points (hermes_cli/main, run_agent, gateway/run, acp_adapter/entry, cli, batch_runner). On Windows, missing bootstrap means the UTF-8 stdio setup doesn't run — degraded behavior (Unicode chars may fail to print) but NOT a crash. POSIX is unaffected either way since the bootstrap is a no-op there. Once hermes is running again, the user can ``hermes update`` to fully recover. ## Test update tests/test_hermes_bootstrap.py::test_entry_point_imports_bootstrap scans for the first top-level import in each entry point and asserts it is hermes_bootstrap. Extended the check to accept a Try block whose body is a lone Import of hermes_bootstrap — that's the recovery-friendly form we just introduced. Verified behavior by ``mv hermes_bootstrap.py hermes_bootstrap.py.bak`` and confirming ``python -c "import hermes_cli.main"`` succeeds. 82/82 tests pass (hermes_bootstrap + windows-native + windows-compat).	2026-05-08 14:43:13 -07:00
Teknium	35fce7699e	feat(windows uninstall): clean up User env, PATH, Scheduled Task, and portable tooling `hermes uninstall` was POSIX-only. On Windows it would leave four classes of installer debris behind that the user had to scrub manually: 1. Scheduled Task and/or Startup-folder .cmd entry that installer.ps1 dropped for `hermes gateway install`. Left running at next logon even after uninstall, pointing at deleted code paths. 2. User-scope PATH entries for the Hermes venv, PortableGit (cmd, bin, usr\bin), and bundled Node, all written to HKCU\Environment\Path. 3. User-scope env vars HERMES_HOME and HERMES_GIT_BASH_PATH, same registry key. 4. PortableGit and Node copies under %LOCALAPPDATA%\hermes\ (~200MB), plus gateway-service/ scratch dir. Fixes: - `uninstall_gateway_service()` gets a Windows branch that calls into `gateway_windows.stop()` + `gateway_windows.uninstall()`, which already know how to remove both schtasks entries and Startup-folder .cmd files and how to stop any running detached pythonw gateway. - `remove_path_from_windows_registry(hermes_home)` reads HKCU\Environment via winreg, strips any PATH entry whose path-prefix matches the installer-owned markers (\hermes-agent, \git, \node, \venv under the current HERMES_HOME), and writes the cleaned value back. Preserves REG_EXPAND_SZ vs REG_SZ so unexpanded %VARS% in the user's PATH survive. No PowerShell subprocess, no fragile `reg query` parsing. - `remove_hermes_env_vars_windows()` deletes HERMES_HOME and HERMES_GIT_BASH_PATH from the same key. - `remove_portable_tooling_windows(hermes_home)` rmtree's `hermes_home/git`, `hermes_home/node`, `hermes_home/gateway-service` — they're installer artifacts, not user data, so they get removed in BOTH "keep data" and "full uninstall" modes. Wired these into `run_uninstall()` guarded by `_is_windows()` so POSIX paths are untouched. Also fixed the closing "Reload your shell" footer to point Windows users at opening a new terminal (PATH changes don't propagate into the current PowerShell session) with the PowerShell install one-liner instead of bash's curl-pipe. Verified on Delta-1 (Windows 10) via preview script: correctly identifies 4 Hermes-installed PATH entries out of 13 total to remove, leaves Python/LM Studio/ripgrep/ffmpeg/winget entries alone.	2026-05-08 14:27:40 -07:00
Teknium	0548facc50	fix(windows): gateway status dedup + install.ps1 platform-SDK bootstrap ## Two residual Windows fixes that were hanging from earlier commits. ### 1. `hermes gateway status` reported 2 PIDs per gateway — TWO bugs compounded Diagnosed with psutil parent/child walk against live gateway PIDs: Bug A (the real one): `_get_parent_pid` silently failed on Windows. The helper shelled out to `ps -o ppid= -p <pid>`, which doesn't exist on Windows — `FileNotFoundError` → returns `None` → the ancestor walk terminated at `os.getpid()` alone. Consequence: the PID table scan in `_scan_gateway_pids` couldn't filter out `hermes gateway status`'s own launcher stub (a venv `pythonw.exe`/`python.exe` that matches the same `-m hermes_cli.main gateway` pattern as the gateway). Every status call saw "itself" as a second gateway. Fix: `_get_parent_pid` now calls `psutil.Process(pid).ppid()` first (psutil is a core dependency since `3dfb35700`) and falls back to `ps` only when `shutil.which("ps")` succeeds — matching the Windows-footgun checker's "always guard `ps` / `wmic` / etc. with `shutil.which`" rule. Before: `Gateway process running (PID: 21952, 46880)` — 46880 changing on every call (the status invocation's own launcher, which died by the time the next status call looked). After (5 consecutive calls): ``` ✓ Gateway process running (PID: 21952) ✓ Gateway process running (PID: 21952) ✓ Gateway process running (PID: 21952) ✓ Gateway process running (PID: 21952) ✓ Gateway process running (PID: 21952) ``` Ancestor walk on the fix: 14 PIDs (full chain through bash/explorer) instead of the broken 1-PID set. Bug B (the cosmetic one): venv-launcher dedup. Standard Windows CPython venv behaviour is that `<venv>/Scripts/pythonw.exe` is a ~5 MB launcher stub that spawns the base Python (`C:\\Program Files\\Python311 \\pythonw.exe`) with the same command line and waits. Our process scanner sees two PIDs for every gateway: launcher + interpreter, same cmdline. Bug A masked this by accidentally counting the status call AS one of them; with Bug A fixed, we see both the real launcher and real interpreter for the gateway process itself. Fix: `_filter_venv_launcher_stubs` at the tail of `_scan_gateway_pids` walks each matched PID's ppid via psutil. Any PID that's the PARENT of another matched PID is a launcher stub — drop it, keep the child. Scoped to Windows (`is_windows() and len(pids) > 1`) and no-ops when psutil isn't importable. Net effect: `gateway status` now reports one PID per gateway — the interpreter — matching POSIX behaviour and user expectations. ### 2. `install.ps1`: bootstrap pip + auto-install platform SDKs New `Install-PlatformSdks` function wired between `Invoke-SetupWizard` and `Start-GatewayIfConfigured`. Fixes two related issues on fresh Windows installs: 1. The tiered `uv pip install` cascade (introduced in `87fca8342`) correctly falls through when tier 1 `.[all]` fails on the RL git deps, but the fallback tiers can silently skip SDKs from `[messaging]` when there's a partial-resolve. Result: user sets `DISCORD_BOT_TOKEN` in `.env`, fires up gateway, hits "discord module not installed". 2. `uv` creates venvs WITHOUT pip by default, so the user's escape hatch (`pip install discord.py` in the venv) doesn't exist either. The new function: - Skips if `-NoVenv` (nothing to bootstrap into). - Scans `~/.hermes/.env` for messaging tokens (TELEGRAM_BOT_TOKEN, DISCORD_BOT_TOKEN, SLACK_BOT_TOKEN, SLACK_APP_TOKEN, WHATSAPP_ENABLED), filtering placeholder values. - For each token that's set, runs `python -c "import <sdk>"` to verify. - If any import fails: runs `python -m ensurepip --upgrade` to bootstrap pip into the venv (idempotent — no-ops if pip is already present), then `pip install <spec>` for each missing SDK with specs mirroring pyproject.toml's `[messaging]` extra to avoid version drift. The `$ErrorActionPreference = "SilentlyContinue"` spans are not cosmetic — PowerShell wraps native-stderr from a non-zero-exit subprocess as a `NativeCommandError` that prints even through `*> $null` / `2>$null`. Save + restore EAP over the import-probe and pip-install blocks keeps the output clean. Verified on this Windows 10 box: - Initial state: telegram+fastapi+psutil present, discord+slack_sdk missing (tier 1 `.[all]` had failed — `.tirith-install-failed` marker in `%LOCALAPPDATA%\\hermes`). - First run with discord+slack tokens in .env: detects both missing, ensurepip (skipped — pip was already bootstrapped earlier this session for telegram), installs `discord.py[voice]==2.7.1` + `PyNaCl` + `davey`, installs `slack-sdk==3.41.0`. All imports succeed on verify. - Second run: all three SDKs report OK, function no-ops. Pip spec strings mirror pyproject.toml's `[messaging]` extra verbatim so a bump to the extra picks up here automatically — no drift. ### Files - `hermes_cli/gateway.py`: `_get_parent_pid` rewritten (psutil-first); `_filter_venv_launcher_stubs` added; `_scan_gateway_pids` dedups launchers on Windows when it finds >1 match. - `scripts/install.ps1`: new `Install-PlatformSdks` function (~85 lines); wired into the main flow at line 1438. ### Verification - `venv/Scripts/python.exe scripts/check-windows-footguns.py --all` → `✓ No Windows footguns found (380 file(s) scanned).` - `ast.parse` passes on gateway.py. - `[System.Management.Automation.Language.Parser]::ParseFile` passes on install.ps1. - Live gateway (PID 21952, running since 12:33 today) survived 5x stress loop of `hermes gateway status` without dying.	2026-05-08 14:27:40 -07:00
Teknium	cc38282b04	feat(cross-platform): psutil for PID/process management + Windows footgun checker ## Why Hermes supports Linux, macOS, and native Windows, but the codebase grew up POSIX-first and has accumulated patterns that silently break (or worse, silently kill!) on Windows: - `os.kill(pid, 0)` as a liveness probe — on Windows this maps to CTRL_C_EVENT and broadcasts Ctrl+C to the target's entire console process group (bpo-14484, open since 2012). - `os.killpg` — doesn't exist on Windows at all (AttributeError). - `os.setsid` / `os.getuid` / `os.geteuid` — same. - `signal.SIGKILL` / `signal.SIGHUP` / `signal.SIGUSR1` — module-attr errors at runtime on Windows. - `open(path)` / `open(path, "r")` without explicit encoding= — inherits the platform default, which is cp1252/mbcs on Windows (UTF-8 on POSIX), causing mojibake round-tripping between hosts. - `wmic` — removed from Windows 10 21H1+. This commit does three things: 1. Makes `psutil` a core dependency and migrates critical callsites to it. 2. Adds a grep-based CI gate (`scripts/check-windows-footguns.py`) that blocks new instances of any of the above patterns. 3. Fixes every existing instance in the codebase so the baseline is clean. ## What changed ### 1. psutil as a core dependency (pyproject.toml) Added `psutil>=5.9.0,<8` to core deps. psutil is the canonical cross-platform answer for "is this PID alive" and "kill this process tree" — its `pid_exists()` uses `OpenProcess + GetExitCodeProcess` on Windows (NOT a signal call), and its `Process.children(recursive=True)` + `.kill()` combo replaces `os.killpg()` portably. ### 2. `gateway/status.py::_pid_exists` Rewrote to call `psutil.pid_exists()` first, falling back to the hand-rolled ctypes `OpenProcess + WaitForSingleObject` dance on Windows (and `os.kill(pid, 0)` on POSIX) only if psutil is somehow missing — e.g. during the scaffold phase of a fresh install before pip finishes. ### 3. `os.killpg` migration to psutil (7 callsites, 5 files) - `tools/code_execution_tool.py` - `tools/process_registry.py` - `tools/tts_tool.py` - `tools/environments/local.py` (3 sites kept as-is, suppressed with `# windows-footgun: ok` — the pgid semantics psutil can't replicate, and the calls are already Windows-guarded at the outer branch) - `gateway/platforms/whatsapp.py` ### 4. `scripts/check-windows-footguns.py` (NEW, 500 lines) Grep-based checker with 11 rules covering every Windows cross-platform footgun we've hit so far: 1. `os.kill(pid, 0)` — the silent killer 2. `os.setsid` without guard 3. `os.killpg` (recommends psutil) 4. `os.getuid` / `os.geteuid` / `os.getgid` 5. `os.fork` 6. `signal.SIGKILL` 7. `signal.SIGHUP/SIGUSR1/SIGUSR2/SIGALRM/SIGCHLD/SIGPIPE/SIGQUIT` 8. `subprocess` shebang script invocation 9. `wmic` without `shutil.which` guard 10. Hardcoded `~/Desktop` (OneDrive trap) 11. `asyncio.add_signal_handler` without try/except 12. `open()` without `encoding=` on text mode Features: - Triple-quoted-docstring aware (won't flag prose inside docstrings) - Trailing-comment aware (won't flag mentions in `# os.kill(pid, 0)` comments) - Guard-hint aware (skips lines with `hasattr(os, ...)`, `shutil.which(...)`, `if platform.system() != 'Windows'`, etc.) - Inline suppression with `# windows-footgun: ok — <reason>` - `--list` to print all rules with fixes - `--all` / `--diff <ref>` / staged-files (default) modes - Scans 380 files in under 2 seconds ### 5. CI integration A GitHub Actions workflow that runs the checker on every PR and push is staged at `/tmp/hermes-stash/windows-footguns.yml` — not included in this commit because the GH token on the push machine lacks `workflow` scope. A maintainer with `workflow` permissions should add it as `.github/workflows/windows-footguns.yml` in a follow-up. Content: ```yaml name: Windows footgun check on: push: branches: [main] pull_request: branches: [main] jobs: check: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: {python-version: "3.11"} - run: python scripts/check-windows-footguns.py --all ``` ### 6. CONTRIBUTING.md — "Cross-Platform Compatibility" expansion Expanded from 5 to 16 rules, each with message, example, and fix. Recommends psutil as the preferred API for PID / process-tree operations. ### 7. Baseline cleanup (91 → 0 findings) - 14 `open()` sites → added `encoding='utf-8'` (internal logs/caches) or `encoding='utf-8-sig'` (user-editable files that Notepad may BOM) - 23 POSIX-only callsites in systemd helpers, pty_bridge, and plugin tool subprocess management → annotated with `# windows-footgun: ok — <reason>` - 7 `os.killpg` sites → migrated to psutil (see §3 above) ## Verification ``` $ python scripts/check-windows-footguns.py --all ✓ No Windows footguns found (380 file(s) scanned). $ python -c "from gateway.status import _pid_exists; import os > print('self:', _pid_exists(os.getpid())); print('bogus:', _pid_exists(999999))" self: True bogus: False ``` Proof-of-repro that `os.kill(pid, 0)` was actually killing processes before this fix — see commit ``1cbe39914`` and bpo-14484. This commit removes the last hand-rolled ctypes path from the hot liveness-check path and defers to the best-maintained cross-platform answer.	2026-05-08 14:27:40 -07:00
Teknium	324567c936	fix(windows): os.kill(pid, 0) is NOT a no-op on Windows — route through new _pid_exists helper On Windows, Python's ``os.kill(pid, 0)`` is NOT a no-op. CPython's implementation (``Modules/posixmodule.c::os_kill_impl``) treats sig=0 as ``CTRL_C_EVENT`` because the two integer values collide at the C layer, and routes it through ``GenerateConsoleCtrlEvent(0, pid)`` — which sends a Ctrl+C to the ENTIRE console process group containing the target PID, not just the PID itself. Any caller that wanted to check "is PID X alive" via the classic POSIX ``os.kill(pid, 0)`` idiom was silently killing that process (and often unrelated processes in the same console group) on Windows. Long-standing Python Windows quirk; see bpo-14484 (open since 2012). This manifested in Hermes as: every ``hermes gateway status`` invocation would read the gateway's PID from the PID file, call ``os.kill(pid, 0)`` via ``gateway.status.get_running_pid()`` as a "liveness check", and instantly terminate the gateway it was trying to report on. No shutdown log, no traceback, no atexit hook fire, no exit-diag entry — just silent termination of the detached pythonw process. "Bot answered one message then stopped typing" was the characteristic end-user symptom because `os.kill(pid, 0)` fires mid-response-send and kills the gateway between logs. Reproduction (verified in this branch before the fix): $ hermes gateway start # gateway alive, PID 37520 $ hermes gateway status # reports "No gateway process detected" $ tasklist /FI "PID eq 37520" # INFO: No tasks are running # — gateway terminated silently Root-cause fix is a new ``gateway.status._pid_exists(pid)`` helper: - On Windows: Win32 ``OpenProcess(PROCESS_QUERY_LIMITED_INFORMATION \| SYNCHRONIZE, False, pid)`` + ``WaitForSingleObject(handle, 0)`` via ctypes. Zero signal delivery, zero console-group side effects. Pins ctypes return types to avoid DWORD-vs-signed-int parse bugs on WAIT_TIMEOUT (0x102). Distinguishes ERROR_INVALID_PARAMETER (PID gone) from ERROR_ACCESS_DENIED (alive but another user). - On POSIX: the canonical ``os.kill(pid, 0)`` idiom that actually is a no-op there. Then patch every ``os.kill(pid, 0)`` liveness-check callsite to route through ``_pid_exists`` instead. Total 14 callsites across 11 files; every single one was a latent silent-kill on Windows: gateway/run.py:2810 — /restart watcher (inline subprocess) gateway/run.py:15195 — --replace wait loop gateway/status.py:572 — acquire_gateway_runtime_lock stale check gateway/status.py:828 — get_running_pid (THE killer for status) gateway/platforms/whatsapp.py:111 hermes_cli/gateway.py:228, 522, 1012 — gateway-related drain loops hermes_cli/kanban_db.py:2826 — _pid_alive was claiming to be cross-platform but used os.kill(pid, 0) on Windows hermes_cli/main.py:5792 — CLI process-kill polling hermes_cli/profiles.py:782 — profile stop wait loop plugins/google_meet/process_manager.py:74 tools/browser_tool.py:1215, 1255 — browser daemon ownership probes tools/mcp_tool.py:1255, 3374 — MCP stdio orphan tracking The watcher source in gateway/run.py:2810 is a multi-line string that gets spawned as an inline ``python -c "..."`` subprocess, so it can't import gateway.status. The fix for that callsite inlines the same ctypes probe directly into the watcher source. Tested on Windows 10 with the hermes gateway + Telegram bot: - gateway start → alive - 5 consecutive ``hermes gateway status`` invocations → gateway alive after every one, same PID reported each time (37520, 21952) - gateway.log shows uninterrupted operation; no spurious shutdown entries; cron ticker and kanban dispatcher still running on their 60-second cadence - bot continues answering Telegram messages throughout Ships alongside an exit-path diagnostic wrapper in ``hermes_cli/gateway.py::run_gateway()`` that captures every way ``asyncio.run(start_gateway(...))`` can return (success, SystemExit, KeyboardInterrupt, BaseException, atexit) with full traceback to ``logs/gateway-exit-diag.log``. This was used to prove the gateway was being hard-killed externally (no exit event fired) and should be kept for future Windows debugging. Refs: https://bugs.python.org/issue14484 See also: references/windows-subprocess-sigint-storm.md in the hermes-agent skill.	2026-05-08 14:27:40 -07:00
Teknium	9c263fbf8a	feat(windows): gateway as a Scheduled Task + Startup-folder fallback Hermes gateway now installs as a real Windows service via `hermes gateway install`, auto-starts on user logon, and stays running across reboots. Mirrors the launchd (macOS) / systemd (Linux) contract so the rest of the CLI dispatcher just plugs into the same `install / uninstall / start / stop / restart / status` entrypoints. Primary implementation is the new `hermes_cli/gateway_windows.py`: - `schtasks /Create /SC ONLOGON /RL LIMITED /RU <user> /NP /IT` creates a per-user Scheduled Task running as the current user at next logon, with no UAC prompt and no stored password. Same pattern OpenClaw uses. - When `schtasks /Create` returns "Access is denied" or times out (locked-down corporate boxes, 15s/30s hard + no-output cutoffs), fall back to writing a `.cmd` file into `%APPDATA%\Microsoft\Windows\Start Menu\Programs\Startup\`, which Windows Explorer fires at every logon. Either path produces the same end-user experience. - `_spawn_detached()` launches `pythonw.exe -m hermes_cli.main gateway run --replace` directly with `DETACHED_PROCESS \| CREATE_NEW_PROCESS_GROUP \| CREATE_NO_WINDOW \| CREATE_BREAKAWAY_FROM_JOB` + DEVNULL stdio + sidecar `logs/gateway-stdio.log`. Going through pythonw.exe (no console) instead of a cmd.exe shim is what lets the gateway survive the spawning shell's exit on Windows — documented in `references/windows-subprocess-sigint-storm.md`. - Two separate quoting helpers for cmd.exe vs schtasks (`/TR` argument) — they're different parsers and mixing breaks both. Same split OpenClaw documents in src/daemon/schtasks.ts. - `_wait_for_gateway_ready()` + `_report_gateway_start()` poll for a live gateway process after spawn and report the PID, so install doesn't lie about success. Dispatcher wiring in `hermes_cli/gateway.py`: - `_gateway_command_inner()` gets Windows branches for install / uninstall / start / stop / restart / status + `_is_service_installed` + `_is_service_running`. `gateway status` output + suggested commands now mention `hermes gateway install` instead of `sudo hermes gateway install --system` on Windows. Two separable Windows fixes that only matter for a working detached gateway, bundled here because shipping them independently leaves install broken: (1) Spurious CTRL_C_EVENT on detached pythonw runs. When the gateway is launched detached on Windows, something on the boot path (HTTPX / python-telegram-bot / asyncio ProactorEventLoop subprocess plumbing) synthesizes a Ctrl+C within ~60-90 seconds. Python 3.11 translates it into KeyboardInterrupt inside `asyncio.run(start_gateway(...))`, the outer `except KeyboardInterrupt: return` exits cleanly, and the process dies with no shutdown log — "bot started typing, then stopped" is the fingerprint because the interrupt fires mid-send. Fix in `run_gateway()`: when `is_windows()` and stdin is not a TTY, install `signal.signal(SIGINT, SIG_IGN)` + same for SIGBREAK. Real console runs have a TTY and skip the absorber, so user Ctrl+C still works interactively. Same family as commit 449ad952b's browser-tool SIGINT absorber; cross-referenced in the ref doc. (2) `wmic process get` is the process-list path used by `_scan_gateway_pids()` / `find_gateway_pids()`, which power status, stop, and restart on Windows. `C:\Windows\System32\wbem\WMIC.exe` has been deprecated since Windows 10 21H1 and is not installed on modern Win 10/11 boxes, so `find_gateway_pids()` silently returns [] — status sees no gateway even when one is running. Fix: `shutil.which("wmic")` first, fall back to PowerShell's `Get-CimInstance Win32_Process` emitting the same LIST-style `CommandLine=...` / `ProcessId=...` pairs the downstream parser already handles. Zero behavior change on boxes where wmic still works. Verified end-to-end on Windows 10 (Delta-1): - `hermes gateway install` → falls back to Startup folder (access denied on schtasks for this user) + detached pythonw spawn, PID reported correctly. - Gateway connects to Telegram, answers messages, stays alive past 2min (previously died at ~85s with no shutdown log). - `hermes gateway stop` + `uninstall` both clean up both tracks. Refs: openclaw/openclaw src/daemon/schtasks.ts for the ONLOGON + startup-folder-fallback pattern. skill hermes-agent references/windows-subprocess-sigint-storm.md for the deeper CTRL_C_EVENT / ProactorEventLoop background.	2026-05-08 14:27:40 -07:00
emozilla	62b4ebb7db	auth: use get_default_hermes_root() for shared nous_auth.json path Replace hardcoded ~/.hermes/shared/ references with get_default_hermes_root() / 'shared' so the cross-profile Nous auth store lands in the correct location on every platform: - Linux/macOS: ~/.hermes/shared/ - native Windows: %LOCALAPPDATA%\hermes\shared- Docker / custom HERMES_HOME: <root>/shared/ Updates _nous_shared_auth_dir(), the pytest seat-belt in _nous_shared_store_path(), and the auth_add_command comment to match. Previously Windows installs wrote to ~/.hermes/shared/ even though the rest of the CLI uses %LOCALAPPDATA%\hermes, so profiles couldn't see each other's shared credential.	2026-05-08 14:27:40 -07:00
Teknium	03566e5124	fix(windows): auto-install Playwright Chromium + surface it in doctor scripts/install.sh runs 'npx playwright install --with-deps chromium' on every Linux distro after the npm-install step, which is why browser tools Just Work on Linux. scripts/install.ps1 never did the equivalent step, so on native Windows installs check_browser_requirements() in tools/browser_tool.py would return False (no Chromium under %LOCALAPPDATA%\ms-playwright) and every browser_* tool got silently filtered out of the agent's tool schema — no error, no log entry, user just wondered why the tools didn't exist. Two-part fix: 1. scripts/install.ps1: after 'npm install' in InstallDir succeeds, run 'npx playwright install chromium'. Resolves npx via the same execution-policy-aware logic already used for npm (prefer npx.cmd next to npmExe, fall back to Get-Command). Surfaces a warning + manual-recovery hint when the install fails, matching install.sh behaviour for distros. 2. hermes_cli/doctor.py: after the agent-browser check, lazily import tools.browser_tool and reuse the exact same _chromium_installed() predicate check_browser_requirements() uses, so the doctor signal cannot drift from the runtime gate. Skip the check when Camofox / CDP override / a cloud provider / Lightpanda is configured (those bypass local Chromium). On missing Chromium, the hint is platform-correct: '--with-deps' on POSIX, plain 'install chromium' on win32. Verified on Windows 10: - 'npx playwright install chromium' completes successfully, drops Chrome Headless Shell under %LOCALAPPDATA%\ms-playwright - check_browser_requirements() flips from False -> True - 'hermes doctor' now prints either '✓ Playwright Chromium (browser engine)' or '⚠ Playwright Chromium not installed' + fix command - tests/hermes_cli/test_doctor.py: 38/38 pass - tests/tools/test_browser_chromium_check.py: 16/16 pass	2026-05-08 14:27:40 -07:00
Teknium	d1838041e5	feat: Ctrl+Enter inserts newline on Windows Terminal Windows Terminal intercepts Alt+Enter for its fullscreen shortcut, leaving Windows users with no Enter-involving way to insert a newline in the Hermes prompt. Fix it by reclaiming c-j on Windows only: - _bind_prompt_submit_keys now binds c-j (LF) to submit only on POSIX, where thin PTYs (docker exec, some SSH configs) deliver Enter as LF. On Windows plain Enter is always c-m, so c-j is free. - Windows-only prompt binding: c-j inserts a newline. Windows Terminal sends Ctrl+Enter as LF, so the user-facing keystroke is Ctrl+Enter — no terminal settings changes required. - Alt+Enter binding unchanged; still works on mac/Linux/WSL. - Test TestPromptToolkitTerminalCompatibility::test_lf_enter_binds_to_submit_handler split into platform-aware assertions for POSIX vs win32. - Fixed the Ctrl+J claim in hermes_cli/tips.py (was wrong before this commit even on POSIX) to point Windows users at Ctrl+Enter. Tradeoff: on Windows, raw Ctrl+J (without Enter) also inserts a newline, since WT collapses Ctrl+Enter and Ctrl+J to the same c-j keycode. No conflicting Hermes binding existed for Ctrl+J, so this is a harmless side effect.	2026-05-08 14:27:40 -07:00
Teknium	cbce5e93fc	codebase: add encoding='utf-8' to all bare open() calls (PLW1514) Closes the last Python-on-Windows UTF-8 exposure by making every text-mode open() call explicit about its encoding. Before: on Windows, bare open(path, 'r') defaults to the system locale encoding (cp1252 on US-locale installs). That means reading any config/yaml/markdown/json file with non-ASCII content either crashes with UnicodeDecodeError or silently mis-decodes bytes. After: all 89 affected call sites in production code now pass encoding='utf-8' explicitly. Works identically on every platform and every locale, no surprise behavior. Mechanical sweep via: ruff check --preview --extend-select PLW1514 --unsafe-fixes --fix --exclude 'tests,venv,.venv,node_modules,website,optional-skills, skills,tinker-atropos,plugins' . All 89 fixes have the same shape: open(x) or open(x, mode) became open(x, encoding='utf-8') or open(x, mode, encoding='utf-8'). Nothing else changed. Every modified file still parses and the Windows/sandbox test suite is still green (85 passed, 14 skipped, 0 failed across tests/tools/test_code_execution_windows_env.py + tests/tools/test_code_execution_modes.py + tests/tools/test_env_passthrough.py + tests/test_hermes_bootstrap.py). Scope notes: - tests/ excluded: test fixtures can use locale encoding intentionally (exercising edge cases). If we want to tighten tests later that's a separate PR. - plugins/ excluded: plugin-specific conventions may differ; plugin authors own their code. - optional-skills/ and skills/ excluded: skill scripts are user-authored and we don't want to mass-edit them. - website/ and tinker-atropos/ excluded: vendored / generated content. 46 files touched, 89 +/- lines (symmetric replacement). No behavior change on POSIX or on Windows when the file is ASCII; bug fix on Windows when the file contains non-ASCII.	2026-05-08 14:27:40 -07:00

1 2 3 4 5 ...

1825 commits