hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-05-06 02:41:48 +00:00

Author	SHA1	Message	Date
daixin1204	744079ffe6	fix(curator): prevent false-positive consolidation from substring matching _classify_removed_skills used naive 'in' substring matching to detect whether a removed skill's name appeared in skill_manage arguments. Short/common skill names (api, git, test, foo, etc.) matched incorrectly when they appeared as substrings of longer words in file paths (references/api-design.md) or content (latest, testing). Replace with field-aware matching: - file_path: needle must match a complete filename stem or directory name, with -/_ normalised for variant tolerance - content fields: word-boundary regex (\b) prevents embedding in longer words Also add 3 regression tests covering the false-positive scenarios.	2026-05-04 01:21:23 -07:00
Clooooode	1964b0565b	test(kanban): add failing test for list_profiles_on_disk with custom HERMES_HOME list_profiles_on_disk() hardcodes Path.home() / ".hermes" / "profiles", ignoring HERMES_HOME when set to a custom root (e.g. /opt/data). Add test_list_profiles_on_disk_custom_root to cover this case. Related to #18442, #18985.	2026-05-04 01:21:14 -07:00
Siddharth Balyan	a11aed1acc	fix(cli): local backend CLI always uses launch directory, stops .env sync of TERMINAL_CWD (#19334 ) The old CWD heuristic was fooled by: 1. TERMINAL_CWD persisted to .env by `hermes config set terminal.cwd` 2. Inherited TERMINAL_CWD from parent hermes processes 3. Only resolved when config had a placeholder value (not explicit paths) Fix: - load_cli_config() unconditionally uses os.getcwd() for local backend - TERMINAL_CWD always force-exported in CLI mode (overrides stale values) - Gateway sets _HERMES_GATEWAY=1 marker so lazy cli.py imports don't clobber - Remove terminal.cwd from config-set .env sync map (prevents re-poisoning) - Clarify setup wizard label as 'Gateway working directory' Closes #19214	2026-05-04 11:36:19 +05:30
Ben	2f2998bb1b	fix(tui): tolerate npm's peer-flag drop in lockfile comparison `_tui_need_npm_install()` compares the canonical `package-lock.json` against the hidden `node_modules/.package-lock.json` to decide whether `npm install` needs to re-run. npm 9 drops the `"peer": true` field from the hidden lock on dev-deps that are also declared as peers (the canonical lock preserves the dual annotation). That made the check flag 16 packages (`@babel/core`, `@types/node`, `@types/react`, `@typescript-eslint/*`, `react`, `vite`, `tsx`, `typescript`, …) as mismatched on every launch, triggering a runtime `npm install`. Inside the Docker image, that runtime install then fails with EACCES because `/opt/hermes/ui-tui/node_modules/` is root-owned from build time, so `docker run … hermes-agent --tui` prints: Installing TUI dependencies… npm install failed. …and exits 1, with no preview. The empty preview is a second bug: the launcher captured only stderr, but npm 9 writes EACCES to stdout, which was DEVNULL'd. Fixes: - Add `"peer"` to `_NPM_LOCK_RUNTIME_KEYS` so the comparison ignores the non-deterministic field, alongside the existing `"ideallyInert"`. - Capture stdout as well as stderr in the install subprocess so future failures surface a useful preview instead of a bare "failed." line. Regression tests: - `test_no_install_when_only_peer_annotation_differs` — the exact scenario - `test_install_when_version_differs_even_with_peer_drop` — guards against the peer-drop tolerance masking a real version skew On-host impact: the same false-positive was firing on every `hermes --tui` invocation from a normal checkout, silently running a no-op `npm install` each time (it converged because the host's `node_modules/` is writable). Startup time on the TUI should drop noticeably.	2026-05-04 14:13:38 +10:00
Chris Danis	363cc93674	fix(cron): bump skill usage when cron jobs load skills Some checks are pending Deploy Site / deploy-vercel (push) Waiting to run Details Deploy Site / deploy-docs (push) Waiting to run Details Docker Build and Publish / build-and-push (push) Waiting to run Details Nix / nix (macos-latest) (push) Waiting to run Details Nix / nix (ubuntu-latest) (push) Waiting to run Details Tests / test (push) Waiting to run Details Tests / e2e (push) Waiting to run Details Cron jobs that reference skills via their skills: config never bumped the usage counters in .usage.json, so the curator could auto-archive skills actively used by cron jobs based on stale timestamps. Now _build_job_prompt() calls bump_use(skill_name) for each successfully loaded skill so the curator sees them as active.	2026-05-03 17:06:48 -07:00
nftpoetrist	808fee151d	fix(auxiliary): propagate explicit_api_key to _try_anthropic() _try_anthropic() lacked the explicit_api_key parameter added to _try_openrouter() in #18768. When resolve_provider_client() is called with provider="anthropic" and an explicit key (e.g. from a fallback_model entry with api_key set), the key was silently ignored — _try_anthropic() always fell back to resolve_anthropic_token(), so the fallback returned None,None for users without a default Anthropic credential configured. Fix: add explicit_api_key: str = None to _try_anthropic() and use explicit_api_key or <pool/env fallback> in both the pool-present and no-pool paths. Pass explicit_api_key=explicit_api_key at the call site in resolve_provider_client(). Symmetric with the _try_openrouter() fix. No behavior change when explicit_api_key is None.	2026-05-03 17:00:55 -07:00
molvikar	74636f9c4a	fix(gateway): clear queued reload-skills notes on new/resume/branch	2026-05-03 17:00:31 -07:00
Kenny Wang	222767e5e8	fix: sanitize Telegram help command mentions	2026-05-03 17:00:09 -07:00
konsisumer	6fda92aa7f	fix(gateway): bridge top-level require_mention to Telegram config Users commonly place `require_mention: true` at the top level of config.yaml alongside `group_sessions_per_user`, expecting it to gate Telegram group messages. The key was silently ignored because the config loader only checked `yaml_cfg["telegram"]["require_mention"]`. When `require_mention` is found at the top level and no telegram-specific value is set, the fix now: - adds it to platforms_data["telegram"]["extra"] so _telegram_require_mention() picks it up via the primary config.extra path - sets TELEGRAM_REQUIRE_MENTION env var for the secondary fallback path A telegram-specific value (telegram.require_mention) still takes precedence over the top-level shorthand. Also corrects telegram.md: bare /cmd without @botname is rejected when require_mention is enabled; only /cmd@botname (bot-menu form) passes. Fixes #3979	2026-05-03 16:59:46 -07:00
clawbot	1bd975c0ba	fix(gateway): suppress duplicate voice transcripts Deduplicate exact and near-exact Discord voice STT transcripts per guild/user over a short window to avoid duplicate delayed agent replies. Adds regression tests for exact and near-duplicate voice transcript suppression.	2026-05-03 16:59:21 -07:00
LeonSGP43	6713274a42	fix(file): strip leaked terminal fences from reads	2026-05-03 16:58:50 -07:00
0xKingBack	3c42024539	fix(curator): pass auxiliary curator api_key/base_url into runtime resolution Curator review fork now forwards per-slot credentials from auxiliary.curator and legacy curator.auxiliary to resolve_runtime_provider, matching the canonical aux task schema. Add regression tests for binding and main fallback.	2026-05-03 16:55:16 -07:00
MrBob	86e64c1d3b	fix(gateway): hide required-arg commands from Telegram menu	2026-05-03 15:29:06 -07:00
Zyproth	a5cae16496	fix(api_server): fall back to default port on malformed API_SERVER_PORT	2026-05-03 15:27:03 -07:00
Zyproth	dfdd7b6e6f	fix(codex-transport): preserve request override headers for xai responses	2026-05-03 15:25:45 -07:00
LeonSGP43	4a2f822137	fix(mcp): reconnect on terminated sessions	2026-05-03 15:23:33 -07:00
teknium1	2658494e81	fix(kanban): add per-path env overrides + dispatcher env injection Layers defense-in-depth on top of the shared-root anchoring (base commit). Changes in hermes_cli/kanban_db.py: - kanban_db_path() now honours HERMES_KANBAN_DB first, then falls through to kanban_home()/kanban.db. - workspaces_root() now honours HERMES_KANBAN_WORKSPACES_ROOT first, then falls through to kanban_home()/kanban/workspaces. - All three overrides (HERMES_KANBAN_HOME, HERMES_KANBAN_DB, HERMES_KANBAN_WORKSPACES_ROOT) now call .expanduser() for consistency. - _default_spawn() injects HERMES_KANBAN_DB and HERMES_KANBAN_WORKSPACES_ROOT into the worker subprocess env. Even when the worker's get_default_hermes_root() resolution somehow disagrees with the dispatcher's (symlinks, unusual Docker layouts), the two processes still open the same SQLite file. Module docstring updated to describe all three overrides and the dispatcher env-injection contract. Tests (tests/hermes_cli/test_kanban_db.py, TestSharedBoardPaths): - test_hermes_kanban_db_pin_beats_kanban_home - test_hermes_kanban_workspaces_root_pin_beats_kanban_home - test_empty_per_path_overrides_fall_through - test_dispatcher_spawn_injects_kanban_db_and_workspaces_root (monkeypatches subprocess.Popen, asserts both env vars reach the child even after HERMES_HOME is rewritten by `hermes -p <profile>`.) Docs: website/docs/reference/environment-variables.md gets entries for the three kanban env vars. This fusion is built on the cleanest of the seven competing PRs that targeted issue #18442: * Base commit (from PR #19350 by @GodsBoy): add `kanban_home()` helper anchored at `get_default_hermes_root()`, reroute all 5 kanban path sites through it (including the 3 sibling log-dir sites that the other six PRs missed), 8-test regression class. * Dispatcher env-var injection approach drawn from PRs #18300 (@quocanh261997) and #19100 (@cg2aigc). * Per-path env overrides drawn from PR #19100 (@cg2aigc). * get_default_hermes_root() resolution direction first proposed in PR #18503 (@beibi9966) and PR #18985 (@Gosuj). Closes the duplicate/competing PRs: #18300, #18503, #18670, #18985, #19037, #19056, #19100. Fixes #18442 and #19348. Co-authored-by: quocanh261997 <17986614+quocanh261997@users.noreply.github.com> Co-authored-by: cg2aigc <232694053+cg2aigc@users.noreply.github.com> Co-authored-by: beibi9966 <beibei1988@proton.me> Co-authored-by: Gosuj <123411271+Gosuj@users.noreply.github.com> Co-authored-by: LeonSGP43 <154585401+LeonSGP43@users.noreply.github.com>	2026-05-03 15:13:39 -07:00
GodsBoy	f5bd77b3e1	fix(kanban): anchor board, workspaces, and worker logs at the shared Hermes root The Kanban board is documented as shared across all Hermes profiles, but `kanban_db_path()` and `workspaces_root()` resolved through `get_hermes_home()`, which returns the active profile's HERMES_HOME. When the dispatcher spawned a worker with `hermes -p <profile> --skills kanban-worker chat -q "work kanban task <id>"`, the worker rewrote HERMES_HOME to the profile subdirectory before kanban_db.py imported, opening a profile-local `kanban.db` that did not contain the dispatcher's task. `kanban_show` and `kanban_complete` failed; the dispatcher's row stayed `running` and was retried/crashed. The same defect applied to `_default_spawn`'s log directory and `worker_log_path`, so `hermes kanban tail` did not see the worker's output. Add `kanban_home()` in `hermes_cli/kanban_db.py` that resolves through `HERMES_KANBAN_HOME` (explicit override) then `get_default_hermes_root()`, which already understands the `<root>/profiles/<name>` and Docker / custom HERMES_HOME shapes. Reroute `kanban_db_path`, `workspaces_root`, the `_default_spawn` log directory, `gc_worker_logs`, and `worker_log_path` through it. Profile-specific config, `.env`, memory, and sessions stay isolated as before; only the kanban surface is shared. Add a `TestSharedBoardPaths` regression class to `tests/hermes_cli/test_kanban_db.py` covering: default install, profile-worker convergence, Docker custom HERMES_HOME, Docker profile layout, explicit `HERMES_KANBAN_HOME` override, and a real SQLite round-trip across dispatcher and worker HERMES_HOME perspectives. The dispatcher/worker convergence tests fail on origin/main and pass after the fix. Update the `kanban.md` user-guide page and the misleading docstrings in `kanban_db.py` to describe the shared-root behavior. Fixes #19348	2026-05-03 15:13:39 -07:00
Siddharth Balyan	167b5648ea	Revert "fix(cli): CLI/TUI on local backend always uses launch directory, ignores terminal.cwd (#19242 )" (#19329 ) This reverts commit `9eaddfafa3`.	2026-05-04 00:43:58 +05:30
Siddharth Balyan	9eaddfafa3	fix(cli): CLI/TUI on local backend always uses launch directory, ignores terminal.cwd (#19242 ) CLI/TUI sessions on the local backend now unconditionally use os.getcwd() as the working directory. The terminal.cwd config value is only consumed by gateway/cron/delegation modes (where there's no shell to cd from). Previously, 'hermes setup' would write an absolute path (e.g. $HOME) into terminal.cwd which then pinned the CLI to that directory regardless of where the user launched hermes from. This was a silent foot-gun — the user's 'cd' was being ignored. Changes: 1. cli.py: Restructured CWD resolution — if TERMINAL_CWD is not already set by the gateway, and the backend is local, always use os.getcwd(). Config terminal.cwd is irrelevant for interactive CLI/TUI sessions. 2. setup.py: Moved the cwd prompt from setup_terminal_backend() to setup_gateway(). It now only appears when configuring messaging platforms and is labeled 'Gateway working directory'. 3. Tests: Rewrote test_cwd_env_respect.py to validate the new behavior: explicit config paths are ignored for CLI, gateway pre-set values are preserved, non-local backends keep their config paths. 4. Docs: Updated configuration.md, profiles.md, and environment-variables.md to clarify that terminal.cwd only affects gateway/cron mode on local backend. Closes #19214	2026-05-04 00:14:36 +05:30
GodsBoy	b8ae8cc801	fix(debug): redact log content at upload time in hermes debug share Apply agent.redact.redact_sensitive_text with force=True to log content captured by _capture_log_snapshot before it reaches upload_to_pastebin. On-disk logs are untouched. Compatible with the off-by-default local redaction policy from #16794: this is upload-time-only and applies regardless of security.redact_secrets because the public paste service is the leak surface. A visible banner is prepended to each uploaded log paste so reviewers know redaction was applied. --no-redact preserves deliberate unredacted sharing for maintainer-coordinated cases. The bug-report, setup-help, and feature-request issue templates direct users to run hermes debug share and paste the resulting public URLs. With redaction off by default per #16794, those uploads have been carrying credentials onto paste.rs and dpaste.com. force=True is non-negotiable: without it, redact_sensitive_text short-circuits at agent/redact.py:322 when the env var is unset, so the fix would silently be a no-op for its target audience. A regression test pins this down. Fixes #19316	2026-05-03 11:42:20 -07:00
Siddharth Balyan	c9a3f36f56	feat: add video_analyze tool for native video understanding (#19301 ) * feat: add video_analyze tool for native video understanding Adds a video_analyze tool that sends video files to multimodal LLMs (e.g. Gemini) for analysis via the OpenRouter-compatible video_url content type. Mirrors vision_analyze in structure, error handling, and registration pattern. Key design: - Base64 encodes entire video (no frame extraction, no ffmpeg dep) - Uses 'video_url' content block type (OpenRouter standard) - Supports mp4, webm, mov, avi, mkv, mpeg formats - 50 MB hard cap, 20 MB warning threshold - 180s minimum timeout (videos take longer than images) - AUXILIARY_VIDEO_MODEL env override, falls back to AUXILIARY_VISION_MODEL - Same SSRF protection, retry logic, and cleanup as vision_analyze Default disabled: registered in 'video' toolset (not in _HERMES_CORE_TOOLS). Users opt in via: hermes tools enable video, or enabled_toolsets=['video']. * feat(video): add models.dev capability pre-check + CONFIGURABLE_TOOLSETS entry - Pre-checks model video capability via models.dev modalities.input before expensive base64 encoding. Fails early with helpful message suggesting video-capable alternatives (gemini, mimo-v2.5-pro). - Passes optimistically if model unknown or lookup fails. - Adds ModelInfo.supports_video_input() helper. - Adds 'video' to CONFIGURABLE_TOOLSETS and _DEFAULT_OFF_TOOLSETS so 'hermes tools enable video' works from CLI. - 8 new tests for the capability check (37 total). * refactor(video): remove models.dev capability pre-check Removes _check_video_model_capability and ModelInfo.supports_video_input. The vision_analyze tool doesn't pre-check image capability either — both tools rely on the same pattern: send request, handle API errors gracefully with categorized user-facing messages. The pre-check was inconsistent (only worked for some providers/models) so drop it for parity. * cleanup: compress comments, fix fragile timeout coupling - Replace _VISION_DOWNLOAD_TIMEOUT * 2 with hardcoded 60s (no silent breakage if vision timeout changes independently) - Strip verbose comments and redundant log lines throughout - No behavioral changes	2026-05-04 00:04:36 +05:30
Bartok9	e527240b27	fix(tools): write_file handler now rejects missing 'content'/'path' args instead of silently writing zero-byte files (#19096 ) Under context pressure, frontier models sometimes emit tool calls with required fields dropped. Previously _handle_write_file() used args.get('content', '') which substituted an empty string for the missing key, returned success with bytes_written=0, and created a zero-byte file on disk. The model had no way to detect the failure. Changes: - Reject calls where 'path' is absent or not a non-empty string - Reject calls where 'content' key is entirely absent (key-presence check, not truthiness) — distinguishing a legitimately empty file from a dropped arg - Reject calls where 'content' is a non-string type - All error messages include guidance to re-emit the tool call or switch to execute_code with hermes_tools.write_file() for large payloads - Explicit empty string content (file truncation) continues to work Regression tests added for all four cases: missing path, missing content, explicit-empty content, and wrong content type. Fixes #19096	2026-05-03 08:52:41 -07:00
Tranquil-Flow	6b4fb9f878	fix(cron): treat non-dict origin as missing instead of crashing tick ``_resolve_origin`` called ``origin.get('platform')`` on whatever ``job.get('origin')`` returned. The leading ``if not origin: return None`` short-circuited the falsy cases (None, empty dict, "") but a non-empty string passed that guard and then crashed with ``AttributeError: 'str' object has no attribute 'get'`` on every fire attempt. Observed in the wild after a migration script tagged jobs with free-form provenance strings (e.g. ``"combined-digest-replaces-x-and-y-20260503"``). ``mark_job_run`` did record ``last_status: error, last_error: "'str' object has no attribute 'get'"`` once, but the next tick re-loaded the same poisoned origin and crashed identically. The job stayed enabled, fired every tick, and accumulated cascading errors in the log until ``origin`` was patched manually. Replace the falsy guard with ``isinstance(origin, dict)``. Non-dict origins (string, int, list, tuple, float — anything that survived a hand-edit, JSON-script write, or migration) are now treated the same as a missing origin: the job continues with ``deliver`` falling back through its normal home-channel path instead of crashing the scheduler loop. Test parametrises the non-dict shapes that can appear in jobs.json through external writers and asserts ``_resolve_origin`` returns None for each. Note: this fix scope is the non-dict-``origin`` crash only. The ``next_run_at: null`` recurring-job recovery (the second sub-bug in #18722) is independently addressed by the in-flight #18825, which extends the never-silently-disable defense from #16265 to ``get_due_jobs()`` — that approach is well-aligned with the existing recovery pattern and ships fine without a competing change here. Fixes #18722 (non-dict origin crash; recurring-job recovery covered by #18825)	2026-05-03 08:51:50 -07:00
leprincep35700	b59bb4e351	fix(gateway): preserve home-channel thread targets across restart notifications	2026-05-03 08:47:49 -07:00
Teknium	d87fd9f039	fix(goals): make /goal work in TUI and fix gateway verdict delivery (#19209 ) /goal was silently broken outside the classic CLI. TUI: /goal was routed through the HermesCLI slash-worker subprocess, which set the goal row in SessionDB but then called _pending_input.put(state.goal) — the subprocess has no reader for that queue, so the kickoff message was discarded. No post-turn judge was wired into prompt.submit either, so even a manual kickoff would not continue the goal loop. Intercept /goal in command.dispatch instead, drive GoalManager directly, and return {type: send, notice, message} so the TUI client renders the Goal-set notice and fires the kickoff. Run the judge in _run_prompt_submit after message.complete, surface the verdict via status.update {kind: goal}, and chain the continuation turn after the running guard is released. Gateway: _post_turn_goal_continuation was gated on hasattr(adapter, 'send_message'), but adapters only expose send(). That branch was dead on every platform — users never saw '✓ Goal achieved', 'Continuing toward goal', or budget-exhausted messages. Replace the dead call with adapter.send(chat_id, content, metadata) and drop a broken reference to self._loop. Tests: - tests/tui_gateway/test_goal_command.py — full /goal dispatch matrix (set / status / pause / resume / clear / stop / done / whitespace) plus regressions for slash.exec → 4018 and 'goal' staying in _PENDING_INPUT_COMMANDS. - tests/gateway/test_goal_verdict_send.py — locks in the adapter.send path for done / continue / budget-exhausted and verifies the hook no-ops when no goal is set or the adapter lacks send().	2026-05-03 05:49:12 -07:00
kshitijk4poor	6f2dab248a	fix: update tests for resume_pending semantics + add AUTHOR_MAP entries Tests updated to reflect suspend_recently_active now setting resume_pending=True (preserves session) instead of suspended=True (wipes session history). AUTHOR_MAP entries: millerc79 (#19033), shellybotmoyer (#18915)	2026-05-03 03:54:03 -07:00
nftpoetrist	6c1322b997	fix(slack): close previous handler in connect() to prevent zombie Socket Mode connections SlackAdapter.connect() overwrote self._handler, self._app, and self._socket_mode_task without closing the prior AsyncSocketModeHandler first. If connect() was called a second time on the same adapter (e.g. during a gateway restart or in-process reconnect attempt), the old Socket Mode websocket stayed alive. Both the old and new connections received every Slack event and dispatched it twice — producing double responses with different wording, the same bug that affected DiscordAdapter (#18187, fixed in #18758). Fix: add a close-before-reassign guard at the start of the connection setup path, mirroring the guard DiscordAdapter.connect() already has. When self._handler is None (fresh adapter, first connect()) the block is a harmless no-op. Scoped to the handler/app fields only — no behavior change for any path that does not call connect() twice. Fixes #18980	2026-05-03 03:47:49 -07:00
0xyg3n	19ba9e43b6	fix(gateway/discord): require allowlist auth on slash commands Slash commands (_run_simple_slash, _handle_thread_create_slash) bypassed every DISCORD_ALLOWED_* gate enforced by on_message. Any guild member could invoke /background (RCE via terminal), /restart, /model, /skill, etc. CVSS 9.8 Critical. - _evaluate_slash_authorization mirrors on_message gates (user, role, channel, ignored channel) with fail-closed semantics - _check_slash_authorization sends ephemeral reject + logs + admin alert - Auth gate runs before defer() so rejections are ephemeral - /skill autocomplete returns [] for unauthorized users (no catalog leak) - Component views (ExecApproval, SlashConfirm, UpdatePrompt, ModelPicker) now honor role allowlists via shared _component_check_auth helper - Optional DISCORD_HIDE_SLASH_COMMANDS defense-in-depth - Cross-platform admin alert (Telegram/Slack fallback) on unauthorized attempts Based on PR #18125 by @0xyg3n.	2026-05-03 03:44:55 -07:00
kshitijk4poor	5d5b8912be	test: add tests for cmd_key preservation through name clamping - TestClampCommandNamesTriples: unit tests for 3-tuple support in _clamp_command_names (short names, long names, collisions, multiple entries, backward compat with 2-tuples) - TestDiscordSkillCmdKeyDispatch: integration test through the full discord_skill_commands pipeline verifying long skill names retain their original cmd_key after clamping - Add contributor CharlieKerfoot to AUTHOR_MAP	2026-05-03 03:25:45 -07:00
kshitij	457c7b76cd	feat(openrouter): add response caching support (#19132 ) Enable OpenRouter's response caching feature (beta) via X-OpenRouter-Cache headers. When enabled, identical API requests return cached responses for free (zero billing), reducing both latency and cost. Configuration via config.yaml: openrouter: response_cache: true # default: on response_cache_ttl: 300 # 1-86400 seconds Changes: - Add openrouter config section to DEFAULT_CONFIG (response_cache + TTL) - Add build_or_headers() in auxiliary_client.py that builds attribution headers plus optional cache headers based on config - Replace inline _OR_HEADERS dicts with build_or_headers() at all 5 sites: run_agent.py __init__, _apply_client_headers_for_base_url(), and auxiliary_client.py _try_openrouter() + _to_async_client() - Add _check_openrouter_cache_status() method to AIAgent that reads X-OpenRouter-Cache-Status from streaming response headers and logs HIT/MISS status - Document in cli-config.yaml.example - Add 28 tests (22 unit + 6 integration) Ref: https://openrouter.ai/docs/guides/features/response-caching	2026-05-03 01:54:24 -07:00
Henkey	9987f3d824	fix(acp): compact Zed tool replay rendering	2026-05-03 01:44:23 -07:00
Henkey	19854c7cd2	Schedule ACP history replay and fence file output	2026-05-03 01:44:23 -07:00
Henkey	eb612f5574	fix(acp): keep web extract rendering compact	2026-05-03 01:44:23 -07:00
Henkey	b294d1d022	fix(acp): keep read-file starts compact	2026-05-03 01:44:23 -07:00
Henkey	72c8037a24	fix(acp): polish common tool rendering	2026-05-03 01:44:23 -07:00
Henkey	ef9a08a872	fix(acp): polish Zed context and tool rendering	2026-05-03 01:44:23 -07:00
Henkey	e26f9b2070	fix(acp): route Zed thoughts to reasoning callbacks	2026-05-03 01:44:23 -07:00
helix4u	4f37669170	fix(tools): reconfigure enabled unconfigured toolsets	2026-05-03 00:33:02 -07:00
helix4u	d409a4409c	fix(model): avoid bedrock credential probe in provider picker	2026-05-03 00:32:55 -07:00
liuhao1024	af98122793	fix(auxiliary): propagate explicit_api_key to _try_openrouter() When resolve_provider_client() passes explicit_api_key for OpenRouter auxiliary tasks, _try_openrouter() now accepts and honors this parameter instead of silently ignoring it and falling back to OPENROUTER_API_KEY env var. Root cause: _try_openrouter() had no explicit_api_key parameter, so even when callers wanted to pass a runtime credential pool key, it could not be used. Fix: - Add explicit_api_key: str = None parameter to _try_openrouter() - Prioritize explicit_api_key over pool key and env var - Update resolve_provider_client() call site to pass explicit_api_key Regression coverage: - Test that explicit_api_key is passed to OpenAI client when provided - Test that fallback to OPENROUTER_API_KEY still works when explicit_api_key is None Closes #18338	2026-05-02 02:27:49 -07:00
teknium1	762eb79f1e	fix(gateway): tighten httpx keepalive and close whatsapp typing-response leak (#18451 ) Two mitigations for the CLOSE_WAIT accumulation reported against QQ Bot + Feishu on macOS behind Cloudflare Warp. 1. Shared httpx.Limits helper (gateway/platforms/_http_client_limits.py). Every long-lived platform adapter now constructs httpx.AsyncClient with max_keepalive_connections=10 and keepalive_expiry=2.0, vs httpx's default of unbounded keepalive pool and 5.0s expiry. On macOS/Warp the default 5s window let idle keepalive sockets sit in CLOSE_WAIT long enough for seven persistent adapters (QQ Bot, WeCom, DingTalk, Signal, BlueBubbles, WeCom-callback, plus the transient Feishu helper) to compound to the 256-fd ulimit. Tunable via HERMES_GATEWAY_HTTPX_KEEPALIVE_EXPIRY and HERMES_GATEWAY_HTTPX_MAX_KEEPALIVE env vars. 2. whatsapp.send_typing aiohttp leak. The call was 'await self._http_session.post(...)' with no 'async with' and no variable capture — the ClientResponse went out of scope unclosed, holding its TCP socket in CLOSE_WAIT until GC. Fixed by wrapping in 'async with'. This was the only bare-await aiohttp leak in the gateway/tools/plugins tree per audit; all other aiohttp sites use the context-manager pattern correctly. The underlying reporter also saw Feishu SDK (lark-oapi) connections in CLOSE_WAIT — those are inside the SDK and out of our direct control, but tightening httpx keepalive across adapters reduces the aggregate pool pressure regardless of which individual adapter leaks.	2026-05-02 02:23:37 -07:00
beibi9966	38dd057e91	fix(feishu): finalize remote document downloads inside httpx.AsyncClient context (#18502 ) Snapshot Content-Type and body while the client context is still active so pooled connections fully release on exit. Previously the read happened after `async with httpx.AsyncClient(...)` returned — which works today only because httpx eagerly buffers non-streaming responses; a future refactor to `.stream()` would silently read- after-close. Part of the #18451 connection-hygiene audit. Salvage of #18502.	2026-05-02 02:23:37 -07:00
luyao618	13f344c5ce	fix(agent): try fallback providers at init when primary credential pool is exhausted (#17929 ) When a provider's credential pool has a single entry in 429-cooldown, resolve_provider_client returns None and AIAgent.__init__ raises a misleading RuntimeError suggesting the API key is missing — even when valid fallback_providers are configured. This patch makes __init__ iterate the fallback chain before raising, mirroring the existing in-flight fallback logic in the request loop. If a fallback resolves, the agent initializes against it and sets _fallback_activated=True so _restore_primary_runtime can pick the primary back up after cooldown. Closes #17929	2026-05-02 02:09:46 -07:00
Teknium	1dce908930	fix(gateway): shutdown + restart hygiene (drain timeout, false-fatal, success log) (#18761 ) * fix(gateway): config.yaml wins over .env for agent/display/timezone settings Regression from the silent config→env bridge. The bridge at module import time is correct for max_turns (unconditional overwrite), but every other agent., display., timezone, and security bridge key was guarded by 'if X not in os.environ' — so a stale .env entry from an old 'hermes setup' run would shadow the user's current config.yaml indefinitely. Symptom: agent.max_turns: 500 in config.yaml, HERMES_MAX_ITERATIONS=60 in .env from an old setup, and the gateway silently capped at 60 iterations per turn. Gateway logs confirmed api_calls never exceeded 60. Three changes: 1. gateway/run.py: drop the 'not in os.environ' guards for all agent., display., timezone, and security.* bridge keys. config.yaml is now authoritative for these settings — same semantics already in place for max_turns, terminal., and auxiliary.. Also surface the bridge failure (previously 'except Exception: pass') to stderr so operators see bridge errors instead of silently falling back to .env. 2. gateway/run.py: INFO-log the resolved max_iterations at gateway start so operators can verify the config→env bridge did the right thing instead of chasing a phantom budget ceiling. 3. hermes_cli/setup.py: stop writing HERMES_MAX_ITERATIONS to .env in the setup wizard. config.yaml is the single source of truth. Also clean up any stale .env entry left behind by pre-fix setups. Regression tests in tests/gateway/test_config_env_bridge_authority.py guard each config→env key against the 'stale .env shadows config' bug. * fix(gateway): shutdown + restart hygiene (drain timeout, false-fatal, success log) Three issues observed in production gateway.log during a rapid restart chain on 2026-05-02, all fixed here. 1. _send_restart_notification logged unconditional success adapter.send() catches provider errors (e.g. Telegram 'Chat not found') and returns SendResult(success=False); it never raises. The caller ignored the return value and always logged 'Sent restart notification to <chat>' at INFO, producing a misleading success line directly below the 'Failed to send Telegram message' traceback on every boot. Now inspects result.success and logs WARNING with the error otherwise. 2. WhatsApp bridge SIGTERM on shutdown classified as fatal error _check_managed_bridge_exit() saw the bridge's returncode -15 (our own SIGTERM from disconnect()) and fired the full fatal-error path, producing 'ERROR ... WhatsApp bridge process exited unexpectedly' plus 'Fatal whatsapp adapter error (whatsapp_bridge_exited)' on every planned shutdown, immediately before the normal '✓ whatsapp disconnected'. Adds a _shutting_down flag that disconnect() sets before the terminate, and _check_managed_bridge_exit() returns None for returncode in {0, -2, -15} while shutting down. OOM-kill (137) and other non-signal exits still hit the fatal path. 3. restart_drain_timeout default 60s → 180s On 2026-05-02 01:43:27 a user /restart fired while three agents were mid-API-call (82s, 112s, 154s into their turns). The 60s drain budget expired and all three were force-interrupted. 180s covers realistic in-flight agent turns; users on very-long-reasoning models can still raise it further via agent.restart_drain_timeout in config.yaml. Existing explicit user values are preserved by deep-merge. Tests - tests/gateway/test_restart_notification.py: two new tests assert INFO is only logged on SendResult(success=True) and WARNING with the error string is logged on SendResult(success=False). - tests/gateway/test_whatsapp_connect.py: parametrized test for returncode in {0, -2, -15} proves shutdown-time exits are suppressed; separate test proves returncode 137 (SIGKILL/OOM) still surfaces as fatal even when _shutting_down is set. - _check_managed_bridge_exit() reads _shutting_down via getattr-with- default so existing _make_adapter() test helpers that bypass __init__ (pitfall #17 in AGENTS.md) keep working unmodified.	2026-05-02 02:08:06 -07:00
Teknium	5eac6084bc	fix(discord): warn on 32-char clamp collisions in the /skill collector (#18759 ) Discord's per-command name limit is 32 chars. When two skill slugs share the same first 32 chars (or a skill slug clamps onto a reserved gateway command name), only the first seen wins — the second is dropped from the /skill autocomplete. The old behavior incremented a ``hidden`` counter silently, so skill authors had no way to discover the drop short of noticing their skill was missing from the picker. Not an actively-biting bug today (no collisions on the default catalog as of 2026-05), but a landmine the moment someone ships a skill with a long name. The earlier series in #18745 / #18753 / #18754 dropped the other silent data-loss paths in the Discord /skill collector; this one lights up the last remaining one. Fix: promote ``_names_used`` from a set to a dict keyed by the clamped name, mapping to the source cmd_key (or a ``"<reserved>"`` sentinel for names inherited via ``reserved_names``). On collision, log a WARNING naming both sides — the winner, the loser, the clamped name, and what to rename. Two phrasings: * skill-vs-skill — "both clamp to X on Discord's 32-char command-name limit; only the winner appears in /skill. Rename one skill's frontmatter ``name:`` to differ in its first 32 chars." * skill-vs-reserved — "collides with a reserved gateway command name; the skill will not appear in /skill. Rename the skill's frontmatter ``name:``." Tests: three cases in ``tests/hermes_cli/test_discord_skill_clamp_warning.py`` — skill-vs-skill collision (warning names both cmd_keys + clamped prefix), skill-vs-reserved collision (warning uses the distinct phrasing), and a no-collision negative (zero warnings emitted).	2026-05-02 02:05:01 -07:00
teknium1	e363ced3c3	test(discord): regression coverage for zombie-websocket guard in connect() Covers PR #18224 fix for issue #18187 — when DiscordAdapter.connect() is called a second time without an intervening disconnect(), the previous commands.Bot must be closed before a new one is created. Otherwise both websockets stay connected to Discord's gateway and both fire on_message, producing double responses with different wording.	2026-05-02 02:04:14 -07:00
teknium1	0a6865b328	test(credential_pool): regression coverage for .env vs os.environ precedence Covers PR #18256 fix for issue #18254 — when OPENROUTER_API_KEY is set in BOTH os.environ (stale from parent shell) and ~/.hermes/.env (fresh), _seed_from_env must prefer the .env value. Also guards the fallback case where .env omits the key entirely (Docker/K8s/systemd deployments that only inject via runtime env).	2026-05-02 02:00:32 -07:00
Teknium	10297fa23c	fix(discord): `/reload-skills` now refreshes the `/skill` autocomplete live (#18754 ) `_register_skill_group` captured the skill catalog in closure variables (`entries` and `skill_lookup`) so the single `tree.add_command` call at startup owned the only live copy. The closure is never re-entered after startup, so `/reload-skills` — which rescans the on-disk skills dir and refreshes the in-process `_skill_commands` registry — had no way to propagate results into the `/skill` autocomplete on Discord. New skills stayed invisible in the dropdown, and deleted skills returned "Unknown skill" when the stale autocomplete entry was clicked. The fix is purely a dataflow change: promote `entries` and `skill_lookup` to instance attributes (`_skill_entries`, `_skill_lookup`), split the collector-driven rebuild into a helper (`_refresh_skill_catalog_state`), and add a public `refresh_skill_group()` method that re-runs the helper and is safe to call at any point after the initial registration. The gateway's `_handle_reload_skills_command` then iterates `self.adapters` and calls `refresh_skill_group()` on any adapter that exposes it (currently only Discord). Both sync and async implementations are supported; adapters that don't override the method (Telegram's BotCommand menu, Slack subcommand map, etc.) are silently skipped — the in-process `reload_skills()` call covers them. No `tree.sync()` is required because Discord fetches autocomplete options dynamically on every keystroke — mutating the instance state the callbacks already read from is sufficient. That sidesteps the per-app command-bucket rate limit (~5 writes / 20 s) that made the previous bulk-sync-on-reload approach unusable (#16713 context). Tests: tests/gateway/test_reload_skills_discord_resync.py — five cases covering (1) refresh replaces entries, (2) entries stay sorted after refresh, (3) collector exception leaves cached state intact, (4) `_refresh_skill_catalog_state` populates the instance attrs, (5) orchestrator calls `refresh_skill_group()` on sync + async adapters and skips adapters that don't expose it.	2026-05-02 02:00:11 -07:00
Teknium	6ec74aec07	fix(gateway): match disabled/optional skills by frontmatter slug, not dir name (#18753 ) _check_unavailable_skill is meant to turn a typed "/foo" command that doesn't resolve into a specific hint — "disabled, enable with hermes skills config" or "available but not installed, install with hermes skills install …" — instead of the generic "unknown command" reply. It was doing the match with `skill_md.parent.name.lower().replace("_", "-")`, comparing that to the typed command. For every skill whose directory name drifted from its declared frontmatter `name:`, that comparison failed and the user got the unhelpful generic path. On a standard install today 19 skills have this drift, e.g.: dir: mlops/stable-diffusion frontmatter: name: Stable Diffusion Image Generation registered slug (what the user types): /stable-diffusion-image-generation dir: mlops/qdrant frontmatter: name: Qdrant Vector Search registered slug: /qdrant-vector-search dir: mlops/flash-attention frontmatter: name: Optimizing Attention Flash registered slug: /optimizing-attention-flash In every case, _check_unavailable_skill would fall through because "stable-diffusion" != "stable-diffusion-image-generation", even with the skill sitting right there on disk. Fix: extract a small `_skill_slug_from_frontmatter` helper that reads the SKILL.md frontmatter and normalizes exactly like scan_skill_commands (lower, spaces/underscores → hyphens, strip non-[a-z0-9-], collapse runs of hyphens, strip edges). Use it in both the disabled-skills branch and the optional-skills branch. The disabled-set membership check now uses the declared frontmatter name (which is what `hermes skills config` writes into skills.disabled / platform_disabled), not the slug. Tests: five cases in tests/gateway/test_unavailable_skill_hint.py — the drift case for the disabled branch, unknown-command negative, matched-but-not-disabled negative, non-alnum stripping, and the drift case for the optional-skills branch. All five fail against main and pass with the fix.	2026-05-02 02:00:09 -07:00

1 2 3 4 5 ...

3071 commits