hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-05-05 02:31:47 +00:00

Author	SHA1	Message	Date
Tranquil-Flow	6b4fb9f878	fix(cron): treat non-dict origin as missing instead of crashing tick ``_resolve_origin`` called ``origin.get('platform')`` on whatever ``job.get('origin')`` returned. The leading ``if not origin: return None`` short-circuited the falsy cases (None, empty dict, "") but a non-empty string passed that guard and then crashed with ``AttributeError: 'str' object has no attribute 'get'`` on every fire attempt. Observed in the wild after a migration script tagged jobs with free-form provenance strings (e.g. ``"combined-digest-replaces-x-and-y-20260503"``). ``mark_job_run`` did record ``last_status: error, last_error: "'str' object has no attribute 'get'"`` once, but the next tick re-loaded the same poisoned origin and crashed identically. The job stayed enabled, fired every tick, and accumulated cascading errors in the log until ``origin`` was patched manually. Replace the falsy guard with ``isinstance(origin, dict)``. Non-dict origins (string, int, list, tuple, float — anything that survived a hand-edit, JSON-script write, or migration) are now treated the same as a missing origin: the job continues with ``deliver`` falling back through its normal home-channel path instead of crashing the scheduler loop. Test parametrises the non-dict shapes that can appear in jobs.json through external writers and asserts ``_resolve_origin`` returns None for each. Note: this fix scope is the non-dict-``origin`` crash only. The ``next_run_at: null`` recurring-job recovery (the second sub-bug in #18722) is independently addressed by the in-flight #18825, which extends the never-silently-disable defense from #16265 to ``get_due_jobs()`` — that approach is well-aligned with the existing recovery pattern and ships fine without a competing change here. Fixes #18722 (non-dict origin crash; recurring-job recovery covered by #18825)	2026-05-03 08:51:50 -07:00
JasonOA888	69dd0f7cf1	fix(approval): extend sensitive write target to cover shell RC and credential files Terminal commands can write to shell RC files (~/.bashrc, ~/.zshrc, ~/.profile) and credential files (~/.netrc, ~/.pgpass, ~/.npmrc, ~/.pypirc) via redirection or tee without triggering approval, even though write_file already blocks these paths in file_safety.py. This creates an inconsistency: write_file protects these paths but terminal shell redirections bypass the same protection. An agent prompted via indirect injection could install persistent backdoors (e.g. PATH manipulation, alias overrides) or write credential entries without user approval. Extend _SENSITIVE_WRITE_TARGET with two new regex groups matching the same paths that file_safety.py's WRITE_DENIED_PATHS already covers: _SHELL_RC_FILES — ~/.bashrc, ~/.zshrc, ~/.profile, ~/.bash_profile, ~/.zprofile _CREDENTIAL_FILES — ~/.netrc, ~/.pgpass, ~/.npmrc, ~/.pypirc All 130 existing tests pass.	2026-05-03 08:49:13 -07:00
teknium1	3c59566cc5	chore(release): map leprincep35700 email for PR #18440 salvage	2026-05-03 08:47:49 -07:00
leprincep35700	b59bb4e351	fix(gateway): preserve home-channel thread targets across restart notifications	2026-05-03 08:47:49 -07:00
Teknium	d87fd9f039	fix(goals): make /goal work in TUI and fix gateway verdict delivery (#19209 ) /goal was silently broken outside the classic CLI. TUI: /goal was routed through the HermesCLI slash-worker subprocess, which set the goal row in SessionDB but then called _pending_input.put(state.goal) — the subprocess has no reader for that queue, so the kickoff message was discarded. No post-turn judge was wired into prompt.submit either, so even a manual kickoff would not continue the goal loop. Intercept /goal in command.dispatch instead, drive GoalManager directly, and return {type: send, notice, message} so the TUI client renders the Goal-set notice and fires the kickoff. Run the judge in _run_prompt_submit after message.complete, surface the verdict via status.update {kind: goal}, and chain the continuation turn after the running guard is released. Gateway: _post_turn_goal_continuation was gated on hasattr(adapter, 'send_message'), but adapters only expose send(). That branch was dead on every platform — users never saw '✓ Goal achieved', 'Continuing toward goal', or budget-exhausted messages. Replace the dead call with adapter.send(chat_id, content, metadata) and drop a broken reference to self._loop. Tests: - tests/tui_gateway/test_goal_command.py — full /goal dispatch matrix (set / status / pause / resume / clear / stop / done / whitespace) plus regressions for slash.exec → 4018 and 'goal' staying in _PENDING_INPUT_COMMANDS. - tests/gateway/test_goal_verdict_send.py — locks in the adapter.send path for done / continue / budget-exhausted and verifies the hook no-ops when no goal is set or the adapter lacks send().	2026-05-03 05:49:12 -07:00
Teknium	55647a5813	fix(whatsapp): pin protobufjs >=7.5.5 via npm overrides to clear 3 critical vulns (#19204 ) The whatsapp-bridge pulls @whiskeysockets/baileys at a pinned git commit whose transitive dep tree ships protobufjs <7.5.5, triggering GHSA-xq3m-2v4x-88gg (critical, arbitrary code execution). npm audit reported 3 cascading criticals: protobufjs, @whiskeysockets/libsignal-node (pulls protobufjs), and baileys itself (effect rollup). Fix: add npm overrides block pinning protobufjs to ^7.5.5. Deduplicates to a single 7.5.6 copy at node_modules/protobufjs that both libsignal-node and any other consumers resolve through normal module resolution. Why not bump baileys: npm-published baileys@6.17.16 is deprecated by the maintainers (wrong version), 7.0.0-rc.* still pulls the same vulnerable libsignal-node, and upstream Baileys HEAD adds a 4th vuln (music-metadata). The override is the minimal, behavior-preserving fix. Validation: - npm audit: 3 critical -> 0 vulnerabilities - node -e "import('@whiskeysockets/baileys')" -> all 5 named exports (makeWASocket, useMultiFileAuthState, DisconnectReason, fetchLatestBaileysVersion, downloadMediaMessage) resolve - node bridge.js loads all modules and reaches Express bind (exits only on EADDRINUSE because the live gateway owns :3000) - Single deduped protobufjs@7.5.6 in the tree	2026-05-03 05:22:30 -07:00
kshitijk4poor	6f2dab248a	fix: update tests for resume_pending semantics + add AUTHOR_MAP entries Tests updated to reflect suspend_recently_active now setting resume_pending=True (preserves session) instead of suspended=True (wipes session history). AUTHOR_MAP entries: millerc79 (#19033), shellybotmoyer (#18915)	2026-05-03 03:54:03 -07:00
charliekerfoot	1148c46241	fix(gateway): correct ws scheme conversion for https urls	2026-05-03 03:54:03 -07:00
kshitijk4poor	7a22c639dc	chore: add shellybotmoyer to AUTHOR_MAP	2026-05-03 03:54:03 -07:00
Hermes Agent	934103476f	fix(gateway): send /new response before cancel_session_processing to avoid race (#18912 ) When /new is issued while an agent is actively processing, the confirmation response was never sent to the user because cancel_session_processing() was called before _send_with_retry(). Task cancellation side effects could silently drop the response. Fix: reorder to send the response BEFORE cancelling the old task. Add logging at the send point (matching the pattern at line 2800 in _process_message_background) so future failures are visible. Closes: #18912	2026-05-03 03:54:03 -07:00
kshitijk4poor	bf3239472f	chore: add millerc79 to AUTHOR_MAP	2026-05-03 03:54:03 -07:00
millerc79	f1e0292517	fix(gateway): resume sessions after crash/restart instead of blanket suspend suspend_recently_active() was unconditionally setting suspended=True on startup, causing get_or_create_session() to wipe conversation history on every restart. Change to set resume_pending=True instead, so sessions auto-resume while still allowing stuck-loop escalation after 3 failures.	2026-05-03 03:54:03 -07:00
kshitijk4poor	0a97ce6bff	chore: add nftpoetrist to AUTHOR_MAP	2026-05-03 03:47:49 -07:00
nftpoetrist	6c1322b997	fix(slack): close previous handler in connect() to prevent zombie Socket Mode connections SlackAdapter.connect() overwrote self._handler, self._app, and self._socket_mode_task without closing the prior AsyncSocketModeHandler first. If connect() was called a second time on the same adapter (e.g. during a gateway restart or in-process reconnect attempt), the old Socket Mode websocket stayed alive. Both the old and new connections received every Slack event and dispatched it twice — producing double responses with different wording, the same bug that affected DiscordAdapter (#18187, fixed in #18758). Fix: add a close-before-reassign guard at the start of the connection setup path, mirroring the guard DiscordAdapter.connect() already has. When self._handler is None (fresh adapter, first connect()) the block is a harmless no-op. Scoped to the handler/app fields only — no behavior change for any path that does not call connect() twice. Fixes #18980	2026-05-03 03:47:49 -07:00
kshitijk4poor	c14bf441a3	chore: add 0xyg3n noreply email to AUTHOR_MAP	2026-05-03 03:44:55 -07:00
0xyg3n	19ba9e43b6	fix(gateway/discord): require allowlist auth on slash commands Slash commands (_run_simple_slash, _handle_thread_create_slash) bypassed every DISCORD_ALLOWED_* gate enforced by on_message. Any guild member could invoke /background (RCE via terminal), /restart, /model, /skill, etc. CVSS 9.8 Critical. - _evaluate_slash_authorization mirrors on_message gates (user, role, channel, ignored channel) with fail-closed semantics - _check_slash_authorization sends ephemeral reject + logs + admin alert - Auth gate runs before defer() so rejections are ephemeral - /skill autocomplete returns [] for unauthorized users (no catalog leak) - Component views (ExecApproval, SlashConfirm, UpdatePrompt, ModelPicker) now honor role allowlists via shared _component_check_auth helper - Optional DISCORD_HIDE_SLASH_COMMANDS defense-in-depth - Cross-platform admin alert (Telegram/Slack fallback) on unauthorized attempts Based on PR #18125 by @0xyg3n.	2026-05-03 03:44:55 -07:00
kshitijk4poor	5d5b8912be	test: add tests for cmd_key preservation through name clamping - TestClampCommandNamesTriples: unit tests for 3-tuple support in _clamp_command_names (short names, long names, collisions, multiple entries, backward compat with 2-tuples) - TestDiscordSkillCmdKeyDispatch: integration test through the full discord_skill_commands pipeline verifying long skill names retain their original cmd_key after clamping - Add contributor CharlieKerfoot to AUTHOR_MAP	2026-05-03 03:25:45 -07:00
charliekerfoot	c4c0e5abc2	fix: After _clamp_command_names truncates skill names to fit the 32-cha…	2026-05-03 03:25:45 -07:00
kshitij	457c7b76cd	feat(openrouter): add response caching support (#19132 ) Enable OpenRouter's response caching feature (beta) via X-OpenRouter-Cache headers. When enabled, identical API requests return cached responses for free (zero billing), reducing both latency and cost. Configuration via config.yaml: openrouter: response_cache: true # default: on response_cache_ttl: 300 # 1-86400 seconds Changes: - Add openrouter config section to DEFAULT_CONFIG (response_cache + TTL) - Add build_or_headers() in auxiliary_client.py that builds attribution headers plus optional cache headers based on config - Replace inline _OR_HEADERS dicts with build_or_headers() at all 5 sites: run_agent.py __init__, _apply_client_headers_for_base_url(), and auxiliary_client.py _try_openrouter() + _to_async_client() - Add _check_openrouter_cache_status() method to AIAgent that reads X-OpenRouter-Cache-Status from streaming response headers and logs HIT/MISS status - Document in cli-config.yaml.example - Add 28 tests (22 unit + 6 integration) Ref: https://openrouter.ai/docs/guides/features/response-caching	2026-05-03 01:54:24 -07:00
Teknium	9b5b88b5e0	chore: add MottledShadow to AUTHOR_MAP	2026-05-03 01:51:33 -07:00
MottledShadow	a22465e07a	fix(weixin): send_weixin_direct cross-loop session check When send_message tool is called from inside a running gateway, the _run_async bridge spawns a worker thread with a separate event loop. send_weixin_direct then reuses the live adapter's aiohttp session which was created on the gateway's main loop. aiohttp's TimerContext checks asyncio.current_task(loop=session._loop) and sees None because we're executing on the worker thread's loop → raises 'Timeout context manager should be used inside a task'. Fix: skip the live-adapter shortcut when the session belongs to a different event loop, falling through to the fresh-session path.	2026-05-03 01:51:33 -07:00
Henkey	9987f3d824	fix(acp): compact Zed tool replay rendering	2026-05-03 01:44:23 -07:00
Henkey	19854c7cd2	Schedule ACP history replay and fence file output	2026-05-03 01:44:23 -07:00
Henkey	eb612f5574	fix(acp): keep web extract rendering compact	2026-05-03 01:44:23 -07:00
Henkey	b294d1d022	fix(acp): keep read-file starts compact	2026-05-03 01:44:23 -07:00
Henkey	72c8037a24	fix(acp): polish common tool rendering	2026-05-03 01:44:23 -07:00
Henkey	ef9a08a872	fix(acp): polish Zed context and tool rendering	2026-05-03 01:44:23 -07:00
Henkey	e26f9b2070	fix(acp): route Zed thoughts to reasoning callbacks	2026-05-03 01:44:23 -07:00
helix4u	4f37669170	fix(tools): reconfigure enabled unconfigured toolsets	2026-05-03 00:33:02 -07:00
helix4u	d409a4409c	fix(model): avoid bedrock credential probe in provider picker	2026-05-03 00:32:55 -07:00
Siddharth Balyan	5d3be898a8	docs(tts): mention xAI custom voice support (#18776 ) Some checks are pending Deploy Site / deploy-vercel (push) Waiting to run Details Deploy Site / deploy-docs (push) Waiting to run Details Docker Build and Publish / build-and-push (push) Waiting to run Details Nix / nix (macos-latest) (push) Waiting to run Details Nix / nix (ubuntu-latest) (push) Waiting to run Details Tests / test (push) Waiting to run Details Tests / e2e (push) Waiting to run Details Point users to xAI's custom voices feature — clone your voice in the console, paste the voice_id into tts.xai.voice_id. No code changes needed; the existing TTS pipeline already handles arbitrary voice IDs. - config.py: link to xAI custom voices docs in voice_id comment - setup.py: prompt accepts custom voice IDs during xAI TTS setup - tts.md: short section linking to xAI console and docs	2026-05-02 16:08:01 +05:30
liuhao1024	af98122793	fix(auxiliary): propagate explicit_api_key to _try_openrouter() When resolve_provider_client() passes explicit_api_key for OpenRouter auxiliary tasks, _try_openrouter() now accepts and honors this parameter instead of silently ignoring it and falling back to OPENROUTER_API_KEY env var. Root cause: _try_openrouter() had no explicit_api_key parameter, so even when callers wanted to pass a runtime credential pool key, it could not be used. Fix: - Add explicit_api_key: str = None parameter to _try_openrouter() - Prioritize explicit_api_key over pool key and env var - Update resolve_provider_client() call site to pass explicit_api_key Regression coverage: - Test that explicit_api_key is passed to OpenAI client when provided - Test that fallback to OPENROUTER_API_KEY still works when explicit_api_key is None Closes #18338	2026-05-02 02:27:49 -07:00
teknium1	73bcd83dba	chore(release): map beibi9966 email for AUTHOR_MAP Follow-up for PR #18502 salvage.	2026-05-02 02:23:37 -07:00
teknium1	762eb79f1e	fix(gateway): tighten httpx keepalive and close whatsapp typing-response leak (#18451 ) Two mitigations for the CLOSE_WAIT accumulation reported against QQ Bot + Feishu on macOS behind Cloudflare Warp. 1. Shared httpx.Limits helper (gateway/platforms/_http_client_limits.py). Every long-lived platform adapter now constructs httpx.AsyncClient with max_keepalive_connections=10 and keepalive_expiry=2.0, vs httpx's default of unbounded keepalive pool and 5.0s expiry. On macOS/Warp the default 5s window let idle keepalive sockets sit in CLOSE_WAIT long enough for seven persistent adapters (QQ Bot, WeCom, DingTalk, Signal, BlueBubbles, WeCom-callback, plus the transient Feishu helper) to compound to the 256-fd ulimit. Tunable via HERMES_GATEWAY_HTTPX_KEEPALIVE_EXPIRY and HERMES_GATEWAY_HTTPX_MAX_KEEPALIVE env vars. 2. whatsapp.send_typing aiohttp leak. The call was 'await self._http_session.post(...)' with no 'async with' and no variable capture — the ClientResponse went out of scope unclosed, holding its TCP socket in CLOSE_WAIT until GC. Fixed by wrapping in 'async with'. This was the only bare-await aiohttp leak in the gateway/tools/plugins tree per audit; all other aiohttp sites use the context-manager pattern correctly. The underlying reporter also saw Feishu SDK (lark-oapi) connections in CLOSE_WAIT — those are inside the SDK and out of our direct control, but tightening httpx keepalive across adapters reduces the aggregate pool pressure regardless of which individual adapter leaks.	2026-05-02 02:23:37 -07:00
beibi9966	38dd057e91	fix(feishu): finalize remote document downloads inside httpx.AsyncClient context (#18502 ) Snapshot Content-Type and body while the client context is still active so pooled connections fully release on exit. Previously the read happened after `async with httpx.AsyncClient(...)` returned — which works today only because httpx eagerly buffers non-streaming responses; a future refactor to `.stream()` would silently read- after-close. Part of the #18451 connection-hygiene audit. Salvage of #18502.	2026-05-02 02:23:37 -07:00
Teknium	e444d8f29c	fix(gateway): config.yaml wins over .env for agent/display/timezone settings (#18764 ) Regression from the silent config→env bridge. The bridge at module import time is correct for max_turns (unconditional overwrite), but every other agent., display., timezone, and security bridge key was guarded by 'if X not in os.environ' — so a stale .env entry from an old 'hermes setup' run would shadow the user's current config.yaml indefinitely. Symptom: agent.max_turns: 500 in config.yaml, HERMES_MAX_ITERATIONS=60 in .env from an old setup, and the gateway silently capped at 60 iterations per turn. Gateway logs confirmed api_calls never exceeded 60. Three changes: 1. gateway/run.py: drop the 'not in os.environ' guards for all agent., display., timezone, and security.* bridge keys. config.yaml is now authoritative for these settings — same semantics already in place for max_turns, terminal., and auxiliary.. Also surface the bridge failure (previously 'except Exception: pass') to stderr so operators see bridge errors instead of silently falling back to .env. 2. gateway/run.py: INFO-log the resolved max_iterations at gateway start so operators can verify the config→env bridge did the right thing instead of chasing a phantom budget ceiling. 3. hermes_cli/setup.py: stop writing HERMES_MAX_ITERATIONS to .env in the setup wizard. config.yaml is the single source of truth. Also clean up any stale .env entry left behind by pre-fix setups. Regression tests in tests/gateway/test_config_env_bridge_authority.py guard each config→env key against the 'stale .env shadows config' bug.	2026-05-02 02:14:35 -07:00
luyao618	13f344c5ce	fix(agent): try fallback providers at init when primary credential pool is exhausted (#17929 ) When a provider's credential pool has a single entry in 429-cooldown, resolve_provider_client returns None and AIAgent.__init__ raises a misleading RuntimeError suggesting the API key is missing — even when valid fallback_providers are configured. This patch makes __init__ iterate the fallback chain before raising, mirroring the existing in-flight fallback logic in the request loop. If a fallback resolves, the agent initializes against it and sets _fallback_activated=True so _restore_primary_runtime can pick the primary back up after cooldown. Closes #17929	2026-05-02 02:09:46 -07:00
Teknium	1dce908930	fix(gateway): shutdown + restart hygiene (drain timeout, false-fatal, success log) (#18761 ) * fix(gateway): config.yaml wins over .env for agent/display/timezone settings Regression from the silent config→env bridge. The bridge at module import time is correct for max_turns (unconditional overwrite), but every other agent., display., timezone, and security bridge key was guarded by 'if X not in os.environ' — so a stale .env entry from an old 'hermes setup' run would shadow the user's current config.yaml indefinitely. Symptom: agent.max_turns: 500 in config.yaml, HERMES_MAX_ITERATIONS=60 in .env from an old setup, and the gateway silently capped at 60 iterations per turn. Gateway logs confirmed api_calls never exceeded 60. Three changes: 1. gateway/run.py: drop the 'not in os.environ' guards for all agent., display., timezone, and security.* bridge keys. config.yaml is now authoritative for these settings — same semantics already in place for max_turns, terminal., and auxiliary.. Also surface the bridge failure (previously 'except Exception: pass') to stderr so operators see bridge errors instead of silently falling back to .env. 2. gateway/run.py: INFO-log the resolved max_iterations at gateway start so operators can verify the config→env bridge did the right thing instead of chasing a phantom budget ceiling. 3. hermes_cli/setup.py: stop writing HERMES_MAX_ITERATIONS to .env in the setup wizard. config.yaml is the single source of truth. Also clean up any stale .env entry left behind by pre-fix setups. Regression tests in tests/gateway/test_config_env_bridge_authority.py guard each config→env key against the 'stale .env shadows config' bug. * fix(gateway): shutdown + restart hygiene (drain timeout, false-fatal, success log) Three issues observed in production gateway.log during a rapid restart chain on 2026-05-02, all fixed here. 1. _send_restart_notification logged unconditional success adapter.send() catches provider errors (e.g. Telegram 'Chat not found') and returns SendResult(success=False); it never raises. The caller ignored the return value and always logged 'Sent restart notification to <chat>' at INFO, producing a misleading success line directly below the 'Failed to send Telegram message' traceback on every boot. Now inspects result.success and logs WARNING with the error otherwise. 2. WhatsApp bridge SIGTERM on shutdown classified as fatal error _check_managed_bridge_exit() saw the bridge's returncode -15 (our own SIGTERM from disconnect()) and fired the full fatal-error path, producing 'ERROR ... WhatsApp bridge process exited unexpectedly' plus 'Fatal whatsapp adapter error (whatsapp_bridge_exited)' on every planned shutdown, immediately before the normal '✓ whatsapp disconnected'. Adds a _shutting_down flag that disconnect() sets before the terminate, and _check_managed_bridge_exit() returns None for returncode in {0, -2, -15} while shutting down. OOM-kill (137) and other non-signal exits still hit the fatal path. 3. restart_drain_timeout default 60s → 180s On 2026-05-02 01:43:27 a user /restart fired while three agents were mid-API-call (82s, 112s, 154s into their turns). The 60s drain budget expired and all three were force-interrupted. 180s covers realistic in-flight agent turns; users on very-long-reasoning models can still raise it further via agent.restart_drain_timeout in config.yaml. Existing explicit user values are preserved by deep-merge. Tests - tests/gateway/test_restart_notification.py: two new tests assert INFO is only logged on SendResult(success=True) and WARNING with the error string is logged on SendResult(success=False). - tests/gateway/test_whatsapp_connect.py: parametrized test for returncode in {0, -2, -15} proves shutdown-time exits are suppressed; separate test proves returncode 137 (SIGKILL/OOM) still surfaces as fatal even when _shutting_down is set. - _check_managed_bridge_exit() reads _shutting_down via getattr-with- default so existing _make_adapter() test helpers that bypass __init__ (pitfall #17 in AGENTS.md) keep working unmodified.	2026-05-02 02:08:06 -07:00
teknium1	50f9f389ec	chore(release): map ambition0802 email for AUTHOR_MAP Follow-up for PR #17939 salvage.	2026-05-02 02:07:14 -07:00
ambition0802	7696ddc59e	fix(cli): robust paste file expansion and process_loop error handling (#17666 ) Two narrow fixes for long pasted messages silently disappearing: 1. _expand_paste_references: replace path.exists() + read_text() with try/except (OSError, IOError). Closes the TOCTOU window where a paste file deleted between check and read raised FileNotFoundError, bubbled up through process_loop's outer except, and silently dropped the user's input. Failures now return the placeholder text and log a warning. 2. process_loop outer except: logger.warning() instead of print(). prompt_toolkit's TUI swallows stdout, so 'Error: …' was invisible to the user. Logged errors are discoverable via hermes logs. Dropped the larger interrupt_queue→pending_input drain that was part of the original PR — that's a separate class of input-drop (in-progress interrupt handling) unrelated to the paste-file TOCTOU reported in the issue, and worth its own review. Salvage of #17939.	2026-05-02 02:07:14 -07:00
Teknium	5eac6084bc	fix(discord): warn on 32-char clamp collisions in the /skill collector (#18759 ) Discord's per-command name limit is 32 chars. When two skill slugs share the same first 32 chars (or a skill slug clamps onto a reserved gateway command name), only the first seen wins — the second is dropped from the /skill autocomplete. The old behavior incremented a ``hidden`` counter silently, so skill authors had no way to discover the drop short of noticing their skill was missing from the picker. Not an actively-biting bug today (no collisions on the default catalog as of 2026-05), but a landmine the moment someone ships a skill with a long name. The earlier series in #18745 / #18753 / #18754 dropped the other silent data-loss paths in the Discord /skill collector; this one lights up the last remaining one. Fix: promote ``_names_used`` from a set to a dict keyed by the clamped name, mapping to the source cmd_key (or a ``"<reserved>"`` sentinel for names inherited via ``reserved_names``). On collision, log a WARNING naming both sides — the winner, the loser, the clamped name, and what to rename. Two phrasings: * skill-vs-skill — "both clamp to X on Discord's 32-char command-name limit; only the winner appears in /skill. Rename one skill's frontmatter ``name:`` to differ in its first 32 chars." * skill-vs-reserved — "collides with a reserved gateway command name; the skill will not appear in /skill. Rename the skill's frontmatter ``name:``." Tests: three cases in ``tests/hermes_cli/test_discord_skill_clamp_warning.py`` — skill-vs-skill collision (warning names both cmd_keys + clamped prefix), skill-vs-reserved collision (warning uses the distinct phrasing), and a no-collision negative (zero warnings emitted).	2026-05-02 02:05:01 -07:00
teknium1	e363ced3c3	test(discord): regression coverage for zombie-websocket guard in connect() Covers PR #18224 fix for issue #18187 — when DiscordAdapter.connect() is called a second time without an intervening disconnect(), the previous commands.Bot must be closed before a new one is created. Otherwise both websockets stay connected to Discord's gateway and both fire on_message, producing double responses with different wording.	2026-05-02 02:04:14 -07:00
luyao618	292d2fb42f	fix(discord): close old client before reconnect to prevent zombie websockets (#18187 ) When DiscordAdapter.connect() is called during reconnect, it creates a new commands.Bot client without closing the previous one. The old client's websocket remains connected to Discord's gateway, causing both to fire on_message for every incoming event — resulting in double responses. Fix: before creating a new Bot instance, check if a previous client exists and close it. This ensures only one websocket connection is active at any time. Closes #18187	2026-05-02 02:04:14 -07:00
teknium1	0a6865b328	test(credential_pool): regression coverage for .env vs os.environ precedence Covers PR #18256 fix for issue #18254 — when OPENROUTER_API_KEY is set in BOTH os.environ (stale from parent shell) and ~/.hermes/.env (fresh), _seed_from_env must prefer the .env value. Also guards the fallback case where .env omits the key entirely (Docker/K8s/systemd deployments that only inject via runtime env).	2026-05-02 02:00:32 -07:00
teknium1	9c626ef8ea	chore(release): map franksong2702 email for AUTHOR_MAP Follow-up for PR #18256 salvage.	2026-05-02 02:00:32 -07:00
Frank Song	2ef1ad280b	fix: prefer ~/.hermes/.env over os.environ when seeding credential pool When _seed_from_env() reads API keys to populate the credential pool, it should treat ~/.hermes/.env as the authoritative source — not os.environ. Stale env vars inherited from parent shell processes (Codex CLI, test scripts, etc.) can shadow deliberate changes to the .env file, causing auth.json to cache an outdated key that leads to silent 401 errors. This is especially visible with OpenRouter: if a parent process exported OPENROUTER_API_KEY=test-key-fresh and the user later updates .env with a valid key, restarting Hermes still picks up the stale os.environ value, writes it back to auth.json, and all API calls fail with 401. Fixes #18254	2026-05-02 02:00:32 -07:00
Teknium	10297fa23c	fix(discord): `/reload-skills` now refreshes the `/skill` autocomplete live (#18754 ) `_register_skill_group` captured the skill catalog in closure variables (`entries` and `skill_lookup`) so the single `tree.add_command` call at startup owned the only live copy. The closure is never re-entered after startup, so `/reload-skills` — which rescans the on-disk skills dir and refreshes the in-process `_skill_commands` registry — had no way to propagate results into the `/skill` autocomplete on Discord. New skills stayed invisible in the dropdown, and deleted skills returned "Unknown skill" when the stale autocomplete entry was clicked. The fix is purely a dataflow change: promote `entries` and `skill_lookup` to instance attributes (`_skill_entries`, `_skill_lookup`), split the collector-driven rebuild into a helper (`_refresh_skill_catalog_state`), and add a public `refresh_skill_group()` method that re-runs the helper and is safe to call at any point after the initial registration. The gateway's `_handle_reload_skills_command` then iterates `self.adapters` and calls `refresh_skill_group()` on any adapter that exposes it (currently only Discord). Both sync and async implementations are supported; adapters that don't override the method (Telegram's BotCommand menu, Slack subcommand map, etc.) are silently skipped — the in-process `reload_skills()` call covers them. No `tree.sync()` is required because Discord fetches autocomplete options dynamically on every keystroke — mutating the instance state the callbacks already read from is sufficient. That sidesteps the per-app command-bucket rate limit (~5 writes / 20 s) that made the previous bulk-sync-on-reload approach unusable (#16713 context). Tests: tests/gateway/test_reload_skills_discord_resync.py — five cases covering (1) refresh replaces entries, (2) entries stay sorted after refresh, (3) collector exception leaves cached state intact, (4) `_refresh_skill_catalog_state` populates the instance attrs, (5) orchestrator calls `refresh_skill_group()` on sync + async adapters and skips adapters that don't expose it.	2026-05-02 02:00:11 -07:00
Teknium	6ec74aec07	fix(gateway): match disabled/optional skills by frontmatter slug, not dir name (#18753 ) _check_unavailable_skill is meant to turn a typed "/foo" command that doesn't resolve into a specific hint — "disabled, enable with hermes skills config" or "available but not installed, install with hermes skills install …" — instead of the generic "unknown command" reply. It was doing the match with `skill_md.parent.name.lower().replace("_", "-")`, comparing that to the typed command. For every skill whose directory name drifted from its declared frontmatter `name:`, that comparison failed and the user got the unhelpful generic path. On a standard install today 19 skills have this drift, e.g.: dir: mlops/stable-diffusion frontmatter: name: Stable Diffusion Image Generation registered slug (what the user types): /stable-diffusion-image-generation dir: mlops/qdrant frontmatter: name: Qdrant Vector Search registered slug: /qdrant-vector-search dir: mlops/flash-attention frontmatter: name: Optimizing Attention Flash registered slug: /optimizing-attention-flash In every case, _check_unavailable_skill would fall through because "stable-diffusion" != "stable-diffusion-image-generation", even with the skill sitting right there on disk. Fix: extract a small `_skill_slug_from_frontmatter` helper that reads the SKILL.md frontmatter and normalizes exactly like scan_skill_commands (lower, spaces/underscores → hyphens, strip non-[a-z0-9-], collapse runs of hyphens, strip edges). Use it in both the disabled-skills branch and the optional-skills branch. The disabled-set membership check now uses the declared frontmatter name (which is what `hermes skills config` writes into skills.disabled / platform_disabled), not the slug. Tests: five cases in tests/gateway/test_unavailable_skill_hint.py — the drift case for the disabled branch, unknown-command negative, matched-but-not-disabled negative, non-alnum stripping, and the drift case for the optional-skills branch. All five fail against main and pass with the fix.	2026-05-02 02:00:09 -07:00
Teknium	8825e9044c	fix(discord): complete #18741 for /skill autocomplete and drop legacy 25x25 caps (#18745 ) ``discord_skill_commands_by_category`` was lagging the flat ``discord_skill_commands`` collector on two counts. Both were actively dropping skills from Discord's ``/skill`` autocomplete dropdown. 1. External-dir skills were filtered out. #18741 widened the flat collector to accept ``SKILLS_DIR + skills.external_dirs`` but left this sibling collector — the one ``_register_skill_group`` actually uses on Discord — still matching ``SKILLS_DIR`` only. External skills were visible in ``hermes skills list`` and the agent's ``/skill-name`` dispatch but silently absent from Discord's ``/skill`` picker. Widen the accepted roots to match, and derive categories from whichever root the skill lives under so ``<ext>/mlops/foo/SKILL.md`` still lands in the ``mlops`` group. 2. 25-group × 25-subcommand caps were still applied. PR #11580 refactored ``/skill`` to a flat autocomplete (whose options Discord fetches dynamically — no per-command payload concern) and its docstring promises "no hidden skills." The collector kept the old nested-layout caps anyway, silently dropping anything past the 25th alphabetical category. On installs with 29 category dirs today (real example: tail categories ``social-media``, ``software-development``, ``yuanbao`` going missing) this was biting immediately. Remove the caps; ``hidden`` now reports only 32-char name-clamp collisions against reserved names. Tests: guard both behaviors. ``test_no_legacy_25x25_cap`` builds 30 categories × 30 skills each and asserts all 900 are returned. ``test_external_dirs_skills_included`` monkeypatches ``get_external_skills_dirs`` and asserts an external-dir skill makes it into the result grouped under its own top-level directory.	2026-05-02 02:00:06 -07:00
Jacob Lizarraga	2470434d60	fix(telegram): probe polling liveness after reconnect to detect wedged Updater After a transient Telegram 502, _handle_polling_network_error's stop()+start_polling() cycle can leave PTB's Updater with `running=True` but a wedged consumer task that never makes progress. No error_callback fires in that state, so the reconnect ladder never advances past attempt 1, the MAX_NETWORK_RETRIES fatal-error path is never reached, and the gateway sits silent indefinitely. Schedule a heartbeat probe (60s after a successful reconnect) that verifies Updater.running is still True and bot.get_me() responds within a tight asyncio.wait_for timeout. Either failure feeds back into the reconnect ladder so the existing escalation path fires. No PTB-internal coupling, no Application rebuild — minimal additive defense inside the existing reconnect abstraction. Tests cover healthy / Updater non-running / probe timeout / probe network error / already-fatal cases, plus an integration check that the probe is actually scheduled after a successful start_polling(). Closes the silent-wedge case observed in the wild after a transient Telegram 502; existing reconnect tests updated to mock bot.get_me() now that the success path schedules a heartbeat probe.	2026-05-02 01:55:04 -07:00

1 2 3 4 5 ...

7075 commits