Commit graph

7075 commits

Author SHA1 Message Date
Tranquil-Flow
6b4fb9f878 fix(cron): treat non-dict origin as missing instead of crashing tick
``_resolve_origin`` called ``origin.get('platform')`` on whatever
``job.get('origin')`` returned. The leading ``if not origin: return None``
short-circuited the falsy cases (None, empty dict, "") but a non-empty
string passed that guard and then crashed with
``AttributeError: 'str' object has no attribute 'get'`` on every fire
attempt. Observed in the wild after a migration script tagged jobs with
free-form provenance strings (e.g.
``"combined-digest-replaces-x-and-y-20260503"``).

``mark_job_run`` did record ``last_status: error,
last_error: "'str' object has no attribute 'get'"`` once, but the next
tick re-loaded the same poisoned origin and crashed identically. The
job stayed enabled, fired every tick, and accumulated cascading errors
in the log until ``origin`` was patched manually.

Replace the falsy guard with ``isinstance(origin, dict)``. Non-dict
origins (string, int, list, tuple, float — anything that survived a
hand-edit, JSON-script write, or migration) are now treated the same
as a missing origin: the job continues with ``deliver`` falling back
through its normal home-channel path instead of crashing the scheduler
loop.

Test parametrises the non-dict shapes that can appear in jobs.json
through external writers and asserts ``_resolve_origin`` returns None
for each.

Note: this fix scope is the non-dict-``origin`` crash only. The
``next_run_at: null`` recurring-job recovery (the second sub-bug in
#18722) is independently addressed by the in-flight #18825, which
extends the never-silently-disable defense from #16265 to
``get_due_jobs()`` — that approach is well-aligned with the existing
recovery pattern and ships fine without a competing change here.

Fixes #18722 (non-dict origin crash; recurring-job recovery covered by #18825)
2026-05-03 08:51:50 -07:00
JasonOA888
69dd0f7cf1 fix(approval): extend sensitive write target to cover shell RC and credential files
Terminal commands can write to shell RC files (~/.bashrc, ~/.zshrc,
~/.profile) and credential files (~/.netrc, ~/.pgpass, ~/.npmrc,
~/.pypirc) via redirection or tee without triggering approval, even
though write_file already blocks these paths in file_safety.py.

This creates an inconsistency: write_file protects these paths but
terminal shell redirections bypass the same protection. An agent
prompted via indirect injection could install persistent backdoors
(e.g. PATH manipulation, alias overrides) or write credential entries
without user approval.

Extend _SENSITIVE_WRITE_TARGET with two new regex groups matching the
same paths that file_safety.py's WRITE_DENIED_PATHS already covers:
  _SHELL_RC_FILES  — ~/.bashrc, ~/.zshrc, ~/.profile, ~/.bash_profile,
                     ~/.zprofile
  _CREDENTIAL_FILES — ~/.netrc, ~/.pgpass, ~/.npmrc, ~/.pypirc

All 130 existing tests pass.
2026-05-03 08:49:13 -07:00
teknium1
3c59566cc5 chore(release): map leprincep35700 email for PR #18440 salvage 2026-05-03 08:47:49 -07:00
leprincep35700
b59bb4e351 fix(gateway): preserve home-channel thread targets across restart notifications 2026-05-03 08:47:49 -07:00
Teknium
d87fd9f039
fix(goals): make /goal work in TUI and fix gateway verdict delivery (#19209)
/goal was silently broken outside the classic CLI.

TUI: /goal was routed through the HermesCLI slash-worker subprocess,
which set the goal row in SessionDB but then called
_pending_input.put(state.goal) — the subprocess has no reader for that
queue, so the kickoff message was discarded. No post-turn judge was
wired into prompt.submit either, so even a manual kickoff would not
continue the goal loop. Intercept /goal in command.dispatch instead,
drive GoalManager directly, and return {type: send, notice, message}
so the TUI client renders the Goal-set notice and fires the kickoff.
Run the judge in _run_prompt_submit after message.complete, surface
the verdict via status.update {kind: goal}, and chain the continuation
turn after the running guard is released.

Gateway: _post_turn_goal_continuation was gated on
hasattr(adapter, 'send_message'), but adapters only expose send().
That branch was dead on every platform — users never saw
'✓ Goal achieved', 'Continuing toward goal', or budget-exhausted
messages. Replace the dead call with adapter.send(chat_id, content,
metadata) and drop a broken reference to self._loop.

Tests:
- tests/tui_gateway/test_goal_command.py — full /goal dispatch matrix
  (set / status / pause / resume / clear / stop / done / whitespace)
  plus regressions for slash.exec → 4018 and 'goal' staying in
  _PENDING_INPUT_COMMANDS.
- tests/gateway/test_goal_verdict_send.py — locks in the adapter.send
  path for done / continue / budget-exhausted and verifies the hook
  no-ops when no goal is set or the adapter lacks send().
2026-05-03 05:49:12 -07:00
Teknium
55647a5813
fix(whatsapp): pin protobufjs >=7.5.5 via npm overrides to clear 3 critical vulns (#19204)
The whatsapp-bridge pulls @whiskeysockets/baileys at a pinned git
commit whose transitive dep tree ships protobufjs <7.5.5, triggering
GHSA-xq3m-2v4x-88gg (critical, arbitrary code execution). npm audit
reported 3 cascading criticals: protobufjs, @whiskeysockets/libsignal-node
(pulls protobufjs), and baileys itself (effect rollup).

Fix: add npm overrides block pinning protobufjs to ^7.5.5. Deduplicates
to a single 7.5.6 copy at node_modules/protobufjs that both libsignal-node
and any other consumers resolve through normal module resolution.

Why not bump baileys: npm-published baileys@6.17.16 is deprecated by the
maintainers (wrong version), 7.0.0-rc.* still pulls the same vulnerable
libsignal-node, and upstream Baileys HEAD adds a 4th vuln (music-metadata).
The override is the minimal, behavior-preserving fix.

Validation:
- npm audit: 3 critical -> 0 vulnerabilities
- node -e "import('@whiskeysockets/baileys')" -> all 5 named exports
  (makeWASocket, useMultiFileAuthState, DisconnectReason,
  fetchLatestBaileysVersion, downloadMediaMessage) resolve
- node bridge.js loads all modules and reaches Express bind
  (exits only on EADDRINUSE because the live gateway owns :3000)
- Single deduped protobufjs@7.5.6 in the tree
2026-05-03 05:22:30 -07:00
kshitijk4poor
6f2dab248a fix: update tests for resume_pending semantics + add AUTHOR_MAP entries
Tests updated to reflect suspend_recently_active now setting
resume_pending=True (preserves session) instead of suspended=True
(wipes session history).

AUTHOR_MAP entries: millerc79 (#19033), shellybotmoyer (#18915)
2026-05-03 03:54:03 -07:00
charliekerfoot
1148c46241 fix(gateway): correct ws scheme conversion for https urls 2026-05-03 03:54:03 -07:00
kshitijk4poor
7a22c639dc chore: add shellybotmoyer to AUTHOR_MAP 2026-05-03 03:54:03 -07:00
Hermes Agent
934103476f fix(gateway): send /new response before cancel_session_processing to avoid race (#18912)
When /new is issued while an agent is actively processing, the confirmation response was never sent to the user because cancel_session_processing() was called before _send_with_retry(). Task cancellation side effects could silently drop the response.

Fix: reorder to send the response BEFORE cancelling the old task. Add logging at the send point (matching the pattern at line 2800 in _process_message_background) so future failures are visible.

Closes: #18912
2026-05-03 03:54:03 -07:00
kshitijk4poor
bf3239472f chore: add millerc79 to AUTHOR_MAP 2026-05-03 03:54:03 -07:00
millerc79
f1e0292517 fix(gateway): resume sessions after crash/restart instead of blanket suspend
suspend_recently_active() was unconditionally setting suspended=True on
startup, causing get_or_create_session() to wipe conversation history on
every restart. Change to set resume_pending=True instead, so sessions
auto-resume while still allowing stuck-loop escalation after 3 failures.
2026-05-03 03:54:03 -07:00
kshitijk4poor
0a97ce6bff chore: add nftpoetrist to AUTHOR_MAP 2026-05-03 03:47:49 -07:00
nftpoetrist
6c1322b997 fix(slack): close previous handler in connect() to prevent zombie Socket Mode connections
SlackAdapter.connect() overwrote self._handler, self._app, and
self._socket_mode_task without closing the prior AsyncSocketModeHandler
first. If connect() was called a second time on the same adapter (e.g.
during a gateway restart or in-process reconnect attempt), the old Socket
Mode websocket stayed alive. Both the old and new connections received
every Slack event and dispatched it twice — producing double responses
with different wording, the same bug that affected DiscordAdapter (#18187,
fixed in #18758).

Fix: add a close-before-reassign guard at the start of the connection
setup path, mirroring the guard DiscordAdapter.connect() already has.
When self._handler is None (fresh adapter, first connect()) the block is
a harmless no-op. Scoped to the handler/app fields only — no behavior
change for any path that does not call connect() twice.

Fixes #18980
2026-05-03 03:47:49 -07:00
kshitijk4poor
c14bf441a3 chore: add 0xyg3n noreply email to AUTHOR_MAP 2026-05-03 03:44:55 -07:00
0xyg3n
19ba9e43b6 fix(gateway/discord): require allowlist auth on slash commands
Slash commands (_run_simple_slash, _handle_thread_create_slash) bypassed
every DISCORD_ALLOWED_* gate enforced by on_message. Any guild member
could invoke /background (RCE via terminal), /restart, /model, /skill,
etc. CVSS 9.8 Critical.

- _evaluate_slash_authorization mirrors on_message gates (user, role,
  channel, ignored channel) with fail-closed semantics
- _check_slash_authorization sends ephemeral reject + logs + admin alert
- Auth gate runs before defer() so rejections are ephemeral
- /skill autocomplete returns [] for unauthorized users (no catalog leak)
- Component views (ExecApproval, SlashConfirm, UpdatePrompt, ModelPicker)
  now honor role allowlists via shared _component_check_auth helper
- Optional DISCORD_HIDE_SLASH_COMMANDS defense-in-depth
- Cross-platform admin alert (Telegram/Slack fallback) on unauthorized attempts

Based on PR #18125 by @0xyg3n.
2026-05-03 03:44:55 -07:00
kshitijk4poor
5d5b8912be test: add tests for cmd_key preservation through name clamping
- TestClampCommandNamesTriples: unit tests for 3-tuple support in
  _clamp_command_names (short names, long names, collisions, multiple
  entries, backward compat with 2-tuples)
- TestDiscordSkillCmdKeyDispatch: integration test through the full
  discord_skill_commands pipeline verifying long skill names retain
  their original cmd_key after clamping
- Add contributor CharlieKerfoot to AUTHOR_MAP
2026-05-03 03:25:45 -07:00
charliekerfoot
c4c0e5abc2 fix: After _clamp_command_names truncates skill names to fit the 32-cha… 2026-05-03 03:25:45 -07:00
kshitij
457c7b76cd
feat(openrouter): add response caching support (#19132)
Enable OpenRouter's response caching feature (beta) via X-OpenRouter-Cache
headers. When enabled, identical API requests return cached responses for
free (zero billing), reducing both latency and cost.

Configuration via config.yaml:
  openrouter:
    response_cache: true       # default: on
    response_cache_ttl: 300    # 1-86400 seconds

Changes:
- Add openrouter config section to DEFAULT_CONFIG (response_cache + TTL)
- Add build_or_headers() in auxiliary_client.py that builds attribution
  headers plus optional cache headers based on config
- Replace inline _OR_HEADERS dicts with build_or_headers() at all 5 sites:
  run_agent.py __init__, _apply_client_headers_for_base_url(), and
  auxiliary_client.py _try_openrouter() + _to_async_client()
- Add _check_openrouter_cache_status() method to AIAgent that reads
  X-OpenRouter-Cache-Status from streaming response headers and logs
  HIT/MISS status
- Document in cli-config.yaml.example
- Add 28 tests (22 unit + 6 integration)

Ref: https://openrouter.ai/docs/guides/features/response-caching
2026-05-03 01:54:24 -07:00
Teknium
9b5b88b5e0 chore: add MottledShadow to AUTHOR_MAP 2026-05-03 01:51:33 -07:00
MottledShadow
a22465e07a fix(weixin): send_weixin_direct cross-loop session check
When send_message tool is called from inside a running gateway, the
_run_async bridge spawns a worker thread with a separate event loop.
send_weixin_direct then reuses the live adapter's aiohttp session
which was created on the gateway's main loop.  aiohttp's TimerContext
checks asyncio.current_task(loop=session._loop) and sees None because
we're executing on the worker thread's loop → raises 'Timeout context
manager should be used inside a task'.

Fix: skip the live-adapter shortcut when the session belongs to a
different event loop, falling through to the fresh-session path.
2026-05-03 01:51:33 -07:00
Henkey
9987f3d824 fix(acp): compact Zed tool replay rendering 2026-05-03 01:44:23 -07:00
Henkey
19854c7cd2 Schedule ACP history replay and fence file output 2026-05-03 01:44:23 -07:00
Henkey
eb612f5574 fix(acp): keep web extract rendering compact 2026-05-03 01:44:23 -07:00
Henkey
b294d1d022 fix(acp): keep read-file starts compact 2026-05-03 01:44:23 -07:00
Henkey
72c8037a24 fix(acp): polish common tool rendering 2026-05-03 01:44:23 -07:00
Henkey
ef9a08a872 fix(acp): polish Zed context and tool rendering 2026-05-03 01:44:23 -07:00
Henkey
e26f9b2070 fix(acp): route Zed thoughts to reasoning callbacks 2026-05-03 01:44:23 -07:00
helix4u
4f37669170 fix(tools): reconfigure enabled unconfigured toolsets 2026-05-03 00:33:02 -07:00
helix4u
d409a4409c fix(model): avoid bedrock credential probe in provider picker 2026-05-03 00:32:55 -07:00
Siddharth Balyan
5d3be898a8
docs(tts): mention xAI custom voice support (#18776)
Some checks are pending
Deploy Site / deploy-vercel (push) Waiting to run
Deploy Site / deploy-docs (push) Waiting to run
Docker Build and Publish / build-and-push (push) Waiting to run
Nix / nix (macos-latest) (push) Waiting to run
Nix / nix (ubuntu-latest) (push) Waiting to run
Tests / test (push) Waiting to run
Tests / e2e (push) Waiting to run
Point users to xAI's custom voices feature — clone your voice in the
console, paste the voice_id into tts.xai.voice_id. No code changes
needed; the existing TTS pipeline already handles arbitrary voice IDs.

- config.py: link to xAI custom voices docs in voice_id comment
- setup.py: prompt accepts custom voice IDs during xAI TTS setup
- tts.md: short section linking to xAI console and docs
2026-05-02 16:08:01 +05:30
liuhao1024
af98122793 fix(auxiliary): propagate explicit_api_key to _try_openrouter()
When resolve_provider_client() passes explicit_api_key for OpenRouter auxiliary
tasks, _try_openrouter() now accepts and honors this parameter instead of
silently ignoring it and falling back to OPENROUTER_API_KEY env var.

Root cause: _try_openrouter() had no explicit_api_key parameter, so even
when callers wanted to pass a runtime credential pool key, it could not be used.

Fix:
- Add explicit_api_key: str = None parameter to _try_openrouter()
- Prioritize explicit_api_key over pool key and env var
- Update resolve_provider_client() call site to pass explicit_api_key

Regression coverage:
- Test that explicit_api_key is passed to OpenAI client when provided
- Test that fallback to OPENROUTER_API_KEY still works when explicit_api_key is None

Closes #18338
2026-05-02 02:27:49 -07:00
teknium1
73bcd83dba chore(release): map beibi9966 email for AUTHOR_MAP
Follow-up for PR #18502 salvage.
2026-05-02 02:23:37 -07:00
teknium1
762eb79f1e fix(gateway): tighten httpx keepalive and close whatsapp typing-response leak (#18451)
Two mitigations for the CLOSE_WAIT accumulation reported against QQ Bot
+ Feishu on macOS behind Cloudflare Warp.

1. Shared httpx.Limits helper (gateway/platforms/_http_client_limits.py).
   Every long-lived platform adapter now constructs httpx.AsyncClient
   with max_keepalive_connections=10 and keepalive_expiry=2.0, vs httpx's
   default of unbounded keepalive pool and 5.0s expiry. On macOS/Warp the
   default 5s window let idle keepalive sockets sit in CLOSE_WAIT long
   enough for seven persistent adapters (QQ Bot, WeCom, DingTalk, Signal,
   BlueBubbles, WeCom-callback, plus the transient Feishu helper) to
   compound to the 256-fd ulimit. Tunable via
   HERMES_GATEWAY_HTTPX_KEEPALIVE_EXPIRY and
   HERMES_GATEWAY_HTTPX_MAX_KEEPALIVE env vars.

2. whatsapp.send_typing aiohttp leak. The call was
   'await self._http_session.post(...)' with no 'async with' and no
   variable capture — the ClientResponse went out of scope unclosed,
   holding its TCP socket in CLOSE_WAIT until GC. Fixed by wrapping in
   'async with'. This was the only bare-await aiohttp leak in the
   gateway/tools/plugins tree per audit; all other aiohttp sites use
   the context-manager pattern correctly.

The underlying reporter also saw Feishu SDK (lark-oapi) connections in
CLOSE_WAIT — those are inside the SDK and out of our direct control, but
tightening httpx keepalive across adapters reduces the aggregate pool
pressure regardless of which individual adapter leaks.
2026-05-02 02:23:37 -07:00
beibi9966
38dd057e91 fix(feishu): finalize remote document downloads inside httpx.AsyncClient context (#18502)
Snapshot Content-Type and body while the client context is still
active so pooled connections fully release on exit. Previously the
read happened after `async with httpx.AsyncClient(...)` returned —
which works today only because httpx eagerly buffers non-streaming
responses; a future refactor to `.stream()` would silently read-
after-close.

Part of the #18451 connection-hygiene audit. Salvage of #18502.
2026-05-02 02:23:37 -07:00
Teknium
e444d8f29c
fix(gateway): config.yaml wins over .env for agent/display/timezone settings (#18764)
Regression from the silent config→env bridge. The bridge at module import
time is correct for max_turns (unconditional overwrite), but every other
agent.*, display.*, timezone, and security bridge key was guarded by
'if X not in os.environ' — so a stale .env entry from an old 'hermes setup'
run would shadow the user's current config.yaml indefinitely.

Symptom: agent.max_turns: 500 in config.yaml, HERMES_MAX_ITERATIONS=60
in .env from an old setup, and the gateway silently capped at 60
iterations per turn. Gateway logs confirmed api_calls never exceeded 60.

Three changes:

1. gateway/run.py: drop the 'not in os.environ' guards for all agent.*,
   display.*, timezone, and security.* bridge keys. config.yaml is now
   authoritative for these settings — same semantics already in place
   for max_turns, terminal.*, and auxiliary.*. Also surface the bridge
   failure (previously 'except Exception: pass') to stderr so operators
   see bridge errors instead of silently falling back to .env.

2. gateway/run.py: INFO-log the resolved max_iterations at gateway
   start so operators can verify the config→env bridge did the right
   thing instead of chasing a phantom budget ceiling.

3. hermes_cli/setup.py: stop writing HERMES_MAX_ITERATIONS to .env in
   the setup wizard. config.yaml is the single source of truth. Also
   clean up any stale .env entry left behind by pre-fix setups.

Regression tests in tests/gateway/test_config_env_bridge_authority.py
guard each config→env key against the 'stale .env shadows config' bug.
2026-05-02 02:14:35 -07:00
luyao618
13f344c5ce fix(agent): try fallback providers at init when primary credential pool is exhausted (#17929)
When a provider's credential pool has a single entry in 429-cooldown,
resolve_provider_client returns None and AIAgent.__init__ raises a
misleading RuntimeError suggesting the API key is missing — even when
valid fallback_providers are configured.

This patch makes __init__ iterate the fallback chain before raising,
mirroring the existing in-flight fallback logic in the request loop.
If a fallback resolves, the agent initializes against it and sets
_fallback_activated=True so _restore_primary_runtime can pick the
primary back up after cooldown.

Closes #17929
2026-05-02 02:09:46 -07:00
Teknium
1dce908930
fix(gateway): shutdown + restart hygiene (drain timeout, false-fatal, success log) (#18761)
* fix(gateway): config.yaml wins over .env for agent/display/timezone settings

Regression from the silent config→env bridge. The bridge at module import
time is correct for max_turns (unconditional overwrite), but every other
agent.*, display.*, timezone, and security bridge key was guarded by
'if X not in os.environ' — so a stale .env entry from an old 'hermes setup'
run would shadow the user's current config.yaml indefinitely.

Symptom: agent.max_turns: 500 in config.yaml, HERMES_MAX_ITERATIONS=60
in .env from an old setup, and the gateway silently capped at 60
iterations per turn. Gateway logs confirmed api_calls never exceeded 60.

Three changes:

1. gateway/run.py: drop the 'not in os.environ' guards for all agent.*,
   display.*, timezone, and security.* bridge keys. config.yaml is now
   authoritative for these settings — same semantics already in place
   for max_turns, terminal.*, and auxiliary.*. Also surface the bridge
   failure (previously 'except Exception: pass') to stderr so operators
   see bridge errors instead of silently falling back to .env.

2. gateway/run.py: INFO-log the resolved max_iterations at gateway
   start so operators can verify the config→env bridge did the right
   thing instead of chasing a phantom budget ceiling.

3. hermes_cli/setup.py: stop writing HERMES_MAX_ITERATIONS to .env in
   the setup wizard. config.yaml is the single source of truth. Also
   clean up any stale .env entry left behind by pre-fix setups.

Regression tests in tests/gateway/test_config_env_bridge_authority.py
guard each config→env key against the 'stale .env shadows config' bug.

* fix(gateway): shutdown + restart hygiene (drain timeout, false-fatal, success log)

Three issues observed in production gateway.log during a rapid restart
chain on 2026-05-02, all fixed here.

1. _send_restart_notification logged unconditional success
   adapter.send() catches provider errors (e.g. Telegram 'Chat not found')
   and returns SendResult(success=False); it never raises. The caller
   ignored the return value and always logged 'Sent restart notification
   to <chat>' at INFO, producing a misleading success line directly
   below the 'Failed to send Telegram message' traceback on every boot.
   Now inspects result.success and logs WARNING with the error otherwise.

2. WhatsApp bridge SIGTERM on shutdown classified as fatal error
   _check_managed_bridge_exit() saw the bridge's returncode -15 (our own
   SIGTERM from disconnect()) and fired the full fatal-error path,
   producing 'ERROR ... WhatsApp bridge process exited unexpectedly' plus
   'Fatal whatsapp adapter error (whatsapp_bridge_exited)' on every
   planned shutdown, immediately before the normal '✓ whatsapp
   disconnected'. Adds a _shutting_down flag that disconnect() sets
   before the terminate, and _check_managed_bridge_exit() returns None
   for returncode in {0, -2, -15} while shutting down. OOM-kill (137)
   and other non-signal exits still hit the fatal path.

3. restart_drain_timeout default 60s → 180s
   On 2026-05-02 01:43:27 a user /restart fired while three agents were
   mid-API-call (82s, 112s, 154s into their turns). The 60s drain budget
   expired and all three were force-interrupted. 180s covers realistic
   in-flight agent turns; users on very-long-reasoning models can still
   raise it further via agent.restart_drain_timeout in config.yaml.
   Existing explicit user values are preserved by deep-merge.

Tests
- tests/gateway/test_restart_notification.py: two new tests assert INFO
  is only logged on SendResult(success=True) and WARNING with the error
  string is logged on SendResult(success=False).
- tests/gateway/test_whatsapp_connect.py: parametrized test for
  returncode in {0, -2, -15} proves shutdown-time exits are suppressed;
  separate test proves returncode 137 (SIGKILL/OOM) still surfaces as
  fatal even when _shutting_down is set.
- _check_managed_bridge_exit() reads _shutting_down via getattr-with-
  default so existing _make_adapter() test helpers that bypass __init__
  (pitfall #17 in AGENTS.md) keep working unmodified.
2026-05-02 02:08:06 -07:00
teknium1
50f9f389ec chore(release): map ambition0802 email for AUTHOR_MAP
Follow-up for PR #17939 salvage.
2026-05-02 02:07:14 -07:00
ambition0802
7696ddc59e fix(cli): robust paste file expansion and process_loop error handling (#17666)
Two narrow fixes for long pasted messages silently disappearing:

1. _expand_paste_references: replace path.exists() + read_text() with
   try/except (OSError, IOError). Closes the TOCTOU window where a paste
   file deleted between check and read raised FileNotFoundError, bubbled
   up through process_loop's outer except, and silently dropped the
   user's input. Failures now return the placeholder text and log a
   warning.

2. process_loop outer except: logger.warning() instead of print().
   prompt_toolkit's TUI swallows stdout, so 'Error: …' was invisible
   to the user. Logged errors are discoverable via hermes logs.

Dropped the larger interrupt_queue→pending_input drain that was part of
the original PR — that's a separate class of input-drop (in-progress
interrupt handling) unrelated to the paste-file TOCTOU reported in the
issue, and worth its own review.

Salvage of #17939.
2026-05-02 02:07:14 -07:00
Teknium
5eac6084bc
fix(discord): warn on 32-char clamp collisions in the /skill collector (#18759)
Discord's per-command name limit is 32 chars. When two skill slugs
share the same first 32 chars (or a skill slug clamps onto a reserved
gateway command name), only the first seen wins — the second is
dropped from the /skill autocomplete. The old behavior incremented a
``hidden`` counter silently, so skill authors had no way to discover
the drop short of noticing their skill was missing from the picker.

Not an actively-biting bug today (no collisions on the default catalog
as of 2026-05), but a landmine the moment someone ships a skill with a
long name. The earlier series in #18745 / #18753 / #18754 dropped the
other silent data-loss paths in the Discord /skill collector; this one
lights up the last remaining one.

Fix: promote ``_names_used`` from a set to a dict keyed by the clamped
name, mapping to the source cmd_key (or a ``"<reserved>"`` sentinel
for names inherited via ``reserved_names``). On collision, log a
WARNING naming both sides — the winner, the loser, the clamped name,
and what to rename.

Two phrasings:

* skill-vs-skill — "both clamp to X on Discord's 32-char command-name
  limit; only the winner appears in /skill. Rename one skill's
  frontmatter ``name:`` to differ in its first 32 chars."
* skill-vs-reserved — "collides with a reserved gateway command name;
  the skill will not appear in /skill. Rename the skill's frontmatter
  ``name:``."

Tests: three cases in
``tests/hermes_cli/test_discord_skill_clamp_warning.py`` —
skill-vs-skill collision (warning names both cmd_keys + clamped prefix),
skill-vs-reserved collision (warning uses the distinct phrasing), and a
no-collision negative (zero warnings emitted).
2026-05-02 02:05:01 -07:00
teknium1
e363ced3c3 test(discord): regression coverage for zombie-websocket guard in connect()
Covers PR #18224 fix for issue #18187 — when DiscordAdapter.connect() is
called a second time without an intervening disconnect(), the previous
commands.Bot must be closed before a new one is created. Otherwise both
websockets stay connected to Discord's gateway and both fire on_message,
producing double responses with different wording.
2026-05-02 02:04:14 -07:00
luyao618
292d2fb42f fix(discord): close old client before reconnect to prevent zombie websockets (#18187)
When DiscordAdapter.connect() is called during reconnect, it creates a new
commands.Bot client without closing the previous one. The old client's
websocket remains connected to Discord's gateway, causing both to fire
on_message for every incoming event — resulting in double responses.

Fix: before creating a new Bot instance, check if a previous client exists
and close it. This ensures only one websocket connection is active at any
time.

Closes #18187
2026-05-02 02:04:14 -07:00
teknium1
0a6865b328 test(credential_pool): regression coverage for .env vs os.environ precedence
Covers PR #18256 fix for issue #18254 — when OPENROUTER_API_KEY is set in
BOTH os.environ (stale from parent shell) and ~/.hermes/.env (fresh),
_seed_from_env must prefer the .env value. Also guards the fallback case
where .env omits the key entirely (Docker/K8s/systemd deployments that
only inject via runtime env).
2026-05-02 02:00:32 -07:00
teknium1
9c626ef8ea chore(release): map franksong2702 email for AUTHOR_MAP
Follow-up for PR #18256 salvage.
2026-05-02 02:00:32 -07:00
Frank Song
2ef1ad280b fix: prefer ~/.hermes/.env over os.environ when seeding credential pool
When _seed_from_env() reads API keys to populate the credential pool, it
should treat ~/.hermes/.env as the authoritative source — not os.environ.
Stale env vars inherited from parent shell processes (Codex CLI, test
scripts, etc.) can shadow deliberate changes to the .env file, causing
auth.json to cache an outdated key that leads to silent 401 errors.

This is especially visible with OpenRouter: if a parent process exported
OPENROUTER_API_KEY=test-key-fresh and the user later updates .env with a
valid key, restarting Hermes still picks up the stale os.environ value,
writes it back to auth.json, and all API calls fail with 401.

Fixes #18254
2026-05-02 02:00:32 -07:00
Teknium
10297fa23c
fix(discord): /reload-skills now refreshes the /skill autocomplete live (#18754)
`_register_skill_group` captured the skill catalog in closure variables
(`entries` and `skill_lookup`) so the single `tree.add_command` call at
startup owned the only live copy. The closure is never re-entered after
startup, so `/reload-skills` — which rescans the on-disk skills dir and
refreshes the in-process `_skill_commands` registry — had no way to
propagate results into the `/skill` autocomplete on Discord. New skills
stayed invisible in the dropdown, and deleted skills returned
"Unknown skill" when the stale autocomplete entry was clicked.

The fix is purely a dataflow change: promote `entries` and `skill_lookup`
to instance attributes (`_skill_entries`, `_skill_lookup`), split the
collector-driven rebuild into a helper (`_refresh_skill_catalog_state`),
and add a public `refresh_skill_group()` method that re-runs the helper
and is safe to call at any point after the initial registration.

The gateway's `_handle_reload_skills_command` then iterates
`self.adapters` and calls `refresh_skill_group()` on any adapter that
exposes it (currently only Discord). Both sync and async implementations
are supported; adapters that don't override the method (Telegram's
BotCommand menu, Slack subcommand map, etc.) are silently skipped — the
in-process `reload_skills()` call covers them.

No `tree.sync()` is required because Discord fetches autocomplete
options dynamically on every keystroke — mutating the instance state the
callbacks already read from is sufficient. That sidesteps the per-app
command-bucket rate limit (~5 writes / 20 s) that made the previous
bulk-sync-on-reload approach unusable (#16713 context).

Tests: tests/gateway/test_reload_skills_discord_resync.py — five cases
covering (1) refresh replaces entries, (2) entries stay sorted after
refresh, (3) collector exception leaves cached state intact, (4)
`_refresh_skill_catalog_state` populates the instance attrs, (5)
orchestrator calls `refresh_skill_group()` on sync + async adapters and
skips adapters that don't expose it.
2026-05-02 02:00:11 -07:00
Teknium
6ec74aec07
fix(gateway): match disabled/optional skills by frontmatter slug, not dir name (#18753)
_check_unavailable_skill is meant to turn a typed "/foo" command that
doesn't resolve into a specific hint — "disabled, enable with hermes
skills config" or "available but not installed, install with hermes
skills install …" — instead of the generic "unknown command" reply.

It was doing the match with `skill_md.parent.name.lower().replace("_", "-")`,
comparing that to the typed command. For every skill whose directory name
drifted from its declared frontmatter `name:`, that comparison failed and
the user got the unhelpful generic path. On a standard install today 19
skills have this drift, e.g.:

  dir: mlops/stable-diffusion
  frontmatter: name: Stable Diffusion Image Generation
  registered slug (what the user types): /stable-diffusion-image-generation

  dir: mlops/qdrant
  frontmatter: name: Qdrant Vector Search
  registered slug: /qdrant-vector-search

  dir: mlops/flash-attention
  frontmatter: name: Optimizing Attention Flash
  registered slug: /optimizing-attention-flash

In every case, _check_unavailable_skill would fall through because
"stable-diffusion" != "stable-diffusion-image-generation", even with the
skill sitting right there on disk.

Fix: extract a small `_skill_slug_from_frontmatter` helper that reads the
SKILL.md frontmatter and normalizes exactly like scan_skill_commands
(lower, spaces/underscores → hyphens, strip non-[a-z0-9-], collapse
runs of hyphens, strip edges). Use it in both the
disabled-skills branch and the optional-skills branch. The disabled-set
membership check now uses the declared frontmatter name (which is what
`hermes skills config` writes into skills.disabled / platform_disabled),
not the slug.

Tests: five cases in tests/gateway/test_unavailable_skill_hint.py —
the drift case for the disabled branch, unknown-command negative,
matched-but-not-disabled negative, non-alnum stripping, and the drift
case for the optional-skills branch. All five fail against main and
pass with the fix.
2026-05-02 02:00:09 -07:00
Teknium
8825e9044c
fix(discord): complete #18741 for /skill autocomplete and drop legacy 25x25 caps (#18745)
``discord_skill_commands_by_category`` was lagging the flat
``discord_skill_commands`` collector on two counts. Both were actively
dropping skills from Discord's ``/skill`` autocomplete dropdown.

1. External-dir skills were filtered out. #18741 widened the flat
   collector to accept ``SKILLS_DIR + skills.external_dirs`` but left
   this sibling collector — the one ``_register_skill_group`` actually
   uses on Discord — still matching ``SKILLS_DIR`` only. External
   skills were visible in ``hermes skills list`` and the agent's
   ``/skill-name`` dispatch but silently absent from Discord's
   ``/skill`` picker. Widen the accepted roots to match, and derive
   categories from whichever root the skill lives under so
   ``<ext>/mlops/foo/SKILL.md`` still lands in the ``mlops`` group.

2. 25-group × 25-subcommand caps were still applied. PR #11580
   refactored ``/skill`` to a flat autocomplete (whose options Discord
   fetches dynamically — no per-command payload concern) and its
   docstring promises "no hidden skills." The collector kept the old
   nested-layout caps anyway, silently dropping anything past the 25th
   alphabetical category. On installs with 29 category dirs today (real
   example: tail categories ``social-media``, ``software-development``,
   ``yuanbao`` going missing) this was biting immediately. Remove the
   caps; ``hidden`` now reports only 32-char name-clamp collisions
   against reserved names.

Tests: guard both behaviors. ``test_no_legacy_25x25_cap`` builds 30
categories × 30 skills each and asserts all 900 are returned.
``test_external_dirs_skills_included`` monkeypatches
``get_external_skills_dirs`` and asserts an external-dir skill makes
it into the result grouped under its own top-level directory.
2026-05-02 02:00:06 -07:00
Jacob Lizarraga
2470434d60 fix(telegram): probe polling liveness after reconnect to detect wedged Updater
After a transient Telegram 502, _handle_polling_network_error's
stop()+start_polling() cycle can leave PTB's Updater with `running=True`
but a wedged consumer task that never makes progress. No error_callback
fires in that state, so the reconnect ladder never advances past attempt
1, the MAX_NETWORK_RETRIES fatal-error path is never reached, and the
gateway sits silent indefinitely.

Schedule a heartbeat probe (60s after a successful reconnect) that
verifies Updater.running is still True and bot.get_me() responds within
a tight asyncio.wait_for timeout. Either failure feeds back into the
reconnect ladder so the existing escalation path fires.

No PTB-internal coupling, no Application rebuild — minimal additive
defense inside the existing reconnect abstraction.

Tests cover healthy / Updater non-running / probe timeout / probe
network error / already-fatal cases, plus an integration check that the
probe is actually scheduled after a successful start_polling().

Closes the silent-wedge case observed in the wild after a transient
Telegram 502; existing reconnect tests updated to mock bot.get_me() now
that the success path schedules a heartbeat probe.
2026-05-02 01:55:04 -07:00