Commit graph

1127 commits

Author SHA1 Message Date
aaronagent
27ddd8fd80 fix(gateway): sanitize agent error messages, validate webhook gh args
Two of the three fixes from PR #6660 (the cli.py reopen_session change is
moot — that raw _conn.execute reopen block no longer exists on main).

- gateway/run.py: stop sending raw type(e).__name__ and str(e)[:300] to
  end users on chat platforms. Exception text from LLM providers can leak
  API URLs, file paths, and partial credentials. Return a generic message;
  keep curated status hints for known HTTP codes; full detail stays in logs.
- gateway/platforms/webhook.py: validate pr_number (positive int) and repo
  (owner/name regex) before passing to the 'gh pr comment' subprocess.
  Payload-controlled values could otherwise inject gh flags (--help, a
  different --repo). List-form subprocess means this is arg injection, not
  shell injection, but validation is still correct.

Co-authored-by: aaronagent <1115117931@qq.com>
2026-06-28 18:53:26 -07:00
Teknium
f1cbe4308f
fix(gateway): log error-notification failures instead of silently swallowing (#54472)
* fix(gateway): log error-notification failures instead of silently swallowing

The last-resort exception handler in _process_message_background() that
sends an error notice to the user caught all exceptions with a bare pass,
leaving zero trace when the notification itself failed. Upgrade to
logger.error(..., exc_info=True) so a failed error-notification send is
debuggable post-mortem.

Salvaged from #6499 by @BongSuCHOI (the logging-upgrade portion only).

* docs: add PR infographic for gateway error-notify logging
2026-06-28 18:52:51 -07:00
Teknium
d65468e7ff
fix(security): SSRF guard yuanbao media download_url (#54470)
yuanbao_media.download_url() fetched model-supplied (outbound) and inbound
image/file URLs server-side via httpx with follow_redirects=True and no
SSRF check. A model response containing <img src="http://169.254.169.254/...">
routed through ImageUrlHandler -> download_url and would fetch cloud-metadata
endpoints; same for inbound media.

Add an is_safe_url() pre-flight plus an async redirect event-hook that
re-validates every 30x target, matching the cache_image_from_url() guard in
gateway/platforms/base.py. The other gateway adapters already guard their
URL-fetch paths; this was the remaining unguarded one.
2026-06-28 15:29:59 -07:00
liuhao1024
9d919daf44 fix(gateway): mark platform lock failure as retryable instead of permanently fatal
When a stale lock file survives a gateway crash, `acquire_scoped_lock()`
may return `(False, existing_dict)` even after detecting and deleting
the stale lock (e.g. if unlink fails or a race condition occurs).

Previously, `_acquire_platform_lock()` called
`_set_fatal_error(..., retryable=False)`, which permanently killed the
platform — the reconnect watcher never retries a non-retryable fatal
error.

Change to `retryable=True` so the platform enters the "retrying"
state and the reconnect watcher can attempt acquisition again after the
standard backoff delay.

Fixes #54167
2026-06-28 04:35:37 -07:00
tymrtn
d7f655f370 fix: accept typed clarify choice replies 2026-06-28 04:13:19 -07:00
teknium1
58c36b1798 fix(api-server): widen error redaction to cron-endpoint + SSE sites
Follow-up to the salvaged #37733 fix. The contributor centralized
redaction at _openai_error and the chat/responses failure paths, which
covers the OpenAI-compatible envelopes transitively. Two sibling classes
crossed the same authenticated HTTP boundary unredacted:

- 8x cron-management endpoints returning {"error": str(e)} on 500
- the session-chat SSE error event ({"message": str(exc)})

Route both through the same _redact_api_error_text(force=True) helper.
Add AUTHOR_MAP entry for coygeek and a TestRedactApiErrorText guard
covering mask/force/limit/passthrough behavior.
2026-06-28 02:05:38 -07:00
Coy Geek
5e774de76e fix(api-server): redact provider errors at HTTP boundary
Force API-server error text through the existing secret redactor before returning OpenAI-compatible errors, response fallback text, response snapshots, and run failure events. This prevents credential-shaped provider failure text from crossing the API-server boundary while preserving debuggable sanitized messages.
2026-06-28 02:05:38 -07:00
Teknium
90d25adc9e
fix(gateway): deliver profile-scoped cache media on symlinked HERMES_HOME (#54060)
Generated images under a profile gateway's cache (profiles/<name>/cache/
images/...) were silently dropped from Telegram/Discord delivery when
HERMES_HOME is symlinked under a denied prefix (e.g. /opt/data ->
/root/.hermes) and $HOME is not that prefix. The resolved path lands
under /root (a system denylist prefix), the root-home exception only
fires when the denied prefix IS $HOME, and the static safe-roots list
only covers the active HERMES_HOME's top-level cache — not per-profile
cache dirs. Both gates fail, so validate_media_delivery_path returns
None and the gateway logs 'Skipping unsafe MEDIA directive path'.

_media_delivery_allowed_roots() now also enumerates per-profile cache
roots (<root>/profiles/*/cache/{images,audio,videos,documents,
screenshots}) at check time. Allowlist match runs before the denylist,
so the profile artifact delivers regardless of the /root interaction;
profile-dir credentials (auth.json) stay blocked since they aren't
under a cache subdir.

Reopened regression of #34485/#38108, neither of which covered the
profile-scoped symlink case. Fixes #31733.
2026-06-28 01:07:28 -07:00
Teknium
d3d621f7c3
revert(windows): roll back terminal-popup PRs #53791 #53810 #53829 (#53853)
* Revert "fix(windows): capture is not a no-window boundary; route flashing spawns through chokepoint (#53829)"

This reverts commit 2ecca1e7d3.

* Revert "fix(windows): stop terminal-window popups from background spawns (#53810)"

This reverts commit 5db1430af9.

* Revert "fix(windows): stop subprocess console-window popups + add CI guard (#53791)"

This reverts commit ef17cd204d.
2026-06-27 15:59:00 -07:00
Teknium
2ecca1e7d3
fix(windows): capture is not a no-window boundary; route flashing spawns through chokepoint (#53829)
Follow-up to #53791 addressing review feedback: the footgun checker treated
capture_output=/stdout=/stderr=/check_output as proof a subprocess can't pop a
Windows console. That invariant is false — stream redirection controls where a
child's output goes, not whether a console is allocated. From a console-less
parent (Desktop/Electron, pythonw.exe, detached gateway/cron) a console-subsystem
child still flashes a window even when fully captured.

- check-windows-footguns.py: capture/redirect/check_output is no longer a blanket
  safe-pass. Added _WINDOWS_FLASHING_PROGRAMS (git/gh/npm/node/python/uv/ffmpeg/
  docker/powershell/…); calls to those are flagged even when captured. Non-flashing
  programs keep the capture exemption (no 271-site noise). _subprocess_compat.run/
  popen calls are inherently safe (wrapper injects CREATE_NO_WINDOW).
- Routed the 35 genuine flashing git/gh/npm/uv/ffmpeg/docker spawns through the
  _subprocess_compat.run/popen chokepoint (Brooklyn's wrapper from #53810) — the
  durable fix, not per-site annotations. cmd.exe /c start stays # ok (intentional).
- Updated tests + CONTRIBUTING.md rule #17 to the corrected invariant.
2026-06-27 14:49:41 -07:00
Teknium
caf4dcc7ad
fix(whatsapp): resolve phone↔LID aliases in adapter DM/group allowlist (#53588)
Some checks failed
CI / Detect affected areas (push) Waiting to run
CI / Python tests (push) Blocked by required conditions
CI / Python lints (push) Blocked by required conditions
CI / TypeScript (push) Blocked by required conditions
CI / Docs Site (push) Blocked by required conditions
CI / Deny unrelated histories (push) Blocked by required conditions
CI / Check contributors (push) Blocked by required conditions
CI / Check uv.lock (push) Blocked by required conditions
CI / Lint Docker scripts (push) Blocked by required conditions
CI / Build&Test Docker image (push) Blocked by required conditions
CI / Supply-chain scan (push) Blocked by required conditions
CI / OSV scan (push) Waiting to run
CI / All required checks pass (push) Blocked by required conditions
Deploy Site / deploy-vercel (push) Waiting to run
Deploy Site / deploy-docs (push) Waiting to run
Build Skills Index / build-index (push) Has been cancelled
Build Skills Index / trigger-deploy (push) Has been cancelled
The adapter-level intake gate (_is_dm_allowed / _is_group_allowed, reached
via _should_process_message) did a raw set-membership check against the
configured allowlist. WhatsApp now delivers inbound DM senders in LID form
(<id>@lid) while operators configure allowlists with phone numbers, so the
check never matched and every DM from an allowed contact was silently
dropped before the gateway authz layer ran.

Route both gates through the existing gateway.whatsapp_identity.
expand_whatsapp_aliases helper (already used by gateway authz and session
keys), which walks the bridge's lid-mapping-*.json session files. Phone and
LID forms now resolve to each other in both directions; exact JID matches,
wildcard, disabled/open policies, and empty-allowlist fail-closed behavior
are all preserved.

Fixes #14486
2026-06-27 04:17:12 -07:00
flobo3
b8b695e2cd fix(api): surface agent crash in SSE chat completions stream 2026-06-27 03:52:44 -07:00
Yashiel Sookdeo
70c834a740 refactor: extract shared GFM table→bullet helpers into helpers.py
Move table-detection regex, row-splitting, and table-to-bullet
conversion into gateway/platforms/helpers.py so both Discord and
Telegram adapters can share them.

Co-authored-by: Yashiel Sookdeo <yashiel@skyner.co.za>
2026-06-27 03:57:24 +05:30
Teknium
619dc4a561
fix(whatsapp_cloud): resolve reply-to text so the agent sees reply context (#52957)
Replies on WhatsApp Cloud arrived at the agent with reply_to_id set but
reply_to_text=None, so run.py never injected the "[Replying to: ...]"
disambiguation prefix (it gates on reply_to_text). Meta's webhook context
object carries only the quoted message's id, never its text.

Index (chat_id, wamid) -> text in rich_sent_store on every inbound message
and every outbound text send -- the same store that solved the identical
Telegram rich-send problem -- then look up the quoted text in
_build_message_event_from_cloud and populate reply_to_text plus
reply_to_is_own_message, derived from context.from versus the business
number.
2026-06-26 01:05:05 -07:00
teknium1
43b8ba4181 fix(telegram): preserve Bot API update queue on watcher reconnect
After a prolonged outage the in-process network-error ladder escalates to
fatal and GatewayRunner._platform_reconnect_watcher rebuilds a fresh adapter
that reconnects through the bootstrap path. That path called
start_polling(drop_pending_updates=True), discarding every update Telegram
queued during the outage — all messages sent while the bot was down were
silently lost. The in-process ladder and 409-conflict handler already passed
drop_pending_updates=False; only bootstrap did not distinguish a cold first
boot from a reconnect.

Thread an is_reconnect signal from the watcher through
_connect_adapter_with_timeout into adapter.connect(). The base
BasePlatformAdapter.connect() gains a keyword-only is_reconnect=False so every
adapter inherits a tolerant signature (no per-platform breakage when the
runner forwards the kwarg). Telegram translates is_reconnect into
drop_pending_updates=not is_reconnect on both the polling and webhook bootstrap
calls. Cold boot still drops the stale queue; a watcher reconnect preserves it.

Fixes #46621.

Co-authored-by: annguyenNous <annguyen@nousresearch.com>
Co-authored-by: kyssta-exe <kyssta-exe@users.noreply.github.com>
Co-authored-by: Kewe63 <Kewe63@users.noreply.github.com>
2026-06-25 21:29:57 -07:00
Ben
d335164833 fix(relay): authorize relay inbound via connector-enforced upstream authz
A hosted instance fronted by the Team Gateway connector dropped EVERY relay
message as "Unauthorized user" and the agent never replied — despite the
message routing correctly through the connector to the instance.

Root cause: gateway authorization (_is_user_authorized) had no notion of
upstream-enforced authz. Platform.RELAY matches no {PLATFORM}_ALLOWED_USERS
allowlist and isn't in the HA/WEBHOOK always-authorized set, so a relay user
with no env allowlist configured hit the default-deny ("No user allowlists
configured. All unauthorized users will be denied."). The message was received,
then silently denied before reaching the agent.

This is incorrect for relay: the connector authenticates the gateway's WS with
a per-instance secret and performs owner-only author-binding resolution BEFORE
delivering. A message only reaches this gateway because the connector resolved
it to THIS instance's bound user (user_instance_binding), keyed on the author id
the connector OBSERVED off the event — never a gateway claim. The authorization
decision is already made by a trusted, authenticated upstream; there is no local
RELAY_ALLOWED_USERS allowlist to consult, and default-denying for its absence is
the bug.

Fix: add a generic BasePlatformAdapter.authorization_is_upstream capability
(default False) that the relay adapter overrides to True, plus a dedicated
trusted branch in _is_user_authorized that honors it. This is delegation to a
trusted upstream, NOT a fail-open: it fires only for an adapter that explicitly
declares the flag; every direct network-exposed adapter leaves it False and the
env-allowlist default-deny (SECURITY.md §2.6) is unchanged. Distinct from
enforces_own_access_policy, which mirrors a LOCAL config-driven allowlist —
this delegates to an authenticated upstream's decision.

Tests: behavior contract that the base defaults False, the relay adapter
declares True, a relay user (group + DM) is authorized with no env allowlist,
and crucially a non-upstream adapter with no allowlist still default-denies
(guards against the fix becoming a blanket fail-open). 6 new tests; relay +
authz + config-policy suites green (134 + 90).

Found via live staging debug of the Discord self-serve onboarding flow.
2026-06-25 10:06:21 +10:00
liuhao1024
404b06ac4f fix(gateway): honor server retry_after in _send_with_retry for Telegram flood control (#46762)
When Telegram's sendRichMessage returns a FloodWait/RetryAfter error,
_try_send_rich() now extracts the server-provided retry_after value and
propagates it through SendResult.retry_after. The base _send_with_retry()
layer honors this value instead of using its default short exponential
backoff (~2s, ~4s), preventing the retry budget from being exhausted
against a server that demands a 25-37s wait.

Salvaged from #46774 by @liuhao1024. Telegram adapter path moved from
gateway/platforms/telegram.py to plugins/platforms/telegram/adapter.py
since the original PR.

Closes #46762
2026-06-25 02:43:47 +05:30
uperLu
0d4cecb352 fix(cron): avoid provider package shadowing core cron 2026-06-23 23:39:22 -07:00
islam666
0c79992db5 fix(gateway): preserve _session_tasks on guard mismatch to enable stale lock healing (#48300)
_session_task_is_stale() failed to detect a stale session lock when the owner
task completed and cleaned _session_tasks (del in _process_message_background's
finally) but _active_sessions was NOT released because _release_session_guard
skipped on a guard mismatch (a concurrent reset/new command or drain handoff
swapped _active_sessions[key] to a different guard). With no owner task left to
inspect, _session_task_is_stale reported 'not stale', the orphaned guard was
never healed, and the session deadlocked permanently — later messages received
but never dispatched.

Reorder the finally cleanup to release-then-conditional-delete: release the
guard first, then drop the _session_tasks entry ONLY if the guard was actually
released (session_key no longer in _active_sessions). On a guard mismatch the
done-task entry survives, so the on-entry self-heal (_session_task_is_stale ->
_heal_stale_session_lock) detects the stale lock and clears it on the next
inbound message.

Extracted the cleanup into a callable _cleanup_finished_session_task() helper so
the regression test drives the REAL production code path rather than a copy of
its logic (the original test inlined the fixed logic and passed regardless of
the production order — mutation-verified the rewritten tests now fail on the
buggy del-first order). Added a positive-path test (guard matches -> release +
delete) so both branches are pinned.

Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
2026-06-24 03:06:21 +05:30
kshitijk4poor
100e7be20e fix(security): deny root-level credential stores in media delivery
The media-delivery denylist in gateway/platforms/base.py enumerated only
.env/auth.json/credentials/config.yaml under HERMES_HOME, so other
credential stores that live at the root fell through and could be
auto-attached to chat replies. The reported case: the Google Workspace
skill's google_token.json refreshes every turn, bumping its mtime to
'now', which kept passing the strict-mode recency window and re-sent the
OAuth token on every reply.

Extend the explicit per-file denylist to mirror the canonical credential
set already enforced by the read/write guards in agent/file_safety.py:
google_token.json, google_oauth_pending.json, auth/google_oauth.json,
.anthropic_oauth.json, webhook_subscriptions.json, cache/bws_cache.json,
auth.lock, and the pairing/ token directory.

Targeted per-file additions (not a blanket ~/.hermes deny, which was
declined in #32090/#34425 because it would block skills/, logs/, and
ad-hoc agent-written deliverables). mcp-tokens/ (#37222) and
state.db/kanban.db (#41071) are left to their sibling targeted PRs.

Reported-by: xxxigm (#50912)
2026-06-23 02:56:48 +05:30
teknium
e9cd8c5bf3 fix(delivery): drop env-var knob, flag all chunking adapters
Follow-up to ScotterMonk's cron-truncation fix:

- Remove HERMES_DELIVERY_MAX_PLATFORM_OUTPUT env var. Behavioral config
  belongs in config.yaml, not a new HERMES_* env var (.env is secrets
  only). The actual bug is fixed entirely by the adapter-aware skip; the
  configurable cap was unneeded scope. MAX_PLATFORM_OUTPUT is a constant
  again, collapsing the max_output=0 disable branch and the
  audit-vs-truncation threshold divergence.
- Flag the remaining verified-chunking adapters (slack, matrix, feishu,
  mattermost, teams, whatsapp, whatsapp_cloud, weixin, bluebubbles,
  yuanbao) with splits_long_messages=True so the fix covers the whole
  bug class, not just Discord/Telegram. Each verified to chunk in its
  own send() via truncate_message().
- SMS deliberately left False: it chunks for normal replies but a
  multi-segment cron blast is cost-bearing; the 4000-cap + file save is
  the safer default there.
- Update tests: drop the two env-override tests, add a test asserting a
  save failure during truncation (non-chunking) propagates.
2026-06-22 05:41:22 -07:00
ScotterMonk
86e4521cb1 fix(delivery): make cron output truncation configurable + adapter-aware
Gateway-level truncation (MAX_PLATFORM_OUTPUT=4000) was pre-empting
adapter-side message splitting. Discord and Telegram both chunk long
content natively in their send() via truncate_message(), but the
delivery router truncated to 3800 chars + footer before the adapter
ever saw the full payload — so long cron output was cut short instead
of being delivered as multiple messages (issue #50126).

Changes:
- HERMES_DELIVERY_MAX_PLATFORM_OUTPUT env var makes the cap configurable
  (default 4000, backward compatible). Set to 0 to disable truncation.
- TRUNCATED_VISIBLE (3800) removed — visible portion now derived
  dynamically from max_output minus the actual footer length.
- New BasePlatformAdapter.splits_long_messages capability flag (default
  False). Adapters that chunk in send() set True; delivery skips
  truncation for them but still saves full output to disk as audit.
- Flagged Discord and Telegram (both verified to chunk in send()).

Fixes #50126
2026-06-22 05:41:22 -07:00
kshitij
1f28b1a9b9
fix(gateway): redact credentials from approval prompts before sending to clients (#48456) (#50767)
Tirith redacts its own findings, but the approval-request callbacks built the
operator prompt from the RAW command string, so a credential-shaped value
Tirith flagged was sent verbatim to clients, undoing the redaction one layer up.

Two egress transports carried the leak; both are fixed via a shared
module-level seam _redact_approval_command() (redact_sensitive_text force=True):
  1. chat platforms — _approval_notify_sync (gateway/run.py): redact before
     both the button path (send_exec_approval) and the plain-text /approve
     fallback.
  2. SSE/API stream — _approval_notify (gateway/platforms/api_server.py):
     redact event['command'] before it is enqueued to API/desktop clients.
     (whole-bug-class: sibling call path on a separate transport.)

force=True so the prompt — a hard secret-egress boundary — honors redaction
even when security.redact_secrets is off. Clean commands pass through unchanged.

Tests bind the seam (synthetic credential-format fixtures, force-when-disabled) AND assert
BOTH callbacks ASSIGN the redacted result before the send/enqueue sink, via an
AST contract that rejects a discarded-result call. All mutation-checked.
2026-06-22 11:39:45 +00:00
teknium1
4314d451ca fix(gateway): accept any inbound file type across all messaging platforms
Authorization to message the agent is the gate, not the file extension.
Previously the inbound-attachment allowlist (SUPPORTED_DOCUMENT_TYPES) was
opt-OUT on Discord (allow_any_attachment defaulted false) and had no bypass
at all on Telegram/Slack — so an .html (or any non-allowlisted type) was
dropped or hard-rejected before the agent saw it.

Now every authorized upload is cached and surfaced to the agent regardless
of type:
- base.cache_media_bytes(): unknown types cache as octet-stream (or the
  caller-supplied MIME) instead of returning None — fixes the chokepoint
  that Teams/Telegram-media route through.
- discord/telegram/slack adapters: removed the allowlist reject/skip; any
  non-media attachment is typed DOCUMENT and cached. Known types keep their
  precise MIME.
- Text inlining now gates on a shared _TEXT_INJECT_EXTENSIONS set (text +
  code + config + markup) instead of a blind UTF-8 decode, so binary formats
  (PDF/zip/docx) with ASCII headers are never inlined.
- gateway/run.py emits the path-pointing context note for every DOCUMENT,
  including non text/application MIME types.
- discord.allow_any_attachment is now a documented no-op kept for config
  back-compat.

Validation: 357 gateway tests pass; E2E confirms .html/.bin/custom types
cache, known types stay precise, PDFs are not inlined.
2026-06-21 22:43:45 -07:00
teknium1
7726ce3040 fix(security): close hermes-0day MCP-persistence attack surface
Remove the dashboard --insecure auth-bypass, add an MCP persistence guard +
IOC blocklist, and raise the API-server key entropy floor.

Driven by the June 2026 hermes-0day campaign (r/hermesagent, live 854.media
instance): scanners find exposed Hermes dashboards/API servers, drive the
root agent to plant a 'command: bash' MCP entry that appends an attacker SSH
key to authorized_keys, which cron + startup then re-execute every tick.

- dashboard: --insecure no longer disables the auth gate. should_require_auth
  returns True for every non-loopback bind; a public bind ALWAYS requires an
  auth provider (bundled password provider or OAuth). --insecure kept as a
  warned no-op for backward compat. Fail-closed error now points at the
  password provider, not at --insecure.
- mcp_security: validate_mcp_server_entry now also rejects shell payloads that
  write to OS persistence surfaces (authorized_keys/.ssh/pam.d/sudoers/cron/
  rc files) and hard-rejects a hermes-0day IOC blocklist (attacker SSH key +
  source IPs) anywhere in command/args/env. Runs at save AND spawn time.
- api_server: raise network-bind API_SERVER_KEY entropy floor 8->16 chars;
  warn when a network-accessible API server runs an unsandboxed local backend.
2026-06-21 19:05:27 -07:00
Teknium
c0409a87ff
feat(gateway): typed send-error classification (SendResult.error_kind) (#50342)
Add a platform-neutral send-failure vocabulary so consumers can branch on a
typed category instead of substring-matching the raw provider message.

- base.py: SEND_ERROR_KINDS + classify_send_error() (too_long / bad_format /
  forbidden / not_found / rate_limited / transient / unknown), and an optional
  SendResult.error_kind field (defaults None — fully backward compatible).
- telegram.py: populate error_kind on send() failures; message_too_long keeps
  its existing error token plus error_kind='too_long'.

Purely additive: no behavioral change to the existing degrade-and-deliver
paths (MarkdownV2->plain-text fallback, overflow split, retry classification
all untouched). 22 new tests + 210 adapter regression tests green.
2026-06-21 12:34:22 -07:00
Teknium
7a131f7f40
fix(api-server): stop silently promising async delivery on stateless HTTP path (#50319)
* fix(api-server): stop silently promising async delivery on stateless HTTP path

terminal(notify_on_complete=True / watch_patterns) and delegate_task(background=True)
silently no-op'd on the API server / WebUI path (#10760): the watcher / detached
child registered, but every API-server route (OpenAI-spec /v1/chat/completions
and /v1/responses, plus the proprietary /v1/runs SSE stream) tears down its
channel when the turn ends, and APIServerAdapter.send() is a no-op stub. A
completion that fires after the response closed had nowhere to go — from the
agent side, indistinguishable from a hang.

There is no spec-compliant surface to wake the agent later on a stateless HTTP
client, so make the no-op honest instead of silent:

- Add a per-adapter capability flag supports_async_delivery (default True;
  APIServerAdapter = False), propagated into a HERMES_SESSION_ASYNC_DELIVERY
  contextvar via async_delivery_supported(). Toggle on the adapter, not a
  hardcoded platform string — a future stateless adapter is correct-by-default.
- terminal: when delivery is unsupported, skip watcher registration, force
  notify_on_complete off, and return a notify_unsupported note telling the
  agent to process(action='poll').
- delegate_task: when delivery is unsupported, fall back to SYNCHRONOUS
  execution (work runs and returns in the same response) with a note, instead
  of handing out a handle that never resolves.

CLI (in-process completion_queue) and the real gateway platforms are unchanged.

Fixes #10760

* refactor(api-server): route session binding through a single no-delivery chokepoint

Add APIServerAdapter._bind_api_server_session() and route both agent-entry
paths (_run_agent for /v1/chat/completions + /v1/responses, and the /v1/runs
_run_sync path) through it. The helper hardwires platform="api_server" and
async_delivery=False with no async_delivery parameter to pass, so a future
route added to the API server physically cannot reintroduce the silent
no-op (#10760) by forgetting to mark the channel as non-delivering.

The binding stays request-scoped (cleared per turn), so a session resumed
later on a delivering interface (CLI / gateway platform) re-binds fresh and
is NOT blocked — the no-delivery decision tracks the interface handling the
current turn, never the session.
2026-06-21 12:15:14 -07:00
sgaofen
93ea9b04af fix(gateway): cap inbound media download size to prevent memory exhaustion
Inbound image/audio/video payloads were buffered fully into process memory
before being written to the cache, with no size limit. A large upload
(Discord Nitro allows 500 MB) or a remote media URL in an inbound message
pointing at a huge file could spike RAM and OOM-kill the gateway.

Enforce a configurable cap in the shared cache helpers (gateway/platforms/
base.py) so the protection holds across every platform adapter, not one:

- cache_image/audio/video_from_bytes reject oversized payloads before writing
  (video was the gap in the original report — now covered).
- cache_image/audio_from_url stream the body, rejecting on an oversized
  Content-Length header and re-checking the running total per chunk so an
  absent/lying header can't smuggle an unbounded body past the cap.
- Discord's _read_attachment_bytes checks att.size up front, so an oversized
  attachment is rejected before any bytes are pulled into memory.

Configurable via gateway.max_inbound_media_bytes in config.yaml (default
128 MiB; 0 disables). No new env var — non-secret config lives in config.yaml.

Salvaged and extended from @sgaofen's PR #13341 (the original report and the
shared-helper approach). Reapplied onto current main (Discord adapter has
since moved to plugins/platforms/discord/), the configurable knob moved from
an env var to config.yaml, and the video cache helper added.

Co-authored-by: Hermes Agent <noreply@nousresearch.com>
2026-06-21 11:56:46 -07:00
Teknium
e499d69e3e
feat(api-server): configurable concurrent-run cap to prevent DoS (#50007)
The OpenAI-compatible API server only enforced a hardcoded cap of 10
concurrent runs on /v1/runs, leaving /v1/chat/completions and
/v1/responses unbounded — a request flood could exhaust CPU, memory,
and upstream LLM quota (#7483).

- Add gateway.api_server.max_concurrent_runs (config.yaml, default 10,
  0 disables). No env var.
- Shared concurrency gate across all three agent-serving endpoints,
  counting both the chat/responses in-flight counter and the /v1/runs
  stream set. Returns OpenAI-style 429 + Retry-After when at the cap.
- Remove the dead hardcoded _MAX_CONCURRENT_RUNS class attribute.

Closes #7483.
2026-06-21 07:26:03 -07:00
kshitijk4poor
b577f25100 refactor(gateway): dedupe drain-timeout resolution + share active_agents parse
Follow-up cleanups on top of the busy/idle readout (PR #50103):

- web_server.py /api/status reused the single drain-timeout resolver
  hermes_cli.gateway._get_restart_drain_timeout() (HERMES_RESTART_DRAIN_TIMEOUT
  env -> agent.restart_drain_timeout config -> default) instead of inlining a
  third hand-rolled copy of that precedence chain. Also fixes a subtle
  divergence: the inline copy used os.environ.get() so a set-but-empty env var
  was treated as a value rather than falling through to config; the shared
  resolver .strip()s and falls through correctly.
- Added gateway.status.parse_active_agents() and routed BOTH HTTP surfaces
  (/api/status and /health/detailed) through it, so the exposed active_agents
  field is consistently clamped non-negative. Previously /api/status clamped
  while /health/detailed exposed the raw file value, diverging on a corrupt
  count.
- Added TestParseActiveAgents covering the shared coercion contract.
2026-06-21 17:22:52 +05:30
Ben
0ee75469d7 feat(dashboard): surface gateway busy/drainable on /api/status
Give an external consumer (NAS) a trustworthy, always-reachable busy/idle
readout it can poll before a disruptive lifecycle action (restart,
migrate, stop, auto-update). The dashboard /api/status is the only HTTP
surface guaranteed up on a hosted agent regardless of which gateway
platforms are enabled, and it already reads gateway_state.json.

Add to /api/status (additive, non-breaking):
  - active_agents       — in-flight gateway-turn count (now refreshed
                          per-turn by the companion gateway-side commit)
  - gateway_busy        — running AND active_agents > 0
  - gateway_drainable   — running and live (a valid begin-drain target)
  - restart_drain_timeout — resolved seconds, so the consumer can size its
                          poll deadline without out-of-band knowledge
                          (env HERMES_RESTART_DRAIN_TIMEOUT → config
                          agent.restart_drain_timeout → default)

The busy/drainable contract is defined once in gateway.status
(derive_gateway_busy / derive_gateway_drainable) and consumed by both
/api/status and /health/detailed so the two surfaces can never disagree.
Liveness keys off gateway_running (a live PID/health probe), NEVER
gateway_updated_at — a healthy idle gateway never advances that timestamp.
All derived fields degrade to safe falsy values when the gateway is down
or the status file is absent/corrupt (never a spurious "busy" that would
wedge the consumer). active_sessions (the 5-min DB recency heuristic the
SPA reads) is left exactly as-is — new signal, new fields.

Tests (behaviour contracts, not snapshots): the pure derivation contract
across every running/state/count/liveness combination; /api/status
integration for busy, idle-drainable, draining, down, stale-busy-file,
corrupt-count, and timeout surfacing; and /health/detailed parity.
2026-06-21 17:22:52 +05:30
Zheng Tao
491579fa05 fix(whatsapp): resolve bridge dir with HERMES_HOME mirror in Docker
In Docker the install tree (/opt/hermes) is read-only, so npm install for
the WhatsApp bridge fails with EACCES. Add resolve_whatsapp_bridge_dir() in
whatsapp_common.py: when the install dir is read-only, mirror the bridge
source into a writable HERMES_HOME location and use that. Both the
adapter and the 'hermes whatsapp' CLI resolve through the shared helper so
the install and runtime paths agree.

Fixes #49561
2026-06-20 17:05:27 -07:00
Teknium
5600105478 refactor(gateway): migrate slack/dingtalk/whatsapp/matrix/feishu/telegram/wecom/email/sms adapters to bundled plugins
Salvage of PR #41284 onto current main. Relocates the last 9 inline messaging
adapters (+ satellites: telegram_network, feishu_comment/_rules/meeting_invite,
wecom_crypto, wecom_callback) from gateway/platforms/ into self-contained
bundled plugins under plugins/platforms/<x>/, discovered via the platform
registry. Strips the per-platform core touchpoints from gateway/run.py,
gateway/config.py, hermes_cli/gateway.py, hermes_cli/setup.py, and
tools/send_message_tool.py.

Carries forward the migration fixes (explicit enabled:false honored,
get_connected_platforms forces discovery, plugin is_connected via
gateway.get_env_value, logs --component gateway matches plugins.platforms.*,
matrix hidden on Windows).

Additionally ports config keys main added since the PR base: the matrix
plugin's _apply_yaml_config now also covers allowed_users,
ignore_user_patterns, process_notices, and session_scope (the inline
gateway/config.py matrix block gained these in the 1340 commits the PR sat
open; they would otherwise have been silently dropped on deletion).
2026-06-20 10:26:45 -07:00
kshitijk4poor
26d9a3c710 fix(signal): FIFO-evict the quote-detection timestamp cache
`_sent_message_timestamps` (the reply-to-own-message quote cache) used a
`set` evicted with `set.pop()`, which removes an ARBITRARY element — so once
more than the cap (500) outbound timestamps are tracked, a still-recent
timestamp could be dropped while older ones survive, missing a genuine
reply-to-own-message. Convert it to an OrderedDict with FIFO (oldest-first)
eviction, mirroring the recently-hardened echo ring (#31250). This closes the
same bug class on the sibling cache.

Adds a regression test asserting oldest-first eviction + MRU promotion.
2026-06-20 21:00:46 +05:30
w31rdm4ch1nZ
332f88f6a6 fix(signal): harden recently-sent echo ring with LRU + TTL 2026-06-20 20:50:52 +05:30
kshitijk4poor
32a97a20af fix(signal): strip self-mention in all groups, not just require_mention
Review follow-up on the salvaged self-mention strip (#31217): the original
only stripped the bot's rendered @<number>/@<uuid> self-mention inside the
`require_mention=true` branch, so groups with require_mention=false still
leaked it into the agent text. Hoist the strip to run for every group message
(fixing the whole bug class), and collapse the doubled space a mid-sentence
removal leaves while preserving intentional newlines.
2026-06-20 16:27:28 +05:30
Kailigithub
40b6ac9ac7 fix(signal): send explicit stop-typing RPC when cancelling indicator 2026-06-20 16:23:41 +05:30
Rick Ratmansky
96b10327b6 fix(signal): strip bot self-mention from group messages before agent dispatch 2026-06-20 16:23:41 +05:30
lkz-de
96db7c6883 fix(signal): preserve quoted reply context
Carry Signal quote metadata through gateway events so replies to assistant messages include the quoted context without personalizing comments.
2026-06-20 15:16:53 +05:30
kshitij
ff50a88617
Merge pull request #49558 from NousResearch/salvage/env-var-guards-48735 2026-06-20 15:11:54 +05:30
kshitijk4poor
a7dd98c860 fix(env): guard remaining malformed int/float env var casts with utils helpers
Widen the env_float() guard from #48735 across the whole bug class: a
non-numeric value (e.g. a stale .env "HERMES_API_TIMEOUT=abc" or a typo'd
port) raised an unhandled ValueError and crashed adapter/agent init.

Converts 22 genuinely-unguarded first-party int/float(os.getenv()) sites to
the canonical utils.env_int / utils.env_float helpers (the established house
pattern), instead of duplicating per-module helpers or inline try/except:

- gateway/config.py: WECOM_CALLBACK_PORT, BLUEBUBBLES_WEBHOOK_PORT
- gateway/platforms/email.py: EMAIL_IMAP/SMTP_PORT, EMAIL_POLL_INTERVAL
- gateway/platforms/feishu.py: dedup cache + text/media batch settings
- gateway/platforms/wecom.py, discord/adapter.py: text batch delays
- gateway/platforms/telegram.py: media batch delay, TELEGRAM_WEBHOOK_PORT
- gateway/platforms/whatsapp.py: WHATSAPP_NPM_INSTALL_TIMEOUT
- hermes_cli/auth.py: CODEX/XAI refresh timeouts
- agent/chat_completion_helpers.py: API/stream read/stale timeouts
- run_agent.py, agent/auxiliary_client.py: API + nous timeouts

Sites already guarded by try/except or local helpers are left untouched.
The HERMES_MAX_ITERATIONS sites are already guarded on main via
_current_max_iterations(), so they are not included.
2026-06-20 14:54:36 +05:30
kshitijk4poor
abafba0762 refactor(signal): correct STT-fallback comment, type the markdown wrapper, make AAC test portable
Review follow-up on the salvaged AAC + markdown changes:
- Fix an inaccurate comment claiming the STT layer has a sniff-and-remux
  fallback (verified: no such fallback exists; the ffmpeg-absent path caches
  raw ADTS and STT may reject it).
- Type the _markdown_to_signal wrapper as tuple[str, list[str]] to match the
  shared helper instead of a bare tuple.
- Replace the hardcoded /home/pi/... test fixture with a runtime-generated
  ADTS AAC sample so the remux round-trip actually runs in CI (skips only
  when ffmpeg is absent) instead of always-skipping.
2026-06-20 14:24:29 +05:30
jasnoorgill
da34fca2bb fix(signal): detect ADTS AAC voice notes and remux to MP4
Android Signal delivers voice notes as raw ADTS AAC frames, which
share the `0xFF 0xFx` sync word with MPEG-1/2 Layer 3 (MP3). The
`_guess_extension` byte-signature test in gateway/platforms/signal.py
was matching both, so ADTS AAC was being misclassified as MP3 — saved
to disk with the wrong extension and rejected by every major STT API
(Groq, OpenAI) because their server-side format sniffers inspect the
actual codec, not the file extension.

Two changes:

1. Tighten the MP3 vs ADTS disambiguator. ADTS packs `ID`,
   `layer`, and `protection_absent` into bits 3-0 of byte 1, where
   `ID=0` and `layer=00` for AAC. Real MP3 has `ID=1` and
   `layer` in {01, 10, 11}. The mask `0xF6` against target `0xF0`
   cleanly separates them.

2. Remux raw ADTS AAC to MP4 container at the cache step via
   `ffmpeg -c:a copy`. Single demux/remux, no re-encode, no quality
   loss, sub-100ms on a Pi 5. The cached file is a normal `.m4a`
   that all major STT providers accept. ffmpeg is a transitive
   dependency of many other Hermes features (TTS, video skills) so
   this isn't a new install requirement; the remux degrades
   gracefully to a no-op if ffmpeg is missing.

The new helper `_remux_aac_to_m4a` is unit-tested with a real
Android voice note from the audio cache that originally triggered
the bug, plus synthetic ADTS frames for the byte-level
disambiguator and garbage-input graceful failure.

Closes the gap that broke transcription for any Android Signal user
sending voice messages to Hermes.
2026-06-20 13:48:05 +05:30
lkz-de
905820b59f fix(signal): share markdown formatting across send paths
Route Signal send paths through shared markdown formatting helpers and render markdown bullets consistently as Unicode bullets. Add coverage for Signal formatting and send_message integration.
2026-06-20 13:47:14 +05:30
joaomarcos
3a6c171e9e fix(gateway): log signal transport response and bubble cron live adapter errors 2026-06-19 16:59:38 -07:00
joaomarcos
5649b8649a Fix silent delivery failures in Signal live adapter (#49260) 2026-06-19 16:59:38 -07:00
kshitijk4poor
d4e7dd609d refactor(windows): tidy managed-node resolver helpers
Behavior-preserving cleanups on the managed-node resolver:
- Hoist _candidate_node_command_names() out of the inner dir loop in
  find_hermes_node_executable (computed once, not per directory).
- Drop redundant os.environ.copy() at the two with_hermes_node_path(
  os.environ.copy()) sites \u2014 the helper already copies os.environ when
  called with no argument (verified env-equivalent).
- Add reciprocal keep-in-sync comments between iter_hermes_node_dirs()
  (hermes_constants.py) and hermesManagedNodePathEntries() (electron
  main.cjs), which mirror the same platform-ordering rule across the
  Python/Node boundary.
2026-06-20 02:12:16 +05:30
helix4u
7a7b56d498 fix(windows): prefer managed node for whatsapp and desktop 2026-06-20 02:00:37 +05:30
Teknium
26e76a75e5
feat(telegram): opt-in Online/Offline bot status indicator (#49134)
Sets the Telegram bot's short description (the line under its name) to
"Online" on gateway connect and "Offline" on clean disconnect, gated
behind extra.status_indicator (off by default).

Telegram bots have no presence/online dot — that's a user-account
feature the Bot API doesn't expose for bots. The short description is
the closest available surface, so this gives users a way to tell whether
the gateway is up from the bot's profile.

- New extra.status_indicator flag (+ status_online/status_offline text
  overrides), read in __init__ via config.extra — no config-schema change.
- _set_status_indicator() helper: best-effort, swallows API errors so it
  never blocks connect/disconnect; truncates to Telegram's 120-char cap.
- Wired Online after _mark_connected(), Offline at top of disconnect()
  while the bot HTTP client is still alive.
- 9 unit tests + Telegram docs section.

Requested by @ilTrumpista, cc @Teknium.
2026-06-19 11:38:39 -07:00
teknium1
a58287afcb
Merge remote-tracking branch 'origin/main' into pr48275-rebase
# Conflicts:
#	cron/scheduler.py
2026-06-19 07:40:29 -07:00