Commit graph

2292 commits

Author SHA1 Message Date
fayenix
d6c53dcdcb fix(gateway): stop per-turn agent-cache eviction from model + message_id signature churn
Two independent bugs evicted the cached gateway AIAgent on every turn,
preventing the prompt cache from ever warming:

1. Model normalization mismatch: the post-run fallback-eviction check
   compared _agent.model (stripped in AIAgent.__init__) against the raw
   _resolve_gateway_model() config string. For vendor-prefixed config on
   native providers (e.g. 'deepseek/deepseek-v4-pro' vs 'deepseek-v4-pro')
   this was always unequal, so the agent was evicted after every
   successful run. Normalize _cfg_model the same way (skip aggregators).

2. Discord triggering message_id leaked into the cached system prompt via
   build_session_context_prompt()'s Discord IDs block. message_id changes
   every turn, so the agent-cache signature (computed from the ephemeral
   prompt) changed every Discord turn -> rebuild every message. The id is
   now injected per-turn into the user message (where per-turn content
   belongs and does not touch the cache signature); the cached IDs block
   carries a static pointer to it, preserving reply/react/pin via the
   discord tools.

Adapted from #28846. Bug #1 fix is the contributor's; bug #2 reworked to
be non-destructive (keeps the triggering-id capability instead of deleting
it). Redundant auto-reset eviction (already on main via #9893/#48031) and
the wrong-premise reset_context_note plumbing from the original PR were
dropped.

Co-authored-by: Hermes Agent <hermes@nousresearch.com>
2026-06-30 04:22:41 -07:00
jasonQin6
6dd188d786 fix(gateway): add session staleness guard to stream consumer
GatewayStreamConsumer.run() processed queued deltas in an infinite loop
with no check on whether the session was still current. On /new or /stop
mid-stream, the consumer kept editing and delivering stale response
fragments alongside the 'Session reset!' ack.

PR #11016 (b7bdf32d) fixed the runner side via sentinel promotion/release
but left the stream consumer unguarded. Every other async callback in
run.py already bails via _run_still_current(); the stream consumer was
the only one missing it.

- stream_consumer.py: optional run_still_current callback, checked at the
  top of the run() loop; returns early when the session is stale.
- run.py: pass the existing _run_still_current closure at both call sites
  (proxy path and agent path).
- tests: TestRunStillCurrentGuard — immediate staleness, mid-stream
  staleness, always-current, no-callback default, pending-finish.

Co-authored-by: jasonQin6 <39369769+jasonQin6@users.noreply.github.com>
2026-06-30 03:42:25 -07:00
keiravoss94
84f350efe0 feat(whatsapp): opt-in forwarding of owner-typed messages in bot mode
In `WHATSAPP_MODE=bot` the bridge currently drops every fromMe inbound
message — they are all assumed to be echoes of our own /send calls.
That makes it impossible for plugins / agents to detect when a human
owner has typed directly into a customer chat from the same WhatsApp
Business account (e.g. via a linked phone or WhatsApp Web).

This adds an opt-in `WHATSAPP_FORWARD_OWNER_MESSAGES` env var.  When
true, the bridge classifies fromMe inbound by looking up `key.id` in a
bounded LRU of recently-sent message IDs (the existing 50-entry echo
suppressor, bumped to 512 and extracted to a testable
`outbound_ids.js` helper).  Hits in the LRU are still dropped (echoes);
misses are forwarded to the Python adapter with `fromOwner: true`.

The Python adapter lifts that flag onto
`MessageEvent.metadata["whatsapp_from_owner"]`.  `metadata` is a new
free-form dict on the event so future per-platform signals don't each
need their own field.  Default behaviour is unchanged: with the env
flag unset, bot mode still drops every fromMe message exactly as
before.

Use cases for downstream consumers:
- Implicit handover activation when the owner replies manually
- Sliding TTL on owner activity (keep an active session alive while
  the owner is engaged)
- Audit trails of owner interventions
- Analytics on human-vs-bot reply ratios

Heuristic limitation (documented in code): the LRU is in-memory.  After
a bridge restart, in-flight delivery receipts of pre-restart sends will
briefly look like owner-typed for a few seconds until the set is
repopulated.  Persisting isn't worth the disk churn — downstream
consumers should treat the flag as best-effort.

Tests:
- tests/gateway/test_whatsapp_from_owner.py (new): adapter sets the
  metadata flag iff the bridge payload has `fromOwner: true`; absent
  otherwise.
- scripts/whatsapp-bridge/outbound_ids.test.mjs (new): LRU bounds,
  eviction order, falsy-id handling.

Backwards compatibility: with the env flag unset, every code path is
identical to before.  No existing deployment is affected.
2026-06-30 03:41:43 -07:00
UgwujaGeorge
cb9d18c759 fix(gateway): stop media-send fallbacks from leaking host paths into chat
The base BasePlatformAdapter implementations of send_voice, send_video,
send_document, and send_image_file forwarded their *_path argument
verbatim into the chat text (e.g. "🎬 Video: /home/.../hermes/cache/...").
Telegram, Discord, and Slack adapters all fall back to those base methods
when their native send raises — so a rejected video on Telegram surfaced
the host filesystem layout to the user instead of a useful message.

Replace the path-echo with a friendly notice, log the path for operator
diagnostics, and keep the user-supplied caption intact. The Slack adapter
had three identical sites that fell through to the same path-echo on its
own native upload failures; fix those too. send_document still surfaces
the caller-provided file_name (or the basename derived from it) since
that is the user-facing filename, not a host path.

Add regression tests asserting the *_path argument never appears in the
fallback content while caption text and explicit file_name still do.
2026-06-30 03:24:36 -07:00
Disaster-Terminator
f4a54b6292 fix(gateway): abort startup during restart 2026-06-30 03:22:18 -07:00
David Gutowsky
3a83b6bc5d fix(gateway): self-heal stale sessions.json routing at message time
Detect a routing key whose session is already ended in state.db
(end_reason set) inside get_or_create_session and drop the stale entry
instead of silently routing the message into a closed session.

Previously the only runtime cleanup of sessions.json was the startup
_prune_stale_sessions_locked (#52808/#54138), which requires a restart.
A session ended while the gateway stays alive — any path that finalizes
the DB row without clearing sessions.json — left a live routing key
pointing at a closed session. get_or_create_session never consulted
end_reason, so it returned that stale entry and every subsequent message
was silently dropped (no log, no error, no response) until the next
restart. This is the live-gateway variant of #52804/FM9, which needed an
actual gateway crash.

The guard drops the stale entry and falls through to
_recover_session_from_db, which reopens agent_close-ended rows and
resumes the SAME session_id (transcript preserved); if the row ended for
a non-recoverable reason (e.g. /new) it correctly starts a fresh
session. A warning is logged so the event is visible (the field
incident reported zero log output).

Adds tests/gateway/test_session_store_runtime_stale_guard.py covering
the _is_session_ended_in_db helper and the end-to-end routing self-heal
(recover-vs-fresh, live-entry untouched, stale-wins-over-suspended,
force_new short-circuit).

Closes #54878.

Co-authored-by: David Gutowsky <david.gutowsky@gmail.com>
2026-06-30 13:17:51 +05:30
Ben Barclay
05ac16778b feat(gateway): per-platform typing_indicator toggle
Add a generic per-platform PlatformConfig.typing_indicator flag (default
True) that gates the _keep_typing refresh loop in
_process_message_background. When false, the loop is never spawned, so no
typing/"is thinking…" status is shown on that platform — message delivery
is otherwise unchanged.

Mirrors the gateway_restart_notification contract exactly: dataclass field
+ to_dict/from_dict (with extra-fallback resolution) + shared-key bridge in
load_gateway_config, so 'slack: typing_indicator: false' under platforms
works without a separate block. Generic by design — the same key works for
every platform (Slack 'is thinking…', Telegram/Discord/Signal typing).

Motivated by users who find Slack's assistant 'is thinking…' status noisy
(it also briefly disables the compose box, via the Assistant API).
2026-06-29 21:12:57 -07:00
Teknium
6aefc9d925
feat(gateway): show per-category context breakdown in /usage (#55204)
Channel users get the same context split the desktop popover shows
(PR #54907) — system prompt, tools, rules, skills, MCP, subagents,
memory, conversation — under the existing Context line in /usage.

Reuses agent.context_breakdown.compute_session_context_breakdown, so
there is no new tool and no new engine. The slices are estimates
(chars/4) and the block is labelled _(estimated)_; the headline
Context line keeps using the provider-measured last_prompt_tokens.
Rendering is fail-open: any engine error returns no breakdown and the
rest of /usage is unaffected.

- gateway/slash_commands.py: _context_breakdown_lines() helper + wire
  into _handle_usage_command
- locales/*.yaml: breakdown_header, breakdown_line, and 8 category
  labels across all 16 locales (parity gate)
- tests/gateway/test_usage_command.py: render + fail-open coverage
2026-06-29 20:42:19 -07:00
Teknium
481caa66f2
feat(display): friendly human-phrased tool labels for built-in tools (#55166)
* feat(display): friendly human-phrased tool labels for built-in tools

Built-in tools now render ChatGPT-style status verbs ('Searching the web
for ...', 'Reading <file>', 'Browsing <url>') on the CLI spinner and
gateway/desktop tool-progress instead of the raw tool name.

- agent/display.py: _TOOL_VERBS map + build_tool_label() + set/get
  friendly-labels flag (default on). Custom/plugin/MCP tools fall back to
  the raw preview; verbose gateway mode left untouched (debug surface).
- tool_executor.py / tui_gateway / gateway: route the three spinner sites,
  the TUI _tool_ctx, and the gateway all/new progress line through the label.
- config: display.friendly_tool_labels (default True, per-platform aware).

Zero new core tool / schema footprint — pure display layer.

* docs: add PR infographic for friendly tool labels

* fix(display): preserve arg preview in gateway friendly labels + update tests

The first gateway pass re-derived the label from the callback's `args`, which
is empty ({}) at the gateway tool.started callsite — the command/query lives in
the `preview` string, so terminal rendered as a bare '💻 Running' and dedup
collapsed consecutive commands. Now the gateway prefixes the verb onto the
already-computed preview via get_tool_verb/tool_verb_connector/verb_drops_preview,
preserving the command/url/query. CLI spinner path (real args) keeps build_tool_label.

Tests: update test_run_progress_topics exact-format assertions to the friendly
form ('💻 Running pwd'), add a format-agnostic preview extractor for the
truncation tests (works for both quoted-legacy and verb-prefixed output).

* test(tui): update resume-display context to friendly tool label

_tool_ctx now uses build_tool_label, so the desktop resume-view context for a
search_files turn reads 'Searching files for resume' instead of the bare
'resume' preview — consistent with live tool-progress. Update the assertion.

* test(tui): harden no-race worker test against sibling shard leakage

test_session_create_no_race_keeps_worker_alive flaked under -j 8: a daemon
build thread leaked from a prior session.create test in the same shard process
fires close/unregister against its own (foreign) session_key after this test
patches the global approval hooks, polluting the captured lists. Scope the
assertions to this session's own session_key so the regression intent
(this session's worker/notify must survive) is preserved while the test
becomes immune to shard composition. Not related to friendly-tool-labels.
2026-06-29 20:31:17 -07:00
Ben Barclay
3a55f66602
refactor(relay): adopt scope_id wire key (guild_id → scope_id dual-read/write) (#55289)
Gateway half of relay-platform-parity Phase 2.5 (D-Q2.5). The relay wire's
platform-neutral scope discriminator is renamed guild_id → scope_id; this is the
hermes-agent side of the cross-repo wire-compatible migration.

- SessionSource: scope_id is canonical; guild_id kept as @deprecated alias.
  __post_init__ mirrors the two so all existing SessionSource(guild_id=...)
  constructors across native adapters keep working unchanged. to_dict dual-WRITES
  scope_id+guild_id; from_dict dual-READS scope_id ?? guild_id.
- relay/adapter.py: capture + outbound metadata dual-read/write scope_id.
- relay/ws_transport.py: _frame_to_event dual-reads scope_id ?? guild_id.
- docs/relay-connector-contract.md: document scope_id (canonical) + guild_id
  (deprecated alias) in the §3 SessionSource field table (conformance test).

250 relay+session+contract tests green. Solo lane (relay).
2026-06-30 11:16:53 +10:00
yoniebans
6735162531 fix(gateway): offload the Telegram topic-recovery helper tree off the loop
The topic-mode helpers (_telegram_topic_mode_enabled,
_recover_telegram_topic_thread_id, _record/_sync_telegram_topic_binding,
_is_telegram_topic_lane/_root_lobby, _normalize_source_for_session_key,
_telegram_topic_new_header, _schedule_telegram_topic_title_rename, and the
base.py _apply_topic_recovery hook) each run a synchronous SessionDB read or
write. They reach the event loop through async handlers, so a contended
state.db froze the loop the same way the handoff watcher did.

These helpers already run off-loop in the run_sync thread-pool closure, so
they are proven thread-safe there. Rather than colour them async, loop-side
callers now invoke them via asyncio.to_thread(...); the executor callers are
unchanged. Inside the helpers the SessionDB handle is unwrapped to the sync
door (getattr(db, '_db', db)) since they always run on a worker thread, and
AIAgent construction + query_session_listing are handed the sync SessionDB
directly. base.py wraps its single _apply_topic_recovery call in to_thread.

The guard is now alias-aware (catches db = getattr(self, '_session_db', None);
db.method(...)) and enforces the offload contract: the offloaded sync helpers
may never be called bare on the loop. Sibling test fixtures wrap their injected
SessionDB in AsyncSessionDB to match how the gateway holds it.
2026-06-29 15:51:57 -07:00
yoniebans
0a997aabbc fix(gateway): route aliased SessionDB calls through AsyncSessionDB
The migration's call-site sweep keyed on the literal self._session_db.
spelling and missed calls bound to a local first
(db = getattr(self, '_session_db', None); db.method(...)). Convert the
three in async contexts: get_telegram_topic_binding in the topic-rename
coroutine, and the two update_session_model sites on the model-switch path.
2026-06-29 15:51:57 -07:00
yoniebans
0896facce8 fix(gateway): route SessionDB calls through AsyncSessionDB 2026-06-29 15:51:57 -07:00
Teknium
290fa7fd2b
fix(gateway): skip confirmed-dead delivery targets (deleted groups, blocked bots) (#55115)
* fix(gateway): skip confirmed-dead delivery targets (deleted groups, blocked bots)

A deleted Telegram group, kicked/blocked bot, or deactivated user keeps
throwing Forbidden/not_found on every cron tick and fan-out delivery. Each
retry burns a send against the platform's flood-control envelope and spams
the logs, making the whole session feel broken even when the model call
completed.

Add a small persistent DeadTargetRegistry (per-profile JSON under
HERMES_HOME) that records a target the moment a send reports a whole-chat
death (forbidden / chat-level not_found), and have DeliveryRouter.deliver()
short-circuit it on subsequent attempts. Self-healing: any successful send
clears the flag, so a user re-adding the bot recovers with no manual cleanup.
Thread/topic-level not_found is NOT recorded (adapters already self-heal that
by retrying without reply_to). Transient/timeout errors are never marked dead.

* infographic: dead delivery target skipping
2026-06-29 13:23:29 -07:00
Ben Barclay
b963d3238b
feat(gateway): suppress home-channel shutdown broadcast on flagged drains (#54824)
Add a generic suppress_notification flag to the drain-request marker. When a
drain that ends in process exit (e.g. a NAS auto-update image migration on the
always-on Hermes Cloud fleet) is flagged, the gateway skips ONLY the
home-channel 'gateway shutting down' broadcast — the operator-flavoured ping
that would otherwise fire on every routine auto-update, dozens of times a day.

The per-active-session interrupt ping is ALWAYS kept: on a drained shutdown
it's empty by construction, and in the force-interrupt (deadline-exceeded) case
it carries the user-valuable 'your task was cut off, message me to resume' hint.

The gateway stays agnostic about WHY a drain is quiet (generic boolean, not a
kind enum); the policy of which drain causes set the flag lives in the caller
(NAS). Default-false so legacy/operator drains behave exactly as before. The
reader reuses the NS-570 epoch-staleness check so an orphaned marker on the
durable volume can never silence a fresh gateway's legitimate broadcast.

- drain_control.py: write_drain_request gains suppress_notification; new
  drain_notification_suppressed() reader (current-epoch + truthy flag).
- web_server.py: /api/gateway/drain reads + echoes the flag.
- run.py: _notify_active_sessions_of_shutdown skips the home-channel loop only.

Tests prove: flag round-trips; home-channel suppressed when set, kept when
unset; active-session ping always fires; stale/legacy/corrupt markers never
suppress.
2026-06-29 12:18:11 -07:00
Teknium
dbad6d47d3 fix(gateway): also neutralize untrusted Matrix room name in prompt
Widen #5961's _format_untrusted_prompt_value coverage to the Matrix
room display name (**Matrix Room:**), a sibling attacker-controllable
field the original fix missed. chat_name is user-settable, so an
injected room name could render as literal markdown in the system
prompt. Adds a regression test.
2026-06-29 04:25:51 -07:00
Xowiek
09666ceb76 fix(gateway): neutralize untrusted session metadata in prompts 2026-06-29 04:25:51 -07:00
teknium1
ea1372d2af fix(security): wire session-id sanitizer into artifact paths + API boundary
Defense-in-depth on top of _safe_session_filename_component (#5958):

Sink (makes the bad write impossible regardless of entry point):
- run_agent._save_session_log: sanitize session_id before building the
  session_{sid}.json snapshot path.
- agent_runtime_helpers.dump_api_request_debug: sanitize before building
  the request_dump_{sid}_{ts}.json path.

Boundary (clean 400 instead of a silently-hashed filename):
- api_server rejects path-traversal-shaped X-Hermes-Session-Id on the
  session-continuation path and the explicit /api/sessions create path,
  reusing gateway.session._is_path_unsafe (mirrors the native gateway's
  entry-boundary guard). Also enforces the session-header length cap on
  the continuation path.

Tests: traversal session_id stays contained at the write site; sanitizer
always yields a traversal-free segment; the API header rejects
../, absolute, and Windows-traversal IDs with 400.
2026-06-29 04:25:45 -07:00
Mibayy
0fe9755016 fix(gateway): use last_prompt_tokens for session-reset activity check
reset_had_activity gated on entry.total_tokens, which is never written
(token counts migrated to agent-direct persistence) so it was always 0.
That suppressed session-reset notifications for sessions that genuinely
had activity. Switch to last_prompt_tokens, which is updated on every
turn.
2026-06-29 04:25:37 -07:00
sgaofen
194bff0687 fix(gateway): confirm final delivery before suppressing send
Fixes #14238. During a compression/session split at the response
boundary, the interim callback delivered unrelated commentary, setting
response_previewed=True. The suppression logic treated that as proof the
final reply had been delivered and skipped the normal send — the response
was persisted to the child session but never sent to chat.

Only suppress the normal final send when the stream consumer confirms
final delivery (final_response_sent / final_content_delivered) or the
exact final response text was delivered as a preview.
2026-06-29 02:37:11 -07:00
teknium1
0b733a8418 test(gateway): pin auto-reset cached-agent eviction (#10710)
Relocate marco0158's eviction into the dedicated auto-reset cleanup block
(single source of truth for dropping session-scoped transient state) and
add an AST invariant pinning _evict_cached_agent into that block. Add
AUTHOR_MAP entry for marco0158.
2026-06-28 22:35:17 -07:00
marco0158
b4300f2d96 fix(gateway): evict cached agent on auto-reset to prevent stale context summary leak
When a session is auto-reset by daily schedule, idle timeout, or suspended
state, the agent cache was not being cleared. This caused the old agent's
context_compressor._previous_summary to leak into the new session, mixing
old conversation history into new compaction summaries.

This was the root cause of the "skin making history" appearing after
compaction in fresh sessions reported by the user.

Follow-up to #9893 which only handled compression_exhausted case.

Changes:
- Add _evict_cached_agent(session_key) call after was_auto_reset check
- Covers daily, idle, and suspended auto-reset scenarios
- Matches the behavior of manual /reset command

Related tests: test_session_boundary_hooks, test_async_memory_flush,
test_session_reset_notify, test_session_reset_fix - all passing.
2026-06-28 22:35:17 -07:00
Junass1
61a4526ac7 fix(gateway): clear session-scoped model overrides on /resume
/resume is a conversation boundary, but unlike /new it did not clear the
chat-keyed _session_model_overrides / _pending_model_notes. A /model switch
made in the previous session under the same chat session_key leaked into the
resumed conversation, running it on the wrong model.

Clear both maps for the session_key after the switch (mirroring /new), scoped
to that key so other chats' overrides are untouched. The cached-agent eviction
this leak also implied already landed via #6672.

Closes #10702.
2026-06-28 22:35:12 -07:00
aaronagent
27ddd8fd80 fix(gateway): sanitize agent error messages, validate webhook gh args
Two of the three fixes from PR #6660 (the cli.py reopen_session change is
moot — that raw _conn.execute reopen block no longer exists on main).

- gateway/run.py: stop sending raw type(e).__name__ and str(e)[:300] to
  end users on chat platforms. Exception text from LLM providers can leak
  API URLs, file paths, and partial credentials. Return a generic message;
  keep curated status hints for known HTTP codes; full detail stays in logs.
- gateway/platforms/webhook.py: validate pr_number (positive int) and repo
  (owner/name regex) before passing to the 'gh pr comment' subprocess.
  Payload-controlled values could otherwise inject gh flags (--help, a
  different --repo). List-form subprocess means this is arg injection, not
  shell injection, but validation is still correct.

Co-authored-by: aaronagent <1115117931@qq.com>
2026-06-28 18:53:26 -07:00
Teknium
f1cbe4308f
fix(gateway): log error-notification failures instead of silently swallowing (#54472)
* fix(gateway): log error-notification failures instead of silently swallowing

The last-resort exception handler in _process_message_background() that
sends an error notice to the user caught all exceptions with a bare pass,
leaving zero trace when the notification itself failed. Upgrade to
logger.error(..., exc_info=True) so a failed error-notification send is
debuggable post-mortem.

Salvaged from #6499 by @BongSuCHOI (the logging-upgrade portion only).

* docs: add PR infographic for gateway error-notify logging
2026-06-28 18:52:51 -07:00
Teknium
d65468e7ff
fix(security): SSRF guard yuanbao media download_url (#54470)
yuanbao_media.download_url() fetched model-supplied (outbound) and inbound
image/file URLs server-side via httpx with follow_redirects=True and no
SSRF check. A model response containing <img src="http://169.254.169.254/...">
routed through ImageUrlHandler -> download_url and would fetch cloud-metadata
endpoints; same for inbound media.

Add an is_safe_url() pre-flight plus an async redirect event-hook that
re-validates every 30x target, matching the cache_image_from_url() guard in
gateway/platforms/base.py. The other gateway adapters already guard their
URL-fetch paths; this was the remaining unguarded one.
2026-06-28 15:29:59 -07:00
Teknium
95f2919f91
perf(startup): lazy-load gateway platform adapters (#54448)
Bundled platform plugins (telegram, discord, feishu, teams, ...) were
eagerly imported at plugin-discovery time on every `hermes` invocation,
including plain `hermes chat` which never touches a gateway platform.
Their modules import heavy platform SDKs at module level (lark_oapi,
microsoft_teams, discord.py, slack_bolt, ...) — feishu alone pulled in
lark_oapi (~2.6s), teams pulled microsoft_teams (~1.9s).

Discovery now registers a cheap deferred loader per platform in the
platform_registry; the adapter module is imported only when the gateway
/ cron / setup / send_message path actually asks for that platform.
is_registered() and the iterate-all accessors stay correct (deferred
counts as registered; plugin_entries()/all_entries() materialize all
deferred loaders, since those paths genuinely need every adapter).

Cold start: ~4.4s -> ~2.45s to banner. discover_and_load: 2.0s -> 0.3s
(warm), and the heavy SDKs are no longer imported at all in CLI mode.
Every shipped platform remains available out of the box — it just loads
on first use.
2026-06-28 15:11:59 -07:00
Teknium
86e64900b9
fix(gateway): preserve sessions across restarts (#54442) 2026-06-28 15:10:39 -07:00
Teknium
9a0010fd46
fix(windows): cover remaining console-flash spawn legs (#54417) 2026-06-28 13:49:08 -07:00
Teknium
cb982ad997 fix(windows): hide console-window flash on backend git/gh/wmic/bash subprocess spawns
The Windows desktop GUI runs its backend headless via pythonw.exe. Several
auxiliary subprocess sites that run inside that windowless backend spawned
console-subsystem children (git, gh, wmic, powershell, bash, rg, taskkill)
WITHOUT CREATE_NO_WINDOW, so Windows allocated a fresh conhost per call and
flashed a black window on screen — sometimes continuously (the dashboard
Projects-tree git probe alone fired ~118 spawns in 60s on startup).

The terminal tool, cron, browser, code_execution, and gateway-spawn paths
already carry windows_hide_flags(); these auxiliary probe/scan/launcher legs
were missed. Wire the existing helper into them:

- tui_gateway/git_probe.py: run_git (+ encoding=utf-8/errors=replace, fixes the
  cp950 UnicodeDecodeError on CJK paths from the same site)
- agent/coding_context.py: _git (per-turn git status/log/diff)
- agent/context_references.py: _run_git + _rg_files (@file/@ref resolution)
- hermes_cli/copilot_auth.py: gh auth token probe (auxiliary provider:auto)
- hermes_cli/gateway.py: wmic + PowerShell Get-CimInstance PID scan
- hermes_cli/main.py: wmic stale-dashboard PID scan
- gateway/status.py: taskkill /T /F force-kill

windows_hide_flags() returns 0 on POSIX, so every changed call is a no-op on
Linux/macOS (verified: real git/rg probes still work; Windows-simulated calls
all pass creationflags=CREATE_NO_WINDOW).

Scoped to the windowless-backend paths that cause the reported flashing. The
Electron updater-handoff leg (main.cjs windowsHide:false) and the
interactive-CLI banner probes (cli.py) are intentionally NOT touched here —
the former needs a Windows-tested change of its own, the latter runs in a
visible console anyway.

Tracking: #54220
Refs: #53178 #53631 #53781 #53957 #49602 #52982 #53424 #53053 #53016
2026-06-28 05:28:45 -07:00
liuhao1024
9d919daf44 fix(gateway): mark platform lock failure as retryable instead of permanently fatal
When a stale lock file survives a gateway crash, `acquire_scoped_lock()`
may return `(False, existing_dict)` even after detecting and deleting
the stale lock (e.g. if unlink fails or a race condition occurs).

Previously, `_acquire_platform_lock()` called
`_set_fatal_error(..., retryable=False)`, which permanently killed the
platform — the reconnect watcher never retries a non-retryable fatal
error.

Change to `retryable=True` so the platform enters the "retrying"
state and the reconnect watcher can attempt acquisition again after the
standard backoff delay.

Fixes #54167
2026-06-28 04:35:37 -07:00
tymrtn
d7f655f370 fix: accept typed clarify choice replies 2026-06-28 04:13:19 -07:00
teknium1
9bb5a809b5 fix(gateway): make zombie check defensive against partial psutil stubs
The zombie status probe referenced psutil.Process/NoSuchProcess/Error
unconditionally, which raised AttributeError when psutil is a partial
stub that only defines pid_exists (as in test_windows_native_support's
fallback tests). Guard the probe so any failure to read status degrades
to the authoritative pid_exists() instead of raising.
2026-06-28 04:11:14 -07:00
MorAlekss
acca526286 fix(gateway): treat zombie PIDs as dead in _pid_exists to unblock --replace (closes #42126)
Under systemd Restart=always, the old gateway becomes a zombie (in the
process table, awaiting reap) when the replacement starts. _pid_exists()
reported the zombie as alive, so --replace waited on a PID that never
dies, then aborted with exit 1 — a silent crash loop. Standalone runs are
unaffected because nothing respawns the gateway into a zombie.

The live path is psutil.pid_exists(), which returns True for zombies, so
the check is added there (Process.status() == STATUS_ZOMBIE -> dead). The
psutil-less POSIX fallback also reads /proc/<pid>/stat (state Z) with a ps
state= fallback for macOS/BSD, before the os.kill(pid, 0) liveness probe.

Diagnosis and the /proc + ps POSIX fallback by MorAlekss (PR #44898);
extended to cover the psutil hot path so the fix applies on normal installs.

Co-authored-by: MorAlekss <mor.aleksandr@yahoo.com>
2026-06-28 04:11:14 -07:00
Teknium
c1c179a239
fix(security): redact secrets in background process + foreground env-dump output (#43025) (#54149)
* fix(security): redact secrets in background process + foreground env-dump output

Terminal-output redaction was incomplete (#43025):

- Gap 1: process(action=poll/log/wait) returned background stdout verbatim —
  no redaction at all. A background printenv/server/test emitting a key leaked
  raw to the model, session.db, and CLI display. Same for the gateway
  background-process watcher's completion/progress notifications.
- Gap 2: the foreground terminal path hardcoded code_file=True, which skips the
  ENV-assignment pass, so an opaque token (no vendor prefix) from env/printenv
  leaked even there.

Adds agent.redact.redact_terminal_output(output, command) as the single policy
for ALL terminal-output surfaces: env-dump commands (env/printenv/set/export/
declare) get the ENV-assignment pass (code_file=False) to mask opaque tokens;
other commands stay on code_file=True to avoid false positives on source dumps.
Wired into terminal_tool, process_registry (_handle_process boundary), and the
gateway watcher. Respects security.redact_secrets (no force) — opt-out preserved.

* docs: add infographic for #43025 terminal-output redaction fix
2026-06-28 02:44:21 -07:00
teknium1
9844243b18 fix(gateway): gate quick_commands through slash access policy
Config-backed quick_commands bypassed the admin-only slash gate. The
early gate in _handle_message only fires for registry-known commands
(is_gateway_known_command), but quick_commands are never in the gateway
registry, so they reached the type:exec dispatch sink unchecked. An
allowlisted non-admin gateway user could invoke admin-only quick
commands — including shell exec in the gateway process — even when the
operator set allow_admin_from / user_allowed_commands to lock them out.

Apply _check_slash_access(source, command) at the quick_commands
dispatch site (the single exec chokepoint, cold-path only) using the
raw typed name. Admins and users with the command in
user_allowed_commands still run it; backward-compat (no policy set)
is unaffected.

Fixes #44727.

Co-authored-by: maxpetrusenko <max.petrusenko.agent@gmail.com>
Co-authored-by: zapabob <1920071390@campus.ouj.ac.jp>
2026-06-28 02:43:23 -07:00
Teknium
00d8c2c915 fix(gateway): prune stale sessions.json entries on startup
A hard gateway crash (exit code 1) skips the graceful shutdown path, so
sessions.json is never cleared and is left pointing at sessions already
ended in state.db. On the next startup get_or_create_session() reuses
those stale entries as long as the time/policy reset checks pass — it
never consults end_reason — so every incoming message is silently routed
into a closed session, with no log or error (#52804).

SessionStore._ensure_loaded_locked() now calls a new
_prune_stale_sessions_locked() that drops any entry whose session_id has
end_reason IS NOT NULL in state.db. Idempotent, _db=None / legacy-absent
safe, DB errors non-fatal, sessions.json rewritten only when something
was pruned. Self-heals into a fresh session on the next message.

Reported and diagnosed by @terry197913 (#52808).
2026-06-28 02:41:47 -07:00
teknium1
ea5aaa7a22 fix(gateway): offload remaining inline agent cleanup off the event loop (#53175)
#35994 moved /new reset cleanup off the loop, but _cleanup_agent_resources
(agent.close() subprocess teardown; shutdown_memory_provider() plugin IO) was
still called INLINE on the event loop from three other sites:

  - _session_expiry_watcher (5-min idle sweep) — live loop
  - _handle_message_with_agent cache-hygiene re-eviction — live loop
  - _finalize_shutdown_agents / stop() idle-cache loop — shutdown

A wedged memory provider on any of these froze the loop: bot goes silent,
runtime-status updated_at heartbeat stops advancing, and SIGTERM can't be
serviced (requires kill -9) — exactly the #53175 zombie pattern.

Adds _cleanup_agent_resources_off_loop: a bounded (30s) worker-thread offload
mirroring the #35994 reset fix, and routes all four sites through it.
2026-06-28 02:41:36 -07:00
LeonSGP43
9f0e64cedd fix(gateway): force exit after graceful shutdown
Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-06-28 02:34:23 -07:00
teknium1
c23f394eb8 fix: satisfy ruff encoding + windows-footgun lints for cgroup reaper
- read_text(encoding='utf-8') (PLW1514)
- # windows-footgun: ok on signal.SIGKILL — module is Linux-only (reads
  /proc, /sys/fs/cgroup; runs from a systemd unit)
- test lambda accepts the new encoding kwarg
2026-06-28 02:05:50 -07:00
PRATHAMESH75
e551da6ddb fix(gateway): reap cgroup orphans via ExecStopPost to unblock restart
Long-lived helpers spawned indirectly by tool calls (adb, platform
bridges) were left in the service cgroup after the gateway's main
process exited. When the kernel rejected the deferred cgroup-wide kill
with EINVAL, systemd blocked Restart=always for 6+ minutes, taking
down all platforms and cron windows (#37454).

Add a small ExecStopPost helper (gateway.cgroup_cleanup) that walks
cgroup.procs and sends per-PID SIGKILLs — a different kernel code path
than cgroup.kill, so it succeeds where the cgroup-wide write failed.
KillMode=mixed is preserved so the gateway still reaps its own
tool-call children before systemd intervenes (#8202).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-28 02:05:50 -07:00
teknium1
58c36b1798 fix(api-server): widen error redaction to cron-endpoint + SSE sites
Follow-up to the salvaged #37733 fix. The contributor centralized
redaction at _openai_error and the chat/responses failure paths, which
covers the OpenAI-compatible envelopes transitively. Two sibling classes
crossed the same authenticated HTTP boundary unredacted:

- 8x cron-management endpoints returning {"error": str(e)} on 500
- the session-chat SSE error event ({"message": str(exc)})

Route both through the same _redact_api_error_text(force=True) helper.
Add AUTHOR_MAP entry for coygeek and a TestRedactApiErrorText guard
covering mask/force/limit/passthrough behavior.
2026-06-28 02:05:38 -07:00
Coy Geek
5e774de76e fix(api-server): redact provider errors at HTTP boundary
Force API-server error text through the existing secret redactor before returning OpenAI-compatible errors, response fallback text, response snapshots, and run failure events. This prevents credential-shaped provider failure text from crossing the API-server boundary while preserving debuggable sanitized messages.
2026-06-28 02:05:38 -07:00
HexLab98
d2ea948bc0 fix(gateway): suppress compression status noise on Discord and other chats (#39293)
Extend the gateway noisy-status filter beyond Telegram so internal
compression lifecycle messages stay in logs instead of spamming Discord,
Slack, and other messaging channels.
2026-06-28 14:35:32 +05:30
fesalfayed
263ffec1b0 fix(whatsapp): resolve LID aliases on modern platforms/ session layout
expand_whatsapp_aliases hardcoded get_hermes_home()/whatsapp/session, but
the adapter writes lid-mapping files via get_hermes_dir("platforms/whatsapp/
session", "whatsapp/session"). On installs without the legacy directory the
two paths diverge, so the resolver finds no mappings and returns the bare LID,
which misses the allowlist and silently drops the message. Resolve through the
same helper so both sides stay in lockstep on new and legacy layouts.
2026-06-28 02:05:26 -07:00
teknium1
64972b6403 fix(config): canonicalize model.name/model.model to model.default (#34500)
A custom_providers config that names the model under model.name (or
model.model) resolved to an empty model, so the API request went out
with model= — HTTP 400 from OpenAI-compatible backends. Display paths
(hermes status/dump) already read model.name and showed the model,
making the failure silent.

The model id was read via 'default or model' at ~14 independent sites
(cli, gateway, cron, curator, oneshot, fallback, profiles, ...), none
of which honored 'name'. Rather than patch every site, canonicalize at
the single load/save chokepoint: _normalize_root_model_keys() now
promotes model.model/model.name -> model.default (precedence
default > model > name) and drops the stale alias, so every reader —
present and future — sees a populated default and config.yaml is
migrated canonical on next save. The gateway, which bypasses
load_config(), replays the same normalization in _load_gateway_config().

Co-authored-by: Bartok9 <danielrpike9@gmail.com>

Credit: root-cause analysis and fix direction from @Bartok9 (#34502,
first) and @v86861062 (#34527).
2026-06-28 02:05:13 -07:00
Teknium
c9df4bc094
fix(gateway): default restart_drain_timeout to 0 to kill systemd crash loop (#54066)
A restart now interrupts in-flight agents immediately rather than holding
the gateway open for a grace window. The previous 180s default coupled two
independently-set timers: the gateway's own drain timer and systemd's
TimeoutStopSec. On a stale unit where TimeoutStopSec < drain, systemd
SIGKILLed the gateway mid-cleanup, leaving a stale lock that made the next
startup exit immediately ('already running') — an infinite crash loop under
Restart=on-failure (#31981).

Setting drain to 0 makes the mismatch structurally impossible: with drain 0
the generated unit gets TimeoutStopSec=90 against a near-instant drain, so
systemd never kills mid-cleanup. Contract: restart the gateway, in-flight
work stops. A grace window large enough to 'save' a long agent turn would
have to outlast an unbounded task, which is impossible.

Also fixes the stale-unit warning's suggested command
(hermes gateway service install --replace -> hermes gateway install --force);
the former subcommand does not exist.

Closes #31981
2026-06-28 01:14:34 -07:00
Teknium
90d25adc9e
fix(gateway): deliver profile-scoped cache media on symlinked HERMES_HOME (#54060)
Generated images under a profile gateway's cache (profiles/<name>/cache/
images/...) were silently dropped from Telegram/Discord delivery when
HERMES_HOME is symlinked under a denied prefix (e.g. /opt/data ->
/root/.hermes) and $HOME is not that prefix. The resolved path lands
under /root (a system denylist prefix), the root-home exception only
fires when the denied prefix IS $HOME, and the static safe-roots list
only covers the active HERMES_HOME's top-level cache — not per-profile
cache dirs. Both gates fail, so validate_media_delivery_path returns
None and the gateway logs 'Skipping unsafe MEDIA directive path'.

_media_delivery_allowed_roots() now also enumerates per-profile cache
roots (<root>/profiles/*/cache/{images,audio,videos,documents,
screenshots}) at check time. Allowlist match runs before the denylist,
so the profile artifact delivers regardless of the /root interaction;
profile-dir credentials (auth.json) stay blocked since they aren't
under a cache subdir.

Reopened regression of #34485/#38108, neither of which covered the
profile-scoped symlink case. Fixes #31733.
2026-06-28 01:07:28 -07:00
sweetcornna
fc70d023d8 fix(telegram): apply bot auth policy to Telegram sources
# Conflicts:
#	gateway/config.py
2026-06-28 00:57:03 -07:00
infinitycrew39
e860a40e14 fix(agent,gateway): surface partial-stream recovery and bound detached restart
Salvage of NousResearch/hermes-agent#41498 (0-CYBERDYNE-SYSTEMS-0).

- Leave response_previewed false on partial_stream_recovery so gateway
  fallback delivery can send the recovered fragment plus explanation.
- Always append the turn-completion explainer for partial_stream_recovery,
  not only for empty or very short fragments (#34452 gap).
- Launch the detached /restart helper before drain, idempotently, with a
  bounded wait of restart_drain_timeout + 5s.
2026-06-27 22:03:14 -07:00