hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-07-03 12:23:08 +00:00

Author	SHA1	Message	Date
fayenix	d6c53dcdcb	fix(gateway): stop per-turn agent-cache eviction from model + message_id signature churn Two independent bugs evicted the cached gateway AIAgent on every turn, preventing the prompt cache from ever warming: 1. Model normalization mismatch: the post-run fallback-eviction check compared _agent.model (stripped in AIAgent.__init__) against the raw _resolve_gateway_model() config string. For vendor-prefixed config on native providers (e.g. 'deepseek/deepseek-v4-pro' vs 'deepseek-v4-pro') this was always unequal, so the agent was evicted after every successful run. Normalize _cfg_model the same way (skip aggregators). 2. Discord triggering message_id leaked into the cached system prompt via build_session_context_prompt()'s Discord IDs block. message_id changes every turn, so the agent-cache signature (computed from the ephemeral prompt) changed every Discord turn -> rebuild every message. The id is now injected per-turn into the user message (where per-turn content belongs and does not touch the cache signature); the cached IDs block carries a static pointer to it, preserving reply/react/pin via the discord tools. Adapted from #28846. Bug #1 fix is the contributor's; bug #2 reworked to be non-destructive (keeps the triggering-id capability instead of deleting it). Redundant auto-reset eviction (already on main via #9893/#48031) and the wrong-premise reset_context_note plumbing from the original PR were dropped. Co-authored-by: Hermes Agent <hermes@nousresearch.com>	2026-06-30 04:22:41 -07:00
teknium1	af5cea04ab	fix(discord): split oversized final edits, truncate mid-stream previews (#27881 ) DiscordAdapter.edit_message clipped any formatted payload over the 2,000-char cap to [:1997]+"..." and returned success=True, so the stream consumer believed the full reply landed and stopped — the user lost everything past the boundary and perceived the agent as quitting mid-task. edit_message is now overflow-aware, mirroring Telegram's proven contract: - finalize=True: split-and-deliver via _edit_overflow_split — edit chunk 1 in place, send chunks 2..N as reply-threaded continuations, return the last visible id in message_id plus continuation_message_ids so the stream consumer keeps editing the most recent chunk and can clean them all up. - finalize=False (mid-stream): truncate a one-message preview in place, never split. A mid-stream split moves the edit target to a continuation and the next accumulated-token tick re-splits, looping forever (the Telegram #48648 lesson the original port predated). - Reactive 50035 '2000 or fewer in length' on edit runs the same branch logic. - Partial continuation failure still reports success with a partial_overflow raw_response so the consumer retries the tail instead of marking a clipped reply complete. Co-authored-by: xxxigm <tuancanhnguyen706@gmail.com> Co-authored-by: AhmetArif0 <147827411+AhmetArif0@users.noreply.github.com>	2026-06-30 03:49:52 -07:00
jasonQin6	6dd188d786	fix(gateway): add session staleness guard to stream consumer GatewayStreamConsumer.run() processed queued deltas in an infinite loop with no check on whether the session was still current. On /new or /stop mid-stream, the consumer kept editing and delivering stale response fragments alongside the 'Session reset!' ack. PR #11016 (`b7bdf32d`) fixed the runner side via sentinel promotion/release but left the stream consumer unguarded. Every other async callback in run.py already bails via _run_still_current(); the stream consumer was the only one missing it. - stream_consumer.py: optional run_still_current callback, checked at the top of the run() loop; returns early when the session is stale. - run.py: pass the existing _run_still_current closure at both call sites (proxy path and agent path). - tests: TestRunStillCurrentGuard — immediate staleness, mid-stream staleness, always-current, no-callback default, pending-finish. Co-authored-by: jasonQin6 <39369769+jasonQin6@users.noreply.github.com>	2026-06-30 03:42:25 -07:00
Kong	24aa02179b	test(whatsapp): repoint owner test import after adapter relocation WhatsAppAdapter lives under plugins/platforms/whatsapp/adapter.py on current upstream; the owner-forward test still imported the removed gateway.platforms.whatsapp module.	2026-06-30 03:41:43 -07:00
Keira Voss	a61cf774ce	feat(whatsapp): tag owner-typed inbound text with [owner reply] prefix When WHATSAPP_FORWARD_OWNER_MESSAGES is enabled and the bridge marks an inbound message with fromOwner=true, also prefix MessageEvent.text with "[owner reply] " at construction time. This makes the disambiguation survive any downstream plugin failure (e.g. handover-rule errors that bypass silent_ingest), so transcripts never misattribute owner-typed text to the customer. Idempotent: re-applies are guarded so a future producer that pre-tags text won't be double-prefixed.	2026-06-30 03:41:43 -07:00
keiravoss94	84f350efe0	feat(whatsapp): opt-in forwarding of owner-typed messages in bot mode In `WHATSAPP_MODE=bot` the bridge currently drops every fromMe inbound message — they are all assumed to be echoes of our own /send calls. That makes it impossible for plugins / agents to detect when a human owner has typed directly into a customer chat from the same WhatsApp Business account (e.g. via a linked phone or WhatsApp Web). This adds an opt-in `WHATSAPP_FORWARD_OWNER_MESSAGES` env var. When true, the bridge classifies fromMe inbound by looking up `key.id` in a bounded LRU of recently-sent message IDs (the existing 50-entry echo suppressor, bumped to 512 and extracted to a testable `outbound_ids.js` helper). Hits in the LRU are still dropped (echoes); misses are forwarded to the Python adapter with `fromOwner: true`. The Python adapter lifts that flag onto `MessageEvent.metadata["whatsapp_from_owner"]`. `metadata` is a new free-form dict on the event so future per-platform signals don't each need their own field. Default behaviour is unchanged: with the env flag unset, bot mode still drops every fromMe message exactly as before. Use cases for downstream consumers: - Implicit handover activation when the owner replies manually - Sliding TTL on owner activity (keep an active session alive while the owner is engaged) - Audit trails of owner interventions - Analytics on human-vs-bot reply ratios Heuristic limitation (documented in code): the LRU is in-memory. After a bridge restart, in-flight delivery receipts of pre-restart sends will briefly look like owner-typed for a few seconds until the set is repopulated. Persisting isn't worth the disk churn — downstream consumers should treat the flag as best-effort. Tests: - tests/gateway/test_whatsapp_from_owner.py (new): adapter sets the metadata flag iff the bridge payload has `fromOwner: true`; absent otherwise. - scripts/whatsapp-bridge/outbound_ids.test.mjs (new): LRU bounds, eviction order, falsy-id handling. Backwards compatibility: with the env flag unset, every code path is identical to before. No existing deployment is affected.	2026-06-30 03:41:43 -07:00
UgwujaGeorge	cb9d18c759	fix(gateway): stop media-send fallbacks from leaking host paths into chat The base BasePlatformAdapter implementations of send_voice, send_video, send_document, and send_image_file forwarded their _path argument verbatim into the chat text (e.g. "🎬 Video: /home/.../hermes/cache/..."). Telegram, Discord, and Slack adapters all fall back to those base methods when their native send raises — so a rejected video on Telegram surfaced the host filesystem layout to the user instead of a useful message. Replace the path-echo with a friendly notice, log the path for operator diagnostics, and keep the user-supplied caption intact. The Slack adapter had three identical sites that fell through to the same path-echo on its own native upload failures; fix those too. send_document still surfaces the caller-provided file_name (or the basename derived from it) since that is the user-facing filename, not a host path. Add regression tests asserting the _path argument never appears in the fallback content while caption text and explicit file_name still do.	2026-06-30 03:24:36 -07:00
teknium1	fee3d4ed04	test(gateway): update startup-restart-race fixtures for current main The salvaged test double predated two main changes: - start() now connects via _connect_adapter_with_timeout, which forwards is_reconnect to adapter.connect(); the StartupRaceAdapter double didn't accept the kwarg. - stop() now awaits _finalize_shutdown_agents (async on main); the fixture stubbed it as a plain MagicMock. Accept is_reconnect in the double and use AsyncMock for the finalize stub.	2026-06-30 03:22:18 -07:00
Disaster-Terminator	f4a54b6292	fix(gateway): abort startup during restart	2026-06-30 03:22:18 -07:00
teknium1	b6045170bb	fix(discord): extend channel-name matching to slash-command auth; clamp flush deadline to disconnect budget Follow-up to the salvaged #8008 fix: - Sibling-site fix: _evaluate_slash_authorization gated DISCORD_ALLOWED_CHANNELS / DISCORD_IGNORED_CHANNELS on numeric IDs only, so name/#name config that now works for on_message still silently failed for slash-command interactions. Refactor the channel-key helper to _discord_channel_keys_from_channel(channel, parent) and reuse it at the interaction gate. Fail-closed on missing channel id is preserved. - The contributor's hardcoded 8s flush deadline could be hard-cancelled mid-flush: _teardown_adapter already wraps cancel_background_tasks() in the per-adapter disconnect budget (HERMES_GATEWAY_ADAPTER_DISCONNECT_TIMEOUT, default 5s). The flush deadline now derives from that budget with headroom so it always completes inside it. - AUTHOR_MAP: map cypher@augmentl.com -> Nickperillo for CI. - Tests: slash-auth name/#name allow + name ignore matching.	2026-06-30 02:48:42 -07:00
Cypher	cb9308f0a6	fix(discord): channel name matching and flush pending sends on shutdown Two related fixes to the Discord gateway adapter: 1. Channel name matching (free-response, allowed, ignored, no-thread channels) Previously these config values only matched against numeric channel IDs. If a user configured free_response_channels: cypher (by name), the adapter would silently ignore it because it only intersected against channel_ids. Now the adapter builds a channel_keys set that includes the channel ID, channel name, and #channel-name form, and checks all three for each gate. 2. Flush pending text-batch tasks before shutdown The Discord adapter uses _pending_text_batch_tasks (its own dict) for merging rapid successive message chunks. These tasks were NOT added to self._background_tasks (the base class list), so the base cancel_background_tasks() never awaited them on restart/shutdown. This caused a race: in-flight response deliveries were cancelled before Discord had a chance to send them, resulting in silent dropped messages visible to users as tool-log-only replies with no text body. Fix: override cancel_background_tasks() in DiscordAdapter to await all pending text-batch tasks (8s deadline) before delegating to the base class.	2026-06-30 02:48:42 -07:00
David Gutowsky	3a83b6bc5d	fix(gateway): self-heal stale sessions.json routing at message time Detect a routing key whose session is already ended in state.db (end_reason set) inside get_or_create_session and drop the stale entry instead of silently routing the message into a closed session. Previously the only runtime cleanup of sessions.json was the startup _prune_stale_sessions_locked (#52808/#54138), which requires a restart. A session ended while the gateway stays alive — any path that finalizes the DB row without clearing sessions.json — left a live routing key pointing at a closed session. get_or_create_session never consulted end_reason, so it returned that stale entry and every subsequent message was silently dropped (no log, no error, no response) until the next restart. This is the live-gateway variant of #52804/FM9, which needed an actual gateway crash. The guard drops the stale entry and falls through to _recover_session_from_db, which reopens agent_close-ended rows and resumes the SAME session_id (transcript preserved); if the row ended for a non-recoverable reason (e.g. /new) it correctly starts a fresh session. A warning is logged so the event is visible (the field incident reported zero log output). Adds tests/gateway/test_session_store_runtime_stale_guard.py covering the _is_session_ended_in_db helper and the end-to-end routing self-heal (recover-vs-fresh, live-entry untouched, stale-wins-over-suspended, force_new short-circuit). Closes #54878. Co-authored-by: David Gutowsky <david.gutowsky@gmail.com>	2026-06-30 13:17:51 +05:30
Ben Barclay	05ac16778b	feat(gateway): per-platform typing_indicator toggle Add a generic per-platform PlatformConfig.typing_indicator flag (default True) that gates the _keep_typing refresh loop in _process_message_background. When false, the loop is never spawned, so no typing/"is thinking…" status is shown on that platform — message delivery is otherwise unchanged. Mirrors the gateway_restart_notification contract exactly: dataclass field + to_dict/from_dict (with extra-fallback resolution) + shared-key bridge in load_gateway_config, so 'slack: typing_indicator: false' under platforms works without a separate block. Generic by design — the same key works for every platform (Slack 'is thinking…', Telegram/Discord/Signal typing). Motivated by users who find Slack's assistant 'is thinking…' status noisy (it also briefly disables the compose box, via the Assistant API).	2026-06-29 21:12:57 -07:00
Ben	184c10cf97	fix(slack): warn when configured token is a user token, not a bot token A Slack user/legacy token (xoxp-...) makes auth.test resolve to the installing human's member ID with no bot_id, so the adapter binds its identity (_bot_user_id / _team_bot_user_ids) to that human. Every "is this the bot?" check then misfires: that person's <@...> mentions wake the bot and are stripped as the bot's own mention, so the agent is genuinely told it was @mentioned and replies to messages merely addressed to that human (symptom: bot responds to "@trevor ..." and insists it was explicitly mentioned). There is no runtime API error to catch — a user token still sends/receives — so the only detectable moment is connect time. Add a warning-only nudge (_warn_if_not_bot_token) alongside the existing group-DM scope nudge: when auth.test resolves a user_id but no bot_id, log that the token is a user token and to use the xoxb-... Bot User OAuth Token. Warning-only: does not block a working-but-misconfigured install. Fires once per workspace per process.	2026-06-29 20:57:43 -07:00
Teknium	6aefc9d925	feat(gateway): show per-category context breakdown in /usage (#55204 ) Channel users get the same context split the desktop popover shows (PR #54907) — system prompt, tools, rules, skills, MCP, subagents, memory, conversation — under the existing Context line in /usage. Reuses agent.context_breakdown.compute_session_context_breakdown, so there is no new tool and no new engine. The slices are estimates (chars/4) and the block is labelled _(estimated)_; the headline Context line keeps using the provider-measured last_prompt_tokens. Rendering is fail-open: any engine error returns no breakdown and the rest of /usage is unaffected. - gateway/slash_commands.py: _context_breakdown_lines() helper + wire into _handle_usage_command - locales/*.yaml: breakdown_header, breakdown_line, and 8 category labels across all 16 locales (parity gate) - tests/gateway/test_usage_command.py: render + fail-open coverage	2026-06-29 20:42:19 -07:00
Teknium	481caa66f2	feat(display): friendly human-phrased tool labels for built-in tools (#55166 ) * feat(display): friendly human-phrased tool labels for built-in tools Built-in tools now render ChatGPT-style status verbs ('Searching the web for ...', 'Reading <file>', 'Browsing <url>') on the CLI spinner and gateway/desktop tool-progress instead of the raw tool name. - agent/display.py: _TOOL_VERBS map + build_tool_label() + set/get friendly-labels flag (default on). Custom/plugin/MCP tools fall back to the raw preview; verbose gateway mode left untouched (debug surface). - tool_executor.py / tui_gateway / gateway: route the three spinner sites, the TUI _tool_ctx, and the gateway all/new progress line through the label. - config: display.friendly_tool_labels (default True, per-platform aware). Zero new core tool / schema footprint — pure display layer. * docs: add PR infographic for friendly tool labels * fix(display): preserve arg preview in gateway friendly labels + update tests The first gateway pass re-derived the label from the callback's `args`, which is empty ({}) at the gateway tool.started callsite — the command/query lives in the `preview` string, so terminal rendered as a bare '💻 Running' and dedup collapsed consecutive commands. Now the gateway prefixes the verb onto the already-computed preview via get_tool_verb/tool_verb_connector/verb_drops_preview, preserving the command/url/query. CLI spinner path (real args) keeps build_tool_label. Tests: update test_run_progress_topics exact-format assertions to the friendly form ('💻 Running pwd'), add a format-agnostic preview extractor for the truncation tests (works for both quoted-legacy and verb-prefixed output). * test(tui): update resume-display context to friendly tool label _tool_ctx now uses build_tool_label, so the desktop resume-view context for a search_files turn reads 'Searching files for resume' instead of the bare 'resume' preview — consistent with live tool-progress. Update the assertion. * test(tui): harden no-race worker test against sibling shard leakage test_session_create_no_race_keeps_worker_alive flaked under -j 8: a daemon build thread leaked from a prior session.create test in the same shard process fires close/unregister against its own (foreign) session_key after this test patches the global approval hooks, polluting the captured lists. Scope the assertions to this session's own session_key so the regression intent (this session's worker/notify must survive) is preserved while the test becomes immune to shard composition. Not related to friendly-tool-labels.	2026-06-29 20:31:17 -07:00
yoniebans	d2ce2c852d	test(gateway): assert interleaving safety of concurrent offloaded DB calls	2026-06-29 15:51:57 -07:00
yoniebans	6735162531	fix(gateway): offload the Telegram topic-recovery helper tree off the loop The topic-mode helpers (_telegram_topic_mode_enabled, _recover_telegram_topic_thread_id, _record/_sync_telegram_topic_binding, _is_telegram_topic_lane/_root_lobby, _normalize_source_for_session_key, _telegram_topic_new_header, _schedule_telegram_topic_title_rename, and the base.py _apply_topic_recovery hook) each run a synchronous SessionDB read or write. They reach the event loop through async handlers, so a contended state.db froze the loop the same way the handoff watcher did. These helpers already run off-loop in the run_sync thread-pool closure, so they are proven thread-safe there. Rather than colour them async, loop-side callers now invoke them via asyncio.to_thread(...); the executor callers are unchanged. Inside the helpers the SessionDB handle is unwrapped to the sync door (getattr(db, '_db', db)) since they always run on a worker thread, and AIAgent construction + query_session_listing are handed the sync SessionDB directly. base.py wraps its single _apply_topic_recovery call in to_thread. The guard is now alias-aware (catches db = getattr(self, '_session_db', None); db.method(...)) and enforces the offload contract: the offloaded sync helpers may never be called bare on the loop. Sibling test fixtures wrap their injected SessionDB in AsyncSessionDB to match how the gateway holds it.	2026-06-29 15:51:57 -07:00
yoniebans	0896facce8	fix(gateway): route SessionDB calls through AsyncSessionDB	2026-06-29 15:51:57 -07:00
yoniebans	89daacb454	test(gateway): cover AsyncSessionDB offload + raw-call guard (failing)	2026-06-29 15:51:57 -07:00
Teknium	290fa7fd2b	fix(gateway): skip confirmed-dead delivery targets (deleted groups, blocked bots) (#55115 ) * fix(gateway): skip confirmed-dead delivery targets (deleted groups, blocked bots) A deleted Telegram group, kicked/blocked bot, or deactivated user keeps throwing Forbidden/not_found on every cron tick and fan-out delivery. Each retry burns a send against the platform's flood-control envelope and spams the logs, making the whole session feel broken even when the model call completed. Add a small persistent DeadTargetRegistry (per-profile JSON under HERMES_HOME) that records a target the moment a send reports a whole-chat death (forbidden / chat-level not_found), and have DeliveryRouter.deliver() short-circuit it on subsequent attempts. Self-healing: any successful send clears the flag, so a user re-adding the bot recovers with no manual cleanup. Thread/topic-level not_found is NOT recorded (adapters already self-heal that by retrying without reply_to). Transient/timeout errors are never marked dead. * infographic: dead delivery target skipping	2026-06-29 13:23:29 -07:00
Ben Barclay	b963d3238b	feat(gateway): suppress home-channel shutdown broadcast on flagged drains (#54824 ) Add a generic suppress_notification flag to the drain-request marker. When a drain that ends in process exit (e.g. a NAS auto-update image migration on the always-on Hermes Cloud fleet) is flagged, the gateway skips ONLY the home-channel 'gateway shutting down' broadcast — the operator-flavoured ping that would otherwise fire on every routine auto-update, dozens of times a day. The per-active-session interrupt ping is ALWAYS kept: on a drained shutdown it's empty by construction, and in the force-interrupt (deadline-exceeded) case it carries the user-valuable 'your task was cut off, message me to resume' hint. The gateway stays agnostic about WHY a drain is quiet (generic boolean, not a kind enum); the policy of which drain causes set the flag lives in the caller (NAS). Default-false so legacy/operator drains behave exactly as before. The reader reuses the NS-570 epoch-staleness check so an orphaned marker on the durable volume can never silence a fresh gateway's legitimate broadcast. - drain_control.py: write_drain_request gains suppress_notification; new drain_notification_suppressed() reader (current-epoch + truthy flag). - web_server.py: /api/gateway/drain reads + echoes the flag. - run.py: _notify_active_sessions_of_shutdown skips the home-channel loop only. Tests prove: flag round-trips; home-channel suppressed when set, kept when unset; active-session ping always fires; stale/legacy/corrupt markers never suppress.	2026-06-29 12:18:11 -07:00
Teknium	dbad6d47d3	fix(gateway): also neutralize untrusted Matrix room name in prompt Widen #5961's _format_untrusted_prompt_value coverage to the Matrix room display name (Matrix Room:), a sibling attacker-controllable field the original fix missed. chat_name is user-settable, so an injected room name could render as literal markdown in the system prompt. Adds a regression test.	2026-06-29 04:25:51 -07:00
Xowiek	09666ceb76	fix(gateway): neutralize untrusted session metadata in prompts	2026-06-29 04:25:51 -07:00
teknium1	ea1372d2af	fix(security): wire session-id sanitizer into artifact paths + API boundary Defense-in-depth on top of _safe_session_filename_component (#5958): Sink (makes the bad write impossible regardless of entry point): - run_agent._save_session_log: sanitize session_id before building the session_{sid}.json snapshot path. - agent_runtime_helpers.dump_api_request_debug: sanitize before building the request_dump_{sid}_{ts}.json path. Boundary (clean 400 instead of a silently-hashed filename): - api_server rejects path-traversal-shaped X-Hermes-Session-Id on the session-continuation path and the explicit /api/sessions create path, reusing gateway.session._is_path_unsafe (mirrors the native gateway's entry-boundary guard). Also enforces the session-header length cap on the continuation path. Tests: traversal session_id stays contained at the write site; sanitizer always yields a traversal-free segment; the API header rejects ../, absolute, and Windows-traversal IDs with 400.	2026-06-29 04:25:45 -07:00
teknium1	cdd8e0a271	test(gateway): exercise last_prompt_tokens in reset-activity tests The reset-had-activity tests set total_tokens (dead state) to simulate activity; production records activity via last_prompt_tokens. Update the fixtures to match the field the fix and runtime actually use.	2026-06-29 04:25:37 -07:00
sgaofen	194bff0687	fix(gateway): confirm final delivery before suppressing send Fixes #14238. During a compression/session split at the response boundary, the interim callback delivered unrelated commentary, setting response_previewed=True. The suppression logic treated that as proof the final reply had been delivered and skipped the normal send — the response was persisted to the child session but never sent to chat. Only suppress the normal final send when the stream consumer confirms final delivery (final_response_sent / final_content_delivered) or the exact final response text was delivered as a preview.	2026-06-29 02:37:11 -07:00
teknium1	34e616e778	feat(slack): nudge stale installs to add mpim scopes; mark message.mpim required Follow-up to the group-DM manifest fix. The manifest change only helps NEW installs; existing apps keep their old (mpim-less) scopes until the admin reinstalls. Since a missing message.mpim event delivers nothing (no runtime API error to catch), detect stale installs at connect time from the auth.test x-oauth-scopes header and log an actionable reinstall nudge when im:history is granted but mpim:history is not. Also promote message.mpim from Recommended to Required in the docs event tables so the default setup path can't drop it.	2026-06-29 01:02:53 -07:00
Teknium	74541beb9c	fix(security): cap WeCom callback body size before pre-auth XML parse (#54615 ) The WeCom callback endpoint (internet-facing, 0.0.0.0) parsed untrusted request bodies before signature verification. defusedxml already guards the entity-expansion class on main, but there was no cap on raw body size, so an unauthenticated POST could still force unbounded read work pre-auth. Set client_max_size=64KB on the aiohttp app (413 at the framework layer) plus an explicit length guard in _handle_callback as defense in depth. WeCom callbacks are small encrypted XML envelopes — media is delivered out-of-band via MediaId, never inline — so 64KB is ample for legitimate traffic. Adds tests for oversized (413) and normal-sized (not 413) bodies. Salvaged from #10192 by @memosr (body-size limit half; defusedxml half already superseded on main).	2026-06-28 22:35:43 -07:00
teknium1	0b733a8418	test(gateway): pin auto-reset cached-agent eviction (#10710 ) Relocate marco0158's eviction into the dedicated auto-reset cleanup block (single source of truth for dropping session-scoped transient state) and add an AST invariant pinning _evict_cached_agent into that block. Add AUTHOR_MAP entry for marco0158.	2026-06-28 22:35:17 -07:00
Junass1	61a4526ac7	fix(gateway): clear session-scoped model overrides on /resume /resume is a conversation boundary, but unlike /new it did not clear the chat-keyed _session_model_overrides / _pending_model_notes. A /model switch made in the previous session under the same chat session_key leaked into the resumed conversation, running it on the wrong model. Clear both maps for the session_key after the switch (mirroring /new), scoped to that key so other chats' overrides are untouched. The cached-agent eviction this leak also implied already landed via #6672. Closes #10702.	2026-06-28 22:35:12 -07:00
Teknium	e20ff352b9	test(matrix): authorize inviter in DM-invite fixture for new invite-auth gate _on_invite now rejects auto-joins from users not on the allow-list. The DM-recording tests invite @alice and expect a join, so the shared _make_adapter fixture now puts @alice on _allowed_user_ids.	2026-06-28 20:47:33 -07:00
Teknium	d65468e7ff	fix(security): SSRF guard yuanbao media download_url (#54470 ) yuanbao_media.download_url() fetched model-supplied (outbound) and inbound image/file URLs server-side via httpx with follow_redirects=True and no SSRF check. A model response containing <img src="http://169.254.169.254/..."> routed through ImageUrlHandler -> download_url and would fetch cloud-metadata endpoints; same for inbound media. Add an is_safe_url() pre-flight plus an async redirect event-hook that re-validates every 30x target, matching the cache_image_from_url() guard in gateway/platforms/base.py. The other gateway adapters already guard their URL-fetch paths; this was the remaining unguarded one.	2026-06-28 15:29:59 -07:00
Teknium	86e64900b9	fix(gateway): preserve sessions across restarts (#54442 )	2026-06-28 15:10:39 -07:00
teknium1	c648ecdca5	fix(telegram): reject unauthorized users before event construction (#40863 ) Removed/unauthorized Telegram users could inject prompt content before the per-user auth gate fired. The adapter ran `_should_process_message`, `_build_message_event`, and text/photo batching — and dispatched to the runner — before `_is_user_authorized()` (gateway/authz_mixin.py) rejected the sender. Unmentioned group chatter from a removed user was also persisted into the session transcript via `_observe_unmentioned_group_message`, leaking into the agent's observed context independent of dispatch. Add `_is_user_authorized_from_message()` as an intake prefilter that runs in `_handle_text_message`, `_handle_command`, `_handle_location_message`, and `_handle_media_message` BEFORE batching, event construction, and the unmentioned-group observe branch. It reuses the runner's `_is_user_authorized()` with a correctly-shaped SessionSource (group vs forum vs dm, real chat_id for TELEGRAM_GROUP_ALLOWED_* allowlists), falls back to env allowlists, and only rejects when an allowlist actually exists — unknown DMs with no allowlist still reach the pairing flow. Channel posts authorize via `sender_chat` identity when `from_user` is absent. Co-authored-by: liuhao1024 <sunsky.lau@gmail.com> Co-authored-by: Carlos Manuel Cejas <carlosmcejas@gmail.com>	2026-06-28 14:25:15 -07:00
Teknium	9a0010fd46	fix(windows): cover remaining console-flash spawn legs (#54417 )	2026-06-28 13:49:08 -07:00
Teknium	cb982ad997	fix(windows): hide console-window flash on backend git/gh/wmic/bash subprocess spawns The Windows desktop GUI runs its backend headless via pythonw.exe. Several auxiliary subprocess sites that run inside that windowless backend spawned console-subsystem children (git, gh, wmic, powershell, bash, rg, taskkill) WITHOUT CREATE_NO_WINDOW, so Windows allocated a fresh conhost per call and flashed a black window on screen — sometimes continuously (the dashboard Projects-tree git probe alone fired ~118 spawns in 60s on startup). The terminal tool, cron, browser, code_execution, and gateway-spawn paths already carry windows_hide_flags(); these auxiliary probe/scan/launcher legs were missed. Wire the existing helper into them: - tui_gateway/git_probe.py: run_git (+ encoding=utf-8/errors=replace, fixes the cp950 UnicodeDecodeError on CJK paths from the same site) - agent/coding_context.py: _git (per-turn git status/log/diff) - agent/context_references.py: _run_git + _rg_files (@file/@ref resolution) - hermes_cli/copilot_auth.py: gh auth token probe (auxiliary provider:auto) - hermes_cli/gateway.py: wmic + PowerShell Get-CimInstance PID scan - hermes_cli/main.py: wmic stale-dashboard PID scan - gateway/status.py: taskkill /T /F force-kill windows_hide_flags() returns 0 on POSIX, so every changed call is a no-op on Linux/macOS (verified: real git/rg probes still work; Windows-simulated calls all pass creationflags=CREATE_NO_WINDOW). Scoped to the windowless-backend paths that cause the reported flashing. The Electron updater-handoff leg (main.cjs windowsHide:false) and the interactive-CLI banner probes (cli.py) are intentionally NOT touched here — the former needs a Windows-tested change of its own, the latter runs in a visible console anyway. Tracking: #54220 Refs: #53178 #53631 #53781 #53957 #49602 #52982 #53424 #53053 #53016	2026-06-28 05:28:45 -07:00
liuhao1024	9d919daf44	fix(gateway): mark platform lock failure as retryable instead of permanently fatal When a stale lock file survives a gateway crash, `acquire_scoped_lock()` may return `(False, existing_dict)` even after detecting and deleting the stale lock (e.g. if unlink fails or a race condition occurs). Previously, `_acquire_platform_lock()` called `_set_fatal_error(..., retryable=False)`, which permanently killed the platform — the reconnect watcher never retries a non-retryable fatal error. Change to `retryable=True` so the platform enters the "retrying" state and the reconnect watcher can attempt acquisition again after the standard backoff delay. Fixes #54167	2026-06-28 04:35:37 -07:00
tymrtn	d7f655f370	fix: accept typed clarify choice replies	2026-06-28 04:13:19 -07:00
MorAlekss	acca526286	fix(gateway): treat zombie PIDs as dead in _pid_exists to unblock --replace (closes #42126 ) Under systemd Restart=always, the old gateway becomes a zombie (in the process table, awaiting reap) when the replacement starts. _pid_exists() reported the zombie as alive, so --replace waited on a PID that never dies, then aborted with exit 1 — a silent crash loop. Standalone runs are unaffected because nothing respawns the gateway into a zombie. The live path is psutil.pid_exists(), which returns True for zombies, so the check is added there (Process.status() == STATUS_ZOMBIE -> dead). The psutil-less POSIX fallback also reads /proc/<pid>/stat (state Z) with a ps state= fallback for macOS/BSD, before the os.kill(pid, 0) liveness probe. Diagnosis and the /proc + ps POSIX fallback by MorAlekss (PR #44898); extended to cover the psutil hot path so the fix applies on normal installs. Co-authored-by: MorAlekss <mor.aleksandr@yahoo.com>	2026-06-28 04:11:14 -07:00
teknium1	d5ba374c03	fix(telegram): detect wedged getUpdates consumer via pending_update_count The merged CLOSE-WAIT heartbeat (#52744) only probes get_me(), which uses the general request path and stays healthy while PTB's getUpdates consumer is silently wedged (updater.running=True but the long-poll task is stuck, observed on WSL2). DMs then queue in the Bot API and never reach handlers (#42909). Augment the existing _polling_heartbeat_loop to also probe get_webhook_info().pending_update_count. After two consecutive probes that see a non-draining queue while the updater claims to be running, escalate into the existing _handle_polling_network_error recovery ladder — no new restart machinery. No-ops in webhook mode, when the updater is not running, or when a reconnect is already in flight. Credit to @gazzumatteo, whose PR #42959 identified the pending_update_count signal as the missing liveness probe. This reuses the existing heartbeat + recovery path rather than adding a parallel watchdog. Fixes #42909.	2026-06-28 02:44:17 -07:00
teknium1	9844243b18	fix(gateway): gate quick_commands through slash access policy Config-backed quick_commands bypassed the admin-only slash gate. The early gate in _handle_message only fires for registry-known commands (is_gateway_known_command), but quick_commands are never in the gateway registry, so they reached the type:exec dispatch sink unchecked. An allowlisted non-admin gateway user could invoke admin-only quick commands — including shell exec in the gateway process — even when the operator set allow_admin_from / user_allowed_commands to lock them out. Apply _check_slash_access(source, command) at the quick_commands dispatch site (the single exec chokepoint, cold-path only) using the raw typed name. Admins and users with the command in user_allowed_commands still run it; backward-compat (no policy set) is unaffected. Fixes #44727. Co-authored-by: maxpetrusenko <max.petrusenko.agent@gmail.com> Co-authored-by: zapabob <1920071390@campus.ouj.ac.jp>	2026-06-28 02:43:23 -07:00
Teknium	00d8c2c915	fix(gateway): prune stale sessions.json entries on startup A hard gateway crash (exit code 1) skips the graceful shutdown path, so sessions.json is never cleared and is left pointing at sessions already ended in state.db. On the next startup get_or_create_session() reuses those stale entries as long as the time/policy reset checks pass — it never consults end_reason — so every incoming message is silently routed into a closed session, with no log or error (#52804). SessionStore._ensure_loaded_locked() now calls a new _prune_stale_sessions_locked() that drops any entry whose session_id has end_reason IS NOT NULL in state.db. Idempotent, _db=None / legacy-absent safe, DB errors non-fatal, sessions.json rewritten only when something was pruned. Self-heals into a fresh session on the next message. Reported and diagnosed by @terry197913 (#52808).	2026-06-28 02:41:47 -07:00
teknium1	ea5aaa7a22	fix(gateway): offload remaining inline agent cleanup off the event loop (#53175 ) #35994 moved /new reset cleanup off the loop, but _cleanup_agent_resources (agent.close() subprocess teardown; shutdown_memory_provider() plugin IO) was still called INLINE on the event loop from three other sites: - _session_expiry_watcher (5-min idle sweep) — live loop - _handle_message_with_agent cache-hygiene re-eviction — live loop - _finalize_shutdown_agents / stop() idle-cache loop — shutdown A wedged memory provider on any of these froze the loop: bot goes silent, runtime-status updated_at heartbeat stops advancing, and SIGTERM can't be serviced (requires kill -9) — exactly the #53175 zombie pattern. Adds _cleanup_agent_resources_off_loop: a bounded (30s) worker-thread offload mirroring the #35994 reset fix, and routes all four sites through it.	2026-06-28 02:41:36 -07:00
liuhao1024	14baeefe1d	fix(matrix): record DM rooms in m.direct on invite to prevent group misclassification Rebase onto plugins/platforms/matrix/adapter.py (code moved from gateway/platforms/matrix.py). Same logic: _on_invite checks is_direct on invite events and calls _record_dm_room to persist in m.direct account data. Fixes #44679	2026-06-28 02:37:52 -07:00
LeonSGP43	9f0e64cedd	fix(gateway): force exit after graceful shutdown Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-06-28 02:34:23 -07:00
yungchentang	7e2ca7f68d	fix(telegram): reset send pool after pool timeouts	2026-06-28 02:34:17 -07:00
teknium1	c23f394eb8	fix: satisfy ruff encoding + windows-footgun lints for cgroup reaper - read_text(encoding='utf-8') (PLW1514) - # windows-footgun: ok on signal.SIGKILL — module is Linux-only (reads /proc, /sys/fs/cgroup; runs from a systemd unit) - test lambda accepts the new encoding kwarg	2026-06-28 02:05:50 -07:00
PRATHAMESH75	e551da6ddb	fix(gateway): reap cgroup orphans via ExecStopPost to unblock restart Long-lived helpers spawned indirectly by tool calls (adb, platform bridges) were left in the service cgroup after the gateway's main process exited. When the kernel rejected the deferred cgroup-wide kill with EINVAL, systemd blocked Restart=always for 6+ minutes, taking down all platforms and cron windows (#37454). Add a small ExecStopPost helper (gateway.cgroup_cleanup) that walks cgroup.procs and sends per-PID SIGKILLs — a different kernel code path than cgroup.kill, so it succeeds where the cgroup-wide write failed. KillMode=mixed is preserved so the gateway still reaps its own tool-call children before systemd intervenes (#8202). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-28 02:05:50 -07:00
teknium1	58c36b1798	fix(api-server): widen error redaction to cron-endpoint + SSE sites Follow-up to the salvaged #37733 fix. The contributor centralized redaction at _openai_error and the chat/responses failure paths, which covers the OpenAI-compatible envelopes transitively. Two sibling classes crossed the same authenticated HTTP boundary unredacted: - 8x cron-management endpoints returning {"error": str(e)} on 500 - the session-chat SSE error event ({"message": str(exc)}) Route both through the same _redact_api_error_text(force=True) helper. Add AUTHOR_MAP entry for coygeek and a TestRedactApiErrorText guard covering mask/force/limit/passthrough behavior.	2026-06-28 02:05:38 -07:00

1 2 3 4 5 ...

1588 commits