hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-04-26 01:01:40 +00:00

Author	SHA1	Message	Date
Dylan Socolobsky	11369a78f9	fix(telegram): handle parentheses in URLs during MarkdownV2 link conversion The link regex in format_message used [^)]+ for the URL portion, which stopped at the first ) character. URLs with nested parentheses (e.g. Wikipedia links like Python_(programming_language)) were improperly parsed. Use a better regex, which is the same the Slack adapter uses.	2026-04-20 14:56:04 -07:00
Teknium	b65f6ca7fe	fix(telegram): actionable error for DM topics when Topics mode not enabled (#13162 ) When createForumTopic fails with 'not a forum' in a private chat, the error now tells the user exactly what to do: enable Topics in the DM chat settings from the Telegram app. Also adds a Prerequisites callout to the docs explaining this client-side requirement before the config section.	2026-04-20 12:29:22 -07:00
MassiveMassimo	7972ff2a2c	feat(whatsapp): add dm_policy and group_policy parity with WeCom/Weixin/QQ adapters Add dm_policy and group_policy to the WhatsApp adapter, bringing parity with WeCom/Weixin/QQ. Allows independent control of DM and group access: disable DMs entirely, allowlist specific senders/groups, or keep open. - dm_policy: open (default) \| allowlist \| disabled - group_policy: open (default) \| allowlist \| disabled - Config bridging for YAML → env vars - 22 tests covering all policy combinations Backward compatible — defaults preserve existing behavior. Cherry-picked from PR #11597 by @MassiveMassimo. Dropped the run.py group auth bypass (would have skipped user auth for ALL platforms, not just WhatsApp).	2026-04-20 11:56:19 -07:00
Teknium	c86915024e	fix(cron): run due jobs in parallel to prevent serial tick starvation (#13021 ) Replaces the serial for-loop in tick() with ThreadPoolExecutor so all jobs due in a single tick run concurrently. A slow job no longer blocks others from executing, fixing silent job skipping (issue #9086). Thread safety: - Session/delivery env vars migrated from os.environ to ContextVars (gateway/session_context.py) so parallel jobs can't clobber each other's delivery targets. Each thread gets its own copied context. - jobs.json read-modify-write cycles (advance_next_run, mark_job_run) protected by threading.Lock to prevent concurrent save clobber. - send_message_tool reads delivery vars via get_session_env() for ContextVar-aware resolution with os.environ fallback. Configuration: - cron.max_parallel_jobs in config.yaml (null = unbounded, 1 = serial) - HERMES_CRON_MAX_PARALLEL env var override Based on PR #9169 by @VenomMoth1. Fixes #9086	2026-04-20 11:53:07 -07:00
Ruzzgar	0613f10def	fix(gateway): use persisted session origin for shutdown notifications Prefer session_store origin over _parse_session_key() for shutdown notifications. Fixes misrouting when chat identifiers contain colons (e.g. Matrix room IDs like !room123:example.org). Falls back to session-key parsing when no persisted origin exists. Co-authored-by: Ruzzgar <ruzzgarcn@gmail.com> Ref: #12766	2026-04-20 05:15:54 -07:00
JP Lew	9fdfb09aed	fix(telegram): cache inbound videos and accept mp4 uploads	2026-04-20 05:10:23 -07:00
sprmn24	ed76185c15	feat(whatsapp): implement send_voice for audio message delivery WhatsApp already receives incoming voice messages (audio/ogg via the bridge) but lacked a send_voice implementation, so TTS and audio responses fell back to the base class send_image path instead of being delivered as native audio messages. Route send_voice through the existing _send_media_to_bridge helper with media_type='audio', matching the pattern used by send_video and send_document.	2026-04-20 05:00:30 -07:00
Teknium	f683132c1d	feat(api-server): inline image inputs on /v1/chat/completions and /v1/responses (#12969 ) OpenAI-compatible clients (Open WebUI, LobeChat, etc.) can now send vision requests to the API server. Both endpoints accept the canonical OpenAI multimodal shape: Chat Completions: {type: text\|image_url, image_url: {url, detail?}} Responses: {type: input_text\|input_image, image_url: <str>, detail?} The server validates and converts both into a single internal shape that the existing agent pipeline already handles (Anthropic adapter converts, OpenAI-wire providers pass through). Remote http(s) URLs and data:image/* URLs are supported. Uploaded files (file, input_file, file_id) and non-image data: URLs are rejected with 400 unsupported_content_type. Changes: - gateway/platforms/api_server.py - _normalize_multimodal_content(): validates + normalizes both Chat and Responses content shapes. Returns a plain string for text-only content (preserves prompt-cache behavior on existing callers) or a canonical [{type:text\|image_url,...}] list when images are present. - _content_has_visible_payload(): replaces the bare truthy check so a user turn with only an image no longer rejects as 'No user message'. - _handle_chat_completions and _handle_responses both call the new helper for user/assistant content; system messages continue to flatten to text. - Codex conversation_history, input[], and inline history paths all share the same validator. No duplicated normalizers. - run_agent.py - _summarize_user_message_for_log(): produces a short string summary ('[1 image] describe this') from list content for logging, spinner previews, and trajectory writes. Fixes AttributeError when list user_message hit user_message[:80] + '...' / .replace(). - _chat_content_to_responses_parts(): module-level helper that converts chat-style multimodal content to Responses 'input_text'/'input_image' parts. Used in _chat_messages_to_responses_input for Codex routing. - _preflight_codex_input_items() now validates and passes through list content parts for user/assistant messages instead of stringifying. - tests/gateway/test_api_server_multimodal.py (new, 38 tests) - Unit coverage for _normalize_multimodal_content, including both part formats, data URL gating, and all reject paths. - Real aiohttp HTTP integration on /v1/chat/completions and /v1/responses verifying multimodal payloads reach _run_agent intact. - 400 coverage for file / input_file / non-image data URL. - tests/run_agent/test_run_agent_multimodal_prologue.py (new) - Regression coverage for the prologue no-crash contract. - _chat_content_to_responses_parts round-trip coverage. - website/docs/user-guide/features/api-server.md - Inline image examples for both endpoints. - Updated Limitations: files still unsupported, images now supported. Validated live against openrouter/anthropic/claude-opus-4.6: POST /v1/chat/completions → 200, vision-accurate description POST /v1/responses → 200, same image, clean output_text POST /v1/chat/completions [file] → 400 unsupported_content_type POST /v1/responses [input_file] → 400 unsupported_content_type POST /v1/responses [non-image data URL] → 400 unsupported_content_type Closes #5621, #8253, #4046, #6632. Co-authored-by: Paul Bergeron <paul@gamma.app> Co-authored-by: zhangxicen <zhangxicen@example.com> Co-authored-by: Manuel Schipper <manuelschipper@users.noreply.github.com> Co-authored-by: pradeep7127 <pradeep7127@users.noreply.github.com>	2026-04-20 04:16:13 -07:00
haileymarshall	6b408e131c	fix(gateway): pass session_key (not session_id) to active-process check during prune SessionStore.prune_old_entries was calling self._has_active_processes_fn(entry.session_id) but the callback wired up in gateway/run.py is process_registry.has_active_for_session, which compares against session_key, not session_id. Every other caller in session.py (_is_session_expired, _should_reset) already passes session_key, so prune was the only outlier — and because session_id and session_key live in different namespaces, the guard never fired. Result in production: sessions with live background processes (queued cron output, detached agents, long-running Bash) were pruned out of _entries despite the docstring promising they'd be preserved. When the process finished and tried to deliver output, the session_key to session_id mapping was gone and the work was effectively orphaned. Also update the existing test_prune_skips_entries_with_active_processes, which was checking the wrong interface (its mock callback took session_id so it agreed with the buggy implementation). The test now uses a session_key-based mock, matching the production callback's real contract, and a new regression guard test pins the behaviour. Swallowed exceptions inside the prune loop now log at debug level instead of silently disappearing.	2026-04-20 03:10:19 -07:00
Teknium	eba7c869bb	fix(steer): drain /steer between individual tool calls, not at batch end (#12959 ) Previously, /steer text was only injected after an entire tool batch completed (_execute_tool_calls_sequential/concurrent returned). If the batch had a long-running tool (delegate_task, terminal build), the steer waited for ALL tools to finish before landing — functionally identical to /queue from the user's perspective. Now _apply_pending_steer_to_tool_results() is called after EACH individual tool result is appended to messages, in both the sequential and concurrent paths. A steer arriving during Tool 1 lands in Tool 1's result before Tool 2 starts executing. Also handles leftover steers in the gateway: if a steer arrives during the final API call (no tool batch to drain into), it's now delivered as the next user turn instead of being silently dropped. Fixes user report from Utku.	2026-04-20 03:08:04 -07:00
elkimek	afd08b76c5	fix(gateway): run /yolo and /verbose mid-agent instead of rejecting them /yolo and /verbose are safe to dispatch while an agent is running: /yolo can unblock a pending approval prompt, /verbose cycles the tool-progress display for the ongoing stream. Both modify session state without needing agent interaction. Previously they fell through to the running-agent catch-all (PR #12334) and returned the generic busy message. /fast and /reasoning stay on the catch-all — their handlers explicitly say 'takes effect on next message', so nothing is gained by dispatching them mid-turn. Salvaged from #10116 (elkimek), scoped down.	2026-04-20 03:03:07 -07:00
Roy-oss1	520edd3499	feat(feishu): show processing state via reactions on user messages Replaces the permanent "OK" receipt reaction with a 3-phase visual lifecycle: - Typing animation appears when the agent starts processing. - Cleared when processing succeeds — the reply message is the signal. - Replaced with CrossMark when processing fails. - Cleared when processing is cancelled or interrupted. When Feishu rejects the reaction-delete call, we keep the Typing in place and skip adding CrossMark. Showing both at once would leave the user seeing both "still working" and "done/failed" simultaneously, which is worse than a stuck Typing. A FEISHU_REACTIONS env var (default on) disables the whole lifecycle. User-added reactions with the same emoji still route through to the agent; only bot-origin reactions are filtered to break the feedback loop. Change-Id: I527081da31f0f9d59b451f45de59df4ddab522ba	2026-04-20 02:04:57 -07:00
Ruzzgar	f23123e7b4	fix(gateway): prevent scoped lock and resource leaks on connection failure	2026-04-20 01:44:36 -07:00
Junass1	4c50b4689e	fix(gateway): make Telegram DM topic config writes atomic	2026-04-20 00:57:53 -07:00
helix4u	e96758291b	fix(signal): normalize direct recipients to UUIDs	2026-04-20 00:35:55 -07:00
Teknium	e330112aa8	refactor(telegram): use entity-only mention detection Replaces the word-boundary regex scan with pure MessageEntity-based detection. Telegram's server emits MENTION entities for real @username mentions and TEXT_MENTION entities for @FirstName mentions; the text- scanning fallback was both redundant (entities are always present for real mentions) and broken (matched raw substrings like email addresses, URLs, code-block contents, and forwarded literal text). Entity-only detection: - Closes bug #12545 ("foo@hermes_bot.example" false positive). - Also fixes edge cases the regex fix would still miss: @handles inside URLs and code blocks, where Telegram does not emit mention entities. Tests rewritten to exercise realistic Telegram payloads (real mentions carry entities; substring false positives don't).	2026-04-20 00:10:22 -07:00
Tranquil-Flow	1e18e0503f	fix(telegram): use word-boundary matching for bot mention detection (#12545 )	2026-04-20 00:10:22 -07:00
JackJin	6c0c625952	fix(gateway): accept finalize kwarg in all platform edit_message overrides stream_consumer._send_or_edit unconditionally passes finalize= to adapter.edit_message(), but only DingTalk's override accepted the kwarg. Streaming on Telegram/Discord/Slack/Matrix/Mattermost/Feishu/ WhatsApp raised TypeError the first time a segment break or final edit fired. The REQUIRES_EDIT_FINALIZE capability flag only gates the redundant final edit (and the identical-text short-circuit), not the kwarg itself — so adapters that opt out of finalize still receive the keyword argument and must accept it. Add *, finalize: bool = False to the 7 non-DingTalk signatures; the body ignores the arg since those platforms treat edits as stateless (consistent with the base class contract in base.py). Add a parametrized signature check over every concrete adapter class so a future override cannot silently drop the kwarg — existing tests use MagicMock which swallows any kwarg and cannot catch this. Fixes #12579	2026-04-19 22:46:47 -07:00
Tranquil-Flow	6a228d52f7	fix(webhook): validate HMAC signature before rate limiting (#12544 )	2026-04-19 22:45:08 -07:00
Teknium	491cf25eef	test(voice): update existing voice_mode tests for platform-prefixed keys Follow-up to `40164ba1`. - _handle_voice_channel_join/leave now use event.source.platform instead of hardcoded Platform.DISCORD (consistent with other voice handlers). - Update tests/gateway/test_voice_command.py to use 'platform:chat_id' keys matching the new _voice_key() format. - Add platform isolation regression test for the bug in #12542. - Drop decorative test_legacy_key_collision_bug (the fix makes the collision impossible; the test mutated a single key twice, not a real scenario). - Adapter mocks in _sync_voice_mode_state_to_adapter tests now set adapter.platform = Platform.* (required by new isinstance check).	2026-04-19 22:36:00 -07:00
Tranquil-Flow	52a972e927	fix(gateway): namespace voice mode state by platform to prevent cross-platform collision (#12542 )	2026-04-19 22:36:00 -07:00
Teknium	1ee3b79f1d	fix(gateway): include QQBOT in allowlist-aware unauthorized DM map Follow-up to #9337: _is_user_authorized maps Platform.QQBOT to QQ_ALLOWED_USERS, but the new platform_env_map inside _get_unauthorized_dm_behavior omitted it. A QQ operator with a strict user allowlist would therefore still have the gateway send pairing codes to strangers. Adds QQBOT to the env map and a regression test.	2026-04-19 22:16:37 -07:00
draix	7282652655	fix(gateway): silence pairing codes when a user allowlist is configured (#9337 ) When SIGNAL_ALLOWED_USERS (or any platform-specific or global allowlist) is set, the gateway was still sending automated pairing-code messages to every unauthorized sender. This forced pairing-code spam onto personal contacts of anyone running Hermes on a primary personal account with a whitelist, and exposed information about the bot's existence. Root cause ---------- _get_unauthorized_dm_behavior() fell through to the global default ('pair') even when an explicit allowlist was configured. An allowlist signals that the operator has deliberately restricted access; offering pairing codes to unknown senders contradicts that intent. Fix --- Extend _get_unauthorized_dm_behavior() to inspect the active per-platform and global allowlist env vars. When any allowlist is set and the operator has not written an explicit per-platform unauthorized_dm_behavior override, the method now returns 'ignore' instead of 'pair'. Resolution order (highest → lowest priority): 1. Explicit per-platform unauthorized_dm_behavior in config — always wins. 2. Explicit global unauthorized_dm_behavior != 'pair' in config — wins. 3. Any platform or global allowlist env var present → 'ignore'. 4. No allowlist, no override → 'pair' (open-gateway default preserved). This fixes the spam for Signal, Telegram, WhatsApp, Slack, and all other platforms with per-platform allowlist env vars. Testing ------- 6 new tests added to tests/gateway/test_unauthorized_dm_behavior.py: - test_signal_with_allowlist_ignores_unauthorized_dm (primary #9337 case) - test_telegram_with_allowlist_ignores_unauthorized_dm (same for Telegram) - test_global_allowlist_ignores_unauthorized_dm (GATEWAY_ALLOWED_USERS) - test_no_allowlist_still_pairs_by_default (open-gateway regression guard) - test_explicit_pair_config_overrides_allowlist_default (operator opt-in) - test_get_unauthorized_dm_behavior_no_allowlist_returns_pair (unit) All 15 tests in the file pass. Fixes #9337	2026-04-19 22:16:37 -07:00
Teknium	424e9f36b0	refactor: remove smart_model_routing feature (#12732 ) Smart model routing (auto-routing short/simple turns to a cheap model across providers) was opt-in and disabled by default. This removes the feature wholesale: the routing module, its config keys, docs, tests, and the orchestration scaffolding it required in cli.py / gateway/run.py / cron/scheduler.py. The /fast (Priority Processing / Anthropic fast mode) feature kept its hooks into _resolve_turn_agent_config — those still build a route dict and attach request_overrides when the model supports it; the route now just always uses the session's primary model/provider rather than running prompts through choose_cheap_model_route() first. Also removed: - DEFAULT_CONFIG['smart_model_routing'] block and matching commented-out example sections in hermes_cli/config.py and cli-config.yaml.example - _load_smart_model_routing() / self._smart_model_routing on GatewayRunner - self._smart_model_routing / self._active_agent_route_signature on HermesCLI (signature kept; just no longer initialised through the smart-routing pipeline) - route_label parameter on HermesCLI._init_agent (only set by smart routing; never read elsewhere) - 'Smart Model Routing' section in website/docs/integrations/providers.md - tip in hermes_cli/tips.py - entries in hermes_cli/dump.py + hermes_cli/web_server.py - row in skills/autonomous-ai-agents/hermes-agent/SKILL.md Tests: - Deleted tests/agent/test_smart_model_routing.py - Rewrote tests/agent/test_credential_pool_routing.py to target the simplified _resolve_turn_agent_config directly (preserves credential pool propagation + 429 rotation coverage) - Dropped 'cheap model' test from test_cli_provider_resolution.py - Dropped resolve_turn_route patches from cli + gateway test_fast_command — they now exercise the real method end-to-end - Removed _smart_model_routing stub assignments from gateway/cron test helpers Targeted suites: 74/74 in the directly affected test files; tests/agent + tests/cron + tests/cli pass except 5 failures that already exist on main (cron silent-delivery + alias quick-command).	2026-04-19 18:12:55 -07:00
Teknium	014248567b	fix(feishu): hydrate bot open_id for manual-setup users Extends _hydrate_bot_identity() to also populate _bot_open_id (not just _bot_name) by probing /open-apis/bot/v3/info — the same endpoint the scan-to-create wizard uses. No extra scopes required beyond the tenant access token. Closes the manual-setup gap in #12450: users who configured Feishu without running the wizard, and never set FEISHU_BOT_OPEN_ID, now get a bot identity that _is_self_sent_bot_message() can actually use to filter the adapter's own bot-sent events. Each field is hydrated independently: - Env vars (FEISHU_BOT_OPEN_ID / FEISHU_BOT_USER_ID / FEISHU_BOT_NAME) still take precedence and skip their respective probe. - /bot/v3/info provides open_id + name. - Application-info endpoint remains as a best-effort fallback for bot_name only (needs admin:app.info:readonly scope). Tests: 5 new cases covering env-var precedence, probe success, probe failure fallback, and the end-to-end self-send filter gate after hydration.	2026-04-19 11:36:04 -07:00
Bingo	2d54e17b82	fix(feishu): allow bot-originated mentions from other bots	2026-04-19 11:36:04 -07:00
Teknium	7e3b356574	refactor(discord): slim down the race-polish fix (#12644 ) PR #12558 was heavy for what the fix actually is — essay-length comments, a dedicated helper method where a setdefault would do, and a source-inspection test with no real behavior coverage. The genuine code change is ~5 lines of new logic (1 field, 2 async with, an on_ready wait block). Trimmed: - Replaced the 12-line _voice_lock_for helper with a setdefault one-liner at each call site (join_voice_channel, leave_voice_channel). - Collapsed the 12-line comment on on_message's _ready_event wait to 3 lines. Dropped the warning log on timeout — pass-on-timeout is fine; if on_ready hangs that long, the bot is already broken and the log wouldn't help. - Dropped the source-inspection test (greps the module source for expected substrings). It was low-value scaffolding; the voice-serialization test covers actual behavior. Net: -73 lines vs PR #12558. Same two guarantees preserved, same test passes (verified by stashing the fix and confirming failure).	2026-04-19 11:08:10 -07:00
Teknium	a521005fe5	fix(discord): close two low-severity adapter races (#12558 ) Two small races in gateway/platforms/discord.py, bundled together since they're adjacent in the adapter and both narrow in impact. 1. on_message vs _resolve_allowed_usernames (startup window) DISCORD_ALLOWED_USERS accepts both numeric IDs and raw usernames. At connect-time, _resolve_allowed_usernames walks the bot's guilds (fetch_members can take multiple seconds) to swap usernames for IDs. on_message can fire during that window; _is_allowed_user compares the numeric author.id against a set that may still contain raw usernames — legitimate users get silently rejected for a few seconds after every reconnect. Fix: on_message awaits _ready_event (with a 30s timeout) when it isn't already set. on_ready sets the event after the resolve completes. In steady state this is a no-op (event already set); only the startup / reconnect window ever blocks. 2. join_voice_channel check-and-connect The existing-connection check at _voice_clients.get() and the channel.connect() call straddled an await boundary with no lock. Two concurrent /voice channel invocations could both see None and both call connect(); discord.py raises ClientException ("Already connected") on the loser. Same race class for leave running concurrently with _voice_timeout_handler. Fix: per-guild asyncio.Lock (_voice_locks dict with lazy alloc via _voice_lock_for). join_voice_channel and leave_voice_channel both run their body under the lock. Sequential within a guild, still fully concurrent across guilds. Both: LOW severity. The first only affects username-based allowlists on fast-follow-up messages at startup; the second is a narrow exception on simultaneous voice commands. Bundled so the adapter gets a single coherent polish pass. Tests (tests/gateway/test_discord_race_polish.py): 2 regression cases. - test_concurrent_joins_do_not_double_connect: two concurrent join_voice_channel calls on the same guild result in exactly one channel.connect() invocation. - test_on_message_blocks_until_ready_event_set: asserts the expected wait pattern is present in on_message (source inspection, since full discord.py client setup isn't practical here). Regression-guard validated: against unpatched gateway/platforms/discord.py both tests fail. With the fix they pass. Full Discord suite (118 tests) green.	2026-04-19 05:45:59 -07:00
Teknium	206a449b29	feat(webhook): direct delivery mode for zero-LLM push notifications (#12473 ) External services can now push plain-text notifications to a user's chat via the webhook adapter without invoking the agent. Set deliver_only=true on a route and the rendered prompt template becomes the literal message body — dispatched directly to the configured target (Telegram, Discord, Slack, GitHub PR comment, etc.). Reuses all existing webhook infrastructure: HMAC-SHA256 signature validation, per-route rate limiting, idempotency cache, body-size limits, template rendering with dot-notation, home-channel fallback. No new HTTP server, no new auth scheme, no new port. Use cases: Supabase/Firebase webhooks → user notifications, monitoring alert forwarding, inter-agent pings, background job completion alerts. Changes: - gateway/platforms/webhook.py: new _direct_deliver() helper + early dispatch branch in _handle_webhook when deliver_only=true. Startup validation rejects deliver_only with deliver=log. - hermes_cli/main.py + hermes_cli/webhook.go: --deliver-only flag on subscribe; list/show output marks direct-delivery routes. - website/docs/user-guide/messaging/webhooks.md: new Direct Delivery Mode section with config example, CLI example, response codes. - skills/devops/webhook-subscriptions/SKILL.md: document --deliver-only with use cases (bumped to v1.1.0). - tests/gateway/test_webhook_deliver_only.py: 14 new tests covering agent bypass, template rendering, status codes, HMAC still enforced, idempotency still applies, rate limit still applies, startup validation, and direct-deliver dispatch. Validation: 78 webhook tests pass (64 existing + 14 new). E2E verified with real aiohttp server + real urllib POST — agent not invoked, target adapter.send() called with rendered template, duplicate delivery_id suppressed. Closes the gap identified in PR #12117 (thanks to @H1an1 / Antenna team) without adding a second HTTP ingress server.	2026-04-19 05:18:19 -07:00
kshitijk4poor	957ca79e8e	fix(feishu): drop dead helper and cover repeated fenced blocks	2026-04-19 03:30:36 -07:00
kshitijk4poor	a9debf10ff	fix(feishu): harden fenced post row splitting	2026-04-19 03:30:36 -07:00
sgaofen	cc59d133dc	fix(feishu): split fenced code blocks in post payload	2026-04-19 03:30:36 -07:00
kshitijk4poor	4b6ff0eb7f	fix: tighten gateway interrupt salvage follow-ups Follow-up on top of the helix4u #12388 cherry-picks: - make deferred post-delivery callbacks generation-aware end-to-end so stale runs cannot clear callbacks registered by a fresher run for the same session - bind callback ownership to the active session event at run start and snapshot that generation inside base adapter processing so later event mutation cannot retarget cleanup - pass run_generation through proxy mode and drop stale proxy streams / final results the same way local runs are dropped - centralize stop/new interrupt cleanup into one helper and replace the open-coded branches with shared logic - unify internal control interrupt reason strings via shared constants - remove the return from base.py's finally block so cleanup no longer swallows cancellation/exception flow - add focused regressions for generation forwarding, proxy stale suppression, and newer-callback preservation This addresses all review findings from the initial #12388 review while keeping the fix scoped to stale-output/typing-loop interrupt handling.	2026-04-19 03:03:57 -07:00
helix4u	8466268ca5	fix(gateway): keep typing loop overrides backward-compatible	2026-04-19 03:03:57 -07:00
helix4u	150382e8b7	fix(gateway): stop typing loops on session interrupt	2026-04-19 03:03:57 -07:00
kshitijk4poor	ff63e2e005	fix: tighten telegram docker-media salvage follow-ups Follow-up on top of the helix4u #6392 cherry-pick: - reuse one helper for actionable Docker-local file-not-found errors across document/image/video/audio local-media send paths - include /outputs/... alongside /output/... in the container-local path hint - soften the gateway startup warning so it does not imply custom host-visible mounts are broken; the warning now targets the specific risky pattern of emitting container-local MEDIA paths without an explicit export mount - add focused regressions for /outputs/... and non-document media hint coverage This keeps the salvage aligned with the actual MEDIA delivery problem on current main while reducing false-positive operator messaging.	2026-04-19 01:55:33 -07:00
helix4u	588333908c	fix(telegram): warn on docker-only media paths	2026-04-19 01:55:33 -07:00
Tranquil-Flow	b668c09ab2	fix(gateway): strip cursor from frozen message on empty fallback continuation (#7183 ) When _send_fallback_final() is called with nothing new to deliver (the visible partial already matches final_text), the last edit may still show the cursor character because fallback mode was entered after a failed edit. Before this fix the early-return path left _already_sent = True without attempting to strip the cursor, so the message stayed frozen with a visible ▉ permanently. Adds a best-effort edit inside the empty-continuation branch to clean the cursor off the last-sent text. Harmless when fallback mode wasn't actually armed or when the cursor isn't present. If the strip edit itself fails (flood still active), we return without crashing and without corrupting _last_sent_text. Adapted from PR #7429 onto current main — the surrounding fallback block grew the #10807 stale-prefix handling since #7429 was written, so the cursor strip lives in the new else-branch where we still return early. 3 unit tests covering: cursor stripped on empty continuation, no edit attempted when cursor is not configured, cursor-strip edit failure handled without crash. Originally proposed as PR #7429.	2026-04-19 01:51:12 -07:00
Teknium	62ce6a38ae	fix(gateway): cancel_background_tasks must drain late-arrivals (#12471 ) During gateway shutdown, a message arriving while cancel_background_tasks is mid-await (inside asyncio.gather) spawns a fresh _process_message_background task via handle_message and adds it to self._background_tasks. The original implementation's _background_tasks.clear() at the end of cancel_background_tasks dropped the reference; the task ran untracked against a disconnecting adapter, logged send-failures, and lingered until it completed on its own. Fix: wrap the cancel+gather in a bounded loop (MAX_DRAIN_ROUNDS=5). If new tasks appeared during the gather, cancel them in the next round. The .clear() at the end is preserved as a safety net for any task that appeared after MAX_DRAIN_ROUNDS — but in practice the drain stabilizes in 1-2 rounds. Tests: tests/gateway/test_cancel_background_drain.py — 3 cases. - test_cancel_background_tasks_drains_late_arrivals: spawn M1, start cancel, inject M2 during M1's shielded cleanup, verify M2 is cancelled. - test_cancel_background_tasks_handles_no_tasks: no-op path still terminates cleanly. - test_cancel_background_tasks_bounded_rounds: baseline — single task cancels in one round, loop terminates. Regression-guard validated: against the unpatched implementation, the late-arrival test fails with exactly the expected message ('task leaked'). With the fix it passes. Blast radius is shutdown-only; the audit classified this as MED. Shipping because the fix is small and the hygiene is worth it. While investigating the audit's other MEDs (busy-handler double-ack, Discord ExecApprovalView double-resolve, UpdatePromptView double-resolve), I verified all three were false positives — the check-and-set patterns have no await between them, so they're atomic on single-threaded asyncio. No fix needed for those.	2026-04-19 01:48:42 -07:00
konsisumer	1d1e1277e4	fix(gateway): flush undelivered tail before segment reset to preserve streamed text (#8124 ) When a streaming edit fails mid-stream (flood control, transport error) and a tool boundary arrives before the fallback threshold is reached, the pre-boundary tail in `_accumulated` was silently discarded by `_reset_segment_state`. The user saw a frozen partial message and missing words on the other side of the tool call. Flush the undelivered tail as a continuation message before the reset, computed relative to the last successfully-delivered prefix so we don't duplicate content the user already saw.	2026-04-19 01:43:04 -07:00
Teknium	7c10761dd2	fix(discord): shield text-batch flush from follow-up cancel (#12444 ) When Discord splits a long message at 2000 chars, _enqueue_text_event buffers each chunk and schedules a _flush_text_batch task with a short delay. If another chunk lands while the prior flush task is already inside handle_message, _enqueue_text_event calls prior_task.cancel() — and without asyncio.shield, CancelledError propagates from the flush task into handle_message → the agent's streaming request, aborting the response the user was waiting on. Reproducer: user sends a 3000-char prompt (split by Discord into 2 messages). Chunk 1 lands, flush delay starts, chunk 2 lands during the brief window when chunk 1's flush has already committed to handle_message. Agent's current streaming response is cancelled with CancelledError, user sees a truncated or missing reply. Fix (gateway/platforms/discord.py): - Wrap the handle_message call in asyncio.shield so the inner dispatch is protected from the outer task's cancel. - Add an except asyncio.CancelledError clause so the outer task still exits cleanly when cancel lands during the sleep window (before the pop) — semantics for that path are unchanged. The new flush task spawned by the follow-up chunk still handles its own batch via the normal pending-message / active-session machinery in base.py, so follow-ups are not lost. Tests: tests/gateway/test_text_batching.py — test_shield_protects_handle_message_from_cancel. Tracks a distinct first_handle_cancelled event so the assertion fails cleanly when the shield is missing (verified by stashing the fix and re-running). Live E2E on the live-loaded DiscordAdapter: first_handle_cancelled: False (shield worked) first_handle_completed: True (handle_message ran to completion)	2026-04-19 00:09:38 -07:00
Teknium	3a6351454b	fix(gateway): close pending-drain and late-arrival races in base adapter (#12371 ) Two related race conditions in gateway/platforms/base.py that could produce duplicate agent runs or silently drop messages. Neither is specific to any one platform — all adapters inherit this logic. R5 (HIGH) — duplicate agent spawn on turn chain In _process_message_background, the pending-drain path deleted _active_sessions[session_key] before awaiting typing_task.cancel() and then recursively awaiting _process_message_background for the queued event. During the typing_task await, a fresh inbound message M3 could pass the Level-1 guard (entry now missing), set its own Event, and spawn a second _process_message_background for the same session_key — two agents running simultaneously, duplicate responses, duplicate tool calls. Fix: keep the _active_sessions entry populated and only clear() the Event. The guard stays live, so any concurrent inbound message takes the busy-handler path (queue + interrupt) as intended. R6 (MED-HIGH) — message dropped during finally cleanup The finally block has two await points (typing_task, stop_typing) before it deletes _active_sessions. A message arriving in that window passes the guard (entry still live), lands in _pending_messages via the busy-handler — and then the unconditional del removes the guard with that message still queued. Nothing drains it; the user never gets a reply. Fix: before deleting _active_sessions in finally, pop any late pending_messages entry and spawn a drain task for it. Only delete _active_sessions when no pending is waiting. Tests: tests/gateway/test_pending_drain_race.py — three regression cases. Validated: without the fix, two of the three fail exactly where the races manifest (duplicate-spawn guard loses identity, late-arrival 'LATE' message not in processed list).	2026-04-18 19:32:26 -07:00
Teknium	beabbd87ef	fix(gateway): close adapter resources when connect() fails or raises (#12339 ) Gateway startup leaks aiohttp.ClientSession (and other partial-init resources) when an adapter's connect() returns False or raises. The adapter is never added to self.adapters, so the shutdown path at gateway/run.py:2426 never calls disconnect() on it — Python GC later logs 'Unclosed client session' at process exit. Seen on 2026-04-18 18:08:16 during a double --replace takeover cycle: one of the partial-init sessions survived past shutdown and emitted the warning right before status=75/TEMPFAIL. Fix: - New GatewayRunner._safe_adapter_disconnect() helper — calls adapter.disconnect() and swallows any exception. Used on error paths. - Connect loop calls it in both failure branches: success=False and except Exception. - Adapter disconnect() implementations are already expected to be idempotent and tolerate partial-init state (they all guard on self._http_session / self._bridge_process before touching them). Tests: tests/gateway/test_safe_adapter_disconnect.py — 3 cases verify the helper forwards to disconnect, swallows exceptions, and tolerates platform=None.	2026-04-18 18:53:31 -07:00
Teknium	632a807a3e	fix(gateway): slash commands never interrupt a running agent (#12334 ) Any recognized slash command now bypasses the Level-1 active-session guard instead of queueing + interrupting. A mid-run /model (or /reasoning, /voice, /insights, /title, /resume, /retry, /undo, /compress, /usage, /provider, /reload-mcp, /sethome, /reset) used to interrupt the agent AND get silently discarded by the slash-command safety net — zero-char response, dropped tool calls. Root cause: - Discord registers 41 native slash commands via tree.command(). - Only 14 were in ACTIVE_SESSION_BYPASS_COMMANDS. - The other ~15 user-facing ones fell through base.py:handle_message to the busy-session handler, which calls running_agent.interrupt() AND queues the text. - After the aborted run, gateway/run.py:9912 correctly identifies the queued text as a slash command and discards it — but the damage (interrupt + zero-char response) already happened. Fix: - should_bypass_active_session() now returns True for any resolvable slash command. ACTIVE_SESSION_BYPASS_COMMANDS stays as the subset with dedicated Level-2 handlers (documentation + tests). - gateway/run.py adds a catch-all after the dedicated handlers that returns a user-visible "agent busy — wait or /stop first" response for any other resolvable command. - Unknown text / file-path-like messages are unchanged — they still queue. Also: - gateway/platforms/discord.py logs the invoker identity on every slash command (user id + name + channel + guild) so future ghost-command reports can be triaged without guessing. Tests: - 15 new parametrized cases in test_command_bypass_active_session.py cover every previously-broken Discord slash command. - Existing tests for /stop, /new, /approve, /deny, /help, /status, /agents, /background, /steer, /update, /queue still pass. - test_steer.py's ACTIVE_SESSION_BYPASS_COMMANDS check still passes. Fixes #5057. Related: #6252, #10370, #4665.	2026-04-18 18:53:22 -07:00
Nish	1a9a2d7fe8	fix(gateway/telegram): fall back to chat.id when from_user is None in DMs When `message.from_user` is None — which can happen for forwarded messages, anonymous admin mode in groups, or certain Telegram client edge cases — `_build_message_event` set `source.user_id` to None. This caused: 1. `_is_user_authorized()` to early-return False (`if not user_id: return False`) 2. The access check never compared against `TELEGRAM_ALLOWED_USERS` even when the user actually was in the allowlist 3. The pairing flow fired and generated a code for `user_id=None` 4. The pairing approval saved an entry under the literal string key "null" 5. The user was effectively locked out because their real user_id never matched the "null" key on subsequent messages For DMs (`chat_type == "dm"`), Telegram guarantees `chat.id == user.id` — they are the same numeric ID for private chats. Falling back to `chat.id` when `from_user` is None for DMs restores the expected access-control behavior without weakening it (group/channel chats correctly stay None). Also adds a parallel `user_name` fallback to `chat.full_name` so the display name still works in the same edge case.	2026-04-18 18:18:01 -07:00
Teknium	c49a58a6d0	fix(gateway): mark only still-running sessions resume_pending on drain timeout (#12332 ) Follow-up to #12301. The drain-timeout branch of _stop_impl() was iterating the drain-start snapshot (active_agents) when marking sessions resume_pending. That snapshot can include sessions that finished gracefully during the drain window — marking them would give their next turn a stray 'your previous turn was interrupted by a gateway restart' system note even though the prior turn actually completed cleanly. Iterate self._running_agents at timeout time instead, mirroring _interrupt_running_agents() exactly: - only sessions still blocking the shutdown get marked - pending sentinels (AIAgent construction not yet complete) are skipped Changes: - gateway/run.py: swap active_agents.keys() for filtered self._running_agents.items() iteration in the drain-timeout mark loop. - tests/gateway/test_restart_resume_pending.py: two regression tests — finisher-during-drain not marked, pending sentinel not marked.	2026-04-18 17:40:34 -07:00
Teknium	cb4addacab	fix(gateway): auto-resume sessions after drain-timeout restart (#11852 ) (#12301 ) The shutdown banner promised "send any message after restart to resume where you left off" but the code did the opposite: a drain-timeout restart skipped the .clean_shutdown marker, which made the next startup call suspend_recently_active(), which marked the session suspended, which made get_or_create_session() spawn a fresh session_id with a 'Session automatically reset. Use /resume...' notice — contradicting the banner. Introduce a resume_pending state on SessionEntry that is distinct from suspended. Drain-timeout shutdown flags active sessions resume_pending instead of letting startup-wide suspension destroy them. The next message on the same session_key preserves the session_id, reloads the transcript, and the agent receives a reason-aware restart-resume system note that subsumes the existing tool-tail auto-continue note (PR #9934). Terminal escalation still flows through the existing .restart_failure_counts stuck-loop counter (PR #7536, threshold 3) — no parallel counter on SessionEntry. suspended still wins over resume_pending in get_or_create_session() so genuinely stuck sessions converge to a clean slate. Spec: PR #11852 (BrennerSpear). Implementation follows the spec with the approved correction (reuse .restart_failure_counts rather than adding a resume_attempts field). Changes: - gateway/session.py: SessionEntry.resume_pending/resume_reason/ last_resume_marked_at + to_dict/from_dict; SessionStore .mark_resume_pending()/clear_resume_pending(); get_or_create_session() returns existing entry when resume_pending (suspended still wins); suspend_recently_active() skips resume_pending entries. - gateway/run.py: _stop_impl() drain-timeout branch marks active sessions resume_pending before _interrupt_running_agents(); _run_agent() injects reason-aware restart-resume system note that subsumes the tool-tail case; successful-turn cleanup also clears resume_pending next to _clear_restart_failure_count(); _notify_active_sessions_of_shutdown() softens the restart banner to 'I'll try to resume where you left off' (honest about stuck-loop escalation). - tests/gateway/test_restart_resume_pending.py: 29 new tests covering SessionEntry roundtrip, mark/clear helpers, get_or_create_session precedence (suspended > resume_pending), suspend_recently_active skip, drain-timeout mark reason (restart vs shutdown), system-note injection decision tree (including tool-tail subsumption), banner wording, and stuck-loop escalation override.	2026-04-18 17:32:17 -07:00
Teknium	2edebedc9e	feat(steer): /steer <prompt> injects a mid-run note after the next tool call (#12116 ) * feat(steer): /steer <prompt> injects a mid-run note after the next tool call Adds a new slash command that sits between /queue (turn boundary) and interrupt. /steer <text> stashes the message on the running agent and the agent loop appends it to the LAST tool result's content once the current tool batch finishes. The model sees it as part of the tool output on its next iteration. No interrupt is fired, no new user turn is inserted, and no prompt cache invalidation happens beyond the normal per-turn tool-result churn. Message-role alternation is preserved — we only modify an existing role:"tool" message's content. Wiring ------ - hermes_cli/commands.py: register /steer + add to ACTIVE_SESSION_BYPASS_COMMANDS. - run_agent.py: add _pending_steer state, AIAgent.steer(), _drain_pending_steer(), _apply_pending_steer_to_tool_results(); drain at end of both parallel and sequential tool executors; clear on interrupt; return leftover as result['pending_steer'] if the agent exits before another tool batch. - cli.py: /steer handler — route to agent.steer() when running, fall back to the regular queue otherwise; deliver result['pending_steer'] as next turn. - gateway/run.py: running-agent intercept calls running_agent.steer(); idle-agent path strips the prefix and forwards as a regular user message. - tui_gateway/server.py: new session.steer JSON-RPC method. - ui-tui: SessionSteerResponse type + local /steer slash command that calls session.steer when ui.busy, otherwise enqueues for the next turn. Fallbacks --------- - Agent exits mid-steer → surfaces in run_conversation result as pending_steer so CLI/gateway deliver it as the next user turn instead of silently dropping it. - All tools skipped after interrupt → re-stashes pending_steer for the caller. - No active agent → /steer reduces to sending the text as a normal message. Tests ----- - tests/run_agent/test_steer.py — accept/reject, concatenation, drain, last-tool-result injection, multimodal list content, thread safety, cleared-on-interrupt, registry membership, bypass-set membership. - tests/gateway/test_steer_command.py — running agent, pending sentinel, missing steer() method, rejected payload, empty payload. - tests/gateway/test_command_bypass_active_session.py — /steer bypasses the Level-1 base adapter guard. - tests/test_tui_gateway_server.py — session.steer RPC paths. 72/72 targeted tests pass under scripts/run_tests.sh. * feat(steer): register /steer in Discord's native slash tree Discord's app_commands tree is a curated subset of slash commands (not derived from COMMAND_REGISTRY like Telegram/Slack). /steer already works there as plain text (routes through handle_message → base adapter bypass → runner), but registering it here adds Discord's native autocomplete + argument hint UI so users can discover and type it like any other first-class command.	2026-04-18 04:17:18 -07:00
Teknium	9527707f80	fix(signal): back off sendTyping spam for unreachable recipients (#12118 ) base.py's _keep_typing refresh loop calls send_typing every ~2s while the agent is processing. If signal-cli returns NETWORK_FAILURE for the recipient (offline, unroutable, group membership lost), the unmitigated path was a WARNING log every 2 seconds for as long as the agent stayed busy — a user report showed 1048 warnings in 41 minutes for one offline contact, plus the matching volume of pointless RPC traffic to signal-cli. - _rpc() accepts log_failures=False so callers can route repeated expected failures (typing) to DEBUG while keeping send/receive at WARNING. - send_typing() tracks consecutive failures per chat. First failure still logs WARNING so transport issues remain visible; subsequent failures log at DEBUG. After three consecutive failures we skip the RPC during an exponential cooldown (16s, 32s, 60s cap) so we stop hammering signal-cli for a recipient it can't deliver to. A successful sendTyping resets the counters. - _stop_typing_indicator() clears the backoff state so the next agent turn starts fresh. E2E simulation against the reported 41-minute window: RPCs drop from 1230 to 45 (-96%), log lines from 1048 WARNINGs to 1 WARNING + 44 DEBUGs. Credits kshitijk4poor (#12056) for the _rpc log_failures kwarg idea; the broader restructure in that PR (nested per-chat loop inside send_typing) is avoided here in favour of stateful backoff that preserves base.py's existing _keep_typing architecture.	2026-04-18 04:13:32 -07:00
Teknium	45acd9beb5	fix(gateway): ignore redelivered /restart after PTB offset ACK fails (#11940 ) When a Telegram /restart fires and PTB's graceful-shutdown `get_updates` ACK call times out ("When polling for updates is restarted, updates may be received twice" in gateway.log), the new gateway receives the same /restart again and restarts a second time — a self-perpetuating loop. Record the triggering update_id in `.restart_last_processed.json` when handling /restart. On the next process, reject a /restart whose update_id <= the recorded one as a stale redelivery. 5-minute staleness guard so an orphaned marker can't block a legitimately new /restart. - gateway/platforms/base.py: add `platform_update_id` to MessageEvent - gateway/platforms/telegram.py: propagate `update.update_id` through _build_message_event for text/command/location/media handlers - gateway/run.py: write dedup marker in _handle_restart_command; _is_stale_restart_redelivery checks it before processing /restart - tests/gateway/test_restart_redelivery_dedup.py: 9 new tests covering fresh restart, redelivery, staleness window, cross-platform, malformed-marker resilience, and no-update_id (CLI) bypass Only active for Telegram today (the one platform with monotonic cross-session update ordering); other platforms return False from _is_stale_restart_redelivery and proceed normally.	2026-04-17 21:17:33 -07:00

1 2 3 4 5 ...

1110 commits