hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-04-25 00:51:20 +00:00

Author	SHA1	Message	Date
Siddharth Balyan	d38b73fa57	fix(matrix): E2EE and migration bugfixes (#10860 ) * - make buffered streaming - fix path naming to expand `~` for agent. - fix stripping of matrix ID to not remove other mentions / localports. * fix(matrix): register MembershipEventDispatcher for invite auto-join The mautrix migration (#7518) broke auto-join because InternalEventType.INVITE events are only dispatched when MembershipEventDispatcher is registered on the client. Without it, _on_invite is dead code and the bot silently ignores all room invites. Closes #10094 Closes #10725 Refs: PR #10135 (digging-airfare-4u), PR #10732 (fxfitz) * fix(matrix): preserve _joined_rooms reference for CryptoStateStore connect() reassigned self._joined_rooms = set(...) after initial sync, orphaning the reference captured by _CryptoStateStore at init time. find_shared_rooms() returned [] forever, breaking Megolm session rotation on membership changes. Mutate in place with clear() + update() so the CryptoStateStore reference stays valid. Refs #8174, PR #8215 * fix(matrix): remove dual ROOM_ENCRYPTED handler to fix dedup race mautrix auto-registers DecryptionDispatcher when client.crypto is set. The adapter also registered _on_encrypted_event for the same event type. _on_encrypted_event had zero awaits and won the race to mark event IDs in the dedup set, causing _on_room_message to drop successfully decrypted events from DecryptionDispatcher. The retry loop masked this by re-decrypting every message ~4 seconds later. Remove _on_encrypted_event entirely. DecryptionDispatcher handles decryption; genuinely undecryptable events are logged by mautrix and retried on next key exchange. Refs #8174, PR #8215 * fix(matrix): re-verify device keys after share_keys() upload Matrix homeservers treat ed25519 identity keys as immutable per device. share_keys() can return 200 but silently ignore new keys if the device already exists with different identity keys. The bot would proceed with shared=True while peers encrypt to the old (unreachable) keys. Now re-queries the server after share_keys() and fails closed if keys don't match, with an actionable error message. Refs #8174, PR #8215 * fix(matrix): encrypt outbound attachments in E2EE rooms _upload_and_send() uploaded raw bytes and used the 'url' key for all rooms. In E2EE rooms, media must be encrypted client-side with encrypt_attachment(), the ciphertext uploaded, and the 'file' key (with key/iv/hashes) used instead of 'url'. Now detects encrypted rooms via state_store.is_encrypted() and branches to the encrypted upload path. Refs: PR #9822 (charles-brooks) * fix(matrix): add stop_typing to clear typing indicator after response The adapter set a 30-second typing timeout but never cleared it. The base class stop_typing() is a no-op, so the typing indicator lingered for up to 30 seconds after each response. Closes #6016 Refs: PR #6020 (r266-tech) * fix(matrix): cache all media types locally, not just photos/voice should_cache_locally only covered PHOTO, VOICE, and encrypted media. Unencrypted audio/video/documents in plaintext rooms were passed as MXC URLs that require authentication the agent doesn't have, resulting in 401 errors. Refs #3487, #3806 * fix(matrix): detect stale OTK conflict on startup and fail closed When crypto state is wiped but the same device ID is reused, the homeserver may still hold one-time keys signed with the previous identity key. Identity key re-upload succeeds but OTK uploads fail with "already exists" and a signature mismatch. Peers cannot establish new Olm sessions, so all new messages are undecryptable. Now proactively flushes OTKs via share_keys() during connect() and catches the "already exists" error with an actionable log message telling the operator to purge the device from the homeserver or generate a fresh device ID. Also documents the crypto store recovery procedure in the Matrix setup guide. Refs #8174 * docs(matrix): improve crypto recovery docs per review - Put easy path (fresh access token) first, manual purge second - URL-encode user ID in Synapse admin API example - Note that device deletion may invalidate the access token - Add "stop Synapse first" caveat for direct SQLite approach - Mention the fail-closed startup detection behavior - Add back-reference from upgrade section to OTK warning * refactor(matrix): cleanup from code review - Extract _extract_server_ed25519() and _reverify_keys_after_upload() to deduplicate the re-verification block (was copy-pasted in two places, three copies of ed25519 key extraction total) - Remove dead code: _pending_megolm, _retry_pending_decryptions, _MAX_PENDING_EVENTS, _PENDING_EVENT_TTL — all orphaned after removing _on_encrypted_event - Remove tautological TestMediaCacheGate (tested its own predicate, not production code) - Remove dead TestMatrixMegolmEventHandling and TestMatrixRetryPendingDecryptions (tested removed methods) - Merge duplicate TestMatrixStopTyping into TestMatrixTypingIndicator - Trim comment to just the "why"	2026-04-17 04:03:02 +05:30
Teknium	f6179c5d5f	fix: bump debug share paste TTL from 1 hour to 6 hours (#11240 ) Users (Teknium) report missing debug reports before the 1-hour auto-delete fires. 6 hours gives enough window for async bug-report triage without leaving sensitive log data on public paste services indefinitely. Applies to both the CLI (hermes debug share) and gateway (/debug) paths.	2026-04-16 14:34:46 -07:00
jackjin1997	f5ac025714	fix(gateway): guard pending_event.channel_prompt against None in recursive _run_agent Initialize next_channel_prompt before the pending_event check and use getattr with None default, matching the existing pattern for next_source/next_message/next_message_id. Prevents AttributeError when pending_event is None (interrupt path). Cherry-picked from #10953 by @jackjin1997.	2026-04-16 07:45:27 -07:00
danieldoderlein	31a72bdbf2	fix: escape command content in Telegram exec approval prompt Switch from fragile Markdown V1 to HTML parse mode with html.escape() for exec approval messages. Add fallback to text-based approval when the formatted send fails. Cherry-picked from #10999 by @danieldoderlein.	2026-04-16 07:45:18 -07:00
Teknium	3c42064efc	fix: enforce config.yaml as sole CWD source + deprecate .env CWD vars + add hermes memory reset (#11029 ) config.yaml terminal.cwd is now the single source of truth for working directory. MESSAGING_CWD and TERMINAL_CWD in .env are deprecated with a migration warning. Changes: 1. config.py: Remove MESSAGING_CWD from OPTIONAL_ENV_VARS (setup wizard no longer prompts for it). Add warn_deprecated_cwd_env_vars() that prints a migration hint when deprecated env vars are detected. 2. gateway/run.py: Replace all MESSAGING_CWD reads with TERMINAL_CWD (which is bridged from config.yaml terminal.cwd). MESSAGING_CWD is still accepted as a backward-compat fallback with deprecation warning. Config bridge skips cwd placeholder values so they don't clobber the resolved TERMINAL_CWD. 3. cli.py: Guard against lazy-import clobbering — when cli.py is imported lazily during gateway runtime (via delegate_tool), don't let load_cli_config() overwrite an already-resolved TERMINAL_CWD with os.getcwd() of the service's working directory. (#10817) 4. hermes_cli/main.py: Add 'hermes memory reset' command with --target all/memory/user and --yes flags. Profile-scoped via HERMES_HOME. Migration path for users with .env settings: Remove MESSAGING_CWD / TERMINAL_CWD from .env Add to config.yaml: terminal: cwd: /your/project/path Addresses: #10225, #4672, #10817, #7663	2026-04-16 06:48:33 -07:00
LeonSGP43	465193b7eb	fix(gateway): close temporary agents after one-off tasks Add shared _cleanup_agent_resources() for temporary gateway AIAgent instances. Apply cleanup to memory flush, background tasks, /btw, manual /compress, and session-hygiene auto-compression. Prevents unclosed aiohttp client session leaks. Cherry-picked from #10899 by @LeonSGP43. Consolidates #10945 by @Lubrsy706. Fixes #10865. Co-authored-by: Lubrsy706 <Lubrsy706@users.noreply.github.com>	2026-04-16 06:31:23 -07:00
Mil Wang (from Dev Box)	f9714161f0	fix: stop leaking '(No response generated)' placeholder to users and cron targets When the LLM returns an empty completion, gateway/run.py replaced final_response with the literal string '(No response generated)'. This defeated cron/scheduler.py's empty-response skip guard, causing the placeholder to be delivered to home channels. Changes: - gateway/run.py: return empty string instead of placeholder when there is no error and no response content - cron/scheduler.py: defensively strip the placeholder text in case any upstream path still produces it Fixes NousResearch/hermes-agent#9270	2026-04-16 06:10:40 -07:00
Dave Tist	35bbc6851b	fix(gateway): honor previewed replies in queued follow-ups	2026-04-16 05:53:18 -07:00
Dave Tist	d67e602cc8	fix: only suppress gateway replies after confirmed final stream delivery (cherry picked from commit 675249085b383fff305cc84b8aeacd6dd20c7b14)	2026-04-16 05:53:18 -07:00
kshitij	92a78ffeee	chore(gateway): replace deprecated asyncio.get_event_loop() with get_running_loop() (#11005 ) All 10 call sites in gateway/run.py and gateway/platforms/api_server.py are inside async functions where a loop is guaranteed to be running. get_event_loop() is deprecated since Python 3.10 — it can silently create a new loop when none is running, masking bugs. get_running_loop() raises RuntimeError instead, which is safer. Surfaced during review of PRs #10533 and #10647. Co-authored-by: kshitijk4poor <kshitijk4poor@users.noreply.github.com>	2026-04-16 05:13:39 -07:00
Teknium	333cb8251b	fix: improve interrupt responsiveness during concurrent tool execution and follow-up turns (#10935 ) Three targeted fixes for the 'agent stuck on terminal command' report: 1. Concurrent tool wait loop now checks interrupts (run_agent.py) The sequential path checked _interrupt_requested before each tool call, but the concurrent path's wait loop just blocked with 30s timeouts. Now polls every 5s and cancels pending futures on interrupt, giving already-running tools 3s to notice the per-thread interrupt signal. 2. Cancelled concurrent tools get proper interrupt messages (run_agent.py) When a concurrent tool is cancelled or didn't return a result due to interrupt, the tool result message says 'skipped due to user interrupt' instead of a generic error. 3. Typing indicator fires before follow-up turn (gateway/run.py) After an interrupt is acknowledged and the pending message dequeued, the gateway now sends a typing indicator before starting the recursive _run_agent call. This gives the user immediate visual feedback that the system is processing their new message (closing the perceived 'dead air' gap between the interrupt ack and the response). Reported by @_SushantSays.	2026-04-16 02:44:56 -07:00
Peter Berthelsen	9a9b8cd1e4	fix: keep rapid telegram follow-ups from getting cut off	2026-04-16 02:44:00 -07:00
Teknium	e4cd62d07d	fix(tests): resolve remaining CI failures — commit_memory_session, already_sent, timezone leak, session env (#10785 ) Fixes 12 CI test failures: 1. test_cli_new_session (4): _FakeAgent missing commit_memory_session attribute added in the memory provider refactoring. Added MagicMock. 2. test_run_progress_topics (1): already_sent detection only checked stream consumer flags, missing the response_previewed path from interim_assistant_callback. Restructured guard to check both paths. 3. test_timezone (1): HERMES_TIMEZONE leaked into child processes via _SAFE_ENV_PREFIXES matching HERMES_*. The code correctly converts it to TZ but didn't remove the original. Added child_env.pop(). 4. test_session_env (1): contextvars baseline captured from a different context couldn't be restored after clear. Changed assertion to verify the test's value was removed rather than comparing to a fragile baseline. 5. test_discord_slash_commands (5): already fixed on current main.	2026-04-16 02:26:14 -07:00
helix4u	8021a735c2	fix(gateway): preserve notify context in executor threads Gateway executor work now inherits the active session contextvars via copy_context() so background process watchers retain the correct platform/chat/user/session metadata for routing completion events back to the originating chat. Cherry-picked from #10647 by @helix4u with: - Use asyncio.get_running_loop() instead of deprecated get_event_loop() - Strip trailing whitespace - Add *args forwarding test - Add exception propagation test	2026-04-16 02:05:59 -07:00
Teknium	cc6e8941db	feat(honcho): context injection overhaul, 5-tool surface, cost safety, session isolation (#10619 ) Salvaged from PR #9884 by erosika. Cherry-picked plugin changes onto current main with minimal core modifications. Plugin changes (plugins/memory/honcho/): - New honcho_reasoning tool (5th tool, splits LLM calls from honcho_context) - Two-layer context injection: base context (summary + representation + card) on contextCadence, dialectic supplement on dialecticCadence - Multi-pass dialectic depth (1-3 passes) with early bail-out on strong signal - Cold/warm prompt selection based on session state - dialecticCadence defaults to 3 (was 1) — ~66% fewer Honcho LLM calls - Session summary injection for conversational continuity - Bidirectional peer targeting on all 5 tools - Correctness fixes: peer param fallback, None guard on set_peer_card, schema validation, signal_sufficient anchored regex, mid->medium level fix Core changes (~20 lines across 3 files): - agent/memory_manager.py: Enhanced sanitize_context() to strip full <memory-context> blocks and system notes (prevents leak from saveMessages) - run_agent.py: gateway_session_key param for stable per-chat Honcho sessions, on_turn_start() call before prefetch_all() for cadence tracking, sanitize_context() on user messages to strip leaked memory blocks - gateway/run.py: skip_memory=True on 2 temp agents (prevents orphan sessions), gateway_session_key threading to main agent Tests: 509 passed (3 skipped — honcho SDK not installed locally) Docs: Updated honcho.md, memory-providers.md, tools-reference.md, SKILL.md Co-authored-by: erosika <erosika@users.noreply.github.com>	2026-04-15 19:12:19 -07:00
Roque	92a23479c0	fix(model-switch): normalize Unicode dashes from Telegram/iOS input Telegram on iOS auto-converts double hyphens (--) to em dashes (—) or en dashes (–) via autocorrect. This breaks /model flag parsing since parse_model_flags() only recognizes literal '--provider' and '--global'. When the flag isn't parsed, the entire string (e.g. 'glm-5.1 —provider zai') gets treated as the model name and fails with 'Model names cannot contain spaces.' Fix: normalize Unicode dashes (U+2012-U+2015) to '--' when they appear before flag keywords (provider, global), before flag extraction. The existing test suite in test_model_switch_provider_routing.py already covers all four dash variants — this commit adds the code that makes them pass.	2026-04-15 17:54:16 -07:00
Xowiek	21cd3a3fc0	fix(profile): use existing get_active_profile_name() for /profile command Replace inline Path.home() / '.hermes' / 'profiles' detection in both CLI and gateway /profile handlers with the existing get_active_profile_name() from hermes_cli.profiles — which already handles custom-root deployments, standard profiles, and Docker layouts. Fixes /profile incorrectly reporting 'default' when HERMES_HOME points to a custom-root profile path like /opt/data/profiles/coder. Based on PR #10484 by Xowiek.	2026-04-15 17:52:03 -07:00
Xowiek	77435c4f13	fix(gateway): use profile-aware Hermes paths in runtime hints	2026-04-15 17:52:03 -07:00
Greer Guthrie	33ff29dfae	fix(gateway): defer background review notifications until after main reply Background review notifications ("💾 Skill created", "💾 Memory updated") could race ahead of the main assistant reply in chat, making it look like the agent stopped after creating a skill. Gate bg-review notifications behind a threading.Event + pending queue. Register a release callback on the adapter's _post_delivery_callbacks dict so base.py's finally block fires it after the main response is delivered. The queued-message path in _run_agent pops and calls the callback directly to prevent double-fire. Co-authored-by: Hermes Agent <hermes@nousresearch.com> Closes #10541	2026-04-15 17:23:15 -07:00
Brenner Spear	90a6336145	fix: remove redundant key normalization and defensive getattr in channel_prompts - Remove double str() normalization in _resolve_channel_prompt since config bridging already handles numeric YAML key conversion - Remove dead prompts.get(str(key)) fallback that could never match after keys were already normalized to strings - Replace getattr(event, "channel_prompt", None) with direct attribute access since channel_prompt is a declared dataclass field - Update test to verify normalization responsibility lives in config bridging	2026-04-15 16:31:28 -07:00
Brenner Spear	2fbdc2c8fa	feat(discord): add channel_prompts config Add native Discord channel_prompts support with parent forum fallback, ephemeral runtime injection, config migration updates, docs, and tests.	2026-04-15 16:31:28 -07:00
Teknium	1d4b9c1a74	fix(gateway): don't treat group session user_id as thread_id in shutdown notifications (#10546 ) _parse_session_key() blindly assigned parts[5] as thread_id for all chat types. For group sessions with per-user isolation, parts[5] is a user_id, not a thread_id. This could cause shutdown notifications to route with incorrect thread metadata. Only return thread_id for chat types where the 6th element is unambiguous: dm and thread. For group/channel sessions, omit thread_id since the suffix may be a user_id. Based on the approach from PR #9938 by @Ruzzgar.	2026-04-15 15:09:23 -07:00
Teknium	e36c804bc2	fix: prevent already_sent from swallowing empty responses after tool calls (#10531 ) When a model (e.g. mimo-v2-pro) streams intermediate text alongside tool calls ("Let me search for that") but then returns empty after processing tool results, the stream consumer already_sent flag is True from the earlier text delivery. The gateway suppression check (already_sent=True, failed=False → return None) would swallow the final response, leaving the user staring at silence after the search. Two changes: 1. gateway/run.py return path: skip already_sent suppression when the final_response is "(empty)" or empty — the user needs to know the agent finished even if streaming sent partial content earlier. 2. gateway/run.py response handler: convert the internal "(empty)" sentinel to a user-friendly warning instead of delivering the raw sentinel string. Tests added for all empty/None/sentinel cases plus preserved existing suppression behavior for normal non-empty responses.	2026-04-15 14:26:45 -07:00
Teknium	19142810ed	fix: /debug privacy — auto-delete pastes after 1 hour, add privacy notices (#10510 ) - Pastes uploaded by /debug now auto-delete after 1 hour via a detached background process that sends DELETE to paste.rs - CLI: shows privacy notice listing what data will be uploaded - Gateway: only uploads summary report (system info + log tails), NOT full log files containing conversation content - Added 'hermes debug delete <url>' for immediate manual deletion - 16 new tests covering auto-delete scheduling, paste deletion, privacy notices, and the delete subcommand Addresses user privacy concern where /debug uploaded full conversation logs to a public paste service with no warning or expiry.	2026-04-15 13:40:27 -07:00
Teknium	2edbf15560	fix: enforce TTL in MessageDeduplicator + use yaml for gateway --config (#10306 , #10216 ) (#10509 ) Two gateway fixes: 1. MessageDeduplicator.is_duplicate() now checks TTL at query time (#10306) Previously, is_duplicate() returned True for any previously seen ID without checking its age — expired entries were only purged when cache size exceeded max_size. On normal workloads that never overflow, message IDs stayed deduplicated forever instead of expiring after the TTL. Fix: check `now - timestamp < ttl` before returning True. Expired entries are removed and treated as new messages. 2. Gateway --config flag now uses yaml.safe_load() (#10216) The --config CLI flag in gateway/run.py main() used json.load() to parse config files. YAML is the only documented config format and every other config loader uses yaml.safe_load(). A YAML config file passed via --config would crash with json.JSONDecodeError. Closes #10306 Closes #10216	2026-04-15 13:35:40 -07:00
Teknium	f61cc464f0	fix: include thread_id in _parse_session_key and fix stale parts reference _parse_session_key() now extracts the optional 6th part (thread_id) from session keys, and _notify_active_sessions_of_shutdown uses _parsed.get() instead of the removed 'parts' variable. Without this, shutdown notifications silently failed (NameError caught by try/except) and forum topic routing was lost.	2026-04-15 11:16:01 -07:00
kshitijk4poor	2276b72141	fix: follow-up improvements for watch notification routing (#9537 ) - Populate watcher_* routing fields for watch-only processes (not just notify_on_complete), so watch-pattern events carry direct metadata instead of relying solely on session_key parsing fallback - Extract _parse_session_key() helper to dedupe session key parsing at two call sites in gateway/run.py - Add negative test proving cross-thread leakage doesn't happen - Add edge-case tests for _build_process_event_source returning None (empty evt, invalid platform, short session_key) - Add unit tests for _parse_session_key helper	2026-04-15 11:16:01 -07:00
etcircle	dee592a0b1	fix(gateway): route synthetic background events by session	2026-04-15 11:16:01 -07:00
Teknium	2546b7acea	fix(gateway): suppress duplicate replies on interrupt and streaming flood control Three fixes for the duplicate reply bug affecting all gateway platforms: 1. base.py: Suppress stale response when the session was interrupted by a new message that hasn't been consumed yet. Checks both interrupt_event and _pending_messages to avoid false positives. (#8221, #2483) 2. run.py (return path): Remove response_previewed guard from already_sent check. Stream consumer's already_sent alone is authoritative — if content was delivered via streaming, the duplicate send must be suppressed regardless of the agent's response_previewed flag. (#8375) 3. run.py (queued-message path): Same fix — already_sent without response_previewed now correctly marks the first response as already streamed, preventing re-send before processing the queued message. The response_previewed field is still produced by the agent (run_agent.py) but is no longer required as a gate for duplicate suppression. The stream consumer's already_sent flag is the delivery-level truth about what the user actually saw. Concepts from PR #8380 (konsisumer). Closes #8375, #8221, #2483.	2026-04-15 03:42:24 -07:00
Teknium	50c35dcabe	fix: stale agent timeout, uv venv detection, empty response after tools (#9051 , #8620 , #9400 ) Three independent fixes: 1. Reset activity timestamp on cached agent reuse (#9051) When the gateway reuses a cached AIAgent for a new turn, the _last_activity_ts from the previous turn (possibly hours ago) carried over. The inactivity timeout handler immediately saw the agent as idle for hours and killed it. Fix: reset _last_activity_ts, _last_activity_desc, and _api_call_count when retrieving an agent from the cache. 2. Detect uv-managed virtual environments (#8620 sub-issue 1) The systemd unit generator fell back to sys.executable (uv's standalone Python) when running under 'uv run', because sys.prefix == sys.base_prefix (uv doesn't set up traditional venv activation). The generated ExecStart pointed to a Python binary without site-packages, crashing the service on startup. Fix: check VIRTUAL_ENV env var before falling back to sys.executable. uv sets VIRTUAL_ENV even when sys.prefix doesn't reflect the venv. 3. Nudge model to continue after empty post-tool response (#9400) Weaker models (GLM-5, mimo-v2-pro) sometimes return empty responses after tool calls instead of continuing to the next step. The agent silently abandoned the remaining work with '(empty)' or used prior-turn fallback text. Fix: when the model returns empty after tool calls AND there's no prior-turn content to fall back on, inject a one-time user nudge message telling the model to process the tool results and continue. The flag resets after each successful tool round so it can fire again on later rounds. Test plan: 97 gateway + CLI tests pass, 9 venv detection tests pass	2026-04-14 22:16:02 -07:00
Teknium	a8b7db35b2	fix: interrupt agent immediately when user messages during active run (#10068 ) When a user sends a message while the agent is executing a task on the gateway, the agent is now interrupted immediately — not silently queued. Previously, messages were stored in _pending_messages with zero feedback to the user, potentially leaving them waiting 1+ hours. Root cause: Level 1 guard (base.py) intercepted all messages for active sessions and returned with no response. Level 2 (gateway/run.py) which calls agent.interrupt() was never reached. Fix: Expand _handle_active_session_busy_message to handle the normal (non-draining) case: 1. Call running_agent.interrupt(text) to abort in-flight tool calls and exit the agent loop at the next check point 2. Store the message as pending so it becomes the next turn once the interrupted run returns 3. Send a brief ack: 'Interrupting current task (10 min elapsed, iteration 21/60, running: terminal). I'll respond shortly.' 4. Debounce acks to once per 30s to avoid spam on rapid messages Reported by @Lonely__MH.	2026-04-14 22:07:28 -07:00
Teknium	c5688e7c8b	fix(gateway): break compression-exhaustion infinite loop and auto-reset session (#9893 ) When compression fails after max attempts, the agent returns {completed: False, partial: True} but was missing the 'failed' flag. The gateway's agent_failed_early guard checked for 'failed' AND 'not final_response', but _run_agent_blocking always converts errors to final_response — making the guard dead code. This caused the oversized session to persist, creating an infinite fail loop where every subsequent message hits the same compression failure. Changes: - run_agent.py: add 'failed: True' and 'compression_exhausted: True' to all 5 compression-exhaustion return paths - gateway/run.py (_run_agent_blocking): forward 'failed' and 'compression_exhausted' flags through to the caller - gateway/run.py (_handle_message_with_agent): fix agent_failed_early to check bool(failed) without the broken 'not final_response' clause; auto-reset the session when compression is exhausted so the next message starts fresh - Update tests to match new guard logic and add TestCompressionExhaustedFlag test class Closes #9893	2026-04-14 21:18:17 -07:00
Teknium	ca0ae56ccb	fix: add 402 billing error hint to gateway error handler (#5220 ) (#10057 ) * fix: hermes gateway restart waits for service to come back up (#8260) Previously, systemd_restart() sent SIGUSR1 to the gateway, printed 'restart requested', and returned immediately. The gateway still needed to drain active agents, exit with code 75, wait for systemd's RestartSec=30, and start the new process. The user saw 'success' but the gateway was actually down for 30-60 seconds. Now the SIGUSR1 path blocks with progress feedback: Phase 1 — wait for old process to die: ⏳ User service draining active work... Polls os.kill(pid, 0) until ProcessLookupError (up to 90s) Phase 2 — wait for new process to become active: ⏳ Waiting for hermes-gateway to restart... Polls systemctl is-active + verifies new PID (up to 60s) Success: ✓ User service restarted (PID 12345) Timeout: ⚠ User service did not become active within 60s. Check status: hermes gateway status Check logs: journalctl --user -u hermes-gateway --since '2 min ago' The reload-or-restart fallback path (line 1189) already blocks because systemctl reload-or-restart is synchronous. Test plan: - Updated test to verify wait-for-restart behavior - All 118 gateway CLI tests pass * fix: add 402 billing error hint to gateway error handler (#5220) The gateway's exception handler for agent errors had specific hints for HTTP 401, 429, 529, 400, 500 — but not 402 (Payment Required / quota exhausted). Users hitting billing limits from custom proxy providers got a generic error with no guidance. Added: 'Your API balance or quota is exhausted. Check your provider dashboard.' The underlying billing classification (error_classifier.py) already correctly handles 402 as FailoverReason.billing with credential rotation and fallback. The original issue (#5220) where 402 killed the entire gateway was from an older version — on current main, 402 is excluded from the is_client_error abort path (line 9460) and goes through the proper retry/fallback/fail flow. Combined with PR #9875 (auto-recover from unexpected SIGTERM), even edge cases where the gateway dies are now survivable.	2026-04-14 21:03:05 -07:00
Teknium	6c89306437	fix: break stuck session resume loops after repeated restarts (#7536 ) When a session gets stuck (hung terminal, runaway tool loop) and the user restarts the gateway, the same session history loads and puts the agent right back in the stuck state. The user is trapped in a loop: restart → stuck → restart → stuck. Fix: track restart-failure counts per session using a simple JSON file (.restart_failure_counts). On each shutdown with active agents, the counter increments for those sessions. On startup, if any session has been active across 3+ consecutive restarts, it's auto-suspended — giving the user a clean slate on their next message. The counter resets to 0 when a session completes a turn successfully (response delivered), so normal sessions that happen to be active during planned restarts (/restart, hermes update) won't accumulate false counts. Implementation: - _increment_restart_failure_counts(): called during stop() when agents are active. Writes {session_key: count} to JSON file. Sessions NOT active are dropped (loop broken). - _suspend_stuck_loop_sessions(): called on startup. Reads the file, suspends sessions at threshold (3), clears the file. - _clear_restart_failure_count(): called after successful response delivery. Removes the session from the counter file. No SessionEntry schema changes. No database migration. Pure file-based tracking that naturally cleans up. Test plan: - 9 new stuck-loop tests (increment, accumulate, threshold, clear, suspend, file cleanup, edge cases) - All 28 gateway lifecycle tests pass (restart drain + auto-continue + stuck loop)	2026-04-14 17:08:35 -07:00
Teknium	e7475b1582	feat: auto-continue interrupted agent work after gateway restart (#4493 ) When the gateway restarts mid-agent-work, the session transcript ends on a tool result the agent never processed. Previously, the user had to type 'continue' or use /retry (which replays from scratch, losing all prior work). Now, when the next user message arrives and the loaded history ends with role='tool', a system note is prepended: [System note: Your previous turn was interrupted before you could process the last tool result(s). Please finish processing those results and summarize what was accomplished, then address the user's new message below.] This is injected in _run_agent()'s run_sync closure, right before calling agent.run_conversation(). The agent sees the full history (including the pending tool results) and the system note, so it can summarize what was accomplished and then handle the user's new input. Design decisions: - No new session flags or schema changes — purely detects trailing tool messages in the loaded history - Works for any restart scenario (clean, crash, SIGTERM, drain timeout) as long as the session wasn't suspended (suspended = fresh start) - The user's actual message is preserved after the note - If the session WAS suspended (unclean shutdown), the old history is abandoned and the user starts fresh — no false auto-continue Also updates the shutdown notification message from 'Use /retry after restart to continue' to 'Send any message after restart to resume where it left off' — which is now accurate. Test plan: - 6 new auto-continue tests (trailing tool detection, no false positives for assistant/user/empty history, multi-tool, message preservation) - All 13 restart drain tests pass (updated /retry assertion)	2026-04-14 16:56:49 -07:00
Teknium	039023f497	diag: log all hermes processes on unexpected gateway shutdown (#9905 ) When the gateway receives SIGTERM/SIGINT, the shutdown handler now runs 'ps aux' and logs every hermes/gateway-related process (excluding itself). This will show in agent.log as: WARNING: Shutdown diagnostic — other hermes processes running: hermes 1234 ... hermes update --gateway hermes 5678 ... hermes gateway restart This is the missing diagnostic for #5646 / #6666 — we can prove the restarts are from systemctl but can't determine WHO issues the systemctl command. Next time it happens, the agent.log will contain the evidence (the process that sent the signal or called systemctl should still be alive when the handler fires).	2026-04-14 16:26:36 -07:00
Teknium	397386cae2	fix: gateway auto-recovers from unexpected SIGTERM via systemd (#5646 ) Root cause: when the gateway received SIGTERM (from hermes update, external kill, WSL2 runtime, etc.), it exited with status 0. systemd's Restart=on-failure only restarts on non-zero exit, so the gateway stayed dead permanently. Users had to manually restart. Fix 1: Signal-initiated shutdown exits non-zero When SIGTERM/SIGINT is received and no restart was requested (via /restart, /update, or SIGUSR1), start_gateway() returns False which causes sys.exit(1). systemd sees a failure exit and auto-restarts after RestartSec=30. This is safe because systemctl stop tracks its own stop-requested state independently of exit code — Restart= never fires for a deliberate stop, regardless of exit code. Also logs 'Received SIGTERM/SIGINT — initiating shutdown' so the cause of unexpected shutdowns is visible in agent.log. Fix 2: PID file ownership guard remove_pid_file() now checks that the PID file belongs to the current process before removing it. During --replace handoffs, the old process's atexit handler could fire AFTER the new process wrote its PID file, deleting the new record. This left the gateway running but invisible to get_running_pid(), causing 'Another gateway already running' errors on next restart. Test plan: - All restart drain tests pass (13) - All gateway service tests pass (84) - All update gateway restart tests pass (34)	2026-04-14 15:35:58 -07:00
Teknium	fa8c448f7d	fix: notify active sessions on gateway shutdown + update health check Three fixes for gateway lifecycle stability: 1. Notify active sessions before shutdown (#new) When the gateway receives SIGTERM or /restart, it now sends a notification to every chat with an active agent BEFORE starting the drain. Users see: - Shutdown: 'Gateway shutting down — your task will be interrupted.' - Restart: 'Gateway restarting — use /retry after restart to continue.' Deduplicates per-chat so group sessions with multiple users get one notification. Best-effort: send failures are logged and swallowed. 2. Skip .clean_shutdown marker when drain timed out Previously, a graceful SIGTERM always wrote .clean_shutdown, even if agents were force-interrupted when the drain timed out. This meant the next startup skipped session suspension, leaving interrupted sessions in a broken state (trailing tool response, no final message). Now the marker is only written if the drain completed without timeout, so interrupted sessions get properly suspended on next startup. 3. Post-restart health check for hermes update (#6631) cmd_update() now verifies the gateway actually survived after systemctl restart (sleep 3s + is-active check). If the service crashed immediately, it retries once. If still dead, prints actionable diagnostics (journalctl command, manual restart hint). Also closes #8104 — already fixed on main (the /restart handler correctly detects systemd via INVOCATION_ID and uses via_service=True). Test plan: - 6 new tests for shutdown notifications (dedup, restart vs shutdown messaging, sentinel filtering, send failure resilience) - Existing restart drain + update tests pass (47 total)	2026-04-14 14:21:57 -07:00
Teknium	90c98345c9	feat: gateway proxy mode — forward messages to remote API server When GATEWAY_PROXY_URL (or gateway.proxy_url in config.yaml) is set, the gateway becomes a thin relay: it handles platform I/O (encryption, threading, media) and delegates all agent work to a remote Hermes API server via POST /v1/chat/completions with SSE streaming. This enables the primary use case of running a Matrix E2EE gateway in Docker on Linux while the actual agent runs on the host (e.g. macOS) with full access to local files, memory, skills, and a unified session store. Works for any platform adapter, not just Matrix. Configuration: - GATEWAY_PROXY_URL env var (Docker-friendly) - gateway.proxy_url in config.yaml - GATEWAY_PROXY_KEY env var for API auth (matches API_SERVER_KEY) - X-Hermes-Session-Id header for session continuity Architecture: - _get_proxy_url() checks env var first, then config.yaml - _run_agent_via_proxy() handles HTTP forwarding with SSE streaming - _run_agent() delegates to proxy path when URL is configured - Platform streaming (GatewayStreamConsumer) works through proxy - Returns compatible result dict for session store recording Files changed: - gateway/run.py: proxy mode implementation (~250 lines) - hermes_cli/config.py: GATEWAY_PROXY_URL + GATEWAY_PROXY_KEY env vars - tests/gateway/test_proxy_mode.py: 17 tests covering config resolution, dispatch, HTTP forwarding, error handling, message filtering, and result shape validation Closes discussion from Cars29 re: Matrix gateway mixed-mode issue.	2026-04-14 10:49:48 -07:00
dirtyfancy	e964cfc403	fix(gateway): trigger memory provider shutdown on /new and /reset The /new and /reset commands were not calling shutdown_memory_provider() on the cached agent before eviction. This caused OpenViking (and any memory provider that relies on session-end shutdown) to skip commit, leaving memories un-indexed until idle timeout or gateway shutdown. Add the missing shutdown_memory_provider() call in _handle_reset_command(), matching the behavior already present in the session expiry watcher. Fixes #7759	2026-04-14 10:49:35 -07:00
Teknium	4654f75627	fix: QQBot missing integration points, timestamp parsing, test fix - Add Platform.QQBOT to _UPDATE_ALLOWED_PLATFORMS (enables /update command) - Add 'qqbot' to webhook cross-platform delivery routing - Add 'qqbot' to hermes dump platform detection - Fix test_name_property casing: 'QQBot' not 'QQBOT' - Add _parse_qq_timestamp() for ISO 8601 + integer ms compatibility (QQ API changed timestamp format — from PR #2411 finding) - Wire timestamp parsing into all 4 message handlers	2026-04-14 00:11:49 -07:00
walli	884cd920d4	feat(gateway): unify QQBot branding, add PLATFORM_HINTS, fix streaming, restore missing setup functions - Rename platform from 'qq' to 'qqbot' across all integration points (Platform enum, toolset, config keys, import paths, file rename qq.py → qqbot.py) - Add PLATFORM_HINTS for QQBot in prompt_builder (QQ supports markdown) - Set SUPPORTS_MESSAGE_EDITING = False to skip streaming on QQ (prevents duplicate messages from non-editable partial + final sends) - Add _send_qqbot() standalone send function for cron/send_message tool - Add interactive _setup_qq() wizard in hermes_cli/setup.py - Restore missing _setup_signal/email/sms/dingtalk/feishu/wecom/wecom_callback functions that were lost during the original merge	2026-04-14 00:11:49 -07:00
Junjun Zhang	87bfc28e70	feat: add QQ Bot platform adapter (Official API v2) Add full QQ Bot integration via the Official QQ Bot API (v2): - WebSocket gateway for inbound events (C2C, group, guild, DM) - REST API for outbound text/markdown/media messages - Voice transcription (Tencent ASR + configurable STT provider) - Attachment processing (images, voice, files) - User authorization (allowlist + allow-all + DM pairing) Integration points: - gateway: Platform.QQ enum, adapter factory, allowlist maps - CLI: setup wizard, gateway config, status display, tools config - tools: send_message cross-platform routing, toolsets - cron: delivery platform support - docs: QQ Bot setup guide	2026-04-14 00:11:49 -07:00
Teknium	8d023e43ed	refactor: remove dead code — 1,784 lines across 77 files (#9180 ) Deep scan with vulture, pyflakes, and manual cross-referencing identified: - 41 dead functions/methods (zero callers in production) - 7 production-dead functions (only test callers, tests deleted) - 5 dead constants/variables - ~35 unused imports across agent/, hermes_cli/, tools/, gateway/ Categories of dead code removed: - Refactoring leftovers: _set_default_model, _setup_copilot_reasoning_selection, rebuild_lookups, clear_session_context, get_logs_dir, clear_session - Unused API surface: search_models_dev, get_pricing, skills_categories, get_read_files_summary, clear_read_tracker, menu_labels, get_spinner_list - Dead compatibility wrappers: schedule_cronjob, list_cronjobs, remove_cronjob - Stale debug helpers: get_debug_session_info copies in 4 tool files (centralized version in debug_helpers.py already exists) - Dead gateway methods: send_emote, send_notice (matrix), send_reaction (bluebubbles), _normalize_inbound_text (feishu), fetch_room_history (matrix), _start_typing_indicator (signal), parse_feishu_post_content - Dead constants: NOUS_API_BASE_URL, SKILLS_TOOL_DESCRIPTION, FILE_TOOLS, VALID_ASPECT_RATIOS, MEMORY_DIR - Unused UI code: _interactive_provider_selection, _interactive_model_selection (superseded by prompt_toolkit picker) Test suite verified: 609 tests covering affected files all pass. Tests for removed functions deleted. Tests using removed utilities (clear_read_tracker, MEMORY_DIR) updated to use internal APIs directly.	2026-04-13 16:32:04 -07:00
helix4u	f94f53cc22	fix(matrix): disable streaming cursor decoration on Matrix	2026-04-13 16:31:02 -07:00
Teknium	952a885fbf	fix(gateway): /stop no longer resets the session (#9224 ) /stop was calling suspend_session() which marked the session for auto-reset on the next message. This meant users lost their conversation history every time they stopped a running agent — especially painful for untitled sessions that can't be resumed by name. Now /stop just interrupts the agent and cleans the session lock. The session stays intact so users can continue the conversation. The suspend behavior was introduced in #7536 to break stuck session resume loops on gateway restart. That case is already handled by suspend_recently_active() which runs at gateway startup, so removing it from /stop doesn't regress the original fix.	2026-04-13 14:59:05 -07:00
墨綠BG	c449cd1af5	fix(config): restore custom providers after v11→v12 migration The v11→v12 migration converts custom_providers (list) into providers (dict), then deletes the list. But all runtime resolvers read from custom_providers — after migration, named custom endpoints silently stop resolving and fallback chains fail with AuthError. Add get_compatible_custom_providers() that reads from both config schemas (legacy custom_providers list + v12+ providers dict), normalizes entries, deduplicates, and returns a unified list. Update ALL consumers: - hermes_cli/runtime_provider.py: _get_named_custom_provider() + key_env - hermes_cli/auth_commands.py: credential pool provider names - hermes_cli/main.py: model picker + _model_flow_named_custom() - agent/auxiliary_client.py: key_env + custom_entry model fallback - agent/credential_pool.py: _iter_custom_providers() - cli.py + gateway/run.py: /model switch custom_providers passthrough - run_agent.py + gateway/run.py: per-model context_length lookup Also: use config.pop() instead of del for safer migration, fix stale _config_version assertions in tests, add pool mock to codex test. Co-authored-by: 墨綠BG <s5460703@gmail.com> Closes #8776, salvaged from PR #8814	2026-04-13 10:50:52 -07:00
twilwa	3a64348772	fix(discord): voice session continuity and signal handler thread safety - Store source metadata on /voice channel join so voice input shares the same session as the linked text channel conversation - Treat voice-linked text channels as free-response (skip @mention and auto-thread) while voice is active - Scope the voice-linked exemption to the exact bound channel, not sibling threads - Guard signal handler registration in start_gateway() for non-main threads (prevents RuntimeError when gateway runs in a daemon thread) - Clean up _voice_sources on leave_voice_channel Salvaged from PR #3475 by twilwa (Modal runtime portions excluded).	2026-04-13 04:49:21 -07:00
Teknium	964ef681cf	fix(gateway): improve /restart response with fallback instructions	2026-04-12 22:34:23 -07:00
Teknium	276d20e62c	fix(gateway): /restart uses service restart under systemd instead of detached subprocess The detached bash subprocess spawned by /restart gets killed by systemd's KillMode=mixed cgroup cleanup, leaving the gateway dead. Under systemd (detected via INVOCATION_ID env var), /restart now uses via_service=True which exits with code 75 — RestartForceExitStatus=75 in the unit file makes systemd auto-restart the service. The detached subprocess approach is preserved as fallback for non-systemd environments (Docker, tmux, foreground mode).	2026-04-12 22:32:19 -07:00

1 2 3 4 5 ...

536 commits