hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-04-30 01:41:43 +00:00

Author	SHA1	Message	Date
beliefanx	93feffbcfa	fix(gateway): avoid stale interrupted turn auto-continue	2026-04-28 05:20:35 -07:00
Teknium	b61d9b297a	refactor: consolidate symlink-safe atomic replace into shared helper Extract the islink/realpath guard from the 16743 fix into a single atomic_replace() helper in utils.py, then migrate every os.replace() call site in the codebase to use it. The original PR #16777 correctly identified and fixed the bug, but only patched 9 of ~24 call sites. The same bug class (managed deployments that symlink state files silently losing the link on every write) still existed at auth.json, sessions file, gateway config, env_loader, webhook subscriptions, debug store, model catalog, pairing, google OAuth, nous rate guard, and more. Rather than add another 10+ copies of the same three-line guard, consolidate into atomic_replace(tmp, target) which: - resolves symlinks via os.path.realpath before os.replace - returns the resolved real path so callers can re-apply permissions - is a drop-in replacement for os.replace at the use sites Changes: - utils.py: new atomic_replace() helper + atomic_json_write / atomic_yaml_write now call it instead of inlining the guard - 16 files: all os.replace() call sites migrated to atomic_replace() - agent/{google_oauth, nous_rate_guard, shell_hooks}.py - cron/jobs.py - gateway/{pairing, session, platforms/telegram}.py - hermes_cli/{auth, config, debug, env_loader, model_catalog, webhook}.py - tools/{memory_tool, skill_manager_tool, skills_sync}.py Tests: tests/test_atomic_replace_symlinks.py pins the invariant for atomic_replace + atomic_json_write + atomic_yaml_write, covers plain files, first-time creates, broken symlinks, and permission preservation. Refs #16743 Builds on #16777 by @vominh1919.	2026-04-28 04:58:22 -07:00
Teknium	4e5ebf07ea	fix(matrix): stop tagging the user on every reply (#16932 ) The mention_user_id injection from #`38a6bada9` unconditionally attached an @user:server mention pill + MSC3952 m.mentions.user_ids payload to every outbound reply and every tool-progress status update. The stated intent was push notifications in muted rooms, but shipped as always-on in every room, DM or group, muted or not — so every reply pinged the user. - gateway/platforms/base.py: stop injecting mention_user_id into send metadata on every reply; restore the original _thread_metadata passthrough. - gateway/run.py: drop mention_user_id from status-thread metadata. - gateway/platforms/matrix.py: drop the mention-pill append block in _send_text that consumed the metadata. Keep the reaction-based exec approval half of #`38a6bada9` and the inbound/outbound m.mentions handling (unrelated to the per-reply ping). Reported by Elkim [NOUS] on Discord. Co-authored-by: teknium1 <teknium@users.noreply.github.com>	2026-04-28 02:00:37 -07:00
ThomassJonax	2f9243c333	fix(session): make SQLite transcript rewrites transactional	2026-04-28 01:49:46 -07:00
crayfish-ai	f3371c39a4	fix(auxiliary): custom provider URL rewrite + main_runtime model for title gen - auxiliary_client: apply _to_openai_base_url() to custom base_url (fixes /anthropic → /v1 rewrite missing for provider="custom") - auxiliary_client: use main_runtime.get("model") instead of _read_main_model() so auxiliary tasks follow system default model changes - title_generator: thread main_runtime through generate_title → auto_title_session → maybe_auto_title - cli.py / gateway/run.py: pass main_runtime to maybe_auto_title - tests: update mock assertions for new main_runtime parameter	2026-04-28 01:47:25 -07:00
Surat Srichan	4d3e3ff8a2	fix(gateway): coerce plaintext "restart gateway" DMs to /restart Narrow plaintext shortcut that rewrites a tiny set of admin phrases ("restart gateway", "restart the gateway", "restart hermes") into the /restart slash command, but only in DMs. Scope is intentionally tight: - DM text messages only — group chats keep natural-language semantics - Exact restart-style phrases only - Skips anything already starting with "/" Without this, the LLM can receive "restart gateway" as a user turn and try to satisfy it via the terminal tool (systemctl restart ...). That kills the gateway while the originating agent is still running, which leaves systemd in "draining" state waiting on a process it's about to kill. Routing the phrase to the slash-command dispatcher bypasses the agent loop and uses the existing restart machinery (request_restart). Called once, at the adapter level in BasePlatformAdapter.handle_message, so every platform gets it for free and pending-message reinjection is covered by the same call site. Adds 2 Telegram-parametrized e2e tests: DM routes to request_restart, group chats fall through to the normal agent path.	2026-04-28 01:40:28 -07:00
Teknium	dd789a4fdf	fix(mcp): move discovery out of model_tools import side effect (#16856 ) (#16899 ) model_tools.py ran discover_mcp_tools() as a module-level side effect. discover_mcp_tools() uses a blocking 120s wait internally (via _run_on_mcp_loop -> future.result(timeout=120)). The gateway lazy-imports run_agent -> model_tools on the first user message, which happens inside the asyncio event loop thread. A slow or unreachable MCP server therefore froze Discord shard heartbeats and Telegram polling for up to 120s on the first message after gateway start. Fix: remove the module-level call. Every entry point now runs discovery explicitly at its own startup, using the context-appropriate blocking/non-blocking pattern: - gateway/run.py: loop.run_in_executor(None, discover_mcp_tools) before platforms start accepting traffic - hermes_cli/main.py: inline (no event loop at CLI startup) - tui_gateway/entry.py: inline (sync stdin loop, no event loop) - acp_adapter/entry.py: inline before asyncio.run() Closes #16856.	2026-04-28 01:17:58 -07:00
ztexydt-cqh	1d5e25f353	fix(gateway): persist /sethome home channel to .env across all platforms _handle_set_home_command wrote FEISHU_HOME_CHANNEL / DISCORD_HOME_CHANNEL / etc. as top-level keys into config.yaml, but load_gateway_config() only reads home channels from env vars. After every gateway restart the home channel was lost — on every platform, not just Feishu. Fix: switch /sethome to save_env_value(), which atomically writes to ~/.hermes/.env and updates the current process env in one shot. The handler builds the env key from platform_name.upper(), so one line change repairs /sethome for every platform that has a HOME_CHANNEL env var. Also widen _EXTRA_ENV_KEYS in hermes_cli/config.py so HOME_CHANNEL and HOME_CHANNEL_NAME for every platform are treated as managed env vars: SIGNAL, SLACK, SMS, DINGTALK, BLUEBUBBLES, FEISHU, WECOM, YUANBAO, plus the missing *_NAME variants for DISCORD/TELEGRAM/MATTERMOST. Closes #16806 Co-authored-by: teknium1 <screenmachine@gmail.com>	2026-04-28 01:17:17 -07:00
nbot	38a6bada92	feat(matrix): reaction-based exec approval + mention_user_id Add Matrix reaction-based exec approval (✅/❎) and mention_user_id support for push notifications in muted rooms. - matrix.py: _MatrixApprovalPrompt, send_exec_approval, reaction approval handling, bot seed reaction redaction, mention pill in send - base.py: inject mention_user_id into send metadata - run.py: inject mention_user_id into status thread metadata - tests for approval prompt registration and reaction resolution	2026-04-27 21:22:44 -07:00
Andrew Miller	d497387cec	matrix: auto-bootstrap cross-signing on first startup Without this, every Matrix bot started under hermes-agent shows the "Encrypted by a device not verified by its owner" badge in Element indefinitely, because the cross-signing chain (master → SSK → device) was never published. Operators currently have to write their own bootstrap script and remember to run it once per bot — and it's easy to get wrong (the obvious base64.b64encode().decode() produces padded keyids that matrix-rust-sdk silently rejects in /keys/query, so even correctly-signed keys fail to load identity in Element). mautrix already has the right primitive: generate_recovery_key() does the full flow — generate seeds, upload privates to SSSS, publish publics to the homeserver, sign the current device with the new SSK, and return the human-readable recovery key. We invoke it once on startup if the bot has no existing cross-signing identity, and log the recovery key with a clear instruction to save it for future restarts via MATRIX_RECOVERY_KEY (which the existing recovery-key path already consumes). Skipped when MATRIX_RECOVERY_KEY is set (existing path takes over) or when the bot already has cross-signing keys on the homeserver (get_own_cross_signing_public_keys returns non-None). Bootstrap failure is non-fatal — logged with hint about UIA; the bot continues without cross-signing and Element will show the warning that prompted this PR. That matches the existing soft-fail pattern for verify_with_recovery_key. Tested against Continuwuity 0.5.7 (no UIA required). Synapse with UIA enabled will need a follow-up PR to thread MATRIX_PASSWORD through to /keys/device_signing/upload.	2026-04-27 21:22:44 -07:00
konsisumer	32d4048c6b	fix: MatrixAdapter respects proxy configuration	2026-04-27 21:22:44 -07:00
Adam Rummer	1eab5960f0	feat(matrix): add dm_auto_thread config for DM auto-threading Adds MATRIX_DM_AUTO_THREAD env var (default: false) to control auto-threading in DM rooms independently from channel auto-threading. Closes #15398	2026-04-27 21:22:44 -07:00
LeonSGP43	74a4832b74	fix(matrix): normalize image-only filenames	2026-04-27 21:22:44 -07:00
Alexazhu	fbbcfa24c5	fix(matrix): preserve exception tracebacks on E2EE and auth failures Five ``except Exception as exc:`` blocks in the Matrix adapter logged only ``str(exc)`` without ``exc_info=True``: - _reverify_keys_after_upload → post-upload key verification failure - _upload_keys_if_needed → initial device-key query failure - _upload_keys_if_needed → re-upload device keys failure - _upload_keys_if_needed → initial device key upload failure - connect → whoami / access-token validation failure The E2EE key paths here are security-critical: a silent traceback- less failure during device-key verification or upload makes it hard for operators to tell whether their Matrix bot is failing because of a stale token, a federation timeout, or an olm state mismatch — all three fail with different tracebacks, which ``str(exc)`` alone flattens. The contributing guide asks for ``exc_info=True`` on error logs. Append it to each of the five call sites. Pure logging enrichment.	2026-04-27 21:22:44 -07:00
Heathley	f223346eb7	fix(matrix): add sync timeout, callback diagnostics, and mention-drop logging - Wrap _sync_loop sync() call with asyncio.wait_for(timeout=45s) to guard against TCP-level hangs that the Matrix long-poll timeout cannot catch - Add logger.debug at the top of _on_room_message so LOG_LEVEL=DEBUG confirms whether callbacks fire at all (diagnoses #5819, #7914, #12614) - Add logger.debug when MATRIX_REQUIRE_MENTION silently drops a message, pointing users to the env var to disable the filter Adapted for current mautrix-python adapter (PR was written against the legacy matrix-nio adapter). Closes #5819	2026-04-27 21:22:44 -07:00
Charles Brooks	57f8cf00e9	fix(matrix): reconcile pending invites from sync state	2026-04-27 21:22:44 -07:00
Angel Claw	32b78578e0	fix(matrix): strip only explicit @mentions in _strip_mention	2026-04-27 21:22:44 -07:00
Sami Rusani	6769a0aece	fix(matrix): add outbound mention payloads	2026-04-27 21:22:44 -07:00
Teknium	6ea5699e3f	fix(compression): notify users when configured aux model fails even if main-model fallback recovers (#16775 ) A misconfigured auxiliary.compression.model is a user-fixable problem that silent recovery would hide. The previous retry-on-main logic transparently swallowed aux-model failures whenever the fallback succeeded, leaving the user's broken config in place and racking up future failures. Track the aux-model failure on the compressor alongside the existing fallback-placeholder fields: - _last_aux_model_failure_model: str \| None - _last_aux_model_failure_error: str \| None Both are set at the moment the aux model errors (captured before summary_model is cleared for retry), regardless of whether the retry succeeds. Cleared at compress() start and on on_session_reset() so a clean run doesn't leak stale warnings. Surface at three places: - gateway hygiene auto-compress: ℹ note to the platform adapter (thread_id preserved) - gateway /compress command: ℹ line appended to the reply - CLI via _emit_warning: deduped on (model, error) so repeat compactions don't spam Distinct from the existing ⚠️ dropped-turns warning — different severity, different emoji, explicit 'context is intact' reassurance.	2026-04-27 20:08:23 -07:00
iamagenius00	c61bc3f72c	fix(compression): pass thread_id metadata + add gateway test for warning delivery Address review feedback on PR #16333: 1. The hygiene-path warning send was missing metadata=_hyg_meta. On Telegram topics / Slack threads / Discord threads the warning would land in the main channel instead of the originating thread. Now reuses the same _hyg_meta dict already computed for the hygiene compaction itself. 2. New gateway-level test test_session_hygiene_warns_user_when_summary_generation_fails verifies end-to-end: - When the compressor's _last_summary_fallback_used flag is True, the gateway invokes adapter.send() exactly once. - The warning message includes the dropped count and the underlying error string. - metadata={'thread_id': ...} is propagated so the warning lands in the originating topic/thread. Tests: 20 gateway hygiene + 54 context_compressor — all pass.	2026-04-27 19:18:13 -07:00
iamagenius00	dfdc4276e8	fix(compression): notify gateway users when summary generation fails When auxiliary compression's summary LLM call fails (e.g. model 404, auxiliary model misconfigured), the compressor still drops the selected turns and inserts a static fallback placeholder — the dropped context is unrecoverable. Previously the only signal of this was a WARNING in agent.log. Gateway users (Telegram/Discord/etc.) had no way to know context was lost because the existing _emit_warning path requires a status_callback, and the gateway hygiene path uses a temporary _hyg_agent with quiet_mode=True and no callback wired up. Changes: - ContextCompressor: track _last_summary_fallback_used and _last_summary_dropped_count on each compress() call. Cleared at the start of compress() and on session reset. - gateway/run.py hygiene: after auto-compress, inspect the temp agent's compressor; if fallback was used, send a visible ⚠️ warning to the user via the platform adapter (TG/Discord/etc.) including dropped count and the underlying error. - gateway/run.py /compress: append the same warning to the manual compress reply so users running /compress see the failure too. Acceptance: - Summary success: no user-visible warning (unchanged). - Summary failure on gateway hygiene: user receives a TG/Discord message with dropped count + error + remediation hint. - Summary failure on /compress: warning appended to the command reply. - CLI status_callback / _emit_warning path is untouched. - Test coverage: two new tests verify the tracking fields are set on failure and cleared on subsequent success.	2026-04-27 19:18:13 -07:00
Teknium	f40b20d13c	fix(gateway): keep typing indicator alive across slow send_typing calls (#16763 ) The typing-indicator refresh loop in BasePlatformAdapter._keep_typing awaited each send_typing call unconditionally. Each call is an HTTP round-trip to the platform API (Telegram/Discord), normally ~100ms. When the same network instability that causes upstream provider timeouts (e.g. Anthropic capacity blips slowing first-token latency past the 120s stream-read timeout) also slows the platform typing API to multi-second response times, the refresh loop stalls inside the await. Platform-side typing expires at ~5s, so the bubble dies and stays dead until the stuck send_typing call returns — right when the user most needs the 'still working' signal and instead sees a bot that looks dead, then asks 'wtf are you doing' which itself interrupts the eventually-recovering turn. Bound each send_typing with asyncio.wait_for (1.5s cap, derived from interval so it's always below the 2s cadence). Slow calls get abandoned so the next scheduled tick fires a fresh send_typing on schedule. As long as any one of them reaches the platform within its ~5s typing-expiry window, the bubble stays visible across the stall. Also catches non-timeout send_typing exceptions (transient HTTP errors) so one bad tick doesn't terminate the whole loop. Tests: 4 new in tests/gateway/test_keep_typing_timeout.py covering slow-send non-blocking, fast-send still-awaited, exception resilience, and paused-chat regression guard.	2026-04-27 19:09:32 -07:00
helix4u	49fb75463f	fix(gateway): keep env-token Slack enabled Some checks are pending Deploy Site / deploy-vercel (push) Waiting to run Details Deploy Site / deploy-docs (push) Waiting to run Details Docker Build and Publish / build-and-push (push) Waiting to run Details Nix / nix (macos-latest) (push) Waiting to run Details Nix / nix (ubuntu-latest) (push) Waiting to run Details Tests / test (push) Waiting to run Details Tests / e2e (push) Waiting to run Details	2026-04-27 18:19:14 -07:00
Erosika	49e3a1d8ee	style: trim verbose comment blocks added by previous commit	2026-04-27 12:37:33 -07:00
Erosika	e553f6f3e4	fix(memory): narrow scrub surface to known wrapper boundaries Reviewer pushback on the original boundary-hardening commits — three overreach points pulled plugin-specific policy into shared core paths: 1. gateway/run.py hardcoded a '## Honcho Context' literal split for vision-LLM output. Plugin-format heading in framework code; could truncate legitimate output naturally containing that header. Drop the literal split; keep generic sanitize_context (the wrapper strip is plugin-agnostic). Plugin-specific cleanup belongs at the provider boundary, not the shared gateway path. 2. run_agent.run_conversation scrubbed user_message and persist_user_message before the conversation loop. User text is sacred — if a user types a literal <memory-context> tag we must not silently delete it. The producer (build_memory_context_block) is the only legitimate emitter; user input should never need the reverse op. 3. _build_assistant_message scrubbed model output before persistence. Same hazard: would silently mutate legitimate documentation/code the model emits containing the literal markers. The streaming scrubber catches real leaks delta-by-delta before content is concatenated; persist-time scrub was redundant belt-and-suspenders. 4. _fire_stream_delta stripped leading newlines from every delta unless a paragraph break flag was set. Mid-stream '\n' is legitimate markdown — lists, code fences, paragraph breaks — and chunk boundaries are arbitrary. Narrow lstrip to the very first delta of the stream only (so stale provider preamble still gets cleaned on turn start, but mid-stream formatting survives). Plus: build_memory_context_block now logs a warning when its defensive sanitize_context strips something — surfaces buggy providers returning pre-wrapped text instead of silently double-fencing. Net architectural change: scrub surface collapses from 8 sites to 3 (StreamingContextScrubber on output deltas, plugin→backend send, build_memory_context_block input-validation). Plugin-specific strings stay out of shared runtime paths. User input and persisted assistant output are no longer mutated. Tests: rescoped TestMemoryContextSanitization (helper-correctness only, no source-inspection of removed call sites), updated vision tests to drop '## Honcho Context' literal-split assertions, updated _build_assistant_message persistence test to assert preservation. Added: cross-turn scrubber reset, build_memory_context_block warn-on- violation, mid-stream newline preservation (plain + code fence).	2026-04-27 12:37:33 -07:00
Erosika	3b2edb347d	fix(gateway): scrub memory-context leaks from vision auto-analysis output fixes #5719 The auxiliary vision LLM called by gateway._enrich_message_with_vision can echo its injected Honcho system prompt back into the image description. That description gets embedded verbatim into the enriched user message, so recalled memory (personal facts, dialectic output) surfaces into a user-visible bubble. Strips both forms of leak before embedding: - <memory-context>...</memory-context> fenced blocks (sanitize_context) - trailing '## Honcho Context' sections (header + everything after) Plus regression tests: - tests/agent/test_streaming_context_scrubber.py — 13 tests on the stateful scrubber (whole block, split tags, false-positive partial tags, unterminated span, reset, case-insensitivity) - tests/run_agent/test_run_agent_codex_responses.py — 2 new tests on _fire_stream_delta covering the realistic 7-chunk leak scenario and the cross-turn scrubber reset - tests/gateway/test_vision_memory_leak.py — 4 tests covering the vision auto-analysis boundary (clean pass-through, '## Honcho Context' header, fenced block, both patterns together)	2026-04-27 12:37:33 -07:00
Brooklyn Nicholson	633f74504f	fix(ci): resolve follow-up title edge case and flaky checks Handle queued-title ValueError cleanup during session init, harden Discord message source building for test stubs, and fix the Dockerfile contract test syntax error. Also refresh the TUI lockfile and Nix build flags so nix ubuntu-latest no longer fails on npm lock/peer resolution drift.	2026-04-27 11:49:02 -05:00
Teknium	9b55365f6f	fix(gateway,cron): close ephemeral agents + reap stale aux clients (salvage #13979 ) (#16598 ) * fix: clean gateway auxiliary client caches on teardown * fix(gateway): recover from stale pid files and close cron agents Two issues were keeping the gateway from surviving long runs: 1. `_cleanup_invalid_pid_path` delegated to `remove_pid_file`, which refuses to unlink when the file's pid differs from our own. That safety check exists for the --replace atexit handoff, but it also applied to stale-record cleanup, so after a crashy exit the pid file was orphaned: `write_pid_file()`'s O_EXCL create then failed with `FileExistsError`, and systemd looped on "PID file race lost to another gateway instance". Unlink unconditionally from this helper since the caller has already verified the record is dead. 2. The cron scheduler never closed the ephemeral `AIAgent` it creates per tick, and never swept the process-global auxiliary-client cache. Over days of 10-minute ticks this leaked subprocesses and async httpx transports until the gateway hit EMFILE. Release the agent and call `cleanup_stale_async_clients()` in `run_job`'s outer `finally`, matching the gateway's own per-turn cleanup. * chore(release): map bloodcarter@gmail.com -> bloodcarter --------- Co-authored-by: bloodcarter <bloodcarter@gmail.com>	2026-04-27 07:41:42 -07:00
briandevans	500774e30e	fix(gateway): pass session messages to shutdown_memory_provider (#15165 ) ``_cleanup_agent_resources`` previously invoked ``agent.shutdown_memory_provider()`` with no arguments, so every memory provider's ``on_session_end`` hook received an empty list. Providers with an early-return guard on empty input (Holographic, Hindsight) never extracted facts from the conversation, and users hit "抱歉，找不到相關的對話記錄" on the first turn after any gateway restart, session reset, or idle expiry. Forward ``agent._session_messages`` — the transcript the agent itself maintains and refreshes every turn via ``_persist_session`` — so providers see the actual conversation. Falls back to the legacy no-arg call whenever the attribute is absent or not a list (test stubs built via ``object.__new__`` or ``MagicMock``) to preserve backward compatibility with existing suites. ``AIAgent.shutdown_memory_provider`` already accepts ``messages: list = None`` (run_agent.py:4126), so this is a pure caller-side fix. Paths that use ``skip_memory=True`` temporary agents (memory flush, hygiene auto-compress, ``/compress``) are no-ops inside ``shutdown_memory_provider`` because ``self._memory_manager`` is None — no behaviour change for them. Covers Part A of the bug report. Part B (adding ``on_session_end`` to the Hindsight plugin) is a separate concern that would benefit from this fix landing first. Regression test added at ``tests/gateway/test_shutdown_memory_provider_messages.py`` covering: populated messages forwarded, empty list still forwarded, attribute missing falls back, non-list (MagicMock) falls back, provider exceptions don't block ``close()``, None agent no-op, and agent without ``shutdown_memory_provider`` tolerated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 06:41:16 -07:00
Christian Scheid	75b460bc94	fix(email): add required Date header to outbound mail	2026-04-27 06:41:11 -07:00
Teknium	ec671c4154	feat(image-input): native multimodal routing based on model vision capability (#16506 ) * feat(image-input): native multimodal routing based on model vision capability Attach user-sent images as OpenAI-style content parts on the user turn when the active model supports native vision, so vision-capable models see real pixels instead of a lossy text description from vision_analyze. Routing decision (agent/image_routing.py::decide_image_input_mode): agent.image_input_mode = auto \| native \| text (default: auto) In auto mode: - If auxiliary.vision.provider/model is explicitly configured, keep the text pipeline (user paid for a dedicated vision backend). - Else if models.dev reports supports_vision=True for the active provider/model, attach natively. - Else fall back to text (current behaviour). Call sites updated: gateway/run.py (all messaging platforms), tui_gateway (dashboard/Ink), cli.py (interactive /attach + drag-drop). run_agent.py changes: - _prepare_anthropic_messages_for_api now passes image parts through unchanged when the model supports vision — the Anthropic adapter translates them to native image blocks. Previous behaviour (vision_analyze → text) only runs for non-vision Anthropic models. - New _prepare_messages_for_non_vision_model mirrors the same contract for chat.completions and codex_responses paths, so non-vision models on any provider get text-fallback instead of failing at the provider. - New _model_supports_vision() helper reads models.dev caps. vision_analyze description rewritten: positions it as a tool for images NOT already visible in the conversation (URLs, tool output, deeper inspection). Prevents the model from redundantly calling it on images already attached natively. Config default: agent.image_input_mode = auto. Tests: 35 new (test_image_routing.py + test_vision_aware_preprocessing.py), all existing tests that reference _prepare_anthropic_messages_for_api still pass (198 targeted + new tests green). * feat(image-input): size-cap + resize oversized images, charge image tokens in compressor Two follow-ups that make the native image routing safer for long / heavy sessions: 1) Oversize handling in build_native_content_parts: - 20 MB ceiling per image (matches vision_tools._MAX_BASE64_BYTES, the most restrictive provider — Gemini inline data). - Delegates to vision_tools._resize_image_for_vision (Pillow-based, already battle-tested) to downscale to 5 MB first-try. - If Pillow is missing or resize still overshoots, the image is dropped and reported back in skipped[]; caller falls back to text enrichment for that image. 2) Image-token accounting in context_compressor: - New _IMAGE_TOKEN_ESTIMATE = 1600 (matches Claude Code's constant; within the realistic range for Anthropic/GPT-4o/Gemini billing). - _content_length_for_budget() helper: sums text-part lengths and charges _IMAGE_CHAR_EQUIVALENT (1600 * 4 chars) per image/image_url/ input_image part. Base64 payload inside image_url is NOT counted as chars — dimensions don't matter, only image-presence. - Both tail-cut sites (_prune_old_tool_results L527 and _find_tail_cut_by_tokens L1126) now call the helper so multi-image conversations don't slip past compression budget. Tests: 9 new in test_image_routing.py (oversize triggers resize, resize-fails-returns-None, oversize-skipped-reported), 11 new in test_compressor_image_tokens.py (flat charge per image, multiple images, Responses-API / Anthropic-native / OpenAI-chat shapes, no-inflation on raw base64, bounds-check on the constant, integration test that an image-heavy tail actually gets trimmed). * fix(image-input): replace blanket 20MB ceiling with empirically-verified per-provider limits The previous commit imposed a hardcoded 20 MB base64 ceiling on all providers, triggering auto-resize on anything larger. This was wrong in both directions: * Too loose for Anthropic — actual limit is 5 MB (returns HTTP 400 'image exceeds 5 MB maximum' above that). * Too strict for OpenAI / Codex / OpenRouter — accept 49 MB+ without complaint (empirically verified April 2026 with progressive PNG sizes). New behaviour: * _PROVIDER_BASE64_CEILING table: only anthropic and bedrock have a ceiling (5 MB, since bedrock-on-Claude shares Anthropic's decoder). * Providers NOT in the table get no ceiling — images attach at native size and we trust the provider to return its own error if it disagrees. A provider-specific 400 message is clearer than us guessing wrong and silently degrading image quality. * build_native_content_parts() gains a keyword-only provider arg; gateway/CLI/TUI pass the active provider so Anthropic users get auto-resize protection while OpenAI users don't pay it. * Resize target dropped from 5 MB to 4 MB to slide safely under Anthropic's boundary with header overhead. Empirical measurements (direct API, no Hermes in the loop): image b64 anthropic openrouter/gpt5.5 codex-oauth/gpt5.5 0.19 MB ✓ ✓ ✓ 12.37 MB ✗ 400 5MB ✓ ✓ 23.85 MB ✗ 400 5MB ✓ ✓ 49.46 MB ✗ 413 ✓ ✓ Tests: rewrote TestOversizeHandling (5 tests): no-ceiling pass-through, Anthropic resize fires, Anthropic skip on resize-fail, build_native_parts routes ceiling by provider, unknown provider gets no ceiling. All 52 targeted tests pass. * refactor(image-input): attempt native, shrink-and-retry on provider reject Replace proactive per-provider size ceilings with a reactive shrink path on the provider's actual rejection. All providers now attempt native full-size attachment first; if the provider returns an image-too-large error, the agent silently shrinks and retries once. Why the previous design was wrong: hardcoding provider ceilings (anthropic=5MB, others=unlimited) meant OpenAI users on a 10MB image paid no tax, but Anthropic users lost quality on anything >5MB even though the empirical behaviour at provider-reject time is the same (shrink + retry). Baking the table into the routing layer also requires updating Hermes every time a provider's limit changes. Reactive design: - image_routing.py: _file_to_data_url encodes native size, no ceiling. build_native_content_parts drops its provider kwarg. - error_classifier.py: new FailoverReason.image_too_large + pattern match ("image exceeds", "image too large", etc.) checked BEFORE context_overflow so Anthropic's 5MB rejection lands in the right bucket. - run_agent.py: new _try_shrink_image_parts_in_messages walks api messages in-place, re-encodes oversized data: URL image parts through vision_tools._resize_image_for_vision to fit under 4MB, handles both chat.completions (dict image_url) and Responses (string image_url) shapes, ignores http URLs (provider-fetched). New image_shrink_retry_attempted flag in the retry loop fires the shrink exactly once per turn after credential-pool recovery but before auth retries. E2E verified live against Anthropic claude-sonnet-4-6: - 17.9MB PNG (23.9MB b64) attached at native size - Anthropic returns 400 "image exceeds 5 MB maximum" - Agent logs '📐 Image(s) exceeded provider size limit — shrank and retrying...' - Retry succeeds, correct response delivered in 6.8s total. Tests: 12 new (8 shrink-helper shapes + 4 classifier signals), replaces 5 proactive-ceiling tests with 3 simpler 'native attach works' tests. 181 targeted tests pass. test_enum_members_exist in test_error_classifier.py updated for the new enum value.	2026-04-27 06:27:59 -07:00
Teknium	90a3e73daf	fix(debug): sweep expired paste.rs uploads on a real timer (#16431 ) Previously 'hermes debug share' uploads only got DELETEd when the user ran 'hermes debug share' again — opportunistic-sweep-on-invoke was the only cleanup path. A user who uploaded once and never ran debug again left pastes up until paste.rs's retention kicked in (which, empirically, never actually expires them). Hook _sweep_expired_pastes into the gateway cron ticker at the same hourly cadence as the image/document cache cleanups. The opportunistic sweep in 'hermes debug share' stays as a fallback for CLI-only users who never start the gateway.	2026-04-27 00:36:33 -07:00
alberto	3ff3dfb5ac	fix(telegram): accept /cmd@botname from bot menu in groups Telegram groups emit a single bot_command entity covering the whole /cmd@botname span with no accompanying mention entity, so the existing mention gate in _message_mentions_bot dropped slash commands sent via the bot-menu autocomplete whenever require_mention is enabled. Recognise bot_command entities whose @botname suffix matches the bot username (case-insensitive) as a direct mention, and keep rejecting commands addressed at other bots. Fixes #15415.	2026-04-26 22:00:18 -07:00
Teknium	af3d5150c1	fix(matrix): close 'hall of mirrors' pairing + echo loop (#15763 ) (#16374 ) Harden the Matrix adapter's sender-drop guards so bot-self events and appservice/bridge identities never reach the gateway's pairing flow or the agent loop. Two filters, applied as early as possible in _on_room_message (and _on_reaction for the self-filter): 1. _is_self_sender(sender) — case-insensitive + whitespace-trimmed equality with self._user_id. When self._user_id is still empty (whoami has not resolved, or login failed), returns True defensively: an unidentified bot dropping its own events is always preferable to falling into an echo loop. The previous byte-for-byte equality check let differently-cased copies of the bot's MXID slip through, and an unresolved self-ID silently disabled the guard. 2. _is_system_or_bridge_sender(sender) — drops appservice namespace puppets (conventional @_bridge_...:server form) and malformed senders with an empty localpart. These identities used to fall through to the gateway's unauthorized-user path, trigger a pairing code, and — once an operator approved the bridge — every outbound message the bridge relayed would loop back as an authorized user message. This was the root of the 'hall of mirrors' symptom. Fixes #15763 Test plan --------- scripts/run_tests.sh tests/gateway/test_matrix.py scripts/run_tests.sh tests/gateway/test_matrix_mention.py tests/gateway/test_matrix_voice.py All 182 tests pass. 14 new regression tests cover exact / case-insensitive / whitespace / unresolved-self-id matches, bridge prefix detection, empty sender, and the full _on_room_message drop path.	2026-04-26 21:50:28 -07:00
Teknium	4a2ee6c162	fix(title-gen): surface auxiliary failures via _emit_auxiliary_failure Closes #15775. Title generation swallowed exceptions at debug level and returned None, so a depleted auxiliary provider (e.g. OpenRouter 402) silently left sessions with NULL titles. Reporter observed 45 untitled sessions accumulated over 19 days with no user-visible indication. - agent/title_generator.py: accept optional failure_callback, bump log to WARNING, invoke callback on call_llm exception (swallowing callback errors so nothing can crash the fire-and-forget worker thread). - cli.py, gateway/run.py: pass agent._emit_auxiliary_failure as the callback so failures route through the existing user-visible warning channel. - tests: cover callback fires / errors are swallowed / no-callback legacy behavior / maybe_auto_title forwards kwarg to worker.	2026-04-26 21:49:34 -07:00
Teknium	6993e566ba	fix(whatsapp_identity): pin identifier regex to ASCII, clarify it's defense-in-depth Follow-up on top of #16243. Two small tweaks: - Compile the regex once as `_SAFE_IDENTIFIER_RE` and pin it to `[A-Za-z0-9@.+\-]`. The previous `\w` accepts Unicode word chars (full-width digits, accented letters) which aren't valid WhatsApp identifiers and shouldn't reach the mapping-file lookup. - Add a comment clarifying this is defense-in-depth, not a live traversal. The hardcoded `lid-mapping-{current}{suffix}.json` prefix already prevents escape via pathlib's component split — with `current='../secrets'`, the first path component under `session/` is the literal directory name `lid-mapping-..`, which the attacker cannot create. E2E verified: legit mapping chains still resolve, all probed attack shapes (`../`, absolute paths, shell metacharacters, Unicode digit tricks) are rejected before any file access.	2026-04-26 20:48:31 -07:00
sprmn24	91512b8210	fix(whatsapp_identity): guard against path traversal and silent mapping errors expand_whatsapp_aliases() interpolated untrusted identifiers directly into filenames (lid-mapping-{current}.json) without validation. An identifier containing ../ or / could escape the session directory. Also replaced bare except Exception: continue with targeted (OSError, json.JSONDecodeError) and a debug log so mapping corruption is diagnosable instead of silently skipped. Fixes: - Reject identifiers with unsafe characters via re.match guard - Replace broad exception swallow with specific catch + debug log	2026-04-26 20:48:31 -07:00
Teknium	478444c262	feat(checkpoints): auto-prune orphan and stale shadow repos at startup (#16303 ) Every working dir hermes ever touches gets its own shadow git repo under ~/.hermes/checkpoints/{sha256(abs_dir)[:16]}/. The per-repo _prune is a no-op (comment in CheckpointManager._prune says so), so abandoned repos from deleted/moved projects or one-off tmp dirs pile up forever. Field reports put the typical offender at 1000+ repos / ~12 GB on active contributor machines. Adds an opt-in startup sweep that mirrors the sessions.auto_prune pattern from #13861 / #16286: - tools/checkpoint_manager.py: new prune_checkpoints() and maybe_auto_prune_checkpoints() helpers. Deletes shadow repos that are orphan (HERMES_WORKDIR marker points to a path that no longer exists) or stale (newest in-repo mtime older than retention_days). Idempotent via a CHECKPOINT_BASE/.last_prune marker file so it only runs once per min_interval_hours regardless of how many hermes processes start up. - hermes_cli/config.py: new checkpoints.auto_prune / retention_days / delete_orphans / min_interval_hours knobs. Default auto_prune: false so users who rely on /rollback against long-ago sessions never lose data silently. - cli.py / gateway/run.py: startup hooks gated on checkpoints.auto_prune, called right next to the existing state.db maintenance block. - Docs updated with the new config knobs. - 11 regression tests: orphan/stale deletion, precedence, byte-freed tracking, non-shadow dir skip, interval gating, corrupt marker recovery. Refs #3015 (session-file disk growth was fixed in #16286; this covers the checkpoint side noted out-of-scope there).	2026-04-26 19:05:52 -07:00
Teknium	77d4766602	fix(gateway): clear pending model note on auto-reset paths too PR #16013 plugged the leak in `/new`, but two sibling session-boundary resets had the same bug: 1. Inactivity / suspended-session auto-reset (top of `_handle_message`) previously cleared only reasoning. Now drops model override and the queued "/model switched" note as well. 2. Compression-exhaustion auto-reset now also drops the pending note alongside the existing model/reasoning cleanup. All three session-boundary sites now use the identical cleanup idiom.	2026-04-26 19:01:50 -07:00
johnncenae	00c6480a05	fix(gateway): clear stale pending model note on session reset	2026-04-26 19:01:50 -07:00
simbam99	cebf95854b	Fix MessageDeduplicator max_size enforcement	2026-04-26 18:51:51 -07:00
Teknium	ab6879634e	yuanbao platform (#16298 ) Co-authored-by: loongzhao <loongzhao@tencent.com>	2026-04-26 18:50:49 -07:00
Teknium	90c84c6dba	fix(gateway): unblock update subprocess on recognized-command bypass When the gateway intercepts a pending /update prompt and the user sends a recognized slash command (/new, /help, ...), the command now dispatches normally AND the detached update subprocess is unblocked by writing a blank .update_response. _gateway_prompt reads '' → strips → returns the prompt's default (typically a safe 'n' / skip), so the update process exits cleanly instead of blocking on stdin until the 30-minute watcher timeout. Also clears _update_prompt_pending[session_key] on this path so stray future input for the same session isn't re-intercepted. Extends PR #15849 with tests for the new cancel-write + a regression test pinning the legacy behavior of unrecognized /foo slash commands still being consumed as the response.	2026-04-26 18:39:44 -07:00
Yukipukii1	bdaf56a94d	fix(gateway): bypass slash commands during pending update prompts	2026-04-26 18:39:44 -07:00
Badgerbees	55f212a7a2	fix(slack): honor NO_PROXY for Slack transport	2026-04-26 18:33:35 -07:00
Xnbi	7eaad06a87	fix(gateway): default Slack tool_progress to off Slack Bolt posts are not editable like CLI spinners; medium-tier new still emitted a permanent line per tool start (issue #14663). - Built-in slack default: off; other tier-2 platforms unchanged. - Adjust /verbose isolation test for off to new cycle. - Migration tests: read/write config.yaml as UTF-8 (Windows locale).	2026-04-26 18:33:35 -07:00
haru398801	a01e767b24	fix(gateway): respect config.yaml slack.enabled when SLACK_BOT_TOKEN env var is set Previously, setting SLACK_BOT_TOKEN in .env would unconditionally enable the Slack gateway adapter regardless of `slack.enabled: false` in config.yaml. This caused spurious "SLACK_APP_TOKEN not set" errors when the token was used only by skills (e.g. cron jobs that send Slack messages) rather than for the Hermes messaging gateway. Now, enabled: false in config.yaml is respected — the token is stored so skills can still use it, but the gateway adapter is not activated. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-26 18:33:35 -07:00
hharry11	fd474d0f00	fix(gateway): avoid cross-user mirror writes in per-user group sessions	2026-04-26 18:31:24 -07:00
Yang Zhi	3b60abb6bb	fix(sessions): delete on-disk transcript files during prune and delete (#3015 ) `delete_session()` and `prune_sessions()` only removed SQLite records, leaving .json/.jsonl transcript files on disk forever. Over time this causes unbounded disk growth (~27MB/day observed). Changes: - Add `_remove_session_files()` static helper that cleans up `{session_id}.json`, `.jsonl`, and `request_dump_{session_id}_*.json` - `delete_session()` accepts optional `sessions_dir` param and removes files for the deleted session and its children - `prune_sessions()` accepts optional `sessions_dir` param and removes files for all pruned sessions after the DB transaction - Wire up CLI `hermes sessions delete` and `hermes sessions prune` to pass `sessions_dir` - File cleanup is best-effort (OSError silenced) so DB operations are never blocked by filesystem issues - Fully backward-compatible: `sessions_dir=None` (default) preserves existing behavior	2026-04-26 18:31:07 -07:00
mewwts	8fb861ea6e	feat(gateway/slack): support channel_skill_bindings Extends the existing channel_skill_bindings mechanism (previously Discord-only) to Slack, so a channel or DM can auto-load one or more skills at session start without relying on the model's skill selector for every short reply. Motivation: Mats's German flashcards DM pushes a cron-driven card 5x/day; he responds with one-word guesses like 'work'. Previously each reply required the main agent to decide whether to load german-flashcards (full opus turn just to pick a skill). With the binding configured per Slack channel, the skill is injected at session start and grading runs directly. Changes: - Extract resolve_channel_skills() from DiscordAdapter._resolve_channel_skills into gateway.platforms.base (now shared across adapters). - DiscordAdapter._resolve_channel_skills delegates to the shared helper (behavior preserved — existing test suite still passes unchanged). - SlackAdapter: resolve channel_skill_bindings on each message and attach auto_skill to MessageEvent. gateway/run.py already handles auto-skill injection on new sessions; this just wires Slack through it. - gateway/config.py: accept channel_skill_bindings in slack: block of config.yaml (was Discord-only). - Tests: new tests/gateway/test_slack_channel_skills.py with 11 cases covering DM/thread/parent resolution, single-vs-list skills, dedup, malformed entries. Discord suite unchanged. - Docs: add 'Per-Channel Skill Bindings' section to Slack user guide. Config example: slack: channel_skill_bindings: - id: "D0ATH9TQ0G6" skills: ["german-flashcards"]	2026-04-26 18:25:41 -07:00

1 2 3 4 5 ...

1274 commits