hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-07-14 14:12:44 +00:00

Author	SHA1	Message	Date
AhmetArif0	5848174374	fix(wecom): guard flush task against cancel-delivery race to prevent message loss When asyncio.sleep() fires just before Task.cancel() is called, CPython sets _must_cancel=True but cannot cancel the already-completed sleep future, so CancelledError is delivered at the next await (handle_message) rather than at the sleep. By that point the superseded task has already popped the merged event from _pending_text_batches, so the superseding task sees an empty batch and silently drops the message. Fix: add a synchronous task-registry check between the sleep and the pop. No await between the check and the pop means no other coroutine can interleave, so the guard is race-free.	2026-05-24 01:33:40 -07:00
Teknium	1bed4e8eed	fix(gateway): drop text snippet from debounce debug log (CodeQL) CodeQL py/clear-text-logging-sensitive-data flagged the candidate-accept debug log including event.text[:60]. Log text_len instead — sufficient for debugging burst behavior without surfacing message contents. Co-authored-by: Paulo Nascimento <pnascimento9596@gmail.com>	2026-05-24 01:31:45 -07:00
Paulo Nascimento	7abd62719b	gateway: debounce queued text follow-ups	2026-05-24 01:31:45 -07:00
AhmetArif0	21db250034	fix(wecom-callback): retry send with fresh token on errcode 40001/42001 When WeCom returns errcode=40001 (invalid credential) or 42001 (token expired), send() was returning a failure without evicting the bad token from _access_tokens. All subsequent sends then kept using the same invalid cached token until its TTL naturally expired (~7200s). Fix: on the first token-rejection errcode, evict the cache entry and retry once with a freshly fetched token. Non-token errcodes fail immediately as before. If the refreshed token also fails, the error is returned without looping further. Adds four regression tests covering: successful retry on 40001, successful retry on 42001, no retry on unrelated errcode, and clean failure when the refresh does not help.	2026-05-24 01:30:47 -07:00
AhmetArif0	39b8d1d313	fix(dingtalk): finalize open streaming cards before disconnect AI Card "tool progress" cards created with finalize=False were left in streaming state on DingTalk's UI after a gateway restart because disconnect() called _streaming_cards.clear() without first closing them via _close_streaming_siblings. Move the finalization loop before self._http_client.aclose() so the HTTP client is still available when the finalize requests are sent. Adds a regression test that asserts the HTTP client is alive during finalization.	2026-05-23 20:48:56 -07:00
Teknium	e42fcc5625	fix(provider): make config.yaml model.provider the single source of truth (#31222 ) Policy: if it ain't a secret it goes in config.yaml. HERMES_INFERENCE_PROVIDER was leaking behavioral config into the .env surface, including from the gateway, which bypassed config.yaml entirely. Behavior: - gateway/run.py: drop HERMES_INFERENCE_PROVIDER read in _resolve_runtime_agent_kwargs. Gateway now flows through resolve_runtime_provider() with no `requested` override, which reads model.provider from config.yaml first. Docs/UX (strip env var from user-facing surface): - --provider help text no longer mentions the env var - cli-config.yaml.example same - reference/environment-variables.md: remove HERMES_INFERENCE_PROVIDER row and the cross-reference from HERMES_INFERENCE_MODEL - reference/cli-commands.md: blank the env-var column for --provider - guides/xai-grok-oauth.md, guides/minimax-oauth.md: replace HERMES_INFERENCE_PROVIDER=x hermes invocations with config.yaml / --provider - developer-guide/adding-providers.md, model-provider-plugin.md: reframe Internal mechanism (kept as-is): - hermes_cli/main.py writes HERMES_INFERENCE_PROVIDER into the TUI subprocess env - tui_gateway/server.py reads it on TUI startup - resolve_requested_provider() / oneshot.py / cli.py still fall through to the env var as a last-resort behind config.yaml, which is what makes the TUI parent->child handoff work This stays. We just stop documenting it as a user knob. Tests: tests/gateway/test_auth_fallback.py — simplify mock to fail on first call, succeed on second; drop monkeypatch.setenv lines that no longer matter. Supersedes #31064 (closed with credit to @novax635 who surfaced the underlying issue but proposed aligning gateway to the env var rather than removing it).	2026-05-23 18:18:41 -07:00
Edison	e752c9454e	feat(plugins): add register_auxiliary_task() to PluginContext API Auxiliary LLM tasks (vision, compression, web_extract, etc.) currently require modifications to core files for any plugin that needs its own task slot — specifically the _AUX_TASKS list in hermes_cli/main.py and the hardcoded env-var bridging dict in gateway/run.py. This violates the 'plugins must not modify core files' rule and forces every memory or context plugin that wants its own auxiliary task to either fork core or open a coupled core+plugin PR. This change adds a generic plugin surface for auxiliary task registration: ctx.register_auxiliary_task( key='memory_retain_filter', display_name='Memory retain filter', description='hindsight pre-retain dedup/extract', defaults={'timeout': 30, 'extra_body': {'reasoning_effort': 'low'}}, ) After registration, the task automatically: - Appears in 'hermes model → Configure auxiliary models' picker via a new _all_aux_tasks() merge of built-in + plugin tasks - Has its provider/model/base_url/api_key bridged from config.yaml to AUXILIARY_<KEY_UPPER>_* env vars at gateway startup (gateway/run.py now uses a dynamic bridged-keys set instead of a hardcoded per-task dict) - Gets plugin-declared defaults (timeout, extra_body, etc.) layered underneath user config so unconfigured plugin tasks still work (agent/auxiliary_client._get_auxiliary_task_config) - Resets to auto via 'Reset all to auto' alongside built-ins Validation: - Rejects shadowing of built-in keys (vision, compression, etc.) - Rejects invalid key shapes (must match [A-Za-z0-9_]+) - Rejects cross-plugin collisions (clear error) - Allows same-plugin re-registration (idempotent updates) Plugin discovery failures (rare) fall back gracefully — the aux config UI still shows built-in tasks if get_plugin_auxiliary_tasks() raises, and gateway env-var bridging keeps working for built-ins. Built-in tasks remain hardcoded in _AUX_TASKS for stability — they're the baseline UX, and DEFAULT_CONFIG already ships their defaults. Plugin tasks layer on top. Tests: 15 new tests in test_plugin_auxiliary_tasks.py covering API validation, manager state lifecycle, helper sort order, _all_aux_tasks merge semantics, _reset_aux_to_auto inclusion of plugin tasks, and default-layering in auxiliary_client. Updates the gateway-bridge code-parity test (test_auxiliary_config_bridge) to assert the new dynamic shape rather than the hardcoded literal env var names which no longer appear post-refactor. Motivation: this unblocks PR #20262 (hindsight smart retain pipeline) and similar plugins that need a dedicated aux task slot. The change is non-breaking — built-in env vars (AUXILIARY_VISION_PROVIDER, etc.) keep working since they're produced by the same f-string template that built the hardcoded names.	2026-05-23 17:49:47 -07:00
novax635	86871ee25a	fix(cli): synchronize HERMES_SESSION_ID across environment and contextvar during session switches	2026-05-23 17:46:55 -07:00
Glucksberg	9451087aab	fix(telegram): preserve observed group slash commands	2026-05-23 16:26:28 -07:00
Teknium	6a8e131a0a	refactor(ntfy): convert built-in adapter to platform plugin ntfy now ships as a self-contained plugin under plugins/platforms/ntfy/ instead of editing 8 core files (gateway/config.py Platform enum, gateway/run.py factory + auth maps, cron/scheduler.py, toolsets.py, hermes_cli/status.py, agent/prompt_builder.py, gateway/channel_directory.py, tools/send_message_tool.py). All routing goes through gateway/platform_registry via register_platform(): - adapter_factory, check_fn, validate_config, is_connected - env_enablement_fn seeds PlatformConfig.extra from NTFY_* env vars so gateway status reflects env-only setups without instantiating httpx - standalone_sender_fn handles deliver=ntfy cron jobs when cron runs out-of-process from the gateway - allowed_users_env / allow_all_env hook into _is_user_authorized - cron_deliver_env_var=NTFY_HOME_CHANNEL for cron home routing - platform_hint surfaces in the system prompt - pii_safe=True (topic names are the only identifier; no PII to redact) Tests moved to tests/gateway/test_ntfy_plugin.py using _plugin_adapter_loader so the module lives under plugin_adapter_ntfy in sys.modules and cannot collide with sibling plugin-adapter tests on the same xdist worker. The core-file grep tests (Platform.NTFY in source, hermes-ntfy in toolsets, etc.) are replaced with plugin-shape tests covering register() metadata, env_enablement_fn output, and standalone_sender_fn behavior. 68 tests pass under scripts/run_tests.sh.	2026-05-23 16:13:01 -07:00
sprmn24	b10f17bf1e	feat(ntfy): add ntfy platform adapter with atomic reconnect, identity fix, and 81 tests	2026-05-23 16:13:01 -07:00
QuenVix	7245bc77eb	fix(fallback): merge fallback_providers with legacy fallback_model configurations	2026-05-23 05:24:57 -07:00
Teknium	9acf949e34	feat(telegram): edit status messages in place instead of appending (#30864 ) Closes #30045. Based on @qike-ms's PR #30141. Telegram status callbacks (lifecycle, compression, context-pressure) used to append a fresh bubble on every emit. Now adapter tracks {(chat_id, status_key) -> message_id}; first call sends, subsequent calls edit. Failed edits drop the cache entry and fall through to a fresh send. - gateway/platforms/telegram.py: send_or_update_status() (+34 LOC) - gateway/run.py: route _status_callback_sync through it when the adapter supports it; plain adapter.send() otherwise (+15 LOC) - 5 tests covering first send / edit-in-place / edit-failure fallback / distinct key & chat isolation	2026-05-23 02:42:10 -07:00
Zyrixtrex	61ac118724	fix(webhook): enforce INSECURE_NO_AUTH safety rail on dynamic route reloads	2026-05-23 02:39:12 -07:00
walli	60b0a0e006	fix(qqbot): fix SILK magic byte detection slice length _guess_ext_from_data: data[:5] == b"#!SILK" -> data[:6] (6-byte string) _looks_like_silk: data[:4] == b"#!SILK" -> data[:6] The previous slices were too short to ever match the 6-byte "#!SILK" literal, relying entirely on the "#!SILK_V3" (9-byte) and 0x02! (2-byte) fallback paths for SILK format detection.	2026-05-23 02:27:17 -07:00
walli	0e7448d63a	fix(qqbot): use original attachment filename for cached files Add original_name parameter to _download_and_cache, preferring the attachment metadata filename over the CDN URL path basename. Previously files were cached with meaningless QQ CDN hash names (e.g. qqdownload_...oadftnv5), causing ugly filenames when sent back to users. Aligns with qqbot-agent-sdk's AttachmentDownloader.download_document.	2026-05-23 02:27:17 -07:00
walli	a54f5afc70	fix(qqbot): handle op 7/9 and expand fatal close code set 1. Handle op 7 (Server Reconnect): close WS to trigger reconnect loop while preserving session for Resume 2. Handle op 9 (Invalid Session): check d value to determine if session is resumable; clear session only when not resumable 3. Remove 4009 from session-clearing set (connection timeout is resumable) 4. Expand fatal close codes: 4001/4002/4010-4014 now stop reconnect immediately instead of retrying uselessly 5. Add unit tests	2026-05-23 02:27:17 -07:00
walli	bbd77d165c	fix(qqbot): add INTERACTION intent and expose video/file cached paths 1. Add INTERACTION intent bit (1<<26) to _send_identify, fixing approval button clicks not being received (INTERACTION_CREATE events were never dispatched by the gateway) 2. Include local cached path in video/file attachment descriptions so the LLM can reference files for re-sending to users 3. Add unit tests (TestIdentifyIntents, TestProcessAttachmentsPathExposure)	2026-05-23 02:27:17 -07:00
teknium1	66d81f9e14	fix(gateway): don't swallow expansion errors in runtime config helper A bare except in _load_gateway_runtime_config would silently return the unexpanded dict on any _expand_env_vars failure — masking the very bug this helper exists to fix. Drop it; let the caller see real errors.	2026-05-23 02:27:08 -07:00
QuenVix	2362cc4688	fix(gateway): enforce env variable template expansion on runtime config loaders	2026-05-23 02:27:08 -07:00
QuenVix	d21ac579e9	fix(gateway): honor key_env in auth-failure fallback resolution	2026-05-23 02:25:53 -07:00
QuenVix	52a368fa72	fix(gateway): preserve WhatsApp pairing approvals across JID/LID alias flips	2026-05-23 01:46:34 -07:00
Eugeniusz Gilewski	41d2c758c3	Fix unsafe gateway media path delivery	2026-05-23 01:40:35 -07:00
Markus	4a91e36495	fix(gateway): separate observed Telegram group context	2026-05-23 01:33:42 -07:00
aaronagent	b82608a6f5	fix(skills,pairing): path traversal guard in uninstall, lock list_pending, hash file paths - skills_hub: validate that uninstall_skill's install_path resolves inside SKILLS_DIR before calling shutil.rmtree, preventing recursive deletion of arbitrary directories via poisoned lock.json entries - skills_hub: include file paths (not just contents) in bundle_content_hash so swapping filenames between files changes the hash, strengthening update-detection integrity - pairing: wrap list_pending() in self._lock so _cleanup_expired() file writes don't race with concurrent generate_code()/approve_code() calls Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 19:59:24 -07:00
kshitijk4poor	cc8e5ec2af	refactor(gateway): migrate Discord adapter to bundled plugin (full Teams parity) First migration of an existing built-in platform adapter to the plugin system established by IRC / Teams / LINE / Google Chat. Closes #24325; advances the umbrella refactor in #3823. Matches Teams' shape exactly — adapter under ``plugins/platforms/discord/`` with the standard ``__init__.py`` / ``adapter.py`` / ``plugin.yaml`` shell, ``register(ctx)`` entry point, no back-compat shim at the old import path, and full parity for the four hooks Teams uses plus the ``apply_yaml_config_fn`` hook that landed in #25443 (the Discord plugin is the first consumer of that hook): * ``standalone_sender_fn`` — out-of-process cron delivery via REST API * ``setup_fn`` — interactive ``hermes setup gateway`` wizard * ``apply_yaml_config_fn`` — translate ``config.yaml`` ``discord:`` keys into ``DISCORD_`` env vars (replaces the hardcoded block in ``gateway/config.py``) ``is_connected`` — declares connection state from ``DISCORD_BOT_TOKEN`` * ``check_fn`` — lazy-installs ``discord.py`` on demand * plus ``allowed_users_env``, ``allow_all_env``, ``cron_deliver_env_var``, ``max_message_length``, ``emoji``, ``required_env``, ``install_hint`` * ``gateway/platforms/discord.py`` (5,101 LOC) → ``plugins/platforms/discord/adapter.py`` (git rename, R090). * New ``plugins/platforms/discord/{__init__.py, plugin.yaml}`` with ``requires_env`` / ``optional_env`` declarations. * Append ``register(ctx)`` block + new hook implementations (``_standalone_send``, ``interactive_setup``, ``_apply_yaml_config``, ``_clean_discord_user_ids``, ``_is_connected``, ``_build_adapter``, plus helpers ``_DISCORD_CHANNEL_TYPE_PROBE_CACHE`` etc.) to the adapter. * Replace the ``Platform.DISCORD elif`` branch in ``GatewayRunner._create_adapter()`` (−9 LOC) with a generic post-creation hook (+6 LOC) in the registry path: any plugin adapter that declares a ``gateway_runner`` attribute now gets it auto-injected. Webhook's built-in branch is unchanged (it doesn't go through the registry path). * Move ``_send_discord`` (190 LOC) and helpers (``_DISCORD_CHANNEL_TYPE_PROBE_CACHE``, ``_remember_channel_is_forum``, ``_probe_is_forum_cached``, ``_derive_forum_thread_name``) from ``tools/send_message_tool.py`` into the plugin as ``_standalone_send``. * Wire via ``standalone_sender_fn=_standalone_send`` (Teams pattern; same gap fixed in #21804 for other plugin platforms). * Replace the Discord ``elif`` in ``tools/send_message_tool.py`` ``_send_to_platform`` with a 10-line registry-hook dispatch. * Drop the ``DiscordAdapter`` import and the ``Platform.DISCORD: DiscordAdapter.MAX_MESSAGE_LENGTH`` ``_MAX_LENGTHS`` entry — the registry's ``max_message_length=2000`` covers it. * Move ``_setup_discord`` and ``_clean_discord_user_ids`` (68 LOC) from ``hermes_cli/setup.py`` into the plugin as ``interactive_setup``. * Wire via ``setup_fn=interactive_setup``. CLI helpers (``prompt``, ``print_info``, etc.) are lazy-imported so the plugin's module-load surface stays minimal. * Remove ``"discord": _s._setup_discord`` from ``hermes_cli/gateway.py::_builtin_setup_fn``. * Remove the entire 32-line ``_PLATFORMS["discord"]`` static dict entry — Discord's setup metadata is now discovered dynamically via ``_all_platforms()`` from the registry entry. * Move the 59-line ``discord_cfg`` YAML→env bridge from ``gateway/config.py::load_gateway_config()`` into the plugin as ``_apply_yaml_config``. Covers ``require_mention``, ``thread_require_mention``, ``free_response_channels``, ``auto_thread``, ``reactions``, ``ignored_channels``, ``allowed_channels``, ``no_thread_channels``, ``allow_mentions.{everyone,roles,users, replied_user}``, and ``reply_to_mode`` (including the YAML 1.1 ``off``-as-False coercion and the ``extra.reply_to_mode`` fallback). * Wire via ``apply_yaml_config_fn=_apply_yaml_config``. * The hook runs BEFORE ``_apply_env_overrides`` and after the generic shared-key loop, exactly as documented in ``website/docs/developer-guide/adding-platform-adapters.md``. * Behavior is preserved exactly — every assignment still uses ``not os.getenv(...)`` guards so env vars take precedence over YAML. All 78 references to the old import path are rewritten — no back-compat shim: * 51 ``from gateway.platforms.discord import X`` → ``from plugins.platforms.discord.adapter import X`` * 5 ``import gateway.platforms.discord as discord_platform`` → ``import plugins.platforms.discord.adapter as discord_platform`` * 1 ``from gateway.platforms import discord as discord_mod`` → ``from plugins.platforms.discord import adapter as discord_mod`` * 21 ``mock.patch("gateway.platforms.discord.X")`` strings → ``mock.patch("plugins.platforms.discord.adapter.X")`` * 1 docstring reference in ``hermes_cli/commands.py`` * 1 import in ``tools/send_message_tool.py`` (now removed entirely) The import-safety test in ``tests/gateway/test_discord_imports.py`` is updated to purge the new canonical module name from ``sys.modules``. 38 files changed, +621 / −473 — net positive due to the YAML hook implementation (89 new LOC in the plugin trading for 59 deleted in core), but every line moved has a clear plugin home now. The git rename is detected at R090 because the adapter gained ~340 LOC of moved-in hook implementations (``_standalone_send`` + ``interactive_setup`` + ``_apply_yaml_config`` + helpers). * All 568 Discord-specific tests pass across 25 ``test_discord_.py`` files plus voice/send/text-batching/reload-skills/stream-consumer/ integration tests. All 147 tests in the YAML-touching subset (``test_discord_reply_mode``, ``test_discord_free_response``, ``test_discord_allowed_channels``, ``test_discord_allowed_mentions``, ``test_discord_channel_controls``, ``test_discord_reactions``, ``test_discord_thread_persistence``, ``test_runtime_footer``) pass — this is the strongest signal that the YAML→env hook behaves identically to the legacy block. * Broader gateway/cron/integration sweep (1297 tests) introduces zero new failures vs ``main``. Pre-existing failures in ``tests/gateway/test_tts_media_routing.py`` and ``tests/e2e/test_platform_commands.py`` reproduce identically on the unchanged ``main`` revision. * Plugin discovery sanity check confirms Discord registers alongside the other four platform plugins: Registered platforms: ['discord', 'google_chat', 'irc', 'line', 'teams'] These Discord-shaped tendrils in core were deliberately not moved — they are generic platform-registry concerns affecting every platform, not Discord-specific: * ``gateway/config.py:1205`` ``DISCORD_BOT_TOKEN → config.token`` env enablement — same shape Telegram has. The existing ``env_enablement_fn`` registry hook only seeds ``extra``, not ``.token``, so it can't replace this without an adapter refactor to read from ``extra["bot_token"]``. * ``gateway/run.py`` voice-mode hooks (``self.adapters.get(Platform.DISCORD)`` for ``start_voice_mode``/``stop_voice_mode``), role-based auth, ``DISCORD_ALLOW_BOTS`` branch in ``_is_user_authorized``, ``_UPDATE_ALLOWED_PLATFORMS`` frozenset, and the per-platform allowlist maps — generic platform-registry concerns. * ``Platform.DISCORD`` enum literal — stable identifier used as dict keys throughout the codebase; removing it is a separate refactor with no real benefit. * ``tools/discord_tool.py`` and ``tools/environments/local.py`` — first-class agent tools and env-passthrough config, neither is the gateway adapter. Each of these is worth its own scoping issue when the time comes.	2026-05-22 14:21:41 -07:00
Teknium	82c2035823	fix(pairing): handle legacy plaintext pending entries during upgrade When an existing install upgrades to the hashed-pending schema, its on-disk pending.json still has the old {code: entry} format with no hash/salt fields. The original PR #8056 assumed every entry had both fields and would have KeyErrored in approve_code, list_pending, and _cleanup_expired. Guard each consumer: - approve_code: skip entries that are not a dict, lack salt/hash, or have a non-hex salt. Legacy entries simply fail to match. - list_pending: tolerate missing 'hash' (show "legacy" placeholder) and non-numeric created_at (skip the row). - _cleanup_expired: treat malformed/legacy entries as expired so they get pruned on the next call rather than wedging the file. Regression tests cover all three consumers plus a mixed-malformed case.	2026-05-22 04:11:49 -07:00
Tom Qiao	2e509422ef	fix(security): hash gateway pairing codes instead of storing plaintext Pairing codes were stored as plaintext keys in JSON files. Now uses sha256 + random salt hashing with constant-time comparison. Fixes #8036 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-05-22 04:11:49 -07:00
memosr	9c90b3a597	fix(security): validate secret in _reload_dynamic_routes to prevent HMAC bypass	2026-05-22 03:45:21 -07:00
Teknium	3fde8c153d	fix(skills): prune dependency/venv dirs from all skill scanners (#30042 ) * fix(skills): skip dependency dirs in skill scan * fix(skills): widen sibling rglob scanners to use shared exclusion set Follow-up to PR #29968. The contributor's PR widened EXCLUDED_SKILL_DIRS in the canonical walker (iter_skill_index_files), which fixes the user-visible discovery path. This commit sweeps the ~12 other rglob('SKILL.md') sites that did their own ad-hoc filtering — most only checked .git/.hub, some had no filter at all — so dependency dirs (.venv, node_modules, site-packages, etc.) cannot leak ghost skills through the secondary paths. Adds agent.skill_utils.is_excluded_skill_path(path) helper. Migrates all 13 sites to use it. Removes 3 hardcoded duplicate filter sets. Sites touched: agent/curator_backup.py - skill backup file count gateway/run.py - disabled-skill response (2 sites) hermes_cli/dump.py - skill count in env dump hermes_cli/profile_describer.py- profile description (2 sites) hermes_cli/profile_distribution.py - profile install count hermes_cli/profiles.py - profile skill count hermes_cli/skills_hub.py - category detection tools/skill_manager_tool.py - skill name lookup (already used set, now uses helper) tools/skill_usage.py - usage tracking + skill dir lookup (2 sites) tools/skills_hub.py - optional skills find + scan (2 sites) tools/skills_sync.py - bundled skills sync E2E verified with the exact reported shape (bring/scripts/.venv/.../typer/.agents/skills/typer/SKILL.md): no sibling site picks up the ghost skill, all five legit-skill counts still return 1. * chore(infographic): retro-pop-grid bento for PR #30042 skill-scanner sweep --------- Co-authored-by: helix4u <4317663+helix4u@users.noreply.github.com>	2026-05-21 14:18:02 -07:00
EloquentBrush0x	6c26727bb3	fix(gateway): extend observe+attribution to location and media handlers _handle_location_message and _handle_media_message were skipped when the observe-unmentioned-group-messages feature landed (`a9db0e2c7`). Both handlers now: 1. Check _should_observe_unmentioned_group_message on the skipped path and call _observe_unmentioned_group_message so group chatter is stored as shared session context even when the bot is not addressed. 2. Call _apply_telegram_group_observe_attribution on the triggered path so the dispatched event uses the shared (user_id=None) group session instead of the per-user session, letting the model see previously observed context. For stickers the attribution is applied after _handle_sticker completes (which overwrites event.text with the vision description); for all other media types it is applied once after caption cleaning. Four new tests cover the observe and attribution paths for both handlers.	2026-05-20 23:52:18 -07:00
Omar B	be0728cacc	fix: handle Discord typing indicator 429 gracefully The typing indicator loop (send_typing) ran every 8s and died on any exception, including Discord 429 rate limits. Once a 429 killed the loop, the indicator never restarted — and the raw exception bounce could cascade into broader gateway instability. Changes: - Bump sleep interval from 8s to 12s (typing light lasts ~10s) - On 429: extract retry_after, log a warning, sleep the backoff, and continue the loop - On non-rate-limit errors: log debug and return (unchanged behaviour)	2026-05-20 23:27:38 -07:00
Markus	a9db0e2c74	Observe unmentioned Telegram group messages	2026-05-20 22:55:31 -07:00
Teknium	31a0100104	feat(state.db): persist platform_message_id; restore yuanbao exact-id recall PR #29211 dropped JSONL gateway transcripts and noted that the platform's own `message_id` field (used by Yuanbao's recall guard to redact a message by exact platform id) was no longer preserved — falling back to content-match. That fallback works for the common case but redacts the wrong row when two messages share text (or fails to match when content is post-processed). Restore exact-id matching by giving state.db a column for it: - New `platform_message_id TEXT` column on the messages table (SCHEMA_VERSION bump 11 → 12; column added via declarative reconciler on existing DBs, no version-gated migration block needed) - Partial index `idx_messages_platform_msg_id` on (session_id, platform_message_id) to keep recall's point-lookup cheap even on large sessions - `append_message()` and `replace_messages()` accept the new value: the gateway-facing `append_to_transcript` in `gateway/session.py` forwards either `message["platform_message_id"]` or the legacy `message["message_id"]` key (yuanbao's existing convention) - `get_messages_as_conversation()` surfaces the column back on the message dict as `message_id` so platform code reads the same shape it used to read from JSONL - Yuanbao `_patch_transcript`: restore branch A1 (exact id match) ahead of A2 (content match) ahead of B (system-note). Both branches log which one fired so operators can tell from gateway.log whether recall hit the canonical path or had to fall back. Tests: - New low-level round-trip tests in `test_hermes_state.py` for both `append_message` and `replace_messages` paths - The PR's `test_yuanbao_recall_db_only.py` was rewritten to assert the new contract: branch A1 (id match) works against DB-only transcripts, and branch A2 (content match) still recovers rows that were observed without a platform id (e.g. agent-processed @bot messages where run.py doesn't carry msg_id through)	2026-05-20 13:00:57 -07:00
yoniebans	0cc1a1d2d9	refactor(yuanbao): drop dead branch A1 message_id loop + pin missing fixture PR #29211 review findings: 1. test_retry_replacement: pin DEFAULT_DB_PATH so SessionDB() doesn't write to the real ~/.hermes/state.db. Same fix as the other DB-only fixtures. 2. yuanbao recall branch A1 (message_id exact match) was structurally dead once load_transcript() became DB-only — state.db never preserves the platform message_id. Removed the dead loop, consolidated to a single content-match branch (renamed 'A: content match'). Branch B (system note) unchanged. Updated the test name + docstring to reflect this. Note: self._lock is no longer taken in append_to_transcript (was guarding the JSONL file append). SQLite append_message handles its own concurrency via WAL mode, so this is safe; flagging for awareness.	2026-05-20 13:00:57 -07:00
yoniebans	b4b118c201	refactor(gateway): drop _append_to_jsonl from mirror Mirror messages are persisted via _append_to_sqlite. JSONL writer was a redundant dual-write. Updated test assertions from JSONL file checks to SQLite mock verification.	2026-05-20 13:00:57 -07:00
yoniebans	351fdcc6e6	refactor(gateway): stop writing JSONL in append_to_transcript / rewrite_transcript state.db is canonical. JSONL transcripts were a transition fallback; the fallback was removed in the previous commit. Existing *.jsonl files on disk are left untouched.	2026-05-20 13:00:57 -07:00
yoniebans	971cfaa38c	refactor(yuanbao): migrate recall to load_transcript() Yuanbao's recall feature was reading the gateway JSONL directly to look up messages by platform message_id, which state.db does not preserve. Migrated to use load_transcript() which returns DB messages. Recall branch A1 (message_id match) now falls through to A2 (content match) or B (system note) for all sessions — a documented degradation. Follow-up issue: add platform_message_id column to state.db messages to restore exact-id matching.	2026-05-20 13:00:57 -07:00
yoniebans	024a8e3ee9	refactor(gateway): drop JSONL fallback in load_transcript state.db is canonical. The 'use whichever source is longer' branch was defensive code for the pre-DB migration; on every real DB it has not fired (verified on a session corpus with 27 jsonl files / 950 sessions — zero jsonl-bigger cases). Test changes: - TestLoadTranscriptCorruptLines: deleted (tested dead JSONL code path) - TestLoadTranscriptPreferLongerSource: deleted (tested removed fallback) - Replaced with TestLoadTranscriptDBOnly (DB-only reads) - TestSessionStoreRewriteTranscript: fixture now creates DB session - test_gateway_retry_replaces_last_user_turn: fixture uses real DB	2026-05-20 13:00:57 -07:00
Teknium	e2fd462ebe	ci(tests): add pytest-timeout 60s hard cap to break suite-teardown deadlock (#28861 ) * ci(tests): add pytest-timeout 60s hard cap to break suite-teardown deadlock The full pytest suite reliably hangs at ~96% on origin/main, blowing through the 20-minute GHA job timeout on every CI push since yesterday. Individual tests complete in <30s — the deadlock builds up at session teardown after all tests run, when leaked threads and atexit handlers from thousands of tests interact and one of them lands in a futex-wait that never resolves. This PR is a stopgap that unblocks CI immediately + speeds up several slow tests we found while diagnosing. Changes - pyproject.toml: add pytest-timeout==2.4.0 to dev deps; bake --timeout=60 --timeout-method=thread into the default addopts. - scripts/run_tests.sh: re-add --timeout flags directly because the script wipes pyproject addopts with -o 'addopts='. - .github/workflows/tests.yml: explicit --timeout/--timeout-method on the CI pytest invocation for clarity. - gateway/run.py: in _run_agent, if the stream consumer was never created (e.g. non-streaming agent or test stub), cancel the stream_task immediately instead of waiting out the 5s wait_for timeout. ~5s saved per non-streaming gateway test run. - tests/run_agent/conftest.py: extend _fast_retry_backoff to patch agent.conversation_loop.jittered_backoff alongside run_agent.jittered_backoff. The retry loop was extracted into agent.conversation_loop which holds its own import — patching the run_agent reference alone left tests burning real wall-clock backoff seconds. - tests/run_agent/test_anthropic_error_handling.py tests/run_agent/test_run_agent.py (TestRetryExhaustion) tests/run_agent/test_fallback_model.py: same conversation_loop fix for per-test fixtures (defensive — the conftest covers them too). - tests/gateway/test_gateway_inactivity_timeout.py: trim run_duration 10.0 → 2.0 / 5.0 → 2.0 on three tests that wait the full SlowFakeAgent duration. Adjusted thresholds proportionally. - tests/gateway/test_api_server_runs.py: test_stop_interrupt_exception_does_not_crash trips the interrupted event in addition to raising, so the slow_run thread unblocks at teardown instead of waiting 10s. - tests/hermes_cli/test_update_gateway_restart.py: also patch time.monotonic in the autouse fixture. _wait_for_service_active loops on a wall-clock deadline; with sleep no-op'd the loop spun on real monotonic until 10s real-time per restart attempt (20s+ per test). - tests/tools/test_zombie_process_cleanup.py: cut runner._restart_drain_timeout 5.0 → 0.1 in test_gateway_stop_calls_close. Suite still hangs at 96% on full no-timeout runs; with these changes CI runs through to a real pass/fail signal. * chore(lock): regenerate uv.lock after adding pytest-timeout * ci: drop pytest-timeout 60 → 30s + bump GHA job 20 → 30 min Prior commit's timeout=60 was too generous — CI test job still hit the 20-min wall-clock cap with the suite hung at 96% (orphan agent-browser subprocesses blocking pytest session teardown). The local timeout=20 run completed in 6:17, so 30s is conservative enough to let real tests finish but aggressive enough to short-circuit deadlocks. Also bump GHA job timeout to 30 min as a safety margin. * test: delete 11 pre-existing failing tests + revert monotonic patch The previous PR commit landed pytest-timeout=30s and the suite now completes in 18:14 instead of hanging at 96%, but 11 pre-existing tests fail with real assertions. Per Teknium: nuke them. Deleted (no replacements): - tests/gateway/test_restart_resume_pending.py::test_clean_drain_does_not_mark_resume_pending - tests/gateway/test_restart_resume_pending.py::test_drain_timeout_only_marks_still_running_sessions - tests/hermes_cli/test_gateway_service.py::TestGatewaySystemServiceRouting::test_gateway_install_passes_system_flags - tests/hermes_cli/test_gateway_wsl.py::TestGatewayCommandWSLMessages::test_install_wsl_with_systemd_warns - tests/hermes_cli/test_update_gateway_restart.py::TestCmdUpdateLaunchdRestart::test_update_detects_launchd_and_skips_manual_restart_message - tests/hermes_cli/test_update_gateway_restart.py::TestCmdUpdateLaunchdRestart::test_update_restarts_profile_manual_gateways - tests/tools/test_file_operations.py::TestGitBaselineCheck::* (6 tests, entire class — _check_git_baseline helper doesn't exist) Also reverted my time.monotonic autouse-fixture hack in test_update_gateway_restart.py — it was causing worker crashes in CI by poisoning later tests in the same xdist worker. The two slow tests in that file (~24s and ~20s) will go back to taking real time but should still finish under the 30s pytest-timeout. * test: delete more pre-existing CI failures After previous push 3 more tests failed on CI; cull them all. Removed: - tests/hermes_cli/test_update_gateway_restart.py::TestCmdUpdateLaunchdRestart::test_update_without_launchd_shows_manual_restart - tests/hermes_cli/test_update_gateway_restart.py::TestCmdUpdateLaunchdRestart::test_update_profile_manual_gateway_falls_back_to_sigterm - tests/hermes_cli/test_update_gateway_restart.py::TestCmdUpdateResetFailedBeforeRestart::test_reset_failed_also_runs_before_retry_restart - tests/hermes_cli/test_update_gateway_restart.py::TestCmdUpdateResetFailedBeforeRestart::test_final_failure_message_tells_user_to_reset_failed - tests/run_agent/test_tool_call_args_sanitizer.py::test_marker_message_inserted_when_missing The 4 update_gateway_restart tests trigger `_wait_for_service_active` polling on a real wall-clock deadline that occasionally exceeds the 30s pytest-timeout cap and crashes xdist workers. The marker test has a pre-existing assertion mismatch. * test: nuke entire TestCmdUpdateLaunchdRestart class After surgical deletes of 4 tests this class keeps producing new worker-crashing tests. The pattern is consistent: any test in this class that triggers cmd_update's _wait_for_service_active polling spins on real wall-clock time and trips pytest-timeout's thread method, crashing the xdist worker. Just delete the whole class (285 lines, ~10 tests). These exercise macOS-only launchd behavior that's better tested on a real macOS runner than in linux xdist. * test: stub the 2 fallback_model tests that crash xdist workers on CI * test: delete test_anthropic_error_handling.py + test_fallback_model.py entirely These two files exercise the agent retry/fallback code paths and consistently crash xdist workers under pytest-timeout's thread method. Whack-a-mole-stubbing individual tests just surfaces the next ones. Nuke both files. * test: delete tests/hermes_cli/test_update_gateway_restart.py entirely This file's cmd_update integration tests consistently crash xdist workers under pytest-timeout's thread method. Surgical deletes just surface the next set. Removing the whole file. * ci(tests): switch pytest-timeout method thread → signal Thread-method has been crashing xdist workers when it interrupts code that's not interruption-safe (retry loops, threading.Event waits, etc). Signal method uses SIGALRM which is interpreter-level and cleanly raises a Failed: Timeout exception in test code. Should stop the worker crash cascade — failures will surface as proper Timeout markers we can diagnose individually.	2026-05-19 17:27:24 -07:00
Teknium	93734c26e5	fix(dingtalk): transcribe native voice notes Sibling fix to PR #28918 (Discord voice notes). DingTalk's rich-text "voice" item type is its native voice-message format, but the adapter was routing it to MessageType.AUDIO — which gateway/run.py:7605 skips for STT. The docs claim every voice-capable platform auto-transcribes, so this brings DingTalk in line. Generic audio uploads (mapped to "file" by DINGTALK_TYPE_MAPPING) are unchanged — they were already classified as DOCUMENT, not AUDIO. Adds tests/gateway/test_dingtalk.py::TestExtractMedia covering both the voice path and the audio-passthrough invariant.	2026-05-19 17:26:26 -07:00
helix4u	448a3f9ea2	fix(discord): transcribe native voice notes	2026-05-19 17:26:26 -07:00
EloquentBrush0x	5a3317693c	fix(discord): define view classes after lazy discord.py install When discord.py is not installed at import time, DISCORD_AVAILABLE=False and the view class definitions at module bottom are skipped. check_discord_requirements() performs a lazy install and sets DISCORD_AVAILABLE=True but never re-ran the class definitions, causing NameError on the first button interaction (exec approval, slash confirm, etc.). Extract the five ui.View subclasses into _define_discord_view_classes() and call it both at module load (when discord.py is pre-installed) and inside check_discord_requirements() after a successful lazy install.	2026-05-19 09:28:22 -07:00
Teknium	a3c753128d	fix(telegram): address post-merge audit follow-ups (#28670 , #28672 , #28674 , #28676 , #28678 ) Five small fixes against issues filed during the post-merge salvage audit: * #28670: `_GATEWAY_PROVIDER_ERROR_RE` false-positives on legitimate prose. Replace the regex with an anchored `_GATEWAY_PROVIDER_ERROR_SHAPE_RE` and add a length-cap heuristic to `_looks_like_gateway_provider_error`: short envelope at the start of the message → real provider error; long prose containing 'HTTP 404' → assistant answer, leave alone. * #28672: drop the pointless 1s asyncio.sleep on Telegram thread-not-found retries. The same-thread retry is preserved (catches Telegram's occasional transient flake exercised by test_send_retries_transient_thread_not_found_before_fallback) but with no artificial delay. * #28674: broaden `_should_retry_without_dm_topic_reply_anchor` to also fire when Bot API rejects `direct_messages_topic_id` for synthetic / resumed sends that have no reply anchor. Avoids dropping post-resume background notifications if the topic id goes stale. * #28676: delete the dead image-document branch superseded by `bd0c54d17` (which returns early on the same extension set). * #28678: extend chat-scoped allowlist (`TELEGRAM_GROUP_ALLOWED_CHATS`) to also cover `chat_type == 'channel'`, so operators can authorize channel posts by chat id without falling back to per-user allowlists. Tests: - scripts/run_tests.sh tests/gateway/test_telegram_thread_fallback.py -q → 41/41 - scripts/run_tests.sh tests/cron/test_scheduler.py -q → 127/127 - broader test set: same 3 pre-existing test-pollution failures reproduce on plain main.	2026-05-19 03:16:23 -07:00
vanthinh6886	62573f44cf	fix: guard yaml.safe_load, flock unlock, TOCTOU races, and atomic writes 1. trajectory_compressor.py: yaml.safe_load() returns None on empty files, crashing with TypeError on `if 'tokenizer' in data`. Fix by adding `or {}` fallback. (HIGH — blocks startup with empty config) 2. 6 files with fcntl.flock(LOCK_UN) in finally blocks without try/except: cron/scheduler.py, hermes_cli/auth.py, agent/shell_hooks.py, tools/skill_usage.py, tools/environments/file_sync.py, tools/memory_tool.py. If unlock raises OSError, fd.close() is skipped and the lock is held forever. The msvcrt branches already had try/except; the fcntl branches did not. Fix by wrapping in try/except (OSError, IOError): pass. 3. agent/copilot_acp_client.py line 639: TOCTOU race — path.exists() followed by path.read_text() with no try/except. If file is deleted between the check and the read, FileNotFoundError propagates. Fix by using try/except FileNotFoundError. 4. gateway/sticker_cache.py: non-atomic write via Path.write_text() can leave truncated JSON on crash, causing JSONDecodeError on next load. Fix by writing to tempfile + fsync + os.replace (atomic).	2026-05-19 00:12:41 -07:00
MoonJuhan	b8a9cbd18c	fix: tolerate unreadable gateway JSONL transcripts	2026-05-19 00:11:12 -07:00
justemu	276e6cc52d	fix(matrix): implement thread_require_mention to prevent multi-agent reply loops In multi-agent shared Matrix rooms, multiple bots all participating in the same thread could trigger infinite reply loops — each bot's reply re-engaged the others because they were all in the bot-thread set. Discord has a `thread_require_mention` opt-in for this; Matrix didn't. Add `_parse_thread_require_mention(config)` (mirrors Discord's pattern). In `_resolve_message_context`, when enabled and the message is in a bot-participated thread (not a free-response room), require @mention before processing. Salvage of @justemu's 2-commit stack (#27996). Fixes #27995.	2026-05-19 00:04:23 -07:00
LifeJiggy	e2a1a2bf13	fix(gateway): pre-mark sessions as resume_pending before drain to prevent data loss (#27856 ) Pre-mark all running agent sessions as resume_pending BEFORE the drain wait begins. If the service manager kills the process during the drain (window), the durable marker is already written so the next gateway boot can recover in-flight sessions. On graceful drain completion, clear the early markers for sessions that finished successfully.	2026-05-19 00:00:28 -07:00
Teknium	4d44304e85	Revert "fix(telegram): enforce TELEGRAM_ALLOWED_USERS allowlist on inbound messages" This reverts commit `db50af910b`.	2026-05-18 23:59:57 -07:00
Teknium	22120ef00f	Revert "feat(telegram): support quick-command-only menus" This reverts commit `b1acf80e17`.	2026-05-18 23:59:57 -07:00

1 2 3 4 5 ...

1745 commits