The Feishu adapter wrapped lark-oapi's Connect() callable to inject
ping_interval/ping_timeout overrides, but made the wrapper async. The
underlying library uses Connect() as an async context manager (async
with Connect(...) as ws:), which requires the call itself to be sync
and return an AsyncContextManager — making it async meant the wrapper
was awaited eagerly and ws never bound.
Restoring the sync wrapper preserves the protocol while still injecting
the overrides.
Salvage of #25388 by @pearjelly (manually re-applied — original branch
was severely stale against current main).
Discord introduced message_snapshots for forwarded messages — text and
attachments live inside snap.content / snap.attachments rather than on
the parent message. _handle_message wasn't reading them, so forwards
showed up empty.
Defensively extracts snapshot text (when raw_content is empty) and
appends snapshot attachments to the working all_attachments list used
for type detection and media routing. hasattr/getattr guards keep this
safe on older discord.py installs without the field.
Salvage of #25462 by @1RB (manually re-applied — original branch was
stale against current main).
By default, once Hermes participates in a Discord thread (auto-created on
@mention or replied in once) it auto-responds to every subsequent message
in that thread without requiring further @mentions. That's the right default
for one-on-one conversations and isolated channel threads.
But it's a confirmed footgun in multi-bot threads. When a user invokes one
bot per turn — addressing Codex first, then Hermes — every other bot in the
thread also fires on every message, burning credits and spamming the channel.
Author has hit this personally in active multi-bot research-team threads.
Add a new `discord.thread_require_mention` config key (env:
`DISCORD_THREAD_REQUIRE_MENTION`), default `false` to preserve existing
behavior. When `true`, the in-thread mention shortcut is disabled and
threads are gated the same way channels are. Explicit @mentions still pass
through as expected.
Mirrors the existing helper shape (config.extra > env > default) and the
existing yaml→env bridge pattern used by `require_mention`.
Changes:
- gateway/platforms/discord.py: new `_discord_thread_require_mention()`
helper; in_bot_thread shortcut now AND's with `not _discord_thread_require_mention()`
- gateway/config.py: bridge `discord.thread_require_mention` from config.yaml
to `DISCORD_THREAD_REQUIRE_MENTION` env var (mirrors the existing
`require_mention` bridge two lines above)
- hermes_cli/config.py: add `thread_require_mention: False` default to
DEFAULT_CONFIG['discord']
- tests/gateway/test_discord_free_response.py: 4 new tests covering default
behaviour (in-thread shortcut still works), enabled behaviour (mention
required in threads), enabled+mentioned (mention still passes through),
and yaml-via-config.extra path. Also clears DISCORD_* env vars in the
`adapter` fixture so process-env state from the contributor's shell
doesn't leak into per-test behaviour.
- tests/gateway/test_config.py: 2 new tests covering the yaml→env bridge
(both the apply-from-yaml and env-precedence-over-yaml paths)
- website/docs/user-guide/messaging/discord.md: document the new env var
+ config key with multi-bot rationale; cross-link from `auto_thread`
section
Tested on Ubuntu 24.04.
Free-response channels are intended as lightweight chat surfaces — the bot
responds to every message without requiring an @mention. But the auto-thread
gate only checked DISCORD_NO_THREAD_CHANNELS, not DISCORD_FREE_RESPONSE_CHANNELS,
so every message in a free-response channel still spawned a brand-new thread.
That turns a chat channel into a thread-spawning machine: 1 thread per message.
The user-facing docs at website/docs/user-guide/messaging/discord.md already
describe the intended behavior ("Free-response channels also skip auto-threading
— the bot replies inline rather than spinning off a new thread per message"),
so this is a code-vs-docs gap, not a design change.
Fix: OR is_free_channel into skip_thread alongside the existing no_thread_channels
check. One-line production change.
Regression test added at tests/gateway/test_discord_free_response.py:
test_discord_free_response_channel_skips_auto_thread asserts that a message
in a free-response channel never calls _auto_create_thread. Reverting the
one-line fix causes the test to fail with 'Expected mock to not have been
awaited. Awaited 1 times.' — i.e. the test demonstrates the bug concretely.
Lets platform plugins own their YAML→env config bridge instead of forcing
core gateway/config.py to know every platform's schema.
The hook receives the full parsed config.yaml and the platform's own
sub-dict, may mutate os.environ (env > YAML precedence preserved via the
standard `not os.getenv(...)` guards), and may return a dict to merge
into PlatformConfig.extra. It runs during load_gateway_config() after
the existing generic shared-key loop and before _apply_env_overrides(),
mirroring the env_enablement_fn dispatch pattern (#21306, #21331).
Pure addition — no behavior change for existing platforms. Each of the
eight platforms with hardcoded YAML→env blocks today (discord, telegram,
whatsapp, slack, dingtalk, mattermost, matrix, feishu, ~252 LOC in
gateway/config.py) can migrate in independent follow-up PRs; the
hardcoded blocks remain functional in the meantime, and their
`not os.getenv(...)` guards make them no-ops for any env var the hook
already set.
Test coverage: 10 new tests in tests/gateway/test_platform_registry.py
covering field default, callable acceptance, env mutation, extras
merge, both signature args, exception swallowing, missing/non-dict
sections, and env > YAML precedence.
Refs #3823, #24356.
Closes#24836.
Fixes#25028.
The lazy-install hooks added in #25014 installed packages correctly but
failed to rebind module-level globals after install:
- Slack: missing aiohttp rebind → NameError on file uploads
- Feishu: none of the ~25 lark_oapi symbols rebound → TypeError on
adapter instantiation
- Matrix: mautrix.types enums stayed as stubs → mismatched values at
runtime
Introduces tools.lazy_deps.ensure_and_bind() — a DRY helper that
combines ensure() + importer-callable + globals().update(). This
eliminates the error-prone pattern of manually listing every global
that needs updating after lazy-install. Each platform adapter now
defines a single _import() function returning all bindings.
Also fixes: pyproject.toml [slack] extra was missing aiohttp (needed
by slack-bolt's async path).
Slack platform-blocks native slash commands inside thread replies ("/queue
is not supported in threads. Sorry!") and there is no app-side setting to
re-enable them. As a workaround, rewrite a leading '!' to '/' for any known
gateway command before downstream processing — so '!queue', '!stop',
'!model gpt-5.4' etc. work inside Slack threads (and anywhere else).
Only the first token is checked against is_gateway_known_command(), so
casual messages like '!nice work' pass through to the agent unchanged.
Downstream pipeline (MessageType.COMMAND tagging, gateway dispatcher,
thread reply routing) is unchanged.
Adds 6 tests covering rewrite, args preservation, thread routing,
casual-message passthrough, '@bot' suffix, and plain '/' still-works.
Only Discord and Telegram had lazy-install hooks in their
check_*_requirements() functions. The remaining four platforms that were
moved to lazy_deps (Slack, Matrix, DingTalk, Feishu) would just return
False immediately if their packages weren't pre-installed — no attempt
to install them at runtime.
This means even with the .venv permissions fix (#24841), these four
platforms would still fail to load in Docker (or any fresh install)
unless the user manually ran pip install.
Add the same lazy_deps.ensure() pattern to all four, matching the
existing Discord/Telegram implementation.
Default timeout raised from 60s to 300s (5 minutes) to accommodate
slower systems like Unraid NAS. Configurable via WHATSAPP_NPM_INSTALL_TIMEOUT
environment variable.
The WeCom adapter's _listen_loop() automatically reconnects when the
WebSocket drops, but it never called _mark_connected() after a successful
reconnection. This left the runtime status file (gateway_state.json) stuck
in "disconnected" even though the adapter was fully operational again.
Add self._mark_connected() right after _open_connection() succeeds so
that the dashboard and health probes report the correct state.
Tested by forcing a WebSocket close via the heartbeat loop and verifying
that the status file updated from "disconnected" back to "connected".
PR #23458 introduced _send_message_with_thread_fallback() and applied it
to all control-style sends (send_update_prompt, send_approval_request,
send_model_picker_prompt), but the slash-confirm result message in
handle_callback_query still called self._bot.send_message directly.
In supergroups with stale message_thread_id on the callback's parent
message, this raises "Message thread not found" and silently swallows
the result text. Replace with the helper so the same retry-without-
thread-id logic applies.
Closes#23064
When Hermes connects to Signal via signal-cli in daemon mode (linked
device setup), group messages sent from the user's phone were silently
dropped. The syncMessage handler only processed events where
destinationNumber equals the bot's own number (Note to Self).
Group messages from linked devices carry a groupInfo.groupId instead of a
destinationNumber. Extend the condition to also pass through sync messages
that have a groupId, so group messages are promoted to dataMessage and
reach the agent.
Salvage of #21063 — adds 'Weixin, and more' to module-level docstrings
in gateway/__init__.py, gateway/config.py, gateway/platforms/base.py
and the 'hermes gateway' subparser description.
Co-authored-by: wuwuzhijing <chuang.guo@hopechart.com>
When the user runs /stop or a session is interrupted mid-flight, the
👀 in-progress reaction lingered on the user's message indefinitely.
Without another agent run to swap it for 👍/👎, the eyes stayed there
forever — visually misleading (looks like the agent is still working).
Fix: on ProcessingOutcome.CANCELLED, call set_message_reaction with
reaction=None to clear all reactions on the message. Documented Bot API
semantics (equivalent to Bot API 10.0's deleteMessageReaction, but works
on PTB 22.6 already without the version bump).
Test changes:
- Renamed test_on_processing_complete_cancelled_keeps_existing_reaction
→ test_on_processing_complete_cancelled_clears_reaction; updated
assertion to expect set_message_reaction(reaction=None).
- Added test_on_processing_complete_cancelled_skipped_when_disabled
(TELEGRAM_REACTIONS=false short-circuits).
- Added test_clear_reactions_handles_api_error_gracefully and
test_clear_reactions_returns_false_without_bot to cover the new
_clear_reactions helper.
The clarify tool returned 'not available in this execution context' for
every gateway-mode agent because gateway/run.py never passed
clarify_callback into the AIAgent constructor. Schema actively encouraged
calling it; users never saw the question.
Changes:
- tools/clarify_gateway.py — new event-based primitive mirroring
tools/approval.py: register/wait_for_response/resolve_gateway_clarify
with per-session FIFO, threading.Event blocking with 1s heartbeat
slices (so the inactivity watchdog keeps ticking), and
clear_session for boundary cleanup.
- gateway/platforms/base.py — abstract send_clarify with a numbered-text
fallback so every adapter (Discord, Slack, WhatsApp, Signal, Matrix,
etc.) gets a working clarify out of the box. Plus an active-session
bypass: when the agent is blocked on a text-awaiting clarify, the next
non-command message routes inline to the runner's intercept instead
of being queued + triggering an interrupt. Same shape as the /approve
deadlock fix from PR #4926.
- gateway/platforms/telegram.py — concrete send_clarify renders one
inline button per choice plus '✏️ Other (type answer)'. cl: callback
handler resolves numeric choices immediately, flips to text-capture
mode for Other, with the same authorization guards as exec/slash
approvals.
- gateway/run.py — clarify_callback wired at the cached-agent per-turn
callback assignment site (only the user-facing agent path; cron and
hygiene-compress agents have no human attached). Bridges sync→async
via run_coroutine_threadsafe, blocks with the configured timeout, and
returns a '[user did not respond within Xm]' sentinel on timeout so
the agent adapts rather than pinning the running-agent guard. Text-
intercept added to _handle_message before slash-confirm intercept
(skipping slash commands). clear_session called in the run's finally
to cancel any orphan entries.
- hermes_cli/config.py — agent.clarify_timeout default 600s.
- website/docs/user-guide/messaging/telegram.md — Interactive Prompts
section.
Tests:
- tests/tools/test_clarify_gateway.py (14 tests) — full primitive
coverage: button resolve, open-ended auto-await, Other flip, timeout
None, unknown-id idempotency, clear_session cancellation, FIFO
ordering, register/unregister notify, config default.
- tests/gateway/test_telegram_clarify_buttons.py (12 tests) — render
paths (multi-choice/open-ended/long-label/HTML-escape/not-connected),
callback dispatch (numeric resolve/Other flip/already-resolved/
unauthorized/invalid-token), and base-adapter text fallback.
Out of scope: bot-to-bot, guest mode, checklists, poll media, live
photos. Closes#24191.
* feat(security): supply-chain advisory checker + lazy-install framework + tiered install fallback
Three coordinated mitigations for the Mini Shai-Hulud worm hitting
mistralai 2.4.6 on PyPI (2026-05-12) and for the next single-package
compromise that follows.
# What this PR makes true
1. Users with the poisoned mistralai 2.4.6 in their venv get a loud
detection banner with copy-pasteable remediation steps the moment
they run hermes (and on every gateway startup).
2. One quarantined / yanked PyPI package can no longer silently demote
a fresh install to 'core only' — the installer keeps every other
extra and tells the user which tier landed.
3. Future opt-in backends (Mistral, ElevenLabs, Honcho, etc.) can
lazy-install on first use under a strict allowlist, instead of
eagerly pulling everything at install time.
# Detection: hermes_cli/security_advisories.py
- ADVISORIES catalog (one entry currently: shai-hulud-2026-05 for
mistralai==2.4.6). Adding the next one is a single dataclass.
- detect_compromised() uses importlib.metadata.version() — no pip
dependency, works in uv venvs that lack pip.
- Banner cache (~/.hermes/cache/advisory_banner_seen) rate-limits
the startup banner to once per 24h per advisory.
- Acks persisted to security.acked_advisories in config.yaml; never
re-banner after ack.
- Wired into:
* hermes doctor — runs first, prints full remediation block
* hermes doctor --ack <id> — dismisses an advisory
* cli.py interactive run() and single-query branches — short
stderr banner pointing at hermes doctor
* gateway/run.py startup — operator-visible warning in gateway.log
# Lazy-install framework: tools/lazy_deps.py
- LAZY_DEPS allowlist maps namespaced feature keys (tts.elevenlabs,
memory.honcho, provider.bedrock, etc.) to pip specs.
- ensure(feature) installs missing deps in the active venv via the
uv → pip → ensurepip ladder (matches tools_config._pip_install).
- Strict spec safety regex rejects URLs, file paths, shell metas,
pip flag injection, control chars — only PyPI-by-name accepted.
- Gated on security.allow_lazy_installs (default true) plus the
HERMES_DISABLE_LAZY_INSTALLS env var for restricted/audited envs.
- Migrated three backends as proof of pattern:
* tools/tts_tool.py — _import_elevenlabs() calls ensure first
* plugins/memory/honcho/client.py — get_honcho_client lazy-installs
* tts.mistral / stt.mistral entries pre-registered for when PyPI
restores mistralai
# Installer fallback tiers
scripts/install.sh, scripts/install.ps1, setup-hermes.sh:
- Centralised _BROKEN_EXTRAS list (currently: mistral). Edit one
array when a transitive breaks; users keep every other extra.
- New 'all minus known-broken' tier between [all] and the existing
PyPI-only-extras tier. Only kicks in when [all] fails resolve.
- All three tiers explicit: every fallback announces which tier
landed and prints a re-run hint when not on Tier 1.
- install.ps1 and install.sh both regenerate their tier specs from
the same _BROKEN_EXTRAS array so updates stay in sync.
Side effect: install.ps1 Tier 2 spec previously hardcoded 'mistral'
in its extra list — bug fixed by the refactor (mistral is filtered
out).
# Config
hermes_cli/config.py — DEFAULT_CONFIG.security gains:
- acked_advisories: [] (advisory IDs the user has dismissed)
- allow_lazy_installs: True (security gate for ensure())
No config version bump needed — both keys nest under existing
security: block, and load_config's deep-merge picks up DEFAULT_CONFIG
defaults for users with older configs.
# Tests
tests/hermes_cli/test_security_advisories.py — 23 tests covering:
- detect_compromised matches/non-matches, wildcard frozenset
- ack persistence, idempotence, blank rejection, config-failure path
- banner cache rate limiting + 24h re-banner + ack-stops-banner
- short_banner_lines / full_remediation_text / render_doctor_section /
gateway_log_message
- shipped catalog well-formedness invariant
tests/tools/test_lazy_deps.py — 40 tests covering:
- spec safety: 11 safe parametrized + 18 unsafe parametrized
- allowlist: unknown-feature rejection, namespace.name shape,
every shipped spec passes the safety regex
- security gating: config flag, env var, default, fail-open
- ensure() happy/sad paths: already-satisfied, install success,
pip stderr surfaced on failure, install-succeeds-but-still-missing
- is_available, feature_install_command
Combined: 63 new tests, all passing under scripts/run_tests.sh.
# Validation
- scripts/run_tests.sh tests/hermes_cli/test_security_advisories.py
tests/tools/test_lazy_deps.py → 63/63 passing
- scripts/run_tests.sh tests/hermes_cli/test_doctor.py
tests/hermes_cli/test_doctor_command_install.py
tests/tools/test_tts_mistral.py tests/tools/test_transcription_tools.py
tests/tools/test_transcription_dotenv_fallback.py → 165/165 passing
- scripts/run_tests.sh tests/hermes_cli/ tests/tools/ →
9191 passed, 8 pre-existing failures (verified on origin/main
before this change)
- bash -n on install.sh and setup-hermes.sh → OK
- py_compile on all modified .py files → OK
- End-to-end smoke test of detect_compromised + render_doctor_section
+ gateway_log_message with mocked installed version → produces
copy-pasteable remediation output
# Community
Full advisory + remediation steps:
website/docs/community/security-advisories/shai-hulud-mistralai-2026-05.md
Short-form post drafts (Discord, GitHub pinned issue, README banner):
scripts/community-announcement-shai-hulud.md
Refs: PR #24205 (mistral disabled), Socket Security advisory
<https://socket.dev/blog/mini-shai-hulud-worm-pypi>
* build(deps): pin every direct dep to ==X.Y.Z (no ranges)
Companion to the supply-chain advisory work: replace every >=/</~= range
in pyproject.toml's [project.dependencies] and [project.optional-dependencies]
with an exact ==X.Y.Z pin sourced from uv.lock.
Why: ranges allow PyPI to ship a fresh version of any direct dep at any
time without a code review on our side. With ranges, the malicious
mistralai 2.4.6 release would have been pulled by every fresh
'pip install -e .[all]' for the hours between upload and PyPI's
quarantine — exactly the install window we got hit on. Exact pins close
that window: the only way a new package version reaches a user is via
an intentional update on our end.
What the user-facing change is: nothing, behavior-wise. Every package
resolves to the same version it was already resolving to via uv.lock —
the pins just remove the resolver's freedom to pick a different one.
Cost: any user installing Hermes alongside another package that requires
a newer pin gets a resolver conflict. Acceptable for our isolated-venv
install path; documented in the new comment block.
Build-system requires line (setuptools>=61.0) is intentionally left
as a range — pinning the build backend would block fresh pip from
bootstrapping the build on architectures where that exact wheel isn't
available.
mistral extra (mistralai==2.3.0) is pinned but stays out of [all]
(per PR #24205). 'uv lock' regeneration will fail until PyPI restores
mistralai; lockfile regeneration is gated behind that, NOT on every PR.
LAZY_DEPS in tools/lazy_deps.py also moved to exact pins so the lazy-
install pathway can never resolve a different version than the one
declared in pyproject.toml.
Validation:
- Cross-checked all 77 pinned direct deps in pyproject.toml against
uv.lock — every pin matches the resolved version exactly.
- Cross-checked all LAZY_DEPS specs against uv.lock — same.
- 'uv pip install -e .[all] --dry-run' resolves 205 packages cleanly.
- tests/tools/test_lazy_deps.py + tests/hermes_cli/test_security_advisories.py
→ 63/63 passing (every shipped spec passes the safety regex).
- Doctor + TTS + transcription targeted suite → 146/146 passing.
* build(deps): hash-verify transitives via uv.lock; remove unresolvable [mistral] extra
You asked: 'what about the dependencies the dependencies rely on?' —
correctly noting that exact-pinning direct deps in pyproject.toml does
NOT cover the transitive graph. `pip install` and `uv pip install` both
re-resolve transitives fresh from PyPI at install time, so a compromised
transitive (e.g. `httpcore` if it got worm-poisoned tomorrow) would
still hit our users even with every direct dep exact-pinned.
# What this commit fixes
1. **Both real installer scripts now prefer `uv sync --locked` as Tier 0.**
uv.lock records SHA256 hashes for every transitive — a compromised
package with a different hash gets REJECTED. Falls through to the
existing `uv pip install` cascade if the lockfile is missing or
stale, with a loud warning that the fallback path does NOT
hash-verify transitives. Previously only `setup-hermes.sh` (the dev
path) used the lockfile; `scripts/install.sh` and `scripts/install.ps1`
(the paths fresh users actually run) skipped it.
2. **Removed the `[mistral]` extra entirely.** The `mistralai` PyPI
project is fully quarantined right now — every version returns 404,
so any pin we wrote was unresolvable, which broke `uv lock --check`
in CI. Restoration is documented in pyproject.toml as a 5-step
checklist (verify, re-add extra, re-enable in 4 modules, regenerate
lock, optionally re-add to [all]).
3. **Regenerated uv.lock.** 262 packages, mistralai/eval-type-backport/
jsonpath-python pruned. `uv lock --check` now passes.
# Defense-in-depth view
| Layer | Where | Protects against |
|----------------------------|-------------------|-------------------------------------------|
| Exact pins in pyproject | direct deps | new mistralai 2.4.6-style direct compromise |
| uv.lock + `--locked` install | transitive graph | transitive worm injection |
| Tier-0 hash-verified path | install.sh / .ps1 | actually USE the lockfile in fresh installs |
| `uv lock --check` CI gate | every PR | drift between pyproject and lockfile |
| `hermes_cli/security_advisories.py` | runtime | cleanup for users who already got hit |
The exact pinning + hash verification together close the supply-chain
gap. Without the lockfile path, exact pins alone are theater.
# Validation
- `uv lock --check` → passes (262 packages resolved, no drift).
- `bash -n` on install.sh + setup-hermes.sh → OK.
- 209/209 tests passing across new + adjacent test files
(test_lazy_deps.py, test_security_advisories.py, test_doctor.py,
test_tts_mistral.py, test_transcription_tools.py).
- TOML parse OK.
* chore: remove community announcement drafts (PR body covers it)
* build(deps): lazy-install every opt-in backend (anthropic, search, terminal, platforms, dashboard)
Extends the lazy-install framework to cover everything that's not used by
every hermes session. Base install drops from ~60 packages to 45.
Moved out of core dependencies = []:
- anthropic (only when provider=anthropic native, not via aggregators)
- exa-py, firecrawl-py, parallel-web (search backends; only when picked)
- fal-client (image gen; only when picked)
- edge-tts (default TTS but still optional)
New extras in pyproject.toml: [anthropic] [exa] [firecrawl] [parallel-web]
[fal] [edge-tts]. All added to [all].
New LAZY_DEPS entries: provider.anthropic, search.{exa,firecrawl,parallel},
tts.edge, image.fal, memory.hindsight, platform.{telegram,discord,matrix},
terminal.{modal,daytona,vercel}, tool.dashboard.
Each import site now calls ensure() before importing the SDK. Where the
module had a top-level try/except (telegram, discord, fastapi), the
graceful-fallback pattern was extended to lazy-install on first
check_*_requirements() call and re-bind module globals.
Updated test_windows_native_support.py tzdata check from snapshot
(>=2023.3 literal) to invariant (any version + win32 marker).
Validation:
- Base install: 45 packages (was ~60); 6 newly-extracted packages absent
- uv lock --check: passes (262 packages, no drift)
- 209/209 lazy_deps + advisory + doctor + tts/transcription tests passing
- py_compile clean on all 12 modified modules
Replace with for all literal-tuple
membership tests. Set lookup is O(1) vs O(n) for tuple — consistent
micro-optimization across the codebase.
608 instances fixed via `ruff --fix --unsafe-fixes`, 0 remaining.
133 files, +626/-626 (net zero).
When the Discord typing API call fails (rate limit, network error, 403),
_typing_loop returns early but the stale task remains in _typing_tasks.
Subsequent send_typing calls see the stale entry and skip, leaving no
typing indicator for the rest of the agent invocation.
Add finally block to _typing_loop to always remove the task from
_typing_tasks on exit, whether from cancellation, error, or normal
completion. This allows send_typing to create a fresh task.
3 new tests in test_discord_send.py:
- Task removed after API error
- Typing restartable after failure
- stop_typing cleans up
Re-authored against current main from PR #10388 by @wilsen0. The
original branch is 3800+ commits stale and could not be cherry-picked
without reverting unrelated work; this change carries only the perf
intent forward.
Tuning summary
==============
Text-batch ingress (gateway/platforms/telegram.py):
- HERMES_TELEGRAM_TEXT_BATCH_DELAY_SECONDS default 0.6 -> 0.3
- HERMES_TELEGRAM_TEXT_BATCH_SPLIT_DELAY_SECONDS default 2.0 -> 1.0
- Adaptive fast-path tiers in _flush_text_batch:
total <= 320 cp -> min(cap, 0.18)
total <= 1024 cp -> min(cap, 0.24)
else -> cap
A single short reply now reaches the agent in ~180ms instead of
600ms. Tier constants compose with the configured cap via min()
so an operator who tightens HERMES_TELEGRAM_TEXT_BATCH_DELAY_SECONDS
below 0.18 still wins on every tier.
- _env_float_clamped helper replaces bare float(os.getenv()).
Rejects NaN / Inf, applies optional min/max bounds. Used for
text-batch + media-batch knobs. Prevents asyncio.sleep(NaN)
crashes when an operator typos an env var.
Stream cadence (gateway/config.py + stream_consumer.py):
- StreamingConfig.edit_interval default 1.0s -> 0.8s
- StreamingConfig.buffer_threshold default 40 -> 24 chars
- DEFAULT_STREAMING_EDIT_INTERVAL / BUFFER_THRESHOLD / CURSOR are now
a single source of truth. StreamConsumerConfig imports them
instead of duplicating the literals; the prior dual-source drift
is fixed.
Tool progress (gateway/display_config.py):
- Telegram default tool_progress 'all' -> 'new'. Inside
Telegram's ~1 edit/s flood envelope the 'all' default would
accumulate edit pressure on busy chats; 'new' shows only the
leading bubble per tool batch and feels less spammy.
- Slack tier_low override (tool_progress='off') is preserved.
Composition with native draft streaming (#23512)
================================================
The mid-stream cadence (edit_interval, buffer_threshold) gates BOTH
the draft path (send_draft) and the edit path (edit_message), so the
tighter cadence helps native draft as much as edit-based. The
text-batch fast-path applies before the consumer starts, so it speeds
up the first-token latency on every transport. No conflict.
Stale-base avoidance
====================
Re-authored from scratch rather than cherry-picked. Dropped from the
original branch:
- Unrelated d2f043f9c 'fix(anthropic): preserve third-party thinking
continuity' commit
- boot_md.py builtin gateway hook (unrelated)
- Reverted Slack tool_progress='off' (#14663) restoration
- Reverted Platform plugin discovery, MSGRAPH_WEBHOOK, YUANBAO
members deletion
- 2300+ lines of run.py base-skew noise
Tests
=====
New tests/gateway/test_telegram_text_batch_perf.py:
- 7 tests for _env_float_clamped (NaN, Inf, garbage, bounds).
- 4 tests for the adaptive-tier composition rules.
Updated tests/gateway/test_display_config.py:
- test_platform_default_when_no_user_config: 'all' -> 'new' for
Telegram, with comment.
- test_high_tier_platforms: split into Telegram-overrides-to-new
and Discord-stays-all assertions.
Closes#10388.
Co-authored-by: wilsen0 <132184373+wilsen0@users.noreply.github.com>
When edit_message_text exceeded Telegram's 4096 UTF-16 codepoint limit,
the adapter caught the BadRequest, best-effort truncated the content
with '…', and returned SendResult(success=True). The stream consumer
believed the full edit was delivered and never recovered, silently
dropping everything past the truncation boundary on long replies.
Returning failure isn't safe either — the consumer's existing fallback
path can race against the next streaming tick, producing duplicate
sends or gaps. Instead, the adapter now SPLITS the oversized payload
across the existing message + new continuation messages, so the user
always gets the full reply in correct order.
How it works:
1. Pre-flight: if utf16_len(content) already exceeds MAX_MESSAGE_LENGTH,
call the new _edit_overflow_split helper directly — saves a doomed
round-trip + a Telegram error.
2. Reactive: if Telegram still returns 'message_too_long' after the
pre-flight (e.g. parse_mode formatting inflated the payload past
the limit via MarkdownV2 escapes), the same helper handles it.
3. _edit_overflow_split:
- Splits via truncate_message(len_fn=utf16_len) — same chunking the
non-streaming send() path uses; chunks get '(1/N)' suffixes.
- Edits the original message_id with chunk 1 (with parse_mode +
plain-fallback when finalize=True, mirroring the main edit path).
- Sends each remaining chunk via self._bot.send_message threaded as
a reply to the previous chunk so the user sees them as a
contiguous block. MarkdownV2-with-plain-fallback per chunk on
finalize.
- Returns SendResult(success=True, message_id=<last_chunk_id>,
continuation_message_ids=(<chunk2_id>, <chunk3_id>, ...)) so the
stream consumer can keep editing the most recent visible message
and the gateway has full visibility into every message id.
SendResult contract extension:
Added optional continuation_message_ids: tuple = () field. When
empty (the common case), behavior is unchanged. When populated, the
caller knows the adapter delivered across multiple platform messages.
Stream consumer integration:
GatewayStreamConsumer._send_or_edit advances _message_id to the
last-continuation id when it sees continuation_message_ids on a
successful edit result, resets _last_sent_text (the new visible
message holds only the final chunk's text), and fires
on_new_message so tool-progress bubbles linearize below the new
continuation rather than the original. Mirrors the openclaw #32535
inter-tool-leak guard.
Composes with what just landed:
- PR #23455 (UTF-16 length-aware splitting in stream consumer)
prevents most overflows upstream by measuring text in UTF-16
codeunits before deciding to split. This PR is the safety net at
the adapter boundary.
- PR #23512 (native draft streaming, default for DM Telegram) routes
DM streaming through send_draft, which has its own contract
unaffected by this change. So this fix narrows in scope to the
edit-based path: groups, supergroups, forum topics, every
non-Telegram platform, and the per-response fallback after a
draft failure.
Salvage notes:
- Cherry-picked from PR #19537 by @kjames2001. Original PR returned
failure on overflow; this evolves to split-and-deliver so users
never lose content and the consumer state stays consistent.
- Dropped an unrelated model-picker hunk (line 2114-2117) that
silently killed the 'X more available — type /model <name>
directly' hint by hardcoding total=len(models). Not in scope.
- Restored the timeout-aware retryable=not is_timeout signal in
send()'s fallthrough catch block.
Closes#19537.
Adds Telegram's native streaming-draft API as a streaming transport so DM
replies render with smooth animated previews as tokens arrive, dropping
the per-edit jitter of the legacy editMessageText polling path.
Adapter contract (gateway/platforms/base.py):
- supports_draft_streaming(chat_type, metadata) -> bool. Default False.
Telegram returns True only for DMs and only when the bound python-
telegram-bot version exposes Bot.send_message_draft (PTB 22.6+).
- send_draft(chat_id, draft_id, content, metadata) -> SendResult.
Default raises NotImplementedError. Telegram delegates to PTB's
send_message_draft. Drafts have no message_id (Bot API contract);
SendResult.message_id is None on success.
Telegram adapter (gateway/platforms/telegram.py):
- supports_draft_streaming gates on chat_type='dm' AND PTB capability.
- send_draft trims to MAX_MESSAGE_LENGTH using utf16_len, threads
message_thread_id through metadata, and routes failures back as
SendResult(success=False, error=...) so the consumer can fall back.
Stream consumer (gateway/stream_consumer.py):
- StreamConsumerConfig gains transport ('auto'|'draft'|'edit'|'off')
and chat_type fields.
- run() resolves _use_draft_streaming once via a probe at the top of
the run, allocating a fresh class-wide draft_id_counter so each
response animates as its own preview (no animation collision across
consecutive responses to the same chat).
- _send_or_edit gains a pre-edit branch: when drafts are active AND
not finalizing AND no edit-path message_id is established, the
frame routes through _send_draft_frame instead of edit_message.
Drafts intentionally do NOT set _already_sent so the gateway's
final sendMessage path still fires — drafts have no message_id and
the user needs a real message in their chat history.
- _reset_segment_state bumps the draft_id when the consumer is in
draft mode so each text block after a tool boundary animates as a
fresh preview below the tool-progress bubble (avoids the inter-
tool-call leak openclaw documented in their #32535).
- Per-response fallback: any send_draft failure (transient network,
server reject, capability gap) flips _use_draft_streaming to False
for the rest of the run, gracefully returning to the edit path.
Gateway config (gateway/config.py):
- StreamingConfig.transport default flips edit -> auto. The auto path
is identical to edit on every chat type that doesn't currently
support drafts (groups, supergroups, forum topics, every non-
Telegram platform), so the default is backwards-compatible for
non-DM users.
Lifecycle model (Telegram Bot API 9.5):
1. sendMessageDraft(chat_id, draft_id, text='') opens the bubble.
2. Repeated sendMessageDraft calls with the SAME draft_id animate
the preview as text grows.
3. Drafts have no message_id and cannot be edited or deleted.
4. When the response finishes the gateway's normal sendMessage path
delivers the final answer; the draft preview clears naturally on
the client and the user sees a real message in their history.
Inspired by PR #3412 by @NivOO5. Re-authored against current main
(stream_consumer.py is now ~4x larger than at #3412's branch base, with
new _NEW_SEGMENT/_COMMENTARY/finalize/_on_new_message machinery the
original PR didn't account for) but the design call (DM-only, edit-
fallback, transport=auto|draft|edit|off) is faithful to the original
proposal, with two improvements baked in:
1. Per-response draft_id (monotonic counter, not a time hash) — no
collision risk across consecutive responses on the same chat.
2. Tool-boundary draft_id bump — prevents the inter-tool-call leak
openclaw hit during their rollout (their #32535).
Closes#21439 (duplicate feature request).
Cherry-picked from PR #10371. Two-layer defense for the spurious-thread_id
issue (#3206):
1. _build_message_event filters DM thread_ids: only preserve thread_id
for real topic messages (is_topic_message=True). Telegram puts
message_thread_id on every DM that is a reply, but reply-chain ids
route to nonexistent threads on send.
2. _send_message_with_thread_fallback helper: control sends
(send_update_prompt, send_exec_approval / send_slash_confirm,
send_model_picker) retry once without message_thread_id when
Telegram returns BadRequest 'Message thread not found'. Mirrors
the pattern PR #3390 added for the streaming send path.
Salvage notes:
- Conflict 1 (line ~4099): merged the contributor's DM is_topic_message
filter with the existing forum General-topic default from #22423,
preserving both behaviors.
- Conflict 2 (line ~1664 / 1690): kept main's delete_message (PR #23416)
alongside the new helper. Tightened the helper's exception catch
from bare 'Exception' to use the existing _is_bad_request_error +
_is_thread_not_found_error helpers (line 484-496) for consistency
with the streaming send path.
- Widened the fix to send_update_prompt (was bare self._bot.send_message,
same bug class).
Authored by rahimsais via PR #10371 (re-attributed from donrhmexe@
local commit author).
The stream consumer measured message length using Python's len() (Unicode
code points), but Telegram's actual limit is in UTF-16 code units. This
caused messages with supplementary characters (emoji, CJK, etc.) to exceed
Telegram's 4096-character limit, resulting in truncated messages with
formatting artifacts.
Changes:
- Add message_len_fn property to BasePlatformAdapter (defaults to len)
- Override in TelegramAdapter to return utf16_len
- Stream consumer uses adapter.message_len_fn for:
- safe_limit calculation
- overflow detection
- truncate_message calls
- split point calculation (via _custom_unit_to_cp)
- fallback final send chunking
Fixes truncated messages with black square artifacts on Telegram when
the model generates responses containing multi-byte Unicode characters.
Cherry-picked from PR #13077 commits:
- 5500c7d8 fix(gateway): stream consumer first message drops thread context
- e84403b9 test(gateway): add regression tests for stream consumer thread routing
Fixes: Streaming first message drops thread/topic context in Feishu group
topics, Slack threads, Telegram forum topics. Adds initial_reply_to_id
ctor arg to GatewayStreamConsumer, threaded through _send_or_edit and
_send_new_chunk. Also fixes Feishu _send_raw_message fallback path
(reply -> create) to use receive_id_type='thread_id' so the new message
lands in the correct topic instead of the main channel.
Authored by hrygo via PR #13077 (re-attributed from the bot-authored
salvage commit on the original branch).
Builds on @kshitijk4poor's CLI handoff stub. The original PR's flow
deferred everything to whenever a real user happened to message the
target platform; this rewrites it so the gateway picks up handoffs
immediately and the destination chat just starts working.
State machine on sessions table replaces the boolean flag:
None -> 'pending' -> 'running' -> ('completed' | 'failed')
plus handoff_error for failure reasons. CLI request_handoff /
get_handoff_state / list_pending_handoffs / claim_handoff /
complete_handoff / fail_handoff helpers wrap the transitions.
CLI side (cli.py): /handoff <platform> validates the platform's home
channel via load_gateway_config, refuses if the agent is mid-turn,
flips the row to 'pending', and poll-blocks (60s) on terminal state.
On 'completed' it prints the /resume hint and exits the CLI like
/quit. On 'failed' or timeout it surfaces the reason and the CLI
session stays intact.
Gateway side (gateway/run.py): new _handoff_watcher background task
scans state.db every 2s, atomically claims pending rows, and runs
_process_handoff for each. _process_handoff:
1. Resolves the platform's home channel.
2. Asks the adapter for a fresh thread via the new
create_handoff_thread(parent_chat_id, name) capability so the
handed-off conversation gets its own scrollback. Adapters that
don't support threads (or fail) return None and the watcher
falls back to the home channel directly.
3. Constructs a SessionSource keyed as 'thread' when a thread was
created, 'dm' otherwise, then session_store.switch_session
re-binds the destination key to the CLI session_id. The full
role-aware transcript replays via load_transcript on the next
turn (no flat-text injection into context_prompt).
4. Forges a synthetic MessageEvent(internal=True) with the handoff
notice and dispatches through _handle_message; the agent runs
against the loaded transcript and adapter.send delivers the
reply.
5. Marks the row 'completed' on success, 'failed' (+error) on any
exception.
Adapter capability (gateway/platforms/base.py): create_handoff_thread
default returns None. Three overrides:
- Telegram (gateway/platforms/telegram.py): wraps _create_dm_topic
so DM topics (Bot API 9.4+) and forum supergroups both work.
- Discord (gateway/platforms/discord.py): parent.create_thread on
text channels with a seed-message + message.create_thread
fallback for permission edge cases. Skips DMs and other
non-thread-capable parents.
- Slack (gateway/platforms/slack.py): posts a seed message and
returns its ts as the thread anchor — Slack threads are
message-anchored.
In thread mode, build_session_key keys the destination without
user_id (thread_sessions_per_user defaults to False) so the synthetic
turn and any later real-user message in the thread share the same
session_key — seamless takeover without race.
CommandDef stays cli_only=True (handoff is initiated from the CLI;
gateway exposes /resume for the reverse direction).
Removed the original PR's _handle_message_with_agent handoff hook
(transcript-as-text injection into context_prompt) and the
send_message_tool notification — both replaced by the watcher path.
Tests rewritten around the new state machine: 13/13 pass.
E2E-validated thread + no-thread paths and the failure path against
real worktree imports with mocked adapters.
* feat(gateway): add LINE Messaging API platform plugin
Adds LINE as a bundled platform plugin under `plugins/platforms/line/`,
synthesized from the strongest pieces of seven open community PRs. The
adapter requires zero core edits — `Platform("line")` is auto-discovered
via the bundled-plugin scan in `gateway/config.py`, and all hooks
(setup, env-enablement, cron delivery, standalone send) are wired
through `register_platform()` kwargs the way IRC and Teams do it.
Highlights merged into one plugin:
- **Reply token preferred, Push fallback.** Try the free reply token
first (single-use, ~60s TTL); fall back to metered Push when the
token is absent, expired, or rejected. (PR #21023)
- **Slow-LLM Template Buttons postback.** When the LLM is still running
past `LINE_SLOW_RESPONSE_THRESHOLD` (default 45s), the adapter burns
the original reply token to send a "Get answer" button bubble. The
user taps it to fetch the cached answer via a fresh reply token —
also free. State machine: PENDING → READY → DELIVERED, ERROR for
cancelled runs (orphan resolves to `LINE_INTERRUPTED_TEXT` after
/stop). Set threshold to 0 to disable. (PR #18153)
- **Three-allowlist gating** — separate user / group / room allowlists
with `LINE_ALLOW_ALL_USERS=true` dev-only escape hatch. (PR #18153)
- **Markdown URL preservation.** Strip bold/italic/code-fence/heading
markers (LINE renders them literally) but keep `[label](url)` →
`label (url)` so URLs stay tappable. (PR #18153)
- **System-message bypass** for `⚡ Interrupting`, `⏳ Queued`, etc. —
busy-acks reach the user as visible bubbles instead of being
swallowed into the postback cache. (PR #18153)
- **Media via public HTTPS URLs.** LINE doesn't accept binary uploads;
images/audio/video must be HTTPS-reachable. The adapter serves
registered tempfiles under `/line/media/<token>/<filename>` from the
same aiohttp app. Allowed-roots traversal guard covers
`tempfile.gettempdir()`, `/tmp` (→ `/private/tmp` on macOS), and
`HERMES_HOME`. `LINE_PUBLIC_URL` overrides URL construction for
setups behind tunnels/proxies. (PR #8398)
- **5-message-per-call batching.** LINE rejects >5 messages per
Reply/Push; smart-chunker caps text at 4500 chars per bubble.
- **Inbound dedup** via `webhookEventId` LRU. (PR #21023)
- **Self-message filter** via `/v2/bot/info` userId lookup. (PR #21023)
- **Loading-animation indicator** wired to LINE's `chat/loading/start`
endpoint, DM-only (LINE rejects it for groups/rooms). (PR #21023)
- **Out-of-process cron delivery** via `_standalone_send`, so
`deliver: line` cron jobs work even when cron runs detached from
the gateway.
- **Webhook hardening** — 1 MiB body cap, constant-time HMAC-SHA256
signature verification, dedup, scoped lock so two profiles can't
bind the same channel.
Validation
----------
- `scripts/run_tests.sh tests/gateway/test_line_plugin.py` →
73 passed in 1.05s
- `scripts/run_tests.sh tests/gateway/test_line_plugin.py
tests/gateway/test_irc_adapter.py
tests/gateway/test_plugin_platform_interface.py
tests/gateway/test_platform_registry.py
tests/gateway/test_config.py` → 193 passed, 7 skipped
- E2E import + register + signature roundtrip + `Platform("line")`
bundled-plugin discovery verified against current `origin/main`.
Closes the seven open LINE PRs (#18153, #16832, #6676, #21023, #14942,
#14988, #8398) by superseding them with a single plugin-form
implementation that takes the best idea from each.
Co-authored-by: pwlee <32443648+leepoweii@users.noreply.github.com>
Co-authored-by: Jetha Chan <jetha@google.com>
Co-authored-by: Cattia <openclaw@liyangchen.me>
Co-authored-by: perng <charles@perng.com>
Co-authored-by: Soichiro Yoshimura <soichiro0111.dev@gmail.com>
Co-authored-by: David Zhou <77736378+David-0x221Eight@users.noreply.github.com>
Co-authored-by: Yu-ga <74749461+yuga-hashimoto@users.noreply.github.com>
* docs(platforms): document platform-specific slow-LLM UX pattern
Add a 'Platform-Specific Slow-LLM UX' section to the platform-adapter
developer guide covering the _keep_typing override pattern that LINE
uses for its Template Buttons postback flow.
Three subsections:
- Pattern: subclass _keep_typing to layer mid-flight UX (with code)
- Pattern: subclass send to route through a cache instead of sending
- When this pattern is appropriate (vs. always-Push fallback)
Plus a short pointer in gateway/platforms/ADDING_A_PLATFORM.md so
tree-readers find the prose walkthrough on the docsite.
Filed because the LINE plugin (PR #23197) was the first bundled
adapter to need this pattern — every prior plugin (irc, teams,
google_chat) handles slow responses with the default typing-loop and
a regular send_text. Documenting now while the rationale is fresh.
---------
Co-authored-by: pwlee <32443648+leepoweii@users.noreply.github.com>
Co-authored-by: Jetha Chan <jetha@google.com>
Co-authored-by: Cattia <openclaw@liyangchen.me>
Co-authored-by: perng <charles@perng.com>
Co-authored-by: Soichiro Yoshimura <soichiro0111.dev@gmail.com>
Co-authored-by: David Zhou <77736378+David-0x221Eight@users.noreply.github.com>
Co-authored-by: Yu-ga <74749461+yuga-hashimoto@users.noreply.github.com>
Per-tool-call push notifications on Telegram are noisy enough that
'all' is the wrong default — long agent runs spam the user's notification
shade with status messages they didn't ask to be pinged about. Final
responses, approval prompts, and slash confirmations still notify;
intermediate progress, streaming, and tool-progress messages now
deliver silently via disable_notification.
Users who want the legacy behavior can opt back in with:
display:
platforms:
telegram:
notifications: all
or HERMES_TELEGRAM_NOTIFICATIONS=all.
Add a configurable notifications mode for the Telegram platform adapter
that controls which messages trigger push notifications.
- display.platforms.telegram.notifications: "all" (default) | "important"
- HERMES_TELEGRAM_NOTIFICATIONS env var override
- In "important" mode, all sends use disable_notification=True except:
- Approvals (send_exec_approval) and slash confirmations
- Final response messages (metadata["notify"]=True)
- Zero overhead in default "all" mode
- Zero impact on non-Telegram platforms
Closes#22771
163/NetEase IMAP servers reject every UID SEARCH/FETCH with `BYE Unsafe
Login` unless the client first identifies itself via the RFC 2971 ID
command after LOGIN. Without this, the email gateway logs in OK but
then fails on the very first poll and the connection is torn down.
Send the ID payload best-effort after both `imap.login()` sites
(`EmailAdapter.connect` and `_fetch_new_messages`). Failures are
swallowed at debug level so non-supporting IMAP servers (Gmail,
Outlook, Fastmail, Yahoo, etc.) keep working unchanged.
Closes#22271
`gateway/platforms/__init__.py` eagerly imported `QQAdapter` and
`YuanbaoAdapter` at package-init time, which transitively pulled in
qqbot's chunked-upload + keyboards + onboard machinery and yuanbao's
websocket stack. About 84 ms wall and 23 MB RSS on every fresh process
that touched anything under `gateway.platforms` — including `hermes
chat` (via run_agent → cli's plugin discovery transitive import).
Nothing in the codebase actually consumes these symbols from the
package root; every real call site uses the long-form path
(`from gateway.platforms.qqbot import QQAdapter`,
`from gateway.platforms.yuanbao import YuanbaoAdapter` in gateway/run.py).
The eager re-export was only there for convenience.
Replace with a PEP 562 module-level `__getattr__` that lazily imports
on first attribute access. Public API stays identical:
`from gateway.platforms import QQAdapter` keeps working but only
pays the import cost when the symbol is actually touched. `__dir__`
preserves help() / autocomplete behavior.
Measured impact (7-run medians, 9950X3D):
import gateway.platforms 127 → 43 ms (-66%)
50 → 27 MB (-46%)
import gateway.platforms.base 127 → 44 ms (-65%)
50 → 27 MB (-46%)
import cli (full chat path) 745 → 710 ms ( -5%)
96 → 90 MB ( -6%)
hermes chat -q (cold) -5 MB
The per-import win is biggest because qqbot/yuanbao deps don't overlap
with anything on the gateway-platforms path — full `import cli`
already loads aiohttp/websockets transitively from other places, so
the marginal CLI win is smaller than the isolated import benchmark.
The `gateway.platforms.base` win is what matters most for long-lived
gateway processes: every gateway boot saves 23 MB resident.
All 144 qqbot tests pass; broader gateway suite (5132 tests) passes
modulo 4 pre-existing flakes also failing on main without this change.
Non-streaming /v1/chat/completions wrapped any AIAgent result \u2014 including
partial/failed runs \u2014 as a successful 200 with finish_reason='stop' and
the internal failure string substituted into message.content. API
clients had no way to distinguish 'agent answered: X' from
'agent crashed and the X you see is its error message'.
After the fix:
- completed: True \u2192 200 finish_reason='stop' (unchanged)
- partial + truncated text \u2192 200 finish_reason='length' + hermes extras
- partial + no text / failed \u2192 502 OpenAI error envelope (SDKs raise)
- other failures \u2192 200 finish_reason='error' + hermes extras
Adds X-Hermes-Completed / X-Hermes-Partial / X-Hermes-Error headers
plus a 'hermes' extras object on partial responses for clients that
want the full picture.
Closes#22496.
- Restore allowed_chats gate before thread_id check so ignored_threads
applies universally (even to guest mentions).
- Compute _message_mentions_bot once in _should_process_message to
eliminate redundant second entity scan when guest_mode=true and the
message does not mention the bot.
- Remove redundant _is_group_chat from _is_guest_mention (caller already
verified the message is a group chat).
- Update _telegram_allowed_chats docstring to note guest_mode exception.
- Add test coverage: bot_command entity, text_mention entity,
caption_entities, and ignored_threads + guest_mode interaction.
- Add nik1t7n to AUTHOR_MAP.
When a Telegram user replies using the native quote feature to select
only part of a prior message, _build_message_event was injecting the
ENTIRE replied-to message into reply_to_text via
message.reply_to_message.text/caption. python-telegram-bot exposes
the user-selected substring as message.quote (TextQuote.text); we now
prefer that and fall back to the full replied-to text only when no
native quote is present.
The agent-visible "[Replying to: \"...\"]" prefix can otherwise expand
the user's narrow quote into the full prior message, causing the agent
to act on unrelated actionable-looking text the user did not select
(e.g. multi-item briefings where the user quotes one bullet but the
prefix injects every bullet). Falls back cleanly when message.quote
is absent (PTB <21 or replies that don't quote a substring).
Fixes#22619
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a GFM table has a row-label column (first column with no header),
_render_table_block_for_telegram incorrectly included the row-label cell
in the bullet zip alongside the data cells, producing a spurious bullet
like '• 維度: 核心賣點' before the real data rows.
Detect the row-label column by comparing the first data row cell count
against the header count (has_row_label_col = len(first_data_row) ==
len(headers) + 1). When present, use cells[0] as the heading and
zip headers against cells[1:] only, correctly excluding the row-label
from the bullet list.
Fixes#22604
Plugin platforms (IRC, Teams, Google Chat) currently fail with
`No live adapter for platform '<name>'` when a `deliver=<plugin>` cron
job runs in a separate process from the gateway, even though the
platforms are eligible cron targets via `cron_deliver_env_var` (added
in #21306). Built-in platforms (Telegram, Discord, Slack, etc.) use
direct REST helpers in `tools/send_message_tool.py` so cron can deliver
without holding the gateway in the same process; plugin platforms
historically depended on `_gateway_runner_ref()` which returns `None`
out of process.
This change adds an optional `standalone_sender_fn` field to
`PlatformEntry` so plugins can register an ephemeral send path that
opens its own connection, sends, and closes without needing the live
adapter. The dispatch site in `_send_via_adapter` falls through to the
hook when the gateway runner is unavailable, with a descriptive error
when neither path applies. The hook is optional, so existing plugins
are unaffected.
Reference migrations land in the same change for IRC, Teams, and
Google Chat, exercising the hook across stdlib (asyncio + IRC protocol),
Bot Framework OAuth client_credentials, and Google service-account
flows respectively.
Security hardening on the new code paths:
* IRC: control-character stripping on chat_id and message body to
block CRLF command injection; bounded nick-collision retries; JOIN
before PRIVMSG so channels with the default `+n` mode accept the
delivery.
* Teams: TEAMS_SERVICE_URL validated against an allowlist of known
Bot Framework hosts (`smba.trafficmanager.net`,
`smba.infra.gov.teams.microsoft.us`) to block SSRF; chat_id and
tenant_id constrained to the documented Bot Framework character set;
per-request timeouts so a slow STS endpoint cannot starve the
activity POST.
* Google Chat: chat_id and thread_id validated against strict
resource-name regexes; service-account refresh wrapped in
`asyncio.wait_for` so a hung token endpoint cannot stall the
scheduler.
Test coverage: 20 new tests covering happy path, missing-config errors,
network failure modes, and each defensive validation. Existing tests
unchanged. `bash scripts/run_tests.sh tests/tools/test_send_message_tool.py
tests/gateway/test_irc_adapter.py tests/gateway/test_teams.py
tests/gateway/test_google_chat.py` reports 341 passed, 0 regressions.
Documentation: new "Out-of-process cron delivery" section in
website/docs/developer-guide/adding-platform-adapters.md and an entry
in gateway/platforms/ADDING_A_PLATFORM.md naming the hook.
SQLite's WAL mode requires shared-memory (mmap) coordination and fcntl
byte-range locks that don't reliably work on network filesystems. Upstream
documents this explicitly:
https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode
On NFS / SMB / some FUSE mounts / WSL1, 'PRAGMA journal_mode=WAL' raises
'sqlite3.OperationalError: locking protocol' (SQLITE_PROTOCOL). Before
this change, every feature backed by state.db or kanban.db broke silently:
- /resume, /title, /history, /branch returned 'Session database not
available.' with no cause
- gateway logged the init failure at DEBUG (invisible in errors.log)
- kanban dispatcher crashed every 60s, driving the known migration race
(duplicate column name: consecutive_failures, #21708 / #21374)
Changes:
- hermes_state.apply_wal_with_fallback(): shared helper that tries WAL
and falls back to DELETE on SQLITE_PROTOCOL-style errors with one
WARNING explaining why
- hermes_state.get_last_init_error() + format_session_db_unavailable():
capture the init failure cause and surface it in user-facing strings
(with an NFS/SMB pointer for 'locking protocol')
- hermes_cli/kanban_db.connect(): use the shared helper
- gateway/run.py: bump SessionDB init failure log DEBUG -> WARNING
(matches cli.py's existing correct behavior)
- cli.py (4 sites) + gateway/run.py (5 sites): replace bare
'Session database not available.' with format_session_db_unavailable()
Tests: 12 new tests in tests/test_hermes_state_wal_fallback.py + 1 new
test in tests/hermes_cli/test_kanban_db.py. Existing suites (state,
kanban, gateway, cli) remain green for all tests unrelated to pre-existing
failures on main.
Evidence: real-world user on NFSv3 mount (172.26.224.200:d2dfac12/home,
local_lock=none) reporting 'Session database not available.' on /resume;
'locking protocol' appears in 4 distinct log entries across backup,
kanban, TUI, and CLI paths in the same session.
closes#22032
The send path uses Hermes' reply-anchor fallback for DM topic lanes
(message_thread_id + reply_to_message_id), but send_chat_action only
accepts message_thread_id — Telegram's Bot API 10.0 rejects it for
these lanes. Without this short-circuit, every typing tick (~every 2s
during agent runs) makes a doomed API call that gets logged as a
'thread not found' debug warning. Skip the call entirely when the
metadata indicates a DM topic reply-fallback lane; the user-visible
behavior is unchanged (no typing indicator either way for these
lanes), but the logs stay clean.
Identified during salvage review of #22053.
## Why
Hermes supports Linux, macOS, and native Windows, but the codebase grew up
POSIX-first and has accumulated patterns that silently break (or worse,
silently kill!) on Windows:
- `os.kill(pid, 0)` as a liveness probe — on Windows this maps to
CTRL_C_EVENT and broadcasts Ctrl+C to the target's entire console
process group (bpo-14484, open since 2012).
- `os.killpg` — doesn't exist on Windows at all (AttributeError).
- `os.setsid` / `os.getuid` / `os.geteuid` — same.
- `signal.SIGKILL` / `signal.SIGHUP` / `signal.SIGUSR1` — module-attr
errors at runtime on Windows.
- `open(path)` / `open(path, "r")` without explicit encoding= — inherits
the platform default, which is cp1252/mbcs on Windows (UTF-8 on POSIX),
causing mojibake round-tripping between hosts.
- `wmic` — removed from Windows 10 21H1+.
This commit does three things:
1. Makes `psutil` a core dependency and migrates critical callsites to it.
2. Adds a grep-based CI gate (`scripts/check-windows-footguns.py`) that
blocks new instances of any of the above patterns.
3. Fixes every existing instance in the codebase so the baseline is clean.
## What changed
### 1. psutil as a core dependency (pyproject.toml)
Added `psutil>=5.9.0,<8` to core deps. psutil is the canonical
cross-platform answer for "is this PID alive" and "kill this process
tree" — its `pid_exists()` uses `OpenProcess + GetExitCodeProcess` on
Windows (NOT a signal call), and its `Process.children(recursive=True)`
+ `.kill()` combo replaces `os.killpg()` portably.
### 2. `gateway/status.py::_pid_exists`
Rewrote to call `psutil.pid_exists()` first, falling back to the
hand-rolled ctypes `OpenProcess + WaitForSingleObject` dance on Windows
(and `os.kill(pid, 0)` on POSIX) only if psutil is somehow missing —
e.g. during the scaffold phase of a fresh install before pip finishes.
### 3. `os.killpg` migration to psutil (7 callsites, 5 files)
- `tools/code_execution_tool.py`
- `tools/process_registry.py`
- `tools/tts_tool.py`
- `tools/environments/local.py` (3 sites kept as-is, suppressed with
`# windows-footgun: ok` — the pgid semantics psutil can't replicate,
and the calls are already Windows-guarded at the outer branch)
- `gateway/platforms/whatsapp.py`
### 4. `scripts/check-windows-footguns.py` (NEW, 500 lines)
Grep-based checker with 11 rules covering every Windows cross-platform
footgun we've hit so far:
1. `os.kill(pid, 0)` — the silent killer
2. `os.setsid` without guard
3. `os.killpg` (recommends psutil)
4. `os.getuid` / `os.geteuid` / `os.getgid`
5. `os.fork`
6. `signal.SIGKILL`
7. `signal.SIGHUP/SIGUSR1/SIGUSR2/SIGALRM/SIGCHLD/SIGPIPE/SIGQUIT`
8. `subprocess` shebang script invocation
9. `wmic` without `shutil.which` guard
10. Hardcoded `~/Desktop` (OneDrive trap)
11. `asyncio.add_signal_handler` without try/except
12. `open()` without `encoding=` on text mode
Features:
- Triple-quoted-docstring aware (won't flag prose inside docstrings)
- Trailing-comment aware (won't flag mentions in `# os.kill(pid, 0)` comments)
- Guard-hint aware (skips lines with `hasattr(os, ...)`,
`shutil.which(...)`, `if platform.system() != 'Windows'`, etc.)
- Inline suppression with `# windows-footgun: ok — <reason>`
- `--list` to print all rules with fixes
- `--all` / `--diff <ref>` / staged-files (default) modes
- Scans 380 files in under 2 seconds
### 5. CI integration
A GitHub Actions workflow that runs the checker on every PR and push is
staged at `/tmp/hermes-stash/windows-footguns.yml` — not included in this
commit because the GH token on the push machine lacks `workflow` scope.
A maintainer with `workflow` permissions should add it as
`.github/workflows/windows-footguns.yml` in a follow-up. Content:
```yaml
name: Windows footgun check
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: {python-version: "3.11"}
- run: python scripts/check-windows-footguns.py --all
```
### 6. CONTRIBUTING.md — "Cross-Platform Compatibility" expansion
Expanded from 5 to 16 rules, each with message, example, and fix.
Recommends psutil as the preferred API for PID / process-tree operations.
### 7. Baseline cleanup (91 → 0 findings)
- 14 `open()` sites → added `encoding='utf-8'` (internal logs/caches) or
`encoding='utf-8-sig'` (user-editable files that Notepad may BOM)
- 23 POSIX-only callsites in systemd helpers, pty_bridge, and plugin
tool subprocess management → annotated with
`# windows-footgun: ok — <reason>`
- 7 `os.killpg` sites → migrated to psutil (see §3 above)
## Verification
```
$ python scripts/check-windows-footguns.py --all
✓ No Windows footguns found (380 file(s) scanned).
$ python -c "from gateway.status import _pid_exists; import os
> print('self:', _pid_exists(os.getpid())); print('bogus:', _pid_exists(999999))"
self: True
bogus: False
```
Proof-of-repro that `os.kill(pid, 0)` was actually killing processes
before this fix — see commit `1cbe39914` and bpo-14484. This commit
removes the last hand-rolled ctypes path from the hot liveness-check
path and defers to the best-maintained cross-platform answer.
On Windows, Python's ``os.kill(pid, 0)`` is NOT a no-op. CPython's
implementation (``Modules/posixmodule.c::os_kill_impl``) treats sig=0
as ``CTRL_C_EVENT`` because the two integer values collide at the C
layer, and routes it through ``GenerateConsoleCtrlEvent(0, pid)`` —
which sends a Ctrl+C to the ENTIRE console process group containing
the target PID, not just the PID itself. Any caller that wanted to
check "is PID X alive" via the classic POSIX ``os.kill(pid, 0)``
idiom was silently killing that process (and often unrelated
processes in the same console group) on Windows. Long-standing
Python Windows quirk; see bpo-14484 (open since 2012).
This manifested in Hermes as: every ``hermes gateway status``
invocation would read the gateway's PID from the PID file, call
``os.kill(pid, 0)`` via ``gateway.status.get_running_pid()`` as a
"liveness check", and instantly terminate the gateway it was trying
to report on. No shutdown log, no traceback, no atexit hook fire,
no exit-diag entry — just silent termination of the detached pythonw
process. "Bot answered one message then stopped typing" was the
characteristic end-user symptom because `os.kill(pid, 0)` fires
mid-response-send and kills the gateway between logs.
Reproduction (verified in this branch before the fix):
$ hermes gateway start # gateway alive, PID 37520
$ hermes gateway status # reports "No gateway process detected"
$ tasklist /FI "PID eq 37520" # INFO: No tasks are running
# — gateway terminated silently
Root-cause fix is a new ``gateway.status._pid_exists(pid)`` helper:
- On Windows: Win32 ``OpenProcess(PROCESS_QUERY_LIMITED_INFORMATION |
SYNCHRONIZE, False, pid)`` + ``WaitForSingleObject(handle, 0)``
via ctypes. Zero signal delivery, zero console-group side effects.
Pins ctypes return types to avoid DWORD-vs-signed-int parse bugs
on WAIT_TIMEOUT (0x102). Distinguishes ERROR_INVALID_PARAMETER
(PID gone) from ERROR_ACCESS_DENIED (alive but another user).
- On POSIX: the canonical ``os.kill(pid, 0)`` idiom that actually is
a no-op there.
Then patch every ``os.kill(pid, 0)`` liveness-check callsite to
route through ``_pid_exists`` instead. Total 14 callsites across
11 files; every single one was a latent silent-kill on Windows:
gateway/run.py:2810 — /restart watcher (inline subprocess)
gateway/run.py:15195 — --replace wait loop
gateway/status.py:572 — acquire_gateway_runtime_lock stale check
gateway/status.py:828 — get_running_pid (THE killer for status)
gateway/platforms/whatsapp.py:111
hermes_cli/gateway.py:228, 522, 1012 — gateway-related drain loops
hermes_cli/kanban_db.py:2826 — _pid_alive was claiming to
be cross-platform but used
os.kill(pid, 0) on Windows
hermes_cli/main.py:5792 — CLI process-kill polling
hermes_cli/profiles.py:782 — profile stop wait loop
plugins/google_meet/process_manager.py:74
tools/browser_tool.py:1215, 1255 — browser daemon ownership probes
tools/mcp_tool.py:1255, 3374 — MCP stdio orphan tracking
The watcher source in gateway/run.py:2810 is a multi-line string
that gets spawned as an inline ``python -c "..."`` subprocess, so
it can't import gateway.status. The fix for that callsite inlines
the same ctypes probe directly into the watcher source.
Tested on Windows 10 with the hermes gateway + Telegram bot:
- gateway start → alive
- 5 consecutive ``hermes gateway status`` invocations → gateway
alive after every one, same PID reported each time (37520, 21952)
- gateway.log shows uninterrupted operation; no spurious shutdown
entries; cron ticker and kanban dispatcher still running on
their 60-second cadence
- bot continues answering Telegram messages throughout
Ships alongside an exit-path diagnostic wrapper in
``hermes_cli/gateway.py::run_gateway()`` that captures every way
``asyncio.run(start_gateway(...))`` can return (success, SystemExit,
KeyboardInterrupt, BaseException, atexit) with full traceback to
``logs/gateway-exit-diag.log``. This was used to prove the gateway
was being hard-killed externally (no exit event fired) and should
be kept for future Windows debugging.
Refs: https://bugs.python.org/issue14484
See also: references/windows-subprocess-sigint-storm.md in
the hermes-agent skill.