Commit graph

1506 commits

Author SHA1 Message Date
teknium1
ab1f9b94c5 fix(telegram): accept @username chat_id in delivery paths (#13206)
TELEGRAM_HOME_CHANNEL set to an @username (not a numeric chat ID) crashed
all webhook/cron->Telegram home-channel delivery with 'ValueError: invalid
literal for int()'. The Telegram Bot API accepts both a numeric chat_id and
an @username string; Hermes was force-coercing every chat_id with int().

Add normalize_telegram_chat_id() (returns int for numeric values, passes
@username strings through) and apply it at the Bot API send/edit sites in
the Telegram adapter and the send_message tool. Username targets are now
recognized as explicit targets in _parse_target_ref.

Reapplies the approach from #13274 (season179), whose branch predated the
gateway/platforms/telegram.py -> plugins/platforms/telegram/adapter.py
relocation. Dupes: #13535 (Tranquil-Flow), #37572 (chewkaah).

Co-authored-by: season179 <season.saw@gmail.com>
2026-06-27 04:01:58 -07:00
teknium1
f2ca3e3d84 fix(gateway): hold _run_restart on _restart_task + explicit cancel-loop skip
Follow-up on the cherry-picked #13173 fix. Holds the _run_restart task in
self._restart_task (a bare asyncio.create_task keeps only a weak reference,
so a still-pending task can be GC'd mid-flight) and explicitly skips it in
the _stop_impl cancel loop alongside _stop_task. Adds AUTHOR_MAP entry for
the contributor and a regression test that fails when the task is cancellable.

Refs #12875
2026-06-27 03:57:31 -07:00
zeapsu
1ce5d6d974 fix(gateway): exclude _run_restart from _background_tasks to prevent zombie on /restart
When request_restart() adds _run_restart to _background_tasks, _stop_impl
later cancels all entries in that set.  Since _run_restart is awaiting
_stop_task at that point, the CancelledError propagates into _stop_impl,
interrupting cleanup before _shutdown_event.set() and _exit_code = 75
execute.  This leaves the gateway as a zombie (alive but disconnected) or
exiting with code 0 instead of 75, preventing systemd Restart=on-failure
from restarting the service.

Fix: don't add _run_restart to _background_tasks — it self-terminates in
~50ms and needs no lifecycle management.

Fixes #12875
2026-06-27 03:57:31 -07:00
teknium1
08e131f77c test(telegram): cover bot self-message ingestion guard (#11905)
Regression tests for the self-author guard added in the salvaged fix:
- bot-authored DM-topic watcher echo is dropped (the exact #11905 symptom)
- bot self-messages dropped in groups/supergroups too
- other bots in the same chat are still processed (self-id, not is_bot)
- observe-unmentioned sibling path also rejects self-messages
- missing from_user does not crash

Test scaffolding ported from @cola-runner's PR #12817 and adapted to the
current plugins/platforms/telegram/adapter.py and _is_own_message().
2026-06-27 03:56:52 -07:00
teknium1
151ae1e937 test(api-server): cover SSE failure finish_reason for both failure modes
Lock the contract that a clean stream-queue termination followed by an
agent failure never reports finish_reason: "stop". Covers the raised-
exception case (#12422 repro), the flagged failed-result case, truncation
(length), and the success happy path.

Follow-up to the salvaged #12504 fix from @flobo3.
2026-06-27 03:52:44 -07:00
dodo-reach
ed54469d06 fix(gateway): show MoA presets in model picker 2026-06-27 03:43:38 -07:00
teknium1
4e0788783b refactor(gateway): extract MoA one-shot restore helper; restore #28686 comment; real-method tests
Follow-up on the salvaged MoA restore fix:
- Extract the finally-block restore into _restore_moa_one_shot() so the
  behavior is unit-testable without re-implementing it, and so the gateway
  /moa handler and the finally block share one implementation.
- Restore the load-bearing #28686 zombie-eviction comment above
  _release_running_agent_state that the original diff dropped.
- Rewrite the tests to call the real _restore_moa_one_shot helper (the
  originals re-implemented the restore logic inline, so they passed
  regardless of the production code).
2026-06-27 03:43:28 -07:00
srojk34
2f29e3cfc5 fix(gateway): restore MoA one-shot model override on failed turns
The MoA one-shot restore ran inside the try block after
_handle_message_with_agent returned. When that call raised an
exception (agent init failure, interpreter shutdown, OOM), the
restore was skipped and the MoA model override stayed permanently
on _session_model_overrides — silently routing all subsequent
messages through the MoA reference fan-out with no user-visible
indication.

Move the restore to the finally block so it fires on every exit
path (success, exception, interrupt). The restore data lives on
the per-turn event object and would be lost if not consumed here.
2026-06-27 03:43:28 -07:00
briandevans
c377e954fb test(gateway): isolate secret-redaction layer from provider-error rewrite
The existing test_chat_gateways_redact_secret_in_provider_error feeds a
provider-error envelope (HTTP 401), which _sanitize_gateway_final_response
rewrites wholesale to a generic category string. That rewrite strips the
secret regardless of whether the redaction layer works, so the test cannot
on its own prove _redact_gateway_user_facing_secrets is exercised.

Add test_chat_gateways_redact_secret_in_non_error_body: ordinary assistant
prose that echoes a bearer token but is NOT a provider-error envelope, so
the rewrite path does not fire and secret redaction is the only defense.
Verified fail-before (token leaks when _GATEWAY_SECRET_PATTERNS is emptied)
and pass-after across whatsapp/slack/signal/matrix, while non-secret prose
is preserved intact.
2026-06-27 04:47:10 +05:30
briandevans
57864d07ed fix(gateway): suppress operational status/error noise on all chat gateways, not just Telegram (#39293)
The Telegram noise/secret filter added in #28533 gated its work on
`_gateway_platform_value(platform) != "telegram"`, so
`_sanitize_gateway_final_response` and `_prepare_gateway_status_message`
only ran for Telegram. Every other human-facing chat surface
(WhatsApp, Discord, Slack, Signal, Matrix, plugin platforms, etc.)
received raw provider-error bodies verbatim — including any leaked
credentials the secret-redaction pass (`sk-…`, `Bearer …`, `gh[pousr]_…`,
`xox[baprs]-…`, `hf_…`, `glpat-…`) was meant to strip.

Invert the gate from a one-platform allowlist into a small
programmatic-surface denylist: only `local`, `api_server`, `webhook`,
and `msgraph_webhook` consume gateway text programmatically and keep raw
status/error text. Every other (chat) surface — including unknown/empty
platform values and on-demand plugin pseudo-members — fails closed to
the redacted, noise-filtered, sanitized path. This widens the same
root-cause fix to both call sites: status callbacks and final replies.
2026-06-27 04:47:10 +05:30
kshitijk4poor
b0f44d3fad fix(gateway): remove process-global HERMES_SESSION_KEY write that misroutes approval prompts across concurrent sessions
GatewayRunner._run_agent's run_sync() wrote the per-turn session key to
the process-global os.environ["HERMES_SESSION_KEY"]. Because os.environ
is shared across the whole process, concurrent gateway sessions (e.g.
two Discord threads) clobbered each other's value. A tool worker thread
whose approval contextvar was unset then fell back to os.environ via
get_current_session_key() and read whichever session ran run_sync()
last — routing "Command Approval Required" prompts to the wrong thread.

Session routing is already concurrency-safe via contextvars:
- gateway/session_context.py _SESSION_KEY (set in set_session_vars)
- tools/approval.py _approval_session_key (set via set_current_session_key
  right before the agent runs, inherited by tool worker threads)

The only non-test readers of HERMES_SESSION_KEY (tools/approval.py,
tools/terminal_tool.py, tools/kanban_tools.py) all prefer the contextvar
with os.environ as a mere fallback. CLI/cron/TUI set their own os.environ
via separate export paths (e.g. the TUI parent exporting it into the
agent subprocess), so removing this in-process write does not affect them.

Adds regression tests asserting the resolver prefers the contextvar and
does not leak a concurrent session's cleared/clobbered os.environ value.

Closes #24100

Co-authored-by: Yosapol Jitrak <yosapol@jitrak.dev>
2026-06-27 04:31:37 +05:30
Yashiel Sookdeo
cf7bf5bdc9 fix(discord): auto-convert markdown tables to bullet groups
Discord does not render GFM pipe tables — raw pipe characters display
as garbage text. format_message now rewrites tables into bold-heading +
bullet groups using the shared helpers.

Fixes #21168

Co-authored-by: Yashiel Sookdeo <yashiel@skyner.co.za>
2026-06-27 03:57:24 +05:30
Yashiel Sookdeo
70c834a740 refactor: extract shared GFM table→bullet helpers into helpers.py
Move table-detection regex, row-splitting, and table-to-bullet
conversion into gateway/platforms/helpers.py so both Discord and
Telegram adapters can share them.

Co-authored-by: Yashiel Sookdeo <yashiel@skyner.co.za>
2026-06-27 03:57:24 +05:30
Ben
8ab7246c45 fix(gateway): stamp drain marker with instantiation epoch so a durable-volume restart clears it (NS-570)
The external-drain marker .drain_request.json is written under HERMES_HOME,
which on Hermes Cloud is a persistent Fly volume (/opt/data). A begin-drain
marker therefore SURVIVES the post-update machine restart. But the disruptive
lifecycle actions a drain protects (auto-update / image migrate / env edit /
profile change) all restart the machine — which is exactly the signal the drain
is over. The freshly-restarted gateway re-read the orphaned marker on its
startup reconcile and parked itself back in 'draining', refusing every new turn
indefinitely (NS-570: ~52 min until manually cleared).

Fix: stamp the marker with an identity of THIS container/VM instantiation
(kernel boot_id + PID 1 start time, read from /proc) and treat a marker whose
epoch differs from the current instantiation as absent. A deliberate restart →
new PID 1 → new epoch → stale marker ignored → gateway boots 'running'. A marker
written during the current instantiation (the live drain) still matches; an s6
respawn of just the gateway (PID 1/init unchanged) keeps the same epoch, so an
in-flight drain is still honoured (D4a reversibility preserved).

The staleness check is lenient and never fail-closed: a legacy marker with no
epoch, a corrupt/contentless marker, or an environment with no /proc (epoch
unavailable) all degrade to the original presence-only behaviour. NAS is
untouched — it only ever POSTs begin/cancel-drain over HTTP; the marker file is
purely gateway-internal IPC.

The fix is entirely within gateway/drain_control.py; the watcher and the
dashboard endpoint go through the same drain_requested()/write_drain_request()
chokepoints and need no functional change.
2026-06-26 18:59:41 +05:30
Teknium
619dc4a561
fix(whatsapp_cloud): resolve reply-to text so the agent sees reply context (#52957)
Replies on WhatsApp Cloud arrived at the agent with reply_to_id set but
reply_to_text=None, so run.py never injected the "[Replying to: ...]"
disambiguation prefix (it gates on reply_to_text). Meta's webhook context
object carries only the quoted message's id, never its text.

Index (chat_id, wamid) -> text in rich_sent_store on every inbound message
and every outbound text send -- the same store that solved the identical
Telegram rich-send problem -- then look up the quoted text in
_build_message_event_from_cloud and populate reply_to_text plus
reply_to_is_own_message, derived from context.from versus the business
number.
2026-06-26 01:05:05 -07:00
Ben
19b2624404 feat(gateway): external drain trigger + accept-gating (begin/cancel + control channel)
Tasks 2.1 + 2.2 + 2.3 of the safe-shutdown plan — the reversible
quiesce-without-restart machinery NAS drives during a lifecycle action (D4a).
These ship together because the endpoint, the control channel, and the gateway
state machine are one coherent slice.

2.2 — control channel (gateway/drain_control.py, new):
The dashboard has no HTTP path into a running gateway (guardrails: "there is NO
external control channel into a running gateway"); restart/drain is driven only
by markers the gateway reacts to. So begin/cancel-drain writes/removes a
presence-based marker .drain_request.json (HERMES_HOME-scoped, atomic write,
never-raises read; a corrupt marker reads as present-contentless → fail-safe
toward quiescing). This is Q-B option A.

2.2 — gateway state machine (gateway/run.py):
- _external_drain_active flag, DISTINCT from the shutdown _draining flag: this
  one does NOT exit the process and is fully reversible.
- _enter_external_drain / _exit_external_drain: idempotent transitions that
  flip gateway_state→draining / →running via _update_runtime_status (preserving
  the live active_agents count). exit refuses to revert to running during a
  real shutdown or after the loop stops (shutdown wins).
- _drain_control_watcher: 1s background task (modelled on _handoff_watcher)
  reconciling accept-state with the marker; honours a marker that survived a
  restart on its first tick. Registered alongside the other watchers in start.
- New-turn accept gate in _handle_message, placed BEFORE the session-slot
  claim: when draining, refuse to START a new turn (so active_agents can only
  fall → no TOCTOU race), while in-flight turns finish untouched. Internal/
  system events (restart-recovery replays, bg-process completions) bypass it.

2.1 — endpoint (hermes_cli/web_server.py):
POST /api/gateway/drain {action: drain|cancel}. Authenticated by the Task-2.0a
token seam (the drain plugin registered this exact path as a token route);
attributes the request to the verified token principal. Begin writes the
marker, cancel removes it — the gateway process owns the actual transition.
Force-override (D6) is NOT here; it maps onto the existing immediate
/api/gateway/restart force path.

Tests (mocked — necessary-not-sufficient; the HARD live gate Q-B is next):
- tests/gateway/test_external_drain_control.py — marker contract (write/clear/
  read/corrupt/atomic), state machine (enter/exit/idempotency/shutdown-wins/
  loop-stopped), watcher reconcile-enter-then-exit, new-turn refusal, and
  in-flight-not-interrupted. 15 tests.
- tests/hermes_cli/test_web_server.py — /api/gateway/drain begin/default-begin/
  cancel/cancel-idempotent/bad-action-400. 6 tests.
- dashboard.drain_auth config section already added in 2.0b commit.

All touched suites green: 301 (gateway+auth) + 9 (web_server endpoints) passed.

Intentionally deferred:
- HARD live-validation gate (Q-B): real isolated `hermes gateway run`, drive a
  real begin-drain marker, prove the 5-point checklist a–e.
- Spec-doc status flip + Phase-2 PR.

Build status: external-drain, restart-drain, status, dashboard-auth, drain-plugin,
token-auth, and web_server-endpoint suites green.
2026-06-26 00:47:19 -07:00
Ben Barclay
6e4e5967f7
feat(relay): multi-platform-per-agent — list identity, provision-loop, N-hello, per-frame egress (Phase 1.5) (#52830)
Cut over the agent half of Shape A (D-Q1.5a/b.1/c) to front a SET of platforms on
one relay WS:

- relay_platform_identities() parses GATEWAY_RELAY_PLATFORMS (list) +
  GATEWAY_RELAY_BOT_IDS (JSON keyed map {platform:{botId,username?}}). Cut over
  from the scalar GATEWAY_RELAY_PLATFORM/_BOT_ID (no fallback, D-Q1.5c).
- self_provision_relay() loops one /relay/provision per platform under one
  gatewayId+secret, partial-failure-tolerant.
- WebSocketRelayTransport takes the identity SET, sends one hello per identity
  (connector accumulates the advertised set), and stamps the per-frame
  OutboundFrame.platform + its matching advertised botId on outbound.
- RelayAdapter remembers each chat's underlying source.platform (mirroring the
  existing guild/dm scope capture) and tags the reply's egress platform.
- send_relay_policy() declares one relevance policy per fronted platform (the
  connector keys policy by (tenant,platform,instanceId)).

Single-platform deploys are byte-identical on the wire (1-element list, no per-frame
tag -> connector session-default fallback). typecheck/ruff clean; relay unit 221 pass
(+10 new); all 15 cross-repo E2E drivers green vs connector origin/main.
2026-06-26 17:32:46 +10:00
Ben Barclay
cb7d1f68f8
fix(relay): accept is_reconnect kwarg in RelayAdapter.connect (#52911)
The gateway reconnect watcher (gateway/run.py) recovers a platform after a
fatal adapter error by building a fresh adapter and calling
connect(is_reconnect=True). Every BasePlatformAdapter implements
connect(*, is_reconnect: bool = False) for this — except RelayAdapter, whose
connect() was bare. So the watcher's recovery path raised:

    TypeError: connect() got an unexpected keyword argument 'is_reconnect'

Observed live on a hosted staging agent: after a fatal relay adapter error the
watcher could never re-establish relay, so the shared-bot inbound never reached
the gateway and Discord DMs stopped (dashboard surfaced the TypeError).

Relay deliberately ignores the flag: the #46621 server-side-queue-preservation
concern doesn't apply, because relay's outage buffer is the connector's durable
buffer (replayed on the transport's re-handshake), not a gateway-side queue the
adapter owns. Routine WS drops are already handled by the transport's own
reconnect supervisor (WebSocketRelayTransport, reconnect=True); the watcher path
is fatal-error recovery, and the fatal handler disconnect()s the old adapter
(cancelling its supervisor) before a fresh adapter+transport is built, so there
is no double-dial.

Adds two regression tests (both proven red without the fix): connect(is_reconnect=True)
reaches the same transport-less RuntimeError instead of TypeError, and the
signature matches BasePlatformAdapter.connect.
2026-06-26 16:46:09 +10:00
teknium1
43b8ba4181 fix(telegram): preserve Bot API update queue on watcher reconnect
After a prolonged outage the in-process network-error ladder escalates to
fatal and GatewayRunner._platform_reconnect_watcher rebuilds a fresh adapter
that reconnects through the bootstrap path. That path called
start_polling(drop_pending_updates=True), discarding every update Telegram
queued during the outage — all messages sent while the bot was down were
silently lost. The in-process ladder and 409-conflict handler already passed
drop_pending_updates=False; only bootstrap did not distinguish a cold first
boot from a reconnect.

Thread an is_reconnect signal from the watcher through
_connect_adapter_with_timeout into adapter.connect(). The base
BasePlatformAdapter.connect() gains a keyword-only is_reconnect=False so every
adapter inherits a tolerant signature (no per-platform breakage when the
runner forwards the kwarg). Telegram translates is_reconnect into
drop_pending_updates=not is_reconnect on both the polling and webhook bootstrap
calls. Cold boot still drops the stale queue; a watcher reconnect preserves it.

Fixes #46621.

Co-authored-by: annguyenNous <annguyen@nousresearch.com>
Co-authored-by: kyssta-exe <kyssta-exe@users.noreply.github.com>
Co-authored-by: Kewe63 <Kewe63@users.noreply.github.com>
2026-06-25 21:29:57 -07:00
teknium1
85e084d60d fix(email): reject spoofed From: header for authorization (GHSA-rxqh-5572-8m77)
The email adapter authorized senders entirely off the From: header, which is
attacker-controlled and unauthenticated by IMAP. An attacker could forge
From: an-allowlisted-address and pass both the adapter's EMAIL_ALLOWED_USERS
pre-filter and the gateway's allowlist authz (both key on the same spoofable
sender_addr), getting unauthorized commands executed by the agent.

Verify the From: domain against the trusted Authentication-Results header the
receiving mail server stamps (SPF/DKIM/DMARC) before trusting it for
authorization. Enforced only when an allowlist is in effect and allow-all is
off — fail-closed. Operators whose server does not stamp the header can opt
out via platforms.email.require_authenticated_sender: false (or
EMAIL_TRUST_FROM_HEADER=true).
2026-06-25 21:11:02 -07:00
Ben Barclay
dedf5643d8
fix(gateway): scale-to-zero never armed — arm-gate counted disabled placeholder platforms (#52831)
The scale-to-zero idle watcher never started on a correctly-opted-in,
relay-only instance, so the gateway never ran its idle decision, never called
go_dormant(), and never sent going_idle to the connector. Fly's autostop still
suspended the machine on traffic-idle, but the connector never flipped the
instance to buffered-only — so an inbound DM took the live delivery path,
found no live session for the suspended machine, and was dropped fail-closed
with no wake poke. The machine slept and never woke.

Root cause: _scale_to_zero_should_arm() passed list(config.platforms.keys())
to messaging_is_relay_only_or_absent(). config.platforms is pre-seeded with a
DISABLED placeholder PlatformConfig for every known platform (telegram,
discord, slack, matrix, …), so the key set is always the full ~20-entry
catalog regardless of what the instance actually runs. The relay-only check
discarded "relay", saw the disabled placeholders as live direct-socket
platforms, and returned False — so should_arm() was False and the watcher was
never created. Verified live on a staging instance: config.platforms keys =
[telegram, discord, slack, mattermost, matrix, relay] with only relay
enabled=True; should_arm() = False.

Fix: filter config.platforms to ENABLED entries before the relay-only check,
mirroring the adapter-connect loop which already gates on
`if not platform_config.enabled: continue`. This arms off the same notion of
"active platform" the rest of start() already uses — no parallel concept.

Also add a one-line not-armed diagnostic: when an instance IS opted in (the
HERMES_SCALE_TO_ZERO stamp is set) but the watcher still doesn't arm, log why
(relay_only_or_absent, the enabled platforms, wake_url present/missing). A
non-opted instance stays silent. The arm path previously logged only on
success, so a failed arm was invisible.

Tests: the existing pure-helper tests passed bare names so they never
exercised the call site that feeds the placeholder-laden config. Add
behaviour-contract tests against the REAL _scale_to_zero_should_arm with a
realistic config.platforms (relay enabled + others disabled). The F25
regression test (relay-only + disabled placeholders must arm) and the
no-platform case are RED without this fix, GREEN with it; the
genuinely-enabled-direct-platform / not-opted-in / no-wake-url cases stay
correctly non-arming so the filter can't over-broaden.

Wake mechanism itself verified healthy independently (direct wakeUrl GET
resumed a suspended staging instance in 1.15s, clean resume signature).
2026-06-26 14:01:48 +10:00
Teknium
811df74a10
fix(gateway): defer cross-process cache cleanup off the cache lock (#52197) (#52761)
The #45966 cross-process coherence guard popped the stale cached agent
and then called the blocking _cleanup_agent_resources (memory-provider
shutdown, tool-resource teardown, async-client teardown) while still
holding _agent_cache_lock, on the gateway event-loop thread. While that
ran, _sweep_idle_cached_agents (driven by _session_expiry_watcher)
blocked acquiring the same lock and the asyncio loop stalled for minutes,
tripping repeated Discord 'heartbeat blocked' warnings.

Fix mirrors the cap-enforcer / idle-sweep paths: pop the stale entry
under the lock, release it, then schedule the SOFT release on a daemon
thread. The soft path (_release_evicted_agent_soft) is also more correct
here than the hard teardown the regression used — the same session
rebuilds a fresh agent immediately after invalidation, so its terminal
sandbox / browser / bg processes (keyed on task_id) must be preserved
for the rebuilt agent to inherit, not torn down.

Verified the cross-process site was the only cleanup-under-lock instance;
the other _cleanup_agent_resources call sites run outside the lock.
2026-06-25 18:58:47 -07:00
Teknium
ce802e932c fix(telegram): heartbeat loop exits cleanly when bot has no get_me
CI shard test_telegram_conflict.py timed out (140s) because the new
_polling_heartbeat_loop, started by connect(), busy-spun under those
tests: they monkeypatch asyncio.sleep to instant and pass a bot double
with no get_me(), so the probe raised AttributeError (swallowed) and the
loop re-entered immediately with no real pacing, starving the event loop.

Guard the loop to return when bot.get_me is not callable — a real PTB Bot
always exposes it, so this only triggers on a torn-down app or a test
double, where there is nothing to probe. Also cancel the heartbeat task in
the conflict tests that call connect() without disconnect(), matching the
production disconnect() teardown.

Verified: test_telegram_conflict.py now runs in ~4.5s; the 22
heartbeat/reconnect tests still pass; E2E confirms a hanging get_me still
fires the reconnect ladder while a missing get_me exits without spinning.
2026-06-25 18:50:11 -07:00
agt-user
8501caf51f fix(telegram): persistent heartbeat loop to detect CLOSE-WAIT polling sockets
When a Telegram long-poll TCP socket enters CLOSE-WAIT (remote sent FIN
but httpx hasn't noticed), epoll still reports it readable so no
exception is raised. PTB's error_callback never fires, the reconnect
ladder never engages, and the gateway silently stops receiving messages
while the process stays alive — until a manual systemctl restart.

The existing recovery only covers two cases: error_callback-driven
reconnects (which require an exception PTB never gets) and a one-shot
_verify_polling_after_reconnect probe (which runs only right after an
explicit reconnect). A socket that wedges during steady-state operation
is never detected.

Add _polling_heartbeat_loop: a background asyncio.Task started in
connect() (polling mode only) that probes get_me() every 90s on the
general request pool (not the getUpdates pool, so healthy long-polls are
never interrupted). On asyncio.TimeoutError/OSError it hands off to the
existing _handle_polling_network_error ladder; other errors are
swallowed. disconnect() cancels and awaits the task. Worst-case
detection window ~105s.

Complementary to #51541 (general-pool keepalive limits / fd leak) — that
recycles idle pooled connections; this detects a wedged active read.

Fixes #48495

Co-authored-by: agt-user <267614622+agt-user@users.noreply.github.com>
2026-06-25 18:50:11 -07:00
Teknium
fd2a35b169
fix: stop reporting cache-hit rate and cost across all UI surfaces (#52717)
* fix: stop reporting cache-hit rate and cost across all UI surfaces

Cost estimates and cache read/write token reporting are unreliable on
providers that don't surface cached_tokens (e.g. ollama-cloud, which doesn't
implement prompt_tokens_details.cached_tokens), producing misleading
near-zero 'cache hit' readouts and cost figures. Remove cost + cache-hit
reporting from every user-facing surface; keep input/output/total token
counts (provider-agnostic and accurate) and the Nous account billing UI
(real account money, separate from per-conversation estimates).

Surfaces:
- CLI /usage + model-info: drop cost lines + cache read/write token lines
- Gateway /usage + /model: drop cost + cache lines
- tui_gateway/server.py: stop emitting cost_usd / cache_read in usage and
  subagent.complete payloads
- TUI (Ink): drop cost from status bar (+ showCost plumbing), /usage panel,
  thinking rollup, agents overlay (incl. compare view); keep token counts
- Desktop Command Center: drop cost stat, per-model cost, actual-cost hint

Underlying estimate_usage_cost / format_cost / insights cost columns are
left intact but no longer surfaced (display-only change, reversible).

* test: update TUI + gateway + CLI tests for removed cost/cache-hit reporting

- CLI /usage test asserts cost/cache lines are absent, tokens present
- gateway /usage test drops cost + cache asserts; removes cost-included test
- TUI subagentTree summary expectation drops the cost segment
- useConfigSync + appChrome status-rule tests drop showCost prop/state
2026-06-25 15:21:22 -07:00
infinitycrew39
d40b5735a4 test(telegram): cover table auto-rich and topic routing
Assert bare tables upgrade to sendRichMessage under default/opt-out config,
DM-topic resumed sends without reply anchors, and rich finalize edits carry
forum topic routing metadata.
2026-06-25 13:10:54 -07:00
kshitijk4poor
d9bd7ce827 test(compression): pin rotation-fallback tests to in_place=False ahead of default flip
These 7 test sites assert rotation behavior (fork, child sessions, lock
contention, logging session-context follows id rotation, boundary hooks fire
on rotation). Pin each builder to in_place=False explicitly so they keep
exercising the retained rotation fallback regardless of the global default
(flipped to True in #38763). Rotation stays a working opt-out fallback and
deserves continued coverage — these are NOT deleted.

Pinned sites:
- test_compression_concurrent_fork._build_agent_with_db
- test_compression_logging_session_context._build_agent_with_db
- test_compression_rotation_state._build_agent_with_db
- test_compression_boundary_hook._make_agent (2 helpers: CompressionBoundaryHook + SessionCompressEvent)
- test_compression_concurrent_sessions._build_agent_with_db
2026-06-25 12:56:05 -07:00
srojk34
510bf40705 fix(gateway): read compaction result flag not config flag in hygiene guard (#50098)
Salvage of #50098 by @srojk34, cherry-picked onto current main.

The hygiene auto-compress guard and the /compress slash command both read
compression_in_place (config flag — is in-place mode enabled?) instead of
_last_compaction_in_place (result flag — did in-place compaction actually
succeed?). Both agents are built without a session_db, so archive_and_compact
always fails silently and _last_compaction_in_place stays False. Reading the
config flag makes the guard think in-place succeeded, triggering
rewrite_transcript() which replaces the original messages with only the
compressed summary — permanent data loss.

Co-authored-by: srojk34 <srojk34@users.noreply.github.com>
2026-06-25 12:56:05 -07:00
kshitijk4poor
73c8d5a1e7 fix: use self._session_db directly + add regression test
- Replace getattr(self.session_store, '_db', None) with self._session_db
  (the GatewayRunner's own SessionDB, consistent with existing usage in
  slash_commands.py L240/L499).
- Remove verbose comment referencing a branch name as an issue number.
- Update stale comment in run.py that said 'today it has no session_db'.
- Add regression test verifying session_db is passed and rotated session
  is persisted (adapted from #51624 by @LeonSGP43).
- Add _session_db=None to _make_runner fixtures in test_compress_command,
  test_compress_focus, and test_compress_plugin_engine.
2026-06-26 00:50:40 +05:30
kshitij
5de8a8fbe8
Merge pull request #52375 from NousResearch/salvage/47237-dedupe-user-turns
fix(gateway): dedupe user turns on transient failure (#47237)
2026-06-26 00:30:59 +05:30
davidgut1982
6208d6b3be fix(gateway): dedupe user turns on transient failure (#47237)
When the gateway persists a user message after a transient provider
failure (429/timeout/auth error), subsequent retries of the same
Telegram message could stack duplicate user turns in the transcript,
causing the agent to fall behind by 1-2 messages.

Add has_platform_message_id() to SessionDB (using the existing
idx_messages_platform_msg_id partial index) and a SessionStore wrapper.
The gateway's transient-failure path checks this before
append_to_transcript -- if the platform_message_id is already
persisted, the duplicate write is skipped.

Salvaged from #47869 by @davidgut1982. Adapted to current main which
has additional append sites and an existing content-based dedupe in
the exception handler path.

Closes #47237
2026-06-26 00:11:17 +05:30
Ben Barclay
d6269da7fd
fix(gateway): harden scale-to-zero dormancy guards (#52359)
Some checks are pending
CI / detect (push) Waiting to run
CI / tests (push) Blocked by required conditions
CI / lint (push) Blocked by required conditions
CI / typecheck (push) Blocked by required conditions
CI / docs-site (push) Blocked by required conditions
CI / history-check (push) Blocked by required conditions
CI / contributor-check (push) Blocked by required conditions
CI / uv-lockfile (push) Blocked by required conditions
CI / docker-lint (push) Blocked by required conditions
CI / supply-chain (push) Blocked by required conditions
CI / osv-scanner (push) Blocked by required conditions
CI / All required checks pass (push) Blocked by required conditions
Deploy Site / deploy-vercel (push) Waiting to run
Deploy Site / deploy-docs (push) Waiting to run
Docker Build and Publish / build-amd64 (push) Waiting to run
Docker Build and Publish / build-arm64 (push) Waiting to run
Docker Build and Publish / merge (push) Blocked by required conditions
Block scale-to-zero suspend while background async delegations are active, and restore runtime status to running on real inbound after a dormant wake.\n\nAdd regression coverage for both review findings.
2026-06-25 20:41:03 +10:00
Ben Barclay
72ae163250
fix(relay): authorize relay-delivered events by delivery, not source.platform (#52306)
* fix(relay): authorize relay-delivered events by delivery, not source.platform

The #52190 upstream-authz fix keyed _is_user_authorized off
source.platform via _adapter_authorization_is_upstream(source.platform).
But a relay *message* inbound carries the UNDERLYING platform
(source.platform == discord/telegram/...), NOT Platform.RELAY, because
ws_transport._event_from_wire maps the connector's wire payload
(platform="discord") straight onto SessionSource for session-keying and
egress. The relay adapter is registered only under Platform.RELAY, so
adapters.get(Platform.DISCORD) misses, the trusted-upstream branch is
skipped, and the user hits the env-allowlist default-deny:

    WARNING gateway.run: Unauthorized user: <id> (<name>) on discord

(Live staging bug: alpha tester linked successfully, then every
follow-up DM was silently dropped.)

Fix: the authentic trust signal is that the event was delivered over the
per-instance-authenticated relay WS, not which platform it underlies. Add
a wire-INVISIBLE SessionSource.delivered_via_upstream_relay flag, stamped
by the relay transport in _event_from_wire, and authorize on it. The flag
is excluded from to_dict/from_dict so a peer can neither forge it across
the wire nor have it restored from persistence. The existing adapter-flag
check is retained for events whose source.platform IS Platform.RELAY
(interaction-passthrough). A direct Discord event on a multiplexing
gateway (direct + relay adapters) is unmarked and still default-denies.

* fix(relay): use identity check on delivery marker to avoid MagicMock fail-open

A MagicMock() source (used by test_signal.py and other gateway tests) auto-
vivifies source.delivered_via_upstream_relay as a truthy Mock, which a bare
truthiness check would treat as authorized — flipping
test_signal_in_allowlist_maps from False to True. The marker is a real bool on
SessionSource, so check 'is True' explicitly: refuses to authorize any non-bool
stand-in, defensive against accidental fail-open.
2026-06-25 14:21:09 +10:00
Ben
0c3f197cff fix(relay): re-attach DM author user_id on outbound for connector egress
A DM reply carries no guild_id, so the connector's egress guard cannot
resolve the owning tenant from metadata.guild_id and declines the send
with "discord egress declined: target not routed to an onboarded tenant"
— the bug behind "the bot never replies in DMs". Guild replies are
unaffected (they carry guild_id), which is why the guild path worked
end-to-end while DMs looked broken.

The connector now resolves a DM reply's tenant from the recipient's
author binding (gateway-gateway #67, resolveByUser keyed on
metadata.user_id) — the outbound counterpart to inbound Phase 7a
author-first resolution. But it needs the recipient user_id ON the
outbound action, and the adapter only re-attached guild_id
(_capture_scope/_with_scope), no-op for DMs (the docstring even said so).

This extends the adapter's inbound-scope capture: for a DM (no guild_id)
remember chat_id -> the authentic author user_id we observed, and
re-attach it as metadata.user_id on outbound. Guild capture is unchanged
and wins when present; user_id is the DM-only fallback. The id is the one
the connector observed inbound (never gateway-asserted), so the trust
invariant holds.

+4 unit tests (DM reply re-attaches user_id + no guild_id; unknown chat
invents nothing; explicit user_id preserved; guild reply never carries
user_id). Proved load-bearing (reverting the re-attach fails the DM
test). 144 relay tests pass, ruff clean.

Pairs with gateway-gateway #67 (the connector-side resolver). Together
they close the DM-reply egress gap end-to-end.
2026-06-25 12:43:54 +10:00
Ben
d1cac0e5ef feat(gateway): scale-to-zero idle detection + dormant-quiesce (Phase 0)
The gateway-side BEHAVIOUR layer that consumes the relay scale-to-zero
primitives (gateway-gateway Phase 5): the gateway decides it is idle and
drives the relay transport dormant so the platform (Fly autostop:"suspend")
can suspend the now-traffic-idle machine, which wakes on the connector's
wakeUrl poke (decisions.md Q3=C', D1-D13).

- gateway/scale_to_zero.py: pure helpers — scale_to_zero_enabled (the NAS
  Labs HERMES_SCALE_TO_ZERO stamp, D11/Q8=A), parse_idle_timeout_seconds
  (config.yaml gateway.scale_to_zero.idle_timeout_minutes, D2),
  messaging_is_relay_only_or_absent (F6/D1), should_arm (D1/D11/§3.4(1)),
  is_idle (D2/D3/F7).
- gateway/run.py: _last_inbound_at clock stamped on user inbound in
  _handle_message (F13); the arm-gate + idle predicate + the
  _scale_to_zero_watcher dormant sequence (mark draining -> adapter
  go_dormant() -> cooldown), started only when armed. Deliberately NOT the
  stop path and NOT mark_resume_pending (F12/D13).
- tools/process_registry.py: has_any_active() for the bg-work guard (D3/F7).
- hermes_cli/config.py: gateway.scale_to_zero.idle_timeout_minutes default 5.

Tests: 38 pure-logic + 6 watcher (incl. bg-work regression guard proven RED).
Full relay + scale-to-zero suites: 184 passed. The 20 unrelated failures in
the broader run are PRE-EXISTING on origin/main (custom-provider/tools tests),
confirmed via a pristine baseline worktree.
2026-06-24 18:47:18 -07:00
Ben
96af4bec30 feat(relay): add go_dormant() transport mode for scale-to-zero (0.E0)
Net-new WebSocketRelayTransport.go_dormant() + RelayAdapter.go_dormant() —
the third transport mode the scale-to-zero behaviour layer needs, distinct
from both disconnect() and an unexpected close (decisions.md D12/F14):

- disconnect() sets _closing=True and CANCELS the reconnect supervisor
  (terminal "shutting down for good") -> a suspended machine never re-dials
  on wake, stranding its buffered backlog.
- an unexpected close re-dials IMMEDIATELY -> the socket never stays down,
  so the platform proxy never suspends the machine.

go_dormant(): going_idle->ack (reuse go_idle), then close the socket WITHOUT
setting _closing, so the reader's fall-through still arms the reconnect
supervisor (wake path stays live) but on the longer _dormant_redial_s
cadence so it doesn't fight the platform suspend window. A successful re-dial
clears _dormant. Honors the §3.4 wake->reconnect->drain contract.

Tests: 6 new in test_relay_going_idle.py incl. the F14 regression guard
(routing dormancy through disconnect() fails exactly the 4 wake-path tests).
Full relay suite 140 passed.
2026-06-24 18:47:18 -07:00
helix4u
17beb55e3c fix(telegram): gate rich draft previews separately 2026-06-24 18:11:14 -07:00
Ben
538c419d2e fix(gateway): scope dashboard liveness fallback to the profile
PR #52151 hardened the runtime-status liveness check to trust a readable
live process command line over stale gateway_state.json argv, so a recycled
PID now owned by an s6 supervisor no longer counts as a running gateway.

That fix is correct but incomplete for the reported symptom: the web
dashboard showed a named profile's gateway green while
`hermes -p <name> gateway status` showed it stopped. Two further issues:

1. Cross-profile PID reuse. In per-profile Docker supervision, one profile's
   stale `gateway_state.json` can record a PID the OS later recycled onto a
   DIFFERENT profile's live gateway. That PID's command line still
   `looks_like_gateway`, so the dead profile was reported running. The
   recorded argv has its `-p <name>` selector stripped in-process by
   `_apply_profile_override`, so it cannot disambiguate; the live `/proc`
   cmdline still carries it. `get_runtime_status_running_pid` now accepts an
   `expected_home` and validates the live command line belongs to THAT
   profile (mirroring `hermes_cli.gateway._matches_current_profile`, the
   logic the CLI scan path already uses — which is why the CLI was correct).
   `_check_gateway_running` passes the enumerated profile dir.

2. The existing regression test `test_gateway_running_check_falls_back_to_
   runtime_state` used the live pytest PID with a gateway-shaped record; once
   the live cmdline became authoritative it no longer looked like a gateway.
   Updated to mock the live cmdline to the real separate-process scenario it
   describes.

The active-profile path (`get_running_pid`) is intentionally left unscoped:
it is lock-verified and any live gateway cmdline is acceptable there. Multiplex
mode is unaffected — `running` state is only ever written to a gateway's own
home, never a secondary served profile's.

Adds coverage for: cross-profile PID reuse (named + default), matching
profile cmdline (`-p`, `--profile`, explicit HERMES_HOME=), the bare default
gateway, and the unreadable-cmdline cross-platform fallback. Each new
cross-profile assertion fails without the profile scope and passes with it.

Co-authored-by: helix4u <4317663+helix4u@users.noreply.github.com>
2026-06-25 10:25:54 +10:00
helix4u
f1617a7ebb fix(gateway): validate runtime status pid command line 2026-06-25 10:25:54 +10:00
Ben
d335164833 fix(relay): authorize relay inbound via connector-enforced upstream authz
A hosted instance fronted by the Team Gateway connector dropped EVERY relay
message as "Unauthorized user" and the agent never replied — despite the
message routing correctly through the connector to the instance.

Root cause: gateway authorization (_is_user_authorized) had no notion of
upstream-enforced authz. Platform.RELAY matches no {PLATFORM}_ALLOWED_USERS
allowlist and isn't in the HA/WEBHOOK always-authorized set, so a relay user
with no env allowlist configured hit the default-deny ("No user allowlists
configured. All unauthorized users will be denied."). The message was received,
then silently denied before reaching the agent.

This is incorrect for relay: the connector authenticates the gateway's WS with
a per-instance secret and performs owner-only author-binding resolution BEFORE
delivering. A message only reaches this gateway because the connector resolved
it to THIS instance's bound user (user_instance_binding), keyed on the author id
the connector OBSERVED off the event — never a gateway claim. The authorization
decision is already made by a trusted, authenticated upstream; there is no local
RELAY_ALLOWED_USERS allowlist to consult, and default-denying for its absence is
the bug.

Fix: add a generic BasePlatformAdapter.authorization_is_upstream capability
(default False) that the relay adapter overrides to True, plus a dedicated
trusted branch in _is_user_authorized that honors it. This is delegation to a
trusted upstream, NOT a fail-open: it fires only for an adapter that explicitly
declares the flag; every direct network-exposed adapter leaves it False and the
env-allowlist default-deny (SECURITY.md §2.6) is unchanged. Distinct from
enforces_own_access_policy, which mirrors a LOCAL config-driven allowlist —
this delegates to an authenticated upstream's decision.

Tests: behavior contract that the base defaults False, the relay adapter
declares True, a relay user (group + DM) is authorized with no env allowlist,
and crucially a non-upstream adapter with no allowlist still default-denies
(guards against the fix becoming a blanket fail-open). 6 new tests; relay +
authz + config-policy suites green (134 + 90).

Found via live staging debug of the Discord self-serve onboarding flow.
2026-06-25 10:06:21 +10:00
liuhao1024
404b06ac4f fix(gateway): honor server retry_after in _send_with_retry for Telegram flood control (#46762)
When Telegram's sendRichMessage returns a FloodWait/RetryAfter error,
_try_send_rich() now extracts the server-provided retry_after value and
propagates it through SendResult.retry_after. The base _send_with_retry()
layer honors this value instead of using its default short exponential
backoff (~2s, ~4s), preventing the retry budget from being exhausted
against a server that demands a 25-37s wait.

Salvaged from #46774 by @liuhao1024. Telegram adapter path moved from
gateway/platforms/telegram.py to plugins/platforms/telegram/adapter.py
since the original PR.

Closes #46762
2026-06-25 02:43:47 +05:30
kshitijk4poor
9c994377ed fix(gateway): skip non-dict entries in session loading (#46994)
Corrupted sessions.json entries (e.g. a bare bool where a dict is
expected) caused TypeError on 'origin' in data' which escaped the
(ValueError, KeyError) inner except and aborted loading ALL remaining
sessions, not just the corrupted one.

Two-layer fix:
- Loop level: isinstance(entry_data, dict) guard before from_dict
- from_dict: isinstance(data['origin'], dict) instead of bare truthiness
- Added TypeError to the inner except as defense-in-depth

Closes #46994
2026-06-25 01:26:13 +05:30
kshitijk4poor
e0272cfef2 Revert "fix(compression): make minimum context floor configurable (#31600)"
This reverts commit cae1ee44a7.
2026-06-25 01:04:44 +05:30
Tranquil-Flow
cae1ee44a7 fix(compression): make minimum context floor configurable (#31600)
Add compression.minimum_context_floor config key that allows users
to lower the compression threshold floor below the hardcoded 64K
default, preventing infinite tool-call loops on models whose
structured output degrades well before 64K tokens.

- agent/model_metadata.py: add get_configurable_minimum_context()
  helper with 16K hard safety limit
- agent/context_compressor.py: accept minimum_context_floor param,
  thread it through _compute_threshold_tokens
- agent/conversation_compression.py: use compressor's floor for
  aux model context validation
- agent/agent_init.py: read compression.minimum_context_floor from
  config and pass to ContextCompressor
- gateway/run.py: cache-busting includes new key

Salvaged from #31686 by @Tranquil-Flow onto current main.
Resolves conflicts with in-place compaction (#38763) and max_tokens
threshold computation (#43547) that landed after the original PR.

Closes #31600
2026-06-25 00:56:04 +05:30
kshitijk4poor
0225480369 fix(gateway): offload agent cleanup off the event loop in /new reset (#35994)
The /new (and /reset) confirmation-button callback runs the slash-confirm
handler on the asyncio event loop (see _request_slash_confirm). That handler
calls _handle_reset_command, which invoked the SYNCHRONOUS, potentially
long-blocking _cleanup_agent_resources inline: agent.close() tears down
terminal sandboxes, browser daemons and background processes (subprocess
waits), and shutdown_memory_provider() can make a network call. A slow
teardown wedged the entire event loop, so the bot went silent and stopped
processing all messages until a manual restart.

Offload _cleanup_agent_resources via the existing contextvar-preserving
_run_in_executor_with_context helper, bounded by asyncio.wait_for with a
named _RESET_CLEANUP_TIMEOUT_S (30s). The loop is never blocked; on timeout
the reset proceeds and the worker thread is left to finish on its own (it
cannot be cancelled). The text /new path is unaffected (already off-loop).

Tests (tests/gateway/test_35994_reset_button_deadlock.py): the loop keeps
ticking while close() blocks in its worker thread; a cleanup that raises is
swallowed (warning logged) and the reset still rotates the session; a
cleanup that times out degrades gracefully. All three are mutation-verified
to fail without their respective production branch.
2026-06-25 00:27:22 +05:30
kshitij
ae20c3fb90
Merge pull request #51025 from NousResearch/salvage/cron-autoreset-override
fix(gateway): consume was_auto_reset so /model survives session auto-reset (#48031)
2026-06-24 19:20:11 +05:30
x7peeps
6879d77d74 fix(gateway): consume was_auto_reset so /model survives session auto-reset
When `/model X` is the FIRST message after an idle/daily/suspended auto-reset,
the slash-command path stores a session model override but leaves
`session_entry.was_auto_reset = True` (it never passes through
`_handle_message_with_agent`, which is where the flag was consumed). On the
NEXT regular message, the auto-reset cleanup block pops the freshly-stored
model/reasoning override BEFORE the flag is consumed — so the switch is
silently lost and resolution falls back to the config default, while the
session DB still shows the switched model (a two-sources-of-truth divergence).

Consume the flag at both sites:
  1. gateway/run.py — capture `was_auto_reset` into a local and set the
     attribute False immediately at the top of the cleanup block, so the
     cleanup can't re-fire on a later message and wipe an override stored
     between turns. Downstream reads use the captured local.
  2. gateway/slash_commands.py — the model path consumes the flag before
     storing the override, so a /model-first-after-auto-reset isn't wiped by
     the next message's cleanup.

Salvaged from #48062 by x7peeps (authorship preserved).

Tests: tests/gateway/test_48031_model_switch_after_auto_reset.py — AST
invariants pinning both consume sites (load-bearing; verified they fail when
either consume is removed). Mirrors the AST-pin approach in
test_35809_auto_reset_clean_context.py. Gateway session/reset suite: 16 passed.

Fixes #48031
2026-06-24 19:12:44 +05:30
kshitij
d68a133458
Merge pull request #51890 from NousResearch/salvage/40695-handoff-watcher-async
fix(gateway): offload handoff-watcher SQLite calls to avoid blocking the async heartbeat (#40695)
2026-06-24 19:10:52 +05:30
liuhao1024
721cf54fb1 fix(gateway): offload /model provider-listing off the event loop (#41289)
The Discord/Telegram /model slash command listed providers synchronously
on the gateway's async event loop. list_picker_providers /
list_authenticated_providers are blocking and can fall through to a
synchronous urllib HTTP fetch when the on-disk provider cache is stale,
freezing the loop for 120-150s -> "application did not respond" and
delayed agent starts.

Port #41304's asyncio.to_thread offload to the current handler location.
The handler moved from gateway/run.py to gateway/slash_commands.py
(_handle_model_command); wrap BOTH blocking call sites so the whole bug
class is covered:

  - picker path        -> list_picker_providers
  - text-fallback path -> list_authenticated_providers

asyncio.to_thread is already idiomatic in this module (and asyncio is
imported), so the loop now stays responsive while the (possibly
network-bound) listing runs on a worker thread.

Adds tests/gateway/test_model_command_async_offload.py asserting the
offload contract at the real handler seam for both paths (mutation-
survivable: reverting either to_thread wrap fails the matching test).

Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
2026-06-24 18:40:52 +05:30
r266-tech
f0c5d812b0 fix(gateway): offload handoff watcher SessionDB polling off the event loop
The Discord gateway heartbeat stalled ('Shard ID None heartbeat blocked
for more than N seconds') because _handoff_watcher polled the synchronous,
blocking SQLite-backed SessionDB directly on the asyncio event loop every
2s. Each list_pending/claim/complete/fail call performed blocking disk I/O
on the loop thread, starving the Discord heartbeat coroutine.

Wrap every blocking SessionDB call inside the watcher loop in
asyncio.to_thread(...) so the SQLite work runs on a worker thread and the
event loop (and heartbeat) stays responsive. These four call sites are the
only synchronous self._session_db.* calls inside the watcher loop body.

Adds tests/gateway/test_handoff_watcher_async_db.py asserting the watcher
offloads its SessionDB calls via asyncio.to_thread (mutation-survivable:
reverting any to_thread wrap fails the corresponding assertion).

Fixes #40695

Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
2026-06-24 18:40:23 +05:30