Replies on WhatsApp Cloud arrived at the agent with reply_to_id set but
reply_to_text=None, so run.py never injected the "[Replying to: ...]"
disambiguation prefix (it gates on reply_to_text). Meta's webhook context
object carries only the quoted message's id, never its text.
Index (chat_id, wamid) -> text in rich_sent_store on every inbound message
and every outbound text send -- the same store that solved the identical
Telegram rich-send problem -- then look up the quoted text in
_build_message_event_from_cloud and populate reply_to_text plus
reply_to_is_own_message, derived from context.from versus the business
number.
Tasks 2.1 + 2.2 + 2.3 of the safe-shutdown plan — the reversible
quiesce-without-restart machinery NAS drives during a lifecycle action (D4a).
These ship together because the endpoint, the control channel, and the gateway
state machine are one coherent slice.
2.2 — control channel (gateway/drain_control.py, new):
The dashboard has no HTTP path into a running gateway (guardrails: "there is NO
external control channel into a running gateway"); restart/drain is driven only
by markers the gateway reacts to. So begin/cancel-drain writes/removes a
presence-based marker .drain_request.json (HERMES_HOME-scoped, atomic write,
never-raises read; a corrupt marker reads as present-contentless → fail-safe
toward quiescing). This is Q-B option A.
2.2 — gateway state machine (gateway/run.py):
- _external_drain_active flag, DISTINCT from the shutdown _draining flag: this
one does NOT exit the process and is fully reversible.
- _enter_external_drain / _exit_external_drain: idempotent transitions that
flip gateway_state→draining / →running via _update_runtime_status (preserving
the live active_agents count). exit refuses to revert to running during a
real shutdown or after the loop stops (shutdown wins).
- _drain_control_watcher: 1s background task (modelled on _handoff_watcher)
reconciling accept-state with the marker; honours a marker that survived a
restart on its first tick. Registered alongside the other watchers in start.
- New-turn accept gate in _handle_message, placed BEFORE the session-slot
claim: when draining, refuse to START a new turn (so active_agents can only
fall → no TOCTOU race), while in-flight turns finish untouched. Internal/
system events (restart-recovery replays, bg-process completions) bypass it.
2.1 — endpoint (hermes_cli/web_server.py):
POST /api/gateway/drain {action: drain|cancel}. Authenticated by the Task-2.0a
token seam (the drain plugin registered this exact path as a token route);
attributes the request to the verified token principal. Begin writes the
marker, cancel removes it — the gateway process owns the actual transition.
Force-override (D6) is NOT here; it maps onto the existing immediate
/api/gateway/restart force path.
Tests (mocked — necessary-not-sufficient; the HARD live gate Q-B is next):
- tests/gateway/test_external_drain_control.py — marker contract (write/clear/
read/corrupt/atomic), state machine (enter/exit/idempotency/shutdown-wins/
loop-stopped), watcher reconcile-enter-then-exit, new-turn refusal, and
in-flight-not-interrupted. 15 tests.
- tests/hermes_cli/test_web_server.py — /api/gateway/drain begin/default-begin/
cancel/cancel-idempotent/bad-action-400. 6 tests.
- dashboard.drain_auth config section already added in 2.0b commit.
All touched suites green: 301 (gateway+auth) + 9 (web_server endpoints) passed.
Intentionally deferred:
- HARD live-validation gate (Q-B): real isolated `hermes gateway run`, drive a
real begin-drain marker, prove the 5-point checklist a–e.
- Spec-doc status flip + Phase-2 PR.
Build status: external-drain, restart-drain, status, dashboard-auth, drain-plugin,
token-auth, and web_server-endpoint suites green.
Cut over the agent half of Shape A (D-Q1.5a/b.1/c) to front a SET of platforms on
one relay WS:
- relay_platform_identities() parses GATEWAY_RELAY_PLATFORMS (list) +
GATEWAY_RELAY_BOT_IDS (JSON keyed map {platform:{botId,username?}}). Cut over
from the scalar GATEWAY_RELAY_PLATFORM/_BOT_ID (no fallback, D-Q1.5c).
- self_provision_relay() loops one /relay/provision per platform under one
gatewayId+secret, partial-failure-tolerant.
- WebSocketRelayTransport takes the identity SET, sends one hello per identity
(connector accumulates the advertised set), and stamps the per-frame
OutboundFrame.platform + its matching advertised botId on outbound.
- RelayAdapter remembers each chat's underlying source.platform (mirroring the
existing guild/dm scope capture) and tags the reply's egress platform.
- send_relay_policy() declares one relevance policy per fronted platform (the
connector keys policy by (tenant,platform,instanceId)).
Single-platform deploys are byte-identical on the wire (1-element list, no per-frame
tag -> connector session-default fallback). typecheck/ruff clean; relay unit 221 pass
(+10 new); all 15 cross-repo E2E drivers green vs connector origin/main.
The gateway reconnect watcher (gateway/run.py) recovers a platform after a
fatal adapter error by building a fresh adapter and calling
connect(is_reconnect=True). Every BasePlatformAdapter implements
connect(*, is_reconnect: bool = False) for this — except RelayAdapter, whose
connect() was bare. So the watcher's recovery path raised:
TypeError: connect() got an unexpected keyword argument 'is_reconnect'
Observed live on a hosted staging agent: after a fatal relay adapter error the
watcher could never re-establish relay, so the shared-bot inbound never reached
the gateway and Discord DMs stopped (dashboard surfaced the TypeError).
Relay deliberately ignores the flag: the #46621 server-side-queue-preservation
concern doesn't apply, because relay's outage buffer is the connector's durable
buffer (replayed on the transport's re-handshake), not a gateway-side queue the
adapter owns. Routine WS drops are already handled by the transport's own
reconnect supervisor (WebSocketRelayTransport, reconnect=True); the watcher path
is fatal-error recovery, and the fatal handler disconnect()s the old adapter
(cancelling its supervisor) before a fresh adapter+transport is built, so there
is no double-dial.
Adds two regression tests (both proven red without the fix): connect(is_reconnect=True)
reaches the same transport-less RuntimeError instead of TypeError, and the
signature matches BasePlatformAdapter.connect.
After a prolonged outage the in-process network-error ladder escalates to
fatal and GatewayRunner._platform_reconnect_watcher rebuilds a fresh adapter
that reconnects through the bootstrap path. That path called
start_polling(drop_pending_updates=True), discarding every update Telegram
queued during the outage — all messages sent while the bot was down were
silently lost. The in-process ladder and 409-conflict handler already passed
drop_pending_updates=False; only bootstrap did not distinguish a cold first
boot from a reconnect.
Thread an is_reconnect signal from the watcher through
_connect_adapter_with_timeout into adapter.connect(). The base
BasePlatformAdapter.connect() gains a keyword-only is_reconnect=False so every
adapter inherits a tolerant signature (no per-platform breakage when the
runner forwards the kwarg). Telegram translates is_reconnect into
drop_pending_updates=not is_reconnect on both the polling and webhook bootstrap
calls. Cold boot still drops the stale queue; a watcher reconnect preserves it.
Fixes#46621.
Co-authored-by: annguyenNous <annguyen@nousresearch.com>
Co-authored-by: kyssta-exe <kyssta-exe@users.noreply.github.com>
Co-authored-by: Kewe63 <Kewe63@users.noreply.github.com>
The email adapter authorized senders entirely off the From: header, which is
attacker-controlled and unauthenticated by IMAP. An attacker could forge
From: an-allowlisted-address and pass both the adapter's EMAIL_ALLOWED_USERS
pre-filter and the gateway's allowlist authz (both key on the same spoofable
sender_addr), getting unauthorized commands executed by the agent.
Verify the From: domain against the trusted Authentication-Results header the
receiving mail server stamps (SPF/DKIM/DMARC) before trusting it for
authorization. Enforced only when an allowlist is in effect and allow-all is
off — fail-closed. Operators whose server does not stamp the header can opt
out via platforms.email.require_authenticated_sender: false (or
EMAIL_TRUST_FROM_HEADER=true).
The scale-to-zero idle watcher never started on a correctly-opted-in,
relay-only instance, so the gateway never ran its idle decision, never called
go_dormant(), and never sent going_idle to the connector. Fly's autostop still
suspended the machine on traffic-idle, but the connector never flipped the
instance to buffered-only — so an inbound DM took the live delivery path,
found no live session for the suspended machine, and was dropped fail-closed
with no wake poke. The machine slept and never woke.
Root cause: _scale_to_zero_should_arm() passed list(config.platforms.keys())
to messaging_is_relay_only_or_absent(). config.platforms is pre-seeded with a
DISABLED placeholder PlatformConfig for every known platform (telegram,
discord, slack, matrix, …), so the key set is always the full ~20-entry
catalog regardless of what the instance actually runs. The relay-only check
discarded "relay", saw the disabled placeholders as live direct-socket
platforms, and returned False — so should_arm() was False and the watcher was
never created. Verified live on a staging instance: config.platforms keys =
[telegram, discord, slack, mattermost, matrix, relay] with only relay
enabled=True; should_arm() = False.
Fix: filter config.platforms to ENABLED entries before the relay-only check,
mirroring the adapter-connect loop which already gates on
`if not platform_config.enabled: continue`. This arms off the same notion of
"active platform" the rest of start() already uses — no parallel concept.
Also add a one-line not-armed diagnostic: when an instance IS opted in (the
HERMES_SCALE_TO_ZERO stamp is set) but the watcher still doesn't arm, log why
(relay_only_or_absent, the enabled platforms, wake_url present/missing). A
non-opted instance stays silent. The arm path previously logged only on
success, so a failed arm was invisible.
Tests: the existing pure-helper tests passed bare names so they never
exercised the call site that feeds the placeholder-laden config. Add
behaviour-contract tests against the REAL _scale_to_zero_should_arm with a
realistic config.platforms (relay enabled + others disabled). The F25
regression test (relay-only + disabled placeholders must arm) and the
no-platform case are RED without this fix, GREEN with it; the
genuinely-enabled-direct-platform / not-opted-in / no-wake-url cases stay
correctly non-arming so the filter can't over-broaden.
Wake mechanism itself verified healthy independently (direct wakeUrl GET
resumed a suspended staging instance in 1.15s, clean resume signature).
The #45966 cross-process coherence guard popped the stale cached agent
and then called the blocking _cleanup_agent_resources (memory-provider
shutdown, tool-resource teardown, async-client teardown) while still
holding _agent_cache_lock, on the gateway event-loop thread. While that
ran, _sweep_idle_cached_agents (driven by _session_expiry_watcher)
blocked acquiring the same lock and the asyncio loop stalled for minutes,
tripping repeated Discord 'heartbeat blocked' warnings.
Fix mirrors the cap-enforcer / idle-sweep paths: pop the stale entry
under the lock, release it, then schedule the SOFT release on a daemon
thread. The soft path (_release_evicted_agent_soft) is also more correct
here than the hard teardown the regression used — the same session
rebuilds a fresh agent immediately after invalidation, so its terminal
sandbox / browser / bg processes (keyed on task_id) must be preserved
for the rebuilt agent to inherit, not torn down.
Verified the cross-process site was the only cleanup-under-lock instance;
the other _cleanup_agent_resources call sites run outside the lock.
CI shard test_telegram_conflict.py timed out (140s) because the new
_polling_heartbeat_loop, started by connect(), busy-spun under those
tests: they monkeypatch asyncio.sleep to instant and pass a bot double
with no get_me(), so the probe raised AttributeError (swallowed) and the
loop re-entered immediately with no real pacing, starving the event loop.
Guard the loop to return when bot.get_me is not callable — a real PTB Bot
always exposes it, so this only triggers on a torn-down app or a test
double, where there is nothing to probe. Also cancel the heartbeat task in
the conflict tests that call connect() without disconnect(), matching the
production disconnect() teardown.
Verified: test_telegram_conflict.py now runs in ~4.5s; the 22
heartbeat/reconnect tests still pass; E2E confirms a hanging get_me still
fires the reconnect ladder while a missing get_me exits without spinning.
When a Telegram long-poll TCP socket enters CLOSE-WAIT (remote sent FIN
but httpx hasn't noticed), epoll still reports it readable so no
exception is raised. PTB's error_callback never fires, the reconnect
ladder never engages, and the gateway silently stops receiving messages
while the process stays alive — until a manual systemctl restart.
The existing recovery only covers two cases: error_callback-driven
reconnects (which require an exception PTB never gets) and a one-shot
_verify_polling_after_reconnect probe (which runs only right after an
explicit reconnect). A socket that wedges during steady-state operation
is never detected.
Add _polling_heartbeat_loop: a background asyncio.Task started in
connect() (polling mode only) that probes get_me() every 90s on the
general request pool (not the getUpdates pool, so healthy long-polls are
never interrupted). On asyncio.TimeoutError/OSError it hands off to the
existing _handle_polling_network_error ladder; other errors are
swallowed. disconnect() cancels and awaits the task. Worst-case
detection window ~105s.
Complementary to #51541 (general-pool keepalive limits / fd leak) — that
recycles idle pooled connections; this detects a wedged active read.
Fixes#48495
Co-authored-by: agt-user <267614622+agt-user@users.noreply.github.com>
Assert bare tables upgrade to sendRichMessage under default/opt-out config,
DM-topic resumed sends without reply anchors, and rich finalize edits carry
forum topic routing metadata.
These 7 test sites assert rotation behavior (fork, child sessions, lock
contention, logging session-context follows id rotation, boundary hooks fire
on rotation). Pin each builder to in_place=False explicitly so they keep
exercising the retained rotation fallback regardless of the global default
(flipped to True in #38763). Rotation stays a working opt-out fallback and
deserves continued coverage — these are NOT deleted.
Pinned sites:
- test_compression_concurrent_fork._build_agent_with_db
- test_compression_logging_session_context._build_agent_with_db
- test_compression_rotation_state._build_agent_with_db
- test_compression_boundary_hook._make_agent (2 helpers: CompressionBoundaryHook + SessionCompressEvent)
- test_compression_concurrent_sessions._build_agent_with_db
Salvage of #50098 by @srojk34, cherry-picked onto current main.
The hygiene auto-compress guard and the /compress slash command both read
compression_in_place (config flag — is in-place mode enabled?) instead of
_last_compaction_in_place (result flag — did in-place compaction actually
succeed?). Both agents are built without a session_db, so archive_and_compact
always fails silently and _last_compaction_in_place stays False. Reading the
config flag makes the guard think in-place succeeded, triggering
rewrite_transcript() which replaces the original messages with only the
compressed summary — permanent data loss.
Co-authored-by: srojk34 <srojk34@users.noreply.github.com>
- Replace getattr(self.session_store, '_db', None) with self._session_db
(the GatewayRunner's own SessionDB, consistent with existing usage in
slash_commands.py L240/L499).
- Remove verbose comment referencing a branch name as an issue number.
- Update stale comment in run.py that said 'today it has no session_db'.
- Add regression test verifying session_db is passed and rotated session
is persisted (adapted from #51624 by @LeonSGP43).
- Add _session_db=None to _make_runner fixtures in test_compress_command,
test_compress_focus, and test_compress_plugin_engine.
When the gateway persists a user message after a transient provider
failure (429/timeout/auth error), subsequent retries of the same
Telegram message could stack duplicate user turns in the transcript,
causing the agent to fall behind by 1-2 messages.
Add has_platform_message_id() to SessionDB (using the existing
idx_messages_platform_msg_id partial index) and a SessionStore wrapper.
The gateway's transient-failure path checks this before
append_to_transcript -- if the platform_message_id is already
persisted, the duplicate write is skipped.
Salvaged from #47869 by @davidgut1982. Adapted to current main which
has additional append sites and an existing content-based dedupe in
the exception handler path.
Closes#47237
Block scale-to-zero suspend while background async delegations are active, and restore runtime status to running on real inbound after a dormant wake.\n\nAdd regression coverage for both review findings.
* fix(relay): authorize relay-delivered events by delivery, not source.platform
The #52190 upstream-authz fix keyed _is_user_authorized off
source.platform via _adapter_authorization_is_upstream(source.platform).
But a relay *message* inbound carries the UNDERLYING platform
(source.platform == discord/telegram/...), NOT Platform.RELAY, because
ws_transport._event_from_wire maps the connector's wire payload
(platform="discord") straight onto SessionSource for session-keying and
egress. The relay adapter is registered only under Platform.RELAY, so
adapters.get(Platform.DISCORD) misses, the trusted-upstream branch is
skipped, and the user hits the env-allowlist default-deny:
WARNING gateway.run: Unauthorized user: <id> (<name>) on discord
(Live staging bug: alpha tester linked successfully, then every
follow-up DM was silently dropped.)
Fix: the authentic trust signal is that the event was delivered over the
per-instance-authenticated relay WS, not which platform it underlies. Add
a wire-INVISIBLE SessionSource.delivered_via_upstream_relay flag, stamped
by the relay transport in _event_from_wire, and authorize on it. The flag
is excluded from to_dict/from_dict so a peer can neither forge it across
the wire nor have it restored from persistence. The existing adapter-flag
check is retained for events whose source.platform IS Platform.RELAY
(interaction-passthrough). A direct Discord event on a multiplexing
gateway (direct + relay adapters) is unmarked and still default-denies.
* fix(relay): use identity check on delivery marker to avoid MagicMock fail-open
A MagicMock() source (used by test_signal.py and other gateway tests) auto-
vivifies source.delivered_via_upstream_relay as a truthy Mock, which a bare
truthiness check would treat as authorized — flipping
test_signal_in_allowlist_maps from False to True. The marker is a real bool on
SessionSource, so check 'is True' explicitly: refuses to authorize any non-bool
stand-in, defensive against accidental fail-open.
A DM reply carries no guild_id, so the connector's egress guard cannot
resolve the owning tenant from metadata.guild_id and declines the send
with "discord egress declined: target not routed to an onboarded tenant"
— the bug behind "the bot never replies in DMs". Guild replies are
unaffected (they carry guild_id), which is why the guild path worked
end-to-end while DMs looked broken.
The connector now resolves a DM reply's tenant from the recipient's
author binding (gateway-gateway #67, resolveByUser keyed on
metadata.user_id) — the outbound counterpart to inbound Phase 7a
author-first resolution. But it needs the recipient user_id ON the
outbound action, and the adapter only re-attached guild_id
(_capture_scope/_with_scope), no-op for DMs (the docstring even said so).
This extends the adapter's inbound-scope capture: for a DM (no guild_id)
remember chat_id -> the authentic author user_id we observed, and
re-attach it as metadata.user_id on outbound. Guild capture is unchanged
and wins when present; user_id is the DM-only fallback. The id is the one
the connector observed inbound (never gateway-asserted), so the trust
invariant holds.
+4 unit tests (DM reply re-attaches user_id + no guild_id; unknown chat
invents nothing; explicit user_id preserved; guild reply never carries
user_id). Proved load-bearing (reverting the re-attach fails the DM
test). 144 relay tests pass, ruff clean.
Pairs with gateway-gateway #67 (the connector-side resolver). Together
they close the DM-reply egress gap end-to-end.
The gateway-side BEHAVIOUR layer that consumes the relay scale-to-zero
primitives (gateway-gateway Phase 5): the gateway decides it is idle and
drives the relay transport dormant so the platform (Fly autostop:"suspend")
can suspend the now-traffic-idle machine, which wakes on the connector's
wakeUrl poke (decisions.md Q3=C', D1-D13).
- gateway/scale_to_zero.py: pure helpers — scale_to_zero_enabled (the NAS
Labs HERMES_SCALE_TO_ZERO stamp, D11/Q8=A), parse_idle_timeout_seconds
(config.yaml gateway.scale_to_zero.idle_timeout_minutes, D2),
messaging_is_relay_only_or_absent (F6/D1), should_arm (D1/D11/§3.4(1)),
is_idle (D2/D3/F7).
- gateway/run.py: _last_inbound_at clock stamped on user inbound in
_handle_message (F13); the arm-gate + idle predicate + the
_scale_to_zero_watcher dormant sequence (mark draining -> adapter
go_dormant() -> cooldown), started only when armed. Deliberately NOT the
stop path and NOT mark_resume_pending (F12/D13).
- tools/process_registry.py: has_any_active() for the bg-work guard (D3/F7).
- hermes_cli/config.py: gateway.scale_to_zero.idle_timeout_minutes default 5.
Tests: 38 pure-logic + 6 watcher (incl. bg-work regression guard proven RED).
Full relay + scale-to-zero suites: 184 passed. The 20 unrelated failures in
the broader run are PRE-EXISTING on origin/main (custom-provider/tools tests),
confirmed via a pristine baseline worktree.
Net-new WebSocketRelayTransport.go_dormant() + RelayAdapter.go_dormant() —
the third transport mode the scale-to-zero behaviour layer needs, distinct
from both disconnect() and an unexpected close (decisions.md D12/F14):
- disconnect() sets _closing=True and CANCELS the reconnect supervisor
(terminal "shutting down for good") -> a suspended machine never re-dials
on wake, stranding its buffered backlog.
- an unexpected close re-dials IMMEDIATELY -> the socket never stays down,
so the platform proxy never suspends the machine.
go_dormant(): going_idle->ack (reuse go_idle), then close the socket WITHOUT
setting _closing, so the reader's fall-through still arms the reconnect
supervisor (wake path stays live) but on the longer _dormant_redial_s
cadence so it doesn't fight the platform suspend window. A successful re-dial
clears _dormant. Honors the §3.4 wake->reconnect->drain contract.
Tests: 6 new in test_relay_going_idle.py incl. the F14 regression guard
(routing dormancy through disconnect() fails exactly the 4 wake-path tests).
Full relay suite 140 passed.
PR #52151 hardened the runtime-status liveness check to trust a readable
live process command line over stale gateway_state.json argv, so a recycled
PID now owned by an s6 supervisor no longer counts as a running gateway.
That fix is correct but incomplete for the reported symptom: the web
dashboard showed a named profile's gateway green while
`hermes -p <name> gateway status` showed it stopped. Two further issues:
1. Cross-profile PID reuse. In per-profile Docker supervision, one profile's
stale `gateway_state.json` can record a PID the OS later recycled onto a
DIFFERENT profile's live gateway. That PID's command line still
`looks_like_gateway`, so the dead profile was reported running. The
recorded argv has its `-p <name>` selector stripped in-process by
`_apply_profile_override`, so it cannot disambiguate; the live `/proc`
cmdline still carries it. `get_runtime_status_running_pid` now accepts an
`expected_home` and validates the live command line belongs to THAT
profile (mirroring `hermes_cli.gateway._matches_current_profile`, the
logic the CLI scan path already uses — which is why the CLI was correct).
`_check_gateway_running` passes the enumerated profile dir.
2. The existing regression test `test_gateway_running_check_falls_back_to_
runtime_state` used the live pytest PID with a gateway-shaped record; once
the live cmdline became authoritative it no longer looked like a gateway.
Updated to mock the live cmdline to the real separate-process scenario it
describes.
The active-profile path (`get_running_pid`) is intentionally left unscoped:
it is lock-verified and any live gateway cmdline is acceptable there. Multiplex
mode is unaffected — `running` state is only ever written to a gateway's own
home, never a secondary served profile's.
Adds coverage for: cross-profile PID reuse (named + default), matching
profile cmdline (`-p`, `--profile`, explicit HERMES_HOME=), the bare default
gateway, and the unreadable-cmdline cross-platform fallback. Each new
cross-profile assertion fails without the profile scope and passes with it.
Co-authored-by: helix4u <4317663+helix4u@users.noreply.github.com>
A hosted instance fronted by the Team Gateway connector dropped EVERY relay
message as "Unauthorized user" and the agent never replied — despite the
message routing correctly through the connector to the instance.
Root cause: gateway authorization (_is_user_authorized) had no notion of
upstream-enforced authz. Platform.RELAY matches no {PLATFORM}_ALLOWED_USERS
allowlist and isn't in the HA/WEBHOOK always-authorized set, so a relay user
with no env allowlist configured hit the default-deny ("No user allowlists
configured. All unauthorized users will be denied."). The message was received,
then silently denied before reaching the agent.
This is incorrect for relay: the connector authenticates the gateway's WS with
a per-instance secret and performs owner-only author-binding resolution BEFORE
delivering. A message only reaches this gateway because the connector resolved
it to THIS instance's bound user (user_instance_binding), keyed on the author id
the connector OBSERVED off the event — never a gateway claim. The authorization
decision is already made by a trusted, authenticated upstream; there is no local
RELAY_ALLOWED_USERS allowlist to consult, and default-denying for its absence is
the bug.
Fix: add a generic BasePlatformAdapter.authorization_is_upstream capability
(default False) that the relay adapter overrides to True, plus a dedicated
trusted branch in _is_user_authorized that honors it. This is delegation to a
trusted upstream, NOT a fail-open: it fires only for an adapter that explicitly
declares the flag; every direct network-exposed adapter leaves it False and the
env-allowlist default-deny (SECURITY.md §2.6) is unchanged. Distinct from
enforces_own_access_policy, which mirrors a LOCAL config-driven allowlist —
this delegates to an authenticated upstream's decision.
Tests: behavior contract that the base defaults False, the relay adapter
declares True, a relay user (group + DM) is authorized with no env allowlist,
and crucially a non-upstream adapter with no allowlist still default-denies
(guards against the fix becoming a blanket fail-open). 6 new tests; relay +
authz + config-policy suites green (134 + 90).
Found via live staging debug of the Discord self-serve onboarding flow.
When Telegram's sendRichMessage returns a FloodWait/RetryAfter error,
_try_send_rich() now extracts the server-provided retry_after value and
propagates it through SendResult.retry_after. The base _send_with_retry()
layer honors this value instead of using its default short exponential
backoff (~2s, ~4s), preventing the retry budget from being exhausted
against a server that demands a 25-37s wait.
Salvaged from #46774 by @liuhao1024. Telegram adapter path moved from
gateway/platforms/telegram.py to plugins/platforms/telegram/adapter.py
since the original PR.
Closes#46762
Corrupted sessions.json entries (e.g. a bare bool where a dict is
expected) caused TypeError on 'origin' in data' which escaped the
(ValueError, KeyError) inner except and aborted loading ALL remaining
sessions, not just the corrupted one.
Two-layer fix:
- Loop level: isinstance(entry_data, dict) guard before from_dict
- from_dict: isinstance(data['origin'], dict) instead of bare truthiness
- Added TypeError to the inner except as defense-in-depth
Closes#46994
Add compression.minimum_context_floor config key that allows users
to lower the compression threshold floor below the hardcoded 64K
default, preventing infinite tool-call loops on models whose
structured output degrades well before 64K tokens.
- agent/model_metadata.py: add get_configurable_minimum_context()
helper with 16K hard safety limit
- agent/context_compressor.py: accept minimum_context_floor param,
thread it through _compute_threshold_tokens
- agent/conversation_compression.py: use compressor's floor for
aux model context validation
- agent/agent_init.py: read compression.minimum_context_floor from
config and pass to ContextCompressor
- gateway/run.py: cache-busting includes new key
Salvaged from #31686 by @Tranquil-Flow onto current main.
Resolves conflicts with in-place compaction (#38763) and max_tokens
threshold computation (#43547) that landed after the original PR.
Closes#31600
The /new (and /reset) confirmation-button callback runs the slash-confirm
handler on the asyncio event loop (see _request_slash_confirm). That handler
calls _handle_reset_command, which invoked the SYNCHRONOUS, potentially
long-blocking _cleanup_agent_resources inline: agent.close() tears down
terminal sandboxes, browser daemons and background processes (subprocess
waits), and shutdown_memory_provider() can make a network call. A slow
teardown wedged the entire event loop, so the bot went silent and stopped
processing all messages until a manual restart.
Offload _cleanup_agent_resources via the existing contextvar-preserving
_run_in_executor_with_context helper, bounded by asyncio.wait_for with a
named _RESET_CLEANUP_TIMEOUT_S (30s). The loop is never blocked; on timeout
the reset proceeds and the worker thread is left to finish on its own (it
cannot be cancelled). The text /new path is unaffected (already off-loop).
Tests (tests/gateway/test_35994_reset_button_deadlock.py): the loop keeps
ticking while close() blocks in its worker thread; a cleanup that raises is
swallowed (warning logged) and the reset still rotates the session; a
cleanup that times out degrades gracefully. All three are mutation-verified
to fail without their respective production branch.
When `/model X` is the FIRST message after an idle/daily/suspended auto-reset,
the slash-command path stores a session model override but leaves
`session_entry.was_auto_reset = True` (it never passes through
`_handle_message_with_agent`, which is where the flag was consumed). On the
NEXT regular message, the auto-reset cleanup block pops the freshly-stored
model/reasoning override BEFORE the flag is consumed — so the switch is
silently lost and resolution falls back to the config default, while the
session DB still shows the switched model (a two-sources-of-truth divergence).
Consume the flag at both sites:
1. gateway/run.py — capture `was_auto_reset` into a local and set the
attribute False immediately at the top of the cleanup block, so the
cleanup can't re-fire on a later message and wipe an override stored
between turns. Downstream reads use the captured local.
2. gateway/slash_commands.py — the model path consumes the flag before
storing the override, so a /model-first-after-auto-reset isn't wiped by
the next message's cleanup.
Salvaged from #48062 by x7peeps (authorship preserved).
Tests: tests/gateway/test_48031_model_switch_after_auto_reset.py — AST
invariants pinning both consume sites (load-bearing; verified they fail when
either consume is removed). Mirrors the AST-pin approach in
test_35809_auto_reset_clean_context.py. Gateway session/reset suite: 16 passed.
Fixes#48031
The Discord/Telegram /model slash command listed providers synchronously
on the gateway's async event loop. list_picker_providers /
list_authenticated_providers are blocking and can fall through to a
synchronous urllib HTTP fetch when the on-disk provider cache is stale,
freezing the loop for 120-150s -> "application did not respond" and
delayed agent starts.
Port #41304's asyncio.to_thread offload to the current handler location.
The handler moved from gateway/run.py to gateway/slash_commands.py
(_handle_model_command); wrap BOTH blocking call sites so the whole bug
class is covered:
- picker path -> list_picker_providers
- text-fallback path -> list_authenticated_providers
asyncio.to_thread is already idiomatic in this module (and asyncio is
imported), so the loop now stays responsive while the (possibly
network-bound) listing runs on a worker thread.
Adds tests/gateway/test_model_command_async_offload.py asserting the
offload contract at the real handler seam for both paths (mutation-
survivable: reverting either to_thread wrap fails the matching test).
Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
The Discord gateway heartbeat stalled ('Shard ID None heartbeat blocked
for more than N seconds') because _handoff_watcher polled the synchronous,
blocking SQLite-backed SessionDB directly on the asyncio event loop every
2s. Each list_pending/claim/complete/fail call performed blocking disk I/O
on the loop thread, starving the Discord heartbeat coroutine.
Wrap every blocking SessionDB call inside the watcher loop in
asyncio.to_thread(...) so the SQLite work runs on a worker thread and the
event loop (and heartbeat) stays responsive. These four call sites are the
only synchronous self._session_db.* calls inside the watcher loop body.
Adds tests/gateway/test_handoff_watcher_async_db.py asserting the watcher
offloads its SessionDB calls via asyncio.to_thread (mutation-survivable:
reverting any to_thread wrap fails the corresponding assertion).
Fixes#40695
Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
Phase 7 Unit 7d-B. When an operator opts an instance OUT of the Team Gateway
relay (Unit 7b deprovision), the connector revokes the per-gateway secret and
closes the gateway's WS with 4401. The reconnect supervisor previously treated
EVERY close as retryable, so the live process spun "retrying 4401" forever and
the dashboard showed a red error — opt-out looked like a failure.
Now a 4401 close that arrives AFTER a successful handshake is recognized as a
terminal credential revocation:
- ws_transport.py: track `_handshake_succeeded` (set when a descriptor is
received); on a 4401 close after a prior success, latch `auth_revoked` and do
NOT spawn the reconnect supervisor. A 4401 BEFORE any successful handshake
stays retryable (cold-start / not-yet-provisioned race, not a revocation).
New `auth_revoked` property + a websockets-version-safe close-code reader
(prefers `.rcvd`/`.sent` Close frames; `.code` is deprecated in websockets 13+).
- adapter.py: a revocation monitor turns `transport.auth_revoked` into a clean,
NON-retryable `relay_disabled` fatal and notifies the gateway's fatal-error
handler (so the adapter is removed and NOT queued for reconnection — the
credential is dead until the instance is recreated). Monitor is cancelled on
disconnect; only started when the transport exposes `auth_revoked` (prod WS).
- run.py: `_handle_adapter_fatal_error` maps the `relay_disabled` code to a
`disabled` platform_state (not `fatal`/`retrying`).
- web: PlatformsCard renders the `disabled` state with a neutral outline badge,
a PowerOff icon, and muted (not destructive-red) text + message. New optional
`status.disabled` i18n string ("Disabled").
Also bundles the Phase 7 contract-doc update (this doc is authoritative in
hermes-agent): docs/relay-connector-contract.md gains an "Author-first
resolution + the account-link (DM) path" section documenting the
multi-tenant-guild rule (D-7.2 — route by authenticated author binding, never by
guild; unlinked → fail-closed), the `/link <code>` DM flow, and the
connector-authoritative opt-out + terminal-4401 behavior this PR implements.
Tests: +2 ws_transport (4401-after-handshake terminal / no-reconnect;
4401-before-handshake stays retryable) and +2 adapter (revocation → non-retryable
relay_disabled fatal + handler fired; no-revocation → no fatal). 138 relay tests
pass (incl. the contract-doc conformance test); ruff clean; web tsc clean.
Phase 7 Unit 7d-B (relay-adapter solo lane). Q17 → Option 2; Option 3 (live
de-register, no recreate) + the restart-re-provision hole deferred post-alpha.
On a launchd-managed gateway (macOS), /restart stopped the gateway but
never relaunched it: the handler's service detection checks only
INVOCATION_ID (systemd) and container markers, so under launchd it takes
the detached path and exits 0 — which KeepAlive.SuccessfulExit=false
treats as a deliberate stop. The gateway stays silently dead until a
manual launchctl kickstart.
Detect launchd via XPC_SERVICE_NAME, which launchd sets to the job label
for processes it spawns. The probe deliberately excludes the literal
"0": interactive macOS shells inherit XPC_SERVICE_NAME=0 (a truthy
string), and routing an unsupervised interactive gateway to the service
path would make it exit non-zero with nothing to revive it.
Routing through via_service=True (rather than forcing a non-zero exit
on the detached path) matters: the detached path also spawns a helper
that relaunches the gateway, so exiting non-zero there would have BOTH
the helper and launchd respawn it — two gateways racing for the same
bot tokens. The service path spawns no helper; launchd is the single
respawner.
Fixes#43475. Supersedes the run.py-era probes in #19940/#33393 (the
handler has since moved to gateway/slash_commands.py) and avoids the
double-spawn risk in the exit-code-site approaches (#43498, #43596).
Most Matrix clients auto-set a room name when creating a DM (e.g.
"Alice & Bot" from participant display names), so the old
`is_direct and not has_explicit_name` heuristic classified virtually
all client-created DM rooms as "room", forcing require_mention gating
in legitimate one-on-one DMs.
member_count is now the primary DM signal: <=2 members means the room
is necessarily a 1:1 conversation, regardless of m.direct or an explicit
name. A room that grew to 3+ members but is still in stale m.direct is
still classified as a room (conflict flag set). Falls back to the
m.direct + name heuristic when the count is unavailable.
Also hardens _get_room_member_count with a joined_members API fallback
when the cache-backed state_store is empty.
Salvaged from #48554 by @justemu onto the current plugin adapter path
(gateway/platforms/matrix.py -> plugins/platforms/matrix/adapter.py).
Fixes#48551
Users who inspect ~/.hermes/sessions/sessions.json see only gateway entries
(e.g. agent:main:whatsapp:dm:...) and mistake it for the session index that
hermes sessions list / /sessions read — which is actually state.db. Issue
#49361 reported CLI sessions as 'invisible' on this premise.
- gateway/session.py: write a self-documenting _README sentinel at the top of
sessions.json explaining it's the gateway routing index and that ALL sessions
(CLI/TUI/gateway) live in state.db; skip _-prefixed keys on load so the
sentinel never round-trips into a SessionEntry.
- Harden every sessions.json reader against the sentinel: mcp_serve loader,
gateway/mirror.py, gateway/channel_directory.py all skip _-prefixed keys.
- docs/user-guide/sessions.md: warning callout naming the exact symptom.
- tests: assert prune ignores metadata sentinels; add round-trip coverage.
Component button interactions (approve/deny, slash confirm, model
picker, clarify) were not checking the pairing store for authorization.
Users approved via `hermes pairing approve` could send messages and use
slash commands (which go through the gateway authz_mixin), but button
clicks were rejected because `_component_check_auth` only checked
env-var allowlists (DISCORD_ALLOWED_USERS, GATEWAY_ALLOW_ALL_USERS,
etc.) and not the pairing store.
This was a regression from commit f6f363662 which intentionally made
component auth fail-closed when no allowlist is set (security fix for
GHSA-mc26-p6fw-7pp6), but did not account for pairing-based auth.
Fix: add a `PairingStore.is_approved("discord", uid)` check to
`_component_check_auth`, mirroring `authz_mixin._check_authorization`.
The pairing store check runs after all allowlist checks, preserving the
fail-closed behavior for non-paired, non-allowed users.
Fixes#50627
uperLu's #50958 renamed plugins/cron → plugins/cron_providers but left
two test files patching the now-gone plugins.cron.chronos.verify path,
which would fail collection. Point them at plugins.cron_providers.*.
Add uperLu to release.py AUTHOR_MAP.
The contributor PR stamped runner._exit_code=78 on non-retryable startup
errors, but start_gateway()'s clean-exit branch returned True before the
SystemExit(runner.exit_code) site, so main() exited 0. The s6 finish
script's [ "$1" = "78" ] check never matched and s6 crash-looped the
gateway anyway — the fix was dead as shipped (#51228).
Honor runner.exit_code in the clean-exit branch: raise SystemExit(code)
when set, else return True (normal /restart clean exit). Add a
start_gateway()-level test that asserts process-level SystemExit(78)
propagation — the gap the PR's object-level test missed — plus exit_code
on the existing _CleanExitRunner mocks.
Profiles without their own messaging token inherit the default
profile's token via os.getenv, hit a token collision, and exit with
startup_failed. s6 restarts them immediately, creating ~30MB tirith
sandbox dirs in /tmp each cycle — filling the disk in hours (#51228).
Changes:
- gateway/restart.py: add GATEWAY_FATAL_CONFIG_EXIT_CODE = 78
- gateway/run.py: set exit_code=78 on non-retryable startup errors
(token collision, no platforms)
- hermes_cli/service_manager.py: add _render_finish_script() that
translates exit 78 → exit 125 (s6 permanent failure)
- hermes_cli/container_boot.py: write finish script alongside run
script during profile registration
The s6 finish script pattern follows docker/s6-rc.d/dashboard/finish.
Closes#51228
Register a per-instance wakeUrl and forward it to the connector at
self-provision so a suspended gateway can be poked awake when buffered
work arrives (pairs with the connector-side WakePoker).
- relay_wake_url() resolver (env GATEWAY_RELAY_WAKE_URL, then
gateway.relay_wake_url in config.yaml), mirroring relay_instance_id()
- thread wake_url through _post_provision (adds wakeUrl to the body only
when set) + self_provision_relay (resolve, forward, log)
- hermes gateway enroll --wake-url <url> persists GATEWAY_RELAY_WAKE_URL
- document the §5.2 wake poke in relay-connector-contract.md §3.3
- tests: relay_wake_url resolution (env/config/absent), provision
forwarding, body-only-when-set (6 new; 130 relay tests pass)
The actual reconnect+drain on wake is Unit B's loop; this unit only
wires the wake SIGNAL. Opt-in: absent wakeUrl => connector never pokes.
The gateway half of the going-idle/buffered-flip primitive (scale-to-zero
PRIMITIVE, not the behaviour). Integrates with the EXISTING drain transition:
- ws_transport: `go_idle()` sends `going_idle` + awaits the connector's
`going_idle_ack` (connector-authoritative flip-then-ack, Q-5.3c — stays
serving until the ack so nothing is lost in the flip window); acks a buffered
inbound (bufferId present) via `inbound_ack` after the handler runs
(drain-without-dup on the delivery leg); NET-NEW reconnect loop re-dials +
re-handshakes after an unexpected close (off by default, on in production).
- adapter: emits `going_idle` from its existing `disconnect()` drain seam before
tearing down the socket; best-effort + guarded (never blocks shutdown).
- transport Protocol + contract doc §3.2 document the 3 new frames.
+6 relay tests (124 pass). NOT in scope: the autonomous idle timer / machine
suspend / NAS health model (deferred behaviour). Ben's relay-adapter solo lane.
When _auto_create_thread() creates a thread from a user message via
message.create_thread(), Discord fires a second MESSAGE_CREATE event
for the 'thread starter message'. That starter message carries
message.id == thread.id and may arrive with type=default instead of
type=21 (thread_starter_message), so the existing type filter in
on_message does not catch it — triggering a second call into
_handle_message and thus a second agent run and response.
Fix: after _auto_create_thread succeeds and returns a thread, pre-seed
the dedup cache with str(thread.id) via self._dedup.is_duplicate().
The dedup cache is the same TTL-based MessageDeduplicator that already
guards against Discord RESUME event replays. Calling is_duplicate()
marks the ID as seen; when the duplicate thread-starter MESSAGE_CREATE
arrives, on_message's guard returns True and the event is dropped.
This is a minimal, targeted fix:
- No new state: reuses the existing _dedup instance
- No timing/race: the pre-seed happens synchronously inside the async
_handle_message, before the thread-starter event can be dispatched
- Scoped: only fires when auto-threading is enabled AND thread creation
succeeds (thread object is not None)
Also adds tests in tests/gateway/test_discord_double_dispatch.py
covering the pre-seed behaviour, failure modes (thread creation fails,
auto-thread disabled), and dedup cache integrity.
Closes#51057