Commit graph

9 commits

Author SHA1 Message Date
Ben Barclay
64a507da44
feat(relay): handle passthrough_forward over the WS (Phase 5 §5.1, gateway half) (#50702)
The connector half (gateway-gateway) moves the passthrough plane's post-ACK
forward off the HTTP gatewayEndpoint onto the gateway's outbound /relay WS via
a new passthrough_forward frame. This is the gateway side: the relay adapter
now RECEIVES and handles that frame, so a hosted gateway (no public IP) can
process forwarded Class-2/3 traffic (Discord interactions, Twilio) over the
socket it already holds — closing the "passthrough inbound doesn't work for
hosted gateways" gap.

- ws_transport.py: decode the passthrough_forward frame; PassthroughForward
  dataclass + _passthrough_from_wire (base64 body -> exact bytes, byte parity
  with the connector's toPassthroughForward); set_passthrough_handler mirrors
  set_interrupt_inbound_handler.
- transport.py: PassthroughHandler type + set_passthrough_handler on the
  RelayTransport protocol.
- adapter.py: connect() wires the passthrough handler; _on_passthrough decodes
  the (already-sanitized, token-free) forward and, for a Discord interaction,
  converts it to a MessageEvent routed through the normal agent path
  (handle_message) — the reply egresses over the outbound / token-less
  follow_up path, so the gateway never holds the interaction credential. Never
  raises (a bad forward can't kill the read loop). Non-discord forwards (Twilio)
  are logged + dropped for now.
- docs/relay-connector-contract.md: document the passthrough_forward frame +
  PassthroughForward shape + §3.1.

The interaction -> MessageEvent CONVERSION semantics (slash-command vs button
UX, option rendering) are the open sub-design flagged in the spec; the TRANSPORT
+ receive mechanism (this) is settled per Ben's Gate-2 decision: "the relay
adapter handles receiving these events over the WS."

Tests (tests/gateway/relay/test_relay_passthrough.py): byte-preservation
round-trip (+ malformed-body tolerance), connect() wiring, application-command
and message-component interactions route through handle_message with correct
session source + scope capture, malformed/non-discord forwards dropped cleanly.
100 relay tests green. Pairs with the connector PR (gateway-gateway).
2026-06-22 20:10:57 +10:00
Ben Barclay
a64fc490fe
fix(relay): make hosted gateways actually connect AND complete the inbound/outbound round-trip (#48828)
* fix(relay): enable RELAY platform + normalize dial URL so hosted gateways actually connect

Three bugs blocked a self-provisioned hosted gateway from ever establishing its
inbound relay WS (found while standing up the live staging end-to-end). Each
masked the next; all three are needed for inbound to work.

1. RELAY platform never enabled in config.platforms (gateway/config.py).
   register_relay_adapter() puts the adapter in the platform_registry, but
   start_gateway()'s connect loop iterates self.config.platforms — which never
   contained Platform.RELAY. So the adapter was "registered" but never connected
   (logs showed "relay adapter registered" then "No messaging platforms
   enabled"). Fix: _apply_env_overrides now enables Platform.RELAY (mirroring
   relay_url into extra for the connected-checker) when GATEWAY_RELAY_URL (env)
   or gateway.relay_url (yaml) is set. Absent -> no RELAY entry (direct/
   single-tenant gateways unaffected).

2. URL scheme not converted for the WS dial (gateway/relay/ws_transport.py).
   The relay URL is configured once as the http(s):// base (used as-is for the
   provision POST), but websockets.connect rejects http(s):// with "scheme isn't
   ws or wss". Fix: _ws_dial_url converts https->wss / http->ws.

3. /relay path not appended (same helper). The connector mounts its
   WebSocketServer at path "/relay" and returns HTTP 400 on an upgrade to any
   other path. GATEWAY_RELAY_URL is the base (no /relay), so the dial hit "/"
   -> 400. Fix: _ws_dial_url ensures the path ends in /relay. Idempotent — a URL
   already carrying ws(s):// and/or /relay is unchanged, so provision's
   _provision_url (which derives /relay/provision from either form) still works.

Why the cross-repo E2E missed #2/#3: the stub connector binds ws://host:port and
its websockets.serve accepts ANY path, so neither the scheme nor the /relay path
was exercised. Real connector needs both.

Verified live on staging hermes-agent-stg-automated-perception-5054: after the
fixes the gateway logs "Connecting to relay..." -> "✓ relay connected" ->
"Gateway running with 1 platform(s)" against
wss://gateway-gateway.staging-nousresearch.com/relay, stable.

Tests: added _ws_dial_url scheme+path+idempotency cases (test_ws_transport.py)
and RELAY-platform-enablement cases for env + yaml + absent (test_config.py).
Full gateway/relay + config suites green (191 passed).

Relay-adapter lane. EXPERIMENTAL.

* fix(relay): re-attach guild_id to outbound so connector egress resolves the tenant

The final bug in the hosted-relay round-trip. Inbound worked end to end (Discord
-> connector -> bus -> agent WS -> agent runs -> reply), but the reply's egress
was declined by the connector: "discord egress declined: target not routed to an
onboarded tenant".

Cause: the connector's routedEgressGuard resolves the owning tenant from the
OUTBOUND action's metadata.guild_id (Discord's routing discriminator). The
gateway's generic delivery path builds outbound metadata via
run.py _thread_metadata_for_source, which only carries thread_id (and returns
None entirely for a non-threaded message) — so guild_id never reached the
connector, tenant resolution failed, and the shared bot refused to post.

Fix (relay-adapter-local, no perturbation of the generic delivery path or other
platforms): RelayAdapter learns chat_id -> guild_id from each inbound event
(_capture_scope) and re-attaches it to the outbound action's metadata in send()
(_with_scope) when not already present. No-op for chats we never saw inbound
(e.g. DMs) and never overwrites an explicit guild_id.

Verified live on staging hermes-agent-stg-automated-perception-5054: an
@mention in #general now produces a visible bot reply — full multi-tenant relay
round-trip (real Discord -> shared connector bot -> tenant routing -> agent WS ->
reply egress -> Discord).

Tests: _capture_scope/_with_scope reattach, no-scope no-op, explicit-guild_id
preserved (test_relay_adapter.py). Full relay + config suites green (160 passed).

Relay-adapter lane. EXPERIMENTAL.
2026-06-19 16:30:24 +10:00
Ben Barclay
d2c53ff558
feat(relay): WS-only inbound on the gateway adapter (Phase 3) (#48294)
The connector now delivers inbound (messages + interrupts) over the gateway's
OUTBOUND /relay WebSocket, not a signed HTTP POST to an inbound endpoint. The
gateway needs no inbound HTTP port — which is what makes hosted gateways (no
public IP) able to receive inbound at all.

- gateway/relay/adapter.py: connect() wires set_interrupt_inbound_handler(
  self.on_interrupt) so connector->gateway interrupt_inbound frames bridge into
  the existing per-session interrupt path (the inbound message handler was
  already wired). Removed _maybe_start_inbound_receiver() + the _inbound_runner
  lifecycle — there is no HTTP receiver anymore.
- gateway/relay/inbound_receiver.py: deleted (the signed-HTTP InboundDelivery
  receiver).
- gateway/relay/__init__.py: removed relay_inbound_config() (dead with the
  receiver gone). The delivery key is still set in-process by self-provision for
  forward-compat but is no longer consumed for inbound.
- docs/relay-connector-contract.md: §3 rewritten — inbound is the WS back-channel
  routed cross-instance via the connector's relay bus; §5 interrupt + §6 auth
  table updated; the old signed-HTTP-POST + per-tenant-delivery-key-signing path
  is documented as superseded. gatewayEndpoint noted as passthrough-plane only.

Tests: stub_connector grows set_interrupt_inbound_handler + push_interrupt;
new test_relay_interrupt case proves connect() wires BOTH inbound handlers and an
interrupt_inbound frame over the WS cancels the right session. Removed the
HTTP-receiver test; updated the crypto-shedding scan + self-provision delivery-key
assertion. 88 relay tests pass.

EXPERIMENTAL. Pairs with gateway-gateway (relay bus + WsGatewayDelivery) and the
NAS GATEWAY_RELAY_URL stamp. The cross-repo E2E (connector repo) proves the full
multi-instance path against this production adapter code.
2026-06-19 09:33:15 +10:00
Ben Barclay
c276b017ad
feat(relay): connector⇄gateway channel auth + signed-HTTP inbound receiver + enroll CLI (#48147)
* feat(relay): authenticate the connector⇄gateway WS channel

The relay gateway may be customer-managed and internet-exposed, so the
connector⇄gateway channel is itself authenticated (distinct from the
platform crypto the relay path sheds). Add gateway/relay/auth.py — a
Python port of the connector's HMAC token + delivery-signature schemes
(relayAuthToken.ts / deliverySigning.ts), verified byte-for-byte against
the connector's compiled TypeScript via cross-language test vectors.

Present an Authorization bearer on the /relay WS upgrade keyed by the
per-gateway secret (resolved from GATEWAY_RELAY_ID / GATEWAY_RELAY_SECRET
in env or config). The connector rejects an unauthenticated/invalid/
revoked upgrade with close 4401.

* feat(relay): signed-HTTP inbound delivery receiver

The connector delivers normalized inbound events to a tenant's gateway
over a signed HTTP POST, not the outbound /relay WS: the connector
instance owning a platform socket is generally not the instance a given
gateway dialed out to, so inbound targets a tenant endpoint that may
load-balance across gateway instances.

Add gateway/relay/inbound_receiver.py — verifies x-relay-signature /
x-relay-timestamp over the EXACT raw request bytes (re-serializing would
break the HMAC: JS JSON.stringify is compact, Python json.dumps spaces)
against the per-tenant delivery key verify list within a 300s replay
window, then dispatches messages to handle_message and interrupts to the
interrupt handler. Wire it into the adapter lifecycle (start in connect()
when a delivery key + bind port are configured, tear down in disconnect();
a purely-outbound dev gateway runs without it).

Refine test_relay_sheds_crypto to distinguish PLATFORM crypto (Discord
ed25519, Twilio/WeCom HMAC — still shed) from the connector⇄gateway
CHANNEL auth (intended): auth.py / inbound_receiver.py are exempt from
the platform-symbol scan but still banned from importing platform-crypto
modules, plus a positive guard that auth.py uses only stdlib hmac/hashlib.

* feat(relay): hermes gateway enroll CLI

Add the gateway half of zero-touch enrollment. `hermes gateway enroll`
resolves a fresh Nous Portal access token (the tenant-proving identity),
POSTs {enrollmentToken, gatewayId} to the connector's /relay/enroll, and
persists GATEWAY_RELAY_ID / GATEWAY_RELAY_SECRET / GATEWAY_RELAY_DELIVERY_KEY
to ~/.hermes/.env. The per-gateway secret authenticates the WS upgrade;
the per-tenant delivery key verifies signed inbound deliveries.

Refuses under is_managed() (hosted installs get the secret stamped in by
the orchestrator). Added as an 'enroll' subcommand on the existing
gateway subparser — not a new top-level command.

* docs(relay): inbound is signed HTTP, not WS; document channel auth

Fix the stale contract: §3/§5 said inbound rode the WS socket (single-
instance only, predates the multi-instance socket-ownership + channel-auth
model). Inbound + connector→gateway interrupt are signed HTTP POSTs to the
tenant endpoint. Add §6.1 documenting the two channel-auth schemes (per-
gateway WS-upgrade secret, per-tenant inbound delivery key) and how they
differ from the platform crypto the relay path sheds.

* test(relay): update build_gateway_parser callers for cmd_gateway_enroll

The enroll subcommand added cmd_gateway_enroll as a required keyword-only
arg to build_gateway_parser, but two existing parser-extraction tests still
called it with only cmd_gateway/cmd_proxy — failing CI with TypeError.
Thread the new handler through both call sites and add a test asserting
`gateway enroll` dispatches to cmd_gateway_enroll with its flags parsed.
2026-06-18 12:01:54 +10:00
Ben
6b03874d07 feat(gateway): production WebSocketRelayTransport + descriptor negotiation
Adds the concrete transport behind the RelayTransport Protocol — the missing
'later-phase work' the relay scaffold deferred. The gateway dials OUT to the
connector over a WebSocket and speaks the newline-delimited JSON frame protocol
(docs/relay-connector-contract.md; connector src/relay/protocol.ts):

- connect(): opens the ws, sends hello{platform,botId}, starts a background
  read loop, and resolves handshake() when the connector's descriptor frame
  arrives.
- inbound frames -> the registered InboundHandler (rebuilt into a MessageEvent
  via _event_from_wire, mapping the snake_case SessionSource wire form back
  onto the gateway dataclasses).
- send_outbound / send_follow_up / get_chat_info: request/response correlated
  by a uuid requestId against a per-request future, with a timeout so a caller
  never hangs; send_interrupt is fire-and-forget.
- disconnect(): cancels the reader, closes the ws, and fails any in-flight
  outbound waiters with a structured error.

RelayAdapter.connect() now negotiates the real CapabilityDescriptor from the
transport and adopts it (_apply_descriptor updates MAX_MESSAGE_LENGTH +
markdown surface), replacing the construction-time placeholder. Lazy
'import websockets' mirrors gateway/platforms/feishu.py; WEBSOCKETS_AVAILABLE
gates construction.
2026-06-17 16:37:45 -07:00
Ben
3db9b3e616 feat(gateway): token-less follow_up outbound op (A2 capability action)
The relay outbound surface had send/edit/typing but no way to act on a
SHARED-identity capability (e.g. a Discord interaction follow-up token,
~15min) that the connector captured + stripped at the edge. Under A2 that
credential never reaches the gateway, so the gateway can't just 'send with
the token' — it needs a semantic op naming the session it's already in.

Adds the follow_up op end to end on the gateway side:
- RelayTransport.send_follow_up(action): protocol method. Action carries
  op='follow_up' + session_key + kind + content (+ metadata) and NO token.
- RelayAdapter.send_follow_up(session_key, kind, content, metadata): builds
  that action and returns a SendResult. The connector resolves the real
  capability (its resolveOutboundCapability), enforces the tenant match so
  tenant B can't wield tenant A's capability, and egresses; success=False
  when the capability is absent/expired/mismatched (nothing to retry — a
  leaked gateway holds zero capability material).
- StubConnector records follow_ups + a canned next_follow_up_result.

Tests: round-trips without a token; the wire action carries only session
refs (no credential value field — the 'kind' string is a type ref, not the
secret); failure surfaces when the connector can't resolve; no-transport
fails cleanly. 55 passed. §4 doc entry follows in the contract-rewrite commit.
2026-06-17 16:37:45 -07:00
Ben
a3cdd8c39d feat(relay): route mid-turn /stop over relay interrupt channel
RelayAdapter.on_interrupt(session_key, chat_id) bridges a connector-delivered
mid-turn /stop into the existing interrupt_session_activity path, setting the
per-session _active_sessions Event and clearing typing — cancelling exactly the
targeted session's turn without touching siblings (mirrors test_stop_thread_
sibling isolation). Transport.send_interrupt carries the gateway-side egress to
the connector for socket-owner routing.

Phase 1, Task 1.4 of the gateway-relay plan.
2026-06-17 16:37:45 -07:00
Ben
259e78e175 feat(relay): transport protocol + test-only stub connector
Defines RelayTransport (lifecycle/handshake/inbound/outbound/interrupt) as the
gateway<->connector wire contract; RelayAdapter.connect now registers an inbound
handler that bridges connector-delivered MessageEvents into handle_message.
Adds an in-memory StubConnector under tests/ and an E2E round-trip proving:
connect registers the handler, inbound events reach the adapter, guild_id drives
build_session_key isolation (two guilds -> two keys; same guild/channel/user ->
one), outbound send round-trips, get_chat_info is proxied.

Phase 1, Task 1.2 of the gateway-relay plan.
2026-06-17 16:37:45 -07:00
Ben
b0999c82f3 feat(relay): generic RelayAdapter advertising negotiated capabilities
One BasePlatformAdapter subclass that reads its capability profile from a
CapabilityDescriptor: MAX_MESSAGE_LENGTH attribute, message_len_fn (table-driven
by len_unit: chars=len, utf16=Telegram-style code units), supports_draft_streaming.
Implements the four abstract methods (connect/disconnect/send/get_chat_info) by
delegating to an injected RelayTransport (full protocol lands in Task 1.2). Adds
Platform.RELAY enum member. No per-platform gateway code.

Phase 1, Task 1.1 of the gateway-relay plan.
2026-06-17 16:37:45 -07:00