feat(relay): WS-only inbound on the gateway adapter (Phase 3) (#48294)

The connector now delivers inbound (messages + interrupts) over the gateway's
OUTBOUND /relay WebSocket, not a signed HTTP POST to an inbound endpoint. The
gateway needs no inbound HTTP port — which is what makes hosted gateways (no
public IP) able to receive inbound at all.

- gateway/relay/adapter.py: connect() wires set_interrupt_inbound_handler(
  self.on_interrupt) so connector->gateway interrupt_inbound frames bridge into
  the existing per-session interrupt path (the inbound message handler was
  already wired). Removed _maybe_start_inbound_receiver() + the _inbound_runner
  lifecycle — there is no HTTP receiver anymore.
- gateway/relay/inbound_receiver.py: deleted (the signed-HTTP InboundDelivery
  receiver).
- gateway/relay/__init__.py: removed relay_inbound_config() (dead with the
  receiver gone). The delivery key is still set in-process by self-provision for
  forward-compat but is no longer consumed for inbound.
- docs/relay-connector-contract.md: §3 rewritten — inbound is the WS back-channel
  routed cross-instance via the connector's relay bus; §5 interrupt + §6 auth
  table updated; the old signed-HTTP-POST + per-tenant-delivery-key-signing path
  is documented as superseded. gatewayEndpoint noted as passthrough-plane only.

Tests: stub_connector grows set_interrupt_inbound_handler + push_interrupt;
new test_relay_interrupt case proves connect() wires BOTH inbound handlers and an
interrupt_inbound frame over the WS cancels the right session. Removed the
HTTP-receiver test; updated the crypto-shedding scan + self-provision delivery-key
assertion. 88 relay tests pass.

EXPERIMENTAL. Pairs with gateway-gateway (relay bus + WsGatewayDelivery) and the
NAS GATEWAY_RELAY_URL stamp. The cross-repo E2E (connector repo) proves the full
multi-instance path against this production adapter code.
This commit is contained in:
Ben Barclay 2026-06-19 09:33:15 +10:00 committed by GitHub
parent 03d9a95a74
commit d2c53ff558
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
9 changed files with 117 additions and 476 deletions

View file

@ -62,33 +62,55 @@ live platform adapter's capability methods.
The connector normalizes each platform wire event into a `MessageEvent`
(`gateway/platforms/base.py`) and delivers it to the gateway. **Inbound is
delivered over a signed HTTP POST, not the outbound `/relay` WebSocket** (see
the transport note below). The gateway keys the session via `build_session_key()`
delivered over the gateway's OUTBOUND `/relay` WebSocket** (see the transport
note below) — the connector pushes an `inbound` frame down the socket the
gateway already dialed. The gateway keys the session via `build_session_key()`
from the embedded `SessionSource` — so populating the right discriminators is
the single highest-correctness responsibility of the connector.
### Inbound transport (signed HTTP POST, not the outbound WS)
### Inbound transport (WS back-channel, not HTTP)
The gateway dials **out** to the connector's `/relay` WebSocket for the
handshake + outbound actions (§4) + its own `/stop` egress (§5). Inbound,
however, is delivered the other way: the connector **POSTs** the normalized
event to the gateway's inbound endpoint (`HttpGatewayDelivery` on the connector;
`gateway/relay/inbound_receiver.py` on the gateway). The reason is
multi-instance: the connector instance that owns a platform's socket (and thus
produces inbound events) is generally **not** the instance a given gateway
dialed its outbound WS into, so inbound must target a tenant **endpoint** (which
may load-balance across gateway instances) rather than ride one gateway's
outbound socket. Each delivery is HMAC-signed with the per-tenant **delivery
key** (§6.1); the gateway verifies the signature over the exact raw bytes before
accepting the event. Two POST targets:
handshake + outbound actions (§4) + its own `/stop` egress (§5). Inbound rides
the **same socket** in the other direction: the connector pushes an `inbound`
frame (and `interrupt_inbound` for §5) down the gateway's outbound WS. There is
**no gateway-side inbound HTTP endpoint** — a gateway need not (and, when hosted,
cannot) expose any inbound port; everything flows over the connection it
initiated.
**Multi-instance routing.** The connector instance that owns a platform's socket
(and thus produces inbound events) is generally **not** the instance the gateway
dialed its outbound WS into. The producing instance therefore publishes the
event on the connector's internal **relay bus** (Redis pub/sub; `RelayBus` in
`src/core/relayBus.ts`) keyed by tenant. Every connector instance subscribes and
routes each message to its **local** sessions for that tenant
(`RelayServer.routeBusMessage`); the single instance that actually holds the
gateway's socket delivers it, and instances with no local session for the tenant
no-op. Cross-instance delivery is thus an in-cluster Redis hop, not a public
HTTP call.
Frames (connector → gateway, over the WS):
- `{"type":"inbound", "event": <MessageEvent>, "bufferId"?}`
- `{"type":"interrupt_inbound", "session_key", "chat_id"}` (§5)
**Trust.** The WS upgrade is authenticated with the gateway's per-gateway secret
(§6.1), so the channel is trusted end to end — inbound frames are not separately
HMAC-signed (the authenticated socket subsumes the per-delivery origin proof the
old HTTP path needed). The relay-bus hop is inside the connector trust domain
(same as the lease/buffer/capability stores).
> Earlier drafts of this contract delivered inbound over a signed **HTTP POST**
> to a `gatewayEndpoint` (`HttpGatewayDelivery` + a gateway-side
> `inbound_receiver`), HMAC-signed with a per-tenant delivery key. That required
> every gateway to expose a reachable inbound URL — impossible for hosted
> gateways, which have no public IP. The WS back-channel above replaces it; the
> per-tenant delivery key is retained at provision for forward-compat but is no
> longer used for inbound. `gatewayEndpoint` remains only for the **passthrough
> plane** (Class-2/3 webhooks like Discord interactions / Twilio), which is a
> separate synchronous-forward path and out of scope for this section.
- `POST {gatewayEndpoint}``{"type":"message", "event": <MessageEvent>}`
- `POST {gatewayEndpoint}/interrupt``{"type":"interrupt", "session_key", "reason"?}` (§5)
> An earlier draft of this contract delivered inbound over the WS `inbound`
> frame. That only works single-instance and predates the multi-instance
> socket-ownership + channel-auth model; the signed-HTTP path above is the
> shipped design.
### SessionSource fields (the wire surface)
@ -178,13 +200,15 @@ gateway holds zero capability material). Source of truth:
mid-turn `/stop` over the outbound WS. The connector MUST forward it to the
gateway instance running that `session_key` (the routing invariant).
- **Connector → gateway:** an inbound interrupt for a `session_key` is delivered
as a **signed HTTP POST** to `{gatewayEndpoint}/interrupt` (§3 transport note),
and bridged by the adapter's `on_interrupt(session_key, chat_id)` into the
existing per-session interrupt mechanism, cancelling exactly that turn
as an `interrupt_inbound` frame down the gateway's outbound WS (§3 transport
note) — routed cross-instance via the relay bus to whichever instance holds
the socket — and bridged by the adapter's `on_interrupt(session_key, chat_id)`
into the existing per-session interrupt mechanism, cancelling exactly that turn
(siblings untouched).
The gateway→connector `/stop` rides the outbound WS; the connector→gateway
interrupt rides the same signed-HTTP inbound path as a normalized event.
Both directions ride the gateway's outbound WS: the gateway→connector `/stop`
egresses over it, and the connector→gateway interrupt rides the same `inbound`
back-channel as a normalized event.
---
@ -231,20 +255,21 @@ only in transport. See `docs/capability-trust-boundary.md` (connector repo:
A2 makes the connector the sole holder of platform secrets while the gateway may
be **customer-managed and internet-exposed**, so the connector⇄gateway channel
is itself authenticated. The gateway holds two enrollment-issued credentials
(`hermes gateway enroll` → connector `/relay/enroll`): a **per-gateway secret**
and a **per-tenant delivery key**. Both are HMAC-SHA256 schemes with a
multi-secret rotation verify list (gateway side: `gateway/relay/auth.py`;
connector side: `src/core/relayAuthToken.ts` + `src/core/deliverySigning.ts`).
is itself authenticated. The gateway holds an enrollment- or provision-issued
**per-gateway secret** (`hermes gateway enroll` → connector `/relay/enroll`, or
managed self-provision → `/relay/provision`) that authenticates its outbound WS
upgrade. It is an HMAC-SHA256 scheme with a multi-secret rotation verify list
(gateway side: `gateway/relay/auth.py`; connector side:
`src/core/relayAuthToken.ts`).
| Leg | Credential | Mechanism |
|-----|-----------|-----------|
| Gateway → connector WS upgrade | per-gateway secret | An `Authorization` bearer header on the `/relay` upgrade. The token is `base64url(payload:exp:sig)` where `payload = gatewayId` and `sig = HMAC(payload:exp, secret)`. Connector verifies and rejects the upgrade (**close 4401**) on mismatch/absence/revocation. The authenticated tenant comes from the connector's store, never the `hello` frame. |
| Connector → gateway inbound POST | per-tenant delivery key | Two headers: `x-relay-timestamp` (unix seconds) and `x-relay-signature` (hex `HMAC(ts.rawBody, deliveryKey)`). Gateway verifies over the **exact raw bytes** within a ±300s replay window before accepting the event; rejects **401** otherwise. |
| Connector → gateway inbound (`inbound` / `interrupt_inbound` frames) | — (rides the authenticated WS) | Inbound is pushed down the gateway's already-authenticated outbound socket (§3), so no per-message signature is needed. A **per-tenant delivery key** is still issued at enroll/provision and retained for forward-compat, but is no longer used to sign inbound. |
This is the **channel** authenticator — distinct from platform crypto, which the
relay path still sheds entirely (§6). The gateway holds zero platform secrets;
these two keys authenticate only the connector link. Full threat model +
the per-gateway secret authenticates only the connector link. Full threat model +
enrollment/rotation/kill-switch design: `docs/connector-gateway-auth-design.md`
(connector repo).