hermes-agent/docs/relay-connector-contract.md
Ben c93b9f9057
Some checks are pending
CI / detect (push) Waiting to run
CI / tests (push) Blocked by required conditions
CI / lint (push) Blocked by required conditions
CI / typecheck (push) Blocked by required conditions
CI / docs-site (push) Blocked by required conditions
CI / history-check (push) Blocked by required conditions
CI / contributor-check (push) Blocked by required conditions
CI / uv-lockfile (push) Blocked by required conditions
CI / docker-lint (push) Blocked by required conditions
CI / supply-chain (push) Blocked by required conditions
CI / osv-scanner (push) Blocked by required conditions
CI / All required checks pass (push) Blocked by required conditions
Deploy Site / deploy-vercel (push) Waiting to run
Deploy Site / deploy-docs (push) Waiting to run
Docker Build and Publish / build-amd64 (push) Waiting to run
Docker Build and Publish / build-arm64 (push) Waiting to run
Docker Build and Publish / merge (push) Blocked by required conditions
feat(relay): terminal 4401 (opt-out) → clean "Relay disabled" state
Phase 7 Unit 7d-B. When an operator opts an instance OUT of the Team Gateway
relay (Unit 7b deprovision), the connector revokes the per-gateway secret and
closes the gateway's WS with 4401. The reconnect supervisor previously treated
EVERY close as retryable, so the live process spun "retrying 4401" forever and
the dashboard showed a red error — opt-out looked like a failure.

Now a 4401 close that arrives AFTER a successful handshake is recognized as a
terminal credential revocation:

- ws_transport.py: track `_handshake_succeeded` (set when a descriptor is
  received); on a 4401 close after a prior success, latch `auth_revoked` and do
  NOT spawn the reconnect supervisor. A 4401 BEFORE any successful handshake
  stays retryable (cold-start / not-yet-provisioned race, not a revocation).
  New `auth_revoked` property + a websockets-version-safe close-code reader
  (prefers `.rcvd`/`.sent` Close frames; `.code` is deprecated in websockets 13+).
- adapter.py: a revocation monitor turns `transport.auth_revoked` into a clean,
  NON-retryable `relay_disabled` fatal and notifies the gateway's fatal-error
  handler (so the adapter is removed and NOT queued for reconnection — the
  credential is dead until the instance is recreated). Monitor is cancelled on
  disconnect; only started when the transport exposes `auth_revoked` (prod WS).
- run.py: `_handle_adapter_fatal_error` maps the `relay_disabled` code to a
  `disabled` platform_state (not `fatal`/`retrying`).
- web: PlatformsCard renders the `disabled` state with a neutral outline badge,
  a PowerOff icon, and muted (not destructive-red) text + message. New optional
  `status.disabled` i18n string ("Disabled").

Also bundles the Phase 7 contract-doc update (this doc is authoritative in
hermes-agent): docs/relay-connector-contract.md gains an "Author-first
resolution + the account-link (DM) path" section documenting the
multi-tenant-guild rule (D-7.2 — route by authenticated author binding, never by
guild; unlinked → fail-closed), the `/link <code>` DM flow, and the
connector-authoritative opt-out + terminal-4401 behavior this PR implements.

Tests: +2 ws_transport (4401-after-handshake terminal / no-reconnect;
4401-before-handshake stays retryable) and +2 adapter (revocation → non-retryable
relay_disabled fatal + handler fired; no-revocation → no fatal). 138 relay tests
pass (incl. the contract-doc conformance test); ruff clean; web tsc clean.

Phase 7 Unit 7d-B (relay-adapter solo lane). Q17 → Option 2; Option 3 (live
de-register, no recreate) + the restart-re-provision hole deferred post-alpha.
2026-06-24 18:43:01 +10:00

567 lines
34 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Relay ↔ Connector Contract (v1, EXPERIMENTAL)
> **Status:** EXPERIMENTAL. This contract MAY CHANGE without a deprecation
> cycle until at least two real Class-1 platforms (Discord + Telegram) have
> validated it. Evolution during the experimental phase is **additive-only**,
> gated by `contract_version`. A breaking change updates both repos in lockstep.
This document is the formal interface between the **Hermes gateway** (Python,
`gateway/relay/`) and the **connector** (Node/TypeScript,
`NousResearch/gateway-gateway`). The connector implementer's first action is to
read this file.
The gateway runs a generic `RelayAdapter` that dials **out** to the connector,
receives a `CapabilityDescriptor` at handshake, then exchanges normalized
`MessageEvent`s (inbound) and actions (outbound) over a per-turn bidirectional
WebSocket. The gateway never learns which concrete platform is fronting it; the
connector owns all platform-specific socket/identity logic.
---
## 1. Handshake
1. Gateway opens the transport (`connect`).
2. Gateway calls `handshake()`; connector returns a `CapabilityDescriptor`
(section 2) describing the platform this adapter instance fronts.
3. Gateway configures the adapter from the descriptor (char limit, length unit,
draft/edit/thread/markdown capabilities) and registers an inbound handler.
4. Connector then streams inbound events and accepts outbound actions.
`contract_version` (currently `1`) is carried in the descriptor. The gateway
ignores unknown descriptor fields (forward-compat) and fills missing optional
fields from defaults.
---
## 2. CapabilityDescriptor (handshake payload)
JSON object. Source of truth: `gateway/relay/descriptor.py`.
| Field | Type | Required | Meaning |
| --- | --- | --- | --- |
| `contract_version` | int | yes | Contract version (additive-only within a version). |
| `platform` | string | yes | Platform name (e.g. `"discord"`, `"telegram"`). |
| `label` | string | yes | Human-readable label. |
| `max_message_length` | int | yes | Char limit; gateway exposes as `MAX_MESSAGE_LENGTH`. 0 → treat as 4096. |
| `supports_draft_streaming` | bool | yes | Native draft-streaming preview support. |
| `supports_edit` | bool | yes | Edit-based streaming possible; if false, consumer degrades to one-message-per-segment. |
| `supports_threads` | bool | yes | `create_handoff_thread` capability. |
| `markdown_dialect` | string | yes | `"plain"`, `"markdown_v2"`, `"discord"`, … (drives `supports_code_blocks`). |
| `len_unit` | string | yes | `"chars"` (builtin len) or `"utf16"` (Telegram UTF-16 code units). |
| `emoji` | string | no | Display emoji (default 🔌). |
| `platform_hint` | string | no | System-prompt platform hint. |
| `pii_safe` | bool | no | Redact PII in session descriptions. |
Most fields are a projection of the gateway's existing `PlatformEntry`; the
runtime-only fields (`len_unit`, `supports_*`, `markdown_dialect`) come from the
live platform adapter's capability methods.
---
## 3. Inbound: `MessageEvent` envelope
The connector normalizes each platform wire event into a `MessageEvent`
(`gateway/platforms/base.py`) and delivers it to the gateway. **Inbound is
delivered over the gateway's OUTBOUND `/relay` WebSocket** (see the transport
note below) — the connector pushes an `inbound` frame down the socket the
gateway already dialed. The gateway keys the session via `build_session_key()`
from the embedded `SessionSource` — so populating the right discriminators is
the single highest-correctness responsibility of the connector.
### Inbound transport (WS back-channel, not HTTP)
The gateway dials **out** to the connector's `/relay` WebSocket for the
handshake + outbound actions (§4) + its own `/stop` egress (§5). Inbound rides
the **same socket** in the other direction: the connector pushes an `inbound`
frame (and `interrupt_inbound` for §5) down the gateway's outbound WS. There is
**no gateway-side inbound HTTP endpoint** — a gateway need not (and, when hosted,
cannot) expose any inbound port; everything flows over the connection it
initiated.
**Multi-instance routing.** The connector instance that owns a platform's socket
(and thus produces inbound events) is generally **not** the instance the gateway
dialed its outbound WS into. The producing instance therefore publishes the
event on the connector's internal **relay bus** (Redis pub/sub; `RelayBus` in
`src/core/relayBus.ts`) keyed by tenant. Every connector instance subscribes and
routes each message to its **local** sessions for that tenant
(`RelayServer.routeBusMessage`); the single instance that actually holds the
gateway's socket delivers it, and instances with no local session for the tenant
no-op. Cross-instance delivery is thus an in-cluster Redis hop, not a public
HTTP call.
Frames (connector → gateway, over the WS):
- `{"type":"inbound", "event": <MessageEvent>, "bufferId"?}`
- `{"type":"interrupt_inbound", "session_key", "chat_id"}` (§5)
- `{"type":"passthrough_forward", "forward": <PassthroughForward>, "bufferId"?}` (§5.1)
`PassthroughForward` is the wire form of a forwarded passthrough-plane request
(Class-2/3 webhooks — Discord interactions, Twilio): `{platform, botId, method,
path, headers: [[k,v],…], bodyB64}`. The body is base64-encoded so arbitrary
bytes survive the newline-delimited-JSON transport; the gateway base64-decodes
back to the exact bytes the connector forwarded (the connector already verified
the provider signature and stripped any shared-identity credential at the edge —
§6 — so the gateway re-processes a sanitized, token-free body and acts on it via
the token-less `follow_up` path). See §3.1.
**Trust.** The WS upgrade is authenticated with the gateway's per-gateway secret
(§6.1), so the channel is trusted end to end — inbound frames are not separately
HMAC-signed (the authenticated socket subsumes the per-delivery origin proof the
old HTTP path needed). The relay-bus hop is inside the connector trust domain
(same as the lease/buffer/capability stores).
> Earlier drafts of this contract delivered inbound over a signed **HTTP POST**
> to a `gatewayEndpoint` (`HttpGatewayDelivery` + a gateway-side
> `inbound_receiver`), HMAC-signed with a per-tenant delivery key. That required
> every gateway to expose a reachable inbound URL — impossible for hosted
> gateways, which have no public IP. The WS back-channel above replaces it; the
> per-tenant delivery key is retained at provision for forward-compat but is no
> longer used for inbound. The **passthrough plane** (Class-2/3 webhooks like
> Discord interactions / Twilio) historically still used `gatewayEndpoint` for
> its post-ACK forward; Phase 5 §5.1 moves that forward onto the WS too (the
> `passthrough_forward` frame above), so a hosted gateway needs zero public
> inbound surface and `gatewayEndpoint` is retired once the cutover lands.
### 3.1 Passthrough-plane forward (§5.1)
The passthrough plane answers the provider's latency-critical ACK at the
connector EDGE (e.g. Discord's deferred interaction response within ~3s), then
does a **fire-and-forget** forward of the real request to the gateway. That
forward needs no response back (the provider was already satisfied), so it rides
the same outbound WS as `inbound` via a `passthrough_forward` frame rather than
an HTTP POST. The gateway processes the decoded request through its normal agent
path (a Discord interaction is decoded to a `MessageEvent` and handled like a
message; the reply egresses over the outbound / `follow_up` path). `bufferId` is
present when the forward was buffered (Phase 5 §5.3 buffered-only flip) and the
gateway acks it after durable handoff.
### SessionSource fields (the wire surface)
Source of truth: `SessionSource.to_dict()` in `gateway/session.py`. These are
every key the gateway accepts on the wire. `platform`, `chat_id`, `chat_type`,
`user_id`, `user_name`, `thread_id`, `chat_name`, and `chat_topic` are always
present (may be `null`); the rest are included only when set.
| Field | Type | Always sent | Meaning |
| --- | --- | --- | --- |
| `platform` | string | yes | Platform name (matches the descriptor's `platform`). |
| `chat_id` | string | yes | Primary conversation id (channel/chat). Session-key discriminator. |
| `chat_type` | string | yes | `dm` / `group` / `channel` / `thread` / `forum`. |
| `chat_name` | string\|null | yes | Human-readable chat name. |
| `user_id` | string\|null | yes | Message author id. Session-key discriminator. |
| `user_name` | string\|null | yes | Author display name. |
| `thread_id` | string\|null | yes | Thread/forum-topic id when in a thread. Session-key discriminator. |
| `chat_topic` | string\|null | yes | Channel topic/description (Discord, Slack). |
| `user_id_alt` | string | no | Platform-specific stable alt id (Signal UUID, Feishu union_id). |
| `chat_id_alt` | string | no | Alternate chat id (e.g. Signal group internal id). |
| `guild_id` | string | no | Discord guild / Slack workspace / Matrix server scope. **REQUIRED for Discord server isolation.** Session-key discriminator. |
| `parent_chat_id` | string | no | Parent channel when `chat_id` refers to a thread. |
| `message_id` | string | no | Id of the triggering message (for pin/reply/react). |
> `is_bot` (author-is-a-bot/webhook classification) exists on the gateway-side
> dataclass but is **intentionally NOT on the wire** in v1 — it is not part of
> `to_dict()`. Do not add it to the connector's `SessionSource` until it is
> first added here and to `to_dict()` (additive bump).
### SessionSource discriminators per platform
| Platform | chat_id | chat_type | user_id | thread_id | guild_id |
| --- | --- | --- | --- | --- | --- |
| **Discord** | channel id | `dm`/`group`/`thread` | author id | thread channel id (threads) | **guild id** (REQUIRED for server isolation) |
| **Telegram** | chat id | `dm`/`group`/`forum` | from id | forum topic id (forums) | — |
**Get Discord's `guild_id` wrong and two servers collide into one session.**
This is the #1 High-severity risk. The gateway's `build_session_key()` is the
conformance oracle: for a given `SessionSource`, the connector's normalization
must produce the same key the Python adapter would. (The Phase-1 stub tests
assert known-input → known-key.)
### Bot identity vs tenant (single-bot consolidation, Appendix A)
The envelope carries the **originating bot identity** as a field **distinct from
tenant**. Tenant is resolved from the event's own discriminator (Discord
`guild_id`, Telegram `chat_id`, webhook path/subdomain) — **never** from which
token/socket/process delivered it. This keeps one shared bot able to front many
tenants (Phase 6) without overloading an existing field.
### Author-first resolution + the account-link (DM) path (Phase 7)
Phase 7 adds **self-serve, per-user onboarding to a shared bot**, which changes
*which* discriminator resolves the instance for a routed inbound message — and
adds a management path for users to bind their own account.
**Author-first resolution (the multi-tenant-guild rule, D-7.2).** A single
Discord guild may hold **many** tenants — different members each linked to their
own agent. So for delivery the connector resolves the destination instance from
the **authenticated author binding** (`user_instance_binding`, keyed by
`(tenant, platform, platform_user_id)` via `resolveByUser`), **NOT** by a
guild→instance route. Concretely:
- A routed message authored by a **linked** user reaches **only that user's**
instance — even when a second linked user in the **same guild** is served by a
different instance (each reaches only their own).
- A message authored by an **unlinked** user resolves to **no** instance and is
dropped (**fail-closed** — never broadcast to the guild's other tenants).
- The author id used is the **authentic `user_id` off the observed event**, the
same `SessionSource.user_id` documented above — never a value asserted by a
gateway or carried in a management frame.
This is the per-`user_id` owner-only routing the connector enforces in
`WsGatewayDelivery` (the gateway-side multi-tenant-guild E2E driver
`gateway_multitenant_guild_driver.py` is the cross-repo oracle).
**The account-link (DM) path.** A user binds their account to an instance with a
one-time code, redeemed by DMing the shared bot:
1. The owner triggers a link from the Portal (or a self-hosted CLI). The
connector mints a short-lived **link code** for the **authenticated**
instance (`POST /manage/link`; instanceId comes from the caller's principal —
a NAS-signed `aud=agent:{instanceId}` token or the instance's own per-gateway
secret — **never** the request body).
2. The user sends `/link <code>` as a **direct message** to the shared bot from
the account they want to bind.
3. The connector's inbound observer **consumes** that DM (it is not routed to any
agent) and writes the `user_instance_binding` using the **authentic
`user_id`** off the observed DM event. From then on, author-first resolution
routes that user's messages to the bound instance.
**Opt-out is connector-authoritative.** Deprovisioning an instance
(`POST /manage/deprovision`) drops its author bindings (so its users stop
resolving to it) **and** revokes its per-gateway secret (so its socket can no
longer authenticate — the next WS upgrade is closed **4401**). A gateway that
sees a **4401 close after a previously-successful handshake** treats it as a
terminal revocation: it stops reconnecting and reports the relay platform as
**disabled** (not a retryable error). A 4401 *before* any successful handshake
stays retryable (a cold-start / not-yet-provisioned race, not a revocation).
### 3.2 Going-idle / buffered-flip primitive (§5.3)
A scale-to-zero PRIMITIVE (not the behaviour — nothing here decides to sleep or
suspends a machine; a later workstream consumes these frames). It lets a gateway
enter a drain/idle transition without losing inbound that arrives while it is
gone, by making the connector buffer for that instance and replay on reconnect.
Three frames (all keyed by the connection's **authenticated** per-instance id —
read off the stored secret record at the WS upgrade, never asserted in a frame):
- `{"type":"going_idle"}` (gateway → connector) — emitted as part of the
gateway's EXISTING drain transition (the adapter sends it before tearing down
the socket). Asks the connector to flip this instance to **buffered-only**.
- `{"type":"going_idle_ack"}` (connector → gateway) — the connector has flipped:
live delivery has stopped and subsequent inbound for this instance buffers
durably. The gateway **stays serving until this ack** (so an event landing in
the flip window is delivered live, not lost — the same SUBSCRIBE-before-serve
ordering discipline as the bus). Only after the ack is it safe to close.
- `{"type":"inbound_ack", "bufferId"}` (gateway → connector) — durable receipt of
a buffered `inbound` delivery (which carries its `bufferId`) replayed on
reconnect. The connector acks the buffer entry only after this, giving
drain-without-dup on the **delivery leg**: an instance that dies mid-drain
redelivers exactly the unacked tail; an acked entry never redelivers.
**Buffer + drain.** While flipped, the connector appends inbound to a durable
per-instance delivery-leg buffer (`delivery:<instanceId>`) instead of pushing it
live. On the gateway's **reconnect** (a NET-NEW reconnect loop re-dials +
re-handshakes after an unexpected close), the new handshake triggers the
connector to drain that backlog over the new socket **in order, ack-gated**,
then clear the flip so live delivery resumes. This reuses the same
`drainWithoutDup` machinery as the Discord→connector ingest leg, applied to the
connector→gateway delivery leg. Connector-authoritative throughout: a gateway can
only flip/drain ITS OWN instance.
> NOT in scope (deferred behaviour): the autonomous idle timer that DECIDES to
> drain, the actual machine suspend, and the NAS suspended-health model. The
> primitive is "when the gateway drains, relay flips to buffered + replays on
> reconnect, with no loss/dup"; WHAT triggers the drain is out of scope.
### 3.3 Wake poke (§5.2)
The other half of the sleep/wake loop: how a SUSPENDED gateway finds out it has
buffered work waiting. A PRIMITIVE — nothing here suspends a machine; it wires
the wake SIGNAL so a future scale-to-zero behaviour layer can rely on "buffered
⇒ wake poked."
- **Registration.** The gateway registers a **wake URL** at enroll/provision —
any reachable URL the connector can GET to wake it (a Fly autostart hostname,
a dashboard host). Self-hosted: `hermes gateway enroll --wake-url <url>` (or
`GATEWAY_RELAY_WAKE_URL` / `gateway.relay_wake_url`). Managed/NAS: stamped into
the container env beside `GATEWAY_RELAY_URL`. Forwarded in the
`/relay/provision` body as `wakeUrl` and stored per-instance on the connector's
secret record (gateway-asserted but safely scoped — same posture as
`instanceId`; the org/tenant stays token-verified, so a gateway can only
register a wake target for ITS OWN instance). DISTINCT from the retired
`gatewayEndpoint`: a **poke target**, not a delivery target.
- **The poke.** When a buffered-only (going-idle) destination receives its FIRST
buffered event, the connector issues a **payload-free, unsigned GET** to that
instance's registered `wakeUrl`, **directly** (NOT NAS-mediated — relay stays
NAS-independent). It carries no tenant data and no inbound: it only says "you
have buffered work, reconnect." Tenant authority is re-established the normal
way when the gateway re-dials (the authenticated WS upgrade), so a leaked/
guessed wake URL can at worst cause a spurious reconnect of ITS OWN instance.
Rate-limited per instance (one poke per cooldown window, not per event), and
best-effort — a failed poke is swallowed; the gateway still drains whenever it
next reconnects on its own. No new frame: the wake is an out-of-band HTTP GET,
not a relay-WS message (the socket is down — that's the whole point).
> NOT in scope (deferred behaviour): the actual machine suspend (Fly
> `autostop:"suspend"`) and the autonomous idle timer that decides to sleep. The
> primitive is "buffered event for a sleeping instance ⇒ its wakeUrl gets poked";
> WHAT makes the instance sleep (and wake-to-serve) is the behaviour layer.
### 3.4 Obligations on a future scale-to-zero behaviour layer
§3.2 and §3.3 ship the **primitives**; this section is the **contract a separate
scale-to-zero behaviour workstream must honour to consume them safely.** It owns
the *decision* to suspend, the actual machine suspend, and the platform/health
model — none of which live here — but it MUST hold these guarantees, which the
primitives assume:
1. **Register a `wakeUrl` before the instance can ever be suspended.** A
suspended instance with no registered `wakeUrl` is a black hole — buffered
inbound never triggers a poke, so it sleeps through its own traffic until
something else reconnects it. The behaviour layer MUST ensure a reachable
wake target is registered (self-hosted: `--wake-url`; managed: stamped) as a
precondition of allowing suspend. A wake URL that is unreachable while the
machine is suspended (e.g. points at the suspended machine itself with no
platform autostart in front) is equivalent to none.
2. **Drain through `going_idle` → await `going_idle_ack` BEFORE tearing down the
socket or suspending.** Never suspend with an un-acked flip in flight. The
ack is the connector's confirmation that delivery for this instance is now
buffered-only; a machine that suspends after sending `going_idle` but before
the ack can drop the inbound that races the flip. The gateway already gates
socket teardown on the ack (Q-5.3c); the suspend step MUST sit *after* a
clean drain completes, not race it.
3. **Keep the NET-NEW reconnect loop live as a precondition of suspend.** The
wake→drain contract is "poke ⇒ the gateway re-dials ⇒ the connector drains on
the reconnect handshake." If the reconnect loop is disabled, a poke lands on a
machine that never re-dials and the buffer strands. The behaviour layer must
not suspend an instance whose relay transport won't reconnect on wake.
4. **Treat suspended ≠ down in the health model (Q-5.3b).** A suspended instance
is healthy-asleep, not failed. The health/monitoring layer MUST distinguish
the two (e.g. via the platform machine-state) so a suspended instance is not
restarted, alerted on, or reaped as unhealthy — that would defeat the suspend
and can race the wake/drain.
5. **The wake poke is best-effort and rate-limited — do not assume exactly-once
or immediate wake.** At most one poke per cooldown window per instance, and a
failed poke is swallowed. The behaviour layer must not rely on the poke as a
guaranteed/prompt signal; correctness still rests on "the gateway drains
whenever it next reconnects." A belt-and-suspenders wake (e.g. a scheduled
job that also reconnects) is the behaviour layer's call, not the primitive's.
6. **Suspend only when genuinely idle — and idle is connector-observable, not
gateway-guessed.** WHAT counts as idle (no in-flight turn + no inbound for N
min) is the behaviour layer's policy, but it must compose with the existing
drain machinery (`gateway_state` running→draining) rather than introduce a
parallel relay-only idle path — the same integration constraint §3.2 places
on `going_idle`.
These are guarantees the behaviour layer OWES the primitives; the primitives owe
the behaviour layer only what §3.2/§3.3 already specify (a flip-on-going_idle,
a durable per-instance buffer + ack-gated reconnect drain, and a poke on the
first buffered event for a flipped instance).
---
## 4. Outbound: action set
The gateway calls the transport with action dicts. Source of truth:
`gateway/relay/transport.py` + `gateway/relay/adapter.py`.
| `op` | Fields | Result |
| --- | --- | --- |
| `send` | `chat_id`, `content`, `reply_to?`, `metadata?` | `{success: bool, message_id?, error?}` |
| `edit` | `chat_id`, `message_id`, `content`, `metadata?` | `{success: bool, error?}` |
| `typing` | `chat_id` | `{success: bool}` |
| `follow_up` | `session_key`, `kind`, `content`, `metadata?` | `{success: bool, message_id?, error?}` |
`get_chat_info(chat_id)` is a separate proxied call returning at least
`{name, type}`. Media actions follow the same envelope shape (deferred to a
later contract revision; additive).
**`follow_up` (A2 capability action).** Some inbound payloads carry a credential
that acts on the **shared** bot identity (e.g. a Discord interaction follow-up
token). Per §6 the connector strips that at the edge and binds it in its
capability vault keyed by the session; it **never reaches the gateway**. To use
it, the gateway issues `follow_up` naming the **session it is already in**
(`session_key`) plus the capability `kind` (e.g. `discord.interaction_token`) —
**never a token**. The connector resolves the real value from its vault,
enforces the tenant match (tenant B can never wield tenant A's capability), and
egresses. `success: false` when the capability is absent/expired or the tenant
doesn't match — the gateway has nothing to retry with, by design (a leaked
gateway holds zero capability material). Source of truth:
`gateway/relay/transport.py` (`send_follow_up`) + `gateway/relay/adapter.py`.
---
## 5. Interrupt (`/stop`) routing
- **Gateway → connector:** `send_interrupt(session_key, reason?)` egresses a
mid-turn `/stop` over the outbound WS. The connector MUST forward it to the
gateway instance running that `session_key` (the routing invariant).
- **Connector → gateway:** an inbound interrupt for a `session_key` is delivered
as an `interrupt_inbound` frame down the gateway's outbound WS (§3 transport
note) — routed cross-instance via the relay bus to whichever instance holds
the socket — and bridged by the adapter's `on_interrupt(session_key, chat_id)`
into the existing per-session interrupt mechanism, cancelling exactly that turn
(siblings untouched).
Both directions ride the gateway's outbound WS: the gateway→connector `/stop`
egresses over it, and the connector→gateway interrupt rides the same `inbound`
back-channel as a normalized event.
---
## 6. Trust boundary & signed-body handling (A2)
**The connector is the sole crypto/identity boundary. The gateway re-validates
nothing.**
Webhook signatures (Discord ed25519, Twilio HMAC, WeCom BizMsgCrypt) are
computed over exact raw bytes, and some payloads are *encrypted* with a shared
secret. The connector fronts a **shared** bot for many tenants and holds every
tenant's platform secrets, so it:
- **verifies / decrypts at the edge** (the only place the secrets live),
- **normalizes** the payload into a tenant-scoped `MessageEvent` (§3),
- **strips any shared-identity capability** out of the payload and binds it in
its capability vault, keyed by the session (see §4 `follow_up`),
- **forwards only the sanitized `MessageEvent`** — never the raw signed body.
The gateway therefore performs **no** platform signature/crypto verification on
the relay path; it trusts the normalized event. This is an enforced invariant on
the gateway side (`tests/gateway/relay/test_relay_sheds_crypto.py`: the relay
package imports/calls no platform-crypto).
**Why not "forward the signed body byte-for-byte so the gateway re-validates"?**
That earlier model is incoherent under an untrusted, disposable tenant gateway:
- Re-validating Twilio HMAC / WeCom crypto would require handing the gateway the
**shared signing secret** — which is itself the leak, and on a shared bot it's
a *cross-tenant* leak.
- WeCom payloads are encrypted with the shared secret; the connector must decrypt
at the edge just to route, so forwarding ciphertext would again require giving
the gateway the secret.
- A Discord interaction token lives **inside** the signed JSON body — you cannot
both preserve the bytes and strip the credential; they are the same bytes.
So byte-preservation is abandoned deliberately: the connector re-serializes the
sanitized event and the gateway trusts it. This also unifies the passthrough and
relay planes — both are "verify at the edge → emit a normalized event," differing
only in transport. See `docs/capability-trust-boundary.md` (connector repo:
`gateway-gateway`) for the full A2 rationale and the connector-side vault.
### 6.1 Channel authentication (the connector⇄gateway link itself)
A2 makes the connector the sole holder of platform secrets while the gateway may
be **customer-managed and internet-exposed**, so the connector⇄gateway channel
is itself authenticated. The gateway holds an enrollment- or provision-issued
**per-gateway secret** (`hermes gateway enroll` → connector `/relay/enroll`, or
managed self-provision → `/relay/provision`) that authenticates its outbound WS
upgrade. It is an HMAC-SHA256 scheme with a multi-secret rotation verify list
(gateway side: `gateway/relay/auth.py`; connector side:
`src/core/relayAuthToken.ts`).
| Leg | Credential | Mechanism |
|-----|-----------|-----------|
| Gateway → connector WS upgrade | per-gateway secret | An `Authorization` bearer header on the `/relay` upgrade. The token is `base64url(payload:exp:sig)` where `payload = gatewayId` and `sig = HMAC(payload:exp, secret)`. Connector verifies and rejects the upgrade (**close 4401**) on mismatch/absence/revocation. The authenticated tenant comes from the connector's store, never the `hello` frame. |
| Connector → gateway inbound (`inbound` / `interrupt_inbound` frames) | — (rides the authenticated WS) | Inbound is pushed down the gateway's already-authenticated outbound socket (§3), so no per-message signature is needed. A **per-tenant delivery key** is still issued at enroll/provision and retained for forward-compat, but is no longer used to sign inbound. |
This is the **channel** authenticator — distinct from platform crypto, which the
relay path still sheds entirely (§6). The gateway holds zero platform secrets;
the per-gateway secret authenticates only the connector link. Full threat model +
enrollment/rotation/kill-switch design: `docs/connector-gateway-auth-design.md`
(connector repo).
---
## 7. Per-instance delivery & the management plane (Phase 6)
Phases 15 treat the connector as a single-tenant front: inbound events for a
tenant fan out to that tenant's gateway socket(s). **Phase 6 makes delivery
per-INSTANCE** — a shared bot can front many users/agents in one tenant (one
Discord guild, one Telegram bot) without cross-delivery — and adds a small
**management plane** the agent (or a managed Portal) uses to declare who-sees-what
and what's-relevant. All of this lives **connector-side**; the gateway's only new
responsibility is to **declare its relevance policy** at boot (§7.3).
### 7.1 The delivery gate (connector-side, informational)
For each inbound event the connector decides which instances receive it by
composing three AND-ed filters. The gateway does not implement these — they run
in the connector — but they define the delivery semantics the gateway relies on:
| Layer | Question | Source of truth |
| --- | --- | --- |
| **owner / scope ∧ principal** | May this instance *see* this author here? | per-user `user_id → instance` bindings (the owner floor) + per-instance `(guild, channel)` scope grants + an `owner-only` / `allow-list` / `any` principal policy. |
| **visibility floor** | Can the instance's bound owner actually `VIEW_CHANNEL` this in Discord? | live Discord ACL (effective permissions), fail-closed. Narrows an over-broad scope grant downward. |
| **relevance** | *Given* it may see it, should the agent engage? | the relevance policy declared in §7.3 (address-gating / free-response / allow-bots). |
The composition only ever **narrows** delivery (`deliver ⇔ authorized ∧ visible
∧ relevant`); the **owner floor bypasses the relevance layer** (an author's own
message always reaches their own instance — you don't @mention your own agent).
A message authored by an unbound user reaches no instance (fail-closed). The
full design + invariants live in the connector repo
(`NousResearch/gateway-gateway`); this section is the gateway-facing summary.
### 7.2 Management routes (connector-side, authenticated)
The connector mounts authenticated management routes. They share the **same
dual-auth** as the WS upgrade: either a managed NAS-signed `aud=agent:{instanceId}`
RS256 JWT, **or** the gateway's own per-gateway secret bearer (§6.1
`make_upgrade_token`). In both cases the connector resolves the authoritative
`{tenant, instanceId}` from its **stored** record — **never** from the request
body (a body-asserted `instanceId` is ignored).
| Route | Purpose |
| --- | --- |
| `POST /manage/link` | Issue a short-lived code to bind a platform account to the authenticated instance (the `/link <code>` flow; the connector reads the authentic `user_id` off the inbound event). |
| `POST /manage/scope`, `/manage/scope/release` | Claim / release a `(guild, channel)` scope for the authenticated instance. A channel is owned by at most one instance (non-overlap is a PK constraint). |
| `POST /manage/principal` | Set the instance's principal policy (`owner-only` \| `allow-list` \| `any`). |
| `POST /manage/dm-default` | Set the user's DM-default instance (DM tie-break when a user linked more than one). |
| `POST /relay/policy` | Declare the instance's **relevance policy** (§7.3). |
These are connector-owned (the management plane is not part of the gateway's
agent path); the gateway only calls `POST /relay/policy` (§7.3). The others are
driven by the managed Portal / `hermes` CLI.
### 7.3 Relevance-policy declaration (the gateway's responsibility)
The relevance layer (§7.1) is the per-tenant parity for the gateway's own
behaviour knobs (`require_mention`, `free_response_channels`,
`{PLATFORM}_ALLOW_BOTS`). So the **same** behaviour governs relay delivery, the
gateway projects those knobs into a **platform-agnostic** policy and POSTs it to
`POST /relay/policy` at boot (after its per-gateway secret is resolved).
Body (`gateway/relay/__init__.py` `relay_relevance_policy()``send_relay_policy()`):
| Field | Type | Projected from | Meaning |
| --- | --- | --- | --- |
| `platform` | string | the fronted platform (`relay_platform_identity`) | which platform this policy applies to. |
| `requireAddress` | bool | `require_mention` | a non-owner message must @mention / reply-to the bot to be relevant. |
| `freeResponseScopes` | string[] | `free_response_channels` | scope (channel) ids where `requireAddress` is waived. Same scope vocabulary as §7.1's scope grants. |
| `allowOtherBots` | bool | `{PLATFORM}_ALLOW_BOTS ∈ {mentions, all}` | admit bot-authored messages (default off). |
Auth is the per-gateway upgrade token (§6.1), so the connector attaches the
policy to the authenticated instance. The gateway is the **source of truth** and
re-declares **every boot** (a full replace, mirroring the `routeKeys` upsert at
provision — self-healing). When the projected policy is all-default the gateway
sends nothing (the connector's absent-row default already matches). The POST is
**fail-soft**: a failure logs and boot proceeds — relevance is an optimization
layered on the authorization gate (§7.1), never a boot dependency. There is **no
new gateway inbound surface** and **no new credential** — it reuses the
per-gateway secret and the same host as `/relay/provision`.
> A relevance drop happens **before** the connector wakes a scaled-to-zero agent
> (Phase 5), so excluded chatter never spins an agent up — relevance is the
> primary scale-to-zero lever as well as a correctness filter.
---
## 8. Versioning policy
- `contract_version` is an int; bump **only** for additive changes during the
experimental phase (new optional fields, new `op`s).
- A breaking change (renamed/removed field, changed semantics) requires a
coordinated update of both repos and a version bump.
- The connector's first PR references the commit SHA of this file it implements
against.