mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-06-28 11:32:22 +00:00
Some checks are pending
CI / detect (push) Waiting to run
CI / tests (push) Blocked by required conditions
CI / lint (push) Blocked by required conditions
CI / typecheck (push) Blocked by required conditions
CI / docs-site (push) Blocked by required conditions
CI / history-check (push) Blocked by required conditions
CI / contributor-check (push) Blocked by required conditions
CI / uv-lockfile (push) Blocked by required conditions
CI / docker-lint (push) Blocked by required conditions
CI / supply-chain (push) Blocked by required conditions
CI / osv-scanner (push) Blocked by required conditions
CI / All required checks pass (push) Blocked by required conditions
Deploy Site / deploy-vercel (push) Waiting to run
Deploy Site / deploy-docs (push) Waiting to run
Docker Build and Publish / build-amd64 (push) Waiting to run
Docker Build and Publish / build-arm64 (push) Waiting to run
Docker Build and Publish / merge (push) Blocked by required conditions
Phase 7 Unit 7d-B. When an operator opts an instance OUT of the Team Gateway
relay (Unit 7b deprovision), the connector revokes the per-gateway secret and
closes the gateway's WS with 4401. The reconnect supervisor previously treated
EVERY close as retryable, so the live process spun "retrying 4401" forever and
the dashboard showed a red error — opt-out looked like a failure.
Now a 4401 close that arrives AFTER a successful handshake is recognized as a
terminal credential revocation:
- ws_transport.py: track `_handshake_succeeded` (set when a descriptor is
received); on a 4401 close after a prior success, latch `auth_revoked` and do
NOT spawn the reconnect supervisor. A 4401 BEFORE any successful handshake
stays retryable (cold-start / not-yet-provisioned race, not a revocation).
New `auth_revoked` property + a websockets-version-safe close-code reader
(prefers `.rcvd`/`.sent` Close frames; `.code` is deprecated in websockets 13+).
- adapter.py: a revocation monitor turns `transport.auth_revoked` into a clean,
NON-retryable `relay_disabled` fatal and notifies the gateway's fatal-error
handler (so the adapter is removed and NOT queued for reconnection — the
credential is dead until the instance is recreated). Monitor is cancelled on
disconnect; only started when the transport exposes `auth_revoked` (prod WS).
- run.py: `_handle_adapter_fatal_error` maps the `relay_disabled` code to a
`disabled` platform_state (not `fatal`/`retrying`).
- web: PlatformsCard renders the `disabled` state with a neutral outline badge,
a PowerOff icon, and muted (not destructive-red) text + message. New optional
`status.disabled` i18n string ("Disabled").
Also bundles the Phase 7 contract-doc update (this doc is authoritative in
hermes-agent): docs/relay-connector-contract.md gains an "Author-first
resolution + the account-link (DM) path" section documenting the
multi-tenant-guild rule (D-7.2 — route by authenticated author binding, never by
guild; unlinked → fail-closed), the `/link <code>` DM flow, and the
connector-authoritative opt-out + terminal-4401 behavior this PR implements.
Tests: +2 ws_transport (4401-after-handshake terminal / no-reconnect;
4401-before-handshake stays retryable) and +2 adapter (revocation → non-retryable
relay_disabled fatal + handler fired; no-revocation → no fatal). 138 relay tests
pass (incl. the contract-doc conformance test); ruff clean; web tsc clean.
Phase 7 Unit 7d-B (relay-adapter solo lane). Q17 → Option 2; Option 3 (live
de-register, no recreate) + the restart-re-provision hole deferred post-alpha.
567 lines
34 KiB
Markdown
567 lines
34 KiB
Markdown
# Relay ↔ Connector Contract (v1, EXPERIMENTAL)
|
||
|
||
> **Status:** EXPERIMENTAL. This contract MAY CHANGE without a deprecation
|
||
> cycle until at least two real Class-1 platforms (Discord + Telegram) have
|
||
> validated it. Evolution during the experimental phase is **additive-only**,
|
||
> gated by `contract_version`. A breaking change updates both repos in lockstep.
|
||
|
||
This document is the formal interface between the **Hermes gateway** (Python,
|
||
`gateway/relay/`) and the **connector** (Node/TypeScript,
|
||
`NousResearch/gateway-gateway`). The connector implementer's first action is to
|
||
read this file.
|
||
|
||
The gateway runs a generic `RelayAdapter` that dials **out** to the connector,
|
||
receives a `CapabilityDescriptor` at handshake, then exchanges normalized
|
||
`MessageEvent`s (inbound) and actions (outbound) over a per-turn bidirectional
|
||
WebSocket. The gateway never learns which concrete platform is fronting it; the
|
||
connector owns all platform-specific socket/identity logic.
|
||
|
||
---
|
||
|
||
## 1. Handshake
|
||
|
||
1. Gateway opens the transport (`connect`).
|
||
2. Gateway calls `handshake()`; connector returns a `CapabilityDescriptor`
|
||
(section 2) describing the platform this adapter instance fronts.
|
||
3. Gateway configures the adapter from the descriptor (char limit, length unit,
|
||
draft/edit/thread/markdown capabilities) and registers an inbound handler.
|
||
4. Connector then streams inbound events and accepts outbound actions.
|
||
|
||
`contract_version` (currently `1`) is carried in the descriptor. The gateway
|
||
ignores unknown descriptor fields (forward-compat) and fills missing optional
|
||
fields from defaults.
|
||
|
||
---
|
||
|
||
## 2. CapabilityDescriptor (handshake payload)
|
||
|
||
JSON object. Source of truth: `gateway/relay/descriptor.py`.
|
||
|
||
| Field | Type | Required | Meaning |
|
||
| --- | --- | --- | --- |
|
||
| `contract_version` | int | yes | Contract version (additive-only within a version). |
|
||
| `platform` | string | yes | Platform name (e.g. `"discord"`, `"telegram"`). |
|
||
| `label` | string | yes | Human-readable label. |
|
||
| `max_message_length` | int | yes | Char limit; gateway exposes as `MAX_MESSAGE_LENGTH`. 0 → treat as 4096. |
|
||
| `supports_draft_streaming` | bool | yes | Native draft-streaming preview support. |
|
||
| `supports_edit` | bool | yes | Edit-based streaming possible; if false, consumer degrades to one-message-per-segment. |
|
||
| `supports_threads` | bool | yes | `create_handoff_thread` capability. |
|
||
| `markdown_dialect` | string | yes | `"plain"`, `"markdown_v2"`, `"discord"`, … (drives `supports_code_blocks`). |
|
||
| `len_unit` | string | yes | `"chars"` (builtin len) or `"utf16"` (Telegram UTF-16 code units). |
|
||
| `emoji` | string | no | Display emoji (default 🔌). |
|
||
| `platform_hint` | string | no | System-prompt platform hint. |
|
||
| `pii_safe` | bool | no | Redact PII in session descriptions. |
|
||
|
||
Most fields are a projection of the gateway's existing `PlatformEntry`; the
|
||
runtime-only fields (`len_unit`, `supports_*`, `markdown_dialect`) come from the
|
||
live platform adapter's capability methods.
|
||
|
||
---
|
||
|
||
## 3. Inbound: `MessageEvent` envelope
|
||
|
||
The connector normalizes each platform wire event into a `MessageEvent`
|
||
(`gateway/platforms/base.py`) and delivers it to the gateway. **Inbound is
|
||
delivered over the gateway's OUTBOUND `/relay` WebSocket** (see the transport
|
||
note below) — the connector pushes an `inbound` frame down the socket the
|
||
gateway already dialed. The gateway keys the session via `build_session_key()`
|
||
from the embedded `SessionSource` — so populating the right discriminators is
|
||
the single highest-correctness responsibility of the connector.
|
||
|
||
### Inbound transport (WS back-channel, not HTTP)
|
||
|
||
The gateway dials **out** to the connector's `/relay` WebSocket for the
|
||
handshake + outbound actions (§4) + its own `/stop` egress (§5). Inbound rides
|
||
the **same socket** in the other direction: the connector pushes an `inbound`
|
||
frame (and `interrupt_inbound` for §5) down the gateway's outbound WS. There is
|
||
**no gateway-side inbound HTTP endpoint** — a gateway need not (and, when hosted,
|
||
cannot) expose any inbound port; everything flows over the connection it
|
||
initiated.
|
||
|
||
**Multi-instance routing.** The connector instance that owns a platform's socket
|
||
(and thus produces inbound events) is generally **not** the instance the gateway
|
||
dialed its outbound WS into. The producing instance therefore publishes the
|
||
event on the connector's internal **relay bus** (Redis pub/sub; `RelayBus` in
|
||
`src/core/relayBus.ts`) keyed by tenant. Every connector instance subscribes and
|
||
routes each message to its **local** sessions for that tenant
|
||
(`RelayServer.routeBusMessage`); the single instance that actually holds the
|
||
gateway's socket delivers it, and instances with no local session for the tenant
|
||
no-op. Cross-instance delivery is thus an in-cluster Redis hop, not a public
|
||
HTTP call.
|
||
|
||
Frames (connector → gateway, over the WS):
|
||
|
||
- `{"type":"inbound", "event": <MessageEvent>, "bufferId"?}`
|
||
- `{"type":"interrupt_inbound", "session_key", "chat_id"}` (§5)
|
||
- `{"type":"passthrough_forward", "forward": <PassthroughForward>, "bufferId"?}` (§5.1)
|
||
|
||
`PassthroughForward` is the wire form of a forwarded passthrough-plane request
|
||
(Class-2/3 webhooks — Discord interactions, Twilio): `{platform, botId, method,
|
||
path, headers: [[k,v],…], bodyB64}`. The body is base64-encoded so arbitrary
|
||
bytes survive the newline-delimited-JSON transport; the gateway base64-decodes
|
||
back to the exact bytes the connector forwarded (the connector already verified
|
||
the provider signature and stripped any shared-identity credential at the edge —
|
||
§6 — so the gateway re-processes a sanitized, token-free body and acts on it via
|
||
the token-less `follow_up` path). See §3.1.
|
||
|
||
**Trust.** The WS upgrade is authenticated with the gateway's per-gateway secret
|
||
(§6.1), so the channel is trusted end to end — inbound frames are not separately
|
||
HMAC-signed (the authenticated socket subsumes the per-delivery origin proof the
|
||
old HTTP path needed). The relay-bus hop is inside the connector trust domain
|
||
(same as the lease/buffer/capability stores).
|
||
|
||
> Earlier drafts of this contract delivered inbound over a signed **HTTP POST**
|
||
> to a `gatewayEndpoint` (`HttpGatewayDelivery` + a gateway-side
|
||
> `inbound_receiver`), HMAC-signed with a per-tenant delivery key. That required
|
||
> every gateway to expose a reachable inbound URL — impossible for hosted
|
||
> gateways, which have no public IP. The WS back-channel above replaces it; the
|
||
> per-tenant delivery key is retained at provision for forward-compat but is no
|
||
> longer used for inbound. The **passthrough plane** (Class-2/3 webhooks like
|
||
> Discord interactions / Twilio) historically still used `gatewayEndpoint` for
|
||
> its post-ACK forward; Phase 5 §5.1 moves that forward onto the WS too (the
|
||
> `passthrough_forward` frame above), so a hosted gateway needs zero public
|
||
> inbound surface and `gatewayEndpoint` is retired once the cutover lands.
|
||
|
||
### 3.1 Passthrough-plane forward (§5.1)
|
||
|
||
The passthrough plane answers the provider's latency-critical ACK at the
|
||
connector EDGE (e.g. Discord's deferred interaction response within ~3s), then
|
||
does a **fire-and-forget** forward of the real request to the gateway. That
|
||
forward needs no response back (the provider was already satisfied), so it rides
|
||
the same outbound WS as `inbound` via a `passthrough_forward` frame rather than
|
||
an HTTP POST. The gateway processes the decoded request through its normal agent
|
||
path (a Discord interaction is decoded to a `MessageEvent` and handled like a
|
||
message; the reply egresses over the outbound / `follow_up` path). `bufferId` is
|
||
present when the forward was buffered (Phase 5 §5.3 buffered-only flip) and the
|
||
gateway acks it after durable handoff.
|
||
|
||
|
||
|
||
### SessionSource fields (the wire surface)
|
||
|
||
Source of truth: `SessionSource.to_dict()` in `gateway/session.py`. These are
|
||
every key the gateway accepts on the wire. `platform`, `chat_id`, `chat_type`,
|
||
`user_id`, `user_name`, `thread_id`, `chat_name`, and `chat_topic` are always
|
||
present (may be `null`); the rest are included only when set.
|
||
|
||
| Field | Type | Always sent | Meaning |
|
||
| --- | --- | --- | --- |
|
||
| `platform` | string | yes | Platform name (matches the descriptor's `platform`). |
|
||
| `chat_id` | string | yes | Primary conversation id (channel/chat). Session-key discriminator. |
|
||
| `chat_type` | string | yes | `dm` / `group` / `channel` / `thread` / `forum`. |
|
||
| `chat_name` | string\|null | yes | Human-readable chat name. |
|
||
| `user_id` | string\|null | yes | Message author id. Session-key discriminator. |
|
||
| `user_name` | string\|null | yes | Author display name. |
|
||
| `thread_id` | string\|null | yes | Thread/forum-topic id when in a thread. Session-key discriminator. |
|
||
| `chat_topic` | string\|null | yes | Channel topic/description (Discord, Slack). |
|
||
| `user_id_alt` | string | no | Platform-specific stable alt id (Signal UUID, Feishu union_id). |
|
||
| `chat_id_alt` | string | no | Alternate chat id (e.g. Signal group internal id). |
|
||
| `guild_id` | string | no | Discord guild / Slack workspace / Matrix server scope. **REQUIRED for Discord server isolation.** Session-key discriminator. |
|
||
| `parent_chat_id` | string | no | Parent channel when `chat_id` refers to a thread. |
|
||
| `message_id` | string | no | Id of the triggering message (for pin/reply/react). |
|
||
|
||
> `is_bot` (author-is-a-bot/webhook classification) exists on the gateway-side
|
||
> dataclass but is **intentionally NOT on the wire** in v1 — it is not part of
|
||
> `to_dict()`. Do not add it to the connector's `SessionSource` until it is
|
||
> first added here and to `to_dict()` (additive bump).
|
||
|
||
### SessionSource discriminators per platform
|
||
|
||
| Platform | chat_id | chat_type | user_id | thread_id | guild_id |
|
||
| --- | --- | --- | --- | --- | --- |
|
||
| **Discord** | channel id | `dm`/`group`/`thread` | author id | thread channel id (threads) | **guild id** (REQUIRED for server isolation) |
|
||
| **Telegram** | chat id | `dm`/`group`/`forum` | from id | forum topic id (forums) | — |
|
||
|
||
**Get Discord's `guild_id` wrong and two servers collide into one session.**
|
||
This is the #1 High-severity risk. The gateway's `build_session_key()` is the
|
||
conformance oracle: for a given `SessionSource`, the connector's normalization
|
||
must produce the same key the Python adapter would. (The Phase-1 stub tests
|
||
assert known-input → known-key.)
|
||
|
||
### Bot identity vs tenant (single-bot consolidation, Appendix A)
|
||
|
||
The envelope carries the **originating bot identity** as a field **distinct from
|
||
tenant**. Tenant is resolved from the event's own discriminator (Discord
|
||
`guild_id`, Telegram `chat_id`, webhook path/subdomain) — **never** from which
|
||
token/socket/process delivered it. This keeps one shared bot able to front many
|
||
tenants (Phase 6) without overloading an existing field.
|
||
|
||
### Author-first resolution + the account-link (DM) path (Phase 7)
|
||
|
||
Phase 7 adds **self-serve, per-user onboarding to a shared bot**, which changes
|
||
*which* discriminator resolves the instance for a routed inbound message — and
|
||
adds a management path for users to bind their own account.
|
||
|
||
**Author-first resolution (the multi-tenant-guild rule, D-7.2).** A single
|
||
Discord guild may hold **many** tenants — different members each linked to their
|
||
own agent. So for delivery the connector resolves the destination instance from
|
||
the **authenticated author binding** (`user_instance_binding`, keyed by
|
||
`(tenant, platform, platform_user_id)` via `resolveByUser`), **NOT** by a
|
||
guild→instance route. Concretely:
|
||
|
||
- A routed message authored by a **linked** user reaches **only that user's**
|
||
instance — even when a second linked user in the **same guild** is served by a
|
||
different instance (each reaches only their own).
|
||
- A message authored by an **unlinked** user resolves to **no** instance and is
|
||
dropped (**fail-closed** — never broadcast to the guild's other tenants).
|
||
- The author id used is the **authentic `user_id` off the observed event**, the
|
||
same `SessionSource.user_id` documented above — never a value asserted by a
|
||
gateway or carried in a management frame.
|
||
|
||
This is the per-`user_id` owner-only routing the connector enforces in
|
||
`WsGatewayDelivery` (the gateway-side multi-tenant-guild E2E driver
|
||
`gateway_multitenant_guild_driver.py` is the cross-repo oracle).
|
||
|
||
**The account-link (DM) path.** A user binds their account to an instance with a
|
||
one-time code, redeemed by DMing the shared bot:
|
||
|
||
1. The owner triggers a link from the Portal (or a self-hosted CLI). The
|
||
connector mints a short-lived **link code** for the **authenticated**
|
||
instance (`POST /manage/link`; instanceId comes from the caller's principal —
|
||
a NAS-signed `aud=agent:{instanceId}` token or the instance's own per-gateway
|
||
secret — **never** the request body).
|
||
2. The user sends `/link <code>` as a **direct message** to the shared bot from
|
||
the account they want to bind.
|
||
3. The connector's inbound observer **consumes** that DM (it is not routed to any
|
||
agent) and writes the `user_instance_binding` using the **authentic
|
||
`user_id`** off the observed DM event. From then on, author-first resolution
|
||
routes that user's messages to the bound instance.
|
||
|
||
**Opt-out is connector-authoritative.** Deprovisioning an instance
|
||
(`POST /manage/deprovision`) drops its author bindings (so its users stop
|
||
resolving to it) **and** revokes its per-gateway secret (so its socket can no
|
||
longer authenticate — the next WS upgrade is closed **4401**). A gateway that
|
||
sees a **4401 close after a previously-successful handshake** treats it as a
|
||
terminal revocation: it stops reconnecting and reports the relay platform as
|
||
**disabled** (not a retryable error). A 4401 *before* any successful handshake
|
||
stays retryable (a cold-start / not-yet-provisioned race, not a revocation).
|
||
|
||
### 3.2 Going-idle / buffered-flip primitive (§5.3)
|
||
|
||
A scale-to-zero PRIMITIVE (not the behaviour — nothing here decides to sleep or
|
||
suspends a machine; a later workstream consumes these frames). It lets a gateway
|
||
enter a drain/idle transition without losing inbound that arrives while it is
|
||
gone, by making the connector buffer for that instance and replay on reconnect.
|
||
|
||
Three frames (all keyed by the connection's **authenticated** per-instance id —
|
||
read off the stored secret record at the WS upgrade, never asserted in a frame):
|
||
|
||
- `{"type":"going_idle"}` (gateway → connector) — emitted as part of the
|
||
gateway's EXISTING drain transition (the adapter sends it before tearing down
|
||
the socket). Asks the connector to flip this instance to **buffered-only**.
|
||
- `{"type":"going_idle_ack"}` (connector → gateway) — the connector has flipped:
|
||
live delivery has stopped and subsequent inbound for this instance buffers
|
||
durably. The gateway **stays serving until this ack** (so an event landing in
|
||
the flip window is delivered live, not lost — the same SUBSCRIBE-before-serve
|
||
ordering discipline as the bus). Only after the ack is it safe to close.
|
||
- `{"type":"inbound_ack", "bufferId"}` (gateway → connector) — durable receipt of
|
||
a buffered `inbound` delivery (which carries its `bufferId`) replayed on
|
||
reconnect. The connector acks the buffer entry only after this, giving
|
||
drain-without-dup on the **delivery leg**: an instance that dies mid-drain
|
||
redelivers exactly the unacked tail; an acked entry never redelivers.
|
||
|
||
**Buffer + drain.** While flipped, the connector appends inbound to a durable
|
||
per-instance delivery-leg buffer (`delivery:<instanceId>`) instead of pushing it
|
||
live. On the gateway's **reconnect** (a NET-NEW reconnect loop re-dials +
|
||
re-handshakes after an unexpected close), the new handshake triggers the
|
||
connector to drain that backlog over the new socket **in order, ack-gated**,
|
||
then clear the flip so live delivery resumes. This reuses the same
|
||
`drainWithoutDup` machinery as the Discord→connector ingest leg, applied to the
|
||
connector→gateway delivery leg. Connector-authoritative throughout: a gateway can
|
||
only flip/drain ITS OWN instance.
|
||
|
||
> NOT in scope (deferred behaviour): the autonomous idle timer that DECIDES to
|
||
> drain, the actual machine suspend, and the NAS suspended-health model. The
|
||
> primitive is "when the gateway drains, relay flips to buffered + replays on
|
||
> reconnect, with no loss/dup"; WHAT triggers the drain is out of scope.
|
||
|
||
### 3.3 Wake poke (§5.2)
|
||
|
||
The other half of the sleep/wake loop: how a SUSPENDED gateway finds out it has
|
||
buffered work waiting. A PRIMITIVE — nothing here suspends a machine; it wires
|
||
the wake SIGNAL so a future scale-to-zero behaviour layer can rely on "buffered
|
||
⇒ wake poked."
|
||
|
||
- **Registration.** The gateway registers a **wake URL** at enroll/provision —
|
||
any reachable URL the connector can GET to wake it (a Fly autostart hostname,
|
||
a dashboard host). Self-hosted: `hermes gateway enroll --wake-url <url>` (or
|
||
`GATEWAY_RELAY_WAKE_URL` / `gateway.relay_wake_url`). Managed/NAS: stamped into
|
||
the container env beside `GATEWAY_RELAY_URL`. Forwarded in the
|
||
`/relay/provision` body as `wakeUrl` and stored per-instance on the connector's
|
||
secret record (gateway-asserted but safely scoped — same posture as
|
||
`instanceId`; the org/tenant stays token-verified, so a gateway can only
|
||
register a wake target for ITS OWN instance). DISTINCT from the retired
|
||
`gatewayEndpoint`: a **poke target**, not a delivery target.
|
||
- **The poke.** When a buffered-only (going-idle) destination receives its FIRST
|
||
buffered event, the connector issues a **payload-free, unsigned GET** to that
|
||
instance's registered `wakeUrl`, **directly** (NOT NAS-mediated — relay stays
|
||
NAS-independent). It carries no tenant data and no inbound: it only says "you
|
||
have buffered work, reconnect." Tenant authority is re-established the normal
|
||
way when the gateway re-dials (the authenticated WS upgrade), so a leaked/
|
||
guessed wake URL can at worst cause a spurious reconnect of ITS OWN instance.
|
||
Rate-limited per instance (one poke per cooldown window, not per event), and
|
||
best-effort — a failed poke is swallowed; the gateway still drains whenever it
|
||
next reconnects on its own. No new frame: the wake is an out-of-band HTTP GET,
|
||
not a relay-WS message (the socket is down — that's the whole point).
|
||
|
||
> NOT in scope (deferred behaviour): the actual machine suspend (Fly
|
||
> `autostop:"suspend"`) and the autonomous idle timer that decides to sleep. The
|
||
> primitive is "buffered event for a sleeping instance ⇒ its wakeUrl gets poked";
|
||
> WHAT makes the instance sleep (and wake-to-serve) is the behaviour layer.
|
||
|
||
### 3.4 Obligations on a future scale-to-zero behaviour layer
|
||
|
||
§3.2 and §3.3 ship the **primitives**; this section is the **contract a separate
|
||
scale-to-zero behaviour workstream must honour to consume them safely.** It owns
|
||
the *decision* to suspend, the actual machine suspend, and the platform/health
|
||
model — none of which live here — but it MUST hold these guarantees, which the
|
||
primitives assume:
|
||
|
||
1. **Register a `wakeUrl` before the instance can ever be suspended.** A
|
||
suspended instance with no registered `wakeUrl` is a black hole — buffered
|
||
inbound never triggers a poke, so it sleeps through its own traffic until
|
||
something else reconnects it. The behaviour layer MUST ensure a reachable
|
||
wake target is registered (self-hosted: `--wake-url`; managed: stamped) as a
|
||
precondition of allowing suspend. A wake URL that is unreachable while the
|
||
machine is suspended (e.g. points at the suspended machine itself with no
|
||
platform autostart in front) is equivalent to none.
|
||
2. **Drain through `going_idle` → await `going_idle_ack` BEFORE tearing down the
|
||
socket or suspending.** Never suspend with an un-acked flip in flight. The
|
||
ack is the connector's confirmation that delivery for this instance is now
|
||
buffered-only; a machine that suspends after sending `going_idle` but before
|
||
the ack can drop the inbound that races the flip. The gateway already gates
|
||
socket teardown on the ack (Q-5.3c); the suspend step MUST sit *after* a
|
||
clean drain completes, not race it.
|
||
3. **Keep the NET-NEW reconnect loop live as a precondition of suspend.** The
|
||
wake→drain contract is "poke ⇒ the gateway re-dials ⇒ the connector drains on
|
||
the reconnect handshake." If the reconnect loop is disabled, a poke lands on a
|
||
machine that never re-dials and the buffer strands. The behaviour layer must
|
||
not suspend an instance whose relay transport won't reconnect on wake.
|
||
4. **Treat suspended ≠ down in the health model (Q-5.3b).** A suspended instance
|
||
is healthy-asleep, not failed. The health/monitoring layer MUST distinguish
|
||
the two (e.g. via the platform machine-state) so a suspended instance is not
|
||
restarted, alerted on, or reaped as unhealthy — that would defeat the suspend
|
||
and can race the wake/drain.
|
||
5. **The wake poke is best-effort and rate-limited — do not assume exactly-once
|
||
or immediate wake.** At most one poke per cooldown window per instance, and a
|
||
failed poke is swallowed. The behaviour layer must not rely on the poke as a
|
||
guaranteed/prompt signal; correctness still rests on "the gateway drains
|
||
whenever it next reconnects." A belt-and-suspenders wake (e.g. a scheduled
|
||
job that also reconnects) is the behaviour layer's call, not the primitive's.
|
||
6. **Suspend only when genuinely idle — and idle is connector-observable, not
|
||
gateway-guessed.** WHAT counts as idle (no in-flight turn + no inbound for N
|
||
min) is the behaviour layer's policy, but it must compose with the existing
|
||
drain machinery (`gateway_state` running→draining) rather than introduce a
|
||
parallel relay-only idle path — the same integration constraint §3.2 places
|
||
on `going_idle`.
|
||
|
||
These are guarantees the behaviour layer OWES the primitives; the primitives owe
|
||
the behaviour layer only what §3.2/§3.3 already specify (a flip-on-going_idle,
|
||
a durable per-instance buffer + ack-gated reconnect drain, and a poke on the
|
||
first buffered event for a flipped instance).
|
||
|
||
---
|
||
|
||
## 4. Outbound: action set
|
||
|
||
The gateway calls the transport with action dicts. Source of truth:
|
||
`gateway/relay/transport.py` + `gateway/relay/adapter.py`.
|
||
|
||
| `op` | Fields | Result |
|
||
| --- | --- | --- |
|
||
| `send` | `chat_id`, `content`, `reply_to?`, `metadata?` | `{success: bool, message_id?, error?}` |
|
||
| `edit` | `chat_id`, `message_id`, `content`, `metadata?` | `{success: bool, error?}` |
|
||
| `typing` | `chat_id` | `{success: bool}` |
|
||
| `follow_up` | `session_key`, `kind`, `content`, `metadata?` | `{success: bool, message_id?, error?}` |
|
||
|
||
`get_chat_info(chat_id)` is a separate proxied call returning at least
|
||
`{name, type}`. Media actions follow the same envelope shape (deferred to a
|
||
later contract revision; additive).
|
||
|
||
**`follow_up` (A2 capability action).** Some inbound payloads carry a credential
|
||
that acts on the **shared** bot identity (e.g. a Discord interaction follow-up
|
||
token). Per §6 the connector strips that at the edge and binds it in its
|
||
capability vault keyed by the session; it **never reaches the gateway**. To use
|
||
it, the gateway issues `follow_up` naming the **session it is already in**
|
||
(`session_key`) plus the capability `kind` (e.g. `discord.interaction_token`) —
|
||
**never a token**. The connector resolves the real value from its vault,
|
||
enforces the tenant match (tenant B can never wield tenant A's capability), and
|
||
egresses. `success: false` when the capability is absent/expired or the tenant
|
||
doesn't match — the gateway has nothing to retry with, by design (a leaked
|
||
gateway holds zero capability material). Source of truth:
|
||
`gateway/relay/transport.py` (`send_follow_up`) + `gateway/relay/adapter.py`.
|
||
|
||
---
|
||
|
||
## 5. Interrupt (`/stop`) routing
|
||
|
||
- **Gateway → connector:** `send_interrupt(session_key, reason?)` egresses a
|
||
mid-turn `/stop` over the outbound WS. The connector MUST forward it to the
|
||
gateway instance running that `session_key` (the routing invariant).
|
||
- **Connector → gateway:** an inbound interrupt for a `session_key` is delivered
|
||
as an `interrupt_inbound` frame down the gateway's outbound WS (§3 transport
|
||
note) — routed cross-instance via the relay bus to whichever instance holds
|
||
the socket — and bridged by the adapter's `on_interrupt(session_key, chat_id)`
|
||
into the existing per-session interrupt mechanism, cancelling exactly that turn
|
||
(siblings untouched).
|
||
|
||
Both directions ride the gateway's outbound WS: the gateway→connector `/stop`
|
||
egresses over it, and the connector→gateway interrupt rides the same `inbound`
|
||
back-channel as a normalized event.
|
||
|
||
---
|
||
|
||
## 6. Trust boundary & signed-body handling (A2)
|
||
|
||
**The connector is the sole crypto/identity boundary. The gateway re-validates
|
||
nothing.**
|
||
|
||
Webhook signatures (Discord ed25519, Twilio HMAC, WeCom BizMsgCrypt) are
|
||
computed over exact raw bytes, and some payloads are *encrypted* with a shared
|
||
secret. The connector fronts a **shared** bot for many tenants and holds every
|
||
tenant's platform secrets, so it:
|
||
|
||
- **verifies / decrypts at the edge** (the only place the secrets live),
|
||
- **normalizes** the payload into a tenant-scoped `MessageEvent` (§3),
|
||
- **strips any shared-identity capability** out of the payload and binds it in
|
||
its capability vault, keyed by the session (see §4 `follow_up`),
|
||
- **forwards only the sanitized `MessageEvent`** — never the raw signed body.
|
||
|
||
The gateway therefore performs **no** platform signature/crypto verification on
|
||
the relay path; it trusts the normalized event. This is an enforced invariant on
|
||
the gateway side (`tests/gateway/relay/test_relay_sheds_crypto.py`: the relay
|
||
package imports/calls no platform-crypto).
|
||
|
||
**Why not "forward the signed body byte-for-byte so the gateway re-validates"?**
|
||
That earlier model is incoherent under an untrusted, disposable tenant gateway:
|
||
|
||
- Re-validating Twilio HMAC / WeCom crypto would require handing the gateway the
|
||
**shared signing secret** — which is itself the leak, and on a shared bot it's
|
||
a *cross-tenant* leak.
|
||
- WeCom payloads are encrypted with the shared secret; the connector must decrypt
|
||
at the edge just to route, so forwarding ciphertext would again require giving
|
||
the gateway the secret.
|
||
- A Discord interaction token lives **inside** the signed JSON body — you cannot
|
||
both preserve the bytes and strip the credential; they are the same bytes.
|
||
|
||
So byte-preservation is abandoned deliberately: the connector re-serializes the
|
||
sanitized event and the gateway trusts it. This also unifies the passthrough and
|
||
relay planes — both are "verify at the edge → emit a normalized event," differing
|
||
only in transport. See `docs/capability-trust-boundary.md` (connector repo:
|
||
`gateway-gateway`) for the full A2 rationale and the connector-side vault.
|
||
|
||
### 6.1 Channel authentication (the connector⇄gateway link itself)
|
||
|
||
A2 makes the connector the sole holder of platform secrets while the gateway may
|
||
be **customer-managed and internet-exposed**, so the connector⇄gateway channel
|
||
is itself authenticated. The gateway holds an enrollment- or provision-issued
|
||
**per-gateway secret** (`hermes gateway enroll` → connector `/relay/enroll`, or
|
||
managed self-provision → `/relay/provision`) that authenticates its outbound WS
|
||
upgrade. It is an HMAC-SHA256 scheme with a multi-secret rotation verify list
|
||
(gateway side: `gateway/relay/auth.py`; connector side:
|
||
`src/core/relayAuthToken.ts`).
|
||
|
||
| Leg | Credential | Mechanism |
|
||
|-----|-----------|-----------|
|
||
| Gateway → connector WS upgrade | per-gateway secret | An `Authorization` bearer header on the `/relay` upgrade. The token is `base64url(payload:exp:sig)` where `payload = gatewayId` and `sig = HMAC(payload:exp, secret)`. Connector verifies and rejects the upgrade (**close 4401**) on mismatch/absence/revocation. The authenticated tenant comes from the connector's store, never the `hello` frame. |
|
||
| Connector → gateway inbound (`inbound` / `interrupt_inbound` frames) | — (rides the authenticated WS) | Inbound is pushed down the gateway's already-authenticated outbound socket (§3), so no per-message signature is needed. A **per-tenant delivery key** is still issued at enroll/provision and retained for forward-compat, but is no longer used to sign inbound. |
|
||
|
||
This is the **channel** authenticator — distinct from platform crypto, which the
|
||
relay path still sheds entirely (§6). The gateway holds zero platform secrets;
|
||
the per-gateway secret authenticates only the connector link. Full threat model +
|
||
enrollment/rotation/kill-switch design: `docs/connector-gateway-auth-design.md`
|
||
(connector repo).
|
||
|
||
---
|
||
|
||
## 7. Per-instance delivery & the management plane (Phase 6)
|
||
|
||
Phases 1–5 treat the connector as a single-tenant front: inbound events for a
|
||
tenant fan out to that tenant's gateway socket(s). **Phase 6 makes delivery
|
||
per-INSTANCE** — a shared bot can front many users/agents in one tenant (one
|
||
Discord guild, one Telegram bot) without cross-delivery — and adds a small
|
||
**management plane** the agent (or a managed Portal) uses to declare who-sees-what
|
||
and what's-relevant. All of this lives **connector-side**; the gateway's only new
|
||
responsibility is to **declare its relevance policy** at boot (§7.3).
|
||
|
||
### 7.1 The delivery gate (connector-side, informational)
|
||
|
||
For each inbound event the connector decides which instances receive it by
|
||
composing three AND-ed filters. The gateway does not implement these — they run
|
||
in the connector — but they define the delivery semantics the gateway relies on:
|
||
|
||
| Layer | Question | Source of truth |
|
||
| --- | --- | --- |
|
||
| **owner / scope ∧ principal** | May this instance *see* this author here? | per-user `user_id → instance` bindings (the owner floor) + per-instance `(guild, channel)` scope grants + an `owner-only` / `allow-list` / `any` principal policy. |
|
||
| **visibility floor** | Can the instance's bound owner actually `VIEW_CHANNEL` this in Discord? | live Discord ACL (effective permissions), fail-closed. Narrows an over-broad scope grant downward. |
|
||
| **relevance** | *Given* it may see it, should the agent engage? | the relevance policy declared in §7.3 (address-gating / free-response / allow-bots). |
|
||
|
||
The composition only ever **narrows** delivery (`deliver ⇔ authorized ∧ visible
|
||
∧ relevant`); the **owner floor bypasses the relevance layer** (an author's own
|
||
message always reaches their own instance — you don't @mention your own agent).
|
||
A message authored by an unbound user reaches no instance (fail-closed). The
|
||
full design + invariants live in the connector repo
|
||
(`NousResearch/gateway-gateway`); this section is the gateway-facing summary.
|
||
|
||
### 7.2 Management routes (connector-side, authenticated)
|
||
|
||
The connector mounts authenticated management routes. They share the **same
|
||
dual-auth** as the WS upgrade: either a managed NAS-signed `aud=agent:{instanceId}`
|
||
RS256 JWT, **or** the gateway's own per-gateway secret bearer (§6.1
|
||
`make_upgrade_token`). In both cases the connector resolves the authoritative
|
||
`{tenant, instanceId}` from its **stored** record — **never** from the request
|
||
body (a body-asserted `instanceId` is ignored).
|
||
|
||
| Route | Purpose |
|
||
| --- | --- |
|
||
| `POST /manage/link` | Issue a short-lived code to bind a platform account to the authenticated instance (the `/link <code>` flow; the connector reads the authentic `user_id` off the inbound event). |
|
||
| `POST /manage/scope`, `/manage/scope/release` | Claim / release a `(guild, channel)` scope for the authenticated instance. A channel is owned by at most one instance (non-overlap is a PK constraint). |
|
||
| `POST /manage/principal` | Set the instance's principal policy (`owner-only` \| `allow-list` \| `any`). |
|
||
| `POST /manage/dm-default` | Set the user's DM-default instance (DM tie-break when a user linked more than one). |
|
||
| `POST /relay/policy` | Declare the instance's **relevance policy** (§7.3). |
|
||
|
||
These are connector-owned (the management plane is not part of the gateway's
|
||
agent path); the gateway only calls `POST /relay/policy` (§7.3). The others are
|
||
driven by the managed Portal / `hermes` CLI.
|
||
|
||
### 7.3 Relevance-policy declaration (the gateway's responsibility)
|
||
|
||
The relevance layer (§7.1) is the per-tenant parity for the gateway's own
|
||
behaviour knobs (`require_mention`, `free_response_channels`,
|
||
`{PLATFORM}_ALLOW_BOTS`). So the **same** behaviour governs relay delivery, the
|
||
gateway projects those knobs into a **platform-agnostic** policy and POSTs it to
|
||
`POST /relay/policy` at boot (after its per-gateway secret is resolved).
|
||
|
||
Body (`gateway/relay/__init__.py` `relay_relevance_policy()` → `send_relay_policy()`):
|
||
|
||
| Field | Type | Projected from | Meaning |
|
||
| --- | --- | --- | --- |
|
||
| `platform` | string | the fronted platform (`relay_platform_identity`) | which platform this policy applies to. |
|
||
| `requireAddress` | bool | `require_mention` | a non-owner message must @mention / reply-to the bot to be relevant. |
|
||
| `freeResponseScopes` | string[] | `free_response_channels` | scope (channel) ids where `requireAddress` is waived. Same scope vocabulary as §7.1's scope grants. |
|
||
| `allowOtherBots` | bool | `{PLATFORM}_ALLOW_BOTS ∈ {mentions, all}` | admit bot-authored messages (default off). |
|
||
|
||
Auth is the per-gateway upgrade token (§6.1), so the connector attaches the
|
||
policy to the authenticated instance. The gateway is the **source of truth** and
|
||
re-declares **every boot** (a full replace, mirroring the `routeKeys` upsert at
|
||
provision — self-healing). When the projected policy is all-default the gateway
|
||
sends nothing (the connector's absent-row default already matches). The POST is
|
||
**fail-soft**: a failure logs and boot proceeds — relevance is an optimization
|
||
layered on the authorization gate (§7.1), never a boot dependency. There is **no
|
||
new gateway inbound surface** and **no new credential** — it reuses the
|
||
per-gateway secret and the same host as `/relay/provision`.
|
||
|
||
> A relevance drop happens **before** the connector wakes a scaled-to-zero agent
|
||
> (Phase 5), so excluded chatter never spins an agent up — relevance is the
|
||
> primary scale-to-zero lever as well as a correctness filter.
|
||
|
||
---
|
||
|
||
## 8. Versioning policy
|
||
|
||
- `contract_version` is an int; bump **only** for additive changes during the
|
||
experimental phase (new optional fields, new `op`s).
|
||
- A breaking change (renamed/removed field, changed semantics) requires a
|
||
coordinated update of both repos and a version bump.
|
||
- The connector's first PR references the commit SHA of this file it implements
|
||
against.
|