Commit graph

11 commits

Author SHA1 Message Date
Ben
c93b9f9057 feat(relay): terminal 4401 (opt-out) → clean "Relay disabled" state
Some checks are pending
CI / detect (push) Waiting to run
CI / tests (push) Blocked by required conditions
CI / lint (push) Blocked by required conditions
CI / typecheck (push) Blocked by required conditions
CI / docs-site (push) Blocked by required conditions
CI / history-check (push) Blocked by required conditions
CI / contributor-check (push) Blocked by required conditions
CI / uv-lockfile (push) Blocked by required conditions
CI / docker-lint (push) Blocked by required conditions
CI / supply-chain (push) Blocked by required conditions
CI / osv-scanner (push) Blocked by required conditions
CI / All required checks pass (push) Blocked by required conditions
Deploy Site / deploy-vercel (push) Waiting to run
Deploy Site / deploy-docs (push) Waiting to run
Docker Build and Publish / build-amd64 (push) Waiting to run
Docker Build and Publish / build-arm64 (push) Waiting to run
Docker Build and Publish / merge (push) Blocked by required conditions
Phase 7 Unit 7d-B. When an operator opts an instance OUT of the Team Gateway
relay (Unit 7b deprovision), the connector revokes the per-gateway secret and
closes the gateway's WS with 4401. The reconnect supervisor previously treated
EVERY close as retryable, so the live process spun "retrying 4401" forever and
the dashboard showed a red error — opt-out looked like a failure.

Now a 4401 close that arrives AFTER a successful handshake is recognized as a
terminal credential revocation:

- ws_transport.py: track `_handshake_succeeded` (set when a descriptor is
  received); on a 4401 close after a prior success, latch `auth_revoked` and do
  NOT spawn the reconnect supervisor. A 4401 BEFORE any successful handshake
  stays retryable (cold-start / not-yet-provisioned race, not a revocation).
  New `auth_revoked` property + a websockets-version-safe close-code reader
  (prefers `.rcvd`/`.sent` Close frames; `.code` is deprecated in websockets 13+).
- adapter.py: a revocation monitor turns `transport.auth_revoked` into a clean,
  NON-retryable `relay_disabled` fatal and notifies the gateway's fatal-error
  handler (so the adapter is removed and NOT queued for reconnection — the
  credential is dead until the instance is recreated). Monitor is cancelled on
  disconnect; only started when the transport exposes `auth_revoked` (prod WS).
- run.py: `_handle_adapter_fatal_error` maps the `relay_disabled` code to a
  `disabled` platform_state (not `fatal`/`retrying`).
- web: PlatformsCard renders the `disabled` state with a neutral outline badge,
  a PowerOff icon, and muted (not destructive-red) text + message. New optional
  `status.disabled` i18n string ("Disabled").

Also bundles the Phase 7 contract-doc update (this doc is authoritative in
hermes-agent): docs/relay-connector-contract.md gains an "Author-first
resolution + the account-link (DM) path" section documenting the
multi-tenant-guild rule (D-7.2 — route by authenticated author binding, never by
guild; unlinked → fail-closed), the `/link <code>` DM flow, and the
connector-authoritative opt-out + terminal-4401 behavior this PR implements.

Tests: +2 ws_transport (4401-after-handshake terminal / no-reconnect;
4401-before-handshake stays retryable) and +2 adapter (revocation → non-retryable
relay_disabled fatal + handler fired; no-revocation → no fatal). 138 relay tests
pass (incl. the contract-doc conformance test); ruff clean; web tsc clean.

Phase 7 Unit 7d-B (relay-adapter solo lane). Q17 → Option 2; Option 3 (live
de-register, no recreate) + the restart-re-provision hole deferred post-alpha.
2026-06-24 18:43:01 +10:00
Ben Barclay
935f2bc48d
docs(relay): add §3.4 — obligations on a future scale-to-zero behaviour layer (#51633)
The contract already documents the scale-to-zero PRIMITIVES (§3.2 going-idle/
buffered-flip, §3.3 wake poke) and what's out of scope. This adds the missing
half: the contract FROM the primitives TO the behaviour layer — the guarantees
a separate scale-to-zero workstream must honour to consume them safely (register
a wakeUrl before suspend; drain+ack before teardown; keep the reconnect loop
live; treat suspended != down in the health model; don't assume exactly-once/
prompt wake; suspend only when genuinely idle, composing with the existing drain
machine). Docs-only; lets the independent scale-to-zero stream build against a
written contract instead of re-reading the connector.
2026-06-24 12:27:19 +10:00
Ben Barclay
6e88f7b6f7
feat(relay): Phase 5 Unit C — wake primitive (gateway side) (#51595)
Register a per-instance wakeUrl and forward it to the connector at
self-provision so a suspended gateway can be poked awake when buffered
work arrives (pairs with the connector-side WakePoker).

- relay_wake_url() resolver (env GATEWAY_RELAY_WAKE_URL, then
  gateway.relay_wake_url in config.yaml), mirroring relay_instance_id()
- thread wake_url through _post_provision (adds wakeUrl to the body only
  when set) + self_provision_relay (resolve, forward, log)
- hermes gateway enroll --wake-url <url> persists GATEWAY_RELAY_WAKE_URL
- document the §5.2 wake poke in relay-connector-contract.md §3.3
- tests: relay_wake_url resolution (env/config/absent), provision
  forwarding, body-only-when-set (6 new; 130 relay tests pass)

The actual reconnect+drain on wake is Unit B's loop; this unit only
wires the wake SIGNAL. Opt-in: absent wakeUrl => connector never pokes.
2026-06-24 11:00:11 +10:00
Ben Barclay
40fddc9e4c
feat(relay): Phase 5 §5.3 going-idle / buffered-flip primitive (gateway side) (#51572)
The gateway half of the going-idle/buffered-flip primitive (scale-to-zero
PRIMITIVE, not the behaviour). Integrates with the EXISTING drain transition:

- ws_transport: `go_idle()` sends `going_idle` + awaits the connector's
  `going_idle_ack` (connector-authoritative flip-then-ack, Q-5.3c — stays
  serving until the ack so nothing is lost in the flip window); acks a buffered
  inbound (bufferId present) via `inbound_ack` after the handler runs
  (drain-without-dup on the delivery leg); NET-NEW reconnect loop re-dials +
  re-handshakes after an unexpected close (off by default, on in production).
- adapter: emits `going_idle` from its existing `disconnect()` drain seam before
  tearing down the socket; best-effort + guarded (never blocks shutdown).
- transport Protocol + contract doc §3.2 document the 3 new frames.

+6 relay tests (124 pass). NOT in scope: the autonomous idle timer / machine
suspend / NAS health model (deferred behaviour). Ben's relay-adapter solo lane.
2026-06-24 09:50:30 +10:00
Ben Barclay
45bc4fb37f
feat(relay): declare relevance policy to the connector + document the management plane (#51248)
The gateway half of Phase 6 Unit ζ: project the agent's existing relevance
knobs into the connector's platform-agnostic vocabulary and declare them at boot
over the /relay/policy route, so the SAME mention-gating / free-response /
allow-bots behavior the agent applies directly also governs relay delivery (and
excluded chatter never wakes a scaled-to-zero agent).

- gateway/relay/__init__.py:
  - relay_relevance_policy(): project require_mention -> requireAddress,
    free_response_channels -> freeResponseScopes, {PLATFORM}_ALLOW_BOTS in
    {mentions,all} -> allowOtherBots. Reads the fronted platform's config block
    + bridged top-level keys. Returns None when all-default (the connector's
    quiet default already matches) or no concrete platform is fronted.
  - send_relay_policy(): POST /relay/policy authenticated with the gateway's own
    per-gateway upgrade token (make_upgrade_token — same bearer as the WS
    upgrade), so the connector attaches it to the authenticated instance, never
    a body-asserted id. Re-declares every boot (self-healing, full replace).
    NEVER raises, NEVER blocks boot — relevance is an optimization layered on
    the δ/ε authorization gate. Reuses the per-gateway secret + the
    /relay/provision host; no new inbound surface, no new credential.
  - _policy_url(): ws(s)://…/relay -> http(s)://…/relay/policy.
- gateway/run.py: call send_relay_policy() after register_relay_adapter()
  succeeds (the secret is resolved by then).
- docs/relay-connector-contract.md: new §7 documenting per-instance delivery +
  the management plane (/manage/* + /relay/policy) + the relevance-declaration
  contract; versioning renumbered to §8. Contract conformance test stays green
  (§2/§3 tables untouched).

Tests: +12 (projection mapping incl. comma-string + top-level fallback; send
auth/skip/fail-soft/non-200). Full relay suite 118 pass. The connector route is
already E2E-proven (connector repo gateway_policy_driver.py); this adds the real
gateway send-path it pairs with.

This completes Phase 6 (Team Gateway per-user isolation) end to end.
2026-06-23 18:43:19 +10:00
Ben Barclay
64a507da44
feat(relay): handle passthrough_forward over the WS (Phase 5 §5.1, gateway half) (#50702)
The connector half (gateway-gateway) moves the passthrough plane's post-ACK
forward off the HTTP gatewayEndpoint onto the gateway's outbound /relay WS via
a new passthrough_forward frame. This is the gateway side: the relay adapter
now RECEIVES and handles that frame, so a hosted gateway (no public IP) can
process forwarded Class-2/3 traffic (Discord interactions, Twilio) over the
socket it already holds — closing the "passthrough inbound doesn't work for
hosted gateways" gap.

- ws_transport.py: decode the passthrough_forward frame; PassthroughForward
  dataclass + _passthrough_from_wire (base64 body -> exact bytes, byte parity
  with the connector's toPassthroughForward); set_passthrough_handler mirrors
  set_interrupt_inbound_handler.
- transport.py: PassthroughHandler type + set_passthrough_handler on the
  RelayTransport protocol.
- adapter.py: connect() wires the passthrough handler; _on_passthrough decodes
  the (already-sanitized, token-free) forward and, for a Discord interaction,
  converts it to a MessageEvent routed through the normal agent path
  (handle_message) — the reply egresses over the outbound / token-less
  follow_up path, so the gateway never holds the interaction credential. Never
  raises (a bad forward can't kill the read loop). Non-discord forwards (Twilio)
  are logged + dropped for now.
- docs/relay-connector-contract.md: document the passthrough_forward frame +
  PassthroughForward shape + §3.1.

The interaction -> MessageEvent CONVERSION semantics (slash-command vs button
UX, option rendering) are the open sub-design flagged in the spec; the TRANSPORT
+ receive mechanism (this) is settled per Ben's Gate-2 decision: "the relay
adapter handles receiving these events over the WS."

Tests (tests/gateway/relay/test_relay_passthrough.py): byte-preservation
round-trip (+ malformed-body tolerance), connect() wiring, application-command
and message-component interactions route through handle_message with correct
session source + scope capture, malformed/non-discord forwards dropped cleanly.
100 relay tests green. Pairs with the connector PR (gateway-gateway).
2026-06-22 20:10:57 +10:00
Ben Barclay
d2c53ff558
feat(relay): WS-only inbound on the gateway adapter (Phase 3) (#48294)
The connector now delivers inbound (messages + interrupts) over the gateway's
OUTBOUND /relay WebSocket, not a signed HTTP POST to an inbound endpoint. The
gateway needs no inbound HTTP port — which is what makes hosted gateways (no
public IP) able to receive inbound at all.

- gateway/relay/adapter.py: connect() wires set_interrupt_inbound_handler(
  self.on_interrupt) so connector->gateway interrupt_inbound frames bridge into
  the existing per-session interrupt path (the inbound message handler was
  already wired). Removed _maybe_start_inbound_receiver() + the _inbound_runner
  lifecycle — there is no HTTP receiver anymore.
- gateway/relay/inbound_receiver.py: deleted (the signed-HTTP InboundDelivery
  receiver).
- gateway/relay/__init__.py: removed relay_inbound_config() (dead with the
  receiver gone). The delivery key is still set in-process by self-provision for
  forward-compat but is no longer consumed for inbound.
- docs/relay-connector-contract.md: §3 rewritten — inbound is the WS back-channel
  routed cross-instance via the connector's relay bus; §5 interrupt + §6 auth
  table updated; the old signed-HTTP-POST + per-tenant-delivery-key-signing path
  is documented as superseded. gatewayEndpoint noted as passthrough-plane only.

Tests: stub_connector grows set_interrupt_inbound_handler + push_interrupt;
new test_relay_interrupt case proves connect() wires BOTH inbound handlers and an
interrupt_inbound frame over the WS cancels the right session. Removed the
HTTP-receiver test; updated the crypto-shedding scan + self-provision delivery-key
assertion. 88 relay tests pass.

EXPERIMENTAL. Pairs with gateway-gateway (relay bus + WsGatewayDelivery) and the
NAS GATEWAY_RELAY_URL stamp. The cross-repo E2E (connector repo) proves the full
multi-instance path against this production adapter code.
2026-06-19 09:33:15 +10:00
Ben Barclay
c276b017ad
feat(relay): connector⇄gateway channel auth + signed-HTTP inbound receiver + enroll CLI (#48147)
* feat(relay): authenticate the connector⇄gateway WS channel

The relay gateway may be customer-managed and internet-exposed, so the
connector⇄gateway channel is itself authenticated (distinct from the
platform crypto the relay path sheds). Add gateway/relay/auth.py — a
Python port of the connector's HMAC token + delivery-signature schemes
(relayAuthToken.ts / deliverySigning.ts), verified byte-for-byte against
the connector's compiled TypeScript via cross-language test vectors.

Present an Authorization bearer on the /relay WS upgrade keyed by the
per-gateway secret (resolved from GATEWAY_RELAY_ID / GATEWAY_RELAY_SECRET
in env or config). The connector rejects an unauthenticated/invalid/
revoked upgrade with close 4401.

* feat(relay): signed-HTTP inbound delivery receiver

The connector delivers normalized inbound events to a tenant's gateway
over a signed HTTP POST, not the outbound /relay WS: the connector
instance owning a platform socket is generally not the instance a given
gateway dialed out to, so inbound targets a tenant endpoint that may
load-balance across gateway instances.

Add gateway/relay/inbound_receiver.py — verifies x-relay-signature /
x-relay-timestamp over the EXACT raw request bytes (re-serializing would
break the HMAC: JS JSON.stringify is compact, Python json.dumps spaces)
against the per-tenant delivery key verify list within a 300s replay
window, then dispatches messages to handle_message and interrupts to the
interrupt handler. Wire it into the adapter lifecycle (start in connect()
when a delivery key + bind port are configured, tear down in disconnect();
a purely-outbound dev gateway runs without it).

Refine test_relay_sheds_crypto to distinguish PLATFORM crypto (Discord
ed25519, Twilio/WeCom HMAC — still shed) from the connector⇄gateway
CHANNEL auth (intended): auth.py / inbound_receiver.py are exempt from
the platform-symbol scan but still banned from importing platform-crypto
modules, plus a positive guard that auth.py uses only stdlib hmac/hashlib.

* feat(relay): hermes gateway enroll CLI

Add the gateway half of zero-touch enrollment. `hermes gateway enroll`
resolves a fresh Nous Portal access token (the tenant-proving identity),
POSTs {enrollmentToken, gatewayId} to the connector's /relay/enroll, and
persists GATEWAY_RELAY_ID / GATEWAY_RELAY_SECRET / GATEWAY_RELAY_DELIVERY_KEY
to ~/.hermes/.env. The per-gateway secret authenticates the WS upgrade;
the per-tenant delivery key verifies signed inbound deliveries.

Refuses under is_managed() (hosted installs get the secret stamped in by
the orchestrator). Added as an 'enroll' subcommand on the existing
gateway subparser — not a new top-level command.

* docs(relay): inbound is signed HTTP, not WS; document channel auth

Fix the stale contract: §3/§5 said inbound rode the WS socket (single-
instance only, predates the multi-instance socket-ownership + channel-auth
model). Inbound + connector→gateway interrupt are signed HTTP POSTs to the
tenant endpoint. Add §6.1 documenting the two channel-auth schemes (per-
gateway WS-upgrade secret, per-tenant inbound delivery key) and how they
differ from the platform crypto the relay path sheds.

* test(relay): update build_gateway_parser callers for cmd_gateway_enroll

The enroll subcommand added cmd_gateway_enroll as a required keyword-only
arg to build_gateway_parser, but two existing parser-extraction tests still
called it with only cmd_gateway/cmd_proxy — failing CI with TypeError.
Thread the new handler through both call sites and add a test asserting
`gateway enroll` dispatches to cmd_gateway_enroll with its flags parsed.
2026-06-18 12:01:54 +10:00
Ben
6e20c1992f docs(gateway): rewrite contract §6 to the A2 trust-boundary model
The contract's §6 still said the connector 'forwards the signed body
byte-for-byte so the gateway's existing crypto validates against unmodified
bytes.' That model is incoherent under an untrusted, disposable tenant
gateway on a shared bot:

- re-validating Twilio HMAC / WeCom crypto needs the shared signing secret
  (handing it over IS the cross-tenant leak),
- WeCom payloads are encrypted with that secret (the connector must decrypt
  at the edge just to route),
- a Discord interaction token lives inside the signed body — you can't both
  preserve the bytes and strip the credential.

Rewrites §6 to the actual model: the connector is the SOLE crypto/identity
boundary — verifies/decrypts at the edge, normalizes to a tenant-scoped
MessageEvent, strips shared-identity capabilities into its vault, and
forwards only the sanitized event. The gateway re-validates nothing (the
invariant test from the crypto-shed commit enforces this). Notes that this
unifies the passthrough + relay planes and points to the connector repo's
capability-trust-boundary.md.

Also documents the follow_up op in §4 (token-less capability action added
in the previous commit). The conformance test (§2/§3 tables) stays green;
contract is unpublished/EXPERIMENTAL so no version-bump ceremony. 55 passed.
2026-06-17 16:37:45 -07:00
Ben
5feec8b4cf test(gateway): enforce relay contract-doc ⟷ Python conformance
Add an invariant test pinning docs/relay-connector-contract.md to the
Python source of truth so the doc (which the connector repo mirrors by
hand) cannot silently drift:

- CapabilityDescriptor §2 table ⟷ dataclass fields + required/optional
- SessionSource wire keys (to_dict output) ⟷ §3 documented fields
- per-platform discriminator columns exist as real SessionSource fields
- guard that is_bot stays off the wire until deliberately promoted

Writing the test surfaced a real gap: §3 only enumerated 5 discriminators
in its per-platform table while to_dict() emits 12 keys. Seven wire keys
the connector must populate (chat_name, chat_topic, user_id_alt,
chat_id_alt, parent_chat_id, message_id, user_name) were undocumented —
a connector author reading the doc would never know to set them. Added a
complete SessionSource wire-field table to §3. The connector's existing
contract.ts already carries all 12, so no connector change is needed; the
doc was the lagging artifact.
2026-06-17 16:37:45 -07:00
Ben
ab1a42fcea docs: relay<->connector cross-repo contract (v1, experimental)
Formal interface between the Hermes gateway (RelayAdapter) and the Node
connector repo: handshake, CapabilityDescriptor field table, MessageEvent
inbound envelope with per-platform SessionSource discriminators (Discord
guild_id is REQUIRED for server isolation), outbound action set, /stop
interrupt routing, signed-body verify-at-edge/byte-preserving rule, and the
additive-only contract_version policy. Documents bot-identity-vs-tenant
separation so single-bot consolidation (Phase 6) stays open. Read-first
artifact for the connector implementer.

Phase 1, Task 1.5 of the gateway-relay plan.
2026-06-17 16:37:45 -07:00