mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-06-27 11:22:03 +00:00
feat(relay): terminal 4401 (opt-out) → clean "Relay disabled" state
Some checks are pending
CI / detect (push) Waiting to run
CI / tests (push) Blocked by required conditions
CI / lint (push) Blocked by required conditions
CI / typecheck (push) Blocked by required conditions
CI / docs-site (push) Blocked by required conditions
CI / history-check (push) Blocked by required conditions
CI / contributor-check (push) Blocked by required conditions
CI / uv-lockfile (push) Blocked by required conditions
CI / docker-lint (push) Blocked by required conditions
CI / supply-chain (push) Blocked by required conditions
CI / osv-scanner (push) Blocked by required conditions
CI / All required checks pass (push) Blocked by required conditions
Deploy Site / deploy-vercel (push) Waiting to run
Deploy Site / deploy-docs (push) Waiting to run
Docker Build and Publish / build-amd64 (push) Waiting to run
Docker Build and Publish / build-arm64 (push) Waiting to run
Docker Build and Publish / merge (push) Blocked by required conditions
Some checks are pending
CI / detect (push) Waiting to run
CI / tests (push) Blocked by required conditions
CI / lint (push) Blocked by required conditions
CI / typecheck (push) Blocked by required conditions
CI / docs-site (push) Blocked by required conditions
CI / history-check (push) Blocked by required conditions
CI / contributor-check (push) Blocked by required conditions
CI / uv-lockfile (push) Blocked by required conditions
CI / docker-lint (push) Blocked by required conditions
CI / supply-chain (push) Blocked by required conditions
CI / osv-scanner (push) Blocked by required conditions
CI / All required checks pass (push) Blocked by required conditions
Deploy Site / deploy-vercel (push) Waiting to run
Deploy Site / deploy-docs (push) Waiting to run
Docker Build and Publish / build-amd64 (push) Waiting to run
Docker Build and Publish / build-arm64 (push) Waiting to run
Docker Build and Publish / merge (push) Blocked by required conditions
Phase 7 Unit 7d-B. When an operator opts an instance OUT of the Team Gateway
relay (Unit 7b deprovision), the connector revokes the per-gateway secret and
closes the gateway's WS with 4401. The reconnect supervisor previously treated
EVERY close as retryable, so the live process spun "retrying 4401" forever and
the dashboard showed a red error — opt-out looked like a failure.
Now a 4401 close that arrives AFTER a successful handshake is recognized as a
terminal credential revocation:
- ws_transport.py: track `_handshake_succeeded` (set when a descriptor is
received); on a 4401 close after a prior success, latch `auth_revoked` and do
NOT spawn the reconnect supervisor. A 4401 BEFORE any successful handshake
stays retryable (cold-start / not-yet-provisioned race, not a revocation).
New `auth_revoked` property + a websockets-version-safe close-code reader
(prefers `.rcvd`/`.sent` Close frames; `.code` is deprecated in websockets 13+).
- adapter.py: a revocation monitor turns `transport.auth_revoked` into a clean,
NON-retryable `relay_disabled` fatal and notifies the gateway's fatal-error
handler (so the adapter is removed and NOT queued for reconnection — the
credential is dead until the instance is recreated). Monitor is cancelled on
disconnect; only started when the transport exposes `auth_revoked` (prod WS).
- run.py: `_handle_adapter_fatal_error` maps the `relay_disabled` code to a
`disabled` platform_state (not `fatal`/`retrying`).
- web: PlatformsCard renders the `disabled` state with a neutral outline badge,
a PowerOff icon, and muted (not destructive-red) text + message. New optional
`status.disabled` i18n string ("Disabled").
Also bundles the Phase 7 contract-doc update (this doc is authoritative in
hermes-agent): docs/relay-connector-contract.md gains an "Author-first
resolution + the account-link (DM) path" section documenting the
multi-tenant-guild rule (D-7.2 — route by authenticated author binding, never by
guild; unlinked → fail-closed), the `/link <code>` DM flow, and the
connector-authoritative opt-out + terminal-4401 behavior this PR implements.
Tests: +2 ws_transport (4401-after-handshake terminal / no-reconnect;
4401-before-handshake stays retryable) and +2 adapter (revocation → non-retryable
relay_disabled fatal + handler fired; no-revocation → no fatal). 138 relay tests
pass (incl. the contract-doc conformance test); ruff clean; web tsc clean.
Phase 7 Unit 7d-B (relay-adapter solo lane). Q17 → Option 2; Option 3 (live
de-register, no recreate) + the restart-re-provision hole deferred post-alpha.
This commit is contained in:
parent
3c75e11571
commit
c93b9f9057
9 changed files with 367 additions and 8 deletions
|
|
@ -186,6 +186,56 @@ tenant**. Tenant is resolved from the event's own discriminator (Discord
|
|||
token/socket/process delivered it. This keeps one shared bot able to front many
|
||||
tenants (Phase 6) without overloading an existing field.
|
||||
|
||||
### Author-first resolution + the account-link (DM) path (Phase 7)
|
||||
|
||||
Phase 7 adds **self-serve, per-user onboarding to a shared bot**, which changes
|
||||
*which* discriminator resolves the instance for a routed inbound message — and
|
||||
adds a management path for users to bind their own account.
|
||||
|
||||
**Author-first resolution (the multi-tenant-guild rule, D-7.2).** A single
|
||||
Discord guild may hold **many** tenants — different members each linked to their
|
||||
own agent. So for delivery the connector resolves the destination instance from
|
||||
the **authenticated author binding** (`user_instance_binding`, keyed by
|
||||
`(tenant, platform, platform_user_id)` via `resolveByUser`), **NOT** by a
|
||||
guild→instance route. Concretely:
|
||||
|
||||
- A routed message authored by a **linked** user reaches **only that user's**
|
||||
instance — even when a second linked user in the **same guild** is served by a
|
||||
different instance (each reaches only their own).
|
||||
- A message authored by an **unlinked** user resolves to **no** instance and is
|
||||
dropped (**fail-closed** — never broadcast to the guild's other tenants).
|
||||
- The author id used is the **authentic `user_id` off the observed event**, the
|
||||
same `SessionSource.user_id` documented above — never a value asserted by a
|
||||
gateway or carried in a management frame.
|
||||
|
||||
This is the per-`user_id` owner-only routing the connector enforces in
|
||||
`WsGatewayDelivery` (the gateway-side multi-tenant-guild E2E driver
|
||||
`gateway_multitenant_guild_driver.py` is the cross-repo oracle).
|
||||
|
||||
**The account-link (DM) path.** A user binds their account to an instance with a
|
||||
one-time code, redeemed by DMing the shared bot:
|
||||
|
||||
1. The owner triggers a link from the Portal (or a self-hosted CLI). The
|
||||
connector mints a short-lived **link code** for the **authenticated**
|
||||
instance (`POST /manage/link`; instanceId comes from the caller's principal —
|
||||
a NAS-signed `aud=agent:{instanceId}` token or the instance's own per-gateway
|
||||
secret — **never** the request body).
|
||||
2. The user sends `/link <code>` as a **direct message** to the shared bot from
|
||||
the account they want to bind.
|
||||
3. The connector's inbound observer **consumes** that DM (it is not routed to any
|
||||
agent) and writes the `user_instance_binding` using the **authentic
|
||||
`user_id`** off the observed DM event. From then on, author-first resolution
|
||||
routes that user's messages to the bound instance.
|
||||
|
||||
**Opt-out is connector-authoritative.** Deprovisioning an instance
|
||||
(`POST /manage/deprovision`) drops its author bindings (so its users stop
|
||||
resolving to it) **and** revokes its per-gateway secret (so its socket can no
|
||||
longer authenticate — the next WS upgrade is closed **4401**). A gateway that
|
||||
sees a **4401 close after a previously-successful handshake** treats it as a
|
||||
terminal revocation: it stops reconnecting and reports the relay platform as
|
||||
**disabled** (not a retryable error). A 4401 *before* any successful handshake
|
||||
stays retryable (a cold-start / not-yet-provisioned race, not a revocation).
|
||||
|
||||
### 3.2 Going-idle / buffered-flip primitive (§5.3)
|
||||
|
||||
A scale-to-zero PRIMITIVE (not the behaviour — nothing here decides to sleep or
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue