From 935f2bc48daa9c7fad73f80c95d4375da18a39c1 Mon Sep 17 00:00:00 2001 From: Ben Barclay Date: Wed, 24 Jun 2026 12:27:19 +1000 Subject: [PATCH] =?UTF-8?q?docs(relay):=20add=20=C2=A73.4=20=E2=80=94=20ob?= =?UTF-8?q?ligations=20on=20a=20future=20scale-to-zero=20behaviour=20layer?= =?UTF-8?q?=20(#51633)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The contract already documents the scale-to-zero PRIMITIVES (§3.2 going-idle/ buffered-flip, §3.3 wake poke) and what's out of scope. This adds the missing half: the contract FROM the primitives TO the behaviour layer — the guarantees a separate scale-to-zero workstream must honour to consume them safely (register a wakeUrl before suspend; drain+ack before teardown; keep the reconnect loop live; treat suspended != down in the health model; don't assume exactly-once/ prompt wake; suspend only when genuinely idle, composing with the existing drain machine). Docs-only; lets the independent scale-to-zero stream build against a written contract instead of re-reading the connector. --- docs/relay-connector-contract.md | 51 ++++++++++++++++++++++++++++++++ 1 file changed, 51 insertions(+) diff --git a/docs/relay-connector-contract.md b/docs/relay-connector-contract.md index b5238ad75da..ee5ddff77db 100644 --- a/docs/relay-connector-contract.md +++ b/docs/relay-connector-contract.md @@ -259,6 +259,57 @@ the wake SIGNAL so a future scale-to-zero behaviour layer can rely on "buffered > primitive is "buffered event for a sleeping instance ⇒ its wakeUrl gets poked"; > WHAT makes the instance sleep (and wake-to-serve) is the behaviour layer. +### 3.4 Obligations on a future scale-to-zero behaviour layer + +§3.2 and §3.3 ship the **primitives**; this section is the **contract a separate +scale-to-zero behaviour workstream must honour to consume them safely.** It owns +the *decision* to suspend, the actual machine suspend, and the platform/health +model — none of which live here — but it MUST hold these guarantees, which the +primitives assume: + +1. **Register a `wakeUrl` before the instance can ever be suspended.** A + suspended instance with no registered `wakeUrl` is a black hole — buffered + inbound never triggers a poke, so it sleeps through its own traffic until + something else reconnects it. The behaviour layer MUST ensure a reachable + wake target is registered (self-hosted: `--wake-url`; managed: stamped) as a + precondition of allowing suspend. A wake URL that is unreachable while the + machine is suspended (e.g. points at the suspended machine itself with no + platform autostart in front) is equivalent to none. +2. **Drain through `going_idle` → await `going_idle_ack` BEFORE tearing down the + socket or suspending.** Never suspend with an un-acked flip in flight. The + ack is the connector's confirmation that delivery for this instance is now + buffered-only; a machine that suspends after sending `going_idle` but before + the ack can drop the inbound that races the flip. The gateway already gates + socket teardown on the ack (Q-5.3c); the suspend step MUST sit *after* a + clean drain completes, not race it. +3. **Keep the NET-NEW reconnect loop live as a precondition of suspend.** The + wake→drain contract is "poke ⇒ the gateway re-dials ⇒ the connector drains on + the reconnect handshake." If the reconnect loop is disabled, a poke lands on a + machine that never re-dials and the buffer strands. The behaviour layer must + not suspend an instance whose relay transport won't reconnect on wake. +4. **Treat suspended ≠ down in the health model (Q-5.3b).** A suspended instance + is healthy-asleep, not failed. The health/monitoring layer MUST distinguish + the two (e.g. via the platform machine-state) so a suspended instance is not + restarted, alerted on, or reaped as unhealthy — that would defeat the suspend + and can race the wake/drain. +5. **The wake poke is best-effort and rate-limited — do not assume exactly-once + or immediate wake.** At most one poke per cooldown window per instance, and a + failed poke is swallowed. The behaviour layer must not rely on the poke as a + guaranteed/prompt signal; correctness still rests on "the gateway drains + whenever it next reconnects." A belt-and-suspenders wake (e.g. a scheduled + job that also reconnects) is the behaviour layer's call, not the primitive's. +6. **Suspend only when genuinely idle — and idle is connector-observable, not + gateway-guessed.** WHAT counts as idle (no in-flight turn + no inbound for N + min) is the behaviour layer's policy, but it must compose with the existing + drain machinery (`gateway_state` running→draining) rather than introduce a + parallel relay-only idle path — the same integration constraint §3.2 places + on `going_idle`. + +These are guarantees the behaviour layer OWES the primitives; the primitives owe +the behaviour layer only what §3.2/§3.3 already specify (a flip-on-going_idle, +a durable per-instance buffer + ack-gated reconnect drain, and a poke on the +first buffered event for a flipped instance). + --- ## 4. Outbound: action set