After a prolonged outage the in-process network-error ladder escalates to
fatal and GatewayRunner._platform_reconnect_watcher rebuilds a fresh adapter
that reconnects through the bootstrap path. That path called
start_polling(drop_pending_updates=True), discarding every update Telegram
queued during the outage — all messages sent while the bot was down were
silently lost. The in-process ladder and 409-conflict handler already passed
drop_pending_updates=False; only bootstrap did not distinguish a cold first
boot from a reconnect.
Thread an is_reconnect signal from the watcher through
_connect_adapter_with_timeout into adapter.connect(). The base
BasePlatformAdapter.connect() gains a keyword-only is_reconnect=False so every
adapter inherits a tolerant signature (no per-platform breakage when the
runner forwards the kwarg). Telegram translates is_reconnect into
drop_pending_updates=not is_reconnect on both the polling and webhook bootstrap
calls. Cold boot still drops the stale queue; a watcher reconnect preserves it.
Fixes#46621.
Co-authored-by: annguyenNous <annguyen@nousresearch.com>
Co-authored-by: kyssta-exe <kyssta-exe@users.noreply.github.com>
Co-authored-by: Kewe63 <Kewe63@users.noreply.github.com>
Defense-in-depth polish on top of the webhook listener before it becomes
a real attack surface once the pipeline starts creating subscriptions
and Graph starts POSTing to the configured public URL.
- Timing-safe clientState comparison. Previously used `==` on strings;
switches to hmac.compare_digest so a mismatch does not leak how many
leading characters matched. client_state is documented as a strong
shared secret (openssl rand -hex 32 in the setup docs), so a
timing-safe primitive is the right call.
- Split GET and POST handlers. Graph validates a subscription by sending
GET with validationToken in the query; anything else on GET is now a
400 so the endpoint cannot be probed or mistakenly used for data
exfil. Previously a bare GET fell through to the POST path and blew
up on request.json() with a confusing 400.
- Empty response bodies on success. 202 is returned with no body so
internal counters (accepted / duplicates / scheduled) do not leak to
any caller that can reach the endpoint; counters remain observable
via /health for operators. 403 on every-item-bad-clientState batches
(so forged POSTs stop retrying), 400 on malformed / unknown-resource
batches (sender configuration issue).
- Optional source-IP allowlist. New `allowed_source_cidrs` extra field
(list or comma-separated string) and `MSGRAPH_WEBHOOK_ALLOWED_SOURCE_CIDRS`
env var let operators restrict the webhook to Microsoft Graph's
published webhook source ranges in production. Empty = allow all,
preserving dev-tunnel / localhost workflows. Invalid CIDRs are
logged and ignored rather than crashing. Also gates the handshake
endpoint so disallowed IPs cannot probe it.
- Tests updated for the new response contract (empty-body 202,
auth-only 403, config-error 400) and extended to cover: bare GET
rejection, POST-with-validationToken handshake tolerance,
timing-safe compare actually invoked via hmac.compare_digest spy,
malformed body / missing value array, IP allowlist accept/reject
paths, handshake IP allowlist, invalid CIDR entries, comma-string
CIDR list parsing. 52/52 passed (was 40).
Full gateway suite: 5049 passed / 1 pre-existing failure in
test_discord_free_response (unrelated, reproduces on clean origin/main).