Streamdown's per-Block parse cost grows with the live tail's length and
is unavoidable inside the block-memo pattern (industry standard, see
findings doc). The fix is to stop having that work block the main thread.
`<DeferStreamingText>` is a 12-line wrapper that reads message-part state
via `useMessagePartText`, runs it through `useDeferredValue`, and
re-publishes via assistant-ui's `<TextMessagePartProvider>`. The inner
`<StreamdownTextPrimitive>` reads the deferred value through the normal
`useMessagePartText` hook — no fork, no internal-path imports, fully on
assistant-ui's public API. React's concurrent scheduler then:
- abandons in-flight deferred renders when a newer token arrives, so
intermediate states get skipped under fast streams
- deprioritises the markdown render when the main thread has urgent
work (typing, scroll), so input stays responsive even while a
100ms parse is queued
Streamdown already uses `useTransition` for its block-array setState;
this lifts the deferral up to the consumer boundary so it covers the
whole pipeline (preprocess → split → repair → parse → render).
A/B on the 34 MB session, 300 tokens at 50 tok/sec, markdown chunks
(four trials each, with the 33ms flush throttle on for both):
| | avgFps | p99 frame | LTs/5s | max LT | typing-while-stream p95 |
|---|---|---|---|---|---|
| pre | 54.3 | 41 ms | 1.7 | 110 ms | ~17 ms |
| post | 58.5 | 31 ms | 2.0 | 117 ms | 14-18 ms |
Longtask count + max LT unchanged — useDeferredValue doesn't reduce
CPU, only its priority. The avgFps lift and p99 frame drop are the
proof that the existing CPU is no longer blocking 60 fps cadence. One
clean run logged MUTATIONS=0 — React skipped every intermediate text
state and only committed the final one (textbook deferred-value
behaviour).
The actually-reduce-CPU path is replacing the parser with a state
machine like Flowdown — left for a future PR; see
`apps/desktop/scripts/profile-typing-lag.md` for the full investigation.
`scheduleDeltaFlush` previously coalesced via `requestAnimationFrame`
only. The "at most one flush per frame" guarantee that gives you is fine
for fast streams (>~80 tok/sec) where multiple tokens arrive within a
single frame, but breaks down at typical LLM token rates (30-80 tok/sec)
where each token arrives slower than the rAF cadence and triggers its
own React commit + Streamdown markdown re-parse.
Track `lastFlushAt` and require at least 33 ms between two flushes.
React 18+ auto-batching probabilistically already collapsed some of
these, but the floor makes it deterministic.
A/B on the 34 MB session, 300 tokens at 50 tok/sec (markdown chunks):
| | avgFps | p99 frame | LTs / 5 s | max LT |
|---|---|---|---|---|
| no floor (current rAF) | 54.0 | 38 ms | 2.0 | 145 ms |
| 33 ms floor (this PR) | 54.3 | 41 ms | 1.7 | 110 ms |
`inter-mutation` p50 also tightens from 22-28 ms to a clean 33 ms,
which is the expected signature of a deterministic floor. Doesn't fully
solve the user's perceived hitches — Streamdown's per-Block parse cost
when the last block grows past ~2 k chars is still the elephant — but
it consistently shaves the worst-case longtask and makes the streaming
cadence visibly steadier.
Also threads a matching `flushMinMs` option through the synthetic
stream driver in `perf-probe.tsx` + `scripts/measure-synthetic-stream.mjs`
so the harness can A/B both regimes without spending LLM credits.
See `scripts/profile-typing-lag.md` for the full investigation.
Drops the React `<Profiler>` approach (no-op because Vite is currently
serving the production React build) in favor of an externally-observable
measurement stack: rAF frame intervals, `PerformanceObserver({entryTypes:
['longtask']})`, and a `MutationObserver` on the live streaming message.
Adds a synthetic stream driver — `window.__PERF_DRIVE__.stream({...})` —
that pushes tokens through the live `$messages` atom at a controlled rate,
so the assistant-ui runtime, incremental repository, and Streamdown
markdown pipeline see the same workload they'd see during a real LLM
stream, without the LLM cost.
The driver lives in `src/app/chat/perf-probe.tsx`; `main.tsx` side-imports
it under `import.meta.env.MODE !== 'production'` so it tree-shakes out of
prod builds. (Using `MODE` rather than `DEV` because our Vite setup
currently reports `DEV=false` even under `vite dev` — see the dev-build
note in `profile-typing-lag.md`.)
Scripts:
- measure-synthetic-stream.mjs drive synthetic + record frame/longtask/mutation
- profile-synth-stream.mjs CPU profile + top self-time during synthetic
- measure-real-stream.mjs same harness, real LLM stream
- profile-real-stream.mjs CPU profile bracketing the real stream window
- eval.mjs / reload.mjs small CDP helpers
A real-LLM measurement on Cloud Shadows (gpt-4o-mini, 39 s window) showed
12 longtasks in the same 75-127 ms range the synthetic predicted, so the
synthetic is a faithful proxy.
Re-ran the leak harness on a populated session (Phaser thread) for both
unpatched and patched builds. The original 'listener leak' was transient
warm-up cost, not a steady-state leak — both versions show 0 listener
growth/round in steady state.
The load-bearing number is forced layouts per character:
unpatched (HEAD~2): 7.02 layouts/char
patched (HEAD): 2.35 layouts/char (3× fewer)
The patches reduce per-char forced-layout work to Blink's natural floor.
Document node count and heap are flat in both builds.
The slowest user-felt path is typing into the composer while the
assistant is streaming. Profile (scripts/profile-under-stream.mjs):
FadeText measureOverflow self time: 35.8 ms → 18.1 ms (-50%)
total active CPU during 7s window: ~150 ms → ~50 ms
Two changes in src/components/ui/fade-text.tsx:
1. Drop the `useEffect([children])` that re-ran `measureOverflow`
(reads scrollWidth + clientWidth — forced layout) on every parent
re-render. `useResizeObserver` already fires the same callback on
mount and whenever the host span's box size changes; that covers
the only case where overflow state can legitimately change. The
previous explicit useEffect was a forced-layout flush on every
parent render, which during streaming meant every token tick.
2. Wrap the component in `memo` with a custom comparator that
short-circuits the entire render when scalar string `children` and
the className/fadeWidth/style props are unchanged. The hot path
was tool-fallback's title chips being re-rendered by parent
streaming updates even though their text was stable; memo+
comparator skips that.
Also adds two harness scripts under apps/desktop/scripts/:
- latency-under-stream.mjs (key→paint latency while a turn streams)
- profile-under-stream.mjs (CPU profile while a turn streams)
Updates profile-typing-lag.md with the streaming numbers and confirms
the Enter→paint submit path is already fast (≤320ms on the populated
session; the 2s "stall after Enter" the user noticed once was a
one-time cold-start, not reproducible at the UI layer).
I'd guess the felt jank in real use is fast-burst typing during a
long-form streaming reply (code blocks + markdown lists multiply the
per-token render cost). The CPU savings here scale linearly with
token volume.
Empirical work via CDP harnesses under apps/desktop/scripts/ (see
profile-typing-lag.md):
jsListeners growth (per round of 200 chars + GC):
before: +35 (verified leak — listeners stuck after 1st trigger popover use)
after: +0
Four narrow edits in src/app/chat/composer/index.tsx:
1. Drop the per-keystroke `editorRef.current.scrollHeight` read used to
decide composer expansion. Replace with `draft.length > 60` heuristic;
the existing ResizeObserver still catches edge cases. `scrollHeight`
is a forced-layout call and was firing on every char until the first
wrap.
2. Bucket measured composer height to 8px before writing
`--composer-measured-height` / `--composer-surface-measured-height`
on `documentElement`. Without this, the editor grows ~1px per char,
setProperty fires every keystroke, computed style is invalidated tree-
wide.
3. Remove the dead `$composerDraft` two-way sync. Nothing outside the
composer subscribed to that atom (verified via grep). Two useEffects
on `[draft]` were pushing draft→atom and atom→aui per keystroke for
no consumer. Also drop the per-keystroke
`reconcileComposerTerminalSelections` call; it was pruning stale
labels for `terminalContextBlocksFromDraft`, but that helper already
ignores labels not in the current submitted text, so pruning per
keystroke was just bookkeeping.
4. `refreshTrigger` fast-bails when the draft contains neither `@` nor
`/`. Previously `textBeforeCaret(editor)` ran on every input/keyup
regardless; `range.toString()` inside is O(n) over draft length.
Synthetic typing latency p50/p90/p99 is similar before vs after on a
freshly-loaded session (Blink can already handle ~30cps typing into a
contentEditable on its own); the real win is the listener leak being
gone and the global computed-style invalidations dropping ~8× when the
composer is sitting at a fixed height row.
The `Enter → stall` follow-up (see profile-typing-lag.md §"Submit /
TTFT stall") is unmeasured here — needs a throwaway session because
the harness fires a real prompt. Not blocking this commit.