Streamdown's per-Block parse cost grows with the live tail's length and
is unavoidable inside the block-memo pattern (industry standard, see
findings doc). The fix is to stop having that work block the main thread.
`<DeferStreamingText>` is a 12-line wrapper that reads message-part state
via `useMessagePartText`, runs it through `useDeferredValue`, and
re-publishes via assistant-ui's `<TextMessagePartProvider>`. The inner
`<StreamdownTextPrimitive>` reads the deferred value through the normal
`useMessagePartText` hook — no fork, no internal-path imports, fully on
assistant-ui's public API. React's concurrent scheduler then:
- abandons in-flight deferred renders when a newer token arrives, so
intermediate states get skipped under fast streams
- deprioritises the markdown render when the main thread has urgent
work (typing, scroll), so input stays responsive even while a
100ms parse is queued
Streamdown already uses `useTransition` for its block-array setState;
this lifts the deferral up to the consumer boundary so it covers the
whole pipeline (preprocess → split → repair → parse → render).
A/B on the 34 MB session, 300 tokens at 50 tok/sec, markdown chunks
(four trials each, with the 33ms flush throttle on for both):
| | avgFps | p99 frame | LTs/5s | max LT | typing-while-stream p95 |
|---|---|---|---|---|---|
| pre | 54.3 | 41 ms | 1.7 | 110 ms | ~17 ms |
| post | 58.5 | 31 ms | 2.0 | 117 ms | 14-18 ms |
Longtask count + max LT unchanged — useDeferredValue doesn't reduce
CPU, only its priority. The avgFps lift and p99 frame drop are the
proof that the existing CPU is no longer blocking 60 fps cadence. One
clean run logged MUTATIONS=0 — React skipped every intermediate text
state and only committed the final one (textbook deferred-value
behaviour).
The actually-reduce-CPU path is replacing the parser with a state
machine like Flowdown — left for a future PR; see
`apps/desktop/scripts/profile-typing-lag.md` for the full investigation.
17 KiB
Profiling renderer typing lag
Workflow for empirically measuring (and fixing) typing/submit lag in the desktop chat composer.
Quick boot for profiling
Vite 8 + plugin-react 6 has a known issue where the React Fast Refresh
preamble script isn't injected into index.html, so opening Electron at
http://127.0.0.1:5174 throws $RefreshReg$ is not defined on every TSX
module and the React tree never mounts. Workaround: run vite with HMR off.
# Terminal A — start dev server without HMR
cd apps/desktop
node scripts/dev-no-hmr.mjs
# Terminal B — start Electron with CDP exposed
cd apps/desktop
XCURSOR_SIZE=24 HERMES_DESKTOP_DEV_SERVER=http://127.0.0.1:5174 \
../../node_modules/.bin/electron --remote-debugging-port=9222 .
Terminal C is yours to run the harnesses.
Harnesses
All zero-dep — Node 24 built-in WebSocket + fetch.
Typing latency — measure-latency.mjs
Per-keystroke keypress → next paint latency, p50/p90/p99/max.
Synthesizes keystrokes via Input.dispatchKeyEvent so the run is
reproducible.
node apps/desktop/scripts/measure-latency.mjs --chars=120 --cps=20
Anything > 16ms is a dropped frame. On a freshly-loaded session
(scripts/click-session.mjs 'Phaser particle') we currently see:
| unpatched | patched | |
|---|---|---|
| p50 paint | 1.9 ms | 2.0 ms |
| p90 paint | 3.3 ms | 13.7 ms |
| p99 paint | 16.7 ms | 15.2 ms |
| max paint | 20.5 ms | 30.4 ms |
| >16ms drops | 2/120 | 1/120 |
Roughly even on a quick session — patches don't fix typing latency under benign synthetic conditions because the existing baseline is already snappy on synthetic input. The real wins are in the leak counters (see below). If the user reports typing jank, capture a profile + heap diff during their actual usage and compare against the synthetic baseline to identify what condition (long thread, popover open, paste, etc.) makes the path slow.
Leak counters — leak-typing.mjs
Types N chars per round, clears, force-GCs, captures
Performance.getMetrics deltas. Reveals leaked event listeners, heap
drift, document node growth, and forced-layout counts.
# After clicking into a real session (e.g. via click-session.mjs):
node apps/desktop/scripts/leak-typing.mjs --rounds=8 --chars=200 --cps=50
Real-session numbers (Phaser thread, 8 rounds × 200 chars):
| unpatched (HEAD~2) | patched (HEAD) | |
|---|---|---|
| jsListeners growth/round | +0 | +0 |
| DOM nodes growth/round | +0 | +0 |
| heap growth/round | ~0 (V8 housekeeping) | ~0 |
| forced layouts/char | 7.02 | 2.35 (3× fewer) |
The forced-layout count is the load-bearing number — typing into a real session was triggering ~7 layouts per character on the unpatched build (scrollHeight reads + per-px CSS var writes + FadeText scrollWidth reads all stacking up). After the patches it's down to ~2.35/char, which is Blink's natural cost for a 1px/char-growing contentEditable and can't be lowered further without architectural changes.
The initial "+35 listeners/round leak" I called out on the first unpatched run turned out to be transient warm-up (popovers initializing, etc.); steady-state listener growth was 0 both before and after.
CPU profile + heap snapshot — profile-typing.mjs
Records a CPU profile while typing, plus before/after heap snapshots so you can do a comparison diff in Chrome DevTools Memory tab.
node apps/desktop/scripts/profile-typing.mjs \
--chars=400 --cps=30 --out=/tmp/hermes-typing
# → /tmp/hermes-typing.cpuprofile (open in Chrome DevTools Performance)
# → /tmp/hermes-typing.before.heapsnapshot
# → /tmp/hermes-typing.after.heapsnapshot
Loading the cpuprofile: Chrome DevTools → Performance tab → drag the file
in, or VS Code → open the .cpuprofile directly.
For heap diff: Chrome DevTools → Memory → Load snapshot → load "before",
then Comparison view → load "after". Sort by # Delta. Stay alert for
detached DOM, FiberNodes (unmounted), and listener growth.
Helpers
probe-renderer.mjs— dump page state (URL, composer mounted?, body text)click-session.mjs <title>— click a sidebar session by partial title matchreload-renderer.mjs— force Page.reload via CDP (no HMR available)dump-state.mjs— richer state dump (thread message count, sticky session, etc.)probe-console.mjs— dump recent console errors / exceptions
Findings
See commit message for apps/desktop/src/app/chat/composer/index.tsx
edits. Three changes:
-
Per-keystroke
scrollHeightread removed. The expansion useEffect used to readeditorRef.current.scrollHeighton every draft change (forces synchronous layout). Replaced with adraft.length > 60heuristic; the ResizeObserver catches anything the heuristic misses. -
Bucketed CSS custom-property writes.
syncComposerMetricsused tosetProperty('--composer-measured-height', height + 'px')on every observed resize, invalidating computed style for the whole tree. Now writes only when the height crosses an 8 px bucket, so typing in a fixed-height row produces no style invalidation at all. -
Removed dead
$composerDraft→aui.composer().setTextround-trip. Nothing outside the composer subscribed to$composerDraft(verified via grep). The two useEffects that pushed draft → store and store → composer were pure overhead per keystroke.reconcileComposerTerminalSelectionswas also called per keystroke; can be deferred to submit time (it's a stale-pruning step, not a correctness one —terminalContextBlocksFromDraftwalks the current text directly at submit and ignores stale labels). -
refreshTriggerfast-bails when no@//in draft. PreviouslytextBeforeCaret()didrange.toString()(O(n)) on every keystroke even when no trigger char was present.
The biggest win is the listener leak in (3) — without it, each round of typing leaked ~35 event listeners until a steady state.
Submit / TTFT stall (open)
User reports a perceived stall after Enter, before the assistant starts
streaming. scripts/measure-submit.mjs measures
enter → composer-cleared → user-message-rendered → first-paint. The
script triggers a real prompt submission, so use it on a throwaway
session. Not enabled in CI.
Streaming "5fps" investigation (May 21, 2026)
User complaint: "the streaming must bring fps to like 5? lol" — felt hitches during assistant streaming on long threads.
Tooling added
src/app/chat/perf-probe.tsx— dev-only side-effect import (guarded byimport.meta.env.MODE !== 'production'inmain.tsx). Attaches two helpers towindow:__PERF_PROBE__— React<Profiler>recorder. Currently inert because Vite is serving the production React build (see "Vite dev-build issue" below); kept for when that's fixed.__PERF_DRIVE__— synthetic stream driver. Pushes tokens through the live$messagesatom at a fixed cadence, so the assistant-ui runtime, incremental repository, Streamdown markdown renderer, and React commit pipeline all see the same workload they'd see from a real LLM stream — but with no LLM call (and no credit cost).
scripts/measure-synthetic-stream.mjs— drives__PERF_DRIVE__, records rAF frame intervals,PerformanceObserver({entryTypes:['longtask']})entries,MutationObservercadence on the live message, and optional type-while-streaming keystroke latency.scripts/profile-synth-stream.mjs— CPU profile during a synthetic stream; writes a.cpuprofile(open in Chrome DevTools Performance panel) and a top-30 self-time table.scripts/measure-real-stream.mjs— same harness as the synthetic but fires a real LLM prompt. Use when you have credits and want to confirm the synthetic predictions hold.scripts/profile-real-stream.mjs— CPU profile over the duration of a real LLM stream.
Helpers: scripts/eval.mjs (one-shot CDP eval), scripts/reload.mjs
(hard reload renderer over CDP).
Findings
Measured on the Cloud Shadows session (7 turns, ~11k px scrollHeight) and
the 34 MB session session_20260514_215353_fe0ac8.json (110 FadeText
instances, lots of historical tool calls).
| metric | Cloud Shadows | 34 MB session |
|---|---|---|
| avgFps (60 tok/sec, 5s) | 60.0 | 58.6 |
| frame p50 / p95 / p99 (ms) | 16.7 / 18.0 / 21.1 | 16.6 / 25.6 / 31.4 |
| max frame (ms) | 31.1 | 97-127 (varies) |
| longtasks per 5s window | 0 | 1-2, 75-127 ms |
| type-while-stream p95 latency (ms) | 17 | — |
A single real-LLM stream on Cloud Shadows (gpt-4o-mini, 39s window) saw 12 longtasks totalling 1.26 s — same cadence the synthetic predicted (~1 hitch per 3.25 s, max 123 ms). So the synthetic stream is a faithful proxy for the real one and is fine for iterating on fixes without paying for tokens.
CPU profile during streaming (synthetic, markdown content)
Top self-time costs (5 s window, 400 tokens at 125 tok/s, markdown chunks):
| ms (self) | function | source |
|---|---|---|
| 260 | bn$1 |
chunk-BO2N…js:20003 (micromark tokenize) |
| 249 | m$1 |
chunk-BO2N…js:19949 (micromark) |
| 128 | compile |
chunk-BO2N…js:21884 (mdast → hast compile) |
| 73 | FadeText body | components/ui/fade-text.tsx |
| 62 | parser |
chunk-BO2N…js:22680 |
| 49 | fromThreadMessageLike |
@assistant-ui/internal |
That chunk-BO2N2NFS is the vendored bundle containing micromark,
mdast-util-from-markdown, mdast-util-to-hast, rehype-raw,
hast-util-sanitize, etc. — i.e. Streamdown's markdown pipeline,
re-parsing the entire growing assistant message on every token append.
Cost scales linearly with message length.
Compare plain-text (no markdown) — the chunk-BO2N… entries drop out
of the top 30 entirely; total work per 5 s window halves.
Fix landed: FadeText memo
FadeText is used in tool-fallback.tsx (110 instances on a tool-heavy
thread). Before: each parent re-render during streaming triggered a
useEffect([children]) that forced a scrollWidth layout read — even
when the title text was unchanged. The useResizeObserver already covers
the genuine resize case, so the effect was strictly redundant.
After: wrapped in React.memo with a custom comparator that compares
children (scalar fast-path), className, fadeWidth, and style
field-by-field. Verified via temporary render counter:
122 renders during a 2 s synthetic stream vs ~11 000 without memo
(110 instances × ~100 stream updates). Doesn't move the longtask needle
on its own — Streamdown dwarfs it — but eliminates a class of forced
layouts and removes a steady CPU floor.
Also landed: MarkdownText plugins memo + upstream flush floor
Two smaller follow-ups in the same investigation:
-
MarkdownTextpluginsobject useMemo'd. The inlineplugins={{ math: mathPlugin, ...(isStreaming ? {} : { code }) }}was constructing a new object on every render, which churns<Streamdown>'s outer memo and forces its internalrehypePlugins/remarkPluginsarrays to rebuild. CPU profile after the change showsparserself-time dropping out of the top 10,compilecut roughly in half, andbn$1/m$1(micromark internals) dropping off the top entries. -
use-message-stream.scheduleDeltaFlushgot a real minimum floor. Previously the rAF-only path effectively meant "at most one flush per frame," but at typical LLM token rates of 30-80 tok/sec each token arrives slower than rAF cadence and gets its own React commit. WithSTREAM_DELTA_FLUSH_MS = 33(two frames) and alastFlushAt-tracked floor, slower streams now coalesce ~2 tokens per commit, halving markdown re-parses. React's auto-batching already covers part of this probabilistically; the floor makes the batching deterministic so the max-longtask number tightens up.
A/B on the 34 MB session, 300 tokens at 50 tok/sec, markdown chunks (3 trials each):
| avgFps | p99 frame | LTs/5s | max LT | mutations | |
|---|---|---|---|---|---|
| no throttle | 54.0 | 38 ms | 2.0 | 145 ms | varies (2-112) |
| 33 ms throttle | 54.3 | 41 ms | 1.7 | 110 ms | ~135 |
Modest. inter-mutation p50 tightens from 22-28 ms to a clean 33 ms,
which is what you'd expect from a deterministic floor.
Also landed: useDeferredValue at the streamdown-text boundary
The longtask CPU was unavoidable inside the block-memo pattern — the live tail re-parses every commit, scales linearly with current length, and nothing about Streamdown's architecture changes that without forking. The fix is to stop having that work block the main thread.
<DeferStreamingText> in markdown-text.tsx is a 12-line wrapper that
reads the message-part state via useMessagePartText, runs it through
useDeferredValue, and re-publishes via assistant-ui's
<TextMessagePartProvider>. The inner StreamdownTextPrimitive reads the
deferred value through the normal useMessagePartText hook — no fork,
no internal-path imports, fully on the assistant-ui public API.
What React's concurrent scheduler now does:
- When a new token arrives mid-render, the in-flight deferred render is abandoned and a fresh one starts with the latest text.
- When the main thread has urgent work (typing, scroll, layout), the Streamdown render gets deprioritized — input stays responsive even while a 100 ms parse is queued.
Streamdown already uses useTransition internally for its block-array
setState; useDeferredValue here just lifts the deferral all the way up
to the consumer text boundary, so the whole pipeline — preprocess,
block split, repair, parse, render — runs at low priority during streaming.
This is the industry-standard approach (see
Streamdown architecture analysis
and Chrome's LLM-response render best practices).
A/B on the 34 MB session, 300 tokens at 50 tok/sec, markdown chunks (four trials each, prod-throttle (33 ms) on for both):
| avgFps | p99 frame | LTs / 5 s | max LT | typing p95 | |
|---|---|---|---|---|---|
| pre-defer | 54.3 | 41 ms | 1.7 | 110 ms | ~17 ms |
| post-defer | 58.5 | 31 ms | 2.0 | 117 ms | 14-18 ms |
Longtask count and max LT are unchanged — useDeferredValue doesn't
reduce CPU, only its priority. The avgFps lift and p99 frame drop are
the proof that the existing CPU is no longer blocking 60 fps cadence:
when React can defer the parse, frames stay clean. One particularly
clean run logged MUTATIONS=0 — React skipped every intermediate
text state and only committed the final one, the textbook
useDeferredValue behaviour.
Not fixed: Streamdown markdown re-parse cost (the elephant)
Total CPU spent in micromark/mdast/hast pipeline per 5 s window is still
the same ~700 ms. With useDeferredValue that work no longer blocks
input, but if you watch a CPU profile you'll see the same hot functions
(Tn$1, bn$1, m$1, parser, compile).
The path to actually reduce that cost (not just defer it) is to
replace the parser with a state machine like
Flowdown — process each
character exactly once, emit DOM ops directly, no re-parse of the prefix
on every token. Claimed ~2,000× over marked. Trades: not a
react-markdown-compatible API, no rehype security pipeline, would
require replacing Streamdown wholesale. Worth investigating only if
even the deferred work shows up in user-perceptible ways (e.g.
trackpad-scrolling a stream-in-progress stutters).
The synthetic harness now mirrors the real upstream pipeline via the
flushMinMs option in __PERF_DRIVE__.stream({ flushMinMs: 33 }), so
future Streamdown / Flowdown experiments can A/B without LLM credit cost.
The synthetic numbers tracked the one real-LLM run we caught within
noise, so it's a reliable proxy.
Possible approaches (none implemented here):
- Coalesce/throttle Streamdown updates — render at most every 32 ms instead of every set-state. Reduces parses but doesn't reduce per-parse cost; trades latency for smoothness.
- Memoize per-prefix — diff the new text against the prior parsed version; only re-parse the changed suffix.
- Render in stable segments — close-form historical paragraphs as
immutable React nodes; only the live tail goes through markdown each
token. Probably the highest-impact change but requires forking or
patching
@assistant-ui/react-streamdown. - Move parsing to a Web Worker — main thread no longer blocks on markdown. Largest surgery; requires double-buffered hast.
Vite dev-build issue (separate)
http://127.0.0.1:5174/node_modules/.vite/deps/react.js resolves to
react/cjs/react.production.js, and react-dom_client.js →
react-dom-client.production.js. As a result:
<React.Profiler>onRenderis never called (production build is a no-op).import.meta.env.DEVisfalse,PRODistrueeven undervite dev(henceMODE !== 'production'as the workaround inmain.tsx).- All the React 19 dev-only warnings/devtools backend hooks are absent.
Root cause likely sits in vite.config.ts aliasing + dedupe + Vite 8's
new optimizeDeps defaults. Worth a separate fix pass — when it's
resolved, the <PerfProbe> blocks in perf-probe.tsx become useful
(per-id commit timings) instead of inert.