docs(moa): document prompt-caching behavior for references and aggregator (#53218)

* docs(moa): document prompt-caching behavior for references and aggregator

* docs(moa): clarify references preserve cache, only aggregator trades reuse

* docs(moa): correct caching prose — tail-append preserves aggregator cache too
This commit is contained in:
Teknium 2026-06-26 12:58:05 -07:00 committed by GitHub
parent 525e1e775d
commit 9b2af36d5a
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -124,6 +124,17 @@ On HermesBench, a two-model MoA preset — `claude-opus-4.8` aggregating over a
The MoA configuration beats its strongest component (opus-4.8) by ~6 points, confirming that aggregating a second perspective lifts quality on hard tasks rather than just averaging the two.
## Prompt caching
MoA is built so the **main conversation's prompt cache is never broken**. Selecting a MoA preset is a normal model selection: it does not mutate past context, swap toolsets, or rebuild the system prompt mid-conversation. Your conversation history, system prompt, and tool schema stay byte-stable, so the cached prefix every other model relies on is preserved exactly as it would be for a plain model. Switching to or away from a MoA preset costs the same cache invalidation as any other `/model` switch — no more.
Both internal call types cache normally:
- **Reference models** receive a trimmed, deterministic view of the conversation (system prompt and tool transcript stripped — see the loop above). Because that view is a stable function of the stable history, a reference model's prompt prefix repeats across iterations and caches normally. References are short advisory calls with no tools.
- **The aggregator** is the acting model. The reference outputs are appended to the *end* of the latest user turn as private guidance. Because that text sits at the tail — below the entire stable prefix (system prompt + prior history) — it does not invalidate any cached prefix: the aggregator gets a cache hit on everything above the injection, and only the freshly appended tail is new. That is exactly how every normal turn behaves, where each new user message is also uncached tail tokens.
So MoA does not sacrifice prompt caching on either call type. Its only real cost is the extra reference calls per iteration — you pay for multiple model perspectives, not for broken caches. The long-lived conversation prefix shared with the rest of Hermes is fully intact.
## Notes
- MoA is no longer listed under `hermes tools`; there is no `moa` toolset to enable.