fix(agent): reset stale token calibration on model switch (#23767)

ContextCompressor.update_model() recomputed context_length/threshold/budgets
but kept the cross-call calibration state (last_real_prompt_tokens,
last_rough_tokens_when_real_prompt_fit, last_compression_rough_tokens,
awaiting_real_usage_after_compression, _ineffective_compression_count) from the
PREVIOUS model.

Those fields encode 'the provider proved this prompt fit' / 'preflight can be
deferred' decisions valid only for the model that produced them. Carried across
a switch to a smaller-context model, should_defer_preflight_to_real_usage() used
the old model's 'it fit' history to SKIP a preflight compression the new model
actually needed — sending an oversized prompt the provider rejects (#23767).

update_model() now clears that state; the new model's first response repopulates
it via update_from_response(). Verified E2E: after a 200K->65,536 switch, defer
no longer suppresses and should_compress fires on an over-threshold estimate.
This commit is contained in:
kshitijk4poor 2026-06-21 17:32:08 +05:30
parent 44d552ea5a
commit 1e0b3a2bcc
2 changed files with 69 additions and 0 deletions

View file

@ -668,6 +668,28 @@ class ContextCompressor(ContextEngine):
int(context_length * 0.05), _SUMMARY_TOKENS_CEILING,
)
# Reset cross-call calibration state captured under the PREVIOUS model.
# These fields encode "the provider proved this prompt fit" / "preflight
# can be deferred" decisions that are only valid for the model that
# produced them. Carrying them across a switch to a smaller-context
# model would let should_defer_preflight_to_real_usage() suppress a
# preflight compression the new model actually needs — the exact
# oversized-send-after-switch failure in #23767. The new model's first
# response repopulates them via update_from_response(). Setting
# last_prompt_tokens to 0 (NOT -1) is deliberate: 0 is the documented
# "no real usage yet -> use the rough estimate" state, so the post-
# response should_compress path falls back to estimate_request_tokens_rough
# rather than skipping compression. -1 is a different sentinel
# (#36718, "compression just ran, await real usage") and must not be set here.
self.last_prompt_tokens = 0
self.last_completion_tokens = 0
self.last_total_tokens = 0
self.last_real_prompt_tokens = 0
self.last_rough_tokens_when_real_prompt_fit = 0
self.last_compression_rough_tokens = 0
self.awaiting_real_usage_after_compression = False
self._ineffective_compression_count = 0
def __init__(
self,
model: str,