fix: remove misleading model.max_tokens suggestion from thinking-exhausted error (#11626)

The 'Thinking Budget Exhausted' user-facing error message advised users to
'set model.max_tokens in config.yaml'. That config key is documented but
intentionally not wired through to the API call in CLI/gateway paths — we
omit max_tokens by default so the inference server uses its full output
budget (llama-server -1=infinity, vLLM max_model_len-prompt_len, etc.).

Users followed the suggestion, saw no change, and kept filing bugs (see
closed #4404, #10917, #6955 and PRs #5001/#6080/#6446/#6707/#7075/#8804/
#10924/#11173/#11268 — all reporting the same misdirection).

Replace the misleading suggestion with an actionable one: switch models
via /model. Lowering reasoning effort remains the primary remediation.
This commit is contained in:
Teknium 2026-04-17 13:29:54 -07:00 committed by GitHub
parent d49126b987
commit 1229d8855c
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -9303,8 +9303,7 @@ class AIAgent:
"and had none left for the actual response.\n\n"
"To fix this:\n"
"→ Lower reasoning effort: `/thinkon low` or `/thinkon minimal`\n"
"→ Increase the output token limit: "
"set `model.max_tokens` in config.yaml"
"→ Or switch to a larger/non-reasoning model with `/model`"
)
self._cleanup_task_resources(effective_task_id)
self._persist_session(messages, conversation_history)