fix: remove misleading model.max_tokens suggestion from thinking-exhausted error (#11626)

The 'Thinking Budget Exhausted' user-facing error message advised users to 'set model.max_tokens in config.yaml'. That config key is documented but intentionally not wired through to the API call in CLI/gateway paths — we omit max_tokens by default so the inference server uses its full output budget (llama-server -1=infinity, vLLM max_model_len-prompt_len, etc.). Users followed the suggestion, saw no change, and kept filing bugs (see closed #4404, #10917, #6955 and PRs #5001/#6080/#6446/#6707/#7075/#8804/ #10924/#11173/#11268 — all reporting the same misdirection). Replace the misleading suggestion with an actionable one: switch models via /model. Lowering reasoning effort remains the primary remediation.
2026-04-25 00:51:20 +00:00 · 2026-04-17 13:29:54 -07:00 · 2026-04-17 13:29:54 -07:00 · 1229d8855c
commit 1229d8855c
parent d49126b987
1 changed files with 1 additions and 2 deletions
--- a/run_agent.py
+++ b/run_agent.py
@ -9303,8 +9303,7 @@ class AIAgent:
                                "and had none left for the actual response.\n\n"
                                "To fix this:\n"
                                "→ Lower reasoning effort: `/thinkon low` or `/thinkon minimal`\n"
-                                "→ Increase the output token limit: "
-                                "set `model.max_tokens` in config.yaml"
+                                "→ Or switch to a larger/non-reasoning model with `/model`"
                            )
                            self._cleanup_task_resources(effective_task_id)
                            self._persist_session(messages, conversation_history)