fix: add vLLM/local server error patterns + MCP initial connection retry (#9281)

Port two improvements inspired by Kilo-Org/kilocode analysis: 1. Error classifier: add context overflow patterns for vLLM, Ollama, and llama.cpp/llama-server. These local inference servers return different error formats than cloud providers (e.g., 'exceeds the max_model_len', 'context length exceeded', 'slot context'). Without these patterns, context overflow errors from local servers are misclassified as format errors, causing infinite retries instead of triggering compression. 2. MCP initial connection retry: previously, if the very first connection attempt to an MCP server failed (e.g., transient DNS blip at startup), the server was permanently marked as failed with no retry. Post-connect reconnection had 5 retries with exponential backoff, but initial connection had zero. Now initial connections retry up to 3 times with backoff before giving up, matching the resilience of post-connect reconnection. (Inspired by Kilo Code's MCP server disappearing fix in v1.3.3) Tests: 6 new error classifier tests, 4 new MCP retry tests, 1 updated existing test. All 276 affected tests pass.
2026-06-09 08:21:50 +00:00 · 2026-04-13 18:46:14 -07:00 · 2026-04-13 18:46:14 -07:00 · f324222b79
commit f324222b79
parent 0a4cf5b3e1
5 changed files with 204 additions and 8 deletions
--- a/tests/agent/test_error_classifier.py
+++ b/tests/agent/test_error_classifier.py
@ -580,6 +580,48 @@ class TestClassifyApiError:
        result = classify_api_error(e)
        assert result.reason == FailoverReason.context_overflow

+    # ── vLLM / local inference server error messages ──
+
+    def test_vllm_max_model_len_overflow(self):
+        """vLLM's 'exceeds the max_model_len' error → context_overflow."""
+        e = MockAPIError(
+            "The engine prompt length 1327246 exceeds the max_model_len 131072. "
+            "Please reduce prompt.",
+            status_code=400,
+        )
+        result = classify_api_error(e)
+        assert result.reason == FailoverReason.context_overflow
+
+    def test_vllm_prompt_length_exceeds(self):
+        """vLLM prompt length error → context_overflow."""
+        e = MockAPIError(
+            "prompt length 200000 exceeds maximum model length 131072",
+            status_code=400,
+        )
+        result = classify_api_error(e)
+        assert result.reason == FailoverReason.context_overflow
+
+    def test_vllm_input_too_long(self):
+        """vLLM 'input is too long' error → context_overflow."""
+        e = MockAPIError("input is too long for model", status_code=400)
+        result = classify_api_error(e)
+        assert result.reason == FailoverReason.context_overflow
+
+    def test_ollama_context_length_exceeded(self):
+        """Ollama 'context length exceeded' error → context_overflow."""
+        e = MockAPIError("context length exceeded", status_code=400)
+        result = classify_api_error(e)
+        assert result.reason == FailoverReason.context_overflow
+
+    def test_llamacpp_slot_context(self):
+        """llama.cpp / llama-server 'slot context' error → context_overflow."""
+        e = MockAPIError(
+            "slot context: 4096 tokens, prompt 8192 tokens — not enough space",
+            status_code=400,
+        )
+        result = classify_api_error(e)
+        assert result.reason == FailoverReason.context_overflow
+
    # ── Result metadata ──

    def test_provider_and_model_in_result(self):