fix: detect context length for custom model endpoints via fuzzy matching + config override (#2051)

* fix: detect context length for custom model endpoints via fuzzy matching + config override Custom model endpoints (non-OpenRouter, non-known-provider) were silently falling back to 2M tokens when the model name didn't exactly match what the endpoint's /v1/models reported. This happened because: 1. Endpoint metadata lookup used exact match only — model name mismatches (e.g. 'qwen3.5:9b' vs 'Qwen3.5-9B-Q4_K_M.gguf') caused a miss 2. Single-model servers (common for local inference) required exact name match even though only one model was loaded 3. No user escape hatch to manually set context length Changes: - Add fuzzy matching for endpoint model metadata: single-model servers use the only available model regardless of name; multi-model servers try substring matching in both directions - Add model.context_length config override (highest priority) so users can explicitly set their model's context length in config.yaml - Log an informative message when falling back to 2M probe, telling users about the config override option - Thread config_context_length through ContextCompressor and AIAgent init Tests: 6 new tests covering fuzzy match, single-model fallback, config override (including zero/None edge cases). * fix: auto-detect local model name and context length for local servers Cherry-picked from PR #2043 by sudoingX. - Auto-detect model name from local server's /v1/models when only one model is loaded (no manual model name config needed) - Add n_ctx_train and n_ctx to context length detection keys for llama.cpp - Query llama.cpp /props endpoint for actual allocated context (not just training context from GGUF metadata) - Strip .gguf suffix from display in banner and status bar - _auto_detect_local_model() in runtime_provider.py for CLI init Co-authored-by: sudo <sudoingx@users.noreply.github.com> * fix: revert accidental summary_target_tokens change + add docs for context_length config - Revert summary_target_tokens from 2500 back to 500 (accidental change during patching) - Add 'Context Length Detection' section to Custom & Self-Hosted docs explaining model.context_length config override --------- Co-authored-by: Test <test@test.com> Co-authored-by: sudo <sudoingx@users.noreply.github.com>
2026-04-25 00:51:20 +00:00 · 2026-03-19 06:01:16 -07:00 · 2026-03-19 06:01:16 -07:00 · d76fa7fc37
commit d76fa7fc37
parent 7b6d14e62a
8 changed files with 208 additions and 5 deletions
--- a/cli.py
+++ b/cli.py
@ -1046,6 +1046,14 @@ class HermesCLI:
        _config_model = _model_config.get("default", "") if isinstance(_model_config, dict) else (_model_config or "")
        _FALLBACK_MODEL = "anthropic/claude-opus-4.6"
        self.model = model or _config_model or _FALLBACK_MODEL
+        # Auto-detect model from local server if still on fallback
+        if self.model == _FALLBACK_MODEL:
+            _base_url = _model_config.get("base_url", "") if isinstance(_model_config, dict) else ""
+            if "localhost" in _base_url or "127.0.0.1" in _base_url:
+                from hermes_cli.runtime_provider import _auto_detect_local_model
+                _detected = _auto_detect_local_model(_base_url)
+                if _detected:
+                    self.model = _detected
        # Track whether model was explicitly chosen by the user or fell back
        # to the global default.  Provider-specific normalisation may override
        # the default silently but should warn when overriding an explicit choice.
@ -1251,6 +1259,8 @@ class HermesCLI:
    def _get_status_bar_snapshot(self) -> Dict[str, Any]:
        model_name = self.model or "unknown"
        model_short = model_name.split("/")[-1] if "/" in model_name else model_name
+        if model_short.endswith(".gguf"):
+            model_short = model_short[:-5]
        if len(model_short) > 26:
            model_short = f"{model_short[:23]}..."