hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-04-25 00:51:20 +00:00

Author	SHA1	Message	Date
helix4u	8a2506af43	fix(aux): surface auxiliary failures in UI	2026-04-24 14:31:21 -07:00
Teknium	a9a4416c7c	fix(compress): don't reach into ContextCompressor privates from /compress (#15039 ) Manual /compress crashed with 'LCMEngine' object has no attribute '_align_boundary_forward' when any context-engine plugin was active. The gateway handler reached into _align_boundary_forward and _find_tail_cut_by_tokens on tmp_agent.context_compressor, but those are ContextCompressor-specific — not part of the generic ContextEngine ABC — so every plugin engine (LCM, etc.) raised AttributeError. - Add optional has_content_to_compress(messages) to ContextEngine ABC with a safe default of True (always attempt). - Override it in the built-in ContextCompressor using the existing private helpers — preserves exact prior behavior for 'compressor'. - Rewrite gateway /compress preflight to call the ABC method, deleting the private-helper reach-in. - Add focus_topic to the ABC compress() signature. Make _compress_context retry without focus_topic on TypeError so older strict-sig plugins don't crash on manual /compress <focus>. - Regression test with a fake ContextEngine subclass that only implements the ABC (mirrors LCM's surface). Reported by @selfhostedsoul (Discord, Apr 22).	2026-04-24 02:55:43 -07:00
sicnuyudidi	c03858733d	fix: pass correct arguments in summary model fallback retry _generate_summary() takes (turns_to_summarize, focus_topic) but the summary model fallback path passed (messages, summary_budget) — where 'messages' is not even in scope, causing a NameError. Fix the recursive call to pass the correct variables so the fallback to the main model actually works when the summary model is unavailable. Fixes: #10721	2026-04-22 17:57:13 -07:00
Yukipukii1	1e8254e599	fix(agent): guard context compressor against structured message content	2026-04-22 14:46:51 -07:00
alt-glitch	1010e5fa3c	refactor: remove redundant local imports already available at module level Sweep ~74 redundant local imports across 21 files where the same module was already imported at the top level. Also includes type fixes and lint cleanups on the same branch.	2026-04-21 00:50:58 -07:00
entropidelic	3368814a3d	fix(security): redact secrets from context compaction input and output Three-layer defense against secrets leaking into compaction summaries: 1. Input redaction: redact_sensitive_text() on message content and tool call arguments in _serialize_for_summary() before sending to summarizer 2. Prompt instructions: NEVER include API keys/tokens/passwords in the summarizer preamble, template Critical Context section, and focus topic 3. Output redaction: redact_sensitive_text() on the summary output and _previous_summary for iterative updates Reuses existing agent/redact.py patterns (sk-, ghp_, key=value, etc). Cherry-picked from PR #9200 by @entropidelic.	2026-04-20 16:07:13 -07:00
Teknium	13294c2d18	feat(compression): summaries now respect the conversation's language Context compaction summaries were always produced in English regardless of the conversation language, which injected English context into non-English conversations and muddied the continuation experience. Adds a one-sentence instruction to the shared `_summarizer_preamble` used by both the initial-compaction and iterative-update prompt paths. Placing it in the preamble (rather than adding it separately to each prompt) means both code paths stay in sync with one edit. Ported from anomalyco/opencode#20581. The original PR (#4670) landed before main's prompt templates were refactored to share the `_summarizer_preamble` and `_template_sections` blocks, so the cherry-pick conflicted on the now-obsolete inline sections; re-applied the essential one-line change on top of the current structure. Verified: 48/48 existing compressor tests pass.	2026-04-19 11:05:14 -07:00
Honghua Yang	3128d9fcd2	fix(context_compressor): keep tool-call arguments JSON valid when shrinking Pass 3 of `_prune_old_tool_results` previously shrunk long `function.arguments` blobs by slicing the raw JSON string at byte 200 and appending the literal text `...[truncated]`. That routinely produced payloads like:: {"path": "/foo.md", "content": "# Long markdown ...[truncated] — an unterminated string with no closing brace. Strict providers (observed on MiniMax) reject this as `invalid function arguments json string` with a non-retryable 400. Because the broken call survives in the session history, every subsequent turn re-sends the same malformed payload and gets the same 400, locking the session into a re-send loop until the call falls out of the window. Fix: parse the arguments first, shrink long string leaves inside the parsed structure, and re-serialise. Non-string values (paths, ints, booleans, lists) pass through intact. Arguments that are not valid JSON to begin with (rare, some backends use non-JSON tool args) are returned unchanged rather than replaced with something neither we nor the provider can parse. Observed in the wild: a `write_file` with ~800 chars of markdown `content` triggered this on a real session against MiniMax-M2.7; every turn after compression got rejected until the session was manually reset. Tests: - 7 direct tests of `_truncate_tool_call_args_json` covering valid-JSON output, non-JSON pass-through, nested structures, non-string leaves, scalar JSON, and Unicode preservation - 1 end-to-end test through `_prune_old_tool_results` Pass 3 that reproduces the exact failure payload shape from the incident Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 12:40:56 -07:00
sontianye	f19ca50cd9	fix(context_compressor): always keep last user message in tail to prevent active-task loss Ensure _align_boundary_backward never pushes the last user message into the compressed region. Without this, compression could delete the user active task instruction mid-session. Cherry-picked from #10969 by @sontianye. Fixes #10896.	2026-04-16 07:45:31 -07:00
Teknium	772cfb6c4e	fix: stale agent timeout, uv venv detection, empty response after tools, compression model fallback (#9051 , #8620 , #9400 ) (#10093 ) Four independent fixes: 1. Reset activity timestamp on cached agent reuse (#9051) When the gateway reuses a cached AIAgent for a new turn, the _last_activity_ts from the previous turn (possibly hours ago) carried over. The inactivity timeout handler immediately saw the agent as idle for hours and killed it. Fix: reset _last_activity_ts, _last_activity_desc, and _api_call_count when retrieving an agent from the cache. 2. Detect uv-managed virtual environments (#8620 sub-issue 1) The systemd unit generator fell back to sys.executable (uv's standalone Python) when running under 'uv run', because sys.prefix == sys.base_prefix. The generated ExecStart pointed to a Python binary without site-packages. Fix: check VIRTUAL_ENV env var before falling back to sys.executable. uv sets VIRTUAL_ENV even when sys.prefix doesn't reflect the venv. 3. Nudge model to continue after empty post-tool response (#9400) Weaker models sometimes return empty after tool calls. The agent silently abandoned the remaining work. Fix: append assistant('(empty)') + user nudge message and retry once. Resets after each successful tool round. 4. Compression model fallback on permanent errors (#8620 sub-issue 4) When the default summary model (gemini-3-flash) returns 503 'model_not_found' on custom proxies, the compressor entered a 600s cooldown, leaving context growing unbounded. Fix: detect permanent model-not-found errors (503, 404, 'model_not_found', 'no available channel') and fall back to using the main model for compression instead of entering cooldown. One-time fallback with immediate retry. Test plan: 40 compressor tests + 97 gateway/CLI tests + 9 venv tests pass	2026-04-14 22:38:17 -07:00
kshitijk4poor	9855190f23	feat(compressor): smart collapse, dedup, anti-thrashing, template upgrade, hardening Combined salvage of PRs #9661, #9663, #9674, #9677, #9678 by kshitijk4poor. - Smart tool output collapse: informative 1-line summaries replace generic placeholder - Dedup identical tool results via MD5 hash, truncate large tool_call arguments - Anti-thrashing: skip compression after 2 consecutive <10% savings passes - Structured action-log summary template with numbered actions and Active State - Hardening: max_tokens 1.3x cap, multimodal safety, note idempotency, adaptive cooldown Follow-up fixes applied during salvage: - web_extract: reads 'urls' (list) not 'url' (original PR bug) - Multimodal list content guards in dedup and prune passes - Kept 'Relevant Files' section in template (original PR removed it) Skipped PRs #9665 (user msg preservation — duplication risk) and #9675 (dead code).	2026-04-14 22:21:25 -07:00
Harish Kukreja	b1f13a8c5f	fix(agent): route compression aux through live session runtime	2026-04-12 01:34:52 -07:00
Teknium	1cec910b6a	fix: improve context compaction to prevent model answering stale questions (#8107 ) After compression, models (especially Kimi 2.5) would sometimes respond to questions from the summary instead of the latest user message. This happened ~30% of the time on Telegram. Root cause: the summary's 'Next Steps' section read as active instructions, and the SUMMARY_PREFIX didn't explicitly tell the model to ignore questions in the summary. When the summary merged into the first tail message, there was no clear separator between historical context and the actual user message. Changes inspired by competitor analysis (Claude Code, OpenCode, Codex): 1. SUMMARY_PREFIX rewritten with explicit 'Do NOT answer questions from this summary — respond ONLY to the latest user message AFTER it' 2. Summarizer preamble (shared by both prompts) adds: - 'Do NOT respond to any questions' (from OpenCode's approach) - 'Different assistant' framing (from Codex) to create psychological distance between summary content and active conversation 3. New summary sections: - '## Resolved Questions' — tracks already-answered questions with their answers, preventing re-answering (from Claude Code's 'Pending user asks' pattern) - '## Pending User Asks' — explicitly marks unanswered questions - '## Remaining Work' replaces '## Next Steps' — passive framing avoids reading as active instructions 4. merge-summary-into-tail path now inserts a clear separator: '--- END OF CONTEXT SUMMARY — respond to the message below ---' 5. Iterative update prompt now instructs: 'Move answered questions to Resolved Questions' to maintain the resolved/pending distinction across multiple compactions.	2026-04-11 19:43:58 -07:00
Teknium	a0a02c1bc0	feat: /compress <focus> — guided compression with focus topic (#8017 ) Adds an optional focus topic to /compress: `/compress database schema` guides the summariser to preserve information related to the focus topic (60-70% of summary budget) while compressing everything else more aggressively. Inspired by Claude Code's /compact <focus>. Changes: - context_compressor.py: focus_topic parameter on _generate_summary() and compress(); appends FOCUS TOPIC guidance block to the LLM prompt - run_agent.py: focus_topic parameter on _compress_context(), passed through to the compressor - cli.py: _manual_compress() extracts focus topic from command string, preserves existing manual_compression_feedback integration (no regression) - gateway/run.py: _handle_compress_command() extracts focus from event args and passes through — full gateway parity - commands.py: args_hint="[focus topic]" on /compress CommandDef Salvaged from PR #7459 (CLI /compress focus only — /context command deferred). 15 new tests across CLI, compressor, and gateway.	2026-04-11 19:23:29 -07:00
Teknium	c8aff74632	fix: prevent agent from stopping mid-task — compression floor, budget overhaul, activity tracking Three root causes of the 'agent stops mid-task' gateway bug: 1. Compression threshold floor (64K tokens minimum) - The 50% threshold on a 100K-context model fired at 50K tokens, causing premature compression that made models lose track of multi-step plans. Now threshold_tokens = max(50% * context, 64K). - Models with <64K context are rejected at startup with a clear error. 2. Budget warning removal — grace call instead - Removed the 70%/90% iteration budget warnings entirely. These injected '[BUDGET WARNING: Provide your final response NOW]' into tool results, causing models to abandon complex tasks prematurely. - Now: no warnings during normal execution. When the budget is actually exhausted (90/90), inject a user message asking the model to summarise, allow one grace API call, and only then fall back to _handle_max_iterations. 3. Activity touches during long terminal execution - _wait_for_process polls every 0.2s but never reported activity. The gateway's inactivity timeout (default 1800s) would fire during long-running commands that appeared 'idle.' - Now: thread-local activity callback fires every 10s during the poll loop, keeping the gateway's activity tracker alive. - Agent wires _touch_activity into the callback before each tool call. Also: docs update noting 64K minimum context requirement. Closes #7915 (root cause was agent-loop termination, not Weixin delivery limits).	2026-04-11 16:18:57 -07:00
Teknium	3fe6938176	fix: robust context engine interface — config selection, plugin discovery, ABC completeness Follow-up fixes for the context engine plugin slot (PR #5700): - Enhance ContextEngine ABC: add threshold_percent, protect_first_n, protect_last_n as class attributes; complete update_model() default with threshold recalculation; clarify on_session_end() lifecycle docs - Add ContextCompressor.update_model() override for model/provider/ base_url/api_key updates - Replace all direct compressor internal access in run_agent.py with ABC interface: switch_model(), fallback restore, context probing all use update_model() now; _context_probed guarded with getattr/ hasattr for plugin engine compatibility - Create plugins/context_engine/ directory with discovery module (mirrors plugins/memory/ pattern) — discover_context_engines(), load_context_engine() - Add context.engine config key to DEFAULT_CONFIG (default: compressor) - Config-driven engine selection in run_agent.__init__: checks config, then plugins/context_engine/<name>/, then general plugin system, falls back to built-in ContextCompressor - Wire on_session_end() in shutdown_memory_provider() at real session boundaries (CLI exit, /reset, gateway expiry)	2026-04-10 19:15:50 -07:00
Stephen Schoettler	92382fb00e	feat: wire context engine plugin slot into agent and plugin system - PluginContext.register_context_engine() lets plugins replace the built-in ContextCompressor with a custom ContextEngine implementation - PluginManager stores the registered engine; only one allowed - run_agent.py checks for a plugin engine at init before falling back to the default ContextCompressor - reset_session_state() now calls engine.on_session_reset() instead of poking internal attributes directly - ContextCompressor.on_session_reset() handles its own internals (_context_probed, _previous_summary, etc.) - 19 new tests covering ABC contract, defaults, plugin slot registration, rejection of duplicates/non-engines, and compressor reset behavior - All 34 existing compressor tests pass unchanged	2026-04-10 19:15:50 -07:00
Stephen Schoettler	fe7e6c156c	feat: add ContextEngine ABC, refactor ContextCompressor to inherit from it Introduces agent/context_engine.py — an abstract base class that defines the pluggable context engine interface. ContextCompressor now inherits from ContextEngine as the default implementation. No behavior change. All 34 existing compressor tests pass. This is the foundation for a context engine plugin slot, enabling third-party engines like LCM (Lossless Context Management) to replace the built-in compressor via the plugin system.	2026-04-10 19:15:50 -07:00
alt-glitch	96c060018a	fix: remove 115 verified dead code symbols across 46 production files Automated dead code audit using vulture + coverage.py + ast-grep intersection, confirmed by Opus deep verification pass. Every symbol verified to have zero production callers (test imports excluded from reachability analysis). Removes ~1,534 lines of dead production code across 46 files and ~1,382 lines of stale test code. 3 entire files deleted (agent/builtin_memory_provider.py, hermes_cli/checklist.py, tests/hermes_cli/test_setup_model_selection.py). Co-authored-by: alt-glitch <balyan.sid@gmail.com>	2026-04-10 03:44:43 -07:00
Teknium	97308707e9	fix: insert static fallback when compression summary fails When _generate_summary() failed (no provider, timeout, model error), the compressor silently dropped all middle turns with just a debug log. The agent would then see head + tail with no explanation of the gap, causing total context amnesia (generic greetings instead of continuing the conversation). Now generates a static fallback marker that tells the model context was lost and to continue from the recent tail messages. The fallback flows through the same role-alternation logic as a real summary so message structure stays valid.	2026-04-09 14:28:56 -07:00
BongSuCHOI	d12f8db0b8	fix(compaction): token-budget primary tail protection Tail protection was effectively message-count based despite having a token budget, because protect_last_n=20 acted as a hard floor. A single 50K-token tool output would cause all 20 recent messages to be preserved regardless of budget, leaving little room for summarization. Changes: - _find_tail_cut_by_tokens: min_tail reduced from protect_last_n (20) to 3; token budget is now the primary criterion - Soft ceiling at 1.5x budget to avoid cutting mid-oversized-message - _prune_old_tool_results: accepts optional protect_tail_tokens so pruning also respects the token budget instead of a fixed count - compress() minimum message check relaxed from protect_first_n + protect_last_n + 1 to protect_first_n + 3 + 1 - Tool group alignment (no splitting tool_call/result) preserved	2026-04-08 23:54:23 -07:00
SHL0MS	8567031433	fix: improve context compression quality — named constants, tool tracking, degradation warning Three targeted improvements to the compression system: 1. Replace hardcoded truncation limits with named class constants (_CONTENT_MAX=6000, _CONTENT_HEAD=4000, _CONTENT_TAIL=1500, _TOOL_ARGS_MAX=1500, _TOOL_ARGS_HEAD=1200). Previous limits (3000/500) heavily truncated the summarizer's input — a 200-line edit got cut to 3000 chars before the summarizer ever saw it. 2. Add '## Tools & Patterns' section to both compression prompt templates (first-pass and iterative). Preserves working tool invocations, preferred flags, and tool-specific discoveries across compaction boundaries. 3. Warn users on 2nd+ compression: 'Session compressed N times — accuracy may degrade. Consider /new to start fresh.' Ref #499	2026-04-08 20:54:23 -07:00
KangYu	b26e85bf9d	Fix compaction summary retries for temperature-restricted models	2026-04-06 16:49:57 -07:00
Teknium	9a364f2805	fix: cap percentage displays at 100% in stats, gateway, and memory tool (#3599 ) Salvage of PR #3533 (binhnt92). Follow-up to #3480 — applies min(100, ...) to 5 remaining unclamped percentage display sites in context_compressor, cli /stats, gateway /stats, and memory tool. Defensive clamps now that the root cause (estimation heuristic) was already removed in #3480. Co-Authored-By: binhnt92 <binhnt92@users.noreply.github.com>	2026-03-28 14:55:18 -07:00
Teknium	839d9d7471	feat(agent): configurable timeouts for auxiliary LLM calls via config.yaml (#3597 ) Add per-task timeout settings under auxiliary.{task}.timeout in config.yaml instead of hardcoded values. Users with slow local models (Ollama, llama.cpp) can now increase timeouts for compression, vision, session search, etc. Defaults: - auxiliary.compression.timeout: 120s (was hardcoded 45s) - auxiliary.vision.timeout: 30s (unchanged) - all other aux tasks: 30s (was hardcoded 30s) - title_generator: 30s (was hardcoded 15s) call_llm/async_call_llm now auto-resolve timeout from config when not explicitly passed. Callers can still override with an explicit timeout arg. Based on PR #3406 by alanfwilliams. Converted from env vars to config.yaml per project conventions. Co-authored-by: alanfwilliams <alanfwilliams@users.noreply.github.com>	2026-03-28 14:35:28 -07:00
Teknium	8bb1d15da4	chore: remove ~100 unused imports across 55 files (#3016 ) Automated cleanup via pyflakes + autoflake with manual review. Changes: - Removed unused stdlib imports (os, sys, json, pathlib.Path, etc.) - Removed unused typing imports (List, Dict, Any, Optional, Tuple, Set, etc.) - Removed unused internal imports (hermes_cli.auth, hermes_cli.config, etc.) - Fixed cli.py: removed 8 shadowed banner imports (imported from hermes_cli.banner then immediately redefined locally — only build_welcome_banner is actually used) - Added noqa comments to imports that appear unused but serve a purpose: - Re-exports (gateway/session.py SessionResetPolicy, tools/terminal_tool.py is_interrupted/_interrupt_event) - SDK presence checks in try/except (daytona, fal_client, discord) - Test mock targets (auxiliary_client.py Path, mcp_config.py get_hermes_home) Zero behavioral changes. Full test suite passes (6162/6162, 2 pre-existing streaming test failures unrelated to this change).	2026-03-25 15:02:03 -07:00
Teknium	7ca22ea11b	fix(compression): restore sane defaults and cap summary at 12K tokens - threshold: 0.80 → 0.50 (compress at 50%, not 80%) - target_ratio: 0.40 → 0.20, now relative to threshold not total context (20% of 50% = 10% of context as tail budget) - summary ceiling: 32K → 12K (Gemini can't output more than ~12K) - Updated DEFAULT_CONFIG, config display, example config, and tests	2026-03-24 18:48:47 -07:00
Teknium	9231a335d4	fix(compression): replace dead summary_target_tokens with ratio-based scaling (#2554 ) The summary_target_tokens parameter was accepted in the constructor, stored on the instance, and never used — the summary budget was always computed from hardcoded module constants (_SUMMARY_RATIO=0.20, _MAX_SUMMARY_TOKENS=8000). This caused two compounding problems: 1. The config value was silently ignored, giving users no control over post-compression size. 2. Fixed budgets (20K tail, 8K summary cap) didn't scale with context window size. Switching from a 1M-context model to a 200K model would trigger compression that nuked 350K tokens of conversation history down to ~30K. Changes: - Replace summary_target_tokens with summary_target_ratio (default 0.40) which sets the post-compression target as a fraction of context_length. Tail token budget and summary cap now scale proportionally: MiniMax 200K → ~80K post-compression GPT-5 1M → ~400K post-compression - Change threshold_percent default: 0.50 → 0.80 (don't fire until 80% of context is consumed) - Change protect_last_n default: 4 → 20 (preserve ~10 full turns) - Summary token cap scales to 5% of context (was fixed 8K), capped at 32K ceiling - Read target_ratio and protect_last_n from config.yaml compression section (both are now configurable) - Remove hardcoded summary_target_tokens=500 from run_agent.py - Add 5 new tests for ratio scaling, clamping, and new defaults	2026-03-24 17:45:49 -07:00
Teknium	292d12bed4	fix: case-insensitive model family matching + compressor init logging Two fixes for local model context detection: 1. Hardcoded DEFAULT_CONTEXT_LENGTHS matching was case-sensitive. 'qwen' didn't match 'Qwen3.5-9B-Q4_K_M.gguf' because of the capital Q. Now uses model.lower() for comparison. 2. Added compressor initialization logging showing the detected context_length, threshold, model, provider, and base_url. This makes turn-1 compression bugs diagnosable from logs — previously there was no log of what context length was detected.	2026-03-21 10:47:44 -07:00
Teknium	e75f58420c	feat(compressor): major context compaction improvements Six improvements to reduce information loss during context compression, informed by analysis of Cline, OpenCode, Pi-mono, Codex, and ClawdBot: 1. Structured summary template — sections for Goal, Progress (Done/ In Progress/Blocked), Key Decisions, Relevant Files, Next Steps, and Critical Context. Forces the summarizer to preserve each category instead of writing a vague paragraph. 2. Iterative summary updates — on re-compression, the prompt says 'PRESERVE existing info, ADD new progress, UPDATE done/in-progress status.' Previous summary is stored and fed back to the summarizer so accumulated context survives across multiple compactions. 3. Token-budget tail protection — instead of fixed protect_last_n=4, walks backward keeping ~20K tokens of recent context. Adapts to message density: sessions with big tool results protect fewer messages, short exchanges protect more. Falls back to protect_last_n for small conversations. 4. Tool output pruning (pre-pass) — before the expensive LLM summary, replaces old tool result contents with a placeholder. This is free (no LLM call) and can save 30%+ of context by itself. 5. Scaled summary budget — instead of fixed 2500 tokens, allocates 20% of compressed content tokens (clamped to 2000-8000). A 50-turn conversation gets more summary space than a 10-turn one. 6. Richer summarizer input — tool calls now include arguments (up to 500 chars) and tool results keep up to 3000 chars (was 1500). The summarizer sees 'terminal(git status) → M src/config.py' instead of just '[Tool calls: terminal]'.	2026-03-21 08:14:14 -07:00
Teknium	88643a1ba9	feat: overhaul context length detection with models.dev and provider-aware resolution (#2158 ) Replace the fragile hardcoded context length system with a multi-source resolution chain that correctly identifies context windows per provider. Key changes: - New agent/models_dev.py: Fetches and caches the models.dev registry (3800+ models across 100+ providers with per-provider context windows). In-memory cache (1hr TTL) + disk cache for cold starts. - Rewritten get_model_context_length() resolution chain: 0. Config override (model.context_length) 1. Custom providers per-model context_length 2. Persistent disk cache 3. Endpoint /models (local servers) 4. Anthropic /v1/models API (max_input_tokens, API-key only) 5. OpenRouter live API (existing, unchanged) 6. Nous suffix-match via OpenRouter (dot/dash normalization) 7. models.dev registry lookup (provider-aware) 8. Thin hardcoded defaults (broad family patterns) 9. 128K fallback (was 2M) - Provider-aware context: same model now correctly resolves to different context windows per provider (e.g. claude-opus-4.6: 1M on Anthropic, 128K on GitHub Copilot). Provider name flows through ContextCompressor. - DEFAULT_CONTEXT_LENGTHS shrunk from 80+ entries to ~16 broad patterns. models.dev replaces the per-model hardcoding. - CONTEXT_PROBE_TIERS changed from [2M, 1M, 512K, 200K, 128K, 64K, 32K] to [128K, 64K, 32K, 16K, 8K]. Unknown models no longer start at 2M. - hermes model: prompts for context_length when configuring custom endpoints. Supports shorthand (32k, 128K). Saved to custom_providers per-model config. - custom_providers schema extended with optional models dict for per-model context_length (backward compatible). - Nous Portal: suffix-matches bare IDs (claude-opus-4-6) against OpenRouter's prefixed IDs (anthropic/claude-opus-4.6) with dot/dash normalization. Handles all 15 current Nous models. - Anthropic direct: queries /v1/models for max_input_tokens. Only works with regular API keys (sk-ant-api*), not OAuth tokens. Falls through to models.dev for OAuth users. Tests: 5574 passed (18 new tests for models_dev + updated probe tiers) Docs: Updated configuration.md context length section, AGENTS.md Co-authored-by: Test <test@test.com>	2026-03-20 06:04:33 -07:00
Teknium	d76fa7fc37	fix: detect context length for custom model endpoints via fuzzy matching + config override (#2051 ) * fix: detect context length for custom model endpoints via fuzzy matching + config override Custom model endpoints (non-OpenRouter, non-known-provider) were silently falling back to 2M tokens when the model name didn't exactly match what the endpoint's /v1/models reported. This happened because: 1. Endpoint metadata lookup used exact match only — model name mismatches (e.g. 'qwen3.5:9b' vs 'Qwen3.5-9B-Q4_K_M.gguf') caused a miss 2. Single-model servers (common for local inference) required exact name match even though only one model was loaded 3. No user escape hatch to manually set context length Changes: - Add fuzzy matching for endpoint model metadata: single-model servers use the only available model regardless of name; multi-model servers try substring matching in both directions - Add model.context_length config override (highest priority) so users can explicitly set their model's context length in config.yaml - Log an informative message when falling back to 2M probe, telling users about the config override option - Thread config_context_length through ContextCompressor and AIAgent init Tests: 6 new tests covering fuzzy match, single-model fallback, config override (including zero/None edge cases). * fix: auto-detect local model name and context length for local servers Cherry-picked from PR #2043 by sudoingX. - Auto-detect model name from local server's /v1/models when only one model is loaded (no manual model name config needed) - Add n_ctx_train and n_ctx to context length detection keys for llama.cpp - Query llama.cpp /props endpoint for actual allocated context (not just training context from GGUF metadata) - Strip .gguf suffix from display in banner and status bar - _auto_detect_local_model() in runtime_provider.py for CLI init Co-authored-by: sudo <sudoingx@users.noreply.github.com> * fix: revert accidental summary_target_tokens change + add docs for context_length config - Revert summary_target_tokens from 2500 back to 500 (accidental change during patching) - Add 'Context Length Detection' section to Custom & Self-Hosted docs explaining model.context_length config override --------- Co-authored-by: Test <test@test.com> Co-authored-by: sudo <sudoingx@users.noreply.github.com>	2026-03-19 06:01:16 -07:00
Teknium	d132e344d7	fix(agent): prevent silent tool result loss during context compression (#1993 ) _align_boundary_backward only checked messages[idx-1] to decide if the compress-end boundary splits a tool_call/result group. When an assistant issues 3+ parallel tool calls, their results span multiple consecutive messages. If the boundary fell in the middle of that group, the parent assistant was summarized away and orphaned tool results were silently deleted by _sanitize_tool_pairs. Now walks backward through all consecutive tool results to find the parent assistant, then pulls the boundary before the entire group. 6 regression tests added in tests/test_compression_boundary.py. Co-authored-by: Guts <Gutslabs@users.noreply.github.com>	2026-03-18 15:22:51 -07:00
Teknium	a2440f72f6	feat: use endpoint metadata for custom model context and pricing (#1906 ) * perf: cache base_url.lower() via property, consolidate triple load_config(), hoist set constant run_agent.py: - Add base_url property that auto-caches _base_url_lower on every assignment, eliminating 12+ redundant .lower() calls per API cycle across __init__, _build_api_kwargs, _supports_reasoning_extra_body, and the main conversation loop - Consolidate three separate load_config() disk reads in __init__ (memory, skills, compression) into a single call, reusing the result dict for all three config sections model_tools.py: - Hoist _READ_SEARCH_TOOLS set to module level (was rebuilt inside handle_function_call on every tool invocation) * Use endpoint metadata for custom model context and pricing --------- Co-authored-by: kshitij <82637225+kshitijk4poor@users.noreply.github.com>	2026-03-18 03:04:07 -07:00
Test	45bad9771d	fix(context_compressor): replace print() calls with logger Replaces all remaining print() calls in compress() with logger.info() and logger.warning() for consistency with the rest of the module. Inspired by PR #1822.	2026-03-17 16:31:01 -07:00
Teknium	548cedb869	fix(context_compressor): prevent consecutive same-role messages after compression (#1743 ) compress() checks both the head and tail neighbors when choosing the summary message role. When only the tail collides, the role is flipped. When BOTH roles would create consecutive same-role messages (e.g. head=assistant, tail=user), the summary is merged into the first tail message instead of inserting a standalone message that breaks role alternation and causes API 400 errors. The previous code handled head-side collision but left the tail-side uncovered — long conversations would crash mid-reply with no useful error, forcing the user to /reset and lose session history. Based on PR #1186 by @alireza78a, with improved double-collision handling (merge into tail instead of unconditional 'user' fallback). Co-authored-by: alireza78a <alireza78.crypto@gmail.com>	2026-03-17 05:18:52 -07:00
teknium1	344f3771cb	fix(compressor): summary role can create consecutive same-role messages The summary message role was determined only by the last head message, ignoring the first tail message. This could create consecutive user messages (rejected by Anthropic) when the tail started with 'user'. Now checks both neighbors. Priority: avoid colliding with the head (already committed). If the chosen role also collides with the tail, flip it — but only if flipping wouldn't re-collide with the head.	2026-03-17 04:08:37 -07:00
Teknium	5c479eedf1	feat: improve context compaction handoff summaries (#1273 ) Adapt PR #916 onto current main by replacing the old context summary marker with a clearer handoff wrapper, updating the summarization prompt for resume-oriented summaries, and preserving the current call_llm-based compression path.	2026-03-14 02:33:31 -07:00
Teknium	07927f6bf2	feat(stt): add free local whisper transcription via faster-whisper (#1185 ) * fix: Home Assistant event filtering now closed by default Previously, when no watch_domains or watch_entities were configured, ALL state_changed events passed through to the agent, causing users to be flooded with notifications for every HA entity change. Now events are dropped by default unless the user explicitly configures: - watch_domains: list of domains to monitor (e.g. climate, light) - watch_entities: list of specific entity IDs to monitor - watch_all: true (new option — opt-in to receive all events) A warning is logged at connect time if no filters are configured, guiding users to set up their HA platform config. All 49 gateway HA tests + 52 HA tool tests pass. * docs: update Home Assistant integration documentation - homeassistant.md: Fix event filtering docs to reflect closed-by-default behavior. Add watch_all option. Replace Python dict config example with YAML. Fix defaults table (was incorrectly showing 'all'). Add required configuration warning admonition. - environment-variables.md: Add HASS_TOKEN and HASS_URL to Messaging section. - messaging/index.md: Add Home Assistant to description, architecture diagram, platform toolsets table, and Next Steps links. * fix(terminal): strip provider env vars from background and PTY subprocesses Extends the env var blocklist from #1157 to also cover the two remaining leaky paths in process_registry.py: - spawn_local() PTY path (line 156) - spawn_local() background Popen path (line 197) Both were still using raw os.environ, leaking provider vars to background processes and interactive PTY sessions. Now uses the same dynamic _HERMES_PROVIDER_ENV_BLOCKLIST from local.py. Explicit env_vars passed to spawn_local() still override the blocklist, matching the existing behavior for callers that intentionally need these. Gap identified by PR #1004 (@PeterFile). * feat(delegate): add observability metadata to subagent results Enrich delegate_task results with metadata from the child AIAgent: - model: which model the child used - exit_reason: completed \| interrupted \| max_iterations - tokens.input / tokens.output: token counts - tool_trace: per-tool-call trace with byte sizes and ok/error status Tool trace uses tool_call_id matching to correctly pair parallel tool calls with their results, with a fallback for messages without IDs. Cherry-picked from PR #872 by @omerkaz, with fixes: - Fixed parallel tool call trace pairing (was always updating last entry) - Removed redundant 'iterations' field (identical to existing 'api_calls') - Added test for parallel tool call trace correctness Co-authored-by: omerkaz <omerkaz@users.noreply.github.com> * feat(stt): add free local whisper transcription via faster-whisper Replace OpenAI-only STT with a dual-provider system mirroring the TTS architecture (Edge TTS free / ElevenLabs paid): STT: faster-whisper local (free, default) / OpenAI Whisper API (paid) Changes: - tools/transcription_tools.py: Full rewrite with provider dispatch, config loading, local faster-whisper backend, and OpenAI API backend. Auto-downloads model (~150MB for 'base') on first voice message. Singleton model instance reused across calls. - pyproject.toml: Add faster-whisper>=1.0.0 as core dependency - hermes_cli/config.py: Expand stt config to match TTS pattern with provider selection and per-provider model settings - agent/context_compressor.py: Fix .strip() crash when LLM returns non-string content (dict from llama.cpp, None). Fixes #1100 partially. - tests/: 23 new tests for STT providers + 2 for compressor fix - docs/: Updated Voice & TTS page with STT provider table, model sizes, config examples, and fallback behavior Fallback behavior: - Local not installed → OpenAI API (if key set) - OpenAI key not set → local whisper (if installed) - Neither → graceful error message to user Co-authored-by: Jah-yee <Jah-yee@users.noreply.github.com> --------- Co-authored-by: omerkaz <omerkaz@users.noreply.github.com> Co-authored-by: Jah-yee <Jah-yee@users.noreply.github.com>	2026-03-13 11:11:05 -07:00
Teknium	1bb8ed4495	chore: lower default compression threshold from 85% to 50% (#1096 ) * fix: ClawHub skill install — use /download ZIP endpoint The ClawHub API v1 version endpoint only returns file metadata (path, size, sha256, contentType) without inline content or download URLs. Our code was looking for inline content in the metadata, which never existed, causing all ClawHub installs to fail with: 'no inline/raw file content was available' Fix: Use the /api/v1/download endpoint (same as the official clawhub CLI) to download skills as ZIP bundles and extract files in-memory. Changes: - Add _download_zip() method that downloads and extracts ZIP bundles - Retry on 429 rate limiting with Retry-After header support - Path sanitization and binary file filtering for security - Keep _extract_files() as a fallback for inline/raw content - Also fix nested file lookup (version_data.version.files) * chore: lower default compression threshold from 85% to 50% Triggers context compression earlier — at 50% of the model's context window instead of 85%. Updated in all four places where the default is defined: context_compressor.py, cli.py, run_agent.py, config.py, and gateway/run.py.	2026-03-12 15:51:50 -07:00
teknium1	0aa31cd3cb	feat: call_llm/async_call_llm + config slots + migrate all consumers Add centralized call_llm() and async_call_llm() functions that own the full LLM request lifecycle: 1. Resolve provider + model from task config or explicit args 2. Get or create a cached client for that provider 3. Format request args (max_tokens handling, provider extra_body) 4. Make the API call with max_tokens/max_completion_tokens retry 5. Return the response Config: expanded auxiliary section with provider:model slots for all tasks (compression, vision, web_extract, session_search, skills_hub, mcp, flush_memories). Config version bumped to 7. Migrated all auxiliary consumers: - context_compressor.py: uses call_llm(task='compression') - vision_tools.py: uses async_call_llm(task='vision') - web_tools.py: uses async_call_llm(task='web_extract') - session_search_tool.py: uses async_call_llm(task='session_search') - browser_tool.py: uses call_llm(task='vision'/'web_extract') - mcp_tool.py: uses call_llm(task='mcp') - skills_guard.py: uses call_llm(provider='openrouter') - run_agent.py flush_memories: uses call_llm(task='flush_memories') Tests updated for context_compressor and MCP tool. Some test mocks still need updating (15 remaining failures from mock pattern changes, 2 pre-existing).	2026-03-11 20:52:19 -07:00
teknium1	8805e705a7	feat: centralized provider router + fix Codex vision bypass + vision error handling Three interconnected fixes for auxiliary client infrastructure: 1. CENTRALIZED PROVIDER ROUTER (auxiliary_client.py) Add resolve_provider_client(provider, model, async_mode) — a single entry point for creating properly configured clients. Given a provider name and optional model, it handles auth lookup (env vars, OAuth tokens, auth.json), base URL resolution, provider-specific headers, and API format differences (Chat Completions vs Responses API for Codex). All auxiliary consumers should route through this instead of ad-hoc env var lookups. Refactored get_text_auxiliary_client, get_async_text_auxiliary_client, and get_vision_auxiliary_client to use the router internally. 2. FIX CODEX VISION BYPASS (vision_tools.py) vision_tools.py was constructing a raw AsyncOpenAI client from the sync vision client's api_key/base_url, completely bypassing the Codex Responses API adapter. When the vision provider resolved to Codex, the raw client would hit chatgpt.com/backend-api/codex with chat.completions.create() which only supports the Responses API. Fix: Added get_async_vision_auxiliary_client() which properly wraps Codex into AsyncCodexAuxiliaryClient. vision_tools.py now uses this instead of manual client construction. 3. FIX COMPRESSION FALLBACK + VISION ERROR HANDLING - context_compressor.py: Removed _get_fallback_client() which blindly looked for OPENAI_API_KEY + OPENAI_BASE_URL (fails for Codex OAuth, API-key providers, users without OPENAI_BASE_URL set). Replaced with fallback loop through resolve_provider_client() for each known provider, with same-provider dedup. - vision_tools.py: Added error detection for vision capability failures. Returns clear message to the model when the configured model doesn't support vision, instead of a generic error. Addresses #886	2026-03-11 19:46:47 -07:00
teknium1	77da3bbc95	fix: use correct role for summary message in context compressor The summary message was always injected as 'user' role, which causes consecutive user messages when the last preserved head message is also 'user'. Some APIs reject this (400 error), and it produces malformed training data. Fix: check the role of the last head message and pick the opposite role for the summary — 'user' after assistant/tool, 'assistant' after user. Based on PR #328 by johnh4098. Closes #328.	2026-03-08 23:09:04 -07:00
teknium1	d9f373654b	feat: enhance auxiliary model configuration and environment variable handling - Added support for auxiliary model overrides in the configuration, allowing users to specify providers and models for vision and web extraction tasks. - Updated the CLI configuration example to include new auxiliary model settings. - Enhanced the environment variable mapping in the CLI to accommodate auxiliary model configurations. - Improved the resolution logic for auxiliary clients to support task-specific provider overrides. - Updated relevant documentation and comments for clarity on the new features and their usage.	2026-03-08 18:06:47 -07:00
teknium1	306d92a9d7	refactor(context_compressor): improve summary generation logic and error handling Updated the _generate_summary method to attempt summary generation using the auxiliary model first, with a fallback to the main model. If both attempts fail, the method now returns None instead of a placeholder, allowing the caller to handle missing summaries appropriately. This change enhances the robustness of context compression and improves logging for failure scenarios.	2026-03-07 11:54:51 -08:00
teknium1	5da55ea1e3	fix: sanitize orphaned tool-call/result pairs in message compression Enhance message compression by adding a method to clean up orphaned tool-call and tool-result pairs. This ensures that the API receives well-formed messages, preventing errors related to mismatched IDs. The new functionality includes removing orphaned results and adding stub results for missing calls, improving overall message integrity during compression.	2026-03-07 08:08:00 -08:00
teknium1	c886333d32	feat: smart context length probing with persistent caching + banner display Replaces the unsafe 128K fallback for unknown models with a descending probe strategy (2M → 1M → 512K → 200K → 128K → 64K → 32K). When a context-length error occurs, the agent steps down tiers and retries. The discovered limit is cached per model+provider combo in ~/.hermes/context_length_cache.yaml so subsequent sessions skip probing. Also parses API error messages to extract the actual context limit (e.g. 'maximum context length is 32768 tokens') for instant resolution. The CLI banner now displays the context window size next to the model name (e.g. 'claude-opus-4 · 200K context · Nous Research'). Changes: - agent/model_metadata.py: CONTEXT_PROBE_TIERS, persistent cache (save/load/get), parse_context_limit_from_error(), get_next_probe_tier() - agent/context_compressor.py: accepts base_url, passes to metadata - run_agent.py: step-down logic in context error handler, caches on success - cli.py + hermes_cli/banner.py: context length in welcome banner - tests: 22 new tests for probing, parsing, and caching Addresses #132. PR #319's approach (8K default) rejected — too conservative.	2026-03-05 16:09:57 -08:00
teknium1	3e2ed18ad0	fix: fallback to main model endpoint when auxiliary summary client fails When the auxiliary client (used for context compression summaries) fails — e.g. due to a stale OpenRouter API key after switching to a local LLM — fall back to the user's active endpoint (OPENAI_BASE_URL) instead of returning a useless static summary string. This handles the common scenario where a user switches providers via 'hermes model' but the old provider's API key remains in .env. The auxiliary client picks up the stale key, fails (402/auth error), and previously compression would produce garbage. Now it gracefully retries with the working endpoint. On successful fallback, the working client is cached for future compressions in the same session so the fallback cost is paid only once. Ref: #348	2026-03-04 17:58:09 -08:00
teknium1	25c65bc99e	fix(agent): handle None content in context compressor (fixes #211 ) The OpenAI API returns content: null on assistant messages that only contain tool calls. msg.get('content', '') returns None (not '') when the key exists with value None, causing TypeError on len() and string concatenation in _generate_summary and compress. Fix: msg.get('content') or '' — handles both missing keys and None. Tests from PR #216 (@Farukest). Fix also in PR #215 (@cutepawss). Both PRs had stale branches and couldn't be merged directly. Closes #211	2026-03-02 01:35:52 -08:00
teknium1	500f0eab4a	refactor(cli): Finalize OpenAI Codex Integration with OAuth - Enhanced Codex model discovery by fetching available models from the API, with fallback to local cache and defaults. - Updated the context compressor's summary target tokens to 2500 for improved performance. - Added external credential detection for Codex CLI to streamline authentication. - Refactored various components to ensure consistent handling of authentication and model selection across the application.	2026-02-28 21:47:51 -08:00

1 2

54 commits