hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-07-24 16:54:43 +00:00

Author	SHA1	Message	Date
Brooklyn Nicholson	e9a243ef78	fix(state): inherit and stamp profile_name across rotation and branch children profile_name was only written on the agent's initial lazy create (`e8b7ce8c1`); every parented child row — compression rotation, TUI /branch, desktop branch first-persist — was created without it. A non-default profile's lineage therefore turned NULL on its first compression or branch and aggregated as "default" in unified session lists, completing the cross-profile session-jump. Fix the class at the DB layer: _insert_session_row's parent backfill now COALESCEs profile_name from the parent alongside cwd/git_* (#64709 pattern), so any parented child inherits its lineage's owning profile. Stamp it explicitly at the three create sites as well — compression rotation (mirroring _ensure_db_session), TUI session.branch, and the TUI first-prompt row persist — so rows are self-describing even when the parent row predates the profile_name column.	2026-07-24 01:49:22 -05:00
Teknium	23476207bc	feat(moa): default advisor fanout to user_turn — the cheapest cadence Flips the default fan-out cadence from per_iteration (advisors re-run on every tool iteration, multiplying advisor spend by tool-loop depth) to user_turn (advisors run once on the first message of each user turn; the acting aggregator works the rest of the tool loop with that turn's advice). Until per-mode benchmarks justify a costlier default, MoA defaults to the cheapest, lowest-impact cadence (#67199). One default for everyone — no split legacy/new-preset semantics; presets that want per-step advising set fanout: per_iteration explicitly. All three modes (user_turn / per_iteration / every_n:N) remain selectable; every_n:1 still collapses to per_iteration (semantic identity), while unparseable values now fall to user_turn (the default). Docs updated with a default-change note; the per-iteration rerun test pins its mode explicitly. Co-authored-by: skyer-flyyy <188930297+skyer-flyyy@users.noreply.github.com>	2026-07-23 21:07:18 -07:00
Teknium	7b65073dc9	fix(moa): tolerate SDK-shaped tool_call entries in _render_tool_calls _render_tool_calls only handled dict-shaped entries; a SimpleNamespace- shaped tool_call (SDK-style stream-stitched responses) rendered as '[called tool: tool]', silently losing the function name and arguments from the advisory view. Handle both shapes (including a namespace-shaped nested function inside a dict entry). One-hunk hardening salvaged from closed #59712. Co-authored-by: SquabbyZ <601709253@qq.com>	2026-07-23 18:40:09 -07:00
Teknium	975eb3a365	fix(moa): trim reference messages to fit each model's context window Reference models may have a smaller context window than the aggregator (e.g. kimi-k2.7-code @ 262K advising a glm-5.2 @ 1M conversation). Without context-length protection, a reference whose window is exceeded gets a hard HTTP 400 from the provider, which _run_reference's try/except silently converts to a [failed: …] note — the MoA turn silently degrades to fewer references (#60345). Redesigned implementation of #60387: - Estimate AFTER the advisory system prompt is prepended, so the request that is actually sent is what gets budgeted. - Reserve output headroom: the preset's reference_max_tokens when set, else an 8192-token constant, plus a 10% estimator-error fraction. - Trim on advisory-view boundaries (text-only user/assistant turns; no tool-result frames to orphan), preserving the system prompt, the user-first invariant after every pop (never assistant-first), and the trailing synthetic user turn. - Cache get_model_context_length per (provider, model) in a per-fan-out dict shared across the worker threads, so a turn resolves each window once instead of probing metadata sources per-reference-per-iteration (failures are cached too). Co-authored-by: webtecnica <75556242+webtecnica@users.noreply.github.com>	2026-07-23 18:40:09 -07:00
Teknium	55f3826224	fix(moa): keep real accounting for interrupted-but-billed references; don't cache interrupted results Follow-ups for salvaged #56344: - A reference that completes between the interrupt check and the reap keeps its REAL output and accounting (the provider call billed) instead of being zeroed with a placeholder. - A reference still in flight at interrupt time gets a placeholder in the results, but its future now carries a done-callback that folds the eventual real usage/cost into the facade's pending accounting (late_accounting_sink -> _record_late_reference_accounting), so billed spend is never silently dropped. Pending totals are folded (not overwritten) and guarded by a lock since done-callbacks fire on executor worker threads. - Interrupted placeholder results are no longer written into the facade's turn-scoped reference cache: a cache HIT never re-runs references, so caching a partial snapshot would replay '[skipped: interrupted by user]' notes for the rest of the turn. The cache is left empty and the next create() re-runs the fan-out.	2026-07-23 18:40:09 -07:00
srojk34	68cd755731	fix(moa): allow a user interrupt to abort the reference fan-out wait agent/tool_executor.py's concurrent tool batch checks agent._interrupt_requested and aborts the wait early; agent/moa_loop.py's _run_references_parallel had no equivalent, so a MoA-enabled turn blocked on ThreadPoolExecutor.result() until every reference model finished or hit its own individual auxiliary.moa_reference timeout -- there was no way for the user to abort a live turn mid-fanout. Thread an optional `agent` parameter through aggregate_moa_context -> _run_references_parallel (used when MoA references run alongside the main model) and MoAClient/MoAChatCompletions (used when the MoA preset itself is the acting model), then poll concurrent.futures.wait() in _REFERENCE_POLL_INTERVAL_S slices instead of blocking on future.result() per reference, checking agent._interrupt_requested each cycle. Deliberately scoped to interrupt/cancel only -- no new or changed timeout value, so this doesn't overlap open PRs #53784/#53875 (which lower the per-reference timeout default but don't add interrupt support). `agent` is optional and defaults to None, so any caller that doesn't pass it keeps today's uninterruptible blocking behavior unchanged.	2026-07-23 18:40:09 -07:00
Teknium	62c2b299a3	fix(moa): act aggregator-alone on the facade path when all references fail Extends the all-references-failed short-circuit (#56975) to the persistent `provider: moa` facade path: MoAChatCompletions.create() previously attached 'use the reference responses below' guidance built entirely from failure sentinels and called the aggregator with it. Now an all-failed turn attaches either the sanitized unavailability notice (loud policy) or nothing (silent policy), and the aggregator — which IS the acting model — simply acts alone. Advisor accounting for the failed fan-out is still recorded. Co-authored-by: liuhao1024 <sunsky.lau@gmail.com>	2026-07-23 18:40:09 -07:00
liuhao1024	f0ed77b627	fix(moa): skip aggregator synthesis when all references fail When every MoA reference model returns a failure (HTTP error, timeout, etc.) or is skipped by the recursion guard, the one-shot aggregator synthesis call is now skipped entirely. Previously it would try to synthesise a wall of failure sentinels, which could block for the full provider timeout (observed ~6 min on SenseNova) before returning a non-retryable error that left the session hanging. The early return carries the sanitized unavailability notice (never raw provider error text, per the failed-reference containment) so the main agent loop can still act in single-model mode. Salvaged from #56975, reworked atop the _is_failed_reference helpers.	2026-07-23 18:40:09 -07:00
Teknium	d3fc27bbf8	fix(moa): make reference_timeout default inherit auxiliary config; filter recursion-guard skips Follow-ups for salvaged #53784: - reference_timeout now defaults to None = no per-preset override, so the reference fan-out inherits auxiliary.moa_reference.timeout (900s default) via call_llm's own per-task timeout resolution. The PR's 30.0s default would have cut off long-thinking advisors mid-response, and its 300s max cap capped legitimate explicit values — both removed. Explicit per-preset values are still honored as-is. - _is_failed_reference also treats '[skipped: …]' recursion-guard notes as internal sentinels, keeping them out of both aggregator prompts. - Dashboard/desktop TS types updated to number \| null; web_server validator accepts null/empty as 'inherit'.	2026-07-23 18:40:09 -07:00
robbyczgw-cla	ccdf171bcd	fix(moa): contain failed reference details	2026-07-23 18:40:09 -07:00
oppenheimor	ca294d3e62	feat(moa): add reference model toggles	2026-07-23 18:11:57 -07:00
Teknium	f7b90e6f80	feat(moa): add privacy redaction filter with display/full modes Adds moa.privacy_filter ('' \| display \| full, default off — issue #59959): - display: redact user-visible surfaces only (reference blocks emitted to the UI + saved MoA trace records, including per-advisor full input/output and the aggregator-input copy); the aggregator sees raw advisor text so synthesis quality is unaffected. - full: additionally redact the advisor text injected into the aggregator prompt, on both the persistent facade path and the one-shot /moa synthesis path (the issue's literal ask). Legacy boolean true maps here. Secret/credential shapes (API-key prefixes, JWTs, private keys, DB connection strings) are delegated to the central redactor (agent.redact.redact_sensitive_text, force=True + code_file=True); the MoA filter adds only email and clearly delimited phone-number patterns. No bare 10-digit matching: line numbers, timestamps, epoch values, git SHAs, IPs, versions, and source-code assignments in code-review-shaped advisory text pass through byte-identical. The reference cache always holds raw text — redaction happens at each consuming surface, so a mid-session mode change never leaks or double-redacts. Reworked from PR #60463: replaced its hand-rolled pattern list (which matched bare digit runs and re-implemented key shapes) with central- redactor reuse + safe patterns, and split the single boolean into display/full modes. Credited for the feature framing. Co-authored-by: webtecnica <75556242+webtecnica@users.noreply.github.com>	2026-07-23 17:50:40 -07:00
Teknium	850f576f3d	feat(moa): add every_n fanout cadence with cached-guidance reuse Extends the fanout enum with 'every_n:<N>' (N >= 2): advisors run on the first iteration of each user turn and every Nth tool iteration after it; off-cadence iterations REUSE the cached guidance from the last on-cadence run via the same cache mechanism the user_turn fanout uses, so the aggregator still gets advice on every step. The cadence counter is scoped per user turn (resets on a new user message) and only advances when the advisory state actually changes, so streaming retries never consume a cadence slot. Mapping form {mode: every_n, n: N} normalizes to the canonical string. Unknown/degenerate values fall back to per_iteration. Addresses issue #63393 (advisor fan-out multiplies turn latency/cost by the tool-iteration count). Redesigned from PR #63448: the submitted shape skipped references entirely on off-cadence iterations (aggregator ran advice-less); this version keeps the last advice in play, credited for the idea and cadence framing. Config-gated, default-off (default fanout remains per_iteration). Co-authored-by: webtecnica <75556242+webtecnica@users.noreply.github.com>	2026-07-23 17:50:40 -07:00
Teknium	74a56b76b0	test(moa): regression for aggregator-model thought_signature resolution (#66212 ) Adds test_moa_gemini_aggregator_sanitize_uses_real_model: drives a full MoA tool-call turn (virtual-provider mode) with a Gemini aggregator and asserts the strict-API sanitize pass is invoked with the resolved aggregator model (gemini-3-pro-preview), never the virtual preset name once a slot is resolved — the exact path that stripped extra_content/thought_signature and made Gemini aggregators 400 (#65092). Writing the test surfaced a gap in the salvaged #66212 fix: in virtual- provider MoA mode (provider=moa, no moa_config threaded through run_conversation) the conversation-loop branch never fired because it only consulted moa_config. Extend it to fall back to the facade's last_aggregator_slot — the same source the handle_max_iterations fix uses — so both MoA entry modes resolve the real aggregator model. Also adds the contributors/emails mapping for the #15676 credit base.	2026-07-23 17:26:24 -07:00
Teknium	16950a4568	fix(moa): normalize Copilot aliases + carry x-initiator through retry rebuilds (#60293 ) Follow-ups to the salvaged core of #60293: - Gate the x-initiator header on _normalize_aux_provider() instead of a literal 'copilot' string compare, so slot configs spelled github / github-copilot / github-models / copilot-acp / mixed case all get the user-turn attribution. - Thread extra_headers through _retry_same_provider_sync/_async so the credential-refresh and pool-rotation retry rebuilds don't silently drop the header (the rebuilt kwargs previously started from scratch). - Add a transport-boundary test asserting the header reaches the SDK client's create() kwargs (no call_llm mocking), an alias-spelling matrix test, and a retry-rebuild preservation test.	2026-07-23 17:26:24 -07:00
dyreckt	4c66307c36	fix(moa): pass Copilot initiator header to advisors	2026-07-23 17:26:24 -07:00
Teknium	0749cac7a1	fix(moa): share facade factory so restore/recover keep reference relay (#53802 ) Follow-up to the salvaged core of #53802: a naive MoAClient(preset) rebuild restores a working facade but silently drops the reference_callback relay wired in agent_init, so moa.reference / moa.aggregating display events stop reaching every frontend for the rest of the session. Introduce agent.moa_loop.build_moa_facade(agent, preset) as the single construction point for the MoA facade and use it at: - initial client construction (agent_init.py) - turn-start fallback restore (restore_primary_runtime) - transient transport recovery (try_recover_primary_transport — previously fell through to _create_openai_client with MoA's empty client_kwargs and died with 'api_key client option must be set') - mid-session model switches (switch_model) The relay reads agent.tool_progress_callback at emit time, so callbacks attached after construction are picked up automatically. Adds test_moa_restored_facade_still_emits_reference_events covering event delivery through a restored facade.	2026-07-23 17:26:24 -07:00
kosta	5501187847	fix(moa): restore virtual runtime after fallback	2026-07-23 17:26:24 -07:00
Dineth Hettiarachchi	8d14e19f9a	fix(agent): close MoA stream on interrupt	2026-07-23 17:26:24 -07:00
Teknium	fa4800414c	feat(compression): prompt-cache reclaim gate + hardened wiring for proactive prune Follow-ups on top of the cherry-picked #62644 mechanism, porting it to current main and closing the salvage-review requirements: - proactive_prune_min_reclaim_tokens (default 4096): a prune only COMMITS when it reclaims a meaningful token batch, measured on the pruned output. A committed prune rewrites already-sent history and invalidates the provider prompt-cache prefix; this hysteresis gate keeps those breaks episodic/amortized (like a compression boundary) instead of firing every tool iteration. 0 disables the gate. (Design point credited to the #62389 review cycle's prune_minimum_tokens.) - Standard no-op caller contract: every skip path returns the INPUT list object; the loop commits only on 'result is not messages' + non-zero count. - Loop call is getattr+callable guarded (plugin engines predating the hook, SimpleNamespace test doubles) and exception-swallowed at debug level. - Config parse follows the compression.max_attempts hardened semantics: booleans rejected, fractional floats rejected, integral floats/numeric strings accepted; negative trigger = disabled. - cli-config.yaml.example documented (all three keys) and gateway _CACHE_BUSTING_CONFIG_KEYS extended so hot-reload rebuilds the agent. - Tests: min-reclaim gate both directions, input-object no-op contract, no-orphan tool_call_id pairing in BOTH directions (#69830 pin rule), default-off zero-behavior-change pin, config parse seam, and behavioral loop-wiring tests (consulted/commit/no-op/absent-method/raising).	2026-07-23 16:44:12 -07:00
izumi0uu	17a81ac89e	fix(context_compression): roll back interrupted preflight state pollution Interrupted turns can seed a speculative display token count before the provider receives the request. Restore that display-only seed when interruption wins the race, while preserving completed post-compaction state and treating a successful provider response independently of optional usage metadata. Constraint: #54776 remains reproducible on current main, while review #4702305384 identifies anti-thrashing rollback as stale and usage receipt as an unreliable response-completion signal. Rejected: Restore anti-thrashing counters from a preflight snapshot \| current main derives their verdict from real provider usage after a completed compaction boundary. Confidence: high Scope-risk: narrow Directive: Keep interrupted preflight rollback display-only, and never infer provider completion from the presence of usage metadata. Tested: ./.venv/bin/python -m pytest -q tests/run_agent/test_413_compression.py (29 passed); turn-finalizer/conversation-loop tests (31 passed); context-compressor targeted tests (12 passed); infinite-compaction targeted tests (3 passed); ruff; git diff --check. Not-tested: End-to-end interactive interrupt through CLI or gateway transport.	2026-07-23 16:38:06 -07:00
Teknium	eebc2286fc	fix(gateway): retry-next-message semantics for compression_deferred + regression suite Gateway half of the #49874 salvage: pass compression_deferred through both _run_agent_inner result dicts and guard the compression-exhausted auto-reset block with it — a lock-contended defer keeps the session intact (the concurrent compressor is actively shrinking it) instead of wiping it via reset_session. Regression tests: - tests/run_agent/test_compression_lock_defer.py — provider-mock 413 and 400-overflow turns whose compression pass lost the lock end as compression_deferred (failed=False, no compression_exhausted); flag unset keeps the terminal exhaustion path byte-identical; type-pin tests vs MagicMock agents and junk flag values; cap=1 e2e proving the refunded pre-API defer leaves the budget for the provider-proven 413 retry. - tests/agent/test_preflight_lock_defer.py — a lock-skipped preflight pass stops the loop WITHOUT arming preflight_compression_blocked; plain no-op still arms it; MagicMock junk does not defer. - tests/gateway/test_compression_deferred_soft_result.py — AST pin that the deferred branch guards the auto-reset chain and performs no session mutation (mirrors test_35809_auto_reset_clean_context.py).	2026-07-23 16:23:57 -07:00
ethernet	a4bc1ca502	fix(timeline): persist typed display events (#69771 ) * fix(desktop): hide persisted agent-only history scaffolding Filter verification-stop nudges and context-compaction handoffs at the stored-history mapper boundary. Preserve a real reply when a compaction handoff shares its stored message. * test(desktop): build persisted E2E sessions through the real agent Drive tui_gateway.entry over its stdio JSON-RPC transport against the mock provider, wait for real completion events, and persist normal session history through AIAgent and SessionDB. Migrate resume and hidden-history coverage, including real compression and live verify-on-stop scaffolding, then remove the unused direct SessionDB import scripts. * fix(desktop): use the provisioned Python for real-session E2Es Run the stdio gateway through uv's synced project environment outside the Nix dev shell, while retaining the fully provisioned Nix Python when the shell advertises HERMES_PYTHON_SRC_ROOT. * fix(nix): expose the provisioned Python environment to uv Mark the Nix-built Python environment active in the dev shell so the shared E2E session builder can always run through `uv run --active --no-sync`. * fix(timeline): persist typed display events * fix(timeline): strip display-only fields from provider payloads, preserve through rewrites, fix /resume display history Three review findings from PR #69771: 1. Provider payload leak: display_kind and display_metadata were forwarded to the provider API as unknown message fields. Strict OpenAI-compatible backends can reject the next request after a model switch or resumed typed event. Strip both from the per-request api_msg copy in conversation_loop alongside the existing api_content pop. 2. Rewrite/import data loss: _insert_message_rows preserved display_kind but silently dropped display_metadata. After replace_messages, archive_and_compact, or session import, async-delegation completion events lost their task counts and fell back to generic display text. Add display_metadata to the INSERT columns and bind tuple. 3. CLI /resume stale recap: startup --resume A set _resume_display_history from A's lineage. A subsequent in-session /resume B loaded B only into conversation_history via get_messages_as_conversation, leaving the stale A display projection. _display_resumed_history preferentially read the stale attribute, showing A's recap for B. Switch /resume to get_resume_conversations and update _resume_display_history alongside conversation_history. Tests: 890 Python (5 files), 35 desktop TS — all green. * feat(tui): render typed display events as ◈ markers in the Ink TUI The TUI was not handling display_kind at all — model switch markers and async delegation completions rendered as opaque user messages with the full [System: ...] text, and hidden compaction handoffs were visible. Wire display_kind through the full TUI chain: - _history_to_messages (tui_gateway/server.py) forwards display_kind and display_metadata to the gateway transcript payload. - GatewayTranscriptMessage (gatewayTypes.ts) gains both fields. - Msg.kind (types.ts) gains 'event' value. - toTranscriptMessages (domain/messages.ts) maps: - hidden → skip entirely - model_switch → event "model changed" - async_delegation_complete → event "N background agents finished" (or "background agent work finished" without metadata) - messageGroup (blockLayout.ts) routes event to its own group, with SELF_SPACED + PAINTS_TRAILING_GAP so it owns its margins. - messageLine.tsx renders event-kind as a dim ◈ marker with no gutter, matching the CLI's ◈ event rendering. - 4 new TUI tests for hidden/model_switch/async_delegation mapping. TUI typecheck: clean. TUI lint: 0 errors (2 pre-existing warnings). TUI tests: 9 passed (1 pre-existing failure on main, unrelated).	2026-07-23 14:46:24 -04:00
Teknium	129b9f9d33	test: make interrupt-pool double's entries callable Follow-up to the #58738 salvage: the pre-exhausted check now enumerates pool.entries() to find the failing key, so the MagicMock pool double must expose entries as a callable, not a bare list.	2026-07-23 07:33:25 -07:00
schattenan	a9613d2e57	fix(credential-pool): refresh the failing entry, not current(), on auth recovery Review follow-up: the auth path called pool.try_refresh_current() before the hinted rotation, so a stale current() pointer could force-refresh a different, healthy entry — consuming its single-use refresh token, or (for non-OAuth entries, where a forced refresh marks the entry exhausted outright) killing it entirely before api_key_hint was ever consulted. Use try_refresh_matching(api_key_hint=...) to resolve and refresh the entry that supplied the failing key under the pool lock, falling back to the previous behavior when no key is known. Adds a regression test with current() deliberately pointed at the healthy entry: on the old code the healthy entry is exhausted by the forced refresh and the pool ends up fully offline; with the fix the failing entry is exhausted and recovery rotates to the healthy one. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-23 07:33:25 -07:00
schattenan	795bf4a9e6	fix(credential-pool): attribute failures to the key that failed, not the shared current() pointer recover_with_credential_pool identified "which credential failed" via pool.current(), a shared mutable pointer that is advanced by every select() (round-robin rotation, concurrent turns, and other processes reloading the pool reset it to None). By the time recovery ran, it routinely pointed at a different, healthy entry — mark_exhausted_and_rotate then stamped the failing request's error message and reset time onto that innocent entry. With round_robin and one hard-capped key this deterministically exhausted the healthy key too and took the entire pool offline ("no available entries") from a single rate-limited credential. mark_exhausted_and_rotate already supports api_key_hint for exactly this (the auxiliary-client path passes it); the main conversation-loop path never did. Pass agent.api_key — kept in sync with the entry in use by _swap_credential — as the hint on all four rotation call sites, and make the "already exhausted → rotate immediately" pre-check look up the failing entry by key with the same fallback to current(). Adds regression tests that fail on the old attribution logic: a fresh pool (current() is None) failing on key B must mark entry B, never entry A. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-23 07:33:25 -07:00
Teknium	49a8c61cac	fix(context-engine): route pre-API and idle compaction status through the quiet-engine resolver Follow-up for the salvaged #35191: the mid-turn pre-API pressure emit in conversation_loop.py and the idle-resume emit in turn_context.py were not routed through automatic_compaction_status_message, so an engine with emit_automatic_compaction_status=False still leaked those lines. Both now resolve through the hook (phases "pre_api" and "idle") while keeping the #69550 template constants as the default wording. Suppression also skips the #69546 structured 'compacted' terminal edge for compress-phase events that opened no visible phase; failure warnings (_emit_warning) remain never suppressible, pinned by test.	2026-07-23 07:26:15 -07:00
Stephen Schoettler	d81a3dbfbb	fix(context-engine): adapt quiet compaction status to turn-context refactor	2026-07-23 07:26:15 -07:00
Stephen Schoettler	28dced2440	test: isolate quiet compaction status assertions	2026-07-23 07:26:15 -07:00
Stephen Schoettler	4035d70bbe	fix(context-engine): honor quiet compaction status	2026-07-23 07:26:15 -07:00
Teknium	4fbfb26704	test(compression): audit abort paths for per-attempt in-place state reset Follow-up for salvaged #58629: thread the previous flush baseline through the idle-compaction caller in turn_context.py (the one caller the original PR predates), and add regression coverage that every early-return path in compress_context (breaker skip, no-progress, plus the completed-rotation boundary) resets the per-attempt in-place outcome so a stale _last_compaction_in_place from an earlier successful in-place compaction can never baseline unflushed turns as persisted.	2026-07-23 07:25:21 -07:00
Brett Bonner	79a83830ba	test(compression): verify abort persistence after restart	2026-07-23 07:25:21 -07:00
Brett Bonner	03c96b7ab5	test(compression): retain durable persistence regression	2026-07-23 07:25:21 -07:00
Brett Bonner	17b3a4bd41	fix(compression): preserve flush baseline after abort	2026-07-23 07:25:21 -07:00
Brandon Zarnitz	d1c0c33a8b	fix(run_agent): call should_compress_preflight() for sub-threshold engines (#20316 ) Context engines that override ``should_compress_preflight()`` (e.g. the hermes-lcm plugin's incremental leaf-chunk compaction) never had their hook fired by ``run_conversation`` because the preflight block exited early once the hardcoded ``>= threshold_tokens`` check failed. As a result, ``LCM_DEFERRED_MAINTENANCE_ENABLED=1`` and friends were inert and accumulated raw_backlog debt indefinitely. Add an ``elif`` branch that delegates to the engine's preflight hook when the legacy threshold check does not fire. The default ``ContextEngine.should_compress_preflight()`` returns ``False`` so the built-in ``ContextCompressor`` is unaffected; engines opting in get a chance to ingest messages and request a single ``compress()`` pass for deferred maintenance. Exceptions are swallowed at debug level so a buggy engine cannot break an otherwise-healthy turn. Closes #20316	2026-07-23 07:25:08 -07:00
Fangliquan	9220c0c0bb	fix(agent): close tool-result tails on invalid-tool and truncated-tool early returns Invalid-tool exhaustion and truncated-tool early returns skipped finalize_turn, leaving role=tool transcripts that become tool→user on the next turn for strict providers. Call close_interrupted_tool_sequence before persist on those paths (same as interrupt aborts).	2026-07-23 17:38:05 +05:30
Teknium	ad8c06047d	test: cover api_key_hint in strict pool doubles + real-pool routing regression Follow-up to the #43755 salvage: - Update the strict _Pool doubles in tests/run_agent/test_run_agent.py to accept api_key_hint and assert it carries the agent's failed key. - Add a real-CredentialPool regression (no mocks) proving the hint routes exhaustion to the entry whose key actually failed, not pool.current(), plus the no-hint baseline (#43747 wrong-entry marking).	2026-07-22 20:54:29 -07:00
Yingliang Zhang	8c745314b9	fix(desktop): synchronize context usage and compaction status	2026-07-22 16:58:58 -07:00
Brooklyn Nicholson	cbf5b05c70	feat(agent): add active-turn redirect core primitive A follow-up sent while the model is still generating previously ended the turn: Hermes kept only the visible partial text (reasoning was display-only), cleared the loop, and replayed the message as a fresh next turn. If the correction referred to something that only appeared in the thinking stream, the model no longer had that context. Add `AIAgent.redirect(text)`: a corrective interrupt distinct from a hard stop. It cancels only the in-flight model request (not tool workers or child agents), stashes the correction under a lock shared with `interrupt()` so a concurrent `/stop` always wins, and lets the loop rebuild the same logical iteration. `_apply_active_turn_redirect()` checkpoints the reasoning that was actually shown to the user plus any visible partial text as an ordinary assistant message, then appends the correction as a real user turn — never replaying incomplete signed/encrypted provider reasoning, and keeping strict role alternation and prompt-cache stability intact. During tool execution it degrades to `steer()` so a running tool finishes at a safe boundary. `_fire_reasoning_delta` now only records reasoning that a display callback actually consumed, so `show_reasoning: false` never leaks hidden provider thinking into the persisted transcript.	2026-07-22 12:02:40 -05:00
Teknium	f944e84858	fix: close review gaps for per-model threshold overrides (#63020 ) Follow-up to the salvaged contributor commit, closing the three gaps flagged in the sweeper review: 1. Init ordering: assign compression.model_thresholds to a selected plugin context engine BEFORE the initial update_model() call in agent_init.py, so the initial model's override applies from init (previously it only took effect after the first /model switch). Base-class ContextEngine.update_model() now snapshots the pre-override percent once so repeated switches fall back to the engine's configured threshold, not a previous model's override. 2. DEFAULT_CONFIG: add compression.model_thresholds (empty map) to hermes_cli/config.py — additive key, no _config_version bump. 3. Docs: document the key in website/docs/developer-guide/context-compression-and-caching.md (yaml example, parameter table, dedicated section) and update the plugin-boundary note in context-engine-plugin.md to state the explicit context-engine contract for model_thresholds. Adds tests/run_agent/test_per_model_threshold_init_ordering.py: plugin-engine AIAgent init regression (override applies at init, empty map unchanged), DEFAULT_CONFIG key presence, floor interaction on the model-switch path (override below the small-context floor is raised to the floor; above the floor wins), and base-class config snapshot across repeated switches. Also maps @bennybuoy in contributors/emails/.	2026-07-22 07:00:27 -07:00
Ben Kamholtz	5f2fdf66bf	feat: per-model compression threshold overrides (v2, rebased on main) Addresses teknium1 review feedback on PR #60781: 1. Gateway cache invalidation: added ('compression', 'model_thresholds') to _CACHE_BUSTING_CONFIG_KEYS so a live config edit to the map invalidates the cached compressor (previously kept stale thresholds). 2. Integrated resolver with small-context floor: per-model overrides are resolved FIRST, then the existing 75% floor for <512K models is applied on top. The floor is no longer replaced — it stacks. An override below 75% on a small-context model still gets floored to 75% (raise-only); an override above 75% wins. 3. Clean rebase on upstream main — no unrelated deletions or anti-thrashing changes. Only the per-model threshold feature is added. Changes: - resolve_model_threshold() module-level helper (longest substring match) - ContextCompressor.__init__ accepts model_thresholds dict - _base_threshold_percent stores the per-model resolved value - _config_threshold_percent stores the raw config value (fallback base) - update_model() re-resolves on /model switch, falls back to config value - ContextEngine base class update_model() applies overrides for plugin engines - agent_init.py reads compression.model_thresholds from config, passes to ctor - gateway/run.py cache busting key added - cli-config.yaml.example documents the feature - 17 tests covering resolve helper, compressor init (large/small context, override above/below floor), update_model (re-resolve, fallback), base class Co-authored-by: Copilot <copilot@github.com>	2026-07-22 07:00:27 -07:00
Sora-bluesky	19a59f7d7b	fix(compression): mirror the full trigger recomputation in the suggestion guard Review follow-up on #67431 (hermes-sweeper): - The viability check compared the floored percentage against the raw context window, but the built-in trigger recomputation also applies the output-token reservation, the 64K floor, and the degenerate-window guard (_compute_threshold_tokens). Mirror that math exactly, so e.g. a 200K window with max_tokens=120K recomputes to max(0.75*80K, 64K)=64K and the suggestion is correctly KEPT for an 80K aux model instead of being suppressed by the raw-window percentage. - Gate the built-in policy behind isinstance(ContextCompressor): external context engines own compaction policy (#44439), so plugin engines keep the plain suggestion untouched. - The non-viable explanation now names the recomputed trigger instead of hardcoding the 75%/512K wording, so it stays accurate when the reservation (not the percentage floor) is what makes the value unreachable. Tests: reservation-viability regression and plugin-engine passthrough, per the review; the floored-branch assertion updated to the recomputed number. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-22 06:58:58 -07:00
Sora-bluesky	dbc71fb6e4	fix(compression): don't recommend a threshold the small-context floor will ignore The auxiliary-compression feasibility warning computes its compression.threshold suggestion as aux_context / main_context, independently of ContextCompressor._effective_threshold_percent()'s raise-only small-context floor. For main windows under 512K the floor raises any configured value below 75% back up, so a suggestion like 'threshold: 0.40' is silently ignored and the same warning returns every session. Derive the suggestion's viability through the compressor's own floor logic: offer the 'lower the threshold' option only when the floored value still fits the auxiliary model's context; otherwise recommend only a larger compression model and explain the floor, so the guidance is always actionable. Tests: the updated auto-correct test pins the floored branch (no threshold suggestion, floor explained); two new tests pin the surviving suggestion at/above the floor on a small window and below 75% on a 512K+ window where no floor applies. The updated test fails against the previous code. Fixes #67422 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-22 06:58:58 -07:00
3ASiC	d46f0fb2d5	fix(compression): notify context engine after commit	2026-07-22 06:58:16 -07:00
Jakub Wolniewicz	4c2e34f07d	fix(moa): measure advisor guidance before compression	2026-07-22 06:57:54 -07:00
WeiYusc	928bcdde24	fix(compression): refresh gateway activity during compaction Refresh the agent activity tracker while context compression is blocked in the auxiliary summarizer so gateway watchdogs do not report inactivity during long compactions. Add regression coverage for successful heartbeats, exception cleanup, touch failures, and strict-signature compressor fallback. (cherry picked from commit c09e58b7709cc60c5b454701f0ecf840e759222f)	2026-07-22 06:57:33 -07:00
Teknium	1c2faedd88	fix(compression): unify the attempt cap across every compression site Follow-up to the salvaged #64010 (Kenmege) and #63870 (dombejar) commits, making one resolved compression.max_attempts cap govern ALL per-turn compression attempt sites: - conversation_loop: resolve max_compression_attempts ONCE at turn start (it was previously re-resolved inside the API-call loop) and route the pre-API pressure gate through it — that gate still hardcoded 'compression_attempts < 3' and logged 'attempt=%s/3'. - conversation_loop: the salvaged post-tool compaction gate now uses the resolved cap instead of a hardcoded 3. - turn_context: the preflight compaction loop was 'for _pass in range(3)'; it now sizes itself from the same resolved cap. - agent_init: harden the max_attempts parser — reject booleans (bool subclasses int; 'true' would coerce to 1), reject fractional floats instead of truncating them, keep accepting integral floats and numeric strings; anything else falls back to 3 (floor 1, ceiling 10 unchanged). - tests: replace #63870's inspect.getsource source-shape test with behavioral loop tests (post-tool compaction fires <= cap times per turn, shares its budget with the pre-API gate, resets between turns); add an e2e test proving a 4th preflight pass runs at config cap=6 while the unset default still stops at 3; extend the #64010 config tests with the bool/float parser semantics. Salvages #64010 by @Kenmege and #63870 by @dombejar.	2026-07-22 06:56:42 -07:00
Dom Bejar	fca883f06f	fix(compaction): cap post-tool attempts per turn	2026-07-22 06:56:42 -07:00
Teknium	4c64ff3aa0	feat(nous): send top-level session_id for provider sticky routing (#69253 ) * feat(nous): send top-level session_id for provider sticky routing The Nous Portal profile only embedded the session id inside portal tags, so Claude traffic through the portal had no sticky-routing key. Multi-turn sessions could reroute between upstream endpoints (Anthropic/Vertex/ Bedrock), cold-writing a fresh prompt cache on every reroute since each provider's cache is instance-local. Mirror the OpenRouter profile: emit extra_body.session_id whenever the agent has one, pinning every turn of a session to the same endpoint so explicit cache_control breakpoints stay warm. * test: expect top-level session_id in Nous max-iterations summary body Sibling site of the profile change — the max-iterations summary path builds its request through the same NousProfile.build_extra_body(), so its exact-shape assertion now includes the sticky-routing session_id when the agent has a session.	2026-07-22 04:39:20 -07:00
kshitijk4poor	9fa2906c18	fix: restore base_url rstrip, extract should_clear_context_pin helper Salvage follow-up for PR #68899: - Restore .rstrip('/') on base_url in _swap_credential (both anthropic and OpenAI paths) to match every other assignment site. The route identity comparison still uses normalize_route_base_url which handles trailing slash correctly. - Extract should_clear_context_pin() into hermes_cli/route_identity.py, consolidating 7 copy-pasted call sites across cli.py, gateway/run.py, gateway/slash_commands.py, and hermes_cli/model_switch.py into a single fail-closed helper. C1 (anthropic path TLS re-application): pre-existing gap — the Anthropic adapter (build_anthropic_client) has no TLS customization support at all, so this is out of scope for this salvage.	2026-07-22 11:19:37 +05:30

1 2 3 4 5 ...

682 commits