hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-07-24 16:54:43 +00:00

Author	SHA1	Message	Date
flyingdoubleg	e9a7c18890	fix(memory): honor disabled toolsets for provider tools	2026-07-24 13:00:53 +05:30
Teknium	13fe08d7e9	fix(context-engine): short-circuit the inherited no-op select_context before any per-request work Verification follow-up for the #51226 salvage: the host call site guarded select_context with hasattr(), but the ABC defines a default on every engine, so the built-in ContextCompressor (and any non-implementing engine) still paid per-request shallow copies of the conversation history plus a hook call on every provider request. Identity-check the bound method against ContextEngine.select_context and return the request untouched — mirroring the existing base-method short-circuit in _notify_context_engine_turn_complete — so the default path does zero work, not just produces an identical result. Adds two pins: the base no-op is never invoked (patched-to-raise base stays silent), and ContextCompressor.__dict__ contains neither new verb. Also registers the contributor email mapping for @chaos-xxl.	2026-07-23 19:44:35 -07:00
xue xinglong	56e00f4ca1	docs+test(context-engine): sync public guide coverage note; pin finalization-seam observation contract - website guide: on_turn_complete() now carries the same best-effort coverage caveat as the ABC docstring (fires from the finalization seam; abnormal early-return paths bypass it) — removes the doc/code inconsistency. - test: finalization seam emits on_turn_complete with usage=None + the interrupted flag for an interrupted finalized turn. Docstring records that the negative early-return-bypass half is best-effort and deferred to a shared-seam follow-up rather than pinned via a full run_conversation harness.	2026-07-23 19:44:35 -07:00
xue xinglong	5f65f0b0f8	fix(context-engine): snapshot select_context read-only inputs; scope on_turn_complete coverage doc Addresses the hermes-sweeper review on #51226: - _apply_context_engine_selection now passes shallow copies of the read-only conversation_messages / incoming_message to the hook, so an engine mutating them in place cannot corrupt persisted transcript state (enforces the request-only contract, not just documents it). Adds a mutation-regression test asserting persisted history + incoming message are untouched. - on_turn_complete docstring: scope the coverage claim to the standard finalization seam. Some abnormal early-return paths (content-policy block, provider terminal failure) currently persist+return without finalization and don't emit the hook; documented as best-effort with a shared-seam follow-up, rather than over-promising a guaranteed callback for every early exit.	2026-07-23 19:44:35 -07:00
xue xinglong	915942935d	fix(context-engine): fail open on empty select_context() result + doc public hooks - _apply_context_engine_selection: reject an empty list. all([]) is True, so a [] returned by a failing/buggy engine previously replaced a valid request with an empty message list the downstream sanitizers can't restore; now it falls open to the unmodified request (honors the fail-open contract). Thanks @johnnykor82 for catching this on #41918's review. - test: empty list keeps the original request (fail-open regression). - docs: document select_context()/on_turn_complete() in the public context-engine plugin guide (were still describing only the old contract).	2026-07-23 19:44:35 -07:00
xue xinglong	71220cdf5b	docs+test(context-engine): document select_context ordering/cache contract; add cache-stability + downstream-sanitizer tests - context_engine.py: document that select_context() runs before cache-control and all request sanitizers, so (a) replacements still pass host validation and (b) the no-op default keeps the request byte-stable (AGENTS.md prompt- cache invariant). Note the hook is evaluated per provider request. - tests: no-op path is byte-stable for cache-control; a role-unusual replacement is passed through for the existing downstream sanitizers to normalize (select_context does structural validation only).	2026-07-23 19:44:35 -07:00
xue xinglong	589cbafb87	feat(context-engine): forward real usage to on_turn_complete() The on_turn_complete() observation hook is the engine's post-turn signal, so it should receive the completed turn's canonical token usage when the host has it, not a hardcoded None. Per @johnnykor82's #41918 contract: the engine uses prompt/completion + cache_read/write/reasoning buckets to judge how large/expensive the selected context was before the next select_context(). - conversation_loop.py: stash the most recent provider response's usage_dict (the same canonical shape fed to update_from_response) on the agent as _last_turn_usage; reset to None at turn start so turns that never reach a provider response (early failure / interrupt) forward None, not a stale prior turn's usage. - turn_finalizer.py: forward agent._last_turn_usage instead of usage=None. - context_engine.py: document the usage param contract on the ABC hook. - tests: cover both ends through the real finalize_turn path — completed turn forwards the full canonical bucket set intact; no-response turn forwards None. Co-authored-by: johnnykor82 <johnnykor82@users.noreply.github.com>	2026-07-23 19:44:35 -07:00
xue xinglong	bb9ef9d72c	feat(context-engine): add on_turn_complete() observation hook Adds the post-turn observation verb as the companion to select_context(): an optional, no-op-default on_turn_complete() called once after the assistant/tool loop finishes, with the finalized transcript snapshot. Lets an engine ingest/index/summarize the completed turn to inform the next select_context(). Wired via _notify_context_engine_turn_complete() from turn_finalizer.finalize_turn(); fail-open, base no-op short-circuited so non-implementing engines (incl. the built-in compressor) pay nothing. This is the request-assembly + observation pair from #41918; with this commit the PR fully subsumes #41918's two hooks (prepare_request_messages -> select_context, on_turn_complete) rather than only the selection half. Co-authored-by: johnnykor82 <johnnykor82@users.noreply.github.com>	2026-07-23 19:44:35 -07:00
xue xinglong	dec464c351	feat(context-engine): add select_context() per-turn selection hook Adds an optional, no-op-default select_context() hook to the ContextEngine ABC, called every turn after the request messages are assembled and before provider dispatch — independent of should_compress(). Lets an engine select or replace which context enters the prompt for a single request (retrieval, topic routing, role/branch switching) without mutating persisted history, removing the need to abuse should_compress()=True as a per-turn callback. The host call site (_apply_context_engine_selection) is fail-open: a missing hook, an exception, or an invalid return value leaves the assembled request untouched. Additive and non-breaking: the built-in compressor and every existing engine are unaffected. Consolidates the per-turn request-assembly surface proposed across #41918, Related: #36765 #41918 #24949 #47109 #50053 #23837 #25115 #29370	2026-07-23 19:44:35 -07:00
Teknium	3d693ae034	chore(moa): add trailing newline to reference-prompt test file Follow-up for salvaged #61454.	2026-07-23 18:40:09 -07:00
liuhao1024	6afbb33af1	fix(moa): add explicit warnings to reference prompt against claiming tool execution	2026-07-23 18:40:09 -07:00
sgtworkman	3dfe712384	fix(moa): scope quiet relay to machine-readable CLI Keep MoA reference display events off the machine-readable -Q stdout surface (platform=cli with tool_progress_mode=off) while preserving them everywhere else. Extracts the relay into module-level helpers so the policy is testable. Salvaged from #67334.	2026-07-23 18:40:09 -07:00
SquabbyZ	89e6f4c989	feat(agent): add MOA progress indicator (#59546 ) Adds per-reference progress events and a phase-transition marker to the MoA display pipeline so TUI / CLI / desktop surfaces can render a status bar like `MOA: 2/3 refs done` and surface which phase (reference vs aggregator) is currently active. - `moa.progress` — fired once per reference completion with `refs_done`, `refs_total`, and the source label - `moa.phase` — fired on phase transitions (currently the single `phase="aggregator"` transition once the fan-out finishes) Plumbed through the existing `reference_callback` → `tool_progress_callback` → gateway path; no new UI surface. The legacy `moa.reference` / `moa.aggregating` events are unchanged for backwards compatibility. AI-assisted fix by https://github.com/SquabbyZ/peaks-loop	2026-07-23 18:11:57 -07:00
Idris Almalki	8d119832b4	fix(gemini): emit thoughtSignature sentinel for cross-provider tool_calls in native adapter When Hermes fails over from a non-Gemini provider (xAI, Anthropic, etc.) to Gemini mid-conversation, the existing assistant tool_calls in history carry no Gemini ``extra_content.google.thought_signature`` (the originating provider never emits one). The native adapter's ``_translate_tool_call_to_gemini`` omitted ``thoughtSignature`` entirely in that case, so Gemini 3 thinking models rejected every replayed turn with:: HTTP 400 INVALID_ARGUMENT Function call is missing a thought_signature in functionCall parts. Additional data, function call default_api:<tool_name>, position N. The Cloud Code Assist sibling adapter already handles this exact case by emitting a sentinel ``"skip_thought_signature_validator"`` (see ``agent/gemini_cloudcode_adapter.py:106``, originally added in #11270 and documented as matching ``opencode-gemini-auth``'s approach). This change mirrors that fallback in the native adapter so the two paths behave identically when replaying cross-provider history. Verified live against ``generativelanguage.googleapis.com/v1beta`` with ``gemini-3-pro-preview``: synthetic 2-turn conversation with no real ``thoughtSignature`` returns 400 without the sentinel and 200 with it. Test added: ``test_build_native_request_emits_sentinel_for_cross_provider_tool_call``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-07-23 17:26:24 -07:00
Teknium	d43cc2ca80	fix(compress): gate N-user tail guarantee to actionable turns, behavior-preserving default Follow-up fixes on top of the salvaged #22566 mechanism: - N-collector now counts only REAL actionable user turns via _is_actionable_user_turn + _is_synthetic_compression_user_turn — the same filter pair _find_last_user_message_idx uses post-#69291. The contributor's bare role=='user' + _is_context_summary_content check let blank platform echoes and continuation/todo rows consume N slots, silently degrading the guarantee. - Default flipped 3 -> 1 (behavior-preserving): a default of 3 was measured to change the tail cut on transcripts whose budget covers only the last turn. min_tail_user_messages=1 delegates to the existing single-user anchor; N>1 is opt-in, and the call site is gated so the default path is byte-identical to main. - Hardened config parse in agent_init (bool rejected, fractional floats rejected, floor 1) matching the max_attempts parser shape. - Wired the recurring external-PR config gaps: hermes_cli/config.py DEFAULT_CONFIG + cli-config.yaml.example (PR only had cli.py). - Regression tests: blank echoes / synthetic rows don't count toward N; tool-call/result pairs never split by the N-boundary (no-orphan both directions); N-guarantee wins over tail_token_budget and the _MAX_TAIL_MESSAGE_FLOOR (floor is a minimum, not a cap); default parity pin; DEFAULT_CONFIG pin.	2026-07-23 17:03:49 -07:00
Jerry	a9c868225e	feat(compress): preserve recent N user messages during context compression Add _ensure_last_n_user_messages_in_tail to guarantee the last N user messages survive compression in the uncompressed tail, with surrounding assistant/tool context preserved. - Add min_tail_user_messages parameter (default 3) to ContextCompressor - New _ensure_last_n_user_messages_in_tail method generalizes single-user protection - Skip context-summary handoff banners when counting user messages - User messages are clean boundaries — skip _align_boundary_backward - Wire through cli.py, agent_init.py, and gateway cache busting keys Config: compression: min_tail_user_messages: 3 Co-Authored-By: Claude <noreply@anthropic.com>	2026-07-23 17:03:49 -07:00
Teknium	69365109b3	fix(compression): mark raw skill_view bodies summarized away, not only pre-pruned rows _collect_ghosted_skill_names() covers both ghost-skill shapes in the compressed middle window: rows already demoted to a [SKILL_PRUNED: ...] marker AND raw skill_view bodies (> _SKILL_VIEW_PRUNE_MIN_CHARS) that survived Phase-1 inside an earlier protected tail and then aged into the compression window — the summarizer paraphrases those instructions away too. Shared threshold constant between the emit site and the scan. Pinned by a live-probe-shaped test (real compress(), mocked aux LLM).	2026-07-23 16:58:06 -07:00
Teknium	28f73d32e9	test(compression): ghost-skill defense suite — marker round-trip, protected prune, real-compress survival 21 tests pinning the salvaged #44166 behavior: - marker emit + extractor round trip (patterns adapted from PR #32375 by @LeonSGP43, with credit) - no-duplicate re-injection when the canonical marker survived (the original PR's presence-check defect) - Phase-1 protection for just-loaded / user-referenced skills, and the Pass-4 pressure override that keeps #61932 fixed - deterministic marker survival through a REAL compress() with a mocked aux LLM: drop → re-injected, keep → not duplicated, static-fallback path, iterative re-compression via rehydrated handoff - markers never classify as handoff content (classify_summary_content / _strip_context_summary_handoff_message untouched) - SKILLS_GUIDANCE Skill Safety Rule renders with real newlines	2026-07-23 16:58:06 -07:00
srojk34	3ea35d6711	fix(vertex,moa): register vertex in PROVIDER_REGISTRY and HERMES_OVERLAYS The Vertex AI provider (added same-day, commit `c73e74386`) was never added to either of the two provider registries that agent/auxiliary_client.py and the MoA slot-resolution chain depend on, breaking Vertex outside the main conversation loop: 1. hermes_cli/auth.py::PROVIDER_REGISTRY had no "vertex" entry. The plugin-auto-extend loop that normally fills gaps explicitly skips non-api_key auth types (`if _pp.auth_type != "api_key": continue`), and Vertex was never hand-declared like "bedrock" is. Because resolve_provider_client() in agent/auxiliary_client.py gates everything on `pconfig = PROVIDER_REGISTRY.get(provider)` and returns (None, None) immediately when pconfig is None, its `elif pconfig.auth_type == "vertex"` branch was permanently dead code — every auxiliary Vertex call (vision, title generation, reflection, context compression, MoA reference/ aggregator slots) failed outright, not just a MoA-specific edge case. 2. hermes_cli/providers.py::HERMES_OVERLAYS also had no "vertex" entry, so hermes_cli.providers.get_provider("vertex") returned None. This backs _preserve_provider_with_base_url() in agent/auxiliary_client.py, which a MoA slot's resolved (base_url, api_key) pair needs to keep its "vertex" identity instead of silently collapsing to "custom" — losing the identity _refresh_provider_credentials() needs to re-mint an expired OAuth2 token (~1h lifetime) on a 401, and permanently breaking every subsequent call in that MoA preset for the rest of the session. Fix mirrors the existing "bedrock"/aws_sdk entries in both registries exactly, plus adds a "vertex" branch to _refresh_provider_credentials() (it had branches for openai-codex/nous/anthropic/xai-oauth but not vertex, so a 401 fell through to `return False` without evicting the stale cached client). - hermes_cli/auth.py: hand-declared vertex ProviderConfig(auth_type="vertex") in PROVIDER_REGISTRY, matching bedrock's shape. - hermes_cli/providers.py: vertex HermesOverlay(auth_type="vertex") in HERMES_OVERLAYS + "Google Vertex AI" label override. - agent/auxiliary_client.py: vertex branch in _refresh_provider_credentials that re-mints the token via get_vertex_config() and evicts the stale cached client. - 8 new regression tests across tests/hermes_cli/test_vertex_provider.py and tests/agent/test_auxiliary_client.py: registry membership, end-to-end resolve_provider_client("vertex", ...) building a working client (proving the previously-dead branch is now reachable), and the 401-refresh/cache- eviction path.	2026-07-23 16:55:41 -07:00
Teknium	b7a05b6b6f	fix: re-anchor summary-input bound to current main + bound iterative path Follow-ups on top of the cherry-picked #27748 mechanism: - move the cap constant to module level with full rationale comment (class attribute aliases it so subclasses/tests can override) - bound the iterative-update path too: the PREVIOUS SUMMARY block is passed through _bound_summary_input so a pathological rehydrated handoff cannot blow up the prompt (previous summary + new turns each capped) - extra regression tests: byte-identical small-input passthrough (identity), direct bound+marker unit check, bound-after-per-message- truncation shape (hundreds of under-_CONTENT_MAX turns), iterative path bounded, marker vs classify_summary_content non-collision - contributor email mapping for @robgfl45	2026-07-23 16:44:53 -07:00
Cluster2	80ece3867b	fix: bound compression summary input	2026-07-23 16:44:53 -07:00
Teknium	fa4800414c	feat(compression): prompt-cache reclaim gate + hardened wiring for proactive prune Follow-ups on top of the cherry-picked #62644 mechanism, porting it to current main and closing the salvage-review requirements: - proactive_prune_min_reclaim_tokens (default 4096): a prune only COMMITS when it reclaims a meaningful token batch, measured on the pruned output. A committed prune rewrites already-sent history and invalidates the provider prompt-cache prefix; this hysteresis gate keeps those breaks episodic/amortized (like a compression boundary) instead of firing every tool iteration. 0 disables the gate. (Design point credited to the #62389 review cycle's prune_minimum_tokens.) - Standard no-op caller contract: every skip path returns the INPUT list object; the loop commits only on 'result is not messages' + non-zero count. - Loop call is getattr+callable guarded (plugin engines predating the hook, SimpleNamespace test doubles) and exception-swallowed at debug level. - Config parse follows the compression.max_attempts hardened semantics: booleans rejected, fractional floats rejected, integral floats/numeric strings accepted; negative trigger = disabled. - cli-config.yaml.example documented (all three keys) and gateway _CACHE_BUSTING_CONFIG_KEYS extended so hot-reload rebuilds the agent. - Tests: min-reclaim gate both directions, input-object no-op contract, no-orphan tool_call_id pairing in BOTH directions (#69830 pin rule), default-off zero-behavior-change pin, config parse seam, and behavioral loop-wiring tests (consulted/commit/no-op/absent-method/raising).	2026-07-23 16:44:12 -07:00
Kolektori	cb481e2f2b	feat(compression): proactive tool-result pruning for large-window models The phase-1 tool-result prune only runs inside compress(), which fires near 50% of the context window, so it never triggers on large-window models; old tool outputs then ride in history and are re-sent every turn. Add prune_tool_results_only(): the same no-LLM prune on a separate, low proactive_prune_tokens trigger, run as an elif to the compression branch. Opt-in (default 0), protects the recent tail by message count. Add the method to the ContextEngine base as a no-op default so pluggable engines inherit it safely (the post-tool-call path never AttributeErrors on a non-built-in engine); the built-in compressor supplies the real prune. Register both keys under the top-level compression config with defaults and document them.	2026-07-23 16:44:12 -07:00
root	34678d2f2e	fix(compression): skip empty post-handoff summary windows	2026-07-23 16:27:06 -07:00
Teknium	eebc2286fc	fix(gateway): retry-next-message semantics for compression_deferred + regression suite Gateway half of the #49874 salvage: pass compression_deferred through both _run_agent_inner result dicts and guard the compression-exhausted auto-reset block with it — a lock-contended defer keeps the session intact (the concurrent compressor is actively shrinking it) instead of wiping it via reset_session. Regression tests: - tests/run_agent/test_compression_lock_defer.py — provider-mock 413 and 400-overflow turns whose compression pass lost the lock end as compression_deferred (failed=False, no compression_exhausted); flag unset keeps the terminal exhaustion path byte-identical; type-pin tests vs MagicMock agents and junk flag values; cap=1 e2e proving the refunded pre-API defer leaves the budget for the provider-proven 413 retry. - tests/agent/test_preflight_lock_defer.py — a lock-skipped preflight pass stops the loop WITHOUT arming preflight_compression_blocked; plain no-op still arms it; MagicMock junk does not defer. - tests/gateway/test_compression_deferred_soft_result.py — AST pin that the deferred branch guards the auto-reset chain and performs no session mutation (mirrors test_35809_auto_reset_clean_context.py).	2026-07-23 16:23:57 -07:00
Rain	bc7212cf93	feat(moa): per-reference-model max_tokens override MoA reference_max_tokens is preset-level — one cap for all reference models. When mixing a verbose model with a terse one, a single cap is either too tight for the terse model or too loose for the verbose one. Now each reference slot can optionally carry its own max_tokens: reference_models: - provider: openrouter model: deepseek/deepseek-v4-pro max_tokens: *** # per-slot cap, overrides preset-level - provider: openai-codex model: gpt-5.5 # no max_tokens → falls back to preset-level reference_max_tokens _clean_slot (moa_config.py) preserves an optional max_tokens field on the slot dict, coerced via _coerce_int_or_none. _run_reference (moa_loop.py) reads slot-level max_tokens first, falling back to the preset-level cap passed by the caller. Slots without the field are unaffected — backward compatible. Type hints on slot-handling functions updated from dict[str, str] to dict[str, Any] to reflect the now-heterogeneous slot shape.	2026-07-23 16:17:27 -07:00
aui	ead9d7b256	test: cover gemini-native max_tokens forwarding in _build_call_kwargs Requested in review: builder-level assertions that the gemini-native branch forwards max_tokens (provider names and the native generativelanguage.googleapis.com base_url, max_tokens=600), plus a control showing gemini models on OpenAI-compatible endpoints — including Gemini's own /openai compatibility endpoint — keep the existing omission behavior (#34530). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-23 16:17:27 -07:00
Janig88	3dce1b967f	fix(auxiliary): scope max_tokens to moa_reference only (not aggregator) Per review feedback from teknium1: reference_max_tokens is an advisors-only contract. The aggregator is the acting model and must not be capped by the reference budget. Changed _is_moa from startswith('moa_') to exact match on 'moa_reference'. Added regression test proving aggregator does NOT receive max_tokens.	2026-07-23 16:17:27 -07:00
Janig88	3616ce006a	fix: use auxiliary_max_tokens_param for Copilot GPT-5 compat Copilot review pointed out that hardcoding kwargs['max_tokens'] would 400 on models requiring max_completion_tokens (GPT-5 family, Copilot). The existing auxiliary_max_tokens_param() helper already selects the correct parameter name per model — use it instead of hardcoding. Test updated to parametrize expected_key so the Copilot gpt-5.5 case correctly asserts max_completion_tokens instead of max_tokens. Addresses Copilot review comments on both files.	2026-07-23 16:17:27 -07:00
Janig88	32a4faa2d5	fix(auxiliary): honor max_tokens for MoA reference/aggregator tasks PR #56756 added reference_max_tokens to cap MoA advisor output and cut turn latency. The value is correctly threaded through five layers of MoA code (moa_config → conversation_loop → aggregate_moa_context → _run_references_parallel → _run_reference → call_llm(task='moa_reference', max_tokens=800, ...)). However, _build_call_kwargs() in auxiliary_client.py silently drops max_tokens for all OpenAI-compatible providers (PR #34845, which fixed endpoints and NVIDIA NIM keep it. This means reference_max_tokens never reached the API for the vast majority of providers. The bug affects every OpenAI-compatible MoA reference/aggregator slot: Z.AI (coding plan), OpenRouter, OpenAI, GitHub Copilot, and local providers. Only Anthropic-compat endpoints (MiniMax, /anthropic URLs) worked — by coincidence, not MoA-aware design. Fix: thread the 'task' parameter through all six _build_call_kwargs() call sites. When task starts with 'moa_', max_tokens is always included in the request kwargs regardless of provider. Non-MoA auxiliary tasks (compression, titles, vision, etc.) keep PR #34845 behavior unchanged. Verified end-to-end: - Z.AI GLM-5.2 with max_tokens=50 → returned exactly 50 tokens - Z.AI GLM-5.2 with max_tokens=20 → returned exactly 20 tokens - Z.AI GLM-5.2 uncapped → returned 315 tokens - 7 new regression tests covering 4 providers, Anthropic wire, non-MoA tasks, and prefix-matching boundary - 288 auxiliary_client tests pass (was 281, +7 new), 84 MoA tests pass - Zero regressions	2026-07-23 16:17:27 -07:00
srojk34	cc1725cbe5	fix(moa): stop reference_max_tokens from also capping the aggregator aggregate_moa_context's single max_tokens parameter was applied to both the reference fan-out (_run_references_parallel) and the aggregator's own synthesis call_llm. #53580 explicitly removed a hardcoded cap from the aggregator call because it truncated long aggregator syntheses; #56756 (reference_max_tokens, added to speed up the advisor fan-out) reintroduced the same shared cap by passing it to both calls, silently regressing #53580's fix. Rename the parameter to reference_max_tokens (matching the caller's own moa_config key) and stop forwarding it to the aggregator's call_llm invocation, which now always runs uncapped as intended.	2026-07-23 16:17:27 -07:00
Prathamesh Chaudhari	4c9628eab5	fix(anthropic): coerce empty/whitespace-only text blocks on the request path (#69512 ) (#69517 ) An assistant message with an empty or whitespace-only text content block — produced by context compression or certain tool-call flows — is rejected by the Anthropic Messages API with HTTP 400 "text content blocks must contain non-whitespace text". Because the blank block is stored in session history and replayed verbatim every turn, the session is permanently wedged behind the same 400. The Bedrock adapter already guards this via _safe_text() (#9486); the native Anthropic path never got the same treatment. _sanitize_replay_block() rebuilt text blocks with the raw stored text, and the _convert_assistant_message() guard only caught a fully empty block list, not a list still containing a whitespace-only text block. Add a _safe_text() helper mirroring the Bedrock one and apply it at both points: the ordered-blocks replay path and a final in-place walk of the converted content list. Both are self-healing — sessions that stored blank blocks recover on the next API call. Only text blocks are coerced; thinking/tool_use/image blocks are untouched. Fixes #69512	2026-07-23 16:36:37 -04:00
Teknium	cd6fb2b167	fix(prompt): scope api_server MEDIA: hint to actual interception behavior Correction to the previous commit (PR #68402): the claim that api_server never intercepts MEDIA: tags is inaccurate on current main. _resolve_media_to_data_urls() (gateway/platforms/api_server.py) DOES inline image MEDIA: tags (<=5MB, image extensions only) as base64 data URLs on the four main endpoints (_handle_session_chat, _handle_session_chat_stream, _handle_chat_completions, _handle_responses). The real gaps elphamale's PR points at are narrower: - the /v1/runs output path (_handle_runs) never calls the resolver; - non-image filetypes are never resolved anywhere (_MEDIA_IMG_EXT is image-only). Reword the hint to teach both halves: images via MEDIA: work on the chat/completions/responses endpoints; non-image files and anything on the runs endpoint must fall back to plain file paths in the response text. Update the test to pin the scoped guidance instead of a blanket prohibition.	2026-07-23 11:55:01 -07:00
elphamale	08abc5eba8	fix(prompt): forbid MEDIA: tags in the api_server platform hint Every PLATFORM_HINTS entry for a messaging platform (Telegram, WhatsApp, Discord, Slack, Signal, WebUI, desktop) teaches the model the MEDIA:/path convention because an interception mechanism actually resolves it there (native attachment delivery, or a validated/inlined data URL). The cli entry, which has no such mechanism, explicitly tells the model NOT to use it and to state the path in plain text instead. The api_server entry had neither instruction. Its /v1/runs handler never routes the final response through any MEDIA: resolver (confirmed against source: none of the four call sites of the api_server module's media-tag resolver are inside its runs-endpoint handler), so a MEDIA:/path tag there renders as inert literal text in the API response — exposing a raw host filesystem path to the caller with no delivery ever taking place. Nothing platform-specific told the model not to use a convention it's correctly taught for several sibling platforms in this same dict, so the general cross-platform habit could surface here too, unlike cli where an explicit prohibition already closes the gap. Mirrors cli's prohibition, adapted for api_server's actual constraint: no "state the path in plain text" fallback, since a typical API caller has no filesystem access to the host at all. Points at "a registered file-delivery tool" generically rather than naming any specific tool, since api_server toolsets are deployment-defined.	2026-07-23 11:55:01 -07:00
joaomarcos	4dccfcd9b7	feat(bedrock): add Converse API prompt caching (cachePoint) Claude-on-Bedrock already gets prompt caching via the AnthropicBedrock SDK path. This adds it to the raw Converse API path used for non-Claude models (Amazon Nova, and Claude when bearer-token auth forces Converse routing, #28156) — a conservative model allowlist inserts cachePoint blocks after tools, system, and the message before the newest turn, and extracts cacheReadInputTokens/cacheWriteInputTokens into usage so cost accounting picks them up through the existing Anthropic-style fallback. Ref: relatorio-cache-performance-provedores-ia.md, P0 item 1.	2026-07-23 11:45:07 -07:00
wjq990112	78312c192d	fix(moa): preserve custom provider context metadata Preserve compatible custom provider metadata through MoA aggregator context resolution and cover the resolver and compressor paths.	2026-07-23 11:21:04 -07:00
Teknium	4ab3bf66f1	test(moa): cover the one-shot /moa aggregator path for slot extra_body Follow-up to #60168's salvage: aggregate_moa_context() is the third independent MoA call path; assert its aggregator call receives the custom-provider request_overrides.extra_body via **agg_runtime.	2026-07-23 11:20:54 -07:00
panding	1d603fe822	fix(moa): pass custom extra_body to slots	2026-07-23 11:20:54 -07:00
srojk34	2962ba2b7b	fix(auxiliary): treat explicit model:auto sentinel, not just cfg_model 'auto' is a sentinel meaning "inherit from main runtime / auto-detect", not a literal model id -- already handled for cfg_model (config-derived) in _resolve_task_provider_model, but not for the explicit `model` kwarg. MoA reference/aggregator slots (agent/moa_loop.py's _slot_runtime) forward a preset's `model:` field as this explicit argument rather than through auxiliary.<task> config, so a MoA preset configured with `model: auto` (a natural thing to try given the existing auxiliary.*.model: auto convention) reached this function as the explicit `model` arg and took the `model or cfg_model` branch, bypassing the cfg_model-only sentinel check entirely -- sending the literal string "auto" to the wire as a model id. Normalize both the explicit `model` and `cfg_model` the same way, fixing this at the single chokepoint every caller (MoA included) already goes through, rather than patching moa_loop.py separately.	2026-07-23 11:20:43 -07:00
Teknium	d9165d7a67	fix: resolve current entry unlocked in try_refresh_matching no-hint branch Follow-up to the #62614 salvage: try_refresh_matching (added by the #69843 salvage after this PR's base) calls self.current() while already holding the now-locking non-reentrant pool lock — a guaranteed deadlock that git merges silently (no textual conflict). Use _current_unlocked() and cover the method in the no-deadlock test.	2026-07-23 09:31:58 -07:00
solyanviktor-star	769381fb3e	fix(credential_pool): complete the locking boundary across the public pool surface Follow-up to review feedback: - Acquire self._lock in the remaining public pool-state methods: has_credentials, reset_statuses, remove_index, resolve_target, and add_entry. All of them read or rebind self._entries (and the mutating ones persist auth.json), so they now hold the same lock as select() and the query methods. None are called from within the lock, so no unlocked helpers are needed. - Make the blocking test deterministic: an instrumented lock records the acquire attempt, and the test first waits for the worker to actually reach self._lock before asserting it blocks. Previously an unlocked method could pass if the worker thread was scheduled late. - Extend the lock test matrix to all nine public methods; the five newly locked ones fail the test without this fix. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-23 09:31:58 -07:00
solyanviktor-star	5b794c984e	fix(credential_pool): acquire the pool lock in has_available/peek/current/entries `has_available()`, `peek()`, `current()` and `entries()` read (and, via `_available_entries()`, mutate and persist) `self._entries` without holding `self._lock`, while every other entry point — `select()`, `mark_exhausted_and_rotate()`, `acquire_lease()`, `try_refresh_current()` — guards the exact same access with the lock. `_available_entries()` is not read-only: it prunes aged-out DEAD manual entries (rebinding `self._entries` at the prune step) and calls `_persist()` (writes auth.json). The gateway runs platform adapters in threads and cron runs jobs in a ThreadPoolExecutor, so a status probe via `has_available()` or `peek()` can race a concurrent `select()`/rotation: torn iteration of `self._entries`, interleaved auth.json writes, or a lost token rotation. Fix: take `self._lock` in all four query methods. Because the lock is non-reentrant and `peek()` composes `current()` + `_available_entries()`, add a lock-free `_current_unlocked()` helper and route the already-locked internal callers (`_select_unlocked`, `mark_exhausted_and_rotate`, `_try_refresh_current_unlocked`) through it to avoid self-deadlock. Added regression tests: a no-deadlock check (peek re-entrancy) and a lock-held-blocks-the-call check for each of the four methods.	2026-07-23 09:31:58 -07:00
Teknium	547bf1ee9e	test: isolate unmatched-hint regression from live ~/.claude credentials Follow-up to the #65844 salvage: the new anthropic pool test must stub read_claude_code_credentials like the sibling tests, otherwise a dev machine's live claude_code singleton seeds a third entry and the no-benching assertion fails outside CI.	2026-07-23 09:22:02 -07:00
Blade	3d67f00fe1	fix(credential-pool): stop lost-update cooldown erasure and wrong-key quarantine Two related races in credential-pool cooldown state: 1. Lost update across processes: write_credential_pool merged only entries missing from the caller's snapshot; for entries present on both sides the caller's in-memory copy won wholesale. A process holding a snapshot taken before another process marked a key exhausted would, on its next persist (e.g. a round-robin rotation), write the key back as healthy — erasing the cooldown so every process resumes hammering a rate-limited key. Merge status fields by last_status_at recency: adopt the on-disk status only when it is strictly newer AND still binding (DEAD, or EXHAUSTED with an unexpired cooldown), and never onto re-authed (token-changed) entries, so legitimate expiry-clears and fresh logins are preserved. 2. Wrong-key quarantine: when mark_exhausted_and_rotate received an api_key_hint that matched no entry, it fell through to current()/_select_unlocked() — on a freshly loaded pool that selects the NEXT healthy key and benches it for the full cooldown TTL, punishing an innocent credential. When a hint is provided but unmatched, rotate without marking anything instead of guessing. Includes regression tests. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-23 09:22:02 -07:00
李航	0e15805e25	fix(credential-pool): exhaust all entries sharing a failed API key on 402 A 402/429/401 is an API-key–level failure (account out of balance, rate-limited, or key rejected), but the same key can back more than one pool entry — e.g. an explicit pool entry plus a `model_config` entry auto-seeded from `model.api_key`, both carrying the identical `runtime_api_key`. `mark_exhausted_and_rotate(api_key_hint=...)` only marked the first matching entry, leaving the sibling OK. `_select_unlocked()` then kept handing back the same depleted key, so the billing-recovery `continue` loop in the conversation retry path never converged: the request hung until the client disconnected (~2.5min observed against DeepSeek), emitting only `response.created` with no 402 ever surfaced to the user. Mark every entry sharing the failed key so the pool can reach the "no available entries" state and let the error propagate immediately. Adds a regression test covering two entries backed by the same key.	2026-07-23 09:08:28 -07:00
Teknium	1d1b670cb5	fix(compression): reset blocked-overflow dedup on every compression path + noise-filter survival pins Follow-up fixes for the #62625 salvage: - Dedup-reset gap (sweeper review): when the block clears while the context is STILL over threshold, execution enters the compression branch — the PR's 'else' reset never ran, so the warning stayed suppressed forever after the first block. _clear_context_overflow_warn() now fires on every automatic compression path: turn-context preflight, conversation_loop pre-API gate, and the post-tool loop-compaction gate. - should_compress_info on current main: main refactored should_compress into _automatic_compression_blocked()/_locally(); the tuple variant now derives its reason from the same in-memory state via _compression_block_reason(), keeping cooldown:<s>/ineffective shapes. - ContextEngine.should_compress_info ABC default now actually returns (should_compress(tokens), None) — the PR's default had a docstring but no return (returned None, would crash tuple-unpacking call sites). - Below-threshold guard: the turn-context persisted-cooldown branch and the conversation_loop pre-API cooldown branch no longer warn when the estimate is under threshold (should_compress_info returns a None reason; the preflight pre-check is not a threshold guarantee). The pre-API guard also honors compression.max_attempts instead of a hardcoded 3, and no longer fabricates a cooldown reason. - Noise-filter survival (#69550 composition): warning text is now a template constant (CONTEXT_OVERFLOW_BLOCKED_WARNING_TEMPLATE) marked FAILURE-CLASS, pinned un-swallowed in VISIBLE_COMPRESSION_MESSAGES and in new tests that execute the real _TELEGRAM_NOISY_STATUS_RE + _prepare_gateway_status_message. - Contributor mapping for stanislav@local -> sl4m3.	2026-07-23 08:43:21 -07:00
Stanislav	5c8d098eb3	Address sweeper review: safe should_compress_info + cover all guards - ContextEngine.should_compress_info() default impl so plugin engines (e.g. _StubEngine) don't raise AttributeError at the call site. - Centralise warning/reset in AIAgent._warn_context_overflow_blocked / _clear_context_overflow_warn so turn-context and conversation-loop guards share identical dedup logic and reset on the real compression boundary. - Cover conversation_loop.py pre-API (~L1007) and loop-compaction (~L4774) guards, not just the turn-context preflight. - _FakeAgent mirrors the two helpers; test suite green (219 passed). Fixes #62708	2026-07-23 08:43:21 -07:00
Stanislav	b0a88899bf	Surface warning when context exceeds compression threshold but compression is blocked Previously, when a session crossed the compression threshold but compression was skipped (summary-LLM cooldown, #11529, or anti-thrashing, #40803), the model kept accumulating context until it hit the hard provider token limit and silently stopped answering — with no signal to the user about why. Changes: - context_compressor.should_compress_info() returns a (should_compress, reason) tuple. reason is 'cooldown:<seconds>' or 'ineffective' when compression is needed but blocked. should_compress() keeps its bool contract so existing callers (conversation_loop.py) and regression #29335 are unaffected. - turn_context.build_turn_context() emits a deduped _emit_warning when the context is over threshold but compression is blocked, advising /new or /compress. Dedup keys on the block kind (cooldown/ineffective), not the ticking countdown, so a cooldown doesn't re-fire the warning every turn. - Adds tests/agent/test_turn_context_overflow_warning.py covering the tuple shape, both block kinds, dedup, and re-fire-after-clear.	2026-07-23 08:43:21 -07:00
happy5318	6bd02ae1a6	feat(image_routing): accept vision alias for custom provider models Extend the existing candidate-name resolver in _supports_vision_override to accept 'vision' as an alias for 'supports_vision' on per-model config, for both the providers.<name>.models dict and the legacy list-style custom_providers form. Per review feedback on #31912: this extends the current resolver rather than replacing its candidate-name logic. Named custom providers resolve to the runtime value provider='custom' while the config keeps the user-declared name under model.provider; that lookup path is preserved. Adds regression tests covering model.provider=my-vllm with runtime provider='custom' for both config shapes.	2026-07-23 08:32:09 -07:00
Teknium	eb7be2edde	fix(compress): classify unconfirmed lock-acquire failures and cover all manual-compress surfaces Follow-up to the salvaged #57634 commits: - agent/manual_compression_feedback.py: new describe_compression_lock_skip() — single source of truth for lock-skip wording. A descriptive holder string means another compressor CONFIRMED holds the lock ('already in progress (holder: ...)'); True/None means acquisition failed without a confirmed holder (hermes_state.try_acquire_compression_lock catches sqlite3.Error internally and returns False), so the message says 'could not acquire ... the lock check failed' instead of falsely claiming a concurrent compression is running. - cli.py, gateway/slash_commands.py, tui_gateway/server.py (all three in-process consumers: session.compress RPC, command.dispatch compress branch, slash.exec mirror) now route through the shared helper. - tui_gateway/server.py command.dispatch compress branch: catch CompressionLockHeld explicitly — it previously fell into the generic 'compress failed' error handler. - Deferred-notify contract (#69324): lock-skip discards the pending context-engine notification (committed=False) in _compress_session_history and the CLI path before returning. - tests: lock-skip wording pins per surface, VISIBLE_COMPRESSION_MESSAGES noise-filter carve-outs for both wordings, MagicMock signal opt-outs for sibling tests added on main after the original PR.	2026-07-23 08:19:14 -07:00

1 2 3 4 5 ...

1324 commits