Two-pronged fix for the WebUI "context compaction block in place of
last assistant response" regression.
Agent layer (the real fix). ``_find_tail_cut_by_tokens`` already had
``_ensure_last_user_message_in_tail`` to keep the most recent user
request out of the compressed middle (#10896), but no symmetric
anchor for the assistant side. When the conversation has an
oversized recent tool result or a long stretch of tool-call/result
pairs *after* the assistant's last visible reply, the token-budget
walk can stop with the previously-visible reply on the wrong side
of ``cut_idx``. The summariser then rolls it into the single
``[CONTEXT COMPACTION — REFERENCE ONLY]`` block persisted as
``role="user"`` or ``role="assistant"``, and from the operator's
perspective the WebUI session viewer
(``web/src/pages/SessionsPage.tsx``) and the TUI chat panel both
suddenly show the opaque "Context compaction" block in the slot
where they were just reading the actual answer:
User: "i cant see the output of the last message you sent,
i did see it previously, however now see 'context
compaction'"
Added ``_ensure_last_assistant_message_in_tail`` mirror of the
user-side anchor. It looks for the most recent assistant message
with non-empty text content (skipping tool-call-only assistant
"stubs" which the UI renders as small "calling tool X" indicators
rather than a readable bubble) and walks ``cut_idx`` back through
the standard ``_align_boundary_backward`` so we don't split a
tool_call/result group that immediately precedes it. The two
anchors are chained — each only walks ``cut_idx`` backward, so
the tail can only grow.
Falls back to "most recent assistant of any kind" only when no
content-bearing reply exists in the compressible region (fresh
multi-step tool sequence with no prior reply) — in that case the
agent-side fix is effectively a no-op and the existing
user-message anchor carries the load.
WebUI layer (clarity). Added ``isCompactionMessage`` detector that
recognises the ``[CONTEXT COMPACTION — REFERENCE ONLY]`` (current)
and ``[CONTEXT SUMMARY]:`` (legacy) prefixes from
``agent/context_compressor.py``, and a new ``compaction`` entry
in ``MessageBubble``'s ``ROLE_STYLES`` map. Compaction blocks
now render as muted, italicised system-style rows labelled
``Context handoff`` — clearly metadata, not the assistant's
actual reply — so an operator scrolling back through a long
session can't mistake the summary for a real answer.
Keeping the detected prefixes inline (rather than importing them)
because the WebUI bundle has no Python interop. A guardrail comment
points readers at the source-of-truth constants in
``agent/context_compressor.py``.
Follow-up to the #33346 cherry-pick:
- the marker string was duplicated at both insertion sites (standalone +
merged-into-tail); hoist to a module constant
- _strip_summary_prefix now also strips a trailing end marker so a
rehydrated handoff body doesn't leak the boundary directive into the
iterative-update summarizer prompt (it is re-appended on insertion)
When the compression summary lands as an assistant-role message (head ends
with user), the end marker was not appended. Models may regurgitate the
summary text as their own visible output when there's no clear boundary
signal (#33256).
The end marker was already appended for user-role summaries (#11475, #14521)
but the assistant-role path was missed in the original fix. This ensures ALL
standalone summary messages carry the boundary marker, preventing summary
text from leaking into user-visible chat output.
Legibility pass on the consolidated prefix: collapse the topic-overlap rule
from three overlapping sentences into one WINS sentence + one discard/no-wrap-up
sentence (same constraints, less dilution), fix the module docstring to
describe the headings that actually shipped, and correct the #10896 comment's
heading name (Historical Pending User Asks).
The prompt consolidation above retires the carveout-era prefix. Without a
frozen copy in _HISTORICAL_SUMMARY_PREFIXES, summaries persisted by
pre-upgrade builds would lose detection (_is_context_summary_content) and
renormalization (_strip_summary_prefix) — the exact regression class the
tuple exists to prevent. Adds contract tests covering every frozen prefix.
Refs #41607#38364#42812
ContextCompressor inherited a no-op on_session_end() from ContextEngine, so
per-session iterative-summary state (_previous_summary) survived a real session
boundary on a reused compressor instance. Override it to clear the summary the
moment the owning session ends, complementing the point-of-use guard in
compress(). Closes the cross-session contamination path in #38788.
Co-authored-by: dusterbloom <32869278+dusterbloom@users.noreply.github.com>
When a cron or background session compacts, it sets _previous_summary for
iterative updates. If that session ends without /new or /reset (which calls
on_session_reset()), the stale summary survives on the ContextCompressor
instance. A subsequent live messaging session's compaction then injects it as
'PREVIOUS SUMMARY:' into the summarizer prompt — contaminating the live
session with unrelated content from the prior session.
Add an else guard in compress(): when no handoff summary is found in the
current messages but _previous_summary is non-empty, discard it so
_generate_summary() starts fresh instead of iteratively updating a stale
cross-session summary.
Fixes#38788
When summary_target_ratio is large (e.g. 0.45) and the context_length is
moderate (e.g. 96000), the soft_ceiling (token_budget * 1.5) can exceed
the total transcript size. _find_tail_cut_by_tokens walks the entire
transcript without breaking early, and the resulting compress window is
either empty (compress_start >= compress_end) or a single message whose
summary-of-one overhead saves ~0 tokens.
Both outcomes cause a no-op compression that does not increment
_ineffective_compression_count, so should_compress() returns True on
every subsequent turn and the loop repeats endlessly.
Fix (two layers):
1. _find_tail_cut_by_tokens: when the backward walk consumed the entire
transcript without breaking (cut_idx <= head_end and accumulated <=
soft_ceiling), re-walk with the raw (non-inflated) token budget to
find a meaningful cut that gives the summarizer a useful middle window.
2. compress(): when compress_start >= compress_end, increment
_ineffective_compression_count and log a warning so the existing
anti-thrashing guard in should_compress() can break the loop.
Fixes#40803
Compaction summaries now receive the current date and instruct the
summarizer to rewrite completed actions as absolute, dated, past-tense
facts (e.g. "email John about the proposal" -> "Sent the proposal email
to John on 2026-06-07"). A resumed conversation no longer re-issues work
that already happened or treats a finished action as still pending.
The date is resolved via hermes_time.now() (date-only, user-configured
timezone) inside _generate_summary. The compaction summary is a
mid-conversation message that is never part of the cached prefix, so the
date does not affect prompt-cache stability. Date resolution is
best-effort: a clock failure omits the rule rather than blocking
compaction. The rule rides the shared template, so both first-compaction
and iterative-update prompts carry it.
Inspired by Poke's summarization (temporal anchoring + semantic
preservation).
A handoff persisted under an older SUMMARY_PREFIX can be inherited into a
resumed lineage. _strip_summary_prefix only matched the current/legacy
literal, so on re-compaction the old 'resume exactly from Active Task'
directive stayed embedded in the body and kept hijacking replies to new,
unrelated user messages.
- Add _HISTORICAL_SUMMARY_PREFIXES (pre-#35344 prefix) and strip/recognize
them in _strip_summary_prefix + _is_context_summary_content so resumed
stale handoffs are re-normalized to the current latest-message-wins prefix.
- Reconcile the overlapping Active Task template edits from the salvaged
#26290 (reverse-signal cancellation) and #32787 (capture open questions /
decisions, don't write None too eagerly) — both intents kept.
- Regression coverage in tests/agent/test_resume_stale_active_task.py.
- AUTHOR_MAP entries for both salvaged contributors.
The Active Task field in compression summaries is the single most important
field for task continuity across context boundaries. The previous template
described it narrowly as a 'task assignment' or 'request', which caused the
summary LLM to write 'None' whenever the user's most recent input was a
question, a decision request, or a discussion turn rather than an
imperative command. The assistant on the other side of the compaction then
treated the conversation as resolved and gave a generic recap instead of
answering the still-open question.
Expand the template guidance to cover:
* explicit task assignments
* questions awaiting an answer
* decisions awaiting input (A vs B)
* ongoing discussions where the assistant owes the next substantive reply
Reserve 'None' for the rare case where the last exchange was fully
resolved (e.g. user said 'thanks, that's all').
Also tighten the trailing CRITICAL instruction in the summary prompt so the
LLM cannot fall back to the old 'no imperative command → None' heuristic.
No behavioural code changes — template strings only. All 83 existing
compressor tests pass.
SUMMARY_PREFIX previously contained two contradictory directives:
1. "treat it as background reference, NOT as active instructions"
"Do NOT answer questions or fulfill requests mentioned in this summary"
"Respond ONLY to the latest user message that appears AFTER this summary"
2. "Your current task is identified in the '## Active Task' section of the
summary — resume exactly from there."
When the latest user message contradicted Active Task (e.g. 'stop the
i18n refactor', 'never mind, look at grafana instead'), models tended to
follow (2) anyway because 'resume exactly' is a strong, unambiguous
directive — leading to repeated re-surfacing of already-cancelled work
across turns, even after explicit 'stop'/'don't keep bringing that up'
messages from the user.
This change:
- Removes the conflicting 'resume exactly from Active Task' clause.
- Makes the precedence explicit: latest user message is the single source
of truth; it WINS on conflict; cancelled Active Task / In Progress /
Pending User Asks / Remaining Work must be discarded entirely (no
'wrap up the old task first').
- Names canonical reverse signals (stop, undo, roll back, never mind,
just verify, topic change) so the model recognizes them as cancellation
triggers, not background context.
- Updates the summarizer template instruction so the LLM doesn't
mechanically copy a cancelled task into Active Task on the next
compaction (it's instructed to copy the reverse signal verbatim).
- Preserves: REFERENCE ONLY framing, MEMORY.md/USER.md authority, and
the 'don't repeat work already reflected in session state' clause.
Adds tests/agent/test_summary_prefix_semantics.py to pin invariants so
the conflict can't regress.
Tested:
- All compaction tests pass: tests/agent/test_context_compressor.py,
tests/agent/test_context_compressor_summary_continuity.py,
tests/run_agent/test_413_compression.py,
tests/run_agent/test_compression_persistence.py,
tests/run_agent/test_compression_boundary_hook.py,
tests/cli/test_manual_compress.py — 117/117 passing.
- Tested on macOS.
PR #28102 made the summary-failure abort path the unconditional default,
changing established behavior. Gate it behind config.yaml flag
`compression.abort_on_summary_failure` (default False = historical
fallback-placeholder behavior).
- hermes_cli/config.py: new `compression.abort_on_summary_failure` key,
default False, documented inline.
- agent/agent_init.py: read the flag from compression config and pass to
ContextCompressor.
- agent/context_compressor.py: `__init__` accepts `abort_on_summary_failure`
(default False). `compress()` failure branch gates the abort on the
flag; when False, falls through to the restored legacy fallback path
(static "summary unavailable" placeholder + drop middle window).
- tests: restore original fallback expectations as default; add new
TestAbortOnSummaryFailure class for the opt-in mode.
Gateway/CLI plumbing (force=True on /compress, hygiene/handler abort
detection, locale `gateway.compress.aborted` key) from PR #28102 stays
intact — those paths only fire when `_last_compress_aborted` is True,
which now only happens when the flag is enabled.
When auxiliary compression's summary generation returns None (aux model
errored, returned non-JSON, timed out, etc.) the compressor previously
still dropped every middle message between compress_start..compress_end
and replaced them with a static 'Summary generation was unavailable'
placeholder. The session kept going but the user silently lost N turns
of context for nothing.
New behavior: on summary failure, compress() aborts entirely — returns
the input messages unchanged and sets _last_compress_aborted=True. The
existing _summary_failure_cooldown_until gate (30-60s) keeps the aux
model from being burned on every turn. Auto-compress callers detect
the no-op (len(after) == len(before)) and stop looping. The chat is
'frozen' at its current size until the next /compress or /new.
Manual /compress (CLI + gateway) now passes force=True which clears
the cooldown so users can retry immediately after an auto-abort. If
the manual retry also fails, the user gets a visible warning telling
them nothing was dropped and how to retry.
- agent/context_compressor.py: compress() gains force= kwarg; failure
branch sets _last_compress_aborted and returns messages unchanged
instead of inserting placeholder.
- run_agent.py: _compress_context() detects abort, surfaces warning,
skips session-rotation entirely, returns messages unchanged.
- cli.py + gateway/run.py: manual /compress paths pass force=True.
- gateway/run.py: hygiene + /compress handlers detect _last_compress_aborted
and emit the new 'Compression aborted' warning (gateway.compress.aborted)
instead of the old 'N historical messages were removed' message.
- locales/*.yaml: new gateway.compress.aborted key in all 16 locales.
- tests: updated to assert the abort contract (messages preserved,
compression_count not incremented, abort flag set, no placeholder
leaked). New test_force_true_bypasses_failure_cooldown covers the
manual-retry path.
After context compression, the protected tail messages retain their
original image parts. When those include multi-MB pasted screenshots,
every subsequent API request re-ships the same base-64 blobs forever —
which can push the request past provider body-size limits and wedge the
session even though compression 'succeeded'.
Add _strip_historical_media() to agent/context_compressor.py. After the
summary is built, find the newest user message that carries an image
part and replace image parts in every earlier message with a short
text placeholder ('[Attached image — stripped after compression]').
The newest image-bearing user turn keeps its media so the model can
still analyse what the user just sent.
Handles all three multimodal shapes:
- OpenAI chat.completions image_url
- OpenAI Responses API input_image
- Anthropic native {type: image, source: ...}
Includes 27 unit tests covering the helpers and the end-to-end
compress() integration, plus a manual E2E check confirming a ~4MB
two-image conversation shrinks to ~2MB after compression.
Follow-up on the salvaged feat commit:
- Keep the constructor / config / yaml-example default at 3 so existing
gateway and CLI users see no behavioural change. PR #13754 (which this
builds on) had lowered the default to 2 to chase pre-feature parity in
the system-prompt-present case, at the cost of quietly halving the
protected head for the gateway path (which strips the system prompt
before calling compress()). With the new "system prompt is implicit"
semantics, default 3 gives every caller a stable head shape.
- agent/context_engine.py: bring the ABC's protect_first_n docstring in
line with the new semantics so plugin context engines interpret the
config key the same way the built-in compressor does.
- tests: adjust the default-value test (3, not 2) and a stale comment;
per-test protect_first_n=2/3/1 values added in PR #13754 stay as-is
since those tests fix concrete head shapes.
The number of head messages preserved verbatim across context compactions
was previously hardcoded to 3 in AIAgent.__init__. Expose it as
`compression.protect_first_n` in config, matching the existing
`protect_last_n` pattern.
Motivation: users who rely on rolling compaction for long-running sessions
had the opening user/assistant exchange pinned as head forever, which
doesn't always match how they want the session framed after many
compactions. Lowering to 1 preserves the system prompt + first non-system
message; lowering to 0 preserves only the system prompt and lets the
entire first exchange age out naturally through the summary.
Semantics: `protect_first_n` counts non-system head messages protected
**in addition to** the system prompt, which is always implicitly protected
when present. Same meaning across both code paths:
protect_first_n=0 → system prompt only (or nothing if no system message)
protect_first_n=2 → system prompt + first 2 non-system messages (default)
This unifies the CLI path (which reads messages with the system prompt at
position 0) and the gateway path (where the gateway /compress handler
strips the system prompt before calling compress() — see
gateway/run.py L9150-9154 on the parent fork). Previously these two paths
disagreed:
CLI path: protect_first_n=1 → protect system prompt only
Gateway path: protect_first_n=1 → protect first USER turn forever
In practice on long-running gateway sessions the old semantics pinned
whatever stale aside happened to be the first user message, reinserting
it into every compaction summary indefinitely.
Default chosen as 2 (not 3) so that the effective protected head count
remains 3 messages in the common case — assuming a system prompt is
present, default protection becomes system + 2 non-system = 3 total,
matching the pre-feature behaviour where `protect_first_n` was hardcoded
to protect 3 messages total. Sessions without a system prompt will see a
small behaviour change (2 protected head messages instead of 3), but this
is the rare path and the new semantics make the system-prompt-present
case the well-defined one.
Changes:
- agent/context_compressor.py: redefine protect_first_n as the count of
non-system head messages protected beyond the implicit system-prompt
guarantee; both paths converge. Constructor default updated to 2.
- hermes_cli/config.py: add `compression.protect_first_n` default (2),
matching the new semantics. `show_config` label tweaked to
'Protect first: N non-system head messages' for clarity.
- run_agent.py: read protect_first_n from config; 0 is now valid (system
prompt is always implicitly protected).
- cli-config.yaml.example: document the new key and rationale.
- tests/agent/test_context_compressor.py: cover default, override, the
end-to-end `protect_first_n=0` and `protect_first_n=1` behaviour,
the no-system-prompt (gateway) path, and the new shared-semantics
regression test.
Fixes#13751
Tested on Ubuntu 24.04.
Replace with for all literal-tuple
membership tests. Set lookup is O(1) vs O(n) for tuple — consistent
micro-optimization across the codebase.
608 instances fixed via `ruff --fix --unsafe-fixes`, 0 remaining.
133 files, +626/-626 (net zero).
Problem:
When a provider or proxy drops a streaming response mid-flight (httpcore
raises RemoteProtocolError: "incomplete chunked read", "peer closed
connection", "response ended prematurely", etc.), _generate_summary
would not classify it as a transient error. Instead of retrying on the
main model, it entered the generic 60-second cooldown, leaving context
growing unbounded until the cooldown expired. Issue #18458.
Root cause:
_is_connection_error in auxiliary_client.py did not match httpcore's
streaming premature-close error substrings. context_compressor.py's
_generate_summary except block never called _is_connection_error, so
those errors fell through to the 60-second generic cooldown rather than
triggering the retry-on-main fallback path used for timeouts.
Fix:
1. auxiliary_client.py — extend _is_connection_error keyword list with:
"incomplete chunked read", "peer closed connection",
"response ended prematurely", "unexpected eof",
"remoteprotocolerror", "localprotocolerror".
Also guard the `from openai import ...` with try/except ImportError
so the function works in environments without the openai package.
2. context_compressor.py — import _is_connection_error and call it in
_generate_summary's except block as _is_streaming_closed. Include
_is_streaming_closed in the fallback-to-main condition (alongside
_is_model_not_found, _is_timeout, _is_json_decode) and use the
shorter 30s transient cooldown for streaming-closed errors.
Tests:
4 new regression tests in TestStreamingClosedFallback:
- test_incomplete_chunked_read_falls_back_to_main
- test_peer_closed_connection_falls_back_to_main
- test_streaming_closed_on_main_uses_short_cooldown (stash-verified)
- test_non_streaming_unknown_error_still_uses_long_cooldown
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When an auxiliary LLM provider (or an upstream proxy) returns a non-JSON
body with `Content-Type: application/json` — e.g. an HTML 502 page from a
misconfigured gateway — the OpenAI SDK's `response.json()` raises a raw
`json.JSONDecodeError` (or wraps it in `APIResponseValidationError` whose
message contains "expecting value"). Previously this fell through to the
unknown-error branch and entered a 60s cooldown without retrying on the
main model, dropping the middle conversation turns instead.
This change folds JSON-decode detection into the existing fast-path
fallback chain: detect by `isinstance(e, JSONDecodeError)` OR substring
match for "expecting value", retry once on the main model, and use a
shorter 30s cooldown when already on main (the body shape tends to flip
back to valid quickly when the upstream proxy recovers).
The three duplicated fallback bodies (model-not-found, unknown-error,
JSON-decode) are consolidated into a single `_fallback_to_main_for_compression`
helper that handles the shared bookkeeping (record aux-model failure for
`/usage`-style callers, clear summary_model, clear cooldown).
Also adds three unit tests covering: raw `JSONDecodeError` retries on main,
substring-match for wrapped exceptions, and the 30s cooldown when already
on main.
Salvage of #22248 by @0xharryriddle. Closes#22244.
Co-authored-by: Harry Riddle <ntconguit@gmail.com>
Background macOS desktop control via cua-driver MCP — does NOT steal the
user's cursor or keyboard focus, works with any tool-capable model.
Replaces the Anthropic-native `computer_20251124` approach from the
abandoned #4562 with a generic OpenAI function-calling schema plus SOM
(set-of-mark) captures so Claude, GPT, Gemini, and open models can all
drive the desktop via numbered element indices.
- `tools/computer_use/` package — swappable ComputerUseBackend ABC +
CuaDriverBackend (stdio MCP client to trycua/cua's cua-driver binary).
- Universal `computer_use` tool with one schema for all providers.
Actions: capture (som/vision/ax), click, double_click, right_click,
middle_click, drag, scroll, type, key, wait, list_apps, focus_app.
- Multimodal tool-result envelope (`_multimodal=True`, OpenAI-style
`content: [text, image_url]` parts) that flows through
handle_function_call into the tool message. Anthropic adapter converts
into native `tool_result` image blocks; OpenAI-compatible providers
get the parts list directly.
- Image eviction in convert_messages_to_anthropic: only the 3 most
recent screenshots carry real image data; older ones become text
placeholders to cap per-turn token cost.
- Context compressor image pruning: old multimodal tool results have
their image parts stripped instead of being skipped.
- Image-aware token estimation: each image counts as a flat 1500 tokens
instead of its base64 char length (~1MB would have registered as
~250K tokens before).
- COMPUTER_USE_GUIDANCE system-prompt block — injected when the toolset
is active.
- Session DB persistence strips base64 from multimodal tool messages.
- Trajectory saver normalises multimodal messages to text-only.
- `hermes tools` post-setup installs cua-driver via the upstream script
and prints permission-grant instructions.
- CLI approval callback wired so destructive computer_use actions go
through the same prompt_toolkit approval dialog as terminal commands.
- Hard safety guards at the tool level: blocked type patterns
(curl|bash, sudo rm -rf, fork bomb), blocked key combos (empty trash,
force delete, lock screen, log out).
- Skill `apple/macos-computer-use/SKILL.md` — universal (model-agnostic)
workflow guide.
- Docs: `user-guide/features/computer-use.md` plus reference catalog
entries.
44 new tests in tests/tools/test_computer_use.py covering schema
shape (universal, not Anthropic-native), dispatch routing, safety
guards, multimodal envelope, Anthropic adapter conversion, screenshot
eviction, context compressor pruning, image-aware token estimation,
run_agent helpers, and universality guarantees.
469/469 pass across tests/tools/test_computer_use.py + the affected
agent/ test suites.
- `model_tools.py` provider-gating: the tool is available to every
provider. Providers without multi-part tool message support will see
text-only tool results (graceful degradation via `text_summary`).
- Anthropic server-side `clear_tool_uses_20250919` — deferred;
client-side eviction + compressor pruning cover the same cost ceiling
without a beta header.
- macOS only. cua-driver uses private SkyLight SPIs
(SLEventPostToPid, SLPSPostEventRecordTo,
_AXObserverAddNotificationAndCheckRemote) that can break on any macOS
update. Pin with HERMES_CUA_DRIVER_VERSION.
- Requires Accessibility + Screen Recording permissions — the post-setup
prints the Settings path.
Supersedes PR #4562 (pyautogui/Quartz foreground backend, Anthropic-
native schema). Credit @0xbyt4 for the original #3816 groundwork whose
context/eviction/token design is preserved here in generic form.
- Fix /compact → /compress in context-overflow tips (closes#20020)
- Evict cached agent after session hygiene and /compress so system
prompt refreshes with current SOUL.md, memory, and skills
- Restore memory authority across compaction: change 'informational
background data' to 'authoritative reference data' in memory block
and SUMMARY_PREFIX, with backward-compatible regex
Based on:
- PR #20027 by @LeonSGP43
- PR #18767 by @MacroAnarchy
- PR #17380 by @vominh1919
PR #17121 boundary marker fix already merged to main (2eef395e1).
PR #9262 user-message anchoring already on main via _ensure_last_user_message_in_tail().
When the head ends with assistant/tool and the tail starts with assistant,
the summary is inserted as a standalone role="user" message. The body's
verbatim "## Active Task" quote then gets read as fresh user input by
weak/local models (#11475, #14521).
The merge-into-tail path already appends an explicit end-of-summary marker
for this reason. Mirror it on the standalone path so both insertion routes
give the model the same "summary above, not new input" signal.
Commit 408dd8aa added a non-string guard for Pass 1 (dedup), but the same
pattern exists in Pass 2 (summarization/pruning) where content.startswith()
and len() are called on potentially non-string tool content.
When a provider returns tool results with non-string content (e.g. dict or
int from llama.cpp or similar), the pruning pass crashes with AttributeError.
Add the same isinstance(content, str) guard to Pass 2 for consistency.
Previously only HTTP 404/503 and specific error strings triggered a fallback
to the main model when the summary model was unavailable. Timeout errors
(HTTP 408/429/502/504, or error strings containing 'timeout') entered a
short cooldown instead, leaving context to grow unbounded for the rest of
the session.
Add _is_timeout detection alongside _is_model_not_found so that transient
timeout errors on the summary model also trigger immediate fallback to the
main model, preventing compression failure from cascading.
Closes#15935
on_session_reset() cleared _previous_summary, _last_summary_error, and
_ineffective_compression_count but left _summary_failure_cooldown_until
intact. When a transient summary error sets a 60 s cooldown (or 600 s
for a missing-provider RuntimeError) and the user immediately runs /reset
or /new, the cooldown carries into the new session. If the new session
reaches the compression threshold before the cooldown expires,
_generate_summary() returns None early, middle turns are silently dropped
without a summary, and the agent continues with no indication that
compaction was skipped.
Fix: set _summary_failure_cooldown_until = 0.0 in on_session_reset(),
matching the value assigned in __init__ and symmetric with the other
per-session fields already cleared there.
Fixes#15547
Background macOS desktop control via cua-driver MCP — does NOT steal the
user's cursor or keyboard focus, works with any tool-capable model.
Replaces the Anthropic-native `computer_20251124` approach from the
abandoned #4562 with a generic OpenAI function-calling schema plus SOM
(set-of-mark) captures so Claude, GPT, Gemini, and open models can all
drive the desktop via numbered element indices.
- `tools/computer_use/` package — swappable ComputerUseBackend ABC +
CuaDriverBackend (stdio MCP client to trycua/cua's cua-driver binary).
- Universal `computer_use` tool with one schema for all providers.
Actions: capture (som/vision/ax), click, double_click, right_click,
middle_click, drag, scroll, type, key, wait, list_apps, focus_app.
- Multimodal tool-result envelope (`_multimodal=True`, OpenAI-style
`content: [text, image_url]` parts) that flows through
handle_function_call into the tool message. Anthropic adapter converts
into native `tool_result` image blocks; OpenAI-compatible providers
get the parts list directly.
- Image eviction in convert_messages_to_anthropic: only the 3 most
recent screenshots carry real image data; older ones become text
placeholders to cap per-turn token cost.
- Context compressor image pruning: old multimodal tool results have
their image parts stripped instead of being skipped.
- Image-aware token estimation: each image counts as a flat 1500 tokens
instead of its base64 char length (~1MB would have registered as
~250K tokens before).
- COMPUTER_USE_GUIDANCE system-prompt block — injected when the toolset
is active.
- Session DB persistence strips base64 from multimodal tool messages.
- Trajectory saver normalises multimodal messages to text-only.
- `hermes tools` post-setup installs cua-driver via the upstream script
and prints permission-grant instructions.
- CLI approval callback wired so destructive computer_use actions go
through the same prompt_toolkit approval dialog as terminal commands.
- Hard safety guards at the tool level: blocked type patterns
(curl|bash, sudo rm -rf, fork bomb), blocked key combos (empty trash,
force delete, lock screen, log out).
- Skill `apple/macos-computer-use/SKILL.md` — universal (model-agnostic)
workflow guide.
- Docs: `user-guide/features/computer-use.md` plus reference catalog
entries.
44 new tests in tests/tools/test_computer_use.py covering schema
shape (universal, not Anthropic-native), dispatch routing, safety
guards, multimodal envelope, Anthropic adapter conversion, screenshot
eviction, context compressor pruning, image-aware token estimation,
run_agent helpers, and universality guarantees.
469/469 pass across tests/tools/test_computer_use.py + the affected
agent/ test suites.
- `model_tools.py` provider-gating: the tool is available to every
provider. Providers without multi-part tool message support will see
text-only tool results (graceful degradation via `text_summary`).
- Anthropic server-side `clear_tool_uses_20250919` — deferred;
client-side eviction + compressor pruning cover the same cost ceiling
without a beta header.
- macOS only. cua-driver uses private SkyLight SPIs
(SLEventPostToPid, SLPSPostEventRecordTo,
_AXObserverAddNotificationAndCheckRemote) that can break on any macOS
update. Pin with HERMES_CUA_DRIVER_VERSION.
- Requires Accessibility + Screen Recording permissions — the post-setup
prints the Settings path.
Supersedes PR #4562 (pyautogui/Quartz foreground backend, Anthropic-
native schema). Credit @0xbyt4 for the original #3816 groundwork whose
context/eviction/token design is preserved here in generic form.
A misconfigured auxiliary.compression.model is a user-fixable problem that silent recovery would hide. The previous retry-on-main logic transparently swallowed aux-model failures whenever the fallback succeeded, leaving the user's broken config in place and racking up future failures.
Track the aux-model failure on the compressor alongside the existing fallback-placeholder fields:
- _last_aux_model_failure_model: str | None
- _last_aux_model_failure_error: str | None
Both are set at the moment the aux model errors (captured before summary_model is cleared for retry), regardless of whether the retry succeeds. Cleared at compress() start and on on_session_reset() so a clean run doesn't leak stale warnings.
Surface at three places:
- gateway hygiene auto-compress: ℹ note to the platform adapter (thread_id preserved)
- gateway /compress command: ℹ line appended to the reply
- CLI via _emit_warning: deduped on (model, error) so repeat compactions don't spam
Distinct from the existing ⚠️ dropped-turns warning — different severity, different emoji, explicit 'context is intact' reassurance.
The existing retry-on-main path in _generate_summary only fires for errors that match the _is_model_not_found heuristic (404/503, 'model_not_found', 'does not exist', 'no available channel'). Other misconfiguration errors — 400s from aggregators, provider-specific 'no route' strings, opaque rejections — fall straight through to the transient-cooldown branch, which drops N turns of context and inserts a static placeholder.
Losing context is almost always worse than one extra summary attempt. Add a best-effort retry-on-main for the unknown-error branch, guarded by the same invariants as the existing fast-path retry: only when summary_model differs from main, and only once per compressor (_summary_model_fallen_back).
Tests cover: 404 fast-path fallback still works, unknown 400 now falls back, same-model aux skips retry (no infinite loop), and a double-failure (aux + main) stops at 2 calls.