Commit graph

5481 commits

Author SHA1 Message Date
Teknium
a1f51feb72
fix(telegram): avoid rich final duplicate previews (#46206) 2026-06-14 11:13:38 -07:00
kshitij
6c34088a17
Merge pull request #46237 from kshitijk4poor/salvage/46095-cross-process-cache
fix(gateway): cross-process agent-cache coherence (#45966) + preserve prompt caching
2026-06-14 23:05:17 +05:30
kshitijk4poor
3bc4a2ff78 fix(gateway): re-baseline agent-cache message_count after each turn
The #45966 cross-process coherence guard snapshots a session's on-disk
message_count next to the cached agent and rebuilds the agent when the
count changes.  But the snapshot is taken at agent-BUILD time — before
the turn writes its own user + assistant (+ tool) rows — and the cache
entry is never rewritten on a reuse.  So this process's OWN turn grows
message_count, and the very next turn sees a mismatch and rebuilds the
agent.  That happens every turn, for every conversation, silently
destroying the per-conversation prompt caching the cache exists to
protect (AGENTS.md: prompt caching is sacred).

Add _refresh_agent_cache_message_count(): after a turn completes and the
agent has flushed its rows to the SessionDB, re-baseline the stored count
to the now-current value.  The guard then fires ONLY when a DIFFERENT
process changes the transcript — preserving the #45966 fix while keeping
the cache warm for normal single-process operation.

Tests drive the real SessionDB + the real guard condition: 5 consecutive
same-process turns now all REUSE the cached agent (0 before the fix); a
cross-process append still invalidates; and the re-baseline is fail-safe
(no DB, falsy session_id, raising probe, legacy 2-tuple, pending sentinel
all no-op).
2026-06-14 22:58:55 +05:30
kshitijk4poor
ce19fdb7ce fix(skills): apply global|platform disabled union to all resolution sites
The platform-disabled fix landed only in agent.skill_utils.get_disabled_skill_names
(the system-prompt path). Two sibling resolvers still used the old
replace-not-union semantics, so the same skill could be hidden from the
<available_skills> prompt yet reported enabled elsewhere:

- hermes_cli/skills_config.get_disabled_skills (the 'hermes skills config' UI)
  returned only the platform list, so a globally-disabled skill showed as
  enabled (unchecked) on any platform with a platform_disabled entry.
- tools/skills_tool._is_skill_disabled (gates whether skill_view loads a skill)
  ignored the global list when a platform list existed, so a globally-disabled
  skill could still be loaded on such a platform.

Both now union the global list with the platform list, matching
get_disabled_skill_names. An explicit empty platform list no longer re-enables
a globally-disabled skill — global disables hold on every platform (#46201).

Also: fix the now-stale get_disabled_skill_names docstring and drop a stray
blank line. Regression tests added for both sites (proven to fail on the old
replace semantics).
2026-06-14 22:54:54 +05:30
ibrahim özsaraç
7bbe7024c2 fix: filter platform-disabled skills from <available_skills> prompt (#46201)
build_skills_system_prompt() already resolved _platform_hint but called
get_disabled_skill_names() with no argument, so the resolved platform never
reached the filter and the prompt cache_key varied by platform while the
disabled set did not. Pass _platform_hint or None.

get_disabled_skill_names() also fully ignored the global 'disabled' list once
a platform-specific list was found. Return the union (global | platform) so a
globally-disabled skill stays disabled on every platform.

Salvaged from #46203 by @iborazzi; the unrelated apps/shared/tsconfig.json
ES2023 bump is intentionally dropped (one concern per PR).
2026-06-14 22:52:57 +05:30
Teknium
7433d5f0eb fix(gateway): scope early duplicate guard to pid file 2026-06-14 08:42:06 -07:00
konsisumer
1436793051 fix(gateway): block shell gateway run when a service supervises the profile 2026-06-14 08:42:06 -07:00
Teknium
2c174bce24 fix(gateway): preserve new input on interrupted replay cleanup 2026-06-14 05:10:39 -07:00
Diyon18
288f7026e3 fix(messaging): correct Weixin personal account labeling 2026-06-14 04:52:54 -07:00
Teknium
efbe1635dd
fix(gateway): include replied-to media attachments (#46107) 2026-06-14 04:51:50 -07:00
Teknium
a27d7e68cc
fix(mcp): block suspicious stdio configs before probe (#46112) 2026-06-14 04:46:54 -07:00
Teknium
13a1bd0f83
perf(model-metadata): persist OpenRouter metadata cache (#46114) 2026-06-14 04:45:46 -07:00
Teknium
972a9885ee
fix(mcp): block exfil-shaped stdio server configs (#46083) 2026-06-14 04:24:14 -07:00
Teknium
9459057d7f
fix(telegram): guard rich details math crash (#46102) 2026-06-14 04:22:22 -07:00
Teknium
cf7d5932f8 fix(email): make IPv4 SMTP fallback use supported sockets 2026-06-14 04:16:26 -07:00
liuhao1024
04d4471d79 fix(email): use SMTP_SSL for port 465 and fall back to IPv4 on timeout
Port 465 expects implicit TLS (SMTP_SSL) from the first byte. The email
adapter always used SMTP() + starttls(), which is correct for port 587
but hangs/fails on port 465 providers (e.g., Swiss ISPs).

Additionally, when the SMTP host has AAAA DNS records but IPv6 is
unreachable, socket.create_connection() tries IPv6 first and hangs
until timeout. Add an IPv4 fallback via AF_INET socket.

Extract _connect_smtp() helper to consolidate the 4 duplicate SMTP
connection sites into a single method with correct protocol selection
and IPv6 fallback logic.
2026-06-14 04:16:26 -07:00
Teknium
5105c3651a
perf(api-server): normalize chat content linearly (#46079) 2026-06-14 03:25:49 -07:00
Aldo
293c04fef6 fix(gateway): suppress exact silence tokens without mutating history 2026-06-14 03:25:08 -07:00
Teknium
10bad2faf1
fix(gateway): serialize startup auto-resume before inbound (#46074)
Gateway startup now queues real inbound messages until restart-interrupted auto-resume turns have completed, preventing duplicate agents for the same session after a restart.
2026-06-14 03:21:06 -07:00
Teknium
2b4873f7fb
fix(agent): persist repaired-turn responses (#46071) 2026-06-14 03:20:25 -07:00
Teknium
723c2331bd fix: make profile subprocess HOME policy explicit 2026-06-14 03:20:21 -07:00
zccyman
b00060ce54 fix(agent): expose HERMES_REAL_HOME in subprocess envs for profile isolation
When profile isolation activates ({HERMES_HOME}/home/ exists), child
processes receive HOME={HERMES_HOME}/home/ for tool config isolation
(git, ssh, gh). However, scripts using Path.home() to locate
~/.hermes/ would incorrectly resolve to the isolated profile home,
breaking helpers that rely on the real user home directory.

New get_real_home() helper in hermes_constants resolves the actual
user home independently of profile isolation. All four subprocess
spawners now inject HERMES_REAL_HOME alongside the profile HOME:

- tools/code_execution_tool.py (execute_code)
- tools/environments/local.py (terminal background, run_env)
- agent/copilot_acp_client.py (Copilot ACP)

Child scripts can now use:
  Path(os.environ.get("HERMES_REAL_HOME", os.environ.get("HOME", "")))

to reliably find the real user home regardless of profile isolation.

Closes #25114
2026-06-14 03:20:21 -07:00
Teknium
0428945b5b
fix(desktop): keep profile homes out of bootstrap (#46073) 2026-06-14 03:08:52 -07:00
Teknium
afc8615509
perf(webhook): prune request caches incrementally (#46065) 2026-06-14 02:40:54 -07:00
LeonSGP43
89bdb1e546 fix: read dashboard spa assets as utf-8
Co-Authored-By: Paperclip <noreply@paperclip.ing>
2026-06-14 02:31:04 -07:00
Teknium
7b9dc7cd0a test(gateway): align web profile wrapper expectation 2026-06-14 02:20:55 -07:00
helix4u
d76a58bd15 fix(gateway): resolve sudo profile system installs 2026-06-14 02:20:55 -07:00
Teknium
1f5eef8093 test(tui): tolerate resume init kwargs in protocol tests 2026-06-14 02:15:33 -07:00
Teknium
9f33d673e9 fix(tui): persist resumed profile cwd updates to profile db 2026-06-14 02:15:33 -07:00
dsad
d842155da1 Keep resumed profile cwd scoped to profile DB 2026-06-14 02:15:33 -07:00
helix4u
4936a49a0c fix(mcp): preserve loop during probes
Some checks failed
Deploy Site / deploy-vercel (push) Waiting to run
Deploy Site / deploy-docs (push) Waiting to run
Docker Build and Publish / build-amd64 (push) Waiting to run
Docker Build and Publish / build-arm64 (push) Waiting to run
Docker Build and Publish / merge (push) Blocked by required conditions
Lint (ruff + ty) / ruff + ty diff (push) Waiting to run
Lint (ruff + ty) / ruff enforcement (blocking) (push) Waiting to run
Lint (ruff + ty) / Windows footguns (blocking) (push) Waiting to run
Nix / nix (macos-latest) (push) Waiting to run
Nix / nix (ubuntu-latest) (push) Waiting to run
OSV-Scanner / Scan lockfiles (push) Waiting to run
Tests / test (1) (push) Waiting to run
Tests / test (2) (push) Waiting to run
Tests / test (3) (push) Waiting to run
Tests / test (4) (push) Waiting to run
Tests / test (5) (push) Waiting to run
Tests / test (6) (push) Waiting to run
Tests / save-durations (push) Blocked by required conditions
Tests / e2e (push) Waiting to run
Typecheck / typecheck (apps/bootstrap-installer) (push) Waiting to run
Typecheck / typecheck (apps/desktop) (push) Waiting to run
Typecheck / typecheck (apps/shared) (push) Waiting to run
Typecheck / typecheck (ui-tui) (push) Waiting to run
Typecheck / typecheck (web) (push) Waiting to run
uv.lock check / uv lock --check (push) Waiting to run
Nix Lockfile Fix / auto-fix-main (push) Has been cancelled
Nix Lockfile Fix / fix (push) Has been cancelled
Build Skills Index / build-index (push) Has been cancelled
Build Skills Index / trigger-deploy (push) Has been cancelled
2026-06-14 02:09:45 -07:00
helix4u
85e6232a07 fix(providers): support anthropic proxy v1 endpoints 2026-06-14 02:09:16 -07:00
Teknium
81e42335a1
fix(file-safety): relax user-write deny policy (#45947)
Allow file tools to edit shell startup files, user package-manager configs, and Hermes control files that the user can already modify directly. Keep hard blocks for SSH keys, .env/OAuth token stores, mcp-tokens, pairing files, and system privilege files.
2026-06-14 02:07:32 -07:00
Brooklyn Nicholson
715b691723 fix(desktop): show summarizing indicator during auto-compaction
Auto-compression rewrites history mid-turn, which made long threads look
like they reset. Re-tag the gateway lifecycle status as compacting and
surface it in the desktop thread loading indicators.
2026-06-14 02:28:07 -05:00
kshitijk4poor
12c84d6c77 fix(transports): only treat a refusal as terminal when it is the sole payload
A chat-completions response that carries real text or tool calls *alongside*
a `message.refusal` note is a normal, usable turn — the model did work. The
prior logic flipped finish_reason to `content_filter` whenever a refusal
string was present, so the conversation loop reframed a content-bearing turn
as a *failed* safety refusal (failed=True) and buried the model's actual
output inside the "model declined" template, or dropped tool calls entirely.

Only promote to a terminal `content_filter` when the refusal is the sole
payload (no visible text AND no tool calls). The refusal explanation is still
recorded in provider_data in every case for observability. Refusal-only
responses (the bug this feature targets) are unaffected and still surface
terminally; the empty+refusal, bare content_filter passthrough, and no-refusal
common cases are byte-identical to before.

Updates the partial-content test to the corrected contract and adds a
tool_calls-alongside-refusal regression guard.
2026-06-14 12:12:52 +05:30
SHL0MS
ab26541b9a test(transports): lock in content_filter passthrough for OpenRouter
OpenRouter (and every other OpenAI-compatible provider) uses the default
chat_completions transport, so it is already covered by the refusal fix:
an upstream Claude / moderation refusal arrives as
finish_reason="content_filter" (often empty content, no message.refusal).
Add a regression test asserting the transport passes that finish reason
straight through to the loop's content_filter handler.

(cherry picked from commit 60168a513b)
2026-06-14 12:10:08 +05:30
SHL0MS
bb46bf8ce4 fix(agent): surface model refusals instead of retrying them as errors
A Claude refusal (HTTP 200, stop_reason="refusal", empty content) was
laundered into a generic retry loop and surfaced as a misleading
"rate limited / invalid response" or "no content after retries" error,
burning paid attempts reproducing a deterministic refusal.

This hit two distinct paths:

- Direct Anthropic (anthropic_messages): validate_response rejected the
  empty-content refusal *before* normalize_response mapped refusal ->
  content_filter, so it fell into the invalid-response retry loop.
- Nous Portal / OpenAI-compatible (chat_completions): the portal surfaces
  a Claude refusal via message.refusal with empty content, which sailed
  past validation and died in the empty-response retry loop.

Fix (one unified content_filter dispatch for all backends):
- AnthropicTransport.validate_response: accept empty content when
  stop_reason == "refusal" so it flows to normalize_response.
- ChatCompletionsTransport.normalize_response: promote message.refusal to
  content + a content_filter finish reason.
- conversation_loop: handle finish_reason == "content_filter" - fire the
  api_request_error hook (content_policy_blocked), try a configured
  fallback once, else return a clear terminal refusal message. Never retry
  a deterministic refusal.

Supersedes #43084, which fixed only the direct-Anthropic path and could
not reach the chat_completions/portal path.

Tests: transport-level (validate_response refusal, message.refusal
promotion) + end-to-end loop (refusal surfaced, exactly one API call).

(cherry picked from commit 01f546f92c)
2026-06-14 12:10:08 +05:30
brooklyn!
4b5ba112ad
fix: shrink images to reported provider dimension limit (#45979)
Parse provider-reported image pixel ceilings so many-image Anthropic requests can recover by shrinking Retina screenshots below the stricter limit instead of retrying the same rejected payload.
2026-06-14 01:07:43 -05:00
Teknium
8f278403d1
perf(execute-code): stop waiting on idle RPC accept (#45948) 2026-06-13 21:57:15 -07:00
Teknium
1b16c48170 fix: guard OAuth account removal 2026-06-13 21:47:13 -07:00
Justin Sunseri
12682d96b9 feat(telegram): restore rich messages opt-out
Salvages PR #45840's client-compatibility opt-out while keeping rich messages enabled by default via telegram.extra.rich_messages: true.
2026-06-13 21:45:49 -07:00
aimable100
8d5d36d793
fix(dispatch): forward session_id into registry.dispatch (#28479)
Both the regular and execute_code dispatch paths forward task_id into
registry.dispatch via middleware _dispatch lambdas but silently dropped
session_id. Dispatch-layer hooks (e.g. set_enforcement_fn) that correlate
calls with the active session received "" for every invocation.

Pass session_id=session_id at both _dispatch call sites inside
handle_function_call, matching the existing task_id pattern. Hooks
already received session_id; this closes the registry.dispatch gap.

Rebased onto current main where dispatch is wrapped by
run_tool_execution_middleware — the old direct-dispatch sites from
#28479 no longer exist.

test(dispatch): add tests for session_id forwarding (NousResearch#28479)

Covers standard and execute_code paths through the middleware wrapper.
Verifies task_id forwarding is not broken by the change.
2026-06-14 00:27:59 -04:00
Teknium
7aaae7acd0 fix(ssl): align guard docs and escape hatch 2026-06-13 21:14:32 -07:00
Teknium
af5b526472 fix(ssl): validate CA bundle paths before provider calls 2026-06-13 21:14:32 -07:00
chromalinx
b42c5bf652 test(ssl_guard): fix macOS fallback test that passed for the wrong reason
The previous test patched ssl.create_default_context globally with a bare
SSLContext that has zero CA certs. Both verify_ca_bundle() and the macOS
fallback got the same mocked context, so the test verified nothing useful:
both paths produced empty get_ca_certs() and the assertion that no
exception escaped was vacuously satisfied.

Only mock the fallback call (no cafile) — let the certifi call hit the
real SSL stack and fail with SSLError on the broken PEM. The mock
fallback returns a context with load_default_certs() so the test now
verifies the real scenario: broken certifi → SSLConfigurationError,
macOS system trust store → success.

Also pads the broken PEM past the 1 KB size guard so the size check
doesn't short-circuit before ssl.create_default_context(cafile=...) runs.

Reported by @liuhao1024 in PR review.
2026-06-13 21:14:32 -07:00
chromalinx
a218a0f156 fix(agent,gateway,doctor): add SSL CA cert bundle fail-fast guard
A stale certifi CA bundle after a partial `hermes update` used to crash
the agent on the first outbound HTTPS call with a raw traceback and
trap the gateway in a retry loop.

This patch:

* Adds `agent/errors.py` with a typed `SSLConfigurationError`
* Adds `agent/ssl_guard.py` with a `verify_ca_bundle()` pre-flight
  that asserts the bundle exists, is non-trivial in size, and can build
  a working SSLContext. On macOS, it falls back to the system trust
  store when the bundle is empty but the system store is healthy
  (covers corporate proxies / MDM setups).
* Wires the guard into `run_agent.py` and `gateway/run.py` right
  after the `hermes_bootstrap` import, inside a try/except so a bug
  in the guard itself can never prevent startup.
* Adds a `SSL / CA Certificates` section to `hermes_cli doctor` so
  users can detect the failure with one command.
* Adds unit tests covering the healthy, missing, empty, skip-env, and
  macOS-fallback paths.
* Adds an RCA document describing the failure mode and the recovery
  path (`pip install -e .`).

When the bundle is broken the user sees:

    \u26a0\ufe0f SSL certificate bundle issue detected.
       Run: pip install -e .

`HERMES_SKIP_SSL_GUARD=1` disables the check for sandboxed
environments that ship their own trust store.
2026-06-13 21:14:32 -07:00
Teknium
1106879147
perf(process): wake waiters on background completion (#45831) 2026-06-13 21:11:19 -07:00
Max Pollard
9a2b976326 test(skills): add regression tests for bundled-update backup recovery
Three tests covering: a stale .bak poisoning a failed update's move/restore, an orphaned .bak misread as a user deletion, and a partially written dest blocking restore-on-failure. All three fail on current main without the fix.

Refs #44942
2026-06-13 15:01:42 -07:00
Teknium
bf8effad02
fix(utils): copy fallback for atomic replace across devices (#43852)
Fallback from `os.replace` on EXDEV/EBUSY using copy+fsync+unlink while preserving symlink target semantics and metadata.
2026-06-13 14:50:05 -07:00
Teknium
817f392311
feat(read): extract notebook and office documents (#37082)
Add stdlib-only extraction for `.ipynb`, `.docx`, and `.xlsx` in read_file with lazy integration and malformed-document fallback.
2026-06-13 14:42:51 -07:00