The prior assertion `all("turn1" in k or "turn2" in k for k in keys)` was
weak on two counts: it passes vacuously when keys is empty (a regression
that lost all state would slip through), and after turn 2 finalizes only
turn 1 lingers, so it only ever inspected turn 1 anyway. Replace it with an
exact check that one key survives, it is turn 1, and turn 2 never merged
into it — the real isolation invariant the test name claims.
Scoping the trace key by turn_id (the prior commit) fixed cross-turn
collisions but introduced a slow leak: _finish_trace only pops a key when a
turn ends cleanly (final response has content and no tool calls), so any
turn that is interrupted, ends on a tool call, or has empty final content
now leaves its uniquely-keyed entry in _TRACE_STATE forever. Previously the
constant per-session key was overwritten by the next turn, capping growth at
~1 entry per session.
Add an LRU cap (_MAX_TRACE_STATE) enforced by _evict_stale_locked, called
under _STATE_LOCK immediately before each insert. It evicts the
least-recently-updated entries (using the previously-dead last_updated_at
field) and ends their root span so nothing dangles. Regression test drives
50 non-finalizing turns against a cap of 8 and asserts the dict stays bounded
with the most-recent turns surviving.
The turn- and api-scoped branches each repeated the same
task/session/thread fallback ladder with only the infix differing. Extract
the shared prefix into _scope_prefix so a future scope dimension touches one
ladder instead of three. The legacy branch still returns a bare task_id (not
the task: prefix) for backward compatibility, so it stays separate.
Output key strings are unchanged; a new test pins them across every
task/session/turn/api combination since the keys are matched across hooks
and any drift would silently break trace finalization.
Cleanup pass on the salvage (behavior-preserving):
- diff_bundled_skill now uses the existing _skill_file_list() helper
instead of reimplementing the rglob/is_file/relative_to file-set
enumeration inline (twice).
- Extract _is_tracked_user_modification(origin_hash, user_hash) and use
it in BOTH the sync loop and list_user_modified_bundled_skills() so the
'kept user edit' rule can't drift between the two sites.
- _read_text_for_diff -> _read_for_diff returns (bytes, text); the binary
branch now compares the bytes it already read instead of re-reading
both files from disk.
- Drop the unused 'user_present' key from diff_bundled_skill's return
contract (no consumer or test ever read it).
- test_update_modified_notice: drop the brittle '>= 2 sites' count-floor
so consolidating the two print paths into a shared helper stays a
welcome refactor; keep the per-site 'count notice => discovery hint'
invariant (still mutation-tested).
The PR added helper-level tests for _trace_key but nothing exercised the
keys through the real hooks. This adds TestTurnTraceIsolation, which drives
on_pre_llm_request / on_post_llm_call across two turns of one gateway
session (task_id == session_id, unique turn_id, api_call_count reset per
turn) and asserts each turn opens its own root trace when the first turn
fails to finalize (tool-only final step). This test fails on the pre-fix
code (only one trace opened, turn 2 absorbed into turn 1) and passes with
the scoping fix.
Also pins the turn_id-over-api_request_id key precedence: the turn-scoped
post_llm_call carries no api_request_id, so it must still resolve to the
same key as the request-scoped hooks or finalization breaks.
Salvage follow-up to the cherry-picked feat/test commits:
- W1: the unpack/install update path in main.py printed the
'~ N user-modified (kept)' notice without the new
'hermes skills list-modified' hint that the git-pull path got.
Mirror the hint to both sites so the count is actionable
regardless of which update path runs.
- W2: 'hermes skills diff <name>' (bundled-vs-stock) now shares the
verb with the gateway write-approval 'diff <id>'. The gateway
handler's docstring + truncation message pointed users to
'/skills diff <id>' on the CLI, which now resolves a bundled skill
by that name instead. Point at the pending JSON file and note the
two diff commands are distinct.
- Add an invariant test asserting every 'user-modified (kept)' notice
in main.py carries the discovery hint (guards sibling drift).
Exercises the real sync pipeline (no mocked comparison logic): a pristine
synced skill is not flagged; an edited one is listed and diffed (modified +
added files); an unknown skill returns not-ok; and `reset --restore` clears
the modified state so revert and discovery stay consistent.
`hermes update` keeps (won't overwrite) bundled skills the user edited
locally, but only printed a count — "~ N user-modified (kept)" — with no way
to learn which skills, or see what changed. Reverting already existed
(`hermes skills reset <name> [--restore]`); discovery and inspection did not.
Add two CLI commands (zero model-tool footprint), reusing the manifest
origin-hash that sync already maintains:
- `hermes skills list-modified [--json]` — list the bundled skills whose
on-disk copy diverges from the last-synced origin hash (the exact test the
sync loop uses to decide what to skip).
- `hermes skills diff <name>` — unified diff between the user's copy and the
current bundled (stock) version, so the user can confirm what changed
before reverting.
Both are mirrored as `/skills list-modified` and `/skills diff`. The
`hermes update` notice now points at `hermes skills list-modified`. Core
helpers `list_user_modified_bundled_skills()` and `diff_bundled_skill()` live
in tools/skills_sync.py alongside the existing reset logic.
Follow-up cleanup on the OpenViking setup path merged in #48262:
- _write_ovcli_config now uses utils.atomic_json_write(path, data, mode=0o600)
instead of the local _precreate_secret_file + write_text + chmod sequence.
The shared helper (already used by honcho/mem0/supermemory/hindsight) writes
via temp-file + fchmod(0600) + fsync + os.replace, so the ovcli.conf is
written atomically (no half-written secret file on crash) and with no
chmod-after-write TOCTOU window. _precreate_secret_file stays for the .env
writer path.
- Remove dead _DEFAULT_ACCOUNT/_DEFAULT_USER constants (0 references; the
empty->'default' tenant fallback lives in the _VikingClient constructor).
Tests: tests/plugins/memory/test_openviking_provider.py + test_memory_setup.py
+ openviking_plugin/test_openviking.py -> 130 passed; ruff clean.
PR infographics are decorative visual hooks for a PR body, not repo
artifacts. The established convention (commit 5772e638c, "chore: drop
in-repo infographic/ directory; keep PR-body URLs only", #30854) is to
hotlink an externally-hosted image so GitHub camo-proxies it inline,
leaving zero binary footprint in the tree.
Two such assets had been committed anyway and are referenced nowhere in
the codebase:
- docs/assets/ns504-chat-session-reconnect.png (1024-equiv, NS-504 PR
infographic, added in #47674 alongside the ChatPage.tsx fix)
- infographic/kanban-db-corruption-defense/infographic.png (re-added a
directory #30854 had explicitly removed, in #30952)
Both are unreferenced decorative infographics, so removing them has no
effect on docs, website, or app builds. Removing the latter also clears
the stray top-level infographic/ directory that #30854 had retired.
These blobs remain in history (the commits that introduced them are
already on main and bundled with real code, so they can't be dropped);
this just removes them from the working tree going forward.
Resolves conflicts from the OpenViking churn that merged after #32445 was
opened (#48042/#47662 session-switch + write hardening, #47311/#47973):
- plugins/memory/openviking/__init__.py: keep both __init__ field groups
(the PR's _runtime_start_* alongside main's _prefetch_threads/_shutting_down).
- tests/plugins/memory/test_openviking_provider.py: keep BOTH the PR's new
setup-validation tests and main's session-switch/concurrency tests (disjoint
additions to the same region).
Two fixes layered while reconciling (contributor work otherwise preserved):
- Restore the merged tenant-header contract (#22414/#21232). The PR had changed
_VikingClient defaults to '' and made empty account/user OMIT the tenant
headers; main's contract is that empty falls back to 'default' and the
X-OpenViking-Account/User headers are ALWAYS sent (ROOT API keys need them).
Reverted the constructor to 'account or os.environ.get(..., "default")' and
updated the two PR tests that asserted the omit-when-empty behavior.
- Close a secret-file TOCTOU in the setup writers. _write_env_vars and
_write_ovcli_config wrote the api_key/root_api_key file and chmod 0600
AFTERWARD, leaving a world-readable window on newly-created files. Added
_precreate_secret_file() to create with 0600 before any secret bytes land.
* fix(dashboard): stream file uploads via multipart instead of base64 JSON
The dashboard file manager uploaded files (including backup/restore zip
archives) by reading them client-side with FileReader.readAsDataURL and
POSTing a base64 data URL inside a JSON body to /api/files/upload. For a
large backup this (a) inflates the payload ~33%, (b) buffers the whole
file plus its decoded copy in memory, and (c) reliably trips an upstream
proxy body-size/timeout limit, surfacing as a 502 with the upload
appearing to hang indefinitely (NS-501). Dashboard-only hosted users have
no shell fallback to place the archive, so backup restore was unusable.
Add a streaming multipart endpoint POST /api/files/upload-stream
(UploadFile + Form) that reads the request body in 1 MiB chunks straight
to a sibling temp file, enforces the existing 100 MB size cap as it
streams (413 on overflow, before buffering the whole file), and
atomically renames into place so a partial/aborted/over-limit upload
never clobbers an existing file. The frontend api.uploadFile now sends
multipart/form-data (raw bytes, no base64, browser-set boundary) and
FilesPage passes the File object directly; the dead readAsDataUrl helper
is removed. The legacy base64 JSON endpoint stays for backward compat.
FastAPI's UploadFile/Form require python-multipart, which is NOT pulled in
by fastapi itself, so it is added to the base deps, the [web] extra, and
the tool.dashboard lazy-install set (kept in sync).
Validated: 5 new endpoint tests (roundtrip, multi-chunk >1 MiB,
over-limit 413 without clobbering + no temp-file leak, overwrite=false
conflict, forced-root traversal containment); existing base64 tests still
pass; web typecheck + vite build clean; and a real uvicorn server E2E
(5 MB multipart upload -> HTTP 200 in 0.21s, exact byte match) plus a
30 MB TestClient roundtrip confirm constant-memory streaming end to end.
Reported via beta (NS-501).
* build(deps): regenerate uv.lock for python-multipart (NS-501)
CI ran uv lock --check / uv sync --locked which failed because the
python-multipart dependency add was not reflected in uv.lock. Regenerate
the lockfile (resolves to 0.0.20, matching the [web] extra pin) after
merging current main.
Importing a backup wrote every file from the zip over the target home
wholesale. On a hosted instance this clobbered gateway_state.json with the
source machine's last recorded run/desired state — driving the container-boot
reconciler (container_boot._read_desired_state, which only auto-starts a
gateway whose state is "running") off stale/foreign state and leaving the
gateway stuck "starting", disconnected from the Nous portal.
Add _IMPORT_SKIP_NAMES (gateway_state.json, gateway.pid, cron.pid,
gateway.lock, processes.json) and skip them by basename in run_import, so both
the root profile and named profiles preserve the target's own runtime state.
This mirrors what container_boot._STALE_RUNTIME_FILES already sweeps on every
container boot, and protects against older backups that predate the
backup-side exclusions. The import summary reports which files were preserved.
This is the second half of NS-501 (filed separately as NS-508): the upload
502 was fixed in #47663; this fixes the import-breaks-the-instance half.
The gateway half of relay Phase 3. On a MANAGED boot with relay configured and
no secret pinned, the runtime self-provisions its relay credentials IN-PROCESS:
resolve the agent's own Nous access token (resolve_nous_access_token) -> POST
the connector's /relay/provision asserting its own endpoint + route keys ->
set GATEWAY_RELAY_ID/SECRET/DELIVERY_KEY into os.environ so the immediately-
following register_relay_adapter() reads them and dials out authenticated.
No human, no enrollment token, no disk write — the creds live only in process
memory (save_env_value refuses under managed anyway, and keeping the secret off
any volume is the stronger posture). Stateless: process-env creds don't survive
a restart, so a managed container re-provisions every boot; the connector's
rotation window covers a still-connected prior instance. An explicitly-pinned
GATEWAY_RELAY_SECRET is respected (skip). Self-hosted is unchanged: humans keep
using `hermes gateway enroll`.
Endpoint provenance is gateway-asserted (GATEWAY_RELAY_ENDPOINT +
GATEWAY_RELAY_ROUTE_KEYS, env or gateway.relay_* config) — uniform code path
whether the operator sets it (self-hosted) or NAS stamps it (hosted, the only
case NAS knows the public URL). Both absent -> outbound-only provisioning
(credentials, no inbound routes). The connector scopes the asserted endpoint to
the verified tenant, so it stays within the security model.
- gateway/relay/__init__.py: relay_endpoint(), relay_route_keys(),
_provision_url(), _post_provision(), self_provision_if_managed() (never
raises — a provision failure logs and boots without relay auth).
- gateway/run.py: call self_provision_if_managed() immediately before
register_relay_adapter() in the startup path.
Tests: 12 unit (trigger logic, respect-pinned-secret, in-process env wiring,
endpoint+routes vs outbound-only, fail-soft on token/connector failure);
mutation-checked (drop is_managed guard / pinned-secret guard -> tests fail).
Cross-repo live E2E driver lands on the connector side (depends on this).
EXPERIMENTAL: relay auth scheme may change until >=2 Class-1 platforms validate.
The install method (docker/git/pip/...) describes the *running binary*, but
detect_install_method() read it from $HERMES_HOME/.install_method — a shared
DATA directory. The Docker docs deliberately bind-mount $HERMES_HOME
(~/.hermes:/opt/data) so config/sessions/memory persist and can be shared with
a host-side Desktop/CLI install.
When a containerized gateway and a host install share one $HERMES_HOME, the
home-scoped stamp is a single slot describing two installs: the published image
stamps 'docker' on every boot, the host install then reads 'docker' and the
in-app updater refuses to run 'hermes update' ("doesn't apply inside the Docker
container"). Reinstalling the Desktop app from the DMG doesn't help because the
contaminated stamp is re-read every time.
Fix (option 1 — code-scoped stamp):
- detect_install_method() reads <install tree>/.install_method first (next to
the running code, immune to the shared data dir). It falls back to the legacy
$HERMES_HOME stamp for back-compat, but IGNORES a 'docker' home stamp when
not actually containerized — so already-poisoned shared homes self-heal.
- stamp_install_method() writes the code-scoped stamp.
- install.sh stamps $INSTALL_DIR instead of $HERMES_HOME.
- Dockerfile bakes 'docker' into /opt/hermes/.install_method at build time
(inside the immutable block); stage2-hook.sh no longer writes the home stamp
and proactively removes a stale 'docker' one to heal existing shared homes.
Genuine containers still resolve to 'docker' (baked stamp, or legacy home stamp
honored when containerized). Unstamped installs in generic containers still fall
through to git/pip (preserves the #34397 fix).
* feat(relay): authenticate the connector⇄gateway WS channel
The relay gateway may be customer-managed and internet-exposed, so the
connector⇄gateway channel is itself authenticated (distinct from the
platform crypto the relay path sheds). Add gateway/relay/auth.py — a
Python port of the connector's HMAC token + delivery-signature schemes
(relayAuthToken.ts / deliverySigning.ts), verified byte-for-byte against
the connector's compiled TypeScript via cross-language test vectors.
Present an Authorization bearer on the /relay WS upgrade keyed by the
per-gateway secret (resolved from GATEWAY_RELAY_ID / GATEWAY_RELAY_SECRET
in env or config). The connector rejects an unauthenticated/invalid/
revoked upgrade with close 4401.
* feat(relay): signed-HTTP inbound delivery receiver
The connector delivers normalized inbound events to a tenant's gateway
over a signed HTTP POST, not the outbound /relay WS: the connector
instance owning a platform socket is generally not the instance a given
gateway dialed out to, so inbound targets a tenant endpoint that may
load-balance across gateway instances.
Add gateway/relay/inbound_receiver.py — verifies x-relay-signature /
x-relay-timestamp over the EXACT raw request bytes (re-serializing would
break the HMAC: JS JSON.stringify is compact, Python json.dumps spaces)
against the per-tenant delivery key verify list within a 300s replay
window, then dispatches messages to handle_message and interrupts to the
interrupt handler. Wire it into the adapter lifecycle (start in connect()
when a delivery key + bind port are configured, tear down in disconnect();
a purely-outbound dev gateway runs without it).
Refine test_relay_sheds_crypto to distinguish PLATFORM crypto (Discord
ed25519, Twilio/WeCom HMAC — still shed) from the connector⇄gateway
CHANNEL auth (intended): auth.py / inbound_receiver.py are exempt from
the platform-symbol scan but still banned from importing platform-crypto
modules, plus a positive guard that auth.py uses only stdlib hmac/hashlib.
* feat(relay): hermes gateway enroll CLI
Add the gateway half of zero-touch enrollment. `hermes gateway enroll`
resolves a fresh Nous Portal access token (the tenant-proving identity),
POSTs {enrollmentToken, gatewayId} to the connector's /relay/enroll, and
persists GATEWAY_RELAY_ID / GATEWAY_RELAY_SECRET / GATEWAY_RELAY_DELIVERY_KEY
to ~/.hermes/.env. The per-gateway secret authenticates the WS upgrade;
the per-tenant delivery key verifies signed inbound deliveries.
Refuses under is_managed() (hosted installs get the secret stamped in by
the orchestrator). Added as an 'enroll' subcommand on the existing
gateway subparser — not a new top-level command.
* docs(relay): inbound is signed HTTP, not WS; document channel auth
Fix the stale contract: §3/§5 said inbound rode the WS socket (single-
instance only, predates the multi-instance socket-ownership + channel-auth
model). Inbound + connector→gateway interrupt are signed HTTP POSTs to the
tenant endpoint. Add §6.1 documenting the two channel-auth schemes (per-
gateway WS-upgrade secret, per-tenant inbound delivery key) and how they
differ from the platform crypto the relay path sheds.
* test(relay): update build_gateway_parser callers for cmd_gateway_enroll
The enroll subcommand added cmd_gateway_enroll as a required keyword-only
arg to build_gateway_parser, but two existing parser-extraction tests still
called it with only cmd_gateway/cmd_proxy — failing CI with TypeError.
Thread the new handler through both call sites and add a test asserting
`gateway enroll` dispatches to cmd_gateway_enroll with its flags parsed.
* fix(docker): supervised gateway uses --replace to take over stale holder
Inside the s6 container image the per-profile gateway service rendered a
bare `hermes gateway run` (no --replace). When a gateway is started
OUTSIDE s6 — a stray shell `hermes gateway run`, an agent action, or the
Open WebUI helper (scripts/setup_open_webui.sh) — it grabs the
per-HERMES_HOME PID lock first. The supervised slot then execs the bare
`gateway run`, hits the "Another gateway instance is already running"
guard, exits non-zero, and s6 restarts it: a restart loop that floods the
log every ~12s and never binds. The container looks up but the gateway is
permanently down, and dashboard-only users (no shell) cannot recover.
Render the supervised run script as `gateway run --replace` so s6 is
authoritative for its slot: it reaps the stale holder via the hardened
takeover path (takeover marker + SIGTERM->SIGKILL-with-confirmation +
scoped-lock cleanup in gateway/run.py) and binds. This matches the
systemd service path, which already builds its argv with --replace
(_build_gateway_argv / 'nohup hermes gateway run --replace'), and the
intent already documented in _maybe_redirect_run_to_s6_supervision. The
existing HERMES_S6_SUPERVISED_CHILD sentinel still prevents the
run->start->run redirect recursion. Each profile is scoped to its own
HERMES_HOME and s6 guarantees one supervised instance per slot, so there
is no legitimate supervised sibling for --replace to clobber.
Reported via beta (NS-505): gateway.log showed PID 17907 'running
(manual process)' with the guard error repeating every ~12s on
v2026.6.5.
Adds a regression test asserting every gateway-run exec line in the
rendered script (default + named profile, both privilege branches)
carries --replace, and updates the existing render-script assertion.
* fix(ci): remove stray .venv symlink committed into repo
The PR's commit accidentally tracked a .venv symlink pointing at the
developer's local venv (mode 120000 -> /home/ben/nous/hermes-agent/.venv).
The CI test/e2e/build jobs run `uv venv` to create .venv and failed with
`failed to create directory .venv: File exists (os error 17)` because the
checkout already contained the symlink. All test shards aborted in <15s
during setup, before any test ran.
Untrack the symlink and add a bare `.venv` entry to .gitignore (the
existing `.venv/` rule only matches a directory, so a symlink slipped
through).
Salvage corrections on top of @XVVH's #44341:
- Make native web_search injection a 1:1 swap for an already-present client
web_search function, NOT an additive grant. The original unconditionally
appended {"type":"web_search"} on every is_xai_responses turn with any
tools, force-enabling Grok server-side search even when the user never
enabled the web toolset (bypassing Hermes web-provider config + tool-trace
plumbing). Now gated on a client web_search actually being present.
- Reconcile grok-composer context to 200000 (merged in #47908) rather than
262144; 200k is xAI's published usable context window for Composer 2.5,
262144 is the /v1/responses input+output budget.
- Update tests to match scoped behavior + add a no-web-toolset guard test.
- AUTHOR_MAP entry for #44341 salvage.
Incomplete-guard (server-side *_call items at in_progress no longer flip
has_incomplete_items) and preflight built-in-tool allowlist kept as-is.
- model_metadata: grok-composer-2.5-fast → 262144 (OAuth slug not in /v1/models)
- codex transport: inject native {"type":"web_search"} for is_xai_responses;
drop client web_search to avoid duplicate-name 400s
- codex adapter: do not treat in-progress server-side *_call items as incomplete
- tests: adapter, transport build_kwargs, model_metadata, oauth recovery
The desktop self-update runs `hermes update` then `hermes desktop
--build-only`, and only relaunches if the rebuild returns 0. The first
`--build-only` can exit nonzero on a still-settling post-update tree or a
network-blocked Electron fetch that the installer's self-heal repaired
mid-run — so both updaters (the Tauri setup binary and the in-app POSIX
path) bailed before the relaunch step. The update landed but the app
never restarted; a manual launch worked because the heal had completed.
Retry `--build-only` once in both paths before failing, mirroring the
retry-once `hermes update` already does (and the CLI `hermes update`'s
own desktop rebuild). A second run builds clean off the healed dist and
is a near-no-op when the first actually succeeded (content-hash stamp).
- update.rs: retry stage 2; add rebuild_needs_retry() + test
- main.cjs: retry via new update-rebuild.cjs helper (behavior-tested)
Weak open models (mimo, nemotron-class) that see tool-call XML/JSON sitting in
file contents or tool output get primed and emit their own structured tool
calls mimicking the payload — usually with an empty/whitespace name. Those
calls can't be fuzzy-repaired toward a real tool, so the dispatch loop returns
an error and the model retries. Before this fix, every empty-name error dumped
the full tool catalog back to the model, which fed the priming loop more names
to mimic and inflated context 3-4x across the retry budget.
A blank/whitespace-only tool name now gets a terse anti-priming error that
tells the model in-context tool-call syntax is DATA, with no catalog dump. A
genuinely-wrong-but-nonempty name (a real typo) still gets the full catalog so
the model can self-correct.
Not a sandbox/auth boundary issue: Hermes never parses tool-call text from
content into executable calls (structured tool_calls only; the lone text->call
parser is the Copilot ACP transport and it also rejects empty names). The
reporter's own debug dump confirms the injection never executed.
Behavior-contract test added: empty-name -> terse error, no catalog; nonempty
unknown -> catalog preserved. Exercised end-to-end via run_conversation against
an in-process mock provider.
* fix(dashboard): recover the Chat tab when the agent session ends (NS-504)
In the dashboard Chat tab, when the agent process exits — the user types
`/exit`, or starts a new session that ends the current PTY child — the
`/api/pty` WebSocket closes with a normal code (not one of the
4401/4403/4404/4408/1011 rejection codes the server emits). The frontend
handled only those rejection codes; the normal-exit fallback just printed
"[session ended]" into the dead terminal and stopped, with `wsRef` nulled
and no respawn path. The only recovery was a full page refresh — exactly
the beta report ("typing /exit breaks functionality, no way to restart
without refreshing"; "starting a new session completely breaks the
agent").
On a clean/normal close the Chat tab now flips `sessionEnded` and renders
an in-place "Start new session" overlay (mirroring ChatSidebar's existing
reconnect affordance). Clicking it bumps a `reconnectNonce` that is a
dependency of the connect effect, so the effect tears down and re-runs,
spawning a fresh PTY in place — no page refresh. `onopen` clears the
flag so a successful reconnect dismisses the overlay.
An explicit button (rather than auto-respawn) is deliberate: if the agent
is crash-looping, auto-respawn would hide the failure and spin; the user
stays in control.
Verified against a live uvicorn `/api/pty` socket: a child that exits
closes with a non-rejection code (client sees close_code None / 1000-class),
which is precisely the branch that now sets sessionEnded=true. web
typecheck + vite build clean.
Reported via beta (NS-504).
* docs(assets): add NS-504 chat session recovery infographic
* feat(mcp): raise default tool-call timeout 120s -> 300s
Port from openai/codex#28234. Long-running MCP tools (web fetches,
sandboxed builds, deep-research servers) routinely exceed 120s, causing
spurious timeout failures. Codex bumped its default MCP tool timeout from
120 to 300 for the same reason.
- _DEFAULT_TOOL_TIMEOUT 120 -> 300 in tools/mcp_tool.py (per-server
'timeout' config override unchanged)
- update test_default_timeout assertion
- document the default in mcp-config-reference.md
* fix(dump): show commit date instead of release date in hermes dump
The version line in `hermes dump` (the top of the /debug report) appended
the package release date in parentheses, which reads like a wall-clock
"generated at" timestamp and confuses support triage. Replace it with the
date the HEAD commit was actually made, resolved live via
`git log -1 --format=%cd --date=short`, kept next to the commit SHA.
On Docker/wheel installs with no .git the date resolves to '' and the
suffix is simply omitted (the baked SHA still identifies the build).
* fix(desktop): resolve electronDist dynamically + self-heal blocked installs
Supersedes the static-path approach (#48081) and the install-step self-heal
(#48082) with a fix that removes the whole failure class instead of chasing each
symptom. Three distinct faults converged into the June desktop-build outage; this
closes all three.
Root cause (the part #48081 left open — "Gap B"):
build.electronDist was a static relative path in apps/desktop/package.json, but
npm workspace hoisting is NOT deterministic — depending on the npm version and
what else is installed, npm nests the workspace-only electron devDep under
apps/desktop/node_modules/electron OR hoists it to the repo root. A static path
matches only one layout, so a clean install intermittently fails with "The
specified electronDist does not exist". #48081 re-pointed the path at the
nested layout (correct today) but electron-builder reads electronDist
STATICALLY, so any future hoist change silently breaks it again — only caught
by a CI invariant, never self-corrected.
Fix:
- scripts/run-electron-builder.cjs: resolve electron the way Node's runtime does
— require.resolve("electron/package.json") walks node_modules from the desktop
project upward and finds electron wherever npm actually put it. The path can
never drift out of sync with the install layout again, on any OS/npm version.
* dist present -> pass -c.electronDist=<abs>/dist so electron-builder reuses
the unpacked runtime (keeps the #38673 fast path that dodges the 26.8.x
missing-binary re-unpack bug).
* dist absent -> omit electronDist; electron-builder fetches Electron itself
via @electron/get honoring electronVersion + ELECTRON_MIRROR.
package.json: builder script now runs the wrapper; the static build.electronDist
is removed (the resolver owns it).
- main.py / install.sh / install.ps1: on a dependency-install failure where the
electron package staged but its dist is missing (electron's install.js
process.exit(1) on a blocked/throttled binary download — #47266/#47917/#48021),
repopulate the dist via electron's downloader (canonical, then npmmirror.com)
and CONTINUE to the build instead of aborting. npm runs postinstall LAST, so
the only casualty is electron/dist; bailing here is what made the pack-time
mirror self-heal unreachable on a blocked network. Hard-fail only when electron
never staged at all (a genuine dependency error).
- The pack-time mirror fallback now retries the build even when the pre-fetch
can't populate the dist: the wrapper lets electron-builder download Electron
itself via the mirror, so the retry is no longer a no-op (it was, when
electronDist was a static path).
The exact 40.10.2 pin (already on main) keeps the third mode — the native
@electron-internal/extract-zip win32 binding that 40.10.3/40.10.4 ship without a
published prebuild — from recurring.
Tests:
- test_desktop_electron_pin.py: replace the static-path-matches-lockfile
invariant with contracts that there is no hardcoded electronDist to drift, the
builder script routes through the resolver, and the resolver uses Node module
resolution + injects -c.electronDist.
- test_gui_command.py: install-failure self-heal continues to build; genuine
(electron-never-staged) install failure still hard-fails; pack retries under
the mirror even when the pre-fetch is blocked.
Salvages/supersedes the overlapping community work in #48003 (sitkarev),
#48012 (omegazheng), #48033 (james47kjv), and #48082.
Co-authored-by: sitkarev <59806492+sitkarev@users.noreply.github.com>
Co-authored-by: omegazheng <zheng@omegasys.eu>
Co-authored-by: james47kjv <220877172+james47kjv@users.noreply.github.com>
* fix(desktop): narrow Electron self-heal to real missing-dist failures
Follow-up on #48091 to remove the remaining misdiagnosis risk from the
installer/build fallback path (#46785 concern): only take the Electron
repair/retry path when Electron's package files are staged and dist is actually
missing/corrupt.
- main.py: add _electron_pkg_staged_missing_dist() and use it to gate install
failure recovery; fail fast for unrelated npm install errors.
- main.py/install.sh/install.ps1: run cache purge + retry only when dist is
missing; do not retry unrelated tsc/vite/build failures under an
Electron-specific narrative.
- install.sh/install.ps1: tighten install-stage self-heal guard to require both
package.json + install.js and missing dist.
- tests: add coverage that install failure hard-fails when Electron dist already
exists, and update retry test to reflect the tightened recovery condition.
Validation:
- Python tests: 64 passed
- install.sh-related tests included in the run
- Real mac build on this machine:
- npm ci at repo root: success
- cd apps/desktop && npm run pack: success
- electron-builder packaged darwin arm64 and used custom unpacked Electron dist
* refactor(desktop): trim electron self-heal helpers and comments
Deduplicate mirror-retry into _try_redownload_electron_dist / shell
counterparts; shorten wrapper and install-script commentary without
changing recovery semantics.
---------
Co-authored-by: sitkarev <59806492+sitkarev@users.noreply.github.com>
Co-authored-by: omegazheng <zheng@omegasys.eu>
Co-authored-by: james47kjv <220877172+james47kjv@users.noreply.github.com>
- test_ws_transport.py: drives WebSocketRelayTransport against a REAL in-process
websockets server (not a mock socket): handshake (hello->descriptor), inbound
frame -> handler, outbound request/response correlation, follow_up routing,
and clean disconnect failing pending waiters. Skips if websockets is absent.
- test_relay_registration.py: rewritten for the config-driven gate — registers
when GATEWAY_RELAY_URL is set / an explicit url is passed / force=True; no-op
without a URL; trailing slash stripped; adapter constructs through the registry.
Full relay suite: 57 passed.
Wire the relay adapter into gateway startup and make activation config-driven
instead of a dark-launch flag.
- gateway/relay/__init__.py: replace relay_enabled()/HERMES_GATEWAY_RELAY with
relay_url() (GATEWAY_RELAY_URL env or gateway.relay_url in config.yaml) — the
same shape as gateway.proxy_url. register_relay_adapter() registers when a URL
is configured and builds a live WebSocketRelayTransport; with no URL it's a
no-op (direct/single-tenant deployments unaffected). force=True keeps the
transport-less adapter for unit tests. relay_platform_identity() reads the
hello platform/botId from GATEWAY_RELAY_PLATFORM/GATEWAY_RELAY_BOT_ID.
- gateway/run.py: call register_relay_adapter() during GatewayRunner.start(),
right after plugin discovery, so a configured connector relay is registered
on every boot. Failures are logged, never block startup.
This removes the dark-launch posture: the relay is on whenever it's configured,
shipping the production end state rather than hiding it behind a flag.
Adds the concrete transport behind the RelayTransport Protocol — the missing
'later-phase work' the relay scaffold deferred. The gateway dials OUT to the
connector over a WebSocket and speaks the newline-delimited JSON frame protocol
(docs/relay-connector-contract.md; connector src/relay/protocol.ts):
- connect(): opens the ws, sends hello{platform,botId}, starts a background
read loop, and resolves handshake() when the connector's descriptor frame
arrives.
- inbound frames -> the registered InboundHandler (rebuilt into a MessageEvent
via _event_from_wire, mapping the snake_case SessionSource wire form back
onto the gateway dataclasses).
- send_outbound / send_follow_up / get_chat_info: request/response correlated
by a uuid requestId against a per-request future, with a timeout so a caller
never hangs; send_interrupt is fire-and-forget.
- disconnect(): cancels the reader, closes the ws, and fails any in-flight
outbound waiters with a structured error.
RelayAdapter.connect() now negotiates the real CapabilityDescriptor from the
transport and adopts it (_apply_descriptor updates MAX_MESSAGE_LENGTH +
markdown surface), replacing the construction-time placeholder. Lazy
'import websockets' mirrors gateway/platforms/feishu.py; WEBSOCKETS_AVAILABLE
gates construction.
The contract's §6 still said the connector 'forwards the signed body
byte-for-byte so the gateway's existing crypto validates against unmodified
bytes.' That model is incoherent under an untrusted, disposable tenant
gateway on a shared bot:
- re-validating Twilio HMAC / WeCom crypto needs the shared signing secret
(handing it over IS the cross-tenant leak),
- WeCom payloads are encrypted with that secret (the connector must decrypt
at the edge just to route),
- a Discord interaction token lives inside the signed body — you can't both
preserve the bytes and strip the credential.
Rewrites §6 to the actual model: the connector is the SOLE crypto/identity
boundary — verifies/decrypts at the edge, normalizes to a tenant-scoped
MessageEvent, strips shared-identity capabilities into its vault, and
forwards only the sanitized event. The gateway re-validates nothing (the
invariant test from the crypto-shed commit enforces this). Notes that this
unifies the passthrough + relay planes and points to the connector repo's
capability-trust-boundary.md.
Also documents the follow_up op in §4 (token-less capability action added
in the previous commit). The conformance test (§2/§3 tables) stays green;
contract is unpublished/EXPERIMENTAL so no version-bump ceremony. 55 passed.
The relay outbound surface had send/edit/typing but no way to act on a
SHARED-identity capability (e.g. a Discord interaction follow-up token,
~15min) that the connector captured + stripped at the edge. Under A2 that
credential never reaches the gateway, so the gateway can't just 'send with
the token' — it needs a semantic op naming the session it's already in.
Adds the follow_up op end to end on the gateway side:
- RelayTransport.send_follow_up(action): protocol method. Action carries
op='follow_up' + session_key + kind + content (+ metadata) and NO token.
- RelayAdapter.send_follow_up(session_key, kind, content, metadata): builds
that action and returns a SendResult. The connector resolves the real
capability (its resolveOutboundCapability), enforces the tenant match so
tenant B can't wield tenant A's capability, and egresses; success=False
when the capability is absent/expired/mismatched (nothing to retry — a
leaked gateway holds zero capability material).
- StubConnector records follow_ups + a canned next_follow_up_result.
Tests: round-trips without a token; the wire action carries only session
refs (no credential value field — the 'kind' string is a type ref, not the
secret); failure surfaces when the connector can't resolve; no-transport
fails cleanly. 55 passed. §4 doc entry follows in the contract-rewrite commit.
Under the A2 trust model the connector is the SOLE crypto/identity
boundary: it verifies/decrypts every inbound platform payload at the edge
(it holds the tenant secrets), normalizes to a tenant-scoped MessageEvent,
and forwards only the sanitized event. The gateway re-validates nothing —
it cannot without being handed the shared signing secret, which on a
shared bot is itself the cross-tenant leak.
The relay path already imports no platform-crypto today; this locks that
in as an enforced invariant so nobody bolts re-validation (Discord
ed25519, Twilio HMAC, WeCom BizMsgCrypt, generic webhook signature checks)
onto the relay later and silently re-couples the gateway to platform
secrets it must never hold. Verification stays in the direct platform
adapters (gateway/platforms/*) which serve non-relay deployments.
- test_relay_package_imports_no_platform_crypto: AST-walks gateway/relay/*
and fails on any import of a platform-crypto/verification module.
- test_relay_package_calls_no_signature_verification: fails on any
verification-symbol reference (ed25519/hmac/bizmsg/verify_*).
Invariants (assert the relation 'relay re-validates nothing'), not frozen
snapshots. Verified the guard bites: injecting a wecom_crypto import makes
it fail, removing it goes green. docs §6 rewrite follows in a later commit.
The Phase 1 exit gate requires BOTH Discord and Telegram to round-trip
through the relay stub, but test_relay_roundtrip.py only covered Discord.
Add the Telegram companion exercising its distinct discriminator profile:
- no guild_id — two chats isolate on chat_id alone
- forum topics share one chat_id and isolate by thread_id (the Telegram
analog of Discord per-guild isolation), shared across participants by
default (thread_sessions_per_user=False)
- DM isolation by chat_id
- utf16 len_unit + markdown_v2 dialect round-trip and configure the adapter
- outbound send round-trips through the stub
Proves the CapabilityDescriptor + build_session_key generalize beyond
Discord, not just the struct (which the descriptor unit tests already
covered).
Add an invariant test pinning docs/relay-connector-contract.md to the
Python source of truth so the doc (which the connector repo mirrors by
hand) cannot silently drift:
- CapabilityDescriptor §2 table ⟷ dataclass fields + required/optional
- SessionSource wire keys (to_dict output) ⟷ §3 documented fields
- per-platform discriminator columns exist as real SessionSource fields
- guard that is_bot stays off the wire until deliberately promoted
Writing the test surfaced a real gap: §3 only enumerated 5 discriminators
in its per-platform table while to_dict() emits 12 keys. Seven wire keys
the connector must populate (chat_name, chat_topic, user_id_alt,
chat_id_alt, parent_chat_id, message_id, user_name) were undocumented —
a connector author reading the doc would never know to set them. Added a
complete SessionSource wire-field table to §3. The connector's existing
contract.ts already carries all 12, so no connector change is needed; the
doc was the lagging artifact.
The platform-connected-checker invariant test requires every built-in
Platform enum member to have either a generic token path or a bespoke
entry in _PLATFORM_CONNECTED_CHECKERS. Platform.RELAY was added without
one, so test_all_builtins_have_checker_or_generic_token_path failed.
Relay dials OUT to a connector and is 'connected' once an endpoint URL
is configured (extra['relay_url'] or extra['url']); the capability
descriptor is negotiated at handshake time, so the URL is the only
config-level signal in the experimental phase. Add the checker plus a
synthetic-config case exercising its True path.
CI guard: fails if gateway/ or plugins/ ever imports the test-only stub
connector or defines StubConnector. Matches code leaks (imports / class defs),
not prose mentions, so the transport.py docstring reference to the stub's path
is allowed.
Phase 1 complete. Task 1.6 of the gateway-relay plan.
Formal interface between the Hermes gateway (RelayAdapter) and the Node
connector repo: handshake, CapabilityDescriptor field table, MessageEvent
inbound envelope with per-platform SessionSource discriminators (Discord
guild_id is REQUIRED for server isolation), outbound action set, /stop
interrupt routing, signed-body verify-at-edge/byte-preserving rule, and the
additive-only contract_version policy. Documents bot-identity-vs-tenant
separation so single-bot consolidation (Phase 6) stays open. Read-first
artifact for the connector implementer.
Phase 1, Task 1.5 of the gateway-relay plan.
RelayAdapter.on_interrupt(session_key, chat_id) bridges a connector-delivered
mid-turn /stop into the existing interrupt_session_activity path, setting the
per-session _active_sessions Event and clearing typing — cancelling exactly the
targeted session's turn without touching siblings (mirrors test_stop_thread_
sibling isolation). Transport.send_interrupt carries the gateway-side egress to
the connector for socket-owner routing.
Phase 1, Task 1.4 of the gateway-relay plan.
register_relay_adapter() registers the generic 'relay' platform via the same
PlatformRegistry path as plugin adapters — no core dispatch changes. OFF by
default (dark-launch): only registers when HERMES_GATEWAY_RELAY is truthy (or
force=True for tests), so existing single-tenant/direct deployments are
unaffected. Factory builds a transport-less RelayAdapter with a placeholder
descriptor; the real descriptor is negotiated at handshake.
Phase 1, Task 1.3 of the gateway-relay plan.
Defines RelayTransport (lifecycle/handshake/inbound/outbound/interrupt) as the
gateway<->connector wire contract; RelayAdapter.connect now registers an inbound
handler that bridges connector-delivered MessageEvents into handle_message.
Adds an in-memory StubConnector under tests/ and an E2E round-trip proving:
connect registers the handler, inbound events reach the adapter, guild_id drives
build_session_key isolation (two guilds -> two keys; same guild/channel/user ->
one), outbound send round-trips, get_chat_info is proxied.
Phase 1, Task 1.2 of the gateway-relay plan.
One BasePlatformAdapter subclass that reads its capability profile from a
CapabilityDescriptor: MAX_MESSAGE_LENGTH attribute, message_len_fn (table-driven
by len_unit: chars=len, utf16=Telegram-style code units), supports_draft_streaming.
Implements the four abstract methods (connect/disconnect/send/get_chat_info) by
delegating to an injected RelayTransport (full protocol lands in Task 1.2). Adds
Platform.RELAY enum member. No per-platform gateway code.
Phase 1, Task 1.1 of the gateway-relay plan.
CapabilityDescriptor.from_platform_entry() projects an existing PlatformEntry
(label, max_message_length, emoji, platform_hint, pii_safe, name) into a
descriptor, proving the descriptor is a projection of existing config rather
than a parallel concept. Runtime-only capabilities (len_unit, draft/edit/
thread/markdown) are caller-supplied. max_message_length==0 ('no limit') maps
to the stream_consumer 4096 default.
Phase 0 complete. Task 0.3 of the gateway-relay plan.
Behavioral regression harness locking the capability surface that the future
RelayAdapter must reproduce: the abstract-method set (connect/disconnect/send/
get_chat_info), message_len_fn default, supports_draft_streaming default, and
the stream_consumer MAX_MESSAGE_LENGTH attribute read. Passes on main before
any RelayAdapter exists.
Phase 0, Task 0.1 of the gateway-relay plan.