A GPT-5 model rejecting max_tokens returns a 400 whose message contains the
literal substring 'max_tokens' — one of the _CONTEXT_OVERFLOW_PATTERNS. The 400
path in _classify_400 checked overflow patterns before any request-validation
check (which only existed on the 5xx path), so the parameter error was routed
into the compression loop, re-sent with the same bad param, and ended in
'Cannot compress further' on a tiny context.
Hoist a request-validation guard (unsupported/unknown parameter) above the
context-overflow check in _classify_400. Deliberately excludes the generic
invalid_request_error code, which OpenAI also stamps on real overflow 400s, so
genuine overflows still compress. Pairs with the max_completion_tokens param
fix that stops the bad request at the source.
Also adds AUTHOR_MAP entry for the salvaged PR #13902 commit.
The per-file test runner re-runs a file once when pytest exits 4 ("file or
directory not found") while the file exists on disk — a transient seen on
loaded shared CI runners where the planner collects a file (--collect-only
counts its tests) but the per-file subprocess fails to stat it moments later.
A single immediate retry could land in the same brief high-load window and
fail again, and the retry was gated on one Path.exists() check that can itself
be a flaky stat under that load — so a freshly-added test file that LPT pins to
one shard would deterministically red that shard on every run (no actual test
failure; the file just never executes).
- Extract the subprocess spawn/communicate/process-tree-kill logic into a
shared _spawn_pytest_once() helper (removes ~90 lines of duplication between
the primary run and the retry).
- Replace the single-shot retry with a bounded backoff loop
(_EXIT4_RETRY_ATTEMPTS, escalating sleep) that re-runs while the file is
present on disk.
- Add _file_present() which re-checks existence across a few spaced stats, so a
single flaky negative stat doesn't wrongly conclude the file is missing. A
genuinely-missing file (typo/deleted) still fails fast — exit 4 is not
swallowed when the file truly does not exist.
- Tests: transient-then-pass recovery, genuinely-missing fails fast with no
retry, give-up after max attempts, and _file_present transient/missing cases.
* docs(windows): correct native data dir to %LOCALAPPDATA%\hermes
The Windows-native guide claimed a deliberate split where config, auth,
skills, and sessions live under %USERPROFILE%\.hermes. That is not what
the installer does: scripts/install.ps1 sets HERMES_HOME=%LOCALAPPDATA%\hermes,
so data actually lives in %LOCALAPPDATA%\hermes alongside the disposable
install (the hermes-agent\, git\, node\, bin\ subdirectories) — `hermes
config` confirms config.yaml/.env resolve there, not under %USERPROFILE%.
Update the data-layout table, the "split is deliberate" note, the env-var
and uninstall sections to describe the real layout: data and install share
the %LOCALAPPDATA%\hermes root, reinstall only replaces hermes-agent\, and
a full wipe targets %LOCALAPPDATA%\hermes (with %USERPROFILE%\.hermes kept
only as a legacy/WSL cleanup). Mention HERMES_HOME as the override knob.
* docs(windows): fix PATH + bin layout to match installer
The installer adds hermes-agent\venv\Scripts (where hermes.exe lives) to
User PATH and sets HERMES_HOME — not %LOCALAPPDATA%\hermes\bin. The \bin
dir holds Hermes's managed uv.exe, not a hermes.cmd shim. Correct the
install-step list and the data-layout table accordingly.
* fix(install): show real HERMES_HOME path in setup messages
The native Windows installer wrote config/env/skills under $HermesHome
(%LOCALAPPDATA%\hermes) but its success messages claimed ~/.hermes,
which doesn't exist on native Windows. Print the actual paths so a new
user can find their config, .env, and skills.
* fix(desktop): rebind sessions after websocket reconnect
* docs(desktop): explain the reconnect-resume guard in use-route-resume
The reconnect fix turns on two subtle conditions with no inline rationale:
`seenGatewayStateRef` suppresses a spurious "became open" on the first effect
run (so a session mounting with the gateway already open doesn't double-resume),
and the `gatewayBecameOpen ||` arm forces a re-resume even when the route looks
`alreadyActive` because the cached runtime id can be stale after the gateway
rebinds/reaps the session. Comment both so the next reader doesn't "simplify"
them back into the original bug. No behavior change.
---------
Co-authored-by: Josh Dow <josh.dow@prepad.io>
* fix(install): self-heal a stuck Electron download on the desktop build
The desktop build downloads Electron (~114MB) from GitHub. A corrupt cached
zip, or a blocked/throttled GitHub release host (the repeating "retrying" log),
hard-failed the install — and install.sh had no recovery at all while
install.ps1 / `hermes desktop` only purged the cache.
All three build paths now escalate on a failed `npm run pack`:
GitHub → purge corrupt electron-*.zip + stale *-unpacked and retry → one retry
via a public Electron mirror (npmmirror.com). @electron/get SHASUM-verifies the
download, and a user-pinned ELECTRON_MIRROR is always respected (never
overridden). Adds a bash clear_electron_build_cache()/_desktop_pack() to mirror
the existing PowerShell/Python helpers.
* test(install): cover the Electron mirror fallback
Verify `hermes desktop` falls back to a mirror when the cache purge finds
nothing, and that a user-pinned ELECTRON_MIRROR is respected (no extra attempt,
not overridden).
* docs(desktop): troubleshoot a stuck Electron download
Document the automatic cache-purge + mirror fallback, how to pin your own
ELECTRON_MIRROR, and how to clear a corrupt cached zip by hand.
* docs(install): correct the Electron mirror trust framing
The mirror-fallback comments and the desktop troubleshooting doc implied
`@electron/get`'s SHASUM check makes the npmmirror.com download safe against
tampering. It doesn't: the SHASUMS256.txt is fetched from the same mirror, so
the check guards against a corrupt/partial download, not a compromised mirror.
Reframe all four surfaces (install.sh, install.ps1, `hermes desktop`, and the
docs) to state the trust trade-off honestly — npmmirror.com is the de-facto
Electron community mirror, we only fall back to it after the canonical GitHub
download fails, and a user-pinned ELECTRON_MIRROR is never overridden. No
behavior change.
---------
Co-authored-by: xxxigm <tuancanhnguyen706@gmail.com>
* fix(photon): override transitive CVEs in the sidecar deps
`npm audit` flagged 7 high-severity transitive CVEs (protobufjs code injection
GHSA-66ff-xgx4-vchm + outdated @opentelemetry OTLP exporters) pulled in via
spectrum-ts -> @photon-ai/otel. npm's suggested fix downgrades spectrum-ts to a
version that targets the decommissioned spectrum host, so instead pin patched
versions via `overrides` (protobufjs 8.6.1, @opentelemetry/* 0.218.0) without
touching spectrum-ts. `npm audit` -> 0; spectrum-ts + provider still import.
* fix(photon): harden the sidecar bridge + bound the dedup cache
- constant-time sidecar control-token comparison (was `!==`, timing-attackable).
- cap the control-channel request body (2 MiB) so a compromised local peer can't
OOM the sidecar.
- wrap the inbound gRPC stream consumer in a re-subscribe loop with capped
exponential backoff + jitter — if the async iterator throws/ends it would
otherwise stop inbound forever (the adapter dedupes any replay).
- add an unhandledRejection handler so a stray rejection logs instead of killing
the process.
- dedup cache (adapter) was a true bounded LRU only for expired entries; a burst
of unique ids within the window grew it without limit. Evict oldest at the cap.
* chore: add AUTHOR_MAP entry for PhilipAD
---------
Co-authored-by: PhilipAD <philipadsouza@gmail.com>
Adds three vitest cases for the recovery path: resume+retry on
"session not found", no-resume passthrough on other errors, and
no-resume when there is no stored session id. Also maps the
contributor's commit email in release.py AUTHOR_MAP.
Follow-up to the salvaged fallback-chain fix:
- Replace the hand-rolled fallback loader with the shared
hermes_cli.fallback_config.get_fallback_chain() helper so the TUI path
matches HermesCLI and gateway/run.py exactly: fallback_providers stays
first and keeps order, with distinct legacy fallback_model entries
merged in after (deduped). Previously the TUI loader picked one key OR
the other, diverging from CLI/gateway when both were set.
- Update the test to assert the merged canonical semantics.
- Add psionic73 to scripts/release.py AUTHOR_MAP (CI gate).
The blanket DEVNULL pass muzzled run_oauth_setup_token()'s interactive
'claude setup-token' login, which needs inherited stdin to prompt the
user. Revert that one call and replace the guard's brittle file:line
whitelist with an inline 'noqa: subprocess-stdin' marker that travels
with the code.
scripts/check_subprocess_stdin.py scans agent/, tools/, plugins/, and
tui_gateway/ for subprocess.run() and subprocess.Popen() calls that
don't explicitly set stdin=. Missing stdin= means the child inherits the
parent's fd, which in TUI mode is the JSON-RPC pipe — causing gateway
crashes on stdin EOF.
Exits 0 (pass) or 1 (violations found). Can be run manually or added to
CI. Skips comments, docstring references, and calls that use input= (which
creates its own pipe).
Usage: python scripts/check_subprocess_stdin.py
Photon now allowlists registered device clients on the device-code
endpoint; the old client_id "hermes-agent" is rejected with
400 invalid_client, breaking the entire login flow. Switch to Photon's
published "photon-cli" device client and send the standard scope.
Also validate the device-flow token against /api/auth/get-session and
/api/projects/ before persisting it, and extract token candidates from
every response shape Photon has used (access_token, accessToken,
data.*, set-auth-token header) so a token that authenticates the
session lookup but is rejected by the project API fails loudly at
login instead of 404ing downstream.
Verified live: request_device_code() now returns 200 + a valid
user_code where "hermes-agent" returned 400 invalid_client.
Salvaged from #34467 by @yanxue06.
The Skills Hub lost every api.github.com-backed source — the OpenAI,
Anthropic, HuggingFace, NVIDIA, gstack, Claude Marketplace and Well-Known
tabs all vanished — while ClawHub/skills.sh/LobeHub/browse.sh survived. A
GitHub API rate limit during the docs-deploy crawl zeroed all three
api.github.com sources (github / claude-marketplace / well-known) at once.
Two compounding bugs let the broken index reach the live site:
1. build_skills_index.py wrote the output file BEFORE the health check, so
even when the github floor (30) tripped and the script exited 2, the
degenerate file was already on disk. deploy-site.yml then swallowed the
exit code with `|| echo non-fatal` and extract-skills.py read the partial
index. Fix: run the health check first, write the file only when healthy,
exit without writing on failure. Removed the non-fatal swallow in
deploy-site.yml so a collapse fails the deploy and the last good site
stays live (Pages serves the previous build).
2. The build-time GitHub listing path returned [] on a 403 rate-limit without
retrying or flagging it, so a rate-limited crawl looked identical to an
empty source. Fix: a shared _github_get() helper on GitHubSource with
retry/backoff (honors Retry-After / X-RateLimit-Reset on 403/429, backs
off on 5xx + transport errors) and flags is_rate_limited. Routed
_list_skills_in_repo and _fetch_file_content through it; gave
ClaudeMarketplaceSource a persistent GitHubSource + is_rate_limited so the
builder can name the rate limit as the cause instead of '0 results'.
Added tests/scripts/test_build_skills_index_health.py pinning both contracts:
a degenerate crawl exits non-zero and writes no file; a healthy crawl writes
the index with github/claude-marketplace/well-known all present.
The parallel test runner sharded a present, tracked test file
(tests/plugins/platforms/photon/test_inbound.py) onto a slice that then
reported 'file or directory not found' (pytest exit 4) at exec time —
even though the planner had just enumerated the file via --collect-only
('5269 passed, 0 failed' in the same run). On loaded shared CI runners
the per-file subprocess can fail to stat a file the planner already saw;
the deterministic LPT slicer then reproduces it on every rerun because
the same file set lands on the same shard.
Fix: when a per-file run exits 4 AND the file still exists on disk, retry
the subprocess once before surfacing it as a hard failure. This kills the
shard-flake class for everyone, not just this PR.
Does NOT widen the exit-5-is-pass rule — exit 4 on a genuinely missing
file still fails (verified). Retry reuses the same pgroup-kill cleanup as
the primary run so no grandchildren orphan.
Validation: photon dir runs green through scripts/run_tests_parallel.py;
unit-level negative case confirms a nonexistent file still returns rc=4.
A bare `git fetch origin` (and `git fetch upstream`) pulls every ref. The
repo carries thousands of auto-generated branches, so on any
non-single-branch checkout the installer's update path and `hermes update`
spend minutes downloading the full branch list — long enough to stall the
desktop installer or trip the follow-up `git pull --ff-only`.
Scope every update-path fetch to the branch we actually compare/merge
against:
- scripts/install.sh: collapse the remote to single-branch and fetch only
$BRANCH on the "existing install, updating" path.
- hermes_cli/main.py: fetch the resolved branch in the apply path, the
--check path (upstream + origin), and the fork upstream-sync.
Tracking-ref updates still happen via git's opportunistic refspec, so the
later origin/<branch> rev-parse/rev-list checks are unaffected.
Tests assert the apply-path fetch is branch-scoped and never bare.
hermes auth add openai-codex now creates an independent
manual:device_code pool entry per account instead of routing through
the singleton _save_codex_tokens save path, which collapsed every
added account into the latest login (the second add overwrote the
first account's singleton-mirrored device_code entry). This is the
add-path half of #39236; PR #39243 (already on this branch) fixes the
re-auth half.
manual:device_code entries refresh from their own token pair
(_sync_codex_entry_from_auth_store only adopts the singleton for
source=="device_code"), so they need no providers.openai-codex
shadow. Adding the first credential marks openai-codex active (the
singleton path did this implicitly) so the setup wizard's
get_active_provider() check still passes; subsequent adds leave the
active provider untouched.
Adds SOURCE_MANUAL_DEVICE_CODE constant and a regression test that two
distinct accounts keep distinct token pairs. Updates two existing add
tests to the pool-only behavior.
Co-authored-by: glesperance <info@glesperance.com>
* feat(windows): enable dashboard chat tab via ConPTY (win_pty_bridge)
Add hermes_cli/win_pty_bridge.py — a pywinpty-backed drop-in for
PtyBridge with the same spawn/read/write/resize/close surface — and
wire it into the web_server PTY import block so Windows picks it up
instead of falling back to None.
pywinpty is already a declared win32 dependency (pyproject.toml).
The ConPTY read path runs inside run_in_executor so the event loop
is never blocked. Spawn/read/write/terminate call shapes are taken
directly from tools/process_registry.py which already exercises the
same pywinpty version.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* docs: remove WSL2-only caveat for dashboard chat tab
The chat pane now works on native Windows via the ConPTY bridge added
in the previous commit.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* test(windows): cover ConPTY bridge + web_server platform-branched import
Companion to the bridge added in the previous commits. Verified live on
native Windows 11 (pywinpty 2.0.15) against `hermes dashboard`'s
`/api/pty` WebSocket: the spawned `hermes --tui` (node entry.js) renders
through ConPTY, resize escapes reach `setwinsize`, and closing the WS
reaps both the node child and the pywinpty agent with zero orphans.
tests/hermes_cli/test_win_pty_bridge.py
Mirrors the layout of the existing POSIX test_pty_bridge.py:
spawn/io/resize/close/env coverage against cmd.exe and python -c,
plus the cross-platform fallback surface (PtyUnavailableError, the
off-Windows `spawn -> raises PtyUnavailableError` guard, and the
load-bearing _clamp() helper that protects setwinsize from garbage
winsize values out of xterm.js).
tests/hermes_cli/test_web_server_pty_import.py
Asserts that web_server.PtyBridge resolves to WinPtyBridge on win32
and to the POSIX PtyBridge on POSIX, that PtyUnavailableError is the
matching class on each side (so isinstance checks in /api/pty's
spawn fallback path work), and a source-text check that pins the
platform-branched import shape so a future refactor can't quietly
collapse it back to a POSIX-only import.
scripts/release.py
AUTHOR_MAP entries so CI release-note generation can resolve both
authors' plain (non-noreply) emails to their GitHub logins.
Co-Authored-By: JoelJJohnson <josephjohnson.joel@gmail.com>
Co-Authored-By: Nea74 <andreas@schwarz-ketsch.de>
---------
Co-authored-by: JoelJJohnson <josephjohnson.joel@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Nea74 <andreas@schwarz-ketsch.de>
On-disk vitest coverage for the auto-heapdump disk-safety guard: opt-in
gating (suppressed diagnostics-only path), truthy-spelling acceptance,
manual-trigger passthrough, and the retention prune. Test approach
adapted from #21780 (briandevans) and #21822 (LeonSGP43), reconciled to
the merged gate semantics. Maps alarcritty into AUTHOR_MAP for CI.
Review feedback (#40998): `rm -rf` / `Remove-Item -Recurse -Force` on the
install dir is destructive -- a user might still want whatever is there.
Rename the broken checkout to a timestamped `<dir>.broken-<ts>` backup and
re-clone fresh, so nothing is ever deleted. Transient cleanup of a clone
attempt that fails within the same run is left as-is.
An interrupted previous clone leaves the install dir's .git present but with
no initial commit. rev-parse --is-inside-work-tree and git status both still
succeed there, so the installer entered the update path and ran `git stash`,
which aborts with "You do not have the initial commit yet" and failed the
desktop install at the "Cloning Hermes repository" stage.
- install.ps1: add a `git rev-parse --verify HEAD` probe to the repo-validity
check so a commit-less checkout is treated as broken and re-cloned fresh.
- install.sh: mirror it at the top of clone_repo() — drop a partial checkout
with no resolvable HEAD so the fresh-clone path handles it (POSIX parity).
Fixes#40998
A one-off transient transport failure (streaming-close / incomplete
chunked read / 5xx / 408) on an auxiliary LLM call escalated straight to
provider/model fallback (or, for context compression, dropped the summary
and entered cooldown), even when an immediate retry on the same provider
would have succeeded.
Add a single same-target retry at the top of call_llm() and
async_call_llm() — before the existing except-chain — gated on a new
_is_transient_transport_error() that reuses the canonical
_is_connection_error() detector plus a 5xx/408 status check. A second
failure (or any non-transient error: auth, other 4xx, malformed payload)
falls through to first_err and the existing fallback handling unchanged.
This lives in call_llm so every auxiliary task (compression, memory flush,
title generation, session search, vision) shares one transient-retry
surface, rather than each caller re-implementing it. The context
compressor needs no change — it calls call_llm and inherits the retry; its
existing fallback-to-main path (#18458) now composes naturally (retry the
aux model once, then fall back to main only if the retry also fails).
Co-authored-by: ARegalado1 <alberto.regalado@ymail.com>
eslint --fix (import sort + padding-line-between-statements) on sidebar/index.tsx
after cherry-picking @dangelo352's commits; add release.py AUTHOR_MAP entry so
CI doesn't block on the unmapped author email.
Salvage follow-up for PR #33221 — the cherry-picked commit is authored
under martin.alca@gmail.com (not the draixagent@gmail.com already mapped),
which would fail the CI author-attribution gate.
Replace the ACP-local prefix/suffix matcher + helper with a single
startswith() check against INTERRUPT_WAITING_FOR_MODEL_PREFIX, now
defined once in conversation_loop.py where the sentinel is produced.
Keeps the source of truth in one place so the guard cannot drift if
the status string changes. Net -17 LOC in server.py.
Also add lsaether to release.py AUTHOR_MAP.
The auxiliary Codex adapter maintained its own chat->Responses conversion
loop that forwarded every non-system message's role verbatim into
Responses input[]. When flush_memories()/compression replayed session
history containing assistant tool_calls + role=tool results, those tool
messages leaked into the request and the Responses API rejected them with
HTTP 400: Invalid value: 'tool'.
Route _CodexCompletionsAdapter.create() through the same shared converter
the main agent transport uses (_chat_messages_to_responses_input), so tool
calls become function_call items and tool results become function_call_output
items with a valid call_id. Single conversion path means no future drift.
Also remove the now-dead _convert_content_for_responses() helper — its only
caller was the private conversion loop this change deletes.
Co-authored-by: ProgramCaiCai <techxacm@gmail.com>
When apps/desktop's `npm ci`/`npm install` fails, install_desktop printed a
single "Desktop workspace npm install failed" line and aborted, leaving the
user with a wall of raw npm output. A common trigger is a root-owned ~/.npm
cache left by an earlier `sudo npm`/`sudo npx`: the non-root install then
cannot write the shared cache, and npm reports it as EEXIST / "File exists"
while the real errno is EACCES (-13) -- so it reads like an installer bug.
Add a targeted remediation hint on that failure path pointing at:
sudo chown -R "$(id -un)" ~/.npm && npm cache verify
followed by the manual rebuild command. The stage stays a hard failure by
design (a silent skip yields a "complete" install with no app); only the
failure output changes.
Desktop connected to a remote gateway can now attach images and PDFs and
display agent-written images. Previously the desktop passed a LOCAL file path
to image.attach; on a remote gateway that path doesn't exist, so the image was
silently dropped ("skipped unreadable path") and the vision model never saw it.
The reverse direction was also broken — images the agent wrote on the gateway
rendered as dead links in the remote client.
Gateway (tui_gateway/server.py):
- image.attach_bytes: base64 byte upload written into the gateway's own images
dir and queued via the existing native-image-attach pipeline. Magic-byte
extension sniffing, data-URL prefix + whitespace tolerance, 25 MB cap,
structured error codes. Accepts content_base64/filename (canonical) and
data/ext (older-desktop aliases).
- pdf.attach: renders each page to PNG via pdftoppm (poppler-utils) at 150 DPI
and queues the pages as images; 50 MB / 25-page caps. Accepts host path or
base64 upload.
- Shared helpers (_decode_attach_base64, _sniff_image_ext, _queue_attached_image)
so the two methods and the existing image.attach don't duplicate logic.
Gateway (hermes_cli/web_server.py):
- GET /api/media: returns a gateway-local image as a base64 data URL so remote
clients can display it. Auth-gated like every /api route, extension
allowlist + size cap, AND confined to the gateway's own media roots
(images/screenshots/cache, resolved symlink-safe) so an authed caller can't
read image-extension files anywhere on disk.
Desktop (apps/desktop):
- syncImageAttachmentsForSubmit uploads bytes via image.attach_bytes when the
connection mode is 'remote'; the local fast path is unchanged.
- media.ts gains isRemoteGateway() + gatewayMediaDataUrl(); directive-text and
markdown-text fetch images over /api/media in remote mode.
Consolidates the competing remote-media PRs (#38876, #40317, #21908, #39437)
into one coherent implementation, taking the strongest parts of each and adding
shared-helper cleanup plus the /api/media root-confinement hardening on top.
The per-profile gateway switching from #38876 is intentionally left out as a
separable feature. TUI file uploads (#40492) remain a separate surface.
Tested: 11 new tui_gateway tests + 5 /api/media endpoint tests + desktop
media.remote unit tests; full tui_gateway + web_server suites green (472
passed); tsc -b clean; E2E verified the full attach→disk→queue and
gateway-path→data-URL display round-trip plus the out-of-root security block.
Co-authored-by: Max Mitcham <maxmitcham@mac.home>
Co-authored-by: Justlrnal4 <Justlrnal4@users.noreply.github.com>
Co-authored-by: Chris Cook <ccook@nvms.com>
Co-authored-by: Thomas Paquette <thomas.paquette@gmail.com>
SIMPLEX_ALLOWED_USERS silently denied every contact when operators
listed display names instead of numeric contactIds. The SimpleX UI
never surfaces the numeric id, so display names are what operators
naturally put in the env var. _is_user_authorized only compared
source.user_id (the contactId), so the allowlist never matched.
Expand check_ids to include source.user_name for the simplex platform,
mirroring the existing WhatsApp phone-LID aliasing pattern. Adds doc +
setup-prompt clarification and three regression tests.
Salvaged from PR #40393. Adds manishbyatroy to release.py AUTHOR_MAP.
Adds the AUTHOR_MAP entry for the #40403 salvage (model.default_headers
for custom OpenAI-compatible providers, fixes#40033) so contributor_audit
passes when the salvage PR lands.
Adds regression tests for the SSH cwd fix: local backend keeps
host-validated session cwd; non-local backend uses TERMINAL_CWD (or
terminal.cwd config) verbatim without host isdir() validation; sentinel
values fall back to session cwd.