The /api/pty handler only closed the PtyBridge in the writer loop's finally.
On child EOF the reader task closes the WebSocket, but if the handler task is
cancelled the instant the socket closes, the writer's finally can be skipped
and the PTY fds leak (#54028) — the FD-leak the regression test guards. Under
dashboard auto-reconnect this stacks orphaned PTYs until fds are exhausted.
Reap the bridge in the reader's EOF finally too (close() is idempotent), so
the PTY is reaped independently of the writer-loop cancellation race. Harden
the regression test to poll for teardown instead of asserting on the same
tick. Was flaky on main (2/20); now 25/25.
PR #39219 split run_browser_install_with_timeout into a thin wrapper that
delegates to a new run_with_timeout helper (and parameterized the timeout
binary as $timeout_bin for macOS gtimeout support), but did not update
tests/test_install_sh_browser_install.py. The behavioral harness extracted
only the now-empty wrapper, so the install command never ran (runs==[]),
failing all 8 behavioral cases; two text assertions also still expected the
old literal 'timeout' invocation.
Fix the stale test: extract run_with_timeout alongside the wrapper, and match
the $timeout_bin-parameterized GNU-timeout strings. Behavior unchanged.
* fix(security): redact secrets in background process + foreground env-dump output
Terminal-output redaction was incomplete (#43025):
- Gap 1: process(action=poll/log/wait) returned background stdout verbatim —
no redaction at all. A background printenv/server/test emitting a key leaked
raw to the model, session.db, and CLI display. Same for the gateway
background-process watcher's completion/progress notifications.
- Gap 2: the foreground terminal path hardcoded code_file=True, which skips the
ENV-assignment pass, so an opaque token (no vendor prefix) from env/printenv
leaked even there.
Adds agent.redact.redact_terminal_output(output, command) as the single policy
for ALL terminal-output surfaces: env-dump commands (env/printenv/set/export/
declare) get the ENV-assignment pass (code_file=False) to mask opaque tokens;
other commands stay on code_file=True to avoid false positives on source dumps.
Wired into terminal_tool, process_registry (_handle_process boundary), and the
gateway watcher. Respects security.redact_secrets (no force) — opt-out preserved.
* docs: add infographic for #43025 terminal-output redaction fix
The merged CLOSE-WAIT heartbeat (#52744) only probes get_me(), which uses the
general request path and stays healthy while PTB's getUpdates consumer is
silently wedged (updater.running=True but the long-poll task is stuck, observed
on WSL2). DMs then queue in the Bot API and never reach handlers (#42909).
Augment the existing _polling_heartbeat_loop to also probe
get_webhook_info().pending_update_count. After two consecutive probes that see a
non-draining queue while the updater claims to be running, escalate into the
existing _handle_polling_network_error recovery ladder — no new restart
machinery. No-ops in webhook mode, when the updater is not running, or when a
reconnect is already in flight.
Credit to @gazzumatteo, whose PR #42959 identified the pending_update_count
signal as the missing liveness probe. This reuses the existing heartbeat +
recovery path rather than adding a parallel watchdog.
Fixes#42909.
Two related redaction bugs from #43083:
1. build_assistant_message redacted tool-call arguments in-memory. That dict
feeds both the replayed conversation history and state.db (which is itself
replayed verbatim on session resume), so the model read back its own
PGPASSWORD='***' psql call and copied the placeholder, breaking every
credential-dependent command on the second turn. The masking gave no real
protection either — the same secret still leaks through tool OUTPUT. Remove
it. Keeping secrets out of the replayable store is a separate
tokenization/vault concern (security.redact_secrets still governs
storage-time redaction elsewhere).
2. _AUTH_HEADER_RE's greedy \S+ credential class ate a closing quote when the
token sat flush against it (Authorization: Bearer sk-.."), turning value
corruption into syntax corruption (unterminated quote -> shell EOF /
SyntaxError). Exclude " and ' from the token class; real credentials never
contain them.
Closes#43083.
Config-backed quick_commands bypassed the admin-only slash gate. The
early gate in _handle_message only fires for registry-known commands
(is_gateway_known_command), but quick_commands are never in the gateway
registry, so they reached the type:exec dispatch sink unchecked. An
allowlisted non-admin gateway user could invoke admin-only quick
commands — including shell exec in the gateway process — even when the
operator set allow_admin_from / user_allowed_commands to lock them out.
Apply _check_slash_access(source, command) at the quick_commands
dispatch site (the single exec chokepoint, cold-path only) using the
raw typed name. Admins and users with the command in
user_allowed_commands still run it; backward-compat (no policy set)
is unaffected.
Fixes#44727.
Co-authored-by: maxpetrusenko <max.petrusenko.agent@gmail.com>
Co-authored-by: zapabob <1920071390@campus.ouj.ac.jp>
* fix(dashboard): close PTY WebSocket on child EOF to stop FD leak
The /api/pty handler's reader task returns on child EOF, but the writer
loop stayed blocked on ws.receive() until the browser sent a disconnect.
When the browser socket is half-open (no FIN delivered — common on
macOS/launchd), that disconnect never arrives, so the handler never
reaches its finally and the PTY master fd + child process leak. With
dashboard auto-reconnect (#52962), every dropped socket then spawns a
fresh PTY on top of the orphaned one, exhausting file descriptors within
hours (EMFILE / Errno 24).
Fix: the reader task now closes the WebSocket in a finally when the child
EOFs or the send side breaks, which unblocks ws.receive() so the existing
finally runs bridge.close(). The writer loop also guards ws.receive()
against the RuntimeError Starlette raises once the socket is closed.
Reported by @fifteenzhang.
Fixes#54028
* docs: add infographic for #54028 PTY FD leak fix
The snapshot/vision guards re-check the page URL before returning content,
but browser_console(expression=...) -> _browser_eval returns arbitrary JS
results directly, leaving two same-class bypasses open:
1. Direct fetch: fetch('http://127.0.0.1/secret').then(r=>r.text()) reads
a private endpoint and returns the body — the page URL stays public so
the post-eval recheck never sees it.
2. Navigate-then-read: location.href='http://127.0.0.1/' then a later eval
reads document.body.innerText.
Guard _browser_eval on the same condition as navigate/snapshot/vision
(not local backend, not local sidecar, not allow_private_urls):
- pre-scan the expression for private/always-blocked URL literals
- re-check window.location.href after the eval at both success-return
sites (supervisor fast-path + subprocess fallback)
Probe failures fail-open (matching the snapshot/vision guards).
The private-network guard in browser_snapshot() and browser_vision()
blocked all private URLs, including those accessed via local sidecar
sessions (hybrid routing). Local sidecar sessions intentionally access
private URLs — the cloud provider never sees the URL in that case.
Add `_is_local_sidecar_key(effective_task_id)` check to both guards,
matching the existing pattern in browser_navigate().
Fixes#45101 review feedback from egilewski.
The SSRF bypass in #44731 was only patched for browser_snapshot(), but
browser_vision() exposes the same vulnerability — it takes a screenshot
and sends it to the vision model without checking if eval-driven
navigation moved the page to a private/internal URL.
Add the same current-page URL safety check to browser_vision() before
any screenshot is captured, encoded, or forwarded to the vision model.
This covers both the normal screenshot path and the Lightpanda Chrome
fallback path.
7 new tests: blocks private URL, allows public URL, skips in local
backend, skips when private URLs allowed, handles eval failure/empty/exception.
browser_snapshot() now checks the current page URL before returning
content. When browser_console() changes location.href to a private or
internal address (e.g., http://127.0.0.1:8080/), the snapshot returns
an error instead of exposing the private page content.
This closes the SSRF bypass where an attacker could:
1. Navigate to a public page
2. Use browser_console to eval location.href = 'http://127.0.0.1:port/'
3. Use browser_snapshot to read the private page content
The fix reuses the existing _is_safe_url() and _allow_private_urls()
infrastructure, and fails open if the URL check itself fails.
Fixes#44731
HolographicMemoryProvider.shutdown() dropped its MemoryStore reference
without calling the existing MemoryStore.close(). Since the connection is
opened check_same_thread=False (one per session), its fd was released by
refcount/GC at a non-deterministic time on a non-deterministic thread,
churning a DB fd through the kernel free pool on every session teardown.
Call close() so the fd is released deterministically.
Reported by @alfranli123 (#44037), who pinpointed the exact code location.
Note: the report's TLS-fd-recycle corruption attribution could not be
reproduced from the code — dropping a sqlite connection flushes valid
SQLite pages via the VFS, never TLS framing, and the provider is at most a
releaser of DB fds, not a TLS-flushing socket owner. This change is correct
resource hygiene that removes per-session fd churn regardless.
A hard gateway crash (exit code 1) skips the graceful shutdown path, so
sessions.json is never cleared and is left pointing at sessions already
ended in state.db. On the next startup get_or_create_session() reuses
those stale entries as long as the time/policy reset checks pass — it
never consults end_reason — so every incoming message is silently routed
into a closed session, with no log or error (#52804).
SessionStore._ensure_loaded_locked() now calls a new
_prune_stale_sessions_locked() that drops any entry whose session_id has
end_reason IS NOT NULL in state.db. Idempotent, _db=None / legacy-absent
safe, DB errors non-fatal, sessions.json rewritten only when something
was pruned. Self-heals into a fresh session on the next message.
Reported and diagnosed by @terry197913 (#52808).
#35994 moved /new reset cleanup off the loop, but _cleanup_agent_resources
(agent.close() subprocess teardown; shutdown_memory_provider() plugin IO) was
still called INLINE on the event loop from three other sites:
- _session_expiry_watcher (5-min idle sweep) — live loop
- _handle_message_with_agent cache-hygiene re-eviction — live loop
- _finalize_shutdown_agents / stop() idle-cache loop — shutdown
A wedged memory provider on any of these froze the loop: bot goes silent,
runtime-status updated_at heartbeat stops advancing, and SIGTERM can't be
serviced (requires kill -9) — exactly the #53175 zombie pattern.
Adds _cleanup_agent_resources_off_loop: a bounded (30s) worker-thread offload
mirroring the #35994 reset fix, and routes all four sites through it.
The system-prompt backend probe imported a nonexistent symbol —
`from tools.environments import get_environment` — which always raised
ImportError: cannot import name 'get_environment'. The exception is caught
and only drops the live backend description to a static fallback, so it is
cosmetic, but it broke the live OS/user/cwd probe for every non-local
backend (docker/singularity/modal/daytona/ssh).
The real factory is `_create_environment` in tools.terminal_tool. Build the
environment the same way the live terminal path does (select backend image,
assemble ssh/container config from _get_env_config()), then run the probe.
Note: this does NOT affect tool loading — tool selection runs each tool's
check_fn and never consults this probe. Regression from #52147 (2026-06-25).
Closes#53667 (probe import); the 'cronjob-only' tool-collapse symptom is
not reproducible — tool selection has no probe dependency and memory's
check_fn is unconditionally True.
Add two tests for the self-lock guard in _recover_from_interrupted_install:
one asserting it clears the marker and skips install when hermes.exe is a
process ancestor (breaking the #52378/#45542 loop), one asserting it falls
through to a normal recovery install when the shim is NOT an ancestor.
The guard's manual-recovery hint runs only inside the Windows branch, so
quote it for cmd.exe (cd /d, double-quoted paths) — the cross-platform
fallback hint at the end of the function is left POSIX-correct.
Map Icather in scripts/release.py AUTHOR_MAP for the salvage.
Rebase onto plugins/platforms/matrix/adapter.py (code moved from
gateway/platforms/matrix.py). Same logic: _on_invite checks is_direct
on invite events and calls _record_dm_room to persist in m.direct
account data.
Fixes#44679
When the Desktop forcibly closes its WebSocket mid-write, asyncio logs a
full traceback for every pending connection-lost callback — 50+ identical
WinError 10054 (ConnectionResetError) lines per disconnect on Windows, the
equivalent ConnectionResetError/BrokenPipeError on POSIX. These are not
actionable: they are the expected side effect of the peer hanging up before
our writes drained.
Install a loop exception handler on the gateway serving loop that collapses
exactly this teardown class (ConnectionResetError/ConnectionAbortedError/
BrokenPipeError originating from _call_connection_lost) to a single debug
line, forwarding every other loop error to the existing/default handler
unchanged so genuine loop bugs still surface. Idempotent per loop.
The atomic mv approach (kyssta-exe's commit) narrows but does not close the
#38249 race: the temp name used $$ (parent shell PID), which is identical
across &-launched concurrent subshells. Two concurrent writers pick the same
temp file, clobber each other mid-write, and mv then publishes a torn snapshot
— a reader sourcing it absorbs declare-x/export fragments into PATH.
- Use $BASHPID (actual per-subshell PID) so concurrent writers never collide.
- Chain mv on export success (&&) and rm the temp on failure so a partial dump
never replaces a good snapshot; apply the same to the init_session bootstrap.
- shlex-quote the static temp-path portion (Windows/spaces), $BASHPID outside.
- LocalEnvironment.cleanup sweeps orphaned snap.tmp.* temps.
- Regression tests: string-shape + a behavioral concurrent writers/readers test
that proves the snapshot never tears (would still tear with $$).
- read_text(encoding='utf-8') (PLW1514)
- # windows-footgun: ok on signal.SIGKILL — module is Linux-only (reads
/proc, /sys/fs/cgroup; runs from a systemd unit)
- test lambda accepts the new encoding kwarg
Long-lived helpers spawned indirectly by tool calls (adb, platform
bridges) were left in the service cgroup after the gateway's main
process exited. When the kernel rejected the deferred cgroup-wide kill
with EINVAL, systemd blocked Restart=always for 6+ minutes, taking
down all platforms and cron windows (#37454).
Add a small ExecStopPost helper (gateway.cgroup_cleanup) that walks
cgroup.procs and sends per-PID SIGKILLs — a different kernel code path
than cgroup.kill, so it succeeds where the cgroup-wide write failed.
KillMode=mixed is preserved so the gateway still reaps its own
tool-call children before systemd intervenes (#8202).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When tools.environments.local can't be imported (partial install,
import-time error), _is_hermes_provider_credential() returned False —
fail-open. A skill could then register a Hermes provider credential
(ANTHROPIC_API_KEY, etc.) as env passthrough; _scrub_child_env lets
passthrough vars bypass the secret-substring net (rule 1), so the
operator's real key would land in the execute_code child. Reopens the
GHSA-rhgp-j443-p4rf bypass.
Fail closed instead: on import failure, treat the name as a protected
provider credential and refuse passthrough. Regression test exercises
the full register -> scrub path under a simulated import failure.
Co-authored-by: Hermes Agent <noreply@nousresearch.com>
Follow-up to the salvaged #37733 fix. The contributor centralized
redaction at _openai_error and the chat/responses failure paths, which
covers the OpenAI-compatible envelopes transitively. Two sibling classes
crossed the same authenticated HTTP boundary unredacted:
- 8x cron-management endpoints returning {"error": str(e)} on 500
- the session-chat SSE error event ({"message": str(exc)})
Route both through the same _redact_api_error_text(force=True) helper.
Add AUTHOR_MAP entry for coygeek and a TestRedactApiErrorText guard
covering mask/force/limit/passthrough behavior.
Force API-server error text through the existing secret redactor before returning OpenAI-compatible errors, response fallback text, response snapshots, and run failure events. This prevents credential-shaped provider failure text from crossing the API-server boundary while preserving debuggable sanitized messages.
Add an _is_user_authorized E2E for the platforms/whatsapp/session layout
on top of fesalfayed's resolver fix (#36665) — guards the actual
silently-dropped-LID-sender path from #36664.
expand_whatsapp_aliases hardcoded get_hermes_home()/whatsapp/session, but
the adapter writes lid-mapping files via get_hermes_dir("platforms/whatsapp/
session", "whatsapp/session"). On installs without the legacy directory the
two paths diverge, so the resolver finds no mappings and returns the bare LID,
which misses the allowlist and silently drops the message. Resolve through the
same helper so both sides stay in lockstep on new and legacy layouts.
When an LLM API call returns HTTP 4xx with an empty parsed SDK `body` ({}),
`_summarize_api_error` fell through to a bare `str(error)`, so users saw only
"HTTP 400" with no provider detail (reported on Windows in #36109). The SDK
leaves `body` empty in this case, but the httpx `response` still carries the
payload in `.text`.
- run_agent.py `_summarize_api_error`: when `body` is empty, fall back to
`response.text` — parse a JSON `error.message`/`message` when present, else
surface the raw (truncated) body. Platform-agnostic diagnostics.
- hermes_cli/oneshot.py: `hermes -z` now runs via `run_conversation` and returns
exit code 2 when the run is failed/partial with no usable final response, so
scripts can detect LLM failures (still 0 when a response — incl. an error
summary as output — is produced).
Tests: new tests/run_agent/test_summarize_api_error.py (empty-body JSON + raw
text, RED/GREEN verified) + oneshot exit-code/`run_conversation` wiring tests.
NOTE: #36109's original root cause (Windows "all providers return empty 400")
is not reproducible on current main (heavy provider-transport churn since
v0.15.1). This change does not claim to fix that root cause — it makes any
empty-body API error LEGIBLE so a future occurrence shows the real provider
message instead of a bare HTTP 400. Relates to #36109 (does not close it).
Follow-up on #54032 for #35166:
- Gate the PLAYWRIGHT_HOST_PLATFORM_OVERRIDE retry on the host being an apt
release newer than Playwright recognizes (Ubuntu >24.04 / Debian >13) via
playwright_host_unrecognized(), instead of retrying on ANY install failure.
A network/disk/permission failure on a supported host now surfaces unchanged
rather than getting a mismatched-glibc build forced onto it.
- detect_os() now captures DISTRO_VERSION from os-release.
- Fold in the interruptibility fix (was PR #35304, self-closed): wrap the
download in 'timeout --foreground -k 10' (probed, with plain-timeout
fallback) so a terminal Ctrl+C reaches the child and a wedged download is
force-killed after the deadline.
- Add behavioral tests that source the helpers and assert the retry fires only
on Ubuntu 26.04 / Debian 14, not on supported hosts, non-apt distros,
native-success, operator-pinned override, or unsupported arch.
On apt releases newer than the bundled Playwright recognizes (Ubuntu 26.04,
Debian 14, and future distros), 'npx playwright install --with-deps chromium'
hangs uninterruptibly at 'Installing Playwright Chromium with system
dependencies' because Playwright's resolver maps the host to a platform with
no download build (#35166).
Wrap every installer Playwright call in run_playwright_install(), which tries
the native install first and, only if it fails or times out, retries once with
PLAYWRIGHT_HOST_PLATFORM_OVERRIDE pinned to the newest known build
(ubuntu24.04-<arch>). This is the escape hatch Playwright's maintainers bless
for unrecognized platforms (microsoft/playwright#33434).
Try-native-first (not a hardcoded distro/version table) is deliberate:
- Self-correcting — when Playwright already supports the host (e.g. Ubuntu
26.04 on Playwright >=1.61) the first attempt succeeds and the override is
never applied, so we never force a mismatched-glibc build onto a release
Playwright handles correctly (microsoft/playwright#35114).
- Zero-maintenance — new distro releases work the moment Playwright adds them.
- Covers Debian 14+ and future releases, not just Ubuntu 26.04.
An operator-set PLAYWRIGHT_HOST_PLATFORM_OVERRIDE is always respected (applied
to the first attempt; retry skipped). Non-x64/arm64 arches have no fallback
build and skip the retry.
Refs #35166
A custom_providers config that names the model under model.name (or
model.model) resolved to an empty model, so the API request went out
with model= — HTTP 400 from OpenAI-compatible backends. Display paths
(hermes status/dump) already read model.name and showed the model,
making the failure silent.
The model id was read via 'default or model' at ~14 independent sites
(cli, gateway, cron, curator, oneshot, fallback, profiles, ...), none
of which honored 'name'. Rather than patch every site, canonicalize at
the single load/save chokepoint: _normalize_root_model_keys() now
promotes model.model/model.name -> model.default (precedence
default > model > name) and drops the stale alias, so every reader —
present and future — sees a populated default and config.yaml is
migrated canonical on next save. The gateway, which bypasses
load_config(), replays the same normalization in _load_gateway_config().
Co-authored-by: Bartok9 <danielrpike9@gmail.com>
Credit: root-cause analysis and fix direction from @Bartok9 (#34502,
first) and @v86861062 (#34527).
* fix(telegram): clear send_path_degraded on successful reconnect
_send_path_degraded was cleared only in _verify_polling_after_reconnect,
60s after reconnect and only if scheduled. A clean start_polling() reconnect
left the flag stuck True, short-circuiting send() and blocking all outbound
messages until the deferred probe ran (or forever if it never did).
Clear the flag the moment start_polling() succeeds — that is the recovery
signal. The deferred probe remains a defensive re-check that re-enters the
reconnect ladder (re-setting the flag) if it detects a silent wedge.
Fixes#35205.
* docs: add infographic for #35205 telegram send-path fix
Secret redaction is display/output-scoped on main — write_file writes
content verbatim, terminal/execute_code redact only output not the
command/source. The real bug is in displayed tool OUTPUT (read_file,
terminal, execute_code):
_DB_CONNSTR_RE's password group [^@]+ was greedy across newlines, so on a
multi-line block it scanned past the DSN line to the next stray '@' (a
Python @decorator), replacing every intervening character — including line
breaks — with ***. That dropped lines and concatenated the next line onto
the f-string line, making read_file output look corrupted (the file on disk
was always correct). Reported in #33801.
Fix:
- Forbid whitespace in the userinfo/password groups ([^:\s]+ / [^@\s]+) so
the match can never span a line break. A real DSN password never contains
whitespace. This alone kills the catastrophic line-dropping.
- Under code_file=True, preserve a password group that is a pure {...} brace
expression — f"postgresql://{user}:{pass}@{host}" is an f-string template,
not a live credential. Literal passwords are still masked.
- Pass code_file=True at the terminal and execute_code output redaction call
sites (file_tools already did) so code-execution output isn't corrupted by
ENV/JSON/template false positives. Real prefixes, auth headers, JWTs, and
private keys are still redacted.
Verified E2E against the reporter's exact pydantic-settings module: file
written verbatim, read_file shows the DSN f-string + @model_validator intact
with zero *** corruption, while a literal postgresql://admin:pw@host DSN and
a real sk- key are still masked.
Reported-by: koishi70
Reported-by: pfrenssen
When a provider's output-layer safety filter (MiniMax "output new_sensitive
(1027)", Azure content_filter, etc.) kills a streaming response after deltas
were already sent, interruptible_streaming_api_call swallows the raw error
into a finish_reason=length partial-stream stub. The conversation loop then
burned 3 continuation retries against the SAME primary — re-hitting the
content-deterministic filter every time — and gave up with "Response remained
truncated after 3 continuation attempts", never consulting fallback_providers.
Builds on @595650661's classifier change (cherry-picked) so error_classifier
recognizes the filter; then:
- chat_completion_helpers: run the swallowed error through error_classifier at
the stub-creation point and stamp _content_filter_terminated on the stub
(single source of truth — no parallel pattern list).
- conversation_loop: read the tag and activate the fallback chain BEFORE
burning any continuation retries; roll partial content back to the last
clean turn and re-issue against the new provider (restart_with_rebuilt_messages).
Plain network stalls are unaffected (only content_policy_blocked is tagged).
Credits #32479 (@sweetcornna) and #33845 (@Tranquil-Flow) which fixed the
same issue via the stub-tag and loop-escalation approaches respectively.
Live E2E confirmed: before, _try_activate_fallback called 0x; after, fallback
fires on the first stub and the fallback provider completes the turn.
The MiniMax output-layer safety filter surfaces the error verbatim as
`output new_sensitive (1027)` (sometimes with additional provider
wrapping like 'Stream stalled mid tool-call: output new_sensitive (1027)').
When the model emits a large tool-call argument block, the upstream
filter trips and the SSE stream is truncated mid-flight, producing
'stream stalled mid tool-call' errors. Until now this case was
misclassified and retried 3x on the same provider, reproducing the same
refusal and burning paid attempts.
Adding `new_sensitive` to `_CONTENT_POLICY_BLOCKED_PATTERNS` routes
it through the existing is_client_error path: skip 3x retry, activate
configured fallback model immediately, surface a clear provider-safety
message to the user.
Refs #32421
A restart now interrupts in-flight agents immediately rather than holding
the gateway open for a grace window. The previous 180s default coupled two
independently-set timers: the gateway's own drain timer and systemd's
TimeoutStopSec. On a stale unit where TimeoutStopSec < drain, systemd
SIGKILLed the gateway mid-cleanup, leaving a stale lock that made the next
startup exit immediately ('already running') — an infinite crash loop under
Restart=on-failure (#31981).
Setting drain to 0 makes the mismatch structurally impossible: with drain 0
the generated unit gets TimeoutStopSec=90 against a near-instant drain, so
systemd never kills mid-cleanup. Contract: restart the gateway, in-flight
work stops. A grace window large enough to 'save' a long agent turn would
have to outlast an unbounded task, which is impossible.
Also fixes the stale-unit warning's suggested command
(hermes gateway service install --replace -> hermes gateway install --force);
the former subcommand does not exist.
Closes#31981
Pull the #33271 post-interrupt recovery (flush_stdin + _force_full_redraw)
out of process_loop's finally block into _recover_terminal_after_interrupt(),
and replace the inline-logic-copy tests with ones that exercise the real
helper plus a source guard that process_loop still invokes it behind the
_last_turn_interrupted gate.
When the agent is interrupted during processing, prompt_toolkit's
renderer and VT100 input parser can be left in an inconsistent state.
CSI 6n cursor position report responses leak as literal text
(^[[19;1R) and the terminal stops accepting keyboard input.
Fix: in process_loop's finally block, after an interrupted turn:
- flush_stdin() to drain stray escape bytes from the OS input buffer
- _force_full_redraw() to reset prompt_toolkit's renderer cache
Closes#33271
The 600s default evicted the gateway clarify entry while users were
still away (meeting/AFK); a later button tap then landed on a dead
entry and the agent hung on 'running: clarify'. Raise the default to
1h in DEFAULT_CONFIG and the get_clarify_timeout() code-level fallback,
documenting the running-agent-guard tradeoff. User overrides still win.
Generated images under a profile gateway's cache (profiles/<name>/cache/
images/...) were silently dropped from Telegram/Discord delivery when
HERMES_HOME is symlinked under a denied prefix (e.g. /opt/data ->
/root/.hermes) and $HOME is not that prefix. The resolved path lands
under /root (a system denylist prefix), the root-home exception only
fires when the denied prefix IS $HOME, and the static safe-roots list
only covers the active HERMES_HOME's top-level cache — not per-profile
cache dirs. Both gates fail, so validate_media_delivery_path returns
None and the gateway logs 'Skipping unsafe MEDIA directive path'.
_media_delivery_allowed_roots() now also enumerates per-profile cache
roots (<root>/profiles/*/cache/{images,audio,videos,documents,
screenshots}) at check time. Allowlist match runs before the denylist,
so the profile artifact delivers regardless of the /root interaction;
profile-dir credentials (auth.json) stay blocked since they aren't
under a cache subdir.
Reopened regression of #34485/#38108, neither of which covered the
profile-scoped symlink case. Fixes#31733.
On Darwin, synchronous=FULL (the WAL default) only issues a plain
fsync(), which Apple documents does NOT guarantee writes reach stable
storage or stay ordered. SQLite's WAL corruption-safety guarantee
assumes the OS honors the fsync barrier; macOS does not unless the app
uses F_FULLFSYNC. During a launchd *system* shutdown the page cache is
dropped (effectively power-loss for in-flight pages), so a WAL
checkpoint whose fsync 'reported' durable may never hit the platter —
corrupting state.db with a malformed image. That is the trigger in
#30636 ('SIGTERM during launchd shutdown under high load').
Apply PRAGMA checkpoint_fullfsync=1 (macOS-guarded) in
apply_wal_with_fallback. It forces the F_FULLFSYNC barrier only at
checkpoint boundaries (where WAL frames land in the main DB), so cost
amortizes to ~+0.1ms/commit vs ~+4ms for the broader fullfsync=1.
No-op off Darwin (F_FULLFSYNC is macOS-only).
Root-cause analysis by @catapreta on #30636. Supersedes #30654, whose
synchronous=FULL is a no-op (already FULL in WAL mode) and whose
TRUNCATE-on-close is already on main.
Co-authored-by: catapreta <catapreta@users.noreply.github.com>
The advisory reference view stripped all tool calls and tool results, so
reference models judged a task whose actions and results they never saw — and
references only fired once per user turn, never re-running as the agent's
state advanced through the tool loop.
Two fixes:
- _reference_messages() now PRESERVES the agent's tool calls and tool results,
rendering them inline as text ([called tool: ...] / [tool result: ...]) so a
reference gives an informed judgement on the real current state. Still emits
zero tool-role messages and zero tool_calls arrays (strict providers reject
those), and large tool results are previewed head+tail (4000-char budget).
The required end-on-user shape is met by APPENDING a synthetic advisory user
turn — not by deleting the agent's latest context (which the prior fix did).
- References now re-run on every state change — each new user message AND each
new tool result — instead of once per user turn. The state-sensitive advisory
signature drives the cache: new tool result = miss (re-run), identical-state
re-call = hit (no re-run, no re-emit).
The acting aggregator still receives the full, untrimmed transcript.