The connector half (gateway-gateway) moves the passthrough plane's post-ACK
forward off the HTTP gatewayEndpoint onto the gateway's outbound /relay WS via
a new passthrough_forward frame. This is the gateway side: the relay adapter
now RECEIVES and handles that frame, so a hosted gateway (no public IP) can
process forwarded Class-2/3 traffic (Discord interactions, Twilio) over the
socket it already holds — closing the "passthrough inbound doesn't work for
hosted gateways" gap.
- ws_transport.py: decode the passthrough_forward frame; PassthroughForward
dataclass + _passthrough_from_wire (base64 body -> exact bytes, byte parity
with the connector's toPassthroughForward); set_passthrough_handler mirrors
set_interrupt_inbound_handler.
- transport.py: PassthroughHandler type + set_passthrough_handler on the
RelayTransport protocol.
- adapter.py: connect() wires the passthrough handler; _on_passthrough decodes
the (already-sanitized, token-free) forward and, for a Discord interaction,
converts it to a MessageEvent routed through the normal agent path
(handle_message) — the reply egresses over the outbound / token-less
follow_up path, so the gateway never holds the interaction credential. Never
raises (a bad forward can't kill the read loop). Non-discord forwards (Twilio)
are logged + dropped for now.
- docs/relay-connector-contract.md: document the passthrough_forward frame +
PassthroughForward shape + §3.1.
The interaction -> MessageEvent CONVERSION semantics (slash-command vs button
UX, option rendering) are the open sub-design flagged in the spec; the TRANSPORT
+ receive mechanism (this) is settled per Ben's Gate-2 decision: "the relay
adapter handles receiving these events over the WS."
Tests (tests/gateway/relay/test_relay_passthrough.py): byte-preservation
round-trip (+ malformed-body tolerance), connect() wiring, application-command
and message-component interactions route through handle_message with correct
session source + scope capture, malformed/non-discord forwards dropped cleanly.
100 relay tests green. Pairs with the connector PR (gateway-gateway).
Adds test_413_retries_on_token_only_compression: same message count but
materially fewer tokens after compaction must count as progress and retry,
not abort. Fails on main without the salvaged fix, passes with it.
display.timestamps already drove the [HH:MM] suffix on live submitted and
streamed message labels, but there was no runtime command to toggle it and
/history ignored the setting entirely. Add /timestamps [on|off|status]
(alias /ts) and render [HH:MM] in /history for turns that carry a stored
unix timestamp (resumed sessions). Live unsaved turns without a stored time
are never given a fabricated one. Uses the existing sanctioned non-wire
'timestamp' message key (stripped before the API call in chat_completions),
so message-alternation and prompt-cache invariants are untouched.
Ctrl+G already opened $EDITOR with the current draft, but used
open_in_editor(validate_and_handle=False), which only loaded the saved text
back into the input area — the user still had to press Enter. The TUI's
Ctrl+G (openEditor) submits the draft on a clean exit. Since CLI submission
is driven by the custom Enter keybinding (not the buffer accept_handler),
validate_and_handle can't route through it; instead chain a done-callback on
the editor Task that calls the new _submit_editor_buffer(), which mirrors the
Enter handler's idle/queue/slash branches and drops an empty save.
Follow-up to the accept-any-file-type change. The observe-unmentioned and
replied-media paths relied on cache_media_bytes() returning None for
unsupported document types to emit an 'unsupported, not cached' note. Now
that any file type is always cached, those docs are cached and surfaced with
a path-pointing note — consistent with the main document path. The
remaining cached-is-None branch is image-validation-failure only; its note
is reworded accordingly. Updates the group-gating test to the new contract.
Authorization to message the agent is the gate, not the file extension.
Previously the inbound-attachment allowlist (SUPPORTED_DOCUMENT_TYPES) was
opt-OUT on Discord (allow_any_attachment defaulted false) and had no bypass
at all on Telegram/Slack — so an .html (or any non-allowlisted type) was
dropped or hard-rejected before the agent saw it.
Now every authorized upload is cached and surfaced to the agent regardless
of type:
- base.cache_media_bytes(): unknown types cache as octet-stream (or the
caller-supplied MIME) instead of returning None — fixes the chokepoint
that Teams/Telegram-media route through.
- discord/telegram/slack adapters: removed the allowlist reject/skip; any
non-media attachment is typed DOCUMENT and cached. Known types keep their
precise MIME.
- Text inlining now gates on a shared _TEXT_INJECT_EXTENSIONS set (text +
code + config + markup) instead of a blind UTF-8 decode, so binary formats
(PDF/zip/docx) with ASCII headers are never inlined.
- gateway/run.py emits the path-pointing context note for every DOCUMENT,
including non text/application MIME types.
- discord.allow_any_attachment is now a documented no-op kept for config
back-compat.
Validation: 357 gateway tests pass; E2E confirms .html/.bin/custom types
cache, known types stay precise, PDFs are not inlined.
terminal.docker_extra_args passes flags verbatim to `docker run` (e.g.
--gpus=all, --shm-size=16g). It was wired into DEFAULT_CONFIG,
TERMINAL_CONFIG_ENV_MAP (so `hermes config set` bridged it),
terminal_tool._get_env_config (reads TERMINAL_DOCKER_EXTRA_ARGS), and
DockerEnvironment (applies extra_args) -- but it was MISSING from cli.py's
env_mappings and gateway/run.py's _terminal_env_map.
Consequence: a user who hand-edits config.yaml (rather than running
`hermes config set`) has docker_extra_args silently dropped on the CLI and
gateway/desktop startup paths, while docker_image / docker_volumes (which
ARE in those maps) bridge correctly -- producing the reported 'Hermes
partially reads the Docker config' symptom where --gpus=all and
--shm-size=16g never reach docker run.
This is the same bridge-coverage bug class that shipped before for
docker_run_as_host_user (cli + gateway) and docker_mount_cwd_to_workspace
(gateway). Fix by adding the key to both maps, plus a dedicated regression
pin in test_terminal_config_env_sync.py mirroring the existing
test_docker_*_is_bridged_everywhere guards.
* fix(gateway): walk /proc/*/cmdline to find main-wrapper.sh under s6-overlay v3 (#49196)
(cherry picked from commit 3a108c2df0)
* fix(container): peel s6-v3 rc.init prefix so dashboard role is detected
kyssta-exe's preceding commit (#49238) fixed _read_container_argv() to
locate the rc.init-launched main-wrapper.sh process under s6-overlay v3,
but the skip still never fired: _strip_container_argv_prefix() only peeled
a prefix when args[0] was init/main-wrapper.sh/hermes. Under s6 v3 the
matched argv is
/bin/sh -e /run/s6/basedir/scripts/rc.init top
/opt/hermes/docker/main-wrapper.sh dashboard ...
so args[0] stayed /bin/sh, _is_dashboard_container() returned False, and
the dashboard container reconciled + started its own gateway-default —
the exact dual Telegram getUpdates 409 in issue #49196.
Fix: strip everything up to and including the main-wrapper.sh token (the
stable boundary the image owns), covering both the v2 (/init ...) and v3
(/bin/sh ... rc.init top ...) shapes with one rule, instead of matching
launcher tokens positionally. This also repairs _is_legacy_gateway_run_request()
under v3, which shares the same strip helper (the issue called this out).
Tests: extend the dashboard true/false parametrize sets with the s6-v3
argv shape, and add test_main_skips_reconcile_in_dashboard_container_s6v3
exercising main() end-to-end with the v3 argv. Verified via mutation that
both new v3 assertions fail under the old positional strip and pass with
the fix.
---------
Co-authored-by: kyssta-exe <kyssta-exe@users.noreply.github.com>
* feat(cli): /prompt — compose your next prompt in $EDITOR
Adds /prompt (alias /compose): opens $VISUAL/$EDITOR on a temp markdown
file so you can hand-edit a multi-line prompt, then sends the saved buffer
as the next agent turn. Text after the command pre-seeds the buffer; an
empty save cancels. Reuses the one-shot _pending_agent_seed the interactive
loop already consumes (same mechanism as /blueprint), so no changes to the
input event loop or message pipeline. CLI-only.
* feat(tui): /prompt slash command opens $EDITOR (parity with CLI)
The TUI already opens $EDITOR via Ctrl+G (openEditor), but had no /prompt
slash command like the classic CLI. Wire openEditor into the slash handler
context and register /prompt (alias /compose) to call it; inline text after
the command is dropped into the composer first so it carries into the editor,
matching the CLI's /prompt <text>.
* feat(cli): /reasoning full to show complete thinking, not 10-line clamp
The post-response Reasoning recap box hard-clamped long thinking to the
first 10 lines, so there was no way to see the full reasoning trace after
a turn (live streaming already shows it in full). Add display.reasoning_full
(default off) plus /reasoning full|clamp to toggle it at runtime; the clamp
truncation note now points at the command. Addresses repeated user requests
to show all thinking tokens.
* test(gateway): de-snapshot /reasoning help assertion
The test froze the exact args-hint literal '/reasoning [level|show|hide]',
which the new full/clamp args change to '[level|show|hide|full|clamp]'.
Convert to an invariant: assert /reasoning is in help and carries its core
args, not the exact hint string.
* feat(tui): /reasoning full|clamp parity in tui_gateway
The classic-CLI reasoning_full toggle had no TUI equivalent — typing
/reasoning full in the TUI fell through to parse_reasoning_effort and
errored. The TUI renders thinking as an expand/collapse section (no fixed
10-line recap), so map full -> sections.thinking=expanded (raw, uncapped
via thinkingPreview mode='full') and clamp -> collapsed, persisting
display.reasoning_full for cross-surface config consistency.
Plugins shelling out to bare `hermes` via the terminal tool hit
`command not found` (exit 127) when the gateway was launched without the
hermes install dir on PATH (systemd, service managers, cron, desktop
launchers) — even though `hermes` works in the user's own interactive
terminal, which sources the shell rc that exports that dir.
The terminal tool's subshell PATH was the agent process PATH plus a
static set of system dirs (_SANE_PATH); it never included wherever the
hermes console-script actually lives (~/.local/bin, the venv bin/Scripts,
pipx, nix). Resolve that dir once (which/argv0/sys.executable) and
prepend-if-missing it so bare `hermes` resolves regardless of launch
method.
* feat(providers): remove google-gemini-cli + google-antigravity OAuth providers
Google now actively bans accounts for third-party tools that piggyback on
Gemini CLI / Antigravity / Code Assist OAuth, and because abuse prevention
sits at a backend layer the ban can extend to the entire Google account
(Gmail/Drive), with a second violation being permanent.
Ref: https://github.com/google-gemini/gemini-cli/discussions/20632
Removes both OAuth inference providers entirely (modules, provider profiles,
auth/runtime/config/models wiring, the /gquota Code Assist quota command,
the antigravity-cli optional skill, desktop + docs surface in en + zh-Hans).
The API-key 'gemini' provider (GOOGLE_API_KEY/GEMINI_API_KEY against
generativelanguage.googleapis.com) is unaffected and stays fully supported.
* fix(skills): keep the antigravity-cli skill — only the OAuth provider is removed
The antigravity-cli optional skill orchestrates the external `agy` binary as
a coding-agent tool via the terminal tool — it does NOT wrap Hermes inference
through the banned google-antigravity OAuth provider, so it carries none of
the account-ban risk that motivated removing that provider. Restore the skill,
its docs page, the sidebar entry, and the optional-skills catalog row. The
google-antigravity / google-gemini-cli inference providers stay fully removed.
The welcome banner's 'Available Tools' merged in every toolset from the
global check_tool_availability() registry walk, regardless of whether it
was enabled for the current platform. On a Blank Slate CLI (file +
terminal only) that surfaced discord / feishu / kanban tools the agent
was never actually given — they are not in the agent's tool schema, but
the banner displayed them, making it look like they were exposed.
- Filter the unavailable-toolset merge to toolsets actually in
enabled_toolsets (a toolset that's enabled but has unmet deps still
legitimately shows as disabled/lazy).
- Gate the 'Available Skills' section on the skills toolset being
enabled — when it's off, the agent can't load any skill, so show
'Skills toolset disabled' instead of the on-disk catalog.
When enabled_toolsets is empty (older callers), behavior is unchanged.
Validation: blank-slate banner now shows only file + terminal and
'Skills toolset disabled'; a skills-enabled banner still lists the
catalog. Added regression tests; full banner suite green (15/15).
Live testing against a real SIGTERM-ignoring process TREE (parent + children,
the agent-browser daemon + renderer shape) revealed psutil.wait_procs's
gone/alive partition mis-handles a parent/child tree: it reaps via
Process.wait() and could mark targets gone/alive inconsistently across the
tree, leaving survivors un-killed (flaky — sometimes the parent lived,
sometimes a child). Replace it with: sleep out the grace window, then
directly re-probe every captured target (_proc_alive, treating zombies as
dead) and SIGKILL any that's still running. Add a multi-child-tree regression
test. 6/6 escalation tests green across repeated runs; the real-tree E2E now
kills the full tree 6/6 runs.
A daemon that ignores or stalls in its SIGTERM handler currently survives the
process-registry reap and leaks until reboot (observed as agent-browser
daemons accumulating to EMFILE on long-running gateways). _terminate_host_pid
now snapshots the tree, SIGTERMs it, waits a bounded grace window
(terminal.daemon_term_grace_seconds, default 2.0s, 0 disables), then SIGKILLs
any survivor. The recycled-PID identity guard still gates the whole path, so
escalation never reaches a stranger; Windows is unchanged (taskkill /F is
already a hard kill).
Config lives in config.yaml (terminal.daemon_term_grace_seconds), NOT an env
var, per the .env-secrets-only policy.
Implements the SIGKILL-escalation idea from @tkwong's #15008, reworked onto the
current _terminate_host_pid tree-kill path (the original predated it) and
config-gated instead of env-var-gated.
Co-authored-by: Benjamin Wong <tkwong@inspiresynergy.com>
Surface dangerous host/deployment posture at gateway startup so operators get
the 'you're exposed' signal the June 2026 MCP-config persistence campaign
victims never had. Warn-only — never blocks startup, never raises.
Checks (each independently fail-safe):
- Running as root (POSIX uid 0)
- SSH daemon with PasswordAuthentication enabled (incl. the 'yes' default)
- Running in a container with no persistent volume mount over HERMES_HOME
- Network-accessible API server with no API_SERVER_KEY
New module hermes_cli/security_audit_startup.py; invoked once per process from
start_gateway() right after setup_logging(). Cross-platform (root/SSH checks
no-op on Windows). Idea: @Cthulhu.
The s6 dashboard entrypoint and docker integration tests relied on
HERMES_DASHBOARD_INSECURE=1 to bring up a 0.0.0.0 dashboard with no auth
provider. With --insecure now a no-op (auth gate mandatory on non-loopback
binds), that path fails closed.
- s6 dashboard/run: drop --insecure derivation; warn that the env is a no-op
and point operators at HERMES_DASHBOARD_BASIC_AUTH_* / OAuth.
- docker tests: supervision tests now register the bundled basic password
provider (HERMES_DASHBOARD_BASIC_AUTH_USERNAME/_PASSWORD) so the gate has a
provider and the dashboard binds. Rewrote the insecure-opt-out test to
assert fail-closed (dashboard does NOT serve) instead of gate-bypass.
- docs (en + zh-Hans): HERMES_DASHBOARD_INSECURE documented as deprecated
no-op; basic-auth is the zero-infra way to authenticate a containerized
public dashboard.
Remove the dashboard --insecure auth-bypass, add an MCP persistence guard +
IOC blocklist, and raise the API-server key entropy floor.
Driven by the June 2026 hermes-0day campaign (r/hermesagent, live 854.media
instance): scanners find exposed Hermes dashboards/API servers, drive the
root agent to plant a 'command: bash' MCP entry that appends an attacker SSH
key to authorized_keys, which cron + startup then re-execute every tick.
- dashboard: --insecure no longer disables the auth gate. should_require_auth
returns True for every non-loopback bind; a public bind ALWAYS requires an
auth provider (bundled password provider or OAuth). --insecure kept as a
warned no-op for backward compat. Fail-closed error now points at the
password provider, not at --insecure.
- mcp_security: validate_mcp_server_entry now also rejects shell payloads that
write to OS persistence surfaces (authorized_keys/.ssh/pam.d/sudoers/cron/
rc files) and hard-rejects a hermes-0day IOC blocklist (attacker SSH key +
source IPs) anywhere in command/args/env. Runs at save AND spawn time.
- api_server: raise network-bind API_SERVER_KEY entropy floor 8->16 chars;
warn when a network-accessible API server runs an unsandboxed local backend.
* fix(agent): strip stale reasoning_content when falling back to a strict provider
A reasoning primary (DeepSeek/Kimi/MiMo thinking mode) pins reasoning_content
on every assistant tool-call turn (a single space " " pad). api_messages is
built once under the primary; on a mid-session fallback to a strict
OpenAI-compatible provider (Mistral, Cerebras, Groq, SambaNova), those stale
pads were replayed verbatim and rejected with HTTP 400/422:
body.messages.2.assistant.reasoning_content: Extra inputs are not
permitted (input: ' ')
reapply_reasoning_echo_for_provider() only ever ADDED pads, so it never
reconciled history built under a reasoning primary against a strict fallback.
copy_reasoning_content_for_api() also leaked empty-string and 'reasoning'-only
shapes to non-pad providers.
Fix both sites: when the active provider does not enforce echo-back, strip
reasoning_content (empty, space-pad, or non-empty) entirely. Re-padding when
switching TO a reasoning provider is preserved. Covers the Cerebras 400 from
#45655 and the DeepSeek->Mistral 422 fallback report.
Refs #45655.
* test: update reasoning-replay tests for strict-provider stripping
test_explicit_reasoning_content_beats_normalized_reasoning_on_replay was
implicitly running on the OpenRouter fixture (non-pad); pin it to a reasoning
provider so the precedence it checks is observable. Add a positive
strict-provider test asserting reasoning_content is stripped on replay.
The PID-reuse guard (#43846) reads /proc/<pid>/stat field 22, which only
exists on Linux — on macOS/Windows it returned None and the guard silently
degraded to a bare liveness check (a no-op, safety-wise). Add a
psutil.create_time() fallback (psutil is a hard dep, cross-platform),
quantized to centiseconds for stable equality, so the recycled-PID guard
actually protects macOS/Windows too. /proc always wins first on Linux and
always misses on macOS/Windows, so the two sources never mix on one host and
same-source equality is all the guard needs.
The salvaged test spawned a listener subprocess that printed its port
immediately after bind() but BEFORE listen(), so under CI's loaded 8-worker
box the parent connected before the socket was listening -> ConnectionRefused
(flaked on test slice 2/6). Reorder the child to listen() then print the port,
and make the client connect with a short bounded retry to absorb scheduler
jitter. 15/15 green locally including direct hammering.
Follow-up to the salvaged #43846 commits: the WhatsApp adapter moved from
gateway/platforms/whatsapp.py to plugins/platforms/whatsapp/adapter.py since the
PR was authored. The cherry-pick brought _listener_pids_on_port's `re.finditer`
ss-fallback and the new test's import, but the new module location doesn't import
`re` (latent NameError on the lsof-absent fallback path) and the test imported the
old module path. Add `import re` to the adapter and repoint the test import.
This is the bug that was actually closing Firefox. `_kill_port_process`, run on
every bridge (re)start to free the port, used `lsof -ti :PORT` / `fuser PORT/tcp`
— both of which match a process whose socket merely *involves* that port number
in ANY state, including ESTABLISHED client connections. It then SIGTERMed every
match.
The bridge defaults to port 3000 — a ubiquitous local dev-server port. With a
browser tab open on localhost:3000, `lsof -ti :3000` returned Firefox's PID, so
each restart of the (crash-looping) WhatsApp bridge SIGTERMed Firefox, closing
the whole browser at irregular intervals with no crash and no coredump.
Proven live with the kernel `signal:signal_generate` tracepoint:
hermes-gateway(3396516) -> sig=15 (code=0/SI_USER) -> comm=firefox pid=3371585
captured immediately after a gateway start, while Firefox held a socket on the
bridge port. Demonstrated over-match: `lsof -ti :8080` returns the listener AND
the gateway's own client connection; `lsof -ti tcp:8080 -sTCP:LISTEN` returns
only the listener.
Fix: `_listener_pids_on_port` resolves only LISTEN-state sockets
(`lsof -ti tcp:PORT -sTCP:LISTEN`, with an `ss -ltnp` fallback) and
`_kill_port_process` signals just those. A client whose connection happens to
involve the port number is never touched — which is also more correct, since a
client never blocks the new bridge from binding. Windows already filtered
LISTENING; the broad `fuser -k` path is removed.
Adds TestKillPortProcess: real-socket tests proving a separate client process
is excluded from the listener lookup and survives port cleanup. 9 tests green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`_kill_stale_bridge_by_pidfile` SIGTERMed the PID recorded in `bridge.pid`
after only a bare liveness check. Once the bridge exits and is reaped the
kernel recycles that PID onto an unrelated process; because the WhatsApp bridge
crash-loops ("Bridge process died (exit code 1)" repeating), this cleanup ran
on every restart and could SIGTERM a recycled PID that had landed on the user's
browser — closing Firefox at irregular intervals with no crash and no coredump
(a clean kill of a stranger).
Same PID-recycling class as the MCP reaper (7bd1f8a2d) and the process-registry
host-PID guard (e6a99cef2); this was the third, and most actively-fired, path.
Fix: `_write_bridge_pidfile` now also records the leader's kernel start time
(line 2). `_kill_stale_bridge_by_pidfile` re-validates identity via
`_bridge_pid_is_ours` before signalling — the (pid, start time) pair must match,
or for legacy single-line pidfiles the live cmdline must name `node` + this
session's unique path. A recycled PID (different start time / cmdline) is logged
and skipped, never signalled. Legacy pidfiles stay readable.
Adds TestWhatsappBridgePidfile: real-process tests proving a genuine bridge is
reaped while a recycled PID (start-time mismatch, or non-bridge cmdline) is
spared. 7 new + 108 gateway/registry tests green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The background-process registry signalled host PIDs (recovery adoption,
detached-session kill, tree-kill) using a number captured at spawn, guarded
only by a bare liveness check. Once a session's process exits and is reaped the
kernel recycles that PID onto an unrelated process, so an alive-but-different
PID passed the check and got tree-killed.
Observed in the wild: a recycled background-session PID landed on Firefox's
session leader; a later kill/refresh walked its process tree and SIGTERMed
every tab — Firefox "closing" at irregular intervals with no crash/coredump.
This is the same PID/PGID-recycling class fixed for the MCP orphan reaper in
7bd1f8a2d, but the process_registry subsystem was never guarded — so the bug
persisted.
Fix: record each host process's kernel start time (/proc/<pid>/stat field 22)
at spawn, persist it in the checkpoint, and re-validate it before every signal
via `_host_pid_is_ours`. A PID whose start time no longer matches — or that is
gone — is never signalled:
- recover_from_checkpoint: a recycled PID is not adopted as a session.
- _refresh_detached_session: a recycled detached PID is marked exited.
- kill_process / _terminate_host_pid: refuse to tree-kill a stranger.
Legacy checkpoints and platforms without /proc (no baseline) degrade to the
prior best-effort liveness behaviour, so nothing else changes.
Adds TestPidReuseGuard: real-process tests proving a mismatched start time
refuses termination while a matching one still kills, plus recovery/refresh
recycling paths. 74 registry + 22 MCP-stability tests green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The kanban-worker and kanban-orchestrator bundled skills existed only to
be force-loaded into dispatcher-spawned workers, gated by
environments:[kanban] so they wouldn't leak into normal CLI listings.
That gating was fragile (the leak that #50443 patched) and the
--skills auto-load was already best-effort — most workers ran without it
because the bundled skill isn't present in profile-scoped skills dirs.
Remove the skills entirely and promote their load-bearing content
(workspace kinds, deliverable artifacts, created-card integrity, profile
discovery) into KANBAN_GUIDANCE, which is already injected into every
kanban worker's system prompt. Net result: every worker reliably gets
the guidance, nothing can leak into a CLI/blank-slate session, and the
gating machinery is gone.
- agent/prompt_builder.py: promote the 4 load-bearing rules into KANBAN_GUIDANCE
- hermes_cli/kanban_db.py: drop --skills kanban-worker auto-injection + _kanban_worker_skill_available probe
- hermes_cli/kanban_swarm.py: drop skills=[kanban-orchestrator] on the root card
- hermes_cli/kanban.py: drop kanban-init skill seeding; fix help text
- delete skills/devops/kanban-{worker,orchestrator}
- docs: delete the two skill pages (EN+zh), fix sidebars/catalog/kanban.md/kanban-worker-lanes.md and the video-orchestrator + codex-lane references
- tests: update spawn-argv expectations; re-bound the guidance-size guard
Supersedes the skill-leak half of #50443 (credit @helix4u for flagging the area).
Salvage follow-up on top of @pmos69's #29474. The PR resolved the
Antigravity OAuth client purely by discovering it from an installed `agy`
binary or HERMES_ANTIGRAVITY_CLIENT_ID/SECRET env vars, so users without
agy installed hit a hard 'client ID not available' error.
Antigravity's desktop OAuth client is a public, non-confidential installed-app
client (PKCE provides the security), baked into every copy of the Antigravity
CLI — same posture as the gemini-cli credentials Hermes already ships in
google_oauth.py. Bake it in as the final fallback (env -> discovery -> public
default) and add the public default Code Assist project as the discovery
fallback, matching the reference Antigravity flow. Now consumers can
authenticate directly without agy installed.
Pin the contract that ``_apply_env_overrides`` consults ``is_connected``
before the install-triggering ``check_fn``: an unconfigured platform is
skipped without calling ``check_fn`` (no lazy install), while a configured
platform still has ``check_fn`` run and is auto-enabled. The first assertion
fails on the pre-fix unconditional sweep.
The browser orphan reaper reads a daemon PID from a `.pid` file in a
world-writable, predictably-named temp dir (`/tmp/agent-browser-h_*`) it
does not write itself, then tree-kills that PID via `_terminate_host_pid`
after only a liveness check. A same-user actor could plant a fake socket
dir whose `.pid` points at an arbitrary victim process, and OS PID reuse
after the real daemon exits could land the recorded PID on an unrelated
process — either way an arbitrary same-user process (and its whole tree)
gets SIGTERMed. Local DoS.
Add `_verify_reapable_browser_daemon()`, gated before the kill: via psutil
(a hard dep, fine cross-platform for the same-user processes the reaper can
signal) require both (1) identity — `agent-browser` in the process
name/cmdline — and (2) binding — the live process references *this* session's
socket dir in its cmdline or `AGENT_BROWSER_SOCKET_DIR`. The binding check is
the real spoof defense: a planted/recycled PID won't embed our exact socket
path. Fail-closed on any ambiguity (unreadable cmdline, no match), leaving the
process and its socket dir untouched for a later sweep.
Builds on @sgaofen's fix in #14394 (cmdline identity check); rewritten to use
psutil instead of `/proc`+`ps` (cross-platform, Windows-covered) and to add
the session-socket-dir binding check for recycled-PID / spoof resistance.
Co-authored-by: sgaofen <135070653+sgaofen@users.noreply.github.com>
Follow-up to the salvaged #9560 fix:
- Replace the _TRAVERSAL_RE regex with an explicit _is_path_unsafe() helper
(drops the now-unused `import re`); catches a path separator ANYWHERE,
not just leading, so a non-leading Windows backslash can't slip through.
- Switch the per-entry skip in _ensure_loaded_locked from print() to
logger.warning to match the module's logging conventions.
- Add AUTHOR_MAP entry for the contributor.
- Add regression tests for the non-leading-separator case.
Extends the CWE-22 path traversal guard to cover Windows absolute paths
of the form C:/... and D:\... — previously only leading / and \ were
checked, which missed drive-letter prefixes. Replaces the inline
startswith check with a compiled module-level regex (_TRAVERSAL_RE) that
covers all three attack patterns: .., leading /\, and leading X: drives.
Adds two regression tests for C:/windows/system32 and D:\\path\\to\\file.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Addresses PR #9560 review comments: applies the CWE-22 fix to current main
(post-PR #458 rebase) and adds the requested regression tests.
- SessionEntry.from_dict now raises ValueError for session_key or session_id
containing '..' or starting with '/' or '\' (directory traversal guard)
- SessionStore._ensure_loaded moves per-entry validation inside the loop so
one malicious/corrupt entry is skipped with a warning instead of aborting
the entire sessions.json load
- Adds TestSessionEntryFromDictTraversalValidation (5 cases) and
TestEnsureLoadedSkipsInvalidEntries covering the skip-not-abort behavior
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Secret redaction only matched `Authorization: Bearer <token>`. Other auth
headers passed through verbatim into logs, tool output, and transcripts:
- `Authorization: Basic <base64>` — leaks base64(user:password)
- `Authorization: token <pat>` / any non-Bearer scheme
- `Proxy-Authorization: ...`
- `x-api-key: <key>` (Anthropic and many providers) and `api-key`,
`x-goog-api-key`, `x-auth-token`, `x-access-token`, ... — opaque values with
no known vendor prefix were caught by nothing
A logged request or an echoed `curl -H "x-api-key: ..."` command therefore
leaked live credentials.
Generalize the Authorization rule to mask the credential for any scheme (and
Proxy-Authorization) while preserving the header name and scheme word for
debuggability, and add an api-key header rule for the single-opaque-value
headers. Bearer behavior is unchanged; plain prose containing the word
"authorization" (no colon-delimited value) is left untouched.
Adds regression tests for Basic/token/Proxy auth and the x-api-key/api-key
headers, including inside a curl command.
Follow-up to the salvaged #25961 fix: regression tests asserting that
scope-bearing IPv6 addresses (fe80::1%eth0, ::1%lo) are blocked by
is_safe_url after the scope is stripped, that a still-unparseable address
fails closed, and that a scoped IPv4-mapped IMDS address is caught by the
always-blocked floor.
When a streamed Telegram reply finalizes, the stream consumer could take
the fresh-final path (send a new sendRichMessage + best-effort delete the
preview) purely because the time-based _should_send_fresh_final()
threshold elapsed — even though Telegram's prefers_fresh_final_streaming
returns False. The fresh Rich Message then overlapped the legacy
MarkdownV2 preview already on screen, leaving both visible (the #47048
table + bullet double-render).
Honor the adapter's decision: when prefers_fresh_final_streaming exists
on the adapter (checked on the class + instance __dict__ so MagicMock
auto-attrs don't false-positive) and declines, the time threshold no
longer overrides it. Adapters without the hook keep the time-based
fresh-final for backward compat.
Fixes#47048
Fold in the #40715 blank-env OOM fix on top of the host-resolution change:
- connect() now sets a non-retryable fatal error when required settings are
missing, so the gateway stops reconnecting against an empty host instead of
looping forever and leaking memory until the host OOM-kills.
- check_email_requirements() treats blank/whitespace-only EMAIL_* values as
missing, so an abandoned setup with empty keys no longer enables the platform.
Credits the parallel fixes by zerone0x (#40745) and liuhao1024 (#40829).
The email adapter read address/host purely from env vars and never stripped
them, so a missing or whitespace-padded EMAIL_IMAP_HOST reached
imaplib.IMAP4_SSL("") and surfaced as the misleading
"[Errno 8] nodename nor servname provided, or not known" — sending users down a
DNS rabbit hole when the real problem was an empty/dirty host string. A
config.yaml-only setup also left the host empty because __init__ ignored
PlatformConfig.extra, even though the "connected" check, the send helper, and
`hermes config show` already read address/imap_host/smtp_host from it.
Resolve address/imap_host/smtp_host from the env var first, then fall back to
config.extra, and strip surrounding whitespace — matching the send helper's
existing pattern. Validate the required settings at the start of connect() and
return False with an actionable message instead of attempting a connection with
an empty host.
Adds regression tests for whitespace stripping, config.extra fallback, and the
no-IMAP-attempt-on-missing-host path.
- Add thread-scoped regression test: interrupt on the waiting thread resolves
the approval as deny well under the 300s timeout; a foreign-thread interrupt
does NOT release the wait (interrupts are per-thread).
- Add panghuer023 to AUTHOR_MAP for the salvaged #37994 fix.
A bare custom provider configured via `model.api_base` (the intuitive name
OpenAI-SDK / LiteLLM users reach for) was silently ignored: `hermes config set`
accepts any dotted key, so `model.api_base` got written and confirmed, but the
runtime resolver reads only `model.base_url`. Requests fell back to OpenRouter
with an empty key -> 401, zero hits to the custom endpoint (issue #8919).
Now api_base is migrated to base_url at load time (fixes existing broken
configs) and at set time (with a notice), never overriding an explicit
base_url. Closes#8919.