Commit graph

5969 commits

Author SHA1 Message Date
kshitij
2649f7360c
Merge pull request #50062 from NousResearch/salvage/cron-missed-grace-runonce
fix(cron): run missed-grace jobs once instead of deferring forever
2026-06-22 15:50:54 +05:30
kshitij
33efff0d8c
Merge pull request #50726 from NousResearch/salvage/compression-token-progress
fix(agent): count tokens, not just message rows, as compression progress (#23767, #39550)
2026-06-22 15:44:38 +05:30
Ben Barclay
64a507da44
feat(relay): handle passthrough_forward over the WS (Phase 5 §5.1, gateway half) (#50702)
The connector half (gateway-gateway) moves the passthrough plane's post-ACK
forward off the HTTP gatewayEndpoint onto the gateway's outbound /relay WS via
a new passthrough_forward frame. This is the gateway side: the relay adapter
now RECEIVES and handles that frame, so a hosted gateway (no public IP) can
process forwarded Class-2/3 traffic (Discord interactions, Twilio) over the
socket it already holds — closing the "passthrough inbound doesn't work for
hosted gateways" gap.

- ws_transport.py: decode the passthrough_forward frame; PassthroughForward
  dataclass + _passthrough_from_wire (base64 body -> exact bytes, byte parity
  with the connector's toPassthroughForward); set_passthrough_handler mirrors
  set_interrupt_inbound_handler.
- transport.py: PassthroughHandler type + set_passthrough_handler on the
  RelayTransport protocol.
- adapter.py: connect() wires the passthrough handler; _on_passthrough decodes
  the (already-sanitized, token-free) forward and, for a Discord interaction,
  converts it to a MessageEvent routed through the normal agent path
  (handle_message) — the reply egresses over the outbound / token-less
  follow_up path, so the gateway never holds the interaction credential. Never
  raises (a bad forward can't kill the read loop). Non-discord forwards (Twilio)
  are logged + dropped for now.
- docs/relay-connector-contract.md: document the passthrough_forward frame +
  PassthroughForward shape + §3.1.

The interaction -> MessageEvent CONVERSION semantics (slash-command vs button
UX, option rendering) are the open sub-design flagged in the spec; the TRANSPORT
+ receive mechanism (this) is settled per Ben's Gate-2 decision: "the relay
adapter handles receiving these events over the WS."

Tests (tests/gateway/relay/test_relay_passthrough.py): byte-preservation
round-trip (+ malformed-body tolerance), connect() wiring, application-command
and message-component interactions route through handle_message with correct
session source + scope capture, malformed/non-discord forwards dropped cleanly.
100 relay tests green. Pairs with the connector PR (gateway-gateway).
2026-06-22 20:10:57 +10:00
kshitijk4poor
ebd38e1280 test(agent): regression for token-only compression progress (#39550, #23767)
Adds test_413_retries_on_token_only_compression: same message count but
materially fewer tokens after compaction must count as progress and retry,
not abort. Fails on main without the salvaged fix, passes with it.
2026-06-22 15:26:29 +05:30
Teknium
5ff11a689b
feat(cli): /timestamps command + timestamps in /history (#50506)
display.timestamps already drove the [HH:MM] suffix on live submitted and
streamed message labels, but there was no runtime command to toggle it and
/history ignored the setting entirely. Add /timestamps [on|off|status]
(alias /ts) and render [HH:MM] in /history for turns that carry a stored
unix timestamp (resumed sessions). Live unsaved turns without a stored time
are never given a fabricated one. Uses the existing sanctioned non-wire
'timestamp' message key (stripped before the API call in chat_completions),
so message-alternation and prompt-cache invariants are untouched.
2026-06-21 22:44:25 -07:00
Shannon Sands
b9b4756ab4 fix dashboard chat session titles 2026-06-21 22:44:02 -07:00
Shannon Sands
5dae502b86 Address email pairing review feedback 2026-06-21 22:43:57 -07:00
Shannon Sands
2455e1801b Make email pairing opt-in 2026-06-21 22:43:57 -07:00
Teknium
74f0dd62e8
feat(cli): Ctrl+G submits the edited draft on save (TUI parity) (#50560)
Ctrl+G already opened $EDITOR with the current draft, but used
open_in_editor(validate_and_handle=False), which only loaded the saved text
back into the input area — the user still had to press Enter. The TUI's
Ctrl+G (openEditor) submits the draft on a clean exit. Since CLI submission
is driven by the custom Enter keybinding (not the buffer accept_handler),
validate_and_handle can't route through it; instead chain a done-callback on
the editor Task that calls the new _submit_editor_buffer(), which mirrors the
Enter handler's idle/queue/slash branches and drops an empty save.
2026-06-21 22:43:55 -07:00
Shannon Sands
4b09903de5 fix Nous auth refresh for idle agents 2026-06-21 22:43:48 -07:00
teknium1
b5bd66eac9 fix(telegram): observed/replied group docs of any type are cached too
Follow-up to the accept-any-file-type change. The observe-unmentioned and
replied-media paths relied on cache_media_bytes() returning None for
unsupported document types to emit an 'unsupported, not cached' note. Now
that any file type is always cached, those docs are cached and surfaced with
a path-pointing note — consistent with the main document path. The
remaining cached-is-None branch is image-validation-failure only; its note
is reworded accordingly. Updates the group-gating test to the new contract.
2026-06-21 22:43:45 -07:00
teknium1
4314d451ca fix(gateway): accept any inbound file type across all messaging platforms
Authorization to message the agent is the gate, not the file extension.
Previously the inbound-attachment allowlist (SUPPORTED_DOCUMENT_TYPES) was
opt-OUT on Discord (allow_any_attachment defaulted false) and had no bypass
at all on Telegram/Slack — so an .html (or any non-allowlisted type) was
dropped or hard-rejected before the agent saw it.

Now every authorized upload is cached and surfaced to the agent regardless
of type:
- base.cache_media_bytes(): unknown types cache as octet-stream (or the
  caller-supplied MIME) instead of returning None — fixes the chokepoint
  that Teams/Telegram-media route through.
- discord/telegram/slack adapters: removed the allowlist reject/skip; any
  non-media attachment is typed DOCUMENT and cached. Known types keep their
  precise MIME.
- Text inlining now gates on a shared _TEXT_INJECT_EXTENSIONS set (text +
  code + config + markup) instead of a blind UTF-8 decode, so binary formats
  (PDF/zip/docx) with ASCII headers are never inlined.
- gateway/run.py emits the path-pointing context note for every DOCUMENT,
  including non text/application MIME types.
- discord.allow_any_attachment is now a documented no-op kept for config
  back-compat.

Validation: 357 gateway tests pass; E2E confirms .html/.bin/custom types
cache, known types stay precise, PDFs are not inlined.
2026-06-21 22:43:45 -07:00
Ben Barclay
de6b3ae377
fix(terminal): bridge docker_extra_args to TERMINAL_DOCKER_EXTRA_ARGS in CLI + gateway (#50631)
terminal.docker_extra_args passes flags verbatim to `docker run` (e.g.
--gpus=all, --shm-size=16g). It was wired into DEFAULT_CONFIG,
TERMINAL_CONFIG_ENV_MAP (so `hermes config set` bridged it),
terminal_tool._get_env_config (reads TERMINAL_DOCKER_EXTRA_ARGS), and
DockerEnvironment (applies extra_args) -- but it was MISSING from cli.py's
env_mappings and gateway/run.py's _terminal_env_map.

Consequence: a user who hand-edits config.yaml (rather than running
`hermes config set`) has docker_extra_args silently dropped on the CLI and
gateway/desktop startup paths, while docker_image / docker_volumes (which
ARE in those maps) bridge correctly -- producing the reported 'Hermes
partially reads the Docker config' symptom where --gpus=all and
--shm-size=16g never reach docker run.

This is the same bridge-coverage bug class that shipped before for
docker_run_as_host_user (cli + gateway) and docker_mount_cwd_to_workspace
(gateway). Fix by adding the key to both maps, plus a dedicated regression
pin in test_terminal_config_env_sync.py mirroring the existing
test_docker_*_is_bridged_everywhere guards.
2026-06-22 15:41:23 +10:00
Ben Barclay
6202fdfc35
fix(container): detect dashboard role under s6-overlay v3 (#49196) (#50600)
* fix(gateway): walk /proc/*/cmdline to find main-wrapper.sh under s6-overlay v3 (#49196)

(cherry picked from commit 3a108c2df0)

* fix(container): peel s6-v3 rc.init prefix so dashboard role is detected

kyssta-exe's preceding commit (#49238) fixed _read_container_argv() to
locate the rc.init-launched main-wrapper.sh process under s6-overlay v3,
but the skip still never fired: _strip_container_argv_prefix() only peeled
a prefix when args[0] was init/main-wrapper.sh/hermes. Under s6 v3 the
matched argv is

    /bin/sh -e /run/s6/basedir/scripts/rc.init top
        /opt/hermes/docker/main-wrapper.sh dashboard ...

so args[0] stayed /bin/sh, _is_dashboard_container() returned False, and
the dashboard container reconciled + started its own gateway-default —
the exact dual Telegram getUpdates 409 in issue #49196.

Fix: strip everything up to and including the main-wrapper.sh token (the
stable boundary the image owns), covering both the v2 (/init ...) and v3
(/bin/sh ... rc.init top ...) shapes with one rule, instead of matching
launcher tokens positionally. This also repairs _is_legacy_gateway_run_request()
under v3, which shares the same strip helper (the issue called this out).

Tests: extend the dashboard true/false parametrize sets with the s6-v3
argv shape, and add test_main_skips_reconcile_in_dashboard_container_s6v3
exercising main() end-to-end with the v3 argv. Verified via mutation that
both new v3 assertions fail under the old positional strip and pass with
the fix.

---------

Co-authored-by: kyssta-exe <kyssta-exe@users.noreply.github.com>
2026-06-22 15:35:38 +10:00
Teknium
9e96e70995
feat(cli): /prompt — compose your next prompt in $EDITOR (#50509)
* feat(cli): /prompt — compose your next prompt in $EDITOR

Adds /prompt (alias /compose): opens $VISUAL/$EDITOR on a temp markdown
file so you can hand-edit a multi-line prompt, then sends the saved buffer
as the next agent turn. Text after the command pre-seeds the buffer; an
empty save cancels. Reuses the one-shot _pending_agent_seed the interactive
loop already consumes (same mechanism as /blueprint), so no changes to the
input event loop or message pipeline. CLI-only.

* feat(tui): /prompt slash command opens $EDITOR (parity with CLI)

The TUI already opens $EDITOR via Ctrl+G (openEditor), but had no /prompt
slash command like the classic CLI. Wire openEditor into the slash handler
context and register /prompt (alias /compose) to call it; inline text after
the command is dropped into the composer first so it carries into the editor,
matching the CLI's /prompt <text>.
2026-06-21 20:21:33 -07:00
Teknium
95d53c3bcb
feat(cli): /reasoning full — show complete thinking, not 10-line clamp (#50499)
* feat(cli): /reasoning full to show complete thinking, not 10-line clamp

The post-response Reasoning recap box hard-clamped long thinking to the
first 10 lines, so there was no way to see the full reasoning trace after
a turn (live streaming already shows it in full). Add display.reasoning_full
(default off) plus /reasoning full|clamp to toggle it at runtime; the clamp
truncation note now points at the command. Addresses repeated user requests
to show all thinking tokens.

* test(gateway): de-snapshot /reasoning help assertion

The test froze the exact args-hint literal '/reasoning [level|show|hide]',
which the new full/clamp args change to '[level|show|hide|full|clamp]'.
Convert to an invariant: assert /reasoning is in help and carries its core
args, not the exact hint string.

* feat(tui): /reasoning full|clamp parity in tui_gateway

The classic-CLI reasoning_full toggle had no TUI equivalent — typing
/reasoning full in the TUI fell through to parse_reasoning_effort and
errored. The TUI renders thinking as an expand/collapse section (no fixed
10-line recap), so map full -> sections.thinking=expanded (raw, uncapped
via thinkingPreview mode='full') and clamp -> collapsed, persisting
display.reasoning_full for cross-surface config consistency.
2026-06-21 20:21:11 -07:00
Teknium
b0a25980f8
fix(terminal): make hermes install dir reachable in subshell PATH (#50534)
Plugins shelling out to bare `hermes` via the terminal tool hit
`command not found` (exit 127) when the gateway was launched without the
hermes install dir on PATH (systemd, service managers, cron, desktop
launchers) — even though `hermes` works in the user's own interactive
terminal, which sources the shell rc that exports that dir.

The terminal tool's subshell PATH was the agent process PATH plus a
static set of system dirs (_SANE_PATH); it never included wherever the
hermes console-script actually lives (~/.local/bin, the venv bin/Scripts,
pipx, nix). Resolve that dir once (which/argv0/sys.executable) and
prepend-if-missing it so bare `hermes` resolves regardless of launch
method.
2026-06-21 20:00:06 -07:00
Teknium
7130d60861
feat(providers): remove google-gemini-cli + google-antigravity OAuth providers (#50492)
* feat(providers): remove google-gemini-cli + google-antigravity OAuth providers

Google now actively bans accounts for third-party tools that piggyback on
Gemini CLI / Antigravity / Code Assist OAuth, and because abuse prevention
sits at a backend layer the ban can extend to the entire Google account
(Gmail/Drive), with a second violation being permanent.
Ref: https://github.com/google-gemini/gemini-cli/discussions/20632

Removes both OAuth inference providers entirely (modules, provider profiles,
auth/runtime/config/models wiring, the /gquota Code Assist quota command,
the antigravity-cli optional skill, desktop + docs surface in en + zh-Hans).
The API-key 'gemini' provider (GOOGLE_API_KEY/GEMINI_API_KEY against
generativelanguage.googleapis.com) is unaffected and stays fully supported.

* fix(skills): keep the antigravity-cli skill — only the OAuth provider is removed

The antigravity-cli optional skill orchestrates the external `agy` binary as
a coding-agent tool via the terminal tool — it does NOT wrap Hermes inference
through the banned google-antigravity OAuth provider, so it carries none of
the account-ban risk that motivated removing that provider. Restore the skill,
its docs page, the sidebar entry, and the optional-skills catalog row. The
google-antigravity / google-gemini-cli inference providers stay fully removed.
2026-06-21 19:53:27 -07:00
Teknium
5bf23ff251
fix(banner): don't advertise toolsets/skills the agent wasn't given (#50497)
The welcome banner's 'Available Tools' merged in every toolset from the
global check_tool_availability() registry walk, regardless of whether it
was enabled for the current platform. On a Blank Slate CLI (file +
terminal only) that surfaced discord / feishu / kanban tools the agent
was never actually given — they are not in the agent's tool schema, but
the banner displayed them, making it look like they were exposed.

- Filter the unavailable-toolset merge to toolsets actually in
  enabled_toolsets (a toolset that's enabled but has unmet deps still
  legitimately shows as disabled/lazy).
- Gate the 'Available Skills' section on the skills toolset being
  enabled — when it's off, the agent can't load any skill, so show
  'Skills toolset disabled' instead of the on-disk catalog.

When enabled_toolsets is empty (older callers), behavior is unchanged.

Validation: blank-slate banner now shows only file + terminal and
'Skills toolset disabled'; a skills-enabled banner still lists the
catalog. Added regression tests; full banner suite green (15/15).
2026-06-21 19:08:54 -07:00
teknium1
8cfcbd327d fix(process): SIGKILL the whole tree on escalation, not just wait_procs survivors
Live testing against a real SIGTERM-ignoring process TREE (parent + children,
the agent-browser daemon + renderer shape) revealed psutil.wait_procs's
gone/alive partition mis-handles a parent/child tree: it reaps via
Process.wait() and could mark targets gone/alive inconsistently across the
tree, leaving survivors un-killed (flaky — sometimes the parent lived,
sometimes a child). Replace it with: sleep out the grace window, then
directly re-probe every captured target (_proc_alive, treating zombies as
dead) and SIGKILL any that's still running. Add a multi-child-tree regression
test. 6/6 escalation tests green across repeated runs; the real-tree E2E now
kills the full tree 6/6 runs.
2026-06-21 19:08:52 -07:00
teknium1
8cecaf0b29 feat(process): escalate SIGTERM->SIGKILL on host-pid termination after grace
A daemon that ignores or stalls in its SIGTERM handler currently survives the
process-registry reap and leaks until reboot (observed as agent-browser
daemons accumulating to EMFILE on long-running gateways). _terminate_host_pid
now snapshots the tree, SIGTERMs it, waits a bounded grace window
(terminal.daemon_term_grace_seconds, default 2.0s, 0 disables), then SIGKILLs
any survivor. The recycled-PID identity guard still gates the whole path, so
escalation never reaches a stranger; Windows is unchanged (taskkill /F is
already a hard kill).

Config lives in config.yaml (terminal.daemon_term_grace_seconds), NOT an env
var, per the .env-secrets-only policy.

Implements the SIGKILL-escalation idea from @tkwong's #15008, reworked onto the
current _terminate_host_pid tree-kill path (the original predated it) and
config-gated instead of env-var-gated.

Co-authored-by: Benjamin Wong <tkwong@inspiresynergy.com>
2026-06-21 19:08:52 -07:00
teknium1
f45ace9318 feat(security): startup security posture audit (warn-on-load)
Surface dangerous host/deployment posture at gateway startup so operators get
the 'you're exposed' signal the June 2026 MCP-config persistence campaign
victims never had. Warn-only — never blocks startup, never raises.

Checks (each independently fail-safe):
- Running as root (POSIX uid 0)
- SSH daemon with PasswordAuthentication enabled (incl. the 'yes' default)
- Running in a container with no persistent volume mount over HERMES_HOME
- Network-accessible API server with no API_SERVER_KEY

New module hermes_cli/security_audit_startup.py; invoked once per process from
start_gateway() right after setup_logging(). Cross-platform (root/SSH checks
no-op on Windows). Idea: @Cthulhu.
2026-06-21 19:05:27 -07:00
teknium1
eb51c180e6 fix(docker): replace dashboard --insecure with basic-auth provider
The s6 dashboard entrypoint and docker integration tests relied on
HERMES_DASHBOARD_INSECURE=1 to bring up a 0.0.0.0 dashboard with no auth
provider. With --insecure now a no-op (auth gate mandatory on non-loopback
binds), that path fails closed.

- s6 dashboard/run: drop --insecure derivation; warn that the env is a no-op
  and point operators at HERMES_DASHBOARD_BASIC_AUTH_* / OAuth.
- docker tests: supervision tests now register the bundled basic password
  provider (HERMES_DASHBOARD_BASIC_AUTH_USERNAME/_PASSWORD) so the gate has a
  provider and the dashboard binds. Rewrote the insecure-opt-out test to
  assert fail-closed (dashboard does NOT serve) instead of gate-bypass.
- docs (en + zh-Hans): HERMES_DASHBOARD_INSECURE documented as deprecated
  no-op; basic-auth is the zero-infra way to authenticate a containerized
  public dashboard.
2026-06-21 19:05:27 -07:00
teknium1
7726ce3040 fix(security): close hermes-0day MCP-persistence attack surface
Remove the dashboard --insecure auth-bypass, add an MCP persistence guard +
IOC blocklist, and raise the API-server key entropy floor.

Driven by the June 2026 hermes-0day campaign (r/hermesagent, live 854.media
instance): scanners find exposed Hermes dashboards/API servers, drive the
root agent to plant a 'command: bash' MCP entry that appends an attacker SSH
key to authorized_keys, which cron + startup then re-execute every tick.

- dashboard: --insecure no longer disables the auth gate. should_require_auth
  returns True for every non-loopback bind; a public bind ALWAYS requires an
  auth provider (bundled password provider or OAuth). --insecure kept as a
  warned no-op for backward compat. Fail-closed error now points at the
  password provider, not at --insecure.
- mcp_security: validate_mcp_server_entry now also rejects shell payloads that
  write to OS persistence surfaces (authorized_keys/.ssh/pam.d/sudoers/cron/
  rc files) and hard-rejects a hermes-0day IOC blocklist (attacker SSH key +
  source IPs) anywhere in command/args/env. Runs at save AND spawn time.
- api_server: raise network-bind API_SERVER_KEY entropy floor 8->16 chars;
  warn when a network-accessible API server runs an unsandboxed local backend.
2026-06-21 19:05:27 -07:00
Teknium
2b3a4f0af8
fix(agent): strip stale reasoning_content when falling back to a strict provider (#50480)
* fix(agent): strip stale reasoning_content when falling back to a strict provider

A reasoning primary (DeepSeek/Kimi/MiMo thinking mode) pins reasoning_content
on every assistant tool-call turn (a single space " " pad). api_messages is
built once under the primary; on a mid-session fallback to a strict
OpenAI-compatible provider (Mistral, Cerebras, Groq, SambaNova), those stale
pads were replayed verbatim and rejected with HTTP 400/422:

    body.messages.2.assistant.reasoning_content: Extra inputs are not
    permitted  (input: ' ')

reapply_reasoning_echo_for_provider() only ever ADDED pads, so it never
reconciled history built under a reasoning primary against a strict fallback.
copy_reasoning_content_for_api() also leaked empty-string and 'reasoning'-only
shapes to non-pad providers.

Fix both sites: when the active provider does not enforce echo-back, strip
reasoning_content (empty, space-pad, or non-empty) entirely. Re-padding when
switching TO a reasoning provider is preserved. Covers the Cerebras 400 from
#45655 and the DeepSeek->Mistral 422 fallback report.

Refs #45655.

* test: update reasoning-replay tests for strict-provider stripping

test_explicit_reasoning_content_beats_normalized_reasoning_on_replay was
implicitly running on the OpenRouter fixture (non-pad); pin it to a reasoning
provider so the precedence it checks is observable. Add a positive
strict-provider test asserting reasoning_content is stripped on replay.
2026-06-21 18:05:07 -07:00
teknium1
012f40c98c fix(status): cross-platform start-time fingerprint via psutil fallback
The PID-reuse guard (#43846) reads /proc/<pid>/stat field 22, which only
exists on Linux — on macOS/Windows it returned None and the guard silently
degraded to a bare liveness check (a no-op, safety-wise). Add a
psutil.create_time() fallback (psutil is a hard dep, cross-platform),
quantized to centiseconds for stable equality, so the recycled-PID guard
actually protects macOS/Windows too. /proc always wins first on Linux and
always misses on macOS/Windows, so the two sources never mix on one host and
same-source equality is all the guard needs.
2026-06-21 17:23:33 -07:00
teknium1
1cefc2a24e test(whatsapp): fix port-spares-client test race (listen before announce + retry connect)
The salvaged test spawned a listener subprocess that printed its port
immediately after bind() but BEFORE listen(), so under CI's loaded 8-worker
box the parent connected before the socket was listening -> ConnectionRefused
(flaked on test slice 2/6). Reorder the child to listen() then print the port,
and make the client connect with a short bounded retry to absorb scheduler
jitter. 15/15 green locally including direct hammering.
2026-06-21 17:23:33 -07:00
teknium1
615a8e6516 fix(whatsapp): add missing re import + fix test import path after adapter relocation
Follow-up to the salvaged #43846 commits: the WhatsApp adapter moved from
gateway/platforms/whatsapp.py to plugins/platforms/whatsapp/adapter.py since the
PR was authored. The cherry-pick brought _listener_pids_on_port's `re.finditer`
ss-fallback and the new test's import, but the new module location doesn't import
`re` (latent NameError on the lsof-absent fallback path) and the test imported the
old module path. Add `import re` to the adapter and repoint the test import.
2026-06-21 17:23:33 -07:00
valentt
069ab40c5f fix(whatsapp): only kill LISTENers when freeing the bridge port, never clients
This is the bug that was actually closing Firefox. `_kill_port_process`, run on
every bridge (re)start to free the port, used `lsof -ti :PORT` / `fuser PORT/tcp`
— both of which match a process whose socket merely *involves* that port number
in ANY state, including ESTABLISHED client connections. It then SIGTERMed every
match.

The bridge defaults to port 3000 — a ubiquitous local dev-server port. With a
browser tab open on localhost:3000, `lsof -ti :3000` returned Firefox's PID, so
each restart of the (crash-looping) WhatsApp bridge SIGTERMed Firefox, closing
the whole browser at irregular intervals with no crash and no coredump.

Proven live with the kernel `signal:signal_generate` tracepoint:
  hermes-gateway(3396516) -> sig=15 (code=0/SI_USER) -> comm=firefox pid=3371585
captured immediately after a gateway start, while Firefox held a socket on the
bridge port. Demonstrated over-match: `lsof -ti :8080` returns the listener AND
the gateway's own client connection; `lsof -ti tcp:8080 -sTCP:LISTEN` returns
only the listener.

Fix: `_listener_pids_on_port` resolves only LISTEN-state sockets
(`lsof -ti tcp:PORT -sTCP:LISTEN`, with an `ss -ltnp` fallback) and
`_kill_port_process` signals just those. A client whose connection happens to
involve the port number is never touched — which is also more correct, since a
client never blocks the new bridge from binding. Windows already filtered
LISTENING; the broad `fuser -k` path is removed.

Adds TestKillPortProcess: real-socket tests proving a separate client process
is excluded from the listener lookup and survives port cleanup. 9 tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 17:23:33 -07:00
valentt
77fdbbfe81 fix(whatsapp): validate bridge PID identity before killing stale pidfile entry
`_kill_stale_bridge_by_pidfile` SIGTERMed the PID recorded in `bridge.pid`
after only a bare liveness check. Once the bridge exits and is reaped the
kernel recycles that PID onto an unrelated process; because the WhatsApp bridge
crash-loops ("Bridge process died (exit code 1)" repeating), this cleanup ran
on every restart and could SIGTERM a recycled PID that had landed on the user's
browser — closing Firefox at irregular intervals with no crash and no coredump
(a clean kill of a stranger).

Same PID-recycling class as the MCP reaper (7bd1f8a2d) and the process-registry
host-PID guard (e6a99cef2); this was the third, and most actively-fired, path.

Fix: `_write_bridge_pidfile` now also records the leader's kernel start time
(line 2). `_kill_stale_bridge_by_pidfile` re-validates identity via
`_bridge_pid_is_ours` before signalling — the (pid, start time) pair must match,
or for legacy single-line pidfiles the live cmdline must name `node` + this
session's unique path. A recycled PID (different start time / cmdline) is logged
and skipped, never signalled. Legacy pidfiles stay readable.

Adds TestWhatsappBridgePidfile: real-process tests proving a genuine bridge is
reaped while a recycled PID (start-time mismatch, or non-bridge cmdline) is
spared. 7 new + 108 gateway/registry tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 17:23:33 -07:00
valentt
e447723149 fix(process-registry): re-validate PID identity before killing host processes
The background-process registry signalled host PIDs (recovery adoption,
detached-session kill, tree-kill) using a number captured at spawn, guarded
only by a bare liveness check. Once a session's process exits and is reaped the
kernel recycles that PID onto an unrelated process, so an alive-but-different
PID passed the check and got tree-killed.

Observed in the wild: a recycled background-session PID landed on Firefox's
session leader; a later kill/refresh walked its process tree and SIGTERMed
every tab — Firefox "closing" at irregular intervals with no crash/coredump.

This is the same PID/PGID-recycling class fixed for the MCP orphan reaper in
7bd1f8a2d, but the process_registry subsystem was never guarded — so the bug
persisted.

Fix: record each host process's kernel start time (/proc/<pid>/stat field 22)
at spawn, persist it in the checkpoint, and re-validate it before every signal
via `_host_pid_is_ours`. A PID whose start time no longer matches — or that is
gone — is never signalled:
  - recover_from_checkpoint: a recycled PID is not adopted as a session.
  - _refresh_detached_session: a recycled detached PID is marked exited.
  - kill_process / _terminate_host_pid: refuse to tree-kill a stranger.
Legacy checkpoints and platforms without /proc (no baseline) degrade to the
prior best-effort liveness behaviour, so nothing else changes.

Adds TestPidReuseGuard: real-process tests proving a mismatched start time
refuses termination while a matching one still kills, plus recovery/refresh
recycling paths. 74 registry + 22 MCP-stability tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 17:23:33 -07:00
Teknium
84e1d31e54
refactor(kanban): fold worker/orchestrator skills into injected guidance (#50473)
The kanban-worker and kanban-orchestrator bundled skills existed only to
be force-loaded into dispatcher-spawned workers, gated by
environments:[kanban] so they wouldn't leak into normal CLI listings.
That gating was fragile (the leak that #50443 patched) and the
--skills auto-load was already best-effort — most workers ran without it
because the bundled skill isn't present in profile-scoped skills dirs.

Remove the skills entirely and promote their load-bearing content
(workspace kinds, deliverable artifacts, created-card integrity, profile
discovery) into KANBAN_GUIDANCE, which is already injected into every
kanban worker's system prompt. Net result: every worker reliably gets
the guidance, nothing can leak into a CLI/blank-slate session, and the
gating machinery is gone.

- agent/prompt_builder.py: promote the 4 load-bearing rules into KANBAN_GUIDANCE
- hermes_cli/kanban_db.py: drop --skills kanban-worker auto-injection + _kanban_worker_skill_available probe
- hermes_cli/kanban_swarm.py: drop skills=[kanban-orchestrator] on the root card
- hermes_cli/kanban.py: drop kanban-init skill seeding; fix help text
- delete skills/devops/kanban-{worker,orchestrator}
- docs: delete the two skill pages (EN+zh), fix sidebars/catalog/kanban.md/kanban-worker-lanes.md and the video-orchestrator + codex-lane references
- tests: update spawn-argv expectations; re-bound the guidance-size guard

Supersedes the skill-leak half of #50443 (credit @helix4u for flagging the area).
2026-06-21 17:06:48 -07:00
Dusk1e
84fcbbf6a9 fix(security): quote HERMES_TIMEZONE in remote code execution to prevent shell injection 2026-06-21 16:55:12 -07:00
Teknium
b7a912ea45 fix(antigravity): bake in public OAuth client + default project fallback
Salvage follow-up on top of @pmos69's #29474. The PR resolved the
Antigravity OAuth client purely by discovering it from an installed `agy`
binary or HERMES_ANTIGRAVITY_CLIENT_ID/SECRET env vars, so users without
agy installed hit a hard 'client ID not available' error.

Antigravity's desktop OAuth client is a public, non-confidential installed-app
client (PKCE provides the security), baked into every copy of the Antigravity
CLI — same posture as the gemini-cli credentials Hermes already ships in
google_oauth.py. Bake it in as the final fallback (env -> discovery -> public
default) and add the public default Code Assist project as the discovery
fallback, matching the reference Antigravity flow. Now consumers can
authenticate directly without agy installed.
2026-06-21 16:41:30 -07:00
pmos69
8baa4e9976 feat(cli): add native Antigravity OAuth provider 2026-06-21 16:41:30 -07:00
xxxigm
29176ffecf test(gateway): cover no eager platform install on startup sweep
Pin the contract that ``_apply_env_overrides`` consults ``is_connected``
before the install-triggering ``check_fn``: an unconfigured platform is
skipped without calling ``check_fn`` (no lazy install), while a configured
platform still has ``check_fn`` run and is auto-enabled. The first assertion
fails on the pre-fix unconditional sweep.
2026-06-21 16:41:17 -07:00
Dusk1e
8fcb8136bb fix(security): harden smart approval guard against prompt injection
# Conflicts:
#	tools/approval.py
2026-06-21 16:39:48 -07:00
JP Lew
c11ae8261b fix(codex): seed app-server sessions with configured cwd 2026-06-21 16:39:02 -07:00
teknium1
624580e836 fix(browser): verify daemon identity before orphan reaper kills a PID (#14073)
The browser orphan reaper reads a daemon PID from a `.pid` file in a
world-writable, predictably-named temp dir (`/tmp/agent-browser-h_*`) it
does not write itself, then tree-kills that PID via `_terminate_host_pid`
after only a liveness check. A same-user actor could plant a fake socket
dir whose `.pid` points at an arbitrary victim process, and OS PID reuse
after the real daemon exits could land the recorded PID on an unrelated
process — either way an arbitrary same-user process (and its whole tree)
gets SIGTERMed. Local DoS.

Add `_verify_reapable_browser_daemon()`, gated before the kill: via psutil
(a hard dep, fine cross-platform for the same-user processes the reaper can
signal) require both (1) identity — `agent-browser` in the process
name/cmdline — and (2) binding — the live process references *this* session's
socket dir in its cmdline or `AGENT_BROWSER_SOCKET_DIR`. The binding check is
the real spoof defense: a planted/recycled PID won't embed our exact socket
path. Fail-closed on any ambiguity (unreadable cmdline, no match), leaving the
process and its socket dir untouched for a later sweep.

Builds on @sgaofen's fix in #14394 (cmdline identity check); rewritten to use
psutil instead of `/proc`+`ps` (cross-platform, Windows-covered) and to add
the session-socket-dir binding check for recycled-PID / spoof resistance.

Co-authored-by: sgaofen <135070653+sgaofen@users.noreply.github.com>
2026-06-21 15:23:47 -07:00
teknium1
4d4ba0831e refactor(session): simplify traversal guard to a helper + logger, harden non-leading separators
Follow-up to the salvaged #9560 fix:
- Replace the _TRAVERSAL_RE regex with an explicit _is_path_unsafe() helper
  (drops the now-unused `import re`); catches a path separator ANYWHERE,
  not just leading, so a non-leading Windows backslash can't slip through.
- Switch the per-entry skip in _ensure_loaded_locked from print() to
  logger.warning to match the module's logging conventions.
- Add AUTHOR_MAP entry for the contributor.
- Add regression tests for the non-leading-separator case.
2026-06-21 15:23:36 -07:00
OrbisAI Security
aa2aac68b0 fix(V-009): reject Windows drive-letter paths in session field validation
Extends the CWE-22 path traversal guard to cover Windows absolute paths
of the form C:/... and D:\... — previously only leading / and \ were
checked, which missed drive-letter prefixes. Replaces the inline
startswith check with a compiled module-level regex (_TRAVERSAL_RE) that
covers all three attack patterns: .., leading /\, and leading X: drives.
Adds two regression tests for C:/windows/system32 and D:\\path\\to\\file.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-21 15:23:36 -07:00
OrbisAI Security
3a6a43cb81 fix(V-009): reject path traversal in SessionEntry.from_dict and harden _ensure_loaded
Addresses PR #9560 review comments: applies the CWE-22 fix to current main
(post-PR #458 rebase) and adds the requested regression tests.

- SessionEntry.from_dict now raises ValueError for session_key or session_id
  containing '..' or starting with '/' or '\' (directory traversal guard)
- SessionStore._ensure_loaded moves per-entry validation inside the loop so
  one malicious/corrupt entry is skipped with a warning instead of aborting
  the entire sessions.json load
- Adds TestSessionEntryFromDictTraversalValidation (5 cases) and
  TestEnsureLoadedSkipsInvalidEntries covering the skip-not-abort behavior

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-21 15:23:36 -07:00
ethernet
bb59075b25
Merge pull request #50398 from helix4u/fix/windows-npm-path-fallback
fix(windows): prefer cmd npm shim on PATH fallback
2026-06-21 18:55:02 -03:00
devorun
6f0ecf37da fix(redact): mask all Authorization schemes and x-api-key style headers
Secret redaction only matched `Authorization: Bearer <token>`. Other auth
headers passed through verbatim into logs, tool output, and transcripts:

- `Authorization: Basic <base64>` — leaks base64(user:password)
- `Authorization: token <pat>` / any non-Bearer scheme
- `Proxy-Authorization: ...`
- `x-api-key: <key>` (Anthropic and many providers) and `api-key`,
  `x-goog-api-key`, `x-auth-token`, `x-access-token`, ... — opaque values with
  no known vendor prefix were caught by nothing

A logged request or an echoed `curl -H "x-api-key: ..."` command therefore
leaked live credentials.

Generalize the Authorization rule to mask the credential for any scheme (and
Proxy-Authorization) while preserving the header name and scheme word for
debuggability, and add an api-key header rule for the single-opaque-value
headers. Bearer behavior is unchanged; plain prose containing the word
"authorization" (no colon-delimited value) is left untouched.

Adds regression tests for Basic/token/Proxy auth and the x-api-key/api-key
headers, including inside a curl command.
2026-06-21 14:08:06 -07:00
teknium1
87ab373381 test(url-safety): cover IPv6 scope-ID strip + fail-closed in URL guards
Follow-up to the salvaged #25961 fix: regression tests asserting that
scope-bearing IPv6 addresses (fe80::1%eth0, ::1%lo) are blocked by
is_safe_url after the scope is stripped, that a still-unparseable address
fails closed, and that a scoped IPv4-mapped IMDS address is caught by the
always-blocked floor.
2026-06-21 13:56:35 -07:00
liuhao1024
b5b8a4cd56 fix(gateway): respect adapter decline of fresh-final to prevent double delivery
When a streamed Telegram reply finalizes, the stream consumer could take
the fresh-final path (send a new sendRichMessage + best-effort delete the
preview) purely because the time-based _should_send_fresh_final()
threshold elapsed — even though Telegram's prefers_fresh_final_streaming
returns False. The fresh Rich Message then overlapped the legacy
MarkdownV2 preview already on screen, leaving both visible (the #47048
table + bullet double-render).

Honor the adapter's decision: when prefers_fresh_final_streaming exists
on the adapter (checked on the class + instance __dict__ so MagicMock
auto-attrs don't false-positive) and declines, the time threshold no
longer overrides it. Adapters without the hook keep the time-based
fresh-final for backward compat.

Fixes #47048
2026-06-21 13:55:50 -07:00
teknium1
f79e0a7060 fix(email): mark missing-config as non-retryable + reject blank env vars (#40715)
Fold in the #40715 blank-env OOM fix on top of the host-resolution change:
- connect() now sets a non-retryable fatal error when required settings are
  missing, so the gateway stops reconnecting against an empty host instead of
  looping forever and leaking memory until the host OOM-kills.
- check_email_requirements() treats blank/whitespace-only EMAIL_* values as
  missing, so an abandoned setup with empty keys no longer enables the platform.

Credits the parallel fixes by zerone0x (#40745) and liuhao1024 (#40829).
2026-06-21 13:33:52 -07:00
devorun
b7f6cb9c8b fix(email): resolve IMAP/SMTP host from config and validate before connecting
The email adapter read address/host purely from env vars and never stripped
them, so a missing or whitespace-padded EMAIL_IMAP_HOST reached
imaplib.IMAP4_SSL("") and surfaced as the misleading
"[Errno 8] nodename nor servname provided, or not known" — sending users down a
DNS rabbit hole when the real problem was an empty/dirty host string. A
config.yaml-only setup also left the host empty because __init__ ignored
PlatformConfig.extra, even though the "connected" check, the send helper, and
`hermes config show` already read address/imap_host/smtp_host from it.

Resolve address/imap_host/smtp_host from the env var first, then fall back to
config.extra, and strip surrounding whitespace — matching the send helper's
existing pattern. Validate the required settings at the start of connect() and
return False with an actionable message instead of attempting a connection with
an empty host.

Adds regression tests for whitespace stripping, config.extra fallback, and the
no-IMAP-attempt-on-missing-host path.
2026-06-21 13:33:52 -07:00
teknium1
4cff0360ea test(approval): regression for interrupt-unblocks-approval; AUTHOR_MAP
- Add thread-scoped regression test: interrupt on the waiting thread resolves
  the approval as deny well under the 300s timeout; a foreign-thread interrupt
  does NOT release the wait (interrupts are per-thread).
- Add panghuer023 to AUTHOR_MAP for the salvaged #37994 fix.
2026-06-21 13:33:48 -07:00
Teknium
824c9d3812
fix(config): alias model.api_base -> model.base_url for custom providers (#50385)
A bare custom provider configured via `model.api_base` (the intuitive name
OpenAI-SDK / LiteLLM users reach for) was silently ignored: `hermes config set`
accepts any dotted key, so `model.api_base` got written and confirmed, but the
runtime resolver reads only `model.base_url`. Requests fell back to OpenRouter
with an empty key -> 401, zero hits to the custom endpoint (issue #8919).

Now api_base is migrated to base_url at load time (fixes existing broken
configs) and at set time (with a notice), never overriding an explicit
base_url. Closes #8919.
2026-06-21 13:33:41 -07:00