hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-06-24 10:52:21 +00:00

Author	SHA1	Message	Date
Teknium	7130d60861	feat(providers): remove google-gemini-cli + google-antigravity OAuth providers (#50492 ) * feat(providers): remove google-gemini-cli + google-antigravity OAuth providers Google now actively bans accounts for third-party tools that piggyback on Gemini CLI / Antigravity / Code Assist OAuth, and because abuse prevention sits at a backend layer the ban can extend to the entire Google account (Gmail/Drive), with a second violation being permanent. Ref: https://github.com/google-gemini/gemini-cli/discussions/20632 Removes both OAuth inference providers entirely (modules, provider profiles, auth/runtime/config/models wiring, the /gquota Code Assist quota command, the antigravity-cli optional skill, desktop + docs surface in en + zh-Hans). The API-key 'gemini' provider (GOOGLE_API_KEY/GEMINI_API_KEY against generativelanguage.googleapis.com) is unaffected and stays fully supported. * fix(skills): keep the antigravity-cli skill — only the OAuth provider is removed The antigravity-cli optional skill orchestrates the external `agy` binary as a coding-agent tool via the terminal tool — it does NOT wrap Hermes inference through the banned google-antigravity OAuth provider, so it carries none of the account-ban risk that motivated removing that provider. Restore the skill, its docs page, the sidebar entry, and the optional-skills catalog row. The google-antigravity / google-gemini-cli inference providers stay fully removed.	2026-06-21 19:53:27 -07:00
Teknium	5bf23ff251	fix(banner): don't advertise toolsets/skills the agent wasn't given (#50497 ) The welcome banner's 'Available Tools' merged in every toolset from the global check_tool_availability() registry walk, regardless of whether it was enabled for the current platform. On a Blank Slate CLI (file + terminal only) that surfaced discord / feishu / kanban tools the agent was never actually given — they are not in the agent's tool schema, but the banner displayed them, making it look like they were exposed. - Filter the unavailable-toolset merge to toolsets actually in enabled_toolsets (a toolset that's enabled but has unmet deps still legitimately shows as disabled/lazy). - Gate the 'Available Skills' section on the skills toolset being enabled — when it's off, the agent can't load any skill, so show 'Skills toolset disabled' instead of the on-disk catalog. When enabled_toolsets is empty (older callers), behavior is unchanged. Validation: blank-slate banner now shows only file + terminal and 'Skills toolset disabled'; a skills-enabled banner still lists the catalog. Added regression tests; full banner suite green (15/15).	2026-06-21 19:08:54 -07:00
teknium1	8cfcbd327d	fix(process): SIGKILL the whole tree on escalation, not just wait_procs survivors Live testing against a real SIGTERM-ignoring process TREE (parent + children, the agent-browser daemon + renderer shape) revealed psutil.wait_procs's gone/alive partition mis-handles a parent/child tree: it reaps via Process.wait() and could mark targets gone/alive inconsistently across the tree, leaving survivors un-killed (flaky — sometimes the parent lived, sometimes a child). Replace it with: sleep out the grace window, then directly re-probe every captured target (_proc_alive, treating zombies as dead) and SIGKILL any that's still running. Add a multi-child-tree regression test. 6/6 escalation tests green across repeated runs; the real-tree E2E now kills the full tree 6/6 runs.	2026-06-21 19:08:52 -07:00
teknium1	8cecaf0b29	feat(process): escalate SIGTERM->SIGKILL on host-pid termination after grace A daemon that ignores or stalls in its SIGTERM handler currently survives the process-registry reap and leaks until reboot (observed as agent-browser daemons accumulating to EMFILE on long-running gateways). _terminate_host_pid now snapshots the tree, SIGTERMs it, waits a bounded grace window (terminal.daemon_term_grace_seconds, default 2.0s, 0 disables), then SIGKILLs any survivor. The recycled-PID identity guard still gates the whole path, so escalation never reaches a stranger; Windows is unchanged (taskkill /F is already a hard kill). Config lives in config.yaml (terminal.daemon_term_grace_seconds), NOT an env var, per the .env-secrets-only policy. Implements the SIGKILL-escalation idea from @tkwong's #15008, reworked onto the current _terminate_host_pid tree-kill path (the original predated it) and config-gated instead of env-var-gated. Co-authored-by: Benjamin Wong <tkwong@inspiresynergy.com>	2026-06-21 19:08:52 -07:00
teknium1	f45ace9318	feat(security): startup security posture audit (warn-on-load) Surface dangerous host/deployment posture at gateway startup so operators get the 'you're exposed' signal the June 2026 MCP-config persistence campaign victims never had. Warn-only — never blocks startup, never raises. Checks (each independently fail-safe): - Running as root (POSIX uid 0) - SSH daemon with PasswordAuthentication enabled (incl. the 'yes' default) - Running in a container with no persistent volume mount over HERMES_HOME - Network-accessible API server with no API_SERVER_KEY New module hermes_cli/security_audit_startup.py; invoked once per process from start_gateway() right after setup_logging(). Cross-platform (root/SSH checks no-op on Windows). Idea: @Cthulhu.	2026-06-21 19:05:27 -07:00
teknium1	eb51c180e6	fix(docker): replace dashboard --insecure with basic-auth provider The s6 dashboard entrypoint and docker integration tests relied on HERMES_DASHBOARD_INSECURE=1 to bring up a 0.0.0.0 dashboard with no auth provider. With --insecure now a no-op (auth gate mandatory on non-loopback binds), that path fails closed. - s6 dashboard/run: drop --insecure derivation; warn that the env is a no-op and point operators at HERMES_DASHBOARD_BASIC_AUTH_* / OAuth. - docker tests: supervision tests now register the bundled basic password provider (HERMES_DASHBOARD_BASIC_AUTH_USERNAME/_PASSWORD) so the gate has a provider and the dashboard binds. Rewrote the insecure-opt-out test to assert fail-closed (dashboard does NOT serve) instead of gate-bypass. - docs (en + zh-Hans): HERMES_DASHBOARD_INSECURE documented as deprecated no-op; basic-auth is the zero-infra way to authenticate a containerized public dashboard.	2026-06-21 19:05:27 -07:00
teknium1	7726ce3040	fix(security): close hermes-0day MCP-persistence attack surface Remove the dashboard --insecure auth-bypass, add an MCP persistence guard + IOC blocklist, and raise the API-server key entropy floor. Driven by the June 2026 hermes-0day campaign (r/hermesagent, live 854.media instance): scanners find exposed Hermes dashboards/API servers, drive the root agent to plant a 'command: bash' MCP entry that appends an attacker SSH key to authorized_keys, which cron + startup then re-execute every tick. - dashboard: --insecure no longer disables the auth gate. should_require_auth returns True for every non-loopback bind; a public bind ALWAYS requires an auth provider (bundled password provider or OAuth). --insecure kept as a warned no-op for backward compat. Fail-closed error now points at the password provider, not at --insecure. - mcp_security: validate_mcp_server_entry now also rejects shell payloads that write to OS persistence surfaces (authorized_keys/.ssh/pam.d/sudoers/cron/ rc files) and hard-rejects a hermes-0day IOC blocklist (attacker SSH key + source IPs) anywhere in command/args/env. Runs at save AND spawn time. - api_server: raise network-bind API_SERVER_KEY entropy floor 8->16 chars; warn when a network-accessible API server runs an unsandboxed local backend.	2026-06-21 19:05:27 -07:00
Teknium	2b3a4f0af8	fix(agent): strip stale reasoning_content when falling back to a strict provider (#50480 ) * fix(agent): strip stale reasoning_content when falling back to a strict provider A reasoning primary (DeepSeek/Kimi/MiMo thinking mode) pins reasoning_content on every assistant tool-call turn (a single space " " pad). api_messages is built once under the primary; on a mid-session fallback to a strict OpenAI-compatible provider (Mistral, Cerebras, Groq, SambaNova), those stale pads were replayed verbatim and rejected with HTTP 400/422: body.messages.2.assistant.reasoning_content: Extra inputs are not permitted (input: ' ') reapply_reasoning_echo_for_provider() only ever ADDED pads, so it never reconciled history built under a reasoning primary against a strict fallback. copy_reasoning_content_for_api() also leaked empty-string and 'reasoning'-only shapes to non-pad providers. Fix both sites: when the active provider does not enforce echo-back, strip reasoning_content (empty, space-pad, or non-empty) entirely. Re-padding when switching TO a reasoning provider is preserved. Covers the Cerebras 400 from #45655 and the DeepSeek->Mistral 422 fallback report. Refs #45655. * test: update reasoning-replay tests for strict-provider stripping test_explicit_reasoning_content_beats_normalized_reasoning_on_replay was implicitly running on the OpenRouter fixture (non-pad); pin it to a reasoning provider so the precedence it checks is observable. Add a positive strict-provider test asserting reasoning_content is stripped on replay.	2026-06-21 18:05:07 -07:00
teknium1	012f40c98c	fix(status): cross-platform start-time fingerprint via psutil fallback The PID-reuse guard (#43846) reads /proc/<pid>/stat field 22, which only exists on Linux — on macOS/Windows it returned None and the guard silently degraded to a bare liveness check (a no-op, safety-wise). Add a psutil.create_time() fallback (psutil is a hard dep, cross-platform), quantized to centiseconds for stable equality, so the recycled-PID guard actually protects macOS/Windows too. /proc always wins first on Linux and always misses on macOS/Windows, so the two sources never mix on one host and same-source equality is all the guard needs.	2026-06-21 17:23:33 -07:00
teknium1	1cefc2a24e	test(whatsapp): fix port-spares-client test race (listen before announce + retry connect) The salvaged test spawned a listener subprocess that printed its port immediately after bind() but BEFORE listen(), so under CI's loaded 8-worker box the parent connected before the socket was listening -> ConnectionRefused (flaked on test slice 2/6). Reorder the child to listen() then print the port, and make the client connect with a short bounded retry to absorb scheduler jitter. 15/15 green locally including direct hammering.	2026-06-21 17:23:33 -07:00
teknium1	615a8e6516	fix(whatsapp): add missing re import + fix test import path after adapter relocation Follow-up to the salvaged #43846 commits: the WhatsApp adapter moved from gateway/platforms/whatsapp.py to plugins/platforms/whatsapp/adapter.py since the PR was authored. The cherry-pick brought _listener_pids_on_port's `re.finditer` ss-fallback and the new test's import, but the new module location doesn't import `re` (latent NameError on the lsof-absent fallback path) and the test imported the old module path. Add `import re` to the adapter and repoint the test import.	2026-06-21 17:23:33 -07:00
valentt	069ab40c5f	fix(whatsapp): only kill LISTENers when freeing the bridge port, never clients This is the bug that was actually closing Firefox. `_kill_port_process`, run on every bridge (re)start to free the port, used `lsof -ti :PORT` / `fuser PORT/tcp` — both of which match a process whose socket merely involves that port number in ANY state, including ESTABLISHED client connections. It then SIGTERMed every match. The bridge defaults to port 3000 — a ubiquitous local dev-server port. With a browser tab open on localhost:3000, `lsof -ti :3000` returned Firefox's PID, so each restart of the (crash-looping) WhatsApp bridge SIGTERMed Firefox, closing the whole browser at irregular intervals with no crash and no coredump. Proven live with the kernel `signal:signal_generate` tracepoint: hermes-gateway(3396516) -> sig=15 (code=0/SI_USER) -> comm=firefox pid=3371585 captured immediately after a gateway start, while Firefox held a socket on the bridge port. Demonstrated over-match: `lsof -ti :8080` returns the listener AND the gateway's own client connection; `lsof -ti tcp:8080 -sTCP:LISTEN` returns only the listener. Fix: `_listener_pids_on_port` resolves only LISTEN-state sockets (`lsof -ti tcp:PORT -sTCP:LISTEN`, with an `ss -ltnp` fallback) and `_kill_port_process` signals just those. A client whose connection happens to involve the port number is never touched — which is also more correct, since a client never blocks the new bridge from binding. Windows already filtered LISTENING; the broad `fuser -k` path is removed. Adds TestKillPortProcess: real-socket tests proving a separate client process is excluded from the listener lookup and survives port cleanup. 9 tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 17:23:33 -07:00
valentt	77fdbbfe81	fix(whatsapp): validate bridge PID identity before killing stale pidfile entry `_kill_stale_bridge_by_pidfile` SIGTERMed the PID recorded in `bridge.pid` after only a bare liveness check. Once the bridge exits and is reaped the kernel recycles that PID onto an unrelated process; because the WhatsApp bridge crash-loops ("Bridge process died (exit code 1)" repeating), this cleanup ran on every restart and could SIGTERM a recycled PID that had landed on the user's browser — closing Firefox at irregular intervals with no crash and no coredump (a clean kill of a stranger). Same PID-recycling class as the MCP reaper (`7bd1f8a2d`) and the process-registry host-PID guard (e6a99cef2); this was the third, and most actively-fired, path. Fix: `_write_bridge_pidfile` now also records the leader's kernel start time (line 2). `_kill_stale_bridge_by_pidfile` re-validates identity via `_bridge_pid_is_ours` before signalling — the (pid, start time) pair must match, or for legacy single-line pidfiles the live cmdline must name `node` + this session's unique path. A recycled PID (different start time / cmdline) is logged and skipped, never signalled. Legacy pidfiles stay readable. Adds TestWhatsappBridgePidfile: real-process tests proving a genuine bridge is reaped while a recycled PID (start-time mismatch, or non-bridge cmdline) is spared. 7 new + 108 gateway/registry tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 17:23:33 -07:00
valentt	e447723149	fix(process-registry): re-validate PID identity before killing host processes The background-process registry signalled host PIDs (recovery adoption, detached-session kill, tree-kill) using a number captured at spawn, guarded only by a bare liveness check. Once a session's process exits and is reaped the kernel recycles that PID onto an unrelated process, so an alive-but-different PID passed the check and got tree-killed. Observed in the wild: a recycled background-session PID landed on Firefox's session leader; a later kill/refresh walked its process tree and SIGTERMed every tab — Firefox "closing" at irregular intervals with no crash/coredump. This is the same PID/PGID-recycling class fixed for the MCP orphan reaper in `7bd1f8a2d`, but the process_registry subsystem was never guarded — so the bug persisted. Fix: record each host process's kernel start time (/proc/<pid>/stat field 22) at spawn, persist it in the checkpoint, and re-validate it before every signal via `_host_pid_is_ours`. A PID whose start time no longer matches — or that is gone — is never signalled: - recover_from_checkpoint: a recycled PID is not adopted as a session. - _refresh_detached_session: a recycled detached PID is marked exited. - kill_process / _terminate_host_pid: refuse to tree-kill a stranger. Legacy checkpoints and platforms without /proc (no baseline) degrade to the prior best-effort liveness behaviour, so nothing else changes. Adds TestPidReuseGuard: real-process tests proving a mismatched start time refuses termination while a matching one still kills, plus recovery/refresh recycling paths. 74 registry + 22 MCP-stability tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-21 17:23:33 -07:00
Teknium	84e1d31e54	refactor(kanban): fold worker/orchestrator skills into injected guidance (#50473 ) The kanban-worker and kanban-orchestrator bundled skills existed only to be force-loaded into dispatcher-spawned workers, gated by environments:[kanban] so they wouldn't leak into normal CLI listings. That gating was fragile (the leak that #50443 patched) and the --skills auto-load was already best-effort — most workers ran without it because the bundled skill isn't present in profile-scoped skills dirs. Remove the skills entirely and promote their load-bearing content (workspace kinds, deliverable artifacts, created-card integrity, profile discovery) into KANBAN_GUIDANCE, which is already injected into every kanban worker's system prompt. Net result: every worker reliably gets the guidance, nothing can leak into a CLI/blank-slate session, and the gating machinery is gone. - agent/prompt_builder.py: promote the 4 load-bearing rules into KANBAN_GUIDANCE - hermes_cli/kanban_db.py: drop --skills kanban-worker auto-injection + _kanban_worker_skill_available probe - hermes_cli/kanban_swarm.py: drop skills=[kanban-orchestrator] on the root card - hermes_cli/kanban.py: drop kanban-init skill seeding; fix help text - delete skills/devops/kanban-{worker,orchestrator} - docs: delete the two skill pages (EN+zh), fix sidebars/catalog/kanban.md/kanban-worker-lanes.md and the video-orchestrator + codex-lane references - tests: update spawn-argv expectations; re-bound the guidance-size guard Supersedes the skill-leak half of #50443 (credit @helix4u for flagging the area).	2026-06-21 17:06:48 -07:00
Dusk1e	84fcbbf6a9	fix(security): quote HERMES_TIMEZONE in remote code execution to prevent shell injection	2026-06-21 16:55:12 -07:00
Teknium	b7a912ea45	fix(antigravity): bake in public OAuth client + default project fallback Salvage follow-up on top of @pmos69's #29474. The PR resolved the Antigravity OAuth client purely by discovering it from an installed `agy` binary or HERMES_ANTIGRAVITY_CLIENT_ID/SECRET env vars, so users without agy installed hit a hard 'client ID not available' error. Antigravity's desktop OAuth client is a public, non-confidential installed-app client (PKCE provides the security), baked into every copy of the Antigravity CLI — same posture as the gemini-cli credentials Hermes already ships in google_oauth.py. Bake it in as the final fallback (env -> discovery -> public default) and add the public default Code Assist project as the discovery fallback, matching the reference Antigravity flow. Now consumers can authenticate directly without agy installed.	2026-06-21 16:41:30 -07:00
pmos69	8baa4e9976	feat(cli): add native Antigravity OAuth provider	2026-06-21 16:41:30 -07:00
xxxigm	29176ffecf	test(gateway): cover no eager platform install on startup sweep Pin the contract that ``_apply_env_overrides`` consults ``is_connected`` before the install-triggering ``check_fn``: an unconfigured platform is skipped without calling ``check_fn`` (no lazy install), while a configured platform still has ``check_fn`` run and is auto-enabled. The first assertion fails on the pre-fix unconditional sweep.	2026-06-21 16:41:17 -07:00
Dusk1e	8fcb8136bb	fix(security): harden smart approval guard against prompt injection # Conflicts: # tools/approval.py	2026-06-21 16:39:48 -07:00
JP Lew	c11ae8261b	fix(codex): seed app-server sessions with configured cwd	2026-06-21 16:39:02 -07:00
teknium1	624580e836	fix(browser): verify daemon identity before orphan reaper kills a PID (#14073 ) The browser orphan reaper reads a daemon PID from a `.pid` file in a world-writable, predictably-named temp dir (`/tmp/agent-browser-h_`) it does not write itself, then tree-kills that PID via `_terminate_host_pid` after only a liveness check. A same-user actor could plant a fake socket dir whose `.pid` points at an arbitrary victim process, and OS PID reuse after the real daemon exits could land the recorded PID on an unrelated process — either way an arbitrary same-user process (and its whole tree) gets SIGTERMed. Local DoS. Add `_verify_reapable_browser_daemon()`, gated before the kill: via psutil (a hard dep, fine cross-platform for the same-user processes the reaper can signal) require both (1) identity — `agent-browser` in the process name/cmdline — and (2) binding — the live process references this* session's socket dir in its cmdline or `AGENT_BROWSER_SOCKET_DIR`. The binding check is the real spoof defense: a planted/recycled PID won't embed our exact socket path. Fail-closed on any ambiguity (unreadable cmdline, no match), leaving the process and its socket dir untouched for a later sweep. Builds on @sgaofen's fix in #14394 (cmdline identity check); rewritten to use psutil instead of `/proc`+`ps` (cross-platform, Windows-covered) and to add the session-socket-dir binding check for recycled-PID / spoof resistance. Co-authored-by: sgaofen <135070653+sgaofen@users.noreply.github.com>	2026-06-21 15:23:47 -07:00
teknium1	4d4ba0831e	refactor(session): simplify traversal guard to a helper + logger, harden non-leading separators Follow-up to the salvaged #9560 fix: - Replace the _TRAVERSAL_RE regex with an explicit _is_path_unsafe() helper (drops the now-unused `import re`); catches a path separator ANYWHERE, not just leading, so a non-leading Windows backslash can't slip through. - Switch the per-entry skip in _ensure_loaded_locked from print() to logger.warning to match the module's logging conventions. - Add AUTHOR_MAP entry for the contributor. - Add regression tests for the non-leading-separator case.	2026-06-21 15:23:36 -07:00
OrbisAI Security	aa2aac68b0	fix(V-009): reject Windows drive-letter paths in session field validation Extends the CWE-22 path traversal guard to cover Windows absolute paths of the form C:/... and D:\... — previously only leading / and \ were checked, which missed drive-letter prefixes. Replaces the inline startswith check with a compiled module-level regex (_TRAVERSAL_RE) that covers all three attack patterns: .., leading /\, and leading X: drives. Adds two regression tests for C:/windows/system32 and D:\\path\\to\\file. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-21 15:23:36 -07:00
OrbisAI Security	3a6a43cb81	fix(V-009): reject path traversal in SessionEntry.from_dict and harden _ensure_loaded Addresses PR #9560 review comments: applies the CWE-22 fix to current main (post-PR #458 rebase) and adds the requested regression tests. - SessionEntry.from_dict now raises ValueError for session_key or session_id containing '..' or starting with '/' or '\' (directory traversal guard) - SessionStore._ensure_loaded moves per-entry validation inside the loop so one malicious/corrupt entry is skipped with a warning instead of aborting the entire sessions.json load - Adds TestSessionEntryFromDictTraversalValidation (5 cases) and TestEnsureLoadedSkipsInvalidEntries covering the skip-not-abort behavior Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-21 15:23:36 -07:00
ethernet	bb59075b25	Merge pull request #50398 from helix4u/fix/windows-npm-path-fallback fix(windows): prefer cmd npm shim on PATH fallback	2026-06-21 18:55:02 -03:00
devorun	6f0ecf37da	fix(redact): mask all Authorization schemes and x-api-key style headers Secret redaction only matched `Authorization: Bearer <token>`. Other auth headers passed through verbatim into logs, tool output, and transcripts: - `Authorization: Basic <base64>` — leaks base64(user:password) - `Authorization: token <pat>` / any non-Bearer scheme - `Proxy-Authorization: ...` - `x-api-key: <key>` (Anthropic and many providers) and `api-key`, `x-goog-api-key`, `x-auth-token`, `x-access-token`, ... — opaque values with no known vendor prefix were caught by nothing A logged request or an echoed `curl -H "x-api-key: ..."` command therefore leaked live credentials. Generalize the Authorization rule to mask the credential for any scheme (and Proxy-Authorization) while preserving the header name and scheme word for debuggability, and add an api-key header rule for the single-opaque-value headers. Bearer behavior is unchanged; plain prose containing the word "authorization" (no colon-delimited value) is left untouched. Adds regression tests for Basic/token/Proxy auth and the x-api-key/api-key headers, including inside a curl command.	2026-06-21 14:08:06 -07:00
teknium1	87ab373381	test(url-safety): cover IPv6 scope-ID strip + fail-closed in URL guards Follow-up to the salvaged #25961 fix: regression tests asserting that scope-bearing IPv6 addresses (fe80::1%eth0, ::1%lo) are blocked by is_safe_url after the scope is stripped, that a still-unparseable address fails closed, and that a scoped IPv4-mapped IMDS address is caught by the always-blocked floor.	2026-06-21 13:56:35 -07:00
liuhao1024	b5b8a4cd56	fix(gateway): respect adapter decline of fresh-final to prevent double delivery When a streamed Telegram reply finalizes, the stream consumer could take the fresh-final path (send a new sendRichMessage + best-effort delete the preview) purely because the time-based _should_send_fresh_final() threshold elapsed — even though Telegram's prefers_fresh_final_streaming returns False. The fresh Rich Message then overlapped the legacy MarkdownV2 preview already on screen, leaving both visible (the #47048 table + bullet double-render). Honor the adapter's decision: when prefers_fresh_final_streaming exists on the adapter (checked on the class + instance __dict__ so MagicMock auto-attrs don't false-positive) and declines, the time threshold no longer overrides it. Adapters without the hook keep the time-based fresh-final for backward compat. Fixes #47048	2026-06-21 13:55:50 -07:00
teknium1	f79e0a7060	fix(email): mark missing-config as non-retryable + reject blank env vars (#40715 ) Fold in the #40715 blank-env OOM fix on top of the host-resolution change: - connect() now sets a non-retryable fatal error when required settings are missing, so the gateway stops reconnecting against an empty host instead of looping forever and leaking memory until the host OOM-kills. - check_email_requirements() treats blank/whitespace-only EMAIL_* values as missing, so an abandoned setup with empty keys no longer enables the platform. Credits the parallel fixes by zerone0x (#40745) and liuhao1024 (#40829).	2026-06-21 13:33:52 -07:00
devorun	b7f6cb9c8b	fix(email): resolve IMAP/SMTP host from config and validate before connecting The email adapter read address/host purely from env vars and never stripped them, so a missing or whitespace-padded EMAIL_IMAP_HOST reached imaplib.IMAP4_SSL("") and surfaced as the misleading "[Errno 8] nodename nor servname provided, or not known" — sending users down a DNS rabbit hole when the real problem was an empty/dirty host string. A config.yaml-only setup also left the host empty because __init__ ignored PlatformConfig.extra, even though the "connected" check, the send helper, and `hermes config show` already read address/imap_host/smtp_host from it. Resolve address/imap_host/smtp_host from the env var first, then fall back to config.extra, and strip surrounding whitespace — matching the send helper's existing pattern. Validate the required settings at the start of connect() and return False with an actionable message instead of attempting a connection with an empty host. Adds regression tests for whitespace stripping, config.extra fallback, and the no-IMAP-attempt-on-missing-host path.	2026-06-21 13:33:52 -07:00
teknium1	4cff0360ea	test(approval): regression for interrupt-unblocks-approval; AUTHOR_MAP - Add thread-scoped regression test: interrupt on the waiting thread resolves the approval as deny well under the 300s timeout; a foreign-thread interrupt does NOT release the wait (interrupts are per-thread). - Add panghuer023 to AUTHOR_MAP for the salvaged #37994 fix.	2026-06-21 13:33:48 -07:00
Teknium	824c9d3812	fix(config): alias model.api_base -> model.base_url for custom providers (#50385 ) A bare custom provider configured via `model.api_base` (the intuitive name OpenAI-SDK / LiteLLM users reach for) was silently ignored: `hermes config set` accepts any dotted key, so `model.api_base` got written and confirmed, but the runtime resolver reads only `model.base_url`. Requests fell back to OpenRouter with an empty key -> 401, zero hits to the custom endpoint (issue #8919). Now api_base is migrated to base_url at load time (fixes existing broken configs) and at set time (with a notice), never overriding an explicit base_url. Closes #8919.	2026-06-21 13:33:41 -07:00
Teknium	bb77a8b0d5	fix(gateway): respawn unmapped Windows gateways after update (#50090 ) (#50373 ) On Windows, _pause_windows_gateways_for_update() force-kills every running gateway before mutating the venv. Gateways mapped to a profile (via profile.path/gateway.pid) were respawned afterward, but gateways with NO profile mapping — e.g. a Windows Scheduled Task running "pythonw.exe -m hermes_cli.main gateway run" — were force-killed and only told to restart manually. After an auto-update/bootstrap the Telegram bot stayed dead until manual intervention. Now we snapshot each unmapped gateway's argv (psutil, guarded by looks_like_gateway_command_line) before the kill and replay it through the same detached watcher used for profile gateways, so unmapped gateways come back automatically too. Co-authored-by: Hermes Agent <agent@nousresearch.com>	2026-06-21 13:33:26 -07:00
Teknium	99f3072aa0	fix(model-switch): a failed in-place swap must be a no-op, not a dead session (#50375 ) When a /model switch resolves a valid model but the in-place agent swap fails mid-conversation (expired key, unreachable base_url), the agent rolls itself back to the old working model+client and re-raises. The callers caught that re-raise, logged a warning, then committed the broken switch anyway: wrote the failed model to the session DB, set _session_model_overrides to the broken model/provider/key, and (gateway direct path) evicted the working cached agent. The next message then rebuilt a dead agent from the broken override -> permanently unusable conversation (#50163). Fix the whole caller class so a failed swap aborts the commit entirely: - gateway/slash_commands.py (picker + direct /model paths): on swap failure, early-return an error message; skip DB persist, session override, cache eviction, and config write. - cli.py (both /model handlers): snapshot CLI-level credential/runtime fields before mutating, restore them on swap failure, and abort the note + success print. - tui_gateway/server.py: wrap the previously-unguarded swap; on failure raise a clean error and skip worker restart, runtime persist, switch marker, session model_override, and config persist. The no-cached-agent path (apply-on-next-session) is unaffected. Adds a gateway regression test that fails on the pre-fix behavior.	2026-06-21 13:33:23 -07:00
memosr	ed3d12a762	fix(security): fail-closed when WebSocket peer is empty in loopback mode Per @egilewski's audit on this PR (#15544), the original fix was correct but the file has refactored since: the four endpoint-local empty-peer checks have been consolidated into _ws_client_is_allowed and _ws_client_reason, but the helpers were left fail-open ('no peer host known means allow' / 'no reason to block'). On a loopback-bound dashboard with auth disabled, an ASGI server behind a misconfigured proxy or a unix-socket transport can deliver ws.client == None or ws.client.host == ''. The helpers were treating that as 'allowed', so the loopback-only peer gate could be bypassed by anything that suppressed the client tuple in transit. All four WebSocket endpoints (/api/pty, /api/ws, /api/pub, /api/events) route through _ws_request_is_allowed -> _ws_client_is_allowed, so the gap applied uniformly. Fix: * _ws_client_is_allowed: return False when client_host is empty instead of True. Only reached on loopback bind with auth disabled (auth_required=True and explicit non-loopback binds short-circuit earlier), so the fail-closed behavior is scoped to the surface that needs it. * _ws_client_reason: return a 'missing_or_empty_peer bound=...' block reason instead of None, so the dispatcher's existing reason-based rejection path picks it up and the close gets logged with a machine-parseable token for diagnosability. Behavior unchanged for: * gated mode (auth_required=True) — early-returns True before the empty-peer check runs. The OAuth ticket is the auth at that point. * explicit non-loopback bind (--host 0.0.0.0/::, or a specific LAN address, always with --insecure) — early-returns True before the empty-peer check runs. DNS-rebinding is still blocked by the Host/Origin guard in _ws_host_origin_is_allowed. * legitimate loopback peers (client_host == '127.0.0.1' / '::1') — not affected by the empty-peer branch. Regression tests added in tests/hermes_cli/test_dashboard_auth_ws_auth.py: * test_empty_client_host_rejected_in_loopback_mode * test_missing_client_object_rejected_in_loopback_mode * test_empty_client_host_reason_is_block Plus two regression guards to ensure the fix does not over-reach: * test_empty_client_host_still_allowed_in_insecure_public_mode * test_empty_client_host_still_allowed_in_gated_mode All three new fail-closed tests fail without this patch (the helpers return True / None for an empty peer) and pass with it. The 45 pre-existing tests in test_dashboard_auth_ws_auth.py continue to pass.	2026-06-21 13:33:18 -07:00
sgaofen	a4b1554c73	fix(whatsapp): normalize bare phone targets to JIDs before bridge send Baileys' jidDecode crashes ("Cannot destructure property 'user' of jidDecode(...) as it is undefined") when handed a bare phone number, so sending a WhatsApp message to +50766715226 / 50766715226 returned HTTP 500 and never delivered (#8637). Add to_whatsapp_jid() to gateway/whatsapp_identity.py — the outbound inverse of normalize_whatsapp_identifier: it builds the JID a send must use (bare phone -> <digits>@s.whatsapp.net) and passes through already qualified JIDs (@g.us, @lid, status@broadcast, @newsletter) unchanged. Wire it at every outbound bridge call site in the WhatsApp adapter (send, edit, media, typing, get_chat_info, and the standalone cron / send_message sender). Co-authored-by: Hermes Agent <noreply@nousresearch.com>	2026-06-21 13:32:22 -07:00
LeonSGP43	09a96ba0f6	fix(gateway): pause Telegram typing before stream finalize In Telegram streaming, the typing indicator persisted through the slow final rich-text/MarkdownV2 finalize edit, so the '...typing' bubble lingered for seconds after the last streamed token. Add a one-shot on_before_finalize hook to GatewayStreamConsumer, fired once when the stream transitions into its finalization path, and wire it on both Telegram streaming call sites to call pause_typing_for_chat() before the final edit. Cover hook ordering and once-only behavior in tests. Fixes #49712	2026-06-21 13:10:25 -07:00
teknium1	6902eb3913	fix(cli): make ZIP-update directory replace atomic so it can't delete ui-tui Root cause of #49145: the Windows ZIP-update path did rmtree(dst) then copytree(src, dst). If the copy failed partway — common on that path, which only runs because file I/O is already flaky on the machine — the directory was left deleted with nothing copied back. ui-tui/ vanishing is what broke 'hermes --tui' (WinError 267), but the bug hit every top-level directory. _atomic_replace_dir stages the new copy into a sibling temp dir and only swaps it in on full success, restoring the original on failure. A failed update now leaves the live tree untouched instead of half-deleted.	2026-06-21 13:10:22 -07:00
teknium1	db097fb088	fix(cli): auto-restore a deleted ui-tui workspace from git before TUI launch The Windows update path can leave tracked ui-tui/ files deleted in the working tree (HEAD intact). The guard now self-heals: when ui-tui/ is missing in a git checkout, run `git restore -- ui-tui` and continue, falling back to the printed manual-recovery steps only when git can't recover it (no checkout / restore failed). Builds on konsisumer's missing-workspace guard.	2026-06-21 13:10:22 -07:00
konsisumer	537ad9ea9a	fix(cli): guard missing ui-tui workspace before TUI launch	2026-06-21 13:10:22 -07:00
峯岸　亮	5b45fb269a	fix(security): sanitize kanban markdown html	2026-06-21 13:10:17 -07:00
helix4u	7502d38bf9	fix(windows): prefer cmd npm shim on PATH fallback	2026-06-21 14:06:39 -06:00
Teknium	8e4d2fd23f	docs(plugins): document acting from hooks via ctx.profile_name + dispatch_tool (#50352 ) Answers a recurring plugin-author question: how to read the active profile and drive Hermes from inside a hook callback when ctx._cli_ref is None (gateway, hermes chat -q, and kanban-spawned worker sessions). - Adds a 'Act from inside a hook' section to the plugin guide covering ctx.profile_name and ctx.dispatch_tool as the session-agnostic APIs, with a kanban_task_blocked example, and notes there is no in-process slash-command bridge for headless workers (shell out via the terminal tool instead). - Adds the three kanban lifecycle hooks to the hook reference table with their process semantics. - Pins the contract with a regression test: ctx.dispatch_tool invokes a tool handler with _cli_ref=None (worker/hook context). Requested by @Smithangshu on Discord.	2026-06-21 12:54:40 -07:00
Teknium	d164ed0326	fix(kanban): make reclaim claim-lock-aware to stop task/run status desync (#50366 ) After a worker crash + reclaim + respawn, the board could show a task in the Ready lane while its task_run was 'running' and the new worker was actively executing (#36910). The dispatcher could then treat live work as available and double-assign. Root cause: the three reclaim paths (detect_crashed_workers, release_stale_claims heartbeat-stale backstop, enforce_max_runtime) each snapshot a task's worker_pid/claim_lock, do liveness work, then reset tasks.status back to 'ready' with only a 'WHERE status=running' guard. If the task was reclaimed AND re-claimed by a NEW worker in between (new run, new claim_lock, live pid), the stale UPDATE clobbered the live task: status flipped to 'ready' while the fresh run stayed 'running'. claim_task is the only writer that sets status='running', so nothing put it back — permanent desync. Fix: gate each reset on the snapshot's claim_lock (and worker_pid where available) so it only fires when the task is still owned by the worker the reclaim was computed for. A stale reclaim now no-ops (rowcount 0) instead of desyncing a re-claimed task. Genuine crashes (lock still matches) reclaim exactly as before. This is the same race class the in-gateway dispatch lock (single-writer ticks) mitigates, closed at the row level so a single dispatcher's fast reclaim->respawn across two ticks is also safe. Closes #36910.	2026-06-21 12:49:07 -07:00
memosr	87615f47b9	test(backup): add regression tests for restore_quick_snapshot path traversal Per @egilewski's audit on this PR, the security fix is behaviorally correct but lacks focused regression coverage for the two traversal vectors it closes. Adding tests now so the path-traversal guard cannot silently regress. * test_restore_rejects_snapshot_id_traversal -- exercises the snapshot_id input guard with seven hostile values (parent traversal, single parent, bare '.', bare '..', forward slash, backslash, empty string). Each must return False without touching the filesystem. * test_restore_rejects_manifest_rel_traversal -- exercises the manifest rel guard by injecting '../../outside.txt' into a real snapshot's manifest.json, seeding a source payload at the escaped path, and asserting the destination outside HERMES_HOME does not exist after restore. This is the higher-value test of the pair -- verified locally that it fails without the fix in restore_quick_snapshot (the escape destination gets written) and passes with the fix in place. The 67 pre-existing tests in test_backup.py continue to pass.	2026-06-21 12:44:22 -07:00
Teknium	1f4c5aed6d	fix(kanban): honor kanban.auto_decompose toggle live, without a gateway restart (#50358 ) The gateway dispatcher captured kanban.auto_decompose ONCE at boot, so a user who flipped it to false to STOP auto-decompose had no way to make that take effect short of restarting the gateway. Reported (#49638): auto-decompose created and launched tasks the user never intended (while they were still typing the task description), and 'even Hermes Agent couldn't disable this feature' — because the live config edit was silently ignored. Auto-decompose is a safety toggle; turning it off must halt fan-out on the next tick. The dispatcher now re-reads the flag (and auto_decompose_per_tick) from config every tick via the extracted _resolve_auto_decompose_settings(), which fails SAFE (disabled) on a config read error so a transient failure can never re-enable a feature the user turned off. Closes #49638.	2026-06-21 12:43:44 -07:00
Teknium	84ba83b09a	fix(kanban): bound the cross-process init lock so connect() can't hang forever (#50353 ) connect() wrapped its entire body in an unbounded blocking flock(LOCK_EX) on every call (_cross_process_init_lock). A single process stalled inside the critical section — or a stale lock held by a wedged worker — blocked every other connect(), including the long-lived gateway dispatcher's next-tick connect, forever. No timeout, no traceback, no recovery: the board silently stopped being worked until a manual restart (issue #36644). Two fixes: 1. Fast-path skip: once THIS process has initialized a path, the expensive first-open work (header validation, integrity probe, schema + additive migrations) is already cached in _INITIALIZED_PATHS. The steady-state connect has nothing for the cross-process lock to protect, so it now opens the connection (WAL + pragmas) under only the cheap in-process _INIT_LOCK and never touches the file lock. This removes the lock from the dispatcher's hot path entirely — a stalled external 'hermes kanban list' can no longer block ticks. 2. Bounded acquire: even on first-init, _cross_process_init_lock now retries a non-blocking acquire up to a 10s deadline, then logs a WARNING and proceeds WITHOUT the cross-process lock. Safe because the in-process _INIT_LOCK still serializes same-process threads and the init work is idempotent (CREATE TABLE IF NOT EXISTS + additive migrations) — worst case is redundant work, not corruption. A bounded 'proceed anyway' beats an unbounded hang. Windows path switched LK_LOCK -> LK_NBLCK (non-blocking) to match. Closes #36644.	2026-06-21 12:43:41 -07:00
Teknium	9630ec6c19	fix(kanban): pin worker TERMINAL_CWD to the task workspace (#50348 ) _default_spawn launched the worker subprocess with cwd=workspace and set HERMES_KANBAN_WORKSPACE, but never set TERMINAL_CWD — so the worker inherited the dispatching gateway's TERMINAL_CWD. That value takes precedence over the process cwd in two places: - tools/file_tools.py::_resolve_base_dir — a relative write_file path resolved against the gateway user's home instead of the workspace, so artifacts silently landed outside the workspace (#41312). - agent_init's context-file loader — AGENTS.md was discovered relative to the gateway's cwd, so under multi-profile dispatch a worker loaded whichever gateway won the claim race's AGENTS.md, not the task's (#34619). Both are the same root cause. Pinning TERMINAL_CWD to the workspace (where the task's work actually happens) fixes both. Guarded on an existing absolute dir because file_tools rejects relative/sentinel TERMINAL_CWD values — a non-dir workspace leaves the inherited value rather than writing a meaningless one. Closes #34619, closes #41312.	2026-06-21 12:43:37 -07:00
Teknium	b6d1072408	fix(cli): branch new worktrees from the fresh remote tip, not stale local HEAD (#50355 ) hermes -w created the worktree branch from the standalone clone's HEAD, which lags origin when the clone isn't freshly updated (it's only refreshed by hermes update, not per session). Every worktree branch then rooted on a stale base, so the PR diff GitHub computes against current main ballooned with unrelated changes and the agent had to discover the staleness at push time and rebase. _resolve_worktree_base() now fetches and branches from the freshest available ref: the current branch's upstream if it tracks one (so a deliberate feature-branch worktree tracks its own remote), else the remote's default branch (origin/HEAD), else local HEAD as a fail-soft fallback (offline / no remote / detached). A bogus 'origin/(unknown)' default is guarded, and worktree creation retries from HEAD if branching off the remote ref fails — so this is never worse than the old behavior. Gated by worktree_sync (default true); set worktree_sync: false to keep the old branch-from-local-HEAD behavior. The resolved base is printed in the session banner. This is the follow-up to the #50319 session, where the standalone clone was 213 commits behind origin and the worktree inherited that stale base.	2026-06-21 12:42:11 -07:00

1 2 3 4 5 ...

5951 commits